Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

+36 -4

Documentation/ABI/testing/sysfs-kernel-mm-damon

··· 29 29 file updates contents of schemes stats files of the kdamond. 30 30 Writing 'update_schemes_tried_regions' to the file updates 31 31 contents of 'tried_regions' directory of every scheme directory 32 - of this kdamond. Writing 'clear_schemes_tried_regions' to the 33 - file removes contents of the 'tried_regions' directory. 32 + of this kdamond. Writing 'update_schemes_tried_bytes' to the 33 + file updates only '.../tried_regions/total_bytes' files of this 34 + kdamond. Writing 'clear_schemes_tried_regions' to the file 35 + removes contents of the 'tried_regions' directory. 34 36 35 37 What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid 36 38 Date: Mar 2022 ··· 271 269 Date: Dec 2022 272 270 Contact: SeongJae Park <sj@kernel.org> 273 271 Description: Writing to and reading from this file sets and gets the type of 274 - the memory of the interest. 'anon' for anonymous pages, or 275 - 'memcg' for specific memory cgroup can be written and read. 272 + the memory of the interest. 'anon' for anonymous pages, 273 + 'memcg' for specific memory cgroup, 'addr' for address range 274 + (an open-ended interval), or 'target' for DAMON monitoring 275 + target can be written and read. 276 276 277 277 What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path 278 278 Date: Dec 2022 ··· 282 278 Description: If 'memcg' is written to the 'type' file, writing to and 283 279 reading from this file sets and gets the path to the memory 284 280 cgroup of the interest. 281 + 282 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_start 283 + Date: Jul 2023 284 + Contact: SeongJae Park <sj@kernel.org> 285 + Description: If 'addr' is written to the 'type' file, writing to or reading 286 + from this file sets or gets the start address of the address 287 + range for the filter. 288 + 289 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_end 290 + Date: Jul 2023 291 + Contact: SeongJae Park <sj@kernel.org> 292 + Description: If 'addr' is written to the 'type' file, writing to or reading 293 + from this file sets or gets the end address of the address 294 + range for the filter. 295 + 296 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/target_idx 297 + Date: Dec 2022 298 + Contact: SeongJae Park <sj@kernel.org> 299 + Description: If 'target' is written to the 'type' file, writing to or 300 + reading from this file sets or gets the index of the DAMON 301 + monitoring target of the interest. 285 302 286 303 What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching 287 304 Date: Dec 2022 ··· 341 316 Contact: SeongJae Park <sj@kernel.org> 342 317 Description: Reading this file returns the number of the exceed events of 343 318 the scheme's quotas. 319 + 320 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/total_bytes 321 + Date: Jul 2023 322 + Contact: SeongJae Park <sj@kernel.org> 323 + Description: Reading this file returns the total amount of memory that 324 + corresponding DAMON-based Operation Scheme's action has tried 325 + to be applied. 344 326 345 327 What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start 346 328 Date: Oct 2022

+2 -2

Documentation/ABI/testing/sysfs-memory-page-offline

··· 10 10 dropping it if possible. The kernel will then be placed 11 11 on the bad page list and never be reused. 12 12 13 - The offlining is done in kernel specific granuality. 13 + The offlining is done in kernel specific granularity. 14 14 Normally it's the base page size of the kernel, but 15 15 this might change. 16 16 ··· 35 35 to access this page assuming it's poisoned by the 36 36 hardware. 37 37 38 - The offlining is done in kernel specific granuality. 38 + The offlining is done in kernel specific granularity. 39 39 Normally it's the base page size of the kernel, but 40 40 this might change. 41 41

-2

Documentation/admin-guide/cgroup-v1/memory.rst

··· 92 92 memory.oom_control set/show oom controls. 93 93 memory.numa_stat show the number of memory usage per numa 94 94 node 95 - memory.kmem.limit_in_bytes This knob is deprecated and writing to 96 - it will return -ENOTSUPP. 97 95 memory.kmem.usage_in_bytes show current kernel memory allocation 98 96 memory.kmem.failcnt show the number of kernel memory usage 99 97 hits limits

+4 -10

Documentation/admin-guide/kdump/vmcoreinfo.rst

··· 141 141 The size of a nodemask_t type. Used to compute the number of online 142 142 nodes. 143 143 144 - (page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head) 145 - ------------------------------------------------------------------------------------------------- 144 + (page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head) 145 + ---------------------------------------------------------------------------------- 146 146 147 147 User-space tools compute their values based on the offset of these 148 148 variables. The variables are used when excluding unnecessary pages. ··· 325 325 On linux-2.6.21 or later, the number of free pages is in 326 326 vm_stat[NR_FREE_PAGES]. Used to get the number of free pages. 327 327 328 - PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask 329 - ------------------------------------------------------------------------------ 328 + PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb 329 + ----------------------------------------------------------------------------------------- 330 330 331 331 Page attributes. These flags are used to filter various unnecessary for 332 332 dumping pages. ··· 337 337 More page attributes. These flags are used to filter various unnecessary for 338 338 dumping pages. 339 339 340 - 341 - HUGETLB_PAGE_DTOR 342 - ----------------- 343 - 344 - The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile 345 - excludes these pages. 346 340 347 341 x86_64 348 342 ======

+50 -26

Documentation/admin-guide/mm/damon/usage.rst

··· 87 87 │ │ │ │ │ │ │ filters/nr_filters 88 88 │ │ │ │ │ │ │ │ 0/type,matching,memcg_id 89 89 │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds 90 - │ │ │ │ │ │ │ tried_regions/ 90 + │ │ │ │ │ │ │ tried_regions/total_bytes 91 91 │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age 92 92 │ │ │ │ │ │ │ │ ... 93 93 │ │ │ │ │ │ ... ··· 127 127 user inputs in the sysfs files except ``state`` file again. Writing 128 128 ``update_schemes_stats`` to ``state`` file updates the contents of stats files 129 129 for each DAMON-based operation scheme of the kdamond. For details of the 130 - stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing 131 - ``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based 132 - operation scheme action tried regions directory for each DAMON-based operation 133 - scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state`` 134 - file clears the DAMON-based operating scheme action tried regions directory for 135 - each DAMON-based operation scheme of the kdamond. For details of the 136 - DAMON-based operation scheme action tried regions directory, please refer to 137 - :ref:`tried_regions section <sysfs_schemes_tried_regions>`. 130 + stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. 131 + 132 + Writing ``update_schemes_tried_regions`` to ``state`` file updates the 133 + DAMON-based operation scheme action tried regions directory for each 134 + DAMON-based operation scheme of the kdamond. Writing 135 + ``update_schemes_tried_bytes`` to ``state`` file updates only 136 + ``.../tried_regions/total_bytes`` files. Writing 137 + ``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based 138 + operating scheme action tried regions directory for each DAMON-based operation 139 + scheme of the kdamond. For details of the DAMON-based operation scheme action 140 + tried regions directory, please refer to :ref:`tried_regions section 141 + <sysfs_schemes_tried_regions>`. 138 142 139 143 If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. 140 144 ··· 363 359 to ``N-1``. Each directory represents each filter. The filters are evaluated 364 360 in the numeric order. 365 361 366 - Each filter directory contains three files, namely ``type``, ``matcing``, and 367 - ``memcg_path``. You can write one of two special keywords, ``anon`` for 368 - anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of 369 - the memory cgroup filtering, you can specify the memory cgroup of the interest 370 - by writing the path of the memory cgroup from the cgroups mount point to 371 - ``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to 372 - filter out pages that does or does not match to the type, respectively. Then, 373 - the scheme's action will not be applied to the pages that specified to be 374 - filtered out. 362 + Each filter directory contains six files, namely ``type``, ``matcing``, 363 + ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type`` 364 + file, you can write one of four special keywords: ``anon`` for anonymous pages, 365 + ``memcg`` for specific memory cgroup, ``addr`` for specific address range (an 366 + open-ended interval), or ``target`` for specific DAMON monitoring target 367 + filtering. In case of the memory cgroup filtering, you can specify the memory 368 + cgroup of the interest by writing the path of the memory cgroup from the 369 + cgroups mount point to ``memcg_path`` file. In case of the address range 370 + filtering, you can specify the start and end address of the range to 371 + ``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring 372 + target filtering, you can specify the index of the target between the list of 373 + the DAMON context's monitoring targets list to ``target_idx`` file. You can 374 + write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does 375 + not match to the type, respectively. Then, the scheme's action will not be 376 + applied to the pages that specified to be filtered out. 375 377 376 378 For example, below restricts a DAMOS action to be applied to only non-anonymous 377 379 pages of all memory cgroups except ``/having_care_already``.:: ··· 391 381 echo /having_care_already > 1/memcg_path 392 382 echo N > 1/matching 393 383 394 - Note that filters are currently supported only when ``paddr`` 395 - `implementation <sysfs_contexts>` is being used. 384 + Note that ``anon`` and ``memcg`` filters are currently supported only when 385 + ``paddr`` `implementation <sysfs_contexts>` is being used. 386 + 387 + Also, memory regions that are filtered out by ``addr`` or ``target`` filters 388 + are not counted as the scheme has tried to those, while regions that filtered 389 + out by other type filters are counted as the scheme has tried to. The 390 + difference is applied to :ref:`stats <damos_stats>` and 391 + :ref:`tried regions <sysfs_schemes_tried_regions>`. 396 392 397 393 .. _sysfs_schemes_stats: 398 394 ··· 422 406 schemes/<N>/tried_regions/ 423 407 -------------------------- 424 408 409 + This directory initially has one file, ``total_bytes``. 410 + 425 411 When a special keyword, ``update_schemes_tried_regions``, is written to the 426 - relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer 427 - starting from ``0`` under this directory. Each directory contains files 428 - exposing detailed information about each of the memory region that the 429 - corresponding scheme's ``action`` has tried to be applied under this directory, 430 - during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The 431 - information includes address range, ``nr_accesses``, and ``age`` of the region. 412 + relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so 413 + that reading it returns the total size of the scheme tried regions, and creates 414 + directories named integer starting from ``0`` under this directory. Each 415 + directory contains files exposing detailed information about each of the memory 416 + region that the corresponding scheme's ``action`` has tried to be applied under 417 + this directory, during next :ref:`aggregation interval 418 + <sysfs_monitoring_attrs>`. The information includes address range, 419 + ``nr_accesses``, and ``age`` of the region. 420 + 421 + Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state`` 422 + file will only update the ``total_bytes`` file, and will not create the 423 + subdirectories. 432 424 433 425 The directories will be removed when another special keyword, 434 426 ``clear_schemes_tried_regions``, is written to the relevant

+20 -7

Documentation/admin-guide/mm/ksm.rst

··· 159 159 160 160 general_profit 161 161 how effective is KSM. The calculation is explained below. 162 + pages_scanned 163 + how many pages are being scanned for ksm 162 164 pages_shared 163 165 how many shared pages are being used 164 166 pages_sharing ··· 175 173 the number of KSM pages that hit the ``max_page_sharing`` limit 176 174 stable_node_dups 177 175 number of duplicated KSM pages 176 + ksm_zero_pages 177 + how many zero pages that are still mapped into processes were mapped by 178 + KSM when deduplicating. 179 + 180 + When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` + 181 + ``ksm_zero_pages`` represents the actual number of pages saved by KSM. 182 + if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0. 178 183 179 184 A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good 180 185 sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` ··· 205 196 1) How to determine whether KSM save memory or consume memory in system-wide 206 197 range? Here is a simple approximate calculation for reference:: 207 198 208 - general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) * 199 + general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) * 209 200 sizeof(rmap_item); 210 201 211 - where all_rmap_items can be easily obtained by summing ``pages_sharing``, 212 - ``pages_shared``, ``pages_unshared`` and ``pages_volatile``. 202 + where ksm_saved_pages equals to the sum of ``pages_sharing`` + 203 + ``ksm_zero_pages`` of the system, and all_rmap_items can be easily 204 + obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared`` 205 + and ``pages_volatile``. 213 206 214 207 2) The KSM profit inner a single process can be similarly obtained by the 215 208 following approximate calculation:: 216 209 217 - process_profit =~ ksm_merging_pages * sizeof(page) - 210 + process_profit =~ ksm_saved_pages * sizeof(page) - 218 211 ksm_rmap_items * sizeof(rmap_item). 219 212 220 - where ksm_merging_pages is shown under the directory ``/proc/<pid>/``, 221 - and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit 222 - is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit. 213 + where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and 214 + ``ksm_zero_pages``, both of which are shown under the directory 215 + ``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in 216 + ``/proc/<pid>/ksm_stat``. The process profit is also shown in 217 + ``/proc/<pid>/ksm_stat`` as ksm_process_profit. 223 218 224 219 From the perspective of application, a high ratio of ``ksm_rmap_items`` to 225 220 ``ksm_merging_pages`` means a bad madvise-applied policy, so developers or

+13 -1

Documentation/admin-guide/mm/memory-hotplug.rst

··· 433 433 memory in a way that huge pages in bigger 434 434 granularity cannot be formed on hotplugged 435 435 memory. 436 + 437 + With value "force" it could result in memory 438 + wastage due to memmap size limitations. For 439 + example, if the memmap for a memory block 440 + requires 1 MiB, but the pageblock size is 2 441 + MiB, 1 MiB of hotplugged memory will be wasted. 442 + Note that there are still cases where the 443 + feature cannot be enforced: for example, if the 444 + memmap is smaller than a single page, or if the 445 + architecture does not support the forced mode 446 + in all configurations. 447 + 436 448 ``online_policy`` read-write: Set the basic policy used for 437 449 automatic zone selection when onlining memory 438 450 blocks without specifying a target zone. ··· 681 669 (-> BUG), memory offlining will keep retrying until it eventually succeeds. 682 670 683 671 When offlining is triggered from user space, the offlining context can be 684 - terminated by sending a fatal signal. A timeout based offlining can easily be 672 + terminated by sending a signal. A timeout based offlining can easily be 685 673 implemented via:: 686 674 687 675 % timeout $TIMEOUT offline_block | failure_handling

+15

Documentation/admin-guide/mm/userfaultfd.rst

··· 244 244 support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP`` 245 245 respectively) to configure the mapping this way. 246 246 247 + Memory Poisioning Emulation 248 + --------------------------- 249 + 250 + In response to a fault (either missing or minor), an action userspace can 251 + take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any 252 + future faulters to either get a SIGBUS, or in KVM's case the guest will 253 + receive an MCE as if there were hardware memory poisoning. 254 + 255 + This is used to emulate hardware memory poisoning. Imagine a VM running on a 256 + machine which experiences a real hardware memory error. Later, we live migrate 257 + the VM to another physical machine. Since we want the migration to be 258 + transparent to the guest, we want that same address range to act as if it was 259 + still poisoned, even though it's on a new physical host which ostensibly 260 + doesn't have a memory error in the exact same spot. 261 + 247 262 QEMU/KVM 248 263 ======== 249 264

+7 -7

Documentation/admin-guide/mm/zswap.rst

··· 49 49 Design 50 50 ====== 51 51 52 - Zswap receives pages for compression through the Frontswap API and is able to 52 + Zswap receives pages for compression from the swap subsystem and is able to 53 53 evict pages from its own compressed pool on an LRU basis and write them back to 54 54 the backing swap device in the case that the compressed pool is full. 55 55 ··· 70 70 zbud pages). The zsmalloc type zpool has a more complex compressed page 71 71 storage method, and it can achieve greater storage densities. 72 72 73 - When a swap page is passed from frontswap to zswap, zswap maintains a mapping 73 + When a swap page is passed from swapout to zswap, zswap maintains a mapping 74 74 of the swap entry, a combination of the swap type and swap offset, to the zpool 75 75 handle that references that compressed swap page. This mapping is achieved 76 76 with a red-black tree per swap type. The swap offset is the search key for the 77 77 tree nodes. 78 78 79 - During a page fault on a PTE that is a swap entry, frontswap calls the zswap 80 - load function to decompress the page into the page allocated by the page fault 81 - handler. 79 + During a page fault on a PTE that is a swap entry, the swapin code calls the 80 + zswap load function to decompress the page into the page allocated by the page 81 + fault handler. 82 82 83 83 Once there are no PTEs referencing a swap page stored in zswap (i.e. the count 84 - in the swap_map goes to 0) the swap code calls the zswap invalidate function, 85 - via frontswap, to free the compressed entry. 84 + in the swap_map goes to 0) the swap code calls the zswap invalidate function 85 + to free the compressed entry. 86 86 87 87 Zswap seeks to be simple in its policies. Sysfs attributes allow for one user 88 88 controlled policy:

+1

Documentation/block/biovecs.rst

··· 134 134 bio_for_each_bvec_all() 135 135 bio_first_bvec_all() 136 136 bio_first_page_all() 137 + bio_first_folio_all() 137 138 bio_last_bvec_all() 138 139 139 140 * The following helpers iterate over single-page segment. The passed 'struct

+27 -28

Documentation/core-api/cachetlb.rst

··· 88 88 89 89 This is used primarily during fault processing. 90 90 91 - 5) ``void update_mmu_cache(struct vm_area_struct *vma, 92 - unsigned long address, pte_t *ptep)`` 91 + 5) ``void update_mmu_cache_range(struct vm_fault *vmf, 92 + struct vm_area_struct *vma, unsigned long address, pte_t *ptep, 93 + unsigned int nr)`` 93 94 94 - At the end of every page fault, this routine is invoked to 95 - tell the architecture specific code that a translation 96 - now exists at virtual address "address" for address space 97 - "vma->vm_mm", in the software page tables. 95 + At the end of every page fault, this routine is invoked to tell 96 + the architecture specific code that translations now exists 97 + in the software page tables for address space "vma->vm_mm" 98 + at virtual address "address" for "nr" consecutive pages. 99 + 100 + This routine is also invoked in various other places which pass 101 + a NULL "vmf". 98 102 99 103 A port may use this information in any way it so chooses. 100 104 For example, it could use this event to pre-load TLB ··· 273 269 If D-cache aliasing is not an issue, these two routines may 274 270 simply call memcpy/memset directly and do nothing more. 275 271 276 - ``void flush_dcache_page(struct page *page)`` 272 + ``void flush_dcache_folio(struct folio *folio)`` 277 273 278 274 This routines must be called when: 279 275 ··· 281 277 and / or in high memory 282 278 b) the kernel is about to read from a page cache page and user space 283 279 shared/writable mappings of this page potentially exist. Note 284 - that {get,pin}_user_pages{_fast} already call flush_dcache_page 280 + that {get,pin}_user_pages{_fast} already call flush_dcache_folio 285 281 on any page found in the user address space and thus driver 286 282 code rarely needs to take this into account. 287 283 ··· 295 291 296 292 The phrase "kernel writes to a page cache page" means, specifically, 297 293 that the kernel executes store instructions that dirty data in that 298 - page at the page->virtual mapping of that page. It is important to 294 + page at the kernel virtual mapping of that page. It is important to 299 295 flush here to handle D-cache aliasing, to make sure these kernel stores 300 296 are visible to user space mappings of that page. 301 297 ··· 306 302 If D-cache aliasing is not an issue, this routine may simply be defined 307 303 as a nop on that architecture. 308 304 309 - There is a bit set aside in page->flags (PG_arch_1) as "architecture 305 + There is a bit set aside in folio->flags (PG_arch_1) as "architecture 310 306 private". The kernel guarantees that, for pagecache pages, it will 311 307 clear this bit when such a page first enters the pagecache. 312 308 313 - This allows these interfaces to be implemented much more efficiently. 314 - It allows one to "defer" (perhaps indefinitely) the actual flush if 315 - there are currently no user processes mapping this page. See sparc64's 316 - flush_dcache_page and update_mmu_cache implementations for an example 317 - of how to go about doing this. 309 + This allows these interfaces to be implemented much more 310 + efficiently. It allows one to "defer" (perhaps indefinitely) the 311 + actual flush if there are currently no user processes mapping this 312 + page. See sparc64's flush_dcache_folio and update_mmu_cache_range 313 + implementations for an example of how to go about doing this. 318 314 319 - The idea is, first at flush_dcache_page() time, if page_file_mapping() 320 - returns a mapping, and mapping_mapped on that mapping returns %false, 321 - just mark the architecture private page flag bit. Later, in 322 - update_mmu_cache(), a check is made of this flag bit, and if set the 323 - flush is done and the flag bit is cleared. 315 + The idea is, first at flush_dcache_folio() time, if 316 + folio_flush_mapping() returns a mapping, and mapping_mapped() on that 317 + mapping returns %false, just mark the architecture private page 318 + flag bit. Later, in update_mmu_cache_range(), a check is made 319 + of this flag bit, and if set the flush is done and the flag bit 320 + is cleared. 324 321 325 322 .. important:: 326 323 ··· 330 325 as did the cpu stores into the page to make it 331 326 dirty. Again, see sparc64 for examples of how 332 327 to deal with this. 333 - 334 - ``void flush_dcache_folio(struct folio *folio)`` 335 - This function is called under the same circumstances as 336 - flush_dcache_page(). It allows the architecture to 337 - optimise for flushing the entire folio of pages instead 338 - of flushing one page at a time. 339 328 340 329 ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page, 341 330 unsigned long user_vaddr, void *dst, void *src, int len)`` ··· 351 352 352 353 When the kernel needs to access the contents of an anonymous 353 354 page, it calls this function (currently only 354 - get_user_pages()). Note: flush_dcache_page() deliberately 355 + get_user_pages()). Note: flush_dcache_folio() deliberately 355 356 doesn't work for an anonymous page. The default 356 357 implementation is a nop (and should remain so for all coherent 357 358 architectures). For incoherent architectures, it should flush ··· 368 369 ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)`` 369 370 370 371 All the functionality of flush_icache_page can be implemented in 371 - flush_dcache_page and update_mmu_cache. In the future, the hope 372 + flush_dcache_folio and update_mmu_cache_range. In the future, the hope 372 373 is to remove this interface completely. 373 374 374 375 The final category of APIs is for I/O to deliberately aliased address

+25

Documentation/core-api/mm-api.rst

··· 115 115 .. kernel-doc:: include/linux/mmzone.h 116 116 .. kernel-doc:: mm/util.c 117 117 :functions: folio_mapping 118 + 119 + .. kernel-doc:: mm/rmap.c 120 + .. kernel-doc:: mm/migrate.c 121 + .. kernel-doc:: mm/mmap.c 122 + .. kernel-doc:: mm/kmemleak.c 123 + .. #kernel-doc:: mm/hmm.c (build warnings) 124 + .. kernel-doc:: mm/memremap.c 125 + .. kernel-doc:: mm/hugetlb.c 126 + .. kernel-doc:: mm/swap.c 127 + .. kernel-doc:: mm/zpool.c 128 + .. kernel-doc:: mm/memcontrol.c 129 + .. #kernel-doc:: mm/memory-tiers.c (build warnings) 130 + .. kernel-doc:: mm/shmem.c 131 + .. kernel-doc:: mm/migrate_device.c 132 + .. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files) 133 + .. kernel-doc:: mm/mapping_dirty_helpers.c 134 + .. #kernel-doc:: mm/memory-failure.c (build warnings) 135 + .. kernel-doc:: mm/percpu.c 136 + .. kernel-doc:: mm/maccess.c 137 + .. kernel-doc:: mm/vmscan.c 138 + .. kernel-doc:: mm/memory_hotplug.c 139 + .. kernel-doc:: mm/mmu_notifier.c 140 + .. kernel-doc:: mm/balloon_compaction.c 141 + .. kernel-doc:: mm/huge_memory.c 142 + .. kernel-doc:: mm/io-mapping.c

+1 -1

Documentation/features/vm/TLB/arch-support.txt

··· 9 9 | alpha: | TODO | 10 10 | arc: | TODO | 11 11 | arm: | TODO | 12 - | arm64: | N/A | 12 + | arm64: | ok | 13 13 | csky: | TODO | 14 14 | hexagon: | TODO | 15 15 | ia64: | TODO |

+24 -14

Documentation/filesystems/locking.rst

··· 636 636 637 637 prototypes:: 638 638 639 - void (*open)(struct vm_area_struct*); 640 - void (*close)(struct vm_area_struct*); 641 - vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *); 639 + void (*open)(struct vm_area_struct *); 640 + void (*close)(struct vm_area_struct *); 641 + vm_fault_t (*fault)(struct vm_fault *); 642 + vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order); 643 + vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end); 642 644 vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); 643 645 vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); 644 646 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); 645 647 646 648 locking rules: 647 649 648 - ============= ========= =========================== 650 + ============= ========== =========================== 649 651 ops mmap_lock PageLocked(page) 650 - ============= ========= =========================== 651 - open: yes 652 - close: yes 653 - fault: yes can return with page locked 654 - map_pages: read 655 - page_mkwrite: yes can return with page locked 656 - pfn_mkwrite: yes 657 - access: yes 658 - ============= ========= =========================== 652 + ============= ========== =========================== 653 + open: write 654 + close: read/write 655 + fault: read can return with page locked 656 + huge_fault: maybe-read 657 + map_pages: maybe-read 658 + page_mkwrite: read can return with page locked 659 + pfn_mkwrite: read 660 + access: read 661 + ============= ========== =========================== 659 662 660 663 ->fault() is called when a previously not present pte is about to be faulted 661 664 in. The filesystem must find and return the page associated with the passed in ··· 668 665 subsequent truncate), and then return with VM_FAULT_LOCKED, and the page 669 666 locked. The VM will unlock the page. 670 667 668 + ->huge_fault() is called when there is no PUD or PMD entry present. This 669 + gives the filesystem the opportunity to install a PUD or PMD sized page. 670 + Filesystems can also use the ->fault method to return a PMD sized page, 671 + so implementing this function may not be necessary. In particular, 672 + filesystems should not call filemap_fault() from ->huge_fault(). 673 + The mmap_lock may not be held when this method is called. 674 + 671 675 ->map_pages() is called when VM asks to map easy accessible pages. 672 676 Filesystem should find and map pages associated with offsets from "start_pgoff" 673 677 till "end_pgoff". ->map_pages() is called with the RCU lock held and must 674 678 not block. If it's not possible to reach a page without blocking, 675 - filesystem should skip it. Filesystem should use do_set_pte() to setup 679 + filesystem should skip it. Filesystem should use set_pte_range() to setup 676 680 page table entry. Pointer to entry associated with the page is passed in 677 681 "pte" field in vm_fault structure. Pointers to entries for other offsets 678 682 should be calculated relative to "pte".

+11

Documentation/filesystems/porting.rst

··· 938 938 changed to simplify callers. The passed file is in a non-open state and on 939 939 success must be opened before returning (e.g. by calling 940 940 finish_open_simple()). 941 + 942 + --- 943 + 944 + **mandatory** 945 + 946 + Calling convention for ->huge_fault has changed. It now takes a page 947 + order instead of an enum page_entry_size, and it may be called without the 948 + mmap_lock held. All in-tree users have been audited and do not seem to 949 + depend on the mmap_lock being held, but out of tree users should verify 950 + for themselves. If they do need it, they can return VM_FAULT_RETRY to 951 + be called with the mmap_lock held.

+18 -6

Documentation/mm/damon/design.rst

··· 380 380 memory, and whether it should exclude the memory of the type (filter-out), or 381 381 all except the memory of the type (filter-in). 382 382 383 - As of this writing, anonymous page type and memory cgroup type are supported by 384 - the feature. Some filter target types can require additional arguments. For 385 - example, the memory cgroup filter type asks users to specify the file path of 386 - the memory cgroup for the filter. Hence, users can apply specific schemes to 387 - only anonymous pages, non-anonymous pages, pages of specific cgroups, all pages 388 - excluding those of specific cgroups, and any combination of those. 383 + Currently, anonymous page, memory cgroup, address range, and DAMON monitoring 384 + target type filters are supported by the feature. Some filter target types 385 + require additional arguments. The memory cgroup filter type asks users to 386 + specify the file path of the memory cgroup for the filter. The address range 387 + type asks the start and end addresses of the range. The DAMON monitoring 388 + target type asks the index of the target from the context's monitoring targets 389 + list. Hence, users can apply specific schemes to only anonymous pages, 390 + non-anonymous pages, pages of specific cgroups, all pages excluding those of 391 + specific cgroups, pages in specific address range, pages in specific DAMON 392 + monitoring targets, and any combination of those. 393 + 394 + To handle filters efficiently, the address range and DAMON monitoring target 395 + type filters are handled by the core layer, while others are handled by 396 + operations set. If a memory region is filtered by a core layer-handled filter, 397 + it is not counted as the scheme has tried to the region. In contrast, if a 398 + memory regions is filtered by an operations set layer-handled filter, it is 399 + counted as the scheme has tried. The difference in accounting leads to changes 400 + in the statistics. 389 401 390 402 391 403 Application Programming Interface

-264

Documentation/mm/frontswap.rst

··· 1 - ========= 2 - Frontswap 3 - ========= 4 - 5 - Frontswap provides a "transcendent memory" interface for swap pages. 6 - In some environments, dramatic performance savings may be obtained because 7 - swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. 8 - 9 - .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ 10 - 11 - Frontswap is so named because it can be thought of as the opposite of 12 - a "backing" store for a swap device. The storage is assumed to be 13 - a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming 14 - to the requirements of transcendent memory (such as Xen's "tmem", or 15 - in-kernel compressed memory, aka "zcache", or future RAM-like devices); 16 - this pseudo-RAM device is not directly accessible or addressable by the 17 - kernel and is of unknown and possibly time-varying size. The driver 18 - links itself to frontswap by calling frontswap_register_ops to set the 19 - frontswap_ops funcs appropriately and the functions it provides must 20 - conform to certain policies as follows: 21 - 22 - An "init" prepares the device to receive frontswap pages associated 23 - with the specified swap device number (aka "type"). A "store" will 24 - copy the page to transcendent memory and associate it with the type and 25 - offset associated with the page. A "load" will copy the page, if found, 26 - from transcendent memory into kernel memory, but will NOT remove the page 27 - from transcendent memory. An "invalidate_page" will remove the page 28 - from transcendent memory and an "invalidate_area" will remove ALL pages 29 - associated with the swap type (e.g., like swapoff) and notify the "device" 30 - to refuse further stores with that swap type. 31 - 32 - Once a page is successfully stored, a matching load on the page will normally 33 - succeed. So when the kernel finds itself in a situation where it needs 34 - to swap out a page, it first attempts to use frontswap. If the store returns 35 - success, the data has been successfully saved to transcendent memory and 36 - a disk write and, if the data is later read back, a disk read are avoided. 37 - If a store returns failure, transcendent memory has rejected the data, and the 38 - page can be written to swap as usual. 39 - 40 - Note that if a page is stored and the page already exists in transcendent memory 41 - (a "duplicate" store), either the store succeeds and the data is overwritten, 42 - or the store fails AND the page is invalidated. This ensures stale data may 43 - never be obtained from frontswap. 44 - 45 - If properly configured, monitoring of frontswap is done via debugfs in 46 - the `/sys/kernel/debug/frontswap` directory. The effectiveness of 47 - frontswap can be measured (across all swap devices) with: 48 - 49 - ``failed_stores`` 50 - how many store attempts have failed 51 - 52 - ``loads`` 53 - how many loads were attempted (all should succeed) 54 - 55 - ``succ_stores`` 56 - how many store attempts have succeeded 57 - 58 - ``invalidates`` 59 - how many invalidates were attempted 60 - 61 - A backend implementation may provide additional metrics. 62 - 63 - FAQ 64 - === 65 - 66 - * Where's the value? 67 - 68 - When a workload starts swapping, performance falls through the floor. 69 - Frontswap significantly increases performance in many such workloads by 70 - providing a clean, dynamic interface to read and write swap pages to 71 - "transcendent memory" that is otherwise not directly addressable to the kernel. 72 - This interface is ideal when data is transformed to a different form 73 - and size (such as with compression) or secretly moved (as might be 74 - useful for write-balancing for some RAM-like devices). Swap pages (and 75 - evicted page-cache pages) are a great use for this kind of slower-than-RAM- 76 - but-much-faster-than-disk "pseudo-RAM device". 77 - 78 - Frontswap with a fairly small impact on the kernel, 79 - provides a huge amount of flexibility for more dynamic, flexible RAM 80 - utilization in various system configurations: 81 - 82 - In the single kernel case, aka "zcache", pages are compressed and 83 - stored in local memory, thus increasing the total anonymous pages 84 - that can be safely kept in RAM. Zcache essentially trades off CPU 85 - cycles used in compression/decompression for better memory utilization. 86 - Benchmarks have shown little or no impact when memory pressure is 87 - low while providing a significant performance improvement (25%+) 88 - on some workloads under high memory pressure. 89 - 90 - "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory 91 - support for clustered systems. Frontswap pages are locally compressed 92 - as in zcache, but then "remotified" to another system's RAM. This 93 - allows RAM to be dynamically load-balanced back-and-forth as needed, 94 - i.e. when system A is overcommitted, it can swap to system B, and 95 - vice versa. RAMster can also be configured as a memory server so 96 - many servers in a cluster can swap, dynamically as needed, to a single 97 - server configured with a large amount of RAM... without pre-configuring 98 - how much of the RAM is available for each of the clients! 99 - 100 - In the virtual case, the whole point of virtualization is to statistically 101 - multiplex physical resources across the varying demands of multiple 102 - virtual machines. This is really hard to do with RAM and efforts to do 103 - it well with no kernel changes have essentially failed (except in some 104 - well-publicized special-case workloads). 105 - Specifically, the Xen Transcendent Memory backend allows otherwise 106 - "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple 107 - virtual machines, but the pages can be compressed and deduplicated to 108 - optimize RAM utilization. And when guest OS's are induced to surrender 109 - underutilized RAM (e.g. with "selfballooning"), sudden unexpected 110 - memory pressure may result in swapping; frontswap allows those pages 111 - to be swapped to and from hypervisor RAM (if overall host system memory 112 - conditions allow), thus mitigating the potentially awful performance impact 113 - of unplanned swapping. 114 - 115 - A KVM implementation is underway and has been RFC'ed to lkml. And, 116 - using frontswap, investigation is also underway on the use of NVM as 117 - a memory extension technology. 118 - 119 - * Sure there may be performance advantages in some situations, but 120 - what's the space/time overhead of frontswap? 121 - 122 - If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into 123 - nothingness and the only overhead is a few extra bytes per swapon'ed 124 - swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" 125 - registers, there is one extra global variable compared to zero for 126 - every swap page read or written. If CONFIG_FRONTSWAP is enabled 127 - AND a frontswap backend registers AND the backend fails every "store" 128 - request (i.e. provides no memory despite claiming it might), 129 - CPU overhead is still negligible -- and since every frontswap fail 130 - precedes a swap page write-to-disk, the system is highly likely 131 - to be I/O bound and using a small fraction of a percent of a CPU 132 - will be irrelevant anyway. 133 - 134 - As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend 135 - registers, one bit is allocated for every swap page for every swap 136 - device that is swapon'd. This is added to the EIGHT bits (which 137 - was sixteen until about 2.6.34) that the kernel already allocates 138 - for every swap page for every swap device that is swapon'd. (Hugh 139 - Dickins has observed that frontswap could probably steal one of 140 - the existing eight bits, but let's worry about that minor optimization 141 - later.) For very large swap disks (which are rare) on a standard 142 - 4K pagesize, this is 1MB per 32GB swap. 143 - 144 - When swap pages are stored in transcendent memory instead of written 145 - out to disk, there is a side effect that this may create more memory 146 - pressure that can potentially outweigh the other advantages. A 147 - backend, such as zcache, must implement policies to carefully (but 148 - dynamically) manage memory limits to ensure this doesn't happen. 149 - 150 - * OK, how about a quick overview of what this frontswap patch does 151 - in terms that a kernel hacker can grok? 152 - 153 - Let's assume that a frontswap "backend" has registered during 154 - kernel initialization; this registration indicates that this 155 - frontswap backend has access to some "memory" that is not directly 156 - accessible by the kernel. Exactly how much memory it provides is 157 - entirely dynamic and random. 158 - 159 - Whenever a swap-device is swapon'd frontswap_init() is called, 160 - passing the swap device number (aka "type") as a parameter. 161 - This notifies frontswap to expect attempts to "store" swap pages 162 - associated with that number. 163 - 164 - Whenever the swap subsystem is readying a page to write to a swap 165 - device (c.f swap_writepage()), frontswap_store is called. Frontswap 166 - consults with the frontswap backend and if the backend says it does NOT 167 - have room, frontswap_store returns -1 and the kernel swaps the page 168 - to the swap device as normal. Note that the response from the frontswap 169 - backend is unpredictable to the kernel; it may choose to never accept a 170 - page, it could accept every ninth page, or it might accept every 171 - page. But if the backend does accept a page, the data from the page 172 - has already been copied and associated with the type and offset, 173 - and the backend guarantees the persistence of the data. In this case, 174 - frontswap sets a bit in the "frontswap_map" for the swap device 175 - corresponding to the page offset on the swap device to which it would 176 - otherwise have written the data. 177 - 178 - When the swap subsystem needs to swap-in a page (swap_readpage()), 179 - it first calls frontswap_load() which checks the frontswap_map to 180 - see if the page was earlier accepted by the frontswap backend. If 181 - it was, the page of data is filled from the frontswap backend and 182 - the swap-in is complete. If not, the normal swap-in code is 183 - executed to obtain the page of data from the real swap device. 184 - 185 - So every time the frontswap backend accepts a page, a swap device read 186 - and (potentially) a swap device write are replaced by a "frontswap backend 187 - store" and (possibly) a "frontswap backend loads", which are presumably much 188 - faster. 189 - 190 - * Can't frontswap be configured as a "special" swap device that is 191 - just higher priority than any real swap device (e.g. like zswap, 192 - or maybe swap-over-nbd/NFS)? 193 - 194 - No. First, the existing swap subsystem doesn't allow for any kind of 195 - swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, 196 - but this would require fairly drastic changes. Even if it were 197 - rewritten, the existing swap subsystem uses the block I/O layer which 198 - assumes a swap device is fixed size and any page in it is linearly 199 - addressable. Frontswap barely touches the existing swap subsystem, 200 - and works around the constraints of the block I/O subsystem to provide 201 - a great deal of flexibility and dynamicity. 202 - 203 - For example, the acceptance of any swap page by the frontswap backend is 204 - entirely unpredictable. This is critical to the definition of frontswap 205 - backends because it grants completely dynamic discretion to the 206 - backend. In zcache, one cannot know a priori how compressible a page is. 207 - "Poorly" compressible pages can be rejected, and "poorly" can itself be 208 - defined dynamically depending on current memory constraints. 209 - 210 - Further, frontswap is entirely synchronous whereas a real swap 211 - device is, by definition, asynchronous and uses block I/O. The 212 - block I/O layer is not only unnecessary, but may perform "optimizations" 213 - that are inappropriate for a RAM-oriented device including delaying 214 - the write of some pages for a significant amount of time. Synchrony is 215 - required to ensure the dynamicity of the backend and to avoid thorny race 216 - conditions that would unnecessarily and greatly complicate frontswap 217 - and/or the block I/O subsystem. That said, only the initial "store" 218 - and "load" operations need be synchronous. A separate asynchronous thread 219 - is free to manipulate the pages stored by frontswap. For example, 220 - the "remotification" thread in RAMster uses standard asynchronous 221 - kernel sockets to move compressed frontswap pages to a remote machine. 222 - Similarly, a KVM guest-side implementation could do in-guest compression 223 - and use "batched" hypercalls. 224 - 225 - In a virtualized environment, the dynamicity allows the hypervisor 226 - (or host OS) to do "intelligent overcommit". For example, it can 227 - choose to accept pages only until host-swapping might be imminent, 228 - then force guests to do their own swapping. 229 - 230 - There is a downside to the transcendent memory specifications for 231 - frontswap: Since any "store" might fail, there must always be a real 232 - slot on a real swap device to swap the page. Thus frontswap must be 233 - implemented as a "shadow" to every swapon'd device with the potential 234 - capability of holding every page that the swap device might have held 235 - and the possibility that it might hold no pages at all. This means 236 - that frontswap cannot contain more pages than the total of swapon'd 237 - swap devices. For example, if NO swap device is configured on some 238 - installation, frontswap is useless. Swapless portable devices 239 - can still use frontswap but a backend for such devices must configure 240 - some kind of "ghost" swap device and ensure that it is never used. 241 - 242 - * Why this weird definition about "duplicate stores"? If a page 243 - has been previously successfully stored, can't it always be 244 - successfully overwritten? 245 - 246 - Nearly always it can, but no, sometimes it cannot. Consider an example 247 - where data is compressed and the original 4K page has been compressed 248 - to 1K. Now an attempt is made to overwrite the page with data that 249 - is non-compressible and so would take the entire 4K. But the backend 250 - has no more space. In this case, the store must be rejected. Whenever 251 - frontswap rejects a store that would overwrite, it also must invalidate 252 - the old data and ensure that it is no longer accessible. Since the 253 - swap subsystem then writes the new data to the read swap device, 254 - this is the correct course of action to ensure coherency. 255 - 256 - * Why does the frontswap patch create the new include file swapfile.h? 257 - 258 - The frontswap code depends on some swap-subsystem-internal data 259 - structures that have, over the years, moved back and forth between 260 - static and global. This seemed a reasonable compromise: Define 261 - them as global but declare them in a new include file that isn't 262 - included by the large number of source files that include swap.h. 263 - 264 - Dan Magenheimer, last updated April 9, 2012

+1

Documentation/mm/highmem.rst

··· 206 206 ========= 207 207 208 208 .. kernel-doc:: include/linux/highmem.h 209 + .. kernel-doc:: mm/highmem.c 209 210 .. kernel-doc:: include/linux/highmem-internal.h

+7 -7

Documentation/mm/hugetlbfs_reserv.rst

··· 271 271 Freeing Huge Pages 272 272 ================== 273 273 274 - Huge page freeing is performed by the routine free_huge_page(). This routine 275 - is the destructor for hugetlbfs compound pages. As a result, it is only 276 - passed a pointer to the page struct. When a huge page is freed, reservation 277 - accounting may need to be performed. This would be the case if the page was 278 - associated with a subpool that contained reserves, or the page is being freed 279 - on an error path where a global reserve count must be restored. 274 + Huge pages are freed by free_huge_folio(). It is only passed a pointer 275 + to the folio as it is called from the generic MM code. When a huge page 276 + is freed, reservation accounting may need to be performed. This would 277 + be the case if the page was associated with a subpool that contained 278 + reserves, or the page is being freed on an error path where a global 279 + reserve count must be restored. 280 280 281 281 The page->private field points to any subpool associated with the page. 282 282 If the PagePrivate flag is set, it indicates the global reserve count should ··· 525 525 page is allocated but before it is instantiated. In this case, the page 526 526 allocation has consumed the reservation and made the appropriate subpool, 527 527 reservation map and global count adjustments. If the page is freed at this 528 - time (before instantiation and clearing of PagePrivate), then free_huge_page 528 + time (before instantiation and clearing of PagePrivate), then free_huge_folio 529 529 will increment the global reservation count. However, the reservation map 530 530 indicates the reservation was consumed. This resulting inconsistent state 531 531 will cause the 'leak' of a reserved huge page. The global reserve count will

-1

Documentation/mm/index.rst

··· 44 44 balance 45 45 damon/index 46 46 free_page_reporting 47 - frontswap 48 47 hmm 49 48 hwpoison 50 49 hugetlbfs_reserv

+6 -6

Documentation/mm/split_page_table_lock.rst

··· 58 58 =================================================== 59 59 60 60 There's no need in special enabling of PTE split page table lock: everything 61 - required is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which 61 + required is done by pagetable_pte_ctor() and pagetable_pte_dtor(), which 62 62 must be called on PTE table allocation / freeing. 63 63 64 64 Make sure the architecture doesn't use slab allocator for page table ··· 68 68 PMD split lock only makes sense if you have more than two page table 69 69 levels. 70 70 71 - PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table 72 - allocation and pgtable_pmd_page_dtor() on freeing. 71 + PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table 72 + allocation and pagetable_pmd_dtor() on freeing. 73 73 74 74 Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and 75 75 pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing ··· 77 77 78 78 With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK. 79 79 80 - NOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must 80 + NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must 81 81 be handled properly. 82 82 83 83 page->ptl ··· 97 97 split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs 98 98 one more cache line for indirect access; 99 99 100 - The spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in 101 - pgtable_pmd_page_ctor() for PMD table. 100 + The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in 101 + pagetable_pmd_ctor() for PMD table. 102 102 103 103 Please, never access page->ptl directly -- use appropriate helper.

+1

Documentation/mm/vmemmap_dedup.rst

··· 210 210 211 211 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), 212 212 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). 213 + For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst 213 214 214 215 The differences with HugeTLB are relatively minor. 215 216

+5

Documentation/mm/zsmalloc.rst

··· 263 263 objects and release zspages. In these cases, it is recommended to decrease 264 264 the limit on the size of the zspage chains (as specified by the 265 265 CONFIG_ZSMALLOC_CHAIN_SIZE option). 266 + 267 + Functions 268 + ========= 269 + 270 + .. kernel-doc:: mm/zsmalloc.c

+1

Documentation/powerpc/index.rst

··· 36 36 ultravisor 37 37 vas-api 38 38 vcpudispatch_stats 39 + vmemmap_dedup 39 40 40 41 features 41 42

+101

Documentation/powerpc/vmemmap_dedup.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========== 4 + Device DAX 5 + ========== 6 + 7 + The device-dax interface uses the tail deduplication technique explained in 8 + Documentation/mm/vmemmap_dedup.rst 9 + 10 + On powerpc, vmemmap deduplication is only used with radix MMU translation. Also 11 + with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap 12 + deduplication. 13 + 14 + With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap 15 + page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no 16 + vmemmap deduplication possible. 17 + 18 + With 1G PUD level mapping, we require 16384 struct pages and a single 64K 19 + vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we 20 + require 16 64K pages in vmemmap to map the struct page for 1G PUD level mapping. 21 + 22 + Here's how things look like on device-dax after the sections are populated:: 23 + +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 24 + | | | 0 | -------------> | 0 | 25 + | | +-----------+ +-----------+ 26 + | | | 1 | -------------> | 1 | 27 + | | +-----------+ +-----------+ 28 + | | | 2 | ----------------^ ^ ^ ^ ^ ^ 29 + | | +-----------+ | | | | | 30 + | | | 3 | ------------------+ | | | | 31 + | | +-----------+ | | | | 32 + | | | 4 | --------------------+ | | | 33 + | PUD | +-----------+ | | | 34 + | level | | . | ----------------------+ | | 35 + | mapping | +-----------+ | | 36 + | | | . | ------------------------+ | 37 + | | +-----------+ | 38 + | | | 15 | --------------------------+ 39 + | | +-----------+ 40 + | | 41 + | | 42 + | | 43 + +-----------+ 44 + 45 + 46 + With 4K page size, 2M PMD level mapping requires 512 struct pages and a single 47 + 4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we 48 + require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping. 49 + 50 + Here's how things look like on device-dax after the sections are populated:: 51 + 52 + +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 53 + | | | 0 | -------------> | 0 | 54 + | | +-----------+ +-----------+ 55 + | | | 1 | -------------> | 1 | 56 + | | +-----------+ +-----------+ 57 + | | | 2 | ----------------^ ^ ^ ^ ^ ^ 58 + | | +-----------+ | | | | | 59 + | | | 3 | ------------------+ | | | | 60 + | | +-----------+ | | | | 61 + | | | 4 | --------------------+ | | | 62 + | PMD | +-----------+ | | | 63 + | level | | 5 | ----------------------+ | | 64 + | mapping | +-----------+ | | 65 + | | | 6 | ------------------------+ | 66 + | | +-----------+ | 67 + | | | 7 | --------------------------+ 68 + | | +-----------+ 69 + | | 70 + | | 71 + | | 72 + +-----------+ 73 + 74 + With 1G PUD level mapping, we require 262144 struct pages and a single 4K 75 + vmemmap page can contain 64 struct pages (4K/sizeof(struct page)). Hence we 76 + require 4096 4K pages in vmemmap to map the struct pages for 1G PUD level 77 + mapping. 78 + 79 + Here's how things look like on device-dax after the sections are populated:: 80 + 81 + +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 82 + | | | 0 | -------------> | 0 | 83 + | | +-----------+ +-----------+ 84 + | | | 1 | -------------> | 1 | 85 + | | +-----------+ +-----------+ 86 + | | | 2 | ----------------^ ^ ^ ^ ^ ^ 87 + | | +-----------+ | | | | | 88 + | | | 3 | ------------------+ | | | | 89 + | | +-----------+ | | | | 90 + | | | 4 | --------------------+ | | | 91 + | PUD | +-----------+ | | | 92 + | level | | . | ----------------------+ | | 93 + | mapping | +-----------+ | | 94 + | | | . | ------------------------+ | 95 + | | +-----------+ | 96 + | | | 4095 | --------------------------+ 97 + | | +-----------+ 98 + | | 99 + | | 100 + | | 101 + +-----------+

-196

Documentation/translations/zh_CN/mm/frontswap.rst

··· 1 - :Original: Documentation/mm/frontswap.rst 2 - 3 - :翻译: 4 - 5 - 司延腾 Yanteng Si <siyanteng@loongson.cn> 6 - 7 - :校译: 8 - 9 - ========= 10 - Frontswap 11 - ========= 12 - 13 - Frontswap为交换页提供了一个 “transcendent memory” 的接口。在一些环境中，由 14 - 于交换页被保存在RAM（或类似RAM的设备）中，而不是交换磁盘，因此可以获得巨大的性能 15 - 节省（提高）。 16 - 17 - .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ 18 - 19 - Frontswap之所以这么命名，是因为它可以被认为是与swap设备的“back”存储相反。存 20 - 储器被认为是一个同步并发安全的面向页面的“伪RAM设备”，符合transcendent memory 21 - （如Xen的“tmem”，或内核内压缩内存，又称“zcache”，或未来的类似RAM的设备）的要 22 - 求；这个伪RAM设备不能被内核直接访问或寻址，其大小未知且可能随时间变化。驱动程序通过 23 - 调用frontswap_register_ops将自己与frontswap链接起来，以适当地设置frontswap_ops 24 - 的功能，它提供的功能必须符合某些策略，如下所示: 25 - 26 - 一个 “init” 将设备准备好接收与指定的交换设备编号（又称“类型”）相关的frontswap 27 - 交换页。一个 “store” 将把该页复制到transcendent memory，并与该页的类型和偏移 28 - 量相关联。一个 “load” 将把该页，如果找到的话，从transcendent memory复制到内核 29 - 内存，但不会从transcendent memory中删除该页。一个 “invalidate_page” 将从 30 - transcendent memory中删除该页，一个 “invalidate_area” 将删除所有与交换类型 31 - 相关的页（例如，像swapoff）并通知 “device” 拒绝进一步存储该交换类型。 32 - 33 - 一旦一个页面被成功存储，在该页面上的匹配加载通常会成功。因此，当内核发现自己处于需 34 - 要交换页面的情况时，它首先尝试使用frontswap。如果存储的结果是成功的，那么数据就已 35 - 经成功的保存到了transcendent memory中，并且避免了磁盘写入，如果后来再读回数据， 36 - 也避免了磁盘读取。如果存储返回失败，transcendent memory已经拒绝了该数据，且该页 37 - 可以像往常一样被写入交换空间。 38 - 39 - 请注意，如果一个页面被存储，而该页面已经存在于transcendent memory中（一个 “重复” 40 - 的存储），要么存储成功，数据被覆盖，要么存储失败，该页面被废止。这确保了旧的数据永远 41 - 不会从frontswap中获得。 42 - 43 - 如果配置正确，对frontswap的监控是通过 `/sys/kernel/debug/frontswap` 目录下的 44 - debugfs完成的。frontswap的有效性可以通过以下方式测量（在所有交换设备中）: 45 - 46 - ``failed_stores`` 47 - 有多少次存储的尝试是失败的 48 - 49 - ``loads`` 50 - 尝试了多少次加载（应该全部成功） 51 - 52 - ``succ_stores`` 53 - 有多少次存储的尝试是成功的 54 - 55 - ``invalidates`` 56 - 尝试了多少次作废 57 - 58 - 后台实现可以提供额外的指标。 59 - 60 - 经常问到的问题 61 - ============== 62 - 63 - * 价值在哪里? 64 - 65 - 当一个工作负载开始交换时，性能就会下降。Frontswap通过提供一个干净的、动态的接口来 66 - 读取和写入交换页到 “transcendent memory”，从而大大增加了许多这样的工作负载的性 67 - 能，否则内核是无法直接寻址的。当数据被转换为不同的形式和大小（比如压缩）或者被秘密 68 - 移动（对于一些类似RAM的设备来说，这可能对写平衡很有用）时，这个接口是理想的。交换 69 - 页（和被驱逐的页面缓存页）是这种比RAM慢但比磁盘快得多的“伪RAM设备”的一大用途。 70 - 71 - Frontswap对内核的影响相当小，为各种系统配置中更动态、更灵活的RAM利用提供了巨大的 72 - 灵活性： 73 - 74 - 在单一内核的情况下，又称“zcache”，页面被压缩并存储在本地内存中，从而增加了可以安 75 - 全保存在RAM中的匿名页面总数。Zcache本质上是用压缩/解压缩的CPU周期换取更好的内存利 76 - 用率。Benchmarks测试显示，当内存压力较低时，几乎没有影响，而在高内存压力下的一些 77 - 工作负载上，则有明显的性能改善（25%以上）。 78 - 79 - “RAMster” 在zcache的基础上增加了对集群系统的 “peer-to-peer” transcendent memory 80 - 的支持。Frontswap页面像zcache一样被本地压缩，但随后被“remotified” 到另一个系 81 - 统的RAM。这使得RAM可以根据需要动态地来回负载平衡，也就是说，当系统A超载时，它可以 82 - 交换到系统B，反之亦然。RAMster也可以被配置成一个内存服务器，因此集群中的许多服务器 83 - 可以根据需要动态地交换到配置有大量内存的单一服务器上......而不需要预先配置每个客户 84 - 有多少内存可用 85 - 86 - 在虚拟情况下，虚拟化的全部意义在于统计地将物理资源在多个虚拟机的不同需求之间进行复 87 - 用。对于RAM来说，这真的很难做到，而且在不改变内核的情况下，要做好这一点的努力基本上 88 - 是失败的（除了一些广为人知的特殊情况下的工作负载）。具体来说，Xen Transcendent Memory 89 - 后端允许管理器拥有的RAM “fallow”，不仅可以在多个虚拟机之间进行“time-shared”， 90 - 而且页面可以被压缩和重复利用，以优化RAM的利用率。当客户操作系统被诱导交出未充分利用 91 - 的RAM时（如 “selfballooning”），突然出现的意外内存压力可能会导致交换；frontswap 92 - 允许这些页面被交换到管理器RAM中或从管理器RAM中交换（如果整体主机系统内存条件允许）， 93 - 从而减轻计划外交换可能带来的可怕的性能影响。 94 - 95 - 一个KVM的实现正在进行中，并且已经被RFC'ed到lkml。而且，利用frontswap，对NVM作为 96 - 内存扩展技术的调查也在进行中。 97 - 98 - * 当然，在某些情况下可能有性能上的优势，但frontswap的空间/时间开销是多少？ 99 - 100 - 如果 CONFIG_FRONTSWAP 被禁用，每个 frontswap 钩子都会编译成空，唯一的开销是每 101 - 个 swapon'ed swap 设备的几个额外字节。如果 CONFIG_FRONTSWAP 被启用，但没有 102 - frontswap的 “backend” 寄存器，每读或写一个交换页就会有一个额外的全局变量，而不 103 - 是零。如果 CONFIG_FRONTSWAP 被启用，并且有一个frontswap的backend寄存器，并且 104 - 后端每次 “store” 请求都失败（即尽管声称可能，但没有提供内存），CPU 的开销仍然可以 105 - 忽略不计 - 因为每次frontswap失败都是在交换页写到磁盘之前，系统很可能是 I/O 绑定 106 - 的，无论如何使用一小部分的 CPU 都是不相关的。 107 - 108 - 至于空间，如果CONFIG_FRONTSWAP被启用，并且有一个frontswap的backend注册，那么 109 - 每个交换设备的每个交换页都会被分配一个比特。这是在内核已经为每个交换设备的每个交换 110 - 页分配的8位（在2.6.34之前是16位）上增加的。(Hugh Dickins观察到，frontswap可能 111 - 会偷取现有的8个比特，但是我们以后再来担心这个小的优化问题)。对于标准的4K页面大小的 112 - 非常大的交换盘（这很罕见），这是每32GB交换盘1MB开销。 113 - 114 - 当交换页存储在transcendent memory中而不是写到磁盘上时，有一个副作用，即这可能会 115 - 产生更多的内存压力，有可能超过其他的优点。一个backend，比如zcache，必须实现策略 116 - 来仔细（但动态地）管理内存限制，以确保这种情况不会发生。 117 - 118 - * 好吧，那就用内核骇客能理解的术语来快速概述一下这个frontswap补丁的作用如何？ 119 - 120 - 我们假设在内核初始化过程中，一个frontswap 的 “backend” 已经注册了；这个注册表 121 - 明这个frontswap 的 “backend” 可以访问一些不被内核直接访问的“内存”。它到底提 122 - 供了多少内存是完全动态和随机的。 123 - 124 - 每当一个交换设备被交换时，就会调用frontswap_init()，把交换设备的编号（又称“类 125 - 型”）作为一个参数传给它。这就通知了frontswap，以期待 “store” 与该号码相关的交 126 - 换页的尝试。 127 - 128 - 每当交换子系统准备将一个页面写入交换设备时（参见swap_writepage()），就会调用 129 - frontswap_store。Frontswap与frontswap backend协商，如果backend说它没有空 130 - 间，frontswap_store返回-1，内核就会照常把页换到交换设备上。注意，来自frontswap 131 - backend的响应对内核来说是不可预测的；它可能选择从不接受一个页面，可能接受每九个 132 - 页面，也可能接受每一个页面。但是如果backend确实接受了一个页面，那么这个页面的数 133 - 据已经被复制并与类型和偏移量相关联了，而且backend保证了数据的持久性。在这种情况 134 - 下，frontswap在交换设备的“frontswap_map” 中设置了一个位，对应于交换设备上的 135 - 页面偏移量，否则它就会将数据写入该设备。 136 - 137 - 当交换子系统需要交换一个页面时（swap_readpage()），它首先调用frontswap_load()， 138 - 检查frontswap_map，看这个页面是否早先被frontswap backend接受。如果是，该页 139 - 的数据就会从frontswap后端填充，换入就完成了。如果不是，正常的交换代码将被执行， 140 - 以便从真正的交换设备上获得这一页的数据。 141 - 142 - 所以每次frontswap backend接受一个页面时，交换设备的读取和（可能）交换设备的写 143 - 入都被 “frontswap backend store” 和（可能）“frontswap backend loads” 144 - 所取代，这可能会快得多。 145 - 146 - * frontswap不能被配置为一个 “特殊的” 交换设备，它的优先级要高于任何真正的交换 147 - 设备（例如像zswap，或者可能是swap-over-nbd/NFS）？ 148 - 149 - 首先，现有的交换子系统不允许有任何种类的交换层次结构。也许它可以被重写以适应层次 150 - 结构，但这将需要相当大的改变。即使它被重写，现有的交换子系统也使用了块I/O层，它 151 - 假定交换设备是固定大小的，其中的任何页面都是可线性寻址的。Frontswap几乎没有触 152 - 及现有的交换子系统，而是围绕着块I/O子系统的限制，提供了大量的灵活性和动态性。 153 - 154 - 例如，frontswap backend对任何交换页的接受是完全不可预测的。这对frontswap backend 155 - 的定义至关重要，因为它赋予了backend完全动态的决定权。在zcache中，人们无法预 156 - 先知道一个页面的可压缩性如何。可压缩性 “差” 的页面会被拒绝，而 “差” 本身也可 157 - 以根据当前的内存限制动态地定义。 158 - 159 - 此外，frontswap是完全同步的，而真正的交换设备，根据定义，是异步的，并且使用 160 - 块I/O。块I/O层不仅是不必要的，而且可能进行 “优化”，这对面向RAM的设备来说是 161 - 不合适的，包括将一些页面的写入延迟相当长的时间。同步是必须的，以确保后端的动 162 - 态性，并避免棘手的竞争条件，这将不必要地大大增加frontswap和/或块I/O子系统的 163 - 复杂性。也就是说，只有最初的 “store” 和 “load” 操作是需要同步的。一个独立 164 - 的异步线程可以自由地操作由frontswap存储的页面。例如，RAMster中的 “remotification” 165 - 线程使用标准的异步内核套接字，将压缩的frontswap页面移动到远程机器。同样， 166 - KVM的客户方实现可以进行客户内压缩，并使用 “batched” hypercalls。 167 - 168 - 在虚拟化环境中，动态性允许管理程序（或主机操作系统）做“intelligent overcommit”。 169 - 例如，它可以选择只接受页面，直到主机交换可能即将发生，然后强迫客户机做他们 170 - 自己的交换。 171 - 172 - transcendent memory规格的frontswap有一个坏处。因为任何 “store” 都可 173 - 能失败，所以必须在一个真正的交换设备上有一个真正的插槽来交换页面。因此， 174 - frontswap必须作为每个交换设备的 “影子” 来实现，它有可能容纳交换设备可能 175 - 容纳的每一个页面，也有可能根本不容纳任何页面。这意味着frontswap不能包含比 176 - swap设备总数更多的页面。例如，如果在某些安装上没有配置交换设备，frontswap 177 - 就没有用。无交换设备的便携式设备仍然可以使用frontswap，但是这种设备的 178 - backend必须配置某种 “ghost” 交换设备，并确保它永远不会被使用。 179 - 180 - 181 - * 为什么会有这种关于 “重复存储” 的奇怪定义？如果一个页面以前被成功地存储过， 182 - 难道它不能总是被成功地覆盖吗？ 183 - 184 - 几乎总是可以的，不，有时不能。考虑一个例子，数据被压缩了，原来的4K页面被压 185 - 缩到了1K。现在，有人试图用不可压缩的数据覆盖该页，因此会占用整个4K。但是 186 - backend没有更多的空间了。在这种情况下，这个存储必须被拒绝。每当frontswap 187 - 拒绝一个会覆盖的存储时，它也必须使旧的数据作废，并确保它不再被访问。因为交 188 - 换子系统会把新的数据写到读交换设备上，这是确保一致性的正确做法。 189 - 190 - * 为什么frontswap补丁会创建新的头文件swapfile.h？ 191 - 192 - frontswap代码依赖于一些swap子系统内部的数据结构，这些数据结构多年来一直 193 - 在静态和全局之间来回移动。这似乎是一个合理的妥协：将它们定义为全局，但在一 194 - 个新的包含文件中声明它们，该文件不被包含swap.h的大量源文件所包含。 195 - 196 - Dan Magenheimer，最后更新于2012年4月9日

+2 -2

Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst

··· 219 219 释放巨页 220 220 ======== 221 221 222 - 巨页释放是由函数free_huge_page()执行的。这个函数是hugetlbfs复合页的析构器。因此，它只传 222 + 巨页释放是由函数free_huge_folio()执行的。这个函数是hugetlbfs复合页的析构器。因此，它只传 223 223 递一个指向页面结构体的指针。当一个巨页被释放时，可能需要进行预留计算。如果该页与包含保 224 224 留的子池相关联，或者该页在错误路径上被释放，必须恢复全局预留计数，就会出现这种情况。 225 225 ··· 387 387 388 388 然而，有几种情况是，在一个巨页被分配后，但在它被实例化之前，就遇到了错误。在这种情况下， 389 389 页面分配已经消耗了预留，并进行了适当的子池、预留映射和全局计数调整。如果页面在这个时候被释放 390 - （在实例化和清除PagePrivate之前），那么free_huge_page将增加全局预留计数。然而，预留映射 390 + （在实例化和清除PagePrivate之前），那么free_huge_folio将增加全局预留计数。然而，预留映射 391 391 显示报留被消耗了。这种不一致的状态将导致预留的巨页的 “泄漏” 。全局预留计数将比它原本的要高， 392 392 并阻止分配一个预先分配的页面。 393 393

-1

Documentation/translations/zh_CN/mm/index.rst

··· 42 42 damon/index 43 43 free_page_reporting 44 44 ksm 45 - frontswap 46 45 hmm 47 46 hwpoison 48 47 hugetlbfs_reserv

+7 -7

Documentation/translations/zh_CN/mm/split_page_table_lock.rst

··· 56 56 架构对分页表锁的支持 57 57 ==================== 58 58 59 - 没有必要特别启用PTE分页表锁：所有需要的东西都由pgtable_pte_page_ctor() 60 - 和pgtable_pte_page_dtor()完成，它们必须在PTE表分配/释放时被调用。 59 + 没有必要特别启用PTE分页表锁：所有需要的东西都由pagetable_pte_ctor() 60 + 和pagetable_pte_dtor()完成，它们必须在PTE表分配/释放时被调用。 61 61 62 62 确保架构不使用slab分配器来分配页表：slab使用page->slab_cache来分配其页 63 63 面。这个区域与page->ptl共享存储。 64 64 65 65 PMD分页锁只有在你有两个以上的页表级别时才有意义。 66 66 67 - 启用PMD分页锁需要在PMD表分配时调用pgtable_pmd_page_ctor()，在释放时调 68 - 用pgtable_pmd_page_dtor()。 67 + 启用PMD分页锁需要在PMD表分配时调用pagetable_pmd_ctor()，在释放时调 68 + 用pagetable_pmd_dtor()。 69 69 70 70 分配通常发生在pmd_alloc_one()中，释放发生在pmd_free()和pmd_free_tlb() 71 71 中，但要确保覆盖所有的PMD表分配/释放路径：即X86_PAE在pgd_alloc()中预先 ··· 73 73 74 74 一切就绪后，你可以设置CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK。 75 75 76 - 注意：pgtable_pte_page_ctor()和pgtable_pmd_page_ctor()可能失败--必 76 + 注意：pagetable_pte_ctor()和pagetable_pmd_ctor()可能失败--必 77 77 须正确处理。 78 78 79 79 page->ptl ··· 90 90 的指针并动态分配它。这允许在启用DEBUG_SPINLOCK或DEBUG_LOCK_ALLOC的 91 91 情况下使用分页锁，但由于间接访问而多花了一个缓存行。 92 92 93 - PTE表的spinlock_t分配在pgtable_pte_page_ctor()中，PMD表的spinlock_t 94 - 分配在pgtable_pmd_page_ctor()中。 93 + PTE表的spinlock_t分配在pagetable_pte_ctor()中，PMD表的spinlock_t 94 + 分配在pagetable_pmd_ctor()中。 95 95 96 96 请不要直接访问page->ptl - -使用适当的辅助函数。

-8

MAINTAINERS

··· 8438 8438 F: include/linux/freezer.h 8439 8439 F: kernel/freezer.c 8440 8440 8441 - FRONTSWAP API 8442 - M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 8443 - L: linux-kernel@vger.kernel.org 8444 - S: Maintained 8445 - F: include/linux/frontswap.h 8446 - F: mm/frontswap.c 8447 - 8448 8441 FS-CACHE: LOCAL CACHING FOR NETWORK FILESYSTEMS 8449 8442 M: David Howells <dhowells@redhat.com> 8450 8443 L: linux-cachefs@redhat.com (moderated for non-subscribers) ··· 14891 14898 M: Eric Dumazet <edumazet@google.com> 14892 14899 L: netdev@vger.kernel.org 14893 14900 S: Maintained 14894 - F: include/linux/net_mm.h 14895 14901 F: include/linux/tcp.h 14896 14902 F: include/net/tcp.h 14897 14903 F: include/trace/events/tcp.h

+10 -3

arch/alpha/include/asm/cacheflush.h

··· 53 53 #define flush_icache_user_page flush_icache_user_page 54 54 #endif /* CONFIG_SMP */ 55 55 56 - /* This is used only in __do_fault and do_swap_page. */ 57 - #define flush_icache_page(vma, page) \ 58 - flush_icache_user_page((vma), (page), 0, 0) 56 + /* 57 + * Both implementations of flush_icache_user_page flush the entire 58 + * address space, so one call, no matter how many pages. 59 + */ 60 + static inline void flush_icache_pages(struct vm_area_struct *vma, 61 + struct page *page, unsigned int nr) 62 + { 63 + flush_icache_user_page(vma, page, 0, 0); 64 + } 65 + #define flush_icache_pages flush_icache_pages 59 66 60 67 #include <asm-generic/cacheflush.h> 61 68

+8 -2

arch/alpha/include/asm/pgtable.h

··· 26 26 * hook is made available. 27 27 */ 28 28 #define set_pte(pteptr, pteval) ((*(pteptr)) = (pteval)) 29 - #define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval) 30 29 31 30 /* PMD_SHIFT determines the size of the area a second-level page table can map */ 32 31 #define PMD_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-3)) ··· 188 189 * and a page entry and page directory to the page they refer to. 189 190 */ 190 191 #define page_to_pa(page) (page_to_pfn(page) << PAGE_SHIFT) 191 - #define pte_pfn(pte) (pte_val(pte) >> 32) 192 + #define PFN_PTE_SHIFT 32 193 + #define pte_pfn(pte) (pte_val(pte) >> PFN_PTE_SHIFT) 192 194 193 195 #define pte_page(pte) pfn_to_page(pte_pfn(pte)) 194 196 #define mk_pte(page, pgprot) \ ··· 300 300 */ 301 301 extern inline void update_mmu_cache(struct vm_area_struct * vma, 302 302 unsigned long address, pte_t *ptep) 303 + { 304 + } 305 + 306 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 307 + struct vm_area_struct *vma, unsigned long address, 308 + pte_t *ptep, unsigned int nr) 303 309 { 304 310 } 305 311

+1

arch/arc/Kconfig

··· 26 26 select GENERIC_PENDING_IRQ if SMP 27 27 select GENERIC_SCHED_CLOCK 28 28 select GENERIC_SMP_IDLE_THREAD 29 + select GENERIC_IOREMAP 29 30 select HAVE_ARCH_KGDB 30 31 select HAVE_ARCH_TRACEHOOK 31 32 select HAVE_ARCH_TRANSPARENT_HUGEPAGE if ARC_MMU_V4

+4 -10

arch/arc/include/asm/cacheflush.h

··· 18 18 #include <linux/mm.h> 19 19 #include <asm/shmparam.h> 20 20 21 - /* 22 - * Semantically we need this because icache doesn't snoop dcache/dma. 23 - * However ARC Cache flush requires paddr as well as vaddr, latter not available 24 - * in the flush_icache_page() API. So we no-op it but do the equivalent work 25 - * in update_mmu_cache() 26 - */ 27 - #define flush_icache_page(vma, page) 28 - 29 21 void flush_cache_all(void); 30 22 31 23 void flush_icache_range(unsigned long kstart, unsigned long kend); 32 24 void __sync_icache_dcache(phys_addr_t paddr, unsigned long vaddr, int len); 33 - void __inv_icache_page(phys_addr_t paddr, unsigned long vaddr); 34 - void __flush_dcache_page(phys_addr_t paddr, unsigned long vaddr); 25 + void __inv_icache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr); 26 + void __flush_dcache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr); 35 27 36 28 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 37 29 38 30 void flush_dcache_page(struct page *page); 31 + void flush_dcache_folio(struct folio *folio); 32 + #define flush_dcache_folio flush_dcache_folio 39 33 40 34 void dma_cache_wback_inv(phys_addr_t start, unsigned long sz); 41 35 void dma_cache_inv(phys_addr_t start, unsigned long sz);

+3 -4

arch/arc/include/asm/io.h

··· 21 21 #endif 22 22 23 23 extern void __iomem *ioremap(phys_addr_t paddr, unsigned long size); 24 - extern void __iomem *ioremap_prot(phys_addr_t paddr, unsigned long size, 25 - unsigned long flags); 24 + #define ioremap ioremap 25 + #define ioremap_prot ioremap_prot 26 + #define iounmap iounmap 26 27 static inline void __iomem *ioport_map(unsigned long port, unsigned int nr) 27 28 { 28 29 return (void __iomem *)port; ··· 32 31 static inline void ioport_unmap(void __iomem *addr) 33 32 { 34 33 } 35 - 36 - extern void iounmap(const volatile void __iomem *addr); 37 34 38 35 /* 39 36 * io{read,write}{16,32}be() macros

+5 -7

arch/arc/include/asm/pgtable-bits-arcv2.h

··· 100 100 return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot)); 101 101 } 102 102 103 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 104 - pte_t *ptep, pte_t pteval) 105 - { 106 - set_pte(ptep, pteval); 107 - } 103 + struct vm_fault; 104 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 105 + unsigned long address, pte_t *ptep, unsigned int nr); 108 106 109 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, 110 - pte_t *ptep); 107 + #define update_mmu_cache(vma, addr, ptep) \ 108 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 111 109 112 110 /* 113 111 * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that

+1

arch/arc/include/asm/pgtable-levels.h

··· 169 169 #define pte_ERROR(e) \ 170 170 pr_crit("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e)) 171 171 172 + #define PFN_PTE_SHIFT PAGE_SHIFT 172 173 #define pte_none(x) (!pte_val(x)) 173 174 #define pte_present(x) (pte_val(x) & _PAGE_PRESENT) 174 175 #define pte_clear(mm,addr,ptep) set_pte_at(mm, addr, ptep, __pte(0))

+37 -24

arch/arc/mm/cache.c

··· 752 752 * There's a corollary case, where kernel READs from a userspace mapped page. 753 753 * If the U-mapping is not congruent to K-mapping, former needs flushing. 754 754 */ 755 - void flush_dcache_page(struct page *page) 755 + void flush_dcache_folio(struct folio *folio) 756 756 { 757 757 struct address_space *mapping; 758 758 759 759 if (!cache_is_vipt_aliasing()) { 760 - clear_bit(PG_dc_clean, &page->flags); 760 + clear_bit(PG_dc_clean, &folio->flags); 761 761 return; 762 762 } 763 763 764 764 /* don't handle anon pages here */ 765 - mapping = page_mapping_file(page); 765 + mapping = folio_flush_mapping(folio); 766 766 if (!mapping) 767 767 return; 768 768 ··· 771 771 * Make a note that K-mapping is dirty 772 772 */ 773 773 if (!mapping_mapped(mapping)) { 774 - clear_bit(PG_dc_clean, &page->flags); 775 - } else if (page_mapcount(page)) { 776 - 774 + clear_bit(PG_dc_clean, &folio->flags); 775 + } else if (folio_mapped(folio)) { 777 776 /* kernel reading from page with U-mapping */ 778 - phys_addr_t paddr = (unsigned long)page_address(page); 779 - unsigned long vaddr = page->index << PAGE_SHIFT; 777 + phys_addr_t paddr = (unsigned long)folio_address(folio); 778 + unsigned long vaddr = folio_pos(folio); 780 779 780 + /* 781 + * vaddr is not actually the virtual address, but is 782 + * congruent to every user mapping. 783 + */ 781 784 if (addr_not_cache_congruent(paddr, vaddr)) 782 - __flush_dcache_page(paddr, vaddr); 785 + __flush_dcache_pages(paddr, vaddr, 786 + folio_nr_pages(folio)); 783 787 } 788 + } 789 + EXPORT_SYMBOL(flush_dcache_folio); 790 + 791 + void flush_dcache_page(struct page *page) 792 + { 793 + return flush_dcache_folio(page_folio(page)); 784 794 } 785 795 EXPORT_SYMBOL(flush_dcache_page); 786 796 ··· 931 921 } 932 922 933 923 /* wrapper to compile time eliminate alignment checks in flush loop */ 934 - void __inv_icache_page(phys_addr_t paddr, unsigned long vaddr) 924 + void __inv_icache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr) 935 925 { 936 - __ic_line_inv_vaddr(paddr, vaddr, PAGE_SIZE); 926 + __ic_line_inv_vaddr(paddr, vaddr, nr * PAGE_SIZE); 937 927 } 938 928 939 929 /* 940 930 * wrapper to clearout kernel or userspace mappings of a page 941 931 * For kernel mappings @vaddr == @paddr 942 932 */ 943 - void __flush_dcache_page(phys_addr_t paddr, unsigned long vaddr) 933 + void __flush_dcache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr) 944 934 { 945 - __dc_line_op(paddr, vaddr & PAGE_MASK, PAGE_SIZE, OP_FLUSH_N_INV); 935 + __dc_line_op(paddr, vaddr & PAGE_MASK, nr * PAGE_SIZE, OP_FLUSH_N_INV); 946 936 } 947 937 948 938 noinline void flush_cache_all(void) ··· 972 962 973 963 u_vaddr &= PAGE_MASK; 974 964 975 - __flush_dcache_page(paddr, u_vaddr); 965 + __flush_dcache_pages(paddr, u_vaddr, 1); 976 966 977 967 if (vma->vm_flags & VM_EXEC) 978 - __inv_icache_page(paddr, u_vaddr); 968 + __inv_icache_pages(paddr, u_vaddr, 1); 979 969 } 980 970 981 971 void flush_cache_range(struct vm_area_struct *vma, unsigned long start, ··· 988 978 unsigned long u_vaddr) 989 979 { 990 980 /* TBD: do we really need to clear the kernel mapping */ 991 - __flush_dcache_page((phys_addr_t)page_address(page), u_vaddr); 992 - __flush_dcache_page((phys_addr_t)page_address(page), 993 - (phys_addr_t)page_address(page)); 981 + __flush_dcache_pages((phys_addr_t)page_address(page), u_vaddr, 1); 982 + __flush_dcache_pages((phys_addr_t)page_address(page), 983 + (phys_addr_t)page_address(page), 1); 994 984 995 985 } 996 986 ··· 999 989 void copy_user_highpage(struct page *to, struct page *from, 1000 990 unsigned long u_vaddr, struct vm_area_struct *vma) 1001 991 { 992 + struct folio *src = page_folio(from); 993 + struct folio *dst = page_folio(to); 1002 994 void *kfrom = kmap_atomic(from); 1003 995 void *kto = kmap_atomic(to); 1004 996 int clean_src_k_mappings = 0; ··· 1017 1005 * addr_not_cache_congruent() is 0 1018 1006 */ 1019 1007 if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) { 1020 - __flush_dcache_page((unsigned long)kfrom, u_vaddr); 1008 + __flush_dcache_pages((unsigned long)kfrom, u_vaddr, 1); 1021 1009 clean_src_k_mappings = 1; 1022 1010 } 1023 1011 ··· 1031 1019 * non copied user pages (e.g. read faults which wire in pagecache page 1032 1020 * directly). 1033 1021 */ 1034 - clear_bit(PG_dc_clean, &to->flags); 1022 + clear_bit(PG_dc_clean, &dst->flags); 1035 1023 1036 1024 /* 1037 1025 * if SRC was already usermapped and non-congruent to kernel mapping 1038 1026 * sync the kernel mapping back to physical page 1039 1027 */ 1040 1028 if (clean_src_k_mappings) { 1041 - __flush_dcache_page((unsigned long)kfrom, (unsigned long)kfrom); 1042 - set_bit(PG_dc_clean, &from->flags); 1029 + __flush_dcache_pages((unsigned long)kfrom, 1030 + (unsigned long)kfrom, 1); 1043 1031 } else { 1044 - clear_bit(PG_dc_clean, &from->flags); 1032 + clear_bit(PG_dc_clean, &src->flags); 1045 1033 } 1046 1034 1047 1035 kunmap_atomic(kto); ··· 1050 1038 1051 1039 void clear_user_page(void *to, unsigned long u_vaddr, struct page *page) 1052 1040 { 1041 + struct folio *folio = page_folio(page); 1053 1042 clear_page(to); 1054 - clear_bit(PG_dc_clean, &page->flags); 1043 + clear_bit(PG_dc_clean, &folio->flags); 1055 1044 } 1056 1045 EXPORT_SYMBOL(clear_user_page); 1057 1046

+4 -45

arch/arc/mm/ioremap.c

··· 8 8 #include <linux/module.h> 9 9 #include <linux/io.h> 10 10 #include <linux/mm.h> 11 - #include <linux/slab.h> 12 11 #include <linux/cache.h> 13 12 14 13 static inline bool arc_uncached_addr_space(phys_addr_t paddr) ··· 24 25 25 26 void __iomem *ioremap(phys_addr_t paddr, unsigned long size) 26 27 { 27 - phys_addr_t end; 28 - 29 - /* Don't allow wraparound or zero size */ 30 - end = paddr + size - 1; 31 - if (!size || (end < paddr)) 32 - return NULL; 33 - 34 28 /* 35 29 * If the region is h/w uncached, MMU mapping can be elided as optim 36 30 * The cast to u32 is fine as this region can only be inside 4GB ··· 43 51 * ARC hardware uncached region, this one still goes thru the MMU as caller 44 52 * might need finer access control (R/W/X) 45 53 */ 46 - void __iomem *ioremap_prot(phys_addr_t paddr, unsigned long size, 54 + void __iomem *ioremap_prot(phys_addr_t paddr, size_t size, 47 55 unsigned long flags) 48 56 { 49 - unsigned int off; 50 - unsigned long vaddr; 51 - struct vm_struct *area; 52 - phys_addr_t end; 53 57 pgprot_t prot = __pgprot(flags); 54 58 55 - /* Don't allow wraparound, zero size */ 56 - end = paddr + size - 1; 57 - if ((!size) || (end < paddr)) 58 - return NULL; 59 - 60 - /* An early platform driver might end up here */ 61 - if (!slab_is_available()) 62 - return NULL; 63 - 64 59 /* force uncached */ 65 - prot = pgprot_noncached(prot); 66 - 67 - /* Mappings have to be page-aligned */ 68 - off = paddr & ~PAGE_MASK; 69 - paddr &= PAGE_MASK_PHYS; 70 - size = PAGE_ALIGN(end + 1) - paddr; 71 - 72 - /* 73 - * Ok, go for it.. 74 - */ 75 - area = get_vm_area(size, VM_IOREMAP); 76 - if (!area) 77 - return NULL; 78 - area->phys_addr = paddr; 79 - vaddr = (unsigned long)area->addr; 80 - if (ioremap_page_range(vaddr, vaddr + size, paddr, prot)) { 81 - vunmap((void __force *)vaddr); 82 - return NULL; 83 - } 84 - return (void __iomem *)(off + (char __iomem *)vaddr); 60 + return generic_ioremap_prot(paddr, size, pgprot_noncached(prot)); 85 61 } 86 62 EXPORT_SYMBOL(ioremap_prot); 87 63 88 - 89 - void iounmap(const volatile void __iomem *addr) 64 + void iounmap(volatile void __iomem *addr) 90 65 { 91 66 /* weird double cast to handle phys_addr_t > 32 bits */ 92 67 if (arc_uncached_addr_space((phys_addr_t)(u32)addr)) 93 68 return; 94 69 95 - vfree((void *)(PAGE_MASK & (unsigned long __force)addr)); 70 + generic_iounmap(addr); 96 71 } 97 72 EXPORT_SYMBOL(iounmap);

+11 -7

arch/arc/mm/tlb.c

··· 467 467 * Note that flush (when done) involves both WBACK - so physical page is 468 468 * in sync as well as INV - so any non-congruent aliases don't remain 469 469 */ 470 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long vaddr_unaligned, 471 - pte_t *ptep) 470 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 471 + unsigned long vaddr_unaligned, pte_t *ptep, unsigned int nr) 472 472 { 473 473 unsigned long vaddr = vaddr_unaligned & PAGE_MASK; 474 474 phys_addr_t paddr = pte_val(*ptep) & PAGE_MASK_PHYS; ··· 491 491 */ 492 492 if ((vma->vm_flags & VM_EXEC) || 493 493 addr_not_cache_congruent(paddr, vaddr)) { 494 - 495 - int dirty = !test_and_set_bit(PG_dc_clean, &page->flags); 494 + struct folio *folio = page_folio(page); 495 + int dirty = !test_and_set_bit(PG_dc_clean, &folio->flags); 496 496 if (dirty) { 497 + unsigned long offset = offset_in_folio(folio, paddr); 498 + nr = folio_nr_pages(folio); 499 + paddr -= offset; 500 + vaddr -= offset; 497 501 /* wback + inv dcache lines (K-mapping) */ 498 - __flush_dcache_page(paddr, paddr); 502 + __flush_dcache_pages(paddr, paddr, nr); 499 503 500 504 /* invalidate any existing icache lines (U-mapping) */ 501 505 if (vma->vm_flags & VM_EXEC) 502 - __inv_icache_page(paddr, vaddr); 506 + __inv_icache_pages(paddr, vaddr, nr); 503 507 } 504 508 } 505 509 } ··· 535 531 pmd_t *pmd) 536 532 { 537 533 pte_t pte = __pte(pmd_val(*pmd)); 538 - update_mmu_cache(vma, addr, &pte); 534 + update_mmu_cache_range(NULL, vma, addr, &pte, HPAGE_PMD_NR); 539 535 } 540 536 541 537 void local_flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,

+14 -15

arch/arm/include/asm/cacheflush.h

··· 231 231 vma->vm_flags); 232 232 } 233 233 234 - static inline void 235 - vivt_flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn) 234 + static inline void vivt_flush_cache_pages(struct vm_area_struct *vma, 235 + unsigned long user_addr, unsigned long pfn, unsigned int nr) 236 236 { 237 237 struct mm_struct *mm = vma->vm_mm; 238 238 239 239 if (!mm || cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) { 240 240 unsigned long addr = user_addr & PAGE_MASK; 241 - __cpuc_flush_user_range(addr, addr + PAGE_SIZE, vma->vm_flags); 241 + __cpuc_flush_user_range(addr, addr + nr * PAGE_SIZE, 242 + vma->vm_flags); 242 243 } 243 244 } 244 245 ··· 248 247 vivt_flush_cache_mm(mm) 249 248 #define flush_cache_range(vma,start,end) \ 250 249 vivt_flush_cache_range(vma,start,end) 251 - #define flush_cache_page(vma,addr,pfn) \ 252 - vivt_flush_cache_page(vma,addr,pfn) 250 + #define flush_cache_pages(vma, addr, pfn, nr) \ 251 + vivt_flush_cache_pages(vma, addr, pfn, nr) 253 252 #else 254 - extern void flush_cache_mm(struct mm_struct *mm); 255 - extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); 256 - extern void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn); 253 + void flush_cache_mm(struct mm_struct *mm); 254 + void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); 255 + void flush_cache_pages(struct vm_area_struct *vma, unsigned long user_addr, 256 + unsigned long pfn, unsigned int nr); 257 257 #endif 258 258 259 259 #define flush_cache_dup_mm(mm) flush_cache_mm(mm) 260 + #define flush_cache_page(vma, addr, pfn) flush_cache_pages(vma, addr, pfn, 1) 260 261 261 262 /* 262 263 * flush_icache_user_range is used when we want to ensure that the ··· 292 289 * See update_mmu_cache for the user space part. 293 290 */ 294 291 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 295 - extern void flush_dcache_page(struct page *); 292 + void flush_dcache_page(struct page *); 293 + void flush_dcache_folio(struct folio *folio); 294 + #define flush_dcache_folio flush_dcache_folio 296 295 297 296 #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 298 297 static inline void flush_kernel_vmap_range(void *addr, int size) ··· 320 315 321 316 #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages) 322 317 #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages) 323 - 324 - /* 325 - * We don't appear to need to do anything here. In fact, if we did, we'd 326 - * duplicate cache flushing elsewhere performed by flush_dcache_page(). 327 - */ 328 - #define flush_icache_page(vma,page) do { } while (0) 329 318 330 319 /* 331 320 * flush_cache_vmap() is used when creating mappings (eg, via vmap,

+1

arch/arm/include/asm/hugetlb.h

··· 10 10 #ifndef _ASM_ARM_HUGETLB_H 11 11 #define _ASM_ARM_HUGETLB_H 12 12 13 + #include <asm/cacheflush.h> 13 14 #include <asm/page.h> 14 15 #include <asm/hugetlb-3level.h> 15 16 #include <asm-generic/hugetlb.h>

+3 -2

arch/arm/include/asm/pgtable.h

··· 207 207 extern void __sync_icache_dcache(pte_t pteval); 208 208 #endif 209 209 210 - void set_pte_at(struct mm_struct *mm, unsigned long addr, 211 - pte_t *ptep, pte_t pteval); 210 + void set_ptes(struct mm_struct *mm, unsigned long addr, 211 + pte_t *ptep, pte_t pteval, unsigned int nr); 212 + #define set_ptes set_ptes 212 213 213 214 static inline pte_t clear_pte_bit(pte_t pte, pgprot_t prot) 214 215 {

+7 -5

arch/arm/include/asm/tlb.h

··· 39 39 static inline void 40 40 __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr) 41 41 { 42 - pgtable_pte_page_dtor(pte); 42 + struct ptdesc *ptdesc = page_ptdesc(pte); 43 + 44 + pagetable_pte_dtor(ptdesc); 43 45 44 46 #ifndef CONFIG_ARM_LPAE 45 47 /* ··· 52 50 __tlb_adjust_range(tlb, addr - PAGE_SIZE, 2 * PAGE_SIZE); 53 51 #endif 54 52 55 - tlb_remove_table(tlb, pte); 53 + tlb_remove_ptdesc(tlb, ptdesc); 56 54 } 57 55 58 56 static inline void 59 57 __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr) 60 58 { 61 59 #ifdef CONFIG_ARM_LPAE 62 - struct page *page = virt_to_page(pmdp); 60 + struct ptdesc *ptdesc = virt_to_ptdesc(pmdp); 63 61 64 - pgtable_pmd_page_dtor(page); 65 - tlb_remove_table(tlb, page); 62 + pagetable_pmd_dtor(ptdesc); 63 + tlb_remove_ptdesc(tlb, ptdesc); 66 64 #endif 67 65 } 68 66

+9 -5

arch/arm/include/asm/tlbflush.h

··· 619 619 * If PG_dcache_clean is not set for the page, we need to ensure that any 620 620 * cache entries for the kernels virtual memory range are written 621 621 * back to the page. On ARMv6 and later, the cache coherency is handled via 622 - * the set_pte_at() function. 622 + * the set_ptes() function. 623 623 */ 624 624 #if __LINUX_ARM_ARCH__ < 6 625 - extern void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr, 626 - pte_t *ptep); 625 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 626 + unsigned long addr, pte_t *ptep, unsigned int nr); 627 627 #else 628 - static inline void update_mmu_cache(struct vm_area_struct *vma, 629 - unsigned long addr, pte_t *ptep) 628 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 629 + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, 630 + unsigned int nr) 630 631 { 631 632 } 632 633 #endif 634 + 635 + #define update_mmu_cache(vma, addr, ptep) \ 636 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 633 637 634 638 #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) 635 639

+3 -2

arch/arm/mm/copypage-v4mc.c

··· 64 64 void v4_mc_copy_user_highpage(struct page *to, struct page *from, 65 65 unsigned long vaddr, struct vm_area_struct *vma) 66 66 { 67 + struct folio *src = page_folio(from); 67 68 void *kto = kmap_atomic(to); 68 69 69 - if (!test_and_set_bit(PG_dcache_clean, &from->flags)) 70 - __flush_dcache_page(page_mapping_file(from), from); 70 + if (!test_and_set_bit(PG_dcache_clean, &src->flags)) 71 + __flush_dcache_folio(folio_flush_mapping(src), src); 71 72 72 73 raw_spin_lock(&minicache_lock); 73 74

+3 -2

arch/arm/mm/copypage-v6.c

··· 69 69 static void v6_copy_user_highpage_aliasing(struct page *to, 70 70 struct page *from, unsigned long vaddr, struct vm_area_struct *vma) 71 71 { 72 + struct folio *src = page_folio(from); 72 73 unsigned int offset = CACHE_COLOUR(vaddr); 73 74 unsigned long kfrom, kto; 74 75 75 - if (!test_and_set_bit(PG_dcache_clean, &from->flags)) 76 - __flush_dcache_page(page_mapping_file(from), from); 76 + if (!test_and_set_bit(PG_dcache_clean, &src->flags)) 77 + __flush_dcache_folio(folio_flush_mapping(src), src); 77 78 78 79 /* FIXME: not highmem safe */ 79 80 discard_old_kernel_data(page_address(to));

+3 -2

arch/arm/mm/copypage-xscale.c

··· 84 84 void xscale_mc_copy_user_highpage(struct page *to, struct page *from, 85 85 unsigned long vaddr, struct vm_area_struct *vma) 86 86 { 87 + struct folio *src = page_folio(from); 87 88 void *kto = kmap_atomic(to); 88 89 89 - if (!test_and_set_bit(PG_dcache_clean, &from->flags)) 90 - __flush_dcache_page(page_mapping_file(from), from); 90 + if (!test_and_set_bit(PG_dcache_clean, &src->flags)) 91 + __flush_dcache_folio(folio_flush_mapping(src), src); 91 92 92 93 raw_spin_lock(&minicache_lock); 93 94

+14 -12

arch/arm/mm/dma-mapping.c

··· 709 709 * Mark the D-cache clean for these pages to avoid extra flushing. 710 710 */ 711 711 if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) { 712 - unsigned long pfn; 713 - size_t left = size; 712 + struct folio *folio = pfn_folio(paddr / PAGE_SIZE); 713 + size_t offset = offset_in_folio(folio, paddr); 714 714 715 - pfn = page_to_pfn(page) + off / PAGE_SIZE; 716 - off %= PAGE_SIZE; 717 - if (off) { 718 - pfn++; 719 - left -= PAGE_SIZE - off; 720 - } 721 - while (left >= PAGE_SIZE) { 722 - page = pfn_to_page(pfn++); 723 - set_bit(PG_dcache_clean, &page->flags); 724 - left -= PAGE_SIZE; 715 + for (;;) { 716 + size_t sz = folio_size(folio) - offset; 717 + 718 + if (size < sz) 719 + break; 720 + if (!offset) 721 + set_bit(PG_dcache_clean, &folio->flags); 722 + offset = 0; 723 + size -= sz; 724 + if (!size) 725 + break; 726 + folio = folio_next(folio); 725 727 } 726 728 } 727 729 }

+9 -10

arch/arm/mm/fault-armv.c

··· 117 117 * must use the nested version. This also means we need to 118 118 * open-code the spin-locking. 119 119 */ 120 - pte = pte_offset_map(pmd, address); 120 + pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl); 121 121 if (!pte) 122 122 return 0; 123 123 124 - ptl = pte_lockptr(vma->vm_mm, pmd); 125 124 do_pte_lock(ptl); 126 125 127 126 ret = do_adjust_pte(vma, address, pfn, pte); ··· 180 181 * 181 182 * Note that the pte lock will be held. 182 183 */ 183 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr, 184 - pte_t *ptep) 184 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 185 + unsigned long addr, pte_t *ptep, unsigned int nr) 185 186 { 186 187 unsigned long pfn = pte_pfn(*ptep); 187 188 struct address_space *mapping; 188 - struct page *page; 189 + struct folio *folio; 189 190 190 191 if (!pfn_valid(pfn)) 191 192 return; ··· 194 195 * The zero page is never written to, so never has any dirty 195 196 * cache lines, and therefore never needs to be flushed. 196 197 */ 197 - page = pfn_to_page(pfn); 198 - if (page == ZERO_PAGE(0)) 198 + if (is_zero_pfn(pfn)) 199 199 return; 200 200 201 - mapping = page_mapping_file(page); 202 - if (!test_and_set_bit(PG_dcache_clean, &page->flags)) 203 - __flush_dcache_page(mapping, page); 201 + folio = page_folio(pfn_to_page(pfn)); 202 + mapping = folio_flush_mapping(folio); 203 + if (!test_and_set_bit(PG_dcache_clean, &folio->flags)) 204 + __flush_dcache_folio(mapping, folio); 204 205 if (mapping) { 205 206 if (cache_is_vivt()) 206 207 make_coherent(mapping, vma, addr, ptep, pfn);

+60 -39

arch/arm/mm/flush.c

··· 95 95 __flush_icache_all(); 96 96 } 97 97 98 - void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn) 98 + void flush_cache_pages(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn, unsigned int nr) 99 99 { 100 100 if (cache_is_vivt()) { 101 - vivt_flush_cache_page(vma, user_addr, pfn); 101 + vivt_flush_cache_pages(vma, user_addr, pfn, nr); 102 102 return; 103 103 } 104 104 ··· 196 196 #endif 197 197 } 198 198 199 - void __flush_dcache_page(struct address_space *mapping, struct page *page) 199 + void __flush_dcache_folio(struct address_space *mapping, struct folio *folio) 200 200 { 201 201 /* 202 202 * Writeback any data associated with the kernel mapping of this 203 203 * page. This ensures that data in the physical page is mutually 204 204 * coherent with the kernels mapping. 205 205 */ 206 - if (!PageHighMem(page)) { 207 - __cpuc_flush_dcache_area(page_address(page), page_size(page)); 206 + if (!folio_test_highmem(folio)) { 207 + __cpuc_flush_dcache_area(folio_address(folio), 208 + folio_size(folio)); 208 209 } else { 209 210 unsigned long i; 210 211 if (cache_is_vipt_nonaliasing()) { 211 - for (i = 0; i < compound_nr(page); i++) { 212 - void *addr = kmap_atomic(page + i); 212 + for (i = 0; i < folio_nr_pages(folio); i++) { 213 + void *addr = kmap_local_folio(folio, 214 + i * PAGE_SIZE); 213 215 __cpuc_flush_dcache_area(addr, PAGE_SIZE); 214 - kunmap_atomic(addr); 216 + kunmap_local(addr); 215 217 } 216 218 } else { 217 - for (i = 0; i < compound_nr(page); i++) { 218 - void *addr = kmap_high_get(page + i); 219 + for (i = 0; i < folio_nr_pages(folio); i++) { 220 + void *addr = kmap_high_get(folio_page(folio, i)); 219 221 if (addr) { 220 222 __cpuc_flush_dcache_area(addr, PAGE_SIZE); 221 - kunmap_high(page + i); 223 + kunmap_high(folio_page(folio, i)); 222 224 } 223 225 } 224 226 } ··· 232 230 * userspace colour, which is congruent with page->index. 233 231 */ 234 232 if (mapping && cache_is_vipt_aliasing()) 235 - flush_pfn_alias(page_to_pfn(page), 236 - page->index << PAGE_SHIFT); 233 + flush_pfn_alias(folio_pfn(folio), folio_pos(folio)); 237 234 } 238 235 239 - static void __flush_dcache_aliases(struct address_space *mapping, struct page *page) 236 + static void __flush_dcache_aliases(struct address_space *mapping, struct folio *folio) 240 237 { 241 238 struct mm_struct *mm = current->active_mm; 242 - struct vm_area_struct *mpnt; 243 - pgoff_t pgoff; 239 + struct vm_area_struct *vma; 240 + pgoff_t pgoff, pgoff_end; 244 241 245 242 /* 246 243 * There are possible user space mappings of this page: ··· 247 246 * data in the current VM view associated with this page. 248 247 * - aliasing VIPT: we only need to find one mapping of this page. 249 248 */ 250 - pgoff = page->index; 249 + pgoff = folio->index; 250 + pgoff_end = pgoff + folio_nr_pages(folio) - 1; 251 251 252 252 flush_dcache_mmap_lock(mapping); 253 - vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { 254 - unsigned long offset; 253 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff_end) { 254 + unsigned long start, offset, pfn; 255 + unsigned int nr; 255 256 256 257 /* 257 258 * If this VMA is not in our MM, we can ignore it. 258 259 */ 259 - if (mpnt->vm_mm != mm) 260 + if (vma->vm_mm != mm) 260 261 continue; 261 - if (!(mpnt->vm_flags & VM_MAYSHARE)) 262 + if (!(vma->vm_flags & VM_MAYSHARE)) 262 263 continue; 263 - offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT; 264 - flush_cache_page(mpnt, mpnt->vm_start + offset, page_to_pfn(page)); 264 + 265 + start = vma->vm_start; 266 + pfn = folio_pfn(folio); 267 + nr = folio_nr_pages(folio); 268 + offset = pgoff - vma->vm_pgoff; 269 + if (offset > -nr) { 270 + pfn -= offset; 271 + nr += offset; 272 + } else { 273 + start += offset * PAGE_SIZE; 274 + } 275 + if (start + nr * PAGE_SIZE > vma->vm_end) 276 + nr = (vma->vm_end - start) / PAGE_SIZE; 277 + 278 + flush_cache_pages(vma, start, pfn, nr); 265 279 } 266 280 flush_dcache_mmap_unlock(mapping); 267 281 } ··· 285 269 void __sync_icache_dcache(pte_t pteval) 286 270 { 287 271 unsigned long pfn; 288 - struct page *page; 272 + struct folio *folio; 289 273 struct address_space *mapping; 290 274 291 275 if (cache_is_vipt_nonaliasing() && !pte_exec(pteval)) ··· 295 279 if (!pfn_valid(pfn)) 296 280 return; 297 281 298 - page = pfn_to_page(pfn); 282 + folio = page_folio(pfn_to_page(pfn)); 299 283 if (cache_is_vipt_aliasing()) 300 - mapping = page_mapping_file(page); 284 + mapping = folio_flush_mapping(folio); 301 285 else 302 286 mapping = NULL; 303 287 304 - if (!test_and_set_bit(PG_dcache_clean, &page->flags)) 305 - __flush_dcache_page(mapping, page); 288 + if (!test_and_set_bit(PG_dcache_clean, &folio->flags)) 289 + __flush_dcache_folio(mapping, folio); 306 290 307 291 if (pte_exec(pteval)) 308 292 __flush_icache_all(); ··· 328 312 * Note that we disable the lazy flush for SMP configurations where 329 313 * the cache maintenance operations are not automatically broadcasted. 330 314 */ 331 - void flush_dcache_page(struct page *page) 315 + void flush_dcache_folio(struct folio *folio) 332 316 { 333 317 struct address_space *mapping; 334 318 ··· 336 320 * The zero page is never written to, so never has any dirty 337 321 * cache lines, and therefore never needs to be flushed. 338 322 */ 339 - if (page == ZERO_PAGE(0)) 323 + if (is_zero_pfn(folio_pfn(folio))) 340 324 return; 341 325 342 326 if (!cache_ops_need_broadcast() && cache_is_vipt_nonaliasing()) { 343 - if (test_bit(PG_dcache_clean, &page->flags)) 344 - clear_bit(PG_dcache_clean, &page->flags); 327 + if (test_bit(PG_dcache_clean, &folio->flags)) 328 + clear_bit(PG_dcache_clean, &folio->flags); 345 329 return; 346 330 } 347 331 348 - mapping = page_mapping_file(page); 332 + mapping = folio_flush_mapping(folio); 349 333 350 334 if (!cache_ops_need_broadcast() && 351 - mapping && !page_mapcount(page)) 352 - clear_bit(PG_dcache_clean, &page->flags); 335 + mapping && !folio_mapped(folio)) 336 + clear_bit(PG_dcache_clean, &folio->flags); 353 337 else { 354 - __flush_dcache_page(mapping, page); 338 + __flush_dcache_folio(mapping, folio); 355 339 if (mapping && cache_is_vivt()) 356 - __flush_dcache_aliases(mapping, page); 340 + __flush_dcache_aliases(mapping, folio); 357 341 else if (mapping) 358 342 __flush_icache_all(); 359 - set_bit(PG_dcache_clean, &page->flags); 343 + set_bit(PG_dcache_clean, &folio->flags); 360 344 } 361 345 } 362 - EXPORT_SYMBOL(flush_dcache_page); 346 + EXPORT_SYMBOL(flush_dcache_folio); 363 347 348 + void flush_dcache_page(struct page *page) 349 + { 350 + flush_dcache_folio(page_folio(page)); 351 + } 352 + EXPORT_SYMBOL(flush_dcache_page); 364 353 /* 365 354 * Flush an anonymous page so that users of get_user_pages() 366 355 * can safely access the data. The expected sequence is:

+1 -1

arch/arm/mm/mm.h

··· 45 45 46 46 const struct mem_type *get_mem_type(unsigned int type); 47 47 48 - extern void __flush_dcache_page(struct address_space *mapping, struct page *page); 48 + void __flush_dcache_folio(struct address_space *mapping, struct folio *folio); 49 49 50 50 /* 51 51 * ARM specific vm_struct->flags bits.

+14 -7

arch/arm/mm/mmu.c

··· 737 737 738 738 static void *__init late_alloc(unsigned long sz) 739 739 { 740 - void *ptr = (void *)__get_free_pages(GFP_PGTABLE_KERNEL, get_order(sz)); 740 + void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 741 + get_order(sz)); 741 742 742 - if (!ptr || !pgtable_pte_page_ctor(virt_to_page(ptr))) 743 + if (!ptdesc || !pagetable_pte_ctor(ptdesc)) 743 744 BUG(); 744 - return ptr; 745 + return ptdesc_to_virt(ptdesc); 745 746 } 746 747 747 748 static pte_t * __init arm_pte_alloc(pmd_t *pmd, unsigned long addr, ··· 1789 1788 bootmem_init(); 1790 1789 1791 1790 empty_zero_page = virt_to_page(zero_page); 1792 - __flush_dcache_page(NULL, empty_zero_page); 1791 + __flush_dcache_folio(NULL, page_folio(empty_zero_page)); 1793 1792 } 1794 1793 1795 1794 void __init early_mm_init(const struct machine_desc *mdesc) ··· 1798 1797 early_paging_init(mdesc); 1799 1798 } 1800 1799 1801 - void set_pte_at(struct mm_struct *mm, unsigned long addr, 1802 - pte_t *ptep, pte_t pteval) 1800 + void set_ptes(struct mm_struct *mm, unsigned long addr, 1801 + pte_t *ptep, pte_t pteval, unsigned int nr) 1803 1802 { 1804 1803 unsigned long ext = 0; 1805 1804 ··· 1809 1808 ext |= PTE_EXT_NG; 1810 1809 } 1811 1810 1812 - set_pte_ext(ptep, pteval, ext); 1811 + for (;;) { 1812 + set_pte_ext(ptep, pteval, ext); 1813 + if (--nr == 0) 1814 + break; 1815 + ptep++; 1816 + pte_val(pteval) += PAGE_SIZE; 1817 + } 1813 1818 }

+6

arch/arm/mm/nommu.c

··· 180 180 { 181 181 } 182 182 183 + void flush_dcache_folio(struct folio *folio) 184 + { 185 + __cpuc_flush_dcache_area(folio_address(folio), folio_size(folio)); 186 + } 187 + EXPORT_SYMBOL(flush_dcache_folio); 188 + 183 189 void flush_dcache_page(struct page *page) 184 190 { 185 191 __cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);

+3 -3

arch/arm/mm/pageattr.c

··· 25 25 return 0; 26 26 } 27 27 28 - static bool in_range(unsigned long start, unsigned long size, 28 + static bool range_in_range(unsigned long start, unsigned long size, 29 29 unsigned long range_start, unsigned long range_end) 30 30 { 31 31 return start >= range_start && start < range_end && ··· 63 63 if (!size) 64 64 return 0; 65 65 66 - if (!in_range(start, size, MODULES_VADDR, MODULES_END) && 67 - !in_range(start, size, VMALLOC_START, VMALLOC_END)) 66 + if (!range_in_range(start, size, MODULES_VADDR, MODULES_END) && 67 + !range_in_range(start, size, VMALLOC_START, VMALLOC_END)) 68 68 return -EINVAL; 69 69 70 70 return __change_memory_common(start, size, set_mask, clear_mask);

+2 -3

arch/arm64/Kconfig

··· 78 78 select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION 79 79 select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION 80 80 select ARCH_KEEP_MEMBLOCK 81 + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 81 82 select ARCH_USE_CMPXCHG_LOCKREF 82 83 select ARCH_USE_GNU_PROPERTY 83 84 select ARCH_USE_MEMTEST ··· 97 96 select ARCH_SUPPORTS_NUMA_BALANCING 98 97 select ARCH_SUPPORTS_PAGE_TABLE_CHECK 99 98 select ARCH_SUPPORTS_PER_VMA_LOCK 99 + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 100 100 select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT 101 101 select ARCH_WANT_DEFAULT_BPF_JIT 102 102 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT ··· 348 346 def_bool y 349 347 350 348 config GENERIC_CALIBRATE_DELAY 351 - def_bool y 352 - 353 - config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 354 349 def_bool y 355 350 356 351 config SMP

+3 -1

arch/arm64/include/asm/cacheflush.h

··· 114 114 #define copy_to_user_page copy_to_user_page 115 115 116 116 /* 117 - * flush_dcache_page is used when the kernel has written to the page 117 + * flush_dcache_folio is used when the kernel has written to the page 118 118 * cache page at virtual address page->virtual. 119 119 * 120 120 * If this page isn't mapped (ie, page_mapping == NULL), or it might ··· 127 127 */ 128 128 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 129 129 extern void flush_dcache_page(struct page *); 130 + void flush_dcache_folio(struct folio *); 131 + #define flush_dcache_folio flush_dcache_folio 130 132 131 133 static __always_inline void icache_inval_all_pou(void) 132 134 {

+16

arch/arm64/include/asm/hugetlb.h

··· 10 10 #ifndef __ASM_HUGETLB_H 11 11 #define __ASM_HUGETLB_H 12 12 13 + #include <asm/cacheflush.h> 13 14 #include <asm/page.h> 14 15 15 16 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION ··· 60 59 pte_t old_pte, pte_t new_pte); 61 60 62 61 #include <asm-generic/hugetlb.h> 62 + 63 + #define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE 64 + static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma, 65 + unsigned long start, 66 + unsigned long end) 67 + { 68 + unsigned long stride = huge_page_size(hstate_vma(vma)); 69 + 70 + if (stride == PMD_SIZE) 71 + __flush_tlb_range(vma, start, end, stride, false, 2); 72 + else if (stride == PUD_SIZE) 73 + __flush_tlb_range(vma, start, end, stride, false, 1); 74 + else 75 + __flush_tlb_range(vma, start, end, PAGE_SIZE, false, 0); 76 + } 63 77 64 78 #endif /* __ASM_HUGETLB_H */

+1 -2

arch/arm64/include/asm/io.h

··· 139 139 * I/O memory mapping functions. 140 140 */ 141 141 142 - bool ioremap_allowed(phys_addr_t phys_addr, size_t size, unsigned long prot); 143 - #define ioremap_allowed ioremap_allowed 142 + #define ioremap_prot ioremap_prot 144 143 145 144 #define _PAGE_IOREMAP PROT_DEVICE_nGnRE 146 145

+2 -2

arch/arm64/include/asm/mte.h

··· 90 90 } 91 91 92 92 void mte_zero_clear_page_tags(void *addr); 93 - void mte_sync_tags(pte_t old_pte, pte_t pte); 93 + void mte_sync_tags(pte_t pte); 94 94 void mte_copy_page_tags(void *kto, const void *kfrom); 95 95 void mte_thread_init_user(void); 96 96 void mte_thread_switch(struct task_struct *next); ··· 122 122 static inline void mte_zero_clear_page_tags(void *addr) 123 123 { 124 124 } 125 - static inline void mte_sync_tags(pte_t old_pte, pte_t pte) 125 + static inline void mte_sync_tags(pte_t pte) 126 126 { 127 127 } 128 128 static inline void mte_copy_page_tags(void *kto, const void *kfrom)

+25 -23

arch/arm64/include/asm/pgtable.h

··· 338 338 * don't expose tags (instruction fetches don't check tags). 339 339 */ 340 340 if (system_supports_mte() && pte_access_permitted(pte, false) && 341 - !pte_special(pte)) { 342 - pte_t old_pte = READ_ONCE(*ptep); 343 - /* 344 - * We only need to synchronise if the new PTE has tags enabled 345 - * or if swapping in (in which case another mapping may have 346 - * set tags in the past even if this PTE isn't tagged). 347 - * (!pte_none() && !pte_present()) is an open coded version of 348 - * is_swap_pte() 349 - */ 350 - if (pte_tagged(pte) || (!pte_none(old_pte) && !pte_present(old_pte))) 351 - mte_sync_tags(old_pte, pte); 352 - } 341 + !pte_special(pte) && pte_tagged(pte)) 342 + mte_sync_tags(pte); 353 343 354 344 __check_safe_pte_update(mm, ptep, pte); 355 345 356 346 set_pte(ptep, pte); 357 347 } 358 348 359 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 360 - pte_t *ptep, pte_t pte) 349 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 350 + pte_t *ptep, pte_t pte, unsigned int nr) 361 351 { 362 - page_table_check_pte_set(mm, addr, ptep, pte); 363 - return __set_pte_at(mm, addr, ptep, pte); 352 + page_table_check_ptes_set(mm, ptep, pte, nr); 353 + 354 + for (;;) { 355 + __set_pte_at(mm, addr, ptep, pte); 356 + if (--nr == 0) 357 + break; 358 + ptep++; 359 + addr += PAGE_SIZE; 360 + pte_val(pte) += PAGE_SIZE; 361 + } 364 362 } 363 + #define set_ptes set_ptes 365 364 366 365 /* 367 366 * Huge pte definitions. ··· 534 535 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, 535 536 pmd_t *pmdp, pmd_t pmd) 536 537 { 537 - page_table_check_pmd_set(mm, addr, pmdp, pmd); 538 + page_table_check_pmd_set(mm, pmdp, pmd); 538 539 return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd)); 539 540 } 540 541 541 542 static inline void set_pud_at(struct mm_struct *mm, unsigned long addr, 542 543 pud_t *pudp, pud_t pud) 543 544 { 544 - page_table_check_pud_set(mm, addr, pudp, pud); 545 + page_table_check_pud_set(mm, pudp, pud); 545 546 return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud)); 546 547 } 547 548 ··· 939 940 { 940 941 pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0)); 941 942 942 - page_table_check_pte_clear(mm, address, pte); 943 + page_table_check_pte_clear(mm, pte); 943 944 944 945 return pte; 945 946 } ··· 951 952 { 952 953 pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0)); 953 954 954 - page_table_check_pmd_clear(mm, address, pmd); 955 + page_table_check_pmd_clear(mm, pmd); 955 956 956 957 return pmd; 957 958 } ··· 987 988 static inline pmd_t pmdp_establish(struct vm_area_struct *vma, 988 989 unsigned long address, pmd_t *pmdp, pmd_t pmd) 989 990 { 990 - page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd); 991 + page_table_check_pmd_set(vma->vm_mm, pmdp, pmd); 991 992 return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd))); 992 993 } 993 994 #endif ··· 1060 1061 /* 1061 1062 * On AArch64, the cache coherency is handled via the set_pte_at() function. 1062 1063 */ 1063 - static inline void update_mmu_cache(struct vm_area_struct *vma, 1064 - unsigned long addr, pte_t *ptep) 1064 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 1065 + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, 1066 + unsigned int nr) 1065 1067 { 1066 1068 /* 1067 1069 * We don't do anything here, so there's a very small chance of ··· 1071 1071 */ 1072 1072 } 1073 1073 1074 + #define update_mmu_cache(vma, addr, ptep) \ 1075 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 1074 1076 #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) 1075 1077 1076 1078 #ifdef CONFIG_ARM64_PA_BITS_52

+8 -6

arch/arm64/include/asm/tlb.h

··· 75 75 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, 76 76 unsigned long addr) 77 77 { 78 - pgtable_pte_page_dtor(pte); 79 - tlb_remove_table(tlb, pte); 78 + struct ptdesc *ptdesc = page_ptdesc(pte); 79 + 80 + pagetable_pte_dtor(ptdesc); 81 + tlb_remove_ptdesc(tlb, ptdesc); 80 82 } 81 83 82 84 #if CONFIG_PGTABLE_LEVELS > 2 83 85 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, 84 86 unsigned long addr) 85 87 { 86 - struct page *page = virt_to_page(pmdp); 88 + struct ptdesc *ptdesc = virt_to_ptdesc(pmdp); 87 89 88 - pgtable_pmd_page_dtor(page); 89 - tlb_remove_table(tlb, page); 90 + pagetable_pmd_dtor(ptdesc); 91 + tlb_remove_ptdesc(tlb, ptdesc); 90 92 } 91 93 #endif 92 94 ··· 96 94 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp, 97 95 unsigned long addr) 98 96 { 99 - tlb_remove_table(tlb, virt_to_page(pudp)); 97 + tlb_remove_ptdesc(tlb, virt_to_ptdesc(pudp)); 100 98 } 101 99 #endif 102 100

+12

arch/arm64/include/asm/tlbbatch.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ARCH_ARM64_TLBBATCH_H 3 + #define _ARCH_ARM64_TLBBATCH_H 4 + 5 + struct arch_tlbflush_unmap_batch { 6 + /* 7 + * For arm64, HW can do tlb shootdown, so we don't 8 + * need to record cpumask for sending IPI 9 + */ 10 + }; 11 + 12 + #endif /* _ARCH_ARM64_TLBBATCH_H */

+64 -6

arch/arm64/include/asm/tlbflush.h

··· 13 13 #include <linux/bitfield.h> 14 14 #include <linux/mm_types.h> 15 15 #include <linux/sched.h> 16 + #include <linux/mmu_notifier.h> 16 17 #include <asm/cputype.h> 17 18 #include <asm/mmu.h> 18 19 ··· 253 252 __tlbi(aside1is, asid); 254 253 __tlbi_user(aside1is, asid); 255 254 dsb(ish); 255 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); 256 + } 257 + 258 + static inline void __flush_tlb_page_nosync(struct mm_struct *mm, 259 + unsigned long uaddr) 260 + { 261 + unsigned long addr; 262 + 263 + dsb(ishst); 264 + addr = __TLBI_VADDR(uaddr, ASID(mm)); 265 + __tlbi(vale1is, addr); 266 + __tlbi_user(vale1is, addr); 267 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, 268 + (uaddr & PAGE_MASK) + PAGE_SIZE); 256 269 } 257 270 258 271 static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, 259 272 unsigned long uaddr) 260 273 { 261 - unsigned long addr; 262 - 263 - dsb(ishst); 264 - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); 265 - __tlbi(vale1is, addr); 266 - __tlbi_user(vale1is, addr); 274 + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); 267 275 } 268 276 269 277 static inline void flush_tlb_page(struct vm_area_struct *vma, 270 278 unsigned long uaddr) 271 279 { 272 280 flush_tlb_page_nosync(vma, uaddr); 281 + dsb(ish); 282 + } 283 + 284 + static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) 285 + { 286 + #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI 287 + /* 288 + * TLB flush deferral is not required on systems which are affected by 289 + * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation 290 + * will have two consecutive TLBI instructions with a dsb(ish) in between 291 + * defeating the purpose (i.e save overall 'dsb ish' cost). 292 + */ 293 + if (unlikely(cpus_have_const_cap(ARM64_WORKAROUND_REPEAT_TLBI))) 294 + return false; 295 + #endif 296 + return true; 297 + } 298 + 299 + static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, 300 + struct mm_struct *mm, 301 + unsigned long uaddr) 302 + { 303 + __flush_tlb_page_nosync(mm, uaddr); 304 + } 305 + 306 + /* 307 + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to 308 + * synchronise all the TLBI issued with a DSB to avoid the race mentioned in 309 + * flush_tlb_batched_pending(). 310 + */ 311 + static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) 312 + { 313 + dsb(ish); 314 + } 315 + 316 + /* 317 + * To support TLB batched flush for multiple pages unmapping, we only send 318 + * the TLBI for each page in arch_tlbbatch_add_pending() and wait for the 319 + * completion at the end in arch_tlbbatch_flush(). Since we've already issued 320 + * TLBI for each page so only a DSB is needed to synchronise its effect on the 321 + * other CPUs. 322 + * 323 + * This will save the time waiting on DSB comparing issuing a TLBI;DSB sequence 324 + * for each page. 325 + */ 326 + static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) 327 + { 273 328 dsb(ish); 274 329 } 275 330 ··· 415 358 scale++; 416 359 } 417 360 dsb(ish); 361 + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end); 418 362 } 419 363 420 364 static inline void flush_tlb_range(struct vm_area_struct *vma,

+7 -30

arch/arm64/kernel/mte.c

··· 35 35 EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode); 36 36 #endif 37 37 38 - static void mte_sync_page_tags(struct page *page, pte_t old_pte, 39 - bool check_swap, bool pte_is_tagged) 40 - { 41 - if (check_swap && is_swap_pte(old_pte)) { 42 - swp_entry_t entry = pte_to_swp_entry(old_pte); 43 - 44 - if (!non_swap_entry(entry)) 45 - mte_restore_tags(entry, page); 46 - } 47 - 48 - if (!pte_is_tagged) 49 - return; 50 - 51 - if (try_page_mte_tagging(page)) { 52 - mte_clear_page_tags(page_address(page)); 53 - set_page_mte_tagged(page); 54 - } 55 - } 56 - 57 - void mte_sync_tags(pte_t old_pte, pte_t pte) 38 + void mte_sync_tags(pte_t pte) 58 39 { 59 40 struct page *page = pte_page(pte); 60 41 long i, nr_pages = compound_nr(page); 61 - bool check_swap = nr_pages == 1; 62 - bool pte_is_tagged = pte_tagged(pte); 63 - 64 - /* Early out if there's nothing to do */ 65 - if (!check_swap && !pte_is_tagged) 66 - return; 67 42 68 43 /* if PG_mte_tagged is set, tags have already been initialised */ 69 - for (i = 0; i < nr_pages; i++, page++) 70 - if (!page_mte_tagged(page)) 71 - mte_sync_page_tags(page, old_pte, check_swap, 72 - pte_is_tagged); 44 + for (i = 0; i < nr_pages; i++, page++) { 45 + if (try_page_mte_tagging(page)) { 46 + mte_clear_page_tags(page_address(page)); 47 + set_page_mte_tagged(page); 48 + } 49 + } 73 50 74 51 /* ensure the tags are visible before the PTE is set */ 75 52 smp_wmb();

+2 -3

arch/arm64/mm/fault.c

··· 587 587 588 588 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); 589 589 590 - #ifdef CONFIG_PER_VMA_LOCK 591 590 if (!(mm_flags & FAULT_FLAG_USER)) 592 591 goto lock_mmap; 593 592 ··· 599 600 goto lock_mmap; 600 601 } 601 602 fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs); 602 - vma_end_read(vma); 603 + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) 604 + vma_end_read(vma); 603 605 604 606 if (!(fault & VM_FAULT_RETRY)) { 605 607 count_vm_vma_lock_event(VMA_LOCK_SUCCESS); ··· 615 615 return 0; 616 616 } 617 617 lock_mmap: 618 - #endif /* CONFIG_PER_VMA_LOCK */ 619 618 620 619 retry: 621 620 vma = lock_mm_and_find_vma(mm, addr, regs);

+14 -22

arch/arm64/mm/flush.c

··· 51 51 52 52 void __sync_icache_dcache(pte_t pte) 53 53 { 54 - struct page *page = pte_page(pte); 54 + struct folio *folio = page_folio(pte_page(pte)); 55 55 56 - /* 57 - * HugeTLB pages are always fully mapped, so only setting head page's 58 - * PG_dcache_clean flag is enough. 59 - */ 60 - if (PageHuge(page)) 61 - page = compound_head(page); 62 - 63 - if (!test_bit(PG_dcache_clean, &page->flags)) { 64 - sync_icache_aliases((unsigned long)page_address(page), 65 - (unsigned long)page_address(page) + 66 - page_size(page)); 67 - set_bit(PG_dcache_clean, &page->flags); 56 + if (!test_bit(PG_dcache_clean, &folio->flags)) { 57 + sync_icache_aliases((unsigned long)folio_address(folio), 58 + (unsigned long)folio_address(folio) + 59 + folio_size(folio)); 60 + set_bit(PG_dcache_clean, &folio->flags); 68 61 } 69 62 } 70 63 EXPORT_SYMBOL_GPL(__sync_icache_dcache); ··· 67 74 * it as dirty for later flushing when mapped in user space (if executable, 68 75 * see __sync_icache_dcache). 69 76 */ 77 + void flush_dcache_folio(struct folio *folio) 78 + { 79 + if (test_bit(PG_dcache_clean, &folio->flags)) 80 + clear_bit(PG_dcache_clean, &folio->flags); 81 + } 82 + EXPORT_SYMBOL(flush_dcache_folio); 83 + 70 84 void flush_dcache_page(struct page *page) 71 85 { 72 - /* 73 - * HugeTLB pages are always fully mapped and only head page will be 74 - * set PG_dcache_clean (see comments in __sync_icache_dcache()). 75 - */ 76 - if (PageHuge(page)) 77 - page = compound_head(page); 78 - 79 - if (test_bit(PG_dcache_clean, &page->flags)) 80 - clear_bit(PG_dcache_clean, &page->flags); 86 + flush_dcache_folio(page_folio(page)); 81 87 } 82 88 EXPORT_SYMBOL(flush_dcache_page); 83 89

+1 -1

arch/arm64/mm/hugetlbpage.c

··· 236 236 unsigned long i, saddr = addr; 237 237 238 238 for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) 239 - pte_clear(mm, addr, ptep); 239 + ptep_clear(mm, addr, ptep); 240 240 241 241 flush_tlb_range(&vma, saddr, addr); 242 242 }

+6 -4

arch/arm64/mm/ioremap.c

··· 3 3 #include <linux/mm.h> 4 4 #include <linux/io.h> 5 5 6 - bool ioremap_allowed(phys_addr_t phys_addr, size_t size, unsigned long prot) 6 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 7 + unsigned long prot) 7 8 { 8 9 unsigned long last_addr = phys_addr + size - 1; 9 10 10 11 /* Don't allow outside PHYS_MASK */ 11 12 if (last_addr & ~PHYS_MASK) 12 - return false; 13 + return NULL; 13 14 14 15 /* Don't allow RAM to be mapped. */ 15 16 if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr)))) 16 - return false; 17 + return NULL; 17 18 18 - return true; 19 + return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); 19 20 } 21 + EXPORT_SYMBOL(ioremap_prot); 20 22 21 23 /* 22 24 * Must be called after early_fixmap_init

+4 -3

arch/arm64/mm/mmu.c

··· 426 426 static phys_addr_t pgd_pgtable_alloc(int shift) 427 427 { 428 428 phys_addr_t pa = __pgd_pgtable_alloc(shift); 429 + struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa)); 429 430 430 431 /* 431 432 * Call proper page table ctor in case later we need to ··· 434 433 * this pre-allocated page table. 435 434 * 436 435 * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is 437 - * folded, and if so pgtable_pmd_page_ctor() becomes nop. 436 + * folded, and if so pagetable_pte_ctor() becomes nop. 438 437 */ 439 438 if (shift == PAGE_SHIFT) 440 - BUG_ON(!pgtable_pte_page_ctor(phys_to_page(pa))); 439 + BUG_ON(!pagetable_pte_ctor(ptdesc)); 441 440 else if (shift == PMD_SHIFT) 442 - BUG_ON(!pgtable_pmd_page_ctor(phys_to_page(pa))); 441 + BUG_ON(!pagetable_pmd_ctor(ptdesc)); 443 442 444 443 return pa; 445 444 }

+3 -2

arch/arm64/mm/mteswap.c

··· 33 33 34 34 mte_save_page_tags(page_address(page), tag_storage); 35 35 36 - /* page_private contains the swap entry.val set in do_swap_page */ 37 - ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL); 36 + /* lookup the swap entry.val from the page */ 37 + ret = xa_store(&mte_pages, page_swap_entry(page).val, tag_storage, 38 + GFP_KERNEL); 38 39 if (WARN(xa_is_err(ret), "Failed to store MTE tags")) { 39 40 mte_free_tag_storage(tag_storage); 40 41 return xa_err(ret);

+19 -13

arch/csky/abiv1/cacheflush.c

··· 15 15 16 16 #define PG_dcache_clean PG_arch_1 17 17 18 - void flush_dcache_page(struct page *page) 18 + void flush_dcache_folio(struct folio *folio) 19 19 { 20 20 struct address_space *mapping; 21 21 22 - if (page == ZERO_PAGE(0)) 22 + if (is_zero_pfn(folio_pfn(folio))) 23 23 return; 24 24 25 - mapping = page_mapping_file(page); 25 + mapping = folio_flush_mapping(folio); 26 26 27 - if (mapping && !page_mapcount(page)) 28 - clear_bit(PG_dcache_clean, &page->flags); 27 + if (mapping && !folio_mapped(folio)) 28 + clear_bit(PG_dcache_clean, &folio->flags); 29 29 else { 30 30 dcache_wbinv_all(); 31 31 if (mapping) 32 32 icache_inv_all(); 33 - set_bit(PG_dcache_clean, &page->flags); 33 + set_bit(PG_dcache_clean, &folio->flags); 34 34 } 35 + } 36 + EXPORT_SYMBOL(flush_dcache_folio); 37 + 38 + void flush_dcache_page(struct page *page) 39 + { 40 + flush_dcache_folio(page_folio(page)); 35 41 } 36 42 EXPORT_SYMBOL(flush_dcache_page); 37 43 38 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr, 39 - pte_t *ptep) 44 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 45 + unsigned long addr, pte_t *ptep, unsigned int nr) 40 46 { 41 47 unsigned long pfn = pte_pfn(*ptep); 42 - struct page *page; 48 + struct folio *folio; 43 49 44 50 flush_tlb_page(vma, addr); 45 51 46 52 if (!pfn_valid(pfn)) 47 53 return; 48 54 49 - page = pfn_to_page(pfn); 50 - if (page == ZERO_PAGE(0)) 55 + if (is_zero_pfn(pfn)) 51 56 return; 52 57 53 - if (!test_and_set_bit(PG_dcache_clean, &page->flags)) 58 + folio = page_folio(pfn_to_page(pfn)); 59 + if (!test_and_set_bit(PG_dcache_clean, &folio->flags)) 54 60 dcache_wbinv_all(); 55 61 56 - if (page_mapping_file(page)) { 62 + if (folio_flush_mapping(folio)) { 57 63 if (vma->vm_flags & VM_EXEC) 58 64 icache_inv_all(); 59 65 }

+2 -1

arch/csky/abiv1/inc/abi/cacheflush.h

··· 9 9 10 10 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 11 11 extern void flush_dcache_page(struct page *); 12 + void flush_dcache_folio(struct folio *); 13 + #define flush_dcache_folio flush_dcache_folio 12 14 13 15 #define flush_cache_mm(mm) dcache_wbinv_all() 14 16 #define flush_cache_page(vma, page, pfn) cache_wbinv_all() ··· 45 43 #define flush_cache_vmap(start, end) cache_wbinv_all() 46 44 #define flush_cache_vunmap(start, end) cache_wbinv_all() 47 45 48 - #define flush_icache_page(vma, page) do {} while (0); 49 46 #define flush_icache_range(start, end) cache_wbinv_range(start, end) 50 47 #define flush_icache_mm_range(mm, start, end) cache_wbinv_range(start, end) 51 48 #define flush_icache_deferred(mm) do {} while (0);

+18 -15

arch/csky/abiv2/cacheflush.c

··· 7 7 #include <asm/cache.h> 8 8 #include <asm/tlbflush.h> 9 9 10 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, 11 - pte_t *pte) 10 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 11 + unsigned long address, pte_t *pte, unsigned int nr) 12 12 { 13 - unsigned long addr; 14 - struct page *page; 13 + unsigned long pfn = pte_pfn(*pte); 14 + struct folio *folio; 15 + unsigned int i; 15 16 16 17 flush_tlb_page(vma, address); 17 18 18 - if (!pfn_valid(pte_pfn(*pte))) 19 + if (!pfn_valid(pfn)) 19 20 return; 20 21 21 - page = pfn_to_page(pte_pfn(*pte)); 22 - if (page == ZERO_PAGE(0)) 22 + folio = page_folio(pfn_to_page(pfn)); 23 + 24 + if (test_and_set_bit(PG_dcache_clean, &folio->flags)) 23 25 return; 24 26 25 - if (test_and_set_bit(PG_dcache_clean, &page->flags)) 26 - return; 27 + icache_inv_range(address, address + nr*PAGE_SIZE); 28 + for (i = 0; i < folio_nr_pages(folio); i++) { 29 + unsigned long addr = (unsigned long) kmap_local_folio(folio, 30 + i * PAGE_SIZE); 27 31 28 - addr = (unsigned long) kmap_atomic(page); 29 - 30 - icache_inv_range(address, address + PAGE_SIZE); 31 - dcache_wb_range(addr, addr + PAGE_SIZE); 32 - 33 - kunmap_atomic((void *) addr); 32 + dcache_wb_range(addr, addr + PAGE_SIZE); 33 + if (vma->vm_flags & VM_EXEC) 34 + icache_inv_range(addr, addr + PAGE_SIZE); 35 + kunmap_local((void *) addr); 36 + } 34 37 } 35 38 36 39 void flush_icache_deferred(struct mm_struct *mm)

+8 -3

arch/csky/abiv2/inc/abi/cacheflush.h

··· 18 18 19 19 #define PG_dcache_clean PG_arch_1 20 20 21 + static inline void flush_dcache_folio(struct folio *folio) 22 + { 23 + if (test_bit(PG_dcache_clean, &folio->flags)) 24 + clear_bit(PG_dcache_clean, &folio->flags); 25 + } 26 + #define flush_dcache_folio flush_dcache_folio 27 + 21 28 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 22 29 static inline void flush_dcache_page(struct page *page) 23 30 { 24 - if (test_bit(PG_dcache_clean, &page->flags)) 25 - clear_bit(PG_dcache_clean, &page->flags); 31 + flush_dcache_folio(page_folio(page)); 26 32 } 27 33 28 34 #define flush_dcache_mmap_lock(mapping) do { } while (0) 29 35 #define flush_dcache_mmap_unlock(mapping) do { } while (0) 30 - #define flush_icache_page(vma, page) do { } while (0) 31 36 32 37 #define flush_icache_range(start, end) cache_wbinv_range(start, end) 33 38

+2 -2

arch/csky/include/asm/pgalloc.h

··· 63 63 64 64 #define __pte_free_tlb(tlb, pte, address) \ 65 65 do { \ 66 - pgtable_pte_page_dtor(pte); \ 67 - tlb_remove_page(tlb, pte); \ 66 + pagetable_pte_dtor(page_ptdesc(pte)); \ 67 + tlb_remove_page_ptdesc(tlb, page_ptdesc(pte)); \ 68 68 } while (0) 69 69 70 70 extern void pagetable_init(void);

+5 -3

arch/csky/include/asm/pgtable.h

··· 28 28 #define pgd_ERROR(e) \ 29 29 pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e)) 30 30 31 + #define PFN_PTE_SHIFT PAGE_SHIFT 31 32 #define pmd_pfn(pmd) (pmd_phys(pmd) >> PAGE_SHIFT) 32 33 #define pmd_page(pmd) (pfn_to_page(pmd_phys(pmd) >> PAGE_SHIFT)) 33 34 #define pte_clear(mm, addr, ptep) set_pte((ptep), \ ··· 91 90 /* prevent out of order excution */ 92 91 smp_mb(); 93 92 } 94 - #define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval) 95 93 96 94 static inline pte_t *pmd_page_vaddr(pmd_t pmd) 97 95 { ··· 263 263 extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; 264 264 extern void paging_init(void); 265 265 266 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, 267 - pte_t *pte); 266 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 267 + unsigned long address, pte_t *pte, unsigned int nr); 268 + #define update_mmu_cache(vma, addr, ptep) \ 269 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 268 270 269 271 #define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \ 270 272 remap_pfn_range(vma, vaddr, pfn, size, prot)

+1

arch/hexagon/Kconfig

··· 25 25 select NEED_SG_DMA_LENGTH 26 26 select NO_IOPORT_MAP 27 27 select GENERIC_IOMAP 28 + select GENERIC_IOREMAP 28 29 select GENERIC_SMP_IDLE_THREAD 29 30 select STACKTRACE_SUPPORT 30 31 select GENERIC_CLOCKEVENTS_BROADCAST

+7 -3

arch/hexagon/include/asm/cacheflush.h

··· 18 18 * - flush_cache_range(vma, start, end) flushes a range of pages 19 19 * - flush_icache_range(start, end) flush a range of instructions 20 20 * - flush_dcache_page(pg) flushes(wback&invalidates) a page for dcache 21 - * - flush_icache_page(vma, pg) flushes(invalidates) a page for icache 21 + * - flush_icache_pages(vma, pg, nr) flushes(invalidates) nr pages for icache 22 22 * 23 23 * Need to doublecheck which one is really needed for ptrace stuff to work. 24 24 */ ··· 58 58 * clean the cache when the PTE is set. 59 59 * 60 60 */ 61 - static inline void update_mmu_cache(struct vm_area_struct *vma, 62 - unsigned long address, pte_t *ptep) 61 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 62 + struct vm_area_struct *vma, unsigned long address, 63 + pte_t *ptep, unsigned int nr) 63 64 { 64 65 /* generic_ptrace_pokedata doesn't wind up here, does it? */ 65 66 } 67 + 68 + #define update_mmu_cache(vma, addr, ptep) \ 69 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 66 70 67 71 void copy_to_user_page(struct vm_area_struct *vma, struct page *page, 68 72 unsigned long vaddr, void *dst, void *src, int len);

+7 -4

arch/hexagon/include/asm/io.h

··· 27 27 extern int remap_area_pages(unsigned long start, unsigned long phys_addr, 28 28 unsigned long end, unsigned long flags); 29 29 30 - extern void iounmap(const volatile void __iomem *addr); 31 - 32 30 /* Defined in lib/io.c, needed for smc91x driver. */ 33 31 extern void __raw_readsw(const void __iomem *addr, void *data, int wordlen); 34 32 extern void __raw_writesw(void __iomem *addr, const void *data, int wordlen); ··· 168 170 #define writew_relaxed __raw_writew 169 171 #define writel_relaxed __raw_writel 170 172 171 - void __iomem *ioremap(unsigned long phys_addr, unsigned long size); 172 - #define ioremap_uc(X, Y) ioremap((X), (Y)) 173 + /* 174 + * I/O memory mapping functions. 175 + */ 176 + #define _PAGE_IOREMAP (_PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE | \ 177 + (__HEXAGON_C_DEV << 6)) 178 + 179 + #define ioremap_uc(addr, size) ioremap((addr), (size)) 173 180 174 181 175 182 #define __raw_writel writel

+4 -4

arch/hexagon/include/asm/pgalloc.h

··· 87 87 max_kernel_seg = pmdindex; 88 88 } 89 89 90 - #define __pte_free_tlb(tlb, pte, addr) \ 91 - do { \ 92 - pgtable_pte_page_dtor((pte)); \ 93 - tlb_remove_page((tlb), (pte)); \ 90 + #define __pte_free_tlb(tlb, pte, addr) \ 91 + do { \ 92 + pagetable_pte_dtor((page_ptdesc(pte))); \ 93 + tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ 94 94 } while (0) 95 95 96 96 #endif

+1 -8

arch/hexagon/include/asm/pgtable.h

··· 338 338 /* __swp_entry_to_pte - extract PTE from swap entry */ 339 339 #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) 340 340 341 + #define PFN_PTE_SHIFT PAGE_SHIFT 341 342 /* pfn_pte - convert page number and protection value to page table entry */ 342 343 #define pfn_pte(pfn, pgprot) __pte((pfn << PAGE_SHIFT) | pgprot_val(pgprot)) 343 344 344 345 /* pte_pfn - convert pte to page frame number */ 345 346 #define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT) 346 347 #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval)) 347 - 348 - /* 349 - * set_pte_at - update page table and do whatever magic may be 350 - * necessary to make the underlying hardware/firmware take note. 351 - * 352 - * VM may require a virtual instruction to alert the MMU. 353 - */ 354 - #define set_pte_at(mm, addr, ptep, pte) set_pte(ptep, pte) 355 348 356 349 static inline unsigned long pmd_page_vaddr(pmd_t pmd) 357 350 {

-2

arch/hexagon/kernel/hexagon_ksyms.c

··· 14 14 EXPORT_SYMBOL(__clear_user_hexagon); 15 15 EXPORT_SYMBOL(raw_copy_from_user); 16 16 EXPORT_SYMBOL(raw_copy_to_user); 17 - EXPORT_SYMBOL(iounmap); 18 17 EXPORT_SYMBOL(__vmgetie); 19 18 EXPORT_SYMBOL(__vmsetie); 20 19 EXPORT_SYMBOL(__vmyield); 21 20 EXPORT_SYMBOL(empty_zero_page); 22 - EXPORT_SYMBOL(ioremap); 23 21 EXPORT_SYMBOL(memcpy); 24 22 EXPORT_SYMBOL(memset); 25 23

+1 -1

arch/hexagon/mm/Makefile

··· 3 3 # Makefile for Hexagon memory management subsystem 4 4 # 5 5 6 - obj-y := init.o ioremap.o uaccess.o vm_fault.o cache.o 6 + obj-y := init.o uaccess.o vm_fault.o cache.o 7 7 obj-y += copy_to_user.o copy_from_user.o vm_tlb.o

-44

arch/hexagon/mm/ioremap.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-only 2 - /* 3 - * I/O remap functions for Hexagon 4 - * 5 - * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved. 6 - */ 7 - 8 - #include <linux/io.h> 9 - #include <linux/vmalloc.h> 10 - #include <linux/mm.h> 11 - 12 - void __iomem *ioremap(unsigned long phys_addr, unsigned long size) 13 - { 14 - unsigned long last_addr, addr; 15 - unsigned long offset = phys_addr & ~PAGE_MASK; 16 - struct vm_struct *area; 17 - 18 - pgprot_t prot = __pgprot(_PAGE_PRESENT|_PAGE_READ|_PAGE_WRITE 19 - |(__HEXAGON_C_DEV << 6)); 20 - 21 - last_addr = phys_addr + size - 1; 22 - 23 - /* Wrapping not allowed */ 24 - if (!size || (last_addr < phys_addr)) 25 - return NULL; 26 - 27 - /* Rounds up to next page size, including whole-page offset */ 28 - size = PAGE_ALIGN(offset + size); 29 - 30 - area = get_vm_area(size, VM_IOREMAP); 31 - addr = (unsigned long)area->addr; 32 - 33 - if (ioremap_page_range(addr, addr+size, phys_addr, prot)) { 34 - vunmap((void *)addr); 35 - return NULL; 36 - } 37 - 38 - return (void __iomem *) (offset + addr); 39 - } 40 - 41 - void iounmap(const volatile void __iomem *addr) 42 - { 43 - vunmap((void *) ((unsigned long) addr & PAGE_MASK)); 44 - }

+1

arch/ia64/Kconfig

··· 47 47 select GENERIC_IRQ_LEGACY 48 48 select ARCH_HAVE_NMI_SAFE_CMPXCHG 49 49 select GENERIC_IOMAP 50 + select GENERIC_IOREMAP 50 51 select GENERIC_SMP_IDLE_THREAD 51 52 select ARCH_TASK_STRUCT_ON_STACK 52 53 select ARCH_TASK_STRUCT_ALLOCATOR

+18 -10

arch/ia64/hp/common/sba_iommu.c

··· 798 798 #endif 799 799 800 800 #ifdef ENABLE_MARK_CLEAN 801 - /** 801 + /* 802 802 * Since DMA is i-cache coherent, any (complete) pages that were written via 803 803 * DMA can be marked as "clean" so that lazy_mmu_prot_update() doesn't have to 804 804 * flush them when they get mapped into an executable vm-area. 805 805 */ 806 - static void 807 - mark_clean (void *addr, size_t size) 806 + static void mark_clean(void *addr, size_t size) 808 807 { 809 - unsigned long pg_addr, end; 808 + struct folio *folio = virt_to_folio(addr); 809 + ssize_t left = size; 810 + size_t offset = offset_in_folio(folio, addr); 810 811 811 - pg_addr = PAGE_ALIGN((unsigned long) addr); 812 - end = (unsigned long) addr + size; 813 - while (pg_addr + PAGE_SIZE <= end) { 814 - struct page *page = virt_to_page((void *)pg_addr); 815 - set_bit(PG_arch_1, &page->flags); 816 - pg_addr += PAGE_SIZE; 812 + if (offset) { 813 + left -= folio_size(folio) - offset; 814 + if (left <= 0) 815 + return; 816 + folio = folio_next(folio); 817 + } 818 + 819 + while (left >= folio_size(folio)) { 820 + left -= folio_size(folio); 821 + set_bit(PG_arch_1, &folio->flags); 822 + if (!left) 823 + break; 824 + folio = folio_next(folio); 817 825 } 818 826 } 819 827 #endif

+10 -4

arch/ia64/include/asm/cacheflush.h

··· 13 13 #include <asm/page.h> 14 14 15 15 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 16 - #define flush_dcache_page(page) \ 17 - do { \ 18 - clear_bit(PG_arch_1, &(page)->flags); \ 19 - } while (0) 16 + static inline void flush_dcache_folio(struct folio *folio) 17 + { 18 + clear_bit(PG_arch_1, &folio->flags); 19 + } 20 + #define flush_dcache_folio flush_dcache_folio 21 + 22 + static inline void flush_dcache_page(struct page *page) 23 + { 24 + flush_dcache_folio(page_folio(page)); 25 + } 20 26 21 27 extern void flush_icache_range(unsigned long start, unsigned long end); 22 28 #define flush_icache_range flush_icache_range

+5 -8

arch/ia64/include/asm/io.h

··· 243 243 244 244 # ifdef __KERNEL__ 245 245 246 - extern void __iomem * ioremap(unsigned long offset, unsigned long size); 246 + #define _PAGE_IOREMAP pgprot_val(PAGE_KERNEL) 247 + 247 248 extern void __iomem * ioremap_uc(unsigned long offset, unsigned long size); 248 - extern void iounmap (volatile void __iomem *addr); 249 - static inline void __iomem * ioremap_cache (unsigned long phys_addr, unsigned long size) 250 - { 251 - return ioremap(phys_addr, size); 252 - } 253 - #define ioremap ioremap 254 - #define ioremap_cache ioremap_cache 249 + 250 + #define ioremap_prot ioremap_prot 251 + #define ioremap_cache ioremap 255 252 #define ioremap_uc ioremap_uc 256 253 #define iounmap iounmap 257 254

+2 -2

arch/ia64/include/asm/pgtable.h

··· 206 206 #define RGN_MAP_SHIFT (PGDIR_SHIFT + PTRS_PER_PGD_SHIFT - 3) 207 207 #define RGN_MAP_LIMIT ((1UL << RGN_MAP_SHIFT) - PAGE_SIZE) /* per region addr limit */ 208 208 209 + #define PFN_PTE_SHIFT PAGE_SHIFT 209 210 /* 210 211 * Conversion functions: convert page frame number (pfn) and a protection value to a page 211 212 * table entry (pte). ··· 304 303 *ptep = pteval; 305 304 } 306 305 307 - #define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval) 308 - 309 306 /* 310 307 * Make page protection values cacheable, uncacheable, or write- 311 308 * combining. Note that "protection" is really a misnomer here as the ··· 395 396 return pte_val(a) == pte_val(b); 396 397 } 397 398 399 + #define update_mmu_cache_range(vmf, vma, address, ptep, nr) do { } while (0) 398 400 #define update_mmu_cache(vma, address, ptep) do { } while (0) 399 401 400 402 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];

+23 -9

arch/ia64/mm/init.c

··· 50 50 __ia64_sync_icache_dcache (pte_t pte) 51 51 { 52 52 unsigned long addr; 53 - struct page *page; 53 + struct folio *folio; 54 54 55 - page = pte_page(pte); 56 - addr = (unsigned long) page_address(page); 55 + folio = page_folio(pte_page(pte)); 56 + addr = (unsigned long)folio_address(folio); 57 57 58 - if (test_bit(PG_arch_1, &page->flags)) 58 + if (test_bit(PG_arch_1, &folio->flags)) 59 59 return; /* i-cache is already coherent with d-cache */ 60 60 61 - flush_icache_range(addr, addr + page_size(page)); 62 - set_bit(PG_arch_1, &page->flags); /* mark page as clean */ 61 + flush_icache_range(addr, addr + folio_size(folio)); 62 + set_bit(PG_arch_1, &folio->flags); /* mark page as clean */ 63 63 } 64 64 65 65 /* 66 - * Since DMA is i-cache coherent, any (complete) pages that were written via 66 + * Since DMA is i-cache coherent, any (complete) folios that were written via 67 67 * DMA can be marked as "clean" so that lazy_mmu_prot_update() doesn't have to 68 68 * flush them when they get mapped into an executable vm-area. 69 69 */ 70 70 void arch_dma_mark_clean(phys_addr_t paddr, size_t size) 71 71 { 72 72 unsigned long pfn = PHYS_PFN(paddr); 73 + struct folio *folio = page_folio(pfn_to_page(pfn)); 74 + ssize_t left = size; 75 + size_t offset = offset_in_folio(folio, paddr); 73 76 74 - do { 77 + if (offset) { 78 + left -= folio_size(folio) - offset; 79 + if (left <= 0) 80 + return; 81 + folio = folio_next(folio); 82 + } 83 + 84 + while (left >= (ssize_t)folio_size(folio)) { 85 + left -= folio_size(folio); 75 86 set_bit(PG_arch_1, &pfn_to_page(pfn)->flags); 76 - } while (++pfn <= PHYS_PFN(paddr + size - 1)); 87 + if (!left) 88 + break; 89 + folio = folio_next(folio); 90 + } 77 91 } 78 92 79 93 inline void

+6 -35

arch/ia64/mm/ioremap.c

··· 29 29 return __ioremap_uc(phys_addr); 30 30 } 31 31 32 - void __iomem * 33 - ioremap (unsigned long phys_addr, unsigned long size) 32 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 33 + unsigned long flags) 34 34 { 35 - void __iomem *addr; 36 - struct vm_struct *area; 37 - unsigned long offset; 38 - pgprot_t prot; 39 35 u64 attr; 40 36 unsigned long gran_base, gran_size; 41 37 unsigned long page_base; ··· 64 68 */ 65 69 page_base = phys_addr & PAGE_MASK; 66 70 size = PAGE_ALIGN(phys_addr + size) - page_base; 67 - if (efi_mem_attribute(page_base, size) & EFI_MEMORY_WB) { 68 - prot = PAGE_KERNEL; 69 - 70 - /* 71 - * Mappings have to be page-aligned 72 - */ 73 - offset = phys_addr & ~PAGE_MASK; 74 - phys_addr &= PAGE_MASK; 75 - 76 - /* 77 - * Ok, go for it.. 78 - */ 79 - area = get_vm_area(size, VM_IOREMAP); 80 - if (!area) 81 - return NULL; 82 - 83 - area->phys_addr = phys_addr; 84 - addr = (void __iomem *) area->addr; 85 - if (ioremap_page_range((unsigned long) addr, 86 - (unsigned long) addr + size, phys_addr, prot)) { 87 - vunmap((void __force *) addr); 88 - return NULL; 89 - } 90 - 91 - return (void __iomem *) (offset + (char __iomem *)addr); 92 - } 71 + if (efi_mem_attribute(page_base, size) & EFI_MEMORY_WB) 72 + return generic_ioremap_prot(phys_addr, size, __pgprot(flags)); 93 73 94 74 return __ioremap_uc(phys_addr); 95 75 } 96 - EXPORT_SYMBOL(ioremap); 76 + EXPORT_SYMBOL(ioremap_prot); 97 77 98 78 void __iomem * 99 79 ioremap_uc(unsigned long phys_addr, unsigned long size) ··· 86 114 { 87 115 } 88 116 89 - void 90 - iounmap (volatile void __iomem *addr) 117 + void iounmap(volatile void __iomem *addr) 91 118 { 92 119 if (REGION_NUMBER(addr) == RGN_GATE) 93 120 vunmap((void *) ((unsigned long) addr & PAGE_MASK));

+1 -1

arch/loongarch/Kconfig

··· 60 60 select ARCH_USE_QUEUED_SPINLOCKS 61 61 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT 62 62 select ARCH_WANT_LD_ORPHAN_WARN 63 - select ARCH_WANT_OPTIMIZE_VMEMMAP 63 + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP 64 64 select ARCH_WANTS_NO_INSTR 65 65 select BUILDTIME_TABLE_SORT 66 66 select COMMON_CLK

-1

arch/loongarch/include/asm/cacheflush.h

··· 46 46 #define flush_cache_page(vma, vmaddr, pfn) do { } while (0) 47 47 #define flush_cache_vmap(start, end) do { } while (0) 48 48 #define flush_cache_vunmap(start, end) do { } while (0) 49 - #define flush_icache_page(vma, page) do { } while (0) 50 49 #define flush_icache_user_page(vma, page, addr, len) do { } while (0) 51 50 #define flush_dcache_page(page) do { } while (0) 52 51 #define flush_dcache_mmap_lock(mapping) do { } while (0)

-2

arch/loongarch/include/asm/io.h

··· 5 5 #ifndef _ASM_IO_H 6 6 #define _ASM_IO_H 7 7 8 - #define ARCH_HAS_IOREMAP_WC 9 - 10 8 #include <linux/kernel.h> 11 9 #include <linux/types.h> 12 10

+15 -12

arch/loongarch/include/asm/pgalloc.h

··· 45 45 extern pgd_t *pgd_alloc(struct mm_struct *mm); 46 46 47 47 #define __pte_free_tlb(tlb, pte, address) \ 48 - do { \ 49 - pgtable_pte_page_dtor(pte); \ 50 - tlb_remove_page((tlb), pte); \ 48 + do { \ 49 + pagetable_pte_dtor(page_ptdesc(pte)); \ 50 + tlb_remove_page_ptdesc((tlb), page_ptdesc(pte)); \ 51 51 } while (0) 52 52 53 53 #ifndef __PAGETABLE_PMD_FOLDED ··· 55 55 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) 56 56 { 57 57 pmd_t *pmd; 58 - struct page *pg; 58 + struct ptdesc *ptdesc; 59 59 60 - pg = alloc_page(GFP_KERNEL_ACCOUNT); 61 - if (!pg) 60 + ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, 0); 61 + if (!ptdesc) 62 62 return NULL; 63 63 64 - if (!pgtable_pmd_page_ctor(pg)) { 65 - __free_page(pg); 64 + if (!pagetable_pmd_ctor(ptdesc)) { 65 + pagetable_free(ptdesc); 66 66 return NULL; 67 67 } 68 68 69 - pmd = (pmd_t *)page_address(pg); 69 + pmd = ptdesc_address(ptdesc); 70 70 pmd_init(pmd); 71 71 return pmd; 72 72 } ··· 80 80 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address) 81 81 { 82 82 pud_t *pud; 83 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); 83 84 84 - pud = (pud_t *) __get_free_page(GFP_KERNEL); 85 - if (pud) 86 - pud_init(pud); 85 + if (!ptdesc) 86 + return NULL; 87 + pud = ptdesc_address(ptdesc); 88 + 89 + pud_init(pud); 87 90 return pud; 88 91 } 89 92

+2 -2

arch/loongarch/include/asm/pgtable-bits.h

··· 50 50 #define _PAGE_NO_EXEC (_ULCAST_(1) << _PAGE_NO_EXEC_SHIFT) 51 51 #define _PAGE_RPLV (_ULCAST_(1) << _PAGE_RPLV_SHIFT) 52 52 #define _CACHE_MASK (_ULCAST_(3) << _CACHE_SHIFT) 53 - #define _PFN_SHIFT (PAGE_SHIFT - 12 + _PAGE_PFN_SHIFT) 53 + #define PFN_PTE_SHIFT (PAGE_SHIFT - 12 + _PAGE_PFN_SHIFT) 54 54 55 55 #define _PAGE_USER (PLV_USER << _PAGE_PLV_SHIFT) 56 56 #define _PAGE_KERN (PLV_KERN << _PAGE_PLV_SHIFT) 57 57 58 - #define _PFN_MASK (~((_ULCAST_(1) << (_PFN_SHIFT)) - 1) & \ 58 + #define _PFN_MASK (~((_ULCAST_(1) << (PFN_PTE_SHIFT)) - 1) & \ 59 59 ((_ULCAST_(1) << (_PAGE_PFN_END_SHIFT)) - 1)) 60 60 61 61 /*

+18 -15

arch/loongarch/include/asm/pgtable.h

··· 237 237 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd); 238 238 239 239 #define pte_page(x) pfn_to_page(pte_pfn(x)) 240 - #define pte_pfn(x) ((unsigned long)(((x).pte & _PFN_MASK) >> _PFN_SHIFT)) 241 - #define pfn_pte(pfn, prot) __pte(((pfn) << _PFN_SHIFT) | pgprot_val(prot)) 242 - #define pfn_pmd(pfn, prot) __pmd(((pfn) << _PFN_SHIFT) | pgprot_val(prot)) 240 + #define pte_pfn(x) ((unsigned long)(((x).pte & _PFN_MASK) >> PFN_PTE_SHIFT)) 241 + #define pfn_pte(pfn, prot) __pte(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot)) 242 + #define pfn_pmd(pfn, prot) __pmd(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot)) 243 243 244 244 /* 245 245 * Initialize a new pgd / pud / pmd table with invalid pointers. ··· 334 334 } 335 335 } 336 336 337 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 338 - pte_t *ptep, pte_t pteval) 339 - { 340 - set_pte(ptep, pteval); 341 - } 342 - 343 337 static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) 344 338 { 345 339 /* Preserve global status for the pair */ 346 340 if (pte_val(*ptep_buddy(ptep)) & _PAGE_GLOBAL) 347 - set_pte_at(mm, addr, ptep, __pte(_PAGE_GLOBAL)); 341 + set_pte(ptep, __pte(_PAGE_GLOBAL)); 348 342 else 349 - set_pte_at(mm, addr, ptep, __pte(0)); 343 + set_pte(ptep, __pte(0)); 350 344 } 351 345 352 346 #define PGD_T_LOG2 (__builtin_ffs(sizeof(pgd_t)) - 1) ··· 439 445 extern void __update_tlb(struct vm_area_struct *vma, 440 446 unsigned long address, pte_t *ptep); 441 447 442 - static inline void update_mmu_cache(struct vm_area_struct *vma, 443 - unsigned long address, pte_t *ptep) 448 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 449 + struct vm_area_struct *vma, unsigned long address, 450 + pte_t *ptep, unsigned int nr) 444 451 { 445 - __update_tlb(vma, address, ptep); 452 + for (;;) { 453 + __update_tlb(vma, address, ptep); 454 + if (--nr == 0) 455 + break; 456 + address += PAGE_SIZE; 457 + ptep++; 458 + } 446 459 } 460 + #define update_mmu_cache(vma, addr, ptep) \ 461 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 447 462 448 463 #define __HAVE_ARCH_UPDATE_MMU_TLB 449 464 #define update_mmu_tlb update_mmu_cache ··· 465 462 466 463 static inline unsigned long pmd_pfn(pmd_t pmd) 467 464 { 468 - return (pmd_val(pmd) & _PFN_MASK) >> _PFN_SHIFT; 465 + return (pmd_val(pmd) & _PFN_MASK) >> PFN_PTE_SHIFT; 469 466 } 470 467 471 468 #ifdef CONFIG_TRANSPARENT_HUGEPAGE

+5 -4

arch/loongarch/mm/pgtable.c

··· 11 11 12 12 pgd_t *pgd_alloc(struct mm_struct *mm) 13 13 { 14 - pgd_t *ret, *init; 14 + pgd_t *init, *ret = NULL; 15 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); 15 16 16 - ret = (pgd_t *) __get_free_page(GFP_KERNEL); 17 - if (ret) { 17 + if (ptdesc) { 18 + ret = (pgd_t *)ptdesc_address(ptdesc); 18 19 init = pgd_offset(&init_mm, 0UL); 19 20 pgd_init(ret); 20 21 memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD, ··· 108 107 { 109 108 pmd_t pmd; 110 109 111 - pmd_val(pmd) = (page_to_pfn(page) << _PFN_SHIFT) | pgprot_val(prot); 110 + pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot); 112 111 113 112 return pmd; 114 113 }

+1 -1

arch/loongarch/mm/tlb.c

··· 252 252 pr_define("_PAGE_WRITE_SHIFT %d\n", _PAGE_WRITE_SHIFT); 253 253 pr_define("_PAGE_NO_READ_SHIFT %d\n", _PAGE_NO_READ_SHIFT); 254 254 pr_define("_PAGE_NO_EXEC_SHIFT %d\n", _PAGE_NO_EXEC_SHIFT); 255 - pr_define("_PFN_SHIFT %d\n", _PFN_SHIFT); 255 + pr_define("PFN_PTE_SHIFT %d\n", PFN_PTE_SHIFT); 256 256 pr_debug("\n"); 257 257 } 258 258

+17 -9

arch/m68k/include/asm/cacheflush_mm.h

··· 220 220 221 221 /* Push the page at kernel virtual address and clear the icache */ 222 222 /* RZ: use cpush %bc instead of cpush %dc, cinv %ic */ 223 - static inline void __flush_page_to_ram(void *vaddr) 223 + static inline void __flush_pages_to_ram(void *vaddr, unsigned int nr) 224 224 { 225 225 if (CPU_IS_COLDFIRE) { 226 226 unsigned long addr, start, end; 227 227 addr = ((unsigned long) vaddr) & ~(PAGE_SIZE - 1); 228 228 start = addr & ICACHE_SET_MASK; 229 - end = (addr + PAGE_SIZE - 1) & ICACHE_SET_MASK; 229 + end = (addr + nr * PAGE_SIZE - 1) & ICACHE_SET_MASK; 230 230 if (start > end) { 231 231 flush_cf_bcache(0, end); 232 232 end = ICACHE_MAX_ADDR; 233 233 } 234 234 flush_cf_bcache(start, end); 235 235 } else if (CPU_IS_040_OR_060) { 236 - __asm__ __volatile__("nop\n\t" 237 - ".chip 68040\n\t" 238 - "cpushp %%bc,(%0)\n\t" 239 - ".chip 68k" 240 - : : "a" (__pa(vaddr))); 236 + unsigned long paddr = __pa(vaddr); 237 + 238 + do { 239 + __asm__ __volatile__("nop\n\t" 240 + ".chip 68040\n\t" 241 + "cpushp %%bc,(%0)\n\t" 242 + ".chip 68k" 243 + : : "a" (paddr)); 244 + paddr += PAGE_SIZE; 245 + } while (--nr); 241 246 } else { 242 247 unsigned long _tmp; 243 248 __asm__ __volatile__("movec %%cacr,%0\n\t" ··· 254 249 } 255 250 256 251 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 257 - #define flush_dcache_page(page) __flush_page_to_ram(page_address(page)) 252 + #define flush_dcache_page(page) __flush_pages_to_ram(page_address(page), 1) 253 + #define flush_dcache_folio(folio) \ 254 + __flush_pages_to_ram(folio_address(folio), folio_nr_pages(folio)) 258 255 #define flush_dcache_mmap_lock(mapping) do { } while (0) 259 256 #define flush_dcache_mmap_unlock(mapping) do { } while (0) 260 - #define flush_icache_page(vma, page) __flush_page_to_ram(page_address(page)) 257 + #define flush_icache_pages(vma, page, nr) \ 258 + __flush_pages_to_ram(page_address(page), nr) 261 259 262 260 extern void flush_icache_user_page(struct vm_area_struct *vma, struct page *page, 263 261 unsigned long addr, int len);

-2

arch/m68k/include/asm/io_mm.h

··· 26 26 #include <asm/virtconvert.h> 27 27 #include <asm/kmap.h> 28 28 29 - #include <asm-generic/iomap.h> 30 - 31 29 #ifdef CONFIG_ATARI 32 30 #define atari_readb raw_inb 33 31 #define atari_writeb raw_outb

-2

arch/m68k/include/asm/kmap.h

··· 4 4 5 5 #ifdef CONFIG_MMU 6 6 7 - #define ARCH_HAS_IOREMAP_WT 8 - 9 7 /* Values for nocacheflag and cmode */ 10 8 #define IOMAP_FULL_CACHING 0 11 9 #define IOMAP_NOCACHE_SER 1

+24 -23

arch/m68k/include/asm/mcf_pgalloc.h

··· 5 5 #include <asm/tlb.h> 6 6 #include <asm/tlbflush.h> 7 7 8 - extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) 8 + static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) 9 9 { 10 - free_page((unsigned long) pte); 10 + pagetable_free(virt_to_ptdesc(pte)); 11 11 } 12 12 13 13 extern const char bad_pmd_string[]; 14 14 15 - extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) 15 + static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) 16 16 { 17 - unsigned long page = __get_free_page(GFP_DMA); 17 + struct ptdesc *ptdesc = pagetable_alloc((GFP_DMA | __GFP_ZERO) & 18 + ~__GFP_HIGHMEM, 0); 18 19 19 - if (!page) 20 + if (!ptdesc) 20 21 return NULL; 21 22 22 - memset((void *)page, 0, PAGE_SIZE); 23 - return (pte_t *) (page); 23 + return ptdesc_address(ptdesc); 24 24 } 25 25 26 26 extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address) ··· 35 35 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable, 36 36 unsigned long address) 37 37 { 38 - struct page *page = virt_to_page(pgtable); 38 + struct ptdesc *ptdesc = virt_to_ptdesc(pgtable); 39 39 40 - pgtable_pte_page_dtor(page); 41 - __free_page(page); 40 + pagetable_pte_dtor(ptdesc); 41 + pagetable_free(ptdesc); 42 42 } 43 43 44 44 static inline pgtable_t pte_alloc_one(struct mm_struct *mm) 45 45 { 46 - struct page *page = alloc_pages(GFP_DMA, 0); 46 + struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | __GFP_ZERO, 0); 47 47 pte_t *pte; 48 48 49 - if (!page) 49 + if (!ptdesc) 50 50 return NULL; 51 - if (!pgtable_pte_page_ctor(page)) { 52 - __free_page(page); 51 + if (!pagetable_pte_ctor(ptdesc)) { 52 + pagetable_free(ptdesc); 53 53 return NULL; 54 54 } 55 55 56 - pte = page_address(page); 57 - clear_page(pte); 58 - 56 + pte = ptdesc_address(ptdesc); 59 57 return pte; 60 58 } 61 59 62 60 static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable) 63 61 { 64 - struct page *page = virt_to_page(pgtable); 62 + struct ptdesc *ptdesc = virt_to_ptdesc(pgtable); 65 63 66 - pgtable_pte_page_dtor(page); 67 - __free_page(page); 64 + pagetable_pte_dtor(ptdesc); 65 + pagetable_free(ptdesc); 68 66 } 69 67 70 68 /* ··· 73 75 74 76 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 75 77 { 76 - free_page((unsigned long) pgd); 78 + pagetable_free(virt_to_ptdesc(pgd)); 77 79 } 78 80 79 81 static inline pgd_t *pgd_alloc(struct mm_struct *mm) 80 82 { 81 83 pgd_t *new_pgd; 84 + struct ptdesc *ptdesc = pagetable_alloc((GFP_DMA | __GFP_NOWARN) & 85 + ~__GFP_HIGHMEM, 0); 82 86 83 - new_pgd = (pgd_t *)__get_free_page(GFP_DMA | __GFP_NOWARN); 84 - if (!new_pgd) 87 + if (!ptdesc) 85 88 return NULL; 89 + new_pgd = ptdesc_address(ptdesc); 90 + 86 91 memcpy(new_pgd, swapper_pg_dir, PTRS_PER_PGD * sizeof(pgd_t)); 87 92 memset(new_pgd, 0, PAGE_OFFSET >> PGDIR_SHIFT); 88 93 return new_pgd;

+1

arch/m68k/include/asm/mcf_pgtable.h

··· 291 291 return pte; 292 292 } 293 293 294 + #define PFN_PTE_SHIFT PAGE_SHIFT 294 295 #define pmd_pfn(pmd) (pmd_val(pmd) >> PAGE_SHIFT) 295 296 #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)) 296 297

+1

arch/m68k/include/asm/motorola_pgtable.h

··· 112 112 #define pte_present(pte) (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROTNONE)) 113 113 #define pte_clear(mm,addr,ptep) ({ pte_val(*(ptep)) = 0; }) 114 114 115 + #define PFN_PTE_SHIFT PAGE_SHIFT 115 116 #define pte_page(pte) virt_to_page(__va(pte_val(pte))) 116 117 #define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT) 117 118 #define pfn_pte(pfn, prot) __pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot))

+6 -4

arch/m68k/include/asm/pgtable_mm.h

··· 31 31 do{ \ 32 32 *(pteptr) = (pteval); \ 33 33 } while(0) 34 - #define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval) 35 - 36 34 37 35 /* PMD_SHIFT determines the size of the area a second-level page table can map */ 38 36 #if CONFIG_PGTABLE_LEVELS == 3 ··· 136 138 * tables contain all the necessary information. The Sun3 does, but 137 139 * they are updated on demand. 138 140 */ 139 - static inline void update_mmu_cache(struct vm_area_struct *vma, 140 - unsigned long address, pte_t *ptep) 141 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 142 + struct vm_area_struct *vma, unsigned long address, 143 + pte_t *ptep, unsigned int nr) 141 144 { 142 145 } 146 + 147 + #define update_mmu_cache(vma, addr, ptep) \ 148 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 143 149 144 150 #endif /* !__ASSEMBLY__ */ 145 151

+4 -4

arch/m68k/include/asm/sun3_pgalloc.h

··· 17 17 18 18 extern const char bad_pmd_string[]; 19 19 20 - #define __pte_free_tlb(tlb,pte,addr) \ 21 - do { \ 22 - pgtable_pte_page_dtor(pte); \ 23 - tlb_remove_page((tlb), pte); \ 20 + #define __pte_free_tlb(tlb, pte, addr) \ 21 + do { \ 22 + pagetable_pte_dtor(page_ptdesc(pte)); \ 23 + tlb_remove_page_ptdesc((tlb), page_ptdesc(pte)); \ 24 24 } while (0) 25 25 26 26 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)

+1

arch/m68k/include/asm/sun3_pgtable.h

··· 105 105 pte_val (*ptep) = 0; 106 106 } 107 107 108 + #define PFN_PTE_SHIFT 0 108 109 #define pte_pfn(pte) (pte_val(pte) & SUN3_PAGE_PGNUM_MASK) 109 110 #define pfn_pte(pfn, pgprot) \ 110 111 ({ pte_t __pte; pte_val(__pte) = pfn | pgprot_val(pgprot); __pte; })

+3 -3

arch/m68k/mm/motorola.c

··· 81 81 82 82 void mmu_page_ctor(void *page) 83 83 { 84 - __flush_page_to_ram(page); 84 + __flush_pages_to_ram(page, 1); 85 85 flush_tlb_kernel_page(page); 86 86 nocache_page(page); 87 87 } ··· 161 161 * m68k doesn't have SPLIT_PTE_PTLOCKS for not having 162 162 * SMP. 163 163 */ 164 - pgtable_pte_page_ctor(virt_to_page(page)); 164 + pagetable_pte_ctor(virt_to_ptdesc(page)); 165 165 } 166 166 167 167 mmu_page_ctor(page); ··· 201 201 list_del(dp); 202 202 mmu_page_dtor((void *)page); 203 203 if (type == TABLE_PTE) 204 - pgtable_pte_page_dtor(virt_to_page((void *)page)); 204 + pagetable_pte_dtor(virt_to_ptdesc((void *)page)); 205 205 free_page (page); 206 206 return 1; 207 207 } else if (ptable_list[type].next != dp) {

+8

arch/microblaze/include/asm/cacheflush.h

··· 74 74 flush_dcache_range((unsigned) (addr), (unsigned) (addr) + PAGE_SIZE); \ 75 75 } while (0); 76 76 77 + static inline void flush_dcache_folio(struct folio *folio) 78 + { 79 + unsigned long addr = folio_pfn(folio) << PAGE_SHIFT; 80 + 81 + flush_dcache_range(addr, addr + folio_size(folio)); 82 + } 83 + #define flush_dcache_folio flush_dcache_folio 84 + 77 85 #define flush_cache_page(vma, vmaddr, pfn) \ 78 86 flush_dcache_range(pfn << PAGE_SHIFT, (pfn << PAGE_SHIFT) + PAGE_SIZE); 79 87

+4 -11

arch/microblaze/include/asm/pgtable.h

··· 230 230 231 231 #define pte_page(x) (mem_map + (unsigned long) \ 232 232 ((pte_val(x) - memory_start) >> PAGE_SHIFT)) 233 - #define PFN_SHIFT_OFFSET (PAGE_SHIFT) 233 + #define PFN_PTE_SHIFT PAGE_SHIFT 234 234 235 - #define pte_pfn(x) (pte_val(x) >> PFN_SHIFT_OFFSET) 235 + #define pte_pfn(x) (pte_val(x) >> PFN_PTE_SHIFT) 236 236 237 237 #define pfn_pte(pfn, prot) \ 238 - __pte(((pte_basic_t)(pfn) << PFN_SHIFT_OFFSET) | pgprot_val(prot)) 238 + __pte(((pte_basic_t)(pfn) << PFN_PTE_SHIFT) | pgprot_val(prot)) 239 239 240 240 #ifndef __ASSEMBLY__ 241 241 /* ··· 330 330 /* 331 331 * set_pte stores a linux PTE into the linux page table. 332 332 */ 333 - static inline void set_pte(struct mm_struct *mm, unsigned long addr, 334 - pte_t *ptep, pte_t pte) 335 - { 336 - *ptep = pte; 337 - } 338 - 339 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 340 - pte_t *ptep, pte_t pte) 333 + static inline void set_pte(pte_t *ptep, pte_t pte) 341 334 { 342 335 *ptep = pte; 343 336 }

+3 -1

arch/microblaze/include/asm/tlbflush.h

··· 33 33 34 34 #define flush_tlb_kernel_range(start, end) do { } while (0) 35 35 36 - #define update_mmu_cache(vma, addr, ptep) do { } while (0) 36 + #define update_mmu_cache_range(vmf, vma, addr, ptep, nr) do { } while (0) 37 + #define update_mmu_cache(vma, addr, pte) \ 38 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 37 39 38 40 #define flush_tlb_all local_flush_tlb_all 39 41 #define flush_tlb_mm local_flush_tlb_mm

+1 -1

arch/mips/bcm47xx/prom.c

··· 116 116 #if defined(CONFIG_BCM47XX_BCMA) && defined(CONFIG_HIGHMEM) 117 117 118 118 #define EXTVBASE 0xc0000000 119 - #define ENTRYLO(x) ((pte_val(pfn_pte((x) >> _PFN_SHIFT, PAGE_KERNEL_UNCACHED)) >> 6) | 1) 119 + #define ENTRYLO(x) ((pte_val(pfn_pte((x) >> PFN_PTE_SHIFT, PAGE_KERNEL_UNCACHED)) >> 6) | 1) 120 120 121 121 #include <asm/tlbflush.h> 122 122

+18 -14

arch/mips/include/asm/cacheflush.h

··· 36 36 */ 37 37 #define PG_dcache_dirty PG_arch_1 38 38 39 - #define Page_dcache_dirty(page) \ 40 - test_bit(PG_dcache_dirty, &(page)->flags) 41 - #define SetPageDcacheDirty(page) \ 42 - set_bit(PG_dcache_dirty, &(page)->flags) 43 - #define ClearPageDcacheDirty(page) \ 44 - clear_bit(PG_dcache_dirty, &(page)->flags) 39 + #define folio_test_dcache_dirty(folio) \ 40 + test_bit(PG_dcache_dirty, &(folio)->flags) 41 + #define folio_set_dcache_dirty(folio) \ 42 + set_bit(PG_dcache_dirty, &(folio)->flags) 43 + #define folio_clear_dcache_dirty(folio) \ 44 + clear_bit(PG_dcache_dirty, &(folio)->flags) 45 45 46 46 extern void (*flush_cache_all)(void); 47 47 extern void (*__flush_cache_all)(void); ··· 50 50 extern void (*flush_cache_range)(struct vm_area_struct *vma, 51 51 unsigned long start, unsigned long end); 52 52 extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); 53 - extern void __flush_dcache_page(struct page *page); 53 + extern void __flush_dcache_pages(struct page *page, unsigned int nr); 54 54 55 55 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 56 + static inline void flush_dcache_folio(struct folio *folio) 57 + { 58 + if (cpu_has_dc_aliases) 59 + __flush_dcache_pages(&folio->page, folio_nr_pages(folio)); 60 + else if (!cpu_has_ic_fills_f_dc) 61 + folio_set_dcache_dirty(folio); 62 + } 63 + #define flush_dcache_folio flush_dcache_folio 64 + 56 65 static inline void flush_dcache_page(struct page *page) 57 66 { 58 67 if (cpu_has_dc_aliases) 59 - __flush_dcache_page(page); 68 + __flush_dcache_pages(page, 1); 60 69 else if (!cpu_has_ic_fills_f_dc) 61 - SetPageDcacheDirty(page); 70 + folio_set_dcache_dirty(page_folio(page)); 62 71 } 63 72 64 73 #define flush_dcache_mmap_lock(mapping) do { } while (0) ··· 80 71 { 81 72 if (cpu_has_dc_aliases && PageAnon(page)) 82 73 __flush_anon_page(page, vmaddr); 83 - } 84 - 85 - static inline void flush_icache_page(struct vm_area_struct *vma, 86 - struct page *page) 87 - { 88 74 } 89 75 90 76 extern void (*flush_icache_range)(unsigned long start, unsigned long end);

+2 -3

arch/mips/include/asm/io.h

··· 12 12 #ifndef _ASM_IO_H 13 13 #define _ASM_IO_H 14 14 15 - #define ARCH_HAS_IOREMAP_WC 16 - 17 15 #include <linux/compiler.h> 18 16 #include <linux/kernel.h> 19 17 #include <linux/types.h> ··· 23 25 #include <asm/byteorder.h> 24 26 #include <asm/cpu.h> 25 27 #include <asm/cpu-features.h> 26 - #include <asm-generic/iomap.h> 27 28 #include <asm/page.h> 28 29 #include <asm/pgtable-bits.h> 29 30 #include <asm/processor.h> ··· 206 209 */ 207 210 #define ioremap_wc(offset, size) \ 208 211 ioremap_prot((offset), (size), boot_cpu_data.writecombine) 212 + 213 + #include <asm-generic/iomap.h> 209 214 210 215 #if defined(CONFIG_CPU_CAVIUM_OCTEON) 211 216 #define war_io_reorder_wmb() wmb()

+18 -14

arch/mips/include/asm/pgalloc.h

··· 51 51 52 52 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 53 53 { 54 - free_pages((unsigned long)pgd, PGD_TABLE_ORDER); 54 + pagetable_free(virt_to_ptdesc(pgd)); 55 55 } 56 56 57 - #define __pte_free_tlb(tlb,pte,address) \ 58 - do { \ 59 - pgtable_pte_page_dtor(pte); \ 60 - tlb_remove_page((tlb), pte); \ 57 + #define __pte_free_tlb(tlb, pte, address) \ 58 + do { \ 59 + pagetable_pte_dtor(page_ptdesc(pte)); \ 60 + tlb_remove_page_ptdesc((tlb), page_ptdesc(pte)); \ 61 61 } while (0) 62 62 63 63 #ifndef __PAGETABLE_PMD_FOLDED ··· 65 65 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) 66 66 { 67 67 pmd_t *pmd; 68 - struct page *pg; 68 + struct ptdesc *ptdesc; 69 69 70 - pg = alloc_pages(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER); 71 - if (!pg) 70 + ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER); 71 + if (!ptdesc) 72 72 return NULL; 73 73 74 - if (!pgtable_pmd_page_ctor(pg)) { 75 - __free_pages(pg, PMD_TABLE_ORDER); 74 + if (!pagetable_pmd_ctor(ptdesc)) { 75 + pagetable_free(ptdesc); 76 76 return NULL; 77 77 } 78 78 79 - pmd = (pmd_t *)page_address(pg); 79 + pmd = ptdesc_address(ptdesc); 80 80 pmd_init(pmd); 81 81 return pmd; 82 82 } ··· 90 90 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address) 91 91 { 92 92 pud_t *pud; 93 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 94 + PUD_TABLE_ORDER); 93 95 94 - pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_TABLE_ORDER); 95 - if (pud) 96 - pud_init(pud); 96 + if (!ptdesc) 97 + return NULL; 98 + pud = ptdesc_address(ptdesc); 99 + 100 + pud_init(pud); 97 101 return pud; 98 102 } 99 103

+5 -5

arch/mips/include/asm/pgtable-32.h

··· 153 153 #if defined(CONFIG_XPA) 154 154 155 155 #define MAX_POSSIBLE_PHYSMEM_BITS 40 156 - #define pte_pfn(x) (((unsigned long)((x).pte_high >> _PFN_SHIFT)) | (unsigned long)((x).pte_low << _PAGE_PRESENT_SHIFT)) 156 + #define pte_pfn(x) (((unsigned long)((x).pte_high >> PFN_PTE_SHIFT)) | (unsigned long)((x).pte_low << _PAGE_PRESENT_SHIFT)) 157 157 static inline pte_t 158 158 pfn_pte(unsigned long pfn, pgprot_t prot) 159 159 { ··· 161 161 162 162 pte.pte_low = (pfn >> _PAGE_PRESENT_SHIFT) | 163 163 (pgprot_val(prot) & ~_PFNX_MASK); 164 - pte.pte_high = (pfn << _PFN_SHIFT) | 164 + pte.pte_high = (pfn << PFN_PTE_SHIFT) | 165 165 (pgprot_val(prot) & ~_PFN_MASK); 166 166 return pte; 167 167 } ··· 184 184 #else 185 185 186 186 #define MAX_POSSIBLE_PHYSMEM_BITS 32 187 - #define pte_pfn(x) ((unsigned long)((x).pte >> _PFN_SHIFT)) 188 - #define pfn_pte(pfn, prot) __pte(((unsigned long long)(pfn) << _PFN_SHIFT) | pgprot_val(prot)) 189 - #define pfn_pmd(pfn, prot) __pmd(((unsigned long long)(pfn) << _PFN_SHIFT) | pgprot_val(prot)) 187 + #define pte_pfn(x) ((unsigned long)((x).pte >> PFN_PTE_SHIFT)) 188 + #define pfn_pte(pfn, prot) __pte(((unsigned long long)(pfn) << PFN_PTE_SHIFT) | pgprot_val(prot)) 189 + #define pfn_pmd(pfn, prot) __pmd(((unsigned long long)(pfn) << PFN_PTE_SHIFT) | pgprot_val(prot)) 190 190 #endif /* defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32) */ 191 191 192 192 #define pte_page(x) pfn_to_page(pte_pfn(x))

+3 -3

arch/mips/include/asm/pgtable-64.h

··· 298 298 299 299 #define pte_page(x) pfn_to_page(pte_pfn(x)) 300 300 301 - #define pte_pfn(x) ((unsigned long)((x).pte >> _PFN_SHIFT)) 302 - #define pfn_pte(pfn, prot) __pte(((pfn) << _PFN_SHIFT) | pgprot_val(prot)) 303 - #define pfn_pmd(pfn, prot) __pmd(((pfn) << _PFN_SHIFT) | pgprot_val(prot)) 301 + #define pte_pfn(x) ((unsigned long)((x).pte >> PFN_PTE_SHIFT)) 302 + #define pfn_pte(pfn, prot) __pte(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot)) 303 + #define pfn_pmd(pfn, prot) __pmd(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot)) 304 304 305 305 #ifndef __PAGETABLE_PMD_FOLDED 306 306 static inline pmd_t *pud_pgtable(pud_t pud)

+3 -3

arch/mips/include/asm/pgtable-bits.h

··· 182 182 #if defined(CONFIG_CPU_R3K_TLB) 183 183 # define _CACHE_UNCACHED (1 << _CACHE_UNCACHED_SHIFT) 184 184 # define _CACHE_MASK _CACHE_UNCACHED 185 - # define _PFN_SHIFT PAGE_SHIFT 185 + # define PFN_PTE_SHIFT PAGE_SHIFT 186 186 #else 187 187 # define _CACHE_MASK (7 << _CACHE_SHIFT) 188 - # define _PFN_SHIFT (PAGE_SHIFT - 12 + _CACHE_SHIFT + 3) 188 + # define PFN_PTE_SHIFT (PAGE_SHIFT - 12 + _CACHE_SHIFT + 3) 189 189 #endif 190 190 191 191 #ifndef _PAGE_NO_EXEC ··· 195 195 #define _PAGE_SILENT_READ _PAGE_VALID 196 196 #define _PAGE_SILENT_WRITE _PAGE_DIRTY 197 197 198 - #define _PFN_MASK (~((1 << (_PFN_SHIFT)) - 1)) 198 + #define _PFN_MASK (~((1 << (PFN_PTE_SHIFT)) - 1)) 199 199 200 200 /* 201 201 * The final layouts of the PTE bits are:

+40 -21

arch/mips/include/asm/pgtable.h

··· 66 66 67 67 static inline unsigned long pmd_pfn(pmd_t pmd) 68 68 { 69 - return pmd_val(pmd) >> _PFN_SHIFT; 69 + return pmd_val(pmd) >> PFN_PTE_SHIFT; 70 70 } 71 71 72 72 #ifndef CONFIG_MIPS_HUGE_TLB_SUPPORT ··· 104 104 local_irq_restore(__flags); \ 105 105 } \ 106 106 } while(0) 107 - 108 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 109 - pte_t *ptep, pte_t pteval); 110 107 111 108 #if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32) 112 109 ··· 154 157 null.pte_low = null.pte_high = _PAGE_GLOBAL; 155 158 } 156 159 157 - set_pte_at(mm, addr, ptep, null); 160 + set_pte(ptep, null); 158 161 htw_start(); 159 162 } 160 163 #else ··· 193 196 #if !defined(CONFIG_CPU_R3K_TLB) 194 197 /* Preserve global status for the pair */ 195 198 if (pte_val(*ptep_buddy(ptep)) & _PAGE_GLOBAL) 196 - set_pte_at(mm, addr, ptep, __pte(_PAGE_GLOBAL)); 199 + set_pte(ptep, __pte(_PAGE_GLOBAL)); 197 200 else 198 201 #endif 199 - set_pte_at(mm, addr, ptep, __pte(0)); 202 + set_pte(ptep, __pte(0)); 200 203 htw_start(); 201 204 } 202 205 #endif 203 206 204 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 205 - pte_t *ptep, pte_t pteval) 207 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 208 + pte_t *ptep, pte_t pte, unsigned int nr) 206 209 { 210 + unsigned int i; 211 + bool do_sync = false; 207 212 208 - if (!pte_present(pteval)) 209 - goto cache_sync_done; 213 + for (i = 0; i < nr; i++) { 214 + if (!pte_present(pte)) 215 + continue; 216 + if (pte_present(ptep[i]) && 217 + (pte_pfn(ptep[i]) == pte_pfn(pte))) 218 + continue; 219 + do_sync = true; 220 + } 210 221 211 - if (pte_present(*ptep) && (pte_pfn(*ptep) == pte_pfn(pteval))) 212 - goto cache_sync_done; 222 + if (do_sync) 223 + __update_cache(addr, pte); 213 224 214 - __update_cache(addr, pteval); 215 - cache_sync_done: 216 - set_pte(ptep, pteval); 225 + for (;;) { 226 + set_pte(ptep, pte); 227 + if (--nr == 0) 228 + break; 229 + ptep++; 230 + pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT)); 231 + } 217 232 } 233 + #define set_ptes set_ptes 218 234 219 235 /* 220 236 * (pmds are folded into puds so this doesn't get actually called, ··· 496 486 pte_t entry, int dirty) 497 487 { 498 488 if (!pte_same(*ptep, entry)) 499 - set_pte_at(vma->vm_mm, address, ptep, entry); 489 + set_pte(ptep, entry); 500 490 /* 501 491 * update_mmu_cache will unconditionally execute, handling both 502 492 * the case that the PTE changed and the spurious fault case. ··· 578 568 extern void __update_tlb(struct vm_area_struct *vma, unsigned long address, 579 569 pte_t pte); 580 570 581 - static inline void update_mmu_cache(struct vm_area_struct *vma, 582 - unsigned long address, pte_t *ptep) 571 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 572 + struct vm_area_struct *vma, unsigned long address, 573 + pte_t *ptep, unsigned int nr) 583 574 { 584 - pte_t pte = *ptep; 585 - __update_tlb(vma, address, pte); 575 + for (;;) { 576 + pte_t pte = *ptep; 577 + __update_tlb(vma, address, pte); 578 + if (--nr == 0) 579 + break; 580 + ptep++; 581 + address += PAGE_SIZE; 582 + } 586 583 } 584 + #define update_mmu_cache(vma, address, ptep) \ 585 + update_mmu_cache_range(NULL, vma, address, ptep, 1) 587 586 588 587 #define __HAVE_ARCH_UPDATE_MMU_TLB 589 588 #define update_mmu_tlb update_mmu_cache

+3 -2

arch/mips/mm/c-r4k.c

··· 568 568 if ((mm == current->active_mm) && (pte_val(*ptep) & _PAGE_VALID)) 569 569 vaddr = NULL; 570 570 else { 571 + struct folio *folio = page_folio(page); 571 572 /* 572 573 * Use kmap_coherent or kmap_atomic to do flushes for 573 574 * another ASID than the current one. 574 575 */ 575 576 map_coherent = (cpu_has_dc_aliases && 576 - page_mapcount(page) && 577 - !Page_dcache_dirty(page)); 577 + folio_mapped(folio) && 578 + !folio_test_dcache_dirty(folio)); 578 579 if (map_coherent) 579 580 vaddr = kmap_coherent(page, addr); 580 581 else

+27 -27

arch/mips/mm/cache.c

··· 99 99 return 0; 100 100 } 101 101 102 - void __flush_dcache_page(struct page *page) 102 + void __flush_dcache_pages(struct page *page, unsigned int nr) 103 103 { 104 - struct address_space *mapping = page_mapping_file(page); 104 + struct folio *folio = page_folio(page); 105 + struct address_space *mapping = folio_flush_mapping(folio); 105 106 unsigned long addr; 107 + unsigned int i; 106 108 107 109 if (mapping && !mapping_mapped(mapping)) { 108 - SetPageDcacheDirty(page); 110 + folio_set_dcache_dirty(folio); 109 111 return; 110 112 } 111 113 ··· 116 114 * case is for exec env/arg pages and those are %99 certainly going to 117 115 * get faulted into the tlb (and thus flushed) anyways. 118 116 */ 119 - if (PageHighMem(page)) 120 - addr = (unsigned long)kmap_atomic(page); 121 - else 122 - addr = (unsigned long)page_address(page); 123 - 124 - flush_data_cache_page(addr); 125 - 126 - if (PageHighMem(page)) 127 - kunmap_atomic((void *)addr); 117 + for (i = 0; i < nr; i++) { 118 + addr = (unsigned long)kmap_local_page(page + i); 119 + flush_data_cache_page(addr); 120 + kunmap_local((void *)addr); 121 + } 128 122 } 129 - 130 - EXPORT_SYMBOL(__flush_dcache_page); 123 + EXPORT_SYMBOL(__flush_dcache_pages); 131 124 132 125 void __flush_anon_page(struct page *page, unsigned long vmaddr) 133 126 { 134 127 unsigned long addr = (unsigned long) page_address(page); 128 + struct folio *folio = page_folio(page); 135 129 136 130 if (pages_do_alias(addr, vmaddr)) { 137 - if (page_mapcount(page) && !Page_dcache_dirty(page)) { 131 + if (folio_mapped(folio) && !folio_test_dcache_dirty(folio)) { 138 132 void *kaddr; 139 133 140 134 kaddr = kmap_coherent(page, vmaddr); ··· 145 147 146 148 void __update_cache(unsigned long address, pte_t pte) 147 149 { 148 - struct page *page; 150 + struct folio *folio; 149 151 unsigned long pfn, addr; 150 152 int exec = !pte_no_exec(pte) && !cpu_has_ic_fills_f_dc; 153 + unsigned int i; 151 154 152 155 pfn = pte_pfn(pte); 153 156 if (unlikely(!pfn_valid(pfn))) 154 157 return; 155 - page = pfn_to_page(pfn); 156 - if (Page_dcache_dirty(page)) { 157 - if (PageHighMem(page)) 158 - addr = (unsigned long)kmap_atomic(page); 159 - else 160 - addr = (unsigned long)page_address(page); 161 158 162 - if (exec || pages_do_alias(addr, address & PAGE_MASK)) 163 - flush_data_cache_page(addr); 159 + folio = page_folio(pfn_to_page(pfn)); 160 + address &= PAGE_MASK; 161 + address -= offset_in_folio(folio, pfn << PAGE_SHIFT); 164 162 165 - if (PageHighMem(page)) 166 - kunmap_atomic((void *)addr); 163 + if (folio_test_dcache_dirty(folio)) { 164 + for (i = 0; i < folio_nr_pages(folio); i++) { 165 + addr = (unsigned long)kmap_local_folio(folio, i); 167 166 168 - ClearPageDcacheDirty(page); 167 + if (exec || pages_do_alias(addr, address)) 168 + flush_data_cache_page(addr); 169 + kunmap_local((void *)addr); 170 + address += PAGE_SIZE; 171 + } 172 + folio_clear_dcache_dirty(folio); 169 173 } 170 174 } 171 175

+13 -8

arch/mips/mm/init.c

··· 88 88 pte_t pte; 89 89 int tlbidx; 90 90 91 - BUG_ON(Page_dcache_dirty(page)); 91 + BUG_ON(folio_test_dcache_dirty(page_folio(page))); 92 92 93 93 preempt_disable(); 94 94 pagefault_disable(); ··· 169 169 void copy_user_highpage(struct page *to, struct page *from, 170 170 unsigned long vaddr, struct vm_area_struct *vma) 171 171 { 172 + struct folio *src = page_folio(from); 172 173 void *vfrom, *vto; 173 174 174 175 vto = kmap_atomic(to); 175 176 if (cpu_has_dc_aliases && 176 - page_mapcount(from) && !Page_dcache_dirty(from)) { 177 + folio_mapped(src) && !folio_test_dcache_dirty(src)) { 177 178 vfrom = kmap_coherent(from, vaddr); 178 179 copy_page(vto, vfrom); 179 180 kunmap_coherent(); ··· 195 194 struct page *page, unsigned long vaddr, void *dst, const void *src, 196 195 unsigned long len) 197 196 { 197 + struct folio *folio = page_folio(page); 198 + 198 199 if (cpu_has_dc_aliases && 199 - page_mapcount(page) && !Page_dcache_dirty(page)) { 200 + folio_mapped(folio) && !folio_test_dcache_dirty(folio)) { 200 201 void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); 201 202 memcpy(vto, src, len); 202 203 kunmap_coherent(); 203 204 } else { 204 205 memcpy(dst, src, len); 205 206 if (cpu_has_dc_aliases) 206 - SetPageDcacheDirty(page); 207 + folio_set_dcache_dirty(folio); 207 208 } 208 209 if (vma->vm_flags & VM_EXEC) 209 210 flush_cache_page(vma, vaddr, page_to_pfn(page)); ··· 215 212 struct page *page, unsigned long vaddr, void *dst, const void *src, 216 213 unsigned long len) 217 214 { 215 + struct folio *folio = page_folio(page); 216 + 218 217 if (cpu_has_dc_aliases && 219 - page_mapcount(page) && !Page_dcache_dirty(page)) { 218 + folio_mapped(folio) && !folio_test_dcache_dirty(folio)) { 220 219 void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); 221 220 memcpy(dst, vfrom, len); 222 221 kunmap_coherent(); 223 222 } else { 224 223 memcpy(dst, src, len); 225 224 if (cpu_has_dc_aliases) 226 - SetPageDcacheDirty(page); 225 + folio_set_dcache_dirty(folio); 227 226 } 228 227 } 229 228 EXPORT_SYMBOL_GPL(copy_from_user_page); ··· 453 448 void __init mem_init(void) 454 449 { 455 450 /* 456 - * When _PFN_SHIFT is greater than PAGE_SHIFT we won't have enough PTE 451 + * When PFN_PTE_SHIFT is greater than PAGE_SHIFT we won't have enough PTE 457 452 * bits to hold a full 32b physical address on MIPS32 systems. 458 453 */ 459 - BUILD_BUG_ON(IS_ENABLED(CONFIG_32BIT) && (_PFN_SHIFT > PAGE_SHIFT)); 454 + BUILD_BUG_ON(IS_ENABLED(CONFIG_32BIT) && (PFN_PTE_SHIFT > PAGE_SHIFT)); 460 455 461 456 #ifdef CONFIG_HIGHMEM 462 457 max_mapnr = highend_pfn ? highend_pfn : max_low_pfn;

+1 -1

arch/mips/mm/pgtable-32.c

··· 35 35 { 36 36 pmd_t pmd; 37 37 38 - pmd_val(pmd) = (page_to_pfn(page) << _PFN_SHIFT) | pgprot_val(prot); 38 + pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot); 39 39 40 40 return pmd; 41 41 }

+1 -1

arch/mips/mm/pgtable-64.c

··· 93 93 { 94 94 pmd_t pmd; 95 95 96 - pmd_val(pmd) = (page_to_pfn(page) << _PFN_SHIFT) | pgprot_val(prot); 96 + pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot); 97 97 98 98 return pmd; 99 99 }

+5 -3

arch/mips/mm/pgtable.c

··· 10 10 11 11 pgd_t *pgd_alloc(struct mm_struct *mm) 12 12 { 13 - pgd_t *ret, *init; 13 + pgd_t *init, *ret = NULL; 14 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 15 + PGD_TABLE_ORDER); 14 16 15 - ret = (pgd_t *) __get_free_pages(GFP_KERNEL, PGD_TABLE_ORDER); 16 - if (ret) { 17 + if (ptdesc) { 18 + ret = ptdesc_address(ptdesc); 17 19 init = pgd_offset(&init_mm, 0UL); 18 20 pgd_init(ret); 19 21 memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,

+1 -1

arch/mips/mm/tlbex.c

··· 253 253 pr_define("_PAGE_GLOBAL_SHIFT %d\n", _PAGE_GLOBAL_SHIFT); 254 254 pr_define("_PAGE_VALID_SHIFT %d\n", _PAGE_VALID_SHIFT); 255 255 pr_define("_PAGE_DIRTY_SHIFT %d\n", _PAGE_DIRTY_SHIFT); 256 - pr_define("_PFN_SHIFT %d\n", _PFN_SHIFT); 256 + pr_define("PFN_PTE_SHIFT %d\n", PFN_PTE_SHIFT); 257 257 pr_debug("\n"); 258 258 } 259 259

+9 -1

arch/nios2/include/asm/cacheflush.h

··· 29 29 unsigned long pfn); 30 30 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 31 31 void flush_dcache_page(struct page *page); 32 + void flush_dcache_folio(struct folio *folio); 33 + #define flush_dcache_folio flush_dcache_folio 32 34 33 35 extern void flush_icache_range(unsigned long start, unsigned long end); 34 - extern void flush_icache_page(struct vm_area_struct *vma, struct page *page); 36 + void flush_icache_pages(struct vm_area_struct *vma, struct page *page, 37 + unsigned int nr); 38 + #define flush_icache_pages flush_icache_pages 35 39 36 40 #define flush_cache_vmap(start, end) flush_dcache_range(start, end) 37 41 #define flush_cache_vunmap(start, end) flush_dcache_range(start, end) ··· 52 48 53 49 #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages) 54 50 #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages) 51 + #define flush_dcache_mmap_lock_irqsave(mapping, flags) \ 52 + xa_lock_irqsave(&mapping->i_pages, flags) 53 + #define flush_dcache_mmap_unlock_irqrestore(mapping, flags) \ 54 + xa_unlock_irqrestore(&mapping->i_pages, flags) 55 55 56 56 #endif /* _ASM_NIOS2_CACHEFLUSH_H */

+4 -4

arch/nios2/include/asm/pgalloc.h

··· 28 28 29 29 extern pgd_t *pgd_alloc(struct mm_struct *mm); 30 30 31 - #define __pte_free_tlb(tlb, pte, addr) \ 32 - do { \ 33 - pgtable_pte_page_dtor(pte); \ 34 - tlb_remove_page((tlb), (pte)); \ 31 + #define __pte_free_tlb(tlb, pte, addr) \ 32 + do { \ 33 + pagetable_pte_dtor(page_ptdesc(pte)); \ 34 + tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ 35 35 } while (0) 36 36 37 37 #endif /* _ASM_NIOS2_PGALLOC_H */

+18 -8

arch/nios2/include/asm/pgtable.h

··· 178 178 *ptep = pteval; 179 179 } 180 180 181 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 182 - pte_t *ptep, pte_t pteval) 181 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 182 + pte_t *ptep, pte_t pte, unsigned int nr) 183 183 { 184 - unsigned long paddr = (unsigned long)page_to_virt(pte_page(pteval)); 184 + unsigned long paddr = (unsigned long)page_to_virt(pte_page(pte)); 185 185 186 - flush_dcache_range(paddr, paddr + PAGE_SIZE); 187 - set_pte(ptep, pteval); 186 + flush_dcache_range(paddr, paddr + nr * PAGE_SIZE); 187 + for (;;) { 188 + set_pte(ptep, pte); 189 + if (--nr == 0) 190 + break; 191 + ptep++; 192 + pte_val(pte) += 1; 193 + } 188 194 } 195 + #define set_ptes set_ptes 189 196 190 197 static inline int pmd_none(pmd_t pmd) 191 198 { ··· 209 202 210 203 pte_val(null) = (addr >> PAGE_SHIFT) & 0xf; 211 204 212 - set_pte_at(mm, addr, ptep, null); 205 + set_pte(ptep, null); 213 206 } 214 207 215 208 /* ··· 280 273 extern void __init paging_init(void); 281 274 extern void __init mmu_init(void); 282 275 283 - extern void update_mmu_cache(struct vm_area_struct *vma, 284 - unsigned long address, pte_t *pte); 276 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 277 + unsigned long address, pte_t *ptep, unsigned int nr); 278 + 279 + #define update_mmu_cache(vma, addr, ptep) \ 280 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 285 281 286 282 #endif /* _ASM_NIOS2_PGTABLE_H */

+46 -38

arch/nios2/mm/cacheflush.c

··· 71 71 __asm__ __volatile(" flushp\n"); 72 72 } 73 73 74 - static void flush_aliases(struct address_space *mapping, struct page *page) 74 + static void flush_aliases(struct address_space *mapping, struct folio *folio) 75 75 { 76 76 struct mm_struct *mm = current->active_mm; 77 - struct vm_area_struct *mpnt; 77 + struct vm_area_struct *vma; 78 + unsigned long flags; 78 79 pgoff_t pgoff; 80 + unsigned long nr = folio_nr_pages(folio); 79 81 80 - pgoff = page->index; 82 + pgoff = folio->index; 81 83 82 - flush_dcache_mmap_lock(mapping); 83 - vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { 84 - unsigned long offset; 84 + flush_dcache_mmap_lock_irqsave(mapping, flags); 85 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + nr - 1) { 86 + unsigned long start; 85 87 86 - if (mpnt->vm_mm != mm) 88 + if (vma->vm_mm != mm) 87 89 continue; 88 - if (!(mpnt->vm_flags & VM_MAYSHARE)) 90 + if (!(vma->vm_flags & VM_MAYSHARE)) 89 91 continue; 90 92 91 - offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT; 92 - flush_cache_page(mpnt, mpnt->vm_start + offset, 93 - page_to_pfn(page)); 93 + start = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); 94 + flush_cache_range(vma, start, start + nr * PAGE_SIZE); 94 95 } 95 - flush_dcache_mmap_unlock(mapping); 96 + flush_dcache_mmap_unlock_irqrestore(mapping, flags); 96 97 } 97 98 98 99 void flush_cache_all(void) ··· 139 138 __flush_icache(start, end); 140 139 } 141 140 142 - void flush_icache_page(struct vm_area_struct *vma, struct page *page) 141 + void flush_icache_pages(struct vm_area_struct *vma, struct page *page, 142 + unsigned int nr) 143 143 { 144 144 unsigned long start = (unsigned long) page_address(page); 145 - unsigned long end = start + PAGE_SIZE; 145 + unsigned long end = start + nr * PAGE_SIZE; 146 146 147 147 __flush_dcache(start, end); 148 148 __flush_icache(start, end); ··· 160 158 __flush_icache(start, end); 161 159 } 162 160 163 - void __flush_dcache_page(struct address_space *mapping, struct page *page) 161 + static void __flush_dcache_folio(struct folio *folio) 164 162 { 165 163 /* 166 164 * Writeback any data associated with the kernel mapping of this 167 165 * page. This ensures that data in the physical page is mutually 168 166 * coherent with the kernels mapping. 169 167 */ 170 - unsigned long start = (unsigned long)page_address(page); 168 + unsigned long start = (unsigned long)folio_address(folio); 171 169 172 - __flush_dcache(start, start + PAGE_SIZE); 170 + __flush_dcache(start, start + folio_size(folio)); 173 171 } 174 172 175 - void flush_dcache_page(struct page *page) 173 + void flush_dcache_folio(struct folio *folio) 176 174 { 177 175 struct address_space *mapping; 178 176 ··· 180 178 * The zero page is never written to, so never has any dirty 181 179 * cache lines, and therefore never needs to be flushed. 182 180 */ 183 - if (page == ZERO_PAGE(0)) 181 + if (is_zero_pfn(folio_pfn(folio))) 184 182 return; 185 183 186 - mapping = page_mapping_file(page); 184 + mapping = folio_flush_mapping(folio); 187 185 188 186 /* Flush this page if there are aliases. */ 189 187 if (mapping && !mapping_mapped(mapping)) { 190 - clear_bit(PG_dcache_clean, &page->flags); 188 + clear_bit(PG_dcache_clean, &folio->flags); 191 189 } else { 192 - __flush_dcache_page(mapping, page); 190 + __flush_dcache_folio(folio); 193 191 if (mapping) { 194 - unsigned long start = (unsigned long)page_address(page); 195 - flush_aliases(mapping, page); 196 - flush_icache_range(start, start + PAGE_SIZE); 192 + unsigned long start = (unsigned long)folio_address(folio); 193 + flush_aliases(mapping, folio); 194 + flush_icache_range(start, start + folio_size(folio)); 197 195 } 198 - set_bit(PG_dcache_clean, &page->flags); 196 + set_bit(PG_dcache_clean, &folio->flags); 199 197 } 198 + } 199 + EXPORT_SYMBOL(flush_dcache_folio); 200 + 201 + void flush_dcache_page(struct page *page) 202 + { 203 + flush_dcache_folio(page_folio(page)); 200 204 } 201 205 EXPORT_SYMBOL(flush_dcache_page); 202 206 203 - void update_mmu_cache(struct vm_area_struct *vma, 204 - unsigned long address, pte_t *ptep) 207 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 208 + unsigned long address, pte_t *ptep, unsigned int nr) 205 209 { 206 210 pte_t pte = *ptep; 207 211 unsigned long pfn = pte_pfn(pte); 208 - struct page *page; 212 + struct folio *folio; 209 213 struct address_space *mapping; 210 214 211 215 reload_tlb_page(vma, address, pte); ··· 223 215 * The zero page is never written to, so never has any dirty 224 216 * cache lines, and therefore never needs to be flushed. 225 217 */ 226 - page = pfn_to_page(pfn); 227 - if (page == ZERO_PAGE(0)) 218 + if (is_zero_pfn(pfn)) 228 219 return; 229 220 230 - mapping = page_mapping_file(page); 231 - if (!test_and_set_bit(PG_dcache_clean, &page->flags)) 232 - __flush_dcache_page(mapping, page); 221 + folio = page_folio(pfn_to_page(pfn)); 222 + if (!test_and_set_bit(PG_dcache_clean, &folio->flags)) 223 + __flush_dcache_folio(folio); 233 224 234 - if(mapping) 235 - { 236 - flush_aliases(mapping, page); 225 + mapping = folio_flush_mapping(folio); 226 + if (mapping) { 227 + flush_aliases(mapping, folio); 237 228 if (vma->vm_flags & VM_EXEC) 238 - flush_icache_page(vma, page); 229 + flush_icache_pages(vma, &folio->page, 230 + folio_nr_pages(folio)); 239 231 } 240 232 } 241 233

+1

arch/openrisc/Kconfig

··· 21 21 select GENERIC_IRQ_PROBE 22 22 select GENERIC_IRQ_SHOW 23 23 select GENERIC_PCI_IOMAP 24 + select GENERIC_IOREMAP 24 25 select GENERIC_CPU_DEVICES 25 26 select HAVE_PCI 26 27 select HAVE_UID16

+7 -1

arch/openrisc/include/asm/cacheflush.h

··· 56 56 */ 57 57 #define PG_dc_clean PG_arch_1 58 58 59 + static inline void flush_dcache_folio(struct folio *folio) 60 + { 61 + clear_bit(PG_dc_clean, &folio->flags); 62 + } 63 + #define flush_dcache_folio flush_dcache_folio 64 + 59 65 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 60 66 static inline void flush_dcache_page(struct page *page) 61 67 { 62 - clear_bit(PG_dc_clean, &page->flags); 68 + flush_dcache_folio(page_folio(page)); 63 69 } 64 70 65 71 #define flush_icache_user_page(vma, page, addr, len) \

+6 -5

arch/openrisc/include/asm/io.h

··· 15 15 #define __ASM_OPENRISC_IO_H 16 16 17 17 #include <linux/types.h> 18 + #include <asm/pgalloc.h> 19 + #include <asm/pgtable.h> 18 20 19 21 /* 20 22 * PCI: We do not use IO ports in OpenRISC ··· 29 27 #define PIO_OFFSET 0 30 28 #define PIO_MASK 0 31 29 32 - #define ioremap ioremap 33 - void __iomem *ioremap(phys_addr_t offset, unsigned long size); 34 - 35 - #define iounmap iounmap 36 - extern void iounmap(volatile void __iomem *addr); 30 + /* 31 + * I/O memory mapping functions. 32 + */ 33 + #define _PAGE_IOREMAP (pgprot_val(PAGE_KERNEL) | _PAGE_CI) 37 34 38 35 #include <asm-generic/io.h> 39 36

+4 -4

arch/openrisc/include/asm/pgalloc.h

··· 66 66 67 67 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm); 68 68 69 - #define __pte_free_tlb(tlb, pte, addr) \ 70 - do { \ 71 - pgtable_pte_page_dtor(pte); \ 72 - tlb_remove_page((tlb), (pte)); \ 69 + #define __pte_free_tlb(tlb, pte, addr) \ 70 + do { \ 71 + pagetable_pte_dtor(page_ptdesc(pte)); \ 72 + tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ 73 73 } while (0) 74 74 75 75 #endif

+10 -5

arch/openrisc/include/asm/pgtable.h

··· 46 46 * hook is made available. 47 47 */ 48 48 #define set_pte(pteptr, pteval) ((*(pteptr)) = (pteval)) 49 - #define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval) 49 + 50 50 /* 51 51 * (pmds are folded into pgds so this doesn't get actually called, 52 52 * but the define is needed for a generic inline function.) ··· 357 357 #define __pmd_offset(address) \ 358 358 (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1)) 359 359 360 + #define PFN_PTE_SHIFT PAGE_SHIFT 360 361 #define pte_pfn(x) ((unsigned long)(((x).pte)) >> PAGE_SHIFT) 361 362 #define pfn_pte(pfn, prot) __pte((((pfn) << PAGE_SHIFT)) | pgprot_val(prot)) 362 363 ··· 380 379 extern void update_cache(struct vm_area_struct *vma, 381 380 unsigned long address, pte_t *pte); 382 381 383 - static inline void update_mmu_cache(struct vm_area_struct *vma, 384 - unsigned long address, pte_t *pte) 382 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 383 + struct vm_area_struct *vma, unsigned long address, 384 + pte_t *ptep, unsigned int nr) 385 385 { 386 - update_tlb(vma, address, pte); 387 - update_cache(vma, address, pte); 386 + update_tlb(vma, address, ptep); 387 + update_cache(vma, address, ptep); 388 388 } 389 + 390 + #define update_mmu_cache(vma, addr, ptep) \ 391 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 389 392 390 393 /* __PHX__ FIXME, SWAP, this probably doesn't work */ 391 394

+8 -4

arch/openrisc/mm/cache.c

··· 43 43 pte_t *pte) 44 44 { 45 45 unsigned long pfn = pte_val(*pte) >> PAGE_SHIFT; 46 - struct page *page = pfn_to_page(pfn); 47 - int dirty = !test_and_set_bit(PG_dc_clean, &page->flags); 46 + struct folio *folio = page_folio(pfn_to_page(pfn)); 47 + int dirty = !test_and_set_bit(PG_dc_clean, &folio->flags); 48 48 49 49 /* 50 50 * Since icaches do not snoop for updated data on OpenRISC, we 51 51 * must write back and invalidate any dirty pages manually. We 52 52 * can skip data pages, since they will not end up in icaches. 53 53 */ 54 - if ((vma->vm_flags & VM_EXEC) && dirty) 55 - sync_icache_dcache(page); 54 + if ((vma->vm_flags & VM_EXEC) && dirty) { 55 + unsigned int nr = folio_nr_pages(folio); 56 + 57 + while (nr--) 58 + sync_icache_dcache(folio_page(folio, nr)); 59 + } 56 60 } 57 61

-82

arch/openrisc/mm/ioremap.c

··· 22 22 23 23 extern int mem_init_done; 24 24 25 - static unsigned int fixmaps_used __initdata; 26 - 27 - /* 28 - * Remap an arbitrary physical address space into the kernel virtual 29 - * address space. Needed when the kernel wants to access high addresses 30 - * directly. 31 - * 32 - * NOTE! We need to allow non-page-aligned mappings too: we will obviously 33 - * have to convert them into an offset in a page-aligned mapping, but the 34 - * caller shouldn't need to know that small detail. 35 - */ 36 - void __iomem *__ref ioremap(phys_addr_t addr, unsigned long size) 37 - { 38 - phys_addr_t p; 39 - unsigned long v; 40 - unsigned long offset, last_addr; 41 - struct vm_struct *area = NULL; 42 - 43 - /* Don't allow wraparound or zero size */ 44 - last_addr = addr + size - 1; 45 - if (!size || last_addr < addr) 46 - return NULL; 47 - 48 - /* 49 - * Mappings have to be page-aligned 50 - */ 51 - offset = addr & ~PAGE_MASK; 52 - p = addr & PAGE_MASK; 53 - size = PAGE_ALIGN(last_addr + 1) - p; 54 - 55 - if (likely(mem_init_done)) { 56 - area = get_vm_area(size, VM_IOREMAP); 57 - if (!area) 58 - return NULL; 59 - v = (unsigned long)area->addr; 60 - } else { 61 - if ((fixmaps_used + (size >> PAGE_SHIFT)) > FIX_N_IOREMAPS) 62 - return NULL; 63 - v = fix_to_virt(FIX_IOREMAP_BEGIN + fixmaps_used); 64 - fixmaps_used += (size >> PAGE_SHIFT); 65 - } 66 - 67 - if (ioremap_page_range(v, v + size, p, 68 - __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_CI))) { 69 - if (likely(mem_init_done)) 70 - vfree(area->addr); 71 - else 72 - fixmaps_used -= (size >> PAGE_SHIFT); 73 - return NULL; 74 - } 75 - 76 - return (void __iomem *)(offset + (char *)v); 77 - } 78 - EXPORT_SYMBOL(ioremap); 79 - 80 - void iounmap(volatile void __iomem *addr) 81 - { 82 - /* If the page is from the fixmap pool then we just clear out 83 - * the fixmap mapping. 84 - */ 85 - if (unlikely((unsigned long)addr > FIXADDR_START)) { 86 - /* This is a bit broken... we don't really know 87 - * how big the area is so it's difficult to know 88 - * how many fixed pages to invalidate... 89 - * just flush tlb and hope for the best... 90 - * consider this a FIXME 91 - * 92 - * Really we should be clearing out one or more page 93 - * table entries for these virtual addresses so that 94 - * future references cause a page fault... for now, we 95 - * rely on two things: 96 - * i) this code never gets called on known boards 97 - * ii) invalid accesses to the freed areas aren't made 98 - */ 99 - flush_tlb_all(); 100 - return; 101 - } 102 - 103 - return vfree((void *)(PAGE_MASK & (unsigned long)addr)); 104 - } 105 - EXPORT_SYMBOL(iounmap); 106 - 107 25 /** 108 26 * OK, this one's a bit tricky... ioremap can get called before memory is 109 27 * initialized (early serial console does this) and will want to alloc a page

+1

arch/parisc/Kconfig

··· 36 36 select GENERIC_ATOMIC64 if !64BIT 37 37 select GENERIC_IRQ_PROBE 38 38 select GENERIC_PCI_IOMAP 39 + select GENERIC_IOREMAP 39 40 select ARCH_HAVE_NMI_SAFE_CMPXCHG 40 41 select GENERIC_SMP_IDLE_THREAD 41 42 select GENERIC_ARCH_TOPOLOGY if SMP

+9 -5

arch/parisc/include/asm/cacheflush.h

··· 43 43 #define flush_cache_vmap(start, end) flush_cache_all() 44 44 #define flush_cache_vunmap(start, end) flush_cache_all() 45 45 46 + void flush_dcache_folio(struct folio *folio); 47 + #define flush_dcache_folio flush_dcache_folio 46 48 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 47 - void flush_dcache_page(struct page *page); 49 + static inline void flush_dcache_page(struct page *page) 50 + { 51 + flush_dcache_folio(page_folio(page)); 52 + } 48 53 49 54 #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages) 50 55 #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages) ··· 58 53 #define flush_dcache_mmap_unlock_irqrestore(mapping, flags) \ 59 54 xa_unlock_irqrestore(&mapping->i_pages, flags) 60 55 61 - #define flush_icache_page(vma,page) do { \ 62 - flush_kernel_dcache_page_addr(page_address(page)); \ 63 - flush_kernel_icache_page(page_address(page)); \ 64 - } while (0) 56 + void flush_icache_pages(struct vm_area_struct *vma, struct page *page, 57 + unsigned int nr); 58 + #define flush_icache_pages flush_icache_pages 65 59 66 60 #define flush_icache_range(s,e) do { \ 67 61 flush_kernel_dcache_range_asm(s,e); \

+10 -5

arch/parisc/include/asm/io.h

··· 125 125 /* 126 126 * The standard PCI ioremap interfaces 127 127 */ 128 - void __iomem *ioremap(unsigned long offset, unsigned long size); 129 - #define ioremap_wc ioremap 130 - #define ioremap_uc ioremap 131 - #define pci_iounmap pci_iounmap 128 + #define ioremap_prot ioremap_prot 132 129 133 - extern void iounmap(const volatile void __iomem *addr); 130 + #define _PAGE_IOREMAP (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | \ 131 + _PAGE_ACCESSED | _PAGE_NO_CACHE) 132 + 133 + #define ioremap_wc(addr, size) \ 134 + ioremap_prot((addr), (size), _PAGE_IOREMAP) 135 + #define ioremap_uc(addr, size) \ 136 + ioremap_prot((addr), (size), _PAGE_IOREMAP) 137 + 138 + #define pci_iounmap pci_iounmap 134 139 135 140 void memset_io(volatile void __iomem *addr, unsigned char val, int count); 136 141 void memcpy_fromio(void *dst, const volatile void __iomem *src, int count);

+23 -14

arch/parisc/include/asm/pgtable.h

··· 73 73 mb(); \ 74 74 } while(0) 75 75 76 - #define set_pte_at(mm, addr, pteptr, pteval) \ 77 - do { \ 78 - if (pte_present(pteval) && \ 79 - pte_user(pteval)) \ 80 - __update_cache(pteval); \ 81 - *(pteptr) = (pteval); \ 82 - purge_tlb_entries(mm, addr); \ 83 - } while (0) 84 - 85 76 #endif /* !__ASSEMBLY__ */ 86 77 87 78 #define pte_ERROR(e) \ ··· 276 285 #define pte_none(x) (pte_val(x) == 0) 277 286 #define pte_present(x) (pte_val(x) & _PAGE_PRESENT) 278 287 #define pte_user(x) (pte_val(x) & _PAGE_USER) 279 - #define pte_clear(mm, addr, xp) set_pte_at(mm, addr, xp, __pte(0)) 288 + #define pte_clear(mm, addr, xp) set_pte(xp, __pte(0)) 280 289 281 290 #define pmd_flag(x) (pmd_val(x) & PxD_FLAG_MASK) 282 291 #define pmd_address(x) ((unsigned long)(pmd_val(x) &~ PxD_FLAG_MASK) << PxD_VALUE_SHIFT) ··· 382 391 383 392 extern void paging_init (void); 384 393 394 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 395 + pte_t *ptep, pte_t pte, unsigned int nr) 396 + { 397 + if (pte_present(pte) && pte_user(pte)) 398 + __update_cache(pte); 399 + for (;;) { 400 + *ptep = pte; 401 + purge_tlb_entries(mm, addr); 402 + if (--nr == 0) 403 + break; 404 + ptep++; 405 + pte_val(pte) += 1 << PFN_PTE_SHIFT; 406 + addr += PAGE_SIZE; 407 + } 408 + } 409 + #define set_ptes set_ptes 410 + 385 411 /* Used for deferring calls to flush_dcache_page() */ 386 412 387 413 #define PG_dcache_dirty PG_arch_1 388 414 389 - #define update_mmu_cache(vms,addr,ptep) __update_cache(*ptep) 415 + #define update_mmu_cache_range(vmf, vma, addr, ptep, nr) __update_cache(*ptep) 416 + #define update_mmu_cache(vma, addr, ptep) __update_cache(*ptep) 390 417 391 418 /* 392 419 * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that ··· 459 450 if (!pte_young(pte)) { 460 451 return 0; 461 452 } 462 - set_pte_at(vma->vm_mm, addr, ptep, pte_mkold(pte)); 453 + set_pte(ptep, pte_mkold(pte)); 463 454 return 1; 464 455 } 465 456 ··· 469 460 pte_t old_pte; 470 461 471 462 old_pte = *ptep; 472 - set_pte_at(mm, addr, ptep, __pte(0)); 463 + set_pte(ptep, __pte(0)); 473 464 474 465 return old_pte; 475 466 } 476 467 477 468 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) 478 469 { 479 - set_pte_at(mm, addr, ptep, pte_wrprotect(*ptep)); 470 + set_pte(ptep, pte_wrprotect(*ptep)); 480 471 } 481 472 482 473 #define pte_same(A,B) (pte_val(A) == pte_val(B))

+73 -34

arch/parisc/kernel/cache.c

··· 94 94 /* Kernel virtual address of pfn. */ 95 95 #define pfn_va(pfn) __va(PFN_PHYS(pfn)) 96 96 97 - void 98 - __update_cache(pte_t pte) 97 + void __update_cache(pte_t pte) 99 98 { 100 99 unsigned long pfn = pte_pfn(pte); 101 - struct page *page; 100 + struct folio *folio; 101 + unsigned int nr; 102 102 103 103 /* We don't have pte special. As a result, we can be called with 104 104 an invalid pfn and we don't need to flush the kernel dcache page. ··· 106 106 if (!pfn_valid(pfn)) 107 107 return; 108 108 109 - page = pfn_to_page(pfn); 110 - if (page_mapping_file(page) && 111 - test_bit(PG_dcache_dirty, &page->flags)) { 112 - flush_kernel_dcache_page_addr(pfn_va(pfn)); 113 - clear_bit(PG_dcache_dirty, &page->flags); 109 + folio = page_folio(pfn_to_page(pfn)); 110 + pfn = folio_pfn(folio); 111 + nr = folio_nr_pages(folio); 112 + if (folio_flush_mapping(folio) && 113 + test_bit(PG_dcache_dirty, &folio->flags)) { 114 + while (nr--) 115 + flush_kernel_dcache_page_addr(pfn_va(pfn + nr)); 116 + clear_bit(PG_dcache_dirty, &folio->flags); 114 117 } else if (parisc_requires_coherency()) 115 - flush_kernel_dcache_page_addr(pfn_va(pfn)); 118 + while (nr--) 119 + flush_kernel_dcache_page_addr(pfn_va(pfn + nr)); 116 120 } 117 121 118 122 void ··· 370 366 preempt_enable(); 371 367 } 372 368 369 + void flush_icache_pages(struct vm_area_struct *vma, struct page *page, 370 + unsigned int nr) 371 + { 372 + void *kaddr = page_address(page); 373 + 374 + for (;;) { 375 + flush_kernel_dcache_page_addr(kaddr); 376 + flush_kernel_icache_page(kaddr); 377 + if (--nr == 0) 378 + break; 379 + kaddr += PAGE_SIZE; 380 + } 381 + } 382 + 373 383 static inline pte_t *get_ptep(struct mm_struct *mm, unsigned long addr) 374 384 { 375 385 pte_t *ptep = NULL; ··· 412 394 == (_PAGE_PRESENT | _PAGE_ACCESSED); 413 395 } 414 396 415 - void flush_dcache_page(struct page *page) 397 + void flush_dcache_folio(struct folio *folio) 416 398 { 417 - struct address_space *mapping = page_mapping_file(page); 418 - struct vm_area_struct *mpnt; 419 - unsigned long offset; 399 + struct address_space *mapping = folio_flush_mapping(folio); 400 + struct vm_area_struct *vma; 420 401 unsigned long addr, old_addr = 0; 402 + void *kaddr; 421 403 unsigned long count = 0; 422 - unsigned long flags; 404 + unsigned long i, nr, flags; 423 405 pgoff_t pgoff; 424 406 425 407 if (mapping && !mapping_mapped(mapping)) { 426 - set_bit(PG_dcache_dirty, &page->flags); 408 + set_bit(PG_dcache_dirty, &folio->flags); 427 409 return; 428 410 } 429 411 430 - flush_kernel_dcache_page_addr(page_address(page)); 412 + nr = folio_nr_pages(folio); 413 + kaddr = folio_address(folio); 414 + for (i = 0; i < nr; i++) 415 + flush_kernel_dcache_page_addr(kaddr + i * PAGE_SIZE); 431 416 432 417 if (!mapping) 433 418 return; 434 419 435 - pgoff = page->index; 420 + pgoff = folio->index; 436 421 437 422 /* 438 423 * We have carefully arranged in arch_get_unmapped_area() that ··· 445 424 * on machines that support equivalent aliasing 446 425 */ 447 426 flush_dcache_mmap_lock_irqsave(mapping, flags); 448 - vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { 449 - offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT; 450 - addr = mpnt->vm_start + offset; 451 - if (parisc_requires_coherency()) { 452 - bool needs_flush = false; 453 - pte_t *ptep; 427 + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + nr - 1) { 428 + unsigned long offset = pgoff - vma->vm_pgoff; 429 + unsigned long pfn = folio_pfn(folio); 454 430 455 - ptep = get_ptep(mpnt->vm_mm, addr); 456 - if (ptep) { 457 - needs_flush = pte_needs_flush(*ptep); 431 + addr = vma->vm_start; 432 + nr = folio_nr_pages(folio); 433 + if (offset > -nr) { 434 + pfn -= offset; 435 + nr += offset; 436 + } else { 437 + addr += offset * PAGE_SIZE; 438 + } 439 + if (addr + nr * PAGE_SIZE > vma->vm_end) 440 + nr = (vma->vm_end - addr) / PAGE_SIZE; 441 + 442 + if (parisc_requires_coherency()) { 443 + for (i = 0; i < nr; i++) { 444 + pte_t *ptep = get_ptep(vma->vm_mm, 445 + addr + i * PAGE_SIZE); 446 + if (!ptep) 447 + continue; 448 + if (pte_needs_flush(*ptep)) 449 + flush_user_cache_page(vma, 450 + addr + i * PAGE_SIZE); 451 + /* Optimise accesses to the same table? */ 458 452 pte_unmap(ptep); 459 453 } 460 - if (needs_flush) 461 - flush_user_cache_page(mpnt, addr); 462 454 } else { 463 455 /* 464 456 * The TLB is the engine of coherence on parisc: ··· 484 450 * in (until the user or kernel specifically 485 451 * accesses it, of course) 486 452 */ 487 - flush_tlb_page(mpnt, addr); 453 + for (i = 0; i < nr; i++) 454 + flush_tlb_page(vma, addr + i * PAGE_SIZE); 488 455 if (old_addr == 0 || (old_addr & (SHM_COLOUR - 1)) 489 456 != (addr & (SHM_COLOUR - 1))) { 490 - __flush_cache_page(mpnt, addr, page_to_phys(page)); 457 + for (i = 0; i < nr; i++) 458 + __flush_cache_page(vma, 459 + addr + i * PAGE_SIZE, 460 + (pfn + i) * PAGE_SIZE); 491 461 /* 492 462 * Software is allowed to have any number 493 463 * of private mappings to a page. 494 464 */ 495 - if (!(mpnt->vm_flags & VM_SHARED)) 465 + if (!(vma->vm_flags & VM_SHARED)) 496 466 continue; 497 467 if (old_addr) 498 468 pr_err("INEQUIVALENT ALIASES 0x%lx and 0x%lx in file %pD\n", 499 - old_addr, addr, mpnt->vm_file); 500 - old_addr = addr; 469 + old_addr, addr, vma->vm_file); 470 + if (nr == folio_nr_pages(folio)) 471 + old_addr = addr; 501 472 } 502 473 } 503 474 WARN_ON(++count == 4096); 504 475 } 505 476 flush_dcache_mmap_unlock_irqrestore(mapping, flags); 506 477 } 507 - EXPORT_SYMBOL(flush_dcache_page); 478 + EXPORT_SYMBOL(flush_dcache_folio); 508 479 509 480 /* Defined in arch/parisc/kernel/pacache.S */ 510 481 EXPORT_SYMBOL(flush_kernel_dcache_range_asm);

+4 -57

arch/parisc/mm/ioremap.c

··· 13 13 #include <linux/io.h> 14 14 #include <linux/mm.h> 15 15 16 - /* 17 - * Generic mapping function (not visible outside): 18 - */ 19 - 20 - /* 21 - * Remap an arbitrary physical address space into the kernel virtual 22 - * address space. 23 - * 24 - * NOTE! We need to allow non-page-aligned mappings too: we will obviously 25 - * have to convert them into an offset in a page-aligned mapping, but the 26 - * caller shouldn't need to know that small detail. 27 - */ 28 - void __iomem *ioremap(unsigned long phys_addr, unsigned long size) 16 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 17 + unsigned long prot) 29 18 { 30 - uintptr_t addr; 31 - struct vm_struct *area; 32 - unsigned long offset, last_addr; 33 - pgprot_t pgprot; 34 - 35 19 #ifdef CONFIG_EISA 36 20 unsigned long end = phys_addr + size - 1; 37 21 /* Support EISA addresses */ ··· 23 39 (phys_addr >= 0x00500000 && end < 0x03bfffff)) 24 40 phys_addr |= F_EXTEND(0xfc000000); 25 41 #endif 26 - 27 - /* Don't allow wraparound or zero size */ 28 - last_addr = phys_addr + size - 1; 29 - if (!size || last_addr < phys_addr) 30 - return NULL; 31 42 32 43 /* 33 44 * Don't allow anybody to remap normal RAM that we're using.. ··· 41 62 } 42 63 } 43 64 44 - pgprot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | 45 - _PAGE_ACCESSED | _PAGE_NO_CACHE); 46 - 47 - /* 48 - * Mappings have to be page-aligned 49 - */ 50 - offset = phys_addr & ~PAGE_MASK; 51 - phys_addr &= PAGE_MASK; 52 - size = PAGE_ALIGN(last_addr + 1) - phys_addr; 53 - 54 - /* 55 - * Ok, go for it.. 56 - */ 57 - area = get_vm_area(size, VM_IOREMAP); 58 - if (!area) 59 - return NULL; 60 - 61 - addr = (uintptr_t) area->addr; 62 - if (ioremap_page_range(addr, addr + size, phys_addr, pgprot)) { 63 - vunmap(area->addr); 64 - return NULL; 65 - } 66 - 67 - return (void __iomem *) (offset + (char __iomem *)addr); 65 + return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); 68 66 } 69 - EXPORT_SYMBOL(ioremap); 70 - 71 - void iounmap(const volatile void __iomem *io_addr) 72 - { 73 - unsigned long addr = (unsigned long)io_addr & PAGE_MASK; 74 - 75 - if (is_vmalloc_addr((void *)addr)) 76 - vunmap((void *)addr); 77 - } 78 - EXPORT_SYMBOL(iounmap); 67 + EXPORT_SYMBOL(ioremap_prot);

+3

arch/powerpc/Kconfig

··· 157 157 select ARCH_HAS_UBSAN_SANITIZE_ALL 158 158 select ARCH_HAVE_NMI_SAFE_CMPXCHG 159 159 select ARCH_KEEP_MEMBLOCK 160 + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU 160 161 select ARCH_MIGHT_HAVE_PC_PARPORT 161 162 select ARCH_MIGHT_HAVE_PC_SERIO 162 163 select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX ··· 175 174 select ARCH_WANT_IPC_PARSE_VERSION 176 175 select ARCH_WANT_IRQS_OFF_ACTIVATE_MM 177 176 select ARCH_WANT_LD_ORPHAN_WARN 177 + select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if PPC_RADIX_MMU 178 178 select ARCH_WANTS_MODULES_DATA_IN_VMALLOC if PPC_BOOK3S_32 || PPC_8xx 179 179 select ARCH_WEAK_RELEASE_ACQUIRE 180 180 select BINFMT_ELF ··· 195 193 select GENERIC_CPU_VULNERABILITIES if PPC_BARRIER_NOSPEC 196 194 select GENERIC_EARLY_IOREMAP 197 195 select GENERIC_GETTIMEOFDAY 196 + select GENERIC_IOREMAP 198 197 select GENERIC_IRQ_SHOW 199 198 select GENERIC_IRQ_SHOW_LEVEL 200 199 select GENERIC_PCI_IOMAP if PCI

-5

arch/powerpc/include/asm/book3s/32/pgtable.h

··· 462 462 pgprot_val(pgprot)); 463 463 } 464 464 465 - static inline unsigned long pte_pfn(pte_t pte) 466 - { 467 - return pte_val(pte) >> PTE_RPN_SHIFT; 468 - } 469 - 470 465 /* Generic modifiers for PTE bits */ 471 466 static inline pte_t pte_wrprotect(pte_t pte) 472 467 {

+9

arch/powerpc/include/asm/book3s/64/hash.h

··· 138 138 } 139 139 140 140 #define hash__pmd_bad(pmd) (pmd_val(pmd) & H_PMD_BAD_BITS) 141 + 142 + /* 143 + * pud comparison that will work with both pte and page table pointer. 144 + */ 145 + static inline int hash__pud_same(pud_t pud_a, pud_t pud_b) 146 + { 147 + return (((pud_raw(pud_a) ^ pud_raw(pud_b)) & ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0); 148 + } 141 149 #define hash__pud_bad(pud) (pud_val(pud) & H_PUD_BAD_BITS) 150 + 142 151 static inline int hash__p4d_bad(p4d_t p4d) 143 152 { 144 153 return (p4d_val(p4d) == 0);

+145 -16

arch/powerpc/include/asm/book3s/64/pgtable.h

··· 104 104 * and every thing below PAGE_SHIFT; 105 105 */ 106 106 #define PTE_RPN_MASK (((1UL << _PAGE_PA_MAX) - 1) & (PAGE_MASK)) 107 + #define PTE_RPN_SHIFT PAGE_SHIFT 107 108 /* 108 109 * set of bits not changed in pmd_modify. Even though we have hash specific bits 109 110 * in here, on radix we expect them to be zero. ··· 570 569 return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot) | _PAGE_PTE); 571 570 } 572 571 573 - static inline unsigned long pte_pfn(pte_t pte) 574 - { 575 - return (pte_val(pte) & PTE_RPN_MASK) >> PAGE_SHIFT; 576 - } 577 - 578 572 /* Generic modifiers for PTE bits */ 579 573 static inline pte_t pte_wrprotect(pte_t pte) 580 574 { ··· 917 921 { 918 922 return __pud_raw(pte_raw(pte)); 919 923 } 924 + 925 + static inline pte_t *pudp_ptep(pud_t *pud) 926 + { 927 + return (pte_t *)pud; 928 + } 929 + 930 + #define pud_pfn(pud) pte_pfn(pud_pte(pud)) 931 + #define pud_dirty(pud) pte_dirty(pud_pte(pud)) 932 + #define pud_young(pud) pte_young(pud_pte(pud)) 933 + #define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) 934 + #define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) 935 + #define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) 936 + #define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) 937 + #define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) 938 + #define pud_mkwrite(pud) pte_pud(pte_mkwrite(pud_pte(pud))) 920 939 #define pud_write(pud) pte_write(pud_pte(pud)) 940 + 941 + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY 942 + #define pud_soft_dirty(pmd) pte_soft_dirty(pud_pte(pud)) 943 + #define pud_mksoft_dirty(pmd) pte_pud(pte_mksoft_dirty(pud_pte(pud))) 944 + #define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud))) 945 + #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ 921 946 922 947 static inline int pud_bad(pud_t pud) 923 948 { ··· 1132 1115 1133 1116 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1134 1117 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); 1118 + extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot); 1135 1119 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); 1136 1120 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot); 1137 1121 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, 1138 1122 pmd_t *pmdp, pmd_t pmd); 1123 + extern void set_pud_at(struct mm_struct *mm, unsigned long addr, 1124 + pud_t *pudp, pud_t pud); 1125 + 1139 1126 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, 1140 1127 unsigned long addr, pmd_t *pmd) 1128 + { 1129 + } 1130 + 1131 + static inline void update_mmu_cache_pud(struct vm_area_struct *vma, 1132 + unsigned long addr, pud_t *pud) 1141 1133 { 1142 1134 } 1143 1135 ··· 1159 1133 } 1160 1134 #define has_transparent_hugepage has_transparent_hugepage 1161 1135 1136 + static inline int has_transparent_pud_hugepage(void) 1137 + { 1138 + if (radix_enabled()) 1139 + return radix__has_transparent_pud_hugepage(); 1140 + return 0; 1141 + } 1142 + #define has_transparent_pud_hugepage has_transparent_pud_hugepage 1143 + 1162 1144 static inline unsigned long 1163 1145 pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, 1164 1146 unsigned long clr, unsigned long set) ··· 1176 1142 return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set); 1177 1143 } 1178 1144 1145 + static inline unsigned long 1146 + pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp, 1147 + unsigned long clr, unsigned long set) 1148 + { 1149 + if (radix_enabled()) 1150 + return radix__pud_hugepage_update(mm, addr, pudp, clr, set); 1151 + BUG(); 1152 + return pud_val(*pudp); 1153 + } 1154 + 1179 1155 /* 1180 1156 * returns true for pmd migration entries, THP, devmap, hugetlb 1181 1157 * But compile time dependent on THP config ··· 1193 1149 static inline int pmd_large(pmd_t pmd) 1194 1150 { 1195 1151 return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); 1152 + } 1153 + 1154 + static inline int pud_large(pud_t pud) 1155 + { 1156 + return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); 1196 1157 } 1197 1158 1198 1159 /* ··· 1215 1166 return ((old & _PAGE_ACCESSED) != 0); 1216 1167 } 1217 1168 1169 + static inline int __pudp_test_and_clear_young(struct mm_struct *mm, 1170 + unsigned long addr, pud_t *pudp) 1171 + { 1172 + unsigned long old; 1173 + 1174 + if ((pud_raw(*pudp) & cpu_to_be64(_PAGE_ACCESSED | H_PAGE_HASHPTE)) == 0) 1175 + return 0; 1176 + old = pud_hugepage_update(mm, addr, pudp, _PAGE_ACCESSED, 0); 1177 + return ((old & _PAGE_ACCESSED) != 0); 1178 + } 1179 + 1218 1180 #define __HAVE_ARCH_PMDP_SET_WRPROTECT 1219 1181 static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, 1220 1182 pmd_t *pmdp) 1221 1183 { 1222 1184 if (pmd_write(*pmdp)) 1223 1185 pmd_hugepage_update(mm, addr, pmdp, _PAGE_WRITE, 0); 1186 + } 1187 + 1188 + #define __HAVE_ARCH_PUDP_SET_WRPROTECT 1189 + static inline void pudp_set_wrprotect(struct mm_struct *mm, unsigned long addr, 1190 + pud_t *pudp) 1191 + { 1192 + if (pud_write(*pudp)) 1193 + pud_hugepage_update(mm, addr, pudp, _PAGE_WRITE, 0); 1224 1194 } 1225 1195 1226 1196 /* ··· 1263 1195 return hash__pmd_trans_huge(pmd); 1264 1196 } 1265 1197 1198 + static inline int pud_trans_huge(pud_t pud) 1199 + { 1200 + if (!pud_present(pud)) 1201 + return false; 1202 + 1203 + if (radix_enabled()) 1204 + return radix__pud_trans_huge(pud); 1205 + return 0; 1206 + } 1207 + 1208 + 1266 1209 #define __HAVE_ARCH_PMD_SAME 1267 1210 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) 1268 1211 { ··· 1282 1203 return hash__pmd_same(pmd_a, pmd_b); 1283 1204 } 1284 1205 1206 + #define pud_same pud_same 1207 + static inline int pud_same(pud_t pud_a, pud_t pud_b) 1208 + { 1209 + if (radix_enabled()) 1210 + return radix__pud_same(pud_a, pud_b); 1211 + return hash__pud_same(pud_a, pud_b); 1212 + } 1213 + 1214 + 1285 1215 static inline pmd_t __pmd_mkhuge(pmd_t pmd) 1286 1216 { 1287 1217 if (radix_enabled()) 1288 1218 return radix__pmd_mkhuge(pmd); 1289 1219 return hash__pmd_mkhuge(pmd); 1220 + } 1221 + 1222 + static inline pud_t __pud_mkhuge(pud_t pud) 1223 + { 1224 + if (radix_enabled()) 1225 + return radix__pud_mkhuge(pud); 1226 + BUG(); 1227 + return pud; 1290 1228 } 1291 1229 1292 1230 /* ··· 1321 1225 return pmd; 1322 1226 } 1323 1227 1228 + static inline pud_t pud_mkhuge(pud_t pud) 1229 + { 1230 + #ifdef CONFIG_DEBUG_VM 1231 + if (radix_enabled()) 1232 + WARN_ON((pud_raw(pud) & cpu_to_be64(_PAGE_PTE)) == 0); 1233 + else 1234 + WARN_ON(1); 1235 + #endif 1236 + return pud; 1237 + } 1238 + 1239 + 1324 1240 #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS 1325 1241 extern int pmdp_set_access_flags(struct vm_area_struct *vma, 1326 1242 unsigned long address, pmd_t *pmdp, 1327 1243 pmd_t entry, int dirty); 1244 + #define __HAVE_ARCH_PUDP_SET_ACCESS_FLAGS 1245 + extern int pudp_set_access_flags(struct vm_area_struct *vma, 1246 + unsigned long address, pud_t *pudp, 1247 + pud_t entry, int dirty); 1328 1248 1329 1249 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG 1330 1250 extern int pmdp_test_and_clear_young(struct vm_area_struct *vma, 1331 1251 unsigned long address, pmd_t *pmdp); 1252 + #define __HAVE_ARCH_PUDP_TEST_AND_CLEAR_YOUNG 1253 + extern int pudp_test_and_clear_young(struct vm_area_struct *vma, 1254 + unsigned long address, pud_t *pudp); 1255 + 1332 1256 1333 1257 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR 1334 1258 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, ··· 1357 1241 if (radix_enabled()) 1358 1242 return radix__pmdp_huge_get_and_clear(mm, addr, pmdp); 1359 1243 return hash__pmdp_huge_get_and_clear(mm, addr, pmdp); 1244 + } 1245 + 1246 + #define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR 1247 + static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm, 1248 + unsigned long addr, pud_t *pudp) 1249 + { 1250 + if (radix_enabled()) 1251 + return radix__pudp_huge_get_and_clear(mm, addr, pudp); 1252 + BUG(); 1253 + return *pudp; 1360 1254 } 1361 1255 1362 1256 static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, ··· 1382 1256 pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, 1383 1257 unsigned long addr, 1384 1258 pmd_t *pmdp, int full); 1259 + 1260 + #define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL 1261 + pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma, 1262 + unsigned long addr, 1263 + pud_t *pudp, int full); 1385 1264 1386 1265 #define __HAVE_ARCH_PGTABLE_DEPOSIT 1387 1266 static inline void pgtable_trans_huge_deposit(struct mm_struct *mm, ··· 1436 1305 return hash__pmd_mkdevmap(pmd); 1437 1306 } 1438 1307 1308 + static inline pud_t pud_mkdevmap(pud_t pud) 1309 + { 1310 + if (radix_enabled()) 1311 + return radix__pud_mkdevmap(pud); 1312 + BUG(); 1313 + return pud; 1314 + } 1315 + 1439 1316 static inline int pmd_devmap(pmd_t pmd) 1440 1317 { 1441 1318 return pte_devmap(pmd_pte(pmd)); ··· 1451 1312 1452 1313 static inline int pud_devmap(pud_t pud) 1453 1314 { 1454 - return 0; 1315 + return pte_devmap(pud_pte(pud)); 1455 1316 } 1456 1317 1457 1318 static inline int pgd_devmap(pgd_t pgd) ··· 1460 1321 } 1461 1322 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 1462 1323 1463 - static inline int pud_pfn(pud_t pud) 1464 - { 1465 - /* 1466 - * Currently all calls to pud_pfn() are gated around a pud_devmap() 1467 - * check so this should never be used. If it grows another user we 1468 - * want to know about it. 1469 - */ 1470 - BUILD_BUG(); 1471 - return 0; 1472 - } 1473 1324 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION 1474 1325 pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *); 1475 1326 void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,

+49

arch/powerpc/include/asm/book3s/64/radix.h

··· 250 250 return !!(pud_val(pud) & RADIX_PUD_BAD_BITS); 251 251 } 252 252 253 + static inline int radix__pud_same(pud_t pud_a, pud_t pud_b) 254 + { 255 + return ((pud_raw(pud_a) ^ pud_raw(pud_b)) == 0); 256 + } 253 257 254 258 static inline int radix__p4d_bad(p4d_t p4d) 255 259 { ··· 272 268 return __pmd(pmd_val(pmd) | _PAGE_PTE); 273 269 } 274 270 271 + static inline int radix__pud_trans_huge(pud_t pud) 272 + { 273 + return (pud_val(pud) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE; 274 + } 275 + 276 + static inline pud_t radix__pud_mkhuge(pud_t pud) 277 + { 278 + return __pud(pud_val(pud) | _PAGE_PTE); 279 + } 280 + 275 281 extern unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, 276 282 pmd_t *pmdp, unsigned long clr, 277 283 unsigned long set); 284 + extern unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long addr, 285 + pud_t *pudp, unsigned long clr, 286 + unsigned long set); 278 287 extern pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, 279 288 unsigned long address, pmd_t *pmdp); 280 289 extern void radix__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, ··· 295 278 extern pgtable_t radix__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp); 296 279 extern pmd_t radix__pmdp_huge_get_and_clear(struct mm_struct *mm, 297 280 unsigned long addr, pmd_t *pmdp); 281 + pud_t radix__pudp_huge_get_and_clear(struct mm_struct *mm, 282 + unsigned long addr, pud_t *pudp); 283 + 298 284 static inline int radix__has_transparent_hugepage(void) 299 285 { 300 286 /* For radix 2M at PMD level means thp */ 301 287 if (mmu_psize_defs[MMU_PAGE_2M].shift == PMD_SHIFT) 288 + return 1; 289 + return 0; 290 + } 291 + 292 + static inline int radix__has_transparent_pud_hugepage(void) 293 + { 294 + /* For radix 1G at PUD level means pud hugepage support */ 295 + if (mmu_psize_defs[MMU_PAGE_1G].shift == PUD_SHIFT) 302 296 return 1; 303 297 return 0; 304 298 } ··· 320 292 return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP)); 321 293 } 322 294 295 + static inline pud_t radix__pud_mkdevmap(pud_t pud) 296 + { 297 + return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP)); 298 + } 299 + 300 + struct vmem_altmap; 301 + struct dev_pagemap; 323 302 extern int __meminit radix__vmemmap_create_mapping(unsigned long start, 324 303 unsigned long page_size, 325 304 unsigned long phys); 305 + int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, 306 + int node, struct vmem_altmap *altmap); 307 + void __ref radix__vmemmap_free(unsigned long start, unsigned long end, 308 + struct vmem_altmap *altmap); 326 309 extern void radix__vmemmap_remove_mapping(unsigned long start, 327 310 unsigned long page_size); 328 311 ··· 364 325 365 326 void radix__kernel_map_pages(struct page *page, int numpages, int enable); 366 327 328 + #ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP 329 + #define vmemmap_can_optimize vmemmap_can_optimize 330 + bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap); 331 + #endif 332 + 333 + #define vmemmap_populate_compound_pages vmemmap_populate_compound_pages 334 + int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, 335 + unsigned long start, 336 + unsigned long end, int node, 337 + struct dev_pagemap *pgmap); 367 338 #endif /* __ASSEMBLY__ */ 368 339 #endif

+2

arch/powerpc/include/asm/book3s/64/tlbflush-radix.h

··· 68 68 unsigned long end, int psize); 69 69 extern void radix__flush_pmd_tlb_range(struct vm_area_struct *vma, 70 70 unsigned long start, unsigned long end); 71 + extern void radix__flush_pud_tlb_range(struct vm_area_struct *vma, 72 + unsigned long start, unsigned long end); 71 73 extern void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start, 72 74 unsigned long end); 73 75 extern void radix__flush_tlb_kernel_range(unsigned long start, unsigned long end);

+9

arch/powerpc/include/asm/book3s/64/tlbflush.h

··· 5 5 #define MMU_NO_CONTEXT ~0UL 6 6 7 7 #include <linux/mm_types.h> 8 + #include <linux/mmu_notifier.h> 8 9 #include <asm/book3s/64/tlbflush-hash.h> 9 10 #include <asm/book3s/64/tlbflush-radix.h> 10 11 ··· 49 48 { 50 49 if (radix_enabled()) 51 50 radix__flush_pmd_tlb_range(vma, start, end); 51 + } 52 + 53 + #define __HAVE_ARCH_FLUSH_PUD_TLB_RANGE 54 + static inline void flush_pud_tlb_range(struct vm_area_struct *vma, 55 + unsigned long start, unsigned long end) 56 + { 57 + if (radix_enabled()) 58 + radix__flush_pud_tlb_range(vma, start, end); 52 59 } 53 60 54 61 #define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE

+3 -8

arch/powerpc/include/asm/book3s/pgtable.h

··· 9 9 #endif 10 10 11 11 #ifndef __ASSEMBLY__ 12 - /* Insert a PTE, top-level function is out of line. It uses an inline 13 - * low level function in the respective pgtable-* files 14 - */ 15 - extern void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, 16 - pte_t pte); 17 - 18 - 19 12 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS 20 13 extern int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address, 21 14 pte_t *ptep, pte_t entry, int dirty); ··· 29 36 * corresponding HPTE into the hash table ahead of time, instead of 30 37 * waiting for the inevitable extra hash-table miss exception. 31 38 */ 32 - static inline void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) 39 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 40 + struct vm_area_struct *vma, unsigned long address, 41 + pte_t *ptep, unsigned int nr) 33 42 { 34 43 if (IS_ENABLED(CONFIG_PPC32) && !mmu_has_feature(MMU_FTR_HPTE_TABLE)) 35 44 return;

+10 -4

arch/powerpc/include/asm/cacheflush.h

··· 35 35 * It just marks the page as not i-cache clean. We do the i-cache 36 36 * flush later when the page is given to a user process, if necessary. 37 37 */ 38 - static inline void flush_dcache_page(struct page *page) 38 + static inline void flush_dcache_folio(struct folio *folio) 39 39 { 40 40 if (cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) 41 41 return; 42 42 /* avoid an atomic op if possible */ 43 - if (test_bit(PG_dcache_clean, &page->flags)) 44 - clear_bit(PG_dcache_clean, &page->flags); 43 + if (test_bit(PG_dcache_clean, &folio->flags)) 44 + clear_bit(PG_dcache_clean, &folio->flags); 45 + } 46 + #define flush_dcache_folio flush_dcache_folio 47 + 48 + static inline void flush_dcache_page(struct page *page) 49 + { 50 + flush_dcache_folio(page_folio(page)); 45 51 } 46 52 47 53 void flush_icache_range(unsigned long start, unsigned long stop); ··· 57 51 unsigned long addr, int len); 58 52 #define flush_icache_user_page flush_icache_user_page 59 53 60 - void flush_dcache_icache_page(struct page *page); 54 + void flush_dcache_icache_folio(struct folio *folio); 61 55 62 56 /** 63 57 * flush_dcache_range(): Write any modified data cache blocks out to memory and

+4 -13

arch/powerpc/include/asm/io.h

··· 3 3 #define _ASM_POWERPC_IO_H 4 4 #ifdef __KERNEL__ 5 5 6 - #define ARCH_HAS_IOREMAP_WC 7 - #ifdef CONFIG_PPC32 8 - #define ARCH_HAS_IOREMAP_WT 9 - #endif 10 - 11 6 /* 12 7 */ 13 8 ··· 727 732 #define writel_relaxed(v, addr) writel(v, addr) 728 733 #define writeq_relaxed(v, addr) writeq(v, addr) 729 734 730 - #ifdef CONFIG_GENERIC_IOMAP 731 - #include <asm-generic/iomap.h> 732 - #else 735 + #ifndef CONFIG_GENERIC_IOMAP 733 736 /* 734 737 * Here comes the implementation of the IOMAP interfaces. 735 738 */ ··· 889 896 * 890 897 */ 891 898 extern void __iomem *ioremap(phys_addr_t address, unsigned long size); 892 - extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size, 893 - unsigned long flags); 899 + #define ioremap ioremap 900 + #define ioremap_prot ioremap_prot 894 901 extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size); 895 902 #define ioremap_wc ioremap_wc 896 903 ··· 904 911 #define ioremap_cache(addr, size) \ 905 912 ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL)) 906 913 907 - extern void iounmap(volatile void __iomem *addr); 914 + #define iounmap iounmap 908 915 909 916 void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size); 910 917 911 918 int early_ioremap_range(unsigned long ea, phys_addr_t pa, 912 919 unsigned long size, pgprot_t prot); 913 - void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long size, 914 - pgprot_t prot, void *caller); 915 920 916 921 extern void __iomem *__ioremap_caller(phys_addr_t, unsigned long size, 917 922 pgprot_t prot, void *caller);

+5 -5

arch/powerpc/include/asm/kvm_ppc.h

··· 894 894 895 895 static inline void kvmppc_mmu_flush_icache(kvm_pfn_t pfn) 896 896 { 897 - struct page *page; 897 + struct folio *folio; 898 898 /* 899 899 * We can only access pages that the kernel maps 900 900 * as memory. Bail out for unmapped ones. ··· 903 903 return; 904 904 905 905 /* Clear i-cache for new pages */ 906 - page = pfn_to_page(pfn); 907 - if (!test_bit(PG_dcache_clean, &page->flags)) { 908 - flush_dcache_icache_page(page); 909 - set_bit(PG_dcache_clean, &page->flags); 906 + folio = page_folio(pfn_to_page(pfn)); 907 + if (!test_bit(PG_dcache_clean, &folio->flags)) { 908 + flush_dcache_icache_folio(folio); 909 + set_bit(PG_dcache_clean, &folio->flags); 910 910 } 911 911 } 912 912

+5 -11

arch/powerpc/include/asm/nohash/pgtable.h

··· 101 101 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) { 102 102 return __pte(((pte_basic_t)(pfn) << PTE_RPN_SHIFT) | 103 103 pgprot_val(pgprot)); } 104 - static inline unsigned long pte_pfn(pte_t pte) { 105 - return pte_val(pte) >> PTE_RPN_SHIFT; } 106 104 107 105 /* Generic modifiers for PTE bits */ 108 106 static inline pte_t pte_exprotect(pte_t pte) ··· 163 165 { 164 166 return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE); 165 167 } 166 - 167 - /* Insert a PTE, top-level function is out of line. It uses an inline 168 - * low level function in the respective pgtable-* files 169 - */ 170 - extern void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, 171 - pte_t pte); 172 168 173 169 /* This low level function performs the actual PTE insertion 174 170 * Setting the PTE depends on the MMU type and other factors. It's ··· 274 282 * for the page which has just been mapped in. 275 283 */ 276 284 #if defined(CONFIG_PPC_E500) && defined(CONFIG_HUGETLB_PAGE) 277 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep); 285 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 286 + unsigned long address, pte_t *ptep, unsigned int nr); 278 287 #else 279 - static inline 280 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) {} 288 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 289 + struct vm_area_struct *vma, unsigned long address, 290 + pte_t *ptep, unsigned int nr) {} 281 291 #endif 282 292 283 293 #endif /* __ASSEMBLY__ */

+4

arch/powerpc/include/asm/pgalloc.h

··· 45 45 pte_fragment_free((unsigned long *)ptepage, 0); 46 46 } 47 47 48 + /* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */ 49 + #define pte_free_defer pte_free_defer 50 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); 51 + 48 52 /* 49 53 * Functions that deal with pagetables that could be at any level of 50 54 * the table need to be passed an "index_size" so they know how to

+34 -5

arch/powerpc/include/asm/pgtable.h

··· 41 41 42 42 #ifndef __ASSEMBLY__ 43 43 44 + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, 45 + pte_t pte, unsigned int nr); 46 + #define set_ptes set_ptes 47 + #define update_mmu_cache(vma, addr, ptep) \ 48 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 49 + 44 50 #ifndef MAX_PTRS_PER_PGD 45 51 #define MAX_PTRS_PER_PGD PTRS_PER_PGD 46 52 #endif ··· 54 48 /* Keep these as a macros to avoid include dependency mess */ 55 49 #define pte_page(x) pfn_to_page(pte_pfn(x)) 56 50 #define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) 51 + 52 + static inline unsigned long pte_pfn(pte_t pte) 53 + { 54 + return (pte_val(pte) & PTE_RPN_MASK) >> PTE_RPN_SHIFT; 55 + } 56 + 57 57 /* 58 58 * Select all bits except the pfn 59 59 */ ··· 170 158 } 171 159 172 160 #ifdef CONFIG_PPC64 173 - #define is_ioremap_addr is_ioremap_addr 174 - static inline bool is_ioremap_addr(const void *x) 161 + int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); 162 + bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, 163 + unsigned long page_size); 164 + /* 165 + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details 166 + * some of the restrictions. We don't check for PMD_SIZE because our 167 + * vmemmap allocation code can fallback correctly. The pageblock 168 + * alignment requirement is met using altmap->reserve blocks. 169 + */ 170 + #define arch_supports_memmap_on_memory arch_supports_memmap_on_memory 171 + static inline bool arch_supports_memmap_on_memory(unsigned long vmemmap_size) 175 172 { 176 - unsigned long addr = (unsigned long)x; 177 - 178 - return addr >= IOREMAP_BASE && addr < IOREMAP_END; 173 + if (!radix_enabled()) 174 + return false; 175 + /* 176 + * With 4K page size and 2M PMD_SIZE, we can align 177 + * things better with memory block size value 178 + * starting from 128MB. Hence align things with PMD_SIZE. 179 + */ 180 + if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) 181 + return IS_ALIGNED(vmemmap_size, PMD_SIZE); 182 + return true; 179 183 } 184 + 180 185 #endif /* CONFIG_PPC64 */ 181 186 182 187 #endif /* __ASSEMBLY__ */

+1

arch/powerpc/kvm/book3s_hv_uvmem.c

··· 410 410 ret = H_STATE; 411 411 break; 412 412 } 413 + vma_start_write(vma); 413 414 /* Copy vm_flags to avoid partial modifications in ksm_madvise */ 414 415 vm_flags = vma->vm_flags; 415 416 ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,

+1 -1

arch/powerpc/mm/book3s64/hash_pgtable.c

··· 214 214 215 215 old = be64_to_cpu(old_be); 216 216 217 - trace_hugepage_update(addr, old, clr, set); 217 + trace_hugepage_update_pmd(addr, old, clr, set); 218 218 if (old & H_PAGE_HASHPTE) 219 219 hpte_do_hugepage_flush(mm, addr, pmdp, old); 220 220 return old;

+6 -5

arch/powerpc/mm/book3s64/hash_utils.c

··· 1307 1307 */ 1308 1308 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap) 1309 1309 { 1310 - struct page *page; 1310 + struct folio *folio; 1311 1311 1312 1312 if (!pfn_valid(pte_pfn(pte))) 1313 1313 return pp; 1314 1314 1315 - page = pte_page(pte); 1315 + folio = page_folio(pte_page(pte)); 1316 1316 1317 1317 /* page is dirty */ 1318 - if (!test_bit(PG_dcache_clean, &page->flags) && !PageReserved(page)) { 1318 + if (!test_bit(PG_dcache_clean, &folio->flags) && 1319 + !folio_test_reserved(folio)) { 1319 1320 if (trap == INTERRUPT_INST_STORAGE) { 1320 - flush_dcache_icache_page(page); 1321 - set_bit(PG_dcache_clean, &page->flags); 1321 + flush_dcache_icache_folio(folio); 1322 + set_bit(PG_dcache_clean, &folio->flags); 1322 1323 } else 1323 1324 pp |= HPTE_R_N; 1324 1325 }

+5 -5

arch/powerpc/mm/book3s64/mmu_context.c

··· 246 246 static void pmd_frag_destroy(void *pmd_frag) 247 247 { 248 248 int count; 249 - struct page *page; 249 + struct ptdesc *ptdesc; 250 250 251 - page = virt_to_page(pmd_frag); 251 + ptdesc = virt_to_ptdesc(pmd_frag); 252 252 /* drop all the pending references */ 253 253 count = ((unsigned long)pmd_frag & ~PAGE_MASK) >> PMD_FRAG_SIZE_SHIFT; 254 254 /* We allow PTE_FRAG_NR fragments from a PTE page */ 255 - if (atomic_sub_and_test(PMD_FRAG_NR - count, &page->pt_frag_refcount)) { 256 - pgtable_pmd_page_dtor(page); 257 - __free_page(page); 255 + if (atomic_sub_and_test(PMD_FRAG_NR - count, &ptdesc->pt_frag_refcount)) { 256 + pagetable_pmd_dtor(ptdesc); 257 + pagetable_free(ptdesc); 258 258 } 259 259 } 260 260

+94 -16

arch/powerpc/mm/book3s64/pgtable.c

··· 64 64 return changed; 65 65 } 66 66 67 + int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address, 68 + pud_t *pudp, pud_t entry, int dirty) 69 + { 70 + int changed; 71 + #ifdef CONFIG_DEBUG_VM 72 + WARN_ON(!pud_devmap(*pudp)); 73 + assert_spin_locked(pud_lockptr(vma->vm_mm, pudp)); 74 + #endif 75 + changed = !pud_same(*(pudp), entry); 76 + if (changed) { 77 + /* 78 + * We can use MMU_PAGE_1G here, because only radix 79 + * path look at the psize. 80 + */ 81 + __ptep_set_access_flags(vma, pudp_ptep(pudp), 82 + pud_pte(entry), address, MMU_PAGE_1G); 83 + } 84 + return changed; 85 + } 86 + 87 + 67 88 int pmdp_test_and_clear_young(struct vm_area_struct *vma, 68 89 unsigned long address, pmd_t *pmdp) 69 90 { 70 91 return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp); 71 92 } 93 + 94 + int pudp_test_and_clear_young(struct vm_area_struct *vma, 95 + unsigned long address, pud_t *pudp) 96 + { 97 + return __pudp_test_and_clear_young(vma->vm_mm, address, pudp); 98 + } 99 + 72 100 /* 73 101 * set a new huge pmd. We should not be called for updating 74 102 * an existing pmd entry. That should go via pmd_hugepage_update. ··· 116 88 #endif 117 89 trace_hugepage_set_pmd(addr, pmd_val(pmd)); 118 90 return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd)); 91 + } 92 + 93 + void set_pud_at(struct mm_struct *mm, unsigned long addr, 94 + pud_t *pudp, pud_t pud) 95 + { 96 + #ifdef CONFIG_DEBUG_VM 97 + /* 98 + * Make sure hardware valid bit is not set. We don't do 99 + * tlb flush for this update. 100 + */ 101 + 102 + WARN_ON(pte_hw_valid(pud_pte(*pudp))); 103 + assert_spin_locked(pud_lockptr(mm, pudp)); 104 + WARN_ON(!(pud_large(pud))); 105 + #endif 106 + trace_hugepage_set_pud(addr, pud_val(pud)); 107 + return set_pte_at(mm, addr, pudp_ptep(pudp), pud_pte(pud)); 119 108 } 120 109 121 110 static void do_serialize(void *arg) ··· 192 147 return pmd; 193 148 } 194 149 150 + pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma, 151 + unsigned long addr, pud_t *pudp, int full) 152 + { 153 + pud_t pud; 154 + 155 + VM_BUG_ON(addr & ~HPAGE_PMD_MASK); 156 + VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) || 157 + !pud_present(*pudp)); 158 + pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp); 159 + /* 160 + * if it not a fullmm flush, then we can possibly end up converting 161 + * this PMD pte entry to a regular level 0 PTE by a parallel page fault. 162 + * Make sure we flush the tlb in this case. 163 + */ 164 + if (!full) 165 + flush_pud_tlb_range(vma, addr, addr + HPAGE_PUD_SIZE); 166 + return pud; 167 + } 168 + 195 169 static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot) 196 170 { 197 171 return __pmd(pmd_val(pmd) | pgprot_val(pgprot)); 172 + } 173 + 174 + static pud_t pud_set_protbits(pud_t pud, pgprot_t pgprot) 175 + { 176 + return __pud(pud_val(pud) | pgprot_val(pgprot)); 198 177 } 199 178 200 179 /* ··· 233 164 pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK; 234 165 235 166 return __pmd_mkhuge(pmd_set_protbits(__pmd(pmdv), pgprot)); 167 + } 168 + 169 + pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot) 170 + { 171 + unsigned long pudv; 172 + 173 + pudv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK; 174 + 175 + return __pud_mkhuge(pud_set_protbits(__pud(pudv), pgprot)); 236 176 } 237 177 238 178 pmd_t mk_pmd(struct page *page, pgprot_t pgprot) ··· 384 306 static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm) 385 307 { 386 308 void *ret = NULL; 387 - struct page *page; 309 + struct ptdesc *ptdesc; 388 310 gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO; 389 311 390 312 if (mm == &init_mm) 391 313 gfp &= ~__GFP_ACCOUNT; 392 - page = alloc_page(gfp); 393 - if (!page) 314 + ptdesc = pagetable_alloc(gfp, 0); 315 + if (!ptdesc) 394 316 return NULL; 395 - if (!pgtable_pmd_page_ctor(page)) { 396 - __free_pages(page, 0); 317 + if (!pagetable_pmd_ctor(ptdesc)) { 318 + pagetable_free(ptdesc); 397 319 return NULL; 398 320 } 399 321 400 - atomic_set(&page->pt_frag_refcount, 1); 322 + atomic_set(&ptdesc->pt_frag_refcount, 1); 401 323 402 - ret = page_address(page); 324 + ret = ptdesc_address(ptdesc); 403 325 /* 404 326 * if we support only one fragment just return the 405 327 * allocated page. ··· 409 331 410 332 spin_lock(&mm->page_table_lock); 411 333 /* 412 - * If we find pgtable_page set, we return 334 + * If we find ptdesc_page set, we return 413 335 * the allocated page with single fragment 414 336 * count. 415 337 */ 416 338 if (likely(!mm->context.pmd_frag)) { 417 - atomic_set(&page->pt_frag_refcount, PMD_FRAG_NR); 339 + atomic_set(&ptdesc->pt_frag_refcount, PMD_FRAG_NR); 418 340 mm->context.pmd_frag = ret + PMD_FRAG_SIZE; 419 341 } 420 342 spin_unlock(&mm->page_table_lock); ··· 435 357 436 358 void pmd_fragment_free(unsigned long *pmd) 437 359 { 438 - struct page *page = virt_to_page(pmd); 360 + struct ptdesc *ptdesc = virt_to_ptdesc(pmd); 439 361 440 - if (PageReserved(page)) 441 - return free_reserved_page(page); 362 + if (pagetable_is_reserved(ptdesc)) 363 + return free_reserved_ptdesc(ptdesc); 442 364 443 - BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0); 444 - if (atomic_dec_and_test(&page->pt_frag_refcount)) { 445 - pgtable_pmd_page_dtor(page); 446 - __free_page(page); 365 + BUG_ON(atomic_read(&ptdesc->pt_frag_refcount) <= 0); 366 + if (atomic_dec_and_test(&ptdesc->pt_frag_refcount)) { 367 + pagetable_pmd_dtor(ptdesc); 368 + pagetable_free(ptdesc); 447 369 } 448 370 } 449 371

+1

arch/powerpc/mm/book3s64/radix_hugetlbpage.c

··· 39 39 radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize); 40 40 else 41 41 radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize); 42 + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end); 42 43 } 43 44 44 45 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,

+535 -39

arch/powerpc/mm/book3s64/radix_pgtable.c

··· 601 601 #else 602 602 mmu_virtual_psize = MMU_PAGE_4K; 603 603 #endif 604 - 605 - #ifdef CONFIG_SPARSEMEM_VMEMMAP 606 - /* vmemmap mapping */ 607 - if (mmu_psize_defs[MMU_PAGE_2M].shift) { 608 - /* 609 - * map vmemmap using 2M if available 610 - */ 611 - mmu_vmemmap_psize = MMU_PAGE_2M; 612 - } else 613 - mmu_vmemmap_psize = mmu_virtual_psize; 614 - #endif 615 604 #endif 616 605 /* 617 606 * initialize page table size ··· 733 744 p4d_clear(p4d); 734 745 } 735 746 736 - static void remove_pte_table(pte_t *pte_start, unsigned long addr, 737 - unsigned long end, bool direct) 747 + #ifdef CONFIG_SPARSEMEM_VMEMMAP 748 + static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long end) 749 + { 750 + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE); 751 + 752 + return !vmemmap_populated(start, PMD_SIZE); 753 + } 754 + 755 + static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long end) 756 + { 757 + unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE); 758 + 759 + return !vmemmap_populated(start, PAGE_SIZE); 760 + 761 + } 762 + #endif 763 + 764 + static void __meminit free_vmemmap_pages(struct page *page, 765 + struct vmem_altmap *altmap, 766 + int order) 767 + { 768 + unsigned int nr_pages = 1 << order; 769 + 770 + if (altmap) { 771 + unsigned long alt_start, alt_end; 772 + unsigned long base_pfn = page_to_pfn(page); 773 + 774 + /* 775 + * with 2M vmemmap mmaping we can have things setup 776 + * such that even though atlmap is specified we never 777 + * used altmap. 778 + */ 779 + alt_start = altmap->base_pfn; 780 + alt_end = altmap->base_pfn + altmap->reserve + altmap->free; 781 + 782 + if (base_pfn >= alt_start && base_pfn < alt_end) { 783 + vmem_altmap_free(altmap, nr_pages); 784 + return; 785 + } 786 + } 787 + 788 + if (PageReserved(page)) { 789 + /* allocated from memblock */ 790 + while (nr_pages--) 791 + free_reserved_page(page++); 792 + } else 793 + free_pages((unsigned long)page_address(page), order); 794 + } 795 + 796 + static void __meminit remove_pte_table(pte_t *pte_start, unsigned long addr, 797 + unsigned long end, bool direct, 798 + struct vmem_altmap *altmap) 738 799 { 739 800 unsigned long next, pages = 0; 740 801 pte_t *pte; ··· 798 759 if (!pte_present(*pte)) 799 760 continue; 800 761 801 - if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) { 802 - /* 803 - * The vmemmap_free() and remove_section_mapping() 804 - * codepaths call us with aligned addresses. 805 - */ 806 - WARN_ONCE(1, "%s: unaligned range\n", __func__); 807 - continue; 762 + if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) { 763 + if (!direct) 764 + free_vmemmap_pages(pte_page(*pte), altmap, 0); 765 + pte_clear(&init_mm, addr, pte); 766 + pages++; 808 767 } 809 - 810 - pte_clear(&init_mm, addr, pte); 811 - pages++; 768 + #ifdef CONFIG_SPARSEMEM_VMEMMAP 769 + else if (!direct && vmemmap_page_is_unused(addr, next)) { 770 + free_vmemmap_pages(pte_page(*pte), altmap, 0); 771 + pte_clear(&init_mm, addr, pte); 772 + } 773 + #endif 812 774 } 813 775 if (direct) 814 776 update_page_count(mmu_virtual_psize, -pages); 815 777 } 816 778 817 779 static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, 818 - unsigned long end, bool direct) 780 + unsigned long end, bool direct, 781 + struct vmem_altmap *altmap) 819 782 { 820 783 unsigned long next, pages = 0; 821 784 pte_t *pte_base; ··· 831 790 continue; 832 791 833 792 if (pmd_is_leaf(*pmd)) { 834 - if (!IS_ALIGNED(addr, PMD_SIZE) || 835 - !IS_ALIGNED(next, PMD_SIZE)) { 836 - WARN_ONCE(1, "%s: unaligned range\n", __func__); 837 - continue; 793 + if (IS_ALIGNED(addr, PMD_SIZE) && 794 + IS_ALIGNED(next, PMD_SIZE)) { 795 + if (!direct) 796 + free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE)); 797 + pte_clear(&init_mm, addr, (pte_t *)pmd); 798 + pages++; 838 799 } 839 - pte_clear(&init_mm, addr, (pte_t *)pmd); 840 - pages++; 800 + #ifdef CONFIG_SPARSEMEM_VMEMMAP 801 + else if (!direct && vmemmap_pmd_is_unused(addr, next)) { 802 + free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE)); 803 + pte_clear(&init_mm, addr, (pte_t *)pmd); 804 + } 805 + #endif 841 806 continue; 842 807 } 843 808 844 809 pte_base = (pte_t *)pmd_page_vaddr(*pmd); 845 - remove_pte_table(pte_base, addr, next, direct); 810 + remove_pte_table(pte_base, addr, next, direct, altmap); 846 811 free_pte_table(pte_base, pmd); 847 812 } 848 813 if (direct) ··· 856 809 } 857 810 858 811 static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, 859 - unsigned long end, bool direct) 812 + unsigned long end, bool direct, 813 + struct vmem_altmap *altmap) 860 814 { 861 815 unsigned long next, pages = 0; 862 816 pmd_t *pmd_base; ··· 882 834 } 883 835 884 836 pmd_base = pud_pgtable(*pud); 885 - remove_pmd_table(pmd_base, addr, next, direct); 837 + remove_pmd_table(pmd_base, addr, next, direct, altmap); 886 838 free_pmd_table(pmd_base, pud); 887 839 } 888 840 if (direct) 889 841 update_page_count(MMU_PAGE_1G, -pages); 890 842 } 891 843 892 - static void __meminit remove_pagetable(unsigned long start, unsigned long end, 893 - bool direct) 844 + static void __meminit 845 + remove_pagetable(unsigned long start, unsigned long end, bool direct, 846 + struct vmem_altmap *altmap) 894 847 { 895 848 unsigned long addr, next; 896 849 pud_t *pud_base; ··· 920 871 } 921 872 922 873 pud_base = p4d_pgtable(*p4d); 923 - remove_pud_table(pud_base, addr, next, direct); 874 + remove_pud_table(pud_base, addr, next, direct, altmap); 924 875 free_pud_table(pud_base, p4d); 925 876 } 926 877 ··· 943 894 944 895 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end) 945 896 { 946 - remove_pagetable(start, end, true); 897 + remove_pagetable(start, end, true, NULL); 947 898 return 0; 948 899 } 949 900 #endif /* CONFIG_MEMORY_HOTPLUG */ ··· 975 926 return 0; 976 927 } 977 928 929 + 930 + bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap) 931 + { 932 + if (radix_enabled()) 933 + return __vmemmap_can_optimize(altmap, pgmap); 934 + 935 + return false; 936 + } 937 + 938 + int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node, 939 + unsigned long addr, unsigned long next) 940 + { 941 + int large = pmd_large(*pmdp); 942 + 943 + if (large) 944 + vmemmap_verify(pmdp_ptep(pmdp), node, addr, next); 945 + 946 + return large; 947 + } 948 + 949 + void __meminit vmemmap_set_pmd(pmd_t *pmdp, void *p, int node, 950 + unsigned long addr, unsigned long next) 951 + { 952 + pte_t entry; 953 + pte_t *ptep = pmdp_ptep(pmdp); 954 + 955 + VM_BUG_ON(!IS_ALIGNED(addr, PMD_SIZE)); 956 + entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); 957 + set_pte_at(&init_mm, addr, ptep, entry); 958 + asm volatile("ptesync": : :"memory"); 959 + 960 + vmemmap_verify(ptep, node, addr, next); 961 + } 962 + 963 + static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long addr, 964 + int node, 965 + struct vmem_altmap *altmap, 966 + struct page *reuse) 967 + { 968 + pte_t *pte = pte_offset_kernel(pmdp, addr); 969 + 970 + if (pte_none(*pte)) { 971 + pte_t entry; 972 + void *p; 973 + 974 + if (!reuse) { 975 + /* 976 + * make sure we don't create altmap mappings 977 + * covering things outside the device. 978 + */ 979 + if (altmap && altmap_cross_boundary(altmap, addr, PAGE_SIZE)) 980 + altmap = NULL; 981 + 982 + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); 983 + if (!p && altmap) 984 + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL); 985 + if (!p) 986 + return NULL; 987 + pr_debug("PAGE_SIZE vmemmap mapping\n"); 988 + } else { 989 + /* 990 + * When a PTE/PMD entry is freed from the init_mm 991 + * there's a free_pages() call to this page allocated 992 + * above. Thus this get_page() is paired with the 993 + * put_page_testzero() on the freeing path. 994 + * This can only called by certain ZONE_DEVICE path, 995 + * and through vmemmap_populate_compound_pages() when 996 + * slab is available. 997 + */ 998 + get_page(reuse); 999 + p = page_to_virt(reuse); 1000 + pr_debug("Tail page reuse vmemmap mapping\n"); 1001 + } 1002 + 1003 + VM_BUG_ON(!PAGE_ALIGNED(addr)); 1004 + entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); 1005 + set_pte_at(&init_mm, addr, pte, entry); 1006 + asm volatile("ptesync": : :"memory"); 1007 + } 1008 + return pte; 1009 + } 1010 + 1011 + static inline pud_t *vmemmap_pud_alloc(p4d_t *p4dp, int node, 1012 + unsigned long address) 1013 + { 1014 + pud_t *pud; 1015 + 1016 + /* All early vmemmap mapping to keep simple do it at PAGE_SIZE */ 1017 + if (unlikely(p4d_none(*p4dp))) { 1018 + if (unlikely(!slab_is_available())) { 1019 + pud = early_alloc_pgtable(PAGE_SIZE, node, 0, 0); 1020 + p4d_populate(&init_mm, p4dp, pud); 1021 + /* go to the pud_offset */ 1022 + } else 1023 + return pud_alloc(&init_mm, p4dp, address); 1024 + } 1025 + return pud_offset(p4dp, address); 1026 + } 1027 + 1028 + static inline pmd_t *vmemmap_pmd_alloc(pud_t *pudp, int node, 1029 + unsigned long address) 1030 + { 1031 + pmd_t *pmd; 1032 + 1033 + /* All early vmemmap mapping to keep simple do it at PAGE_SIZE */ 1034 + if (unlikely(pud_none(*pudp))) { 1035 + if (unlikely(!slab_is_available())) { 1036 + pmd = early_alloc_pgtable(PAGE_SIZE, node, 0, 0); 1037 + pud_populate(&init_mm, pudp, pmd); 1038 + } else 1039 + return pmd_alloc(&init_mm, pudp, address); 1040 + } 1041 + return pmd_offset(pudp, address); 1042 + } 1043 + 1044 + static inline pte_t *vmemmap_pte_alloc(pmd_t *pmdp, int node, 1045 + unsigned long address) 1046 + { 1047 + pte_t *pte; 1048 + 1049 + /* All early vmemmap mapping to keep simple do it at PAGE_SIZE */ 1050 + if (unlikely(pmd_none(*pmdp))) { 1051 + if (unlikely(!slab_is_available())) { 1052 + pte = early_alloc_pgtable(PAGE_SIZE, node, 0, 0); 1053 + pmd_populate(&init_mm, pmdp, pte); 1054 + } else 1055 + return pte_alloc_kernel(pmdp, address); 1056 + } 1057 + return pte_offset_kernel(pmdp, address); 1058 + } 1059 + 1060 + 1061 + 1062 + int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node, 1063 + struct vmem_altmap *altmap) 1064 + { 1065 + unsigned long addr; 1066 + unsigned long next; 1067 + pgd_t *pgd; 1068 + p4d_t *p4d; 1069 + pud_t *pud; 1070 + pmd_t *pmd; 1071 + pte_t *pte; 1072 + 1073 + for (addr = start; addr < end; addr = next) { 1074 + next = pmd_addr_end(addr, end); 1075 + 1076 + pgd = pgd_offset_k(addr); 1077 + p4d = p4d_offset(pgd, addr); 1078 + pud = vmemmap_pud_alloc(p4d, node, addr); 1079 + if (!pud) 1080 + return -ENOMEM; 1081 + pmd = vmemmap_pmd_alloc(pud, node, addr); 1082 + if (!pmd) 1083 + return -ENOMEM; 1084 + 1085 + if (pmd_none(READ_ONCE(*pmd))) { 1086 + void *p; 1087 + 1088 + /* 1089 + * keep it simple by checking addr PMD_SIZE alignment 1090 + * and verifying the device boundary condition. 1091 + * For us to use a pmd mapping, both addr and pfn should 1092 + * be aligned. We skip if addr is not aligned and for 1093 + * pfn we hope we have extra area in the altmap that 1094 + * can help to find an aligned block. This can result 1095 + * in altmap block allocation failures, in which case 1096 + * we fallback to RAM for vmemmap allocation. 1097 + */ 1098 + if (altmap && (!IS_ALIGNED(addr, PMD_SIZE) || 1099 + altmap_cross_boundary(altmap, addr, PMD_SIZE))) { 1100 + /* 1101 + * make sure we don't create altmap mappings 1102 + * covering things outside the device. 1103 + */ 1104 + goto base_mapping; 1105 + } 1106 + 1107 + p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); 1108 + if (p) { 1109 + vmemmap_set_pmd(pmd, p, node, addr, next); 1110 + pr_debug("PMD_SIZE vmemmap mapping\n"); 1111 + continue; 1112 + } else if (altmap) { 1113 + /* 1114 + * A vmemmap block allocation can fail due to 1115 + * alignment requirements and we trying to align 1116 + * things aggressively there by running out of 1117 + * space. Try base mapping on failure. 1118 + */ 1119 + goto base_mapping; 1120 + } 1121 + } else if (vmemmap_check_pmd(pmd, node, addr, next)) { 1122 + /* 1123 + * If a huge mapping exist due to early call to 1124 + * vmemmap_populate, let's try to use that. 1125 + */ 1126 + continue; 1127 + } 1128 + base_mapping: 1129 + /* 1130 + * Not able allocate higher order memory to back memmap 1131 + * or we found a pointer to pte page. Allocate base page 1132 + * size vmemmap 1133 + */ 1134 + pte = vmemmap_pte_alloc(pmd, node, addr); 1135 + if (!pte) 1136 + return -ENOMEM; 1137 + 1138 + pte = radix__vmemmap_pte_populate(pmd, addr, node, altmap, NULL); 1139 + if (!pte) 1140 + return -ENOMEM; 1141 + 1142 + vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); 1143 + next = addr + PAGE_SIZE; 1144 + } 1145 + return 0; 1146 + } 1147 + 1148 + static pte_t * __meminit radix__vmemmap_populate_address(unsigned long addr, int node, 1149 + struct vmem_altmap *altmap, 1150 + struct page *reuse) 1151 + { 1152 + pgd_t *pgd; 1153 + p4d_t *p4d; 1154 + pud_t *pud; 1155 + pmd_t *pmd; 1156 + pte_t *pte; 1157 + 1158 + pgd = pgd_offset_k(addr); 1159 + p4d = p4d_offset(pgd, addr); 1160 + pud = vmemmap_pud_alloc(p4d, node, addr); 1161 + if (!pud) 1162 + return NULL; 1163 + pmd = vmemmap_pmd_alloc(pud, node, addr); 1164 + if (!pmd) 1165 + return NULL; 1166 + if (pmd_leaf(*pmd)) 1167 + /* 1168 + * The second page is mapped as a hugepage due to a nearby request. 1169 + * Force our mapping to page size without deduplication 1170 + */ 1171 + return NULL; 1172 + pte = vmemmap_pte_alloc(pmd, node, addr); 1173 + if (!pte) 1174 + return NULL; 1175 + radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL); 1176 + vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); 1177 + 1178 + return pte; 1179 + } 1180 + 1181 + static pte_t * __meminit vmemmap_compound_tail_page(unsigned long addr, 1182 + unsigned long pfn_offset, int node) 1183 + { 1184 + pgd_t *pgd; 1185 + p4d_t *p4d; 1186 + pud_t *pud; 1187 + pmd_t *pmd; 1188 + pte_t *pte; 1189 + unsigned long map_addr; 1190 + 1191 + /* the second vmemmap page which we use for duplication */ 1192 + map_addr = addr - pfn_offset * sizeof(struct page) + PAGE_SIZE; 1193 + pgd = pgd_offset_k(map_addr); 1194 + p4d = p4d_offset(pgd, map_addr); 1195 + pud = vmemmap_pud_alloc(p4d, node, map_addr); 1196 + if (!pud) 1197 + return NULL; 1198 + pmd = vmemmap_pmd_alloc(pud, node, map_addr); 1199 + if (!pmd) 1200 + return NULL; 1201 + if (pmd_leaf(*pmd)) 1202 + /* 1203 + * The second page is mapped as a hugepage due to a nearby request. 1204 + * Force our mapping to page size without deduplication 1205 + */ 1206 + return NULL; 1207 + pte = vmemmap_pte_alloc(pmd, node, map_addr); 1208 + if (!pte) 1209 + return NULL; 1210 + /* 1211 + * Check if there exist a mapping to the left 1212 + */ 1213 + if (pte_none(*pte)) { 1214 + /* 1215 + * Populate the head page vmemmap page. 1216 + * It can fall in different pmd, hence 1217 + * vmemmap_populate_address() 1218 + */ 1219 + pte = radix__vmemmap_populate_address(map_addr - PAGE_SIZE, node, NULL, NULL); 1220 + if (!pte) 1221 + return NULL; 1222 + /* 1223 + * Populate the tail pages vmemmap page 1224 + */ 1225 + pte = radix__vmemmap_pte_populate(pmd, map_addr, node, NULL, NULL); 1226 + if (!pte) 1227 + return NULL; 1228 + vmemmap_verify(pte, node, map_addr, map_addr + PAGE_SIZE); 1229 + return pte; 1230 + } 1231 + return pte; 1232 + } 1233 + 1234 + int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, 1235 + unsigned long start, 1236 + unsigned long end, int node, 1237 + struct dev_pagemap *pgmap) 1238 + { 1239 + /* 1240 + * we want to map things as base page size mapping so that 1241 + * we can save space in vmemmap. We could have huge mapping 1242 + * covering out both edges. 1243 + */ 1244 + unsigned long addr; 1245 + unsigned long addr_pfn = start_pfn; 1246 + unsigned long next; 1247 + pgd_t *pgd; 1248 + p4d_t *p4d; 1249 + pud_t *pud; 1250 + pmd_t *pmd; 1251 + pte_t *pte; 1252 + 1253 + for (addr = start; addr < end; addr = next) { 1254 + 1255 + pgd = pgd_offset_k(addr); 1256 + p4d = p4d_offset(pgd, addr); 1257 + pud = vmemmap_pud_alloc(p4d, node, addr); 1258 + if (!pud) 1259 + return -ENOMEM; 1260 + pmd = vmemmap_pmd_alloc(pud, node, addr); 1261 + if (!pmd) 1262 + return -ENOMEM; 1263 + 1264 + if (pmd_leaf(READ_ONCE(*pmd))) { 1265 + /* existing huge mapping. Skip the range */ 1266 + addr_pfn += (PMD_SIZE >> PAGE_SHIFT); 1267 + next = pmd_addr_end(addr, end); 1268 + continue; 1269 + } 1270 + pte = vmemmap_pte_alloc(pmd, node, addr); 1271 + if (!pte) 1272 + return -ENOMEM; 1273 + if (!pte_none(*pte)) { 1274 + /* 1275 + * This could be because we already have a compound 1276 + * page whose VMEMMAP_RESERVE_NR pages were mapped and 1277 + * this request fall in those pages. 1278 + */ 1279 + addr_pfn += 1; 1280 + next = addr + PAGE_SIZE; 1281 + continue; 1282 + } else { 1283 + unsigned long nr_pages = pgmap_vmemmap_nr(pgmap); 1284 + unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages); 1285 + pte_t *tail_page_pte; 1286 + 1287 + /* 1288 + * if the address is aligned to huge page size it is the 1289 + * head mapping. 1290 + */ 1291 + if (pfn_offset == 0) { 1292 + /* Populate the head page vmemmap page */ 1293 + pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL); 1294 + if (!pte) 1295 + return -ENOMEM; 1296 + vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); 1297 + 1298 + /* 1299 + * Populate the tail pages vmemmap page 1300 + * It can fall in different pmd, hence 1301 + * vmemmap_populate_address() 1302 + */ 1303 + pte = radix__vmemmap_populate_address(addr + PAGE_SIZE, node, NULL, NULL); 1304 + if (!pte) 1305 + return -ENOMEM; 1306 + 1307 + addr_pfn += 2; 1308 + next = addr + 2 * PAGE_SIZE; 1309 + continue; 1310 + } 1311 + /* 1312 + * get the 2nd mapping details 1313 + * Also create it if that doesn't exist 1314 + */ 1315 + tail_page_pte = vmemmap_compound_tail_page(addr, pfn_offset, node); 1316 + if (!tail_page_pte) { 1317 + 1318 + pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL); 1319 + if (!pte) 1320 + return -ENOMEM; 1321 + vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); 1322 + 1323 + addr_pfn += 1; 1324 + next = addr + PAGE_SIZE; 1325 + continue; 1326 + } 1327 + 1328 + pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, pte_page(*tail_page_pte)); 1329 + if (!pte) 1330 + return -ENOMEM; 1331 + vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); 1332 + 1333 + addr_pfn += 1; 1334 + next = addr + PAGE_SIZE; 1335 + continue; 1336 + } 1337 + } 1338 + return 0; 1339 + } 1340 + 1341 + 978 1342 #ifdef CONFIG_MEMORY_HOTPLUG 979 1343 void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned long page_size) 980 1344 { 981 - remove_pagetable(start, start + page_size, false); 1345 + remove_pagetable(start, start + page_size, true, NULL); 1346 + } 1347 + 1348 + void __ref radix__vmemmap_free(unsigned long start, unsigned long end, 1349 + struct vmem_altmap *altmap) 1350 + { 1351 + remove_pagetable(start, end, false, altmap); 982 1352 } 983 1353 #endif 984 1354 #endif ··· 1430 962 #endif 1431 963 1432 964 old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1); 1433 - trace_hugepage_update(addr, old, clr, set); 965 + trace_hugepage_update_pmd(addr, old, clr, set); 966 + 967 + return old; 968 + } 969 + 970 + unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long addr, 971 + pud_t *pudp, unsigned long clr, 972 + unsigned long set) 973 + { 974 + unsigned long old; 975 + 976 + #ifdef CONFIG_DEBUG_VM 977 + WARN_ON(!pud_devmap(*pudp)); 978 + assert_spin_locked(pud_lockptr(mm, pudp)); 979 + #endif 980 + 981 + old = radix__pte_update(mm, addr, pudp_ptep(pudp), clr, set, 1); 982 + trace_hugepage_update_pud(addr, old, clr, set); 1434 983 1435 984 return old; 1436 985 } ··· 1526 1041 old = radix__pmd_hugepage_update(mm, addr, pmdp, ~0UL, 0); 1527 1042 old_pmd = __pmd(old); 1528 1043 return old_pmd; 1044 + } 1045 + 1046 + pud_t radix__pudp_huge_get_and_clear(struct mm_struct *mm, 1047 + unsigned long addr, pud_t *pudp) 1048 + { 1049 + pud_t old_pud; 1050 + unsigned long old; 1051 + 1052 + old = radix__pud_hugepage_update(mm, addr, pudp, ~0UL, 0); 1053 + old_pud = __pud(old); 1054 + return old_pud; 1529 1055 } 1530 1056 1531 1057 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+11

arch/powerpc/mm/book3s64/radix_tlb.c

··· 987 987 } 988 988 } 989 989 preempt_enable(); 990 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); 990 991 } 991 992 EXPORT_SYMBOL(radix__flush_tlb_mm); 992 993 ··· 1021 1020 _tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL); 1022 1021 } 1023 1022 preempt_enable(); 1023 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); 1024 1024 } 1025 1025 1026 1026 void radix__flush_all_mm(struct mm_struct *mm) ··· 1230 1228 } 1231 1229 out: 1232 1230 preempt_enable(); 1231 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 1233 1232 } 1234 1233 1235 1234 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start, ··· 1395 1392 } 1396 1393 out: 1397 1394 preempt_enable(); 1395 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 1398 1396 } 1399 1397 1400 1398 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start, ··· 1464 1460 radix__flush_tlb_range_psize(vma->vm_mm, start, end, MMU_PAGE_2M); 1465 1461 } 1466 1462 EXPORT_SYMBOL(radix__flush_pmd_tlb_range); 1463 + 1464 + void radix__flush_pud_tlb_range(struct vm_area_struct *vma, 1465 + unsigned long start, unsigned long end) 1466 + { 1467 + radix__flush_tlb_range_psize(vma->vm_mm, start, end, MMU_PAGE_1G); 1468 + } 1469 + EXPORT_SYMBOL(radix__flush_pud_tlb_range); 1467 1470 1468 1471 void radix__flush_tlb_all(void) 1469 1472 {

+14 -27

arch/powerpc/mm/cacheflush.c

··· 148 148 invalidate_icache_range(addr, addr + PAGE_SIZE); 149 149 } 150 150 151 - static void flush_dcache_icache_hugepage(struct page *page) 151 + void flush_dcache_icache_folio(struct folio *folio) 152 152 { 153 - int i; 154 - int nr = compound_nr(page); 153 + unsigned int i, nr = folio_nr_pages(folio); 155 154 156 - if (!PageHighMem(page)) { 155 + if (flush_coherent_icache()) 156 + return; 157 + 158 + if (!folio_test_highmem(folio)) { 159 + void *addr = folio_address(folio); 157 160 for (i = 0; i < nr; i++) 158 - __flush_dcache_icache(lowmem_page_address(page + i)); 159 - } else { 161 + __flush_dcache_icache(addr + i * PAGE_SIZE); 162 + } else if (IS_ENABLED(CONFIG_BOOKE) || sizeof(phys_addr_t) > sizeof(void *)) { 160 163 for (i = 0; i < nr; i++) { 161 - void *start = kmap_local_page(page + i); 164 + void *start = kmap_local_folio(folio, i * PAGE_SIZE); 162 165 163 166 __flush_dcache_icache(start); 164 167 kunmap_local(start); 165 168 } 166 - } 167 - } 168 - 169 - void flush_dcache_icache_page(struct page *page) 170 - { 171 - if (flush_coherent_icache()) 172 - return; 173 - 174 - if (PageCompound(page)) 175 - return flush_dcache_icache_hugepage(page); 176 - 177 - if (!PageHighMem(page)) { 178 - __flush_dcache_icache(lowmem_page_address(page)); 179 - } else if (IS_ENABLED(CONFIG_BOOKE) || sizeof(phys_addr_t) > sizeof(void *)) { 180 - void *start = kmap_local_page(page); 181 - 182 - __flush_dcache_icache(start); 183 - kunmap_local(start); 184 169 } else { 185 - flush_dcache_icache_phys(page_to_phys(page)); 170 + unsigned long pfn = folio_pfn(folio); 171 + for (i = 0; i < nr; i++) 172 + flush_dcache_icache_phys((pfn + i) * PAGE_SIZE); 186 173 } 187 174 } 188 - EXPORT_SYMBOL(flush_dcache_icache_page); 175 + EXPORT_SYMBOL(flush_dcache_icache_folio); 189 176 190 177 void clear_user_page(void *page, unsigned long vaddr, struct page *pg) 191 178 {

+2 -5

arch/powerpc/mm/fault.c

··· 469 469 if (is_exec) 470 470 flags |= FAULT_FLAG_INSTRUCTION; 471 471 472 - #ifdef CONFIG_PER_VMA_LOCK 473 472 if (!(flags & FAULT_FLAG_USER)) 474 473 goto lock_mmap; 475 474 ··· 488 489 } 489 490 490 491 fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); 491 - vma_end_read(vma); 492 + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) 493 + vma_end_read(vma); 492 494 493 495 if (!(fault & VM_FAULT_RETRY)) { 494 496 count_vm_vma_lock_event(VMA_LOCK_SUCCESS); ··· 501 501 return user_mode(regs) ? 0 : SIGBUS; 502 502 503 503 lock_mmap: 504 - #endif /* CONFIG_PER_VMA_LOCK */ 505 504 506 505 /* When running in the kernel we expect faults to occur only to 507 506 * addresses in user space. All other faults represent errors in the ··· 550 551 551 552 mmap_read_unlock(current->mm); 552 553 553 - #ifdef CONFIG_PER_VMA_LOCK 554 554 done: 555 - #endif 556 555 if (unlikely(fault & VM_FAULT_ERROR)) 557 556 return mm_fault_error(regs, address, fault); 558 557

+30 -7

arch/powerpc/mm/init_64.c

··· 92 92 * a page table lookup here because with the hash translation we don't keep 93 93 * vmemmap details in linux page table. 94 94 */ 95 - static int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size) 95 + int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size) 96 96 { 97 97 struct page *start; 98 98 unsigned long vmemmap_end = vmemmap_addr + vmemmap_map_size; ··· 183 183 return 0; 184 184 } 185 185 186 - static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, 187 - unsigned long page_size) 186 + bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, 187 + unsigned long page_size) 188 188 { 189 189 unsigned long nr_pfn = page_size / sizeof(struct page); 190 190 unsigned long start_pfn = page_to_pfn((struct page *)start); ··· 198 198 return false; 199 199 } 200 200 201 - int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, 202 - struct vmem_altmap *altmap) 201 + static int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node, 202 + struct vmem_altmap *altmap) 203 203 { 204 204 bool altmap_alloc; 205 205 unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; ··· 272 272 return 0; 273 273 } 274 274 275 + int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, 276 + struct vmem_altmap *altmap) 277 + { 278 + 279 + #ifdef CONFIG_PPC_BOOK3S_64 280 + if (radix_enabled()) 281 + return radix__vmemmap_populate(start, end, node, altmap); 282 + #endif 283 + 284 + return __vmemmap_populate(start, end, node, altmap); 285 + } 286 + 275 287 #ifdef CONFIG_MEMORY_HOTPLUG 276 288 static unsigned long vmemmap_list_free(unsigned long start) 277 289 { ··· 315 303 return vmem_back->phys; 316 304 } 317 305 318 - void __ref vmemmap_free(unsigned long start, unsigned long end, 319 - struct vmem_altmap *altmap) 306 + static void __ref __vmemmap_free(unsigned long start, unsigned long end, 307 + struct vmem_altmap *altmap) 320 308 { 321 309 unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; 322 310 unsigned long page_order = get_order(page_size); ··· 373 361 vmemmap_remove_mapping(start, page_size); 374 362 } 375 363 } 364 + 365 + void __ref vmemmap_free(unsigned long start, unsigned long end, 366 + struct vmem_altmap *altmap) 367 + { 368 + #ifdef CONFIG_PPC_BOOK3S_64 369 + if (radix_enabled()) 370 + return radix__vmemmap_free(start, end, altmap); 371 + #endif 372 + return __vmemmap_free(start, end, altmap); 373 + } 374 + 376 375 #endif 377 376 void register_page_bootmem_memmap(unsigned long section_nr, 378 377 struct page *start_page, unsigned long size)

+1 -25

arch/powerpc/mm/ioremap.c

··· 41 41 return __ioremap_caller(addr, size, prot, caller); 42 42 } 43 43 44 - void __iomem *ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags) 44 + void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long flags) 45 45 { 46 46 pte_t pte = __pte(flags); 47 47 void *caller = __builtin_return_address(0); ··· 73 73 } 74 74 75 75 return 0; 76 - } 77 - 78 - void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long size, 79 - pgprot_t prot, void *caller) 80 - { 81 - struct vm_struct *area; 82 - int ret; 83 - unsigned long va; 84 - 85 - area = __get_vm_area_caller(size, VM_IOREMAP, IOREMAP_START, IOREMAP_END, caller); 86 - if (area == NULL) 87 - return NULL; 88 - 89 - area->phys_addr = pa; 90 - va = (unsigned long)area->addr; 91 - 92 - ret = ioremap_page_range(va, va + size, pa, prot); 93 - if (!ret) 94 - return (void __iomem *)area->addr + offset; 95 - 96 - vunmap_range(va, va + size); 97 - free_vm_area(area); 98 - 99 - return NULL; 100 76 }

+9 -10

arch/powerpc/mm/ioremap_32.c

··· 22 22 int err; 23 23 24 24 /* 25 + * If the address lies within the first 16 MB, assume it's in ISA 26 + * memory space 27 + */ 28 + if (addr < SZ_16M) 29 + addr += _ISA_MEM_BASE; 30 + 31 + /* 25 32 * Choose an address to map it to. 26 33 * Once the vmalloc system is running, we use it. 27 34 * Before then, we use space going down from IOREMAP_TOP ··· 37 30 p = addr & PAGE_MASK; 38 31 offset = addr & ~PAGE_MASK; 39 32 size = PAGE_ALIGN(addr + size) - p; 40 - 41 - /* 42 - * If the address lies within the first 16 MB, assume it's in ISA 43 - * memory space 44 - */ 45 - if (p < 16 * 1024 * 1024) 46 - p += _ISA_MEM_BASE; 47 33 48 34 #ifndef CONFIG_CRASH_DUMP 49 35 /* ··· 63 63 return (void __iomem *)v + offset; 64 64 65 65 if (slab_is_available()) 66 - return do_ioremap(p, offset, size, prot, caller); 66 + return generic_ioremap_prot(addr, size, prot); 67 67 68 68 /* 69 69 * Should check if it is a candidate for a BAT mapping ··· 87 87 if (v_block_mapped((unsigned long)addr)) 88 88 return; 89 89 90 - if (addr > high_memory && (unsigned long)addr < ioremap_bot) 91 - vunmap((void *)(PAGE_MASK & (unsigned long)addr)); 90 + generic_iounmap(addr); 92 91 } 93 92 EXPORT_SYMBOL(iounmap);

+2 -10

arch/powerpc/mm/ioremap_64.c

··· 29 29 return NULL; 30 30 31 31 if (slab_is_available()) 32 - return do_ioremap(paligned, offset, size, prot, caller); 32 + return generic_ioremap_prot(addr, size, prot); 33 33 34 34 pr_warn("ioremap() called early from %pS. Use early_ioremap() instead\n", caller); 35 35 ··· 49 49 */ 50 50 void iounmap(volatile void __iomem *token) 51 51 { 52 - void *addr; 53 - 54 52 if (!slab_is_available()) 55 53 return; 56 54 57 - addr = (void *)((unsigned long __force)PCI_FIX_ADDR(token) & PAGE_MASK); 58 - 59 - if ((unsigned long)addr < ioremap_bot) { 60 - pr_warn("Attempt to iounmap early bolted mapping at 0x%p\n", addr); 61 - return; 62 - } 63 - vunmap(addr); 55 + generic_iounmap(PCI_FIX_ADDR(token)); 64 56 } 65 57 EXPORT_SYMBOL(iounmap);

+2 -1

arch/powerpc/mm/nohash/e500_hugetlbpage.c

··· 178 178 * 179 179 * This must always be called with the pte lock held. 180 180 */ 181 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) 181 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 182 + unsigned long address, pte_t *ptep, unsigned int nr) 182 183 { 183 184 if (is_vm_hugetlb_page(vma)) 184 185 book3e_hugetlb_preload(vma, address, *ptep);

+47 -24

arch/powerpc/mm/pgtable-frag.c

··· 18 18 void pte_frag_destroy(void *pte_frag) 19 19 { 20 20 int count; 21 - struct page *page; 21 + struct ptdesc *ptdesc; 22 22 23 - page = virt_to_page(pte_frag); 23 + ptdesc = virt_to_ptdesc(pte_frag); 24 24 /* drop all the pending references */ 25 25 count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT; 26 26 /* We allow PTE_FRAG_NR fragments from a PTE page */ 27 - if (atomic_sub_and_test(PTE_FRAG_NR - count, &page->pt_frag_refcount)) { 28 - pgtable_pte_page_dtor(page); 29 - __free_page(page); 27 + if (atomic_sub_and_test(PTE_FRAG_NR - count, &ptdesc->pt_frag_refcount)) { 28 + pagetable_pte_dtor(ptdesc); 29 + pagetable_free(ptdesc); 30 30 } 31 31 } 32 32 ··· 55 55 static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel) 56 56 { 57 57 void *ret = NULL; 58 - struct page *page; 58 + struct ptdesc *ptdesc; 59 59 60 60 if (!kernel) { 61 - page = alloc_page(PGALLOC_GFP | __GFP_ACCOUNT); 62 - if (!page) 61 + ptdesc = pagetable_alloc(PGALLOC_GFP | __GFP_ACCOUNT, 0); 62 + if (!ptdesc) 63 63 return NULL; 64 - if (!pgtable_pte_page_ctor(page)) { 65 - __free_page(page); 64 + if (!pagetable_pte_ctor(ptdesc)) { 65 + pagetable_free(ptdesc); 66 66 return NULL; 67 67 } 68 68 } else { 69 - page = alloc_page(PGALLOC_GFP); 70 - if (!page) 69 + ptdesc = pagetable_alloc(PGALLOC_GFP, 0); 70 + if (!ptdesc) 71 71 return NULL; 72 72 } 73 73 74 - atomic_set(&page->pt_frag_refcount, 1); 74 + atomic_set(&ptdesc->pt_frag_refcount, 1); 75 75 76 - ret = page_address(page); 76 + ret = ptdesc_address(ptdesc); 77 77 /* 78 78 * if we support only one fragment just return the 79 79 * allocated page. ··· 82 82 return ret; 83 83 spin_lock(&mm->page_table_lock); 84 84 /* 85 - * If we find pgtable_page set, we return 85 + * If we find ptdesc_page set, we return 86 86 * the allocated page with single fragment 87 87 * count. 88 88 */ 89 89 if (likely(!pte_frag_get(&mm->context))) { 90 - atomic_set(&page->pt_frag_refcount, PTE_FRAG_NR); 90 + atomic_set(&ptdesc->pt_frag_refcount, PTE_FRAG_NR); 91 91 pte_frag_set(&mm->context, ret + PTE_FRAG_SIZE); 92 92 } 93 93 spin_unlock(&mm->page_table_lock); ··· 106 106 return __alloc_for_ptecache(mm, kernel); 107 107 } 108 108 109 + static void pte_free_now(struct rcu_head *head) 110 + { 111 + struct ptdesc *ptdesc; 112 + 113 + ptdesc = container_of(head, struct ptdesc, pt_rcu_head); 114 + pagetable_pte_dtor(ptdesc); 115 + pagetable_free(ptdesc); 116 + } 117 + 109 118 void pte_fragment_free(unsigned long *table, int kernel) 110 119 { 111 - struct page *page = virt_to_page(table); 120 + struct ptdesc *ptdesc = virt_to_ptdesc(table); 112 121 113 - if (PageReserved(page)) 114 - return free_reserved_page(page); 122 + if (pagetable_is_reserved(ptdesc)) 123 + return free_reserved_ptdesc(ptdesc); 115 124 116 - BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0); 117 - if (atomic_dec_and_test(&page->pt_frag_refcount)) { 118 - if (!kernel) 119 - pgtable_pte_page_dtor(page); 120 - __free_page(page); 125 + BUG_ON(atomic_read(&ptdesc->pt_frag_refcount) <= 0); 126 + if (atomic_dec_and_test(&ptdesc->pt_frag_refcount)) { 127 + if (kernel) 128 + pagetable_free(ptdesc); 129 + else if (folio_test_clear_active(ptdesc_folio(ptdesc))) 130 + call_rcu(&ptdesc->pt_rcu_head, pte_free_now); 131 + else 132 + pte_free_now(&ptdesc->pt_rcu_head); 121 133 } 122 134 } 135 + 136 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 137 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) 138 + { 139 + struct page *page; 140 + 141 + page = virt_to_page(pgtable); 142 + SetPageActive(page); 143 + pte_fragment_free((unsigned long *)pgtable, 0); 144 + } 145 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+37 -24

arch/powerpc/mm/pgtable.c

··· 58 58 return 0; 59 59 } 60 60 61 - static struct page *maybe_pte_to_page(pte_t pte) 61 + static struct folio *maybe_pte_to_folio(pte_t pte) 62 62 { 63 63 unsigned long pfn = pte_pfn(pte); 64 64 struct page *page; ··· 68 68 page = pfn_to_page(pfn); 69 69 if (PageReserved(page)) 70 70 return NULL; 71 - return page; 71 + return page_folio(page); 72 72 } 73 73 74 74 #ifdef CONFIG_PPC_BOOK3S ··· 84 84 pte = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS); 85 85 if (pte_looks_normal(pte) && !(cpu_has_feature(CPU_FTR_COHERENT_ICACHE) || 86 86 cpu_has_feature(CPU_FTR_NOEXECUTE))) { 87 - struct page *pg = maybe_pte_to_page(pte); 88 - if (!pg) 87 + struct folio *folio = maybe_pte_to_folio(pte); 88 + if (!folio) 89 89 return pte; 90 - if (!test_bit(PG_dcache_clean, &pg->flags)) { 91 - flush_dcache_icache_page(pg); 92 - set_bit(PG_dcache_clean, &pg->flags); 90 + if (!test_bit(PG_dcache_clean, &folio->flags)) { 91 + flush_dcache_icache_folio(folio); 92 + set_bit(PG_dcache_clean, &folio->flags); 93 93 } 94 94 } 95 95 return pte; ··· 107 107 */ 108 108 static inline pte_t set_pte_filter(pte_t pte) 109 109 { 110 - struct page *pg; 110 + struct folio *folio; 111 111 112 112 if (radix_enabled()) 113 113 return pte; ··· 120 120 return pte; 121 121 122 122 /* If you set _PAGE_EXEC on weird pages you're on your own */ 123 - pg = maybe_pte_to_page(pte); 124 - if (unlikely(!pg)) 123 + folio = maybe_pte_to_folio(pte); 124 + if (unlikely(!folio)) 125 125 return pte; 126 126 127 127 /* If the page clean, we move on */ 128 - if (test_bit(PG_dcache_clean, &pg->flags)) 128 + if (test_bit(PG_dcache_clean, &folio->flags)) 129 129 return pte; 130 130 131 131 /* If it's an exec fault, we flush the cache and make it clean */ 132 132 if (is_exec_fault()) { 133 - flush_dcache_icache_page(pg); 134 - set_bit(PG_dcache_clean, &pg->flags); 133 + flush_dcache_icache_folio(folio); 134 + set_bit(PG_dcache_clean, &folio->flags); 135 135 return pte; 136 136 } 137 137 ··· 142 142 static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma, 143 143 int dirty) 144 144 { 145 - struct page *pg; 145 + struct folio *folio; 146 146 147 147 if (IS_ENABLED(CONFIG_PPC_BOOK3S_64)) 148 148 return pte; ··· 168 168 #endif /* CONFIG_DEBUG_VM */ 169 169 170 170 /* If you set _PAGE_EXEC on weird pages you're on your own */ 171 - pg = maybe_pte_to_page(pte); 172 - if (unlikely(!pg)) 171 + folio = maybe_pte_to_folio(pte); 172 + if (unlikely(!folio)) 173 173 goto bail; 174 174 175 175 /* If the page is already clean, we move on */ 176 - if (test_bit(PG_dcache_clean, &pg->flags)) 176 + if (test_bit(PG_dcache_clean, &folio->flags)) 177 177 goto bail; 178 178 179 179 /* Clean the page and set PG_dcache_clean */ 180 - flush_dcache_icache_page(pg); 181 - set_bit(PG_dcache_clean, &pg->flags); 180 + flush_dcache_icache_folio(folio); 181 + set_bit(PG_dcache_clean, &folio->flags); 182 182 183 183 bail: 184 184 return pte_mkexec(pte); ··· 187 187 /* 188 188 * set_pte stores a linux PTE into the linux page table. 189 189 */ 190 - void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, 191 - pte_t pte) 190 + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, 191 + pte_t pte, unsigned int nr) 192 192 { 193 193 /* 194 194 * Make sure hardware valid bit is not set. We don't do ··· 203 203 pte = set_pte_filter(pte); 204 204 205 205 /* Perform the setting of the PTE */ 206 - __set_pte_at(mm, addr, ptep, pte, 0); 206 + arch_enter_lazy_mmu_mode(); 207 + for (;;) { 208 + __set_pte_at(mm, addr, ptep, pte, 0); 209 + if (--nr == 0) 210 + break; 211 + ptep++; 212 + pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT)); 213 + addr += PAGE_SIZE; 214 + } 215 + arch_leave_lazy_mmu_mode(); 207 216 } 208 217 209 218 void unmap_kernel_page(unsigned long va) ··· 320 311 p4d_t *p4d; 321 312 pud_t *pud; 322 313 pmd_t *pmd; 314 + pte_t *pte; 315 + spinlock_t *ptl; 323 316 324 317 if (mm == &init_mm) 325 318 return; ··· 340 329 */ 341 330 if (pmd_none(*pmd)) 342 331 return; 343 - BUG_ON(!pmd_present(*pmd)); 344 - assert_spin_locked(pte_lockptr(mm, pmd)); 332 + pte = pte_offset_map_nolock(mm, pmd, addr, &ptl); 333 + BUG_ON(!pte); 334 + assert_spin_locked(ptl); 335 + pte_unmap(pte); 345 336 } 346 337 #endif /* CONFIG_DEBUG_VM */ 347 338

+1

arch/powerpc/platforms/Kconfig.cputype

··· 94 94 select PPC_FPU 95 95 select PPC_HAVE_PMU_SUPPORT 96 96 select HAVE_ARCH_TRANSPARENT_HUGEPAGE 97 + select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 97 98 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION 98 99 select ARCH_ENABLE_SPLIT_PMD_PTLOCK 99 100 select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE

+1 -1

arch/powerpc/platforms/pseries/hotplug-memory.c

··· 637 637 nid = first_online_node; 638 638 639 639 /* Add the memory */ 640 - rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE); 640 + rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_MEMMAP_ON_MEMORY); 641 641 if (rc) { 642 642 invalidate_lmb_associativity_index(lmb); 643 643 return rc;

+1 -1

arch/powerpc/xmon/xmon.c

··· 1084 1084 memzcan(); 1085 1085 break; 1086 1086 case 'i': 1087 - show_mem(0, NULL); 1087 + show_mem(); 1088 1088 break; 1089 1089 default: 1090 1090 termch = cmd;

+1 -1

arch/riscv/Kconfig

··· 53 53 select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT 54 54 select ARCH_WANT_HUGE_PMD_SHARE if 64BIT 55 55 select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL 56 - select ARCH_WANT_OPTIMIZE_VMEMMAP 56 + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP 57 57 select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE 58 58 select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU 59 59 select BUILDTIME_TABLE_SORT if MMU

+9 -10

arch/riscv/include/asm/cacheflush.h

··· 15 15 16 16 #define PG_dcache_clean PG_arch_1 17 17 18 + static inline void flush_dcache_folio(struct folio *folio) 19 + { 20 + if (test_bit(PG_dcache_clean, &folio->flags)) 21 + clear_bit(PG_dcache_clean, &folio->flags); 22 + } 23 + #define flush_dcache_folio flush_dcache_folio 24 + #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 25 + 18 26 static inline void flush_dcache_page(struct page *page) 19 27 { 20 - /* 21 - * HugeTLB pages are always fully mapped and only head page will be 22 - * set PG_dcache_clean (see comments in flush_icache_pte()). 23 - */ 24 - if (PageHuge(page)) 25 - page = compound_head(page); 26 - 27 - if (test_bit(PG_dcache_clean, &page->flags)) 28 - clear_bit(PG_dcache_clean, &page->flags); 28 + flush_dcache_folio(page_folio(page)); 29 29 } 30 - #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 31 30 32 31 /* 33 32 * RISC-V doesn't have an instruction to flush parts of the instruction cache,

+1

arch/riscv/include/asm/hugetlb.h

··· 2 2 #ifndef _ASM_RISCV_HUGETLB_H 3 3 #define _ASM_RISCV_HUGETLB_H 4 4 5 + #include <asm/cacheflush.h> 5 6 #include <asm/page.h> 6 7 7 8 static inline void arch_clear_hugepage_flags(struct page *page)

+4 -4

arch/riscv/include/asm/pgalloc.h

··· 153 153 154 154 #endif /* __PAGETABLE_PMD_FOLDED */ 155 155 156 - #define __pte_free_tlb(tlb, pte, buf) \ 157 - do { \ 158 - pgtable_pte_page_dtor(pte); \ 159 - tlb_remove_page((tlb), pte); \ 156 + #define __pte_free_tlb(tlb, pte, buf) \ 157 + do { \ 158 + pagetable_pte_dtor(page_ptdesc(pte)); \ 159 + tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\ 160 160 } while (0) 161 161 #endif /* CONFIG_MMU */ 162 162

+29 -18

arch/riscv/include/asm/pgtable.h

··· 447 447 448 448 449 449 /* Commit new configuration to MMU hardware */ 450 - static inline void update_mmu_cache(struct vm_area_struct *vma, 451 - unsigned long address, pte_t *ptep) 450 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 451 + struct vm_area_struct *vma, unsigned long address, 452 + pte_t *ptep, unsigned int nr) 452 453 { 453 454 /* 454 455 * The kernel assumes that TLBs don't cache invalid entries, but ··· 458 457 * Relying on flush_tlb_fix_spurious_fault would suffice, but 459 458 * the extra traps reduce performance. So, eagerly SFENCE.VMA. 460 459 */ 461 - local_flush_tlb_page(address); 460 + while (nr--) 461 + local_flush_tlb_page(address + nr * PAGE_SIZE); 462 462 } 463 + #define update_mmu_cache(vma, addr, ptep) \ 464 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 463 465 464 466 #define __HAVE_ARCH_UPDATE_MMU_TLB 465 467 #define update_mmu_tlb update_mmu_cache ··· 493 489 494 490 void flush_icache_pte(pte_t pte); 495 491 496 - static inline void __set_pte_at(struct mm_struct *mm, 497 - unsigned long addr, pte_t *ptep, pte_t pteval) 492 + static inline void __set_pte_at(pte_t *ptep, pte_t pteval) 498 493 { 499 494 if (pte_present(pteval) && pte_exec(pteval)) 500 495 flush_icache_pte(pteval); ··· 501 498 set_pte(ptep, pteval); 502 499 } 503 500 504 - static inline void set_pte_at(struct mm_struct *mm, 505 - unsigned long addr, pte_t *ptep, pte_t pteval) 501 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 502 + pte_t *ptep, pte_t pteval, unsigned int nr) 506 503 { 507 - page_table_check_pte_set(mm, addr, ptep, pteval); 508 - __set_pte_at(mm, addr, ptep, pteval); 504 + page_table_check_ptes_set(mm, ptep, pteval, nr); 505 + 506 + for (;;) { 507 + __set_pte_at(ptep, pteval); 508 + if (--nr == 0) 509 + break; 510 + ptep++; 511 + pte_val(pteval) += 1 << _PAGE_PFN_SHIFT; 512 + } 509 513 } 514 + #define set_ptes set_ptes 510 515 511 516 static inline void pte_clear(struct mm_struct *mm, 512 517 unsigned long addr, pte_t *ptep) 513 518 { 514 - __set_pte_at(mm, addr, ptep, __pte(0)); 519 + __set_pte_at(ptep, __pte(0)); 515 520 } 516 521 517 522 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS ··· 528 517 pte_t entry, int dirty) 529 518 { 530 519 if (!pte_same(*ptep, entry)) 531 - set_pte_at(vma->vm_mm, address, ptep, entry); 520 + __set_pte_at(ptep, entry); 532 521 /* 533 522 * update_mmu_cache will unconditionally execute, handling both 534 523 * the case that the PTE changed and the spurious fault case. ··· 542 531 { 543 532 pte_t pte = __pte(atomic_long_xchg((atomic_long_t *)ptep, 0)); 544 533 545 - page_table_check_pte_clear(mm, address, pte); 534 + page_table_check_pte_clear(mm, pte); 546 535 547 536 return pte; 548 537 } ··· 700 689 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, 701 690 pmd_t *pmdp, pmd_t pmd) 702 691 { 703 - page_table_check_pmd_set(mm, addr, pmdp, pmd); 704 - return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd)); 692 + page_table_check_pmd_set(mm, pmdp, pmd); 693 + return __set_pte_at((pte_t *)pmdp, pmd_pte(pmd)); 705 694 } 706 695 707 696 static inline void set_pud_at(struct mm_struct *mm, unsigned long addr, 708 697 pud_t *pudp, pud_t pud) 709 698 { 710 - page_table_check_pud_set(mm, addr, pudp, pud); 711 - return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud)); 699 + page_table_check_pud_set(mm, pudp, pud); 700 + return __set_pte_at((pte_t *)pudp, pud_pte(pud)); 712 701 } 713 702 714 703 #ifdef CONFIG_PAGE_TABLE_CHECK ··· 755 744 { 756 745 pmd_t pmd = __pmd(atomic_long_xchg((atomic_long_t *)pmdp, 0)); 757 746 758 - page_table_check_pmd_clear(mm, address, pmd); 747 + page_table_check_pmd_clear(mm, pmd); 759 748 760 749 return pmd; 761 750 } ··· 771 760 static inline pmd_t pmdp_establish(struct vm_area_struct *vma, 772 761 unsigned long address, pmd_t *pmdp, pmd_t pmd) 773 762 { 774 - page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd); 763 + page_table_check_pmd_set(vma->vm_mm, pmdp, pmd); 775 764 return __pmd(atomic_long_xchg((atomic_long_t *)pmdp, pmd_val(pmd))); 776 765 } 777 766

+3 -10

arch/riscv/mm/cacheflush.c

··· 82 82 #ifdef CONFIG_MMU 83 83 void flush_icache_pte(pte_t pte) 84 84 { 85 - struct page *page = pte_page(pte); 85 + struct folio *folio = page_folio(pte_page(pte)); 86 86 87 - /* 88 - * HugeTLB pages are always fully mapped, so only setting head page's 89 - * PG_dcache_clean flag is enough. 90 - */ 91 - if (PageHuge(page)) 92 - page = compound_head(page); 93 - 94 - if (!test_bit(PG_dcache_clean, &page->flags)) { 87 + if (!test_bit(PG_dcache_clean, &folio->flags)) { 95 88 flush_icache_all(); 96 - set_bit(PG_dcache_clean, &page->flags); 89 + set_bit(PG_dcache_clean, &folio->flags); 97 90 } 98 91 } 99 92 #endif /* CONFIG_MMU */

+2 -5

arch/riscv/mm/fault.c

··· 283 283 flags |= FAULT_FLAG_WRITE; 284 284 else if (cause == EXC_INST_PAGE_FAULT) 285 285 flags |= FAULT_FLAG_INSTRUCTION; 286 - #ifdef CONFIG_PER_VMA_LOCK 287 286 if (!(flags & FAULT_FLAG_USER)) 288 287 goto lock_mmap; 289 288 ··· 296 297 } 297 298 298 299 fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); 299 - vma_end_read(vma); 300 + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) 301 + vma_end_read(vma); 300 302 301 303 if (!(fault & VM_FAULT_RETRY)) { 302 304 count_vm_vma_lock_event(VMA_LOCK_SUCCESS); ··· 311 311 return; 312 312 } 313 313 lock_mmap: 314 - #endif /* CONFIG_PER_VMA_LOCK */ 315 314 316 315 retry: 317 316 vma = lock_mm_and_find_vma(mm, addr, regs); ··· 367 368 368 369 mmap_read_unlock(mm); 369 370 370 - #ifdef CONFIG_PER_VMA_LOCK 371 371 done: 372 - #endif 373 372 if (unlikely(fault & VM_FAULT_ERROR)) { 374 373 tsk->thread.bad_cause = cause; 375 374 mm_fault_error(regs, addr, fault);

+6 -10

arch/riscv/mm/init.c

··· 359 359 360 360 static phys_addr_t __init alloc_pte_late(uintptr_t va) 361 361 { 362 - unsigned long vaddr; 362 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); 363 363 364 - vaddr = __get_free_page(GFP_KERNEL); 365 - BUG_ON(!vaddr || !pgtable_pte_page_ctor(virt_to_page((void *)vaddr))); 366 - 367 - return __pa(vaddr); 364 + BUG_ON(!ptdesc || !pagetable_pte_ctor(ptdesc)); 365 + return __pa((pte_t *)ptdesc_address(ptdesc)); 368 366 } 369 367 370 368 static void __init create_pte_mapping(pte_t *ptep, ··· 440 442 441 443 static phys_addr_t __init alloc_pmd_late(uintptr_t va) 442 444 { 443 - unsigned long vaddr; 445 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); 444 446 445 - vaddr = __get_free_page(GFP_KERNEL); 446 - BUG_ON(!vaddr || !pgtable_pmd_page_ctor(virt_to_page((void *)vaddr))); 447 - 448 - return __pa(vaddr); 447 + BUG_ON(!ptdesc || !pagetable_pmd_ctor(ptdesc)); 448 + return __pa((pmd_t *)ptdesc_address(ptdesc)); 449 449 } 450 450 451 451 static void __init create_pmd_mapping(pmd_t *pmdp,

+2 -1

arch/s390/Kconfig

··· 127 127 select ARCH_WANTS_NO_INSTR 128 128 select ARCH_WANT_DEFAULT_BPF_JIT 129 129 select ARCH_WANT_IPC_PARSE_VERSION 130 - select ARCH_WANT_OPTIMIZE_VMEMMAP 130 + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP 131 131 select BUILDTIME_TABLE_SORT 132 132 select CLONE_BACKWARDS2 133 133 select DMA_OPS if PCI ··· 143 143 select GENERIC_SMP_IDLE_THREAD 144 144 select GENERIC_TIME_VSYSCALL 145 145 select GENERIC_VDSO_TIME_NS 146 + select GENERIC_IOREMAP if PCI 146 147 select HAVE_ALIGNED_STRUCT_PAGE if SLUB 147 148 select HAVE_ARCH_AUDITSYSCALL 148 149 select HAVE_ARCH_JUMP_LABEL

+12 -9

arch/s390/include/asm/io.h

··· 22 22 23 23 #define IO_SPACE_LIMIT 0 24 24 25 - void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot); 26 - void __iomem *ioremap(phys_addr_t addr, size_t size); 27 - void __iomem *ioremap_wc(phys_addr_t addr, size_t size); 28 - void __iomem *ioremap_wt(phys_addr_t addr, size_t size); 29 - void iounmap(volatile void __iomem *addr); 25 + /* 26 + * I/O memory mapping functions. 27 + */ 28 + #define ioremap_prot ioremap_prot 29 + #define iounmap iounmap 30 + 31 + #define _PAGE_IOREMAP pgprot_val(PAGE_KERNEL) 32 + 33 + #define ioremap_wc(addr, size) \ 34 + ioremap_prot((addr), (size), pgprot_val(pgprot_writecombine(PAGE_KERNEL))) 35 + #define ioremap_wt(addr, size) \ 36 + ioremap_prot((addr), (size), pgprot_val(pgprot_writethrough(PAGE_KERNEL))) 30 37 31 38 static inline void __iomem *ioport_map(unsigned long port, unsigned int nr) 32 39 { ··· 57 50 #define pci_iounmap pci_iounmap 58 51 #define pci_iomap_wc pci_iomap_wc 59 52 #define pci_iomap_wc_range pci_iomap_wc_range 60 - 61 - #define ioremap ioremap 62 - #define ioremap_wt ioremap_wt 63 - #define ioremap_wc ioremap_wc 64 53 65 54 #define memcpy_fromio(dst, src, count) zpci_memcpy_fromio(dst, src, count) 66 55 #define memcpy_toio(dst, src, count) zpci_memcpy_toio(dst, src, count)

+6 -2

arch/s390/include/asm/pgalloc.h

··· 86 86 if (!table) 87 87 return NULL; 88 88 crst_table_init(table, _SEGMENT_ENTRY_EMPTY); 89 - if (!pgtable_pmd_page_ctor(virt_to_page(table))) { 89 + if (!pagetable_pmd_ctor(virt_to_ptdesc(table))) { 90 90 crst_table_free(mm, table); 91 91 return NULL; 92 92 } ··· 97 97 { 98 98 if (mm_pmd_folded(mm)) 99 99 return; 100 - pgtable_pmd_page_dtor(virt_to_page(pmd)); 100 + pagetable_pmd_dtor(virt_to_ptdesc(pmd)); 101 101 crst_table_free(mm, (unsigned long *) pmd); 102 102 } 103 103 ··· 142 142 143 143 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte) 144 144 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte) 145 + 146 + /* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */ 147 + #define pte_free_defer pte_free_defer 148 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); 145 149 146 150 void vmem_map_init(void); 147 151 void *vmem_crst_alloc(unsigned long val);

+24 -9

arch/s390/include/asm/pgtable.h

··· 47 47 * tables contain all the necessary information. 48 48 */ 49 49 #define update_mmu_cache(vma, address, ptep) do { } while (0) 50 + #define update_mmu_cache_range(vmf, vma, addr, ptep, nr) do { } while (0) 50 51 #define update_mmu_cache_pmd(vma, address, ptep) do { } while (0) 51 52 52 53 /* ··· 1315 1314 pgprot_t pgprot_writethrough(pgprot_t prot); 1316 1315 1317 1316 /* 1318 - * Certain architectures need to do special things when PTEs 1319 - * within a page table are directly modified. Thus, the following 1320 - * hook is made available. 1317 + * Set multiple PTEs to consecutive pages with a single call. All PTEs 1318 + * are within the same folio, PMD and VMA. 1321 1319 */ 1322 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 1323 - pte_t *ptep, pte_t entry) 1320 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 1321 + pte_t *ptep, pte_t entry, unsigned int nr) 1324 1322 { 1325 1323 if (pte_present(entry)) 1326 1324 entry = clear_pte_bit(entry, __pgprot(_PAGE_UNUSED)); 1327 - if (mm_has_pgste(mm)) 1328 - ptep_set_pte_at(mm, addr, ptep, entry); 1329 - else 1330 - set_pte(ptep, entry); 1325 + if (mm_has_pgste(mm)) { 1326 + for (;;) { 1327 + ptep_set_pte_at(mm, addr, ptep, entry); 1328 + if (--nr == 0) 1329 + break; 1330 + ptep++; 1331 + entry = __pte(pte_val(entry) + PAGE_SIZE); 1332 + addr += PAGE_SIZE; 1333 + } 1334 + } else { 1335 + for (;;) { 1336 + set_pte(ptep, entry); 1337 + if (--nr == 0) 1338 + break; 1339 + ptep++; 1340 + entry = __pte(pte_val(entry) + PAGE_SIZE); 1341 + } 1342 + } 1331 1343 } 1344 + #define set_ptes set_ptes 1332 1345 1333 1346 /* 1334 1347 * Conversion functions: convert a page and protection to a page entry,

+2 -2

arch/s390/include/asm/tlb.h

··· 89 89 { 90 90 if (mm_pmd_folded(tlb->mm)) 91 91 return; 92 - pgtable_pmd_page_dtor(virt_to_page(pmd)); 92 + pagetable_pmd_dtor(virt_to_ptdesc(pmd)); 93 93 __tlb_adjust_range(tlb, address, PAGE_SIZE); 94 94 tlb->mm->context.flush_mm = 1; 95 95 tlb->freed_tables = 1; 96 96 tlb->cleared_puds = 1; 97 - tlb_remove_table(tlb, pmd); 97 + tlb_remove_ptdesc(tlb, pmd); 98 98 } 99 99 100 100 /*

+2 -3

arch/s390/mm/fault.c

··· 405 405 access = VM_WRITE; 406 406 if (access == VM_WRITE) 407 407 flags |= FAULT_FLAG_WRITE; 408 - #ifdef CONFIG_PER_VMA_LOCK 409 408 if (!(flags & FAULT_FLAG_USER)) 410 409 goto lock_mmap; 411 410 vma = lock_vma_under_rcu(mm, address); ··· 415 416 goto lock_mmap; 416 417 } 417 418 fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); 418 - vma_end_read(vma); 419 + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) 420 + vma_end_read(vma); 419 421 if (!(fault & VM_FAULT_RETRY)) { 420 422 count_vm_vma_lock_event(VMA_LOCK_SUCCESS); 421 423 if (likely(!(fault & VM_FAULT_ERROR))) ··· 430 430 goto out; 431 431 } 432 432 lock_mmap: 433 - #endif /* CONFIG_PER_VMA_LOCK */ 434 433 mmap_read_lock(mm); 435 434 436 435 gmap = NULL;

+117 -59

arch/s390/mm/pgalloc.c

··· 43 43 44 44 unsigned long *crst_table_alloc(struct mm_struct *mm) 45 45 { 46 - struct page *page = alloc_pages(GFP_KERNEL, CRST_ALLOC_ORDER); 46 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, CRST_ALLOC_ORDER); 47 47 48 - if (!page) 48 + if (!ptdesc) 49 49 return NULL; 50 - arch_set_page_dat(page, CRST_ALLOC_ORDER); 51 - return (unsigned long *) page_to_virt(page); 50 + arch_set_page_dat(ptdesc_page(ptdesc), CRST_ALLOC_ORDER); 51 + return (unsigned long *) ptdesc_to_virt(ptdesc); 52 52 } 53 53 54 54 void crst_table_free(struct mm_struct *mm, unsigned long *table) 55 55 { 56 - free_pages((unsigned long)table, CRST_ALLOC_ORDER); 56 + pagetable_free(virt_to_ptdesc(table)); 57 57 } 58 58 59 59 static void __crst_table_upgrade(void *arg) ··· 140 140 141 141 struct page *page_table_alloc_pgste(struct mm_struct *mm) 142 142 { 143 - struct page *page; 143 + struct ptdesc *ptdesc; 144 144 u64 *table; 145 145 146 - page = alloc_page(GFP_KERNEL); 147 - if (page) { 148 - table = (u64 *)page_to_virt(page); 146 + ptdesc = pagetable_alloc(GFP_KERNEL, 0); 147 + if (ptdesc) { 148 + table = (u64 *)ptdesc_to_virt(ptdesc); 149 149 memset64(table, _PAGE_INVALID, PTRS_PER_PTE); 150 150 memset64(table + PTRS_PER_PTE, 0, PTRS_PER_PTE); 151 151 } 152 - return page; 152 + return ptdesc_page(ptdesc); 153 153 } 154 154 155 155 void page_table_free_pgste(struct page *page) 156 156 { 157 - __free_page(page); 157 + pagetable_free(page_ptdesc(page)); 158 158 } 159 159 160 160 #endif /* CONFIG_PGSTE */ ··· 229 229 * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable 230 230 * while the PP bits are never used, nor such a page is added to or removed 231 231 * from mm_context_t::pgtable_list. 232 + * 233 + * pte_free_defer() overrides those rules: it takes the page off pgtable_list, 234 + * and prevents both 2K fragments from being reused. pte_free_defer() has to 235 + * guarantee that its pgtable cannot be reused before the RCU grace period 236 + * has elapsed (which page_table_free_rcu() does not actually guarantee). 237 + * But for simplicity, because page->rcu_head overlays page->lru, and because 238 + * the RCU callback might not be called before the mm_context_t has been freed, 239 + * pte_free_defer() in this implementation prevents both fragments from being 240 + * reused, and delays making the call to RCU until both fragments are freed. 232 241 */ 233 242 unsigned long *page_table_alloc(struct mm_struct *mm) 234 243 { 235 244 unsigned long *table; 236 - struct page *page; 245 + struct ptdesc *ptdesc; 237 246 unsigned int mask, bit; 238 247 239 248 /* Try to get a fragment of a 4K page as a 2K page table */ ··· 250 241 table = NULL; 251 242 spin_lock_bh(&mm->context.lock); 252 243 if (!list_empty(&mm->context.pgtable_list)) { 253 - page = list_first_entry(&mm->context.pgtable_list, 254 - struct page, lru); 255 - mask = atomic_read(&page->_refcount) >> 24; 244 + ptdesc = list_first_entry(&mm->context.pgtable_list, 245 + struct ptdesc, pt_list); 246 + mask = atomic_read(&ptdesc->_refcount) >> 24; 256 247 /* 257 248 * The pending removal bits must also be checked. 258 249 * Failure to do so might lead to an impossible ··· 264 255 */ 265 256 mask = (mask | (mask >> 4)) & 0x03U; 266 257 if (mask != 0x03U) { 267 - table = (unsigned long *) page_to_virt(page); 258 + table = (unsigned long *) ptdesc_to_virt(ptdesc); 268 259 bit = mask & 1; /* =1 -> second 2K */ 269 260 if (bit) 270 261 table += PTRS_PER_PTE; 271 - atomic_xor_bits(&page->_refcount, 262 + atomic_xor_bits(&ptdesc->_refcount, 272 263 0x01U << (bit + 24)); 273 - list_del(&page->lru); 264 + list_del_init(&ptdesc->pt_list); 274 265 } 275 266 } 276 267 spin_unlock_bh(&mm->context.lock); ··· 278 269 return table; 279 270 } 280 271 /* Allocate a fresh page */ 281 - page = alloc_page(GFP_KERNEL); 282 - if (!page) 272 + ptdesc = pagetable_alloc(GFP_KERNEL, 0); 273 + if (!ptdesc) 283 274 return NULL; 284 - if (!pgtable_pte_page_ctor(page)) { 285 - __free_page(page); 275 + if (!pagetable_pte_ctor(ptdesc)) { 276 + pagetable_free(ptdesc); 286 277 return NULL; 287 278 } 288 - arch_set_page_dat(page, 0); 279 + arch_set_page_dat(ptdesc_page(ptdesc), 0); 289 280 /* Initialize page table */ 290 - table = (unsigned long *) page_to_virt(page); 281 + table = (unsigned long *) ptdesc_to_virt(ptdesc); 291 282 if (mm_alloc_pgste(mm)) { 292 283 /* Return 4K page table with PGSTEs */ 293 - atomic_xor_bits(&page->_refcount, 0x03U << 24); 284 + INIT_LIST_HEAD(&ptdesc->pt_list); 285 + atomic_xor_bits(&ptdesc->_refcount, 0x03U << 24); 294 286 memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE); 295 287 memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE); 296 288 } else { 297 289 /* Return the first 2K fragment of the page */ 298 - atomic_xor_bits(&page->_refcount, 0x01U << 24); 290 + atomic_xor_bits(&ptdesc->_refcount, 0x01U << 24); 299 291 memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE); 300 292 spin_lock_bh(&mm->context.lock); 301 - list_add(&page->lru, &mm->context.pgtable_list); 293 + list_add(&ptdesc->pt_list, &mm->context.pgtable_list); 302 294 spin_unlock_bh(&mm->context.lock); 303 295 } 304 296 return table; ··· 310 300 { 311 301 char msg[128]; 312 302 313 - if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask) 303 + if (!IS_ENABLED(CONFIG_DEBUG_VM)) 304 + return; 305 + if (!mask && list_empty(&page->lru)) 314 306 return; 315 307 snprintf(msg, sizeof(msg), 316 308 "Invalid pgtable %p release half 0x%02x mask 0x%02x", ··· 320 308 dump_page(page, msg); 321 309 } 322 310 311 + static void pte_free_now(struct rcu_head *head) 312 + { 313 + struct ptdesc *ptdesc; 314 + 315 + ptdesc = container_of(head, struct ptdesc, pt_rcu_head); 316 + pagetable_pte_dtor(ptdesc); 317 + pagetable_free(ptdesc); 318 + } 319 + 323 320 void page_table_free(struct mm_struct *mm, unsigned long *table) 324 321 { 325 322 unsigned int mask, bit, half; 326 - struct page *page; 323 + struct ptdesc *ptdesc = virt_to_ptdesc(table); 327 324 328 - page = virt_to_page(table); 329 325 if (!mm_alloc_pgste(mm)) { 330 326 /* Free 2K page table fragment of a 4K page */ 331 327 bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t)); ··· 343 323 * will happen outside of the critical section from this 344 324 * function or from __tlb_remove_table() 345 325 */ 346 - mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); 326 + mask = atomic_xor_bits(&ptdesc->_refcount, 0x11U << (bit + 24)); 347 327 mask >>= 24; 348 - if (mask & 0x03U) 349 - list_add(&page->lru, &mm->context.pgtable_list); 350 - else 351 - list_del(&page->lru); 328 + if ((mask & 0x03U) && !folio_test_active(ptdesc_folio(ptdesc))) { 329 + /* 330 + * Other half is allocated, and neither half has had 331 + * its free deferred: add page to head of list, to make 332 + * this freed half available for immediate reuse. 333 + */ 334 + list_add(&ptdesc->pt_list, &mm->context.pgtable_list); 335 + } else { 336 + /* If page is on list, now remove it. */ 337 + list_del_init(&ptdesc->pt_list); 338 + } 352 339 spin_unlock_bh(&mm->context.lock); 353 - mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24)); 340 + mask = atomic_xor_bits(&ptdesc->_refcount, 0x10U << (bit + 24)); 354 341 mask >>= 24; 355 342 if (mask != 0x00U) 356 343 return; 357 344 half = 0x01U << bit; 358 345 } else { 359 346 half = 0x03U; 360 - mask = atomic_xor_bits(&page->_refcount, 0x03U << 24); 347 + mask = atomic_xor_bits(&ptdesc->_refcount, 0x03U << 24); 361 348 mask >>= 24; 362 349 } 363 350 364 - page_table_release_check(page, table, half, mask); 365 - pgtable_pte_page_dtor(page); 366 - __free_page(page); 351 + page_table_release_check(ptdesc_page(ptdesc), table, half, mask); 352 + if (folio_test_clear_active(ptdesc_folio(ptdesc))) 353 + call_rcu(&ptdesc->pt_rcu_head, pte_free_now); 354 + else 355 + pte_free_now(&ptdesc->pt_rcu_head); 367 356 } 368 357 369 358 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table, 370 359 unsigned long vmaddr) 371 360 { 372 361 struct mm_struct *mm; 373 - struct page *page; 374 362 unsigned int bit, mask; 363 + struct ptdesc *ptdesc = virt_to_ptdesc(table); 375 364 376 365 mm = tlb->mm; 377 - page = virt_to_page(table); 378 366 if (mm_alloc_pgste(mm)) { 379 367 gmap_unlink(mm, table, vmaddr); 380 368 table = (unsigned long *) ((unsigned long)table | 0x03U); 381 - tlb_remove_table(tlb, table); 369 + tlb_remove_ptdesc(tlb, table); 382 370 return; 383 371 } 384 372 bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t)); ··· 396 368 * outside of the critical section from __tlb_remove_table() or from 397 369 * page_table_free() 398 370 */ 399 - mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24)); 371 + mask = atomic_xor_bits(&ptdesc->_refcount, 0x11U << (bit + 24)); 400 372 mask >>= 24; 401 - if (mask & 0x03U) 402 - list_add_tail(&page->lru, &mm->context.pgtable_list); 403 - else 404 - list_del(&page->lru); 373 + if ((mask & 0x03U) && !folio_test_active(ptdesc_folio(ptdesc))) { 374 + /* 375 + * Other half is allocated, and neither half has had 376 + * its free deferred: add page to end of list, to make 377 + * this freed half available for reuse once its pending 378 + * bit has been cleared by __tlb_remove_table(). 379 + */ 380 + list_add_tail(&ptdesc->pt_list, &mm->context.pgtable_list); 381 + } else { 382 + /* If page is on list, now remove it. */ 383 + list_del_init(&ptdesc->pt_list); 384 + } 405 385 spin_unlock_bh(&mm->context.lock); 406 386 table = (unsigned long *) ((unsigned long) table | (0x01U << bit)); 407 387 tlb_remove_table(tlb, table); ··· 419 383 { 420 384 unsigned int mask = (unsigned long) _table & 0x03U, half = mask; 421 385 void *table = (void *)((unsigned long) _table ^ mask); 422 - struct page *page = virt_to_page(table); 386 + struct ptdesc *ptdesc = virt_to_ptdesc(table); 423 387 424 388 switch (half) { 425 389 case 0x00U: /* pmd, pud, or p4d */ 426 - free_pages((unsigned long)table, CRST_ALLOC_ORDER); 390 + pagetable_free(ptdesc); 427 391 return; 428 392 case 0x01U: /* lower 2K of a 4K page table */ 429 393 case 0x02U: /* higher 2K of a 4K page table */ 430 - mask = atomic_xor_bits(&page->_refcount, mask << (4 + 24)); 394 + mask = atomic_xor_bits(&ptdesc->_refcount, mask << (4 + 24)); 431 395 mask >>= 24; 432 396 if (mask != 0x00U) 433 397 return; 434 398 break; 435 399 case 0x03U: /* 4K page table with pgstes */ 436 - mask = atomic_xor_bits(&page->_refcount, 0x03U << 24); 400 + mask = atomic_xor_bits(&ptdesc->_refcount, 0x03U << 24); 437 401 mask >>= 24; 438 402 break; 439 403 } 440 404 441 - page_table_release_check(page, table, half, mask); 442 - pgtable_pte_page_dtor(page); 443 - __free_page(page); 405 + page_table_release_check(ptdesc_page(ptdesc), table, half, mask); 406 + if (folio_test_clear_active(ptdesc_folio(ptdesc))) 407 + call_rcu(&ptdesc->pt_rcu_head, pte_free_now); 408 + else 409 + pte_free_now(&ptdesc->pt_rcu_head); 444 410 } 411 + 412 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 413 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) 414 + { 415 + struct page *page; 416 + 417 + page = virt_to_page(pgtable); 418 + SetPageActive(page); 419 + page_table_free(mm, (unsigned long *)pgtable); 420 + /* 421 + * page_table_free() does not do the pgste gmap_unlink() which 422 + * page_table_free_rcu() does: warn us if pgste ever reaches here. 423 + */ 424 + WARN_ON_ONCE(mm_has_pgste(mm)); 425 + } 426 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 445 427 446 428 /* 447 429 * Base infrastructure required to generate basic asces, region, segment, ··· 486 432 static unsigned long *base_crst_alloc(unsigned long val) 487 433 { 488 434 unsigned long *table; 435 + struct ptdesc *ptdesc; 489 436 490 - table = (unsigned long *)__get_free_pages(GFP_KERNEL, CRST_ALLOC_ORDER); 491 - if (table) 492 - crst_table_init(table, val); 437 + ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, CRST_ALLOC_ORDER); 438 + if (!ptdesc) 439 + return NULL; 440 + table = ptdesc_address(ptdesc); 441 + 442 + crst_table_init(table, val); 493 443 return table; 494 444 } 495 445 496 446 static void base_crst_free(unsigned long *table) 497 447 { 498 - free_pages((unsigned long)table, CRST_ALLOC_ORDER); 448 + pagetable_free(virt_to_ptdesc(table)); 499 449 } 500 450 501 451 #define BASE_ADDR_END_FUNC(NAME, SIZE) \

+10 -47

arch/s390/pci/pci.c

··· 244 244 zpci_memcpy_toio(to, from, count); 245 245 } 246 246 247 - static void __iomem *__ioremap(phys_addr_t addr, size_t size, pgprot_t prot) 247 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 248 + unsigned long prot) 248 249 { 249 - unsigned long offset, vaddr; 250 - struct vm_struct *area; 251 - phys_addr_t last_addr; 252 - 253 - last_addr = addr + size - 1; 254 - if (!size || last_addr < addr) 255 - return NULL; 256 - 250 + /* 251 + * When PCI MIO instructions are unavailable the "physical" address 252 + * encodes a hint for accessing the PCI memory space it represents. 253 + * Just pass it unchanged such that ioread/iowrite can decode it. 254 + */ 257 255 if (!static_branch_unlikely(&have_mio)) 258 - return (void __iomem *) addr; 256 + return (void __iomem *)phys_addr; 259 257 260 - offset = addr & ~PAGE_MASK; 261 - addr &= PAGE_MASK; 262 - size = PAGE_ALIGN(size + offset); 263 - area = get_vm_area(size, VM_IOREMAP); 264 - if (!area) 265 - return NULL; 266 - 267 - vaddr = (unsigned long) area->addr; 268 - if (ioremap_page_range(vaddr, vaddr + size, addr, prot)) { 269 - free_vm_area(area); 270 - return NULL; 271 - } 272 - return (void __iomem *) ((unsigned long) area->addr + offset); 273 - } 274 - 275 - void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot) 276 - { 277 - return __ioremap(addr, size, __pgprot(prot)); 258 + return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); 278 259 } 279 260 EXPORT_SYMBOL(ioremap_prot); 280 - 281 - void __iomem *ioremap(phys_addr_t addr, size_t size) 282 - { 283 - return __ioremap(addr, size, PAGE_KERNEL); 284 - } 285 - EXPORT_SYMBOL(ioremap); 286 - 287 - void __iomem *ioremap_wc(phys_addr_t addr, size_t size) 288 - { 289 - return __ioremap(addr, size, pgprot_writecombine(PAGE_KERNEL)); 290 - } 291 - EXPORT_SYMBOL(ioremap_wc); 292 - 293 - void __iomem *ioremap_wt(phys_addr_t addr, size_t size) 294 - { 295 - return __ioremap(addr, size, pgprot_writethrough(PAGE_KERNEL)); 296 - } 297 - EXPORT_SYMBOL(ioremap_wt); 298 261 299 262 void iounmap(volatile void __iomem *addr) 300 263 { 301 264 if (static_branch_likely(&have_mio)) 302 - vunmap((__force void *) ((unsigned long) addr & PAGE_MASK)); 265 + generic_iounmap(addr); 303 266 } 304 267 EXPORT_SYMBOL(iounmap); 305 268

+1

arch/sh/Kconfig

··· 29 29 select GENERIC_SMP_IDLE_THREAD 30 30 select GUP_GET_PXX_LOW_HIGH if X2TLB 31 31 select HAS_IOPORT if HAS_IOPORT_MAP 32 + select GENERIC_IOREMAP if MMU 32 33 select HAVE_ARCH_AUDITSYSCALL 33 34 select HAVE_ARCH_KGDB 34 35 select HAVE_ARCH_SECCOMP_FILTER

+14 -7

arch/sh/include/asm/cacheflush.h

··· 13 13 * - flush_cache_page(mm, vmaddr, pfn) flushes a single page 14 14 * - flush_cache_range(vma, start, end) flushes a range of pages 15 15 * 16 - * - flush_dcache_page(pg) flushes(wback&invalidates) a page for dcache 16 + * - flush_dcache_folio(folio) flushes(wback&invalidates) a folio for dcache 17 17 * - flush_icache_range(start, end) flushes(invalidates) a range for icache 18 - * - flush_icache_page(vma, pg) flushes(invalidates) a page for icache 18 + * - flush_icache_pages(vma, pg, nr) flushes(invalidates) pages for icache 19 19 * - flush_cache_sigtramp(vaddr) flushes the signal trampoline 20 20 */ 21 21 extern void (*local_flush_cache_all)(void *args); ··· 23 23 extern void (*local_flush_cache_dup_mm)(void *args); 24 24 extern void (*local_flush_cache_page)(void *args); 25 25 extern void (*local_flush_cache_range)(void *args); 26 - extern void (*local_flush_dcache_page)(void *args); 26 + extern void (*local_flush_dcache_folio)(void *args); 27 27 extern void (*local_flush_icache_range)(void *args); 28 - extern void (*local_flush_icache_page)(void *args); 28 + extern void (*local_flush_icache_folio)(void *args); 29 29 extern void (*local_flush_cache_sigtramp)(void *args); 30 30 31 31 static inline void cache_noop(void *args) { } ··· 42 42 extern void flush_cache_range(struct vm_area_struct *vma, 43 43 unsigned long start, unsigned long end); 44 44 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 45 - void flush_dcache_page(struct page *page); 45 + void flush_dcache_folio(struct folio *folio); 46 + #define flush_dcache_folio flush_dcache_folio 47 + static inline void flush_dcache_page(struct page *page) 48 + { 49 + flush_dcache_folio(page_folio(page)); 50 + } 51 + 46 52 extern void flush_icache_range(unsigned long start, unsigned long end); 47 53 #define flush_icache_user_range flush_icache_range 48 - extern void flush_icache_page(struct vm_area_struct *vma, 49 - struct page *page); 54 + void flush_icache_pages(struct vm_area_struct *vma, struct page *page, 55 + unsigned int nr); 56 + #define flush_icache_pages flush_icache_pages 50 57 extern void flush_cache_sigtramp(unsigned long address); 51 58 52 59 struct flusher_data {

+57 -32

arch/sh/include/asm/io.h

··· 119 119 120 120 __BUILD_MEMORY_STRING(__raw_, q, u64) 121 121 122 + #define ioport_map ioport_map 123 + #define ioport_unmap ioport_unmap 124 + #define pci_iounmap pci_iounmap 125 + 126 + #define ioread8 ioread8 127 + #define ioread16 ioread16 128 + #define ioread16be ioread16be 129 + #define ioread32 ioread32 130 + #define ioread32be ioread32be 131 + 132 + #define iowrite8 iowrite8 133 + #define iowrite16 iowrite16 134 + #define iowrite16be iowrite16be 135 + #define iowrite32 iowrite32 136 + #define iowrite32be iowrite32be 137 + 138 + #define ioread8_rep ioread8_rep 139 + #define ioread16_rep ioread16_rep 140 + #define ioread32_rep ioread32_rep 141 + 142 + #define iowrite8_rep iowrite8_rep 143 + #define iowrite16_rep iowrite16_rep 144 + #define iowrite32_rep iowrite32_rep 145 + 122 146 #ifdef CONFIG_HAS_IOPORT_MAP 123 147 124 148 /* ··· 245 221 246 222 #endif 247 223 224 + #define inb(addr) inb(addr) 225 + #define inw(addr) inw(addr) 226 + #define inl(addr) inl(addr) 227 + #define outb(x, addr) outb((x), (addr)) 228 + #define outw(x, addr) outw((x), (addr)) 229 + #define outl(x, addr) outl((x), (addr)) 230 + 231 + #define inb_p(addr) inb(addr) 232 + #define inw_p(addr) inw(addr) 233 + #define inl_p(addr) inl(addr) 234 + #define outb_p(x, addr) outb((x), (addr)) 235 + #define outw_p(x, addr) outw((x), (addr)) 236 + #define outl_p(x, addr) outl((x), (addr)) 237 + 238 + #define insb insb 239 + #define insw insw 240 + #define insl insl 241 + #define outsb outsb 242 + #define outsw outsw 243 + #define outsl outsl 248 244 249 245 #define IO_SPACE_LIMIT 0xffffffff 250 246 251 247 /* We really want to try and get these to memcpy etc */ 248 + #define memset_io memset_io 249 + #define memcpy_fromio memcpy_fromio 250 + #define memcpy_toio memcpy_toio 252 251 void memcpy_fromio(void *, const volatile void __iomem *, unsigned long); 253 252 void memcpy_toio(volatile void __iomem *, const void *, unsigned long); 254 253 void memset_io(volatile void __iomem *, int, unsigned long); ··· 290 243 #endif 291 244 292 245 #ifdef CONFIG_MMU 293 - void iounmap(void __iomem *addr); 294 - void __iomem *__ioremap_caller(phys_addr_t offset, unsigned long size, 295 - pgprot_t prot, void *caller); 246 + /* 247 + * I/O memory mapping functions. 248 + */ 249 + #define ioremap_prot ioremap_prot 250 + #define iounmap iounmap 296 251 297 - static inline void __iomem *ioremap(phys_addr_t offset, unsigned long size) 298 - { 299 - return __ioremap_caller(offset, size, PAGE_KERNEL_NOCACHE, 300 - __builtin_return_address(0)); 301 - } 252 + #define _PAGE_IOREMAP pgprot_val(PAGE_KERNEL_NOCACHE) 302 253 303 - static inline void __iomem * 304 - ioremap_cache(phys_addr_t offset, unsigned long size) 305 - { 306 - return __ioremap_caller(offset, size, PAGE_KERNEL, 307 - __builtin_return_address(0)); 308 - } 309 - #define ioremap_cache ioremap_cache 310 - 311 - #ifdef CONFIG_HAVE_IOREMAP_PROT 312 - static inline void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size, 313 - unsigned long flags) 314 - { 315 - return __ioremap_caller(offset, size, __pgprot(flags), 316 - __builtin_return_address(0)); 317 - } 318 - #endif /* CONFIG_HAVE_IOREMAP_PROT */ 319 - 320 - #else /* CONFIG_MMU */ 321 - static inline void __iomem *ioremap(phys_addr_t offset, size_t size) 322 - { 323 - return (void __iomem *)(unsigned long)offset; 324 - } 325 - 326 - static inline void iounmap(volatile void __iomem *addr) { } 254 + #define ioremap_cache(addr, size) \ 255 + ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL)) 327 256 #endif /* CONFIG_MMU */ 328 257 329 258 #define ioremap_uc ioremap ··· 310 287 */ 311 288 #define xlate_dev_mem_ptr(p) __va(p) 312 289 #define unxlate_dev_mem_ptr(p, v) do { } while (0) 290 + 291 + #include <asm-generic/io.h> 313 292 314 293 #define ARCH_HAS_VALID_PHYS_ADDR_RANGE 315 294 int valid_phys_addr_range(phys_addr_t addr, size_t size);

-7

arch/sh/include/asm/io_noioport.h

··· 46 46 BUG(); 47 47 } 48 48 49 - #define inb_p(addr) inb(addr) 50 - #define inw_p(addr) inw(addr) 51 - #define inl_p(addr) inl(addr) 52 - #define outb_p(x, addr) outb((x), (addr)) 53 - #define outw_p(x, addr) outw((x), (addr)) 54 - #define outl_p(x, addr) outl((x), (addr)) 55 - 56 49 static inline void insb(unsigned long port, void *dst, unsigned long count) 57 50 { 58 51 BUG();

+5 -4

arch/sh/include/asm/pgalloc.h

··· 2 2 #ifndef __ASM_SH_PGALLOC_H 3 3 #define __ASM_SH_PGALLOC_H 4 4 5 + #include <linux/mm.h> 5 6 #include <asm/page.h> 6 7 7 8 #define __HAVE_ARCH_PMD_ALLOC_ONE ··· 32 31 set_pmd(pmd, __pmd((unsigned long)page_address(pte))); 33 32 } 34 33 35 - #define __pte_free_tlb(tlb,pte,addr) \ 36 - do { \ 37 - pgtable_pte_page_dtor(pte); \ 38 - tlb_remove_page((tlb), (pte)); \ 34 + #define __pte_free_tlb(tlb, pte, addr) \ 35 + do { \ 36 + pagetable_pte_dtor(page_ptdesc(pte)); \ 37 + tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ 39 38 } while (0) 40 39 41 40 #endif /* __ASM_SH_PGALLOC_H */

+5 -2

arch/sh/include/asm/pgtable.h

··· 102 102 extern void __update_tlb(struct vm_area_struct *vma, 103 103 unsigned long address, pte_t pte); 104 104 105 - static inline void 106 - update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) 105 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 106 + struct vm_area_struct *vma, unsigned long address, 107 + pte_t *ptep, unsigned int nr) 107 108 { 108 109 pte_t pte = *ptep; 109 110 __update_cache(vma, address, pte); 110 111 __update_tlb(vma, address, pte); 111 112 } 113 + #define update_mmu_cache(vma, addr, ptep) \ 114 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 112 115 113 116 extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; 114 117 extern void paging_init(void);

+2 -3

arch/sh/include/asm/pgtable_32.h

··· 307 307 #define set_pte(pteptr, pteval) (*(pteptr) = pteval) 308 308 #endif 309 309 310 - #define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval) 311 - 312 310 /* 313 311 * (pmds are folded into pgds so this doesn't get actually called, 314 312 * but the define is needed for a generic inline function.) 315 313 */ 316 314 #define set_pmd(pmdptr, pmdval) (*(pmdptr) = pmdval) 317 315 316 + #define PFN_PTE_SHIFT PAGE_SHIFT 318 317 #define pfn_pte(pfn, prot) \ 319 318 __pte(((unsigned long long)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) 320 319 #define pfn_pmd(pfn, prot) \ ··· 322 323 #define pte_none(x) (!pte_val(x)) 323 324 #define pte_present(x) ((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE)) 324 325 325 - #define pte_clear(mm,addr,xp) do { set_pte_at(mm, addr, xp, __pte(0)); } while (0) 326 + #define pte_clear(mm, addr, ptep) set_pte(ptep, __pte(0)) 326 327 327 328 #define pmd_none(x) (!pmd_val(x)) 328 329 #define pmd_present(x) (pmd_val(x))

+2 -2

arch/sh/mm/cache-j2.c

··· 55 55 local_flush_cache_dup_mm = j2_flush_both; 56 56 local_flush_cache_page = j2_flush_both; 57 57 local_flush_cache_range = j2_flush_both; 58 - local_flush_dcache_page = j2_flush_dcache; 58 + local_flush_dcache_folio = j2_flush_dcache; 59 59 local_flush_icache_range = j2_flush_icache; 60 - local_flush_icache_page = j2_flush_icache; 60 + local_flush_icache_folio = j2_flush_icache; 61 61 local_flush_cache_sigtramp = j2_flush_icache; 62 62 63 63 pr_info("Initial J2 CCR is %.8x\n", __raw_readl(j2_ccr_base));

+18 -8

arch/sh/mm/cache-sh4.c

··· 107 107 * Write back & invalidate the D-cache of the page. 108 108 * (To avoid "alias" issues) 109 109 */ 110 - static void sh4_flush_dcache_page(void *arg) 110 + static void sh4_flush_dcache_folio(void *arg) 111 111 { 112 - struct page *page = arg; 113 - unsigned long addr = (unsigned long)page_address(page); 112 + struct folio *folio = arg; 114 113 #ifndef CONFIG_SMP 115 - struct address_space *mapping = page_mapping_file(page); 114 + struct address_space *mapping = folio_flush_mapping(folio); 116 115 117 116 if (mapping && !mapping_mapped(mapping)) 118 - clear_bit(PG_dcache_clean, &page->flags); 117 + clear_bit(PG_dcache_clean, &folio->flags); 119 118 else 120 119 #endif 121 - flush_cache_one(CACHE_OC_ADDRESS_ARRAY | 122 - (addr & shm_align_mask), page_to_phys(page)); 120 + { 121 + unsigned long pfn = folio_pfn(folio); 122 + unsigned long addr = (unsigned long)folio_address(folio); 123 + unsigned int i, nr = folio_nr_pages(folio); 124 + 125 + for (i = 0; i < nr; i++) { 126 + flush_cache_one(CACHE_OC_ADDRESS_ARRAY | 127 + (addr & shm_align_mask), 128 + pfn * PAGE_SIZE); 129 + addr += PAGE_SIZE; 130 + pfn++; 131 + } 132 + } 123 133 124 134 wmb(); 125 135 } ··· 389 379 __raw_readl(CCN_PRR)); 390 380 391 381 local_flush_icache_range = sh4_flush_icache_range; 392 - local_flush_dcache_page = sh4_flush_dcache_page; 382 + local_flush_dcache_folio = sh4_flush_dcache_folio; 393 383 local_flush_cache_all = sh4_flush_cache_all; 394 384 local_flush_cache_mm = sh4_flush_cache_mm; 395 385 local_flush_cache_dup_mm = sh4_flush_cache_mm;

+16 -10

arch/sh/mm/cache-sh7705.c

··· 132 132 * Write back & invalidate the D-cache of the page. 133 133 * (To avoid "alias" issues) 134 134 */ 135 - static void sh7705_flush_dcache_page(void *arg) 135 + static void sh7705_flush_dcache_folio(void *arg) 136 136 { 137 - struct page *page = arg; 138 - struct address_space *mapping = page_mapping_file(page); 137 + struct folio *folio = arg; 138 + struct address_space *mapping = folio_flush_mapping(folio); 139 139 140 140 if (mapping && !mapping_mapped(mapping)) 141 - clear_bit(PG_dcache_clean, &page->flags); 142 - else 143 - __flush_dcache_page(__pa(page_address(page))); 141 + clear_bit(PG_dcache_clean, &folio->flags); 142 + else { 143 + unsigned long pfn = folio_pfn(folio); 144 + unsigned int i, nr = folio_nr_pages(folio); 145 + 146 + for (i = 0; i < nr; i++) 147 + __flush_dcache_page((pfn + i) * PAGE_SIZE); 148 + } 144 149 } 145 150 146 151 static void sh7705_flush_cache_all(void *args) ··· 181 176 * Not entirely sure why this is necessary on SH3 with 32K cache but 182 177 * without it we get occasional "Memory fault" when loading a program. 183 178 */ 184 - static void sh7705_flush_icache_page(void *page) 179 + static void sh7705_flush_icache_folio(void *arg) 185 180 { 186 - __flush_purge_region(page_address(page), PAGE_SIZE); 181 + struct folio *folio = arg; 182 + __flush_purge_region(folio_address(folio), folio_size(folio)); 187 183 } 188 184 189 185 void __init sh7705_cache_init(void) 190 186 { 191 187 local_flush_icache_range = sh7705_flush_icache_range; 192 - local_flush_dcache_page = sh7705_flush_dcache_page; 188 + local_flush_dcache_folio = sh7705_flush_dcache_folio; 193 189 local_flush_cache_all = sh7705_flush_cache_all; 194 190 local_flush_cache_mm = sh7705_flush_cache_all; 195 191 local_flush_cache_dup_mm = sh7705_flush_cache_all; 196 192 local_flush_cache_range = sh7705_flush_cache_all; 197 193 local_flush_cache_page = sh7705_flush_cache_page; 198 - local_flush_icache_page = sh7705_flush_icache_page; 194 + local_flush_icache_folio = sh7705_flush_icache_folio; 199 195 }

+30 -22

arch/sh/mm/cache.c

··· 20 20 void (*local_flush_cache_dup_mm)(void *args) = cache_noop; 21 21 void (*local_flush_cache_page)(void *args) = cache_noop; 22 22 void (*local_flush_cache_range)(void *args) = cache_noop; 23 - void (*local_flush_dcache_page)(void *args) = cache_noop; 23 + void (*local_flush_dcache_folio)(void *args) = cache_noop; 24 24 void (*local_flush_icache_range)(void *args) = cache_noop; 25 - void (*local_flush_icache_page)(void *args) = cache_noop; 25 + void (*local_flush_icache_folio)(void *args) = cache_noop; 26 26 void (*local_flush_cache_sigtramp)(void *args) = cache_noop; 27 27 28 28 void (*__flush_wback_region)(void *start, int size); ··· 61 61 unsigned long vaddr, void *dst, const void *src, 62 62 unsigned long len) 63 63 { 64 - if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) && 65 - test_bit(PG_dcache_clean, &page->flags)) { 64 + struct folio *folio = page_folio(page); 65 + 66 + if (boot_cpu_data.dcache.n_aliases && folio_mapped(folio) && 67 + test_bit(PG_dcache_clean, &folio->flags)) { 66 68 void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); 67 69 memcpy(vto, src, len); 68 70 kunmap_coherent(vto); 69 71 } else { 70 72 memcpy(dst, src, len); 71 73 if (boot_cpu_data.dcache.n_aliases) 72 - clear_bit(PG_dcache_clean, &page->flags); 74 + clear_bit(PG_dcache_clean, &folio->flags); 73 75 } 74 76 75 77 if (vma->vm_flags & VM_EXEC) ··· 82 80 unsigned long vaddr, void *dst, const void *src, 83 81 unsigned long len) 84 82 { 83 + struct folio *folio = page_folio(page); 84 + 85 85 if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) && 86 - test_bit(PG_dcache_clean, &page->flags)) { 86 + test_bit(PG_dcache_clean, &folio->flags)) { 87 87 void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); 88 88 memcpy(dst, vfrom, len); 89 89 kunmap_coherent(vfrom); 90 90 } else { 91 91 memcpy(dst, src, len); 92 92 if (boot_cpu_data.dcache.n_aliases) 93 - clear_bit(PG_dcache_clean, &page->flags); 93 + clear_bit(PG_dcache_clean, &folio->flags); 94 94 } 95 95 } 96 96 97 97 void copy_user_highpage(struct page *to, struct page *from, 98 98 unsigned long vaddr, struct vm_area_struct *vma) 99 99 { 100 + struct folio *src = page_folio(from); 100 101 void *vfrom, *vto; 101 102 102 103 vto = kmap_atomic(to); 103 104 104 - if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) && 105 - test_bit(PG_dcache_clean, &from->flags)) { 105 + if (boot_cpu_data.dcache.n_aliases && folio_mapped(src) && 106 + test_bit(PG_dcache_clean, &src->flags)) { 106 107 vfrom = kmap_coherent(from, vaddr); 107 108 copy_page(vto, vfrom); 108 109 kunmap_coherent(vfrom); ··· 141 136 void __update_cache(struct vm_area_struct *vma, 142 137 unsigned long address, pte_t pte) 143 138 { 144 - struct page *page; 145 139 unsigned long pfn = pte_pfn(pte); 146 140 147 141 if (!boot_cpu_data.dcache.n_aliases) 148 142 return; 149 143 150 - page = pfn_to_page(pfn); 151 144 if (pfn_valid(pfn)) { 152 - int dirty = !test_and_set_bit(PG_dcache_clean, &page->flags); 145 + struct folio *folio = page_folio(pfn_to_page(pfn)); 146 + int dirty = !test_and_set_bit(PG_dcache_clean, &folio->flags); 153 147 if (dirty) 154 - __flush_purge_region(page_address(page), PAGE_SIZE); 148 + __flush_purge_region(folio_address(folio), 149 + folio_size(folio)); 155 150 } 156 151 } 157 152 158 153 void __flush_anon_page(struct page *page, unsigned long vmaddr) 159 154 { 155 + struct folio *folio = page_folio(page); 160 156 unsigned long addr = (unsigned long) page_address(page); 161 157 162 158 if (pages_do_alias(addr, vmaddr)) { 163 - if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) && 164 - test_bit(PG_dcache_clean, &page->flags)) { 159 + if (boot_cpu_data.dcache.n_aliases && folio_mapped(folio) && 160 + test_bit(PG_dcache_clean, &folio->flags)) { 165 161 void *kaddr; 166 162 167 163 kaddr = kmap_coherent(page, vmaddr); ··· 170 164 /* __flush_purge_region((void *)kaddr, PAGE_SIZE); */ 171 165 kunmap_coherent(kaddr); 172 166 } else 173 - __flush_purge_region((void *)addr, PAGE_SIZE); 167 + __flush_purge_region(folio_address(folio), 168 + folio_size(folio)); 174 169 } 175 170 } 176 171 ··· 222 215 } 223 216 EXPORT_SYMBOL(flush_cache_range); 224 217 225 - void flush_dcache_page(struct page *page) 218 + void flush_dcache_folio(struct folio *folio) 226 219 { 227 - cacheop_on_each_cpu(local_flush_dcache_page, page, 1); 220 + cacheop_on_each_cpu(local_flush_dcache_folio, folio, 1); 228 221 } 229 - EXPORT_SYMBOL(flush_dcache_page); 222 + EXPORT_SYMBOL(flush_dcache_folio); 230 223 231 224 void flush_icache_range(unsigned long start, unsigned long end) 232 225 { ··· 240 233 } 241 234 EXPORT_SYMBOL(flush_icache_range); 242 235 243 - void flush_icache_page(struct vm_area_struct *vma, struct page *page) 236 + void flush_icache_pages(struct vm_area_struct *vma, struct page *page, 237 + unsigned int nr) 244 238 { 245 - /* Nothing uses the VMA, so just pass the struct page along */ 246 - cacheop_on_each_cpu(local_flush_icache_page, page, 1); 239 + /* Nothing uses the VMA, so just pass the folio along */ 240 + cacheop_on_each_cpu(local_flush_icache_folio, page_folio(page), 1); 247 241 } 248 242 249 243 void flush_cache_sigtramp(unsigned long address)

+11 -54

arch/sh/mm/ioremap.c

··· 72 72 #define __ioremap_29bit(offset, size, prot) NULL 73 73 #endif /* CONFIG_29BIT */ 74 74 75 - /* 76 - * Remap an arbitrary physical address space into the kernel virtual 77 - * address space. Needed when the kernel wants to access high addresses 78 - * directly. 79 - * 80 - * NOTE! We need to allow non-page-aligned mappings too: we will obviously 81 - * have to convert them into an offset in a page-aligned mapping, but the 82 - * caller shouldn't need to know that small detail. 83 - */ 84 - void __iomem * __ref 85 - __ioremap_caller(phys_addr_t phys_addr, unsigned long size, 86 - pgprot_t pgprot, void *caller) 75 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 76 + unsigned long prot) 87 77 { 88 - struct vm_struct *area; 89 - unsigned long offset, last_addr, addr, orig_addr; 90 78 void __iomem *mapped; 79 + pgprot_t pgprot = __pgprot(prot); 91 80 92 81 mapped = __ioremap_trapped(phys_addr, size); 93 82 if (mapped) ··· 85 96 mapped = __ioremap_29bit(phys_addr, size, pgprot); 86 97 if (mapped) 87 98 return mapped; 88 - 89 - /* Don't allow wraparound or zero size */ 90 - last_addr = phys_addr + size - 1; 91 - if (!size || last_addr < phys_addr) 92 - return NULL; 93 99 94 100 /* 95 101 * If we can't yet use the regular approach, go the fixmap route. ··· 96 112 * First try to remap through the PMB. 97 113 * PMB entries are all pre-faulted. 98 114 */ 99 - mapped = pmb_remap_caller(phys_addr, size, pgprot, caller); 115 + mapped = pmb_remap_caller(phys_addr, size, pgprot, 116 + __builtin_return_address(0)); 100 117 if (mapped && !IS_ERR(mapped)) 101 118 return mapped; 102 119 103 - /* 104 - * Mappings have to be page-aligned 105 - */ 106 - offset = phys_addr & ~PAGE_MASK; 107 - phys_addr &= PAGE_MASK; 108 - size = PAGE_ALIGN(last_addr+1) - phys_addr; 109 - 110 - /* 111 - * Ok, go for it.. 112 - */ 113 - area = get_vm_area_caller(size, VM_IOREMAP, caller); 114 - if (!area) 115 - return NULL; 116 - area->phys_addr = phys_addr; 117 - orig_addr = addr = (unsigned long)area->addr; 118 - 119 - if (ioremap_page_range(addr, addr + size, phys_addr, pgprot)) { 120 - vunmap((void *)orig_addr); 121 - return NULL; 122 - } 123 - 124 - return (void __iomem *)(offset + (char *)orig_addr); 120 + return generic_ioremap_prot(phys_addr, size, pgprot); 125 121 } 126 - EXPORT_SYMBOL(__ioremap_caller); 122 + EXPORT_SYMBOL(ioremap_prot); 127 123 128 124 /* 129 125 * Simple checks for non-translatable mappings. ··· 122 158 return 0; 123 159 } 124 160 125 - void iounmap(void __iomem *addr) 161 + void iounmap(volatile void __iomem *addr) 126 162 { 127 163 unsigned long vaddr = (unsigned long __force)addr; 128 - struct vm_struct *p; 129 164 130 165 /* 131 166 * Nothing to do if there is no translatable mapping. ··· 135 172 /* 136 173 * There's no VMA if it's from an early fixed mapping. 137 174 */ 138 - if (iounmap_fixed(addr) == 0) 175 + if (iounmap_fixed((void __iomem *)addr) == 0) 139 176 return; 140 177 141 178 /* 142 179 * If the PMB handled it, there's nothing else to do. 143 180 */ 144 - if (pmb_unmap(addr) == 0) 181 + if (pmb_unmap((void __iomem *)addr) == 0) 145 182 return; 146 183 147 - p = remove_vm_area((void *)(vaddr & PAGE_MASK)); 148 - if (!p) { 149 - printk(KERN_ERR "%s: bad address %p\n", __func__, addr); 150 - return; 151 - } 152 - 153 - kfree(p); 184 + generic_iounmap(addr); 154 185 } 155 186 EXPORT_SYMBOL(iounmap);

+2 -1

arch/sh/mm/kmap.c

··· 27 27 28 28 void *kmap_coherent(struct page *page, unsigned long addr) 29 29 { 30 + struct folio *folio = page_folio(page); 30 31 enum fixed_addresses idx; 31 32 unsigned long vaddr; 32 33 33 - BUG_ON(!test_bit(PG_dcache_clean, &page->flags)); 34 + BUG_ON(!test_bit(PG_dcache_clean, &folio->flags)); 34 35 35 36 preempt_disable(); 36 37 pagefault_disable();

+7 -3

arch/sparc/include/asm/cacheflush_32.h

··· 2 2 #ifndef _SPARC_CACHEFLUSH_H 3 3 #define _SPARC_CACHEFLUSH_H 4 4 5 + #include <linux/page-flags.h> 5 6 #include <asm/cachetlb_32.h> 6 7 7 8 #define flush_cache_all() \ ··· 16 15 #define flush_cache_page(vma,addr,pfn) \ 17 16 sparc32_cachetlb_ops->cache_page(vma, addr) 18 17 #define flush_icache_range(start, end) do { } while (0) 19 - #define flush_icache_page(vma, pg) do { } while (0) 20 18 21 19 #define copy_to_user_page(vma, page, vaddr, dst, src, len) \ 22 20 do { \ ··· 35 35 #define flush_page_for_dma(addr) \ 36 36 sparc32_cachetlb_ops->page_for_dma(addr) 37 37 38 - struct page; 39 38 void sparc_flush_page_to_ram(struct page *page); 39 + void sparc_flush_folio_to_ram(struct folio *folio); 40 40 41 41 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 42 - #define flush_dcache_page(page) sparc_flush_page_to_ram(page) 42 + #define flush_dcache_folio(folio) sparc_flush_folio_to_ram(folio) 43 + static inline void flush_dcache_page(struct page *page) 44 + { 45 + flush_dcache_folio(page_folio(page)); 46 + } 43 47 #define flush_dcache_mmap_lock(mapping) do { } while (0) 44 48 #define flush_dcache_mmap_unlock(mapping) do { } while (0) 45 49

+11 -8

arch/sparc/include/asm/cacheflush_64.h

··· 35 35 void __flush_icache_page(unsigned long); 36 36 37 37 void __flush_dcache_page(void *addr, int flush_icache); 38 - void flush_dcache_page_impl(struct page *page); 38 + void flush_dcache_folio_impl(struct folio *folio); 39 39 #ifdef CONFIG_SMP 40 - void smp_flush_dcache_page_impl(struct page *page, int cpu); 41 - void flush_dcache_page_all(struct mm_struct *mm, struct page *page); 40 + void smp_flush_dcache_folio_impl(struct folio *folio, int cpu); 41 + void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio); 42 42 #else 43 - #define smp_flush_dcache_page_impl(page,cpu) flush_dcache_page_impl(page) 44 - #define flush_dcache_page_all(mm,page) flush_dcache_page_impl(page) 43 + #define smp_flush_dcache_folio_impl(folio, cpu) flush_dcache_folio_impl(folio) 44 + #define flush_dcache_folio_all(mm, folio) flush_dcache_folio_impl(folio) 45 45 #endif 46 46 47 47 void __flush_dcache_range(unsigned long start, unsigned long end); 48 48 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 49 - void flush_dcache_page(struct page *page); 50 - 51 - #define flush_icache_page(vma, pg) do { } while(0) 49 + void flush_dcache_folio(struct folio *folio); 50 + #define flush_dcache_folio flush_dcache_folio 51 + static inline void flush_dcache_page(struct page *page) 52 + { 53 + flush_dcache_folio(page_folio(page)); 54 + } 52 55 53 56 void flush_ptrace_access(struct vm_area_struct *, struct page *, 54 57 unsigned long uaddr, void *kaddr,

+4

arch/sparc/include/asm/pgalloc_64.h

··· 65 65 void pte_free_kernel(struct mm_struct *mm, pte_t *pte); 66 66 void pte_free(struct mm_struct *mm, pgtable_t ptepage); 67 67 68 + /* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */ 69 + #define pte_free_defer pte_free_defer 70 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); 71 + 68 72 #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE) 69 73 #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE) 70 74

+4 -4

arch/sparc/include/asm/pgtable_32.h

··· 101 101 srmmu_swap((unsigned long *)ptep, pte_val(pteval)); 102 102 } 103 103 104 - #define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval) 105 - 106 104 static inline int srmmu_device_memory(unsigned long x) 107 105 { 108 106 return ((x & 0xF0000000) != 0); ··· 254 256 return __pte(pte_val(pte) | SRMMU_REF); 255 257 } 256 258 259 + #define PFN_PTE_SHIFT (PAGE_SHIFT - 4) 257 260 #define pfn_pte(pfn, prot) mk_pte(pfn_to_page(pfn), prot) 258 261 259 262 static inline unsigned long pte_pfn(pte_t pte) ··· 267 268 */ 268 269 return ~0UL; 269 270 } 270 - return (pte_val(pte) & SRMMU_PTE_PMASK) >> (PAGE_SHIFT-4); 271 + return (pte_val(pte) & SRMMU_PTE_PMASK) >> PFN_PTE_SHIFT; 271 272 } 272 273 273 274 #define pte_page(pte) pfn_to_page(pte_pfn(pte)) ··· 317 318 #define FAULT_CODE_USER 0x4 318 319 319 320 #define update_mmu_cache(vma, address, ptep) do { } while (0) 321 + #define update_mmu_cache_range(vmf, vma, address, ptep, nr) do { } while (0) 320 322 321 323 void srmmu_mapiorange(unsigned int bus, unsigned long xpa, 322 324 unsigned long xva, unsigned int len); ··· 422 422 ({ \ 423 423 int __changed = !pte_same(*(__ptep), __entry); \ 424 424 if (__changed) { \ 425 - set_pte_at((__vma)->vm_mm, (__address), __ptep, __entry); \ 425 + set_pte(__ptep, __entry); \ 426 426 flush_tlb_page(__vma, __address); \ 427 427 } \ 428 428 __changed; \

+22 -7

arch/sparc/include/asm/pgtable_64.h

··· 86 86 #define vmemmap ((struct page *)VMEMMAP_BASE) 87 87 88 88 #include <linux/sched.h> 89 + #include <asm/tlbflush.h> 89 90 90 91 bool kern_addr_valid(unsigned long addr); 91 92 ··· 928 927 maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT); 929 928 } 930 929 931 - #define set_pte_at(mm,addr,ptep,pte) \ 932 - __set_pte_at((mm), (addr), (ptep), (pte), 0) 930 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 931 + pte_t *ptep, pte_t pte, unsigned int nr) 932 + { 933 + arch_enter_lazy_mmu_mode(); 934 + for (;;) { 935 + __set_pte_at(mm, addr, ptep, pte, 0); 936 + if (--nr == 0) 937 + break; 938 + ptep++; 939 + pte_val(pte) += PAGE_SIZE; 940 + addr += PAGE_SIZE; 941 + } 942 + arch_leave_lazy_mmu_mode(); 943 + } 944 + #define set_ptes set_ptes 933 945 934 946 #define pte_clear(mm,addr,ptep) \ 935 947 set_pte_at((mm), (addr), (ptep), __pte(0UL)) ··· 961 947 \ 962 948 if (pfn_valid(this_pfn) && \ 963 949 (((old_addr) ^ (new_addr)) & (1 << 13))) \ 964 - flush_dcache_page_all(current->mm, \ 965 - pfn_to_page(this_pfn)); \ 950 + flush_dcache_folio_all(current->mm, \ 951 + page_folio(pfn_to_page(this_pfn))); \ 966 952 } \ 967 953 newpte; \ 968 954 }) ··· 977 963 void mmu_info(struct seq_file *); 978 964 979 965 struct vm_area_struct; 980 - void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t *); 966 + void update_mmu_cache_range(struct vm_fault *, struct vm_area_struct *, 967 + unsigned long addr, pte_t *ptep, unsigned int nr); 968 + #define update_mmu_cache(vma, addr, ptep) \ 969 + update_mmu_cache_range(NULL, vma, addr, ptep, 1) 981 970 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 982 971 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, 983 972 pmd_t *pmd); ··· 1137 1120 return (pte_val(pte) & (prot | _PAGE_SPECIAL)) == prot; 1138 1121 } 1139 1122 #define pte_access_permitted pte_access_permitted 1140 - 1141 - #include <asm/tlbflush.h> 1142 1123 1143 1124 /* We provide our own get_unmapped_area to cope with VA holes and 1144 1125 * SHM area cache aliasing for userland.

+1 -1

arch/sparc/kernel/setup_32.c

··· 83 83 "nop\n\t" : : "r" (&trapbase)); 84 84 85 85 prom_printf("PROM SYNC COMMAND...\n"); 86 - show_free_areas(0, NULL); 86 + show_mem(); 87 87 if (!is_idle_task(current)) { 88 88 local_irq_enable(); 89 89 ksys_sync();

+36 -20

arch/sparc/kernel/smp_64.c

··· 921 921 #endif 922 922 extern unsigned long xcall_flush_dcache_page_spitfire; 923 923 924 - static inline void __local_flush_dcache_page(struct page *page) 924 + static inline void __local_flush_dcache_folio(struct folio *folio) 925 925 { 926 + unsigned int i, nr = folio_nr_pages(folio); 927 + 926 928 #ifdef DCACHE_ALIASING_POSSIBLE 927 - __flush_dcache_page(page_address(page), 929 + for (i = 0; i < nr; i++) 930 + __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE, 928 931 ((tlb_type == spitfire) && 929 - page_mapping_file(page) != NULL)); 932 + folio_flush_mapping(folio) != NULL)); 930 933 #else 931 - if (page_mapping_file(page) != NULL && 932 - tlb_type == spitfire) 933 - __flush_icache_page(__pa(page_address(page))); 934 + if (folio_flush_mapping(folio) != NULL && 935 + tlb_type == spitfire) { 936 + unsigned long pfn = folio_pfn(folio) 937 + for (i = 0; i < nr; i++) 938 + __flush_icache_page((pfn + i) * PAGE_SIZE); 939 + } 934 940 #endif 935 941 } 936 942 937 - void smp_flush_dcache_page_impl(struct page *page, int cpu) 943 + void smp_flush_dcache_folio_impl(struct folio *folio, int cpu) 938 944 { 939 945 int this_cpu; 940 946 ··· 954 948 this_cpu = get_cpu(); 955 949 956 950 if (cpu == this_cpu) { 957 - __local_flush_dcache_page(page); 951 + __local_flush_dcache_folio(folio); 958 952 } else if (cpu_online(cpu)) { 959 - void *pg_addr = page_address(page); 953 + void *pg_addr = folio_address(folio); 960 954 u64 data0 = 0; 961 955 962 956 if (tlb_type == spitfire) { 963 957 data0 = ((u64)&xcall_flush_dcache_page_spitfire); 964 - if (page_mapping_file(page) != NULL) 958 + if (folio_flush_mapping(folio) != NULL) 965 959 data0 |= ((u64)1 << 32); 966 960 } else if (tlb_type == cheetah || tlb_type == cheetah_plus) { 967 961 #ifdef DCACHE_ALIASING_POSSIBLE ··· 969 963 #endif 970 964 } 971 965 if (data0) { 972 - xcall_deliver(data0, __pa(pg_addr), 973 - (u64) pg_addr, cpumask_of(cpu)); 966 + unsigned int i, nr = folio_nr_pages(folio); 967 + 968 + for (i = 0; i < nr; i++) { 969 + xcall_deliver(data0, __pa(pg_addr), 970 + (u64) pg_addr, cpumask_of(cpu)); 974 971 #ifdef CONFIG_DEBUG_DCFLUSH 975 - atomic_inc(&dcpage_flushes_xcall); 972 + atomic_inc(&dcpage_flushes_xcall); 976 973 #endif 974 + pg_addr += PAGE_SIZE; 975 + } 977 976 } 978 977 } 979 978 980 979 put_cpu(); 981 980 } 982 981 983 - void flush_dcache_page_all(struct mm_struct *mm, struct page *page) 982 + void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio) 984 983 { 985 984 void *pg_addr; 986 985 u64 data0; ··· 999 988 atomic_inc(&dcpage_flushes); 1000 989 #endif 1001 990 data0 = 0; 1002 - pg_addr = page_address(page); 991 + pg_addr = folio_address(folio); 1003 992 if (tlb_type == spitfire) { 1004 993 data0 = ((u64)&xcall_flush_dcache_page_spitfire); 1005 - if (page_mapping_file(page) != NULL) 994 + if (folio_flush_mapping(folio) != NULL) 1006 995 data0 |= ((u64)1 << 32); 1007 996 } else if (tlb_type == cheetah || tlb_type == cheetah_plus) { 1008 997 #ifdef DCACHE_ALIASING_POSSIBLE ··· 1010 999 #endif 1011 1000 } 1012 1001 if (data0) { 1013 - xcall_deliver(data0, __pa(pg_addr), 1014 - (u64) pg_addr, cpu_online_mask); 1002 + unsigned int i, nr = folio_nr_pages(folio); 1003 + 1004 + for (i = 0; i < nr; i++) { 1005 + xcall_deliver(data0, __pa(pg_addr), 1006 + (u64) pg_addr, cpu_online_mask); 1015 1007 #ifdef CONFIG_DEBUG_DCFLUSH 1016 - atomic_inc(&dcpage_flushes_xcall); 1008 + atomic_inc(&dcpage_flushes_xcall); 1017 1009 #endif 1010 + pg_addr += PAGE_SIZE; 1011 + } 1018 1012 } 1019 - __local_flush_dcache_page(page); 1013 + __local_flush_dcache_folio(folio); 1020 1014 1021 1015 preempt_enable(); 1022 1016 }

+11 -2

arch/sparc/mm/init_32.c

··· 297 297 { 298 298 unsigned long vaddr = (unsigned long)page_address(page); 299 299 300 - if (vaddr) 301 - __flush_page_to_ram(vaddr); 300 + __flush_page_to_ram(vaddr); 302 301 } 303 302 EXPORT_SYMBOL(sparc_flush_page_to_ram); 303 + 304 + void sparc_flush_folio_to_ram(struct folio *folio) 305 + { 306 + unsigned long vaddr = (unsigned long)folio_address(folio); 307 + unsigned int i, nr = folio_nr_pages(folio); 308 + 309 + for (i = 0; i < nr; i++) 310 + __flush_page_to_ram(vaddr + i * PAGE_SIZE); 311 + } 312 + EXPORT_SYMBOL(sparc_flush_folio_to_ram); 304 313 305 314 static const pgprot_t protection_map[16] = { 306 315 [VM_NONE] = PAGE_NONE,

+71 -40

arch/sparc/mm/init_64.c

··· 195 195 #endif 196 196 #endif 197 197 198 - inline void flush_dcache_page_impl(struct page *page) 198 + inline void flush_dcache_folio_impl(struct folio *folio) 199 199 { 200 + unsigned int i, nr = folio_nr_pages(folio); 201 + 200 202 BUG_ON(tlb_type == hypervisor); 201 203 #ifdef CONFIG_DEBUG_DCFLUSH 202 204 atomic_inc(&dcpage_flushes); 203 205 #endif 204 206 205 207 #ifdef DCACHE_ALIASING_POSSIBLE 206 - __flush_dcache_page(page_address(page), 207 - ((tlb_type == spitfire) && 208 - page_mapping_file(page) != NULL)); 208 + for (i = 0; i < nr; i++) 209 + __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE, 210 + ((tlb_type == spitfire) && 211 + folio_flush_mapping(folio) != NULL)); 209 212 #else 210 - if (page_mapping_file(page) != NULL && 211 - tlb_type == spitfire) 212 - __flush_icache_page(__pa(page_address(page))); 213 + if (folio_flush_mapping(folio) != NULL && 214 + tlb_type == spitfire) { 215 + for (i = 0; i < nr; i++) 216 + __flush_icache_page((pfn + i) * PAGE_SIZE); 217 + } 213 218 #endif 214 219 } 215 220 ··· 223 218 #define PG_dcache_cpu_mask \ 224 219 ((1UL<<ilog2(roundup_pow_of_two(NR_CPUS)))-1UL) 225 220 226 - #define dcache_dirty_cpu(page) \ 227 - (((page)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask) 221 + #define dcache_dirty_cpu(folio) \ 222 + (((folio)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask) 228 223 229 - static inline void set_dcache_dirty(struct page *page, int this_cpu) 224 + static inline void set_dcache_dirty(struct folio *folio, int this_cpu) 230 225 { 231 226 unsigned long mask = this_cpu; 232 227 unsigned long non_cpu_bits; ··· 243 238 "bne,pn %%xcc, 1b\n\t" 244 239 " nop" 245 240 : /* no outputs */ 246 - : "r" (mask), "r" (non_cpu_bits), "r" (&page->flags) 241 + : "r" (mask), "r" (non_cpu_bits), "r" (&folio->flags) 247 242 : "g1", "g7"); 248 243 } 249 244 250 - static inline void clear_dcache_dirty_cpu(struct page *page, unsigned long cpu) 245 + static inline void clear_dcache_dirty_cpu(struct folio *folio, unsigned long cpu) 251 246 { 252 247 unsigned long mask = (1UL << PG_dcache_dirty); 253 248 ··· 265 260 " nop\n" 266 261 "2:" 267 262 : /* no outputs */ 268 - : "r" (cpu), "r" (mask), "r" (&page->flags), 263 + : "r" (cpu), "r" (mask), "r" (&folio->flags), 269 264 "i" (PG_dcache_cpu_mask), 270 265 "i" (PG_dcache_cpu_shift) 271 266 : "g1", "g7"); ··· 289 284 290 285 page = pfn_to_page(pfn); 291 286 if (page) { 287 + struct folio *folio = page_folio(page); 292 288 unsigned long pg_flags; 293 289 294 - pg_flags = page->flags; 290 + pg_flags = folio->flags; 295 291 if (pg_flags & (1UL << PG_dcache_dirty)) { 296 292 int cpu = ((pg_flags >> PG_dcache_cpu_shift) & 297 293 PG_dcache_cpu_mask); ··· 302 296 * in the SMP case. 303 297 */ 304 298 if (cpu == this_cpu) 305 - flush_dcache_page_impl(page); 299 + flush_dcache_folio_impl(folio); 306 300 else 307 - smp_flush_dcache_page_impl(page, cpu); 301 + smp_flush_dcache_folio_impl(folio, cpu); 308 302 309 - clear_dcache_dirty_cpu(page, cpu); 303 + clear_dcache_dirty_cpu(folio, cpu); 310 304 311 305 put_cpu(); 312 306 } ··· 394 388 } 395 389 #endif /* CONFIG_HUGETLB_PAGE */ 396 390 397 - void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) 391 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 392 + unsigned long address, pte_t *ptep, unsigned int nr) 398 393 { 399 394 struct mm_struct *mm; 400 395 unsigned long flags; 401 396 bool is_huge_tsb; 402 397 pte_t pte = *ptep; 398 + unsigned int i; 403 399 404 400 if (tlb_type != hypervisor) { 405 401 unsigned long pfn = pte_pfn(pte); ··· 448 440 } 449 441 } 450 442 #endif 451 - if (!is_huge_tsb) 452 - __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT, 453 - address, pte_val(pte)); 443 + if (!is_huge_tsb) { 444 + for (i = 0; i < nr; i++) { 445 + __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT, 446 + address, pte_val(pte)); 447 + address += PAGE_SIZE; 448 + pte_val(pte) += PAGE_SIZE; 449 + } 450 + } 454 451 455 452 spin_unlock_irqrestore(&mm->context.lock, flags); 456 453 } 457 454 458 - void flush_dcache_page(struct page *page) 455 + void flush_dcache_folio(struct folio *folio) 459 456 { 457 + unsigned long pfn = folio_pfn(folio); 460 458 struct address_space *mapping; 461 459 int this_cpu; 462 460 ··· 473 459 * is merely the zero page. The 'bigcore' testcase in GDB 474 460 * causes this case to run millions of times. 475 461 */ 476 - if (page == ZERO_PAGE(0)) 462 + if (is_zero_pfn(pfn)) 477 463 return; 478 464 479 465 this_cpu = get_cpu(); 480 466 481 - mapping = page_mapping_file(page); 467 + mapping = folio_flush_mapping(folio); 482 468 if (mapping && !mapping_mapped(mapping)) { 483 - int dirty = test_bit(PG_dcache_dirty, &page->flags); 469 + bool dirty = test_bit(PG_dcache_dirty, &folio->flags); 484 470 if (dirty) { 485 - int dirty_cpu = dcache_dirty_cpu(page); 471 + int dirty_cpu = dcache_dirty_cpu(folio); 486 472 487 473 if (dirty_cpu == this_cpu) 488 474 goto out; 489 - smp_flush_dcache_page_impl(page, dirty_cpu); 475 + smp_flush_dcache_folio_impl(folio, dirty_cpu); 490 476 } 491 - set_dcache_dirty(page, this_cpu); 477 + set_dcache_dirty(folio, this_cpu); 492 478 } else { 493 479 /* We could delay the flush for the !page_mapping 494 480 * case too. But that case is for exec env/arg 495 481 * pages and those are %99 certainly going to get 496 482 * faulted into the tlb (and thus flushed) anyways. 497 483 */ 498 - flush_dcache_page_impl(page); 484 + flush_dcache_folio_impl(folio); 499 485 } 500 486 501 487 out: 502 488 put_cpu(); 503 489 } 504 - EXPORT_SYMBOL(flush_dcache_page); 490 + EXPORT_SYMBOL(flush_dcache_folio); 505 491 506 492 void __kprobes flush_icache_range(unsigned long start, unsigned long end) 507 493 { ··· 2294 2280 setup_page_offset(); 2295 2281 2296 2282 /* These build time checkes make sure that the dcache_dirty_cpu() 2297 - * page->flags usage will work. 2283 + * folio->flags usage will work. 2298 2284 * 2299 2285 * When a page gets marked as dcache-dirty, we store the 2300 - * cpu number starting at bit 32 in the page->flags. Also, 2286 + * cpu number starting at bit 32 in the folio->flags. Also, 2301 2287 * functions like clear_dcache_dirty_cpu use the cpu mask 2302 2288 * in 13-bit signed-immediate instruction fields. 2303 2289 */ ··· 2907 2893 2908 2894 pgtable_t pte_alloc_one(struct mm_struct *mm) 2909 2895 { 2910 - struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO); 2911 - if (!page) 2896 + struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL | __GFP_ZERO, 0); 2897 + 2898 + if (!ptdesc) 2912 2899 return NULL; 2913 - if (!pgtable_pte_page_ctor(page)) { 2914 - __free_page(page); 2900 + if (!pagetable_pte_ctor(ptdesc)) { 2901 + pagetable_free(ptdesc); 2915 2902 return NULL; 2916 2903 } 2917 - return (pte_t *) page_address(page); 2904 + return ptdesc_address(ptdesc); 2918 2905 } 2919 2906 2920 2907 void pte_free_kernel(struct mm_struct *mm, pte_t *pte) ··· 2925 2910 2926 2911 static void __pte_free(pgtable_t pte) 2927 2912 { 2928 - struct page *page = virt_to_page(pte); 2913 + struct ptdesc *ptdesc = virt_to_ptdesc(pte); 2929 2914 2930 - pgtable_pte_page_dtor(page); 2931 - __free_page(page); 2915 + pagetable_pte_dtor(ptdesc); 2916 + pagetable_free(ptdesc); 2932 2917 } 2933 2918 2934 2919 void pte_free(struct mm_struct *mm, pgtable_t pte) ··· 2945 2930 } 2946 2931 2947 2932 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 2933 + static void pte_free_now(struct rcu_head *head) 2934 + { 2935 + struct page *page; 2936 + 2937 + page = container_of(head, struct page, rcu_head); 2938 + __pte_free((pgtable_t)page_address(page)); 2939 + } 2940 + 2941 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) 2942 + { 2943 + struct page *page; 2944 + 2945 + page = virt_to_page(pgtable); 2946 + call_rcu(&page->rcu_head, pte_free_now); 2947 + } 2948 + 2948 2949 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, 2949 2950 pmd_t *pmd) 2950 2951 {

+3 -2

arch/sparc/mm/srmmu.c

··· 355 355 return NULL; 356 356 page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT); 357 357 spin_lock(&mm->page_table_lock); 358 - if (page_ref_inc_return(page) == 2 && !pgtable_pte_page_ctor(page)) { 358 + if (page_ref_inc_return(page) == 2 && 359 + !pagetable_pte_ctor(page_ptdesc(page))) { 359 360 page_ref_dec(page); 360 361 ptep = NULL; 361 362 } ··· 372 371 page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT); 373 372 spin_lock(&mm->page_table_lock); 374 373 if (page_ref_dec_return(page) == 1) 375 - pgtable_pte_page_dtor(page); 374 + pagetable_pte_dtor(page_ptdesc(page)); 376 375 spin_unlock(&mm->page_table_lock); 377 376 378 377 srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);

+3 -2

arch/sparc/mm/tlb.c

··· 118 118 unsigned long paddr, pfn = pte_pfn(orig); 119 119 struct address_space *mapping; 120 120 struct page *page; 121 + struct folio *folio; 121 122 122 123 if (!pfn_valid(pfn)) 123 124 goto no_cache_flush; ··· 128 127 goto no_cache_flush; 129 128 130 129 /* A real file page? */ 131 - mapping = page_mapping_file(page); 130 + mapping = folio_flush_mapping(folio); 132 131 if (!mapping) 133 132 goto no_cache_flush; 134 133 135 134 paddr = (unsigned long) page_address(page); 136 135 if ((paddr ^ vaddr) & (1 << 13)) 137 - flush_dcache_page_all(mm, page); 136 + flush_dcache_folio_all(mm, folio); 138 137 } 139 138 140 139 no_cache_flush:

+9 -9

arch/um/include/asm/pgalloc.h

··· 25 25 */ 26 26 extern pgd_t *pgd_alloc(struct mm_struct *); 27 27 28 - #define __pte_free_tlb(tlb,pte, address) \ 29 - do { \ 30 - pgtable_pte_page_dtor(pte); \ 31 - tlb_remove_page((tlb),(pte)); \ 28 + #define __pte_free_tlb(tlb, pte, address) \ 29 + do { \ 30 + pagetable_pte_dtor(page_ptdesc(pte)); \ 31 + tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \ 32 32 } while (0) 33 33 34 34 #ifdef CONFIG_3_LEVEL_PGTABLES 35 35 36 - #define __pmd_free_tlb(tlb, pmd, address) \ 37 - do { \ 38 - pgtable_pmd_page_dtor(virt_to_page(pmd)); \ 39 - tlb_remove_page((tlb),virt_to_page(pmd)); \ 40 - } while (0) \ 36 + #define __pmd_free_tlb(tlb, pmd, address) \ 37 + do { \ 38 + pagetable_pmd_dtor(virt_to_ptdesc(pmd)); \ 39 + tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pmd)); \ 40 + } while (0) 41 41 42 42 #endif 43 43

+2 -5

arch/um/include/asm/pgtable.h

··· 242 242 if(pte_present(*pteptr)) *pteptr = pte_mknewprot(*pteptr); 243 243 } 244 244 245 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 246 - pte_t *pteptr, pte_t pteval) 247 - { 248 - set_pte(pteptr, pteval); 249 - } 245 + #define PFN_PTE_SHIFT PAGE_SHIFT 250 246 251 247 #define __HAVE_ARCH_PTE_SAME 252 248 static inline int pte_same(pte_t pte_a, pte_t pte_b) ··· 286 290 extern pte_t *virt_to_pte(struct mm_struct *mm, unsigned long addr); 287 291 288 292 #define update_mmu_cache(vma,address,ptep) do {} while (0) 293 + #define update_mmu_cache_range(vmf, vma, address, ptep, nr) do {} while (0) 289 294 290 295 /* 291 296 * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that

+3 -4

arch/x86/Kconfig

··· 102 102 select ARCH_HAS_DEBUG_WX 103 103 select ARCH_HAS_ZONE_DMA_SET if EXPERT 104 104 select ARCH_HAVE_NMI_SAFE_CMPXCHG 105 + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 105 106 select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI 106 107 select ARCH_MIGHT_HAVE_PC_PARPORT 107 108 select ARCH_MIGHT_HAVE_PC_SERIO ··· 129 128 select ARCH_WANT_GENERAL_HUGETLB 130 129 select ARCH_WANT_HUGE_PMD_SHARE 131 130 select ARCH_WANT_LD_ORPHAN_WARN 132 - select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64 131 + select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if X86_64 132 + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 133 133 select ARCH_WANTS_THP_SWAP if X86_64 134 134 select ARCH_HAS_PARANOID_L1D_FLUSH 135 135 select BUILDTIME_TABLE_SORT ··· 2600 2598 config ARCH_HAS_ADD_PAGES 2601 2599 def_bool y 2602 2600 depends on ARCH_ENABLE_MEMORY_HOTPLUG 2603 - 2604 - config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 2605 - def_bool y 2606 2601 2607 2602 menu "Power management and ACPI options" 2608 2603

-5

arch/x86/include/asm/io.h

··· 35 35 * - Arnaldo Carvalho de Melo <acme@conectiva.com.br> 36 36 */ 37 37 38 - #define ARCH_HAS_IOREMAP_WC 39 - #define ARCH_HAS_IOREMAP_WT 40 - 41 38 #include <linux/string.h> 42 39 #include <linux/compiler.h> 43 40 #include <linux/cc_platform.h> ··· 208 211 #define memcpy_fromio memcpy_fromio 209 212 #define memcpy_toio memcpy_toio 210 213 #define memset_io memset_io 211 - 212 - #include <asm-generic/iomap.h> 213 214 214 215 /* 215 216 * ISA space is 'always mapped' on a typical x86 system, no need to

+14 -14

arch/x86/include/asm/pgtable.h

··· 185 185 186 186 static inline u64 protnone_mask(u64 val); 187 187 188 + #define PFN_PTE_SHIFT PAGE_SHIFT 189 + 188 190 static inline unsigned long pte_pfn(pte_t pte) 189 191 { 190 192 phys_addr_t pfn = pte_val(pte); ··· 1022 1020 return res; 1023 1021 } 1024 1022 1025 - static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, 1026 - pte_t *ptep, pte_t pte) 1027 - { 1028 - page_table_check_pte_set(mm, addr, ptep, pte); 1029 - set_pte(ptep, pte); 1030 - } 1031 - 1032 1023 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, 1033 1024 pmd_t *pmdp, pmd_t pmd) 1034 1025 { 1035 - page_table_check_pmd_set(mm, addr, pmdp, pmd); 1026 + page_table_check_pmd_set(mm, pmdp, pmd); 1036 1027 set_pmd(pmdp, pmd); 1037 1028 } 1038 1029 1039 1030 static inline void set_pud_at(struct mm_struct *mm, unsigned long addr, 1040 1031 pud_t *pudp, pud_t pud) 1041 1032 { 1042 - page_table_check_pud_set(mm, addr, pudp, pud); 1033 + page_table_check_pud_set(mm, pudp, pud); 1043 1034 native_set_pud(pudp, pud); 1044 1035 } 1045 1036 ··· 1063 1068 pte_t *ptep) 1064 1069 { 1065 1070 pte_t pte = native_ptep_get_and_clear(ptep); 1066 - page_table_check_pte_clear(mm, addr, pte); 1071 + page_table_check_pte_clear(mm, pte); 1067 1072 return pte; 1068 1073 } 1069 1074 ··· 1079 1084 * care about updates and native needs no locking 1080 1085 */ 1081 1086 pte = native_local_ptep_get_and_clear(ptep); 1082 - page_table_check_pte_clear(mm, addr, pte); 1087 + page_table_check_pte_clear(mm, pte); 1083 1088 } else { 1084 1089 pte = ptep_get_and_clear(mm, addr, ptep); 1085 1090 } ··· 1128 1133 { 1129 1134 pmd_t pmd = native_pmdp_get_and_clear(pmdp); 1130 1135 1131 - page_table_check_pmd_clear(mm, addr, pmd); 1136 + page_table_check_pmd_clear(mm, pmd); 1132 1137 1133 1138 return pmd; 1134 1139 } ··· 1139 1144 { 1140 1145 pud_t pud = native_pudp_get_and_clear(pudp); 1141 1146 1142 - page_table_check_pud_clear(mm, addr, pud); 1147 + page_table_check_pud_clear(mm, pud); 1143 1148 1144 1149 return pud; 1145 1150 } ··· 1162 1167 static inline pmd_t pmdp_establish(struct vm_area_struct *vma, 1163 1168 unsigned long address, pmd_t *pmdp, pmd_t pmd) 1164 1169 { 1165 - page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd); 1170 + page_table_check_pmd_set(vma->vm_mm, pmdp, pmd); 1166 1171 if (IS_ENABLED(CONFIG_SMP)) { 1167 1172 return xchg(pmdp, pmd); 1168 1173 } else { ··· 1285 1290 */ 1286 1291 static inline void update_mmu_cache(struct vm_area_struct *vma, 1287 1292 unsigned long addr, pte_t *ptep) 1293 + { 1294 + } 1295 + static inline void update_mmu_cache_range(struct vm_fault *vmf, 1296 + struct vm_area_struct *vma, unsigned long addr, 1297 + pte_t *ptep, unsigned int nr) 1288 1298 { 1289 1299 } 1290 1300 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,

+22 -2

arch/x86/include/asm/tlbflush.h

··· 3 3 #define _ASM_X86_TLBFLUSH_H 4 4 5 5 #include <linux/mm_types.h> 6 + #include <linux/mmu_notifier.h> 6 7 #include <linux/sched.h> 7 8 8 9 #include <asm/processor.h> ··· 254 253 flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); 255 254 } 256 255 256 + static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) 257 + { 258 + bool should_defer = false; 259 + 260 + /* If remote CPUs need to be flushed then defer batch the flush */ 261 + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) 262 + should_defer = true; 263 + put_cpu(); 264 + 265 + return should_defer; 266 + } 267 + 257 268 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) 258 269 { 259 270 /* ··· 277 264 return atomic64_inc_return(&mm->context.tlb_gen); 278 265 } 279 266 280 - static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, 281 - struct mm_struct *mm) 267 + static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, 268 + struct mm_struct *mm, 269 + unsigned long uaddr) 282 270 { 283 271 inc_mm_tlb_gen(mm); 284 272 cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); 273 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); 274 + } 275 + 276 + static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) 277 + { 278 + flush_tlb_mm(mm); 285 279 } 286 280 287 281 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

+2 -5

arch/x86/mm/fault.c

··· 1328 1328 } 1329 1329 #endif 1330 1330 1331 - #ifdef CONFIG_PER_VMA_LOCK 1332 1331 if (!(flags & FAULT_FLAG_USER)) 1333 1332 goto lock_mmap; 1334 1333 ··· 1340 1341 goto lock_mmap; 1341 1342 } 1342 1343 fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); 1343 - vma_end_read(vma); 1344 + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) 1345 + vma_end_read(vma); 1344 1346 1345 1347 if (!(fault & VM_FAULT_RETRY)) { 1346 1348 count_vm_vma_lock_event(VMA_LOCK_SUCCESS); ··· 1358 1358 return; 1359 1359 } 1360 1360 lock_mmap: 1361 - #endif /* CONFIG_PER_VMA_LOCK */ 1362 1361 1363 1362 retry: 1364 1363 vma = lock_mm_and_find_vma(mm, address, regs); ··· 1417 1418 } 1418 1419 1419 1420 mmap_read_unlock(mm); 1420 - #ifdef CONFIG_PER_VMA_LOCK 1421 1421 done: 1422 - #endif 1423 1422 if (likely(!(fault & VM_FAULT_ERROR))) 1424 1423 return; 1425 1424

+28 -19

arch/x86/mm/pgtable.c

··· 52 52 53 53 void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte) 54 54 { 55 - pgtable_pte_page_dtor(pte); 55 + pagetable_pte_dtor(page_ptdesc(pte)); 56 56 paravirt_release_pte(page_to_pfn(pte)); 57 57 paravirt_tlb_remove_table(tlb, pte); 58 58 } ··· 60 60 #if CONFIG_PGTABLE_LEVELS > 2 61 61 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd) 62 62 { 63 - struct page *page = virt_to_page(pmd); 63 + struct ptdesc *ptdesc = virt_to_ptdesc(pmd); 64 64 paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT); 65 65 /* 66 66 * NOTE! For PAE, any changes to the top page-directory-pointer-table ··· 69 69 #ifdef CONFIG_X86_PAE 70 70 tlb->need_flush_all = 1; 71 71 #endif 72 - pgtable_pmd_page_dtor(page); 73 - paravirt_tlb_remove_table(tlb, page); 72 + pagetable_pmd_dtor(ptdesc); 73 + paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc)); 74 74 } 75 75 76 76 #if CONFIG_PGTABLE_LEVELS > 3 ··· 92 92 93 93 static inline void pgd_list_add(pgd_t *pgd) 94 94 { 95 - struct page *page = virt_to_page(pgd); 95 + struct ptdesc *ptdesc = virt_to_ptdesc(pgd); 96 96 97 - list_add(&page->lru, &pgd_list); 97 + list_add(&ptdesc->pt_list, &pgd_list); 98 98 } 99 99 100 100 static inline void pgd_list_del(pgd_t *pgd) 101 101 { 102 - struct page *page = virt_to_page(pgd); 102 + struct ptdesc *ptdesc = virt_to_ptdesc(pgd); 103 103 104 - list_del(&page->lru); 104 + list_del(&ptdesc->pt_list); 105 105 } 106 106 107 107 #define UNSHARED_PTRS_PER_PGD \ ··· 112 112 113 113 static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm) 114 114 { 115 - virt_to_page(pgd)->pt_mm = mm; 115 + virt_to_ptdesc(pgd)->pt_mm = mm; 116 116 } 117 117 118 118 struct mm_struct *pgd_page_get_mm(struct page *page) 119 119 { 120 - return page->pt_mm; 120 + return page_ptdesc(page)->pt_mm; 121 121 } 122 122 123 123 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd) ··· 213 213 static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count) 214 214 { 215 215 int i; 216 + struct ptdesc *ptdesc; 216 217 217 218 for (i = 0; i < count; i++) 218 219 if (pmds[i]) { 219 - pgtable_pmd_page_dtor(virt_to_page(pmds[i])); 220 - free_page((unsigned long)pmds[i]); 220 + ptdesc = virt_to_ptdesc(pmds[i]); 221 + 222 + pagetable_pmd_dtor(ptdesc); 223 + pagetable_free(ptdesc); 221 224 mm_dec_nr_pmds(mm); 222 225 } 223 226 } ··· 233 230 234 231 if (mm == &init_mm) 235 232 gfp &= ~__GFP_ACCOUNT; 233 + gfp &= ~__GFP_HIGHMEM; 236 234 237 235 for (i = 0; i < count; i++) { 238 - pmd_t *pmd = (pmd_t *)__get_free_page(gfp); 239 - if (!pmd) 236 + pmd_t *pmd = NULL; 237 + struct ptdesc *ptdesc = pagetable_alloc(gfp, 0); 238 + 239 + if (!ptdesc) 240 240 failed = true; 241 - if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) { 242 - free_page((unsigned long)pmd); 243 - pmd = NULL; 241 + if (ptdesc && !pagetable_pmd_ctor(ptdesc)) { 242 + pagetable_free(ptdesc); 243 + ptdesc = NULL; 244 244 failed = true; 245 245 } 246 - if (pmd) 246 + if (ptdesc) { 247 247 mm_inc_nr_pmds(mm); 248 + pmd = ptdesc_address(ptdesc); 249 + } 250 + 248 251 pmds[i] = pmd; 249 252 } 250 253 ··· 839 830 840 831 free_page((unsigned long)pmd_sv); 841 832 842 - pgtable_pmd_page_dtor(virt_to_page(pmd)); 833 + pagetable_pmd_dtor(virt_to_ptdesc(pmd)); 843 834 free_page((unsigned long)pmd); 844 835 845 836 return 1;

+2

arch/x86/mm/tlb.c

··· 10 10 #include <linux/debugfs.h> 11 11 #include <linux/sched/smt.h> 12 12 #include <linux/task_work.h> 13 + #include <linux/mmu_notifier.h> 13 14 14 15 #include <asm/tlbflush.h> 15 16 #include <asm/mmu_context.h> ··· 1037 1036 1038 1037 put_flush_tlb_info(); 1039 1038 put_cpu(); 1039 + mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 1040 1040 } 1041 1041 1042 1042

+1 -1

arch/x86/xen/mmu_pv.c

··· 667 667 spinlock_t *ptl = NULL; 668 668 669 669 #if USE_SPLIT_PTE_PTLOCKS 670 - ptl = ptlock_ptr(page); 670 + ptl = ptlock_ptr(page_ptdesc(page)); 671 671 spin_lock_nest_lock(ptl, &mm->page_table_lock); 672 672 #endif 673 673

+1

arch/xtensa/Kconfig

··· 28 28 select GENERIC_LIB_UCMPDI2 29 29 select GENERIC_PCI_IOMAP 30 30 select GENERIC_SCHED_CLOCK 31 + select GENERIC_IOREMAP if MMU 31 32 select HAVE_ARCH_AUDITSYSCALL 32 33 select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL 33 34 select HAVE_ARCH_KASAN if MMU && !XIP_KERNEL

+7 -4

arch/xtensa/include/asm/cacheflush.h

··· 119 119 #define flush_cache_vmap(start,end) flush_cache_all() 120 120 #define flush_cache_vunmap(start,end) flush_cache_all() 121 121 122 + void flush_dcache_folio(struct folio *folio); 123 + #define flush_dcache_folio flush_dcache_folio 124 + 122 125 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 123 - void flush_dcache_page(struct page *); 126 + static inline void flush_dcache_page(struct page *page) 127 + { 128 + flush_dcache_folio(page_folio(page)); 129 + } 124 130 125 131 void local_flush_cache_range(struct vm_area_struct *vma, 126 132 unsigned long start, unsigned long end); ··· 159 153 __flush_dcache_range(start, (end) - (start)); \ 160 154 __invalidate_icache_range(start,(end) - (start)); \ 161 155 } while (0) 162 - 163 - /* This is not required, see Documentation/core-api/cachetlb.rst */ 164 - #define flush_icache_page(vma,page) do { } while (0) 165 156 166 157 #define flush_dcache_mmap_lock(mapping) do { } while (0) 167 158 #define flush_dcache_mmap_unlock(mapping) do { } while (0)

+12 -20

arch/xtensa/include/asm/io.h

··· 16 16 #include <asm/vectors.h> 17 17 #include <linux/bug.h> 18 18 #include <linux/kernel.h> 19 + #include <linux/pgtable.h> 19 20 20 21 #include <linux/types.h> 21 22 ··· 25 24 #define PCI_IOBASE ((void __iomem *)XCHAL_KIO_BYPASS_VADDR) 26 25 27 26 #ifdef CONFIG_MMU 28 - 29 - void __iomem *xtensa_ioremap_nocache(unsigned long addr, unsigned long size); 30 - void __iomem *xtensa_ioremap_cache(unsigned long addr, unsigned long size); 31 - void xtensa_iounmap(volatile void __iomem *addr); 32 - 33 27 /* 34 - * Return the virtual address for the specified bus memory. 28 + * I/O memory mapping functions. 35 29 */ 30 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 31 + unsigned long prot); 32 + #define ioremap_prot ioremap_prot 33 + #define iounmap iounmap 34 + 36 35 static inline void __iomem *ioremap(unsigned long offset, unsigned long size) 37 36 { 38 37 if (offset >= XCHAL_KIO_PADDR 39 38 && offset - XCHAL_KIO_PADDR < XCHAL_KIO_SIZE) 40 39 return (void*)(offset-XCHAL_KIO_PADDR+XCHAL_KIO_BYPASS_VADDR); 41 40 else 42 - return xtensa_ioremap_nocache(offset, size); 41 + return ioremap_prot(offset, size, 42 + pgprot_val(pgprot_noncached(PAGE_KERNEL))); 43 43 } 44 + #define ioremap ioremap 44 45 45 46 static inline void __iomem *ioremap_cache(unsigned long offset, 46 47 unsigned long size) ··· 51 48 && offset - XCHAL_KIO_PADDR < XCHAL_KIO_SIZE) 52 49 return (void*)(offset-XCHAL_KIO_PADDR+XCHAL_KIO_CACHED_VADDR); 53 50 else 54 - return xtensa_ioremap_cache(offset, size); 51 + return ioremap_prot(offset, size, pgprot_val(PAGE_KERNEL)); 52 + 55 53 } 56 54 #define ioremap_cache ioremap_cache 57 - 58 - static inline void iounmap(volatile void __iomem *addr) 59 - { 60 - unsigned long va = (unsigned long) addr; 61 - 62 - if (!(va >= XCHAL_KIO_CACHED_VADDR && 63 - va - XCHAL_KIO_CACHED_VADDR < XCHAL_KIO_SIZE) && 64 - !(va >= XCHAL_KIO_BYPASS_VADDR && 65 - va - XCHAL_KIO_BYPASS_VADDR < XCHAL_KIO_SIZE)) 66 - xtensa_iounmap(addr); 67 - } 68 - 69 55 #endif /* CONFIG_MMU */ 70 56 71 57 #include <asm-generic/io.h>

+8 -10

arch/xtensa/include/asm/pgtable.h

··· 274 274 * and a page entry and page directory to the page they refer to. 275 275 */ 276 276 277 + #define PFN_PTE_SHIFT PAGE_SHIFT 277 278 #define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT) 278 279 #define pte_same(a,b) (pte_val(a) == pte_val(b)) 279 280 #define pte_page(x) pfn_to_page(pte_pfn(x)) ··· 302 301 303 302 struct mm_struct; 304 303 305 - static inline void 306 - set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval) 304 + static inline void set_pte(pte_t *ptep, pte_t pte) 307 305 { 308 - update_pte(ptep, pteval); 309 - } 310 - 311 - static inline void set_pte(pte_t *ptep, pte_t pteval) 312 - { 313 - update_pte(ptep, pteval); 306 + update_pte(ptep, pte); 314 307 } 315 308 316 309 static inline void ··· 402 407 403 408 #else 404 409 405 - extern void update_mmu_cache(struct vm_area_struct * vma, 406 - unsigned long address, pte_t *ptep); 410 + struct vm_fault; 411 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 412 + unsigned long address, pte_t *ptep, unsigned int nr); 413 + #define update_mmu_cache(vma, address, ptep) \ 414 + update_mmu_cache_range(NULL, vma, address, ptep, 1) 407 415 408 416 typedef pte_t *pte_addr_t; 409 417

+48 -37

arch/xtensa/mm/cache.c

··· 121 121 * 122 122 */ 123 123 124 - void flush_dcache_page(struct page *page) 124 + void flush_dcache_folio(struct folio *folio) 125 125 { 126 - struct address_space *mapping = page_mapping_file(page); 126 + struct address_space *mapping = folio_flush_mapping(folio); 127 127 128 128 /* 129 129 * If we have a mapping but the page is not mapped to user-space ··· 132 132 */ 133 133 134 134 if (mapping && !mapping_mapped(mapping)) { 135 - if (!test_bit(PG_arch_1, &page->flags)) 136 - set_bit(PG_arch_1, &page->flags); 135 + if (!test_bit(PG_arch_1, &folio->flags)) 136 + set_bit(PG_arch_1, &folio->flags); 137 137 return; 138 138 139 139 } else { 140 - 141 - unsigned long phys = page_to_phys(page); 142 - unsigned long temp = page->index << PAGE_SHIFT; 140 + unsigned long phys = folio_pfn(folio) * PAGE_SIZE; 141 + unsigned long temp = folio_pos(folio); 142 + unsigned int i, nr = folio_nr_pages(folio); 143 143 unsigned long alias = !(DCACHE_ALIAS_EQ(temp, phys)); 144 144 unsigned long virt; 145 145 ··· 154 154 return; 155 155 156 156 preempt_disable(); 157 - virt = TLBTEMP_BASE_1 + (phys & DCACHE_ALIAS_MASK); 158 - __flush_invalidate_dcache_page_alias(virt, phys); 159 - 160 - virt = TLBTEMP_BASE_1 + (temp & DCACHE_ALIAS_MASK); 161 - 162 - if (alias) 157 + for (i = 0; i < nr; i++) { 158 + virt = TLBTEMP_BASE_1 + (phys & DCACHE_ALIAS_MASK); 163 159 __flush_invalidate_dcache_page_alias(virt, phys); 164 160 165 - if (mapping) 166 - __invalidate_icache_page_alias(virt, phys); 161 + virt = TLBTEMP_BASE_1 + (temp & DCACHE_ALIAS_MASK); 162 + 163 + if (alias) 164 + __flush_invalidate_dcache_page_alias(virt, phys); 165 + 166 + if (mapping) 167 + __invalidate_icache_page_alias(virt, phys); 168 + phys += PAGE_SIZE; 169 + temp += PAGE_SIZE; 170 + } 167 171 preempt_enable(); 168 172 } 169 173 170 174 /* There shouldn't be an entry in the cache for this page anymore. */ 171 175 } 172 - EXPORT_SYMBOL(flush_dcache_page); 176 + EXPORT_SYMBOL(flush_dcache_folio); 173 177 174 178 /* 175 179 * For now, flush the whole cache. FIXME?? ··· 211 207 212 208 #endif /* DCACHE_WAY_SIZE > PAGE_SIZE */ 213 209 214 - void 215 - update_mmu_cache(struct vm_area_struct * vma, unsigned long addr, pte_t *ptep) 210 + void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma, 211 + unsigned long addr, pte_t *ptep, unsigned int nr) 216 212 { 217 213 unsigned long pfn = pte_pfn(*ptep); 218 - struct page *page; 214 + struct folio *folio; 215 + unsigned int i; 219 216 220 217 if (!pfn_valid(pfn)) 221 218 return; 222 219 223 - page = pfn_to_page(pfn); 220 + folio = page_folio(pfn_to_page(pfn)); 224 221 225 - /* Invalidate old entry in TLBs */ 226 - 227 - flush_tlb_page(vma, addr); 222 + /* Invalidate old entries in TLBs */ 223 + for (i = 0; i < nr; i++) 224 + flush_tlb_page(vma, addr + i * PAGE_SIZE); 225 + nr = folio_nr_pages(folio); 228 226 229 227 #if (DCACHE_WAY_SIZE > PAGE_SIZE) 230 228 231 - if (!PageReserved(page) && test_bit(PG_arch_1, &page->flags)) { 232 - unsigned long phys = page_to_phys(page); 229 + if (!folio_test_reserved(folio) && test_bit(PG_arch_1, &folio->flags)) { 230 + unsigned long phys = folio_pfn(folio) * PAGE_SIZE; 233 231 unsigned long tmp; 234 232 235 233 preempt_disable(); 236 - tmp = TLBTEMP_BASE_1 + (phys & DCACHE_ALIAS_MASK); 237 - __flush_invalidate_dcache_page_alias(tmp, phys); 238 - tmp = TLBTEMP_BASE_1 + (addr & DCACHE_ALIAS_MASK); 239 - __flush_invalidate_dcache_page_alias(tmp, phys); 240 - __invalidate_icache_page_alias(tmp, phys); 234 + for (i = 0; i < nr; i++) { 235 + tmp = TLBTEMP_BASE_1 + (phys & DCACHE_ALIAS_MASK); 236 + __flush_invalidate_dcache_page_alias(tmp, phys); 237 + tmp = TLBTEMP_BASE_1 + (addr & DCACHE_ALIAS_MASK); 238 + __flush_invalidate_dcache_page_alias(tmp, phys); 239 + __invalidate_icache_page_alias(tmp, phys); 240 + phys += PAGE_SIZE; 241 + } 241 242 preempt_enable(); 242 243 243 - clear_bit(PG_arch_1, &page->flags); 244 + clear_bit(PG_arch_1, &folio->flags); 244 245 } 245 246 #else 246 - if (!PageReserved(page) && !test_bit(PG_arch_1, &page->flags) 247 + if (!folio_test_reserved(folio) && !test_bit(PG_arch_1, &folio->flags) 247 248 && (vma->vm_flags & VM_EXEC) != 0) { 248 - unsigned long paddr = (unsigned long)kmap_atomic(page); 249 - __flush_dcache_page(paddr); 250 - __invalidate_icache_page(paddr); 251 - set_bit(PG_arch_1, &page->flags); 252 - kunmap_atomic((void *)paddr); 249 + for (i = 0; i < nr; i++) { 250 + void *paddr = kmap_local_folio(folio, i * PAGE_SIZE); 251 + __flush_dcache_page((unsigned long)paddr); 252 + __invalidate_icache_page((unsigned long)paddr); 253 + kunmap_local(paddr); 254 + } 255 + set_bit(PG_arch_1, &folio->flags); 253 256 } 254 257 #endif 255 258 }

+14 -44

arch/xtensa/mm/ioremap.c

··· 6 6 */ 7 7 8 8 #include <linux/io.h> 9 - #include <linux/vmalloc.h> 10 9 #include <linux/pgtable.h> 11 10 #include <asm/cacheflush.h> 12 11 #include <asm/io.h> 13 12 14 - static void __iomem *xtensa_ioremap(unsigned long paddr, unsigned long size, 15 - pgprot_t prot) 13 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 14 + unsigned long prot) 16 15 { 17 - unsigned long offset = paddr & ~PAGE_MASK; 18 - unsigned long pfn = __phys_to_pfn(paddr); 19 - struct vm_struct *area; 20 - unsigned long vaddr; 21 - int err; 22 - 23 - paddr &= PAGE_MASK; 24 - 16 + unsigned long pfn = __phys_to_pfn((phys_addr)); 25 17 WARN_ON(pfn_valid(pfn)); 26 18 27 - size = PAGE_ALIGN(offset + size); 28 - 29 - area = get_vm_area(size, VM_IOREMAP); 30 - if (!area) 31 - return NULL; 32 - 33 - vaddr = (unsigned long)area->addr; 34 - area->phys_addr = paddr; 35 - 36 - err = ioremap_page_range(vaddr, vaddr + size, paddr, prot); 37 - 38 - if (err) { 39 - vunmap((void *)vaddr); 40 - return NULL; 41 - } 42 - 43 - flush_cache_vmap(vaddr, vaddr + size); 44 - return (void __iomem *)(offset + vaddr); 19 + return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); 45 20 } 21 + EXPORT_SYMBOL(ioremap_prot); 46 22 47 - void __iomem *xtensa_ioremap_nocache(unsigned long addr, unsigned long size) 23 + void iounmap(volatile void __iomem *addr) 48 24 { 49 - return xtensa_ioremap(addr, size, pgprot_noncached(PAGE_KERNEL)); 50 - } 51 - EXPORT_SYMBOL(xtensa_ioremap_nocache); 25 + unsigned long va = (unsigned long) addr; 52 26 53 - void __iomem *xtensa_ioremap_cache(unsigned long addr, unsigned long size) 54 - { 55 - return xtensa_ioremap(addr, size, PAGE_KERNEL); 56 - } 57 - EXPORT_SYMBOL(xtensa_ioremap_cache); 27 + if ((va >= XCHAL_KIO_CACHED_VADDR && 28 + va - XCHAL_KIO_CACHED_VADDR < XCHAL_KIO_SIZE) || 29 + (va >= XCHAL_KIO_BYPASS_VADDR && 30 + va - XCHAL_KIO_BYPASS_VADDR < XCHAL_KIO_SIZE)) 31 + return; 58 32 59 - void xtensa_iounmap(volatile void __iomem *io_addr) 60 - { 61 - void *addr = (void *)(PAGE_MASK & (unsigned long)io_addr); 62 - 63 - vunmap(addr); 33 + generic_iounmap(addr); 64 34 } 65 - EXPORT_SYMBOL(xtensa_iounmap); 35 + EXPORT_SYMBOL(iounmap);

+1 -2

drivers/acpi/acpi_memhotplug.c

··· 211 211 if (!info->length) 212 212 continue; 213 213 214 - if (mhp_supports_memmap_on_memory(info->length)) 215 - mhp_flags |= MHP_MEMMAP_ON_MEMORY; 214 + mhp_flags |= MHP_MEMMAP_ON_MEMORY; 216 215 result = __add_memory(mgid, info->start_addr, info->length, 217 216 mhp_flags); 218 217

+17 -10

drivers/base/memory.c

··· 105 105 static void memory_block_release(struct device *dev) 106 106 { 107 107 struct memory_block *mem = to_memory_block(dev); 108 - 108 + /* Verify that the altmap is freed */ 109 + WARN_ON(mem->altmap); 109 110 kfree(mem); 110 111 } 111 112 ··· 184 183 { 185 184 unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 186 185 unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 187 - unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; 186 + unsigned long nr_vmemmap_pages = 0; 188 187 struct zone *zone; 189 188 int ret; 190 189 ··· 201 200 * stage helps to keep accounting easier to follow - e.g vmemmaps 202 201 * belong to the same zone as the memory they backed. 203 202 */ 203 + if (mem->altmap) 204 + nr_vmemmap_pages = mem->altmap->free; 205 + 204 206 if (nr_vmemmap_pages) { 205 207 ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone); 206 208 if (ret) ··· 234 230 { 235 231 unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 236 232 unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 237 - unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; 233 + unsigned long nr_vmemmap_pages = 0; 238 234 int ret; 239 235 240 236 if (!mem->zone) ··· 244 240 * Unaccount before offlining, such that unpopulated zone and kthreads 245 241 * can properly be torn down in offline_pages(). 246 242 */ 243 + if (mem->altmap) 244 + nr_vmemmap_pages = mem->altmap->free; 245 + 247 246 if (nr_vmemmap_pages) 248 247 adjust_present_page_count(pfn_to_page(start_pfn), mem->group, 249 248 -nr_vmemmap_pages); ··· 733 726 #endif 734 727 735 728 static int add_memory_block(unsigned long block_id, unsigned long state, 736 - unsigned long nr_vmemmap_pages, 729 + struct vmem_altmap *altmap, 737 730 struct memory_group *group) 738 731 { 739 732 struct memory_block *mem; ··· 751 744 mem->start_section_nr = block_id * sections_per_block; 752 745 mem->state = state; 753 746 mem->nid = NUMA_NO_NODE; 754 - mem->nr_vmemmap_pages = nr_vmemmap_pages; 747 + mem->altmap = altmap; 755 748 INIT_LIST_HEAD(&mem->group_next); 756 749 757 750 #ifndef CONFIG_NUMA ··· 790 783 if (section_count == 0) 791 784 return 0; 792 785 return add_memory_block(memory_block_id(base_section_nr), 793 - MEM_ONLINE, 0, NULL); 786 + MEM_ONLINE, NULL, NULL); 794 787 } 795 788 796 789 static int add_hotplug_memory_block(unsigned long block_id, 797 - unsigned long nr_vmemmap_pages, 790 + struct vmem_altmap *altmap, 798 791 struct memory_group *group) 799 792 { 800 - return add_memory_block(block_id, MEM_OFFLINE, nr_vmemmap_pages, group); 793 + return add_memory_block(block_id, MEM_OFFLINE, altmap, group); 801 794 } 802 795 803 796 static void remove_memory_block(struct memory_block *memory) ··· 825 818 * Called under device_hotplug_lock. 826 819 */ 827 820 int create_memory_block_devices(unsigned long start, unsigned long size, 828 - unsigned long vmemmap_pages, 821 + struct vmem_altmap *altmap, 829 822 struct memory_group *group) 830 823 { 831 824 const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start)); ··· 839 832 return -EINVAL; 840 833 841 834 for (block_id = start_block_id; block_id != end_block_id; block_id++) { 842 - ret = add_hotplug_memory_block(block_id, vmemmap_pages, group); 835 + ret = add_hotplug_memory_block(block_id, altmap, group); 843 836 if (ret) 844 837 break; 845 838 }

+2 -2

drivers/base/node.c

··· 446 446 "Node %d AnonHugePages: %8lu kB\n" 447 447 "Node %d ShmemHugePages: %8lu kB\n" 448 448 "Node %d ShmemPmdMapped: %8lu kB\n" 449 - "Node %d FileHugePages: %8lu kB\n" 450 - "Node %d FilePmdMapped: %8lu kB\n" 449 + "Node %d FileHugePages: %8lu kB\n" 450 + "Node %d FilePmdMapped: %8lu kB\n" 451 451 #endif 452 452 #ifdef CONFIG_UNACCEPTED_MEMORY 453 453 "Node %d Unaccepted: %8lu kB\n"

+8 -14

drivers/dax/device.c

··· 228 228 } 229 229 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 230 230 231 - static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, 232 - enum page_entry_size pe_size) 231 + static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order) 233 232 { 234 233 struct file *filp = vmf->vma->vm_file; 235 234 vm_fault_t rc = VM_FAULT_SIGBUS; 236 235 int id; 237 236 struct dev_dax *dev_dax = filp->private_data; 238 237 239 - dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm, 238 + dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm, 240 239 (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read", 241 - vmf->vma->vm_start, vmf->vma->vm_end, pe_size); 240 + vmf->vma->vm_start, vmf->vma->vm_end, order); 242 241 243 242 id = dax_read_lock(); 244 - switch (pe_size) { 245 - case PE_SIZE_PTE: 243 + if (order == 0) 246 244 rc = __dev_dax_pte_fault(dev_dax, vmf); 247 - break; 248 - case PE_SIZE_PMD: 245 + else if (order == PMD_ORDER) 249 246 rc = __dev_dax_pmd_fault(dev_dax, vmf); 250 - break; 251 - case PE_SIZE_PUD: 247 + else if (order == PUD_ORDER) 252 248 rc = __dev_dax_pud_fault(dev_dax, vmf); 253 - break; 254 - default: 249 + else 255 250 rc = VM_FAULT_SIGBUS; 256 - } 257 251 258 252 dax_read_unlock(id); 259 253 ··· 256 262 257 263 static vm_fault_t dev_dax_fault(struct vm_fault *vmf) 258 264 { 259 - return dev_dax_huge_fault(vmf, PE_SIZE_PTE); 265 + return dev_dax_huge_fault(vmf, 0); 260 266 } 261 267 262 268 static int dev_dax_may_split(struct vm_area_struct *vma, unsigned long addr)

+2 -2

drivers/dax/kmem.c

··· 264 264 return rc; 265 265 266 266 error_dax_driver: 267 - destroy_memory_type(dax_slowmem_type); 267 + put_memory_type(dax_slowmem_type); 268 268 err_dax_slowmem_type: 269 269 kfree_const(kmem_name); 270 270 return rc; ··· 275 275 dax_driver_unregister(&device_dax_kmem_driver); 276 276 if (!any_hotremove_failed) 277 277 kfree_const(kmem_name); 278 - destroy_memory_type(dax_slowmem_type); 278 + put_memory_type(dax_slowmem_type); 279 279 } 280 280 281 281 MODULE_AUTHOR("Intel Corporation");

+1 -4

drivers/gpu/drm/amd/amdkfd/kfd_svm.c

··· 2621 2621 return -EFAULT; 2622 2622 } 2623 2623 2624 - *is_heap_stack = (vma->vm_start <= vma->vm_mm->brk && 2625 - vma->vm_end >= vma->vm_mm->start_brk) || 2626 - (vma->vm_start <= vma->vm_mm->start_stack && 2627 - vma->vm_end >= vma->vm_mm->start_stack); 2624 + *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma); 2628 2625 2629 2626 start_limit = max(vma->vm_start >> PAGE_SHIFT, 2630 2627 (unsigned long)ALIGN_DOWN(addr, 2UL << 8));

+1 -1

drivers/gpu/drm/arm/display/include/malidp_utils.h

··· 35 35 rg->end = end; 36 36 } 37 37 38 - static inline bool in_range(struct malidp_range *rg, u32 v) 38 + static inline bool malidp_in_range(struct malidp_range *rg, u32 v) 39 39 { 40 40 return (v >= rg->start) && (v <= rg->end); 41 41 }

+12 -12

drivers/gpu/drm/arm/display/komeda/komeda_pipeline_state.c

··· 305 305 if (komeda_fb_check_src_coords(kfb, src_x, src_y, src_w, src_h)) 306 306 return -EINVAL; 307 307 308 - if (!in_range(&layer->hsize_in, src_w)) { 308 + if (!malidp_in_range(&layer->hsize_in, src_w)) { 309 309 DRM_DEBUG_ATOMIC("invalidate src_w %d.\n", src_w); 310 310 return -EINVAL; 311 311 } 312 312 313 - if (!in_range(&layer->vsize_in, src_h)) { 313 + if (!malidp_in_range(&layer->vsize_in, src_h)) { 314 314 DRM_DEBUG_ATOMIC("invalidate src_h %d.\n", src_h); 315 315 return -EINVAL; 316 316 } ··· 452 452 hsize_out = dflow->out_w; 453 453 vsize_out = dflow->out_h; 454 454 455 - if (!in_range(&scaler->hsize, hsize_in) || 456 - !in_range(&scaler->hsize, hsize_out)) { 455 + if (!malidp_in_range(&scaler->hsize, hsize_in) || 456 + !malidp_in_range(&scaler->hsize, hsize_out)) { 457 457 DRM_DEBUG_ATOMIC("Invalid horizontal sizes"); 458 458 return -EINVAL; 459 459 } 460 460 461 - if (!in_range(&scaler->vsize, vsize_in) || 462 - !in_range(&scaler->vsize, vsize_out)) { 461 + if (!malidp_in_range(&scaler->vsize, vsize_in) || 462 + !malidp_in_range(&scaler->vsize, vsize_out)) { 463 463 DRM_DEBUG_ATOMIC("Invalid vertical sizes"); 464 464 return -EINVAL; 465 465 } ··· 574 574 return -EINVAL; 575 575 } 576 576 577 - if (!in_range(&splitter->hsize, dflow->in_w)) { 577 + if (!malidp_in_range(&splitter->hsize, dflow->in_w)) { 578 578 DRM_DEBUG_ATOMIC("split in_w:%d is out of the acceptable range.\n", 579 579 dflow->in_w); 580 580 return -EINVAL; 581 581 } 582 582 583 - if (!in_range(&splitter->vsize, dflow->in_h)) { 583 + if (!malidp_in_range(&splitter->vsize, dflow->in_h)) { 584 584 DRM_DEBUG_ATOMIC("split in_h: %d exceeds the acceptable range.\n", 585 585 dflow->in_h); 586 586 return -EINVAL; ··· 624 624 return -EINVAL; 625 625 } 626 626 627 - if (!in_range(&merger->hsize_merged, output->out_w)) { 627 + if (!malidp_in_range(&merger->hsize_merged, output->out_w)) { 628 628 DRM_DEBUG_ATOMIC("merged_w: %d is out of the accepted range.\n", 629 629 output->out_w); 630 630 return -EINVAL; 631 631 } 632 632 633 - if (!in_range(&merger->vsize_merged, output->out_h)) { 633 + if (!malidp_in_range(&merger->vsize_merged, output->out_h)) { 634 634 DRM_DEBUG_ATOMIC("merged_h: %d is out of the accepted range.\n", 635 635 output->out_h); 636 636 return -EINVAL; ··· 866 866 * input/output range. 867 867 */ 868 868 if (dflow->en_scaling && scaler) 869 - dflow->en_split = !in_range(&scaler->hsize, dflow->in_w) || 870 - !in_range(&scaler->hsize, dflow->out_w); 869 + dflow->en_split = !malidp_in_range(&scaler->hsize, dflow->in_w) || 870 + !malidp_in_range(&scaler->hsize, dflow->out_w); 871 871 } 872 872 873 873 static bool merger_is_available(struct komeda_pipeline *pipe,

-6

drivers/gpu/drm/msm/adreno/a6xx_gmu.c

··· 676 676 u32 data[]; 677 677 }; 678 678 679 - /* this should be a general kernel helper */ 680 - static int in_range(u32 addr, u32 start, u32 size) 681 - { 682 - return addr >= start && addr < start + size; 683 - } 684 - 685 679 static bool fw_block_mem(struct a6xx_gmu_bo *bo, const struct block_header *blk) 686 680 { 687 681 if (!in_range(blk->addr, bo->iova, bo->size))

+5 -5

drivers/iommu/amd/iommu_v2.c

··· 355 355 return container_of(mn, struct pasid_state, mn); 356 356 } 357 357 358 - static void mn_invalidate_range(struct mmu_notifier *mn, 359 - struct mm_struct *mm, 360 - unsigned long start, unsigned long end) 358 + static void mn_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, 359 + struct mm_struct *mm, 360 + unsigned long start, unsigned long end) 361 361 { 362 362 struct pasid_state *pasid_state; 363 363 struct device_state *dev_state; ··· 391 391 } 392 392 393 393 static const struct mmu_notifier_ops iommu_mn = { 394 - .release = mn_release, 395 - .invalidate_range = mn_invalidate_range, 394 + .release = mn_release, 395 + .arch_invalidate_secondary_tlbs = mn_arch_invalidate_secondary_tlbs, 396 396 }; 397 397 398 398 static void set_pri_tag_status(struct pasid_state *pasid_state,

+20 -9

drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c

··· 186 186 } 187 187 } 188 188 189 - static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn, 190 - struct mm_struct *mm, 191 - unsigned long start, unsigned long end) 189 + static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, 190 + struct mm_struct *mm, 191 + unsigned long start, 192 + unsigned long end) 192 193 { 193 194 struct arm_smmu_mmu_notifier *smmu_mn = mn_to_smmu(mn); 194 195 struct arm_smmu_domain *smmu_domain = smmu_mn->domain; ··· 201 200 * range. So do a simple translation here by calculating size correctly. 202 201 */ 203 202 size = end - start; 203 + if (size == ULONG_MAX) 204 + size = 0; 204 205 205 - if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) 206 - arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid, 207 - PAGE_SIZE, false, smmu_domain); 206 + if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) { 207 + if (!size) 208 + arm_smmu_tlb_inv_asid(smmu_domain->smmu, 209 + smmu_mn->cd->asid); 210 + else 211 + arm_smmu_tlb_inv_range_asid(start, size, 212 + smmu_mn->cd->asid, 213 + PAGE_SIZE, false, 214 + smmu_domain); 215 + } 216 + 208 217 arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size); 209 218 } 210 219 ··· 248 237 } 249 238 250 239 static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = { 251 - .invalidate_range = arm_smmu_mm_invalidate_range, 252 - .release = arm_smmu_mm_release, 253 - .free_notifier = arm_smmu_mmu_notifier_free, 240 + .arch_invalidate_secondary_tlbs = arm_smmu_mm_arch_invalidate_secondary_tlbs, 241 + .release = arm_smmu_mm_release, 242 + .free_notifier = arm_smmu_mmu_notifier_free, 254 243 }; 255 244 256 245 /* Allocate or get existing MMU notifier for this {domain, mm} pair */

+4 -4

drivers/iommu/intel/svm.c

··· 219 219 } 220 220 221 221 /* Pages have been freed at this point */ 222 - static void intel_invalidate_range(struct mmu_notifier *mn, 223 - struct mm_struct *mm, 224 - unsigned long start, unsigned long end) 222 + static void intel_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, 223 + struct mm_struct *mm, 224 + unsigned long start, unsigned long end) 225 225 { 226 226 struct intel_svm *svm = container_of(mn, struct intel_svm, notifier); 227 227 ··· 256 256 257 257 static const struct mmu_notifier_ops intel_mmuops = { 258 258 .release = intel_mm_release, 259 - .invalidate_range = intel_invalidate_range, 259 + .arch_invalidate_secondary_tlbs = intel_arch_invalidate_secondary_tlbs, 260 260 }; 261 261 262 262 static DEFINE_MUTEX(pasid_mutex);

+4 -4

drivers/misc/ocxl/link.c

··· 491 491 } 492 492 EXPORT_SYMBOL_GPL(ocxl_link_release); 493 493 494 - static void invalidate_range(struct mmu_notifier *mn, 495 - struct mm_struct *mm, 496 - unsigned long start, unsigned long end) 494 + static void arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, 495 + struct mm_struct *mm, 496 + unsigned long start, unsigned long end) 497 497 { 498 498 struct pe_data *pe_data = container_of(mn, struct pe_data, mmu_notifier); 499 499 struct ocxl_link *link = pe_data->link; ··· 509 509 } 510 510 511 511 static const struct mmu_notifier_ops ocxl_mmu_notifier_ops = { 512 - .invalidate_range = invalidate_range, 512 + .arch_invalidate_secondary_tlbs = arch_invalidate_secondary_tlbs, 513 513 }; 514 514 515 515 static u64 calculate_cfg_state(bool kernel)

+9 -9

drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c

··· 2126 2126 .set_link_ksettings = set_link_ksettings, 2127 2127 }; 2128 2128 2129 - static int in_range(int val, int lo, int hi) 2129 + static int cxgb_in_range(int val, int lo, int hi) 2130 2130 { 2131 2131 return val < 0 || (val <= hi && val >= lo); 2132 2132 } ··· 2162 2162 return -EINVAL; 2163 2163 if (t.qset_idx >= SGE_QSETS) 2164 2164 return -EINVAL; 2165 - if (!in_range(t.intr_lat, 0, M_NEWTIMER) || 2166 - !in_range(t.cong_thres, 0, 255) || 2167 - !in_range(t.txq_size[0], MIN_TXQ_ENTRIES, 2165 + if (!cxgb_in_range(t.intr_lat, 0, M_NEWTIMER) || 2166 + !cxgb_in_range(t.cong_thres, 0, 255) || 2167 + !cxgb_in_range(t.txq_size[0], MIN_TXQ_ENTRIES, 2168 2168 MAX_TXQ_ENTRIES) || 2169 - !in_range(t.txq_size[1], MIN_TXQ_ENTRIES, 2169 + !cxgb_in_range(t.txq_size[1], MIN_TXQ_ENTRIES, 2170 2170 MAX_TXQ_ENTRIES) || 2171 - !in_range(t.txq_size[2], MIN_CTRL_TXQ_ENTRIES, 2171 + !cxgb_in_range(t.txq_size[2], MIN_CTRL_TXQ_ENTRIES, 2172 2172 MAX_CTRL_TXQ_ENTRIES) || 2173 - !in_range(t.fl_size[0], MIN_FL_ENTRIES, 2173 + !cxgb_in_range(t.fl_size[0], MIN_FL_ENTRIES, 2174 2174 MAX_RX_BUFFERS) || 2175 - !in_range(t.fl_size[1], MIN_FL_ENTRIES, 2175 + !cxgb_in_range(t.fl_size[1], MIN_FL_ENTRIES, 2176 2176 MAX_RX_JUMBO_BUFFERS) || 2177 - !in_range(t.rspq_size, MIN_RSPQ_ENTRIES, 2177 + !cxgb_in_range(t.rspq_size, MIN_RSPQ_ENTRIES, 2178 2178 MAX_RSPQ_ENTRIES)) 2179 2179 return -EINVAL; 2180 2180

+1 -1

drivers/net/ethernet/sfc/io.h

··· 46 46 */ 47 47 #ifdef CONFIG_X86_64 48 48 /* PIO is a win only if write-combining is possible */ 49 - #ifdef ARCH_HAS_IOREMAP_WC 49 + #ifdef ioremap_wc 50 50 #define EFX_USE_PIO 1 51 51 #endif 52 52 #endif

+1 -1

drivers/net/ethernet/sfc/siena/io.h

··· 70 70 */ 71 71 #ifdef CONFIG_X86_64 72 72 /* PIO is a win only if write-combining is possible */ 73 - #ifdef ARCH_HAS_IOREMAP_WC 73 + #ifdef ioremap_wc 74 74 #define EFX_USE_PIO 1 75 75 #endif 76 76 #endif

+1 -1

drivers/nvdimm/pfn_devs.c

··· 100 100 101 101 if (has_transparent_hugepage()) { 102 102 alignments[1] = HPAGE_PMD_SIZE; 103 - if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)) 103 + if (has_transparent_pud_hugepage()) 104 104 alignments[2] = HPAGE_PUD_SIZE; 105 105 } 106 106

+1 -1

drivers/tty/sysrq.c

··· 342 342 343 343 static void sysrq_handle_showmem(int key) 344 344 { 345 - show_mem(0, NULL); 345 + show_mem(); 346 346 } 347 347 static const struct sysrq_key_op sysrq_showmem_op = { 348 348 .handler = sysrq_handle_showmem,

+1 -1

drivers/tty/vt/keyboard.c

··· 606 606 607 607 static void fn_show_mem(struct vc_data *vc) 608 608 { 609 - show_mem(0, NULL); 609 + show_mem(); 610 610 } 611 611 612 612 static void fn_show_state(struct vc_data *vc)

+2 -2

drivers/virt/acrn/ioreq.c

··· 351 351 return is_handled; 352 352 } 353 353 354 - static bool in_range(struct acrn_ioreq_range *range, 354 + static bool acrn_in_range(struct acrn_ioreq_range *range, 355 355 struct acrn_io_request *req) 356 356 { 357 357 bool ret = false; ··· 389 389 list_for_each_entry(client, &vm->ioreq_clients, list) { 390 390 read_lock_bh(&client->range_lock); 391 391 list_for_each_entry(range, &client->range_list, list) { 392 - if (in_range(range, req)) { 392 + if (acrn_in_range(range, req)) { 393 393 found = client; 394 394 break; 395 395 }

+2

fs/9p/cache.c

··· 68 68 &path, sizeof(path), 69 69 &version, sizeof(version), 70 70 i_size_read(&v9inode->netfs.inode)); 71 + if (v9inode->netfs.cache) 72 + mapping_set_release_always(inode->i_mapping); 71 73 72 74 p9_debug(P9_DEBUG_FSC, "inode %p get cookie %p\n", 73 75 inode, v9fs_inode_cookie(v9inode));

+3 -4

fs/Kconfig

··· 169 169 config TMPFS 170 170 bool "Tmpfs virtual memory file system support (former shm fs)" 171 171 depends on SHMEM 172 + select MEMFD_CREATE 172 173 help 173 174 Tmpfs is a file system which keeps all files in virtual memory. 174 175 ··· 253 252 bool "HugeTLB file system support" 254 253 depends on X86 || IA64 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN 255 254 depends on (SYSFS || SYSCTL) 255 + select MEMFD_CREATE 256 256 help 257 257 hugetlbfs is a filesystem backing for HugeTLB pages, based on 258 258 ramfs. For architectures that support it, say Y here and read ··· 266 264 267 265 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP 268 266 def_bool HUGETLB_PAGE 269 - depends on ARCH_WANT_OPTIMIZE_VMEMMAP 267 + depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP 270 268 depends on SPARSEMEM_VMEMMAP 271 269 272 270 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON ··· 277 275 The HugeTLB VmemmapvOptimization (HVO) defaults to off. Say Y here to 278 276 enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off 279 277 (boot command line) or hugetlb_optimize_vmemmap (sysctl). 280 - 281 - config MEMFD_CREATE 282 - def_bool TMPFS || HUGETLBFS 283 278 284 279 config ARCH_HAS_GIGANTIC_PAGE 285 280 bool

+38 -39

fs/affs/file.c

··· 523 523 return ERR_PTR(err); 524 524 } 525 525 526 - static int 527 - affs_do_readpage_ofs(struct page *page, unsigned to, int create) 526 + static int affs_do_read_folio_ofs(struct folio *folio, size_t to, int create) 528 527 { 529 - struct inode *inode = page->mapping->host; 528 + struct inode *inode = folio->mapping->host; 530 529 struct super_block *sb = inode->i_sb; 531 530 struct buffer_head *bh; 532 - unsigned pos = 0; 533 - u32 bidx, boff, bsize; 531 + size_t pos = 0; 532 + size_t bidx, boff, bsize; 534 533 u32 tmp; 535 534 536 - pr_debug("%s(%lu, %ld, 0, %d)\n", __func__, inode->i_ino, 537 - page->index, to); 538 - BUG_ON(to > PAGE_SIZE); 535 + pr_debug("%s(%lu, %ld, 0, %zu)\n", __func__, inode->i_ino, 536 + folio->index, to); 537 + BUG_ON(to > folio_size(folio)); 539 538 bsize = AFFS_SB(sb)->s_data_blksize; 540 - tmp = page->index << PAGE_SHIFT; 539 + tmp = folio_pos(folio); 541 540 bidx = tmp / bsize; 542 541 boff = tmp % bsize; 543 542 ··· 546 547 return PTR_ERR(bh); 547 548 tmp = min(bsize - boff, to - pos); 548 549 BUG_ON(pos + tmp > to || tmp > bsize); 549 - memcpy_to_page(page, pos, AFFS_DATA(bh) + boff, tmp); 550 + memcpy_to_folio(folio, pos, AFFS_DATA(bh) + boff, tmp); 550 551 affs_brelse(bh); 551 552 bidx++; 552 553 pos += tmp; ··· 626 627 return PTR_ERR(bh); 627 628 } 628 629 629 - static int 630 - affs_read_folio_ofs(struct file *file, struct folio *folio) 630 + static int affs_read_folio_ofs(struct file *file, struct folio *folio) 631 631 { 632 - struct page *page = &folio->page; 633 - struct inode *inode = page->mapping->host; 634 - u32 to; 632 + struct inode *inode = folio->mapping->host; 633 + size_t to; 635 634 int err; 636 635 637 - pr_debug("%s(%lu, %ld)\n", __func__, inode->i_ino, page->index); 638 - to = PAGE_SIZE; 639 - if (((page->index + 1) << PAGE_SHIFT) > inode->i_size) { 640 - to = inode->i_size & ~PAGE_MASK; 641 - memset(page_address(page) + to, 0, PAGE_SIZE - to); 636 + pr_debug("%s(%lu, %ld)\n", __func__, inode->i_ino, folio->index); 637 + to = folio_size(folio); 638 + if (folio_pos(folio) + to > inode->i_size) { 639 + to = inode->i_size - folio_pos(folio); 640 + folio_zero_segment(folio, to, folio_size(folio)); 642 641 } 643 642 644 - err = affs_do_readpage_ofs(page, to, 0); 643 + err = affs_do_read_folio_ofs(folio, to, 0); 645 644 if (!err) 646 - SetPageUptodate(page); 647 - unlock_page(page); 645 + folio_mark_uptodate(folio); 646 + folio_unlock(folio); 648 647 return err; 649 648 } 650 649 ··· 651 654 struct page **pagep, void **fsdata) 652 655 { 653 656 struct inode *inode = mapping->host; 654 - struct page *page; 657 + struct folio *folio; 655 658 pgoff_t index; 656 659 int err = 0; 657 660 ··· 667 670 } 668 671 669 672 index = pos >> PAGE_SHIFT; 670 - page = grab_cache_page_write_begin(mapping, index); 671 - if (!page) 672 - return -ENOMEM; 673 - *pagep = page; 673 + folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN, 674 + mapping_gfp_mask(mapping)); 675 + if (IS_ERR(folio)) 676 + return PTR_ERR(folio); 677 + *pagep = &folio->page; 674 678 675 - if (PageUptodate(page)) 679 + if (folio_test_uptodate(folio)) 676 680 return 0; 677 681 678 682 /* XXX: inefficient but safe in the face of short writes */ 679 - err = affs_do_readpage_ofs(page, PAGE_SIZE, 1); 683 + err = affs_do_read_folio_ofs(folio, folio_size(folio), 1); 680 684 if (err) { 681 - unlock_page(page); 682 - put_page(page); 685 + folio_unlock(folio); 686 + folio_put(folio); 683 687 } 684 688 return err; 685 689 } ··· 689 691 loff_t pos, unsigned len, unsigned copied, 690 692 struct page *page, void *fsdata) 691 693 { 694 + struct folio *folio = page_folio(page); 692 695 struct inode *inode = mapping->host; 693 696 struct super_block *sb = inode->i_sb; 694 697 struct buffer_head *bh, *prev_bh; ··· 703 704 to = from + len; 704 705 /* 705 706 * XXX: not sure if this can handle short copies (len < copied), but 706 - * we don't have to, because the page should always be uptodate here, 707 + * we don't have to, because the folio should always be uptodate here, 707 708 * due to write_begin. 708 709 */ 709 710 710 711 pr_debug("%s(%lu, %llu, %llu)\n", __func__, inode->i_ino, pos, 711 712 pos + len); 712 713 bsize = AFFS_SB(sb)->s_data_blksize; 713 - data = page_address(page); 714 + data = folio_address(folio); 714 715 715 716 bh = NULL; 716 717 written = 0; 717 - tmp = (page->index << PAGE_SHIFT) + from; 718 + tmp = (folio->index << PAGE_SHIFT) + from; 718 719 bidx = tmp / bsize; 719 720 boff = tmp % bsize; 720 721 if (boff) { ··· 806 807 from += tmp; 807 808 bidx++; 808 809 } 809 - SetPageUptodate(page); 810 + folio_mark_uptodate(folio); 810 811 811 812 done: 812 813 affs_brelse(bh); 813 - tmp = (page->index << PAGE_SHIFT) + from; 814 + tmp = (folio->index << PAGE_SHIFT) + from; 814 815 if (tmp > inode->i_size) 815 816 inode->i_size = AFFS_I(inode)->mmu_private = tmp; 816 817 ··· 821 822 } 822 823 823 824 err_first_bh: 824 - unlock_page(page); 825 - put_page(page); 825 + folio_unlock(folio); 826 + folio_put(folio); 826 827 827 828 return written; 828 829

+5 -7

fs/affs/symlink.c

··· 13 13 14 14 static int affs_symlink_read_folio(struct file *file, struct folio *folio) 15 15 { 16 - struct page *page = &folio->page; 17 16 struct buffer_head *bh; 18 - struct inode *inode = page->mapping->host; 19 - char *link = page_address(page); 17 + struct inode *inode = folio->mapping->host; 18 + char *link = folio_address(folio); 20 19 struct slink_front *lf; 21 20 int i, j; 22 21 char c; ··· 57 58 } 58 59 link[i] = '\0'; 59 60 affs_brelse(bh); 60 - SetPageUptodate(page); 61 - unlock_page(page); 61 + folio_mark_uptodate(folio); 62 + folio_unlock(folio); 62 63 return 0; 63 64 fail: 64 - SetPageError(page); 65 - unlock_page(page); 65 + folio_unlock(folio); 66 66 return -EIO; 67 67 } 68 68

+2

fs/afs/internal.h

··· 681 681 { 682 682 #ifdef CONFIG_AFS_FSCACHE 683 683 vnode->netfs.cache = cookie; 684 + if (cookie) 685 + mapping_set_release_always(vnode->netfs.inode.i_mapping); 684 686 #endif 685 687 } 686 688

-2

fs/btrfs/misc.h

··· 8 8 #include <linux/math64.h> 9 9 #include <linux/rbtree.h> 10 10 11 - #define in_range(b, first, len) ((b) >= (first) && (b) < (first) + (len)) 12 - 13 11 /* 14 12 * Enumerate bits using enum autoincrement. Define the @name as the n-th bit. 15 13 */

+8 -28

fs/buffer.c

··· 1539 1539 bh_lru_unlock(); 1540 1540 } 1541 1541 1542 - void set_bh_page(struct buffer_head *bh, 1543 - struct page *page, unsigned long offset) 1544 - { 1545 - bh->b_page = page; 1546 - BUG_ON(offset >= PAGE_SIZE); 1547 - if (PageHighMem(page)) 1548 - /* 1549 - * This catches illegal uses and preserves the offset: 1550 - */ 1551 - bh->b_data = (char *)(0 + offset); 1552 - else 1553 - bh->b_data = page_address(page) + offset; 1554 - } 1555 - EXPORT_SYMBOL(set_bh_page); 1556 - 1557 1542 void folio_set_bh(struct buffer_head *bh, struct folio *folio, 1558 1543 unsigned long offset) 1559 1544 { ··· 2165 2180 } 2166 2181 EXPORT_SYMBOL(__block_write_begin); 2167 2182 2168 - static int __block_commit_write(struct inode *inode, struct folio *folio, 2169 - size_t from, size_t to) 2183 + static void __block_commit_write(struct folio *folio, size_t from, size_t to) 2170 2184 { 2171 2185 size_t block_start, block_end; 2172 2186 bool partial = false; ··· 2200 2216 */ 2201 2217 if (!partial) 2202 2218 folio_mark_uptodate(folio); 2203 - return 0; 2204 2219 } 2205 2220 2206 2221 /* ··· 2236 2253 struct page *page, void *fsdata) 2237 2254 { 2238 2255 struct folio *folio = page_folio(page); 2239 - struct inode *inode = mapping->host; 2240 2256 size_t start = pos - folio_pos(folio); 2241 2257 2242 2258 if (unlikely(copied < len)) { ··· 2259 2277 flush_dcache_folio(folio); 2260 2278 2261 2279 /* This could be a short (even 0-length) commit */ 2262 - __block_commit_write(inode, folio, start, start + copied); 2280 + __block_commit_write(folio, start, start + copied); 2263 2281 2264 2282 return copied; 2265 2283 } ··· 2580 2598 } 2581 2599 EXPORT_SYMBOL(cont_write_begin); 2582 2600 2583 - int block_commit_write(struct page *page, unsigned from, unsigned to) 2601 + void block_commit_write(struct page *page, unsigned from, unsigned to) 2584 2602 { 2585 2603 struct folio *folio = page_folio(page); 2586 - struct inode *inode = folio->mapping->host; 2587 - __block_commit_write(inode, folio, from, to); 2588 - return 0; 2604 + __block_commit_write(folio, from, to); 2589 2605 } 2590 2606 EXPORT_SYMBOL(block_commit_write); 2591 2607 ··· 2629 2649 end = size - folio_pos(folio); 2630 2650 2631 2651 ret = __block_write_begin_int(folio, 0, end, get_block, NULL); 2632 - if (!ret) 2633 - ret = __block_commit_write(inode, folio, 0, end); 2634 - 2635 - if (unlikely(ret < 0)) 2652 + if (unlikely(ret)) 2636 2653 goto out_unlock; 2654 + 2655 + __block_commit_write(folio, 0, end); 2656 + 2637 2657 folio_mark_dirty(folio); 2638 2658 folio_wait_stable(folio); 2639 2659 return 0;

+2

fs/cachefiles/namei.c

··· 585 585 if (ret < 0) 586 586 goto check_failed; 587 587 588 + clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &object->cookie->flags); 589 + 588 590 object->file = file; 589 591 590 592 /* Always update the atime on an object we've just looked up (this is

+2

fs/ceph/cache.c

··· 36 36 &ci->i_vino, sizeof(ci->i_vino), 37 37 &ci->i_version, sizeof(ci->i_version), 38 38 i_size_read(inode)); 39 + if (ci->netfs.cache) 40 + mapping_set_release_always(inode->i_mapping); 39 41 } 40 42 41 43 void ceph_fscache_unregister_inode_cookie(struct ceph_inode_info *ci)

+8 -25

fs/dax.c

··· 30 30 #define CREATE_TRACE_POINTS 31 31 #include <trace/events/fs_dax.h> 32 32 33 - static inline unsigned int pe_order(enum page_entry_size pe_size) 34 - { 35 - if (pe_size == PE_SIZE_PTE) 36 - return PAGE_SHIFT - PAGE_SHIFT; 37 - if (pe_size == PE_SIZE_PMD) 38 - return PMD_SHIFT - PAGE_SHIFT; 39 - if (pe_size == PE_SIZE_PUD) 40 - return PUD_SHIFT - PAGE_SHIFT; 41 - return ~0; 42 - } 43 - 44 33 /* We choose 4096 entries - same as per-zone page wait tables */ 45 34 #define DAX_WAIT_TABLE_BITS 12 46 35 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS) ··· 37 48 /* The 'colour' (ie low bits) within a PMD of a page offset. */ 38 49 #define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) 39 50 #define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT) 40 - 41 - /* The order of a PMD entry */ 42 - #define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) 43 51 44 52 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES]; 45 53 ··· 1894 1908 /** 1895 1909 * dax_iomap_fault - handle a page fault on a DAX file 1896 1910 * @vmf: The description of the fault 1897 - * @pe_size: Size of the page to fault in 1911 + * @order: Order of the page to fault in 1898 1912 * @pfnp: PFN to insert for synchronous faults if fsync is required 1899 1913 * @iomap_errp: Storage for detailed error code in case of error 1900 1914 * @ops: Iomap ops passed from the file system ··· 1904 1918 * has done all the necessary locking for page fault to proceed 1905 1919 * successfully. 1906 1920 */ 1907 - vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, 1921 + vm_fault_t dax_iomap_fault(struct vm_fault *vmf, unsigned int order, 1908 1922 pfn_t *pfnp, int *iomap_errp, const struct iomap_ops *ops) 1909 1923 { 1910 - switch (pe_size) { 1911 - case PE_SIZE_PTE: 1924 + if (order == 0) 1912 1925 return dax_iomap_pte_fault(vmf, pfnp, iomap_errp, ops); 1913 - case PE_SIZE_PMD: 1926 + else if (order == PMD_ORDER) 1914 1927 return dax_iomap_pmd_fault(vmf, pfnp, ops); 1915 - default: 1928 + else 1916 1929 return VM_FAULT_FALLBACK; 1917 - } 1918 1930 } 1919 1931 EXPORT_SYMBOL_GPL(dax_iomap_fault); 1920 1932 ··· 1963 1979 /** 1964 1980 * dax_finish_sync_fault - finish synchronous page fault 1965 1981 * @vmf: The description of the fault 1966 - * @pe_size: Size of entry to be inserted 1982 + * @order: Order of entry to be inserted 1967 1983 * @pfn: PFN to insert 1968 1984 * 1969 1985 * This function ensures that the file range touched by the page fault is 1970 1986 * stored persistently on the media and handles inserting of appropriate page 1971 1987 * table entry. 1972 1988 */ 1973 - vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, 1974 - enum page_entry_size pe_size, pfn_t pfn) 1989 + vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, unsigned int order, 1990 + pfn_t pfn) 1975 1991 { 1976 1992 int err; 1977 1993 loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT; 1978 - unsigned int order = pe_order(pe_size); 1979 1994 size_t len = PAGE_SIZE << order; 1980 1995 1981 1996 err = vfs_fsync_range(vmf->vma->vm_file, start, start + len - 1, 1);

+2

fs/drop_caches.c

··· 10 10 #include <linux/writeback.h> 11 11 #include <linux/sysctl.h> 12 12 #include <linux/gfp.h> 13 + #include <linux/swap.h> 13 14 #include "internal.h" 14 15 15 16 /* A global variable is a bit ugly, but it keeps the code simple */ ··· 60 59 static int stfu; 61 60 62 61 if (sysctl_drop_caches & 1) { 62 + lru_add_drain_all(); 63 63 iterate_supers(drop_pagecache_sb, NULL); 64 64 count_vm_event(DROP_PAGECACHE); 65 65 }

+3 -3

fs/erofs/data.c

··· 413 413 414 414 #ifdef CONFIG_FS_DAX 415 415 static vm_fault_t erofs_dax_huge_fault(struct vm_fault *vmf, 416 - enum page_entry_size pe_size) 416 + unsigned int order) 417 417 { 418 - return dax_iomap_fault(vmf, pe_size, NULL, NULL, &erofs_iomap_ops); 418 + return dax_iomap_fault(vmf, order, NULL, NULL, &erofs_iomap_ops); 419 419 } 420 420 421 421 static vm_fault_t erofs_dax_fault(struct vm_fault *vmf) 422 422 { 423 - return erofs_dax_huge_fault(vmf, PE_SIZE_PTE); 423 + return erofs_dax_huge_fault(vmf, 0); 424 424 } 425 425 426 426 static const struct vm_operations_struct erofs_dax_vm_ops = {

+1

fs/exec.c

··· 701 701 if (vma != vma_next(&vmi)) 702 702 return -EFAULT; 703 703 704 + vma_iter_prev_range(&vmi); 704 705 /* 705 706 * cover the whole range: [new_start, old_end) 706 707 */

-2

fs/ext2/balloc.c

··· 36 36 */ 37 37 38 38 39 - #define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1) 40 - 41 39 struct ext2_group_desc * ext2_get_group_desc(struct super_block * sb, 42 40 unsigned int block_group, 43 41 struct buffer_head ** bh)

+1 -1

fs/ext2/file.c

··· 103 103 } 104 104 filemap_invalidate_lock_shared(inode->i_mapping); 105 105 106 - ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, &ext2_iomap_ops); 106 + ret = dax_iomap_fault(vmf, 0, NULL, NULL, &ext2_iomap_ops); 107 107 108 108 filemap_invalidate_unlock_shared(inode->i_mapping); 109 109 if (write)

-2

fs/ext4/ext4.h

··· 3780 3780 set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state); 3781 3781 } 3782 3782 3783 - #define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1) 3784 - 3785 3783 /* For ioend & aio unwritten conversion wait queues */ 3786 3784 #define EXT4_WQ_HASH_SZ 37 3787 3785 #define ext4_ioend_wq(v) (&ext4__ioend_wq[((unsigned long)(v)) %\

+5 -6

fs/ext4/file.c

··· 723 723 } 724 724 725 725 #ifdef CONFIG_FS_DAX 726 - static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf, 727 - enum page_entry_size pe_size) 726 + static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf, unsigned int order) 728 727 { 729 728 int error = 0; 730 729 vm_fault_t result; ··· 739 740 * read-only. 740 741 * 741 742 * We check for VM_SHARED rather than vmf->cow_page since the latter is 742 - * unset for pe_size != PE_SIZE_PTE (i.e. only in do_cow_fault); for 743 + * unset for order != 0 (i.e. only in do_cow_fault); for 743 744 * other sizes, dax_iomap_fault will handle splitting / fallback so that 744 745 * we eventually come back with a COW page. 745 746 */ ··· 763 764 } else { 764 765 filemap_invalidate_lock_shared(mapping); 765 766 } 766 - result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops); 767 + result = dax_iomap_fault(vmf, order, &pfn, &error, &ext4_iomap_ops); 767 768 if (write) { 768 769 ext4_journal_stop(handle); 769 770 ··· 772 773 goto retry; 773 774 /* Handling synchronous page fault? */ 774 775 if (result & VM_FAULT_NEEDDSYNC) 775 - result = dax_finish_sync_fault(vmf, pe_size, pfn); 776 + result = dax_finish_sync_fault(vmf, order, pfn); 776 777 filemap_invalidate_unlock_shared(mapping); 777 778 sb_end_pagefault(sb); 778 779 } else { ··· 784 785 785 786 static vm_fault_t ext4_dax_fault(struct vm_fault *vmf) 786 787 { 787 - return ext4_dax_huge_fault(vmf, PE_SIZE_PTE); 788 + return ext4_dax_huge_fault(vmf, 0); 788 789 } 789 790 790 791 static const struct vm_operations_struct ext4_dax_vm_ops = {

+2 -2

fs/ext4/inode.c

··· 1569 1569 1570 1570 if (folio->index < mpd->first_page) 1571 1571 continue; 1572 - if (folio->index + folio_nr_pages(folio) - 1 > end) 1572 + if (folio_next_index(folio) - 1 > end) 1573 1573 continue; 1574 1574 BUG_ON(!folio_test_locked(folio)); 1575 1575 BUG_ON(folio_test_writeback(folio)); ··· 2455 2455 2456 2456 if (mpd->map.m_len == 0) 2457 2457 mpd->first_page = folio->index; 2458 - mpd->next_page = folio->index + folio_nr_pages(folio); 2458 + mpd->next_page = folio_next_index(folio); 2459 2459 /* 2460 2460 * Writeout when we cannot modify metadata is simple. 2461 2461 * Just submit the page. For data=journal mode we

+6 -13

fs/ext4/move_extent.c

··· 340 340 ext4_double_up_write_data_sem(orig_inode, donor_inode); 341 341 goto data_copy; 342 342 } 343 - if ((folio_has_private(folio[0]) && 344 - !filemap_release_folio(folio[0], 0)) || 345 - (folio_has_private(folio[1]) && 346 - !filemap_release_folio(folio[1], 0))) { 343 + if (!filemap_release_folio(folio[0], 0) || 344 + !filemap_release_folio(folio[1], 0)) { 347 345 *err = -EBUSY; 348 346 goto drop_data_sem; 349 347 } ··· 360 362 361 363 /* At this point all buffers in range are uptodate, old mapping layout 362 364 * is no longer required, try to drop it now. */ 363 - if ((folio_has_private(folio[0]) && 364 - !filemap_release_folio(folio[0], 0)) || 365 - (folio_has_private(folio[1]) && 366 - !filemap_release_folio(folio[1], 0))) { 365 + if (!filemap_release_folio(folio[0], 0) || 366 + !filemap_release_folio(folio[1], 0)) { 367 367 *err = -EBUSY; 368 368 goto unlock_folios; 369 369 } ··· 388 392 for (i = 0; i < block_len_in_page; i++) { 389 393 *err = ext4_get_block(orig_inode, orig_blk_offset + i, bh, 0); 390 394 if (*err < 0) 391 - break; 395 + goto repair_branches; 392 396 bh = bh->b_this_page; 393 397 } 394 - if (!*err) 395 - *err = block_commit_write(&folio[0]->page, from, from + replaced_size); 396 398 397 - if (unlikely(*err < 0)) 398 - goto repair_branches; 399 + block_commit_write(&folio[0]->page, from, from + replaced_size); 399 400 400 401 /* Even in case of data=writeback it is reasonable to pin 401 402 * inode to transaction, to prevent unexpected data loss */

+9 -11

fs/fuse/dax.c

··· 784 784 return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc); 785 785 } 786 786 787 - static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf, 788 - enum page_entry_size pe_size, bool write) 787 + static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf, unsigned int order, 788 + bool write) 789 789 { 790 790 vm_fault_t ret; 791 791 struct inode *inode = file_inode(vmf->vma->vm_file); ··· 809 809 * to populate page cache or access memory we are trying to free. 810 810 */ 811 811 filemap_invalidate_lock_shared(inode->i_mapping); 812 - ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops); 812 + ret = dax_iomap_fault(vmf, order, &pfn, &error, &fuse_iomap_ops); 813 813 if ((ret & VM_FAULT_ERROR) && error == -EAGAIN) { 814 814 error = 0; 815 815 retry = true; ··· 818 818 } 819 819 820 820 if (ret & VM_FAULT_NEEDDSYNC) 821 - ret = dax_finish_sync_fault(vmf, pe_size, pfn); 821 + ret = dax_finish_sync_fault(vmf, order, pfn); 822 822 filemap_invalidate_unlock_shared(inode->i_mapping); 823 823 824 824 if (write) ··· 829 829 830 830 static vm_fault_t fuse_dax_fault(struct vm_fault *vmf) 831 831 { 832 - return __fuse_dax_fault(vmf, PE_SIZE_PTE, 833 - vmf->flags & FAULT_FLAG_WRITE); 832 + return __fuse_dax_fault(vmf, 0, vmf->flags & FAULT_FLAG_WRITE); 834 833 } 835 834 836 - static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf, 837 - enum page_entry_size pe_size) 835 + static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf, unsigned int order) 838 836 { 839 - return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE); 837 + return __fuse_dax_fault(vmf, order, vmf->flags & FAULT_FLAG_WRITE); 840 838 } 841 839 842 840 static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf) 843 841 { 844 - return __fuse_dax_fault(vmf, PE_SIZE_PTE, true); 842 + return __fuse_dax_fault(vmf, 0, true); 845 843 } 846 844 847 845 static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf) 848 846 { 849 - return __fuse_dax_fault(vmf, PE_SIZE_PTE, true); 847 + return __fuse_dax_fault(vmf, 0, true); 850 848 } 851 849 852 850 static const struct vm_operations_struct fuse_dax_vm_ops = {

+51 -6

fs/hugetlbfs/inode.c

··· 283 283 #endif 284 284 285 285 /* 286 + * Someone wants to read @bytes from a HWPOISON hugetlb @page from @offset. 287 + * Returns the maximum number of bytes one can read without touching the 1st raw 288 + * HWPOISON subpage. 289 + * 290 + * The implementation borrows the iteration logic from copy_page_to_iter*. 291 + */ 292 + static size_t adjust_range_hwpoison(struct page *page, size_t offset, size_t bytes) 293 + { 294 + size_t n = 0; 295 + size_t res = 0; 296 + 297 + /* First subpage to start the loop. */ 298 + page += offset / PAGE_SIZE; 299 + offset %= PAGE_SIZE; 300 + while (1) { 301 + if (is_raw_hwpoison_page_in_hugepage(page)) 302 + break; 303 + 304 + /* Safe to read n bytes without touching HWPOISON subpage. */ 305 + n = min(bytes, (size_t)PAGE_SIZE - offset); 306 + res += n; 307 + bytes -= n; 308 + if (!bytes || !n) 309 + break; 310 + offset += n; 311 + if (offset == PAGE_SIZE) { 312 + page++; 313 + offset = 0; 314 + } 315 + } 316 + 317 + return res; 318 + } 319 + 320 + /* 286 321 * Support for read() - Find the page attached to f_mapping and copy out the 287 322 * data. This provides functionality similar to filemap_read(). 288 323 */ ··· 335 300 336 301 while (iov_iter_count(to)) { 337 302 struct page *page; 338 - size_t nr, copied; 303 + size_t nr, copied, want; 339 304 340 305 /* nr is the maximum number of bytes to copy from this page */ 341 306 nr = huge_page_size(h); ··· 363 328 } else { 364 329 unlock_page(page); 365 330 366 - if (PageHWPoison(page)) { 367 - put_page(page); 368 - retval = -EIO; 369 - break; 331 + if (!PageHWPoison(page)) 332 + want = nr; 333 + else { 334 + /* 335 + * Adjust how many bytes safe to read without 336 + * touching the 1st raw HWPOISON subpage after 337 + * offset. 338 + */ 339 + want = adjust_range_hwpoison(page, offset, nr); 340 + if (want == 0) { 341 + put_page(page); 342 + retval = -EIO; 343 + break; 344 + } 370 345 } 371 346 372 347 /* 373 348 * We have the page, copy it to user space buffer. 374 349 */ 375 - copied = copy_page_to_iter(page, offset, nr, to); 350 + copied = copy_page_to_iter(page, offset, want, to); 376 351 put_page(page); 377 352 } 378 353 offset += copied;

+16 -19

fs/jbd2/journal.c

··· 341 341 int do_escape = 0; 342 342 char *mapped_data; 343 343 struct buffer_head *new_bh; 344 - struct page *new_page; 344 + struct folio *new_folio; 345 345 unsigned int new_offset; 346 346 struct buffer_head *bh_in = jh2bh(jh_in); 347 347 journal_t *journal = transaction->t_journal; ··· 370 370 */ 371 371 if (jh_in->b_frozen_data) { 372 372 done_copy_out = 1; 373 - new_page = virt_to_page(jh_in->b_frozen_data); 374 - new_offset = offset_in_page(jh_in->b_frozen_data); 373 + new_folio = virt_to_folio(jh_in->b_frozen_data); 374 + new_offset = offset_in_folio(new_folio, jh_in->b_frozen_data); 375 375 } else { 376 - new_page = jh2bh(jh_in)->b_page; 377 - new_offset = offset_in_page(jh2bh(jh_in)->b_data); 376 + new_folio = jh2bh(jh_in)->b_folio; 377 + new_offset = offset_in_folio(new_folio, jh2bh(jh_in)->b_data); 378 378 } 379 379 380 - mapped_data = kmap_atomic(new_page); 380 + mapped_data = kmap_local_folio(new_folio, new_offset); 381 381 /* 382 382 * Fire data frozen trigger if data already wasn't frozen. Do this 383 383 * before checking for escaping, as the trigger may modify the magic ··· 385 385 * data in the buffer. 386 386 */ 387 387 if (!done_copy_out) 388 - jbd2_buffer_frozen_trigger(jh_in, mapped_data + new_offset, 388 + jbd2_buffer_frozen_trigger(jh_in, mapped_data, 389 389 jh_in->b_triggers); 390 390 391 391 /* 392 392 * Check for escaping 393 393 */ 394 - if (*((__be32 *)(mapped_data + new_offset)) == 395 - cpu_to_be32(JBD2_MAGIC_NUMBER)) { 394 + if (*((__be32 *)mapped_data) == cpu_to_be32(JBD2_MAGIC_NUMBER)) { 396 395 need_copy_out = 1; 397 396 do_escape = 1; 398 397 } 399 - kunmap_atomic(mapped_data); 398 + kunmap_local(mapped_data); 400 399 401 400 /* 402 401 * Do we need to do a data copy? ··· 416 417 } 417 418 418 419 jh_in->b_frozen_data = tmp; 419 - mapped_data = kmap_atomic(new_page); 420 - memcpy(tmp, mapped_data + new_offset, bh_in->b_size); 421 - kunmap_atomic(mapped_data); 420 + memcpy_from_folio(tmp, new_folio, new_offset, bh_in->b_size); 422 421 423 - new_page = virt_to_page(tmp); 424 - new_offset = offset_in_page(tmp); 422 + new_folio = virt_to_folio(tmp); 423 + new_offset = offset_in_folio(new_folio, tmp); 425 424 done_copy_out = 1; 426 425 427 426 /* ··· 435 438 * copying, we can finally do so. 436 439 */ 437 440 if (do_escape) { 438 - mapped_data = kmap_atomic(new_page); 439 - *((unsigned int *)(mapped_data + new_offset)) = 0; 440 - kunmap_atomic(mapped_data); 441 + mapped_data = kmap_local_folio(new_folio, new_offset); 442 + *((unsigned int *)mapped_data) = 0; 443 + kunmap_local(mapped_data); 441 444 } 442 445 443 - set_bh_page(new_bh, new_page, new_offset); 446 + folio_set_bh(new_bh, new_folio, new_offset); 444 447 new_bh->b_size = bh_in->b_size; 445 448 new_bh->b_bdev = journal->j_dev; 446 449 new_bh->b_blocknr = blocknr;

+3

fs/nfs/fscache.c

··· 180 180 &auxdata, /* aux_data */ 181 181 sizeof(auxdata), 182 182 i_size_read(inode)); 183 + 184 + if (netfs_inode(inode)->cache) 185 + mapping_set_release_always(inode->i_mapping); 183 186 } 184 187 185 188 /*

+5 -5

fs/ntfs3/inode.c

··· 556 556 struct super_block *sb = inode->i_sb; 557 557 struct ntfs_sb_info *sbi = sb->s_fs_info; 558 558 struct ntfs_inode *ni = ntfs_i(inode); 559 - struct page *page = bh->b_page; 559 + struct folio *folio = bh->b_folio; 560 560 u8 cluster_bits = sbi->cluster_bits; 561 561 u32 block_size = sb->s_blocksize; 562 562 u64 bytes, lbo, valid; ··· 571 571 572 572 if (is_resident(ni)) { 573 573 ni_lock(ni); 574 - err = attr_data_read_resident(ni, page); 574 + err = attr_data_read_resident(ni, &folio->page); 575 575 ni_unlock(ni); 576 576 577 577 if (!err) ··· 644 644 */ 645 645 bytes = block_size; 646 646 647 - if (page) { 647 + if (folio) { 648 648 u32 voff = valid - vbo; 649 649 650 650 bh->b_size = block_size; 651 651 off = vbo & (PAGE_SIZE - 1); 652 - set_bh_page(bh, page, off); 652 + folio_set_bh(bh, folio, off); 653 653 654 654 err = bh_read(bh, 0); 655 655 if (err < 0) 656 656 goto out; 657 - zero_user_segment(page, off + voff, off + block_size); 657 + folio_zero_segment(folio, off + voff, off + block_size); 658 658 } 659 659 } 660 660

+1 -6

fs/ocfs2/file.c

··· 810 810 811 811 812 812 /* must not update i_size! */ 813 - ret = block_commit_write(page, block_start + 1, 814 - block_start + 1); 815 - if (ret < 0) 816 - mlog_errno(ret); 817 - else 818 - ret = 0; 813 + block_commit_write(page, block_start + 1, block_start + 1); 819 814 } 820 815 821 816 /*

+1

fs/proc/base.c

··· 3207 3207 mm = get_task_mm(task); 3208 3208 if (mm) { 3209 3209 seq_printf(m, "ksm_rmap_items %lu\n", mm->ksm_rmap_items); 3210 + seq_printf(m, "ksm_zero_pages %lu\n", mm->ksm_zero_pages); 3210 3211 seq_printf(m, "ksm_merging_pages %lu\n", mm->ksm_merging_pages); 3211 3212 seq_printf(m, "ksm_process_profit %ld\n", ksm_process_profit(mm)); 3212 3213 mmput(mm);

+2 -11

fs/proc/meminfo.c

··· 17 17 #ifdef CONFIG_CMA 18 18 #include <linux/cma.h> 19 19 #endif 20 + #include <linux/zswap.h> 20 21 #include <asm/page.h> 21 22 #include "internal.h" 22 23 ··· 133 132 show_val_kb(m, "VmallocChunk: ", 0ul); 134 133 show_val_kb(m, "Percpu: ", pcpu_nr_pages()); 135 134 136 - #ifdef CONFIG_MEMTEST 137 - if (early_memtest_done) { 138 - unsigned long early_memtest_bad_size_kb; 139 - 140 - early_memtest_bad_size_kb = early_memtest_bad_size>>10; 141 - if (early_memtest_bad_size && !early_memtest_bad_size_kb) 142 - early_memtest_bad_size_kb = 1; 143 - /* When 0 is reported, it means there actually was a successful test */ 144 - seq_printf(m, "EarlyMemtestBad: %5lu kB\n", early_memtest_bad_size_kb); 145 - } 146 - #endif 135 + memtest_report_meminfo(m); 147 136 148 137 #ifdef CONFIG_MEMORY_FAILURE 149 138 seq_printf(m, "HardwareCorrupted: %5lu kB\n",

+5 -21

fs/proc/task_mmu.c

··· 236 236 sizeof(struct proc_maps_private)); 237 237 } 238 238 239 - /* 240 - * Indicate if the VMA is a stack for the given task; for 241 - * /proc/PID/maps that is the stack of the main task. 242 - */ 243 - static int is_stack(struct vm_area_struct *vma) 244 - { 245 - /* 246 - * We make no effort to guess what a given thread considers to be 247 - * its "stack". It's not even well-defined for programs written 248 - * languages like Go. 249 - */ 250 - return vma->vm_start <= vma->vm_mm->start_stack && 251 - vma->vm_end >= vma->vm_mm->start_stack; 252 - } 253 - 254 239 static void show_vma_header_prefix(struct seq_file *m, 255 240 unsigned long start, unsigned long end, 256 241 vm_flags_t flags, unsigned long long pgoff, ··· 312 327 goto done; 313 328 } 314 329 315 - if (vma->vm_start <= mm->brk && 316 - vma->vm_end >= mm->start_brk) { 330 + if (vma_is_initial_heap(vma)) { 317 331 name = "[heap]"; 318 332 goto done; 319 333 } 320 334 321 - if (is_stack(vma)) { 335 + if (vma_is_initial_stack(vma)) { 322 336 name = "[stack]"; 323 337 goto done; 324 338 } ··· 855 871 856 872 __show_smap(m, &mss, false); 857 873 858 - seq_printf(m, "THPeligible: %d\n", 874 + seq_printf(m, "THPeligible: %8u\n", 859 875 hugepage_vma_check(vma, vma->vm_flags, true, false, true)); 860 876 861 877 if (arch_pkeys_enabled()) ··· 1959 1975 if (file) { 1960 1976 seq_puts(m, " file="); 1961 1977 seq_file_path(m, file, "\n\t= "); 1962 - } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { 1978 + } else if (vma_is_initial_heap(vma)) { 1963 1979 seq_puts(m, " heap"); 1964 - } else if (is_stack(vma)) { 1980 + } else if (vma_is_initial_stack(vma)) { 1965 1981 seq_puts(m, " stack"); 1966 1982 } 1967 1983

+1 -14

fs/proc/task_nommu.c

··· 121 121 return size; 122 122 } 123 123 124 - static int is_stack(struct vm_area_struct *vma) 125 - { 126 - struct mm_struct *mm = vma->vm_mm; 127 - 128 - /* 129 - * We make no effort to guess what a given thread considers to be 130 - * its "stack". It's not even well-defined for programs written 131 - * languages like Go. 132 - */ 133 - return vma->vm_start <= mm->start_stack && 134 - vma->vm_end >= mm->start_stack; 135 - } 136 - 137 124 /* 138 125 * display a single VMA to a sequenced file 139 126 */ ··· 158 171 if (file) { 159 172 seq_pad(m, ' '); 160 173 seq_file_path(m, file, ""); 161 - } else if (mm && is_stack(vma)) { 174 + } else if (mm && vma_is_initial_stack(vma)) { 162 175 seq_pad(m, ' '); 163 176 seq_puts(m, "[stack]"); 164 177 }

+2

fs/smb/client/fscache.c

··· 108 108 &cifsi->uniqueid, sizeof(cifsi->uniqueid), 109 109 &cd, sizeof(cd), 110 110 i_size_read(&cifsi->netfs.inode)); 111 + if (cifsi->netfs.cache) 112 + mapping_set_release_always(inode->i_mapping); 111 113 } 112 114 113 115 void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update)

+1 -2

fs/splice.c

··· 83 83 */ 84 84 folio_wait_writeback(folio); 85 85 86 - if (folio_has_private(folio) && 87 - !filemap_release_folio(folio, GFP_KERNEL)) 86 + if (!filemap_release_folio(folio, GFP_KERNEL)) 88 87 goto out_unlock; 89 88 90 89 /*

+3 -3

fs/udf/file.c

··· 63 63 else 64 64 end = PAGE_SIZE; 65 65 err = __block_write_begin(page, 0, end, udf_get_block); 66 - if (!err) 67 - err = block_commit_write(page, 0, end); 68 - if (err < 0) { 66 + if (err) { 69 67 unlock_page(page); 70 68 ret = block_page_mkwrite_return(err); 71 69 goto out_unlock; 72 70 } 71 + 72 + block_commit_write(page, 0, end); 73 73 out_dirty: 74 74 set_page_dirty(page); 75 75 wait_for_stable_page(page);

-6

fs/ufs/util.h

··· 11 11 #include <linux/fs.h> 12 12 #include "swab.h" 13 13 14 - 15 - /* 16 - * some useful macros 17 - */ 18 - #define in_range(b,first,len) ((b)>=(first)&&(b)<(first)+(len)) 19 - 20 14 /* 21 15 * functions used for retyping 22 16 */

+98 -42

fs/userfaultfd.c

··· 277 277 * hugepmd ranges. 278 278 */ 279 279 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, 280 - struct vm_area_struct *vma, 281 - unsigned long address, 282 - unsigned long flags, 283 - unsigned long reason) 280 + struct vm_fault *vmf, 281 + unsigned long reason) 284 282 { 283 + struct vm_area_struct *vma = vmf->vma; 285 284 pte_t *ptep, pte; 286 285 bool ret = true; 287 286 288 - mmap_assert_locked(ctx->mm); 287 + assert_fault_locked(vmf); 289 288 290 - ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma)); 289 + ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma)); 291 290 if (!ptep) 292 291 goto out; 293 292 ··· 307 308 } 308 309 #else 309 310 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, 310 - struct vm_area_struct *vma, 311 - unsigned long address, 312 - unsigned long flags, 313 - unsigned long reason) 311 + struct vm_fault *vmf, 312 + unsigned long reason) 314 313 { 315 314 return false; /* should never get here */ 316 315 } ··· 322 325 * threads. 323 326 */ 324 327 static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, 325 - unsigned long address, 326 - unsigned long flags, 328 + struct vm_fault *vmf, 327 329 unsigned long reason) 328 330 { 329 331 struct mm_struct *mm = ctx->mm; 332 + unsigned long address = vmf->address; 330 333 pgd_t *pgd; 331 334 p4d_t *p4d; 332 335 pud_t *pud; ··· 335 338 pte_t ptent; 336 339 bool ret = true; 337 340 338 - mmap_assert_locked(mm); 341 + assert_fault_locked(vmf); 339 342 340 343 pgd = pgd_offset(mm, address); 341 344 if (!pgd_present(*pgd)) ··· 424 427 * 425 428 * We also don't do userfault handling during 426 429 * coredumping. hugetlbfs has the special 427 - * follow_hugetlb_page() to skip missing pages in the 430 + * hugetlb_follow_page_mask() to skip missing pages in the 428 431 * FOLL_DUMP case, anon memory also checks for FOLL_DUMP with 429 432 * the no_page_table() helper in follow_page_mask(), but the 430 433 * shmem_vm_ops->fault method is invoked even during 431 - * coredumping without mmap_lock and it ends up here. 434 + * coredumping and it ends up here. 432 435 */ 433 436 if (current->flags & (PF_EXITING|PF_DUMPCORE)) 434 437 goto out; 435 438 436 - /* 437 - * Coredumping runs without mmap_lock so we can only check that 438 - * the mmap_lock is held, if PF_DUMPCORE was not set. 439 - */ 440 - mmap_assert_locked(mm); 439 + assert_fault_locked(vmf); 441 440 442 441 ctx = vma->vm_userfaultfd_ctx.ctx; 443 442 if (!ctx) ··· 549 556 spin_unlock_irq(&ctx->fault_pending_wqh.lock); 550 557 551 558 if (!is_vm_hugetlb_page(vma)) 552 - must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags, 553 - reason); 559 + must_wait = userfaultfd_must_wait(ctx, vmf, reason); 554 560 else 555 - must_wait = userfaultfd_huge_must_wait(ctx, vma, 556 - vmf->address, 557 - vmf->flags, reason); 561 + must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason); 558 562 if (is_vm_hugetlb_page(vma)) 559 563 hugetlb_vma_unlock_read(vma); 560 - mmap_read_unlock(mm); 564 + release_fault_lock(vmf); 561 565 562 566 if (likely(must_wait && !READ_ONCE(ctx->released))) { 563 567 wake_up_poll(&ctx->fd_wqh, EPOLLIN); ··· 657 667 mmap_write_lock(mm); 658 668 for_each_vma(vmi, vma) { 659 669 if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) { 670 + vma_start_write(vma); 660 671 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 661 672 userfaultfd_set_vm_flags(vma, 662 673 vma->vm_flags & ~__VM_UFFD_FLAGS); ··· 693 702 694 703 octx = vma->vm_userfaultfd_ctx.ctx; 695 704 if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { 705 + vma_start_write(vma); 696 706 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 697 707 userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); 698 708 return 0; ··· 775 783 atomic_inc(&ctx->mmap_changing); 776 784 } else { 777 785 /* Drop uffd context if remap feature not enabled */ 786 + vma_start_write(vma); 778 787 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 779 788 userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); 780 789 } ··· 933 940 prev = vma; 934 941 } 935 942 943 + vma_start_write(vma); 936 944 userfaultfd_set_vm_flags(vma, new_flags); 937 945 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 938 946 } ··· 1283 1289 __wake_userfault(ctx, range); 1284 1290 } 1285 1291 1286 - static __always_inline int validate_range(struct mm_struct *mm, 1287 - __u64 start, __u64 len) 1292 + static __always_inline int validate_unaligned_range( 1293 + struct mm_struct *mm, __u64 start, __u64 len) 1288 1294 { 1289 1295 __u64 task_size = mm->task_size; 1290 1296 1291 - if (start & ~PAGE_MASK) 1292 - return -EINVAL; 1293 1297 if (len & ~PAGE_MASK) 1294 1298 return -EINVAL; 1295 1299 if (!len) ··· 1298 1306 return -EINVAL; 1299 1307 if (len > task_size - start) 1300 1308 return -EINVAL; 1309 + if (start + len <= start) 1310 + return -EINVAL; 1301 1311 return 0; 1312 + } 1313 + 1314 + static __always_inline int validate_range(struct mm_struct *mm, 1315 + __u64 start, __u64 len) 1316 + { 1317 + if (start & ~PAGE_MASK) 1318 + return -EINVAL; 1319 + 1320 + return validate_unaligned_range(mm, start, len); 1302 1321 } 1303 1322 1304 1323 static int userfaultfd_register(struct userfaultfd_ctx *ctx, ··· 1505 1502 * the next vma was merged into the current one and 1506 1503 * the current one has not been updated yet. 1507 1504 */ 1505 + vma_start_write(vma); 1508 1506 userfaultfd_set_vm_flags(vma, new_flags); 1509 1507 vma->vm_userfaultfd_ctx.ctx = ctx; 1510 1508 ··· 1689 1685 * the next vma was merged into the current one and 1690 1686 * the current one has not been updated yet. 1691 1687 */ 1688 + vma_start_write(vma); 1692 1689 userfaultfd_set_vm_flags(vma, new_flags); 1693 1690 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 1694 1691 ··· 1762 1757 sizeof(uffdio_copy)-sizeof(__s64))) 1763 1758 goto out; 1764 1759 1760 + ret = validate_unaligned_range(ctx->mm, uffdio_copy.src, 1761 + uffdio_copy.len); 1762 + if (ret) 1763 + goto out; 1765 1764 ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len); 1766 1765 if (ret) 1767 1766 goto out; 1768 - /* 1769 - * double check for wraparound just in case. copy_from_user() 1770 - * will later check uffdio_copy.src + uffdio_copy.len to fit 1771 - * in the userland range. 1772 - */ 1767 + 1773 1768 ret = -EINVAL; 1774 - if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src) 1775 - goto out; 1776 1769 if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP)) 1777 1770 goto out; 1778 1771 if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP) ··· 1930 1927 goto out; 1931 1928 1932 1929 ret = -EINVAL; 1933 - /* double check for wraparound just in case. */ 1934 - if (uffdio_continue.range.start + uffdio_continue.range.len <= 1935 - uffdio_continue.range.start) { 1936 - goto out; 1937 - } 1938 1930 if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE | 1939 1931 UFFDIO_CONTINUE_MODE_WP)) 1940 1932 goto out; ··· 1958 1960 wake_userfault(ctx, &range); 1959 1961 } 1960 1962 ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN; 1963 + 1964 + out: 1965 + return ret; 1966 + } 1967 + 1968 + static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long arg) 1969 + { 1970 + __s64 ret; 1971 + struct uffdio_poison uffdio_poison; 1972 + struct uffdio_poison __user *user_uffdio_poison; 1973 + struct userfaultfd_wake_range range; 1974 + 1975 + user_uffdio_poison = (struct uffdio_poison __user *)arg; 1976 + 1977 + ret = -EAGAIN; 1978 + if (atomic_read(&ctx->mmap_changing)) 1979 + goto out; 1980 + 1981 + ret = -EFAULT; 1982 + if (copy_from_user(&uffdio_poison, user_uffdio_poison, 1983 + /* don't copy the output fields */ 1984 + sizeof(uffdio_poison) - (sizeof(__s64)))) 1985 + goto out; 1986 + 1987 + ret = validate_range(ctx->mm, uffdio_poison.range.start, 1988 + uffdio_poison.range.len); 1989 + if (ret) 1990 + goto out; 1991 + 1992 + ret = -EINVAL; 1993 + if (uffdio_poison.mode & ~UFFDIO_POISON_MODE_DONTWAKE) 1994 + goto out; 1995 + 1996 + if (mmget_not_zero(ctx->mm)) { 1997 + ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start, 1998 + uffdio_poison.range.len, 1999 + &ctx->mmap_changing, 0); 2000 + mmput(ctx->mm); 2001 + } else { 2002 + return -ESRCH; 2003 + } 2004 + 2005 + if (unlikely(put_user(ret, &user_uffdio_poison->updated))) 2006 + return -EFAULT; 2007 + if (ret < 0) 2008 + goto out; 2009 + 2010 + /* len == 0 would wake all */ 2011 + BUG_ON(!ret); 2012 + range.len = ret; 2013 + if (!(uffdio_poison.mode & UFFDIO_POISON_MODE_DONTWAKE)) { 2014 + range.start = uffdio_poison.range.start; 2015 + wake_userfault(ctx, &range); 2016 + } 2017 + ret = range.len == uffdio_poison.range.len ? 0 : -EAGAIN; 1961 2018 1962 2019 out: 1963 2020 return ret; ··· 2118 2065 break; 2119 2066 case UFFDIO_CONTINUE: 2120 2067 ret = userfaultfd_continue(ctx, arg); 2068 + break; 2069 + case UFFDIO_POISON: 2070 + ret = userfaultfd_poison(ctx, arg); 2121 2071 break; 2122 2072 } 2123 2073 return ret;

+12 -12

fs/xfs/xfs_file.c

··· 1287 1287 static inline vm_fault_t 1288 1288 xfs_dax_fault( 1289 1289 struct vm_fault *vmf, 1290 - enum page_entry_size pe_size, 1290 + unsigned int order, 1291 1291 bool write_fault, 1292 1292 pfn_t *pfn) 1293 1293 { 1294 - return dax_iomap_fault(vmf, pe_size, pfn, NULL, 1294 + return dax_iomap_fault(vmf, order, pfn, NULL, 1295 1295 (write_fault && !vmf->cow_page) ? 1296 1296 &xfs_dax_write_iomap_ops : 1297 1297 &xfs_read_iomap_ops); ··· 1300 1300 static inline vm_fault_t 1301 1301 xfs_dax_fault( 1302 1302 struct vm_fault *vmf, 1303 - enum page_entry_size pe_size, 1303 + unsigned int order, 1304 1304 bool write_fault, 1305 1305 pfn_t *pfn) 1306 1306 { ··· 1322 1322 static vm_fault_t 1323 1323 __xfs_filemap_fault( 1324 1324 struct vm_fault *vmf, 1325 - enum page_entry_size pe_size, 1325 + unsigned int order, 1326 1326 bool write_fault) 1327 1327 { 1328 1328 struct inode *inode = file_inode(vmf->vma->vm_file); 1329 1329 struct xfs_inode *ip = XFS_I(inode); 1330 1330 vm_fault_t ret; 1331 1331 1332 - trace_xfs_filemap_fault(ip, pe_size, write_fault); 1332 + trace_xfs_filemap_fault(ip, order, write_fault); 1333 1333 1334 1334 if (write_fault) { 1335 1335 sb_start_pagefault(inode->i_sb); ··· 1340 1340 pfn_t pfn; 1341 1341 1342 1342 xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); 1343 - ret = xfs_dax_fault(vmf, pe_size, write_fault, &pfn); 1343 + ret = xfs_dax_fault(vmf, order, write_fault, &pfn); 1344 1344 if (ret & VM_FAULT_NEEDDSYNC) 1345 - ret = dax_finish_sync_fault(vmf, pe_size, pfn); 1345 + ret = dax_finish_sync_fault(vmf, order, pfn); 1346 1346 xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); 1347 1347 } else { 1348 1348 if (write_fault) { ··· 1373 1373 struct vm_fault *vmf) 1374 1374 { 1375 1375 /* DAX can shortcut the normal fault path on write faults! */ 1376 - return __xfs_filemap_fault(vmf, PE_SIZE_PTE, 1376 + return __xfs_filemap_fault(vmf, 0, 1377 1377 IS_DAX(file_inode(vmf->vma->vm_file)) && 1378 1378 xfs_is_write_fault(vmf)); 1379 1379 } ··· 1381 1381 static vm_fault_t 1382 1382 xfs_filemap_huge_fault( 1383 1383 struct vm_fault *vmf, 1384 - enum page_entry_size pe_size) 1384 + unsigned int order) 1385 1385 { 1386 1386 if (!IS_DAX(file_inode(vmf->vma->vm_file))) 1387 1387 return VM_FAULT_FALLBACK; 1388 1388 1389 1389 /* DAX can shortcut the normal fault path on write faults! */ 1390 - return __xfs_filemap_fault(vmf, pe_size, 1390 + return __xfs_filemap_fault(vmf, order, 1391 1391 xfs_is_write_fault(vmf)); 1392 1392 } 1393 1393 ··· 1395 1395 xfs_filemap_page_mkwrite( 1396 1396 struct vm_fault *vmf) 1397 1397 { 1398 - return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); 1398 + return __xfs_filemap_fault(vmf, 0, true); 1399 1399 } 1400 1400 1401 1401 /* ··· 1408 1408 struct vm_fault *vmf) 1409 1409 { 1410 1410 1411 - return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); 1411 + return __xfs_filemap_fault(vmf, 0, true); 1412 1412 } 1413 1413 1414 1414 static const struct vm_operations_struct xfs_file_vm_ops = {

+6 -14

fs/xfs/xfs_trace.h

··· 802 802 * ring buffer. Somehow this was only worth mentioning in the ftrace sample 803 803 * code. 804 804 */ 805 - TRACE_DEFINE_ENUM(PE_SIZE_PTE); 806 - TRACE_DEFINE_ENUM(PE_SIZE_PMD); 807 - TRACE_DEFINE_ENUM(PE_SIZE_PUD); 808 - 809 805 TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_SHARED); 810 806 TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_COW); 811 807 812 808 TRACE_EVENT(xfs_filemap_fault, 813 - TP_PROTO(struct xfs_inode *ip, enum page_entry_size pe_size, 814 - bool write_fault), 815 - TP_ARGS(ip, pe_size, write_fault), 809 + TP_PROTO(struct xfs_inode *ip, unsigned int order, bool write_fault), 810 + TP_ARGS(ip, order, write_fault), 816 811 TP_STRUCT__entry( 817 812 __field(dev_t, dev) 818 813 __field(xfs_ino_t, ino) 819 - __field(enum page_entry_size, pe_size) 814 + __field(unsigned int, order) 820 815 __field(bool, write_fault) 821 816 ), 822 817 TP_fast_assign( 823 818 __entry->dev = VFS_I(ip)->i_sb->s_dev; 824 819 __entry->ino = ip->i_ino; 825 - __entry->pe_size = pe_size; 820 + __entry->order = order; 826 821 __entry->write_fault = write_fault; 827 822 ), 828 - TP_printk("dev %d:%d ino 0x%llx %s write_fault %d", 823 + TP_printk("dev %d:%d ino 0x%llx order %u write_fault %d", 829 824 MAJOR(__entry->dev), MINOR(__entry->dev), 830 825 __entry->ino, 831 - __print_symbolic(__entry->pe_size, 832 - { PE_SIZE_PTE, "PTE" }, 833 - { PE_SIZE_PMD, "PMD" }, 834 - { PE_SIZE_PUD, "PUD" }), 826 + __entry->order, 835 827 __entry->write_fault) 836 828 ) 837 829

-7

include/asm-generic/cacheflush.h

··· 77 77 #define flush_icache_user_range flush_icache_range 78 78 #endif 79 79 80 - #ifndef flush_icache_page 81 - static inline void flush_icache_page(struct vm_area_struct *vma, 82 - struct page *page) 83 - { 84 - } 85 - #endif 86 - 87 80 #ifndef flush_icache_user_page 88 81 static inline void flush_icache_user_page(struct vm_area_struct *vma, 89 82 struct page *page,

+6 -25

include/asm-generic/io.h

··· 1047 1047 #elif defined(CONFIG_GENERIC_IOREMAP) 1048 1048 #include <linux/pgtable.h> 1049 1049 1050 - /* 1051 - * Arch code can implement the following two hooks when using GENERIC_IOREMAP 1052 - * ioremap_allowed() return a bool, 1053 - * - true means continue to remap 1054 - * - false means skip remap and return directly 1055 - * iounmap_allowed() return a bool, 1056 - * - true means continue to vunmap 1057 - * - false means skip vunmap and return directly 1058 - */ 1059 - #ifndef ioremap_allowed 1060 - #define ioremap_allowed ioremap_allowed 1061 - static inline bool ioremap_allowed(phys_addr_t phys_addr, size_t size, 1062 - unsigned long prot) 1063 - { 1064 - return true; 1065 - } 1066 - #endif 1067 - 1068 - #ifndef iounmap_allowed 1069 - #define iounmap_allowed iounmap_allowed 1070 - static inline bool iounmap_allowed(void *addr) 1071 - { 1072 - return true; 1073 - } 1074 - #endif 1050 + void __iomem *generic_ioremap_prot(phys_addr_t phys_addr, size_t size, 1051 + pgprot_t prot); 1075 1052 1076 1053 void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 1077 1054 unsigned long prot); 1078 1055 void iounmap(volatile void __iomem *addr); 1056 + void generic_iounmap(volatile void __iomem *addr); 1079 1057 1058 + #ifndef ioremap 1059 + #define ioremap ioremap 1080 1060 static inline void __iomem *ioremap(phys_addr_t addr, size_t size) 1081 1061 { 1082 1062 /* _PAGE_IOREMAP needs to be supplied by the architecture */ 1083 1063 return ioremap_prot(addr, size, _PAGE_IOREMAP); 1084 1064 } 1065 + #endif 1085 1066 #endif /* !CONFIG_MMU || CONFIG_GENERIC_IOREMAP */ 1086 1067 1087 1068 #ifndef ioremap_wc

+3 -3

include/asm-generic/iomap.h

··· 93 93 extern void ioport_unmap(void __iomem *); 94 94 #endif 95 95 96 - #ifndef ARCH_HAS_IOREMAP_WC 96 + #ifndef ioremap_wc 97 97 #define ioremap_wc ioremap 98 98 #endif 99 99 100 - #ifndef ARCH_HAS_IOREMAP_WT 100 + #ifndef ioremap_wt 101 101 #define ioremap_wt ioremap 102 102 #endif 103 103 104 - #ifndef ARCH_HAS_IOREMAP_NP 104 + #ifndef ioremap_np 105 105 /* See the comment in asm-generic/io.h about ioremap_np(). */ 106 106 #define ioremap_np ioremap_np 107 107 static inline void __iomem *ioremap_np(phys_addr_t offset, size_t size)

+52 -36

include/asm-generic/pgalloc.h

··· 8 8 #define GFP_PGTABLE_USER (GFP_PGTABLE_KERNEL | __GFP_ACCOUNT) 9 9 10 10 /** 11 - * __pte_alloc_one_kernel - allocate a page for PTE-level kernel page table 11 + * __pte_alloc_one_kernel - allocate memory for a PTE-level kernel page table 12 12 * @mm: the mm_struct of the current context 13 13 * 14 14 * This function is intended for architectures that need ··· 18 18 */ 19 19 static inline pte_t *__pte_alloc_one_kernel(struct mm_struct *mm) 20 20 { 21 - return (pte_t *)__get_free_page(GFP_PGTABLE_KERNEL); 21 + struct ptdesc *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & 22 + ~__GFP_HIGHMEM, 0); 23 + 24 + if (!ptdesc) 25 + return NULL; 26 + return ptdesc_address(ptdesc); 22 27 } 23 28 24 29 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL 25 30 /** 26 - * pte_alloc_one_kernel - allocate a page for PTE-level kernel page table 31 + * pte_alloc_one_kernel - allocate memory for a PTE-level kernel page table 27 32 * @mm: the mm_struct of the current context 28 33 * 29 34 * Return: pointer to the allocated memory or %NULL on error ··· 40 35 #endif 41 36 42 37 /** 43 - * pte_free_kernel - free PTE-level kernel page table page 38 + * pte_free_kernel - free PTE-level kernel page table memory 44 39 * @mm: the mm_struct of the current context 45 40 * @pte: pointer to the memory containing the page table 46 41 */ 47 42 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) 48 43 { 49 - free_page((unsigned long)pte); 44 + pagetable_free(virt_to_ptdesc(pte)); 50 45 } 51 46 52 47 /** 53 - * __pte_alloc_one - allocate a page for PTE-level user page table 48 + * __pte_alloc_one - allocate memory for a PTE-level user page table 54 49 * @mm: the mm_struct of the current context 55 50 * @gfp: GFP flags to use for the allocation 56 51 * 57 - * Allocates a page and runs the pgtable_pte_page_ctor(). 52 + * Allocate memory for a page table and ptdesc and runs pagetable_pte_ctor(). 58 53 * 59 54 * This function is intended for architectures that need 60 55 * anything beyond simple page allocation or must have custom GFP flags. 61 56 * 62 - * Return: `struct page` initialized as page table or %NULL on error 57 + * Return: `struct page` referencing the ptdesc or %NULL on error 63 58 */ 64 59 static inline pgtable_t __pte_alloc_one(struct mm_struct *mm, gfp_t gfp) 65 60 { 66 - struct page *pte; 61 + struct ptdesc *ptdesc; 67 62 68 - pte = alloc_page(gfp); 69 - if (!pte) 63 + ptdesc = pagetable_alloc(gfp, 0); 64 + if (!ptdesc) 70 65 return NULL; 71 - if (!pgtable_pte_page_ctor(pte)) { 72 - __free_page(pte); 66 + if (!pagetable_pte_ctor(ptdesc)) { 67 + pagetable_free(ptdesc); 73 68 return NULL; 74 69 } 75 70 76 - return pte; 71 + return ptdesc_page(ptdesc); 77 72 } 78 73 79 74 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE ··· 81 76 * pte_alloc_one - allocate a page for PTE-level user page table 82 77 * @mm: the mm_struct of the current context 83 78 * 84 - * Allocates a page and runs the pgtable_pte_page_ctor(). 79 + * Allocate memory for a page table and ptdesc and runs pagetable_pte_ctor(). 85 80 * 86 - * Return: `struct page` initialized as page table or %NULL on error 81 + * Return: `struct page` referencing the ptdesc or %NULL on error 87 82 */ 88 83 static inline pgtable_t pte_alloc_one(struct mm_struct *mm) 89 84 { ··· 97 92 */ 98 93 99 94 /** 100 - * pte_free - free PTE-level user page table page 95 + * pte_free - free PTE-level user page table memory 101 96 * @mm: the mm_struct of the current context 102 - * @pte_page: the `struct page` representing the page table 97 + * @pte_page: the `struct page` referencing the ptdesc 103 98 */ 104 99 static inline void pte_free(struct mm_struct *mm, struct page *pte_page) 105 100 { 106 - pgtable_pte_page_dtor(pte_page); 107 - __free_page(pte_page); 101 + struct ptdesc *ptdesc = page_ptdesc(pte_page); 102 + 103 + pagetable_pte_dtor(ptdesc); 104 + pagetable_free(ptdesc); 108 105 } 109 106 110 107 ··· 114 107 115 108 #ifndef __HAVE_ARCH_PMD_ALLOC_ONE 116 109 /** 117 - * pmd_alloc_one - allocate a page for PMD-level page table 110 + * pmd_alloc_one - allocate memory for a PMD-level page table 118 111 * @mm: the mm_struct of the current context 119 112 * 120 - * Allocates a page and runs the pgtable_pmd_page_ctor(). 113 + * Allocate memory for a page table and ptdesc and runs pagetable_pmd_ctor(). 114 + * 121 115 * Allocations use %GFP_PGTABLE_USER in user context and 122 116 * %GFP_PGTABLE_KERNEL in kernel context. 123 117 * ··· 126 118 */ 127 119 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 128 120 { 129 - struct page *page; 121 + struct ptdesc *ptdesc; 130 122 gfp_t gfp = GFP_PGTABLE_USER; 131 123 132 124 if (mm == &init_mm) 133 125 gfp = GFP_PGTABLE_KERNEL; 134 - page = alloc_page(gfp); 135 - if (!page) 126 + ptdesc = pagetable_alloc(gfp, 0); 127 + if (!ptdesc) 136 128 return NULL; 137 - if (!pgtable_pmd_page_ctor(page)) { 138 - __free_page(page); 129 + if (!pagetable_pmd_ctor(ptdesc)) { 130 + pagetable_free(ptdesc); 139 131 return NULL; 140 132 } 141 - return (pmd_t *)page_address(page); 133 + return ptdesc_address(ptdesc); 142 134 } 143 135 #endif 144 136 145 137 #ifndef __HAVE_ARCH_PMD_FREE 146 138 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 147 139 { 140 + struct ptdesc *ptdesc = virt_to_ptdesc(pmd); 141 + 148 142 BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); 149 - pgtable_pmd_page_dtor(virt_to_page(pmd)); 150 - free_page((unsigned long)pmd); 143 + pagetable_pmd_dtor(ptdesc); 144 + pagetable_free(ptdesc); 151 145 } 152 146 #endif 153 147 ··· 160 150 static inline pud_t *__pud_alloc_one(struct mm_struct *mm, unsigned long addr) 161 151 { 162 152 gfp_t gfp = GFP_PGTABLE_USER; 153 + struct ptdesc *ptdesc; 163 154 164 155 if (mm == &init_mm) 165 156 gfp = GFP_PGTABLE_KERNEL; 166 - return (pud_t *)get_zeroed_page(gfp); 157 + gfp &= ~__GFP_HIGHMEM; 158 + 159 + ptdesc = pagetable_alloc(gfp, 0); 160 + if (!ptdesc) 161 + return NULL; 162 + return ptdesc_address(ptdesc); 167 163 } 168 164 169 165 #ifndef __HAVE_ARCH_PUD_ALLOC_ONE 170 166 /** 171 - * pud_alloc_one - allocate a page for PUD-level page table 167 + * pud_alloc_one - allocate memory for a PUD-level page table 172 168 * @mm: the mm_struct of the current context 173 169 * 174 - * Allocates a page using %GFP_PGTABLE_USER for user context and 175 - * %GFP_PGTABLE_KERNEL for kernel context. 170 + * Allocate memory for a page table using %GFP_PGTABLE_USER for user context 171 + * and %GFP_PGTABLE_KERNEL for kernel context. 176 172 * 177 173 * Return: pointer to the allocated memory or %NULL on error 178 174 */ ··· 191 175 static inline void __pud_free(struct mm_struct *mm, pud_t *pud) 192 176 { 193 177 BUG_ON((unsigned long)pud & (PAGE_SIZE-1)); 194 - free_page((unsigned long)pud); 178 + pagetable_free(virt_to_ptdesc(pud)); 195 179 } 196 180 197 181 #ifndef __HAVE_ARCH_PUD_FREE ··· 206 190 #ifndef __HAVE_ARCH_PGD_FREE 207 191 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 208 192 { 209 - free_page((unsigned long)pgd); 193 + pagetable_free(virt_to_ptdesc(pgd)); 210 194 } 211 195 #endif 212 196

+11 -1

include/asm-generic/tlb.h

··· 456 456 return; 457 457 458 458 tlb_flush(tlb); 459 - mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end); 460 459 __tlb_reset_range(tlb); 461 460 } 462 461 ··· 478 479 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) 479 480 { 480 481 return tlb_remove_page_size(tlb, page, PAGE_SIZE); 482 + } 483 + 484 + static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt) 485 + { 486 + tlb_remove_table(tlb, pt); 487 + } 488 + 489 + /* Like tlb_remove_ptdesc, but for page-like page directories. */ 490 + static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt) 491 + { 492 + tlb_remove_page(tlb, ptdesc_page(pt)); 481 493 } 482 494 483 495 static inline void tlb_change_page_size(struct mmu_gather *tlb,

-1

include/linux/backing-dev.h

··· 46 46 extern struct list_head bdi_list; 47 47 48 48 extern struct workqueue_struct *bdi_wq; 49 - extern struct workqueue_struct *bdi_async_bio_wq; 50 49 51 50 static inline bool wb_has_dirty_io(struct bdi_writeback *wb) 52 51 {

+5

include/linux/bio.h

··· 253 253 return bio_first_bvec_all(bio)->bv_page; 254 254 } 255 255 256 + static inline struct folio *bio_first_folio_all(struct bio *bio) 257 + { 258 + return page_folio(bio_first_page_all(bio)); 259 + } 260 + 256 261 static inline struct bio_vec *bio_last_bvec_all(struct bio *bio) 257 262 { 258 263 WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));

+1 -3

include/linux/buffer_head.h

··· 194 194 void mark_buffer_dirty(struct buffer_head *bh); 195 195 void mark_buffer_write_io_error(struct buffer_head *bh); 196 196 void touch_buffer(struct buffer_head *bh); 197 - void set_bh_page(struct buffer_head *bh, 198 - struct page *page, unsigned long offset); 199 197 void folio_set_bh(struct buffer_head *bh, struct folio *folio, 200 198 unsigned long offset); 201 199 bool try_to_free_buffers(struct folio *); ··· 286 288 unsigned, struct page **, void **, 287 289 get_block_t *, loff_t *); 288 290 int generic_cont_expand_simple(struct inode *inode, loff_t size); 289 - int block_commit_write(struct page *page, unsigned from, unsigned to); 291 + void block_commit_write(struct page *page, unsigned int from, unsigned int to); 290 292 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, 291 293 get_block_t get_block); 292 294 /* Convert errno to return value from ->page_mkwrite() call */

+11 -2

include/linux/cacheflush.h

··· 7 7 struct folio; 8 8 9 9 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 10 - #ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO 10 + #ifndef flush_dcache_folio 11 11 void flush_dcache_folio(struct folio *folio); 12 12 #endif 13 13 #else 14 14 static inline void flush_dcache_folio(struct folio *folio) 15 15 { 16 16 } 17 - #define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO 0 17 + #define flush_dcache_folio flush_dcache_folio 18 18 #endif /* ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE */ 19 + 20 + #ifndef flush_icache_pages 21 + static inline void flush_icache_pages(struct vm_area_struct *vma, 22 + struct page *page, unsigned int nr) 23 + { 24 + } 25 + #endif 26 + 27 + #define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1) 19 28 20 29 #endif /* _LINUX_CACHEFLUSH_H */

+22 -6

include/linux/damon.h

··· 226 226 * enum damos_filter_type - Type of memory for &struct damos_filter 227 227 * @DAMOS_FILTER_TYPE_ANON: Anonymous pages. 228 228 * @DAMOS_FILTER_TYPE_MEMCG: Specific memcg's pages. 229 + * @DAMOS_FILTER_TYPE_ADDR: Address range. 230 + * @DAMOS_FILTER_TYPE_TARGET: Data Access Monitoring target. 229 231 * @NR_DAMOS_FILTER_TYPES: Number of filter types. 230 232 * 231 - * The support of each filter type is up to running &struct damon_operations. 232 - * &enum DAMON_OPS_PADDR is supporting all filter types, while 233 - * &enum DAMON_OPS_VADDR and &enum DAMON_OPS_FVADDR are not supporting any 234 - * filter types. 233 + * The anon pages type and memcg type filters are handled by underlying 234 + * &struct damon_operations as a part of scheme action trying, and therefore 235 + * accounted as 'tried'. In contrast, other types are handled by core layer 236 + * before trying of the action and therefore not accounted as 'tried'. 237 + * 238 + * The support of the filters that handled by &struct damon_operations depend 239 + * on the running &struct damon_operations. 240 + * &enum DAMON_OPS_PADDR supports both anon pages type and memcg type filters, 241 + * while &enum DAMON_OPS_VADDR and &enum DAMON_OPS_FVADDR don't support any of 242 + * the two types. 235 243 */ 236 244 enum damos_filter_type { 237 245 DAMOS_FILTER_TYPE_ANON, 238 246 DAMOS_FILTER_TYPE_MEMCG, 247 + DAMOS_FILTER_TYPE_ADDR, 248 + DAMOS_FILTER_TYPE_TARGET, 239 249 NR_DAMOS_FILTER_TYPES, 240 250 }; 241 251 ··· 254 244 * @type: Type of the page. 255 245 * @matching: If the matching page should filtered out or in. 256 246 * @memcg_id: Memcg id of the question if @type is DAMOS_FILTER_MEMCG. 247 + * @addr_range: Address range if @type is DAMOS_FILTER_TYPE_ADDR. 248 + * @target_idx: Index of the &struct damon_target of 249 + * &damon_ctx->adaptive_targets if @type is 250 + * DAMOS_FILTER_TYPE_TARGET. 257 251 * @list: List head for siblings. 258 252 * 259 253 * Before applying the &damos->action to a memory region, DAMOS checks if each 260 254 * page of the region matches to this and avoid applying the action if so. 261 - * Note that the check support is up to &struct damon_operations 262 - * implementation. 255 + * Support of each filter type depends on the running &struct damon_operations 256 + * and the type. Refer to &enum damos_filter_type for more detai. 263 257 */ 264 258 struct damos_filter { 265 259 enum damos_filter_type type; 266 260 bool matching; 267 261 union { 268 262 unsigned short memcg_id; 263 + struct damon_addr_range addr_range; 264 + int target_idx; 269 265 }; 270 266 struct list_head list; 271 267 };

+2 -2

include/linux/dax.h

··· 241 241 242 242 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, 243 243 const struct iomap_ops *ops); 244 - vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, 244 + vm_fault_t dax_iomap_fault(struct vm_fault *vmf, unsigned int order, 245 245 pfn_t *pfnp, int *errp, const struct iomap_ops *ops); 246 246 vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, 247 - enum page_entry_size pe_size, pfn_t pfn); 247 + unsigned int order, pfn_t pfn); 248 248 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); 249 249 int dax_invalidate_mapping_entry_sync(struct address_space *mapping, 250 250 pgoff_t index);

-91

include/linux/frontswap.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _LINUX_FRONTSWAP_H 3 - #define _LINUX_FRONTSWAP_H 4 - 5 - #include <linux/swap.h> 6 - #include <linux/mm.h> 7 - #include <linux/bitops.h> 8 - #include <linux/jump_label.h> 9 - 10 - struct frontswap_ops { 11 - void (*init)(unsigned); /* this swap type was just swapon'ed */ 12 - int (*store)(unsigned, pgoff_t, struct page *); /* store a page */ 13 - int (*load)(unsigned, pgoff_t, struct page *, bool *); /* load a page */ 14 - void (*invalidate_page)(unsigned, pgoff_t); /* page no longer needed */ 15 - void (*invalidate_area)(unsigned); /* swap type just swapoff'ed */ 16 - }; 17 - 18 - int frontswap_register_ops(const struct frontswap_ops *ops); 19 - 20 - extern void frontswap_init(unsigned type, unsigned long *map); 21 - extern int __frontswap_store(struct page *page); 22 - extern int __frontswap_load(struct page *page); 23 - extern void __frontswap_invalidate_page(unsigned, pgoff_t); 24 - extern void __frontswap_invalidate_area(unsigned); 25 - 26 - #ifdef CONFIG_FRONTSWAP 27 - extern struct static_key_false frontswap_enabled_key; 28 - 29 - static inline bool frontswap_enabled(void) 30 - { 31 - return static_branch_unlikely(&frontswap_enabled_key); 32 - } 33 - 34 - static inline void frontswap_map_set(struct swap_info_struct *p, 35 - unsigned long *map) 36 - { 37 - p->frontswap_map = map; 38 - } 39 - 40 - static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 41 - { 42 - return p->frontswap_map; 43 - } 44 - #else 45 - /* all inline routines become no-ops and all externs are ignored */ 46 - 47 - static inline bool frontswap_enabled(void) 48 - { 49 - return false; 50 - } 51 - 52 - static inline void frontswap_map_set(struct swap_info_struct *p, 53 - unsigned long *map) 54 - { 55 - } 56 - 57 - static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 58 - { 59 - return NULL; 60 - } 61 - #endif 62 - 63 - static inline int frontswap_store(struct page *page) 64 - { 65 - if (frontswap_enabled()) 66 - return __frontswap_store(page); 67 - 68 - return -1; 69 - } 70 - 71 - static inline int frontswap_load(struct page *page) 72 - { 73 - if (frontswap_enabled()) 74 - return __frontswap_load(page); 75 - 76 - return -1; 77 - } 78 - 79 - static inline void frontswap_invalidate_page(unsigned type, pgoff_t offset) 80 - { 81 - if (frontswap_enabled()) 82 - __frontswap_invalidate_page(type, offset); 83 - } 84 - 85 - static inline void frontswap_invalidate_area(unsigned type) 86 - { 87 - if (frontswap_enabled()) 88 - __frontswap_invalidate_area(type); 89 - } 90 - 91 - #endif /* _LINUX_FRONTSWAP_H */

+1 -1

include/linux/fs.h

··· 478 478 atomic_t nr_thps; 479 479 #endif 480 480 struct rb_root_cached i_mmap; 481 - struct rw_semaphore i_mmap_rwsem; 482 481 unsigned long nrpages; 483 482 pgoff_t writeback_index; 484 483 const struct address_space_operations *a_ops; 485 484 unsigned long flags; 485 + struct rw_semaphore i_mmap_rwsem; 486 486 errseq_t wb_err; 487 487 spinlock_t private_lock; 488 488 struct list_head private_list;

+44

include/linux/highmem.h

··· 439 439 kunmap_local(addr); 440 440 } 441 441 442 + static inline void memcpy_from_folio(char *to, struct folio *folio, 443 + size_t offset, size_t len) 444 + { 445 + VM_BUG_ON(offset + len > folio_size(folio)); 446 + 447 + do { 448 + const char *from = kmap_local_folio(folio, offset); 449 + size_t chunk = len; 450 + 451 + if (folio_test_highmem(folio) && 452 + chunk > PAGE_SIZE - offset_in_page(offset)) 453 + chunk = PAGE_SIZE - offset_in_page(offset); 454 + memcpy(to, from, chunk); 455 + kunmap_local(from); 456 + 457 + from += chunk; 458 + offset += chunk; 459 + len -= chunk; 460 + } while (len > 0); 461 + } 462 + 463 + static inline void memcpy_to_folio(struct folio *folio, size_t offset, 464 + const char *from, size_t len) 465 + { 466 + VM_BUG_ON(offset + len > folio_size(folio)); 467 + 468 + do { 469 + char *to = kmap_local_folio(folio, offset); 470 + size_t chunk = len; 471 + 472 + if (folio_test_highmem(folio) && 473 + chunk > PAGE_SIZE - offset_in_page(offset)) 474 + chunk = PAGE_SIZE - offset_in_page(offset); 475 + memcpy(to, from, chunk); 476 + kunmap_local(to); 477 + 478 + from += chunk; 479 + offset += chunk; 480 + len -= chunk; 481 + } while (len > 0); 482 + 483 + flush_dcache_folio(folio); 484 + } 485 + 442 486 /** 443 487 * memcpy_from_file_folio - Copy some bytes from a file folio. 444 488 * @to: The destination buffer.

+2 -4

include/linux/huge_mm.h

··· 140 140 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, 141 141 unsigned long len, unsigned long pgoff, unsigned long flags); 142 142 143 - void prep_transhuge_page(struct page *page); 144 - void free_transhuge_page(struct page *page); 145 - 143 + void folio_prep_large_rmappable(struct folio *folio); 146 144 bool can_split_folio(struct folio *folio, int *pextra_pins); 147 145 int split_huge_page_to_list(struct page *page, struct list_head *list); 148 146 static inline int split_huge_page(struct page *page) ··· 280 282 return false; 281 283 } 282 284 283 - static inline void prep_transhuge_page(struct page *page) {} 285 + static inline void folio_prep_large_rmappable(struct folio *folio) {} 284 286 285 287 #define transparent_hugepage_flags 0UL 286 288

+12 -26

include/linux/hugetlb.h

··· 26 26 #define __hugepd(x) ((hugepd_t) { (x) }) 27 27 #endif 28 28 29 + void free_huge_folio(struct folio *folio); 30 + 29 31 #ifdef CONFIG_HUGETLB_PAGE 30 32 31 33 #include <linux/mempolicy.h> ··· 133 131 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, 134 132 struct vm_area_struct *, struct vm_area_struct *); 135 133 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, 136 - unsigned long address, unsigned int flags); 137 - long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, 138 - struct page **, unsigned long *, unsigned long *, 139 - long, unsigned int, int *); 134 + unsigned long address, unsigned int flags, 135 + unsigned int *page_mask); 140 136 void unmap_hugepage_range(struct vm_area_struct *, 141 137 unsigned long, unsigned long, struct page *, 142 138 zap_flags_t); ··· 167 167 bool *migratable_cleared); 168 168 void folio_putback_active_hugetlb(struct folio *folio); 169 169 void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int reason); 170 - void free_huge_page(struct page *page); 171 170 void hugetlb_fix_reserve_counts(struct inode *inode); 172 171 extern struct mutex *hugetlb_fault_mutex_table; 173 172 u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx); ··· 296 297 { 297 298 } 298 299 299 - static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, 300 - unsigned long address, unsigned int flags) 300 + static inline struct page *hugetlb_follow_page_mask( 301 + struct vm_area_struct *vma, unsigned long address, unsigned int flags, 302 + unsigned int *page_mask) 301 303 { 302 304 BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/ 303 - } 304 - 305 - static inline long follow_hugetlb_page(struct mm_struct *mm, 306 - struct vm_area_struct *vma, struct page **pages, 307 - unsigned long *position, unsigned long *nr_pages, 308 - long i, unsigned int flags, int *nonblocking) 309 - { 310 - BUG(); 311 - return 0; 312 305 } 313 306 314 307 static inline int copy_hugetlb_page_range(struct mm_struct *dst, ··· 842 851 return size_to_hstate(folio_size(folio)); 843 852 } 844 853 845 - static inline struct hstate *page_hstate(struct page *page) 846 - { 847 - return folio_hstate(page_folio(page)); 848 - } 849 - 850 854 static inline unsigned hstate_index_to_shift(unsigned index) 851 855 { 852 856 return hstates[index].order + PAGE_SHIFT; ··· 993 1007 void hugetlb_unregister_node(struct node *node); 994 1008 #endif 995 1009 1010 + /* 1011 + * Check if a given raw @page in a hugepage is HWPOISON. 1012 + */ 1013 + bool is_raw_hwpoison_page_in_hugepage(struct page *page); 1014 + 996 1015 #else /* CONFIG_HUGETLB_PAGE */ 997 1016 struct hstate {}; 998 1017 ··· 1054 1063 } 1055 1064 1056 1065 static inline struct hstate *folio_hstate(struct folio *folio) 1057 - { 1058 - return NULL; 1059 - } 1060 - 1061 - static inline struct hstate *page_hstate(struct page *page) 1062 1066 { 1063 1067 return NULL; 1064 1068 }

+30

include/linux/ioremap.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_IOREMAP_H 3 + #define _LINUX_IOREMAP_H 4 + 5 + #include <linux/kasan.h> 6 + #include <asm/pgtable.h> 7 + 8 + #if defined(CONFIG_HAS_IOMEM) || defined(CONFIG_GENERIC_IOREMAP) 9 + /* 10 + * Ioremap often, but not always uses the generic vmalloc area. E.g on 11 + * Power ARCH, it could have different ioremap space. 12 + */ 13 + #ifndef IOREMAP_START 14 + #define IOREMAP_START VMALLOC_START 15 + #define IOREMAP_END VMALLOC_END 16 + #endif 17 + static inline bool is_ioremap_addr(const void *x) 18 + { 19 + unsigned long addr = (unsigned long)kasan_reset_tag(x); 20 + 21 + return addr >= IOREMAP_START && addr < IOREMAP_END; 22 + } 23 + #else 24 + static inline bool is_ioremap_addr(const void *x) 25 + { 26 + return false; 27 + } 28 + #endif 29 + 30 + #endif /* _LINUX_IOREMAP_H */

+6 -5

include/linux/kfence.h

··· 59 59 } 60 60 61 61 /** 62 - * kfence_alloc_pool() - allocate the KFENCE pool via memblock 62 + * kfence_alloc_pool_and_metadata() - allocate the KFENCE pool and KFENCE 63 + * metadata via memblock 63 64 */ 64 - void __init kfence_alloc_pool(void); 65 + void __init kfence_alloc_pool_and_metadata(void); 65 66 66 67 /** 67 68 * kfence_init() - perform KFENCE initialization at boot time 68 69 * 69 - * Requires that kfence_alloc_pool() was called before. This sets up the 70 - * allocation gate timer, and requires that workqueues are available. 70 + * Requires that kfence_alloc_pool_and_metadata() was called before. This sets 71 + * up the allocation gate timer, and requires that workqueues are available. 71 72 */ 72 73 void __init kfence_init(void); 73 74 ··· 224 223 #else /* CONFIG_KFENCE */ 225 224 226 225 static inline bool is_kfence_address(const void *addr) { return false; } 227 - static inline void kfence_alloc_pool(void) { } 226 + static inline void kfence_alloc_pool_and_metadata(void) { } 228 227 static inline void kfence_init(void) { } 229 228 static inline void kfence_shutdown_cache(struct kmem_cache *s) { } 230 229 static inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) { return NULL; }

+20

include/linux/ksm.h

··· 26 26 27 27 int __ksm_enter(struct mm_struct *mm); 28 28 void __ksm_exit(struct mm_struct *mm); 29 + /* 30 + * To identify zeropages that were mapped by KSM, we reuse the dirty bit 31 + * in the PTE. If the PTE is dirty, the zeropage was mapped by KSM when 32 + * deduplicating memory. 33 + */ 34 + #define is_ksm_zero_pte(pte) (is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte)) 35 + 36 + extern unsigned long ksm_zero_pages; 37 + 38 + static inline void ksm_might_unmap_zero_page(struct mm_struct *mm, pte_t pte) 39 + { 40 + if (is_ksm_zero_pte(pte)) { 41 + ksm_zero_pages--; 42 + mm->ksm_zero_pages--; 43 + } 44 + } 29 45 30 46 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) 31 47 { ··· 108 92 } 109 93 110 94 static inline void ksm_exit(struct mm_struct *mm) 95 + { 96 + } 97 + 98 + static inline void ksm_might_unmap_zero_page(struct mm_struct *mm, pte_t pte) 111 99 { 112 100 } 113 101

+35 -11

include/linux/maple_tree.h

··· 29 29 #define MAPLE_NODE_SLOTS 31 /* 256 bytes including ->parent */ 30 30 #define MAPLE_RANGE64_SLOTS 16 /* 256 bytes */ 31 31 #define MAPLE_ARANGE64_SLOTS 10 /* 240 bytes */ 32 - #define MAPLE_ARANGE64_META_MAX 15 /* Out of range for metadata */ 33 32 #define MAPLE_ALLOC_SLOTS (MAPLE_NODE_SLOTS - 1) 34 33 #else 35 34 /* 32bit sizes */ 36 35 #define MAPLE_NODE_SLOTS 63 /* 256 bytes including ->parent */ 37 36 #define MAPLE_RANGE64_SLOTS 32 /* 256 bytes */ 38 37 #define MAPLE_ARANGE64_SLOTS 21 /* 240 bytes */ 39 - #define MAPLE_ARANGE64_META_MAX 31 /* Out of range for metadata */ 40 38 #define MAPLE_ALLOC_SLOTS (MAPLE_NODE_SLOTS - 2) 41 39 #endif /* defined(CONFIG_64BIT) || defined(BUILD_VDSO32_64) */ 42 40 ··· 182 184 183 185 #ifdef CONFIG_LOCKDEP 184 186 typedef struct lockdep_map *lockdep_map_p; 185 - #define mt_lock_is_held(mt) lock_is_held(mt->ma_external_lock) 187 + #define mt_lock_is_held(mt) \ 188 + (!(mt)->ma_external_lock || lock_is_held((mt)->ma_external_lock)) 189 + 190 + #define mt_write_lock_is_held(mt) \ 191 + (!(mt)->ma_external_lock || \ 192 + lock_is_held_type((mt)->ma_external_lock, 0)) 193 + 186 194 #define mt_set_external_lock(mt, lock) \ 187 195 (mt)->ma_external_lock = &(lock)->dep_map 196 + 197 + #define mt_on_stack(mt) (mt).ma_external_lock = NULL 188 198 #else 189 199 typedef struct { /* nothing */ } lockdep_map_p; 190 - #define mt_lock_is_held(mt) 1 200 + #define mt_lock_is_held(mt) 1 201 + #define mt_write_lock_is_held(mt) 1 191 202 #define mt_set_external_lock(mt, lock) do { } while (0) 203 + #define mt_on_stack(mt) do { } while (0) 192 204 #endif 193 205 194 206 /* ··· 220 212 spinlock_t ma_lock; 221 213 lockdep_map_p ma_external_lock; 222 214 }; 223 - void __rcu *ma_root; 224 215 unsigned int ma_flags; 216 + void __rcu *ma_root; 225 217 }; 226 218 227 219 /** ··· 466 458 void *mas_find_range(struct ma_state *mas, unsigned long max); 467 459 void *mas_find_rev(struct ma_state *mas, unsigned long min); 468 460 void *mas_find_range_rev(struct ma_state *mas, unsigned long max); 469 - int mas_preallocate(struct ma_state *mas, gfp_t gfp); 461 + int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp); 470 462 bool mas_is_err(struct ma_state *mas); 471 463 472 464 bool mas_nomem(struct ma_state *mas, gfp_t gfp); ··· 539 531 */ 540 532 #define mas_for_each(__mas, __entry, __max) \ 541 533 while (((__entry) = mas_find((__mas), (__max))) != NULL) 534 + /** 535 + * __mas_set_range() - Set up Maple Tree operation state to a sub-range of the 536 + * current location. 537 + * @mas: Maple Tree operation state. 538 + * @start: New start of range in the Maple Tree. 539 + * @last: New end of range in the Maple Tree. 540 + * 541 + * set the internal maple state values to a sub-range. 542 + * Please use mas_set_range() if you do not know where you are in the tree. 543 + */ 544 + static inline void __mas_set_range(struct ma_state *mas, unsigned long start, 545 + unsigned long last) 546 + { 547 + mas->index = start; 548 + mas->last = last; 549 + } 542 550 543 551 /** 544 552 * mas_set_range() - Set up Maple Tree operation state for a different index. ··· 569 545 static inline 570 546 void mas_set_range(struct ma_state *mas, unsigned long start, unsigned long last) 571 547 { 572 - mas->index = start; 573 - mas->last = last; 574 - mas->node = MAS_START; 548 + __mas_set_range(mas, start, last); 549 + mas->node = MAS_START; 575 550 } 576 551 577 552 /** ··· 685 662 * mt_for_each - Iterate over each entry starting at index until max. 686 663 * @__tree: The Maple Tree 687 664 * @__entry: The current entry 688 - * @__index: The index to update to track the location in the tree 665 + * @__index: The index to start the search from. Subsequently used as iterator. 689 666 * @__max: The maximum limit for @index 690 667 * 691 - * Note: Will not return the zero entry. 668 + * This iterator skips all entries, which resolve to a NULL pointer, 669 + * e.g. entries which has been reserved with XA_ZERO_ENTRY. 692 670 */ 693 671 #define mt_for_each(__tree, __entry, __index, __max) \ 694 672 for (__entry = mt_find(__tree, &(__index), __max); \

+5 -9

include/linux/memblock.h

··· 581 581 unsigned long high_limit); 582 582 583 583 #define HASH_EARLY 0x00000001 /* Allocating during early boot? */ 584 - #define HASH_SMALL 0x00000002 /* sub-page allocation allowed, min 585 - * shift passed via *_hash_shift */ 586 - #define HASH_ZERO 0x00000004 /* Zero allocated hash table */ 584 + #define HASH_ZERO 0x00000002 /* Zero allocated hash table */ 587 585 588 586 /* Only NUMA needs hash distribution. 64bit NUMA architectures have 589 587 * sufficient vmalloc space. ··· 594 596 #endif 595 597 596 598 #ifdef CONFIG_MEMTEST 597 - extern phys_addr_t early_memtest_bad_size; /* Size of faulty ram found by memtest */ 598 - extern bool early_memtest_done; /* Was early memtest done? */ 599 - extern void early_memtest(phys_addr_t start, phys_addr_t end); 599 + void early_memtest(phys_addr_t start, phys_addr_t end); 600 + void memtest_report_meminfo(struct seq_file *m); 600 601 #else 601 - static inline void early_memtest(phys_addr_t start, phys_addr_t end) 602 - { 603 - } 602 + static inline void early_memtest(phys_addr_t start, phys_addr_t end) { } 603 + static inline void memtest_report_meminfo(struct seq_file *m) { } 604 604 #endif 605 605 606 606

+8 -10

include/linux/memcontrol.h

··· 61 61 #ifdef CONFIG_MEMCG 62 62 63 63 #define MEM_CGROUP_ID_SHIFT 16 64 - #define MEM_CGROUP_ID_MAX USHRT_MAX 65 64 66 65 struct mem_cgroup_id { 67 66 int id; ··· 110 111 struct lruvec_stats { 111 112 /* Aggregated (CPU and subtree) state */ 112 113 long state[NR_VM_NODE_STAT_ITEMS]; 114 + 115 + /* Non-hierarchical (CPU aggregated) state */ 116 + long state_local[NR_VM_NODE_STAT_ITEMS]; 113 117 114 118 /* Pending child counts during tree propagation */ 115 119 long state_pending[NR_VM_NODE_STAT_ITEMS]; ··· 590 588 /* 591 589 * There is no reclaim protection applied to a targeted reclaim. 592 590 * We are special casing this specific case here because 593 - * mem_cgroup_protected calculation is not robust enough to keep 591 + * mem_cgroup_calculate_protection is not robust enough to keep 594 592 * the protection invariant for calculated effective values for 595 593 * parallel reclaimers with different reclaim target. This is 596 594 * especially a problem for tail memcgs (as they have pages on LRU) ··· 868 866 * parent_mem_cgroup - find the accounting parent of a memcg 869 867 * @memcg: memcg whose parent to find 870 868 * 871 - * Returns the parent memcg, or NULL if this is the root or the memory 872 - * controller is in legacy no-hierarchy mode. 869 + * Returns the parent memcg, or NULL if this is the root. 873 870 */ 874 871 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) 875 872 { ··· 1026 1025 { 1027 1026 struct mem_cgroup_per_node *pn; 1028 1027 long x = 0; 1029 - int cpu; 1030 1028 1031 1029 if (mem_cgroup_disabled()) 1032 1030 return node_page_state(lruvec_pgdat(lruvec), idx); 1033 1031 1034 1032 pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 1035 - for_each_possible_cpu(cpu) 1036 - x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); 1033 + x = READ_ONCE(pn->lruvec_stats.state_local[idx]); 1037 1034 #ifdef CONFIG_SMP 1038 1035 if (x < 0) 1039 1036 x = 0; ··· 1162 1163 #else /* CONFIG_MEMCG */ 1163 1164 1164 1165 #define MEM_CGROUP_ID_SHIFT 0 1165 - #define MEM_CGROUP_ID_MAX 0 1166 1166 1167 1167 static inline struct mem_cgroup *folio_memcg(struct folio *folio) 1168 1168 { ··· 1764 1766 void __memcg_kmem_uncharge_page(struct page *page, int order); 1765 1767 1766 1768 struct obj_cgroup *get_obj_cgroup_from_current(void); 1767 - struct obj_cgroup *get_obj_cgroup_from_page(struct page *page); 1769 + struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio); 1768 1770 1769 1771 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); 1770 1772 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); ··· 1848 1850 { 1849 1851 } 1850 1852 1851 - static inline struct obj_cgroup *get_obj_cgroup_from_page(struct page *page) 1853 + static inline struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) 1852 1854 { 1853 1855 return NULL; 1854 1856 }

+2 -2

include/linux/memory-tiers.h

··· 33 33 #ifdef CONFIG_NUMA 34 34 extern bool numa_demotion_enabled; 35 35 struct memory_dev_type *alloc_memory_type(int adistance); 36 - void destroy_memory_type(struct memory_dev_type *memtype); 36 + void put_memory_type(struct memory_dev_type *memtype); 37 37 void init_node_memory_type(int node, struct memory_dev_type *default_type); 38 38 void clear_node_memory_type(int node, struct memory_dev_type *memtype); 39 39 #ifdef CONFIG_MIGRATION ··· 68 68 return NULL; 69 69 } 70 70 71 - static inline void destroy_memory_type(struct memory_dev_type *memtype) 71 + static inline void put_memory_type(struct memory_dev_type *memtype) 72 72 { 73 73 74 74 }

+2 -6

include/linux/memory.h

··· 77 77 */ 78 78 struct zone *zone; 79 79 struct device dev; 80 - /* 81 - * Number of vmemmap pages. These pages 82 - * lay at the beginning of the memory block. 83 - */ 84 - unsigned long nr_vmemmap_pages; 80 + struct vmem_altmap *altmap; 85 81 struct memory_group *group; /* group (if any) for this block */ 86 82 struct list_head group_next; /* next block inside memory group */ 87 83 #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG) ··· 143 147 extern int register_memory_notifier(struct notifier_block *nb); 144 148 extern void unregister_memory_notifier(struct notifier_block *nb); 145 149 int create_memory_block_devices(unsigned long start, unsigned long size, 146 - unsigned long vmemmap_pages, 150 + struct vmem_altmap *altmap, 147 151 struct memory_group *group); 148 152 void remove_memory_block_devices(unsigned long start, unsigned long size); 149 153 extern void memory_dev_init(void);

+2 -1

include/linux/memory_hotplug.h

··· 97 97 * To do so, we will use the beginning of the hot-added range to build 98 98 * the page tables for the memmap array that describes the entire range. 99 99 * Only selected architectures support it with SPARSE_VMEMMAP. 100 + * This is only a hint, the core kernel can decide to not do this based on 101 + * different alignment checks. 100 102 */ 101 103 #define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1)) 102 104 /* ··· 356 354 extern int arch_create_linear_mapping(int nid, u64 start, u64 size, 357 355 struct mhp_params *params); 358 356 void arch_remove_linear_mapping(u64 start, u64 size); 359 - extern bool mhp_supports_memmap_on_memory(unsigned long size); 360 357 #endif /* CONFIG_MEMORY_HOTPLUG */ 361 358 362 359 #endif /* __LINUX_MEMORY_HOTPLUG_H */

+27

include/linux/minmax.h

··· 3 3 #define _LINUX_MINMAX_H 4 4 5 5 #include <linux/const.h> 6 + #include <linux/types.h> 6 7 7 8 /* 8 9 * min()/max()/clamp() macros must accomplish three things: ··· 158 157 * integer type. 159 158 */ 160 159 #define clamp_val(val, lo, hi) clamp_t(typeof(val), val, lo, hi) 160 + 161 + static inline bool in_range64(u64 val, u64 start, u64 len) 162 + { 163 + return (val - start) < len; 164 + } 165 + 166 + static inline bool in_range32(u32 val, u32 start, u32 len) 167 + { 168 + return (val - start) < len; 169 + } 170 + 171 + /** 172 + * in_range - Determine if a value lies within a range. 173 + * @val: Value to test. 174 + * @start: First value in range. 175 + * @len: Number of values in range. 176 + * 177 + * This is more efficient than "if (start <= val && val < (start + len))". 178 + * It also gives a different answer if @start + @len overflows the size of 179 + * the type by a sufficient amount to encompass @val. Decide for yourself 180 + * which behaviour you want, or prove that start + len never overflow. 181 + * Do not blindly replace one form with the other. 182 + */ 183 + #define in_range(val, start, len) \ 184 + ((sizeof(start) | sizeof(len) | sizeof(val)) <= sizeof(u32) ? \ 185 + in_range32(val, start, len) : in_range64(val, start, len)) 161 186 162 187 /** 163 188 * swap - swap values of @a and @b

+229 -138

include/linux/mm.h

··· 532 532 */ 533 533 }; 534 534 535 - /* page entry size for vm->huge_fault() */ 536 - enum page_entry_size { 537 - PE_SIZE_PTE = 0, 538 - PE_SIZE_PMD, 539 - PE_SIZE_PUD, 540 - }; 541 - 542 535 /* 543 536 * These are the virtual MM functions - opening of an area, closing and 544 537 * unmapping it (needed to keep files on disk up-to-date etc), pointer ··· 555 562 int (*mprotect)(struct vm_area_struct *vma, unsigned long start, 556 563 unsigned long end, unsigned long newflags); 557 564 vm_fault_t (*fault)(struct vm_fault *vmf); 558 - vm_fault_t (*huge_fault)(struct vm_fault *vmf, 559 - enum page_entry_size pe_size); 565 + vm_fault_t (*huge_fault)(struct vm_fault *vmf, unsigned int order); 560 566 vm_fault_t (*map_pages)(struct vm_fault *vmf, 561 567 pgoff_t start_pgoff, pgoff_t end_pgoff); 562 568 unsigned long (*pagesize)(struct vm_area_struct * area); ··· 671 679 rcu_read_unlock(); 672 680 } 673 681 682 + /* WARNING! Can only be used if mmap_lock is expected to be write-locked */ 674 683 static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq) 675 684 { 676 685 mmap_assert_write_locked(vma->vm_mm); ··· 684 691 return (vma->vm_lock_seq == *mm_lock_seq); 685 692 } 686 693 694 + /* 695 + * Begin writing to a VMA. 696 + * Exclude concurrent readers under the per-VMA lock until the currently 697 + * write-locked mmap_lock is dropped or downgraded. 698 + */ 687 699 static inline void vma_start_write(struct vm_area_struct *vma) 688 700 { 689 701 int mm_lock_seq; ··· 707 709 up_write(&vma->vm_lock->lock); 708 710 } 709 711 710 - static inline bool vma_try_start_write(struct vm_area_struct *vma) 711 - { 712 - int mm_lock_seq; 713 - 714 - if (__is_vma_write_locked(vma, &mm_lock_seq)) 715 - return true; 716 - 717 - if (!down_write_trylock(&vma->vm_lock->lock)) 718 - return false; 719 - 720 - WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); 721 - up_write(&vma->vm_lock->lock); 722 - return true; 723 - } 724 - 725 712 static inline void vma_assert_write_locked(struct vm_area_struct *vma) 726 713 { 727 714 int mm_lock_seq; 728 715 729 716 VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); 717 + } 718 + 719 + static inline void vma_assert_locked(struct vm_area_struct *vma) 720 + { 721 + if (!rwsem_is_locked(&vma->vm_lock->lock)) 722 + vma_assert_write_locked(vma); 730 723 } 731 724 732 725 static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) ··· 726 737 if (detached) 727 738 vma_assert_write_locked(vma); 728 739 vma->detached = detached; 740 + } 741 + 742 + static inline void release_fault_lock(struct vm_fault *vmf) 743 + { 744 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) 745 + vma_end_read(vmf->vma); 746 + else 747 + mmap_read_unlock(vmf->vma->vm_mm); 748 + } 749 + 750 + static inline void assert_fault_locked(struct vm_fault *vmf) 751 + { 752 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) 753 + vma_assert_locked(vmf->vma); 754 + else 755 + mmap_assert_locked(vmf->vma->vm_mm); 729 756 } 730 757 731 758 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, ··· 753 748 { return false; } 754 749 static inline void vma_end_read(struct vm_area_struct *vma) {} 755 750 static inline void vma_start_write(struct vm_area_struct *vma) {} 756 - static inline bool vma_try_start_write(struct vm_area_struct *vma) 757 - { return true; } 758 - static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} 751 + static inline void vma_assert_write_locked(struct vm_area_struct *vma) 752 + { mmap_assert_write_locked(vma->vm_mm); } 759 753 static inline void vma_mark_detached(struct vm_area_struct *vma, 760 754 bool detached) {} 761 755 756 + static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, 757 + unsigned long address) 758 + { 759 + return NULL; 760 + } 761 + 762 + static inline void release_fault_lock(struct vm_fault *vmf) 763 + { 764 + mmap_read_unlock(vmf->vma->vm_mm); 765 + } 766 + 767 + static inline void assert_fault_locked(struct vm_fault *vmf) 768 + { 769 + mmap_assert_locked(vmf->vma->vm_mm); 770 + } 771 + 762 772 #endif /* CONFIG_PER_VMA_LOCK */ 773 + 774 + extern const struct vm_operations_struct vma_dummy_vm_ops; 763 775 764 776 /* 765 777 * WARNING: vma_init does not initialize vma->vm_lock. ··· 784 762 */ 785 763 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) 786 764 { 787 - static const struct vm_operations_struct dummy_vm_ops = {}; 788 - 789 765 memset(vma, 0, sizeof(*vma)); 790 766 vma->vm_mm = mm; 791 - vma->vm_ops = &dummy_vm_ops; 767 + vma->vm_ops = &vma_dummy_vm_ops; 792 768 INIT_LIST_HEAD(&vma->anon_vma_chain); 793 769 vma_mark_detached(vma, false); 794 770 vma_numab_state_init(vma); ··· 799 779 ACCESS_PRIVATE(vma, __vm_flags) = flags; 800 780 } 801 781 802 - /* Use when VMA is part of the VMA tree and modifications need coordination */ 782 + /* 783 + * Use when VMA is part of the VMA tree and modifications need coordination 784 + * Note: vm_flags_reset and vm_flags_reset_once do not lock the vma and 785 + * it should be locked explicitly beforehand. 786 + */ 803 787 static inline void vm_flags_reset(struct vm_area_struct *vma, 804 788 vm_flags_t flags) 805 789 { 806 - vma_start_write(vma); 790 + vma_assert_write_locked(vma); 807 791 vm_flags_init(vma, flags); 808 792 } 809 793 810 794 static inline void vm_flags_reset_once(struct vm_area_struct *vma, 811 795 vm_flags_t flags) 812 796 { 813 - vma_start_write(vma); 797 + vma_assert_write_locked(vma); 814 798 WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags); 815 799 } 816 800 ··· 861 837 static inline bool vma_is_anonymous(struct vm_area_struct *vma) 862 838 { 863 839 return !vma->vm_ops; 840 + } 841 + 842 + /* 843 + * Indicate if the VMA is a heap for the given task; for 844 + * /proc/PID/maps that is the heap of the main task. 845 + */ 846 + static inline bool vma_is_initial_heap(const struct vm_area_struct *vma) 847 + { 848 + return vma->vm_start <= vma->vm_mm->brk && 849 + vma->vm_end >= vma->vm_mm->start_brk; 850 + } 851 + 852 + /* 853 + * Indicate if the VMA is a stack for the given task; for 854 + * /proc/PID/maps that is the stack of the main task. 855 + */ 856 + static inline bool vma_is_initial_stack(const struct vm_area_struct *vma) 857 + { 858 + /* 859 + * We make no effort to guess what a given thread considers to be 860 + * its "stack". It's not even well-defined for programs written 861 + * languages like Go. 862 + */ 863 + return vma->vm_start <= vma->vm_mm->start_stack && 864 + vma->vm_end >= vma->vm_mm->start_stack; 864 865 } 865 866 866 867 static inline bool vma_is_temporary_stack(struct vm_area_struct *vma) ··· 1025 976 * compound_order() can be called without holding a reference, which means 1026 977 * that niceties like page_folio() don't work. These callers should be 1027 978 * prepared to handle wild return values. For example, PG_head may be 1028 - * set before _folio_order is initialised, or this may be a tail page. 979 + * set before the order is initialised, or this may be a tail page. 1029 980 * See compaction.c for some good examples. 1030 981 */ 1031 982 static inline unsigned int compound_order(struct page *page) ··· 1034 985 1035 986 if (!test_bit(PG_head, &folio->flags)) 1036 987 return 0; 1037 - return folio->_folio_order; 988 + return folio->_flags_1 & 0xff; 1038 989 } 1039 990 1040 991 /** ··· 1050 1001 { 1051 1002 if (!folio_test_large(folio)) 1052 1003 return 0; 1053 - return folio->_folio_order; 1004 + return folio->_flags_1 & 0xff; 1054 1005 } 1055 1006 1056 1007 #include <linux/huge_mm.h> ··· 1121 1072 * On nommu, vmalloc/vfree wrap through kmalloc/kfree directly, so there 1122 1073 * is no special casing required. 1123 1074 */ 1124 - 1125 - #ifndef is_ioremap_addr 1126 - #define is_ioremap_addr(x) is_vmalloc_addr(x) 1127 - #endif 1128 - 1129 1075 #ifdef CONFIG_MMU 1130 1076 extern bool is_vmalloc_addr(const void *x); 1131 1077 extern int is_vmalloc_or_module_addr(const void *x); ··· 1264 1220 1265 1221 unsigned long nr_free_buffer_pages(void); 1266 1222 1267 - /* 1268 - * Compound pages have a destructor function. Provide a 1269 - * prototype for that function and accessor functions. 1270 - * These are _only_ valid on the head of a compound page. 1271 - */ 1272 - typedef void compound_page_dtor(struct page *); 1273 - 1274 - /* Keep the enum in sync with compound_page_dtors array in mm/page_alloc.c */ 1275 - enum compound_dtor_id { 1276 - NULL_COMPOUND_DTOR, 1277 - COMPOUND_PAGE_DTOR, 1278 - #ifdef CONFIG_HUGETLB_PAGE 1279 - HUGETLB_PAGE_DTOR, 1280 - #endif 1281 - #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1282 - TRANSHUGE_PAGE_DTOR, 1283 - #endif 1284 - NR_COMPOUND_DTORS, 1285 - }; 1286 - 1287 - static inline void folio_set_compound_dtor(struct folio *folio, 1288 - enum compound_dtor_id compound_dtor) 1289 - { 1290 - VM_BUG_ON_FOLIO(compound_dtor >= NR_COMPOUND_DTORS, folio); 1291 - folio->_folio_dtor = compound_dtor; 1292 - } 1293 - 1294 1223 void destroy_large_folio(struct folio *folio); 1295 1224 1296 1225 /* Returns the number of bytes in this potentially compound page. */ ··· 1299 1282 return PAGE_SIZE << thp_order(page); 1300 1283 } 1301 1284 1302 - void free_compound_page(struct page *page); 1303 - 1304 1285 #ifdef CONFIG_MMU 1305 1286 /* 1306 1287 * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when ··· 1314 1299 } 1315 1300 1316 1301 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); 1317 - void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr); 1302 + void set_pte_range(struct vm_fault *vmf, struct folio *folio, 1303 + struct page *page, unsigned int nr, unsigned long addr); 1318 1304 1319 1305 vm_fault_t finish_fault(struct vm_fault *vmf); 1320 1306 vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); ··· 2022 2006 #ifdef CONFIG_64BIT 2023 2007 return folio->_folio_nr_pages; 2024 2008 #else 2025 - return 1L << folio->_folio_order; 2009 + return 1L << (folio->_flags_1 & 0xff); 2026 2010 #endif 2027 2011 } 2028 2012 ··· 2040 2024 #ifdef CONFIG_64BIT 2041 2025 return folio->_folio_nr_pages; 2042 2026 #else 2043 - return 1L << folio->_folio_order; 2027 + return 1L << (folio->_flags_1 & 0xff); 2044 2028 #endif 2045 2029 } 2046 2030 ··· 2186 2170 return page_address(&folio->page); 2187 2171 } 2188 2172 2189 - extern void *page_rmapping(struct page *page); 2190 2173 extern pgoff_t __page_file_index(struct page *page); 2191 2174 2192 2175 /* ··· 2251 2236 #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) 2252 2237 #define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1)) 2253 2238 #define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1)) 2254 - 2255 - /* 2256 - * Flags passed to show_mem() and show_free_areas() to suppress output in 2257 - * various contexts. 2258 - */ 2259 - #define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ 2260 - 2261 - extern void __show_free_areas(unsigned int flags, nodemask_t *nodemask, int max_zone_idx); 2262 - static void __maybe_unused show_free_areas(unsigned int flags, nodemask_t *nodemask) 2263 - { 2264 - __show_free_areas(flags, nodemask, MAX_NR_ZONES - 1); 2265 - } 2266 2239 2267 2240 /* 2268 2241 * Parameter block passed down to zap_pte_range in exceptional cases. ··· 2320 2317 zap_page_range_single(vma, vma->vm_start, 2321 2318 vma->vm_end - vma->vm_start, NULL); 2322 2319 } 2323 - void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt, 2320 + void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, 2324 2321 struct vm_area_struct *start_vma, unsigned long start, 2325 - unsigned long end, bool mm_wr_locked); 2322 + unsigned long end, unsigned long tree_end, bool mm_wr_locked); 2326 2323 2327 2324 struct mmu_notifier_range; 2328 2325 ··· 2769 2766 } 2770 2767 #endif /* CONFIG_MMU */ 2771 2768 2769 + static inline struct ptdesc *virt_to_ptdesc(const void *x) 2770 + { 2771 + return page_ptdesc(virt_to_page(x)); 2772 + } 2773 + 2774 + static inline void *ptdesc_to_virt(const struct ptdesc *pt) 2775 + { 2776 + return page_to_virt(ptdesc_page(pt)); 2777 + } 2778 + 2779 + static inline void *ptdesc_address(const struct ptdesc *pt) 2780 + { 2781 + return folio_address(ptdesc_folio(pt)); 2782 + } 2783 + 2784 + static inline bool pagetable_is_reserved(struct ptdesc *pt) 2785 + { 2786 + return folio_test_reserved(ptdesc_folio(pt)); 2787 + } 2788 + 2789 + /** 2790 + * pagetable_alloc - Allocate pagetables 2791 + * @gfp: GFP flags 2792 + * @order: desired pagetable order 2793 + * 2794 + * pagetable_alloc allocates memory for page tables as well as a page table 2795 + * descriptor to describe that memory. 2796 + * 2797 + * Return: The ptdesc describing the allocated page tables. 2798 + */ 2799 + static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) 2800 + { 2801 + struct page *page = alloc_pages(gfp | __GFP_COMP, order); 2802 + 2803 + return page_ptdesc(page); 2804 + } 2805 + 2806 + /** 2807 + * pagetable_free - Free pagetables 2808 + * @pt: The page table descriptor 2809 + * 2810 + * pagetable_free frees the memory of all page tables described by a page 2811 + * table descriptor and the memory for the descriptor itself. 2812 + */ 2813 + static inline void pagetable_free(struct ptdesc *pt) 2814 + { 2815 + struct page *page = ptdesc_page(pt); 2816 + 2817 + __free_pages(page, compound_order(page)); 2818 + } 2819 + 2772 2820 #if USE_SPLIT_PTE_PTLOCKS 2773 2821 #if ALLOC_SPLIT_PTLOCKS 2774 2822 void __init ptlock_cache_init(void); 2775 - extern bool ptlock_alloc(struct page *page); 2776 - extern void ptlock_free(struct page *page); 2823 + bool ptlock_alloc(struct ptdesc *ptdesc); 2824 + void ptlock_free(struct ptdesc *ptdesc); 2777 2825 2778 - static inline spinlock_t *ptlock_ptr(struct page *page) 2826 + static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc) 2779 2827 { 2780 - return page->ptl; 2828 + return ptdesc->ptl; 2781 2829 } 2782 2830 #else /* ALLOC_SPLIT_PTLOCKS */ 2783 2831 static inline void ptlock_cache_init(void) 2784 2832 { 2785 2833 } 2786 2834 2787 - static inline bool ptlock_alloc(struct page *page) 2835 + static inline bool ptlock_alloc(struct ptdesc *ptdesc) 2788 2836 { 2789 2837 return true; 2790 2838 } 2791 2839 2792 - static inline void ptlock_free(struct page *page) 2840 + static inline void ptlock_free(struct ptdesc *ptdesc) 2793 2841 { 2794 2842 } 2795 2843 2796 - static inline spinlock_t *ptlock_ptr(struct page *page) 2844 + static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc) 2797 2845 { 2798 - return &page->ptl; 2846 + return &ptdesc->ptl; 2799 2847 } 2800 2848 #endif /* ALLOC_SPLIT_PTLOCKS */ 2801 2849 2802 2850 static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) 2803 2851 { 2804 - return ptlock_ptr(pmd_page(*pmd)); 2852 + return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); 2805 2853 } 2806 2854 2807 - static inline bool ptlock_init(struct page *page) 2855 + static inline bool ptlock_init(struct ptdesc *ptdesc) 2808 2856 { 2809 2857 /* 2810 2858 * prep_new_page() initialize page->private (and therefore page->ptl) ··· 2864 2810 * It can happen if arch try to use slab for page table allocation: 2865 2811 * slab code uses page->slab_cache, which share storage with page->ptl. 2866 2812 */ 2867 - VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page); 2868 - if (!ptlock_alloc(page)) 2813 + VM_BUG_ON_PAGE(*(unsigned long *)&ptdesc->ptl, ptdesc_page(ptdesc)); 2814 + if (!ptlock_alloc(ptdesc)) 2869 2815 return false; 2870 - spin_lock_init(ptlock_ptr(page)); 2816 + spin_lock_init(ptlock_ptr(ptdesc)); 2871 2817 return true; 2872 2818 } 2873 2819 ··· 2880 2826 return &mm->page_table_lock; 2881 2827 } 2882 2828 static inline void ptlock_cache_init(void) {} 2883 - static inline bool ptlock_init(struct page *page) { return true; } 2884 - static inline void ptlock_free(struct page *page) {} 2829 + static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } 2830 + static inline void ptlock_free(struct ptdesc *ptdesc) {} 2885 2831 #endif /* USE_SPLIT_PTE_PTLOCKS */ 2886 2832 2887 - static inline bool pgtable_pte_page_ctor(struct page *page) 2833 + static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc) 2888 2834 { 2889 - if (!ptlock_init(page)) 2835 + struct folio *folio = ptdesc_folio(ptdesc); 2836 + 2837 + if (!ptlock_init(ptdesc)) 2890 2838 return false; 2891 - __SetPageTable(page); 2892 - inc_lruvec_page_state(page, NR_PAGETABLE); 2839 + __folio_set_pgtable(folio); 2840 + lruvec_stat_add_folio(folio, NR_PAGETABLE); 2893 2841 return true; 2894 2842 } 2895 2843 2896 - static inline void pgtable_pte_page_dtor(struct page *page) 2844 + static inline void pagetable_pte_dtor(struct ptdesc *ptdesc) 2897 2845 { 2898 - ptlock_free(page); 2899 - __ClearPageTable(page); 2900 - dec_lruvec_page_state(page, NR_PAGETABLE); 2846 + struct folio *folio = ptdesc_folio(ptdesc); 2847 + 2848 + ptlock_free(ptdesc); 2849 + __folio_clear_pgtable(folio); 2850 + lruvec_stat_sub_folio(folio, NR_PAGETABLE); 2901 2851 } 2902 2852 2903 2853 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp); ··· 2950 2892 return virt_to_page((void *)((unsigned long) pmd & mask)); 2951 2893 } 2952 2894 2895 + static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd) 2896 + { 2897 + return page_ptdesc(pmd_pgtable_page(pmd)); 2898 + } 2899 + 2953 2900 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) 2954 2901 { 2955 - return ptlock_ptr(pmd_pgtable_page(pmd)); 2902 + return ptlock_ptr(pmd_ptdesc(pmd)); 2956 2903 } 2957 2904 2958 - static inline bool pmd_ptlock_init(struct page *page) 2905 + static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) 2959 2906 { 2960 2907 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 2961 - page->pmd_huge_pte = NULL; 2908 + ptdesc->pmd_huge_pte = NULL; 2962 2909 #endif 2963 - return ptlock_init(page); 2910 + return ptlock_init(ptdesc); 2964 2911 } 2965 2912 2966 - static inline void pmd_ptlock_free(struct page *page) 2913 + static inline void pmd_ptlock_free(struct ptdesc *ptdesc) 2967 2914 { 2968 2915 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 2969 - VM_BUG_ON_PAGE(page->pmd_huge_pte, page); 2916 + VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte, ptdesc_page(ptdesc)); 2970 2917 #endif 2971 - ptlock_free(page); 2918 + ptlock_free(ptdesc); 2972 2919 } 2973 2920 2974 - #define pmd_huge_pte(mm, pmd) (pmd_pgtable_page(pmd)->pmd_huge_pte) 2921 + #define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte) 2975 2922 2976 2923 #else 2977 2924 ··· 2985 2922 return &mm->page_table_lock; 2986 2923 } 2987 2924 2988 - static inline bool pmd_ptlock_init(struct page *page) { return true; } 2989 - static inline void pmd_ptlock_free(struct page *page) {} 2925 + static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) { return true; } 2926 + static inline void pmd_ptlock_free(struct ptdesc *ptdesc) {} 2990 2927 2991 2928 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) 2992 2929 ··· 2999 2936 return ptl; 3000 2937 } 3001 2938 3002 - static inline bool pgtable_pmd_page_ctor(struct page *page) 2939 + static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc) 3003 2940 { 3004 - if (!pmd_ptlock_init(page)) 2941 + struct folio *folio = ptdesc_folio(ptdesc); 2942 + 2943 + if (!pmd_ptlock_init(ptdesc)) 3005 2944 return false; 3006 - __SetPageTable(page); 3007 - inc_lruvec_page_state(page, NR_PAGETABLE); 2945 + __folio_set_pgtable(folio); 2946 + lruvec_stat_add_folio(folio, NR_PAGETABLE); 3008 2947 return true; 3009 2948 } 3010 2949 3011 - static inline void pgtable_pmd_page_dtor(struct page *page) 2950 + static inline void pagetable_pmd_dtor(struct ptdesc *ptdesc) 3012 2951 { 3013 - pmd_ptlock_free(page); 3014 - __ClearPageTable(page); 3015 - dec_lruvec_page_state(page, NR_PAGETABLE); 2952 + struct folio *folio = ptdesc_folio(ptdesc); 2953 + 2954 + pmd_ptlock_free(ptdesc); 2955 + __folio_clear_pgtable(folio); 2956 + lruvec_stat_sub_folio(folio, NR_PAGETABLE); 3016 2957 } 3017 2958 3018 2959 /* ··· 3069 3002 { 3070 3003 SetPageReserved(page); 3071 3004 adjust_managed_page_count(page, -1); 3005 + } 3006 + 3007 + static inline void free_reserved_ptdesc(struct ptdesc *pt) 3008 + { 3009 + free_reserved_page(ptdesc_page(pt)); 3072 3010 } 3073 3011 3074 3012 /* ··· 3141 3069 extern void __init mmap_init(void); 3142 3070 3143 3071 extern void __show_mem(unsigned int flags, nodemask_t *nodemask, int max_zone_idx); 3144 - static inline void show_mem(unsigned int flags, nodemask_t *nodemask) 3072 + static inline void show_mem(void) 3145 3073 { 3146 - __show_mem(flags, nodemask, MAX_NR_ZONES - 1); 3074 + __show_mem(0, NULL, MAX_NR_ZONES - 1); 3147 3075 } 3148 3076 extern long si_mem_available(void); 3149 3077 extern void si_meminfo(struct sysinfo * val); ··· 3581 3509 } 3582 3510 3583 3511 /* 3584 - * For use in fast paths after init_debug_pagealloc() has run, or when a 3585 - * false negative result is not harmful when called too early. 3512 + * For use in fast paths after mem_debugging_and_hardening_init() has run, 3513 + * or when a false negative result is not harmful when called too early. 3586 3514 */ 3587 3515 static inline bool debug_pagealloc_enabled_static(void) 3588 3516 { ··· 3737 3665 struct vmem_altmap *altmap); 3738 3666 #endif 3739 3667 3740 - #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP 3741 - static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, 3742 - struct dev_pagemap *pgmap) 3668 + #define VMEMMAP_RESERVE_NR 2 3669 + #ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP 3670 + static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap, 3671 + struct dev_pagemap *pgmap) 3743 3672 { 3744 - return is_power_of_2(sizeof(struct page)) && 3745 - pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap; 3673 + unsigned long nr_pages; 3674 + unsigned long nr_vmemmap_pages; 3675 + 3676 + if (!pgmap || !is_power_of_2(sizeof(struct page))) 3677 + return false; 3678 + 3679 + nr_pages = pgmap_vmemmap_nr(pgmap); 3680 + nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT); 3681 + /* 3682 + * For vmemmap optimization with DAX we need minimum 2 vmemmap 3683 + * pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst 3684 + */ 3685 + return !altmap && (nr_vmemmap_pages > VMEMMAP_RESERVE_NR); 3746 3686 } 3687 + /* 3688 + * If we don't have an architecture override, use the generic rule 3689 + */ 3690 + #ifndef vmemmap_can_optimize 3691 + #define vmemmap_can_optimize __vmemmap_can_optimize 3692 + #endif 3693 + 3747 3694 #else 3748 3695 static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, 3749 3696 struct dev_pagemap *pgmap)

+21

include/linux/mm_inline.h

··· 523 523 return atomic_read(&mm->tlb_flush_pending) > 1; 524 524 } 525 525 526 + #ifdef CONFIG_MMU 527 + /* 528 + * Computes the pte marker to copy from the given source entry into dst_vma. 529 + * If no marker should be copied, returns 0. 530 + * The caller should insert a new pte created with make_pte_marker(). 531 + */ 532 + static inline pte_marker copy_pte_marker( 533 + swp_entry_t entry, struct vm_area_struct *dst_vma) 534 + { 535 + pte_marker srcm = pte_marker_get(entry); 536 + /* Always copy error entries. */ 537 + pte_marker dstm = srcm & PTE_MARKER_POISONED; 538 + 539 + /* Only copy PTE markers if UFFD register matches. */ 540 + if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma)) 541 + dstm |= PTE_MARKER_UFFD_WP; 542 + 543 + return dstm; 544 + } 545 + #endif 546 + 526 547 /* 527 548 * If this pte is wr-protected by uffd-wp in any form, arm the special pte to 528 549 * replace a none pte. NOTE! This should only be called when *pte is already

+104 -31

include/linux/mm_types.h

··· 141 141 struct { /* Tail pages of compound page */ 142 142 unsigned long compound_head; /* Bit zero is set */ 143 143 }; 144 - struct { /* Page table pages */ 145 - unsigned long _pt_pad_1; /* compound_head */ 146 - pgtable_t pmd_huge_pte; /* protected by page->ptl */ 147 - unsigned long _pt_pad_2; /* mapping */ 148 - union { 149 - struct mm_struct *pt_mm; /* x86 pgds only */ 150 - atomic_t pt_frag_refcount; /* powerpc */ 151 - }; 152 - #if ALLOC_SPLIT_PTLOCKS 153 - spinlock_t *ptl; 154 - #else 155 - spinlock_t ptl; 156 - #endif 157 - }; 158 144 struct { /* ZONE_DEVICE pages */ 159 145 /** @pgmap: Points to the hosting device page map. */ 160 146 struct dev_pagemap *pgmap; ··· 248 262 return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page); 249 263 } 250 264 265 + /* 266 + * A swap entry has to fit into a "unsigned long", as the entry is hidden 267 + * in the "index" field of the swapper address space. 268 + */ 269 + typedef struct { 270 + unsigned long val; 271 + } swp_entry_t; 272 + 251 273 /** 252 274 * struct folio - Represents a contiguous set of bytes. 253 275 * @flags: Identical to the page flags. ··· 266 272 * @index: Offset within the file, in units of pages. For anonymous memory, 267 273 * this is the index from the beginning of the mmap. 268 274 * @private: Filesystem per-folio data (see folio_attach_private()). 269 - * Used for swp_entry_t if folio_test_swapcache(). 275 + * @swap: Used for swp_entry_t if folio_test_swapcache(). 270 276 * @_mapcount: Do not access this member directly. Use folio_mapcount() to 271 277 * find out how many times this folio is mapped by userspace. 272 278 * @_refcount: Do not access this member directly. Use folio_ref_count() 273 279 * to find how many references there are to this folio. 274 280 * @memcg_data: Memory Control Group data. 275 - * @_folio_dtor: Which destructor to use for this folio. 276 - * @_folio_order: Do not use directly, call folio_order(). 277 281 * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). 278 282 * @_nr_pages_mapped: Do not use directly, call folio_mapcount(). 279 283 * @_pincount: Do not use directly, call folio_maybe_dma_pinned(). ··· 309 317 }; 310 318 struct address_space *mapping; 311 319 pgoff_t index; 312 - void *private; 320 + union { 321 + void *private; 322 + swp_entry_t swap; 323 + }; 313 324 atomic_t _mapcount; 314 325 atomic_t _refcount; 315 326 #ifdef CONFIG_MEMCG ··· 326 331 struct { 327 332 unsigned long _flags_1; 328 333 unsigned long _head_1; 334 + unsigned long _folio_avail; 329 335 /* public: */ 330 - unsigned char _folio_dtor; 331 - unsigned char _folio_order; 332 336 atomic_t _entire_mapcount; 333 337 atomic_t _nr_pages_mapped; 334 338 atomic_t _pincount; ··· 385 391 offsetof(struct page, pg) + 2 * sizeof(struct page)) 386 392 FOLIO_MATCH(flags, _flags_2); 387 393 FOLIO_MATCH(compound_head, _head_2); 394 + FOLIO_MATCH(flags, _flags_2a); 395 + FOLIO_MATCH(compound_head, _head_2a); 388 396 #undef FOLIO_MATCH 397 + 398 + /** 399 + * struct ptdesc - Memory descriptor for page tables. 400 + * @__page_flags: Same as page flags. Unused for page tables. 401 + * @pt_rcu_head: For freeing page table pages. 402 + * @pt_list: List of used page tables. Used for s390 and x86. 403 + * @_pt_pad_1: Padding that aliases with page's compound head. 404 + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs. 405 + * @__page_mapping: Aliases with page->mapping. Unused for page tables. 406 + * @pt_mm: Used for x86 pgds. 407 + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 only. 408 + * @_pt_pad_2: Padding to ensure proper alignment. 409 + * @ptl: Lock for the page table. 410 + * @__page_type: Same as page->page_type. Unused for page tables. 411 + * @_refcount: Same as page refcount. Used for s390 page tables. 412 + * @pt_memcg_data: Memcg data. Tracked for page tables here. 413 + * 414 + * This struct overlays struct page for now. Do not modify without a good 415 + * understanding of the issues. 416 + */ 417 + struct ptdesc { 418 + unsigned long __page_flags; 419 + 420 + union { 421 + struct rcu_head pt_rcu_head; 422 + struct list_head pt_list; 423 + struct { 424 + unsigned long _pt_pad_1; 425 + pgtable_t pmd_huge_pte; 426 + }; 427 + }; 428 + unsigned long __page_mapping; 429 + 430 + union { 431 + struct mm_struct *pt_mm; 432 + atomic_t pt_frag_refcount; 433 + }; 434 + 435 + union { 436 + unsigned long _pt_pad_2; 437 + #if ALLOC_SPLIT_PTLOCKS 438 + spinlock_t *ptl; 439 + #else 440 + spinlock_t ptl; 441 + #endif 442 + }; 443 + unsigned int __page_type; 444 + atomic_t _refcount; 445 + #ifdef CONFIG_MEMCG 446 + unsigned long pt_memcg_data; 447 + #endif 448 + }; 449 + 450 + #define TABLE_MATCH(pg, pt) \ 451 + static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt)) 452 + TABLE_MATCH(flags, __page_flags); 453 + TABLE_MATCH(compound_head, pt_list); 454 + TABLE_MATCH(compound_head, _pt_pad_1); 455 + TABLE_MATCH(mapping, __page_mapping); 456 + TABLE_MATCH(rcu_head, pt_rcu_head); 457 + TABLE_MATCH(page_type, __page_type); 458 + TABLE_MATCH(_refcount, _refcount); 459 + #ifdef CONFIG_MEMCG 460 + TABLE_MATCH(memcg_data, pt_memcg_data); 461 + #endif 462 + #undef TABLE_MATCH 463 + static_assert(sizeof(struct ptdesc) <= sizeof(struct page)); 464 + 465 + #define ptdesc_page(pt) (_Generic((pt), \ 466 + const struct ptdesc *: (const struct page *)(pt), \ 467 + struct ptdesc *: (struct page *)(pt))) 468 + 469 + #define ptdesc_folio(pt) (_Generic((pt), \ 470 + const struct ptdesc *: (const struct folio *)(pt), \ 471 + struct ptdesc *: (struct folio *)(pt))) 472 + 473 + #define page_ptdesc(p) (_Generic((p), \ 474 + const struct page *: (const struct ptdesc *)(p), \ 475 + struct page *: (struct ptdesc *)(p))) 389 476 390 477 /* 391 478 * Used for sizing the vmemmap region on some architectures ··· 887 812 #ifdef CONFIG_KSM 888 813 /* 889 814 * Represent how many pages of this process are involved in KSM 890 - * merging. 815 + * merging (not including ksm_zero_pages). 891 816 */ 892 817 unsigned long ksm_merging_pages; 893 818 /* ··· 895 820 * including merged and not merged. 896 821 */ 897 822 unsigned long ksm_rmap_items; 898 - #endif 823 + /* 824 + * Represent how many empty pages are merged with kernel zero 825 + * pages when enabling KSM use_zero_pages. 826 + */ 827 + unsigned long ksm_zero_pages; 828 + #endif /* CONFIG_KSM */ 899 829 #ifdef CONFIG_LRU_GEN 900 830 struct { 901 831 /* this mm_struct is on lru_gen_mm_list */ ··· 1185 1105 { VM_FAULT_RETRY, "RETRY" }, \ 1186 1106 { VM_FAULT_FALLBACK, "FALLBACK" }, \ 1187 1107 { VM_FAULT_DONE_COW, "DONE_COW" }, \ 1188 - { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" } 1108 + { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \ 1109 + { VM_FAULT_COMPLETED, "COMPLETED" } 1189 1110 1190 1111 struct vm_special_mapping { 1191 1112 const char *name; /* The name, e.g. "[vdso]". */ ··· 1219 1138 TLB_REMOTE_SEND_IPI, 1220 1139 NR_TLB_FLUSH_REASONS, 1221 1140 }; 1222 - 1223 - /* 1224 - * A swap entry has to fit into a "unsigned long", as the entry is hidden 1225 - * in the "index" field of the swapper address space. 1226 - */ 1227 - typedef struct { 1228 - unsigned long val; 1229 - } swp_entry_t; 1230 1141 1231 1142 /** 1232 1143 * enum fault_flag - Fault flag definitions.

+2 -2

include/linux/mm_types_task.h

··· 52 52 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 53 53 /* 54 54 * The arch code makes the following promise: generic code can modify a 55 - * PTE, then call arch_tlbbatch_add_mm() (which internally provides all 56 - * needed barriers), then call arch_tlbbatch_flush(), and the entries 55 + * PTE, then call arch_tlbbatch_add_pending() (which internally provides 56 + * all needed barriers), then call arch_tlbbatch_flush(), and the entries 57 57 * will be flushed on all CPUs by the time that arch_tlbbatch_flush() 58 58 * returns. 59 59 */

+8 -10

include/linux/mmap_lock.h

··· 73 73 } 74 74 75 75 #ifdef CONFIG_PER_VMA_LOCK 76 + /* 77 + * Drop all currently-held per-VMA locks. 78 + * This is called from the mmap_lock implementation directly before releasing 79 + * a write-locked mmap_lock (or downgrading it to read-locked). 80 + * This should normally NOT be called manually from other places. 81 + * If you want to call this manually anyway, keep in mind that this will release 82 + * *all* VMA write locks, including ones from further up the stack. 83 + */ 76 84 static inline void vma_end_write_all(struct mm_struct *mm) 77 85 { 78 86 mmap_assert_write_locked(mm); ··· 123 115 __mmap_lock_trace_start_locking(mm, true); 124 116 ret = down_write_killable(&mm->mmap_lock); 125 117 __mmap_lock_trace_acquire_returned(mm, true, ret == 0); 126 - return ret; 127 - } 128 - 129 - static inline bool mmap_write_trylock(struct mm_struct *mm) 130 - { 131 - bool ret; 132 - 133 - __mmap_lock_trace_start_locking(mm, true); 134 - ret = down_write_trylock(&mm->mmap_lock) != 0; 135 - __mmap_lock_trace_acquire_returned(mm, true, ret); 136 118 return ret; 137 119 } 138 120

+26 -78

include/linux/mmu_notifier.h

··· 187 187 const struct mmu_notifier_range *range); 188 188 189 189 /* 190 - * invalidate_range() is either called between 191 - * invalidate_range_start() and invalidate_range_end() when the 192 - * VM has to free pages that where unmapped, but before the 193 - * pages are actually freed, or outside of _start()/_end() when 194 - * a (remote) TLB is necessary. 190 + * arch_invalidate_secondary_tlbs() is used to manage a non-CPU TLB 191 + * which shares page-tables with the CPU. The 192 + * invalidate_range_start()/end() callbacks should not be implemented as 193 + * invalidate_secondary_tlbs() already catches the points in time when 194 + * an external TLB needs to be flushed. 195 195 * 196 - * If invalidate_range() is used to manage a non-CPU TLB with 197 - * shared page-tables, it not necessary to implement the 198 - * invalidate_range_start()/end() notifiers, as 199 - * invalidate_range() already catches the points in time when an 200 - * external TLB range needs to be flushed. For more in depth 201 - * discussion on this see Documentation/mm/mmu_notifier.rst 196 + * This requires arch_invalidate_secondary_tlbs() to be called while 197 + * holding the ptl spin-lock and therefore this callback is not allowed 198 + * to sleep. 202 199 * 203 - * Note that this function might be called with just a sub-range 204 - * of what was passed to invalidate_range_start()/end(), if 205 - * called between those functions. 200 + * This is called by architecture code whenever invalidating a TLB 201 + * entry. It is assumed that any secondary TLB has the same rules for 202 + * when invalidations are required. If this is not the case architecture 203 + * code will need to call this explicitly when required for secondary 204 + * TLB invalidation. 206 205 */ 207 - void (*invalidate_range)(struct mmu_notifier *subscription, 208 - struct mm_struct *mm, 209 - unsigned long start, 210 - unsigned long end); 206 + void (*arch_invalidate_secondary_tlbs)( 207 + struct mmu_notifier *subscription, 208 + struct mm_struct *mm, 209 + unsigned long start, 210 + unsigned long end); 211 211 212 212 /* 213 213 * These callbacks are used with the get/put interface to manage the ··· 395 395 extern void __mmu_notifier_change_pte(struct mm_struct *mm, 396 396 unsigned long address, pte_t pte); 397 397 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); 398 - extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r, 399 - bool only_end); 400 - extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, 401 - unsigned long start, unsigned long end); 398 + extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r); 399 + extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm, 400 + unsigned long start, unsigned long end); 402 401 extern bool 403 402 mmu_notifier_range_update_to_read_only(const struct mmu_notifier_range *range); 404 403 ··· 480 481 might_sleep(); 481 482 482 483 if (mm_has_notifiers(range->mm)) 483 - __mmu_notifier_invalidate_range_end(range, false); 484 + __mmu_notifier_invalidate_range_end(range); 484 485 } 485 486 486 - static inline void 487 - mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) 488 - { 489 - if (mm_has_notifiers(range->mm)) 490 - __mmu_notifier_invalidate_range_end(range, true); 491 - } 492 - 493 - static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, 494 - unsigned long start, unsigned long end) 487 + static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm, 488 + unsigned long start, unsigned long end) 495 489 { 496 490 if (mm_has_notifiers(mm)) 497 - __mmu_notifier_invalidate_range(mm, start, end); 491 + __mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); 498 492 } 499 493 500 494 static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm) ··· 572 580 __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \ 573 581 ___address + PMD_SIZE); \ 574 582 __young; \ 575 - }) 576 - 577 - #define ptep_clear_flush_notify(__vma, __address, __ptep) \ 578 - ({ \ 579 - unsigned long ___addr = __address & PAGE_MASK; \ 580 - struct mm_struct *___mm = (__vma)->vm_mm; \ 581 - pte_t ___pte; \ 582 - \ 583 - ___pte = ptep_clear_flush(__vma, __address, __ptep); \ 584 - mmu_notifier_invalidate_range(___mm, ___addr, \ 585 - ___addr + PAGE_SIZE); \ 586 - \ 587 - ___pte; \ 588 - }) 589 - 590 - #define pmdp_huge_clear_flush_notify(__vma, __haddr, __pmd) \ 591 - ({ \ 592 - unsigned long ___haddr = __haddr & HPAGE_PMD_MASK; \ 593 - struct mm_struct *___mm = (__vma)->vm_mm; \ 594 - pmd_t ___pmd; \ 595 - \ 596 - ___pmd = pmdp_huge_clear_flush(__vma, __haddr, __pmd); \ 597 - mmu_notifier_invalidate_range(___mm, ___haddr, \ 598 - ___haddr + HPAGE_PMD_SIZE); \ 599 - \ 600 - ___pmd; \ 601 - }) 602 - 603 - #define pudp_huge_clear_flush_notify(__vma, __haddr, __pud) \ 604 - ({ \ 605 - unsigned long ___haddr = __haddr & HPAGE_PUD_MASK; \ 606 - struct mm_struct *___mm = (__vma)->vm_mm; \ 607 - pud_t ___pud; \ 608 - \ 609 - ___pud = pudp_huge_clear_flush(__vma, __haddr, __pud); \ 610 - mmu_notifier_invalidate_range(___mm, ___haddr, \ 611 - ___haddr + HPAGE_PUD_SIZE); \ 612 - \ 613 - ___pud; \ 614 583 }) 615 584 616 585 /* ··· 664 711 { 665 712 } 666 713 667 - static inline void 668 - mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) 669 - { 670 - } 671 - 672 - static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, 714 + static inline void mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm, 673 715 unsigned long start, unsigned long end) 674 716 { 675 717 }

-1

include/linux/mmzone.h

··· 676 676 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) 677 677 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) 678 678 679 - /* Fields and list protected by pagesets local_lock in page_alloc.c */ 680 679 struct per_cpu_pages { 681 680 spinlock_t lock; /* Protects lists field */ 682 681 int count; /* number of pages in the list */

-17

include/linux/net_mm.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0-or-later */ 2 - #ifdef CONFIG_MMU 3 - 4 - #ifdef CONFIG_INET 5 - extern const struct vm_operations_struct tcp_vm_ops; 6 - static inline bool vma_is_tcp(const struct vm_area_struct *vma) 7 - { 8 - return vma->vm_ops == &tcp_vm_ops; 9 - } 10 - #else 11 - static inline bool vma_is_tcp(const struct vm_area_struct *vma) 12 - { 13 - return false; 14 - } 15 - #endif /* CONFIG_INET*/ 16 - 17 - #endif /* CONFIG_MMU */

+65 -25

include/linux/page-flags.h

··· 99 99 */ 100 100 enum pageflags { 101 101 PG_locked, /* Page is locked. Don't touch. */ 102 + PG_writeback, /* Page is under writeback */ 102 103 PG_referenced, 103 104 PG_uptodate, 104 105 PG_dirty, 105 106 PG_lru, 107 + PG_head, /* Must be in bit 6 */ 108 + PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ 106 109 PG_active, 107 110 PG_workingset, 108 - PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ 109 111 PG_error, 110 112 PG_slab, 111 113 PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ ··· 115 113 PG_reserved, 116 114 PG_private, /* If pagecache, has fs-private data */ 117 115 PG_private_2, /* If pagecache, has fs aux data */ 118 - PG_writeback, /* Page is under writeback */ 119 - PG_head, /* A head page */ 120 116 PG_mappedtodisk, /* Has blocks allocated on-disk */ 121 117 PG_reclaim, /* To be reclaimed asap */ 122 118 PG_swapbacked, /* Page is backed by RAM/swap */ ··· 171 171 /* Remapped by swiotlb-xen. */ 172 172 PG_xen_remapped = PG_owner_priv_1, 173 173 174 - #ifdef CONFIG_MEMORY_FAILURE 175 - /* 176 - * Compound pages. Stored in first tail page's flags. 177 - * Indicates that at least one subpage is hwpoisoned in the 178 - * THP. 179 - */ 180 - PG_has_hwpoisoned = PG_error, 181 - #endif 182 - 183 174 /* non-lru isolated movable page */ 184 175 PG_isolated = PG_reclaim, 185 176 ··· 181 190 /* For self-hosted memmap pages */ 182 191 PG_vmemmap_self_hosted = PG_owner_priv_1, 183 192 #endif 193 + 194 + /* 195 + * Flags only valid for compound pages. Stored in first tail page's 196 + * flags word. Cannot use the first 8 flags or any flag marked as 197 + * PF_ANY. 198 + */ 199 + 200 + /* At least one page in this folio has the hwpoison flag set */ 201 + PG_has_hwpoisoned = PG_error, 202 + PG_hugetlb = PG_active, 203 + PG_large_rmappable = PG_workingset, /* anon or file-backed */ 184 204 }; 185 205 186 206 #define PAGEFLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1) ··· 808 806 BUG_ON(!PageHead(page)); 809 807 ClearPageHead(page); 810 808 } 809 + PAGEFLAG(LargeRmappable, large_rmappable, PF_SECOND) 810 + #else 811 + TESTPAGEFLAG_FALSE(LargeRmappable, large_rmappable) 811 812 #endif 812 813 813 814 #define PG_head_mask ((1UL << PG_head)) 814 815 815 816 #ifdef CONFIG_HUGETLB_PAGE 816 817 int PageHuge(struct page *page); 817 - bool folio_test_hugetlb(struct folio *folio); 818 + SETPAGEFLAG(HugeTLB, hugetlb, PF_SECOND) 819 + CLEARPAGEFLAG(HugeTLB, hugetlb, PF_SECOND) 820 + 821 + /** 822 + * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs 823 + * @folio: The folio to test. 824 + * 825 + * Context: Any context. Caller should have a reference on the folio to 826 + * prevent it from being turned into a tail page. 827 + * Return: True for hugetlbfs folios, false for anon folios or folios 828 + * belonging to other filesystems. 829 + */ 830 + static inline bool folio_test_hugetlb(struct folio *folio) 831 + { 832 + return folio_test_large(folio) && 833 + test_bit(PG_hugetlb, folio_flags(folio, 1)); 834 + } 818 835 #else 819 836 TESTPAGEFLAG_FALSE(Huge, hugetlb) 820 837 #endif ··· 851 830 { 852 831 VM_BUG_ON_PAGE(PageTail(page), page); 853 832 return PageHead(page); 854 - } 855 - 856 - static inline bool folio_test_transhuge(struct folio *folio) 857 - { 858 - return folio_test_head(folio); 859 833 } 860 834 861 835 /* ··· 924 908 925 909 #define PageType(page, flag) \ 926 910 ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) 911 + #define folio_test_type(folio, flag) \ 912 + ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) 927 913 928 914 static inline int page_type_has_type(unsigned int page_type) 929 915 { ··· 937 919 return page_type_has_type(page->page_type); 938 920 } 939 921 940 - #define PAGE_TYPE_OPS(uname, lname) \ 941 - static __always_inline int Page##uname(struct page *page) \ 922 + #define PAGE_TYPE_OPS(uname, lname, fname) \ 923 + static __always_inline int Page##uname(const struct page *page) \ 942 924 { \ 943 925 return PageType(page, PG_##lname); \ 926 + } \ 927 + static __always_inline int folio_test_##fname(const struct folio *folio)\ 928 + { \ 929 + return folio_test_type(folio, PG_##lname); \ 944 930 } \ 945 931 static __always_inline void __SetPage##uname(struct page *page) \ 946 932 { \ 947 933 VM_BUG_ON_PAGE(!PageType(page, 0), page); \ 948 934 page->page_type &= ~PG_##lname; \ 949 935 } \ 936 + static __always_inline void __folio_set_##fname(struct folio *folio) \ 937 + { \ 938 + VM_BUG_ON_FOLIO(!folio_test_type(folio, 0), folio); \ 939 + folio->page.page_type &= ~PG_##lname; \ 940 + } \ 950 941 static __always_inline void __ClearPage##uname(struct page *page) \ 951 942 { \ 952 943 VM_BUG_ON_PAGE(!Page##uname(page), page); \ 953 944 page->page_type |= PG_##lname; \ 954 - } 945 + } \ 946 + static __always_inline void __folio_clear_##fname(struct folio *folio) \ 947 + { \ 948 + VM_BUG_ON_FOLIO(!folio_test_##fname(folio), folio); \ 949 + folio->page.page_type |= PG_##lname; \ 950 + } \ 955 951 956 952 /* 957 953 * PageBuddy() indicates that the page is free and in the buddy system 958 954 * (see mm/page_alloc.c). 959 955 */ 960 - PAGE_TYPE_OPS(Buddy, buddy) 956 + PAGE_TYPE_OPS(Buddy, buddy, buddy) 961 957 962 958 /* 963 959 * PageOffline() indicates that the page is logically offline although the ··· 995 963 * pages should check PageOffline() and synchronize with such drivers using 996 964 * page_offline_freeze()/page_offline_thaw(). 997 965 */ 998 - PAGE_TYPE_OPS(Offline, offline) 966 + PAGE_TYPE_OPS(Offline, offline, offline) 999 967 1000 968 extern void page_offline_freeze(void); 1001 969 extern void page_offline_thaw(void); ··· 1005 973 /* 1006 974 * Marks pages in use as page tables. 1007 975 */ 1008 - PAGE_TYPE_OPS(Table, table) 976 + PAGE_TYPE_OPS(Table, table, pgtable) 1009 977 1010 978 /* 1011 979 * Marks guardpages used with debug_pagealloc. 1012 980 */ 1013 - PAGE_TYPE_OPS(Guard, guard) 981 + PAGE_TYPE_OPS(Guard, guard, guard) 1014 982 1015 983 extern bool is_free_buddy_page(struct page *page); 1016 984 ··· 1071 1039 */ 1072 1040 #define PAGE_FLAGS_CHECK_AT_PREP \ 1073 1041 ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK) 1042 + 1043 + /* 1044 + * Flags stored in the second page of a compound page. They may overlap 1045 + * the CHECK_AT_FREE flags above, so need to be cleared. 1046 + */ 1047 + #define PAGE_FLAGS_SECOND \ 1048 + (0xffUL /* order */ | 1UL << PG_has_hwpoisoned | \ 1049 + 1UL << PG_hugetlb | 1UL << PG_large_rmappable) 1074 1050 1075 1051 #define PAGE_FLAGS_PRIVATE \ 1076 1052 (1UL << PG_private | 1UL << PG_private_2)

+7 -2

include/linux/page_ext.h

··· 8 8 9 9 struct pglist_data; 10 10 11 + #ifdef CONFIG_PAGE_EXTENSION 11 12 /** 12 13 * struct page_ext_operations - per page_ext client operations 13 14 * @offset: Offset to the client's data within page_ext. Offset is returned to ··· 29 28 void (*init)(void); 30 29 bool need_shared_flags; 31 30 }; 32 - 33 - #ifdef CONFIG_PAGE_EXTENSION 34 31 35 32 /* 36 33 * The page_ext_flags users must set need_shared_flags to true. ··· 80 81 81 82 extern struct page_ext *page_ext_get(struct page *page); 82 83 extern void page_ext_put(struct page_ext *page_ext); 84 + 85 + static inline void *page_ext_data(struct page_ext *page_ext, 86 + struct page_ext_operations *ops) 87 + { 88 + return (void *)(page_ext) + ops->offset; 89 + } 83 90 84 91 static inline struct page_ext *page_ext_next(struct page_ext *curr) 85 92 {

-5

include/linux/page_idle.h

··· 144 144 { 145 145 folio_set_idle(page_folio(page)); 146 146 } 147 - 148 - static inline void clear_page_idle(struct page *page) 149 - { 150 - folio_clear_idle(page_folio(page)); 151 - } 152 147 #endif /* _LINUX_MM_PAGE_IDLE_H */

+27 -44

include/linux/page_table_check.h

··· 14 14 extern struct page_ext_operations page_table_check_ops; 15 15 16 16 void __page_table_check_zero(struct page *page, unsigned int order); 17 - void __page_table_check_pte_clear(struct mm_struct *mm, unsigned long addr, 18 - pte_t pte); 19 - void __page_table_check_pmd_clear(struct mm_struct *mm, unsigned long addr, 20 - pmd_t pmd); 21 - void __page_table_check_pud_clear(struct mm_struct *mm, unsigned long addr, 22 - pud_t pud); 23 - void __page_table_check_pte_set(struct mm_struct *mm, unsigned long addr, 24 - pte_t *ptep, pte_t pte); 25 - void __page_table_check_pmd_set(struct mm_struct *mm, unsigned long addr, 26 - pmd_t *pmdp, pmd_t pmd); 27 - void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr, 28 - pud_t *pudp, pud_t pud); 17 + void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte); 18 + void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd); 19 + void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud); 20 + void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte, 21 + unsigned int nr); 22 + void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd); 23 + void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud); 29 24 void __page_table_check_pte_clear_range(struct mm_struct *mm, 30 25 unsigned long addr, 31 26 pmd_t pmd); ··· 41 46 __page_table_check_zero(page, order); 42 47 } 43 48 44 - static inline void page_table_check_pte_clear(struct mm_struct *mm, 45 - unsigned long addr, pte_t pte) 49 + static inline void page_table_check_pte_clear(struct mm_struct *mm, pte_t pte) 46 50 { 47 51 if (static_branch_likely(&page_table_check_disabled)) 48 52 return; 49 53 50 - __page_table_check_pte_clear(mm, addr, pte); 54 + __page_table_check_pte_clear(mm, pte); 51 55 } 52 56 53 - static inline void page_table_check_pmd_clear(struct mm_struct *mm, 54 - unsigned long addr, pmd_t pmd) 57 + static inline void page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd) 55 58 { 56 59 if (static_branch_likely(&page_table_check_disabled)) 57 60 return; 58 61 59 - __page_table_check_pmd_clear(mm, addr, pmd); 62 + __page_table_check_pmd_clear(mm, pmd); 60 63 } 61 64 62 - static inline void page_table_check_pud_clear(struct mm_struct *mm, 63 - unsigned long addr, pud_t pud) 65 + static inline void page_table_check_pud_clear(struct mm_struct *mm, pud_t pud) 64 66 { 65 67 if (static_branch_likely(&page_table_check_disabled)) 66 68 return; 67 69 68 - __page_table_check_pud_clear(mm, addr, pud); 70 + __page_table_check_pud_clear(mm, pud); 69 71 } 70 72 71 - static inline void page_table_check_pte_set(struct mm_struct *mm, 72 - unsigned long addr, pte_t *ptep, 73 - pte_t pte) 73 + static inline void page_table_check_ptes_set(struct mm_struct *mm, 74 + pte_t *ptep, pte_t pte, unsigned int nr) 74 75 { 75 76 if (static_branch_likely(&page_table_check_disabled)) 76 77 return; 77 78 78 - __page_table_check_pte_set(mm, addr, ptep, pte); 79 + __page_table_check_ptes_set(mm, ptep, pte, nr); 79 80 } 80 81 81 - static inline void page_table_check_pmd_set(struct mm_struct *mm, 82 - unsigned long addr, pmd_t *pmdp, 82 + static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, 83 83 pmd_t pmd) 84 84 { 85 85 if (static_branch_likely(&page_table_check_disabled)) 86 86 return; 87 87 88 - __page_table_check_pmd_set(mm, addr, pmdp, pmd); 88 + __page_table_check_pmd_set(mm, pmdp, pmd); 89 89 } 90 90 91 - static inline void page_table_check_pud_set(struct mm_struct *mm, 92 - unsigned long addr, pud_t *pudp, 91 + static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, 93 92 pud_t pud) 94 93 { 95 94 if (static_branch_likely(&page_table_check_disabled)) 96 95 return; 97 96 98 - __page_table_check_pud_set(mm, addr, pudp, pud); 97 + __page_table_check_pud_set(mm, pudp, pud); 99 98 } 100 99 101 100 static inline void page_table_check_pte_clear_range(struct mm_struct *mm, ··· 112 123 { 113 124 } 114 125 115 - static inline void page_table_check_pte_clear(struct mm_struct *mm, 116 - unsigned long addr, pte_t pte) 126 + static inline void page_table_check_pte_clear(struct mm_struct *mm, pte_t pte) 117 127 { 118 128 } 119 129 120 - static inline void page_table_check_pmd_clear(struct mm_struct *mm, 121 - unsigned long addr, pmd_t pmd) 130 + static inline void page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd) 122 131 { 123 132 } 124 133 125 - static inline void page_table_check_pud_clear(struct mm_struct *mm, 126 - unsigned long addr, pud_t pud) 134 + static inline void page_table_check_pud_clear(struct mm_struct *mm, pud_t pud) 127 135 { 128 136 } 129 137 130 - static inline void page_table_check_pte_set(struct mm_struct *mm, 131 - unsigned long addr, pte_t *ptep, 132 - pte_t pte) 138 + static inline void page_table_check_ptes_set(struct mm_struct *mm, 139 + pte_t *ptep, pte_t pte, unsigned int nr) 133 140 { 134 141 } 135 142 136 - static inline void page_table_check_pmd_set(struct mm_struct *mm, 137 - unsigned long addr, pmd_t *pmdp, 143 + static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, 138 144 pmd_t pmd) 139 145 { 140 146 } 141 147 142 - static inline void page_table_check_pud_set(struct mm_struct *mm, 143 - unsigned long addr, pud_t *pudp, 148 + static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, 144 149 pud_t pud) 145 150 { 146 151 }

+42 -22

include/linux/pagemap.h

··· 203 203 /* writeback related tags are not used */ 204 204 AS_NO_WRITEBACK_TAGS = 5, 205 205 AS_LARGE_FOLIO_SUPPORT = 6, 206 + AS_RELEASE_ALWAYS, /* Call ->release_folio(), even if no private data */ 206 207 }; 207 208 208 209 /** ··· 272 271 static inline int mapping_use_writeback_tags(struct address_space *mapping) 273 272 { 274 273 return !test_bit(AS_NO_WRITEBACK_TAGS, &mapping->flags); 274 + } 275 + 276 + static inline bool mapping_release_always(const struct address_space *mapping) 277 + { 278 + return test_bit(AS_RELEASE_ALWAYS, &mapping->flags); 279 + } 280 + 281 + static inline void mapping_set_release_always(struct address_space *mapping) 282 + { 283 + set_bit(AS_RELEASE_ALWAYS, &mapping->flags); 284 + } 285 + 286 + static inline void mapping_clear_release_always(struct address_space *mapping) 287 + { 288 + clear_bit(AS_RELEASE_ALWAYS, &mapping->flags); 275 289 } 276 290 277 291 static inline gfp_t mapping_gfp_mask(struct address_space * mapping) ··· 389 373 return folio->mapping; 390 374 } 391 375 376 + /** 377 + * folio_flush_mapping - Find the file mapping this folio belongs to. 378 + * @folio: The folio. 379 + * 380 + * For folios which are in the page cache, return the mapping that this 381 + * page belongs to. Anonymous folios return NULL, even if they're in 382 + * the swap cache. Other kinds of folio also return NULL. 383 + * 384 + * This is ONLY used by architecture cache flushing code. If you aren't 385 + * writing cache flushing code, you want either folio_mapping() or 386 + * folio_file_mapping(). 387 + */ 388 + static inline struct address_space *folio_flush_mapping(struct folio *folio) 389 + { 390 + if (unlikely(folio_test_swapcache(folio))) 391 + return NULL; 392 + 393 + return folio_mapping(folio); 394 + } 395 + 392 396 static inline struct address_space *page_file_mapping(struct page *page) 393 397 { 394 398 return folio_file_mapping(page_folio(page)); 395 - } 396 - 397 - /* 398 - * For file cache pages, return the address_space, otherwise return NULL 399 - */ 400 - static inline struct address_space *page_mapping_file(struct page *page) 401 - { 402 - struct folio *folio = page_folio(page); 403 - 404 - if (unlikely(folio_test_swapcache(folio))) 405 - return NULL; 406 - return folio_mapping(folio); 407 399 } 408 400 409 401 /** ··· 984 960 985 961 void __folio_lock(struct folio *folio); 986 962 int __folio_lock_killable(struct folio *folio); 987 - bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, 988 - unsigned int flags); 963 + vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf); 989 964 void unlock_page(struct page *page); 990 965 void folio_unlock(struct folio *folio); 991 966 ··· 1088 1065 * Return value and mmap_lock implications depend on flags; see 1089 1066 * __folio_lock_or_retry(). 1090 1067 */ 1091 - static inline bool folio_lock_or_retry(struct folio *folio, 1092 - struct mm_struct *mm, unsigned int flags) 1068 + static inline vm_fault_t folio_lock_or_retry(struct folio *folio, 1069 + struct vm_fault *vmf) 1093 1070 { 1094 1071 might_sleep(); 1095 - return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags); 1072 + if (!folio_trylock(folio)) 1073 + return __folio_lock_or_retry(folio, vmf); 1074 + return 0; 1096 1075 } 1097 1076 1098 1077 /* ··· 1127 1102 static inline void wait_on_page_locked(struct page *page) 1128 1103 { 1129 1104 folio_wait_locked(page_folio(page)); 1130 - } 1131 - 1132 - static inline int wait_on_page_locked_killable(struct page *page) 1133 - { 1134 - return folio_wait_locked_killable(page_folio(page)); 1135 1105 } 1136 1106 1137 1107 void wait_on_page_writeback(struct page *page);

+88 -35

include/linux/pgtable.h

··· 5 5 #include <linux/pfn.h> 6 6 #include <asm/pgtable.h> 7 7 8 + #define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) 9 + #define PUD_ORDER (PUD_SHIFT - PAGE_SHIFT) 10 + 8 11 #ifndef __ASSEMBLY__ 9 12 #ifdef CONFIG_MMU 10 13 ··· 66 63 { 67 64 return (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); 68 65 } 69 - #define pte_index pte_index 70 66 71 67 #ifndef pmd_index 72 68 static inline unsigned long pmd_index(unsigned long address) ··· 101 99 ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address))) 102 100 #define pte_unmap(pte) do { \ 103 101 kunmap_local((pte)); \ 104 - /* rcu_read_unlock() to be added later */ \ 102 + rcu_read_unlock(); \ 105 103 } while (0) 106 104 #else 107 105 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address) ··· 110 108 } 111 109 static inline void pte_unmap(pte_t *pte) 112 110 { 113 - /* rcu_read_unlock() to be added later */ 111 + rcu_read_unlock(); 114 112 } 115 113 #endif 114 + 115 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); 116 116 117 117 /* Find an entry in the second-level page table.. */ 118 118 #ifndef pmd_offset ··· 183 179 return 0; 184 180 } 185 181 #endif 182 + 183 + /* 184 + * A facility to provide lazy MMU batching. This allows PTE updates and 185 + * page invalidations to be delayed until a call to leave lazy MMU mode 186 + * is issued. Some architectures may benefit from doing this, and it is 187 + * beneficial for both shadow and direct mode hypervisors, which may batch 188 + * the PTE updates which happen during this window. Note that using this 189 + * interface requires that read hazards be removed from the code. A read 190 + * hazard could result in the direct mode hypervisor case, since the actual 191 + * write to the page tables may not yet have taken place, so reads though 192 + * a raw PTE pointer after it has been modified are not guaranteed to be 193 + * up to date. This mode can only be entered and left under the protection of 194 + * the page table locks for all page tables which may be modified. In the UP 195 + * case, this is required so that preemption is disabled, and in the SMP case, 196 + * it must synchronize the delayed page table writes properly on other CPUs. 197 + */ 198 + #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE 199 + #define arch_enter_lazy_mmu_mode() do {} while (0) 200 + #define arch_leave_lazy_mmu_mode() do {} while (0) 201 + #define arch_flush_lazy_mmu_mode() do {} while (0) 202 + #endif 203 + 204 + #ifndef set_ptes 205 + /** 206 + * set_ptes - Map consecutive pages to a contiguous range of addresses. 207 + * @mm: Address space to map the pages into. 208 + * @addr: Address to map the first page at. 209 + * @ptep: Page table pointer for the first entry. 210 + * @pte: Page table entry for the first page. 211 + * @nr: Number of pages to map. 212 + * 213 + * May be overridden by the architecture, or the architecture can define 214 + * set_pte() and PFN_PTE_SHIFT. 215 + * 216 + * Context: The caller holds the page table lock. The pages all belong 217 + * to the same folio. The PTEs are all in the same PMD. 218 + */ 219 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, 220 + pte_t *ptep, pte_t pte, unsigned int nr) 221 + { 222 + page_table_check_ptes_set(mm, ptep, pte, nr); 223 + 224 + arch_enter_lazy_mmu_mode(); 225 + for (;;) { 226 + set_pte(ptep, pte); 227 + if (--nr == 0) 228 + break; 229 + ptep++; 230 + pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT)); 231 + } 232 + arch_leave_lazy_mmu_mode(); 233 + } 234 + #endif 235 + #define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1) 186 236 187 237 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS 188 238 extern int ptep_set_access_flags(struct vm_area_struct *vma, ··· 378 320 { 379 321 pte_t pte = ptep_get(ptep); 380 322 pte_clear(mm, address, ptep); 381 - page_table_check_pte_clear(mm, address, pte); 323 + page_table_check_pte_clear(mm, pte); 382 324 return pte; 383 325 } 384 326 #endif ··· 448 390 return pmd; 449 391 } 450 392 #define pmdp_get_lockless pmdp_get_lockless 393 + #define pmdp_get_lockless_sync() tlb_remove_table_sync_one() 451 394 #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 452 395 #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ 453 396 ··· 467 408 { 468 409 return pmdp_get(pmdp); 469 410 } 411 + static inline void pmdp_get_lockless_sync(void) 412 + { 413 + } 470 414 #endif 471 415 472 416 #ifdef CONFIG_TRANSPARENT_HUGEPAGE ··· 481 419 pmd_t pmd = *pmdp; 482 420 483 421 pmd_clear(pmdp); 484 - page_table_check_pmd_clear(mm, address, pmd); 422 + page_table_check_pmd_clear(mm, pmd); 485 423 486 424 return pmd; 487 425 } ··· 494 432 pud_t pud = *pudp; 495 433 496 434 pud_clear(pudp); 497 - page_table_check_pud_clear(mm, address, pud); 435 + page_table_check_pud_clear(mm, pud); 498 436 499 437 return pud; 500 438 } ··· 512 450 #endif 513 451 514 452 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL 515 - static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm, 453 + static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma, 516 454 unsigned long address, pud_t *pudp, 517 455 int full) 518 456 { 519 - return pudp_huge_get_and_clear(mm, address, pudp); 457 + return pudp_huge_get_and_clear(vma->vm_mm, address, pudp); 520 458 } 521 459 #endif 522 460 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ ··· 620 558 #endif 621 559 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT 622 560 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 561 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 623 562 static inline void pudp_set_wrprotect(struct mm_struct *mm, 624 563 unsigned long address, pud_t *pudp) 625 564 { ··· 634 571 { 635 572 BUILD_BUG(); 636 573 } 574 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 637 575 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 638 576 #endif 639 577 ··· 757 693 { 758 694 return pmd_val(pmd_a) == pmd_val(pmd_b); 759 695 } 696 + #endif 760 697 698 + #ifndef pud_same 761 699 static inline int pud_same(pud_t pud_a, pud_t pud_b) 762 700 { 763 701 return pud_val(pud_a) == pud_val(pud_b); 764 702 } 703 + #define pud_same pud_same 765 704 #endif 766 705 767 706 #ifndef __HAVE_ARCH_P4D_SAME ··· 1108 1041 #endif 1109 1042 1110 1043 /* 1111 - * A facility to provide lazy MMU batching. This allows PTE updates and 1112 - * page invalidations to be delayed until a call to leave lazy MMU mode 1113 - * is issued. Some architectures may benefit from doing this, and it is 1114 - * beneficial for both shadow and direct mode hypervisors, which may batch 1115 - * the PTE updates which happen during this window. Note that using this 1116 - * interface requires that read hazards be removed from the code. A read 1117 - * hazard could result in the direct mode hypervisor case, since the actual 1118 - * write to the page tables may not yet have taken place, so reads though 1119 - * a raw PTE pointer after it has been modified are not guaranteed to be 1120 - * up to date. This mode can only be entered and left under the protection of 1121 - * the page table locks for all page tables which may be modified. In the UP 1122 - * case, this is required so that preemption is disabled, and in the SMP case, 1123 - * it must synchronize the delayed page table writes properly on other CPUs. 1124 - */ 1125 - #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE 1126 - #define arch_enter_lazy_mmu_mode() do {} while (0) 1127 - #define arch_leave_lazy_mmu_mode() do {} while (0) 1128 - #define arch_flush_lazy_mmu_mode() do {} while (0) 1129 - #endif 1130 - 1131 - /* 1132 1044 * A facility to provide batching of the reload of page tables and 1133 1045 * other process state with the actual context switch code for 1134 1046 * paravirtualized guests. By convention, only one of the batched ··· 1368 1322 1369 1323 #ifndef CONFIG_NUMA_BALANCING 1370 1324 /* 1371 - * Technically a PTE can be PROTNONE even when not doing NUMA balancing but 1372 - * the only case the kernel cares is for NUMA balancing and is only ever set 1373 - * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked 1374 - * _PAGE_PROTNONE so by default, implement the helper as "always no". It 1375 - * is the responsibility of the caller to distinguish between PROT_NONE 1376 - * protections and NUMA hinting fault protections. 1325 + * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is 1326 + * perfectly valid to indicate "no" in that case, which is why our default 1327 + * implementation defaults to "always no". 1328 + * 1329 + * In an accessible VMA, however, pte_protnone() reliably indicates PROT_NONE 1330 + * page protection due to NUMA hinting. NUMA hinting faults only apply in 1331 + * accessible VMAs. 1332 + * 1333 + * So, to reliably identify PROT_NONE PTEs that require a NUMA hinting fault, 1334 + * looking at the VMA accessibility is sufficient. 1377 1335 */ 1378 1336 static inline int pte_protnone(pte_t pte) 1379 1337 { ··· 1549 1499 #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE) 1550 1500 #endif 1551 1501 1502 + #ifndef has_transparent_pud_hugepage 1503 + #define has_transparent_pud_hugepage() IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) 1504 + #endif 1552 1505 /* 1553 1506 * On some architectures it depends on the mm if the p4d/pud or pmd 1554 1507 * layer of the page table hierarchy is folded or not.

+26 -13

include/linux/pid_namespace.h

··· 17 17 struct fs_pin; 18 18 19 19 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) 20 - /* 21 - * sysctl for vm.memfd_noexec 22 - * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL 23 - * acts like MFD_EXEC was set. 24 - * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL 25 - * acts like MFD_NOEXEC_SEAL was set. 26 - * 2: memfd_create() without MFD_NOEXEC_SEAL will be 27 - * rejected. 28 - */ 29 - #define MEMFD_NOEXEC_SCOPE_EXEC 0 30 - #define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 31 - #define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 20 + /* modes for vm.memfd_noexec sysctl */ 21 + #define MEMFD_NOEXEC_SCOPE_EXEC 0 /* MFD_EXEC implied if unset */ 22 + #define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 /* MFD_NOEXEC_SEAL implied if unset */ 23 + #define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 /* same as 1, except MFD_EXEC rejected */ 32 24 #endif 33 25 34 26 struct pid_namespace { ··· 39 47 int reboot; /* group exit code if this pidns was rebooted */ 40 48 struct ns_common ns; 41 49 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) 42 - /* sysctl for vm.memfd_noexec */ 43 50 int memfd_noexec_scope; 44 51 #endif 45 52 } __randomize_layout; ··· 55 64 return ns; 56 65 } 57 66 67 + #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) 68 + static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) 69 + { 70 + int scope = MEMFD_NOEXEC_SCOPE_EXEC; 71 + 72 + for (; ns; ns = ns->parent) 73 + scope = max(scope, READ_ONCE(ns->memfd_noexec_scope)); 74 + 75 + return scope; 76 + } 77 + #else 78 + static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) 79 + { 80 + return 0; 81 + } 82 + #endif 83 + 58 84 extern struct pid_namespace *copy_pid_ns(unsigned long flags, 59 85 struct user_namespace *user_ns, struct pid_namespace *ns); 60 86 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns); ··· 84 76 static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns) 85 77 { 86 78 return ns; 79 + } 80 + 81 + static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) 82 + { 83 + return 0; 87 84 } 88 85 89 86 static inline struct pid_namespace *copy_pid_ns(unsigned long flags,

+2

include/linux/rmap.h

··· 198 198 unsigned long address); 199 199 void page_add_file_rmap(struct page *, struct vm_area_struct *, 200 200 bool compound); 201 + void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr, 202 + struct vm_area_struct *, bool compound); 201 203 void page_remove_rmap(struct page *, struct vm_area_struct *, 202 204 bool compound); 203 205

+7 -8

include/linux/secretmem.h

··· 6 6 7 7 extern const struct address_space_operations secretmem_aops; 8 8 9 - static inline bool page_is_secretmem(struct page *page) 9 + static inline bool folio_is_secretmem(struct folio *folio) 10 10 { 11 11 struct address_space *mapping; 12 12 13 13 /* 14 - * Using page_mapping() is quite slow because of the actual call 15 - * instruction and repeated compound_head(page) inside the 16 - * page_mapping() function. 14 + * Using folio_mapping() is quite slow because of the actual call 15 + * instruction. 17 16 * We know that secretmem pages are not compound and LRU so we can 18 17 * save a couple of cycles here. 19 18 */ 20 - if (PageCompound(page) || !PageLRU(page)) 19 + if (folio_test_large(folio) || !folio_test_lru(folio)) 21 20 return false; 22 21 23 22 mapping = (struct address_space *) 24 - ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); 23 + ((unsigned long)folio->mapping & ~PAGE_MAPPING_FLAGS); 25 24 26 - if (!mapping || mapping != page->mapping) 25 + if (!mapping || mapping != folio->mapping) 27 26 return false; 28 27 29 28 return mapping->a_ops == &secretmem_aops; ··· 38 39 return false; 39 40 } 40 41 41 - static inline bool page_is_secretmem(struct page *page) 42 + static inline bool folio_is_secretmem(struct folio *folio) 42 43 { 43 44 return false; 44 45 }

+5 -16

include/linux/swap.h

··· 302 302 struct file *swap_file; /* seldom referenced */ 303 303 unsigned int old_block_size; /* seldom referenced */ 304 304 struct completion comp; /* seldom referenced */ 305 - #ifdef CONFIG_FRONTSWAP 306 - unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ 307 - atomic_t frontswap_pages; /* frontswap pages in-use counter */ 308 - #endif 309 305 spinlock_t lock; /* 310 306 * protect map scan related fields like 311 307 * swap_map, lowest_bit, highest_bit, ··· 333 337 */ 334 338 }; 335 339 336 - static inline swp_entry_t folio_swap_entry(struct folio *folio) 340 + static inline swp_entry_t page_swap_entry(struct page *page) 337 341 { 338 - swp_entry_t entry = { .val = page_private(&folio->page) }; 339 - return entry; 340 - } 342 + struct folio *folio = page_folio(page); 343 + swp_entry_t entry = folio->swap; 341 344 342 - static inline void folio_set_swap_entry(struct folio *folio, swp_entry_t entry) 343 - { 344 - folio->private = (void *)entry.val; 345 + entry.val += folio_page_idx(folio, page); 346 + return entry; 345 347 } 346 348 347 349 /* linux/mm/workingset.c */ ··· 622 628 { 623 629 return READ_ONCE(vm_swappiness); 624 630 } 625 - #endif 626 - 627 - #ifdef CONFIG_ZSWAP 628 - extern u64 zswap_pool_total_size; 629 - extern atomic_t zswap_stored_pages; 630 631 #endif 631 632 632 633 #if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)

-5

include/linux/swapfile.h

··· 2 2 #ifndef _LINUX_SWAPFILE_H 3 3 #define _LINUX_SWAPFILE_H 4 4 5 - /* 6 - * these were static in swapfile.c but frontswap.c needs them and we don't 7 - * want to expose them to the dozens of source files that include swap.h 8 - */ 9 - extern struct swap_info_struct *swap_info[]; 10 5 extern unsigned long generic_max_swapfile_size(void); 11 6 unsigned long arch_max_swapfile_size(void); 12 7

+10 -5

include/linux/swapops.h

··· 393 393 typedef unsigned long pte_marker; 394 394 395 395 #define PTE_MARKER_UFFD_WP BIT(0) 396 - #define PTE_MARKER_SWAPIN_ERROR BIT(1) 396 + /* 397 + * "Poisoned" here is meant in the very general sense of "future accesses are 398 + * invalid", instead of referring very specifically to hardware memory errors. 399 + * This marker is meant to represent any of various different causes of this. 400 + */ 401 + #define PTE_MARKER_POISONED BIT(1) 397 402 #define PTE_MARKER_MASK (BIT(2) - 1) 398 403 399 404 static inline swp_entry_t make_pte_marker_entry(pte_marker marker) ··· 426 421 return swp_entry_to_pte(make_pte_marker_entry(marker)); 427 422 } 428 423 429 - static inline swp_entry_t make_swapin_error_entry(void) 424 + static inline swp_entry_t make_poisoned_swp_entry(void) 430 425 { 431 - return make_pte_marker_entry(PTE_MARKER_SWAPIN_ERROR); 426 + return make_pte_marker_entry(PTE_MARKER_POISONED); 432 427 } 433 428 434 - static inline int is_swapin_error_entry(swp_entry_t entry) 429 + static inline int is_poisoned_swp_entry(swp_entry_t entry) 435 430 { 436 431 return is_pte_marker_entry(entry) && 437 - (pte_marker_get(entry) & PTE_MARKER_SWAPIN_ERROR); 432 + (pte_marker_get(entry) & PTE_MARKER_POISONED); 438 433 } 439 434 440 435 /*

+4

include/linux/userfaultfd_k.h

··· 46 46 MFILL_ATOMIC_COPY, 47 47 MFILL_ATOMIC_ZEROPAGE, 48 48 MFILL_ATOMIC_CONTINUE, 49 + MFILL_ATOMIC_POISON, 49 50 NR_MFILL_ATOMIC_MODES, 50 51 }; 51 52 ··· 84 83 extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, 85 84 unsigned long len, atomic_t *mmap_changing, 86 85 uffd_flags_t flags); 86 + extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, 87 + unsigned long len, atomic_t *mmap_changing, 88 + uffd_flags_t flags); 87 89 extern int mwriteprotect_range(struct mm_struct *dst_mm, 88 90 unsigned long start, unsigned long len, 89 91 bool enable_wp, atomic_t *mmap_changing);

+37

include/linux/zswap.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_ZSWAP_H 3 + #define _LINUX_ZSWAP_H 4 + 5 + #include <linux/types.h> 6 + #include <linux/mm_types.h> 7 + 8 + extern u64 zswap_pool_total_size; 9 + extern atomic_t zswap_stored_pages; 10 + 11 + #ifdef CONFIG_ZSWAP 12 + 13 + bool zswap_store(struct folio *folio); 14 + bool zswap_load(struct folio *folio); 15 + void zswap_invalidate(int type, pgoff_t offset); 16 + void zswap_swapon(int type); 17 + void zswap_swapoff(int type); 18 + 19 + #else 20 + 21 + static inline bool zswap_store(struct folio *folio) 22 + { 23 + return false; 24 + } 25 + 26 + static inline bool zswap_load(struct folio *folio) 27 + { 28 + return false; 29 + } 30 + 31 + static inline void zswap_invalidate(int type, pgoff_t offset) {} 32 + static inline void zswap_swapon(int type) {} 33 + static inline void zswap_swapoff(int type) {} 34 + 35 + #endif 36 + 37 + #endif /* _LINUX_ZSWAP_H */

-1

include/net/tcp.h

··· 45 45 #include <linux/memcontrol.h> 46 46 #include <linux/bpf-cgroup.h> 47 47 #include <linux/siphash.h> 48 - #include <linux/net_mm.h> 49 48 50 49 extern struct inet_hashinfo tcp_hashinfo; 51 50

+26 -7

include/trace/events/thp.h

··· 8 8 #include <linux/types.h> 9 9 #include <linux/tracepoint.h> 10 10 11 - TRACE_EVENT(hugepage_set_pmd, 11 + DECLARE_EVENT_CLASS(hugepage_set, 12 12 13 - TP_PROTO(unsigned long addr, unsigned long pmd), 14 - TP_ARGS(addr, pmd), 13 + TP_PROTO(unsigned long addr, unsigned long pte), 14 + TP_ARGS(addr, pte), 15 15 TP_STRUCT__entry( 16 16 __field(unsigned long, addr) 17 - __field(unsigned long, pmd) 17 + __field(unsigned long, pte) 18 18 ), 19 19 20 20 TP_fast_assign( 21 21 __entry->addr = addr; 22 - __entry->pmd = pmd; 22 + __entry->pte = pte; 23 23 ), 24 24 25 - TP_printk("Set pmd with 0x%lx with 0x%lx", __entry->addr, __entry->pmd) 25 + TP_printk("Set page table entry with 0x%lx with 0x%lx", __entry->addr, __entry->pte) 26 26 ); 27 27 28 + DEFINE_EVENT(hugepage_set, hugepage_set_pmd, 29 + TP_PROTO(unsigned long addr, unsigned long pmd), 30 + TP_ARGS(addr, pmd) 31 + ); 28 32 29 - TRACE_EVENT(hugepage_update, 33 + DEFINE_EVENT(hugepage_set, hugepage_set_pud, 34 + TP_PROTO(unsigned long addr, unsigned long pud), 35 + TP_ARGS(addr, pud) 36 + ); 37 + 38 + DECLARE_EVENT_CLASS(hugepage_update, 30 39 31 40 TP_PROTO(unsigned long addr, unsigned long pte, unsigned long clr, unsigned long set), 32 41 TP_ARGS(addr, pte, clr, set), ··· 55 46 ), 56 47 57 48 TP_printk("hugepage update at addr 0x%lx and pte = 0x%lx clr = 0x%lx, set = 0x%lx", __entry->addr, __entry->pte, __entry->clr, __entry->set) 49 + ); 50 + 51 + DEFINE_EVENT(hugepage_update, hugepage_update_pmd, 52 + TP_PROTO(unsigned long addr, unsigned long pmd, unsigned long clr, unsigned long set), 53 + TP_ARGS(addr, pmd, clr, set) 54 + ); 55 + 56 + DEFINE_EVENT(hugepage_update, hugepage_update_pud, 57 + TP_PROTO(unsigned long addr, unsigned long pud, unsigned long clr, unsigned long set), 58 + TP_ARGS(addr, pud, clr, set) 58 59 ); 59 60 60 61 DECLARE_EVENT_CLASS(migration_pmd,

+22 -3

include/uapi/linux/userfaultfd.h

··· 39 39 UFFD_FEATURE_MINOR_SHMEM | \ 40 40 UFFD_FEATURE_EXACT_ADDRESS | \ 41 41 UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ 42 - UFFD_FEATURE_WP_UNPOPULATED) 42 + UFFD_FEATURE_WP_UNPOPULATED | \ 43 + UFFD_FEATURE_POISON) 43 44 #define UFFD_API_IOCTLS \ 44 45 ((__u64)1 << _UFFDIO_REGISTER | \ 45 46 (__u64)1 << _UFFDIO_UNREGISTER | \ ··· 50 49 (__u64)1 << _UFFDIO_COPY | \ 51 50 (__u64)1 << _UFFDIO_ZEROPAGE | \ 52 51 (__u64)1 << _UFFDIO_WRITEPROTECT | \ 53 - (__u64)1 << _UFFDIO_CONTINUE) 52 + (__u64)1 << _UFFDIO_CONTINUE | \ 53 + (__u64)1 << _UFFDIO_POISON) 54 54 #define UFFD_API_RANGE_IOCTLS_BASIC \ 55 55 ((__u64)1 << _UFFDIO_WAKE | \ 56 56 (__u64)1 << _UFFDIO_COPY | \ 57 + (__u64)1 << _UFFDIO_WRITEPROTECT | \ 57 58 (__u64)1 << _UFFDIO_CONTINUE | \ 58 - (__u64)1 << _UFFDIO_WRITEPROTECT) 59 + (__u64)1 << _UFFDIO_POISON) 59 60 60 61 /* 61 62 * Valid ioctl command number range with this API is from 0x00 to ··· 74 71 #define _UFFDIO_ZEROPAGE (0x04) 75 72 #define _UFFDIO_WRITEPROTECT (0x06) 76 73 #define _UFFDIO_CONTINUE (0x07) 74 + #define _UFFDIO_POISON (0x08) 77 75 #define _UFFDIO_API (0x3F) 78 76 79 77 /* userfaultfd ioctl ids */ ··· 95 91 struct uffdio_writeprotect) 96 92 #define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \ 97 93 struct uffdio_continue) 94 + #define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \ 95 + struct uffdio_poison) 98 96 99 97 /* read() structure */ 100 98 struct uffd_msg { ··· 231 225 #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) 232 226 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) 233 227 #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) 228 + #define UFFD_FEATURE_POISON (1<<14) 234 229 __u64 features; 235 230 236 231 __u64 ioctls; ··· 326 319 * the copy_from_user will not read past here. 327 320 */ 328 321 __s64 mapped; 322 + }; 323 + 324 + struct uffdio_poison { 325 + struct uffdio_range range; 326 + #define UFFDIO_POISON_MODE_DONTWAKE ((__u64)1<<0) 327 + __u64 mode; 328 + 329 + /* 330 + * Fields below here are written by the ioctl and must be at the end: 331 + * the copy_from_user will not read past here. 332 + */ 333 + __s64 updated; 329 334 }; 330 335 331 336 /*

+1 -1

init/initramfs.c

··· 61 61 } 62 62 63 63 #define panic_show_mem(fmt, ...) \ 64 - ({ show_mem(0, NULL); panic(fmt, ##__VA_ARGS__); }) 64 + ({ show_mem(); panic(fmt, ##__VA_ARGS__); }) 65 65 66 66 /* link hash */ 67 67

+1 -5

io_uring/io_uring.c

··· 2643 2643 2644 2644 static void io_mem_free(void *ptr) 2645 2645 { 2646 - struct page *page; 2647 - 2648 2646 if (!ptr) 2649 2647 return; 2650 2648 2651 - page = virt_to_head_page(ptr); 2652 - if (put_page_testzero(page)) 2653 - free_compound_page(page); 2649 + folio_put(virt_to_folio(ptr)); 2654 2650 } 2655 2651 2656 2652 static void io_pages_free(struct page ***pages, int npages)

+1 -5

io_uring/kbuf.c

··· 218 218 if (bl->is_mapped) { 219 219 i = bl->buf_ring->tail - bl->head; 220 220 if (bl->is_mmap) { 221 - struct page *page; 222 - 223 - page = virt_to_head_page(bl->buf_ring); 224 - if (put_page_testzero(page)) 225 - free_compound_page(page); 221 + folio_put(virt_to_folio(bl->buf_ring)); 226 222 bl->buf_ring = NULL; 227 223 bl->is_mmap = 0; 228 224 } else if (bl->buf_nr_pages) {

+1 -3

kernel/crash_core.c

··· 455 455 VMCOREINFO_OFFSET(page, lru); 456 456 VMCOREINFO_OFFSET(page, _mapcount); 457 457 VMCOREINFO_OFFSET(page, private); 458 - VMCOREINFO_OFFSET(folio, _folio_dtor); 459 - VMCOREINFO_OFFSET(folio, _folio_order); 460 458 VMCOREINFO_OFFSET(page, compound_head); 461 459 VMCOREINFO_OFFSET(pglist_data, node_zones); 462 460 VMCOREINFO_OFFSET(pglist_data, nr_zones); ··· 488 490 #define PAGE_BUDDY_MAPCOUNT_VALUE (~PG_buddy) 489 491 VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); 490 492 #ifdef CONFIG_HUGETLB_PAGE 491 - VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); 493 + VMCOREINFO_NUMBER(PG_hugetlb); 492 494 #define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline) 493 495 VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); 494 496 #endif

+11 -22

kernel/events/core.c

··· 8631 8631 unsigned int size; 8632 8632 char tmp[16]; 8633 8633 char *buf = NULL; 8634 - char *name; 8634 + char *name = NULL; 8635 8635 8636 8636 if (vma->vm_flags & VM_READ) 8637 8637 prot |= PROT_READ; ··· 8678 8678 8679 8679 goto got_name; 8680 8680 } else { 8681 - if (vma->vm_ops && vma->vm_ops->name) { 8681 + if (vma->vm_ops && vma->vm_ops->name) 8682 8682 name = (char *) vma->vm_ops->name(vma); 8683 - if (name) 8684 - goto cpy_name; 8683 + if (!name) 8684 + name = (char *)arch_vma_name(vma); 8685 + if (!name) { 8686 + if (vma_is_initial_heap(vma)) 8687 + name = "[heap]"; 8688 + else if (vma_is_initial_stack(vma)) 8689 + name = "[stack]"; 8690 + else 8691 + name = "//anon"; 8685 8692 } 8686 - 8687 - name = (char *)arch_vma_name(vma); 8688 - if (name) 8689 - goto cpy_name; 8690 - 8691 - if (vma->vm_start <= vma->vm_mm->start_brk && 8692 - vma->vm_end >= vma->vm_mm->brk) { 8693 - name = "[heap]"; 8694 - goto cpy_name; 8695 - } 8696 - if (vma->vm_start <= vma->vm_mm->start_stack && 8697 - vma->vm_end >= vma->vm_mm->start_stack) { 8698 - name = "[stack]"; 8699 - goto cpy_name; 8700 - } 8701 - 8702 - name = "//anon"; 8703 - goto cpy_name; 8704 8693 } 8705 8694 8706 8695 cpy_name:

+1 -1

kernel/events/uprobes.c

··· 193 193 } 194 194 195 195 flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte))); 196 - ptep_clear_flush_notify(vma, addr, pvmw.pte); 196 + ptep_clear_flush(vma, addr, pvmw.pte); 197 197 if (new_page) 198 198 set_pte_at_notify(mm, addr, pvmw.pte, 199 199 mk_pte(new_page, vma->vm_page_prot));

+1 -2

kernel/futex/core.c

··· 1132 1132 #endif 1133 1133 1134 1134 futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues), 1135 - futex_hashsize, 0, 1136 - futex_hashsize < 256 ? HASH_SMALL : 0, 1135 + futex_hashsize, 0, 0, 1137 1136 &futex_shift, NULL, 1138 1137 futex_hashsize, futex_hashsize); 1139 1138 futex_hashsize = 1UL << futex_shift;

+5 -8

kernel/iomem.c

··· 3 3 #include <linux/types.h> 4 4 #include <linux/io.h> 5 5 #include <linux/mm.h> 6 - 7 - #ifndef ioremap_cache 8 - /* temporary while we convert existing ioremap_cache users to memremap */ 9 - __weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size) 10 - { 11 - return ioremap(offset, size); 12 - } 13 - #endif 6 + #include <linux/ioremap.h> 14 7 15 8 #ifndef arch_memremap_wb 16 9 static void *arch_memremap_wb(resource_size_t offset, unsigned long size) 17 10 { 11 + #ifdef ioremap_cache 18 12 return (__force void *)ioremap_cache(offset, size); 13 + #else 14 + return (__force void *)ioremap(offset, size); 15 + #endif 19 16 } 20 17 #endif 21 18

+1 -1

kernel/panic.c

··· 216 216 show_state(); 217 217 218 218 if (panic_print & PANIC_PRINT_MEM_INFO) 219 - show_mem(0, NULL); 219 + show_mem(); 220 220 221 221 if (panic_print & PANIC_PRINT_TIMER_INFO) 222 222 sysrq_timer_list_show();

+3

kernel/pid.c

··· 83 83 #ifdef CONFIG_PID_NS 84 84 .ns.ops = &pidns_operations, 85 85 #endif 86 + #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) 87 + .memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC, 88 + #endif 86 89 }; 87 90 EXPORT_SYMBOL_GPL(init_pid_ns); 88 91

+3 -3

kernel/pid_namespace.c

··· 110 110 ns->user_ns = get_user_ns(user_ns); 111 111 ns->ucounts = ucounts; 112 112 ns->pid_allocated = PIDNS_ADDING; 113 - 114 - initialize_memfd_noexec_scope(ns); 115 - 113 + #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) 114 + ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns); 115 + #endif 116 116 return ns; 117 117 118 118 out_free_idr:

+12 -16

kernel/pid_sysctl.h

··· 5 5 #include <linux/pid_namespace.h> 6 6 7 7 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) 8 - static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) 9 - { 10 - ns->memfd_noexec_scope = 11 - task_active_pid_ns(current)->memfd_noexec_scope; 12 - } 13 - 14 8 static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, 15 9 int write, void *buf, size_t *lenp, loff_t *ppos) 16 10 { 17 11 struct pid_namespace *ns = task_active_pid_ns(current); 18 12 struct ctl_table table_copy; 13 + int err, scope, parent_scope; 19 14 20 15 if (write && !ns_capable(ns->user_ns, CAP_SYS_ADMIN)) 21 16 return -EPERM; 22 17 23 18 table_copy = *table; 24 - if (ns != &init_pid_ns) 25 - table_copy.data = &ns->memfd_noexec_scope; 26 19 27 - /* 28 - * set minimum to current value, the effect is only bigger 29 - * value is accepted. 30 - */ 31 - if (*(int *)table_copy.data > *(int *)table_copy.extra1) 32 - table_copy.extra1 = table_copy.data; 20 + /* You cannot set a lower enforcement value than your parent. */ 21 + parent_scope = pidns_memfd_noexec_scope(ns->parent); 22 + /* Equivalent to pidns_memfd_noexec_scope(ns). */ 23 + scope = max(READ_ONCE(ns->memfd_noexec_scope), parent_scope); 33 24 34 - return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); 25 + table_copy.data = &scope; 26 + table_copy.extra1 = &parent_scope; 27 + 28 + err = proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); 29 + if (!err && write) 30 + WRITE_ONCE(ns->memfd_noexec_scope, scope); 31 + return err; 35 32 } 36 33 37 34 static struct ctl_table pid_ns_ctl_table_vm[] = { ··· 48 51 register_sysctl("vm", pid_ns_ctl_table_vm); 49 52 } 50 53 #else 51 - static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) {} 52 54 static inline void register_pid_ns_sysctl_table_vm(void) {} 53 55 #endif 54 56

-3

lib/logic_pio.c

··· 20 20 static LIST_HEAD(io_range_list); 21 21 static DEFINE_MUTEX(io_range_mutex); 22 22 23 - /* Consider a kernel general helper for this */ 24 - #define in_range(b, first, len) ((b) >= (first) && (b) < (first) + (len)) 25 - 26 23 /** 27 24 * logic_pio_register_range - register logical PIO range for a host 28 25 * @new_range: pointer to the IO range to be registered.

+496 -616

lib/maple_tree.c

··· 75 75 #define MA_STATE_PREALLOC 4 76 76 77 77 #define ma_parent_ptr(x) ((struct maple_pnode *)(x)) 78 + #define mas_tree_parent(x) ((unsigned long)(x->tree) | MA_ROOT_PARENT) 78 79 #define ma_mnode_ptr(x) ((struct maple_node *)(x)) 79 80 #define ma_enode_ptr(x) ((struct maple_enode *)(x)) 80 81 static struct kmem_cache *maple_node_cache; ··· 730 729 } 731 730 732 731 /* 733 - * mas_logical_pivot() - Get the logical pivot of a given offset. 734 - * @mas: The maple state 735 - * @pivots: The pointer to the maple node pivots 736 - * @offset: The offset into the pivot array 737 - * @type: The maple node type 738 - * 739 - * When there is no value at a pivot (beyond the end of the data), then the 740 - * pivot is actually @mas->max. 741 - * 742 - * Return: the logical pivot of a given @offset. 743 - */ 744 - static inline unsigned long 745 - mas_logical_pivot(struct ma_state *mas, unsigned long *pivots, 746 - unsigned char offset, enum maple_type type) 747 - { 748 - unsigned long lpiv = mas_safe_pivot(mas, pivots, offset, type); 749 - 750 - if (likely(lpiv)) 751 - return lpiv; 752 - 753 - if (likely(offset)) 754 - return mas->max; 755 - 756 - return lpiv; 757 - } 758 - 759 - /* 760 732 * mte_set_pivot() - Set a pivot to a value in an encoded maple node. 761 733 * @mn: The encoded maple node 762 734 * @piv: The pivot offset ··· 778 804 } 779 805 } 780 806 807 + static inline bool mt_write_locked(const struct maple_tree *mt) 808 + { 809 + return mt_external_lock(mt) ? mt_write_lock_is_held(mt) : 810 + lockdep_is_held(&mt->ma_lock); 811 + } 812 + 781 813 static inline bool mt_locked(const struct maple_tree *mt) 782 814 { 783 815 return mt_external_lock(mt) ? mt_lock_is_held(mt) : ··· 799 819 static inline void *mt_slot_locked(struct maple_tree *mt, void __rcu **slots, 800 820 unsigned char offset) 801 821 { 802 - return rcu_dereference_protected(slots[offset], mt_locked(mt)); 822 + return rcu_dereference_protected(slots[offset], mt_write_locked(mt)); 803 823 } 804 824 /* 805 825 * mas_slot_locked() - Get the slot value when holding the maple tree lock. ··· 842 862 843 863 static inline void *mt_root_locked(struct maple_tree *mt) 844 864 { 845 - return rcu_dereference_protected(mt->ma_root, mt_locked(mt)); 865 + return rcu_dereference_protected(mt->ma_root, mt_write_locked(mt)); 846 866 } 847 867 848 868 /* ··· 982 1002 mat->tail = dead_enode; 983 1003 } 984 1004 985 - static void mte_destroy_walk(struct maple_enode *, struct maple_tree *); 986 - static inline void mas_free(struct ma_state *mas, struct maple_enode *used); 987 - 988 - /* 989 - * mas_mat_free() - Free all nodes in a dead list. 990 - * @mas - the maple state 991 - * @mat - the ma_topiary linked list of dead nodes to free. 992 - * 993 - * Free walk a dead list. 994 - */ 995 - static void mas_mat_free(struct ma_state *mas, struct ma_topiary *mat) 996 - { 997 - struct maple_enode *next; 998 - 999 - while (mat->head) { 1000 - next = mte_to_mat(mat->head)->next; 1001 - mas_free(mas, mat->head); 1002 - mat->head = next; 1003 - } 1004 - } 1005 - 1005 + static void mt_free_walk(struct rcu_head *head); 1006 + static void mt_destroy_walk(struct maple_enode *enode, struct maple_tree *mt, 1007 + bool free); 1006 1008 /* 1007 1009 * mas_mat_destroy() - Free all nodes and subtrees in a dead list. 1008 1010 * @mas - the maple state ··· 995 1033 static void mas_mat_destroy(struct ma_state *mas, struct ma_topiary *mat) 996 1034 { 997 1035 struct maple_enode *next; 1036 + struct maple_node *node; 1037 + bool in_rcu = mt_in_rcu(mas->tree); 998 1038 999 1039 while (mat->head) { 1000 1040 next = mte_to_mat(mat->head)->next; 1001 - mte_destroy_walk(mat->head, mat->mtree); 1041 + node = mte_to_node(mat->head); 1042 + mt_destroy_walk(mat->head, mas->tree, !in_rcu); 1043 + if (in_rcu) 1044 + call_rcu(&node->rcu, mt_free_walk); 1002 1045 mat->head = next; 1003 1046 } 1004 1047 } ··· 1577 1610 * mas_max_gap() - find the largest gap in a non-leaf node and set the slot. 1578 1611 * @mas: The maple state. 1579 1612 * 1580 - * If the metadata gap is set to MAPLE_ARANGE64_META_MAX, there is no gap. 1581 - * 1582 1613 * Return: The gap value. 1583 1614 */ 1584 1615 static inline unsigned long mas_max_gap(struct ma_state *mas) ··· 1593 1628 node = mas_mn(mas); 1594 1629 MAS_BUG_ON(mas, mt != maple_arange_64); 1595 1630 offset = ma_meta_gap(node, mt); 1596 - if (offset == MAPLE_ARANGE64_META_MAX) 1597 - return 0; 1598 - 1599 1631 gaps = ma_gaps(node, mt); 1600 1632 return gaps[offset]; 1601 1633 } ··· 1624 1662 ascend: 1625 1663 MAS_BUG_ON(mas, pmt != maple_arange_64); 1626 1664 meta_offset = ma_meta_gap(pnode, pmt); 1627 - if (meta_offset == MAPLE_ARANGE64_META_MAX) 1628 - meta_gap = 0; 1629 - else 1630 - meta_gap = pgaps[meta_offset]; 1665 + meta_gap = pgaps[meta_offset]; 1631 1666 1632 1667 pgaps[offset] = new; 1633 1668 ··· 1637 1678 1638 1679 ma_set_meta_gap(pnode, pmt, offset); 1639 1680 } else if (new < meta_gap) { 1640 - meta_offset = 15; 1641 1681 new = ma_max_gap(pnode, pgaps, pmt, &meta_offset); 1642 1682 ma_set_meta_gap(pnode, pmt, meta_offset); 1643 1683 } ··· 1689 1731 struct maple_enode *parent) 1690 1732 { 1691 1733 enum maple_type type = mte_node_type(parent); 1692 - struct maple_node *node = mas_mn(mas); 1734 + struct maple_node *node = mte_to_node(parent); 1693 1735 void __rcu **slots = ma_slots(node, type); 1694 1736 unsigned long *pivots = ma_pivots(node, type); 1695 1737 struct maple_enode *child; ··· 1703 1745 } 1704 1746 1705 1747 /* 1706 - * mas_replace() - Replace a maple node in the tree with mas->node. Uses the 1707 - * parent encoding to locate the maple node in the tree. 1708 - * @mas - the ma_state to use for operations. 1709 - * @advanced - boolean to adopt the child nodes and free the old node (false) or 1710 - * leave the node (true) and handle the adoption and free elsewhere. 1748 + * mas_put_in_tree() - Put a new node in the tree, smp_wmb(), and mark the old 1749 + * node as dead. 1750 + * @mas - the maple state with the new node 1751 + * @old_enode - The old maple encoded node to replace. 1711 1752 */ 1712 - static inline void mas_replace(struct ma_state *mas, bool advanced) 1753 + static inline void mas_put_in_tree(struct ma_state *mas, 1754 + struct maple_enode *old_enode) 1713 1755 __must_hold(mas->tree->ma_lock) 1714 1756 { 1715 - struct maple_node *mn = mas_mn(mas); 1716 - struct maple_enode *old_enode; 1717 - unsigned char offset = 0; 1718 - void __rcu **slots = NULL; 1719 - 1720 - if (ma_is_root(mn)) { 1721 - old_enode = mas_root_locked(mas); 1722 - } else { 1723 - offset = mte_parent_slot(mas->node); 1724 - slots = ma_slots(mte_parent(mas->node), 1725 - mas_parent_type(mas, mas->node)); 1726 - old_enode = mas_slot_locked(mas, slots, offset); 1727 - } 1728 - 1729 - if (!advanced && !mte_is_leaf(mas->node)) 1730 - mas_adopt_children(mas, mas->node); 1757 + unsigned char offset; 1758 + void __rcu **slots; 1731 1759 1732 1760 if (mte_is_root(mas->node)) { 1733 - mn->parent = ma_parent_ptr( 1734 - ((unsigned long)mas->tree | MA_ROOT_PARENT)); 1761 + mas_mn(mas)->parent = ma_parent_ptr(mas_tree_parent(mas)); 1735 1762 rcu_assign_pointer(mas->tree->ma_root, mte_mk_root(mas->node)); 1736 1763 mas_set_height(mas); 1737 1764 } else { 1765 + 1766 + offset = mte_parent_slot(mas->node); 1767 + slots = ma_slots(mte_parent(mas->node), 1768 + mas_parent_type(mas, mas->node)); 1738 1769 rcu_assign_pointer(slots[offset], mas->node); 1739 1770 } 1740 1771 1741 - if (!advanced) { 1742 - mte_set_node_dead(old_enode); 1743 - mas_free(mas, old_enode); 1744 - } 1772 + mte_set_node_dead(old_enode); 1745 1773 } 1746 1774 1747 1775 /* 1748 - * mas_new_child() - Find the new child of a node. 1749 - * @mas: the maple state 1776 + * mas_replace_node() - Replace a node by putting it in the tree, marking it 1777 + * dead, and freeing it. 1778 + * the parent encoding to locate the maple node in the tree. 1779 + * @mas - the ma_state with @mas->node pointing to the new node. 1780 + * @old_enode - The old maple encoded node. 1781 + */ 1782 + static inline void mas_replace_node(struct ma_state *mas, 1783 + struct maple_enode *old_enode) 1784 + __must_hold(mas->tree->ma_lock) 1785 + { 1786 + mas_put_in_tree(mas, old_enode); 1787 + mas_free(mas, old_enode); 1788 + } 1789 + 1790 + /* 1791 + * mas_find_child() - Find a child who has the parent @mas->node. 1792 + * @mas: the maple state with the parent. 1750 1793 * @child: the maple state to store the child. 1751 1794 */ 1752 - static inline bool mas_new_child(struct ma_state *mas, struct ma_state *child) 1795 + static inline bool mas_find_child(struct ma_state *mas, struct ma_state *child) 1753 1796 __must_hold(mas->tree->ma_lock) 1754 1797 { 1755 1798 enum maple_type mt; ··· 2035 2076 end = j - 1; 2036 2077 if (likely(!ma_is_leaf(mt) && mt_is_alloc(mas->tree))) { 2037 2078 unsigned long max_gap = 0; 2038 - unsigned char offset = 15; 2079 + unsigned char offset = 0; 2039 2080 2040 2081 gaps = ma_gaps(node, mt); 2041 2082 do { ··· 2049 2090 ma_set_meta(node, mt, offset, end); 2050 2091 } else { 2051 2092 mas_leaf_set_meta(mas, node, pivots, mt, end); 2052 - } 2053 - } 2054 - 2055 - /* 2056 - * mas_descend_adopt() - Descend through a sub-tree and adopt children. 2057 - * @mas: the maple state with the maple encoded node of the sub-tree. 2058 - * 2059 - * Descend through a sub-tree and adopt children who do not have the correct 2060 - * parents set. Follow the parents which have the correct parents as they are 2061 - * the new entries which need to be followed to find other incorrectly set 2062 - * parents. 2063 - */ 2064 - static inline void mas_descend_adopt(struct ma_state *mas) 2065 - { 2066 - struct ma_state list[3], next[3]; 2067 - int i, n; 2068 - 2069 - /* 2070 - * At each level there may be up to 3 correct parent pointers which indicates 2071 - * the new nodes which need to be walked to find any new nodes at a lower level. 2072 - */ 2073 - 2074 - for (i = 0; i < 3; i++) { 2075 - list[i] = *mas; 2076 - list[i].offset = 0; 2077 - next[i].offset = 0; 2078 - } 2079 - next[0] = *mas; 2080 - 2081 - while (!mte_is_leaf(list[0].node)) { 2082 - n = 0; 2083 - for (i = 0; i < 3; i++) { 2084 - if (mas_is_none(&list[i])) 2085 - continue; 2086 - 2087 - if (i && list[i-1].node == list[i].node) 2088 - continue; 2089 - 2090 - while ((n < 3) && (mas_new_child(&list[i], &next[n]))) 2091 - n++; 2092 - 2093 - mas_adopt_children(&list[i], list[i].node); 2094 - } 2095 - 2096 - while (n < 3) 2097 - next[n++].node = MAS_NONE; 2098 - 2099 - /* descend by setting the list to the children */ 2100 - for (i = 0; i < 3; i++) 2101 - list[i] = next[i]; 2102 2093 } 2103 2094 } 2104 2095 ··· 2120 2211 goto b_end; 2121 2212 2122 2213 /* Handle new range ending before old range ends */ 2123 - piv = mas_logical_pivot(mas, wr_mas->pivots, offset_end, wr_mas->type); 2214 + piv = mas_safe_pivot(mas, wr_mas->pivots, offset_end, wr_mas->type); 2124 2215 if (piv > mas->last) { 2125 2216 if (piv == ULONG_MAX) 2126 2217 mas_bulk_rebalance(mas, b_node->b_end, wr_mas->type); ··· 2242 2333 } 2243 2334 2244 2335 /* 2245 - * mas_topiary_range() - Add a range of slots to the topiary. 2246 - * @mas: The maple state 2247 - * @destroy: The topiary to add the slots (usually destroy) 2248 - * @start: The starting slot inclusively 2249 - * @end: The end slot inclusively 2250 - */ 2251 - static inline void mas_topiary_range(struct ma_state *mas, 2252 - struct ma_topiary *destroy, unsigned char start, unsigned char end) 2253 - { 2254 - void __rcu **slots; 2255 - unsigned char offset; 2256 - 2257 - MAS_BUG_ON(mas, mte_is_leaf(mas->node)); 2258 - 2259 - slots = ma_slots(mas_mn(mas), mte_node_type(mas->node)); 2260 - for (offset = start; offset <= end; offset++) { 2261 - struct maple_enode *enode = mas_slot_locked(mas, slots, offset); 2262 - 2263 - if (mte_dead_node(enode)) 2264 - continue; 2265 - 2266 - mat_add(destroy, enode); 2267 - } 2268 - } 2269 - 2270 - /* 2271 - * mast_topiary() - Add the portions of the tree to the removal list; either to 2272 - * be freed or discarded (destroy walk). 2273 - * @mast: The maple_subtree_state. 2274 - */ 2275 - static inline void mast_topiary(struct maple_subtree_state *mast) 2276 - { 2277 - MA_WR_STATE(wr_mas, mast->orig_l, NULL); 2278 - unsigned char r_start, r_end; 2279 - unsigned char l_start, l_end; 2280 - void __rcu **l_slots, **r_slots; 2281 - 2282 - wr_mas.type = mte_node_type(mast->orig_l->node); 2283 - mast->orig_l->index = mast->orig_l->last; 2284 - mas_wr_node_walk(&wr_mas); 2285 - l_start = mast->orig_l->offset + 1; 2286 - l_end = mas_data_end(mast->orig_l); 2287 - r_start = 0; 2288 - r_end = mast->orig_r->offset; 2289 - 2290 - if (r_end) 2291 - r_end--; 2292 - 2293 - l_slots = ma_slots(mas_mn(mast->orig_l), 2294 - mte_node_type(mast->orig_l->node)); 2295 - 2296 - r_slots = ma_slots(mas_mn(mast->orig_r), 2297 - mte_node_type(mast->orig_r->node)); 2298 - 2299 - if ((l_start < l_end) && 2300 - mte_dead_node(mas_slot_locked(mast->orig_l, l_slots, l_start))) { 2301 - l_start++; 2302 - } 2303 - 2304 - if (mte_dead_node(mas_slot_locked(mast->orig_r, r_slots, r_end))) { 2305 - if (r_end) 2306 - r_end--; 2307 - } 2308 - 2309 - if ((l_start > r_end) && (mast->orig_l->node == mast->orig_r->node)) 2310 - return; 2311 - 2312 - /* At the node where left and right sides meet, add the parts between */ 2313 - if (mast->orig_l->node == mast->orig_r->node) { 2314 - return mas_topiary_range(mast->orig_l, mast->destroy, 2315 - l_start, r_end); 2316 - } 2317 - 2318 - /* mast->orig_r is different and consumed. */ 2319 - if (mte_is_leaf(mast->orig_r->node)) 2320 - return; 2321 - 2322 - if (mte_dead_node(mas_slot_locked(mast->orig_l, l_slots, l_end))) 2323 - l_end--; 2324 - 2325 - 2326 - if (l_start <= l_end) 2327 - mas_topiary_range(mast->orig_l, mast->destroy, l_start, l_end); 2328 - 2329 - if (mte_dead_node(mas_slot_locked(mast->orig_r, r_slots, r_start))) 2330 - r_start++; 2331 - 2332 - if (r_start <= r_end) 2333 - mas_topiary_range(mast->orig_r, mast->destroy, 0, r_end); 2334 - } 2335 - 2336 - /* 2337 2336 * mast_rebalance_next() - Rebalance against the next node 2338 2337 * @mast: The maple subtree state 2339 2338 * @old_r: The encoded maple node to the right (next node). ··· 2276 2459 /* 2277 2460 * mast_spanning_rebalance() - Rebalance nodes with nearest neighbour favouring 2278 2461 * the node to the right. Checking the nodes to the right then the left at each 2279 - * level upwards until root is reached. Free and destroy as needed. 2462 + * level upwards until root is reached. 2280 2463 * Data is copied into the @mast->bn. 2281 2464 * @mast: The maple_subtree_state. 2282 2465 */ ··· 2285 2468 { 2286 2469 struct ma_state r_tmp = *mast->orig_r; 2287 2470 struct ma_state l_tmp = *mast->orig_l; 2288 - struct maple_enode *ancestor = NULL; 2289 - unsigned char start, end; 2290 2471 unsigned char depth = 0; 2291 2472 2292 2473 r_tmp = *mast->orig_r; ··· 2293 2478 mas_ascend(mast->orig_r); 2294 2479 mas_ascend(mast->orig_l); 2295 2480 depth++; 2296 - if (!ancestor && 2297 - (mast->orig_r->node == mast->orig_l->node)) { 2298 - ancestor = mast->orig_r->node; 2299 - end = mast->orig_r->offset - 1; 2300 - start = mast->orig_l->offset + 1; 2301 - } 2302 - 2303 2481 if (mast->orig_r->offset < mas_data_end(mast->orig_r)) { 2304 - if (!ancestor) { 2305 - ancestor = mast->orig_r->node; 2306 - start = 0; 2307 - } 2308 - 2309 2482 mast->orig_r->offset++; 2310 2483 do { 2311 2484 mas_descend(mast->orig_r); 2312 2485 mast->orig_r->offset = 0; 2313 - depth--; 2314 - } while (depth); 2486 + } while (--depth); 2315 2487 2316 2488 mast_rebalance_next(mast); 2317 - do { 2318 - unsigned char l_off = 0; 2319 - struct maple_enode *child = r_tmp.node; 2320 - 2321 - mas_ascend(&r_tmp); 2322 - if (ancestor == r_tmp.node) 2323 - l_off = start; 2324 - 2325 - if (r_tmp.offset) 2326 - r_tmp.offset--; 2327 - 2328 - if (l_off < r_tmp.offset) 2329 - mas_topiary_range(&r_tmp, mast->destroy, 2330 - l_off, r_tmp.offset); 2331 - 2332 - if (l_tmp.node != child) 2333 - mat_add(mast->free, child); 2334 - 2335 - } while (r_tmp.node != ancestor); 2336 - 2337 2489 *mast->orig_l = l_tmp; 2338 2490 return true; 2339 - 2340 2491 } else if (mast->orig_l->offset != 0) { 2341 - if (!ancestor) { 2342 - ancestor = mast->orig_l->node; 2343 - end = mas_data_end(mast->orig_l); 2344 - } 2345 - 2346 2492 mast->orig_l->offset--; 2347 2493 do { 2348 2494 mas_descend(mast->orig_l); 2349 2495 mast->orig_l->offset = 2350 2496 mas_data_end(mast->orig_l); 2351 - depth--; 2352 - } while (depth); 2497 + } while (--depth); 2353 2498 2354 2499 mast_rebalance_prev(mast); 2355 - do { 2356 - unsigned char r_off; 2357 - struct maple_enode *child = l_tmp.node; 2358 - 2359 - mas_ascend(&l_tmp); 2360 - if (ancestor == l_tmp.node) 2361 - r_off = end; 2362 - else 2363 - r_off = mas_data_end(&l_tmp); 2364 - 2365 - if (l_tmp.offset < r_off) 2366 - l_tmp.offset++; 2367 - 2368 - if (l_tmp.offset < r_off) 2369 - mas_topiary_range(&l_tmp, mast->destroy, 2370 - l_tmp.offset, r_off); 2371 - 2372 - if (r_tmp.node != child) 2373 - mat_add(mast->free, child); 2374 - 2375 - } while (l_tmp.node != ancestor); 2376 - 2377 2500 *mast->orig_r = r_tmp; 2378 2501 return true; 2379 2502 } ··· 2323 2570 } 2324 2571 2325 2572 /* 2326 - * mast_ascend_free() - Add current original maple state nodes to the free list 2327 - * and ascend. 2573 + * mast_ascend() - Ascend the original left and right maple states. 2328 2574 * @mast: the maple subtree state. 2329 2575 * 2330 - * Ascend the original left and right sides and add the previous nodes to the 2331 - * free list. Set the slots to point to the correct location in the new nodes. 2576 + * Ascend the original left and right sides. Set the offsets to point to the 2577 + * data already in the new tree (@mast->l and @mast->r). 2332 2578 */ 2333 - static inline void 2334 - mast_ascend_free(struct maple_subtree_state *mast) 2579 + static inline void mast_ascend(struct maple_subtree_state *mast) 2335 2580 { 2336 2581 MA_WR_STATE(wr_mas, mast->orig_r, NULL); 2337 - struct maple_enode *left = mast->orig_l->node; 2338 - struct maple_enode *right = mast->orig_r->node; 2339 - 2340 2582 mas_ascend(mast->orig_l); 2341 2583 mas_ascend(mast->orig_r); 2342 - mat_add(mast->free, left); 2343 - 2344 - if (left != right) 2345 - mat_add(mast->free, right); 2346 2584 2347 2585 mast->orig_r->offset = 0; 2348 2586 mast->orig_r->index = mast->r->max; 2349 2587 /* last should be larger than or equal to index */ 2350 2588 if (mast->orig_r->last < mast->orig_r->index) 2351 2589 mast->orig_r->last = mast->orig_r->index; 2352 - /* 2353 - * The node may not contain the value so set slot to ensure all 2354 - * of the nodes contents are freed or destroyed. 2355 - */ 2590 + 2356 2591 wr_mas.type = mte_node_type(mast->orig_r->node); 2357 2592 mas_wr_node_walk(&wr_mas); 2358 2593 /* Set up the left side of things */ ··· 2519 2778 } 2520 2779 2521 2780 /* 2781 + * mas_topiary_node() - Dispose of a singe node 2782 + * @mas: The maple state for pushing nodes 2783 + * @enode: The encoded maple node 2784 + * @in_rcu: If the tree is in rcu mode 2785 + * 2786 + * The node will either be RCU freed or pushed back on the maple state. 2787 + */ 2788 + static inline void mas_topiary_node(struct ma_state *mas, 2789 + struct maple_enode *enode, bool in_rcu) 2790 + { 2791 + struct maple_node *tmp; 2792 + 2793 + if (enode == MAS_NONE) 2794 + return; 2795 + 2796 + tmp = mte_to_node(enode); 2797 + mte_set_node_dead(enode); 2798 + if (in_rcu) 2799 + ma_free_rcu(tmp); 2800 + else 2801 + mas_push_node(mas, tmp); 2802 + } 2803 + 2804 + /* 2805 + * mas_topiary_replace() - Replace the data with new data, then repair the 2806 + * parent links within the new tree. Iterate over the dead sub-tree and collect 2807 + * the dead subtrees and topiary the nodes that are no longer of use. 2808 + * 2809 + * The new tree will have up to three children with the correct parent. Keep 2810 + * track of the new entries as they need to be followed to find the next level 2811 + * of new entries. 2812 + * 2813 + * The old tree will have up to three children with the old parent. Keep track 2814 + * of the old entries as they may have more nodes below replaced. Nodes within 2815 + * [index, last] are dead subtrees, others need to be freed and followed. 2816 + * 2817 + * @mas: The maple state pointing at the new data 2818 + * @old_enode: The maple encoded node being replaced 2819 + * 2820 + */ 2821 + static inline void mas_topiary_replace(struct ma_state *mas, 2822 + struct maple_enode *old_enode) 2823 + { 2824 + struct ma_state tmp[3], tmp_next[3]; 2825 + MA_TOPIARY(subtrees, mas->tree); 2826 + bool in_rcu; 2827 + int i, n; 2828 + 2829 + /* Place data in tree & then mark node as old */ 2830 + mas_put_in_tree(mas, old_enode); 2831 + 2832 + /* Update the parent pointers in the tree */ 2833 + tmp[0] = *mas; 2834 + tmp[0].offset = 0; 2835 + tmp[1].node = MAS_NONE; 2836 + tmp[2].node = MAS_NONE; 2837 + while (!mte_is_leaf(tmp[0].node)) { 2838 + n = 0; 2839 + for (i = 0; i < 3; i++) { 2840 + if (mas_is_none(&tmp[i])) 2841 + continue; 2842 + 2843 + while (n < 3) { 2844 + if (!mas_find_child(&tmp[i], &tmp_next[n])) 2845 + break; 2846 + n++; 2847 + } 2848 + 2849 + mas_adopt_children(&tmp[i], tmp[i].node); 2850 + } 2851 + 2852 + if (MAS_WARN_ON(mas, n == 0)) 2853 + break; 2854 + 2855 + while (n < 3) 2856 + tmp_next[n++].node = MAS_NONE; 2857 + 2858 + for (i = 0; i < 3; i++) 2859 + tmp[i] = tmp_next[i]; 2860 + } 2861 + 2862 + /* Collect the old nodes that need to be discarded */ 2863 + if (mte_is_leaf(old_enode)) 2864 + return mas_free(mas, old_enode); 2865 + 2866 + tmp[0] = *mas; 2867 + tmp[0].offset = 0; 2868 + tmp[0].node = old_enode; 2869 + tmp[1].node = MAS_NONE; 2870 + tmp[2].node = MAS_NONE; 2871 + in_rcu = mt_in_rcu(mas->tree); 2872 + do { 2873 + n = 0; 2874 + for (i = 0; i < 3; i++) { 2875 + if (mas_is_none(&tmp[i])) 2876 + continue; 2877 + 2878 + while (n < 3) { 2879 + if (!mas_find_child(&tmp[i], &tmp_next[n])) 2880 + break; 2881 + 2882 + if ((tmp_next[n].min >= tmp_next->index) && 2883 + (tmp_next[n].max <= tmp_next->last)) { 2884 + mat_add(&subtrees, tmp_next[n].node); 2885 + tmp_next[n].node = MAS_NONE; 2886 + } else { 2887 + n++; 2888 + } 2889 + } 2890 + } 2891 + 2892 + if (MAS_WARN_ON(mas, n == 0)) 2893 + break; 2894 + 2895 + while (n < 3) 2896 + tmp_next[n++].node = MAS_NONE; 2897 + 2898 + for (i = 0; i < 3; i++) { 2899 + mas_topiary_node(mas, tmp[i].node, in_rcu); 2900 + tmp[i] = tmp_next[i]; 2901 + } 2902 + } while (!mte_is_leaf(tmp[0].node)); 2903 + 2904 + for (i = 0; i < 3; i++) 2905 + mas_topiary_node(mas, tmp[i].node, in_rcu); 2906 + 2907 + mas_mat_destroy(mas, &subtrees); 2908 + } 2909 + 2910 + /* 2522 2911 * mas_wmb_replace() - Write memory barrier and replace 2523 2912 * @mas: The maple state 2524 - * @free: the maple topiary list of nodes to free 2525 - * @destroy: The maple topiary list of nodes to destroy (walk and free) 2913 + * @old: The old maple encoded node that is being replaced. 2526 2914 * 2527 2915 * Updates gap as necessary. 2528 2916 */ 2529 2917 static inline void mas_wmb_replace(struct ma_state *mas, 2530 - struct ma_topiary *free, 2531 - struct ma_topiary *destroy) 2918 + struct maple_enode *old_enode) 2532 2919 { 2533 - /* All nodes must see old data as dead prior to replacing that data */ 2534 - smp_wmb(); /* Needed for RCU */ 2535 - 2536 2920 /* Insert the new data in the tree */ 2537 - mas_replace(mas, true); 2538 - 2539 - if (!mte_is_leaf(mas->node)) 2540 - mas_descend_adopt(mas); 2541 - 2542 - mas_mat_free(mas, free); 2543 - 2544 - if (destroy) 2545 - mas_mat_destroy(mas, destroy); 2921 + mas_topiary_replace(mas, old_enode); 2546 2922 2547 2923 if (mte_is_leaf(mas->node)) 2548 2924 return; 2549 2925 2550 2926 mas_update_gap(mas); 2551 - } 2552 - 2553 - /* 2554 - * mast_new_root() - Set a new tree root during subtree creation 2555 - * @mast: The maple subtree state 2556 - * @mas: The maple state 2557 - */ 2558 - static inline void mast_new_root(struct maple_subtree_state *mast, 2559 - struct ma_state *mas) 2560 - { 2561 - mas_mn(mast->l)->parent = 2562 - ma_parent_ptr(((unsigned long)mas->tree | MA_ROOT_PARENT)); 2563 - if (!mte_dead_node(mast->orig_l->node) && 2564 - !mte_is_root(mast->orig_l->node)) { 2565 - do { 2566 - mast_ascend_free(mast); 2567 - mast_topiary(mast); 2568 - } while (!mte_is_root(mast->orig_l->node)); 2569 - } 2570 - if ((mast->orig_l->node != mas->node) && 2571 - (mast->l->depth > mas_mt_height(mas))) { 2572 - mat_add(mast->free, mas->node); 2573 - } 2574 2927 } 2575 2928 2576 2929 /* ··· 2850 3015 unsigned char split, mid_split; 2851 3016 unsigned char slot = 0; 2852 3017 struct maple_enode *left = NULL, *middle = NULL, *right = NULL; 3018 + struct maple_enode *old_enode; 2853 3019 2854 3020 MA_STATE(l_mas, mas->tree, mas->index, mas->index); 2855 3021 MA_STATE(r_mas, mas->tree, mas->index, mas->last); 2856 3022 MA_STATE(m_mas, mas->tree, mas->index, mas->index); 2857 - MA_TOPIARY(free, mas->tree); 2858 - MA_TOPIARY(destroy, mas->tree); 2859 3023 2860 3024 /* 2861 3025 * The tree needs to be rebalanced and leaves need to be kept at the same level. ··· 2863 3029 mast->l = &l_mas; 2864 3030 mast->m = &m_mas; 2865 3031 mast->r = &r_mas; 2866 - mast->free = &free; 2867 - mast->destroy = &destroy; 2868 3032 l_mas.node = r_mas.node = m_mas.node = MAS_NONE; 2869 3033 2870 3034 /* Check if this is not root and has sufficient data. */ ··· 2870 3038 unlikely(mast->bn->b_end <= mt_min_slots[mast->bn->type])) 2871 3039 mast_spanning_rebalance(mast); 2872 3040 2873 - mast->orig_l->depth = 0; 3041 + l_mas.depth = 0; 2874 3042 2875 3043 /* 2876 3044 * Each level of the tree is examined and balanced, pushing data to the left or ··· 2881 3049 * original tree and the partially new tree. To remedy the parent pointers in 2882 3050 * the old tree, the new data is swapped into the active tree and a walk down 2883 3051 * the tree is performed and the parent pointers are updated. 2884 - * See mas_descend_adopt() for more information.. 3052 + * See mas_topiary_replace() for more information. 2885 3053 */ 2886 3054 while (count--) { 2887 3055 mast->bn->b_end--; ··· 2898 3066 */ 2899 3067 memset(mast->bn, 0, sizeof(struct maple_big_node)); 2900 3068 mast->bn->type = mte_node_type(left); 2901 - mast->orig_l->depth++; 3069 + l_mas.depth++; 2902 3070 2903 3071 /* Root already stored in l->node. */ 2904 3072 if (mas_is_root_limits(mast->l)) 2905 3073 goto new_root; 2906 3074 2907 - mast_ascend_free(mast); 3075 + mast_ascend(mast); 2908 3076 mast_combine_cp_left(mast); 2909 3077 l_mas.offset = mast->bn->b_end; 2910 3078 mab_set_b_end(mast->bn, &l_mas, left); ··· 2913 3081 2914 3082 /* Copy anything necessary out of the right node. */ 2915 3083 mast_combine_cp_right(mast); 2916 - mast_topiary(mast); 2917 3084 mast->orig_l->last = mast->orig_l->max; 2918 3085 2919 3086 if (mast_sufficient(mast)) ··· 2934 3103 2935 3104 l_mas.node = mt_mk_node(ma_mnode_ptr(mas_pop_node(mas)), 2936 3105 mte_node_type(mast->orig_l->node)); 2937 - mast->orig_l->depth++; 3106 + l_mas.depth++; 2938 3107 mab_mas_cp(mast->bn, 0, mt_slots[mast->bn->type] - 1, &l_mas, true); 2939 3108 mas_set_parent(mas, left, l_mas.node, slot); 2940 3109 if (middle) ··· 2945 3114 2946 3115 if (mas_is_root_limits(mast->l)) { 2947 3116 new_root: 2948 - mast_new_root(mast, mas); 3117 + mas_mn(mast->l)->parent = ma_parent_ptr(mas_tree_parent(mas)); 3118 + while (!mte_is_root(mast->orig_l->node)) 3119 + mast_ascend(mast); 2949 3120 } else { 2950 3121 mas_mn(&l_mas)->parent = mas_mn(mast->orig_l)->parent; 2951 3122 } 2952 3123 2953 - if (!mte_dead_node(mast->orig_l->node)) 2954 - mat_add(&free, mast->orig_l->node); 2955 - 2956 - mas->depth = mast->orig_l->depth; 2957 - *mast->orig_l = l_mas; 2958 - mte_set_node_dead(mas->node); 2959 - 2960 - /* Set up mas for insertion. */ 2961 - mast->orig_l->depth = mas->depth; 2962 - mast->orig_l->alloc = mas->alloc; 2963 - *mas = *mast->orig_l; 2964 - mas_wmb_replace(mas, &free, &destroy); 3124 + old_enode = mast->orig_l->node; 3125 + mas->depth = l_mas.depth; 3126 + mas->node = l_mas.node; 3127 + mas->min = l_mas.min; 3128 + mas->max = l_mas.max; 3129 + mas->offset = l_mas.offset; 3130 + mas_wmb_replace(mas, old_enode); 2965 3131 mtree_range_walk(mas); 2966 3132 return mast->bn->b_end; 2967 3133 } ··· 2994 3166 * tries to combine the data in the same way. If one node contains the 2995 3167 * entire range of the tree, then that node is used as a new root node. 2996 3168 */ 2997 - mas_node_count(mas, 1 + empty_count * 3); 3169 + mas_node_count(mas, empty_count * 2 - 1); 2998 3170 if (mas_is_err(mas)) 2999 3171 return 0; 3000 3172 ··· 3034 3206 { 3035 3207 enum maple_type mt = mte_node_type(mas->node); 3036 3208 struct maple_node reuse, *newnode, *parent, *new_left, *left, *node; 3037 - struct maple_enode *eparent; 3209 + struct maple_enode *eparent, *old_eparent; 3038 3210 unsigned char offset, tmp, split = mt_slots[mt] / 2; 3039 3211 void __rcu **l_slots, **slots; 3040 3212 unsigned long *l_pivs, *pivs, gap; ··· 3076 3248 3077 3249 l_mas.max = l_pivs[split]; 3078 3250 mas->min = l_mas.max + 1; 3079 - eparent = mt_mk_node(mte_parent(l_mas.node), 3251 + old_eparent = mt_mk_node(mte_parent(l_mas.node), 3080 3252 mas_parent_type(&l_mas, l_mas.node)); 3081 3253 tmp += end; 3082 3254 if (!in_rcu) { ··· 3092 3264 3093 3265 memcpy(node, newnode, sizeof(struct maple_node)); 3094 3266 ma_set_meta(node, mt, 0, tmp - 1); 3095 - mte_set_pivot(eparent, mte_parent_slot(l_mas.node), 3267 + mte_set_pivot(old_eparent, mte_parent_slot(l_mas.node), 3096 3268 l_pivs[split]); 3097 3269 3098 3270 /* Remove data from l_pivs. */ ··· 3100 3272 memset(l_pivs + tmp, 0, sizeof(unsigned long) * (max_p - tmp)); 3101 3273 memset(l_slots + tmp, 0, sizeof(void *) * (max_s - tmp)); 3102 3274 ma_set_meta(left, mt, 0, split); 3275 + eparent = old_eparent; 3103 3276 3104 3277 goto done; 3105 3278 } ··· 3125 3296 parent = mas_pop_node(mas); 3126 3297 slots = ma_slots(parent, mt); 3127 3298 pivs = ma_pivots(parent, mt); 3128 - memcpy(parent, mte_to_node(eparent), sizeof(struct maple_node)); 3299 + memcpy(parent, mte_to_node(old_eparent), sizeof(struct maple_node)); 3129 3300 rcu_assign_pointer(slots[offset], mas->node); 3130 3301 rcu_assign_pointer(slots[offset - 1], l_mas.node); 3131 3302 pivs[offset - 1] = l_mas.max; ··· 3137 3308 mte_set_gap(eparent, mte_parent_slot(l_mas.node), gap); 3138 3309 mas_ascend(mas); 3139 3310 3140 - if (in_rcu) 3141 - mas_replace(mas, false); 3311 + if (in_rcu) { 3312 + mas_replace_node(mas, old_eparent); 3313 + mas_adopt_children(mas, mas->node); 3314 + } 3142 3315 3143 3316 mas_update_gap(mas); 3144 3317 } ··· 3189 3358 unsigned char skip) 3190 3359 { 3191 3360 bool cp = true; 3192 - struct maple_enode *old = mas->node; 3193 3361 unsigned char split; 3194 3362 3195 3363 memset(mast->bn->gap, 0, sizeof(unsigned long) * ARRAY_SIZE(mast->bn->gap)); ··· 3200 3370 cp = false; 3201 3371 } else { 3202 3372 mas_ascend(mas); 3203 - mat_add(mast->free, old); 3204 3373 mas->offset = mte_parent_slot(mas->node); 3205 3374 } 3206 3375 ··· 3303 3474 split = mt_slots[mast->bn->type] - 2; 3304 3475 if (left) { 3305 3476 /* Switch mas to prev node */ 3306 - mat_add(mast->free, mas->node); 3307 3477 *mas = tmp_mas; 3308 3478 /* Start using mast->l for the left side. */ 3309 3479 tmp_mas.node = mast->l->node; 3310 3480 *mast->l = tmp_mas; 3311 3481 } else { 3312 - mat_add(mast->free, tmp_mas.node); 3313 3482 tmp_mas.node = mast->r->node; 3314 3483 *mast->r = tmp_mas; 3315 3484 split = slot_total - split; ··· 3334 3507 struct maple_subtree_state mast; 3335 3508 int height = 0; 3336 3509 unsigned char mid_split, split = 0; 3510 + struct maple_enode *old; 3337 3511 3338 3512 /* 3339 3513 * Splitting is handled differently from any other B-tree; the Maple ··· 3357 3529 MA_STATE(r_mas, mas->tree, mas->index, mas->last); 3358 3530 MA_STATE(prev_l_mas, mas->tree, mas->index, mas->last); 3359 3531 MA_STATE(prev_r_mas, mas->tree, mas->index, mas->last); 3360 - MA_TOPIARY(mat, mas->tree); 3361 3532 3362 3533 trace_ma_op(__func__, mas); 3363 3534 mas->depth = mas_mt_height(mas); ··· 3369 3542 mast.r = &r_mas; 3370 3543 mast.orig_l = &prev_l_mas; 3371 3544 mast.orig_r = &prev_r_mas; 3372 - mast.free = &mat; 3373 3545 mast.bn = b_node; 3374 3546 3375 3547 while (height++ <= mas->depth) { ··· 3408 3582 } 3409 3583 3410 3584 /* Set the original node as dead */ 3411 - mat_add(mast.free, mas->node); 3585 + old = mas->node; 3412 3586 mas->node = l_mas.node; 3413 - mas_wmb_replace(mas, mast.free, NULL); 3587 + mas_wmb_replace(mas, old); 3414 3588 mtree_range_walk(mas); 3415 3589 return 1; 3416 3590 } ··· 3452 3626 struct maple_big_node *b_node, unsigned char end) 3453 3627 { 3454 3628 struct maple_node *node; 3629 + struct maple_enode *old_enode; 3455 3630 unsigned char b_end = b_node->b_end; 3456 3631 enum maple_type b_type = b_node->type; 3457 3632 3633 + old_enode = wr_mas->mas->node; 3458 3634 if ((b_end < mt_min_slots[b_type]) && 3459 - (!mte_is_root(wr_mas->mas->node)) && 3635 + (!mte_is_root(old_enode)) && 3460 3636 (mas_mt_height(wr_mas->mas) > 1)) 3461 3637 return mas_rebalance(wr_mas->mas, b_node); 3462 3638 ··· 3476 3648 node->parent = mas_mn(wr_mas->mas)->parent; 3477 3649 wr_mas->mas->node = mt_mk_node(node, b_type); 3478 3650 mab_mas_cp(b_node, 0, b_end, wr_mas->mas, false); 3479 - mas_replace(wr_mas->mas, false); 3651 + mas_replace_node(wr_mas->mas, old_enode); 3480 3652 reuse_node: 3481 3653 mas_update_gap(wr_mas->mas); 3482 3654 return 1; ··· 3503 3675 node = mas_pop_node(mas); 3504 3676 pivots = ma_pivots(node, type); 3505 3677 slots = ma_slots(node, type); 3506 - node->parent = ma_parent_ptr( 3507 - ((unsigned long)mas->tree | MA_ROOT_PARENT)); 3678 + node->parent = ma_parent_ptr(mas_tree_parent(mas)); 3508 3679 mas->node = mt_mk_node(node, type); 3509 3680 3510 3681 if (mas->index) { ··· 3746 3919 return NULL; 3747 3920 } 3748 3921 3922 + static void mte_destroy_walk(struct maple_enode *, struct maple_tree *); 3749 3923 /* 3750 3924 * mas_new_root() - Create a new root node that only contains the entry passed 3751 3925 * in. ··· 3780 3952 node = mas_pop_node(mas); 3781 3953 pivots = ma_pivots(node, type); 3782 3954 slots = ma_slots(node, type); 3783 - node->parent = ma_parent_ptr( 3784 - ((unsigned long)mas->tree | MA_ROOT_PARENT)); 3955 + node->parent = ma_parent_ptr(mas_tree_parent(mas)); 3785 3956 mas->node = mt_mk_node(node, type); 3786 3957 rcu_assign_pointer(slots[0], entry); 3787 3958 pivots[0] = mas->last; ··· 3813 3986 /* Left and Right side of spanning store */ 3814 3987 MA_STATE(l_mas, NULL, 0, 0); 3815 3988 MA_STATE(r_mas, NULL, 0, 0); 3816 - 3817 3989 MA_WR_STATE(r_wr_mas, &r_mas, wr_mas->entry); 3818 3990 MA_WR_STATE(l_wr_mas, &l_mas, wr_mas->entry); 3819 3991 ··· 3973 4147 done: 3974 4148 mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end); 3975 4149 if (in_rcu) { 3976 - mte_set_node_dead(mas->node); 4150 + struct maple_enode *old_enode = mas->node; 4151 + 3977 4152 mas->node = mt_mk_node(newnode, wr_mas->type); 3978 - mas_replace(mas, false); 4153 + mas_replace_node(mas, old_enode); 3979 4154 } else { 3980 4155 memcpy(wr_mas->node, newnode, sizeof(struct maple_node)); 3981 4156 } ··· 3995 4168 { 3996 4169 struct ma_state *mas = wr_mas->mas; 3997 4170 unsigned char offset = mas->offset; 4171 + void __rcu **slots = wr_mas->slots; 3998 4172 bool gap = false; 3999 4173 4000 - if (wr_mas->offset_end - offset != 1) 4001 - return false; 4174 + gap |= !mt_slot_locked(mas->tree, slots, offset); 4175 + gap |= !mt_slot_locked(mas->tree, slots, offset + 1); 4002 4176 4003 - gap |= !mt_slot_locked(mas->tree, wr_mas->slots, offset); 4004 - gap |= !mt_slot_locked(mas->tree, wr_mas->slots, offset + 1); 4005 - 4006 - if (mas->index == wr_mas->r_min) { 4007 - /* Overwriting the range and over a part of the next range. */ 4008 - rcu_assign_pointer(wr_mas->slots[offset], wr_mas->entry); 4009 - wr_mas->pivots[offset] = mas->last; 4010 - } else { 4011 - /* Overwriting a part of the range and over the next range */ 4012 - rcu_assign_pointer(wr_mas->slots[offset + 1], wr_mas->entry); 4177 + if (wr_mas->offset_end - offset == 1) { 4178 + if (mas->index == wr_mas->r_min) { 4179 + /* Overwriting the range and a part of the next one */ 4180 + rcu_assign_pointer(slots[offset], wr_mas->entry); 4181 + wr_mas->pivots[offset] = mas->last; 4182 + } else { 4183 + /* Overwriting a part of the range and the next one */ 4184 + rcu_assign_pointer(slots[offset + 1], wr_mas->entry); 4185 + wr_mas->pivots[offset] = mas->index - 1; 4186 + mas->offset++; /* Keep mas accurate. */ 4187 + } 4188 + } else if (!mt_in_rcu(mas->tree)) { 4189 + /* 4190 + * Expand the range, only partially overwriting the previous and 4191 + * next ranges 4192 + */ 4193 + gap |= !mt_slot_locked(mas->tree, slots, offset + 2); 4194 + rcu_assign_pointer(slots[offset + 1], wr_mas->entry); 4013 4195 wr_mas->pivots[offset] = mas->index - 1; 4196 + wr_mas->pivots[offset + 1] = mas->last; 4014 4197 mas->offset++; /* Keep mas accurate. */ 4198 + } else { 4199 + return false; 4015 4200 } 4016 4201 4017 4202 trace_ma_write(__func__, mas, 0, wr_mas->entry); ··· 4035 4196 mas_update_gap(mas); 4036 4197 4037 4198 return true; 4038 - } 4039 - 4040 - static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas) 4041 - { 4042 - while ((wr_mas->offset_end < wr_mas->node_end) && 4043 - (wr_mas->mas->last > wr_mas->pivots[wr_mas->offset_end])) 4044 - wr_mas->offset_end++; 4045 - 4046 - if (wr_mas->offset_end < wr_mas->node_end) 4047 - wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end]; 4048 - else 4049 - wr_mas->end_piv = wr_mas->mas->max; 4050 4199 } 4051 4200 4052 4201 static inline void mas_wr_extend_null(struct ma_wr_state *wr_mas) ··· 4073 4246 } 4074 4247 } 4075 4248 4249 + static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas) 4250 + { 4251 + while ((wr_mas->offset_end < wr_mas->node_end) && 4252 + (wr_mas->mas->last > wr_mas->pivots[wr_mas->offset_end])) 4253 + wr_mas->offset_end++; 4254 + 4255 + if (wr_mas->offset_end < wr_mas->node_end) 4256 + wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end]; 4257 + else 4258 + wr_mas->end_piv = wr_mas->mas->max; 4259 + 4260 + if (!wr_mas->entry) 4261 + mas_wr_extend_null(wr_mas); 4262 + } 4263 + 4076 4264 static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas) 4077 4265 { 4078 4266 struct ma_state *mas = wr_mas->mas; ··· 4106 4264 /* 4107 4265 * mas_wr_append: Attempt to append 4108 4266 * @wr_mas: the maple write state 4267 + * @new_end: The end of the node after the modification 4109 4268 * 4110 4269 * This is currently unsafe in rcu mode since the end of the node may be cached 4111 4270 * by readers while the node contents may be updated which could result in ··· 4114 4271 * 4115 4272 * Return: True if appended, false otherwise 4116 4273 */ 4117 - static inline bool mas_wr_append(struct ma_wr_state *wr_mas) 4274 + static inline bool mas_wr_append(struct ma_wr_state *wr_mas, 4275 + unsigned char new_end) 4118 4276 { 4119 - unsigned char end = wr_mas->node_end; 4120 - unsigned char new_end = end + 1; 4121 - struct ma_state *mas = wr_mas->mas; 4122 - unsigned char node_pivots = mt_pivots[wr_mas->type]; 4277 + struct ma_state *mas; 4278 + void __rcu **slots; 4279 + unsigned char end; 4123 4280 4281 + mas = wr_mas->mas; 4124 4282 if (mt_in_rcu(mas->tree)) 4125 4283 return false; 4126 4284 4127 4285 if (mas->offset != wr_mas->node_end) 4128 4286 return false; 4129 4287 4130 - if (new_end < node_pivots) { 4288 + end = wr_mas->node_end; 4289 + if (mas->offset != end) 4290 + return false; 4291 + 4292 + if (new_end < mt_pivots[wr_mas->type]) { 4131 4293 wr_mas->pivots[new_end] = wr_mas->pivots[end]; 4132 - ma_set_meta(wr_mas->node, maple_leaf_64, 0, new_end); 4294 + ma_set_meta(wr_mas->node, wr_mas->type, 0, new_end); 4133 4295 } 4134 4296 4135 - if (mas->last == wr_mas->r_max) { 4136 - /* Append to end of range */ 4137 - rcu_assign_pointer(wr_mas->slots[new_end], wr_mas->entry); 4138 - wr_mas->pivots[end] = mas->index - 1; 4139 - mas->offset = new_end; 4297 + slots = wr_mas->slots; 4298 + if (new_end == end + 1) { 4299 + if (mas->last == wr_mas->r_max) { 4300 + /* Append to end of range */ 4301 + rcu_assign_pointer(slots[new_end], wr_mas->entry); 4302 + wr_mas->pivots[end] = mas->index - 1; 4303 + mas->offset = new_end; 4304 + } else { 4305 + /* Append to start of range */ 4306 + rcu_assign_pointer(slots[new_end], wr_mas->content); 4307 + wr_mas->pivots[end] = mas->last; 4308 + rcu_assign_pointer(slots[end], wr_mas->entry); 4309 + } 4140 4310 } else { 4141 - /* Append to start of range */ 4142 - rcu_assign_pointer(wr_mas->slots[new_end], wr_mas->content); 4143 - wr_mas->pivots[end] = mas->last; 4144 - rcu_assign_pointer(wr_mas->slots[end], wr_mas->entry); 4311 + /* Append to the range without touching any boundaries. */ 4312 + rcu_assign_pointer(slots[new_end], wr_mas->content); 4313 + wr_mas->pivots[end + 1] = mas->last; 4314 + rcu_assign_pointer(slots[end + 1], wr_mas->entry); 4315 + wr_mas->pivots[end] = mas->index - 1; 4316 + mas->offset = end + 1; 4145 4317 } 4146 4318 4147 4319 if (!wr_mas->content || !wr_mas->entry) 4148 4320 mas_update_gap(mas); 4149 4321 4322 + trace_ma_write(__func__, mas, new_end, wr_mas->entry); 4150 4323 return true; 4151 4324 } 4152 4325 ··· 4204 4345 goto slow_path; 4205 4346 4206 4347 /* Attempt to append */ 4207 - if (new_end == wr_mas->node_end + 1 && mas_wr_append(wr_mas)) 4348 + if (mas_wr_append(wr_mas, new_end)) 4208 4349 return; 4209 4350 4210 4351 if (new_end == wr_mas->node_end && mas_wr_slot_store(wr_mas)) ··· 4244 4385 4245 4386 /* At this point, we are at the leaf node that needs to be altered. */ 4246 4387 mas_wr_end_piv(wr_mas); 4247 - 4248 - if (!wr_mas->entry) 4249 - mas_wr_extend_null(wr_mas); 4250 - 4251 4388 /* New root for a single pointer */ 4252 4389 if (unlikely(!mas->index && mas->last == ULONG_MAX)) { 4253 4390 mas_new_root(mas, wr_mas->entry); ··· 4783 4928 min = mas_safe_min(mas, pivots, offset); 4784 4929 data_end = ma_data_end(node, type, pivots, mas->max); 4785 4930 for (; offset <= data_end; offset++) { 4786 - pivot = mas_logical_pivot(mas, pivots, offset, type); 4931 + pivot = mas_safe_pivot(mas, pivots, offset, type); 4787 4932 4788 4933 /* Not within lower bounds */ 4789 4934 if (mas->index > pivot) ··· 5294 5439 5295 5440 static void mas_wr_store_setup(struct ma_wr_state *wr_mas) 5296 5441 { 5297 - if (unlikely(mas_is_paused(wr_mas->mas))) 5298 - mas_reset(wr_mas->mas); 5442 + if (mas_is_start(wr_mas->mas)) 5443 + return; 5299 5444 5300 - if (!mas_is_start(wr_mas->mas)) { 5301 - if (mas_is_none(wr_mas->mas)) { 5302 - mas_reset(wr_mas->mas); 5303 - } else { 5304 - wr_mas->r_max = wr_mas->mas->max; 5305 - wr_mas->type = mte_node_type(wr_mas->mas->node); 5306 - if (mas_is_span_wr(wr_mas)) 5307 - mas_reset(wr_mas->mas); 5308 - } 5309 - } 5445 + if (unlikely(mas_is_paused(wr_mas->mas))) 5446 + goto reset; 5447 + 5448 + if (unlikely(mas_is_none(wr_mas->mas))) 5449 + goto reset; 5450 + 5451 + /* 5452 + * A less strict version of mas_is_span_wr() where we allow spanning 5453 + * writes within this node. This is to stop partial walks in 5454 + * mas_prealloc() from being reset. 5455 + */ 5456 + if (wr_mas->mas->last > wr_mas->mas->max) 5457 + goto reset; 5458 + 5459 + if (wr_mas->entry) 5460 + return; 5461 + 5462 + if (mte_is_leaf(wr_mas->mas->node) && 5463 + wr_mas->mas->last == wr_mas->mas->max) 5464 + goto reset; 5465 + 5466 + return; 5467 + 5468 + reset: 5469 + mas_reset(wr_mas->mas); 5310 5470 } 5311 5471 5312 5472 /* Interface */ ··· 5413 5543 /** 5414 5544 * mas_preallocate() - Preallocate enough nodes for a store operation 5415 5545 * @mas: The maple state 5546 + * @entry: The entry that will be stored 5416 5547 * @gfp: The GFP_FLAGS to use for allocations. 5417 5548 * 5418 5549 * Return: 0 on success, -ENOMEM if memory could not be allocated. 5419 5550 */ 5420 - int mas_preallocate(struct ma_state *mas, gfp_t gfp) 5551 + int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) 5421 5552 { 5553 + MA_WR_STATE(wr_mas, mas, entry); 5554 + unsigned char node_size; 5555 + int request = 1; 5422 5556 int ret; 5423 5557 5424 - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp); 5558 + 5559 + if (unlikely(!mas->index && mas->last == ULONG_MAX)) 5560 + goto ask_now; 5561 + 5562 + mas_wr_store_setup(&wr_mas); 5563 + wr_mas.content = mas_start(mas); 5564 + /* Root expand */ 5565 + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) 5566 + goto ask_now; 5567 + 5568 + if (unlikely(!mas_wr_walk(&wr_mas))) { 5569 + /* Spanning store, use worst case for now */ 5570 + request = 1 + mas_mt_height(mas) * 3; 5571 + goto ask_now; 5572 + } 5573 + 5574 + /* At this point, we are at the leaf node that needs to be altered. */ 5575 + /* Exact fit, no nodes needed. */ 5576 + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) 5577 + return 0; 5578 + 5579 + mas_wr_end_piv(&wr_mas); 5580 + node_size = mas_wr_new_end(&wr_mas); 5581 + if (node_size >= mt_slots[wr_mas.type]) { 5582 + /* Split, worst case for now. */ 5583 + request = 1 + mas_mt_height(mas) * 2; 5584 + goto ask_now; 5585 + } 5586 + 5587 + /* New root needs a singe node */ 5588 + if (unlikely(mte_is_root(mas->node))) 5589 + goto ask_now; 5590 + 5591 + /* Potential spanning rebalance collapsing a node, use worst-case */ 5592 + if (node_size - 1 <= mt_min_slots[wr_mas.type]) 5593 + request = mas_mt_height(mas) * 2 - 1; 5594 + 5595 + /* node store, slot store needs one node */ 5596 + ask_now: 5597 + mas_node_count_gfp(mas, request, gfp); 5425 5598 mas->mas_flags |= MA_STATE_PREALLOC; 5426 5599 if (likely(!mas_is_err(mas))) 5427 5600 return 0; ··· 5670 5757 * @index: The start index 5671 5758 * @max: The maximum index to check 5672 5759 * 5673 - * Return: The entry at @index or higher, or %NULL if nothing is found. 5760 + * Takes RCU read lock internally to protect the search, which does not 5761 + * protect the returned pointer after dropping RCU read lock. 5762 + * See also: Documentation/core-api/maple_tree.rst 5763 + * 5764 + * Return: The entry higher than @index or %NULL if nothing is found. 5674 5765 */ 5675 5766 void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max) 5676 5767 { ··· 5780 5863 * @index: The start index 5781 5864 * @min: The minimum index to check 5782 5865 * 5783 - * Return: The entry at @index or lower, or %NULL if nothing is found. 5866 + * Takes RCU read lock internally to protect the search, which does not 5867 + * protect the returned pointer after dropping RCU read lock. 5868 + * See also: Documentation/core-api/maple_tree.rst 5869 + * 5870 + * Return: The entry before @index or %NULL if nothing is found. 5784 5871 */ 5785 5872 void *mt_prev(struct maple_tree *mt, unsigned long index, unsigned long min) 5786 5873 { ··· 6207 6286 EXPORT_SYMBOL(mtree_store); 6208 6287 6209 6288 /** 6210 - * mtree_insert_range() - Insert an entry at a give range if there is no value. 6289 + * mtree_insert_range() - Insert an entry at a given range if there is no value. 6211 6290 * @mt: The maple tree 6212 6291 * @first: The start of the range 6213 6292 * @last: The end of the range ··· 6243 6322 EXPORT_SYMBOL(mtree_insert_range); 6244 6323 6245 6324 /** 6246 - * mtree_insert() - Insert an entry at a give index if there is no value. 6325 + * mtree_insert() - Insert an entry at a given index if there is no value. 6247 6326 * @mt: The maple tree 6248 6327 * @index : The index to store the value 6249 6328 * @entry: The entry to store 6250 - * @gfp: The FGP_FLAGS to use for allocations. 6329 + * @gfp: The GFP_FLAGS to use for allocations. 6251 6330 * 6252 6331 * Return: 0 on success, -EEXISTS if the range is occupied, -EINVAL on invalid 6253 6332 * request, -ENOMEM if memory could not be allocated. ··· 6396 6475 * mt_find() - Search from the start up until an entry is found. 6397 6476 * @mt: The maple tree 6398 6477 * @index: Pointer which contains the start location of the search 6399 - * @max: The maximum value to check 6478 + * @max: The maximum value of the search range 6400 6479 * 6401 - * Handles locking. @index will be incremented to one beyond the range. 6480 + * Takes RCU read lock internally to protect the search, which does not 6481 + * protect the returned pointer after dropping RCU read lock. 6482 + * See also: Documentation/core-api/maple_tree.rst 6483 + * 6484 + * In case that an entry is found @index is updated to point to the next 6485 + * possible entry independent whether the found entry is occupying a 6486 + * single index or a range if indices. 6402 6487 * 6403 6488 * Return: The entry at or after the @index or %NULL 6404 6489 */ ··· 6462 6535 * @index: Pointer which contains the start location of the search 6463 6536 * @max: The maximum value to check 6464 6537 * 6465 - * Handles locking, detects wrapping on index == 0 6538 + * Same as mt_find() except that it checks @index for 0 before 6539 + * searching. If @index == 0, the search is aborted. This covers a wrap 6540 + * around of @index to 0 in an iterator loop. 6466 6541 * 6467 6542 * Return: The entry at or after the @index or %NULL 6468 6543 */ ··· 6567 6638 { 6568 6639 return mas_slot(mas, ma_slots(mas_mn(mas), mte_node_type(mas->node)), 6569 6640 offset); 6570 - } 6571 - 6572 - 6573 - /* 6574 - * mas_first_entry() - Go the first leaf and find the first entry. 6575 - * @mas: the maple state. 6576 - * @limit: the maximum index to check. 6577 - * @*r_start: Pointer to set to the range start. 6578 - * 6579 - * Sets mas->offset to the offset of the entry, r_start to the range minimum. 6580 - * 6581 - * Return: The first entry or MAS_NONE. 6582 - */ 6583 - static inline void *mas_first_entry(struct ma_state *mas, struct maple_node *mn, 6584 - unsigned long limit, enum maple_type mt) 6585 - 6586 - { 6587 - unsigned long max; 6588 - unsigned long *pivots; 6589 - void __rcu **slots; 6590 - void *entry = NULL; 6591 - 6592 - mas->index = mas->min; 6593 - if (mas->index > limit) 6594 - goto none; 6595 - 6596 - max = mas->max; 6597 - mas->offset = 0; 6598 - while (likely(!ma_is_leaf(mt))) { 6599 - MAS_WARN_ON(mas, mte_dead_node(mas->node)); 6600 - slots = ma_slots(mn, mt); 6601 - entry = mas_slot(mas, slots, 0); 6602 - pivots = ma_pivots(mn, mt); 6603 - if (unlikely(ma_dead_node(mn))) 6604 - return NULL; 6605 - max = pivots[0]; 6606 - mas->node = entry; 6607 - mn = mas_mn(mas); 6608 - mt = mte_node_type(mas->node); 6609 - } 6610 - MAS_WARN_ON(mas, mte_dead_node(mas->node)); 6611 - 6612 - mas->max = max; 6613 - slots = ma_slots(mn, mt); 6614 - entry = mas_slot(mas, slots, 0); 6615 - if (unlikely(ma_dead_node(mn))) 6616 - return NULL; 6617 - 6618 - /* Slot 0 or 1 must be set */ 6619 - if (mas->index > limit) 6620 - goto none; 6621 - 6622 - if (likely(entry)) 6623 - return entry; 6624 - 6625 - mas->offset = 1; 6626 - entry = mas_slot(mas, slots, 1); 6627 - pivots = ma_pivots(mn, mt); 6628 - if (unlikely(ma_dead_node(mn))) 6629 - return NULL; 6630 - 6631 - mas->index = pivots[0] + 1; 6632 - if (mas->index > limit) 6633 - goto none; 6634 - 6635 - if (likely(entry)) 6636 - return entry; 6637 - 6638 - none: 6639 - if (likely(!ma_dead_node(mn))) 6640 - mas->node = MAS_NONE; 6641 - return NULL; 6642 6641 } 6643 6642 6644 6643 /* Depth first search, post-order */ ··· 6703 6846 int i; 6704 6847 6705 6848 pr_cont(" contents: "); 6706 - for (i = 0; i < MAPLE_ARANGE64_SLOTS; i++) 6707 - pr_cont("%lu ", node->gap[i]); 6849 + for (i = 0; i < MAPLE_ARANGE64_SLOTS; i++) { 6850 + switch (format) { 6851 + case mt_dump_hex: 6852 + pr_cont("%lx ", node->gap[i]); 6853 + break; 6854 + default: 6855 + case mt_dump_dec: 6856 + pr_cont("%lu ", node->gap[i]); 6857 + } 6858 + } 6708 6859 pr_cont("| %02X %02X| ", node->meta.end, node->meta.gap); 6709 - for (i = 0; i < MAPLE_ARANGE64_SLOTS - 1; i++) 6710 - pr_cont("%p %lu ", node->slot[i], node->pivot[i]); 6860 + for (i = 0; i < MAPLE_ARANGE64_SLOTS - 1; i++) { 6861 + switch (format) { 6862 + case mt_dump_hex: 6863 + pr_cont("%p %lX ", node->slot[i], node->pivot[i]); 6864 + break; 6865 + default: 6866 + case mt_dump_dec: 6867 + pr_cont("%p %lu ", node->slot[i], node->pivot[i]); 6868 + } 6869 + } 6711 6870 pr_cont("%p\n", node->slot[i]); 6712 6871 for (i = 0; i < MAPLE_ARANGE64_SLOTS; i++) { 6713 6872 unsigned long last = max; ··· 6807 6934 static void mas_validate_gaps(struct ma_state *mas) 6808 6935 { 6809 6936 struct maple_enode *mte = mas->node; 6810 - struct maple_node *p_mn; 6937 + struct maple_node *p_mn, *node = mte_to_node(mte); 6938 + enum maple_type mt = mte_node_type(mas->node); 6811 6939 unsigned long gap = 0, max_gap = 0; 6812 6940 unsigned long p_end, p_start = mas->min; 6813 - unsigned char p_slot; 6941 + unsigned char p_slot, offset; 6814 6942 unsigned long *gaps = NULL; 6815 - unsigned long *pivots = ma_pivots(mte_to_node(mte), mte_node_type(mte)); 6816 - int i; 6943 + unsigned long *pivots = ma_pivots(node, mt); 6944 + unsigned int i; 6817 6945 6818 - if (ma_is_dense(mte_node_type(mte))) { 6946 + if (ma_is_dense(mt)) { 6819 6947 for (i = 0; i < mt_slot_count(mte); i++) { 6820 6948 if (mas_get_slot(mas, i)) { 6821 6949 if (gap > max_gap) ··· 6829 6955 goto counted; 6830 6956 } 6831 6957 6832 - gaps = ma_gaps(mte_to_node(mte), mte_node_type(mte)); 6958 + gaps = ma_gaps(node, mt); 6833 6959 for (i = 0; i < mt_slot_count(mte); i++) { 6834 - p_end = mas_logical_pivot(mas, pivots, i, mte_node_type(mte)); 6960 + p_end = mas_safe_pivot(mas, pivots, i, mt); 6835 6961 6836 6962 if (!gaps) { 6837 - if (mas_get_slot(mas, i)) { 6838 - gap = 0; 6839 - goto not_empty; 6840 - } 6841 - 6842 - gap += p_end - p_start + 1; 6963 + if (!mas_get_slot(mas, i)) 6964 + gap = p_end - p_start + 1; 6843 6965 } else { 6844 6966 void *entry = mas_get_slot(mas, i); 6845 6967 6846 6968 gap = gaps[i]; 6847 - if (!entry) { 6848 - if (gap != p_end - p_start + 1) { 6849 - pr_err("%p[%u] -> %p %lu != %lu - %lu + 1\n", 6850 - mas_mn(mas), i, 6851 - mas_get_slot(mas, i), gap, 6852 - p_end, p_start); 6853 - mt_dump(mas->tree, mt_dump_hex); 6969 + MT_BUG_ON(mas->tree, !entry); 6854 6970 6855 - MT_BUG_ON(mas->tree, 6856 - gap != p_end - p_start + 1); 6857 - } 6858 - } else { 6859 - if (gap > p_end - p_start + 1) { 6860 - pr_err("%p[%u] %lu >= %lu - %lu + 1 (%lu)\n", 6861 - mas_mn(mas), i, gap, p_end, p_start, 6862 - p_end - p_start + 1); 6863 - MT_BUG_ON(mas->tree, 6864 - gap > p_end - p_start + 1); 6865 - } 6971 + if (gap > p_end - p_start + 1) { 6972 + pr_err("%p[%u] %lu >= %lu - %lu + 1 (%lu)\n", 6973 + mas_mn(mas), i, gap, p_end, p_start, 6974 + p_end - p_start + 1); 6975 + MT_BUG_ON(mas->tree, gap > p_end - p_start + 1); 6866 6976 } 6867 6977 } 6868 6978 6869 6979 if (gap > max_gap) 6870 6980 max_gap = gap; 6871 - not_empty: 6981 + 6872 6982 p_start = p_end + 1; 6873 6983 if (p_end >= mas->max) 6874 6984 break; 6875 6985 } 6876 6986 6877 6987 counted: 6988 + if (mt == maple_arange_64) { 6989 + offset = ma_meta_gap(node, mt); 6990 + if (offset > i) { 6991 + pr_err("gap offset %p[%u] is invalid\n", node, offset); 6992 + MT_BUG_ON(mas->tree, 1); 6993 + } 6994 + 6995 + if (gaps[offset] != max_gap) { 6996 + pr_err("gap %p[%u] is not the largest gap %lu\n", 6997 + node, offset, max_gap); 6998 + MT_BUG_ON(mas->tree, 1); 6999 + } 7000 + 7001 + MT_BUG_ON(mas->tree, !gaps); 7002 + for (i++ ; i < mt_slot_count(mte); i++) { 7003 + if (gaps[i] != 0) { 7004 + pr_err("gap %p[%u] beyond node limit != 0\n", 7005 + node, i); 7006 + MT_BUG_ON(mas->tree, 1); 7007 + } 7008 + } 7009 + } 7010 + 6878 7011 if (mte_is_root(mte)) 6879 7012 return; 6880 7013 ··· 6891 7010 if (ma_gaps(p_mn, mas_parent_type(mas, mte))[p_slot] != max_gap) { 6892 7011 pr_err("gap %p[%u] != %lu\n", p_mn, p_slot, max_gap); 6893 7012 mt_dump(mas->tree, mt_dump_hex); 7013 + MT_BUG_ON(mas->tree, 1); 6894 7014 } 6895 - 6896 - MT_BUG_ON(mas->tree, 6897 - ma_gaps(p_mn, mas_parent_type(mas, mte))[p_slot] != max_gap); 6898 7015 } 6899 7016 6900 7017 static void mas_validate_parent_slot(struct ma_state *mas) ··· 6943 7064 6944 7065 for (i = 0; i < mt_slots[type]; i++) { 6945 7066 child = mas_slot(mas, slots, i); 6946 - if (!pivots[i] || pivots[i] == mas->max) 6947 - break; 6948 7067 6949 - if (!child) 6950 - break; 7068 + if (!child) { 7069 + pr_err("Non-leaf node lacks child at %p[%u]\n", 7070 + mas_mn(mas), i); 7071 + MT_BUG_ON(mas->tree, 1); 7072 + } 6951 7073 6952 7074 if (mte_parent_slot(child) != i) { 6953 7075 pr_err("Slot error at %p[%u]: child %p has pslot %u\n", ··· 6963 7083 mte_to_node(mas->node)); 6964 7084 MT_BUG_ON(mas->tree, 1); 6965 7085 } 7086 + 7087 + if (i < mt_pivots[type] && pivots[i] == mas->max) 7088 + break; 6966 7089 } 6967 7090 } 6968 7091 6969 7092 /* 6970 - * Validate all pivots are within mas->min and mas->max. 7093 + * Validate all pivots are within mas->min and mas->max, check metadata ends 7094 + * where the maximum ends and ensure there is no slots or pivots set outside of 7095 + * the end of the data. 6971 7096 */ 6972 7097 static void mas_validate_limits(struct ma_state *mas) 6973 7098 { ··· 6982 7097 void __rcu **slots = ma_slots(mte_to_node(mas->node), type); 6983 7098 unsigned long *pivots = ma_pivots(mas_mn(mas), type); 6984 7099 6985 - /* all limits are fine here. */ 6986 - if (mte_is_root(mas->node)) 6987 - return; 6988 - 6989 7100 for (i = 0; i < mt_slots[type]; i++) { 6990 7101 unsigned long piv; 6991 7102 6992 7103 piv = mas_safe_pivot(mas, pivots, i, type); 6993 7104 6994 - if (!piv && (i != 0)) 6995 - break; 6996 - 6997 - if (!mte_is_leaf(mas->node)) { 6998 - void *entry = mas_slot(mas, slots, i); 6999 - 7000 - if (!entry) 7001 - pr_err("%p[%u] cannot be null\n", 7002 - mas_mn(mas), i); 7003 - 7004 - MT_BUG_ON(mas->tree, !entry); 7105 + if (!piv && (i != 0)) { 7106 + pr_err("Missing node limit pivot at %p[%u]", 7107 + mas_mn(mas), i); 7108 + MAS_WARN_ON(mas, 1); 7005 7109 } 7006 7110 7007 7111 if (prev_piv > piv) { ··· 7013 7139 if (piv == mas->max) 7014 7140 break; 7015 7141 } 7142 + 7143 + if (mas_data_end(mas) != i) { 7144 + pr_err("node%p: data_end %u != the last slot offset %u\n", 7145 + mas_mn(mas), mas_data_end(mas), i); 7146 + MT_BUG_ON(mas->tree, 1); 7147 + } 7148 + 7016 7149 for (i += 1; i < mt_slots[type]; i++) { 7017 7150 void *entry = mas_slot(mas, slots, i); 7018 7151 ··· 7094 7213 if (!mas_searchable(&mas)) 7095 7214 goto done; 7096 7215 7097 - mas_first_entry(&mas, mas_mn(&mas), ULONG_MAX, mte_node_type(mas.node)); 7216 + while (!mte_is_leaf(mas.node)) 7217 + mas_descend(&mas); 7218 + 7098 7219 while (!mas_is_none(&mas)) { 7099 7220 MAS_WARN_ON(&mas, mte_dead_node(mas.node)); 7100 - if (!mte_is_root(mas.node)) { 7101 - end = mas_data_end(&mas); 7102 - if (MAS_WARN_ON(&mas, 7103 - (end < mt_min_slot_count(mas.node)) && 7104 - (mas.max != ULONG_MAX))) { 7105 - pr_err("Invalid size %u of %p\n", end, 7106 - mas_mn(&mas)); 7107 - } 7221 + end = mas_data_end(&mas); 7222 + if (MAS_WARN_ON(&mas, (end < mt_min_slot_count(mas.node)) && 7223 + (mas.max != ULONG_MAX))) { 7224 + pr_err("Invalid size %u of %p\n", end, mas_mn(&mas)); 7108 7225 } 7226 + 7109 7227 mas_validate_parent_slot(&mas); 7110 - mas_validate_child_slot(&mas); 7111 7228 mas_validate_limits(&mas); 7229 + mas_validate_child_slot(&mas); 7112 7230 if (mt_is_alloc(mt)) 7113 7231 mas_validate_gaps(&mas); 7114 7232 mas_dfs_postorder(&mas, ULONG_MAX);

+141

lib/test_maple_tree.c

··· 44 44 /* #define BENCH_WALK */ 45 45 /* #define BENCH_MT_FOR_EACH */ 46 46 /* #define BENCH_FORK */ 47 + /* #define BENCH_MAS_FOR_EACH */ 48 + /* #define BENCH_MAS_PREV */ 47 49 48 50 #ifdef __KERNEL__ 49 51 #define mt_set_non_kernel(x) do {} while (0) ··· 1159 1157 MT_BUG_ON(mt, !mt_height(mt)); 1160 1158 mtree_destroy(mt); 1161 1159 1160 + /* Check in-place modifications */ 1161 + mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE); 1162 + /* Append to the start of last range */ 1163 + mt_set_non_kernel(50); 1164 + for (i = 0; i <= 500; i++) { 1165 + val = i * 5 + 1; 1166 + val2 = val + 4; 1167 + check_store_range(mt, val, val2, xa_mk_value(val), 0); 1168 + } 1169 + 1170 + /* Append to the last range without touching any boundaries */ 1171 + for (i = 0; i < 10; i++) { 1172 + val = val2 + 5; 1173 + val2 = val + 4; 1174 + check_store_range(mt, val, val2, xa_mk_value(val), 0); 1175 + } 1176 + 1177 + /* Append to the end of last range */ 1178 + val = val2; 1179 + for (i = 0; i < 10; i++) { 1180 + val += 5; 1181 + MT_BUG_ON(mt, mtree_test_store_range(mt, val, ULONG_MAX, 1182 + xa_mk_value(val)) != 0); 1183 + } 1184 + 1185 + /* Overwriting the range and over a part of the next range */ 1186 + for (i = 10; i < 30; i += 2) { 1187 + val = i * 5 + 1; 1188 + val2 = val + 5; 1189 + check_store_range(mt, val, val2, xa_mk_value(val), 0); 1190 + } 1191 + 1192 + /* Overwriting a part of the range and over the next range */ 1193 + for (i = 50; i < 70; i += 2) { 1194 + val2 = i * 5; 1195 + val = val2 - 5; 1196 + check_store_range(mt, val, val2, xa_mk_value(val), 0); 1197 + } 1198 + 1199 + /* 1200 + * Expand the range, only partially overwriting the previous and 1201 + * next ranges 1202 + */ 1203 + for (i = 100; i < 130; i += 3) { 1204 + val = i * 5 - 5; 1205 + val2 = i * 5 + 1; 1206 + check_store_range(mt, val, val2, xa_mk_value(val), 0); 1207 + } 1208 + 1209 + /* 1210 + * Expand the range, only partially overwriting the previous and 1211 + * next ranges, in RCU mode 1212 + */ 1213 + mt_set_in_rcu(mt); 1214 + for (i = 150; i < 180; i += 3) { 1215 + val = i * 5 - 5; 1216 + val2 = i * 5 + 1; 1217 + check_store_range(mt, val, val2, xa_mk_value(val), 0); 1218 + } 1219 + 1220 + MT_BUG_ON(mt, !mt_height(mt)); 1221 + mt_validate(mt); 1222 + mt_set_non_kernel(0); 1223 + mtree_destroy(mt); 1224 + 1162 1225 /* Test rebalance gaps */ 1163 1226 mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE); 1164 1227 mt_set_non_kernel(50); ··· 1772 1705 } 1773 1706 #endif 1774 1707 1708 + #if defined(BENCH_MAS_FOR_EACH) 1709 + static noinline void __init bench_mas_for_each(struct maple_tree *mt) 1710 + { 1711 + int i, count = 1000000; 1712 + unsigned long max = 2500; 1713 + void *entry; 1714 + MA_STATE(mas, mt, 0, 0); 1715 + 1716 + for (i = 0; i < max; i += 5) { 1717 + int gap = 4; 1718 + 1719 + if (i % 30 == 0) 1720 + gap = 3; 1721 + mtree_store_range(mt, i, i + gap, xa_mk_value(i), GFP_KERNEL); 1722 + } 1723 + 1724 + rcu_read_lock(); 1725 + for (i = 0; i < count; i++) { 1726 + unsigned long j = 0; 1727 + 1728 + mas_for_each(&mas, entry, max) { 1729 + MT_BUG_ON(mt, entry != xa_mk_value(j)); 1730 + j += 5; 1731 + } 1732 + mas_set(&mas, 0); 1733 + } 1734 + rcu_read_unlock(); 1735 + 1736 + } 1737 + #endif 1738 + #if defined(BENCH_MAS_PREV) 1739 + static noinline void __init bench_mas_prev(struct maple_tree *mt) 1740 + { 1741 + int i, count = 1000000; 1742 + unsigned long max = 2500; 1743 + void *entry; 1744 + MA_STATE(mas, mt, 0, 0); 1745 + 1746 + for (i = 0; i < max; i += 5) { 1747 + int gap = 4; 1748 + 1749 + if (i % 30 == 0) 1750 + gap = 3; 1751 + mtree_store_range(mt, i, i + gap, xa_mk_value(i), GFP_KERNEL); 1752 + } 1753 + 1754 + rcu_read_lock(); 1755 + for (i = 0; i < count; i++) { 1756 + unsigned long j = 2495; 1757 + 1758 + mas_set(&mas, ULONG_MAX); 1759 + while ((entry = mas_prev(&mas, 0)) != NULL) { 1760 + MT_BUG_ON(mt, entry != xa_mk_value(j)); 1761 + j -= 5; 1762 + } 1763 + } 1764 + rcu_read_unlock(); 1765 + 1766 + } 1767 + #endif 1775 1768 /* check_forking - simulate the kernel forking sequence with the tree. */ 1776 1769 static noinline void __init check_forking(struct maple_tree *mt) 1777 1770 { ··· 3557 3430 #define BENCH 3558 3431 mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); 3559 3432 bench_mt_for_each(&tree); 3433 + mtree_destroy(&tree); 3434 + goto skip; 3435 + #endif 3436 + #if defined(BENCH_MAS_FOR_EACH) 3437 + #define BENCH 3438 + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); 3439 + bench_mas_for_each(&tree); 3440 + mtree_destroy(&tree); 3441 + goto skip; 3442 + #endif 3443 + #if defined(BENCH_MAS_PREV) 3444 + #define BENCH 3445 + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); 3446 + bench_mas_prev(&tree); 3560 3447 mtree_destroy(&tree); 3561 3448 goto skip; 3562 3449 #endif

+1 -1

lib/test_meminit.c

··· 93 93 int failures = 0, num_tests = 0; 94 94 int i; 95 95 96 - for (i = 0; i < 10; i++) 96 + for (i = 0; i <= MAX_ORDER; i++) 97 97 num_tests += do_alloc_pages_order(i, &failures); 98 98 99 99 REPORT_FAILURES_IN_FN();

+10 -5

mm/Kconfig

··· 25 25 config ZSWAP 26 26 bool "Compressed cache for swap pages" 27 27 depends on SWAP 28 - select FRONTSWAP 29 28 select CRYPTO 30 29 select ZPOOL 31 30 help ··· 503 504 # Select this config option from the architecture Kconfig, if it is preferred 504 505 # to enable the feature of HugeTLB/dev_dax vmemmap optimization. 505 506 # 506 - config ARCH_WANT_OPTIMIZE_VMEMMAP 507 + config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP 508 + bool 509 + 510 + config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP 507 511 bool 508 512 509 513 config HAVE_MEMBLOCK_PHYS_MAP ··· 587 585 depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 588 586 589 587 endif # MEMORY_HOTPLUG 588 + 589 + config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE 590 + bool 590 591 591 592 # Heavily threaded applications may benefit from splitting the mm-wide 592 593 # page_table_lock, so that faults on different parts of the user address ··· 892 887 config HAVE_SETUP_PER_CPU_AREA 893 888 bool 894 889 895 - config FRONTSWAP 896 - bool 897 - 898 890 config CMA 899 891 bool "Contiguous Memory Allocator" 900 892 depends on MMU ··· 1162 1160 # struct io_mapping based helper. Selected by drivers that need them 1163 1161 config IO_MAPPING 1164 1162 bool 1163 + 1164 + config MEMFD_CREATE 1165 + bool "Enable memfd_create() system call" if EXPERT 1165 1166 1166 1167 config SECRETMEM 1167 1168 default y

-1

mm/Makefile

··· 72 72 endif 73 73 74 74 obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o swap_slots.o 75 - obj-$(CONFIG_FRONTSWAP) += frontswap.o 76 75 obj-$(CONFIG_ZSWAP) += zswap.o 77 76 obj-$(CONFIG_HAS_DMA) += dmapool.o 78 77 obj-$(CONFIG_HUGETLBFS) += hugetlb.o

+1 -5

mm/backing-dev.c

··· 16 16 #include <linux/writeback.h> 17 17 #include <linux/device.h> 18 18 #include <trace/events/writeback.h> 19 + #include "internal.h" 19 20 20 21 struct backing_dev_info noop_backing_dev_info; 21 22 EXPORT_SYMBOL_GPL(noop_backing_dev_info); ··· 34 33 35 34 /* bdi_wq serves all asynchronous writeback tasks */ 36 35 struct workqueue_struct *bdi_wq; 37 - 38 - #define K(x) ((x) << (PAGE_SHIFT - 10)) 39 36 40 37 #ifdef CONFIG_DEBUG_FS 41 38 #include <linux/debugfs.h> ··· 731 732 struct bdi_writeback *wb; 732 733 733 734 might_alloc(gfp); 734 - 735 - if (!memcg_css->parent) 736 - return &bdi->wb; 737 735 738 736 do { 739 737 wb = wb_get_lookup(bdi, memcg_css);

+2 -2

mm/cma.c

··· 436 436 if (!cma || !cma->count || !cma->bitmap) 437 437 goto out; 438 438 439 - pr_debug("%s(cma %p, count %lu, align %d)\n", __func__, (void *)cma, 440 - count, align); 439 + pr_debug("%s(cma %p, name: %s, count %lu, align %d)\n", __func__, 440 + (void *)cma, cma->name, count, align); 441 441 442 442 if (!count) 443 443 goto out;

+67 -38

mm/compaction.c

··· 249 249 250 250 return 0; 251 251 } 252 + 253 + /* 254 + * If the PFN falls into an offline section, return the end PFN of the 255 + * next online section in reverse. If the PFN falls into an online section 256 + * or if there is no next online section in reverse, return 0. 257 + */ 258 + static unsigned long skip_offline_sections_reverse(unsigned long start_pfn) 259 + { 260 + unsigned long start_nr = pfn_to_section_nr(start_pfn); 261 + 262 + if (!start_nr || online_section_nr(start_nr)) 263 + return 0; 264 + 265 + while (start_nr-- > 0) { 266 + if (online_section_nr(start_nr)) 267 + return section_nr_to_pfn(start_nr) + PAGES_PER_SECTION; 268 + } 269 + 270 + return 0; 271 + } 252 272 #else 253 273 static unsigned long skip_offline_sections(unsigned long start_pfn) 274 + { 275 + return 0; 276 + } 277 + 278 + static unsigned long skip_offline_sections_reverse(unsigned long start_pfn) 254 279 { 255 280 return 0; 256 281 } ··· 463 438 { 464 439 struct zone *zone = cc->zone; 465 440 466 - pfn = pageblock_end_pfn(pfn); 467 - 468 441 /* Set for isolation rather than compaction */ 469 442 if (cc->no_set_skip_hint) 470 443 return; 471 444 445 + pfn = pageblock_end_pfn(pfn); 446 + 447 + /* Update where async and sync compaction should restart */ 472 448 if (pfn > zone->compact_cached_migrate_pfn[0]) 473 449 zone->compact_cached_migrate_pfn[0] = pfn; 474 450 if (cc->mode != MIGRATE_ASYNC && ··· 491 465 492 466 set_pageblock_skip(page); 493 467 494 - /* Update where async and sync compaction should restart */ 495 468 if (pfn < zone->compact_cached_free_pfn) 496 469 zone->compact_cached_free_pfn = pfn; 497 470 } ··· 589 564 bool strict) 590 565 { 591 566 int nr_scanned = 0, total_isolated = 0; 592 - struct page *cursor; 567 + struct page *page; 593 568 unsigned long flags = 0; 594 569 bool locked = false; 595 570 unsigned long blockpfn = *start_pfn; ··· 599 574 if (strict) 600 575 stride = 1; 601 576 602 - cursor = pfn_to_page(blockpfn); 577 + page = pfn_to_page(blockpfn); 603 578 604 579 /* Isolate free pages. */ 605 - for (; blockpfn < end_pfn; blockpfn += stride, cursor += stride) { 580 + for (; blockpfn < end_pfn; blockpfn += stride, page += stride) { 606 581 int isolated; 607 - struct page *page = cursor; 608 582 609 583 /* 610 584 * Periodically drop the lock (if held) regardless of its ··· 628 604 629 605 if (likely(order <= MAX_ORDER)) { 630 606 blockpfn += (1UL << order) - 1; 631 - cursor += (1UL << order) - 1; 607 + page += (1UL << order) - 1; 632 608 nr_scanned += (1UL << order) - 1; 633 609 } 634 610 goto isolate_fail; ··· 665 641 } 666 642 /* Advance to the end of split page */ 667 643 blockpfn += isolated - 1; 668 - cursor += isolated - 1; 644 + page += isolated - 1; 669 645 continue; 670 646 671 647 isolate_fail: 672 648 if (strict) 673 649 break; 674 - else 675 - continue; 676 650 677 651 } 678 652 ··· 737 715 /* Protect pfn from changing by isolate_freepages_block */ 738 716 unsigned long isolate_start_pfn = pfn; 739 717 740 - block_end_pfn = min(block_end_pfn, end_pfn); 741 - 742 718 /* 743 719 * pfn could pass the block_end_pfn if isolated freepage 744 720 * is more than pageblock order. In this case, we adjust ··· 745 725 if (pfn >= block_end_pfn) { 746 726 block_start_pfn = pageblock_start_pfn(pfn); 747 727 block_end_pfn = pageblock_end_pfn(pfn); 748 - block_end_pfn = min(block_end_pfn, end_pfn); 749 728 } 729 + 730 + block_end_pfn = min(block_end_pfn, end_pfn); 750 731 751 732 if (!pageblock_pfn_to_page(block_start_pfn, 752 733 block_end_pfn, cc->zone)) ··· 1097 1076 bool migrate_dirty; 1098 1077 1099 1078 /* 1100 - * Only pages without mappings or that have a 1101 - * ->migrate_folio callback are possible to migrate 1102 - * without blocking. However, we can be racing with 1103 - * truncation so it's necessary to lock the page 1104 - * to stabilise the mapping as truncation holds 1105 - * the page lock until after the page is removed 1106 - * from the page cache. 1079 + * Only folios without mappings or that have 1080 + * a ->migrate_folio callback are possible to 1081 + * migrate without blocking. However, we may 1082 + * be racing with truncation, which can free 1083 + * the mapping. Truncation holds the folio lock 1084 + * until after the folio is removed from the page 1085 + * cache so holding it ourselves is sufficient. 1107 1086 */ 1108 1087 if (!folio_trylock(folio)) 1109 1088 goto isolate_fail_put; ··· 1141 1120 skip_updated = true; 1142 1121 if (test_and_set_skip(cc, valid_page) && 1143 1122 !cc->finish_pageblock) { 1123 + low_pfn = end_pfn; 1144 1124 goto isolate_abort; 1145 1125 } 1146 1126 } ··· 1443 1421 isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false); 1444 1422 1445 1423 /* Skip this pageblock in the future as it's full or nearly full */ 1446 - if (start_pfn == end_pfn) 1424 + if (start_pfn == end_pfn && !cc->no_set_skip_hint) 1447 1425 set_pageblock_skip(page); 1448 - 1449 - return; 1450 1426 } 1451 1427 1452 1428 /* Search orders in round-robin fashion */ ··· 1521 1501 1522 1502 spin_lock_irqsave(&cc->zone->lock, flags); 1523 1503 freelist = &area->free_list[MIGRATE_MOVABLE]; 1524 - list_for_each_entry_reverse(freepage, freelist, lru) { 1504 + list_for_each_entry_reverse(freepage, freelist, buddy_list) { 1525 1505 unsigned long pfn; 1526 1506 1527 1507 order_scanned++; ··· 1550 1530 break; 1551 1531 } 1552 1532 1553 - /* Use a minimum pfn if a preferred one was not found */ 1533 + /* Use a maximum candidate pfn if a preferred one was not found */ 1554 1534 if (!page && high_pfn) { 1555 1535 page = pfn_to_page(high_pfn); 1556 1536 ··· 1689 1669 1690 1670 page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn, 1691 1671 zone); 1692 - if (!page) 1672 + if (!page) { 1673 + unsigned long next_pfn; 1674 + 1675 + next_pfn = skip_offline_sections_reverse(block_start_pfn); 1676 + if (next_pfn) 1677 + block_start_pfn = max(next_pfn, low_pfn); 1678 + 1693 1679 continue; 1680 + } 1694 1681 1695 1682 /* Check the block is suitable for migration */ 1696 1683 if (!suitable_migration_target(cc, page)) ··· 1713 1686 1714 1687 /* Update the skip hint if the full pageblock was scanned */ 1715 1688 if (isolate_start_pfn == block_end_pfn) 1716 - update_pageblock_skip(cc, page, block_start_pfn); 1689 + update_pageblock_skip(cc, page, block_start_pfn - 1690 + pageblock_nr_pages); 1717 1691 1718 1692 /* Are enough freepages isolated? */ 1719 1693 if (cc->nr_freepages >= cc->nr_migratepages) { ··· 1912 1884 1913 1885 spin_lock_irqsave(&cc->zone->lock, flags); 1914 1886 freelist = &area->free_list[MIGRATE_MOVABLE]; 1915 - list_for_each_entry(freepage, freelist, lru) { 1887 + list_for_each_entry(freepage, freelist, buddy_list) { 1916 1888 unsigned long free_pfn; 1917 1889 1918 1890 if (nr_scanned++ >= limit) { ··· 1986 1958 block_start_pfn = cc->zone->zone_start_pfn; 1987 1959 1988 1960 /* 1989 - * fast_find_migrateblock marks a pageblock skipped so to avoid 1990 - * the isolation_suitable check below, check whether the fast 1991 - * search was successful. 1961 + * fast_find_migrateblock() has already ensured the pageblock is not 1962 + * set with a skipped flag, so to avoid the isolation_suitable check 1963 + * below again, check whether the fast search was successful. 1992 1964 */ 1993 1965 fast_find_block = low_pfn != cc->migrate_pfn && !cc->fast_search_fail; 1994 1966 ··· 2142 2114 return score; 2143 2115 } 2144 2116 2145 - static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low) 2117 + static unsigned int fragmentation_score_wmark(bool low) 2146 2118 { 2147 2119 unsigned int wmark_low; 2148 2120 ··· 2162 2134 if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat)) 2163 2135 return false; 2164 2136 2165 - wmark_high = fragmentation_score_wmark(pgdat, false); 2137 + wmark_high = fragmentation_score_wmark(false); 2166 2138 return fragmentation_score_node(pgdat) > wmark_high; 2167 2139 } 2168 2140 ··· 2201 2173 return COMPACT_PARTIAL_SKIPPED; 2202 2174 2203 2175 score = fragmentation_score_zone(cc->zone); 2204 - wmark_low = fragmentation_score_wmark(pgdat, true); 2176 + wmark_low = fragmentation_score_wmark(true); 2205 2177 2206 2178 if (score > wmark_low) 2207 2179 ret = COMPACT_CONTINUE; ··· 2508 2480 goto check_drain; 2509 2481 case ISOLATE_SUCCESS: 2510 2482 update_cached = false; 2511 - last_migrated_pfn = iteration_start_pfn; 2483 + last_migrated_pfn = max(cc->zone->zone_start_pfn, 2484 + pageblock_start_pfn(cc->migrate_pfn - 1)); 2512 2485 } 2513 2486 2514 2487 err = migrate_pages(&cc->migratepages, compaction_alloc, ··· 2532 2503 } 2533 2504 /* 2534 2505 * If an ASYNC or SYNC_LIGHT fails to migrate a page 2535 - * within the current order-aligned block and 2506 + * within the pageblock_order-aligned block and 2536 2507 * fast_find_migrateblock may be used then scan the 2537 2508 * remainder of the pageblock. This will mark the 2538 2509 * pageblock "skip" to avoid rescanning in the near ··· 2898 2869 2899 2870 void compaction_unregister_node(struct node *node) 2900 2871 { 2901 - return device_remove_file(&node->dev, &dev_attr_compact); 2872 + device_remove_file(&node->dev, &dev_attr_compact); 2902 2873 } 2903 2874 #endif /* CONFIG_SYSFS && CONFIG_NUMA */ 2904 2875

+74

mm/damon/core-test.h

··· 341 341 KUNIT_EXPECT_EQ(test, damon_set_attrs(c, &invalid_attrs), -EINVAL); 342 342 } 343 343 344 + static void damos_test_new_filter(struct kunit *test) 345 + { 346 + struct damos_filter *filter; 347 + 348 + filter = damos_new_filter(DAMOS_FILTER_TYPE_ANON, true); 349 + KUNIT_EXPECT_EQ(test, filter->type, DAMOS_FILTER_TYPE_ANON); 350 + KUNIT_EXPECT_EQ(test, filter->matching, true); 351 + KUNIT_EXPECT_PTR_EQ(test, filter->list.prev, &filter->list); 352 + KUNIT_EXPECT_PTR_EQ(test, filter->list.next, &filter->list); 353 + damos_destroy_filter(filter); 354 + } 355 + 356 + static void damos_test_filter_out(struct kunit *test) 357 + { 358 + struct damon_target *t; 359 + struct damon_region *r, *r2; 360 + struct damos_filter *f; 361 + 362 + f = damos_new_filter(DAMOS_FILTER_TYPE_ADDR, true); 363 + f->addr_range = (struct damon_addr_range){ 364 + .start = DAMON_MIN_REGION * 2, .end = DAMON_MIN_REGION * 6}; 365 + 366 + t = damon_new_target(); 367 + r = damon_new_region(DAMON_MIN_REGION * 3, DAMON_MIN_REGION * 5); 368 + damon_add_region(r, t); 369 + 370 + /* region in the range */ 371 + KUNIT_EXPECT_TRUE(test, __damos_filter_out(NULL, t, r, f)); 372 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 1); 373 + 374 + /* region before the range */ 375 + r->ar.start = DAMON_MIN_REGION * 1; 376 + r->ar.end = DAMON_MIN_REGION * 2; 377 + KUNIT_EXPECT_FALSE(test, __damos_filter_out(NULL, t, r, f)); 378 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 1); 379 + 380 + /* region after the range */ 381 + r->ar.start = DAMON_MIN_REGION * 6; 382 + r->ar.end = DAMON_MIN_REGION * 8; 383 + KUNIT_EXPECT_FALSE(test, __damos_filter_out(NULL, t, r, f)); 384 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 1); 385 + 386 + /* region started before the range */ 387 + r->ar.start = DAMON_MIN_REGION * 1; 388 + r->ar.end = DAMON_MIN_REGION * 4; 389 + KUNIT_EXPECT_FALSE(test, __damos_filter_out(NULL, t, r, f)); 390 + /* filter should have split the region */ 391 + KUNIT_EXPECT_EQ(test, r->ar.start, DAMON_MIN_REGION * 1); 392 + KUNIT_EXPECT_EQ(test, r->ar.end, DAMON_MIN_REGION * 2); 393 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 2); 394 + r2 = damon_next_region(r); 395 + KUNIT_EXPECT_EQ(test, r2->ar.start, DAMON_MIN_REGION * 2); 396 + KUNIT_EXPECT_EQ(test, r2->ar.end, DAMON_MIN_REGION * 4); 397 + damon_destroy_region(r2, t); 398 + 399 + /* region started in the range */ 400 + r->ar.start = DAMON_MIN_REGION * 2; 401 + r->ar.end = DAMON_MIN_REGION * 8; 402 + KUNIT_EXPECT_TRUE(test, __damos_filter_out(NULL, t, r, f)); 403 + /* filter should have split the region */ 404 + KUNIT_EXPECT_EQ(test, r->ar.start, DAMON_MIN_REGION * 2); 405 + KUNIT_EXPECT_EQ(test, r->ar.end, DAMON_MIN_REGION * 6); 406 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 2); 407 + r2 = damon_next_region(r); 408 + KUNIT_EXPECT_EQ(test, r2->ar.start, DAMON_MIN_REGION * 6); 409 + KUNIT_EXPECT_EQ(test, r2->ar.end, DAMON_MIN_REGION * 8); 410 + damon_destroy_region(r2, t); 411 + 412 + damon_free_target(t); 413 + damos_free_filter(f); 414 + } 415 + 344 416 static struct kunit_case damon_test_cases[] = { 345 417 KUNIT_CASE(damon_test_target), 346 418 KUNIT_CASE(damon_test_regions), ··· 425 353 KUNIT_CASE(damon_test_set_regions), 426 354 KUNIT_CASE(damon_test_update_monitoring_result), 427 355 KUNIT_CASE(damon_test_set_attrs), 356 + KUNIT_CASE(damos_test_new_filter), 357 + KUNIT_CASE(damos_test_filter_out), 428 358 {}, 429 359 }; 430 360

+62

mm/damon/core.c

··· 878 878 s->stat.sz_applied += sz_applied; 879 879 } 880 880 881 + static bool __damos_filter_out(struct damon_ctx *ctx, struct damon_target *t, 882 + struct damon_region *r, struct damos_filter *filter) 883 + { 884 + bool matched = false; 885 + struct damon_target *ti; 886 + int target_idx = 0; 887 + unsigned long start, end; 888 + 889 + switch (filter->type) { 890 + case DAMOS_FILTER_TYPE_TARGET: 891 + damon_for_each_target(ti, ctx) { 892 + if (ti == t) 893 + break; 894 + target_idx++; 895 + } 896 + matched = target_idx == filter->target_idx; 897 + break; 898 + case DAMOS_FILTER_TYPE_ADDR: 899 + start = ALIGN_DOWN(filter->addr_range.start, DAMON_MIN_REGION); 900 + end = ALIGN_DOWN(filter->addr_range.end, DAMON_MIN_REGION); 901 + 902 + /* inside the range */ 903 + if (start <= r->ar.start && r->ar.end <= end) { 904 + matched = true; 905 + break; 906 + } 907 + /* outside of the range */ 908 + if (r->ar.end <= start || end <= r->ar.start) { 909 + matched = false; 910 + break; 911 + } 912 + /* start before the range and overlap */ 913 + if (r->ar.start < start) { 914 + damon_split_region_at(t, r, start - r->ar.start); 915 + matched = false; 916 + break; 917 + } 918 + /* start inside the range */ 919 + damon_split_region_at(t, r, end - r->ar.start); 920 + matched = true; 921 + break; 922 + default: 923 + break; 924 + } 925 + 926 + return matched == filter->matching; 927 + } 928 + 929 + static bool damos_filter_out(struct damon_ctx *ctx, struct damon_target *t, 930 + struct damon_region *r, struct damos *s) 931 + { 932 + struct damos_filter *filter; 933 + 934 + damos_for_each_filter(filter, s) { 935 + if (__damos_filter_out(ctx, t, r, filter)) 936 + return true; 937 + } 938 + return false; 939 + } 940 + 881 941 static void damos_apply_scheme(struct damon_ctx *c, struct damon_target *t, 882 942 struct damon_region *r, struct damos *s) 883 943 { ··· 955 895 goto update_stat; 956 896 damon_split_region_at(t, r, sz); 957 897 } 898 + if (damos_filter_out(c, t, r, s)) 899 + return; 958 900 ktime_get_coarse_ts64(&begin); 959 901 if (c->callback.before_damos_apply) 960 902 err = c->callback.before_damos_apply(c, t, r, s);

+1 -1

mm/damon/ops-common.c

··· 54 54 void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr) 55 55 { 56 56 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 57 - struct folio *folio = damon_get_folio(pmd_pfn(*pmd)); 57 + struct folio *folio = damon_get_folio(pmd_pfn(pmdp_get(pmd))); 58 58 59 59 if (!folio) 60 60 return;

+1 -1

mm/damon/paddr.c

··· 94 94 mmu_notifier_test_young(vma->vm_mm, addr); 95 95 } else { 96 96 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 97 - *accessed = pmd_young(*pvmw.pmd) || 97 + *accessed = pmd_young(pmdp_get(pvmw.pmd)) || 98 98 !folio_test_idle(folio) || 99 99 mmu_notifier_test_young(vma->vm_mm, addr); 100 100 #else

+1 -1

mm/damon/sysfs-common.h

··· 47 47 48 48 int damon_sysfs_schemes_update_regions_start( 49 49 struct damon_sysfs_schemes *sysfs_schemes, 50 - struct damon_ctx *ctx); 50 + struct damon_ctx *ctx, bool total_bytes_only); 51 51 52 52 int damon_sysfs_schemes_update_regions_stop(struct damon_ctx *ctx); 53 53

+106 -1

mm/damon/sysfs-schemes.c

··· 117 117 struct kobject kobj; 118 118 struct list_head regions_list; 119 119 int nr_regions; 120 + unsigned long total_bytes; 120 121 }; 121 122 122 123 static struct damon_sysfs_scheme_regions * ··· 129 128 regions->kobj = (struct kobject){}; 130 129 INIT_LIST_HEAD(&regions->regions_list); 131 130 regions->nr_regions = 0; 131 + regions->total_bytes = 0; 132 132 return regions; 133 + } 134 + 135 + static ssize_t total_bytes_show(struct kobject *kobj, 136 + struct kobj_attribute *attr, char *buf) 137 + { 138 + struct damon_sysfs_scheme_regions *regions = container_of(kobj, 139 + struct damon_sysfs_scheme_regions, kobj); 140 + 141 + return sysfs_emit(buf, "%lu\n", regions->total_bytes); 133 142 } 134 143 135 144 static void damon_sysfs_scheme_regions_rm_dirs( ··· 159 148 kfree(container_of(kobj, struct damon_sysfs_scheme_regions, kobj)); 160 149 } 161 150 151 + static struct kobj_attribute damon_sysfs_scheme_regions_total_bytes_attr = 152 + __ATTR_RO_MODE(total_bytes, 0400); 153 + 162 154 static struct attribute *damon_sysfs_scheme_regions_attrs[] = { 155 + &damon_sysfs_scheme_regions_total_bytes_attr.attr, 163 156 NULL, 164 157 }; 165 158 ATTRIBUTE_GROUPS(damon_sysfs_scheme_regions); ··· 282 267 enum damos_filter_type type; 283 268 bool matching; 284 269 char *memcg_path; 270 + struct damon_addr_range addr_range; 271 + int target_idx; 285 272 }; 286 273 287 274 static struct damon_sysfs_scheme_filter *damon_sysfs_scheme_filter_alloc(void) ··· 295 278 static const char * const damon_sysfs_scheme_filter_type_strs[] = { 296 279 "anon", 297 280 "memcg", 281 + "addr", 282 + "target", 298 283 }; 299 284 300 285 static ssize_t type_show(struct kobject *kobj, ··· 377 358 return count; 378 359 } 379 360 361 + static ssize_t addr_start_show(struct kobject *kobj, 362 + struct kobj_attribute *attr, char *buf) 363 + { 364 + struct damon_sysfs_scheme_filter *filter = container_of(kobj, 365 + struct damon_sysfs_scheme_filter, kobj); 366 + 367 + return sysfs_emit(buf, "%lu\n", filter->addr_range.start); 368 + } 369 + 370 + static ssize_t addr_start_store(struct kobject *kobj, 371 + struct kobj_attribute *attr, const char *buf, size_t count) 372 + { 373 + struct damon_sysfs_scheme_filter *filter = container_of(kobj, 374 + struct damon_sysfs_scheme_filter, kobj); 375 + int err = kstrtoul(buf, 0, &filter->addr_range.start); 376 + 377 + return err ? err : count; 378 + } 379 + 380 + static ssize_t addr_end_show(struct kobject *kobj, 381 + struct kobj_attribute *attr, char *buf) 382 + { 383 + struct damon_sysfs_scheme_filter *filter = container_of(kobj, 384 + struct damon_sysfs_scheme_filter, kobj); 385 + 386 + return sysfs_emit(buf, "%lu\n", filter->addr_range.end); 387 + } 388 + 389 + static ssize_t addr_end_store(struct kobject *kobj, 390 + struct kobj_attribute *attr, const char *buf, size_t count) 391 + { 392 + struct damon_sysfs_scheme_filter *filter = container_of(kobj, 393 + struct damon_sysfs_scheme_filter, kobj); 394 + int err = kstrtoul(buf, 0, &filter->addr_range.end); 395 + 396 + return err ? err : count; 397 + } 398 + 399 + static ssize_t damon_target_idx_show(struct kobject *kobj, 400 + struct kobj_attribute *attr, char *buf) 401 + { 402 + struct damon_sysfs_scheme_filter *filter = container_of(kobj, 403 + struct damon_sysfs_scheme_filter, kobj); 404 + 405 + return sysfs_emit(buf, "%d\n", filter->target_idx); 406 + } 407 + 408 + static ssize_t damon_target_idx_store(struct kobject *kobj, 409 + struct kobj_attribute *attr, const char *buf, size_t count) 410 + { 411 + struct damon_sysfs_scheme_filter *filter = container_of(kobj, 412 + struct damon_sysfs_scheme_filter, kobj); 413 + int err = kstrtoint(buf, 0, &filter->target_idx); 414 + 415 + return err ? err : count; 416 + } 417 + 380 418 static void damon_sysfs_scheme_filter_release(struct kobject *kobj) 381 419 { 382 420 struct damon_sysfs_scheme_filter *filter = container_of(kobj, ··· 452 376 static struct kobj_attribute damon_sysfs_scheme_filter_memcg_path_attr = 453 377 __ATTR_RW_MODE(memcg_path, 0600); 454 378 379 + static struct kobj_attribute damon_sysfs_scheme_filter_addr_start_attr = 380 + __ATTR_RW_MODE(addr_start, 0600); 381 + 382 + static struct kobj_attribute damon_sysfs_scheme_filter_addr_end_attr = 383 + __ATTR_RW_MODE(addr_end, 0600); 384 + 385 + static struct kobj_attribute damon_sysfs_scheme_filter_damon_target_idx_attr = 386 + __ATTR_RW_MODE(damon_target_idx, 0600); 387 + 455 388 static struct attribute *damon_sysfs_scheme_filter_attrs[] = { 456 389 &damon_sysfs_scheme_filter_type_attr.attr, 457 390 &damon_sysfs_scheme_filter_matching_attr.attr, 458 391 &damon_sysfs_scheme_filter_memcg_path_attr.attr, 392 + &damon_sysfs_scheme_filter_addr_start_attr.attr, 393 + &damon_sysfs_scheme_filter_addr_end_attr.attr, 394 + &damon_sysfs_scheme_filter_damon_target_idx_attr.attr, 459 395 NULL, 460 396 }; 461 397 ATTRIBUTE_GROUPS(damon_sysfs_scheme_filter); ··· 1557 1469 damos_destroy_filter(filter); 1558 1470 return err; 1559 1471 } 1472 + } else if (filter->type == DAMOS_FILTER_TYPE_ADDR) { 1473 + if (sysfs_filter->addr_range.end < 1474 + sysfs_filter->addr_range.start) { 1475 + damos_destroy_filter(filter); 1476 + return -EINVAL; 1477 + } 1478 + filter->addr_range = sysfs_filter->addr_range; 1479 + } else if (filter->type == DAMOS_FILTER_TYPE_TARGET) { 1480 + filter->target_idx = sysfs_filter->target_idx; 1560 1481 } 1482 + 1561 1483 damos_add_filter(scheme, filter); 1562 1484 } 1563 1485 return 0; ··· 1718 1620 */ 1719 1621 static struct damon_sysfs_schemes *damon_sysfs_schemes_for_damos_callback; 1720 1622 static int damon_sysfs_schemes_region_idx; 1623 + static bool damos_regions_upd_total_bytes_only; 1721 1624 1722 1625 /* 1723 1626 * DAMON callback that called before damos apply. While this callback is ··· 1747 1648 return 0; 1748 1649 1749 1650 sysfs_regions = sysfs_schemes->schemes_arr[schemes_idx]->tried_regions; 1651 + sysfs_regions->total_bytes += r->ar.end - r->ar.start; 1652 + if (damos_regions_upd_total_bytes_only) 1653 + return 0; 1654 + 1750 1655 region = damon_sysfs_scheme_region_alloc(r); 1751 1656 list_add_tail(&region->list, &sysfs_regions->regions_list); 1752 1657 sysfs_regions->nr_regions++; ··· 1781 1678 sysfs_scheme = sysfs_schemes->schemes_arr[schemes_idx++]; 1782 1679 damon_sysfs_scheme_regions_rm_dirs( 1783 1680 sysfs_scheme->tried_regions); 1681 + sysfs_scheme->tried_regions->total_bytes = 0; 1784 1682 } 1785 1683 return 0; 1786 1684 } ··· 1789 1685 /* Called from damon_sysfs_cmd_request_callback under damon_sysfs_lock */ 1790 1686 int damon_sysfs_schemes_update_regions_start( 1791 1687 struct damon_sysfs_schemes *sysfs_schemes, 1792 - struct damon_ctx *ctx) 1688 + struct damon_ctx *ctx, bool total_bytes_only) 1793 1689 { 1794 1690 damon_sysfs_schemes_clear_regions(sysfs_schemes, ctx); 1795 1691 damon_sysfs_schemes_for_damos_callback = sysfs_schemes; 1692 + damos_regions_upd_total_bytes_only = total_bytes_only; 1796 1693 ctx->callback.before_damos_apply = damon_sysfs_before_damos_apply; 1797 1694 return 0; 1798 1695 }

+20 -6

mm/damon/sysfs.c

··· 1000 1000 */ 1001 1001 DAMON_SYSFS_CMD_UPDATE_SCHEMES_STATS, 1002 1002 /* 1003 + * @DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_BYTES: Update 1004 + * tried_regions/total_bytes sysfs files for each scheme. 1005 + */ 1006 + DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_BYTES, 1007 + /* 1003 1008 * @DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_REGIONS: Update schemes tried 1004 1009 * regions 1005 1010 */ ··· 1026 1021 "off", 1027 1022 "commit", 1028 1023 "update_schemes_stats", 1024 + "update_schemes_tried_bytes", 1029 1025 "update_schemes_tried_regions", 1030 1026 "clear_schemes_tried_regions", 1031 1027 }; ··· 1212 1206 { 1213 1207 struct damon_target *t, *next; 1214 1208 struct damon_sysfs_kdamond *kdamond; 1209 + enum damon_sysfs_cmd cmd; 1215 1210 1216 1211 /* damon_sysfs_schemes_update_regions_stop() might not yet called */ 1217 1212 kdamond = damon_sysfs_cmd_request.kdamond; 1218 - if (kdamond && damon_sysfs_cmd_request.cmd == 1219 - DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_REGIONS && 1220 - ctx == kdamond->damon_ctx) { 1213 + cmd = damon_sysfs_cmd_request.cmd; 1214 + if (kdamond && ctx == kdamond->damon_ctx && 1215 + (cmd == DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_REGIONS || 1216 + cmd == DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_BYTES)) { 1221 1217 damon_sysfs_schemes_update_regions_stop(ctx); 1222 1218 mutex_unlock(&damon_sysfs_lock); 1223 1219 } ··· 1256 1248 } 1257 1249 1258 1250 static int damon_sysfs_upd_schemes_regions_start( 1259 - struct damon_sysfs_kdamond *kdamond) 1251 + struct damon_sysfs_kdamond *kdamond, bool total_bytes_only) 1260 1252 { 1261 1253 struct damon_ctx *ctx = kdamond->damon_ctx; 1262 1254 1263 1255 if (!ctx) 1264 1256 return -EINVAL; 1265 1257 return damon_sysfs_schemes_update_regions_start( 1266 - kdamond->contexts->contexts_arr[0]->schemes, ctx); 1258 + kdamond->contexts->contexts_arr[0]->schemes, ctx, 1259 + total_bytes_only); 1267 1260 } 1268 1261 1269 1262 static int damon_sysfs_upd_schemes_regions_stop( ··· 1341 1332 { 1342 1333 struct damon_sysfs_kdamond *kdamond; 1343 1334 static bool damon_sysfs_schemes_regions_updating; 1335 + bool total_bytes_only = false; 1344 1336 int err = 0; 1345 1337 1346 1338 /* avoid deadlock due to concurrent state_store('off') */ ··· 1358 1348 case DAMON_SYSFS_CMD_COMMIT: 1359 1349 err = damon_sysfs_commit_input(kdamond); 1360 1350 break; 1351 + case DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_BYTES: 1352 + total_bytes_only = true; 1353 + fallthrough; 1361 1354 case DAMON_SYSFS_CMD_UPDATE_SCHEMES_TRIED_REGIONS: 1362 1355 if (!damon_sysfs_schemes_regions_updating) { 1363 - err = damon_sysfs_upd_schemes_regions_start(kdamond); 1356 + err = damon_sysfs_upd_schemes_regions_start(kdamond, 1357 + total_bytes_only); 1364 1358 if (!err) { 1365 1359 damon_sysfs_schemes_regions_updating = true; 1366 1360 goto keep_lock_out;

+15 -8

mm/damon/vaddr.c

··· 301 301 unsigned long next, struct mm_walk *walk) 302 302 { 303 303 pte_t *pte; 304 + pmd_t pmde; 304 305 spinlock_t *ptl; 305 306 306 - if (pmd_trans_huge(*pmd)) { 307 + if (pmd_trans_huge(pmdp_get(pmd))) { 307 308 ptl = pmd_lock(walk->mm, pmd); 308 - if (!pmd_present(*pmd)) { 309 + pmde = pmdp_get(pmd); 310 + 311 + if (!pmd_present(pmde)) { 309 312 spin_unlock(ptl); 310 313 return 0; 311 314 } 312 315 313 - if (pmd_trans_huge(*pmd)) { 316 + if (pmd_trans_huge(pmde)) { 314 317 damon_pmdp_mkold(pmd, walk->vma, addr); 315 318 spin_unlock(ptl); 316 319 return 0; ··· 443 440 struct damon_young_walk_private *priv = walk->private; 444 441 445 442 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 446 - if (pmd_trans_huge(*pmd)) { 443 + if (pmd_trans_huge(pmdp_get(pmd))) { 444 + pmd_t pmde; 445 + 447 446 ptl = pmd_lock(walk->mm, pmd); 448 - if (!pmd_present(*pmd)) { 447 + pmde = pmdp_get(pmd); 448 + 449 + if (!pmd_present(pmde)) { 449 450 spin_unlock(ptl); 450 451 return 0; 451 452 } 452 453 453 - if (!pmd_trans_huge(*pmd)) { 454 + if (!pmd_trans_huge(pmde)) { 454 455 spin_unlock(ptl); 455 456 goto regular_page; 456 457 } 457 - folio = damon_get_folio(pmd_pfn(*pmd)); 458 + folio = damon_get_folio(pmd_pfn(pmde)); 458 459 if (!folio) 459 460 goto huge_out; 460 - if (pmd_young(*pmd) || !folio_test_idle(folio) || 461 + if (pmd_young(pmde) || !folio_test_idle(folio) || 461 462 mmu_notifier_test_young(walk->mm, 462 463 addr)) 463 464 priv->young = true;

+8 -10

mm/debug_vm_pgtable.c

··· 302 302 unsigned long val = idx, *ptr = &val; 303 303 pud_t pud; 304 304 305 - if (!has_transparent_hugepage()) 305 + if (!has_transparent_pud_hugepage()) 306 306 return; 307 307 308 308 pr_debug("Validating PUD basic (%pGv)\n", ptr); ··· 343 343 unsigned long vaddr = args->vaddr; 344 344 pud_t pud; 345 345 346 - if (!has_transparent_hugepage()) 346 + if (!has_transparent_pud_hugepage()) 347 347 return; 348 348 349 349 page = (args->pud_pfn != ULONG_MAX) ? pfn_to_page(args->pud_pfn) : NULL; ··· 385 385 WARN_ON(!(pud_write(pud) && pud_dirty(pud))); 386 386 387 387 #ifndef __PAGETABLE_PMD_FOLDED 388 - pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1); 388 + pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1); 389 389 pud = READ_ONCE(*args->pudp); 390 390 WARN_ON(!pud_none(pud)); 391 391 #endif /* __PAGETABLE_PMD_FOLDED */ ··· 405 405 { 406 406 pud_t pud; 407 407 408 - if (!has_transparent_hugepage()) 408 + if (!has_transparent_pud_hugepage()) 409 409 return; 410 410 411 411 pr_debug("Validating PUD leaf\n"); ··· 732 732 { 733 733 pud_t pud; 734 734 735 - if (!has_transparent_hugepage()) 735 + if (!has_transparent_pud_hugepage()) 736 736 return; 737 737 738 738 pr_debug("Validating PUD devmap\n"); ··· 981 981 { 982 982 pud_t pud; 983 983 984 - if (!has_transparent_hugepage()) 984 + if (!has_transparent_pud_hugepage()) 985 985 return; 986 986 987 987 pr_debug("Validating PUD based THP\n"); ··· 1022 1022 1023 1023 /* Free (huge) page */ 1024 1024 if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 1025 - IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && 1026 - has_transparent_hugepage() && 1025 + has_transparent_pud_hugepage() && 1027 1026 args->pud_pfn != ULONG_MAX) { 1028 1027 if (args->is_contiguous_page) { 1029 1028 free_contig_range(args->pud_pfn, ··· 1273 1274 * if we fail to allocate (huge) pages. 1274 1275 */ 1275 1276 if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 1276 - IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && 1277 - has_transparent_hugepage()) { 1277 + has_transparent_pud_hugepage()) { 1278 1278 page = debug_vm_pgtable_alloc_huge_page(args, 1279 1279 HPAGE_PUD_SHIFT - PAGE_SHIFT); 1280 1280 if (page) {

+97 -78

mm/filemap.c

··· 1669 1669 1670 1670 /* 1671 1671 * Return values: 1672 - * true - folio is locked; mmap_lock is still held. 1673 - * false - folio is not locked. 1674 - * mmap_lock has been released (mmap_read_unlock(), unless flags had both 1675 - * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in 1676 - * which case mmap_lock is still held. 1672 + * 0 - folio is locked. 1673 + * non-zero - folio is not locked. 1674 + * mmap_lock or per-VMA lock has been released (mmap_read_unlock() or 1675 + * vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and 1676 + * FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held. 1677 1677 * 1678 - * If neither ALLOW_RETRY nor KILLABLE are set, will always return true 1679 - * with the folio locked and the mmap_lock unperturbed. 1678 + * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0 1679 + * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed. 1680 1680 */ 1681 - bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, 1682 - unsigned int flags) 1681 + vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf) 1683 1682 { 1683 + unsigned int flags = vmf->flags; 1684 + 1684 1685 if (fault_flag_allow_retry_first(flags)) { 1685 1686 /* 1686 - * CAUTION! In this case, mmap_lock is not released 1687 - * even though return 0. 1687 + * CAUTION! In this case, mmap_lock/per-VMA lock is not 1688 + * released even though returning VM_FAULT_RETRY. 1688 1689 */ 1689 1690 if (flags & FAULT_FLAG_RETRY_NOWAIT) 1690 - return false; 1691 + return VM_FAULT_RETRY; 1691 1692 1692 - mmap_read_unlock(mm); 1693 + release_fault_lock(vmf); 1693 1694 if (flags & FAULT_FLAG_KILLABLE) 1694 1695 folio_wait_locked_killable(folio); 1695 1696 else 1696 1697 folio_wait_locked(folio); 1697 - return false; 1698 + return VM_FAULT_RETRY; 1698 1699 } 1699 1700 if (flags & FAULT_FLAG_KILLABLE) { 1700 1701 bool ret; 1701 1702 1702 1703 ret = __folio_lock_killable(folio); 1703 1704 if (ret) { 1704 - mmap_read_unlock(mm); 1705 - return false; 1705 + release_fault_lock(vmf); 1706 + return VM_FAULT_RETRY; 1706 1707 } 1707 1708 } else { 1708 1709 __folio_lock(folio); 1709 1710 } 1710 1711 1711 - return true; 1712 + return 0; 1712 1713 } 1713 1714 1714 1715 /** ··· 2081 2080 if (!xa_is_value(folio)) { 2082 2081 if (folio->index < *start) 2083 2082 goto put; 2084 - if (folio->index + folio_nr_pages(folio) - 1 > end) 2083 + if (folio_next_index(folio) - 1 > end) 2085 2084 goto put; 2086 2085 if (!folio_trylock(folio)) 2087 2086 goto put; ··· 2173 2172 } 2174 2173 EXPORT_SYMBOL(filemap_get_folios); 2175 2174 2176 - static inline 2177 - bool folio_more_pages(struct folio *folio, pgoff_t index, pgoff_t max) 2178 - { 2179 - if (!folio_test_large(folio) || folio_test_hugetlb(folio)) 2180 - return false; 2181 - if (index >= max) 2182 - return false; 2183 - return index < folio->index + folio_nr_pages(folio) - 1; 2184 - } 2185 - 2186 2175 /** 2187 2176 * filemap_get_folios_contig - Get a batch of contiguous folios 2188 2177 * @mapping: The address_space to search ··· 2238 2247 if (folio_test_hugetlb(folio)) 2239 2248 *start = folio->index + 1; 2240 2249 else 2241 - *start = folio->index + folio_nr_pages(folio); 2250 + *start = folio_next_index(folio); 2242 2251 } 2243 2252 out: 2244 2253 rcu_read_unlock(); ··· 2355 2364 break; 2356 2365 if (folio_test_readahead(folio)) 2357 2366 break; 2358 - xas_advance(&xas, folio->index + folio_nr_pages(folio) - 1); 2367 + xas_advance(&xas, folio_next_index(folio) - 1); 2359 2368 continue; 2360 2369 put_folio: 2361 2370 folio_put(folio); ··· 2628 2637 int i, error = 0; 2629 2638 bool writably_mapped; 2630 2639 loff_t isize, end_offset; 2640 + loff_t last_pos = ra->prev_pos; 2631 2641 2632 2642 if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes)) 2633 2643 return 0; ··· 2679 2687 * When a read accesses the same folio several times, only 2680 2688 * mark it as accessed the first time. 2681 2689 */ 2682 - if (!pos_same_folio(iocb->ki_pos, ra->prev_pos - 1, 2683 - fbatch.folios[0])) 2690 + if (!pos_same_folio(iocb->ki_pos, last_pos - 1, 2691 + fbatch.folios[0])) 2684 2692 folio_mark_accessed(fbatch.folios[0]); 2685 2693 2686 2694 for (i = 0; i < folio_batch_count(&fbatch); i++) { ··· 2707 2715 2708 2716 already_read += copied; 2709 2717 iocb->ki_pos += copied; 2710 - ra->prev_pos = iocb->ki_pos; 2718 + last_pos = iocb->ki_pos; 2711 2719 2712 2720 if (copied < bytes) { 2713 2721 error = -EFAULT; ··· 2721 2729 } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); 2722 2730 2723 2731 file_accessed(filp); 2724 - 2732 + ra->prev_pos = last_pos; 2725 2733 return already_read ? already_read : error; 2726 2734 } 2727 2735 EXPORT_SYMBOL_GPL(filemap_read); ··· 3431 3439 return false; 3432 3440 } 3433 3441 3434 - static struct folio *next_uptodate_page(struct folio *folio, 3435 - struct address_space *mapping, 3436 - struct xa_state *xas, pgoff_t end_pgoff) 3442 + static struct folio *next_uptodate_folio(struct xa_state *xas, 3443 + struct address_space *mapping, pgoff_t end_pgoff) 3437 3444 { 3445 + struct folio *folio = xas_next_entry(xas, end_pgoff); 3438 3446 unsigned long max_idx; 3439 3447 3440 3448 do { ··· 3472 3480 return NULL; 3473 3481 } 3474 3482 3475 - static inline struct folio *first_map_page(struct address_space *mapping, 3476 - struct xa_state *xas, 3477 - pgoff_t end_pgoff) 3483 + /* 3484 + * Map page range [start_page, start_page + nr_pages) of folio. 3485 + * start_page is gotten from start by folio_page(folio, start) 3486 + */ 3487 + static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, 3488 + struct folio *folio, unsigned long start, 3489 + unsigned long addr, unsigned int nr_pages) 3478 3490 { 3479 - return next_uptodate_page(xas_find(xas, end_pgoff), 3480 - mapping, xas, end_pgoff); 3481 - } 3491 + vm_fault_t ret = 0; 3492 + struct vm_area_struct *vma = vmf->vma; 3493 + struct file *file = vma->vm_file; 3494 + struct page *page = folio_page(folio, start); 3495 + unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); 3496 + unsigned int count = 0; 3497 + pte_t *old_ptep = vmf->pte; 3482 3498 3483 - static inline struct folio *next_map_page(struct address_space *mapping, 3484 - struct xa_state *xas, 3485 - pgoff_t end_pgoff) 3486 - { 3487 - return next_uptodate_page(xas_next_entry(xas, end_pgoff), 3488 - mapping, xas, end_pgoff); 3499 + do { 3500 + if (PageHWPoison(page + count)) 3501 + goto skip; 3502 + 3503 + if (mmap_miss > 0) 3504 + mmap_miss--; 3505 + 3506 + /* 3507 + * NOTE: If there're PTE markers, we'll leave them to be 3508 + * handled in the specific fault path, and it'll prohibit the 3509 + * fault-around logic. 3510 + */ 3511 + if (!pte_none(vmf->pte[count])) 3512 + goto skip; 3513 + 3514 + count++; 3515 + continue; 3516 + skip: 3517 + if (count) { 3518 + set_pte_range(vmf, folio, page, count, addr); 3519 + folio_ref_add(folio, count); 3520 + if (in_range(vmf->address, addr, count)) 3521 + ret = VM_FAULT_NOPAGE; 3522 + } 3523 + 3524 + count++; 3525 + page += count; 3526 + vmf->pte += count; 3527 + addr += count * PAGE_SIZE; 3528 + count = 0; 3529 + } while (--nr_pages > 0); 3530 + 3531 + if (count) { 3532 + set_pte_range(vmf, folio, page, count, addr); 3533 + folio_ref_add(folio, count); 3534 + if (in_range(vmf->address, addr, count)) 3535 + ret = VM_FAULT_NOPAGE; 3536 + } 3537 + 3538 + vmf->pte = old_ptep; 3539 + WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); 3540 + 3541 + return ret; 3489 3542 } 3490 3543 3491 3544 vm_fault_t filemap_map_pages(struct vm_fault *vmf, ··· 3543 3506 unsigned long addr; 3544 3507 XA_STATE(xas, &mapping->i_pages, start_pgoff); 3545 3508 struct folio *folio; 3546 - struct page *page; 3547 - unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); 3548 3509 vm_fault_t ret = 0; 3510 + int nr_pages = 0; 3549 3511 3550 3512 rcu_read_lock(); 3551 - folio = first_map_page(mapping, &xas, end_pgoff); 3513 + folio = next_uptodate_folio(&xas, mapping, end_pgoff); 3552 3514 if (!folio) 3553 3515 goto out; 3554 3516 ··· 3564 3528 goto out; 3565 3529 } 3566 3530 do { 3567 - again: 3568 - page = folio_file_page(folio, xas.xa_index); 3569 - if (PageHWPoison(page)) 3570 - goto unlock; 3571 - 3572 - if (mmap_miss > 0) 3573 - mmap_miss--; 3531 + unsigned long end; 3574 3532 3575 3533 addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT; 3576 3534 vmf->pte += xas.xa_index - last_pgoff; 3577 3535 last_pgoff = xas.xa_index; 3536 + end = folio->index + folio_nr_pages(folio) - 1; 3537 + nr_pages = min(end, end_pgoff) - xas.xa_index + 1; 3578 3538 3579 3539 /* 3580 3540 * NOTE: If there're PTE markers, we'll leave them to be ··· 3580 3548 if (!pte_none(ptep_get(vmf->pte))) 3581 3549 goto unlock; 3582 3550 3583 - /* We're about to handle the fault */ 3584 - if (vmf->address == addr) 3585 - ret = VM_FAULT_NOPAGE; 3551 + ret |= filemap_map_folio_range(vmf, folio, 3552 + xas.xa_index - folio->index, addr, nr_pages); 3586 3553 3587 - do_set_pte(vmf, page, addr); 3588 - /* no need to invalidate: a not-present page won't be cached */ 3589 - update_mmu_cache(vma, addr, vmf->pte); 3590 - if (folio_more_pages(folio, xas.xa_index, end_pgoff)) { 3591 - xas.xa_index++; 3592 - folio_ref_inc(folio); 3593 - goto again; 3594 - } 3595 - folio_unlock(folio); 3596 - continue; 3597 3554 unlock: 3598 - if (folio_more_pages(folio, xas.xa_index, end_pgoff)) { 3599 - xas.xa_index++; 3600 - goto again; 3601 - } 3602 3555 folio_unlock(folio); 3603 3556 folio_put(folio); 3604 - } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL); 3557 + folio = next_uptodate_folio(&xas, mapping, end_pgoff); 3558 + } while (folio); 3605 3559 pte_unmap_unlock(vmf->pte, vmf->ptl); 3606 3560 out: 3607 3561 rcu_read_unlock(); 3608 - WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); 3609 3562 return ret; 3610 3563 } 3611 3564 EXPORT_SYMBOL(filemap_map_pages); ··· 4094 4077 struct address_space * const mapping = folio->mapping; 4095 4078 4096 4079 BUG_ON(!folio_test_locked(folio)); 4080 + if (!folio_needs_release(folio)) 4081 + return true; 4097 4082 if (folio_test_writeback(folio)) 4098 4083 return false; 4099 4084

-283

mm/frontswap.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-only 2 - /* 3 - * Frontswap frontend 4 - * 5 - * This code provides the generic "frontend" layer to call a matching 6 - * "backend" driver implementation of frontswap. See 7 - * Documentation/mm/frontswap.rst for more information. 8 - * 9 - * Copyright (C) 2009-2012 Oracle Corp. All rights reserved. 10 - * Author: Dan Magenheimer 11 - */ 12 - 13 - #include <linux/mman.h> 14 - #include <linux/swap.h> 15 - #include <linux/swapops.h> 16 - #include <linux/security.h> 17 - #include <linux/module.h> 18 - #include <linux/debugfs.h> 19 - #include <linux/frontswap.h> 20 - #include <linux/swapfile.h> 21 - 22 - DEFINE_STATIC_KEY_FALSE(frontswap_enabled_key); 23 - 24 - /* 25 - * frontswap_ops are added by frontswap_register_ops, and provide the 26 - * frontswap "backend" implementation functions. Multiple implementations 27 - * may be registered, but implementations can never deregister. This 28 - * is a simple singly-linked list of all registered implementations. 29 - */ 30 - static const struct frontswap_ops *frontswap_ops __read_mostly; 31 - 32 - #ifdef CONFIG_DEBUG_FS 33 - /* 34 - * Counters available via /sys/kernel/debug/frontswap (if debugfs is 35 - * properly configured). These are for information only so are not protected 36 - * against increment races. 37 - */ 38 - static u64 frontswap_loads; 39 - static u64 frontswap_succ_stores; 40 - static u64 frontswap_failed_stores; 41 - static u64 frontswap_invalidates; 42 - 43 - static inline void inc_frontswap_loads(void) 44 - { 45 - data_race(frontswap_loads++); 46 - } 47 - static inline void inc_frontswap_succ_stores(void) 48 - { 49 - data_race(frontswap_succ_stores++); 50 - } 51 - static inline void inc_frontswap_failed_stores(void) 52 - { 53 - data_race(frontswap_failed_stores++); 54 - } 55 - static inline void inc_frontswap_invalidates(void) 56 - { 57 - data_race(frontswap_invalidates++); 58 - } 59 - #else 60 - static inline void inc_frontswap_loads(void) { } 61 - static inline void inc_frontswap_succ_stores(void) { } 62 - static inline void inc_frontswap_failed_stores(void) { } 63 - static inline void inc_frontswap_invalidates(void) { } 64 - #endif 65 - 66 - /* 67 - * Due to the asynchronous nature of the backends loading potentially 68 - * _after_ the swap system has been activated, we have chokepoints 69 - * on all frontswap functions to not call the backend until the backend 70 - * has registered. 71 - * 72 - * This would not guards us against the user deciding to call swapoff right as 73 - * we are calling the backend to initialize (so swapon is in action). 74 - * Fortunately for us, the swapon_mutex has been taken by the callee so we are 75 - * OK. The other scenario where calls to frontswap_store (called via 76 - * swap_writepage) is racing with frontswap_invalidate_area (called via 77 - * swapoff) is again guarded by the swap subsystem. 78 - * 79 - * While no backend is registered all calls to frontswap_[store|load| 80 - * invalidate_area|invalidate_page] are ignored or fail. 81 - * 82 - * The time between the backend being registered and the swap file system 83 - * calling the backend (via the frontswap_* functions) is indeterminate as 84 - * frontswap_ops is not atomic_t (or a value guarded by a spinlock). 85 - * That is OK as we are comfortable missing some of these calls to the newly 86 - * registered backend. 87 - * 88 - * Obviously the opposite (unloading the backend) must be done after all 89 - * the frontswap_[store|load|invalidate_area|invalidate_page] start 90 - * ignoring or failing the requests. However, there is currently no way 91 - * to unload a backend once it is registered. 92 - */ 93 - 94 - /* 95 - * Register operations for frontswap 96 - */ 97 - int frontswap_register_ops(const struct frontswap_ops *ops) 98 - { 99 - if (frontswap_ops) 100 - return -EINVAL; 101 - 102 - frontswap_ops = ops; 103 - static_branch_inc(&frontswap_enabled_key); 104 - return 0; 105 - } 106 - 107 - /* 108 - * Called when a swap device is swapon'd. 109 - */ 110 - void frontswap_init(unsigned type, unsigned long *map) 111 - { 112 - struct swap_info_struct *sis = swap_info[type]; 113 - 114 - VM_BUG_ON(sis == NULL); 115 - 116 - /* 117 - * p->frontswap is a bitmap that we MUST have to figure out which page 118 - * has gone in frontswap. Without it there is no point of continuing. 119 - */ 120 - if (WARN_ON(!map)) 121 - return; 122 - /* 123 - * Irregardless of whether the frontswap backend has been loaded 124 - * before this function or it will be later, we _MUST_ have the 125 - * p->frontswap set to something valid to work properly. 126 - */ 127 - frontswap_map_set(sis, map); 128 - 129 - if (!frontswap_enabled()) 130 - return; 131 - frontswap_ops->init(type); 132 - } 133 - 134 - static bool __frontswap_test(struct swap_info_struct *sis, 135 - pgoff_t offset) 136 - { 137 - if (sis->frontswap_map) 138 - return test_bit(offset, sis->frontswap_map); 139 - return false; 140 - } 141 - 142 - static inline void __frontswap_set(struct swap_info_struct *sis, 143 - pgoff_t offset) 144 - { 145 - set_bit(offset, sis->frontswap_map); 146 - atomic_inc(&sis->frontswap_pages); 147 - } 148 - 149 - static inline void __frontswap_clear(struct swap_info_struct *sis, 150 - pgoff_t offset) 151 - { 152 - clear_bit(offset, sis->frontswap_map); 153 - atomic_dec(&sis->frontswap_pages); 154 - } 155 - 156 - /* 157 - * "Store" data from a page to frontswap and associate it with the page's 158 - * swaptype and offset. Page must be locked and in the swap cache. 159 - * If frontswap already contains a page with matching swaptype and 160 - * offset, the frontswap implementation may either overwrite the data and 161 - * return success or invalidate the page from frontswap and return failure. 162 - */ 163 - int __frontswap_store(struct page *page) 164 - { 165 - int ret = -1; 166 - swp_entry_t entry = { .val = page_private(page), }; 167 - int type = swp_type(entry); 168 - struct swap_info_struct *sis = swap_info[type]; 169 - pgoff_t offset = swp_offset(entry); 170 - 171 - VM_BUG_ON(!frontswap_ops); 172 - VM_BUG_ON(!PageLocked(page)); 173 - VM_BUG_ON(sis == NULL); 174 - 175 - /* 176 - * If a dup, we must remove the old page first; we can't leave the 177 - * old page no matter if the store of the new page succeeds or fails, 178 - * and we can't rely on the new page replacing the old page as we may 179 - * not store to the same implementation that contains the old page. 180 - */ 181 - if (__frontswap_test(sis, offset)) { 182 - __frontswap_clear(sis, offset); 183 - frontswap_ops->invalidate_page(type, offset); 184 - } 185 - 186 - ret = frontswap_ops->store(type, offset, page); 187 - if (ret == 0) { 188 - __frontswap_set(sis, offset); 189 - inc_frontswap_succ_stores(); 190 - } else { 191 - inc_frontswap_failed_stores(); 192 - } 193 - 194 - return ret; 195 - } 196 - 197 - /* 198 - * "Get" data from frontswap associated with swaptype and offset that were 199 - * specified when the data was put to frontswap and use it to fill the 200 - * specified page with data. Page must be locked and in the swap cache. 201 - */ 202 - int __frontswap_load(struct page *page) 203 - { 204 - int ret = -1; 205 - swp_entry_t entry = { .val = page_private(page), }; 206 - int type = swp_type(entry); 207 - struct swap_info_struct *sis = swap_info[type]; 208 - pgoff_t offset = swp_offset(entry); 209 - bool exclusive = false; 210 - 211 - VM_BUG_ON(!frontswap_ops); 212 - VM_BUG_ON(!PageLocked(page)); 213 - VM_BUG_ON(sis == NULL); 214 - 215 - if (!__frontswap_test(sis, offset)) 216 - return -1; 217 - 218 - /* Try loading from each implementation, until one succeeds. */ 219 - ret = frontswap_ops->load(type, offset, page, &exclusive); 220 - if (ret == 0) { 221 - inc_frontswap_loads(); 222 - if (exclusive) { 223 - SetPageDirty(page); 224 - __frontswap_clear(sis, offset); 225 - } 226 - } 227 - return ret; 228 - } 229 - 230 - /* 231 - * Invalidate any data from frontswap associated with the specified swaptype 232 - * and offset so that a subsequent "get" will fail. 233 - */ 234 - void __frontswap_invalidate_page(unsigned type, pgoff_t offset) 235 - { 236 - struct swap_info_struct *sis = swap_info[type]; 237 - 238 - VM_BUG_ON(!frontswap_ops); 239 - VM_BUG_ON(sis == NULL); 240 - 241 - if (!__frontswap_test(sis, offset)) 242 - return; 243 - 244 - frontswap_ops->invalidate_page(type, offset); 245 - __frontswap_clear(sis, offset); 246 - inc_frontswap_invalidates(); 247 - } 248 - 249 - /* 250 - * Invalidate all data from frontswap associated with all offsets for the 251 - * specified swaptype. 252 - */ 253 - void __frontswap_invalidate_area(unsigned type) 254 - { 255 - struct swap_info_struct *sis = swap_info[type]; 256 - 257 - VM_BUG_ON(!frontswap_ops); 258 - VM_BUG_ON(sis == NULL); 259 - 260 - if (sis->frontswap_map == NULL) 261 - return; 262 - 263 - frontswap_ops->invalidate_area(type); 264 - atomic_set(&sis->frontswap_pages, 0); 265 - bitmap_zero(sis->frontswap_map, sis->max); 266 - } 267 - 268 - static int __init init_frontswap(void) 269 - { 270 - #ifdef CONFIG_DEBUG_FS 271 - struct dentry *root = debugfs_create_dir("frontswap", NULL); 272 - if (root == NULL) 273 - return -ENXIO; 274 - debugfs_create_u64("loads", 0444, root, &frontswap_loads); 275 - debugfs_create_u64("succ_stores", 0444, root, &frontswap_succ_stores); 276 - debugfs_create_u64("failed_stores", 0444, root, 277 - &frontswap_failed_stores); 278 - debugfs_create_u64("invalidates", 0444, root, &frontswap_invalidates); 279 - #endif 280 - return 0; 281 - } 282 - 283 - module_init(init_frontswap);

+49 -43

mm/gup.c

··· 811 811 struct follow_page_context *ctx) 812 812 { 813 813 pgd_t *pgd; 814 - struct page *page; 815 814 struct mm_struct *mm = vma->vm_mm; 816 815 817 816 ctx->page_mask = 0; ··· 819 820 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use 820 821 * special hugetlb page table walking code. This eliminates the 821 822 * need to check for hugetlb entries in the general walking code. 822 - * 823 - * hugetlb_follow_page_mask is only for follow_page() handling here. 824 - * Ordinary GUP uses follow_hugetlb_page for hugetlb processing. 825 823 */ 826 - if (is_vm_hugetlb_page(vma)) { 827 - page = hugetlb_follow_page_mask(vma, address, flags); 828 - if (!page) 829 - page = no_page_table(vma, flags); 830 - return page; 831 - } 824 + if (is_vm_hugetlb_page(vma)) 825 + return hugetlb_follow_page_mask(vma, address, flags, 826 + &ctx->page_mask); 832 827 833 828 pgd = pgd_offset(mm, address); 834 829 ··· 1208 1215 if (!vma && in_gate_area(mm, start)) { 1209 1216 ret = get_gate_page(mm, start & PAGE_MASK, 1210 1217 gup_flags, &vma, 1211 - pages ? &pages[i] : NULL); 1218 + pages ? &page : NULL); 1212 1219 if (ret) 1213 1220 goto out; 1214 1221 ctx.page_mask = 0; ··· 1222 1229 ret = check_vma_flags(vma, gup_flags); 1223 1230 if (ret) 1224 1231 goto out; 1225 - 1226 - if (is_vm_hugetlb_page(vma)) { 1227 - i = follow_hugetlb_page(mm, vma, pages, 1228 - &start, &nr_pages, i, 1229 - gup_flags, locked); 1230 - if (!*locked) { 1231 - /* 1232 - * We've got a VM_FAULT_RETRY 1233 - * and we've lost mmap_lock. 1234 - * We must stop here. 1235 - */ 1236 - BUG_ON(gup_flags & FOLL_NOWAIT); 1237 - goto out; 1238 - } 1239 - continue; 1240 - } 1241 1232 } 1242 1233 retry: 1243 1234 /* ··· 1262 1285 ret = PTR_ERR(page); 1263 1286 goto out; 1264 1287 } 1265 - 1266 - goto next_page; 1267 1288 } else if (IS_ERR(page)) { 1268 1289 ret = PTR_ERR(page); 1269 1290 goto out; 1270 - } 1271 - if (pages) { 1272 - pages[i] = page; 1273 - flush_anon_page(vma, page, start); 1274 - flush_dcache_page(page); 1275 - ctx.page_mask = 0; 1276 1291 } 1277 1292 next_page: 1278 1293 page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask); 1279 1294 if (page_increm > nr_pages) 1280 1295 page_increm = nr_pages; 1296 + 1297 + if (pages) { 1298 + struct page *subpage; 1299 + unsigned int j; 1300 + 1301 + /* 1302 + * This must be a large folio (and doesn't need to 1303 + * be the whole folio; it can be part of it), do 1304 + * the refcount work for all the subpages too. 1305 + * 1306 + * NOTE: here the page may not be the head page 1307 + * e.g. when start addr is not thp-size aligned. 1308 + * try_grab_folio() should have taken care of tail 1309 + * pages. 1310 + */ 1311 + if (page_increm > 1) { 1312 + struct folio *folio; 1313 + 1314 + /* 1315 + * Since we already hold refcount on the 1316 + * large folio, this should never fail. 1317 + */ 1318 + folio = try_grab_folio(page, page_increm - 1, 1319 + foll_flags); 1320 + if (WARN_ON_ONCE(!folio)) { 1321 + /* 1322 + * Release the 1st page ref if the 1323 + * folio is problematic, fail hard. 1324 + */ 1325 + gup_put_folio(page_folio(page), 1, 1326 + foll_flags); 1327 + ret = -EFAULT; 1328 + goto out; 1329 + } 1330 + } 1331 + 1332 + for (j = 0; j < page_increm; j++) { 1333 + subpage = nth_page(page, j); 1334 + pages[i + j] = subpage; 1335 + flush_anon_page(vma, subpage, start + j * PAGE_SIZE); 1336 + flush_dcache_page(subpage); 1337 + } 1338 + } 1339 + 1281 1340 i += page_increm; 1282 1341 start += page_increm * PAGE_SIZE; 1283 1342 nr_pages -= page_increm; ··· 2244 2231 gup_flags |= FOLL_UNLOCKABLE; 2245 2232 } 2246 2233 2247 - /* 2248 - * For now, always trigger NUMA hinting faults. Some GUP users like 2249 - * KVM require the hint to be as the calling context of GUP is 2250 - * functionally similar to a memory reference from task context. 2251 - */ 2252 - gup_flags |= FOLL_HONOR_NUMA_FAULT; 2253 - 2254 2234 /* FOLL_GET and FOLL_PIN are mutually exclusive. */ 2255 2235 if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) == 2256 2236 (FOLL_PIN | FOLL_GET))) ··· 2600 2594 if (!folio) 2601 2595 goto pte_unmap; 2602 2596 2603 - if (unlikely(page_is_secretmem(page))) { 2597 + if (unlikely(folio_is_secretmem(folio))) { 2604 2598 gup_put_folio(folio, 1, flags); 2605 2599 goto pte_unmap; 2606 2600 }

+52 -75

mm/huge_memory.c

··· 577 577 } 578 578 #endif 579 579 580 - void prep_transhuge_page(struct page *page) 580 + void folio_prep_large_rmappable(struct folio *folio) 581 581 { 582 - struct folio *folio = (struct folio *)page; 583 - 584 582 VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); 585 583 INIT_LIST_HEAD(&folio->_deferred_list); 586 - folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR); 584 + folio_set_large_rmappable(folio); 587 585 } 588 586 589 - static inline bool is_transparent_hugepage(struct page *page) 587 + static inline bool is_transparent_hugepage(struct folio *folio) 590 588 { 591 - struct folio *folio; 592 - 593 - if (!PageCompound(page)) 589 + if (!folio_test_large(folio)) 594 590 return false; 595 591 596 - folio = page_folio(page); 597 592 return is_huge_zero_page(&folio->page) || 598 - folio->_folio_dtor == TRANSHUGE_PAGE_DTOR; 593 + folio_test_large_rmappable(folio); 599 594 } 600 595 601 596 static unsigned long __thp_get_unmapped_area(struct file *filp, ··· 1975 1980 if (!ptl) 1976 1981 return 0; 1977 1982 1978 - pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); 1983 + pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm); 1979 1984 tlb_remove_pud_tlb_entry(tlb, pud, addr); 1980 1985 if (vma_is_special_huge(vma)) { 1981 1986 spin_unlock(ptl); ··· 1997 2002 1998 2003 count_vm_event(THP_SPLIT_PUD); 1999 2004 2000 - pudp_huge_clear_flush_notify(vma, haddr, pud); 2005 + pudp_huge_clear_flush(vma, haddr, pud); 2001 2006 } 2002 2007 2003 2008 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, ··· 2017 2022 2018 2023 out: 2019 2024 spin_unlock(ptl); 2020 - /* 2021 - * No need to double call mmu_notifier->invalidate_range() callback as 2022 - * the above pudp_huge_clear_flush_notify() did already call it. 2023 - */ 2024 - mmu_notifier_invalidate_range_only_end(&range); 2025 + mmu_notifier_invalidate_range_end(&range); 2025 2026 } 2026 2027 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 2027 2028 ··· 2084 2093 count_vm_event(THP_SPLIT_PMD); 2085 2094 2086 2095 if (!vma_is_anonymous(vma)) { 2087 - old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); 2096 + old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); 2088 2097 /* 2089 2098 * We are going to unmap this huge page. So 2090 2099 * just go ahead and zap it ··· 2114 2123 if (is_huge_zero_pmd(*pmd)) { 2115 2124 /* 2116 2125 * FIXME: Do we want to invalidate secondary mmu by calling 2117 - * mmu_notifier_invalidate_range() see comments below inside 2118 - * __split_huge_pmd() ? 2126 + * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below 2127 + * inside __split_huge_pmd() ? 2119 2128 * 2120 2129 * We are going from a zero huge page write protected to zero 2121 2130 * small page also write protected so it does not seems useful ··· 2245 2254 entry = pte_mksoft_dirty(entry); 2246 2255 if (uffd_wp) 2247 2256 entry = pte_mkuffd_wp(entry); 2248 - page_add_anon_rmap(page + i, vma, addr, false); 2257 + page_add_anon_rmap(page + i, vma, addr, RMAP_NONE); 2249 2258 } 2250 2259 VM_BUG_ON(!pte_none(ptep_get(pte))); 2251 2260 set_pte_at(mm, addr, pte, entry); ··· 2294 2303 2295 2304 out: 2296 2305 spin_unlock(ptl); 2297 - /* 2298 - * No need to double call mmu_notifier->invalidate_range() callback. 2299 - * They are 3 cases to consider inside __split_huge_pmd_locked(): 2300 - * 1) pmdp_huge_clear_flush_notify() call invalidate_range() obvious 2301 - * 2) __split_huge_zero_page_pmd() read only zero page and any write 2302 - * fault will trigger a flush_notify before pointing to a new page 2303 - * (it is fine if the secondary mmu keeps pointing to the old zero 2304 - * page in the meantime) 2305 - * 3) Split a huge pmd into pte pointing to the same page. No need 2306 - * to invalidate secondary tlb entry they are all still valid. 2307 - * any further changes to individual pte will notify. So no need 2308 - * to call mmu_notifier->invalidate_range() 2309 - */ 2310 - mmu_notifier_invalidate_range_only_end(&range); 2306 + mmu_notifier_invalidate_range_end(&range); 2311 2307 } 2312 2308 2313 2309 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, ··· 2401 2423 } 2402 2424 } 2403 2425 2404 - static void __split_huge_page_tail(struct page *head, int tail, 2426 + static void __split_huge_page_tail(struct folio *folio, int tail, 2405 2427 struct lruvec *lruvec, struct list_head *list) 2406 2428 { 2429 + struct page *head = &folio->page; 2407 2430 struct page *page_tail = head + tail; 2431 + /* 2432 + * Careful: new_folio is not a "real" folio before we cleared PageTail. 2433 + * Don't pass it around before clear_compound_head(). 2434 + */ 2435 + struct folio *new_folio = (struct folio *)page_tail; 2408 2436 2409 2437 VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail); 2410 2438 ··· 2452 2468 page_tail->index = head->index + tail; 2453 2469 2454 2470 /* 2455 - * page->private should not be set in tail pages with the exception 2456 - * of swap cache pages that store the swp_entry_t in tail pages. 2457 - * Fix up and warn once if private is unexpectedly set. 2458 - * 2459 - * What of 32-bit systems, on which folio->_pincount overlays 2460 - * head[1].private? No problem: THP_SWAP is not enabled on 32-bit, and 2461 - * pincount must be 0 for folio_ref_freeze() to have succeeded. 2471 + * page->private should not be set in tail pages. Fix up and warn once 2472 + * if private is unexpectedly set. 2462 2473 */ 2463 - if (!folio_test_swapcache(page_folio(head))) { 2464 - VM_WARN_ON_ONCE_PAGE(page_tail->private != 0, page_tail); 2474 + if (unlikely(page_tail->private)) { 2475 + VM_WARN_ON_ONCE_PAGE(true, page_tail); 2465 2476 page_tail->private = 0; 2466 2477 } 2478 + if (folio_test_swapcache(folio)) 2479 + new_folio->swap.val = folio->swap.val + tail; 2467 2480 2468 2481 /* Page flags must be visible before we make the page non-compound. */ 2469 2482 smp_wmb(); ··· 2506 2525 /* complete memcg works before add pages to LRU */ 2507 2526 split_page_memcg(head, nr); 2508 2527 2509 - if (PageAnon(head) && PageSwapCache(head)) { 2510 - swp_entry_t entry = { .val = page_private(head) }; 2511 - 2512 - offset = swp_offset(entry); 2513 - swap_cache = swap_address_space(entry); 2528 + if (folio_test_anon(folio) && folio_test_swapcache(folio)) { 2529 + offset = swp_offset(folio->swap); 2530 + swap_cache = swap_address_space(folio->swap); 2514 2531 xa_lock(&swap_cache->i_pages); 2515 2532 } 2516 2533 ··· 2518 2539 ClearPageHasHWPoisoned(head); 2519 2540 2520 2541 for (i = nr - 1; i >= 1; i--) { 2521 - __split_huge_page_tail(head, i, lruvec, list); 2542 + __split_huge_page_tail(folio, i, lruvec, list); 2522 2543 /* Some pages can be beyond EOF: drop them from page cache */ 2523 2544 if (head[i].index >= end) { 2524 2545 struct folio *tail = page_folio(head + i); ··· 2565 2586 shmem_uncharge(head->mapping->host, nr_dropped); 2566 2587 remap_page(folio, nr); 2567 2588 2568 - if (PageSwapCache(head)) { 2569 - swp_entry_t entry = { .val = page_private(head) }; 2570 - 2571 - split_swap_cluster(entry); 2572 - } 2589 + if (folio_test_swapcache(folio)) 2590 + split_swap_cluster(folio->swap); 2573 2591 2574 2592 for (i = 0; i < nr; i++) { 2575 2593 struct page *subpage = head + i; ··· 2674 2698 gfp = current_gfp_context(mapping_gfp_mask(mapping) & 2675 2699 GFP_RECLAIM_MASK); 2676 2700 2677 - if (folio_test_private(folio) && 2678 - !filemap_release_folio(folio, gfp)) { 2701 + if (!filemap_release_folio(folio, gfp)) { 2679 2702 ret = -EBUSY; 2680 2703 goto out; 2681 2704 } ··· 2771 2796 return ret; 2772 2797 } 2773 2798 2774 - void free_transhuge_page(struct page *page) 2799 + void folio_undo_large_rmappable(struct folio *folio) 2775 2800 { 2776 - struct folio *folio = (struct folio *)page; 2777 - struct deferred_split *ds_queue = get_deferred_split_queue(folio); 2801 + struct deferred_split *ds_queue; 2778 2802 unsigned long flags; 2779 2803 2780 2804 /* ··· 2781 2807 * deferred_list. If folio is not in deferred_list, it's safe 2782 2808 * to check without acquiring the split_queue_lock. 2783 2809 */ 2784 - if (data_race(!list_empty(&folio->_deferred_list))) { 2785 - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); 2786 - if (!list_empty(&folio->_deferred_list)) { 2787 - ds_queue->split_queue_len--; 2788 - list_del(&folio->_deferred_list); 2789 - } 2790 - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); 2810 + if (data_race(list_empty(&folio->_deferred_list))) 2811 + return; 2812 + 2813 + ds_queue = get_deferred_split_queue(folio); 2814 + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); 2815 + if (!list_empty(&folio->_deferred_list)) { 2816 + ds_queue->split_queue_len--; 2817 + list_del(&folio->_deferred_list); 2791 2818 } 2792 - free_compound_page(page); 2819 + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); 2793 2820 } 2794 2821 2795 2822 void deferred_split_folio(struct folio *folio) ··· 3009 3034 for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) { 3010 3035 struct vm_area_struct *vma = vma_lookup(mm, addr); 3011 3036 struct page *page; 3037 + struct folio *folio; 3012 3038 3013 3039 if (!vma) 3014 3040 break; ··· 3026 3050 if (IS_ERR_OR_NULL(page)) 3027 3051 continue; 3028 3052 3029 - if (!is_transparent_hugepage(page)) 3053 + folio = page_folio(page); 3054 + if (!is_transparent_hugepage(folio)) 3030 3055 goto next; 3031 3056 3032 3057 total++; 3033 - if (!can_split_folio(page_folio(page), NULL)) 3058 + if (!can_split_folio(folio, NULL)) 3034 3059 goto next; 3035 3060 3036 - if (!trylock_page(page)) 3061 + if (!folio_trylock(folio)) 3037 3062 goto next; 3038 3063 3039 - if (!split_huge_page(page)) 3064 + if (!split_folio(folio)) 3040 3065 split++; 3041 3066 3042 - unlock_page(page); 3067 + folio_unlock(folio); 3043 3068 next: 3044 - put_page(page); 3069 + folio_put(folio); 3045 3070 cond_resched(); 3046 3071 } 3047 3072 mmap_read_unlock(mm);

+126 -335

mm/hugetlb.c

··· 34 34 #include <linux/nospec.h> 35 35 #include <linux/delayacct.h> 36 36 #include <linux/memory.h> 37 + #include <linux/mm_inline.h> 37 38 38 39 #include <asm/page.h> 39 40 #include <asm/pgalloc.h> ··· 968 967 } 969 968 EXPORT_SYMBOL_GPL(linear_hugepage_index); 970 969 971 - /* 972 - * Return the size of the pages allocated when backing a VMA. In the majority 973 - * cases this will be same size as used by the page table entries. 970 + /** 971 + * vma_kernel_pagesize - Page size granularity for this VMA. 972 + * @vma: The user mapping. 973 + * 974 + * Folios in this VMA will be aligned to, and at least the size of the 975 + * number of bytes returned by this function. 976 + * 977 + * Return: The default size of the folios allocated when backing a VMA. 974 978 */ 975 979 unsigned long vma_kernel_pagesize(struct vm_area_struct *vma) 976 980 { ··· 1489 1483 1490 1484 for (i = 1; i < nr_pages; i++) { 1491 1485 p = folio_page(folio, i); 1486 + p->flags &= ~PAGE_FLAGS_CHECK_AT_FREE; 1492 1487 p->mapping = NULL; 1493 1488 clear_compound_head(p); 1494 1489 if (!demote) ··· 1591 1584 { 1592 1585 lockdep_assert_held(&hugetlb_lock); 1593 1586 1594 - /* 1595 - * Very subtle 1596 - * 1597 - * For non-gigantic pages set the destructor to the normal compound 1598 - * page dtor. This is needed in case someone takes an additional 1599 - * temporary ref to the page, and freeing is delayed until they drop 1600 - * their reference. 1601 - * 1602 - * For gigantic pages set the destructor to the null dtor. This 1603 - * destructor will never be called. Before freeing the gigantic 1604 - * page destroy_compound_gigantic_folio will turn the folio into a 1605 - * simple group of pages. After this the destructor does not 1606 - * apply. 1607 - * 1608 - */ 1609 - if (hstate_is_gigantic(h)) 1610 - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); 1611 - else 1612 - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); 1587 + folio_clear_hugetlb(folio); 1613 1588 } 1614 1589 1615 1590 /* ··· 1678 1689 h->surplus_huge_pages_node[nid]++; 1679 1690 } 1680 1691 1681 - folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR); 1692 + folio_set_hugetlb(folio); 1682 1693 folio_change_private(folio, NULL); 1683 1694 /* 1684 1695 * We have to set hugetlb_vmemmap_optimized again as above ··· 1694 1705 zeroed = folio_put_testzero(folio); 1695 1706 if (unlikely(!zeroed)) 1696 1707 /* 1697 - * It is VERY unlikely soneone else has taken a ref on 1698 - * the page. In this case, we simply return as the 1699 - * hugetlb destructor (free_huge_page) will be called 1700 - * when this other ref is dropped. 1708 + * It is VERY unlikely soneone else has taken a ref 1709 + * on the folio. In this case, we simply return as 1710 + * free_huge_folio() will be called when this other ref 1711 + * is dropped. 1701 1712 */ 1702 1713 return; 1703 1714 ··· 1708 1719 static void __update_and_free_hugetlb_folio(struct hstate *h, 1709 1720 struct folio *folio) 1710 1721 { 1711 - int i; 1712 - struct page *subpage; 1713 1722 bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio); 1714 1723 1715 1724 if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) ··· 1747 1760 spin_lock_irq(&hugetlb_lock); 1748 1761 __clear_hugetlb_destructor(h, folio); 1749 1762 spin_unlock_irq(&hugetlb_lock); 1750 - } 1751 - 1752 - for (i = 0; i < pages_per_huge_page(h); i++) { 1753 - subpage = folio_page(folio, i); 1754 - subpage->flags &= ~(1 << PG_locked | 1 << PG_error | 1755 - 1 << PG_referenced | 1 << PG_dirty | 1756 - 1 << PG_active | 1 << PG_private | 1757 - 1 << PG_writeback); 1758 1763 } 1759 1764 1760 1765 /* ··· 1790 1811 node = node->next; 1791 1812 page->mapping = NULL; 1792 1813 /* 1793 - * The VM_BUG_ON_PAGE(!PageHuge(page), page) in page_hstate() 1794 - * is going to trigger because a previous call to 1795 - * remove_hugetlb_folio() will call folio_set_compound_dtor 1796 - * (folio, NULL_COMPOUND_DTOR), so do not use page_hstate() 1797 - * directly. 1814 + * The VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio) in 1815 + * folio_hstate() is going to trigger because a previous call to 1816 + * remove_hugetlb_folio() will clear the hugetlb bit, so do 1817 + * not use folio_hstate() directly. 1798 1818 */ 1799 1819 h = size_to_hstate(page_size(page)); 1800 1820 ··· 1852 1874 return NULL; 1853 1875 } 1854 1876 1855 - void free_huge_page(struct page *page) 1877 + void free_huge_folio(struct folio *folio) 1856 1878 { 1857 1879 /* 1858 1880 * Can't pass hstate in here because it is called from the 1859 1881 * compound page destructor. 1860 1882 */ 1861 - struct folio *folio = page_folio(page); 1862 1883 struct hstate *h = folio_hstate(folio); 1863 1884 int nid = folio_nid(folio); 1864 1885 struct hugepage_subpool *spool = hugetlb_folio_subpool(folio); ··· 1912 1935 spin_unlock_irqrestore(&hugetlb_lock, flags); 1913 1936 update_and_free_hugetlb_folio(h, folio, true); 1914 1937 } else { 1915 - arch_clear_hugepage_flags(page); 1938 + arch_clear_hugepage_flags(&folio->page); 1916 1939 enqueue_hugetlb_folio(h, folio); 1917 1940 spin_unlock_irqrestore(&hugetlb_lock, flags); 1918 1941 } ··· 1932 1955 { 1933 1956 hugetlb_vmemmap_optimize(h, &folio->page); 1934 1957 INIT_LIST_HEAD(&folio->lru); 1935 - folio_set_compound_dtor(folio, HUGETLB_PAGE_DTOR); 1958 + folio_set_hugetlb(folio); 1936 1959 hugetlb_set_folio_subpool(folio, NULL); 1937 1960 set_hugetlb_cgroup(folio, NULL); 1938 1961 set_hugetlb_cgroup_rsvd(folio, NULL); ··· 2047 2070 if (!PageCompound(page)) 2048 2071 return 0; 2049 2072 folio = page_folio(page); 2050 - return folio->_folio_dtor == HUGETLB_PAGE_DTOR; 2073 + return folio_test_hugetlb(folio); 2051 2074 } 2052 2075 EXPORT_SYMBOL_GPL(PageHuge); 2053 - 2054 - /** 2055 - * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs 2056 - * @folio: The folio to test. 2057 - * 2058 - * Context: Any context. Caller should have a reference on the folio to 2059 - * prevent it from being turned into a tail page. 2060 - * Return: True for hugetlbfs folios, false for anon folios or folios 2061 - * belonging to other filesystems. 2062 - */ 2063 - bool folio_test_hugetlb(struct folio *folio) 2064 - { 2065 - if (!folio_test_large(folio)) 2066 - return false; 2067 - 2068 - return folio->_folio_dtor == HUGETLB_PAGE_DTOR; 2069 - } 2070 - EXPORT_SYMBOL_GPL(folio_test_hugetlb); 2071 2076 2072 2077 /* 2073 2078 * Find and lock address space (mapping) in write mode. ··· 2204 2245 folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node, 2205 2246 nodes_allowed, node_alloc_noretry); 2206 2247 if (folio) { 2207 - free_huge_page(&folio->page); /* free it into the hugepage allocator */ 2248 + free_huge_folio(folio); /* free it into the hugepage allocator */ 2208 2249 return 1; 2209 2250 } 2210 2251 } ··· 2387 2428 * We could have raced with the pool size change. 2388 2429 * Double check that and simply deallocate the new page 2389 2430 * if we would end up overcommiting the surpluses. Abuse 2390 - * temporary page to workaround the nasty free_huge_page 2431 + * temporary page to workaround the nasty free_huge_folio 2391 2432 * codeflow 2392 2433 */ 2393 2434 if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) { 2394 2435 folio_set_hugetlb_temporary(folio); 2395 2436 spin_unlock_irq(&hugetlb_lock); 2396 - free_huge_page(&folio->page); 2437 + free_huge_folio(folio); 2397 2438 return NULL; 2398 2439 } 2399 2440 ··· 2505 2546 __must_hold(&hugetlb_lock) 2506 2547 { 2507 2548 LIST_HEAD(surplus_list); 2508 - struct folio *folio; 2509 - struct page *page, *tmp; 2549 + struct folio *folio, *tmp; 2510 2550 int ret; 2511 2551 long i; 2512 2552 long needed, allocated; ··· 2565 2607 ret = 0; 2566 2608 2567 2609 /* Free the needed pages to the hugetlb pool */ 2568 - list_for_each_entry_safe(page, tmp, &surplus_list, lru) { 2610 + list_for_each_entry_safe(folio, tmp, &surplus_list, lru) { 2569 2611 if ((--needed) < 0) 2570 2612 break; 2571 2613 /* Add the page to the hugetlb allocator */ 2572 - enqueue_hugetlb_folio(h, page_folio(page)); 2614 + enqueue_hugetlb_folio(h, folio); 2573 2615 } 2574 2616 free: 2575 2617 spin_unlock_irq(&hugetlb_lock); 2576 2618 2577 2619 /* 2578 2620 * Free unnecessary surplus pages to the buddy allocator. 2579 - * Pages have no ref count, call free_huge_page directly. 2621 + * Pages have no ref count, call free_huge_folio directly. 2580 2622 */ 2581 - list_for_each_entry_safe(page, tmp, &surplus_list, lru) 2582 - free_huge_page(page); 2623 + list_for_each_entry_safe(folio, tmp, &surplus_list, lru) 2624 + free_huge_folio(folio); 2583 2625 spin_lock_irq(&hugetlb_lock); 2584 2626 2585 2627 return ret; ··· 2793 2835 * 2) No reservation was in place for the page, so hugetlb_restore_reserve is 2794 2836 * not set. However, alloc_hugetlb_folio always updates the reserve map. 2795 2837 * 2796 - * In case 1, free_huge_page later in the error path will increment the 2797 - * global reserve count. But, free_huge_page does not have enough context 2838 + * In case 1, free_huge_folio later in the error path will increment the 2839 + * global reserve count. But, free_huge_folio does not have enough context 2798 2840 * to adjust the reservation map. This case deals primarily with private 2799 2841 * mappings. Adjust the reserve map here to be consistent with global 2800 - * reserve count adjustments to be made by free_huge_page. Make sure the 2842 + * reserve count adjustments to be made by free_huge_folio. Make sure the 2801 2843 * reserve map indicates there is a reservation present. 2802 2844 * 2803 2845 * In case 2, simply undo reserve map modifications done by alloc_hugetlb_folio. ··· 2813 2855 * Rare out of memory condition in reserve map 2814 2856 * manipulation. Clear hugetlb_restore_reserve so 2815 2857 * that global reserve count will not be incremented 2816 - * by free_huge_page. This will make it appear 2858 + * by free_huge_folio. This will make it appear 2817 2859 * as though the reservation for this folio was 2818 2860 * consumed. This may prevent the task from 2819 2861 * faulting in the folio at a later time. This ··· 3189 3231 if (prep_compound_gigantic_folio(folio, huge_page_order(h))) { 3190 3232 WARN_ON(folio_test_reserved(folio)); 3191 3233 prep_new_hugetlb_folio(h, folio, folio_nid(folio)); 3192 - free_huge_page(page); /* add to the hugepage allocator */ 3234 + free_huge_folio(folio); /* add to the hugepage allocator */ 3193 3235 } else { 3194 3236 /* VERY unlikely inflated ref count on a tail page */ 3195 3237 free_gigantic_folio(folio, huge_page_order(h)); ··· 3221 3263 &node_states[N_MEMORY], NULL); 3222 3264 if (!folio) 3223 3265 break; 3224 - free_huge_page(&folio->page); /* free it into the hugepage allocator */ 3266 + free_huge_folio(folio); /* free it into the hugepage allocator */ 3225 3267 } 3226 3268 cond_resched(); 3227 3269 } ··· 3499 3541 while (count > persistent_huge_pages(h)) { 3500 3542 /* 3501 3543 * If this allocation races such that we no longer need the 3502 - * page, free_huge_page will handle it by freeing the page 3544 + * page, free_huge_folio will handle it by freeing the page 3503 3545 * and reducing the surplus. 3504 3546 */ 3505 3547 spin_unlock_irq(&hugetlb_lock); ··· 3615 3657 prep_compound_page(subpage, target_hstate->order); 3616 3658 folio_change_private(inner_folio, NULL); 3617 3659 prep_new_hugetlb_folio(target_hstate, inner_folio, nid); 3618 - free_huge_page(subpage); 3660 + free_huge_folio(inner_folio); 3619 3661 } 3620 3662 mutex_unlock(&target_hstate->resize_lock); 3621 3663 ··· 4732 4774 void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm) 4733 4775 { 4734 4776 seq_printf(m, "HugetlbPages:\t%8lu kB\n", 4735 - atomic_long_read(&mm->hugetlb_usage) << (PAGE_SHIFT - 10)); 4777 + K(atomic_long_read(&mm->hugetlb_usage))); 4736 4778 } 4737 4779 4738 4780 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */ ··· 5013 5055 src_vma->vm_start, 5014 5056 src_vma->vm_end); 5015 5057 mmu_notifier_invalidate_range_start(&range); 5016 - mmap_assert_write_locked(src); 5058 + vma_assert_write_locked(src_vma); 5017 5059 raw_write_seqcount_begin(&src->write_protect_seq); 5018 5060 } else { 5019 5061 /* ··· 5086 5128 entry = huge_pte_clear_uffd_wp(entry); 5087 5129 set_huge_pte_at(dst, addr, dst_pte, entry); 5088 5130 } else if (unlikely(is_pte_marker(entry))) { 5089 - /* No swap on hugetlb */ 5090 - WARN_ON_ONCE( 5091 - is_swapin_error_entry(pte_to_swp_entry(entry))); 5092 - /* 5093 - * We copy the pte marker only if the dst vma has 5094 - * uffd-wp enabled. 5095 - */ 5096 - if (userfaultfd_wp(dst_vma)) 5097 - set_huge_pte_at(dst, addr, dst_pte, entry); 5131 + pte_marker marker = copy_pte_marker( 5132 + pte_to_swp_entry(entry), dst_vma); 5133 + 5134 + if (marker) 5135 + set_huge_pte_at(dst, addr, dst_pte, 5136 + make_pte_marker(marker)); 5098 5137 } else { 5099 5138 entry = huge_ptep_get(src_pte); 5100 5139 pte_folio = page_folio(pte_page(entry)); ··· 5263 5308 } 5264 5309 5265 5310 if (shared_pmd) 5266 - flush_tlb_range(vma, range.start, range.end); 5311 + flush_hugetlb_tlb_range(vma, range.start, range.end); 5267 5312 else 5268 - flush_tlb_range(vma, old_end - len, old_end); 5313 + flush_hugetlb_tlb_range(vma, old_end - len, old_end); 5269 5314 mmu_notifier_invalidate_range_end(&range); 5270 5315 i_mmap_unlock_write(mapping); 5271 5316 hugetlb_vma_unlock_write(vma); ··· 5672 5717 5673 5718 /* Break COW or unshare */ 5674 5719 huge_ptep_clear_flush(vma, haddr, ptep); 5675 - mmu_notifier_invalidate_range(mm, range.start, range.end); 5676 5720 page_remove_rmap(&old_folio->page, vma, true); 5677 5721 hugepage_add_new_anon_rmap(new_folio, vma, haddr); 5678 5722 if (huge_pte_uffd_wp(pte)) ··· 5702 5748 5703 5749 /* 5704 5750 * Return whether there is a pagecache page to back given address within VMA. 5705 - * Caller follow_hugetlb_page() holds page_table_lock so we cannot lock_page. 5706 5751 */ 5707 5752 static bool hugetlbfs_pagecache_present(struct hstate *h, 5708 5753 struct vm_area_struct *vma, unsigned long address) ··· 6046 6093 int need_wait_lock = 0; 6047 6094 unsigned long haddr = address & huge_page_mask(h); 6048 6095 6096 + /* TODO: Handle faults under the VMA lock */ 6097 + if (flags & FAULT_FLAG_VMA_LOCK) { 6098 + vma_end_read(vma); 6099 + return VM_FAULT_RETRY; 6100 + } 6101 + 6049 6102 /* 6050 6103 * Serialize hugepage allocation and instantiation, so that we don't 6051 6104 * get spurious allocation failures if two CPUs race to instantiate ··· 6076 6117 } 6077 6118 6078 6119 entry = huge_ptep_get(ptep); 6079 - /* PTE markers should be handled the same way as none pte */ 6080 - if (huge_pte_none_mostly(entry)) 6120 + if (huge_pte_none_mostly(entry)) { 6121 + if (is_pte_marker(entry)) { 6122 + pte_marker marker = 6123 + pte_marker_get(pte_to_swp_entry(entry)); 6124 + 6125 + if (marker & PTE_MARKER_POISONED) { 6126 + ret = VM_FAULT_HWPOISON_LARGE; 6127 + goto out_mutex; 6128 + } 6129 + } 6130 + 6081 6131 /* 6132 + * Other PTE markers should be handled the same way as none PTE. 6133 + * 6082 6134 * hugetlb_no_page will drop vma lock and hugetlb fault 6083 6135 * mutex internally, which make us return immediately. 6084 6136 */ 6085 6137 return hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 6086 6138 entry, flags); 6139 + } 6087 6140 6088 6141 ret = 0; 6089 6142 ··· 6250 6279 struct folio *folio; 6251 6280 int writable; 6252 6281 bool folio_in_pagecache = false; 6282 + 6283 + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { 6284 + ptl = huge_pte_lock(h, dst_mm, dst_pte); 6285 + 6286 + /* Don't overwrite any existing PTEs (even markers) */ 6287 + if (!huge_pte_none(huge_ptep_get(dst_pte))) { 6288 + spin_unlock(ptl); 6289 + return -EEXIST; 6290 + } 6291 + 6292 + _dst_pte = make_pte_marker(PTE_MARKER_POISONED); 6293 + set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); 6294 + 6295 + /* No need to invalidate - it was non-present before */ 6296 + update_mmu_cache(dst_vma, dst_addr, dst_pte); 6297 + 6298 + spin_unlock(ptl); 6299 + return 0; 6300 + } 6253 6301 6254 6302 if (is_continue) { 6255 6303 ret = -EFAULT; ··· 6439 6449 } 6440 6450 #endif /* CONFIG_USERFAULTFD */ 6441 6451 6442 - static void record_subpages(struct page *page, struct vm_area_struct *vma, 6443 - int refs, struct page **pages) 6444 - { 6445 - int nr; 6446 - 6447 - for (nr = 0; nr < refs; nr++) { 6448 - if (likely(pages)) 6449 - pages[nr] = nth_page(page, nr); 6450 - } 6451 - } 6452 - 6453 - static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma, 6454 - unsigned int flags, pte_t *pte, 6455 - bool *unshare) 6456 - { 6457 - pte_t pteval = huge_ptep_get(pte); 6458 - 6459 - *unshare = false; 6460 - if (is_swap_pte(pteval)) 6461 - return true; 6462 - if (huge_pte_write(pteval)) 6463 - return false; 6464 - if (flags & FOLL_WRITE) 6465 - return true; 6466 - if (gup_must_unshare(vma, flags, pte_page(pteval))) { 6467 - *unshare = true; 6468 - return true; 6469 - } 6470 - return false; 6471 - } 6472 - 6473 6452 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, 6474 - unsigned long address, unsigned int flags) 6453 + unsigned long address, unsigned int flags, 6454 + unsigned int *page_mask) 6475 6455 { 6476 6456 struct hstate *h = hstate_vma(vma); 6477 6457 struct mm_struct *mm = vma->vm_mm; ··· 6449 6489 struct page *page = NULL; 6450 6490 spinlock_t *ptl; 6451 6491 pte_t *pte, entry; 6452 - 6453 - /* 6454 - * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via 6455 - * follow_hugetlb_page(). 6456 - */ 6457 - if (WARN_ON_ONCE(flags & FOLL_PIN)) 6458 - return NULL; 6492 + int ret; 6459 6493 6460 6494 hugetlb_vma_lock_read(vma); 6461 6495 pte = hugetlb_walk(vma, haddr, huge_page_size(h)); ··· 6459 6505 ptl = huge_pte_lock(h, mm, pte); 6460 6506 entry = huge_ptep_get(pte); 6461 6507 if (pte_present(entry)) { 6462 - page = pte_page(entry) + 6463 - ((address & ~huge_page_mask(h)) >> PAGE_SHIFT); 6508 + page = pte_page(entry); 6509 + 6510 + if (!huge_pte_write(entry)) { 6511 + if (flags & FOLL_WRITE) { 6512 + page = NULL; 6513 + goto out; 6514 + } 6515 + 6516 + if (gup_must_unshare(vma, flags, page)) { 6517 + /* Tell the caller to do unsharing */ 6518 + page = ERR_PTR(-EMLINK); 6519 + goto out; 6520 + } 6521 + } 6522 + 6523 + page += ((address & ~huge_page_mask(h)) >> PAGE_SHIFT); 6524 + 6464 6525 /* 6465 6526 * Note that page may be a sub-page, and with vmemmap 6466 6527 * optimizations the page struct may be read only. ··· 6485 6516 * try_grab_page() should always be able to get the page here, 6486 6517 * because we hold the ptl lock and have verified pte_present(). 6487 6518 */ 6488 - if (try_grab_page(page, flags)) { 6489 - page = NULL; 6519 + ret = try_grab_page(page, flags); 6520 + 6521 + if (WARN_ON_ONCE(ret)) { 6522 + page = ERR_PTR(ret); 6490 6523 goto out; 6491 6524 } 6525 + 6526 + *page_mask = (1U << huge_page_order(h)) - 1; 6492 6527 } 6493 6528 out: 6494 6529 spin_unlock(ptl); 6495 6530 out_unlock: 6496 6531 hugetlb_vma_unlock_read(vma); 6497 - return page; 6498 - } 6499 6532 6500 - long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, 6501 - struct page **pages, unsigned long *position, 6502 - unsigned long *nr_pages, long i, unsigned int flags, 6503 - int *locked) 6504 - { 6505 - unsigned long pfn_offset; 6506 - unsigned long vaddr = *position; 6507 - unsigned long remainder = *nr_pages; 6508 - struct hstate *h = hstate_vma(vma); 6509 - int err = -EFAULT, refs; 6510 - 6511 - while (vaddr < vma->vm_end && remainder) { 6512 - pte_t *pte; 6513 - spinlock_t *ptl = NULL; 6514 - bool unshare = false; 6515 - int absent; 6516 - struct page *page; 6517 - 6518 - /* 6519 - * If we have a pending SIGKILL, don't keep faulting pages and 6520 - * potentially allocating memory. 6521 - */ 6522 - if (fatal_signal_pending(current)) { 6523 - remainder = 0; 6524 - break; 6525 - } 6526 - 6527 - hugetlb_vma_lock_read(vma); 6528 - /* 6529 - * Some archs (sparc64, sh*) have multiple pte_ts to 6530 - * each hugepage. We have to make sure we get the 6531 - * first, for the page indexing below to work. 6532 - * 6533 - * Note that page table lock is not held when pte is null. 6534 - */ 6535 - pte = hugetlb_walk(vma, vaddr & huge_page_mask(h), 6536 - huge_page_size(h)); 6537 - if (pte) 6538 - ptl = huge_pte_lock(h, mm, pte); 6539 - absent = !pte || huge_pte_none(huge_ptep_get(pte)); 6540 - 6541 - /* 6542 - * When coredumping, it suits get_dump_page if we just return 6543 - * an error where there's an empty slot with no huge pagecache 6544 - * to back it. This way, we avoid allocating a hugepage, and 6545 - * the sparse dumpfile avoids allocating disk blocks, but its 6546 - * huge holes still show up with zeroes where they need to be. 6547 - */ 6548 - if (absent && (flags & FOLL_DUMP) && 6549 - !hugetlbfs_pagecache_present(h, vma, vaddr)) { 6550 - if (pte) 6551 - spin_unlock(ptl); 6552 - hugetlb_vma_unlock_read(vma); 6553 - remainder = 0; 6554 - break; 6555 - } 6556 - 6557 - /* 6558 - * We need call hugetlb_fault for both hugepages under migration 6559 - * (in which case hugetlb_fault waits for the migration,) and 6560 - * hwpoisoned hugepages (in which case we need to prevent the 6561 - * caller from accessing to them.) In order to do this, we use 6562 - * here is_swap_pte instead of is_hugetlb_entry_migration and 6563 - * is_hugetlb_entry_hwpoisoned. This is because it simply covers 6564 - * both cases, and because we can't follow correct pages 6565 - * directly from any kind of swap entries. 6566 - */ 6567 - if (absent || 6568 - __follow_hugetlb_must_fault(vma, flags, pte, &unshare)) { 6569 - vm_fault_t ret; 6570 - unsigned int fault_flags = 0; 6571 - 6572 - if (pte) 6573 - spin_unlock(ptl); 6574 - hugetlb_vma_unlock_read(vma); 6575 - 6576 - if (flags & FOLL_WRITE) 6577 - fault_flags |= FAULT_FLAG_WRITE; 6578 - else if (unshare) 6579 - fault_flags |= FAULT_FLAG_UNSHARE; 6580 - if (locked) { 6581 - fault_flags |= FAULT_FLAG_ALLOW_RETRY | 6582 - FAULT_FLAG_KILLABLE; 6583 - if (flags & FOLL_INTERRUPTIBLE) 6584 - fault_flags |= FAULT_FLAG_INTERRUPTIBLE; 6585 - } 6586 - if (flags & FOLL_NOWAIT) 6587 - fault_flags |= FAULT_FLAG_ALLOW_RETRY | 6588 - FAULT_FLAG_RETRY_NOWAIT; 6589 - if (flags & FOLL_TRIED) { 6590 - /* 6591 - * Note: FAULT_FLAG_ALLOW_RETRY and 6592 - * FAULT_FLAG_TRIED can co-exist 6593 - */ 6594 - fault_flags |= FAULT_FLAG_TRIED; 6595 - } 6596 - ret = hugetlb_fault(mm, vma, vaddr, fault_flags); 6597 - if (ret & VM_FAULT_ERROR) { 6598 - err = vm_fault_to_errno(ret, flags); 6599 - remainder = 0; 6600 - break; 6601 - } 6602 - if (ret & VM_FAULT_RETRY) { 6603 - if (locked && 6604 - !(fault_flags & FAULT_FLAG_RETRY_NOWAIT)) 6605 - *locked = 0; 6606 - *nr_pages = 0; 6607 - /* 6608 - * VM_FAULT_RETRY must not return an 6609 - * error, it will return zero 6610 - * instead. 6611 - * 6612 - * No need to update "position" as the 6613 - * caller will not check it after 6614 - * *nr_pages is set to 0. 6615 - */ 6616 - return i; 6617 - } 6618 - continue; 6619 - } 6620 - 6621 - pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT; 6622 - page = pte_page(huge_ptep_get(pte)); 6623 - 6624 - VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && 6625 - !PageAnonExclusive(page), page); 6626 - 6627 - /* 6628 - * If subpage information not requested, update counters 6629 - * and skip the same_page loop below. 6630 - */ 6631 - if (!pages && !pfn_offset && 6632 - (vaddr + huge_page_size(h) < vma->vm_end) && 6633 - (remainder >= pages_per_huge_page(h))) { 6634 - vaddr += huge_page_size(h); 6635 - remainder -= pages_per_huge_page(h); 6636 - i += pages_per_huge_page(h); 6637 - spin_unlock(ptl); 6638 - hugetlb_vma_unlock_read(vma); 6639 - continue; 6640 - } 6641 - 6642 - /* vaddr may not be aligned to PAGE_SIZE */ 6643 - refs = min3(pages_per_huge_page(h) - pfn_offset, remainder, 6644 - (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT); 6645 - 6646 - if (pages) 6647 - record_subpages(nth_page(page, pfn_offset), 6648 - vma, refs, 6649 - likely(pages) ? pages + i : NULL); 6650 - 6651 - if (pages) { 6652 - /* 6653 - * try_grab_folio() should always succeed here, 6654 - * because: a) we hold the ptl lock, and b) we've just 6655 - * checked that the huge page is present in the page 6656 - * tables. If the huge page is present, then the tail 6657 - * pages must also be present. The ptl prevents the 6658 - * head page and tail pages from being rearranged in 6659 - * any way. As this is hugetlb, the pages will never 6660 - * be p2pdma or not longterm pinable. So this page 6661 - * must be available at this point, unless the page 6662 - * refcount overflowed: 6663 - */ 6664 - if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs, 6665 - flags))) { 6666 - spin_unlock(ptl); 6667 - hugetlb_vma_unlock_read(vma); 6668 - remainder = 0; 6669 - err = -ENOMEM; 6670 - break; 6671 - } 6672 - } 6673 - 6674 - vaddr += (refs << PAGE_SHIFT); 6675 - remainder -= refs; 6676 - i += refs; 6677 - 6678 - spin_unlock(ptl); 6679 - hugetlb_vma_unlock_read(vma); 6680 - } 6681 - *nr_pages = remainder; 6682 6533 /* 6683 - * setting position is actually required only if remainder is 6684 - * not zero but it's faster not to add a "if (remainder)" 6685 - * branch. 6534 + * Fixup retval for dump requests: if pagecache doesn't exist, 6535 + * don't try to allocate a new page but just skip it. 6686 6536 */ 6687 - *position = vaddr; 6537 + if (!page && (flags & FOLL_DUMP) && 6538 + !hugetlbfs_pagecache_present(h, vma, address)) 6539 + page = ERR_PTR(-EFAULT); 6688 6540 6689 - return i ? i : err; 6541 + return page; 6690 6542 } 6691 6543 6692 6544 long hugetlb_change_protection(struct vm_area_struct *vma, ··· 6639 6849 else 6640 6850 flush_hugetlb_tlb_range(vma, start, end); 6641 6851 /* 6642 - * No need to call mmu_notifier_invalidate_range() we are downgrading 6643 - * page table protection not changing it to point to a new page. 6852 + * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are 6853 + * downgrading page table protection not changing it to point to a new 6854 + * page. 6644 6855 * 6645 6856 * See Documentation/mm/mmu_notifier.rst 6646 6857 */ ··· 7285 7494 i_mmap_unlock_write(vma->vm_file->f_mapping); 7286 7495 hugetlb_vma_unlock_write(vma); 7287 7496 /* 7288 - * No need to call mmu_notifier_invalidate_range(), see 7497 + * No need to call mmu_notifier_arch_invalidate_secondary_tlbs(), see 7289 7498 * Documentation/mm/mmu_notifier.rst. 7290 7499 */ 7291 7500 mmu_notifier_invalidate_range_end(&range);

+14 -20

mm/hugetlb_vmemmap.c

··· 36 36 struct list_head *vmemmap_pages; 37 37 }; 38 38 39 - static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 39 + static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 40 40 { 41 41 pmd_t __pmd; 42 42 int i; 43 43 unsigned long addr = start; 44 - struct page *page = pmd_page(*pmd); 45 - pte_t *pgtable = pte_alloc_one_kernel(&init_mm); 44 + struct page *head; 45 + pte_t *pgtable; 46 46 47 + spin_lock(&init_mm.page_table_lock); 48 + head = pmd_leaf(*pmd) ? pmd_page(*pmd) : NULL; 49 + spin_unlock(&init_mm.page_table_lock); 50 + 51 + if (!head) 52 + return 0; 53 + 54 + pgtable = pte_alloc_one_kernel(&init_mm); 47 55 if (!pgtable) 48 56 return -ENOMEM; 49 57 ··· 61 53 pte_t entry, *pte; 62 54 pgprot_t pgprot = PAGE_KERNEL; 63 55 64 - entry = mk_pte(page + i, pgprot); 56 + entry = mk_pte(head + i, pgprot); 65 57 pte = pte_offset_kernel(&__pmd, addr); 66 58 set_pte_at(&init_mm, addr, pte, entry); 67 59 } ··· 73 65 * be treated as indepdenent small pages (as they can be freed 74 66 * individually). 75 67 */ 76 - if (!PageReserved(page)) 77 - split_page(page, get_order(PMD_SIZE)); 68 + if (!PageReserved(head)) 69 + split_page(head, get_order(PMD_SIZE)); 78 70 79 71 /* Make pte visible before pmd. See comment in pmd_install(). */ 80 72 smp_wmb(); ··· 86 78 spin_unlock(&init_mm.page_table_lock); 87 79 88 80 return 0; 89 - } 90 - 91 - static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 92 - { 93 - int leaf; 94 - 95 - spin_lock(&init_mm.page_table_lock); 96 - leaf = pmd_leaf(*pmd); 97 - spin_unlock(&init_mm.page_table_lock); 98 - 99 - if (!leaf) 100 - return 0; 101 - 102 - return __split_vmemmap_huge_pmd(pmd, start); 103 81 } 104 82 105 83 static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,

+2

mm/init-mm.c

··· 17 17 #define INIT_MM_CONTEXT(name) 18 18 #endif 19 19 20 + const struct vm_operations_struct vma_dummy_vm_ops; 21 + 20 22 /* 21 23 * For dynamically allocated mm_structs, there is a dynamically sized cpumask 22 24 * at the end of the structure, the size of which depends on the maximum CPU

+48 -13

mm/internal.h

··· 62 62 #define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1) 63 63 64 64 /* 65 + * Flags passed to __show_mem() and show_free_areas() to suppress output in 66 + * various contexts. 67 + */ 68 + #define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ 69 + 70 + /* 65 71 * How many individual pages have an elevated _mapcount. Excludes 66 72 * the folio's entire_mapcount. 67 73 */ ··· 109 103 void deactivate_file_folio(struct folio *folio); 110 104 void folio_activate(struct folio *folio); 111 105 112 - void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, 106 + void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, 113 107 struct vm_area_struct *start_vma, unsigned long floor, 114 108 unsigned long ceiling, bool mm_wr_locked); 115 109 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); ··· 174 168 VM_BUG_ON_PAGE(PageTail(page), page); 175 169 VM_BUG_ON_PAGE(page_ref_count(page), page); 176 170 set_page_count(page, 1); 171 + } 172 + 173 + /* 174 + * Return true if a folio needs ->release_folio() calling upon it. 175 + */ 176 + static inline bool folio_needs_release(struct folio *folio) 177 + { 178 + struct address_space *mapping = folio_mapping(folio); 179 + 180 + return folio_has_private(folio) || 181 + (mapping && mapping_release_always(mapping)); 177 182 } 178 183 179 184 extern unsigned long highest_memmap_pfn; ··· 407 390 if (WARN_ON_ONCE(!order || !folio_test_large(folio))) 408 391 return; 409 392 410 - folio->_folio_order = order; 393 + folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; 411 394 #ifdef CONFIG_64BIT 412 395 folio->_folio_nr_pages = 1U << order; 413 396 #endif 414 397 } 415 398 399 + void folio_undo_large_rmappable(struct folio *folio); 400 + 416 401 static inline void prep_compound_head(struct page *page, unsigned int order) 417 402 { 418 403 struct folio *folio = (struct folio *)page; 419 404 420 - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); 421 405 folio_set_order(folio, order); 422 406 atomic_set(&folio->_entire_mapcount, -1); 423 407 atomic_set(&folio->_nr_pages_mapped, 0); ··· 707 689 if (fault_flag_allow_retry_first(flags) && 708 690 !(flags & FAULT_FLAG_RETRY_NOWAIT)) { 709 691 fpin = get_file(vmf->vma->vm_file); 710 - mmap_read_unlock(vmf->vma->vm_mm); 692 + release_fault_lock(vmf); 711 693 } 712 694 return fpin; 713 695 } ··· 1040 1022 } 1041 1023 1042 1024 extern bool mirrored_kernelcore; 1025 + extern bool memblock_has_mirror(void); 1043 1026 1044 1027 static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma) 1045 1028 { ··· 1060 1041 return !(vma->vm_flags & VM_SOFTDIRTY); 1061 1042 } 1062 1043 1044 + static inline void vma_iter_config(struct vma_iterator *vmi, 1045 + unsigned long index, unsigned long last) 1046 + { 1047 + MAS_BUG_ON(&vmi->mas, vmi->mas.node != MAS_START && 1048 + (vmi->mas.index > index || vmi->mas.last < index)); 1049 + __mas_set_range(&vmi->mas, index, last - 1); 1050 + } 1051 + 1063 1052 /* 1064 1053 * VMA Iterator functions shared between nommu and mmap 1065 1054 */ 1066 - static inline int vma_iter_prealloc(struct vma_iterator *vmi) 1055 + static inline int vma_iter_prealloc(struct vma_iterator *vmi, 1056 + struct vm_area_struct *vma) 1067 1057 { 1068 - return mas_preallocate(&vmi->mas, GFP_KERNEL); 1058 + return mas_preallocate(&vmi->mas, vma, GFP_KERNEL); 1069 1059 } 1070 1060 1071 - static inline void vma_iter_clear(struct vma_iterator *vmi, 1072 - unsigned long start, unsigned long end) 1061 + static inline void vma_iter_clear(struct vma_iterator *vmi) 1073 1062 { 1074 - mas_set_range(&vmi->mas, start, end - 1); 1075 1063 mas_store_prealloc(&vmi->mas, NULL); 1064 + } 1065 + 1066 + static inline int vma_iter_clear_gfp(struct vma_iterator *vmi, 1067 + unsigned long start, unsigned long end, gfp_t gfp) 1068 + { 1069 + __mas_set_range(&vmi->mas, start, end - 1); 1070 + mas_store_gfp(&vmi->mas, NULL, gfp); 1071 + if (unlikely(mas_is_err(&vmi->mas))) 1072 + return -ENOMEM; 1073 + 1074 + return 0; 1076 1075 } 1077 1076 1078 1077 static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi) ··· 1122 1085 ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) 1123 1086 vma_iter_invalidate(vmi); 1124 1087 1125 - vmi->mas.index = vma->vm_start; 1126 - vmi->mas.last = vma->vm_end - 1; 1088 + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); 1127 1089 mas_store_prealloc(&vmi->mas, vma); 1128 1090 } 1129 1091 ··· 1133 1097 ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) 1134 1098 vma_iter_invalidate(vmi); 1135 1099 1136 - vmi->mas.index = vma->vm_start; 1137 - vmi->mas.last = vma->vm_end - 1; 1100 + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); 1138 1101 mas_store_gfp(&vmi->mas, vma, gfp); 1139 1102 if (unlikely(mas_is_err(&vmi->mas))) 1140 1103 return -ENOMEM;

+28 -15

mm/ioremap.c

··· 10 10 #include <linux/mm.h> 11 11 #include <linux/io.h> 12 12 #include <linux/export.h> 13 + #include <linux/ioremap.h> 13 14 14 - void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 15 - unsigned long prot) 15 + void __iomem *generic_ioremap_prot(phys_addr_t phys_addr, size_t size, 16 + pgprot_t prot) 16 17 { 17 18 unsigned long offset, vaddr; 18 19 phys_addr_t last_addr; 19 20 struct vm_struct *area; 21 + 22 + /* An early platform driver might end up here */ 23 + if (WARN_ON_ONCE(!slab_is_available())) 24 + return NULL; 20 25 21 26 /* Disallow wrap-around or zero size */ 22 27 last_addr = phys_addr + size - 1; ··· 33 28 phys_addr -= offset; 34 29 size = PAGE_ALIGN(size + offset); 35 30 36 - if (!ioremap_allowed(phys_addr, size, prot)) 37 - return NULL; 38 - 39 - area = get_vm_area_caller(size, VM_IOREMAP, 40 - __builtin_return_address(0)); 31 + area = __get_vm_area_caller(size, VM_IOREMAP, IOREMAP_START, 32 + IOREMAP_END, __builtin_return_address(0)); 41 33 if (!area) 42 34 return NULL; 43 35 vaddr = (unsigned long)area->addr; 44 36 area->phys_addr = phys_addr; 45 37 46 - if (ioremap_page_range(vaddr, vaddr + size, phys_addr, 47 - __pgprot(prot))) { 38 + if (ioremap_page_range(vaddr, vaddr + size, phys_addr, prot)) { 48 39 free_vm_area(area); 49 40 return NULL; 50 41 } 51 42 52 43 return (void __iomem *)(vaddr + offset); 53 44 } 54 - EXPORT_SYMBOL(ioremap_prot); 55 45 56 - void iounmap(volatile void __iomem *addr) 46 + #ifndef ioremap_prot 47 + void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size, 48 + unsigned long prot) 49 + { 50 + return generic_ioremap_prot(phys_addr, size, __pgprot(prot)); 51 + } 52 + EXPORT_SYMBOL(ioremap_prot); 53 + #endif 54 + 55 + void generic_iounmap(volatile void __iomem *addr) 57 56 { 58 57 void *vaddr = (void *)((unsigned long)addr & PAGE_MASK); 59 58 60 - if (!iounmap_allowed(vaddr)) 61 - return; 62 - 63 - if (is_vmalloc_addr(vaddr)) 59 + if (is_ioremap_addr(vaddr)) 64 60 vunmap(vaddr); 65 61 } 62 + 63 + #ifndef iounmap 64 + void iounmap(volatile void __iomem *addr) 65 + { 66 + generic_iounmap(addr); 67 + } 66 68 EXPORT_SYMBOL(iounmap); 69 + #endif

+86 -37

mm/kfence/core.c

··· 116 116 * backing pages (in __kfence_pool). 117 117 */ 118 118 static_assert(CONFIG_KFENCE_NUM_OBJECTS > 0); 119 - struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS]; 119 + struct kfence_metadata *kfence_metadata __read_mostly; 120 + 121 + /* 122 + * If kfence_metadata is not NULL, it may be accessed by kfence_shutdown_cache(). 123 + * So introduce kfence_metadata_init to initialize metadata, and then make 124 + * kfence_metadata visible after initialization is successful. This prevents 125 + * potential UAF or access to uninitialized metadata. 126 + */ 127 + static struct kfence_metadata *kfence_metadata_init __read_mostly; 120 128 121 129 /* Freelist with available objects. */ 122 130 static struct list_head kfence_freelist = LIST_HEAD_INIT(kfence_freelist); ··· 599 591 600 592 __folio_set_slab(slab_folio(slab)); 601 593 #ifdef CONFIG_MEMCG 602 - slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg | 594 + slab->memcg_data = (unsigned long)&kfence_metadata_init[i / 2 - 1].objcg | 603 595 MEMCG_DATA_OBJCGS; 604 596 #endif 605 597 } ··· 618 610 } 619 611 620 612 for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) { 621 - struct kfence_metadata *meta = &kfence_metadata[i]; 613 + struct kfence_metadata *meta = &kfence_metadata_init[i]; 622 614 623 615 /* Initialize metadata. */ 624 616 INIT_LIST_HEAD(&meta->list); ··· 634 626 addr += 2 * PAGE_SIZE; 635 627 } 636 628 629 + /* 630 + * Make kfence_metadata visible only when initialization is successful. 631 + * Otherwise, if the initialization fails and kfence_metadata is freed, 632 + * it may cause UAF in kfence_shutdown_cache(). 633 + */ 634 + smp_store_release(&kfence_metadata, kfence_metadata_init); 637 635 return 0; 638 636 639 637 reset_slab: ··· 686 672 */ 687 673 memblock_free_late(__pa(addr), KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool)); 688 674 __kfence_pool = NULL; 689 - return false; 690 - } 691 675 692 - static bool kfence_init_pool_late(void) 693 - { 694 - unsigned long addr, free_size; 676 + memblock_free_late(__pa(kfence_metadata_init), KFENCE_METADATA_SIZE); 677 + kfence_metadata_init = NULL; 695 678 696 - addr = kfence_init_pool(); 697 - 698 - if (!addr) 699 - return true; 700 - 701 - /* Same as above. */ 702 - free_size = KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool); 703 - #ifdef CONFIG_CONTIG_ALLOC 704 - free_contig_range(page_to_pfn(virt_to_page((void *)addr)), free_size / PAGE_SIZE); 705 - #else 706 - free_pages_exact((void *)addr, free_size); 707 - #endif 708 - __kfence_pool = NULL; 709 679 return false; 710 680 } 711 681 ··· 839 841 840 842 /* === Public interface ===================================================== */ 841 843 842 - void __init kfence_alloc_pool(void) 844 + void __init kfence_alloc_pool_and_metadata(void) 843 845 { 844 846 if (!kfence_sample_interval) 845 847 return; 846 848 847 - /* if the pool has already been initialized by arch, skip the below. */ 848 - if (__kfence_pool) 849 - return; 850 - 851 - __kfence_pool = memblock_alloc(KFENCE_POOL_SIZE, PAGE_SIZE); 852 - 849 + /* 850 + * If the pool has already been initialized by arch, there is no need to 851 + * re-allocate the memory pool. 852 + */ 853 853 if (!__kfence_pool) 854 + __kfence_pool = memblock_alloc(KFENCE_POOL_SIZE, PAGE_SIZE); 855 + 856 + if (!__kfence_pool) { 854 857 pr_err("failed to allocate pool\n"); 858 + return; 859 + } 860 + 861 + /* The memory allocated by memblock has been zeroed out. */ 862 + kfence_metadata_init = memblock_alloc(KFENCE_METADATA_SIZE, PAGE_SIZE); 863 + if (!kfence_metadata_init) { 864 + pr_err("failed to allocate metadata\n"); 865 + memblock_free(__kfence_pool, KFENCE_POOL_SIZE); 866 + __kfence_pool = NULL; 867 + } 855 868 } 856 869 857 870 static void kfence_init_enable(void) ··· 904 895 905 896 static int kfence_init_late(void) 906 897 { 907 - const unsigned long nr_pages = KFENCE_POOL_SIZE / PAGE_SIZE; 898 + const unsigned long nr_pages_pool = KFENCE_POOL_SIZE / PAGE_SIZE; 899 + const unsigned long nr_pages_meta = KFENCE_METADATA_SIZE / PAGE_SIZE; 900 + unsigned long addr = (unsigned long)__kfence_pool; 901 + unsigned long free_size = KFENCE_POOL_SIZE; 902 + int err = -ENOMEM; 903 + 908 904 #ifdef CONFIG_CONTIG_ALLOC 909 905 struct page *pages; 910 906 911 - pages = alloc_contig_pages(nr_pages, GFP_KERNEL, first_online_node, NULL); 907 + pages = alloc_contig_pages(nr_pages_pool, GFP_KERNEL, first_online_node, 908 + NULL); 912 909 if (!pages) 913 910 return -ENOMEM; 911 + 914 912 __kfence_pool = page_to_virt(pages); 913 + pages = alloc_contig_pages(nr_pages_meta, GFP_KERNEL, first_online_node, 914 + NULL); 915 + if (pages) 916 + kfence_metadata_init = page_to_virt(pages); 915 917 #else 916 - if (nr_pages > MAX_ORDER_NR_PAGES) { 918 + if (nr_pages_pool > MAX_ORDER_NR_PAGES || 919 + nr_pages_meta > MAX_ORDER_NR_PAGES) { 917 920 pr_warn("KFENCE_NUM_OBJECTS too large for buddy allocator\n"); 918 921 return -EINVAL; 919 922 } 923 + 920 924 __kfence_pool = alloc_pages_exact(KFENCE_POOL_SIZE, GFP_KERNEL); 921 925 if (!__kfence_pool) 922 926 return -ENOMEM; 927 + 928 + kfence_metadata_init = alloc_pages_exact(KFENCE_METADATA_SIZE, GFP_KERNEL); 923 929 #endif 924 930 925 - if (!kfence_init_pool_late()) { 926 - pr_err("%s failed\n", __func__); 927 - return -EBUSY; 931 + if (!kfence_metadata_init) 932 + goto free_pool; 933 + 934 + memzero_explicit(kfence_metadata_init, KFENCE_METADATA_SIZE); 935 + addr = kfence_init_pool(); 936 + if (!addr) { 937 + kfence_init_enable(); 938 + kfence_debugfs_init(); 939 + return 0; 928 940 } 929 941 930 - kfence_init_enable(); 931 - kfence_debugfs_init(); 942 + pr_err("%s failed\n", __func__); 943 + free_size = KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool); 944 + err = -EBUSY; 932 945 933 - return 0; 946 + #ifdef CONFIG_CONTIG_ALLOC 947 + free_contig_range(page_to_pfn(virt_to_page((void *)kfence_metadata_init)), 948 + nr_pages_meta); 949 + free_pool: 950 + free_contig_range(page_to_pfn(virt_to_page((void *)addr)), 951 + free_size / PAGE_SIZE); 952 + #else 953 + free_pages_exact((void *)kfence_metadata_init, KFENCE_METADATA_SIZE); 954 + free_pool: 955 + free_pages_exact((void *)addr, free_size); 956 + #endif 957 + 958 + kfence_metadata_init = NULL; 959 + __kfence_pool = NULL; 960 + return err; 934 961 } 935 962 936 963 static int kfence_enable_late(void) ··· 985 940 unsigned long flags; 986 941 struct kfence_metadata *meta; 987 942 int i; 943 + 944 + /* Pairs with release in kfence_init_pool(). */ 945 + if (!smp_load_acquire(&kfence_metadata)) 946 + return; 988 947 989 948 for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) { 990 949 bool in_use;

+4 -1

mm/kfence/kfence.h

··· 102 102 #endif 103 103 }; 104 104 105 - extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS]; 105 + #define KFENCE_METADATA_SIZE PAGE_ALIGN(sizeof(struct kfence_metadata) * \ 106 + CONFIG_KFENCE_NUM_OBJECTS) 107 + 108 + extern struct kfence_metadata *kfence_metadata; 106 109 107 110 static inline struct kfence_metadata *addr_to_metadata(unsigned long addr) 108 111 {

+191 -317

mm/khugepaged.c

··· 19 19 #include <linux/page_table_check.h> 20 20 #include <linux/swapops.h> 21 21 #include <linux/shmem_fs.h> 22 + #include <linux/ksm.h> 22 23 23 24 #include <asm/tlb.h> 24 25 #include <asm/pgalloc.h> ··· 93 92 94 93 static struct kmem_cache *mm_slot_cache __read_mostly; 95 94 96 - #define MAX_PTE_MAPPED_THP 8 97 - 98 95 struct collapse_control { 99 96 bool is_khugepaged; 100 97 ··· 106 107 /** 107 108 * struct khugepaged_mm_slot - khugepaged information per mm that is being scanned 108 109 * @slot: hash lookup from mm to mm_slot 109 - * @nr_pte_mapped_thp: number of pte mapped THP 110 - * @pte_mapped_thp: address array corresponding pte mapped THP 111 110 */ 112 111 struct khugepaged_mm_slot { 113 112 struct mm_slot slot; 114 - 115 - /* pte-mapped THP in this mm */ 116 - int nr_pte_mapped_thp; 117 - unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP]; 118 113 }; 119 114 120 115 /** ··· 702 709 spin_lock(ptl); 703 710 ptep_clear(vma->vm_mm, address, _pte); 704 711 spin_unlock(ptl); 712 + ksm_might_unmap_zero_page(vma->vm_mm, pteval); 705 713 } 706 714 } else { 707 715 src_page = pte_page(pteval); ··· 896 902 return false; 897 903 } 898 904 899 - prep_transhuge_page(*hpage); 905 + folio_prep_large_rmappable((struct folio *)*hpage); 900 906 count_vm_event(THP_COLLAPSE_ALLOC); 901 907 return true; 902 908 } ··· 1433 1439 } 1434 1440 1435 1441 #ifdef CONFIG_SHMEM 1436 - /* 1437 - * Notify khugepaged that given addr of the mm is pte-mapped THP. Then 1438 - * khugepaged should try to collapse the page table. 1439 - * 1440 - * Note that following race exists: 1441 - * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A, 1442 - * emptying the A's ->pte_mapped_thp[] array. 1443 - * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and 1444 - * retract_page_tables() finds a VMA in mm_struct A mapping the same extent 1445 - * (at virtual address X) and adds an entry (for X) into mm_struct A's 1446 - * ->pte-mapped_thp[] array. 1447 - * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X, 1448 - * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry 1449 - * (for X) into mm_struct A's ->pte-mapped_thp[] array. 1450 - * Thus, it's possible the same address is added multiple times for the same 1451 - * mm_struct. Should this happen, we'll simply attempt 1452 - * collapse_pte_mapped_thp() multiple times for the same address, under the same 1453 - * exclusive mmap_lock, and assuming the first call is successful, subsequent 1454 - * attempts will return quickly (without grabbing any additional locks) when 1455 - * a huge pmd is found in find_pmd_or_thp_or_none(). Since this is a cheap 1456 - * check, and since this is a rare occurrence, the cost of preventing this 1457 - * "multiple-add" is thought to be more expensive than just handling it, should 1458 - * it occur. 1459 - */ 1460 - static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, 1461 - unsigned long addr) 1462 - { 1463 - struct khugepaged_mm_slot *mm_slot; 1464 - struct mm_slot *slot; 1465 - bool ret = false; 1466 - 1467 - VM_BUG_ON(addr & ~HPAGE_PMD_MASK); 1468 - 1469 - spin_lock(&khugepaged_mm_lock); 1470 - slot = mm_slot_lookup(mm_slots_hash, mm); 1471 - mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot); 1472 - if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) { 1473 - mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr; 1474 - ret = true; 1475 - } 1476 - spin_unlock(&khugepaged_mm_lock); 1477 - return ret; 1478 - } 1479 - 1480 - /* hpage must be locked, and mmap_lock must be held in write */ 1442 + /* hpage must be locked, and mmap_lock must be held */ 1481 1443 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, 1482 1444 pmd_t *pmdp, struct page *hpage) 1483 1445 { ··· 1445 1495 }; 1446 1496 1447 1497 VM_BUG_ON(!PageTransHuge(hpage)); 1448 - mmap_assert_write_locked(vma->vm_mm); 1498 + mmap_assert_locked(vma->vm_mm); 1449 1499 1450 1500 if (do_set_pmd(&vmf, hpage)) 1451 1501 return SCAN_FAIL; 1452 1502 1453 1503 get_page(hpage); 1454 1504 return SCAN_SUCCEED; 1455 - } 1456 - 1457 - /* 1458 - * A note about locking: 1459 - * Trying to take the page table spinlocks would be useless here because those 1460 - * are only used to synchronize: 1461 - * 1462 - * - modifying terminal entries (ones that point to a data page, not to another 1463 - * page table) 1464 - * - installing *new* non-terminal entries 1465 - * 1466 - * Instead, we need roughly the same kind of protection as free_pgtables() or 1467 - * mm_take_all_locks() (but only for a single VMA): 1468 - * The mmap lock together with this VMA's rmap locks covers all paths towards 1469 - * the page table entries we're messing with here, except for hardware page 1470 - * table walks and lockless_pages_from_mm(). 1471 - */ 1472 - static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, 1473 - unsigned long addr, pmd_t *pmdp) 1474 - { 1475 - pmd_t pmd; 1476 - struct mmu_notifier_range range; 1477 - 1478 - mmap_assert_write_locked(mm); 1479 - if (vma->vm_file) 1480 - lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem); 1481 - /* 1482 - * All anon_vmas attached to the VMA have the same root and are 1483 - * therefore locked by the same lock. 1484 - */ 1485 - if (vma->anon_vma) 1486 - lockdep_assert_held_write(&vma->anon_vma->root->rwsem); 1487 - 1488 - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, 1489 - addr + HPAGE_PMD_SIZE); 1490 - mmu_notifier_invalidate_range_start(&range); 1491 - pmd = pmdp_collapse_flush(vma, addr, pmdp); 1492 - tlb_remove_table_sync_one(); 1493 - mmu_notifier_invalidate_range_end(&range); 1494 - mm_dec_nr_ptes(mm); 1495 - page_table_check_pte_clear_range(mm, addr, pmd); 1496 - pte_free(mm, pmd_pgtable(pmd)); 1497 1505 } 1498 1506 1499 1507 /** ··· 1469 1561 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, 1470 1562 bool install_pmd) 1471 1563 { 1564 + struct mmu_notifier_range range; 1565 + bool notified = false; 1472 1566 unsigned long haddr = addr & HPAGE_PMD_MASK; 1473 1567 struct vm_area_struct *vma = vma_lookup(mm, haddr); 1474 1568 struct page *hpage; 1475 1569 pte_t *start_pte, *pte; 1476 - pmd_t *pmd; 1477 - spinlock_t *ptl; 1478 - int count = 0, result = SCAN_FAIL; 1570 + pmd_t *pmd, pgt_pmd; 1571 + spinlock_t *pml = NULL, *ptl; 1572 + int nr_ptes = 0, result = SCAN_FAIL; 1479 1573 int i; 1480 1574 1481 - mmap_assert_write_locked(mm); 1575 + mmap_assert_locked(mm); 1576 + 1577 + /* First check VMA found, in case page tables are being torn down */ 1578 + if (!vma || !vma->vm_file || 1579 + !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) 1580 + return SCAN_VMA_CHECK; 1482 1581 1483 1582 /* Fast check before locking page if already PMD-mapped */ 1484 1583 result = find_pmd_or_thp_or_none(mm, haddr, &pmd); 1485 1584 if (result == SCAN_PMD_MAPPED) 1486 1585 return result; 1487 - 1488 - if (!vma || !vma->vm_file || 1489 - !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) 1490 - return SCAN_VMA_CHECK; 1491 1586 1492 1587 /* 1493 1588 * If we are here, we've succeeded in replacing all the native pages ··· 1521 1610 goto drop_hpage; 1522 1611 } 1523 1612 1613 + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); 1524 1614 switch (result) { 1525 1615 case SCAN_SUCCEED: 1526 1616 break; 1527 1617 case SCAN_PMD_NONE: 1528 1618 /* 1529 - * In MADV_COLLAPSE path, possible race with khugepaged where 1530 - * all pte entries have been removed and pmd cleared. If so, 1531 - * skip all the pte checks and just update the pmd mapping. 1619 + * All pte entries have been removed and pmd cleared. 1620 + * Skip all the pte checks and just update the pmd mapping. 1532 1621 */ 1533 1622 goto maybe_install_pmd; 1534 1623 default: 1535 1624 goto drop_hpage; 1536 1625 } 1537 1626 1538 - /* Lock the vma before taking i_mmap and page table locks */ 1539 - vma_start_write(vma); 1540 - 1541 - /* 1542 - * We need to lock the mapping so that from here on, only GUP-fast and 1543 - * hardware page walks can access the parts of the page tables that 1544 - * we're operating on. 1545 - * See collapse_and_free_pmd(). 1546 - */ 1547 - i_mmap_lock_write(vma->vm_file->f_mapping); 1548 - 1549 - /* 1550 - * This spinlock should be unnecessary: Nobody else should be accessing 1551 - * the page tables under spinlock protection here, only 1552 - * lockless_pages_from_mm() and the hardware page walker can access page 1553 - * tables while all the high-level locks are held in write mode. 1554 - */ 1555 1627 result = SCAN_FAIL; 1556 1628 start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); 1557 - if (!start_pte) 1558 - goto drop_immap; 1629 + if (!start_pte) /* mmap_lock + page lock should prevent this */ 1630 + goto drop_hpage; 1559 1631 1560 1632 /* step 1: check all mapped PTEs are to the right huge page */ 1561 1633 for (i = 0, addr = haddr, pte = start_pte; ··· 1565 1671 */ 1566 1672 if (hpage + i != page) 1567 1673 goto abort; 1568 - count++; 1569 1674 } 1570 1675 1571 - /* step 2: adjust rmap */ 1676 + pte_unmap_unlock(start_pte, ptl); 1677 + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, 1678 + haddr, haddr + HPAGE_PMD_SIZE); 1679 + mmu_notifier_invalidate_range_start(&range); 1680 + notified = true; 1681 + 1682 + /* 1683 + * pmd_lock covers a wider range than ptl, and (if split from mm's 1684 + * page_table_lock) ptl nests inside pml. The less time we hold pml, 1685 + * the better; but userfaultfd's mfill_atomic_pte() on a private VMA 1686 + * inserts a valid as-if-COWed PTE without even looking up page cache. 1687 + * So page lock of hpage does not protect from it, so we must not drop 1688 + * ptl before pgt_pmd is removed, so uffd private needs pml taken now. 1689 + */ 1690 + if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) 1691 + pml = pmd_lock(mm, pmd); 1692 + 1693 + start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl); 1694 + if (!start_pte) /* mmap_lock + page lock should prevent this */ 1695 + goto abort; 1696 + if (!pml) 1697 + spin_lock(ptl); 1698 + else if (ptl != pml) 1699 + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); 1700 + 1701 + /* step 2: clear page table and adjust rmap */ 1572 1702 for (i = 0, addr = haddr, pte = start_pte; 1573 1703 i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { 1574 1704 struct page *page; ··· 1600 1682 1601 1683 if (pte_none(ptent)) 1602 1684 continue; 1603 - page = vm_normal_page(vma, addr, ptent); 1604 - if (WARN_ON_ONCE(page && is_zone_device_page(page))) 1685 + /* 1686 + * We dropped ptl after the first scan, to do the mmu_notifier: 1687 + * page lock stops more PTEs of the hpage being faulted in, but 1688 + * does not stop write faults COWing anon copies from existing 1689 + * PTEs; and does not stop those being swapped out or migrated. 1690 + */ 1691 + if (!pte_present(ptent)) { 1692 + result = SCAN_PTE_NON_PRESENT; 1605 1693 goto abort; 1694 + } 1695 + page = vm_normal_page(vma, addr, ptent); 1696 + if (hpage + i != page) 1697 + goto abort; 1698 + 1699 + /* 1700 + * Must clear entry, or a racing truncate may re-remove it. 1701 + * TLB flush can be left until pmdp_collapse_flush() does it. 1702 + * PTE dirty? Shmem page is already dirty; file is read-only. 1703 + */ 1704 + ptep_clear(mm, addr, pte); 1606 1705 page_remove_rmap(page, vma, false); 1706 + nr_ptes++; 1607 1707 } 1608 1708 1609 - pte_unmap_unlock(start_pte, ptl); 1709 + pte_unmap(start_pte); 1710 + if (!pml) 1711 + spin_unlock(ptl); 1610 1712 1611 1713 /* step 3: set proper refcount and mm_counters. */ 1612 - if (count) { 1613 - page_ref_sub(hpage, count); 1614 - add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count); 1714 + if (nr_ptes) { 1715 + page_ref_sub(hpage, nr_ptes); 1716 + add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); 1615 1717 } 1616 1718 1617 - /* step 4: remove pte entries */ 1618 - /* we make no change to anon, but protect concurrent anon page lookup */ 1619 - if (vma->anon_vma) 1620 - anon_vma_lock_write(vma->anon_vma); 1719 + /* step 4: remove empty page table */ 1720 + if (!pml) { 1721 + pml = pmd_lock(mm, pmd); 1722 + if (ptl != pml) 1723 + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); 1724 + } 1725 + pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); 1726 + pmdp_get_lockless_sync(); 1727 + if (ptl != pml) 1728 + spin_unlock(ptl); 1729 + spin_unlock(pml); 1621 1730 1622 - collapse_and_free_pmd(mm, vma, haddr, pmd); 1731 + mmu_notifier_invalidate_range_end(&range); 1623 1732 1624 - if (vma->anon_vma) 1625 - anon_vma_unlock_write(vma->anon_vma); 1626 - i_mmap_unlock_write(vma->vm_file->f_mapping); 1733 + mm_dec_nr_ptes(mm); 1734 + page_table_check_pte_clear_range(mm, haddr, pgt_pmd); 1735 + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); 1627 1736 1628 1737 maybe_install_pmd: 1629 1738 /* step 5: install pmd entry */ 1630 1739 result = install_pmd 1631 1740 ? set_huge_pmd(vma, haddr, pmd, hpage) 1632 1741 : SCAN_SUCCEED; 1633 - 1742 + goto drop_hpage; 1743 + abort: 1744 + if (nr_ptes) { 1745 + flush_tlb_mm(mm); 1746 + page_ref_sub(hpage, nr_ptes); 1747 + add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); 1748 + } 1749 + if (start_pte) 1750 + pte_unmap_unlock(start_pte, ptl); 1751 + if (pml && pml != ptl) 1752 + spin_unlock(pml); 1753 + if (notified) 1754 + mmu_notifier_invalidate_range_end(&range); 1634 1755 drop_hpage: 1635 1756 unlock_page(hpage); 1636 1757 put_page(hpage); 1637 1758 return result; 1638 - 1639 - abort: 1640 - pte_unmap_unlock(start_pte, ptl); 1641 - drop_immap: 1642 - i_mmap_unlock_write(vma->vm_file->f_mapping); 1643 - goto drop_hpage; 1644 1759 } 1645 1760 1646 - static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot) 1647 - { 1648 - struct mm_slot *slot = &mm_slot->slot; 1649 - struct mm_struct *mm = slot->mm; 1650 - int i; 1651 - 1652 - if (likely(mm_slot->nr_pte_mapped_thp == 0)) 1653 - return; 1654 - 1655 - if (!mmap_write_trylock(mm)) 1656 - return; 1657 - 1658 - if (unlikely(hpage_collapse_test_exit(mm))) 1659 - goto out; 1660 - 1661 - for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++) 1662 - collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false); 1663 - 1664 - out: 1665 - mm_slot->nr_pte_mapped_thp = 0; 1666 - mmap_write_unlock(mm); 1667 - } 1668 - 1669 - static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, 1670 - struct mm_struct *target_mm, 1671 - unsigned long target_addr, struct page *hpage, 1672 - struct collapse_control *cc) 1761 + static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) 1673 1762 { 1674 1763 struct vm_area_struct *vma; 1675 - int target_result = SCAN_FAIL; 1676 1764 1677 - i_mmap_lock_write(mapping); 1765 + i_mmap_lock_read(mapping); 1678 1766 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 1679 - int result = SCAN_FAIL; 1680 - struct mm_struct *mm = NULL; 1681 - unsigned long addr = 0; 1682 - pmd_t *pmd; 1683 - bool is_target = false; 1767 + struct mmu_notifier_range range; 1768 + struct mm_struct *mm; 1769 + unsigned long addr; 1770 + pmd_t *pmd, pgt_pmd; 1771 + spinlock_t *pml; 1772 + spinlock_t *ptl; 1773 + bool skipped_uffd = false; 1684 1774 1685 1775 /* 1686 1776 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that 1687 - * got written to. These VMAs are likely not worth investing 1688 - * mmap_write_lock(mm) as PMD-mapping is likely to be split 1689 - * later. 1690 - * 1691 - * Note that vma->anon_vma check is racy: it can be set up after 1692 - * the check but before we took mmap_lock by the fault path. 1693 - * But page lock would prevent establishing any new ptes of the 1694 - * page, so we are safe. 1695 - * 1696 - * An alternative would be drop the check, but check that page 1697 - * table is clear before calling pmdp_collapse_flush() under 1698 - * ptl. It has higher chance to recover THP for the VMA, but 1699 - * has higher cost too. It would also probably require locking 1700 - * the anon_vma. 1777 + * got written to. These VMAs are likely not worth removing 1778 + * page tables from, as PMD-mapping is likely to be split later. 1701 1779 */ 1702 - if (READ_ONCE(vma->anon_vma)) { 1703 - result = SCAN_PAGE_ANON; 1704 - goto next; 1705 - } 1780 + if (READ_ONCE(vma->anon_vma)) 1781 + continue; 1782 + 1706 1783 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); 1707 1784 if (addr & ~HPAGE_PMD_MASK || 1708 - vma->vm_end < addr + HPAGE_PMD_SIZE) { 1709 - result = SCAN_VMA_CHECK; 1710 - goto next; 1711 - } 1712 - mm = vma->vm_mm; 1713 - is_target = mm == target_mm && addr == target_addr; 1714 - result = find_pmd_or_thp_or_none(mm, addr, &pmd); 1715 - if (result != SCAN_SUCCEED) 1716 - goto next; 1717 - /* 1718 - * We need exclusive mmap_lock to retract page table. 1719 - * 1720 - * We use trylock due to lock inversion: we need to acquire 1721 - * mmap_lock while holding page lock. Fault path does it in 1722 - * reverse order. Trylock is a way to avoid deadlock. 1723 - * 1724 - * Also, it's not MADV_COLLAPSE's job to collapse other 1725 - * mappings - let khugepaged take care of them later. 1726 - */ 1727 - result = SCAN_PTE_MAPPED_HUGEPAGE; 1728 - if ((cc->is_khugepaged || is_target) && 1729 - mmap_write_trylock(mm)) { 1730 - /* trylock for the same lock inversion as above */ 1731 - if (!vma_try_start_write(vma)) 1732 - goto unlock_next; 1733 - 1734 - /* 1735 - * Re-check whether we have an ->anon_vma, because 1736 - * collapse_and_free_pmd() requires that either no 1737 - * ->anon_vma exists or the anon_vma is locked. 1738 - * We already checked ->anon_vma above, but that check 1739 - * is racy because ->anon_vma can be populated under the 1740 - * mmap lock in read mode. 1741 - */ 1742 - if (vma->anon_vma) { 1743 - result = SCAN_PAGE_ANON; 1744 - goto unlock_next; 1745 - } 1746 - /* 1747 - * When a vma is registered with uffd-wp, we can't 1748 - * recycle the pmd pgtable because there can be pte 1749 - * markers installed. Skip it only, so the rest mm/vma 1750 - * can still have the same file mapped hugely, however 1751 - * it'll always mapped in small page size for uffd-wp 1752 - * registered ranges. 1753 - */ 1754 - if (hpage_collapse_test_exit(mm)) { 1755 - result = SCAN_ANY_PROCESS; 1756 - goto unlock_next; 1757 - } 1758 - if (userfaultfd_wp(vma)) { 1759 - result = SCAN_PTE_UFFD_WP; 1760 - goto unlock_next; 1761 - } 1762 - collapse_and_free_pmd(mm, vma, addr, pmd); 1763 - if (!cc->is_khugepaged && is_target) 1764 - result = set_huge_pmd(vma, addr, pmd, hpage); 1765 - else 1766 - result = SCAN_SUCCEED; 1767 - 1768 - unlock_next: 1769 - mmap_write_unlock(mm); 1770 - goto next; 1771 - } 1772 - /* 1773 - * Calling context will handle target mm/addr. Otherwise, let 1774 - * khugepaged try again later. 1775 - */ 1776 - if (!is_target) { 1777 - khugepaged_add_pte_mapped_thp(mm, addr); 1785 + vma->vm_end < addr + HPAGE_PMD_SIZE) 1778 1786 continue; 1787 + 1788 + mm = vma->vm_mm; 1789 + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) 1790 + continue; 1791 + 1792 + if (hpage_collapse_test_exit(mm)) 1793 + continue; 1794 + /* 1795 + * When a vma is registered with uffd-wp, we cannot recycle 1796 + * the page table because there may be pte markers installed. 1797 + * Other vmas can still have the same file mapped hugely, but 1798 + * skip this one: it will always be mapped in small page size 1799 + * for uffd-wp registered ranges. 1800 + */ 1801 + if (userfaultfd_wp(vma)) 1802 + continue; 1803 + 1804 + /* PTEs were notified when unmapped; but now for the PMD? */ 1805 + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, 1806 + addr, addr + HPAGE_PMD_SIZE); 1807 + mmu_notifier_invalidate_range_start(&range); 1808 + 1809 + pml = pmd_lock(mm, pmd); 1810 + ptl = pte_lockptr(mm, pmd); 1811 + if (ptl != pml) 1812 + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); 1813 + 1814 + /* 1815 + * Huge page lock is still held, so normally the page table 1816 + * must remain empty; and we have already skipped anon_vma 1817 + * and userfaultfd_wp() vmas. But since the mmap_lock is not 1818 + * held, it is still possible for a racing userfaultfd_ioctl() 1819 + * to have inserted ptes or markers. Now that we hold ptlock, 1820 + * repeating the anon_vma check protects from one category, 1821 + * and repeating the userfaultfd_wp() check from another. 1822 + */ 1823 + if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) { 1824 + skipped_uffd = true; 1825 + } else { 1826 + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); 1827 + pmdp_get_lockless_sync(); 1779 1828 } 1780 - next: 1781 - if (is_target) 1782 - target_result = result; 1829 + 1830 + if (ptl != pml) 1831 + spin_unlock(ptl); 1832 + spin_unlock(pml); 1833 + 1834 + mmu_notifier_invalidate_range_end(&range); 1835 + 1836 + if (!skipped_uffd) { 1837 + mm_dec_nr_ptes(mm); 1838 + page_table_check_pte_clear_range(mm, addr, pgt_pmd); 1839 + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); 1840 + } 1783 1841 } 1784 - i_mmap_unlock_write(mapping); 1785 - return target_result; 1842 + i_mmap_unlock_read(mapping); 1786 1843 } 1787 1844 1788 1845 /** ··· 1965 2072 goto out_unlock; 1966 2073 } 1967 2074 1968 - if (folio_has_private(folio) && 1969 - !filemap_release_folio(folio, GFP_KERNEL)) { 2075 + if (!filemap_release_folio(folio, GFP_KERNEL)) { 1970 2076 result = SCAN_PAGE_HAS_PRIVATE; 1971 2077 folio_putback_lru(folio); 1972 2078 goto out_unlock; ··· 2152 2260 2153 2261 /* 2154 2262 * Remove pte page tables, so we can re-fault the page as huge. 2263 + * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp(). 2155 2264 */ 2156 - result = retract_page_tables(mapping, start, mm, addr, hpage, 2157 - cc); 2265 + retract_page_tables(mapping, start); 2266 + if (cc && !cc->is_khugepaged) 2267 + result = SCAN_PTE_MAPPED_HUGEPAGE; 2158 2268 unlock_page(hpage); 2159 2269 2160 2270 /* ··· 2317 2423 { 2318 2424 BUILD_BUG(); 2319 2425 } 2320 - 2321 - static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot) 2322 - { 2323 - } 2324 - 2325 - static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, 2326 - unsigned long addr) 2327 - { 2328 - return false; 2329 - } 2330 2426 #endif 2331 2427 2332 2428 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, ··· 2346 2462 khugepaged_scan.mm_slot = mm_slot; 2347 2463 } 2348 2464 spin_unlock(&khugepaged_mm_lock); 2349 - khugepaged_collapse_pte_mapped_thps(mm_slot); 2350 2465 2351 2466 mm = slot->mm; 2352 2467 /* ··· 2398 2515 khugepaged_scan.address); 2399 2516 2400 2517 mmap_read_unlock(mm); 2401 - *result = hpage_collapse_scan_file(mm, 2402 - khugepaged_scan.address, 2403 - file, pgoff, cc); 2404 2518 mmap_locked = false; 2519 + *result = hpage_collapse_scan_file(mm, 2520 + khugepaged_scan.address, file, pgoff, cc); 2405 2521 fput(file); 2522 + if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { 2523 + mmap_read_lock(mm); 2524 + if (hpage_collapse_test_exit(mm)) 2525 + goto breakouterloop; 2526 + *result = collapse_pte_mapped_thp(mm, 2527 + khugepaged_scan.address, false); 2528 + if (*result == SCAN_PMD_MAPPED) 2529 + *result = SCAN_SUCCEED; 2530 + mmap_read_unlock(mm); 2531 + } 2406 2532 } else { 2407 2533 *result = hpage_collapse_scan_pmd(mm, vma, 2408 - khugepaged_scan.address, 2409 - &mmap_locked, 2410 - cc); 2534 + khugepaged_scan.address, &mmap_locked, cc); 2411 2535 } 2412 - switch (*result) { 2413 - case SCAN_PTE_MAPPED_HUGEPAGE: { 2414 - pmd_t *pmd; 2415 2536 2416 - *result = find_pmd_or_thp_or_none(mm, 2417 - khugepaged_scan.address, 2418 - &pmd); 2419 - if (*result != SCAN_SUCCEED) 2420 - break; 2421 - if (!khugepaged_add_pte_mapped_thp(mm, 2422 - khugepaged_scan.address)) 2423 - break; 2424 - } fallthrough; 2425 - case SCAN_SUCCEED: 2537 + if (*result == SCAN_SUCCEED) 2426 2538 ++khugepaged_pages_collapsed; 2427 - break; 2428 - default: 2429 - break; 2430 - } 2431 2539 2432 2540 /* move to next address */ 2433 2541 khugepaged_scan.address += HPAGE_PMD_SIZE; ··· 2764 2890 case SCAN_PTE_MAPPED_HUGEPAGE: 2765 2891 BUG_ON(mmap_locked); 2766 2892 BUG_ON(*prev); 2767 - mmap_write_lock(mm); 2893 + mmap_read_lock(mm); 2768 2894 result = collapse_pte_mapped_thp(mm, addr, true); 2769 - mmap_write_unlock(mm); 2895 + mmap_read_unlock(mm); 2770 2896 goto handle_result; 2771 2897 /* Whitelisted set of results where continuing OK */ 2772 2898 case SCAN_PMD_NULL:

+10 -5

mm/kmemleak.c

··· 218 218 /* same as above but only for the kmemleak_free() callback */ 219 219 static int kmemleak_free_enabled = 1; 220 220 /* set in the late_initcall if there were no errors */ 221 - static int kmemleak_initialized; 221 + static int kmemleak_late_initialized; 222 222 /* set if a kmemleak warning was issued */ 223 223 static int kmemleak_warning; 224 224 /* set if a fatal kmemleak error has occurred */ ··· 610 610 unsigned long entries[MAX_TRACE]; 611 611 unsigned int nr_entries; 612 612 613 - if (!kmemleak_initialized) 613 + /* 614 + * Use object_cache to determine whether kmemleak_init() has 615 + * been invoked. stack_depot_early_init() is called before 616 + * kmemleak_init() in mm_core_init(). 617 + */ 618 + if (!object_cache) 614 619 return 0; 615 620 nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 3); 616 621 trace_handle = stack_depot_save(entries, nr_entries, GFP_NOWAIT); ··· 2057 2052 kmemleak_enabled = 0; 2058 2053 2059 2054 /* check whether it is too early for a kernel thread */ 2060 - if (kmemleak_initialized) 2055 + if (kmemleak_late_initialized) 2061 2056 schedule_work(&cleanup_work); 2062 2057 else 2063 2058 kmemleak_free_enabled = 0; ··· 2122 2117 */ 2123 2118 static int __init kmemleak_late_init(void) 2124 2119 { 2125 - kmemleak_initialized = 1; 2120 + kmemleak_late_initialized = 1; 2126 2121 2127 2122 debugfs_create_file("kmemleak", 0644, NULL, NULL, &kmemleak_fops); 2128 2123 ··· 2130 2125 /* 2131 2126 * Some error occurred and kmemleak was disabled. There is a 2132 2127 * small chance that kmemleak_disable() was called immediately 2133 - * after setting kmemleak_initialized and we may end up with 2128 + * after setting kmemleak_late_initialized and we may end up with 2134 2129 * two clean-up threads but serialized by scan_mutex. 2135 2130 */ 2136 2131 schedule_work(&cleanup_work);

+2 -2

mm/kmsan/hooks.c

··· 117 117 page = virt_to_head_page((void *)ptr); 118 118 KMSAN_WARN_ON(ptr != page_address(page)); 119 119 kmsan_internal_poison_memory((void *)ptr, 120 - PAGE_SIZE << compound_order(page), 120 + page_size(page), 121 121 GFP_KERNEL, 122 122 KMSAN_POISON_CHECK | KMSAN_POISON_FREE); 123 123 kmsan_leave_runtime(); ··· 339 339 * internal KMSAN checks. 340 340 */ 341 341 while (size > 0) { 342 - page_offset = addr % PAGE_SIZE; 342 + page_offset = offset_in_page(addr); 343 343 to_go = min(PAGE_SIZE - page_offset, (u64)size); 344 344 kmsan_handle_dma_page((void *)addr, to_go, dir); 345 345 addr += to_go;

+4 -4

mm/kmsan/shadow.c

··· 145 145 return NULL; 146 146 if (!page_has_metadata(page)) 147 147 return NULL; 148 - off = addr % PAGE_SIZE; 148 + off = offset_in_page(addr); 149 149 150 150 return (is_origin ? origin_ptr_for(page) : shadow_ptr_for(page)) + off; 151 151 } ··· 210 210 return; 211 211 kmsan_enter_runtime(); 212 212 kmsan_internal_poison_memory(page_address(page), 213 - PAGE_SIZE << compound_order(page), 213 + page_size(page), 214 214 GFP_KERNEL, 215 215 KMSAN_POISON_CHECK | KMSAN_POISON_FREE); 216 216 kmsan_leave_runtime(); ··· 281 281 struct page *page; 282 282 u64 size; 283 283 284 - start = (void *)ALIGN_DOWN((u64)start, PAGE_SIZE); 285 - size = ALIGN((u64)end - (u64)start, PAGE_SIZE); 284 + start = (void *)PAGE_ALIGN_DOWN((u64)start); 285 + size = PAGE_ALIGN((u64)end - (u64)start); 286 286 shadow = memblock_alloc(size, PAGE_SIZE); 287 287 origin = memblock_alloc(size, PAGE_SIZE); 288 288 for (u64 addr = 0; addr < size; addr += PAGE_SIZE) {

+38 -6

mm/ksm.c

··· 242 242 static struct kmem_cache *stable_node_cache; 243 243 static struct kmem_cache *mm_slot_cache; 244 244 245 + /* The number of pages scanned */ 246 + static unsigned long ksm_pages_scanned; 247 + 245 248 /* The number of nodes in the stable tree */ 246 249 static unsigned long ksm_pages_shared; 247 250 ··· 280 277 281 278 /* Whether to merge empty (zeroed) pages with actual zero pages */ 282 279 static bool ksm_use_zero_pages __read_mostly; 280 + 281 + /* The number of zero pages which is placed by KSM */ 282 + unsigned long ksm_zero_pages; 283 283 284 284 #ifdef CONFIG_NUMA 285 285 /* Zeroed when merging across nodes is not allowed */ ··· 454 448 if (is_migration_entry(entry)) 455 449 page = pfn_swap_entry_to_page(entry); 456 450 } 457 - ret = page && PageKsm(page); 451 + /* return 1 if the page is an normal ksm page or KSM-placed zero page */ 452 + ret = (page && PageKsm(page)) || is_ksm_zero_pte(*pte); 458 453 pte_unmap_unlock(pte, ptl); 459 454 return ret; 460 455 } ··· 1236 1229 page_add_anon_rmap(kpage, vma, addr, RMAP_NONE); 1237 1230 newpte = mk_pte(kpage, vma->vm_page_prot); 1238 1231 } else { 1239 - newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage), 1240 - vma->vm_page_prot)); 1232 + /* 1233 + * Use pte_mkdirty to mark the zero page mapped by KSM, and then 1234 + * we can easily track all KSM-placed zero pages by checking if 1235 + * the dirty bit in zero page's PTE is set. 1236 + */ 1237 + newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot))); 1238 + ksm_zero_pages++; 1239 + mm->ksm_zero_pages++; 1241 1240 /* 1242 1241 * We're replacing an anonymous page with a zero page, which is 1243 1242 * not anonymous. We need to do proper accounting otherwise we ··· 2486 2473 { 2487 2474 struct ksm_rmap_item *rmap_item; 2488 2475 struct page *page; 2476 + unsigned int npages = scan_npages; 2489 2477 2490 - while (scan_npages-- && likely(!freezing(current))) { 2478 + while (npages-- && likely(!freezing(current))) { 2491 2479 cond_resched(); 2492 2480 rmap_item = scan_get_next_rmap_item(&page); 2493 2481 if (!rmap_item) ··· 2496 2482 cmp_and_merge_page(page, rmap_item); 2497 2483 put_page(page); 2498 2484 } 2485 + 2486 + ksm_pages_scanned += scan_npages - npages; 2499 2487 } 2500 2488 2501 2489 static int ksmd_should_run(void) ··· 3107 3091 #ifdef CONFIG_PROC_FS 3108 3092 long ksm_process_profit(struct mm_struct *mm) 3109 3093 { 3110 - return mm->ksm_merging_pages * PAGE_SIZE - 3094 + return (long)(mm->ksm_merging_pages + mm->ksm_zero_pages) * PAGE_SIZE - 3111 3095 mm->ksm_rmap_items * sizeof(struct ksm_rmap_item); 3112 3096 } 3113 3097 #endif /* CONFIG_PROC_FS */ ··· 3338 3322 } 3339 3323 KSM_ATTR(max_page_sharing); 3340 3324 3325 + static ssize_t pages_scanned_show(struct kobject *kobj, 3326 + struct kobj_attribute *attr, char *buf) 3327 + { 3328 + return sysfs_emit(buf, "%lu\n", ksm_pages_scanned); 3329 + } 3330 + KSM_ATTR_RO(pages_scanned); 3331 + 3341 3332 static ssize_t pages_shared_show(struct kobject *kobj, 3342 3333 struct kobj_attribute *attr, char *buf) 3343 3334 { ··· 3383 3360 } 3384 3361 KSM_ATTR_RO(pages_volatile); 3385 3362 3363 + static ssize_t ksm_zero_pages_show(struct kobject *kobj, 3364 + struct kobj_attribute *attr, char *buf) 3365 + { 3366 + return sysfs_emit(buf, "%ld\n", ksm_zero_pages); 3367 + } 3368 + KSM_ATTR_RO(ksm_zero_pages); 3369 + 3386 3370 static ssize_t general_profit_show(struct kobject *kobj, 3387 3371 struct kobj_attribute *attr, char *buf) 3388 3372 { 3389 3373 long general_profit; 3390 3374 3391 - general_profit = ksm_pages_sharing * PAGE_SIZE - 3375 + general_profit = (ksm_pages_sharing + ksm_zero_pages) * PAGE_SIZE - 3392 3376 ksm_rmap_items * sizeof(struct ksm_rmap_item); 3393 3377 3394 3378 return sysfs_emit(buf, "%ld\n", general_profit); ··· 3453 3423 &sleep_millisecs_attr.attr, 3454 3424 &pages_to_scan_attr.attr, 3455 3425 &run_attr.attr, 3426 + &pages_scanned_attr.attr, 3456 3427 &pages_shared_attr.attr, 3457 3428 &pages_sharing_attr.attr, 3458 3429 &pages_unshared_attr.attr, 3459 3430 &pages_volatile_attr.attr, 3431 + &ksm_zero_pages_attr.attr, 3460 3432 &full_scans_attr.attr, 3461 3433 #ifdef CONFIG_NUMA 3462 3434 &merge_across_nodes_attr.attr,

+9 -6

mm/madvise.c

··· 173 173 } 174 174 175 175 success: 176 - /* 177 - * vm_flags is protected by the mmap_lock held in write mode. 178 - */ 176 + /* vm_flags is protected by the mmap_lock held in write mode. */ 177 + vma_start_write(vma); 179 178 vm_flags_reset(vma, new_flags); 180 179 if (!vma->vm_file || vma_is_anon_shmem(vma)) { 181 180 error = replace_anon_vma_name(vma, anon_name); ··· 217 218 ptep = NULL; 218 219 219 220 page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, 220 - vma, addr, false, &splug); 221 + vma, addr, &splug); 221 222 if (page) 222 223 put_page(page); 223 224 } ··· 262 263 rcu_read_unlock(); 263 264 264 265 page = read_swap_cache_async(entry, mapping_gfp_mask(mapping), 265 - vma, addr, false, &splug); 266 + vma, addr, &splug); 266 267 if (page) 267 268 put_page(page); 268 269 ··· 413 414 414 415 folio_clear_referenced(folio); 415 416 folio_test_clear_young(folio); 417 + if (folio_test_active(folio)) 418 + folio_set_workingset(folio); 416 419 if (pageout) { 417 420 if (folio_isolate_lru(folio)) { 418 421 if (folio_test_unevictable(folio)) ··· 512 511 */ 513 512 folio_clear_referenced(folio); 514 513 folio_test_clear_young(folio); 514 + if (folio_test_active(folio)) 515 + folio_set_workingset(folio); 515 516 if (pageout) { 516 517 if (folio_isolate_lru(folio)) { 517 518 if (folio_test_unevictable(folio)) ··· 665 662 free_swap_and_cache(entry); 666 663 pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); 667 664 } else if (is_hwpoison_entry(entry) || 668 - is_swapin_error_entry(entry)) { 665 + is_poisoned_swp_entry(entry)) { 669 666 pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); 670 667 } 671 668 continue;

+6 -5

mm/mapping_dirty_helpers.c

··· 288 288 * @end: Pointer to the number of the last set bit in @bitmap. 289 289 * none set. The value is modified as new bits are set by the function. 290 290 * 291 - * Note: When this function returns there is no guarantee that a CPU has 291 + * When this function returns there is no guarantee that a CPU has 292 292 * not already dirtied new ptes. However it will not clean any ptes not 293 293 * reported in the bitmap. The guarantees are as follows: 294 - * a) All ptes dirty when the function starts executing will end up recorded 295 - * in the bitmap. 296 - * b) All ptes dirtied after that will either remain dirty, be recorded in the 297 - * bitmap or both. 294 + * 295 + * * All ptes dirty when the function starts executing will end up recorded 296 + * in the bitmap. 297 + * * All ptes dirtied after that will either remain dirty, be recorded in the 298 + * bitmap or both. 298 299 * 299 300 * If a caller needs to make sure all dirty ptes are picked up and none 300 301 * additional are added, it first needs to write-protect the address-space

+5

mm/memblock.c

··· 161 161 static int memblock_memory_in_slab __initdata_memblock; 162 162 static int memblock_reserved_in_slab __initdata_memblock; 163 163 164 + bool __init_memblock memblock_has_mirror(void) 165 + { 166 + return system_has_some_mirror; 167 + } 168 + 164 169 static enum memblock_flags __init_memblock choose_memblock_flags(void) 165 170 { 166 171 return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;

+63 -73

mm/memcontrol.c

··· 197 197 }; 198 198 199 199 /* 200 - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft 200 + * Maximum loops in mem_cgroup_soft_reclaim(), used for soft 201 201 * limit reclaim to prevent infinite loops, if they ever occur. 202 202 */ 203 203 #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 ··· 742 742 long state[MEMCG_NR_STAT]; 743 743 unsigned long events[NR_MEMCG_EVENTS]; 744 744 745 + /* Non-hierarchical (CPU aggregated) page state & events */ 746 + long state_local[MEMCG_NR_STAT]; 747 + unsigned long events_local[NR_MEMCG_EVENTS]; 748 + 745 749 /* Pending child counts during tree propagation */ 746 750 long state_pending[MEMCG_NR_STAT]; 747 751 unsigned long events_pending[NR_MEMCG_EVENTS]; ··· 779 775 /* idx can be of type enum memcg_stat_item or node_stat_item. */ 780 776 static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) 781 777 { 782 - long x = 0; 783 - int cpu; 778 + long x = READ_ONCE(memcg->vmstats->state_local[idx]); 784 779 785 - for_each_possible_cpu(cpu) 786 - x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); 787 780 #ifdef CONFIG_SMP 788 781 if (x < 0) 789 782 x = 0; ··· 927 926 928 927 static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) 929 928 { 930 - long x = 0; 931 - int cpu; 932 929 int index = memcg_events_index(event); 933 930 934 931 if (index < 0) 935 932 return 0; 936 933 937 - for_each_possible_cpu(cpu) 938 - x += per_cpu(memcg->vmstats_percpu->events[index], cpu); 939 - return x; 934 + return READ_ONCE(memcg->vmstats->events_local[index]); 940 935 } 941 936 942 937 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, ··· 1626 1629 WARN_ON_ONCE(seq_buf_has_overflowed(s)); 1627 1630 } 1628 1631 1629 - #define K(x) ((x) << (PAGE_SHIFT-10)) 1630 1632 /** 1631 1633 * mem_cgroup_print_oom_context: Print OOM information relevant to 1632 1634 * memory controller. ··· 3032 3036 return objcg; 3033 3037 } 3034 3038 3035 - struct obj_cgroup *get_obj_cgroup_from_page(struct page *page) 3039 + struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) 3036 3040 { 3037 3041 struct obj_cgroup *objcg; 3038 3042 3039 3043 if (!memcg_kmem_online()) 3040 3044 return NULL; 3041 3045 3042 - if (PageMemcgKmem(page)) { 3043 - objcg = __folio_objcg(page_folio(page)); 3046 + if (folio_memcg_kmem(folio)) { 3047 + objcg = __folio_objcg(folio); 3044 3048 obj_cgroup_get(objcg); 3045 3049 } else { 3046 3050 struct mem_cgroup *memcg; 3047 3051 3048 3052 rcu_read_lock(); 3049 - memcg = __folio_memcg(page_folio(page)); 3053 + memcg = __folio_memcg(folio); 3050 3054 if (memcg) 3051 3055 objcg = __get_obj_cgroup_from_memcg(memcg); 3052 3056 else ··· 3866 3870 break; 3867 3871 case _MEMSWAP: 3868 3872 ret = mem_cgroup_resize_max(memcg, nr_pages, true); 3869 - break; 3870 - case _KMEM: 3871 - /* kmem.limit_in_bytes is deprecated. */ 3872 - ret = -EOPNOTSUPP; 3873 3873 break; 3874 3874 case _TCP: 3875 3875 ret = memcg_update_tcp_max(memcg, nr_pages); ··· 5078 5086 }, 5079 5087 #endif 5080 5088 { 5081 - .name = "kmem.limit_in_bytes", 5082 - .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), 5083 - .write = mem_cgroup_write, 5084 - .read_u64 = mem_cgroup_read_u64, 5085 - }, 5086 - { 5087 5089 .name = "kmem.usage_in_bytes", 5088 5090 .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), 5089 5091 .read_u64 = mem_cgroup_read_u64, ··· 5151 5165 * those references are manageable from userspace. 5152 5166 */ 5153 5167 5168 + #define MEM_CGROUP_ID_MAX ((1UL << MEM_CGROUP_ID_SHIFT) - 1) 5154 5169 static DEFINE_IDR(mem_cgroup_idr); 5155 5170 5156 5171 static void mem_cgroup_id_remove(struct mem_cgroup *memcg) ··· 5513 5526 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 5514 5527 struct mem_cgroup *parent = parent_mem_cgroup(memcg); 5515 5528 struct memcg_vmstats_percpu *statc; 5516 - long delta, v; 5529 + long delta, delta_cpu, v; 5517 5530 int i, nid; 5518 5531 5519 5532 statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); ··· 5529 5542 memcg->vmstats->state_pending[i] = 0; 5530 5543 5531 5544 /* Add CPU changes on this level since the last flush */ 5545 + delta_cpu = 0; 5532 5546 v = READ_ONCE(statc->state[i]); 5533 5547 if (v != statc->state_prev[i]) { 5534 - delta += v - statc->state_prev[i]; 5548 + delta_cpu = v - statc->state_prev[i]; 5549 + delta += delta_cpu; 5535 5550 statc->state_prev[i] = v; 5536 5551 } 5537 5552 5538 - if (!delta) 5539 - continue; 5540 - 5541 5553 /* Aggregate counts on this level and propagate upwards */ 5542 - memcg->vmstats->state[i] += delta; 5543 - if (parent) 5544 - parent->vmstats->state_pending[i] += delta; 5554 + if (delta_cpu) 5555 + memcg->vmstats->state_local[i] += delta_cpu; 5556 + 5557 + if (delta) { 5558 + memcg->vmstats->state[i] += delta; 5559 + if (parent) 5560 + parent->vmstats->state_pending[i] += delta; 5561 + } 5545 5562 } 5546 5563 5547 5564 for (i = 0; i < NR_MEMCG_EVENTS; i++) { ··· 5553 5562 if (delta) 5554 5563 memcg->vmstats->events_pending[i] = 0; 5555 5564 5565 + delta_cpu = 0; 5556 5566 v = READ_ONCE(statc->events[i]); 5557 5567 if (v != statc->events_prev[i]) { 5558 - delta += v - statc->events_prev[i]; 5568 + delta_cpu = v - statc->events_prev[i]; 5569 + delta += delta_cpu; 5559 5570 statc->events_prev[i] = v; 5560 5571 } 5561 5572 5562 - if (!delta) 5563 - continue; 5573 + if (delta_cpu) 5574 + memcg->vmstats->events_local[i] += delta_cpu; 5564 5575 5565 - memcg->vmstats->events[i] += delta; 5566 - if (parent) 5567 - parent->vmstats->events_pending[i] += delta; 5576 + if (delta) { 5577 + memcg->vmstats->events[i] += delta; 5578 + if (parent) 5579 + parent->vmstats->events_pending[i] += delta; 5580 + } 5568 5581 } 5569 5582 5570 5583 for_each_node_state(nid, N_MEMORY) { ··· 5586 5591 if (delta) 5587 5592 pn->lruvec_stats.state_pending[i] = 0; 5588 5593 5594 + delta_cpu = 0; 5589 5595 v = READ_ONCE(lstatc->state[i]); 5590 5596 if (v != lstatc->state_prev[i]) { 5591 - delta += v - lstatc->state_prev[i]; 5597 + delta_cpu = v - lstatc->state_prev[i]; 5598 + delta += delta_cpu; 5592 5599 lstatc->state_prev[i] = v; 5593 5600 } 5594 5601 5595 - if (!delta) 5596 - continue; 5602 + if (delta_cpu) 5603 + pn->lruvec_stats.state_local[i] += delta_cpu; 5597 5604 5598 - pn->lruvec_stats.state[i] += delta; 5599 - if (ppn) 5600 - ppn->lruvec_stats.state_pending[i] += delta; 5605 + if (delta) { 5606 + pn->lruvec_stats.state[i] += delta; 5607 + if (ppn) 5608 + ppn->lruvec_stats.state_pending[i] += delta; 5609 + } 5601 5610 } 5602 5611 } 5603 5612 } ··· 5647 5648 { 5648 5649 struct page *page = vm_normal_page(vma, addr, ptent); 5649 5650 5650 - if (!page || !page_mapped(page)) 5651 + if (!page) 5651 5652 return NULL; 5652 5653 if (PageAnon(page)) { 5653 5654 if (!(mc.flags & MOVE_ANON)) ··· 5656 5657 if (!(mc.flags & MOVE_FILE)) 5657 5658 return NULL; 5658 5659 } 5659 - if (!get_page_unless_zero(page)) 5660 - return NULL; 5660 + get_page(page); 5661 5661 5662 5662 return page; 5663 5663 } ··· 5764 5766 if (folio_mapped(folio)) { 5765 5767 __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); 5766 5768 __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); 5767 - if (folio_test_transhuge(folio)) { 5769 + if (folio_test_pmd_mappable(folio)) { 5768 5770 __mod_lruvec_state(from_vec, NR_ANON_THPS, 5769 5771 -nr_pages); 5770 5772 __mod_lruvec_state(to_vec, NR_ANON_THPS, ··· 5850 5852 * @ptent: the pte to be checked 5851 5853 * @target: the pointer the target page or swap ent will be stored(can be NULL) 5852 5854 * 5853 - * Returns 5854 - * 0(MC_TARGET_NONE): if the pte is not a target for move charge. 5855 - * 1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for 5856 - * move charge. if @target is not NULL, the page is stored in target->page 5857 - * with extra refcnt got(Callers should handle it). 5858 - * 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a 5859 - * target for charge migration. if @target is not NULL, the entry is stored 5860 - * in target->ent. 5861 - * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is device memory and 5862 - * thus not on the lru. 5863 - * For now we such page is charge like a regular page would be as for all 5864 - * intent and purposes it is just special memory taking the place of a 5865 - * regular page. 5866 - * 5867 - * See Documentations/vm/hmm.txt and include/linux/hmm.h 5868 - * 5869 - * Called with pte lock held. 5855 + * Context: Called with pte lock held. 5856 + * Return: 5857 + * * MC_TARGET_NONE - If the pte is not a target for move charge. 5858 + * * MC_TARGET_PAGE - If the page corresponding to this pte is a target for 5859 + * move charge. If @target is not NULL, the page is stored in target->page 5860 + * with extra refcnt taken (Caller should release it). 5861 + * * MC_TARGET_SWAP - If the swap entry corresponding to this pte is a 5862 + * target for charge migration. If @target is not NULL, the entry is 5863 + * stored in target->ent. 5864 + * * MC_TARGET_DEVICE - Like MC_TARGET_PAGE but page is device memory and 5865 + * thus not on the lru. For now such page is charged like a regular page 5866 + * would be as it is just special memory taking the place of a regular page. 5867 + * See Documentations/vm/hmm.txt and include/linux/hmm.h 5870 5868 */ 5871 - 5872 5869 static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, 5873 5870 unsigned long addr, pte_t ptent, union mc_target *target) 5874 5871 { ··· 6691 6698 lru_add_drain_all(); 6692 6699 6693 6700 reclaimed = try_to_free_mem_cgroup_pages(memcg, 6694 - nr_to_reclaim - nr_reclaimed, 6695 - GFP_KERNEL, reclaim_options); 6701 + min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), 6702 + GFP_KERNEL, reclaim_options); 6696 6703 6697 6704 if (!reclaimed && !nr_retries--) 6698 6705 return -EAGAIN; ··· 7530 7537 struct mem_cgroup *memcg; 7531 7538 unsigned short id; 7532 7539 7533 - if (mem_cgroup_disabled()) 7534 - return; 7535 - 7536 7540 id = swap_cgroup_record(entry, 0, nr_pages); 7537 7541 rcu_read_lock(); 7538 7542 memcg = mem_cgroup_from_id(id); ··· 7779 7789 * @objcg: the object cgroup 7780 7790 * @size: size of compressed object 7781 7791 * 7782 - * This forces the charge after obj_cgroup_may_swap() allowed 7792 + * This forces the charge after obj_cgroup_may_zswap() allowed 7783 7793 * compression and storage in zwap for this cgroup to go ahead. 7784 7794 */ 7785 7795 void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)

+30 -28

mm/memfd.c

··· 268 268 269 269 #define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) 270 270 271 + static int check_sysctl_memfd_noexec(unsigned int *flags) 272 + { 273 + #ifdef CONFIG_SYSCTL 274 + struct pid_namespace *ns = task_active_pid_ns(current); 275 + int sysctl = pidns_memfd_noexec_scope(ns); 276 + 277 + if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { 278 + if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) 279 + *flags |= MFD_NOEXEC_SEAL; 280 + else 281 + *flags |= MFD_EXEC; 282 + } 283 + 284 + if (!(*flags & MFD_NOEXEC_SEAL) && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) { 285 + pr_err_ratelimited( 286 + "%s[%d]: memfd_create() requires MFD_NOEXEC_SEAL with vm.memfd_noexec=%d\n", 287 + current->comm, task_pid_nr(current), sysctl); 288 + return -EACCES; 289 + } 290 + #endif 291 + return 0; 292 + } 293 + 271 294 SYSCALL_DEFINE2(memfd_create, 272 295 const char __user *, uname, 273 296 unsigned int, flags) 274 297 { 275 - char comm[TASK_COMM_LEN]; 276 298 unsigned int *file_seals; 277 299 struct file *file; 278 300 int fd, error; ··· 316 294 return -EINVAL; 317 295 318 296 if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { 319 - #ifdef CONFIG_SYSCTL 320 - int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; 321 - struct pid_namespace *ns; 322 - 323 - ns = task_active_pid_ns(current); 324 - if (ns) 325 - sysctl = ns->memfd_noexec_scope; 326 - 327 - switch (sysctl) { 328 - case MEMFD_NOEXEC_SCOPE_EXEC: 329 - flags |= MFD_EXEC; 330 - break; 331 - case MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL: 332 - flags |= MFD_NOEXEC_SEAL; 333 - break; 334 - default: 335 - pr_warn_once( 336 - "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", 337 - task_pid_nr(current), get_task_comm(comm, current)); 338 - return -EINVAL; 339 - } 340 - #else 341 - flags |= MFD_EXEC; 342 - #endif 343 - pr_warn_once( 344 - "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", 345 - task_pid_nr(current), get_task_comm(comm, current)); 297 + pr_info_ratelimited( 298 + "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n", 299 + current->comm, task_pid_nr(current)); 346 300 } 301 + 302 + error = check_sysctl_memfd_noexec(&flags); 303 + if (error < 0) 304 + return error; 347 305 348 306 /* length includes terminating zero */ 349 307 len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);

+83 -50

mm/memory-failure.c

··· 39 39 #include <linux/kernel.h> 40 40 #include <linux/mm.h> 41 41 #include <linux/page-flags.h> 42 - #include <linux/kernel-page-flags.h> 43 42 #include <linux/sched/signal.h> 44 43 #include <linux/sched/task.h> 45 44 #include <linux/dax.h> ··· 49 50 #include <linux/swap.h> 50 51 #include <linux/backing-dev.h> 51 52 #include <linux/migrate.h> 52 - #include <linux/suspend.h> 53 53 #include <linux/slab.h> 54 54 #include <linux/swapops.h> 55 55 #include <linux/hugetlb.h> ··· 57 59 #include <linux/memremap.h> 58 60 #include <linux/kfifo.h> 59 61 #include <linux/ratelimit.h> 60 - #include <linux/page-isolation.h> 61 62 #include <linux/pagewalk.h> 62 63 #include <linux/shmem_fs.h> 63 64 #include <linux/sysctl.h> ··· 72 75 73 76 static bool hw_memory_failure __read_mostly = false; 74 77 75 - inline void num_poisoned_pages_inc(unsigned long pfn) 78 + static DEFINE_MUTEX(mf_mutex); 79 + 80 + void num_poisoned_pages_inc(unsigned long pfn) 76 81 { 77 82 atomic_long_inc(&num_poisoned_pages); 78 83 memblk_nr_poison_inc(pfn); 79 84 } 80 85 81 - inline void num_poisoned_pages_sub(unsigned long pfn, long i) 86 + void num_poisoned_pages_sub(unsigned long pfn, long i) 82 87 { 83 88 atomic_long_sub(i, &num_poisoned_pages); 84 89 if (pfn != -1UL) ··· 362 363 { 363 364 if (PageHuge(p)) 364 365 return; 365 - 366 - if (!PageSlab(p)) { 367 - lru_add_drain_all(); 368 - if (PageLRU(p) || is_free_buddy_page(p)) 369 - return; 370 - } 371 - 372 366 /* 373 367 * TODO: Could shrink slab caches here if a lightweight range-based 374 368 * shrinker will be available. 375 369 */ 370 + if (PageSlab(p)) 371 + return; 372 + 373 + lru_add_drain_all(); 376 374 } 377 375 EXPORT_SYMBOL_GPL(shake_page); 378 376 ··· 610 614 611 615 pgoff = page_to_pgoff(page); 612 616 read_lock(&tasklist_lock); 613 - for_each_process (tsk) { 617 + for_each_process(tsk) { 614 618 struct anon_vma_chain *vmac; 615 619 struct task_struct *t = task_early_kill(tsk, force_early); 616 620 ··· 654 658 /* 655 659 * Send early kill signal to tasks where a vma covers 656 660 * the page but the corrupted page is not necessarily 657 - * mapped it in its pte. 661 + * mapped in its pte. 658 662 * Assume applications who requested early kill want 659 663 * to be informed of all such data corruptions. 660 664 */ ··· 936 940 struct folio *folio = page_folio(p); 937 941 int err = mapping->a_ops->error_remove_page(mapping, p); 938 942 939 - if (err != 0) { 943 + if (err != 0) 940 944 pr_info("%#lx: Failed to punch page: %d\n", pfn, err); 941 - } else if (folio_has_private(folio) && 942 - !filemap_release_folio(folio, GFP_NOIO)) { 945 + else if (!filemap_release_folio(folio, GFP_NOIO)) 943 946 pr_info("%#lx: failed to release buffers\n", pfn); 944 - } else { 947 + else 945 948 ret = MF_RECOVERED; 946 - } 947 949 } else { 948 950 /* 949 951 * If the file system doesn't support it just invalidate ··· 1187 1193 struct address_space *mapping; 1188 1194 bool extra_pins = false; 1189 1195 1190 - if (!PageHuge(hpage)) 1191 - return MF_DELAYED; 1192 - 1193 1196 mapping = page_mapping(hpage); 1194 1197 if (mapping) { 1195 1198 res = truncate_error_page(hpage, page_to_pfn(p), mapping); ··· 1386 1395 bool hugetlb = false; 1387 1396 1388 1397 ret = get_hwpoison_hugetlb_folio(folio, &hugetlb, false); 1389 - if (hugetlb) 1390 - return ret; 1398 + if (hugetlb) { 1399 + /* Make sure hugetlb demotion did not happen from under us. */ 1400 + if (folio == page_folio(page)) 1401 + return ret; 1402 + if (ret > 0) { 1403 + folio_put(folio); 1404 + folio = page_folio(page); 1405 + } 1406 + } 1391 1407 1392 1408 /* 1393 1409 * This check prevents from calling folio_try_get() for any ··· 1483 1485 bool hugetlb = false; 1484 1486 1485 1487 ret = get_hwpoison_hugetlb_folio(folio, &hugetlb, true); 1486 - if (hugetlb) 1487 - return ret; 1488 + if (hugetlb) { 1489 + /* Make sure hugetlb demotion did not happen from under us. */ 1490 + if (folio == page_folio(page)) 1491 + return ret; 1492 + if (ret > 0) 1493 + folio_put(folio); 1494 + } 1488 1495 1489 1496 /* 1490 1497 * PageHWPoisonTakenOff pages are not only marked as PG_hwpoison, ··· 1817 1814 #endif /* CONFIG_FS_DAX */ 1818 1815 1819 1816 #ifdef CONFIG_HUGETLB_PAGE 1817 + 1820 1818 /* 1821 1819 * Struct raw_hwp_page represents information about "raw error page", 1822 1820 * constructing singly linked list from ->_hugetlb_hwpoison field of folio. ··· 1832 1828 return (struct llist_head *)&folio->_hugetlb_hwpoison; 1833 1829 } 1834 1830 1831 + bool is_raw_hwpoison_page_in_hugepage(struct page *page) 1832 + { 1833 + struct llist_head *raw_hwp_head; 1834 + struct raw_hwp_page *p; 1835 + struct folio *folio = page_folio(page); 1836 + bool ret = false; 1837 + 1838 + if (!folio_test_hwpoison(folio)) 1839 + return false; 1840 + 1841 + if (!folio_test_hugetlb(folio)) 1842 + return PageHWPoison(page); 1843 + 1844 + /* 1845 + * When RawHwpUnreliable is set, kernel lost track of which subpages 1846 + * are HWPOISON. So return as if ALL subpages are HWPOISONed. 1847 + */ 1848 + if (folio_test_hugetlb_raw_hwp_unreliable(folio)) 1849 + return true; 1850 + 1851 + mutex_lock(&mf_mutex); 1852 + 1853 + raw_hwp_head = raw_hwp_list_head(folio); 1854 + llist_for_each_entry(p, raw_hwp_head->first, node) { 1855 + if (page == p->page) { 1856 + ret = true; 1857 + break; 1858 + } 1859 + } 1860 + 1861 + mutex_unlock(&mf_mutex); 1862 + 1863 + return ret; 1864 + } 1865 + 1835 1866 static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) 1836 1867 { 1837 - struct llist_head *head; 1838 - struct llist_node *t, *tnode; 1868 + struct llist_node *head; 1869 + struct raw_hwp_page *p, *next; 1839 1870 unsigned long count = 0; 1840 1871 1841 - head = raw_hwp_list_head(folio); 1842 - llist_for_each_safe(tnode, t, head->first) { 1843 - struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); 1844 - 1872 + head = llist_del_all(raw_hwp_list_head(folio)); 1873 + llist_for_each_entry_safe(p, next, head, node) { 1845 1874 if (move_flag) 1846 1875 SetPageHWPoison(p->page); 1847 1876 else ··· 1882 1845 kfree(p); 1883 1846 count++; 1884 1847 } 1885 - llist_del_all(head); 1886 1848 return count; 1887 1849 } 1888 1850 ··· 1889 1853 { 1890 1854 struct llist_head *head; 1891 1855 struct raw_hwp_page *raw_hwp; 1892 - struct llist_node *t, *tnode; 1856 + struct raw_hwp_page *p, *next; 1893 1857 int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0; 1894 1858 1895 1859 /* ··· 1900 1864 if (folio_test_hugetlb_raw_hwp_unreliable(folio)) 1901 1865 return -EHWPOISON; 1902 1866 head = raw_hwp_list_head(folio); 1903 - llist_for_each_safe(tnode, t, head->first) { 1904 - struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); 1905 - 1867 + llist_for_each_entry_safe(p, next, head->first, node) { 1906 1868 if (p->page == page) 1907 1869 return -EHWPOISON; 1908 1870 } ··· 1950 1916 void folio_clear_hugetlb_hwpoison(struct folio *folio) 1951 1917 { 1952 1918 if (folio_test_hugetlb_raw_hwp_unreliable(folio)) 1919 + return; 1920 + if (folio_test_hugetlb_vmemmap_optimized(folio)) 1953 1921 return; 1954 1922 folio_clear_hwpoison(folio); 1955 1923 folio_free_raw_hwp(folio, true); ··· 2117 2081 { 2118 2082 int rc = -ENXIO; 2119 2083 2120 - put_ref_page(pfn, flags); 2121 - 2122 2084 /* device metadata space is not recoverable */ 2123 2085 if (!pgmap_pfn_valid(pgmap, pfn)) 2124 2086 goto out; ··· 2139 2105 out: 2140 2106 /* drop pgmap ref acquired in caller */ 2141 2107 put_dev_pagemap(pgmap); 2142 - action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED); 2108 + if (rc != -EOPNOTSUPP) 2109 + action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED); 2143 2110 return rc; 2144 2111 } 2145 - 2146 - static DEFINE_MUTEX(mf_mutex); 2147 2112 2148 2113 /** 2149 2114 * memory_failure - Handle memory failure of a page. ··· 2159 2126 * detected by a background scrubber) 2160 2127 * 2161 2128 * Must run in process context (e.g. a work queue) with interrupts 2162 - * enabled and no spinlocks hold. 2129 + * enabled and no spinlocks held. 2163 2130 * 2164 2131 * Return: 0 for successfully handled the memory error, 2165 2132 * -EOPNOTSUPP for hwpoison_filter() filtered the error event, ··· 2191 2158 2192 2159 if (pfn_valid(pfn)) { 2193 2160 pgmap = get_dev_pagemap(pfn, NULL); 2161 + put_ref_page(pfn, flags); 2194 2162 if (pgmap) { 2195 2163 res = memory_failure_dev_pagemap(pfn, flags, 2196 2164 pgmap); ··· 2217 2183 put_page(p); 2218 2184 goto unlock_mutex; 2219 2185 } 2220 - 2221 - hpage = compound_head(p); 2222 2186 2223 2187 /* 2224 2188 * We need/can do nothing about count=0 pages. ··· 2256 2224 } 2257 2225 } 2258 2226 2227 + hpage = compound_head(p); 2259 2228 if (PageTransHuge(hpage)) { 2260 2229 /* 2261 2230 * The flag must be set after the refcount is bumped 2262 2231 * otherwise it may race with THP split. 2263 2232 * And the flag can't be set in get_hwpoison_page() since 2264 2233 * it is called by soft offline too and it is just called 2265 - * for !MF_COUNT_INCREASE. So here seems to be the best 2234 + * for !MF_COUNT_INCREASED. So here seems to be the best 2266 2235 * place. 2267 2236 * 2268 2237 * Don't need care about the above error handling paths for ··· 2623 2590 2624 2591 /* 2625 2592 * If we succeed to isolate the page, we grabbed another refcount on 2626 - * the page, so we can safely drop the one we got from get_any_pages(). 2593 + * the page, so we can safely drop the one we got from get_any_page(). 2627 2594 * If we failed to isolate the page, it means that we cannot go further 2628 2595 * and we will return an error, so drop the reference we got from 2629 - * get_any_pages() as well. 2596 + * get_any_page() as well. 2630 2597 */ 2631 2598 put_page(page); 2632 2599 return isolated; ··· 2659 2626 } 2660 2627 2661 2628 lock_page(page); 2662 - if (!PageHuge(page)) 2629 + if (!huge) 2663 2630 wait_on_page_writeback(page); 2664 2631 if (PageHWPoison(page)) { 2665 2632 unlock_page(page); ··· 2668 2635 return 0; 2669 2636 } 2670 2637 2671 - if (!PageHuge(page) && PageLRU(page) && !PageSwapCache(page)) 2638 + if (!huge && PageLRU(page) && !PageSwapCache(page)) 2672 2639 /* 2673 2640 * Try to invalidate first. This should work for 2674 2641 * non dirty unmapped page cache pages.

+9 -10

mm/memory-tiers.c

··· 560 560 } 561 561 EXPORT_SYMBOL_GPL(alloc_memory_type); 562 562 563 - void destroy_memory_type(struct memory_dev_type *memtype) 563 + void put_memory_type(struct memory_dev_type *memtype) 564 564 { 565 565 kref_put(&memtype->kref, release_memtype); 566 566 } 567 - EXPORT_SYMBOL_GPL(destroy_memory_type); 567 + EXPORT_SYMBOL_GPL(put_memory_type); 568 568 569 569 void init_node_memory_type(int node, struct memory_dev_type *memtype) 570 570 { ··· 586 586 */ 587 587 if (!node_memory_types[node].map_count) { 588 588 node_memory_types[node].memtype = NULL; 589 - kref_put(&memtype->kref, release_memtype); 589 + put_memory_type(memtype); 590 590 } 591 591 mutex_unlock(&memory_tier_lock); 592 592 } ··· 672 672 673 673 #ifdef CONFIG_MIGRATION 674 674 #ifdef CONFIG_SYSFS 675 - static ssize_t numa_demotion_enabled_show(struct kobject *kobj, 676 - struct kobj_attribute *attr, char *buf) 675 + static ssize_t demotion_enabled_show(struct kobject *kobj, 676 + struct kobj_attribute *attr, char *buf) 677 677 { 678 678 return sysfs_emit(buf, "%s\n", 679 679 numa_demotion_enabled ? "true" : "false"); 680 680 } 681 681 682 - static ssize_t numa_demotion_enabled_store(struct kobject *kobj, 683 - struct kobj_attribute *attr, 684 - const char *buf, size_t count) 682 + static ssize_t demotion_enabled_store(struct kobject *kobj, 683 + struct kobj_attribute *attr, 684 + const char *buf, size_t count) 685 685 { 686 686 ssize_t ret; 687 687 ··· 693 693 } 694 694 695 695 static struct kobj_attribute numa_demotion_enabled_attr = 696 - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, 697 - numa_demotion_enabled_store); 696 + __ATTR_RW(demotion_enabled); 698 697 699 698 static struct attribute *numa_attrs[] = { 700 699 &numa_demotion_enabled_attr.attr,

+190 -151

mm/memory.c

··· 77 77 #include <linux/ptrace.h> 78 78 #include <linux/vmalloc.h> 79 79 #include <linux/sched/sysctl.h> 80 - #include <linux/net_mm.h> 81 80 82 81 #include <trace/events/kmem.h> 83 82 ··· 360 361 } while (pgd++, addr = next, addr != end); 361 362 } 362 363 363 - void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, 364 + void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, 364 365 struct vm_area_struct *vma, unsigned long floor, 365 366 unsigned long ceiling, bool mm_wr_locked) 366 367 { 367 - MA_STATE(mas, mt, vma->vm_end, vma->vm_end); 368 - 369 368 do { 370 369 unsigned long addr = vma->vm_start; 371 370 struct vm_area_struct *next; ··· 372 375 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may 373 376 * be 0. This will underflow and is okay. 374 377 */ 375 - next = mas_find(&mas, ceiling - 1); 378 + next = mas_find(mas, ceiling - 1); 376 379 377 380 /* 378 381 * Hide vma from rmap and truncate_pagecache before freeing ··· 393 396 while (next && next->vm_start <= vma->vm_end + PMD_SIZE 394 397 && !is_vm_hugetlb_page(next)) { 395 398 vma = next; 396 - next = mas_find(&mas, ceiling - 1); 399 + next = mas_find(mas, ceiling - 1); 397 400 if (mm_wr_locked) 398 401 vma_start_write(vma); 399 402 unlink_anon_vmas(vma); ··· 857 860 return -EBUSY; 858 861 return -ENOENT; 859 862 } else if (is_pte_marker_entry(entry)) { 860 - if (is_swapin_error_entry(entry) || userfaultfd_wp(dst_vma)) 861 - set_pte_at(dst_mm, addr, dst_pte, pte); 863 + pte_marker marker = copy_pte_marker(entry, dst_vma); 864 + 865 + if (marker) 866 + set_pte_at(dst_mm, addr, dst_pte, 867 + make_pte_marker(marker)); 862 868 return 0; 863 869 } 864 870 if (!userfaultfd_wp(dst_vma)) ··· 1312 1312 * Use the raw variant of the seqcount_t write API to avoid 1313 1313 * lockdep complaining about preemptibility. 1314 1314 */ 1315 - mmap_assert_write_locked(src_mm); 1315 + vma_assert_write_locked(src_vma); 1316 1316 raw_write_seqcount_begin(&src_mm->write_protect_seq); 1317 1317 } 1318 1318 ··· 1433 1433 tlb_remove_tlb_entry(tlb, pte, addr); 1434 1434 zap_install_uffd_wp_if_needed(vma, addr, pte, details, 1435 1435 ptent); 1436 - if (unlikely(!page)) 1436 + if (unlikely(!page)) { 1437 + ksm_might_unmap_zero_page(mm, ptent); 1437 1438 continue; 1439 + } 1438 1440 1439 1441 delay_rmap = 0; 1440 1442 if (!PageAnon(page)) { ··· 1502 1500 !zap_drop_file_uffd_wp(details)) 1503 1501 continue; 1504 1502 } else if (is_hwpoison_entry(entry) || 1505 - is_swapin_error_entry(entry)) { 1503 + is_poisoned_swp_entry(entry)) { 1506 1504 if (!should_zap_cows(details)) 1507 1505 continue; 1508 1506 } else { ··· 1693 1691 /** 1694 1692 * unmap_vmas - unmap a range of memory covered by a list of vma's 1695 1693 * @tlb: address of the caller's struct mmu_gather 1696 - * @mt: the maple tree 1694 + * @mas: the maple state 1697 1695 * @vma: the starting vma 1698 1696 * @start_addr: virtual address at which to start unmapping 1699 1697 * @end_addr: virtual address at which to end unmapping 1698 + * @tree_end: The maximum index to check 1699 + * @mm_wr_locked: lock flag 1700 1700 * 1701 1701 * Unmap all pages in the vma list. 1702 1702 * ··· 1711 1707 * ensure that any thus-far unmapped pages are flushed before unmap_vmas() 1712 1708 * drops the lock and schedules. 1713 1709 */ 1714 - void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt, 1710 + void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, 1715 1711 struct vm_area_struct *vma, unsigned long start_addr, 1716 - unsigned long end_addr, bool mm_wr_locked) 1712 + unsigned long end_addr, unsigned long tree_end, 1713 + bool mm_wr_locked) 1717 1714 { 1718 1715 struct mmu_notifier_range range; 1719 1716 struct zap_details details = { ··· 1722 1717 /* Careful - we need to zap private pages too! */ 1723 1718 .even_cows = true, 1724 1719 }; 1725 - MA_STATE(mas, mt, vma->vm_end, vma->vm_end); 1726 1720 1727 1721 mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm, 1728 1722 start_addr, end_addr); ··· 1729 1725 do { 1730 1726 unmap_single_vma(tlb, vma, start_addr, end_addr, &details, 1731 1727 mm_wr_locked); 1732 - } while ((vma = mas_find(&mas, end_addr - 1)) != NULL); 1728 + } while ((vma = mas_find(mas, tree_end - 1)) != NULL); 1733 1729 mmu_notifier_invalidate_range_end(&range); 1734 1730 } 1735 1731 ··· 1869 1865 return retval; 1870 1866 } 1871 1867 1872 - #ifdef pte_index 1873 1868 static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte, 1874 1869 unsigned long addr, struct page *page, pgprot_t prot) 1875 1870 { ··· 1883 1880 } 1884 1881 1885 1882 /* insert_pages() amortizes the cost of spinlock operations 1886 - * when inserting pages in a loop. Arch *must* define pte_index. 1883 + * when inserting pages in a loop. 1887 1884 */ 1888 1885 static int insert_pages(struct vm_area_struct *vma, unsigned long addr, 1889 1886 struct page **pages, unsigned long *num, pgprot_t prot) ··· 1942 1939 *num = remaining_pages_total; 1943 1940 return ret; 1944 1941 } 1945 - #endif /* ifdef pte_index */ 1946 1942 1947 1943 /** 1948 1944 * vm_insert_pages - insert multiple pages into user vma, batching the pmd lock. ··· 1961 1959 int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, 1962 1960 struct page **pages, unsigned long *num) 1963 1961 { 1964 - #ifdef pte_index 1965 1962 const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1; 1966 1963 1967 1964 if (addr < vma->vm_start || end_addr >= vma->vm_end) ··· 1972 1971 } 1973 1972 /* Defer page refcount checking till we're about to map that page. */ 1974 1973 return insert_pages(vma, addr, pages, num, vma->vm_page_prot); 1975 - #else 1976 - unsigned long idx = 0, pgcount = *num; 1977 - int err = -EINVAL; 1978 - 1979 - for (; idx < pgcount; ++idx) { 1980 - err = vm_insert_page(vma, addr + (PAGE_SIZE * idx), pages[idx]); 1981 - if (err) 1982 - break; 1983 - } 1984 - *num = pgcount - idx; 1985 - return err; 1986 - #endif /* ifdef pte_index */ 1987 1974 } 1988 1975 EXPORT_SYMBOL(vm_insert_pages); 1989 1976 ··· 2847 2858 2848 2859 entry = pte_mkyoung(vmf->orig_pte); 2849 2860 if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) 2850 - update_mmu_cache(vma, addr, vmf->pte); 2861 + update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1); 2851 2862 } 2852 2863 2853 2864 /* ··· 2916 2927 * 2917 2928 * We do this without the lock held, so that it can sleep if it needs to. 2918 2929 */ 2919 - static vm_fault_t do_page_mkwrite(struct vm_fault *vmf) 2930 + static vm_fault_t do_page_mkwrite(struct vm_fault *vmf, struct folio *folio) 2920 2931 { 2921 2932 vm_fault_t ret; 2922 - struct page *page = vmf->page; 2923 2933 unsigned int old_flags = vmf->flags; 2924 2934 2925 2935 vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; ··· 2933 2945 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) 2934 2946 return ret; 2935 2947 if (unlikely(!(ret & VM_FAULT_LOCKED))) { 2936 - lock_page(page); 2937 - if (!page->mapping) { 2938 - unlock_page(page); 2948 + folio_lock(folio); 2949 + if (!folio->mapping) { 2950 + folio_unlock(folio); 2939 2951 return 0; /* retry */ 2940 2952 } 2941 2953 ret |= VM_FAULT_LOCKED; 2942 2954 } else 2943 - VM_BUG_ON_PAGE(!PageLocked(page), page); 2955 + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); 2944 2956 return ret; 2945 2957 } 2946 2958 ··· 2953 2965 { 2954 2966 struct vm_area_struct *vma = vmf->vma; 2955 2967 struct address_space *mapping; 2956 - struct page *page = vmf->page; 2968 + struct folio *folio = page_folio(vmf->page); 2957 2969 bool dirtied; 2958 2970 bool page_mkwrite = vma->vm_ops && vma->vm_ops->page_mkwrite; 2959 2971 2960 - dirtied = set_page_dirty(page); 2961 - VM_BUG_ON_PAGE(PageAnon(page), page); 2972 + dirtied = folio_mark_dirty(folio); 2973 + VM_BUG_ON_FOLIO(folio_test_anon(folio), folio); 2962 2974 /* 2963 - * Take a local copy of the address_space - page.mapping may be zeroed 2964 - * by truncate after unlock_page(). The address_space itself remains 2965 - * pinned by vma->vm_file's reference. We rely on unlock_page()'s 2975 + * Take a local copy of the address_space - folio.mapping may be zeroed 2976 + * by truncate after folio_unlock(). The address_space itself remains 2977 + * pinned by vma->vm_file's reference. We rely on folio_unlock()'s 2966 2978 * release semantics to prevent the compiler from undoing this copying. 2967 2979 */ 2968 - mapping = page_rmapping(page); 2969 - unlock_page(page); 2980 + mapping = folio_raw_mapping(folio); 2981 + folio_unlock(folio); 2970 2982 2971 2983 if (!page_mkwrite) 2972 2984 file_update_time(vma->vm_file); ··· 3024 3036 entry = pte_mkyoung(vmf->orig_pte); 3025 3037 entry = maybe_mkwrite(pte_mkdirty(entry), vma); 3026 3038 if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1)) 3027 - update_mmu_cache(vma, vmf->address, vmf->pte); 3039 + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 3028 3040 pte_unmap_unlock(vmf->pte, vmf->ptl); 3029 3041 count_vm_event(PGREUSE); 3030 3042 } ··· 3116 3128 inc_mm_counter(mm, MM_ANONPAGES); 3117 3129 } 3118 3130 } else { 3131 + ksm_might_unmap_zero_page(mm, vmf->orig_pte); 3119 3132 inc_mm_counter(mm, MM_ANONPAGES); 3120 3133 } 3121 3134 flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); ··· 3138 3149 * that left a window where the new PTE could be loaded into 3139 3150 * some TLBs while the old PTE remains in others. 3140 3151 */ 3141 - ptep_clear_flush_notify(vma, vmf->address, vmf->pte); 3152 + ptep_clear_flush(vma, vmf->address, vmf->pte); 3142 3153 folio_add_new_anon_rmap(new_folio, vma, vmf->address); 3143 3154 folio_add_lru_vma(new_folio, vma); 3144 3155 /* ··· 3148 3159 */ 3149 3160 BUG_ON(unshare && pte_write(entry)); 3150 3161 set_pte_at_notify(mm, vmf->address, vmf->pte, entry); 3151 - update_mmu_cache(vma, vmf->address, vmf->pte); 3162 + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 3152 3163 if (old_folio) { 3153 3164 /* 3154 3165 * Only after switching the pte to the new page may ··· 3184 3195 pte_unmap_unlock(vmf->pte, vmf->ptl); 3185 3196 } 3186 3197 3187 - /* 3188 - * No need to double call mmu_notifier->invalidate_range() callback as 3189 - * the above ptep_clear_flush_notify() did already call it. 3190 - */ 3191 - mmu_notifier_invalidate_range_only_end(&range); 3198 + mmu_notifier_invalidate_range_end(&range); 3192 3199 3193 3200 if (new_folio) 3194 3201 folio_put(new_folio); ··· 3254 3269 vm_fault_t ret; 3255 3270 3256 3271 pte_unmap_unlock(vmf->pte, vmf->ptl); 3272 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 3273 + vma_end_read(vmf->vma); 3274 + return VM_FAULT_RETRY; 3275 + } 3276 + 3257 3277 vmf->flags |= FAULT_FLAG_MKWRITE; 3258 3278 ret = vma->vm_ops->pfn_mkwrite(vmf); 3259 3279 if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)) ··· 3269 3279 return 0; 3270 3280 } 3271 3281 3272 - static vm_fault_t wp_page_shared(struct vm_fault *vmf) 3282 + static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio) 3273 3283 __releases(vmf->ptl) 3274 3284 { 3275 3285 struct vm_area_struct *vma = vmf->vma; 3276 3286 vm_fault_t ret = 0; 3277 3287 3278 - get_page(vmf->page); 3288 + folio_get(folio); 3279 3289 3280 3290 if (vma->vm_ops && vma->vm_ops->page_mkwrite) { 3281 3291 vm_fault_t tmp; 3282 3292 3283 3293 pte_unmap_unlock(vmf->pte, vmf->ptl); 3284 - tmp = do_page_mkwrite(vmf); 3294 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 3295 + folio_put(folio); 3296 + vma_end_read(vmf->vma); 3297 + return VM_FAULT_RETRY; 3298 + } 3299 + 3300 + tmp = do_page_mkwrite(vmf, folio); 3285 3301 if (unlikely(!tmp || (tmp & 3286 3302 (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { 3287 - put_page(vmf->page); 3303 + folio_put(folio); 3288 3304 return tmp; 3289 3305 } 3290 3306 tmp = finish_mkwrite_fault(vmf); 3291 3307 if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) { 3292 - unlock_page(vmf->page); 3293 - put_page(vmf->page); 3308 + folio_unlock(folio); 3309 + folio_put(folio); 3294 3310 return tmp; 3295 3311 } 3296 3312 } else { 3297 3313 wp_page_reuse(vmf); 3298 - lock_page(vmf->page); 3314 + folio_lock(folio); 3299 3315 } 3300 3316 ret |= fault_dirty_shared_page(vmf); 3301 - put_page(vmf->page); 3317 + folio_put(folio); 3302 3318 3303 3319 return ret; 3304 3320 } ··· 3355 3359 3356 3360 vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); 3357 3361 3362 + if (vmf->page) 3363 + folio = page_folio(vmf->page); 3364 + 3358 3365 /* 3359 3366 * Shared mapping: we are guaranteed to have VM_WRITE and 3360 3367 * FAULT_FLAG_WRITE set at this point. ··· 3372 3373 */ 3373 3374 if (!vmf->page) 3374 3375 return wp_pfn_shared(vmf); 3375 - return wp_page_shared(vmf); 3376 + return wp_page_shared(vmf, folio); 3376 3377 } 3377 - 3378 - if (vmf->page) 3379 - folio = page_folio(vmf->page); 3380 3378 3381 3379 /* 3382 3380 * Private mapping: create an exclusive anonymous page copy if reuse ··· 3428 3432 return 0; 3429 3433 } 3430 3434 copy: 3435 + if ((vmf->flags & FAULT_FLAG_VMA_LOCK) && !vma->anon_vma) { 3436 + pte_unmap_unlock(vmf->pte, vmf->ptl); 3437 + vma_end_read(vmf->vma); 3438 + return VM_FAULT_RETRY; 3439 + } 3440 + 3431 3441 /* 3432 3442 * Ok, we need to copy. Oh, well.. 3433 3443 */ ··· 3497 3495 VM_BUG_ON(!folio_test_locked(folio)); 3498 3496 3499 3497 first_index = folio->index; 3500 - last_index = folio->index + folio_nr_pages(folio) - 1; 3498 + last_index = folio_next_index(folio) - 1; 3501 3499 3502 3500 details.even_cows = false; 3503 3501 details.single_folio = folio; ··· 3584 3582 struct folio *folio = page_folio(vmf->page); 3585 3583 struct vm_area_struct *vma = vmf->vma; 3586 3584 struct mmu_notifier_range range; 3585 + vm_fault_t ret; 3587 3586 3588 3587 /* 3589 3588 * We need a reference to lock the folio because we don't hold ··· 3597 3594 if (!folio_try_get(folio)) 3598 3595 return 0; 3599 3596 3600 - if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) { 3597 + ret = folio_lock_or_retry(folio, vmf); 3598 + if (ret) { 3601 3599 folio_put(folio); 3602 - return VM_FAULT_RETRY; 3600 + return ret; 3603 3601 } 3604 3602 mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, 3605 3603 vma->vm_mm, vmf->address & PAGE_MASK, ··· 3651 3647 * none pte. Otherwise it means the pte could have changed, so retry. 3652 3648 * 3653 3649 * This should also cover the case where e.g. the pte changed 3654 - * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_SWAPIN_ERROR. 3650 + * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_POISONED. 3655 3651 * So is_pte_marker() check is not enough to safely drop the pte. 3656 3652 */ 3657 3653 if (pte_same(vmf->orig_pte, ptep_get(vmf->pte))) ··· 3697 3693 return VM_FAULT_SIGBUS; 3698 3694 3699 3695 /* Higher priority than uffd-wp when data corrupted */ 3700 - if (marker & PTE_MARKER_SWAPIN_ERROR) 3701 - return VM_FAULT_SIGBUS; 3696 + if (marker & PTE_MARKER_POISONED) 3697 + return VM_FAULT_HWPOISON; 3702 3698 3703 3699 if (pte_marker_entry_uffd_wp(entry)) 3704 3700 return pte_marker_handle_uffd_wp(vmf); ··· 3725 3721 bool exclusive = false; 3726 3722 swp_entry_t entry; 3727 3723 pte_t pte; 3728 - int locked; 3729 3724 vm_fault_t ret = 0; 3730 3725 void *shadow = NULL; 3731 3726 3732 3727 if (!pte_unmap_same(vmf)) 3733 3728 goto out; 3734 - 3735 - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 3736 - ret = VM_FAULT_RETRY; 3737 - goto out; 3738 - } 3739 3729 3740 3730 entry = pte_to_swp_entry(vmf->orig_pte); 3741 3731 if (unlikely(non_swap_entry(entry))) { ··· 3740 3742 vmf->page = pfn_swap_entry_to_page(entry); 3741 3743 ret = remove_device_exclusive_entry(vmf); 3742 3744 } else if (is_device_private_entry(entry)) { 3745 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 3746 + /* 3747 + * migrate_to_ram is not yet ready to operate 3748 + * under VMA lock. 3749 + */ 3750 + vma_end_read(vma); 3751 + ret = VM_FAULT_RETRY; 3752 + goto out; 3753 + } 3754 + 3743 3755 vmf->page = pfn_swap_entry_to_page(entry); 3744 3756 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 3745 3757 vmf->address, &vmf->ptl); ··· 3813 3805 folio_add_lru(folio); 3814 3806 3815 3807 /* To provide entry to swap_readpage() */ 3816 - folio_set_swap_entry(folio, entry); 3808 + folio->swap = entry; 3817 3809 swap_readpage(page, true, NULL); 3818 3810 folio->private = NULL; 3819 3811 } ··· 3851 3843 goto out_release; 3852 3844 } 3853 3845 3854 - locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags); 3855 - 3856 - if (!locked) { 3857 - ret |= VM_FAULT_RETRY; 3846 + ret |= folio_lock_or_retry(folio, vmf); 3847 + if (ret & VM_FAULT_RETRY) 3858 3848 goto out_release; 3859 - } 3860 3849 3861 3850 if (swapcache) { 3862 3851 /* ··· 3864 3859 * changed. 3865 3860 */ 3866 3861 if (unlikely(!folio_test_swapcache(folio) || 3867 - page_private(page) != entry.val)) 3862 + page_swap_entry(page).val != entry.val)) 3868 3863 goto out_page; 3869 3864 3870 3865 /* ··· 4031 4026 } 4032 4027 4033 4028 /* No need to invalidate - it was non-present before */ 4034 - update_mmu_cache(vma, vmf->address, vmf->pte); 4029 + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 4035 4030 unlock: 4036 4031 if (vmf->pte) 4037 4032 pte_unmap_unlock(vmf->pte, vmf->ptl); ··· 4155 4150 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); 4156 4151 4157 4152 /* No need to invalidate - it was non-present before */ 4158 - update_mmu_cache(vma, vmf->address, vmf->pte); 4153 + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 4159 4154 unlock: 4160 4155 if (vmf->pte) 4161 4156 pte_unmap_unlock(vmf->pte, vmf->ptl); ··· 4250 4245 bool write = vmf->flags & FAULT_FLAG_WRITE; 4251 4246 unsigned long haddr = vmf->address & HPAGE_PMD_MASK; 4252 4247 pmd_t entry; 4253 - int i; 4254 4248 vm_fault_t ret = VM_FAULT_FALLBACK; 4255 4249 4256 4250 if (!transhuge_vma_suitable(vma, haddr)) ··· 4282 4278 if (unlikely(!pmd_none(*vmf->pmd))) 4283 4279 goto out; 4284 4280 4285 - for (i = 0; i < HPAGE_PMD_NR; i++) 4286 - flush_icache_page(vma, page + i); 4281 + flush_icache_pages(vma, page, HPAGE_PMD_NR); 4287 4282 4288 4283 entry = mk_huge_pmd(page, vma->vm_page_prot); 4289 4284 if (write) ··· 4315 4312 } 4316 4313 #endif 4317 4314 4318 - void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) 4315 + /** 4316 + * set_pte_range - Set a range of PTEs to point to pages in a folio. 4317 + * @vmf: Fault decription. 4318 + * @folio: The folio that contains @page. 4319 + * @page: The first page to create a PTE for. 4320 + * @nr: The number of PTEs to create. 4321 + * @addr: The first address to create a PTE for. 4322 + */ 4323 + void set_pte_range(struct vm_fault *vmf, struct folio *folio, 4324 + struct page *page, unsigned int nr, unsigned long addr) 4319 4325 { 4320 4326 struct vm_area_struct *vma = vmf->vma; 4321 4327 bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); 4322 4328 bool write = vmf->flags & FAULT_FLAG_WRITE; 4323 - bool prefault = vmf->address != addr; 4329 + bool prefault = in_range(vmf->address, addr, nr * PAGE_SIZE); 4324 4330 pte_t entry; 4325 4331 4326 - flush_icache_page(vma, page); 4332 + flush_icache_pages(vma, page, nr); 4327 4333 entry = mk_pte(page, vma->vm_page_prot); 4328 4334 4329 4335 if (prefault && arch_wants_old_prefaulted_pte()) ··· 4346 4334 entry = pte_mkuffd_wp(entry); 4347 4335 /* copy-on-write page */ 4348 4336 if (write && !(vma->vm_flags & VM_SHARED)) { 4349 - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); 4350 - page_add_new_anon_rmap(page, vma, addr); 4351 - lru_cache_add_inactive_or_unevictable(page, vma); 4337 + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr); 4338 + VM_BUG_ON_FOLIO(nr != 1, folio); 4339 + folio_add_new_anon_rmap(folio, vma, addr); 4340 + folio_add_lru_vma(folio, vma); 4352 4341 } else { 4353 - inc_mm_counter(vma->vm_mm, mm_counter_file(page)); 4354 - page_add_file_rmap(page, vma, false); 4342 + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr); 4343 + folio_add_file_rmap_range(folio, page, nr, vma, false); 4355 4344 } 4356 - set_pte_at(vma->vm_mm, addr, vmf->pte, entry); 4345 + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr); 4346 + 4347 + /* no need to invalidate: a not-present page won't be cached */ 4348 + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr); 4357 4349 } 4358 4350 4359 4351 static bool vmf_pte_changed(struct vm_fault *vmf) ··· 4425 4409 4426 4410 /* Re-check under ptl */ 4427 4411 if (likely(!vmf_pte_changed(vmf))) { 4428 - do_set_pte(vmf, page, vmf->address); 4412 + struct folio *folio = page_folio(page); 4429 4413 4430 - /* no need to invalidate: a not-present page won't be cached */ 4431 - update_mmu_cache(vma, vmf->address, vmf->pte); 4432 - 4414 + set_pte_range(vmf, folio, page, 1, vmf->address); 4433 4415 ret = 0; 4434 4416 } else { 4435 4417 update_mmu_tlb(vma, vmf->address, vmf->pte); ··· 4546 4532 static vm_fault_t do_read_fault(struct vm_fault *vmf) 4547 4533 { 4548 4534 vm_fault_t ret = 0; 4535 + struct folio *folio; 4549 4536 4550 4537 /* 4551 4538 * Let's call ->map_pages() first and use ->fault() as fallback ··· 4559 4544 return ret; 4560 4545 } 4561 4546 4547 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 4548 + vma_end_read(vmf->vma); 4549 + return VM_FAULT_RETRY; 4550 + } 4551 + 4562 4552 ret = __do_fault(vmf); 4563 4553 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) 4564 4554 return ret; 4565 4555 4566 4556 ret |= finish_fault(vmf); 4567 - unlock_page(vmf->page); 4557 + folio = page_folio(vmf->page); 4558 + folio_unlock(folio); 4568 4559 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) 4569 - put_page(vmf->page); 4560 + folio_put(folio); 4570 4561 return ret; 4571 4562 } 4572 4563 ··· 4580 4559 { 4581 4560 struct vm_area_struct *vma = vmf->vma; 4582 4561 vm_fault_t ret; 4562 + 4563 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 4564 + vma_end_read(vma); 4565 + return VM_FAULT_RETRY; 4566 + } 4583 4567 4584 4568 if (unlikely(anon_vma_prepare(vma))) 4585 4569 return VM_FAULT_OOM; ··· 4624 4598 { 4625 4599 struct vm_area_struct *vma = vmf->vma; 4626 4600 vm_fault_t ret, tmp; 4601 + struct folio *folio; 4602 + 4603 + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 4604 + vma_end_read(vma); 4605 + return VM_FAULT_RETRY; 4606 + } 4627 4607 4628 4608 ret = __do_fault(vmf); 4629 4609 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) 4630 4610 return ret; 4611 + 4612 + folio = page_folio(vmf->page); 4631 4613 4632 4614 /* 4633 4615 * Check if the backing address space wants to know that the page is 4634 4616 * about to become writable 4635 4617 */ 4636 4618 if (vma->vm_ops->page_mkwrite) { 4637 - unlock_page(vmf->page); 4638 - tmp = do_page_mkwrite(vmf); 4619 + folio_unlock(folio); 4620 + tmp = do_page_mkwrite(vmf, folio); 4639 4621 if (unlikely(!tmp || 4640 4622 (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { 4641 - put_page(vmf->page); 4623 + folio_put(folio); 4642 4624 return tmp; 4643 4625 } 4644 4626 } ··· 4654 4620 ret |= finish_fault(vmf); 4655 4621 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | 4656 4622 VM_FAULT_RETRY))) { 4657 - unlock_page(vmf->page); 4658 - put_page(vmf->page); 4623 + folio_unlock(folio); 4624 + folio_put(folio); 4659 4625 return ret; 4660 4626 } 4661 4627 ··· 4844 4810 if (writable) 4845 4811 pte = pte_mkwrite(pte); 4846 4812 ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); 4847 - update_mmu_cache(vma, vmf->address, vmf->pte); 4813 + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 4848 4814 pte_unmap_unlock(vmf->pte, vmf->ptl); 4849 4815 goto out; 4850 4816 } 4851 4817 4852 4818 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) 4853 4819 { 4854 - if (vma_is_anonymous(vmf->vma)) 4820 + struct vm_area_struct *vma = vmf->vma; 4821 + if (vma_is_anonymous(vma)) 4855 4822 return do_huge_pmd_anonymous_page(vmf); 4856 - if (vmf->vma->vm_ops->huge_fault) 4857 - return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); 4823 + if (vma->vm_ops->huge_fault) 4824 + return vma->vm_ops->huge_fault(vmf, PMD_ORDER); 4858 4825 return VM_FAULT_FALLBACK; 4859 4826 } 4860 4827 4861 4828 /* `inline' is required to avoid gcc 4.1.2 build error */ 4862 4829 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) 4863 4830 { 4831 + struct vm_area_struct *vma = vmf->vma; 4864 4832 const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; 4865 4833 vm_fault_t ret; 4866 4834 4867 - if (vma_is_anonymous(vmf->vma)) { 4835 + if (vma_is_anonymous(vma)) { 4868 4836 if (likely(!unshare) && 4869 - userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) 4837 + userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) 4870 4838 return handle_userfault(vmf, VM_UFFD_WP); 4871 4839 return do_huge_pmd_wp_page(vmf); 4872 4840 } 4873 4841 4874 - if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { 4875 - if (vmf->vma->vm_ops->huge_fault) { 4876 - ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); 4842 + if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { 4843 + if (vma->vm_ops->huge_fault) { 4844 + ret = vma->vm_ops->huge_fault(vmf, PMD_ORDER); 4877 4845 if (!(ret & VM_FAULT_FALLBACK)) 4878 4846 return ret; 4879 4847 } 4880 4848 } 4881 4849 4882 4850 /* COW or write-notify handled on pte level: split pmd. */ 4883 - __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); 4851 + __split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL); 4884 4852 4885 4853 return VM_FAULT_FALLBACK; 4886 4854 } ··· 4891 4855 { 4892 4856 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ 4893 4857 defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) 4858 + struct vm_area_struct *vma = vmf->vma; 4894 4859 /* No support for anonymous transparent PUD pages yet */ 4895 - if (vma_is_anonymous(vmf->vma)) 4860 + if (vma_is_anonymous(vma)) 4896 4861 return VM_FAULT_FALLBACK; 4897 - if (vmf->vma->vm_ops->huge_fault) 4898 - return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); 4862 + if (vma->vm_ops->huge_fault) 4863 + return vma->vm_ops->huge_fault(vmf, PUD_ORDER); 4899 4864 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 4900 4865 return VM_FAULT_FALLBACK; 4901 4866 } ··· 4905 4868 { 4906 4869 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ 4907 4870 defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) 4871 + struct vm_area_struct *vma = vmf->vma; 4908 4872 vm_fault_t ret; 4909 4873 4910 4874 /* No support for anonymous transparent PUD pages yet */ 4911 - if (vma_is_anonymous(vmf->vma)) 4875 + if (vma_is_anonymous(vma)) 4912 4876 goto split; 4913 - if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { 4914 - if (vmf->vma->vm_ops->huge_fault) { 4915 - ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); 4877 + if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { 4878 + if (vma->vm_ops->huge_fault) { 4879 + ret = vma->vm_ops->huge_fault(vmf, PUD_ORDER); 4916 4880 if (!(ret & VM_FAULT_FALLBACK)) 4917 4881 return ret; 4918 4882 } 4919 4883 } 4920 4884 split: 4921 4885 /* COW or write-notify not handled on PUD level: split pud.*/ 4922 - __split_huge_pud(vmf->vma, vmf->pud, vmf->address); 4886 + __split_huge_pud(vma, vmf->pud, vmf->address); 4923 4887 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 4924 4888 return VM_FAULT_FALLBACK; 4925 4889 } ··· 4997 4959 entry = pte_mkyoung(entry); 4998 4960 if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, 4999 4961 vmf->flags & FAULT_FLAG_WRITE)) { 5000 - update_mmu_cache(vmf->vma, vmf->address, vmf->pte); 4962 + update_mmu_cache_range(vmf, vmf->vma, vmf->address, 4963 + vmf->pte, 1); 5001 4964 } else { 5002 4965 /* Skip spurious TLB flush for retried page fault */ 5003 4966 if (vmf->flags & FAULT_FLAG_TRIED) ··· 5019 4980 } 5020 4981 5021 4982 /* 5022 - * By the time we get here, we already hold the mm semaphore 5023 - * 5024 - * The mmap_lock may have been released depending on flags and our 5025 - * return value. See filemap_fault() and __folio_lock_or_retry(). 4983 + * On entry, we hold either the VMA lock or the mmap_lock 4984 + * (FAULT_FLAG_VMA_LOCK tells you which). If VM_FAULT_RETRY is set in 4985 + * the result, the mmap_lock is not held on exit. See filemap_fault() 4986 + * and __folio_lock_or_retry(). 5026 4987 */ 5027 4988 static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, 5028 4989 unsigned long address, unsigned int flags) ··· 5120 5081 5121 5082 /** 5122 5083 * mm_account_fault - Do page fault accounting 5123 - * 5084 + * @mm: mm from which memcg should be extracted. It can be NULL. 5124 5085 * @regs: the pt_regs struct pointer. When set to NULL, will skip accounting 5125 5086 * of perf event counters, but we'll still do the per-task accounting to 5126 5087 * the task who triggered this page fault. ··· 5228 5189 !is_cow_mapping(vma->vm_flags))) 5229 5190 return VM_FAULT_SIGSEGV; 5230 5191 } 5192 + #ifdef CONFIG_PER_VMA_LOCK 5193 + /* 5194 + * Per-VMA locks can't be used with FAULT_FLAG_RETRY_NOWAIT because of 5195 + * the assumption that lock is dropped on VM_FAULT_RETRY. 5196 + */ 5197 + if (WARN_ON_ONCE((*flags & 5198 + (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_RETRY_NOWAIT)) == 5199 + (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_RETRY_NOWAIT))) 5200 + return VM_FAULT_SIGSEGV; 5201 + #endif 5202 + 5231 5203 return 0; 5232 5204 } 5233 5205 ··· 5436 5386 if (!vma) 5437 5387 goto inval; 5438 5388 5439 - /* Only anonymous and tcp vmas are supported for now */ 5440 - if (!vma_is_anonymous(vma) && !vma_is_tcp(vma)) 5441 - goto inval; 5442 - 5443 5389 if (!vma_start_read(vma)) 5444 5390 goto inval; 5445 5391 ··· 5445 5399 * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA 5446 5400 * from its anon_vma. 5447 5401 */ 5448 - if (unlikely(!vma->anon_vma && !vma_is_tcp(vma))) 5449 - goto inval_end_read; 5450 - 5451 - /* 5452 - * Due to the possibility of userfault handler dropping mmap_lock, avoid 5453 - * it for now and fall back to page fault handling under mmap_lock. 5454 - */ 5455 - if (userfaultfd_armed(vma)) 5402 + if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) 5456 5403 goto inval_end_read; 5457 5404 5458 5405 /* Check since vm_start/vm_end might change before we lock the VMA */ ··· 6098 6059 SLAB_PANIC, NULL); 6099 6060 } 6100 6061 6101 - bool ptlock_alloc(struct page *page) 6062 + bool ptlock_alloc(struct ptdesc *ptdesc) 6102 6063 { 6103 6064 spinlock_t *ptl; 6104 6065 6105 6066 ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL); 6106 6067 if (!ptl) 6107 6068 return false; 6108 - page->ptl = ptl; 6069 + ptdesc->ptl = ptl; 6109 6070 return true; 6110 6071 } 6111 6072 6112 - void ptlock_free(struct page *page) 6073 + void ptlock_free(struct ptdesc *ptdesc) 6113 6074 { 6114 - kmem_cache_free(page_ptl_cachep, page->ptl); 6075 + kmem_cache_free(page_ptl_cachep, ptdesc->ptl); 6115 6076 } 6116 6077 #endif

+154 -38

mm/memory_hotplug.c

··· 41 41 #include "internal.h" 42 42 #include "shuffle.h" 43 43 44 + enum { 45 + MEMMAP_ON_MEMORY_DISABLE = 0, 46 + MEMMAP_ON_MEMORY_ENABLE, 47 + MEMMAP_ON_MEMORY_FORCE, 48 + }; 49 + 50 + static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE; 51 + 52 + static inline unsigned long memory_block_memmap_size(void) 53 + { 54 + return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page); 55 + } 56 + 57 + static inline unsigned long memory_block_memmap_on_memory_pages(void) 58 + { 59 + unsigned long nr_pages = PFN_UP(memory_block_memmap_size()); 60 + 61 + /* 62 + * In "forced" memmap_on_memory mode, we add extra pages to align the 63 + * vmemmap size to cover full pageblocks. That way, we can add memory 64 + * even if the vmemmap size is not properly aligned, however, we might waste 65 + * memory. 66 + */ 67 + if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) 68 + return pageblock_align(nr_pages); 69 + return nr_pages; 70 + } 71 + 44 72 #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY 45 73 /* 46 74 * memory_hotplug.memmap_on_memory parameter 47 75 */ 48 - static bool memmap_on_memory __ro_after_init; 49 - module_param(memmap_on_memory, bool, 0444); 50 - MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); 76 + static int set_memmap_mode(const char *val, const struct kernel_param *kp) 77 + { 78 + int ret, mode; 79 + bool enabled; 80 + 81 + if (sysfs_streq(val, "force") || sysfs_streq(val, "FORCE")) { 82 + mode = MEMMAP_ON_MEMORY_FORCE; 83 + } else { 84 + ret = kstrtobool(val, &enabled); 85 + if (ret < 0) 86 + return ret; 87 + if (enabled) 88 + mode = MEMMAP_ON_MEMORY_ENABLE; 89 + else 90 + mode = MEMMAP_ON_MEMORY_DISABLE; 91 + } 92 + *((int *)kp->arg) = mode; 93 + if (mode == MEMMAP_ON_MEMORY_FORCE) { 94 + unsigned long memmap_pages = memory_block_memmap_on_memory_pages(); 95 + 96 + pr_info_once("Memory hotplug will waste %ld pages in each memory block\n", 97 + memmap_pages - PFN_UP(memory_block_memmap_size())); 98 + } 99 + return 0; 100 + } 101 + 102 + static int get_memmap_mode(char *buffer, const struct kernel_param *kp) 103 + { 104 + if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE) 105 + return sprintf(buffer, "force\n"); 106 + return param_get_bool(buffer, kp); 107 + } 108 + 109 + static const struct kernel_param_ops memmap_mode_ops = { 110 + .set = set_memmap_mode, 111 + .get = get_memmap_mode, 112 + }; 113 + module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444); 114 + MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n" 115 + "With value \"force\" it could result in memory wastage due " 116 + "to memmap size limitations (Y/N/force)"); 51 117 52 118 static inline bool mhp_memmap_on_memory(void) 53 119 { 54 - return memmap_on_memory; 120 + return memmap_mode != MEMMAP_ON_MEMORY_DISABLE; 55 121 } 56 122 #else 57 123 static inline bool mhp_memmap_on_memory(void) ··· 1313 1247 return device_online(&mem->dev); 1314 1248 } 1315 1249 1316 - bool mhp_supports_memmap_on_memory(unsigned long size) 1250 + #ifndef arch_supports_memmap_on_memory 1251 + static inline bool arch_supports_memmap_on_memory(unsigned long vmemmap_size) 1317 1252 { 1318 - unsigned long nr_vmemmap_pages = size / PAGE_SIZE; 1319 - unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); 1320 - unsigned long remaining_size = size - vmemmap_size; 1253 + /* 1254 + * As default, we want the vmemmap to span a complete PMD such that we 1255 + * can map the vmemmap using a single PMD if supported by the 1256 + * architecture. 1257 + */ 1258 + return IS_ALIGNED(vmemmap_size, PMD_SIZE); 1259 + } 1260 + #endif 1261 + 1262 + static bool mhp_supports_memmap_on_memory(unsigned long size) 1263 + { 1264 + unsigned long vmemmap_size = memory_block_memmap_size(); 1265 + unsigned long memmap_pages = memory_block_memmap_on_memory_pages(); 1321 1266 1322 1267 /* 1323 1268 * Besides having arch support and the feature enabled at runtime, we ··· 1356 1279 * altmap as an alternative source of memory, and we do not exactly 1357 1280 * populate a single PMD. 1358 1281 */ 1359 - return mhp_memmap_on_memory() && 1360 - size == memory_block_size_bytes() && 1361 - IS_ALIGNED(vmemmap_size, PMD_SIZE) && 1362 - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); 1282 + if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) 1283 + return false; 1284 + 1285 + /* 1286 + * Make sure the vmemmap allocation is fully contained 1287 + * so that we always allocate vmemmap memory from altmap area. 1288 + */ 1289 + if (!IS_ALIGNED(vmemmap_size, PAGE_SIZE)) 1290 + return false; 1291 + 1292 + /* 1293 + * start pfn should be pageblock_nr_pages aligned for correctly 1294 + * setting migrate types 1295 + */ 1296 + if (!pageblock_aligned(memmap_pages)) 1297 + return false; 1298 + 1299 + if (memmap_pages == PHYS_PFN(memory_block_size_bytes())) 1300 + /* No effective hotplugged memory doesn't make sense. */ 1301 + return false; 1302 + 1303 + return arch_supports_memmap_on_memory(vmemmap_size); 1363 1304 } 1364 1305 1365 1306 /* ··· 1390 1295 { 1391 1296 struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; 1392 1297 enum memblock_flags memblock_flags = MEMBLOCK_NONE; 1393 - struct vmem_altmap mhp_altmap = {}; 1298 + struct vmem_altmap mhp_altmap = { 1299 + .base_pfn = PHYS_PFN(res->start), 1300 + .end_pfn = PHYS_PFN(res->end), 1301 + }; 1394 1302 struct memory_group *group = NULL; 1395 1303 u64 start, size; 1396 1304 bool new_node = false; ··· 1437 1339 * Self hosted memmap array 1438 1340 */ 1439 1341 if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { 1440 - if (!mhp_supports_memmap_on_memory(size)) { 1441 - ret = -EINVAL; 1442 - goto error; 1342 + if (mhp_supports_memmap_on_memory(size)) { 1343 + mhp_altmap.free = memory_block_memmap_on_memory_pages(); 1344 + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); 1345 + if (!params.altmap) { 1346 + ret = -ENOMEM; 1347 + goto error; 1348 + } 1349 + 1350 + memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); 1443 1351 } 1444 - mhp_altmap.free = PHYS_PFN(size); 1445 - mhp_altmap.base_pfn = PHYS_PFN(start); 1446 - params.altmap = &mhp_altmap; 1352 + /* fallback to not using altmap */ 1447 1353 } 1448 1354 1449 1355 /* call arch's memory hotadd */ 1450 1356 ret = arch_add_memory(nid, start, size, &params); 1451 1357 if (ret < 0) 1452 - goto error; 1358 + goto error_free; 1453 1359 1454 1360 /* create memory block devices after memory was added */ 1455 - ret = create_memory_block_devices(start, size, mhp_altmap.alloc, 1456 - group); 1361 + ret = create_memory_block_devices(start, size, params.altmap, group); 1457 1362 if (ret) { 1458 1363 arch_remove_memory(start, size, NULL); 1459 - goto error; 1364 + goto error_free; 1460 1365 } 1461 1366 1462 1367 if (new_node) { ··· 1496 1395 walk_memory_blocks(start, size, NULL, online_memory_block); 1497 1396 1498 1397 return ret; 1398 + error_free: 1399 + kfree(params.altmap); 1499 1400 error: 1500 1401 if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) 1501 1402 memblock_remove(start, size); ··· 1946 1843 do { 1947 1844 pfn = start_pfn; 1948 1845 do { 1846 + /* 1847 + * Historically we always checked for any signal and 1848 + * can't limit it to fatal signals without eventually 1849 + * breaking user space. 1850 + */ 1949 1851 if (signal_pending(current)) { 1950 1852 ret = -EINTR; 1951 1853 reason = "signal backoff"; ··· 2064 1956 return 0; 2065 1957 } 2066 1958 2067 - static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg) 1959 + static int test_has_altmap_cb(struct memory_block *mem, void *arg) 2068 1960 { 1961 + struct memory_block **mem_ptr = (struct memory_block **)arg; 2069 1962 /* 2070 - * If not set, continue with the next block. 1963 + * return the memblock if we have altmap 1964 + * and break callback. 2071 1965 */ 2072 - return mem->nr_vmemmap_pages; 1966 + if (mem->altmap) { 1967 + *mem_ptr = mem; 1968 + return 1; 1969 + } 1970 + return 0; 2073 1971 } 2074 1972 2075 1973 static int check_cpu_on_node(int nid) ··· 2150 2036 2151 2037 static int __ref try_remove_memory(u64 start, u64 size) 2152 2038 { 2153 - struct vmem_altmap mhp_altmap = {}; 2154 - struct vmem_altmap *altmap = NULL; 2155 - unsigned long nr_vmemmap_pages; 2039 + struct memory_block *mem; 2156 2040 int rc = 0, nid = NUMA_NO_NODE; 2041 + struct vmem_altmap *altmap = NULL; 2157 2042 2158 2043 BUG_ON(check_hotplug_memory_range(start, size)); 2159 2044 ··· 2174 2061 * the same granularity it was added - a single memory block. 2175 2062 */ 2176 2063 if (mhp_memmap_on_memory()) { 2177 - nr_vmemmap_pages = walk_memory_blocks(start, size, NULL, 2178 - get_nr_vmemmap_pages_cb); 2179 - if (nr_vmemmap_pages) { 2064 + rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); 2065 + if (rc) { 2180 2066 if (size != memory_block_size_bytes()) { 2181 2067 pr_warn("Refuse to remove %#llx - %#llx," 2182 2068 "wrong granularity\n", 2183 2069 start, start + size); 2184 2070 return -EINVAL; 2185 2071 } 2186 - 2072 + altmap = mem->altmap; 2187 2073 /* 2188 - * Let remove_pmd_table->free_hugepage_table do the 2189 - * right thing if we used vmem_altmap when hot-adding 2190 - * the range. 2074 + * Mark altmap NULL so that we can add a debug 2075 + * check on memblock free. 2191 2076 */ 2192 - mhp_altmap.alloc = nr_vmemmap_pages; 2193 - altmap = &mhp_altmap; 2077 + mem->altmap = NULL; 2194 2078 } 2195 2079 } 2196 2080 ··· 2203 2093 mem_hotplug_begin(); 2204 2094 2205 2095 arch_remove_memory(start, size, altmap); 2096 + 2097 + /* Verify that all vmemmap pages have actually been freed. */ 2098 + if (altmap) { 2099 + WARN(altmap->alloc, "Altmap not fully unmapped"); 2100 + kfree(altmap); 2101 + } 2206 2102 2207 2103 if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { 2208 2104 memblock_phys_free(start, size);

+8 -7

mm/mempolicy.c

··· 2195 2195 mpol_cond_put(pol); 2196 2196 gfp |= __GFP_COMP; 2197 2197 page = alloc_page_interleave(gfp, order, nid); 2198 - if (page && order > 1) 2199 - prep_transhuge_page(page); 2200 2198 folio = (struct folio *)page; 2199 + if (folio && order > 1) 2200 + folio_prep_large_rmappable(folio); 2201 2201 goto out; 2202 2202 } 2203 2203 ··· 2208 2208 gfp |= __GFP_COMP; 2209 2209 page = alloc_pages_preferred_many(gfp, order, node, pol); 2210 2210 mpol_cond_put(pol); 2211 - if (page && order > 1) 2212 - prep_transhuge_page(page); 2213 2211 folio = (struct folio *)page; 2212 + if (folio && order > 1) 2213 + folio_prep_large_rmappable(folio); 2214 2214 goto out; 2215 2215 } 2216 2216 ··· 2306 2306 struct folio *folio_alloc(gfp_t gfp, unsigned order) 2307 2307 { 2308 2308 struct page *page = alloc_pages(gfp | __GFP_COMP, order); 2309 + struct folio *folio = (struct folio *)page; 2309 2310 2310 - if (page && order > 1) 2311 - prep_transhuge_page(page); 2312 - return (struct folio *)page; 2311 + if (folio && order > 1) 2312 + folio_prep_large_rmappable(folio); 2313 + return folio; 2313 2314 } 2314 2315 EXPORT_SYMBOL(folio_alloc); 2315 2316

+20 -2

mm/memtest.c

··· 3 3 #include <linux/types.h> 4 4 #include <linux/init.h> 5 5 #include <linux/memblock.h> 6 + #include <linux/seq_file.h> 6 7 7 - bool early_memtest_done; 8 - phys_addr_t early_memtest_bad_size; 8 + static bool early_memtest_done; 9 + static phys_addr_t early_memtest_bad_size; 9 10 10 11 static u64 patterns[] __initdata = { 11 12 /* The first entry has to be 0 to leave memtest with zeroed memory */ ··· 117 116 idx = i % ARRAY_SIZE(patterns); 118 117 do_one_pass(patterns[idx], start, end); 119 118 } 119 + } 120 + 121 + void memtest_report_meminfo(struct seq_file *m) 122 + { 123 + unsigned long early_memtest_bad_size_kb; 124 + 125 + if (!IS_ENABLED(CONFIG_PROC_FS)) 126 + return; 127 + 128 + if (!early_memtest_done) 129 + return; 130 + 131 + early_memtest_bad_size_kb = early_memtest_bad_size >> 10; 132 + if (early_memtest_bad_size && !early_memtest_bad_size_kb) 133 + early_memtest_bad_size_kb = 1; 134 + /* When 0 is reported, it means there actually was a successful test */ 135 + seq_printf(m, "EarlyMemtestBad: %5lu kB\n", early_memtest_bad_size_kb); 120 136 }

+2 -3

mm/migrate.c

··· 773 773 774 774 bh = head; 775 775 do { 776 - set_bh_page(bh, &dst->page, bh_offset(bh)); 776 + folio_set_bh(bh, dst, bh_offset(bh)); 777 777 bh = bh->b_this_page; 778 778 } while (bh != head); 779 779 ··· 922 922 * Buffers may be managed in a filesystem specific way. 923 923 * We must have no buffers or drop them. 924 924 */ 925 - if (folio_test_private(src) && 926 - !filemap_release_folio(src, GFP_KERNEL)) 925 + if (!filemap_release_folio(src, GFP_KERNEL)) 927 926 return mode == MIGRATE_SYNC ? -EAGAIN : -EBUSY; 928 927 929 928 return migrate_folio(mapping, dst, src, mode);

+17 -13

mm/migrate_device.c

··· 659 659 660 660 if (flush) { 661 661 flush_cache_page(vma, addr, pte_pfn(orig_pte)); 662 - ptep_clear_flush_notify(vma, addr, ptep); 662 + ptep_clear_flush(vma, addr, ptep); 663 663 set_pte_at_notify(mm, addr, ptep, entry); 664 664 update_mmu_cache(vma, addr, ptep); 665 665 } else { ··· 728 728 729 729 if (is_device_private_page(newpage) || 730 730 is_device_coherent_page(newpage)) { 731 - /* 732 - * For now only support anonymous memory migrating to 733 - * device private or coherent memory. 734 - */ 735 731 if (mapping) { 736 - src_pfns[i] &= ~MIGRATE_PFN_MIGRATE; 737 - continue; 732 + struct folio *folio; 733 + 734 + folio = page_folio(page); 735 + 736 + /* 737 + * For now only support anonymous memory migrating to 738 + * device private or coherent memory. 739 + * 740 + * Try to get rid of swap cache if possible. 741 + */ 742 + if (!folio_test_anon(folio) || 743 + !folio_free_swap(folio)) { 744 + src_pfns[i] &= ~MIGRATE_PFN_MIGRATE; 745 + continue; 746 + } 738 747 } 739 748 } else if (is_zone_device_page(newpage)) { 740 749 /* ··· 764 755 src_pfns[i] &= ~MIGRATE_PFN_MIGRATE; 765 756 } 766 757 767 - /* 768 - * No need to double call mmu_notifier->invalidate_range() callback as 769 - * the above ptep_clear_flush_notify() inside migrate_vma_insert_page() 770 - * did already call it. 771 - */ 772 758 if (notified) 773 - mmu_notifier_invalidate_range_only_end(&range); 759 + mmu_notifier_invalidate_range_end(&range); 774 760 } 775 761 776 762 /**

+2 -1

mm/mlock.c

··· 387 387 */ 388 388 if (newflags & VM_LOCKED) 389 389 newflags |= VM_IO; 390 + vma_start_write(vma); 390 391 vm_flags_reset_once(vma, newflags); 391 392 392 393 lru_add_drain(); ··· 462 461 * It's okay if try_to_unmap_one unmaps a page just after we 463 462 * set VM_LOCKED, populate_vma_page_range will bring it back. 464 463 */ 465 - 466 464 if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) { 467 465 /* No work to do, and mlocking twice would be wrong */ 466 + vma_start_write(vma); 468 467 vm_flags_reset(vma, newflags); 469 468 } else { 470 469 mlock_vma_pages_range(vma, start, end, newflags);

+15 -22

mm/mm_init.c

··· 79 79 int shift, width; 80 80 unsigned long or_mask, add_mask; 81 81 82 - shift = 8 * sizeof(unsigned long); 82 + shift = BITS_PER_LONG; 83 83 width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH 84 84 - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH; 85 85 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths", ··· 154 154 #endif /* CONFIG_DEBUG_MEMORY_INIT */ 155 155 156 156 struct kobject *mm_kobj; 157 - EXPORT_SYMBOL_GPL(mm_kobj); 158 157 159 158 #ifdef CONFIG_SMP 160 159 s32 vm_committed_as_batch = 32; ··· 375 376 */ 376 377 if (mirrored_kernelcore) { 377 378 bool mem_below_4gb_not_mirrored = false; 379 + 380 + if (!memblock_has_mirror()) { 381 + pr_warn("The system has no mirror memory, ignore kernelcore=mirror.\n"); 382 + goto out; 383 + } 378 384 379 385 for_each_mem_region(r) { 380 386 if (memblock_is_mirror(r)) ··· 1024 1020 if (!vmemmap_can_optimize(altmap, pgmap)) 1025 1021 return pgmap_vmemmap_nr(pgmap); 1026 1022 1027 - return 2 * (PAGE_SIZE / sizeof(struct page)); 1023 + return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page)); 1028 1024 } 1029 1025 1030 1026 static void __ref memmap_init_compound(struct page *head, ··· 1109 1105 */ 1110 1106 static void __init adjust_zone_range_for_zone_movable(int nid, 1111 1107 unsigned long zone_type, 1112 - unsigned long node_start_pfn, 1113 1108 unsigned long node_end_pfn, 1114 1109 unsigned long *zone_start_pfn, 1115 1110 unsigned long *zone_end_pfn) ··· 1225 1222 /* Get the start and end of the zone */ 1226 1223 *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high); 1227 1224 *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high); 1228 - adjust_zone_range_for_zone_movable(nid, zone_type, 1229 - node_start_pfn, node_end_pfn, 1230 - zone_start_pfn, zone_end_pfn); 1225 + adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn, 1226 + zone_start_pfn, zone_end_pfn); 1231 1227 1232 1228 /* Check that this node has pages within the zone's required range */ 1233 1229 if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) ··· 1426 1424 usemapsize = roundup(zonesize, pageblock_nr_pages); 1427 1425 usemapsize = usemapsize >> pageblock_order; 1428 1426 usemapsize *= NR_PAGEBLOCK_BITS; 1429 - usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long)); 1427 + usemapsize = roundup(usemapsize, BITS_PER_LONG); 1430 1428 1431 - return usemapsize / 8; 1429 + return usemapsize / BITS_PER_BYTE; 1432 1430 } 1433 1431 1434 1432 static void __ref setup_usemap(struct zone *zone) ··· 1683 1681 * 1684 1682 * It returns the start and end page frame of a node based on information 1685 1683 * provided by memblock_set_node(). If called for a node 1686 - * with no available memory, a warning is printed and the start and end 1687 - * PFNs will be 0. 1684 + * with no available memory, the start and end PFNs will be 0. 1688 1685 */ 1689 1686 void __init get_pfn_range_for_nid(unsigned int nid, 1690 1687 unsigned long *start_pfn, unsigned long *end_pfn) ··· 1738 1737 } 1739 1738 1740 1739 /* Any regular or high memory on that node ? */ 1741 - static void check_for_memory(pg_data_t *pgdat) 1740 + static void __init check_for_memory(pg_data_t *pgdat) 1742 1741 { 1743 1742 enum zone_type zone_type; 1744 1743 ··· 2491 2490 else 2492 2491 numentries <<= (PAGE_SHIFT - scale); 2493 2492 2494 - /* Make sure we've got at least a 0-order allocation.. */ 2495 - if (unlikely(flags & HASH_SMALL)) { 2496 - /* Makes no sense without HASH_EARLY */ 2497 - WARN_ON(!(flags & HASH_EARLY)); 2498 - if (!(numentries >> *_hash_shift)) { 2499 - numentries = 1UL << *_hash_shift; 2500 - BUG_ON(!numentries); 2501 - } 2502 - } else if (unlikely((numentries * bucketsize) < PAGE_SIZE)) 2493 + if (unlikely((numentries * bucketsize) < PAGE_SIZE)) 2503 2494 numentries = PAGE_SIZE / bucketsize; 2504 2495 } 2505 2496 numentries = roundup_pow_of_two(numentries); ··· 2771 2778 */ 2772 2779 page_ext_init_flatmem(); 2773 2780 mem_debugging_and_hardening_init(); 2774 - kfence_alloc_pool(); 2781 + kfence_alloc_pool_and_metadata(); 2775 2782 report_meminit(); 2776 2783 kmsan_init_shadow(); 2777 2784 stack_depot_early_init();

+131 -124

mm/mmap.c

··· 76 76 static bool ignore_rlimit_data; 77 77 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644); 78 78 79 - static void unmap_region(struct mm_struct *mm, struct maple_tree *mt, 79 + static void unmap_region(struct mm_struct *mm, struct ma_state *mas, 80 80 struct vm_area_struct *vma, struct vm_area_struct *prev, 81 81 struct vm_area_struct *next, unsigned long start, 82 - unsigned long end, bool mm_wr_locked); 82 + unsigned long end, unsigned long tree_end, bool mm_wr_locked); 83 83 84 84 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags) 85 85 { ··· 152 152 unsigned long min) 153 153 { 154 154 return mas_prev(&vmi->mas, min); 155 - } 156 - 157 - static inline int vma_iter_clear_gfp(struct vma_iterator *vmi, 158 - unsigned long start, unsigned long end, gfp_t gfp) 159 - { 160 - vmi->mas.index = start; 161 - vmi->mas.last = end - 1; 162 - mas_store_gfp(&vmi->mas, NULL, gfp); 163 - if (unlikely(mas_is_err(&vmi->mas))) 164 - return -ENOMEM; 165 - 166 - return 0; 167 155 } 168 156 169 157 /* ··· 397 409 VMA_ITERATOR(vmi, mm, 0); 398 410 struct address_space *mapping = NULL; 399 411 400 - if (vma_iter_prealloc(&vmi)) 412 + vma_iter_config(&vmi, vma->vm_start, vma->vm_end); 413 + if (vma_iter_prealloc(&vmi, vma)) 401 414 return -ENOMEM; 415 + 416 + vma_start_write(vma); 417 + 418 + vma_iter_store(&vmi, vma); 402 419 403 420 if (vma->vm_file) { 404 421 mapping = vma->vm_file->f_mapping; 405 422 i_mmap_lock_write(mapping); 406 - } 407 - 408 - vma_iter_store(&vmi, vma); 409 - 410 - if (mapping) { 411 423 __vma_link_file(vma, mapping); 412 424 i_mmap_unlock_write(mapping); 413 425 } ··· 462 474 */ 463 475 static inline void vma_prepare(struct vma_prepare *vp) 464 476 { 465 - vma_start_write(vp->vma); 466 - if (vp->adj_next) 467 - vma_start_write(vp->adj_next); 468 - /* vp->insert is always a newly created VMA, no need for locking */ 469 - if (vp->remove) 470 - vma_start_write(vp->remove); 471 - if (vp->remove2) 472 - vma_start_write(vp->remove2); 473 - 474 477 if (vp->file) { 475 478 uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end); 476 479 ··· 576 597 } 577 598 if (vp->insert && vp->file) 578 599 uprobe_mmap(vp->insert); 600 + validate_mm(mm); 579 601 } 580 602 581 603 /* ··· 595 615 * anon pages imported. 596 616 */ 597 617 if (src->anon_vma && !dst->anon_vma) { 598 - vma_start_write(dst); 618 + vma_assert_write_locked(dst); 599 619 dst->anon_vma = src->anon_vma; 600 620 return anon_vma_clone(dst, src); 601 621 } ··· 627 647 bool remove_next = false; 628 648 struct vma_prepare vp; 629 649 650 + vma_start_write(vma); 630 651 if (next && (vma != next) && (end == next->vm_end)) { 631 652 int ret; 632 653 633 654 remove_next = true; 655 + vma_start_write(next); 634 656 ret = dup_anon_vma(vma, next); 635 657 if (ret) 636 658 return ret; ··· 645 663 /* Only handles expanding */ 646 664 VM_WARN_ON(vma->vm_start < start || vma->vm_end > end); 647 665 648 - if (vma_iter_prealloc(vmi)) 666 + /* Note: vma iterator must be pointing to 'start' */ 667 + vma_iter_config(vmi, start, end); 668 + if (vma_iter_prealloc(vmi, vma)) 649 669 goto nomem; 650 670 651 671 vma_prepare(&vp); 652 672 vma_adjust_trans_huge(vma, start, end, 0); 653 - /* VMA iterator points to previous, so set to start if necessary */ 654 - if (vma_iter_addr(vmi) != start) 655 - vma_iter_set(vmi, start); 656 - 657 673 vma->vm_start = start; 658 674 vma->vm_end = end; 659 675 vma->vm_pgoff = pgoff; 660 - /* Note: mas must be pointing to the expanding VMA */ 661 676 vma_iter_store(vmi, vma); 662 677 663 678 vma_complete(&vp, vmi, vma->vm_mm); 664 - validate_mm(vma->vm_mm); 665 679 return 0; 666 680 667 681 nomem: ··· 680 702 681 703 WARN_ON((vma->vm_start != start) && (vma->vm_end != end)); 682 704 683 - if (vma_iter_prealloc(vmi)) 705 + if (vma->vm_start < start) 706 + vma_iter_config(vmi, vma->vm_start, start); 707 + else 708 + vma_iter_config(vmi, end, vma->vm_end); 709 + 710 + if (vma_iter_prealloc(vmi, NULL)) 684 711 return -ENOMEM; 712 + 713 + vma_start_write(vma); 685 714 686 715 init_vma_prep(&vp, vma); 687 716 vma_prepare(&vp); 688 717 vma_adjust_trans_huge(vma, start, end, 0); 689 718 690 - if (vma->vm_start < start) 691 - vma_iter_clear(vmi, vma->vm_start, start); 692 - 693 - if (vma->vm_end > end) 694 - vma_iter_clear(vmi, end, vma->vm_end); 695 - 719 + vma_iter_clear(vmi); 696 720 vma->vm_start = start; 697 721 vma->vm_end = end; 698 722 vma->vm_pgoff = pgoff; 699 723 vma_complete(&vp, vmi, vma->vm_mm); 700 - validate_mm(vma->vm_mm); 701 724 return 0; 702 725 } 703 726 ··· 871 892 pgoff_t pglen = (end - addr) >> PAGE_SHIFT; 872 893 long adj_start = 0; 873 894 874 - validate_mm(mm); 875 895 /* 876 896 * We later require that vma->vm_flags == vm_flags, 877 897 * so this tests vma->vm_flags & VM_SPECIAL, too. ··· 915 937 if (!merge_prev && !merge_next) 916 938 return NULL; /* Not mergeable. */ 917 939 940 + if (merge_prev) 941 + vma_start_write(prev); 942 + 918 943 res = vma = prev; 919 944 remove = remove2 = adjust = NULL; 920 945 921 946 /* Can we merge both the predecessor and the successor? */ 922 947 if (merge_prev && merge_next && 923 948 is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { 949 + vma_start_write(next); 924 950 remove = next; /* case 1 */ 925 951 vma_end = next->vm_end; 926 952 err = dup_anon_vma(prev, next); 927 953 if (curr) { /* case 6 */ 954 + vma_start_write(curr); 928 955 remove = curr; 929 956 remove2 = next; 930 957 if (!next->anon_vma) ··· 937 954 } 938 955 } else if (merge_prev) { /* case 2 */ 939 956 if (curr) { 957 + vma_start_write(curr); 940 958 err = dup_anon_vma(prev, curr); 941 959 if (end == curr->vm_end) { /* case 7 */ 942 960 remove = curr; ··· 947 963 } 948 964 } 949 965 } else { /* merge_next */ 966 + vma_start_write(next); 950 967 res = next; 951 968 if (prev && addr < prev->vm_end) { /* case 4 */ 969 + vma_start_write(prev); 952 970 vma_end = addr; 953 971 adjust = next; 954 972 adj_start = -(prev->vm_end - addr); ··· 966 980 vma_pgoff = next->vm_pgoff - pglen; 967 981 if (curr) { /* case 8 */ 968 982 vma_pgoff = curr->vm_pgoff; 983 + vma_start_write(curr); 969 984 remove = curr; 970 985 err = dup_anon_vma(next, curr); 971 986 } ··· 977 990 if (err) 978 991 return NULL; 979 992 980 - if (vma_iter_prealloc(vmi)) 993 + if (vma_start < vma->vm_start || vma_end > vma->vm_end) 994 + vma_expanded = true; 995 + 996 + if (vma_expanded) { 997 + vma_iter_config(vmi, vma_start, vma_end); 998 + } else { 999 + vma_iter_config(vmi, adjust->vm_start + adj_start, 1000 + adjust->vm_end); 1001 + } 1002 + 1003 + if (vma_iter_prealloc(vmi, vma)) 981 1004 return NULL; 982 1005 983 1006 init_multi_vma_prep(&vp, vma, adjust, remove, remove2); ··· 996 999 997 1000 vma_prepare(&vp); 998 1001 vma_adjust_trans_huge(vma, vma_start, vma_end, adj_start); 999 - if (vma_start < vma->vm_start || vma_end > vma->vm_end) 1000 - vma_expanded = true; 1001 1002 1002 1003 vma->vm_start = vma_start; 1003 1004 vma->vm_end = vma_end; ··· 1014 1019 } 1015 1020 1016 1021 vma_complete(&vp, vmi, mm); 1017 - vma_iter_free(vmi); 1018 - validate_mm(mm); 1019 1022 khugepaged_enter_vma(res, vm_flags); 1020 - 1021 1023 return res; 1022 1024 } 1023 1025 ··· 1189 1197 vm_flags_t vm_flags; 1190 1198 int pkey = 0; 1191 1199 1192 - validate_mm(mm); 1193 1200 *populate = 0; 1194 1201 1195 1202 if (!len) ··· 1935 1944 struct vm_area_struct *next; 1936 1945 unsigned long gap_addr; 1937 1946 int error = 0; 1938 - MA_STATE(mas, &mm->mm_mt, 0, 0); 1947 + MA_STATE(mas, &mm->mm_mt, vma->vm_start, address); 1939 1948 1940 1949 if (!(vma->vm_flags & VM_GROWSUP)) 1941 1950 return -EFAULT; ··· 1960 1969 /* Check that both stack segments have the same anon_vma? */ 1961 1970 } 1962 1971 1963 - if (mas_preallocate(&mas, GFP_KERNEL)) 1972 + if (next) 1973 + mas_prev_range(&mas, address); 1974 + 1975 + __mas_set_range(&mas, vma->vm_start, address - 1); 1976 + if (mas_preallocate(&mas, vma, GFP_KERNEL)) 1964 1977 return -ENOMEM; 1965 1978 1966 1979 /* We must make sure the anon_vma is allocated. */ ··· 2009 2014 anon_vma_interval_tree_pre_update_vma(vma); 2010 2015 vma->vm_end = address; 2011 2016 /* Overwrite old entry in mtree. */ 2012 - mas_set_range(&mas, vma->vm_start, address - 1); 2013 2017 mas_store_prealloc(&mas, vma); 2014 2018 anon_vma_interval_tree_post_update_vma(vma); 2015 2019 spin_unlock(&mm->page_table_lock); ··· 2020 2026 anon_vma_unlock_write(vma->anon_vma); 2021 2027 khugepaged_enter_vma(vma, vma->vm_flags); 2022 2028 mas_destroy(&mas); 2029 + validate_mm(mm); 2023 2030 return error; 2024 2031 } 2025 2032 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ ··· 2053 2058 return -ENOMEM; 2054 2059 } 2055 2060 2056 - if (mas_preallocate(&mas, GFP_KERNEL)) 2061 + if (prev) 2062 + mas_next_range(&mas, vma->vm_start); 2063 + 2064 + __mas_set_range(&mas, address, vma->vm_end - 1); 2065 + if (mas_preallocate(&mas, vma, GFP_KERNEL)) 2057 2066 return -ENOMEM; 2058 2067 2059 2068 /* We must make sure the anon_vma is allocated. */ ··· 2103 2104 vma->vm_start = address; 2104 2105 vma->vm_pgoff -= grow; 2105 2106 /* Overwrite old entry in mtree. */ 2106 - mas_set_range(&mas, address, vma->vm_end - 1); 2107 2107 mas_store_prealloc(&mas, vma); 2108 2108 anon_vma_interval_tree_post_update_vma(vma); 2109 2109 spin_unlock(&mm->page_table_lock); ··· 2114 2116 anon_vma_unlock_write(vma->anon_vma); 2115 2117 khugepaged_enter_vma(vma, vma->vm_flags); 2116 2118 mas_destroy(&mas); 2119 + validate_mm(mm); 2117 2120 return error; 2118 2121 } 2119 2122 ··· 2292 2293 remove_vma(vma, false); 2293 2294 } 2294 2295 vm_unacct_memory(nr_accounted); 2295 - validate_mm(mm); 2296 2296 } 2297 2297 2298 2298 /* ··· 2299 2301 * 2300 2302 * Called with the mm semaphore held. 2301 2303 */ 2302 - static void unmap_region(struct mm_struct *mm, struct maple_tree *mt, 2304 + static void unmap_region(struct mm_struct *mm, struct ma_state *mas, 2303 2305 struct vm_area_struct *vma, struct vm_area_struct *prev, 2304 - struct vm_area_struct *next, 2305 - unsigned long start, unsigned long end, bool mm_wr_locked) 2306 + struct vm_area_struct *next, unsigned long start, 2307 + unsigned long end, unsigned long tree_end, bool mm_wr_locked) 2306 2308 { 2307 2309 struct mmu_gather tlb; 2310 + unsigned long mt_start = mas->index; 2308 2311 2309 2312 lru_add_drain(); 2310 2313 tlb_gather_mmu(&tlb, mm); 2311 2314 update_hiwater_rss(mm); 2312 - unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked); 2313 - free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, 2315 + unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked); 2316 + mas_set(mas, mt_start); 2317 + free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, 2314 2318 next ? next->vm_start : USER_PGTABLES_CEILING, 2315 2319 mm_wr_locked); 2316 2320 tlb_finish_mmu(&tlb); ··· 2330 2330 struct vm_area_struct *new; 2331 2331 int err; 2332 2332 2333 - validate_mm(vma->vm_mm); 2334 - 2335 2333 WARN_ON(vma->vm_start >= addr); 2336 2334 WARN_ON(vma->vm_end <= addr); 2337 2335 ··· 2343 2345 if (!new) 2344 2346 return -ENOMEM; 2345 2347 2346 - err = -ENOMEM; 2347 - if (vma_iter_prealloc(vmi)) 2348 - goto out_free_vma; 2349 - 2350 2348 if (new_below) { 2351 2349 new->vm_end = addr; 2352 2350 } else { 2353 2351 new->vm_start = addr; 2354 2352 new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT); 2355 2353 } 2354 + 2355 + err = -ENOMEM; 2356 + vma_iter_config(vmi, new->vm_start, new->vm_end); 2357 + if (vma_iter_prealloc(vmi, new)) 2358 + goto out_free_vma; 2356 2359 2357 2360 err = vma_dup_policy(vma, new); 2358 2361 if (err) ··· 2368 2369 2369 2370 if (new->vm_ops && new->vm_ops->open) 2370 2371 new->vm_ops->open(new); 2372 + 2373 + vma_start_write(vma); 2374 + vma_start_write(new); 2371 2375 2372 2376 init_vma_prep(&vp, vma); 2373 2377 vp.insert = new; ··· 2390 2388 /* Success. */ 2391 2389 if (new_below) 2392 2390 vma_next(vmi); 2393 - validate_mm(vma->vm_mm); 2394 2391 return 0; 2395 2392 2396 2393 out_free_mpol: ··· 2398 2397 vma_iter_free(vmi); 2399 2398 out_free_vma: 2400 2399 vm_area_free(new); 2401 - validate_mm(vma->vm_mm); 2402 2400 return err; 2403 2401 } 2404 2402 ··· 2440 2440 unsigned long locked_vm = 0; 2441 2441 MA_STATE(mas_detach, &mt_detach, 0, 0); 2442 2442 mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); 2443 - mt_set_external_lock(&mt_detach, &mm->mmap_lock); 2443 + mt_on_stack(mt_detach); 2444 2444 2445 2445 /* 2446 2446 * If we need to split any vma, do it now to save pain later. ··· 2461 2461 if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) 2462 2462 goto map_count_exceeded; 2463 2463 2464 - error = __split_vma(vmi, vma, start, 0); 2464 + error = __split_vma(vmi, vma, start, 1); 2465 2465 if (error) 2466 2466 goto start_split_failed; 2467 - 2468 - vma = vma_iter_load(vmi); 2469 2467 } 2470 - 2471 - prev = vma_prev(vmi); 2472 - if (unlikely((!prev))) 2473 - vma_iter_set(vmi, start); 2474 2468 2475 2469 /* 2476 2470 * Detach a range of VMAs from the mm. Using next as a temp variable as 2477 2471 * it is always overwritten. 2478 2472 */ 2479 - for_each_vma_range(*vmi, next, end) { 2473 + next = vma; 2474 + do { 2480 2475 /* Does it split the end? */ 2481 2476 if (next->vm_end > end) { 2482 2477 error = __split_vma(vmi, next, end, 0); ··· 2479 2484 goto end_split_failed; 2480 2485 } 2481 2486 vma_start_write(next); 2482 - mas_set_range(&mas_detach, next->vm_start, next->vm_end - 1); 2487 + mas_set(&mas_detach, count); 2483 2488 error = mas_store_gfp(&mas_detach, next, GFP_KERNEL); 2484 2489 if (error) 2485 2490 goto munmap_gather_failed; ··· 2507 2512 BUG_ON(next->vm_start < start); 2508 2513 BUG_ON(next->vm_start > end); 2509 2514 #endif 2510 - } 2511 - 2512 - if (vma_iter_end(vmi) > end) 2513 - next = vma_iter_load(vmi); 2514 - 2515 - if (!next) 2516 - next = vma_next(vmi); 2515 + } for_each_vma_range(*vmi, next, end); 2517 2516 2518 2517 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE) 2519 2518 /* Make sure no VMAs are about to be lost. */ 2520 2519 { 2521 - MA_STATE(test, &mt_detach, start, end - 1); 2520 + MA_STATE(test, &mt_detach, 0, 0); 2522 2521 struct vm_area_struct *vma_mas, *vma_test; 2523 2522 int test_count = 0; 2524 2523 2525 2524 vma_iter_set(vmi, start); 2526 2525 rcu_read_lock(); 2527 - vma_test = mas_find(&test, end - 1); 2526 + vma_test = mas_find(&test, count - 1); 2528 2527 for_each_vma_range(*vmi, vma_mas, end) { 2529 2528 BUG_ON(vma_mas != vma_test); 2530 2529 test_count++; 2531 - vma_test = mas_next(&test, end - 1); 2530 + vma_test = mas_next(&test, count - 1); 2532 2531 } 2533 2532 rcu_read_unlock(); 2534 2533 BUG_ON(count != test_count); 2535 2534 } 2536 2535 #endif 2537 - vma_iter_set(vmi, start); 2536 + 2537 + while (vma_iter_addr(vmi) > start) 2538 + vma_iter_prev_range(vmi); 2539 + 2538 2540 error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL); 2539 2541 if (error) 2540 2542 goto clear_tree_failed; ··· 2542 2550 if (unlock) 2543 2551 mmap_write_downgrade(mm); 2544 2552 2553 + prev = vma_iter_prev_range(vmi); 2554 + next = vma_next(vmi); 2555 + if (next) 2556 + vma_iter_prev_range(vmi); 2557 + 2545 2558 /* 2546 2559 * We can free page tables without write-locking mmap_lock because VMAs 2547 2560 * were isolated before we downgraded mmap_lock. 2548 2561 */ 2549 - unmap_region(mm, &mt_detach, vma, prev, next, start, end, !unlock); 2562 + mas_set(&mas_detach, 1); 2563 + unmap_region(mm, &mas_detach, vma, prev, next, start, end, count, 2564 + !unlock); 2550 2565 /* Statistics and freeing VMAs */ 2551 - mas_set(&mas_detach, start); 2566 + mas_set(&mas_detach, 0); 2552 2567 remove_mt(mm, &mas_detach); 2553 - __mt_destroy(&mt_detach); 2554 2568 validate_mm(mm); 2555 2569 if (unlock) 2556 2570 mmap_read_unlock(mm); 2557 2571 2572 + __mt_destroy(&mt_detach); 2558 2573 return 0; 2559 2574 2560 2575 clear_tree_failed: ··· 2685 2686 2686 2687 next = vma_next(&vmi); 2687 2688 prev = vma_prev(&vmi); 2688 - if (vm_flags & VM_SPECIAL) 2689 + if (vm_flags & VM_SPECIAL) { 2690 + if (prev) 2691 + vma_iter_next_range(&vmi); 2689 2692 goto cannot_expand; 2693 + } 2690 2694 2691 2695 /* Attempt to expand an old mapping */ 2692 2696 /* Check next */ ··· 2710 2708 merge_start = prev->vm_start; 2711 2709 vma = prev; 2712 2710 vm_pgoff = prev->vm_pgoff; 2711 + } else if (prev) { 2712 + vma_iter_next_range(&vmi); 2713 2713 } 2714 - 2715 2714 2716 2715 /* Actually expand, if possible */ 2717 2716 if (vma && ··· 2721 2718 goto expanded; 2722 2719 } 2723 2720 2721 + if (vma == prev) 2722 + vma_iter_set(&vmi, addr); 2724 2723 cannot_expand: 2725 - if (prev) 2726 - vma_iter_next_range(&vmi); 2727 2724 2728 2725 /* 2729 2726 * Determine the object being mapped and call the appropriate ··· 2736 2733 goto unacct_error; 2737 2734 } 2738 2735 2739 - vma_iter_set(&vmi, addr); 2736 + vma_iter_config(&vmi, addr, end); 2740 2737 vma->vm_start = addr; 2741 2738 vma->vm_end = end; 2742 2739 vm_flags_init(vma, vm_flags); ··· 2763 2760 if (WARN_ON((addr != vma->vm_start))) 2764 2761 goto close_and_free_vma; 2765 2762 2766 - vma_iter_set(&vmi, addr); 2763 + vma_iter_config(&vmi, addr, end); 2767 2764 /* 2768 2765 * If vm_flags changed after call_mmap(), we should try merge 2769 2766 * vma again as we may succeed this time. ··· 2810 2807 goto close_and_free_vma; 2811 2808 2812 2809 error = -ENOMEM; 2813 - if (vma_iter_prealloc(&vmi)) 2810 + if (vma_iter_prealloc(&vmi, vma)) 2814 2811 goto close_and_free_vma; 2815 2812 2816 2813 /* Lock the VMA since it is modified after insertion into VMA tree */ 2817 2814 vma_start_write(vma); 2818 - if (vma->vm_file) 2819 - i_mmap_lock_write(vma->vm_file->f_mapping); 2820 - 2821 2815 vma_iter_store(&vmi, vma); 2822 2816 mm->map_count++; 2823 2817 if (vma->vm_file) { 2818 + i_mmap_lock_write(vma->vm_file->f_mapping); 2824 2819 if (vma->vm_flags & VM_SHARED) 2825 2820 mapping_allow_writable(vma->vm_file->f_mapping); 2826 2821 ··· 2879 2878 fput(vma->vm_file); 2880 2879 vma->vm_file = NULL; 2881 2880 2881 + vma_iter_set(&vmi, vma->vm_end); 2882 2882 /* Undo any partial mapping done by a device driver. */ 2883 - unmap_region(mm, &mm->mm_mt, vma, prev, next, vma->vm_start, 2884 - vma->vm_end, true); 2883 + unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start, 2884 + vma->vm_end, vma->vm_end, true); 2885 2885 } 2886 2886 if (file && (vm_flags & VM_SHARED)) 2887 2887 mapping_unmap_writable(file->f_mapping); ··· 3052 3050 struct mm_struct *mm = current->mm; 3053 3051 struct vma_prepare vp; 3054 3052 3055 - validate_mm(mm); 3056 3053 /* 3057 3054 * Check against address space limits by the changed size 3058 3055 * Note: This happens *after* clearing old mappings in some code paths. ··· 3073 3072 if (vma && vma->vm_end == addr && !vma_policy(vma) && 3074 3073 can_vma_merge_after(vma, flags, NULL, NULL, 3075 3074 addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL)) { 3076 - if (vma_iter_prealloc(vmi)) 3075 + vma_iter_config(vmi, vma->vm_start, addr + len); 3076 + if (vma_iter_prealloc(vmi, vma)) 3077 3077 goto unacct_fail; 3078 + 3079 + vma_start_write(vma); 3078 3080 3079 3081 init_vma_prep(&vp, vma); 3080 3082 vma_prepare(&vp); ··· 3091 3087 goto out; 3092 3088 } 3093 3089 3090 + if (vma) 3091 + vma_iter_next_range(vmi); 3094 3092 /* create a vma struct for an anonymous mapping */ 3095 3093 vma = vm_area_alloc(mm); 3096 3094 if (!vma) ··· 3104 3098 vma->vm_pgoff = addr >> PAGE_SHIFT; 3105 3099 vm_flags_init(vma, flags); 3106 3100 vma->vm_page_prot = vm_get_page_prot(flags); 3101 + vma_start_write(vma); 3107 3102 if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) 3108 3103 goto mas_store_fail; 3109 3104 3110 3105 mm->map_count++; 3106 + validate_mm(mm); 3111 3107 ksm_add_vma(vma); 3112 3108 out: 3113 3109 perf_event_mmap(vma); ··· 3118 3110 if (flags & VM_LOCKED) 3119 3111 mm->locked_vm += (len >> PAGE_SHIFT); 3120 3112 vm_flags_set(vma, VM_SOFTDIRTY); 3121 - validate_mm(mm); 3122 3113 return 0; 3123 3114 3124 3115 mas_store_fail: ··· 3207 3200 tlb_gather_mmu_fullmm(&tlb, mm); 3208 3201 /* update_hiwater_rss(mm) here? but nobody should be looking */ 3209 3202 /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */ 3210 - unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false); 3203 + unmap_vmas(&tlb, &mas, vma, 0, ULONG_MAX, ULONG_MAX, false); 3211 3204 mmap_read_unlock(mm); 3212 3205 3213 3206 /* ··· 3217 3210 set_bit(MMF_OOM_SKIP, &mm->flags); 3218 3211 mmap_write_lock(mm); 3219 3212 mt_clear_in_rcu(&mm->mm_mt); 3220 - free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, 3213 + mas_set(&mas, vma->vm_end); 3214 + free_pgtables(&tlb, &mas, vma, FIRST_USER_ADDRESS, 3221 3215 USER_PGTABLES_CEILING, true); 3222 3216 tlb_finish_mmu(&tlb); 3223 3217 ··· 3227 3219 * enabled, without holding any MM locks besides the unreachable 3228 3220 * mmap_write_lock. 3229 3221 */ 3222 + mas_set(&mas, vma->vm_end); 3230 3223 do { 3231 3224 if (vma->vm_flags & VM_ACCOUNT) 3232 3225 nr_accounted += vma_pages(vma); ··· 3300 3291 bool faulted_in_anon_vma = true; 3301 3292 VMA_ITERATOR(vmi, mm, addr); 3302 3293 3303 - validate_mm(mm); 3304 3294 /* 3305 3295 * If anonymous vma has not yet been faulted, update new pgoff 3306 3296 * to match new location, to increase its chance of merging. ··· 3353 3345 get_file(new_vma->vm_file); 3354 3346 if (new_vma->vm_ops && new_vma->vm_ops->open) 3355 3347 new_vma->vm_ops->open(new_vma); 3356 - vma_start_write(new_vma); 3357 3348 if (vma_link(mm, new_vma)) 3358 3349 goto out_vma_link; 3359 3350 *need_rmap_locks = false; 3360 3351 } 3361 - validate_mm(mm); 3362 3352 return new_vma; 3363 3353 3364 3354 out_vma_link: ··· 3372 3366 out_free_vma: 3373 3367 vm_area_free(new_vma); 3374 3368 out: 3375 - validate_mm(mm); 3376 3369 return NULL; 3377 3370 } 3378 3371 ··· 3508 3503 int ret; 3509 3504 struct vm_area_struct *vma; 3510 3505 3511 - validate_mm(mm); 3512 3506 vma = vm_area_alloc(mm); 3513 3507 if (unlikely(vma == NULL)) 3514 3508 return ERR_PTR(-ENOMEM); ··· 3530 3526 3531 3527 perf_event_mmap(vma); 3532 3528 3533 - validate_mm(mm); 3534 3529 return vma; 3535 3530 3536 3531 out: 3537 3532 vm_area_free(vma); 3538 - validate_mm(mm); 3539 3533 return ERR_PTR(ret); 3540 3534 } 3541 3535 ··· 3665 3663 3666 3664 mutex_lock(&mm_all_locks_mutex); 3667 3665 3666 + /* 3667 + * vma_start_write() does not have a complement in mm_drop_all_locks() 3668 + * because vma_start_write() is always asymmetrical; it marks a VMA as 3669 + * being written to until mmap_write_unlock() or mmap_write_downgrade() 3670 + * is reached. 3671 + */ 3668 3672 mas_for_each(&mas, vma, ULONG_MAX) { 3669 3673 if (signal_pending(current)) 3670 3674 goto out_unlock; ··· 3767 3759 if (vma->vm_file && vma->vm_file->f_mapping) 3768 3760 vm_unlock_mapping(vma->vm_file->f_mapping); 3769 3761 } 3770 - vma_end_write_all(mm); 3771 3762 3772 3763 mutex_unlock(&mm_all_locks_mutex); 3773 3764 } ··· 3796 3789 { 3797 3790 unsigned long free_kbytes; 3798 3791 3799 - free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3792 + free_kbytes = K(global_zone_page_state(NR_FREE_PAGES)); 3800 3793 3801 3794 sysctl_user_reserve_kbytes = min(free_kbytes / 32, 1UL << 17); 3802 3795 return 0; ··· 3817 3810 { 3818 3811 unsigned long free_kbytes; 3819 3812 3820 - free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3813 + free_kbytes = K(global_zone_page_state(NR_FREE_PAGES)); 3821 3814 3822 3815 sysctl_admin_reserve_kbytes = min(free_kbytes / 32, 1UL << 13); 3823 3816 return 0; ··· 3861 3854 3862 3855 break; 3863 3856 case MEM_OFFLINE: 3864 - free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3857 + free_kbytes = K(global_zone_page_state(NR_FREE_PAGES)); 3865 3858 3866 3859 if (sysctl_user_reserve_kbytes > free_kbytes) { 3867 3860 init_user_reserve();

+1

mm/mmu_gather.c

··· 63 63 /** 64 64 * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB 65 65 * @tlb: the current mmu_gather 66 + * @vma: The memory area from which the pages are being removed. 66 67 * 67 68 * Note that because of how tlb_next_batch() above works, we will 68 69 * never start multiple new batches with pending delayed rmaps, so

+21 -29

mm/mmu_notifier.c

··· 199 199 * invalidate_start/end and is colliding. 200 200 * 201 201 * The locking looks broadly like this: 202 - * mn_tree_invalidate_start(): mmu_interval_read_begin(): 202 + * mn_itree_inv_start(): mmu_interval_read_begin(): 203 203 * spin_lock 204 204 * seq = READ_ONCE(interval_sub->invalidate_seq); 205 205 * seq == subs->invalidate_seq ··· 207 207 * spin_lock 208 208 * seq = ++subscriptions->invalidate_seq 209 209 * spin_unlock 210 - * op->invalidate_range(): 210 + * op->invalidate(): 211 211 * user_lock 212 212 * mmu_interval_set_seq() 213 213 * interval_sub->invalidate_seq = seq ··· 551 551 552 552 static void 553 553 mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions, 554 - struct mmu_notifier_range *range, bool only_end) 554 + struct mmu_notifier_range *range) 555 555 { 556 556 struct mmu_notifier *subscription; 557 557 int id; ··· 559 559 id = srcu_read_lock(&srcu); 560 560 hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist, 561 561 srcu_read_lock_held(&srcu)) { 562 - /* 563 - * Call invalidate_range here too to avoid the need for the 564 - * subsystem of having to register an invalidate_range_end 565 - * call-back when there is invalidate_range already. Usually a 566 - * subsystem registers either invalidate_range_start()/end() or 567 - * invalidate_range(), so this will be no additional overhead 568 - * (besides the pointer check). 569 - * 570 - * We skip call to invalidate_range() if we know it is safe ie 571 - * call site use mmu_notifier_invalidate_range_only_end() which 572 - * is safe to do when we know that a call to invalidate_range() 573 - * already happen under page table lock. 574 - */ 575 - if (!only_end && subscription->ops->invalidate_range) 576 - subscription->ops->invalidate_range(subscription, 577 - range->mm, 578 - range->start, 579 - range->end); 580 562 if (subscription->ops->invalidate_range_end) { 581 563 if (!mmu_notifier_range_blockable(range)) 582 564 non_block_start(); ··· 571 589 srcu_read_unlock(&srcu, id); 572 590 } 573 591 574 - void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, 575 - bool only_end) 592 + void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) 576 593 { 577 594 struct mmu_notifier_subscriptions *subscriptions = 578 595 range->mm->notifier_subscriptions; ··· 581 600 mn_itree_inv_end(subscriptions); 582 601 583 602 if (!hlist_empty(&subscriptions->list)) 584 - mn_hlist_invalidate_end(subscriptions, range, only_end); 603 + mn_hlist_invalidate_end(subscriptions, range); 585 604 lock_map_release(&__mmu_notifier_invalidate_range_start_map); 586 605 } 587 606 588 - void __mmu_notifier_invalidate_range(struct mm_struct *mm, 589 - unsigned long start, unsigned long end) 607 + void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm, 608 + unsigned long start, unsigned long end) 590 609 { 591 610 struct mmu_notifier *subscription; 592 611 int id; ··· 595 614 hlist_for_each_entry_rcu(subscription, 596 615 &mm->notifier_subscriptions->list, hlist, 597 616 srcu_read_lock_held(&srcu)) { 598 - if (subscription->ops->invalidate_range) 599 - subscription->ops->invalidate_range(subscription, mm, 600 - start, end); 617 + if (subscription->ops->arch_invalidate_secondary_tlbs) 618 + subscription->ops->arch_invalidate_secondary_tlbs( 619 + subscription, mm, 620 + start, end); 601 621 } 602 622 srcu_read_unlock(&srcu, id); 603 623 } ··· 616 634 617 635 mmap_assert_write_locked(mm); 618 636 BUG_ON(atomic_read(&mm->mm_users) <= 0); 637 + 638 + /* 639 + * Subsystems should only register for invalidate_secondary_tlbs() or 640 + * invalidate_range_start()/end() callbacks, not both. 641 + */ 642 + if (WARN_ON_ONCE(subscription && 643 + (subscription->ops->arch_invalidate_secondary_tlbs && 644 + (subscription->ops->invalidate_range_start || 645 + subscription->ops->invalidate_range_end)))) 646 + return -EINVAL; 619 647 620 648 if (!mm->notifier_subscriptions) { 621 649 /*

+4 -3

mm/mprotect.c

··· 213 213 } else if (is_writable_device_private_entry(entry)) { 214 214 /* 215 215 * We do not preserve soft-dirtiness. See 216 - * copy_one_pte() for explanation. 216 + * copy_nonpresent_pte() for explanation. 217 217 */ 218 218 entry = make_readable_device_private_entry( 219 219 swp_offset(entry)); ··· 230 230 newpte = pte_swp_mkuffd_wp(newpte); 231 231 } else if (is_pte_marker_entry(entry)) { 232 232 /* 233 - * Ignore swapin errors unconditionally, 233 + * Ignore error swap entries unconditionally, 234 234 * because any access should sigbus anyway. 235 235 */ 236 - if (is_swapin_error_entry(entry)) 236 + if (is_poisoned_swp_entry(entry)) 237 237 continue; 238 238 /* 239 239 * If this is uffd-wp pte marker and we'd like ··· 657 657 * vm_flags and vm_page_prot are protected by the mmap_lock 658 658 * held in write mode. 659 659 */ 660 + vma_start_write(vma); 660 661 vm_flags_reset(vma, newflags); 661 662 if (vma_wants_manual_pte_write_upgrade(vma)) 662 663 mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;

+1 -1

mm/mremap.c

··· 349 349 } 350 350 #endif 351 351 352 - #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 352 + #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) 353 353 static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr, 354 354 unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) 355 355 {

+23 -32

mm/nommu.c

··· 583 583 { 584 584 VMA_ITERATOR(vmi, vma->vm_mm, vma->vm_start); 585 585 586 - if (vma_iter_prealloc(&vmi)) { 586 + vma_iter_config(&vmi, vma->vm_start, vma->vm_end); 587 + if (vma_iter_prealloc(&vmi, vma)) { 587 588 pr_warn("Allocation of vma tree for process %d failed\n", 588 589 current->pid); 589 590 return -ENOMEM; ··· 592 591 cleanup_vma_from_mm(vma); 593 592 594 593 /* remove from the MM's tree and list */ 595 - vma_iter_clear(&vmi, vma->vm_start, vma->vm_end); 594 + vma_iter_clear(&vmi); 596 595 return 0; 597 596 } 598 597 /* ··· 1004 1003 enomem: 1005 1004 pr_err("Allocation of length %lu from process %d (%s) failed\n", 1006 1005 len, current->pid, current->comm); 1007 - show_free_areas(0, NULL); 1006 + show_mem(); 1008 1007 return -ENOMEM; 1009 1008 } 1010 1009 ··· 1054 1053 vma = vm_area_alloc(current->mm); 1055 1054 if (!vma) 1056 1055 goto error_getting_vma; 1057 - 1058 - if (vma_iter_prealloc(&vmi)) 1059 - goto error_vma_iter_prealloc; 1060 1056 1061 1057 region->vm_usage = 1; 1062 1058 region->vm_flags = vm_flags; ··· 1196 1198 1197 1199 share: 1198 1200 BUG_ON(!vma->vm_region); 1201 + vma_iter_config(&vmi, vma->vm_start, vma->vm_end); 1202 + if (vma_iter_prealloc(&vmi, vma)) 1203 + goto error_just_free; 1204 + 1199 1205 setup_vma_to_mm(vma, current->mm); 1200 1206 current->mm->map_count++; 1201 1207 /* add the VMA to the tree */ ··· 1238 1236 kmem_cache_free(vm_region_jar, region); 1239 1237 pr_warn("Allocation of vma for %lu byte allocation from process %d failed\n", 1240 1238 len, current->pid); 1241 - show_free_areas(0, NULL); 1239 + show_mem(); 1242 1240 return -ENOMEM; 1243 1241 1244 1242 error_getting_region: 1245 1243 pr_warn("Allocation of vm region for %lu byte allocation from process %d failed\n", 1246 1244 len, current->pid); 1247 - show_free_areas(0, NULL); 1245 + show_mem(); 1248 1246 return -ENOMEM; 1249 - 1250 - error_vma_iter_prealloc: 1251 - kmem_cache_free(vm_region_jar, region); 1252 - vm_area_free(vma); 1253 - pr_warn("Allocation of vma tree for process %d failed\n", current->pid); 1254 - show_free_areas(0, NULL); 1255 - return -ENOMEM; 1256 - 1257 1247 } 1258 1248 1259 1249 unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, ··· 1330 1336 if (!new) 1331 1337 goto err_vma_dup; 1332 1338 1333 - if (vma_iter_prealloc(vmi)) { 1334 - pr_warn("Allocation of vma tree for process %d failed\n", 1335 - current->pid); 1336 - goto err_vmi_preallocate; 1337 - } 1338 - 1339 1339 /* most fields are the same, copy all, and then fixup */ 1340 1340 *region = *vma->vm_region; 1341 1341 new->vm_region = region; ··· 1341 1353 } else { 1342 1354 region->vm_start = new->vm_start = addr; 1343 1355 region->vm_pgoff = new->vm_pgoff += npages; 1356 + } 1357 + 1358 + vma_iter_config(vmi, new->vm_start, new->vm_end); 1359 + if (vma_iter_prealloc(vmi, vma)) { 1360 + pr_warn("Allocation of vma tree for process %d failed\n", 1361 + current->pid); 1362 + goto err_vmi_preallocate; 1344 1363 } 1345 1364 1346 1365 if (new->vm_ops && new->vm_ops->open) ··· 1391 1396 1392 1397 /* adjust the VMA's pointers, which may reposition it in the MM's tree 1393 1398 * and list */ 1394 - if (vma_iter_prealloc(vmi)) { 1395 - pr_warn("Allocation of vma tree for process %d failed\n", 1396 - current->pid); 1397 - return -ENOMEM; 1398 - } 1399 - 1400 1399 if (from > vma->vm_start) { 1401 - vma_iter_clear(vmi, from, vma->vm_end); 1400 + if (vma_iter_clear_gfp(vmi, from, vma->vm_end, GFP_KERNEL)) 1401 + return -ENOMEM; 1402 1402 vma->vm_end = from; 1403 1403 } else { 1404 - vma_iter_clear(vmi, vma->vm_start, to); 1404 + if (vma_iter_clear_gfp(vmi, vma->vm_start, to, GFP_KERNEL)) 1405 + return -ENOMEM; 1405 1406 vma->vm_start = to; 1406 1407 } 1407 1408 ··· 1800 1809 { 1801 1810 unsigned long free_kbytes; 1802 1811 1803 - free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 1812 + free_kbytes = K(global_zone_page_state(NR_FREE_PAGES)); 1804 1813 1805 1814 sysctl_user_reserve_kbytes = min(free_kbytes / 32, 1UL << 17); 1806 1815 return 0; ··· 1821 1830 { 1822 1831 unsigned long free_kbytes; 1823 1832 1824 - free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 1833 + free_kbytes = K(global_zone_page_state(NR_FREE_PAGES)); 1825 1834 1826 1835 sysctl_admin_reserve_kbytes = min(free_kbytes / 32, 1UL << 13); 1827 1836 return 0;

-3

mm/oom_kill.c

··· 479 479 480 480 static bool oom_killer_disabled __read_mostly; 481 481 482 - #define K(x) ((x) << (PAGE_SHIFT-10)) 483 - 484 482 /* 485 483 * task->mm can be NULL if the task is the exited group leader. So to 486 484 * determine whether the task is using a particular mm, we examine all the ··· 992 994 mmdrop(mm); 993 995 put_task_struct(victim); 994 996 } 995 - #undef K 996 997 997 998 /* 998 999 * Kill provided task unless it's secured by setting

+55 -95

mm/page_alloc.c

··· 284 284 #endif 285 285 }; 286 286 287 - static compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = { 288 - [NULL_COMPOUND_DTOR] = NULL, 289 - [COMPOUND_PAGE_DTOR] = free_compound_page, 290 - #ifdef CONFIG_HUGETLB_PAGE 291 - [HUGETLB_PAGE_DTOR] = free_huge_page, 292 - #endif 293 - #ifdef CONFIG_TRANSPARENT_HUGEPAGE 294 - [TRANSHUGE_PAGE_DTOR] = free_transhuge_page, 295 - #endif 296 - }; 297 - 298 287 int min_free_kbytes = 1024; 299 288 int user_min_free_kbytes = -1; 300 289 static int watermark_boost_factor __read_mostly = 15000; ··· 360 371 return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; 361 372 } 362 373 363 - static __always_inline 364 - unsigned long __get_pfnblock_flags_mask(const struct page *page, 365 - unsigned long pfn, 366 - unsigned long mask) 374 + /** 375 + * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages 376 + * @page: The page within the block of interest 377 + * @pfn: The target page frame number 378 + * @mask: mask of bits that the caller is interested in 379 + * 380 + * Return: pageblock_bits flags 381 + */ 382 + unsigned long get_pfnblock_flags_mask(const struct page *page, 383 + unsigned long pfn, unsigned long mask) 367 384 { 368 385 unsigned long *bitmap; 369 386 unsigned long bitidx, word_bitidx; ··· 388 393 return (word >> bitidx) & mask; 389 394 } 390 395 391 - /** 392 - * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages 393 - * @page: The page within the block of interest 394 - * @pfn: The target page frame number 395 - * @mask: mask of bits that the caller is interested in 396 - * 397 - * Return: pageblock_bits flags 398 - */ 399 - unsigned long get_pfnblock_flags_mask(const struct page *page, 400 - unsigned long pfn, unsigned long mask) 401 - { 402 - return __get_pfnblock_flags_mask(page, pfn, mask); 403 - } 404 - 405 396 static __always_inline int get_pfnblock_migratetype(const struct page *page, 406 397 unsigned long pfn) 407 398 { 408 - return __get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK); 399 + return get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK); 409 400 } 410 401 411 402 /** ··· 440 459 #ifdef CONFIG_DEBUG_VM 441 460 static int page_outside_zone_boundaries(struct zone *zone, struct page *page) 442 461 { 443 - int ret = 0; 462 + int ret; 444 463 unsigned seq; 445 464 unsigned long pfn = page_to_pfn(page); 446 465 unsigned long sp, start_pfn; ··· 449 468 seq = zone_span_seqbegin(zone); 450 469 start_pfn = zone->zone_start_pfn; 451 470 sp = zone->spanned_pages; 452 - if (!zone_spans_pfn(zone, pfn)) 453 - ret = 1; 471 + ret = !zone_spans_pfn(zone, pfn); 454 472 } while (zone_span_seqretry(zone, seq)); 455 473 456 474 if (ret) ··· 519 539 520 540 static inline unsigned int order_to_pindex(int migratetype, int order) 521 541 { 522 - int base = order; 523 - 524 542 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 525 543 if (order > PAGE_ALLOC_COSTLY_ORDER) { 526 544 VM_BUG_ON(order != pageblock_order); ··· 528 550 VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); 529 551 #endif 530 552 531 - return (MIGRATE_PCPTYPES * base) + migratetype; 553 + return (MIGRATE_PCPTYPES * order) + migratetype; 532 554 } 533 555 534 556 static inline int pindex_to_order(unsigned int pindex) ··· 572 594 * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded 573 595 * in bit 0 of page->compound_head. The rest of bits is pointer to head page. 574 596 * 575 - * The first tail page's ->compound_dtor holds the offset in array of compound 576 - * page destructors. See compound_page_dtors. 577 - * 578 597 * The first tail page's ->compound_order holds the order of allocation. 579 598 * This usage means that zero-order pages may not be compound. 580 599 */ 581 - 582 - void free_compound_page(struct page *page) 583 - { 584 - mem_cgroup_uncharge(page_folio(page)); 585 - free_the_page(page, compound_order(page)); 586 - } 587 600 588 601 void prep_compound_page(struct page *page, unsigned int order) 589 602 { ··· 590 621 591 622 void destroy_large_folio(struct folio *folio) 592 623 { 593 - enum compound_dtor_id dtor = folio->_folio_dtor; 624 + if (folio_test_hugetlb(folio)) { 625 + free_huge_folio(folio); 626 + return; 627 + } 594 628 595 - VM_BUG_ON_FOLIO(dtor >= NR_COMPOUND_DTORS, folio); 596 - compound_page_dtors[dtor](&folio->page); 629 + if (folio_test_large_rmappable(folio)) 630 + folio_undo_large_rmappable(folio); 631 + 632 + mem_cgroup_uncharge(folio); 633 + free_the_page(&folio->page, folio_order(folio)); 597 634 } 598 635 599 636 static inline void set_buddy_order(struct page *page, unsigned int order) ··· 799 824 * pageblock isolation could cause incorrect freepage or CMA 800 825 * accounting or HIGHATOMIC accounting. 801 826 */ 802 - int buddy_mt = get_pageblock_migratetype(buddy); 827 + int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn); 803 828 804 829 if (migratetype != buddy_mt 805 830 && (!migratetype_is_mergeable(migratetype) || ··· 875 900 goto out; 876 901 } 877 902 878 - mt = get_pageblock_migratetype(free_page); 903 + mt = get_pfnblock_migratetype(free_page, free_page_pfn); 879 904 if (likely(!is_migrate_isolate(mt))) 880 905 __mod_zone_freepage_state(zone, -(1UL << order), mt); 881 906 ··· 1107 1132 VM_BUG_ON_PAGE(compound && compound_order(page) != order, page); 1108 1133 1109 1134 if (compound) 1110 - ClearPageHasHWPoisoned(page); 1135 + page[1].flags &= ~PAGE_FLAGS_SECOND; 1111 1136 for (i = 1; i < (1 << order); i++) { 1112 1137 if (compound) 1113 1138 bad += free_tail_page_prepare(page, page + i); ··· 1185 1210 int pindex) 1186 1211 { 1187 1212 unsigned long flags; 1188 - int min_pindex = 0; 1189 - int max_pindex = NR_PCP_LISTS - 1; 1190 1213 unsigned int order; 1191 1214 bool isolated_pageblocks; 1192 1215 struct page *page; ··· 1207 1234 1208 1235 /* Remove pages from lists in a round-robin fashion. */ 1209 1236 do { 1210 - if (++pindex > max_pindex) 1211 - pindex = min_pindex; 1237 + if (++pindex > NR_PCP_LISTS - 1) 1238 + pindex = 0; 1212 1239 list = &pcp->lists[pindex]; 1213 - if (!list_empty(list)) 1214 - break; 1215 - 1216 - if (pindex == max_pindex) 1217 - max_pindex--; 1218 - if (pindex == min_pindex) 1219 - min_pindex++; 1220 - } while (1); 1240 + } while (list_empty(list)); 1221 1241 1222 1242 order = pindex_to_order(pindex); 1223 1243 nr_pages = 1 << order; ··· 1800 1834 1801 1835 free_pages = move_freepages_block(zone, page, start_type, 1802 1836 &movable_pages); 1837 + /* moving whole block can fail due to zone boundary conditions */ 1838 + if (!free_pages) 1839 + goto single_page; 1840 + 1803 1841 /* 1804 1842 * Determine how many pages are compatible with our allocation. 1805 1843 * For movable allocation, it's the number of movable pages which ··· 1825 1855 else 1826 1856 alike_pages = 0; 1827 1857 } 1828 - 1829 - /* moving whole block can fail due to zone boundary conditions */ 1830 - if (!free_pages) 1831 - goto single_page; 1832 - 1833 1858 /* 1834 1859 * If a sufficient number of pages in the block are either free or of 1835 - * comparable migratability as our allocation, claim the whole block. 1860 + * compatible migratability as our allocation, claim the whole block. 1836 1861 */ 1837 1862 if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || 1838 1863 page_group_by_mobility_disabled) ··· 1877 1912 * Reserve a pageblock for exclusive use of high-order atomic allocations if 1878 1913 * there are no empty page blocks that contain a page with a suitable order 1879 1914 */ 1880 - static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, 1881 - unsigned int alloc_order) 1915 + static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) 1882 1916 { 1883 1917 int mt; 1884 1918 unsigned long max_managed, flags; ··· 2317 2353 return true; 2318 2354 } 2319 2355 2320 - static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch, 2321 - bool free_high) 2356 + static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) 2322 2357 { 2323 2358 int min_nr_free, max_nr_free; 2359 + int batch = READ_ONCE(pcp->batch); 2324 2360 2325 2361 /* Free everything if batch freeing high-order pages. */ 2326 2362 if (unlikely(free_high)) ··· 2387 2423 2388 2424 high = nr_pcp_high(pcp, zone, free_high); 2389 2425 if (pcp->count >= high) { 2390 - int batch = READ_ONCE(pcp->batch); 2391 - 2392 - free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex); 2426 + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex); 2393 2427 } 2394 2428 } 2395 2429 ··· 3187 3225 * if the pageblock should be reserved for the future 3188 3226 */ 3189 3227 if (unlikely(alloc_flags & ALLOC_HIGHATOMIC)) 3190 - reserve_highatomic_pageblock(page, zone, order); 3228 + reserve_highatomic_pageblock(page, zone); 3191 3229 3192 3230 return page; 3193 3231 } else { ··· 4470 4508 { 4471 4509 struct page *page = __alloc_pages(gfp | __GFP_COMP, order, 4472 4510 preferred_nid, nodemask); 4511 + struct folio *folio = (struct folio *)page; 4473 4512 4474 - if (page && order > 1) 4475 - prep_transhuge_page(page); 4476 - return (struct folio *)page; 4513 + if (folio && order > 1) 4514 + folio_prep_large_rmappable(folio); 4515 + return folio; 4477 4516 } 4478 4517 EXPORT_SYMBOL(__folio_alloc); 4479 4518 ··· 5102 5139 unsigned long flags; 5103 5140 5104 5141 /* 5105 - * Explicitly disable this CPU's interrupts before taking seqlock 5106 - * to prevent any IRQ handler from calling into the page allocator 5107 - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. 5142 + * The zonelist_update_seq must be acquired with irqsave because the 5143 + * reader can be invoked from IRQ with GFP_ATOMIC. 5108 5144 */ 5109 - local_irq_save(flags); 5145 + write_seqlock_irqsave(&zonelist_update_seq, flags); 5110 5146 /* 5111 - * Explicitly disable this CPU's synchronous printk() before taking 5112 - * seqlock to prevent any printk() from trying to hold port->lock, for 5147 + * Also disable synchronous printk() to prevent any printk() from 5148 + * trying to hold port->lock, for 5113 5149 * tty_insert_flip_string_and_push_buffer() on other CPU might be 5114 5150 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. 5115 5151 */ 5116 5152 printk_deferred_enter(); 5117 - write_seqlock(&zonelist_update_seq); 5118 5153 5119 5154 #ifdef CONFIG_NUMA 5120 5155 memset(node_load, 0, sizeof(node_load)); ··· 5149 5188 #endif 5150 5189 } 5151 5190 5152 - write_sequnlock(&zonelist_update_seq); 5153 5191 printk_deferred_exit(); 5154 - local_irq_restore(flags); 5192 + write_sequnlock_irqrestore(&zonelist_update_seq, flags); 5155 5193 } 5156 5194 5157 5195 static noinline void __init ··· 5654 5694 struct zone *zone; 5655 5695 unsigned long flags; 5656 5696 5657 - /* Calculate total number of !ZONE_HIGHMEM pages */ 5697 + /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */ 5658 5698 for_each_zone(zone) { 5659 - if (!is_highmem(zone)) 5699 + if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE) 5660 5700 lowmem_pages += zone_managed_pages(zone); 5661 5701 } 5662 5702 ··· 5666 5706 spin_lock_irqsave(&zone->lock, flags); 5667 5707 tmp = (u64)pages_min * zone_managed_pages(zone); 5668 5708 do_div(tmp, lowmem_pages); 5669 - if (is_highmem(zone)) { 5709 + if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) { 5670 5710 /* 5671 5711 * __GFP_HIGH and PF_MEMALLOC allocations usually don't 5672 - * need highmem pages, so cap pages_min to a small 5673 - * value here. 5712 + * need highmem and movable zones pages, so cap pages_min 5713 + * to a small value here. 5674 5714 * 5675 5715 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN) 5676 5716 * deltas control async page reclaim, and so should 5677 - * not be capped for highmem. 5717 + * not be capped for highmem and movable zones. 5678 5718 */ 5679 5719 unsigned long min_pages; 5680 5720

+49 -54

mm/page_ext.c

··· 90 90 unsigned long page_ext_size; 91 91 92 92 static unsigned long total_usage; 93 - static struct page_ext *lookup_page_ext(const struct page *page); 94 93 95 94 bool early_page_ext __meminitdata; 96 95 static int __init setup_early_page_ext(char *str) ··· 136 137 } 137 138 } 138 139 139 - #ifndef CONFIG_SPARSEMEM 140 - void __init page_ext_init_flatmem_late(void) 141 - { 142 - invoke_init_callbacks(); 143 - } 144 - #endif 145 - 146 140 static inline struct page_ext *get_entry(void *base, unsigned long index) 147 141 { 148 142 return base + page_ext_size * index; 149 143 } 150 144 151 - /** 152 - * page_ext_get() - Get the extended information for a page. 153 - * @page: The page we're interested in. 154 - * 155 - * Ensures that the page_ext will remain valid until page_ext_put() 156 - * is called. 157 - * 158 - * Return: NULL if no page_ext exists for this page. 159 - * Context: Any context. Caller may not sleep until they have called 160 - * page_ext_put(). 161 - */ 162 - struct page_ext *page_ext_get(struct page *page) 163 - { 164 - struct page_ext *page_ext; 165 - 166 - rcu_read_lock(); 167 - page_ext = lookup_page_ext(page); 168 - if (!page_ext) { 169 - rcu_read_unlock(); 170 - return NULL; 171 - } 172 - 173 - return page_ext; 174 - } 175 - 176 - /** 177 - * page_ext_put() - Working with page extended information is done. 178 - * @page_ext: Page extended information received from page_ext_get(). 179 - * 180 - * The page extended information of the page may not be valid after this 181 - * function is called. 182 - * 183 - * Return: None. 184 - * Context: Any context with corresponding page_ext_get() is called. 185 - */ 186 - void page_ext_put(struct page_ext *page_ext) 187 - { 188 - if (unlikely(!page_ext)) 189 - return; 190 - 191 - rcu_read_unlock(); 192 - } 193 145 #ifndef CONFIG_SPARSEMEM 194 - 146 + void __init page_ext_init_flatmem_late(void) 147 + { 148 + invoke_init_callbacks(); 149 + } 195 150 196 151 void __meminit pgdat_page_ext_init(struct pglist_data *pgdat) 197 152 { ··· 377 424 return 0; 378 425 379 426 /* rollback */ 427 + end = pfn - PAGES_PER_SECTION; 380 428 for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) 381 429 __free_page_ext(pfn); 382 430 383 431 return -ENOMEM; 384 432 } 385 433 386 - static int __meminit offline_page_ext(unsigned long start_pfn, 434 + static void __meminit offline_page_ext(unsigned long start_pfn, 387 435 unsigned long nr_pages) 388 436 { 389 437 unsigned long start, end, pfn; ··· 408 454 409 455 for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) 410 456 __free_page_ext(pfn); 411 - return 0; 412 - 413 457 } 414 458 415 459 static int __meminit page_ext_callback(struct notifier_block *self, ··· 489 537 } 490 538 491 539 #endif 540 + 541 + /** 542 + * page_ext_get() - Get the extended information for a page. 543 + * @page: The page we're interested in. 544 + * 545 + * Ensures that the page_ext will remain valid until page_ext_put() 546 + * is called. 547 + * 548 + * Return: NULL if no page_ext exists for this page. 549 + * Context: Any context. Caller may not sleep until they have called 550 + * page_ext_put(). 551 + */ 552 + struct page_ext *page_ext_get(struct page *page) 553 + { 554 + struct page_ext *page_ext; 555 + 556 + rcu_read_lock(); 557 + page_ext = lookup_page_ext(page); 558 + if (!page_ext) { 559 + rcu_read_unlock(); 560 + return NULL; 561 + } 562 + 563 + return page_ext; 564 + } 565 + 566 + /** 567 + * page_ext_put() - Working with page extended information is done. 568 + * @page_ext: Page extended information received from page_ext_get(). 569 + * 570 + * The page extended information of the page may not be valid after this 571 + * function is called. 572 + * 573 + * Return: None. 574 + * Context: Any context with corresponding page_ext_get() is called. 575 + */ 576 + void page_ext_put(struct page_ext *page_ext) 577 + { 578 + if (unlikely(!page_ext)) 579 + return; 580 + 581 + rcu_read_unlock(); 582 + }

+39 -41

mm/page_io.c

··· 19 19 #include <linux/bio.h> 20 20 #include <linux/swapops.h> 21 21 #include <linux/writeback.h> 22 - #include <linux/frontswap.h> 23 22 #include <linux/blkdev.h> 24 23 #include <linux/psi.h> 25 24 #include <linux/uio.h> 26 25 #include <linux/sched/task.h> 27 26 #include <linux/delayacct.h> 27 + #include <linux/zswap.h> 28 28 #include "swap.h" 29 29 30 30 static void __end_swap_bio_write(struct bio *bio) 31 31 { 32 - struct page *page = bio_first_page_all(bio); 32 + struct folio *folio = bio_first_folio_all(bio); 33 33 34 34 if (bio->bi_status) { 35 - SetPageError(page); 36 35 /* 37 36 * We failed to write the page out to swap-space. 38 37 * Re-dirty the page in order to avoid it being reclaimed. ··· 40 41 * 41 42 * Also clear PG_reclaim to avoid folio_rotate_reclaimable() 42 43 */ 43 - set_page_dirty(page); 44 + folio_mark_dirty(folio); 44 45 pr_alert_ratelimited("Write-error on swap-device (%u:%u:%llu)\n", 45 46 MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)), 46 47 (unsigned long long)bio->bi_iter.bi_sector); 47 - ClearPageReclaim(page); 48 + folio_clear_reclaim(folio); 48 49 } 49 - end_page_writeback(page); 50 + folio_end_writeback(folio); 50 51 } 51 52 52 53 static void end_swap_bio_write(struct bio *bio) ··· 57 58 58 59 static void __end_swap_bio_read(struct bio *bio) 59 60 { 60 - struct page *page = bio_first_page_all(bio); 61 + struct folio *folio = bio_first_folio_all(bio); 61 62 62 63 if (bio->bi_status) { 63 - SetPageError(page); 64 - ClearPageUptodate(page); 65 64 pr_alert_ratelimited("Read-error on swap-device (%u:%u:%llu)\n", 66 65 MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)), 67 66 (unsigned long long)bio->bi_iter.bi_sector); 68 67 } else { 69 - SetPageUptodate(page); 68 + folio_mark_uptodate(folio); 70 69 } 71 - unlock_page(page); 70 + folio_unlock(folio); 72 71 } 73 72 74 73 static void end_swap_bio_read(struct bio *bio) ··· 195 198 folio_unlock(folio); 196 199 return ret; 197 200 } 198 - if (frontswap_store(&folio->page) == 0) { 201 + if (zswap_store(folio)) { 199 202 folio_start_writeback(folio); 200 203 folio_unlock(folio); 201 204 folio_end_writeback(folio); ··· 205 208 return 0; 206 209 } 207 210 208 - static inline void count_swpout_vm_event(struct page *page) 211 + static inline void count_swpout_vm_event(struct folio *folio) 209 212 { 210 213 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 211 - if (unlikely(PageTransHuge(page))) 214 + if (unlikely(folio_test_pmd_mappable(folio))) 212 215 count_vm_event(THP_SWPOUT); 213 216 #endif 214 - count_vm_events(PSWPOUT, thp_nr_pages(page)); 217 + count_vm_events(PSWPOUT, folio_nr_pages(folio)); 215 218 } 216 219 217 220 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) 218 - static void bio_associate_blkg_from_page(struct bio *bio, struct page *page) 221 + static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio) 219 222 { 220 223 struct cgroup_subsys_state *css; 221 224 struct mem_cgroup *memcg; 222 225 223 - memcg = page_memcg(page); 226 + memcg = folio_memcg(folio); 224 227 if (!memcg) 225 228 return; 226 229 ··· 230 233 rcu_read_unlock(); 231 234 } 232 235 #else 233 - #define bio_associate_blkg_from_page(bio, page) do { } while (0) 236 + #define bio_associate_blkg_from_page(bio, folio) do { } while (0) 234 237 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */ 235 238 236 239 struct swap_iocb { ··· 280 283 } 281 284 } else { 282 285 for (p = 0; p < sio->pages; p++) 283 - count_swpout_vm_event(sio->bvec[p].bv_page); 286 + count_swpout_vm_event(page_folio(sio->bvec[p].bv_page)); 284 287 } 285 288 286 289 for (p = 0; p < sio->pages; p++) ··· 331 334 { 332 335 struct bio_vec bv; 333 336 struct bio bio; 337 + struct folio *folio = page_folio(page); 334 338 335 339 bio_init(&bio, sis->bdev, &bv, 1, 336 340 REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc)); 337 341 bio.bi_iter.bi_sector = swap_page_sector(page); 338 342 __bio_add_page(&bio, page, thp_size(page), 0); 339 343 340 - bio_associate_blkg_from_page(&bio, page); 341 - count_swpout_vm_event(page); 344 + bio_associate_blkg_from_page(&bio, folio); 345 + count_swpout_vm_event(folio); 342 346 343 - set_page_writeback(page); 344 - unlock_page(page); 347 + folio_start_writeback(folio); 348 + folio_unlock(folio); 345 349 346 350 submit_bio_wait(&bio); 347 351 __end_swap_bio_write(&bio); ··· 352 354 struct writeback_control *wbc, struct swap_info_struct *sis) 353 355 { 354 356 struct bio *bio; 357 + struct folio *folio = page_folio(page); 355 358 356 359 bio = bio_alloc(sis->bdev, 1, 357 360 REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc), ··· 361 362 bio->bi_end_io = end_swap_bio_write; 362 363 __bio_add_page(bio, page, thp_size(page), 0); 363 364 364 - bio_associate_blkg_from_page(bio, page); 365 - count_swpout_vm_event(page); 366 - set_page_writeback(page); 367 - unlock_page(page); 365 + bio_associate_blkg_from_page(bio, folio); 366 + count_swpout_vm_event(folio); 367 + folio_start_writeback(folio); 368 + folio_unlock(folio); 368 369 submit_bio(bio); 369 370 } 370 371 ··· 405 406 406 407 if (ret == sio->len) { 407 408 for (p = 0; p < sio->pages; p++) { 408 - struct page *page = sio->bvec[p].bv_page; 409 + struct folio *folio = page_folio(sio->bvec[p].bv_page); 409 410 410 - SetPageUptodate(page); 411 - unlock_page(page); 411 + folio_mark_uptodate(folio); 412 + folio_unlock(folio); 412 413 } 413 414 count_vm_events(PSWPIN, sio->pages); 414 415 } else { 415 416 for (p = 0; p < sio->pages; p++) { 416 - struct page *page = sio->bvec[p].bv_page; 417 + struct folio *folio = page_folio(sio->bvec[p].bv_page); 417 418 418 - SetPageError(page); 419 - ClearPageUptodate(page); 420 - unlock_page(page); 419 + folio_unlock(folio); 421 420 } 422 421 pr_alert_ratelimited("Read-error on swap-device\n"); 423 422 } ··· 492 495 493 496 void swap_readpage(struct page *page, bool synchronous, struct swap_iocb **plug) 494 497 { 498 + struct folio *folio = page_folio(page); 495 499 struct swap_info_struct *sis = page_swap_info(page); 496 - bool workingset = PageWorkingset(page); 500 + bool workingset = folio_test_workingset(folio); 497 501 unsigned long pflags; 498 502 bool in_thrashing; 499 503 500 - VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page); 501 - VM_BUG_ON_PAGE(!PageLocked(page), page); 502 - VM_BUG_ON_PAGE(PageUptodate(page), page); 504 + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); 505 + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); 506 + VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); 503 507 504 508 /* 505 509 * Count submission time as memory stall and delay. When the device ··· 513 515 } 514 516 delayacct_swapin_start(); 515 517 516 - if (frontswap_load(page) == 0) { 517 - SetPageUptodate(page); 518 - unlock_page(page); 518 + if (zswap_load(folio)) { 519 + folio_mark_uptodate(folio); 520 + folio_unlock(folio); 519 521 } else if (data_race(sis->flags & SWP_FS_OPS)) { 520 522 swap_readpage_fs(page, plug); 521 523 } else if (synchronous || (sis->flags & SWP_SYNCHRONOUS_IO)) {

+4 -4

mm/page_isolation.c

··· 79 79 * handle each tail page individually in migration. 80 80 */ 81 81 if (PageHuge(page) || PageTransCompound(page)) { 82 - struct page *head = compound_head(page); 82 + struct folio *folio = page_folio(page); 83 83 unsigned int skip_pages; 84 84 85 85 if (PageHuge(page)) { 86 - if (!hugepage_migration_supported(page_hstate(head))) 86 + if (!hugepage_migration_supported(folio_hstate(folio))) 87 87 return page; 88 - } else if (!PageLRU(head) && !__PageMovable(head)) { 88 + } else if (!folio_test_lru(folio) && !__folio_test_movable(folio)) { 89 89 return page; 90 90 } 91 91 92 - skip_pages = compound_nr(head) - (page - head); 92 + skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page); 93 93 pfn += skip_pages - 1; 94 94 continue; 95 95 }

+1 -1

mm/page_owner.c

··· 104 104 105 105 static inline struct page_owner *get_page_owner(struct page_ext *page_ext) 106 106 { 107 - return (void *)page_ext + page_owner_ops.offset; 107 + return page_ext_data(page_ext, &page_owner_ops); 108 108 } 109 109 110 110 static noinline depot_stack_handle_t save_stack(gfp_t flags)

-1

mm/page_poison.c

··· 4 4 #include <linux/mm.h> 5 5 #include <linux/mmdebug.h> 6 6 #include <linux/highmem.h> 7 - #include <linux/page_ext.h> 8 7 #include <linux/poison.h> 9 8 #include <linux/ratelimit.h> 10 9 #include <linux/kasan.h>

+25 -37

mm/page_table_check.c

··· 51 51 static struct page_table_check *get_page_table_check(struct page_ext *page_ext) 52 52 { 53 53 BUG_ON(!page_ext); 54 - return (void *)(page_ext) + page_table_check_ops.offset; 54 + return page_ext_data(page_ext, &page_table_check_ops); 55 55 } 56 56 57 57 /* 58 58 * An entry is removed from the page table, decrement the counters for that page 59 59 * verify that it is of correct type and counters do not become negative. 60 60 */ 61 - static void page_table_check_clear(struct mm_struct *mm, unsigned long addr, 62 - unsigned long pfn, unsigned long pgcnt) 61 + static void page_table_check_clear(unsigned long pfn, unsigned long pgcnt) 63 62 { 64 63 struct page_ext *page_ext; 65 64 struct page *page; ··· 94 95 * verify that it is of correct type and is not being mapped with a different 95 96 * type to a different process. 96 97 */ 97 - static void page_table_check_set(struct mm_struct *mm, unsigned long addr, 98 - unsigned long pfn, unsigned long pgcnt, 98 + static void page_table_check_set(unsigned long pfn, unsigned long pgcnt, 99 99 bool rw) 100 100 { 101 101 struct page_ext *page_ext; ··· 149 151 page_ext_put(page_ext); 150 152 } 151 153 152 - void __page_table_check_pte_clear(struct mm_struct *mm, unsigned long addr, 153 - pte_t pte) 154 + void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte) 154 155 { 155 156 if (&init_mm == mm) 156 157 return; 157 158 158 159 if (pte_user_accessible_page(pte)) { 159 - page_table_check_clear(mm, addr, pte_pfn(pte), 160 - PAGE_SIZE >> PAGE_SHIFT); 160 + page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT); 161 161 } 162 162 } 163 163 EXPORT_SYMBOL(__page_table_check_pte_clear); 164 164 165 - void __page_table_check_pmd_clear(struct mm_struct *mm, unsigned long addr, 166 - pmd_t pmd) 165 + void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd) 167 166 { 168 167 if (&init_mm == mm) 169 168 return; 170 169 171 170 if (pmd_user_accessible_page(pmd)) { 172 - page_table_check_clear(mm, addr, pmd_pfn(pmd), 173 - PMD_SIZE >> PAGE_SHIFT); 171 + page_table_check_clear(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT); 174 172 } 175 173 } 176 174 EXPORT_SYMBOL(__page_table_check_pmd_clear); 177 175 178 - void __page_table_check_pud_clear(struct mm_struct *mm, unsigned long addr, 179 - pud_t pud) 176 + void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud) 180 177 { 181 178 if (&init_mm == mm) 182 179 return; 183 180 184 181 if (pud_user_accessible_page(pud)) { 185 - page_table_check_clear(mm, addr, pud_pfn(pud), 186 - PUD_SIZE >> PAGE_SHIFT); 182 + page_table_check_clear(pud_pfn(pud), PUD_SIZE >> PAGE_SHIFT); 187 183 } 188 184 } 189 185 EXPORT_SYMBOL(__page_table_check_pud_clear); 190 186 191 - void __page_table_check_pte_set(struct mm_struct *mm, unsigned long addr, 192 - pte_t *ptep, pte_t pte) 187 + void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte, 188 + unsigned int nr) 193 189 { 190 + unsigned int i; 191 + 194 192 if (&init_mm == mm) 195 193 return; 196 194 197 - __page_table_check_pte_clear(mm, addr, ptep_get(ptep)); 198 - if (pte_user_accessible_page(pte)) { 199 - page_table_check_set(mm, addr, pte_pfn(pte), 200 - PAGE_SIZE >> PAGE_SHIFT, 201 - pte_write(pte)); 202 - } 195 + for (i = 0; i < nr; i++) 196 + __page_table_check_pte_clear(mm, ptep_get(ptep + i)); 197 + if (pte_user_accessible_page(pte)) 198 + page_table_check_set(pte_pfn(pte), nr, pte_write(pte)); 203 199 } 204 - EXPORT_SYMBOL(__page_table_check_pte_set); 200 + EXPORT_SYMBOL(__page_table_check_ptes_set); 205 201 206 - void __page_table_check_pmd_set(struct mm_struct *mm, unsigned long addr, 207 - pmd_t *pmdp, pmd_t pmd) 202 + void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd) 208 203 { 209 204 if (&init_mm == mm) 210 205 return; 211 206 212 - __page_table_check_pmd_clear(mm, addr, *pmdp); 207 + __page_table_check_pmd_clear(mm, *pmdp); 213 208 if (pmd_user_accessible_page(pmd)) { 214 - page_table_check_set(mm, addr, pmd_pfn(pmd), 215 - PMD_SIZE >> PAGE_SHIFT, 209 + page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT, 216 210 pmd_write(pmd)); 217 211 } 218 212 } 219 213 EXPORT_SYMBOL(__page_table_check_pmd_set); 220 214 221 - void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr, 222 - pud_t *pudp, pud_t pud) 215 + void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud) 223 216 { 224 217 if (&init_mm == mm) 225 218 return; 226 219 227 - __page_table_check_pud_clear(mm, addr, *pudp); 220 + __page_table_check_pud_clear(mm, *pudp); 228 221 if (pud_user_accessible_page(pud)) { 229 - page_table_check_set(mm, addr, pud_pfn(pud), 230 - PUD_SIZE >> PAGE_SHIFT, 222 + page_table_check_set(pud_pfn(pud), PUD_SIZE >> PAGE_SHIFT, 231 223 pud_write(pud)); 232 224 } 233 225 } ··· 237 249 if (WARN_ON(!ptep)) 238 250 return; 239 251 for (i = 0; i < PTRS_PER_PTE; i++) { 240 - __page_table_check_pte_clear(mm, addr, ptep_get(ptep)); 252 + __page_table_check_pte_clear(mm, ptep_get(ptep)); 241 253 addr += PAGE_SIZE; 242 254 ptep++; 243 255 }

+7 -5

mm/page_vma_mapped.c

··· 73 73 } 74 74 75 75 /** 76 - * check_pte - check if @pvmw->page is mapped at the @pvmw->pte 77 - * @pvmw: page_vma_mapped_walk struct, includes a pair pte and page for checking 76 + * check_pte - check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is 77 + * mapped at the @pvmw->pte 78 + * @pvmw: page_vma_mapped_walk struct, includes a pair pte and pfn range 79 + * for checking 78 80 * 79 - * page_vma_mapped_walk() found a place where @pvmw->page is *potentially* 81 + * page_vma_mapped_walk() found a place where pfn range is *potentially* 80 82 * mapped. check_pte() has to validate this. 81 83 * 82 84 * pvmw->pte may point to empty PTE, swap PTE or PTE pointing to 83 85 * arbitrary page. 84 86 * 85 87 * If PVMW_MIGRATION flag is set, returns true if @pvmw->pte contains migration 86 - * entry that points to @pvmw->page or any subpage in case of THP. 88 + * entry that points to [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) 87 89 * 88 90 * If PVMW_MIGRATION flag is not set, returns true if pvmw->pte points to 89 - * pvmw->page or any subpage in case of THP. 91 + * [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) 90 92 * 91 93 * Otherwise, return false. 92 94 *

+95 -2

mm/pgtable-generic.c

··· 13 13 #include <linux/swap.h> 14 14 #include <linux/swapops.h> 15 15 #include <linux/mm_inline.h> 16 + #include <asm/pgalloc.h> 16 17 #include <asm/tlb.h> 17 18 18 19 /* ··· 231 230 return pmd; 232 231 } 233 232 #endif 233 + 234 + /* arch define pte_free_defer in asm/pgalloc.h for its own implementation */ 235 + #ifndef pte_free_defer 236 + static void pte_free_now(struct rcu_head *head) 237 + { 238 + struct page *page; 239 + 240 + page = container_of(head, struct page, rcu_head); 241 + pte_free(NULL /* mm not passed and not used */, (pgtable_t)page); 242 + } 243 + 244 + void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) 245 + { 246 + struct page *page; 247 + 248 + page = pgtable; 249 + call_rcu(&page->rcu_head, pte_free_now); 250 + } 251 + #endif /* pte_free_defer */ 234 252 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 253 + 254 + #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ 255 + (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU)) 256 + /* 257 + * See the comment above ptep_get_lockless() in include/linux/pgtable.h: 258 + * the barriers in pmdp_get_lockless() cannot guarantee that the value in 259 + * pmd_high actually belongs with the value in pmd_low; but holding interrupts 260 + * off blocks the TLB flush between present updates, which guarantees that a 261 + * successful __pte_offset_map() points to a page from matched halves. 262 + */ 263 + static unsigned long pmdp_get_lockless_start(void) 264 + { 265 + unsigned long irqflags; 266 + 267 + local_irq_save(irqflags); 268 + return irqflags; 269 + } 270 + static void pmdp_get_lockless_end(unsigned long irqflags) 271 + { 272 + local_irq_restore(irqflags); 273 + } 274 + #else 275 + static unsigned long pmdp_get_lockless_start(void) { return 0; } 276 + static void pmdp_get_lockless_end(unsigned long irqflags) { } 277 + #endif 235 278 236 279 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) 237 280 { 281 + unsigned long irqflags; 238 282 pmd_t pmdval; 239 283 240 - /* rcu_read_lock() to be added later */ 284 + rcu_read_lock(); 285 + irqflags = pmdp_get_lockless_start(); 241 286 pmdval = pmdp_get_lockless(pmd); 287 + pmdp_get_lockless_end(irqflags); 288 + 242 289 if (pmdvalp) 243 290 *pmdvalp = pmdval; 244 291 if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval))) ··· 299 250 } 300 251 return __pte_map(&pmdval, addr); 301 252 nomap: 302 - /* rcu_read_unlock() to be added later */ 253 + rcu_read_unlock(); 303 254 return NULL; 304 255 } 305 256 ··· 315 266 return pte; 316 267 } 317 268 269 + /* 270 + * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation 271 + * __pte_offset_map_lock() below, is usually called with the pmd pointer for 272 + * addr, reached by walking down the mm's pgd, p4d, pud for addr: either while 273 + * holding mmap_lock or vma lock for read or for write; or in truncate or rmap 274 + * context, while holding file's i_mmap_lock or anon_vma lock for read (or for 275 + * write). In a few cases, it may be used with pmd pointing to a pmd_t already 276 + * copied to or constructed on the stack. 277 + * 278 + * When successful, it returns the pte pointer for addr, with its page table 279 + * kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurrent 280 + * modification by software, with a pointer to that spinlock in ptlp (in some 281 + * configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in table's 282 + * struct page). pte_unmap_unlock(pte, ptl) to unlock and unmap afterwards. 283 + * 284 + * But it is unsuccessful, returning NULL with *ptlp unchanged, if there is no 285 + * page table at *pmd: if, for example, the page table has just been removed, 286 + * or replaced by the huge pmd of a THP. (When successful, *pmd is rechecked 287 + * after acquiring the ptlock, and retried internally if it changed: so that a 288 + * page table can be safely removed or replaced by THP while holding its lock.) 289 + * 290 + * pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() above, 291 + * just returns the pte pointer for addr, its page table kmapped if necessary; 292 + * or NULL if there is no page table at *pmd. It does not attempt to lock the 293 + * page table, so cannot normally be used when the page table is to be updated, 294 + * or when entries read must be stable. But it does take rcu_read_lock(): so 295 + * that even when page table is racily removed, it remains a valid though empty 296 + * and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s 297 + * afterwards. 298 + * 299 + * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map(); 300 + * but when successful, it also outputs a pointer to the spinlock in ptlp - as 301 + * pte_offset_map_lock() does, but in this case without locking it. This helps 302 + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time 303 + * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock 304 + * pointer for the page table that it returns. In principle, the caller should 305 + * recheck *pmd once the lock is taken; in practice, no callsite needs that - 306 + * either the mmap_lock for write, or pte_same() check on contents, is enough. 307 + * 308 + * Note that free_pgtables(), used after unmapping detached vmas, or when 309 + * exiting the whole mm, does not take page table lock before freeing a page 310 + * table, and may not use RCU at all: "outsiders" like khugepaged should avoid 311 + * pte_offset_map() and co once the vma is detached from mm or mm_users is zero. 312 + */ 318 313 pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, 319 314 unsigned long addr, spinlock_t **ptlp) 320 315 {

+67 -78

mm/rmap.c

··· 642 642 #define TLB_FLUSH_BATCH_PENDING_LARGE \ 643 643 (TLB_FLUSH_BATCH_PENDING_MASK / 2) 644 644 645 - static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval) 645 + static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval, 646 + unsigned long uaddr) 646 647 { 647 648 struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc; 648 649 int batch; ··· 652 651 if (!pte_accessible(mm, pteval)) 653 652 return; 654 653 655 - arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); 654 + arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr); 656 655 tlb_ubc->flush_required = true; 657 656 658 657 /* ··· 689 688 */ 690 689 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) 691 690 { 692 - bool should_defer = false; 693 - 694 691 if (!(flags & TTU_BATCH_FLUSH)) 695 692 return false; 696 693 697 - /* If remote CPUs need to be flushed then defer batch the flush */ 698 - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) 699 - should_defer = true; 700 - put_cpu(); 701 - 702 - return should_defer; 694 + return arch_tlbbatch_should_defer(mm); 703 695 } 704 696 705 697 /* ··· 717 723 int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT; 718 724 719 725 if (pending != flushed) { 720 - flush_tlb_mm(mm); 726 + arch_flush_tlb_batched_pending(mm); 721 727 /* 722 728 * If the new TLB flushing is pending during flushing, leave 723 729 * mm->tlb_flush_batched as is, to avoid losing flushing. ··· 727 733 } 728 734 } 729 735 #else 730 - static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval) 736 + static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval, 737 + unsigned long uaddr) 731 738 { 732 739 } 733 740 ··· 985 990 #endif 986 991 } 987 992 988 - /* 989 - * No need to call mmu_notifier_invalidate_range() as we are 990 - * downgrading page table protection not changing it to point 991 - * to a new page. 992 - * 993 - * See Documentation/mm/mmu_notifier.rst 994 - */ 995 993 if (ret) 996 994 cleaned++; 997 995 } ··· 1163 1175 1164 1176 /** 1165 1177 * __page_check_anon_rmap - sanity check anonymous rmap addition 1166 - * @page: the page to add the mapping to 1178 + * @folio: The folio containing @page. 1179 + * @page: the page to check the mapping of 1167 1180 * @vma: the vm area in which the mapping is added 1168 1181 * @address: the user virtual address mapped 1169 1182 */ 1170 - static void __page_check_anon_rmap(struct page *page, 1183 + static void __page_check_anon_rmap(struct folio *folio, struct page *page, 1171 1184 struct vm_area_struct *vma, unsigned long address) 1172 1185 { 1173 - struct folio *folio = page_folio(page); 1174 1186 /* 1175 1187 * The page's anon-rmap details (mapping and index) are guaranteed to 1176 1188 * be set up correctly at this point. ··· 1250 1262 __page_set_anon_rmap(folio, page, vma, address, 1251 1263 !!(flags & RMAP_EXCLUSIVE)); 1252 1264 else 1253 - __page_check_anon_rmap(page, vma, address); 1265 + __page_check_anon_rmap(folio, page, vma, address); 1254 1266 } 1255 1267 1256 1268 mlock_vma_folio(folio, vma, compound); ··· 1294 1306 } 1295 1307 1296 1308 /** 1297 - * page_add_file_rmap - add pte mapping to a file page 1298 - * @page: the page to add the mapping to 1309 + * folio_add_file_rmap_range - add pte mapping to page range of a folio 1310 + * @folio: The folio to add the mapping to 1311 + * @page: The first page to add 1312 + * @nr_pages: The number of pages which will be mapped 1299 1313 * @vma: the vm area in which the mapping is added 1300 1314 * @compound: charge the page as compound or small page 1301 1315 * 1316 + * The page range of folio is defined by [first_page, first_page + nr_pages) 1317 + * 1302 1318 * The caller needs to hold the pte lock. 1303 1319 */ 1304 - void page_add_file_rmap(struct page *page, struct vm_area_struct *vma, 1305 - bool compound) 1320 + void folio_add_file_rmap_range(struct folio *folio, struct page *page, 1321 + unsigned int nr_pages, struct vm_area_struct *vma, 1322 + bool compound) 1306 1323 { 1307 - struct folio *folio = page_folio(page); 1308 1324 atomic_t *mapped = &folio->_nr_pages_mapped; 1309 - int nr = 0, nr_pmdmapped = 0; 1310 - bool first; 1325 + unsigned int nr_pmdmapped = 0, first; 1326 + int nr = 0; 1311 1327 1312 - VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); 1328 + VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); 1313 1329 1314 1330 /* Is page being mapped by PTE? Is this its first map to be added? */ 1315 1331 if (likely(!compound)) { 1316 - first = atomic_inc_and_test(&page->_mapcount); 1317 - nr = first; 1318 - if (first && folio_test_large(folio)) { 1319 - nr = atomic_inc_return_relaxed(mapped); 1320 - nr = (nr < COMPOUND_MAPPED); 1321 - } 1332 + do { 1333 + first = atomic_inc_and_test(&page->_mapcount); 1334 + if (first && folio_test_large(folio)) { 1335 + first = atomic_inc_return_relaxed(mapped); 1336 + first = (first < COMPOUND_MAPPED); 1337 + } 1338 + 1339 + if (first) 1340 + nr++; 1341 + } while (page++, --nr_pages > 0); 1322 1342 } else if (folio_test_pmd_mappable(folio)) { 1323 1343 /* That test is redundant: it's for safety or to optimize out */ 1324 1344 ··· 1353 1357 __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr); 1354 1358 1355 1359 mlock_vma_folio(folio, vma, compound); 1360 + } 1361 + 1362 + /** 1363 + * page_add_file_rmap - add pte mapping to a file page 1364 + * @page: the page to add the mapping to 1365 + * @vma: the vm area in which the mapping is added 1366 + * @compound: charge the page as compound or small page 1367 + * 1368 + * The caller needs to hold the pte lock. 1369 + */ 1370 + void page_add_file_rmap(struct page *page, struct vm_area_struct *vma, 1371 + bool compound) 1372 + { 1373 + struct folio *folio = page_folio(page); 1374 + unsigned int nr_pages; 1375 + 1376 + VM_WARN_ON_ONCE_PAGE(compound && !PageTransHuge(page), page); 1377 + 1378 + if (likely(!compound)) 1379 + nr_pages = 1; 1380 + else 1381 + nr_pages = folio_nr_pages(folio); 1382 + 1383 + folio_add_file_rmap_range(folio, page, nr_pages, vma, compound); 1356 1384 } 1357 1385 1358 1386 /** ··· 1574 1554 hugetlb_vma_unlock_write(vma); 1575 1555 flush_tlb_range(vma, 1576 1556 range.start, range.end); 1577 - mmu_notifier_invalidate_range(mm, 1578 - range.start, range.end); 1579 1557 /* 1580 1558 * The ref count of the PMD page was 1581 1559 * dropped which is part of the way map ··· 1604 1586 */ 1605 1587 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 1606 1588 1607 - set_tlb_ubc_flush_pending(mm, pteval); 1589 + set_tlb_ubc_flush_pending(mm, pteval, address); 1608 1590 } else { 1609 1591 pteval = ptep_clear_flush(vma, address, pvmw.pte); 1610 1592 } ··· 1646 1628 * copied pages. 1647 1629 */ 1648 1630 dec_mm_counter(mm, mm_counter(&folio->page)); 1649 - /* We have to invalidate as we cleared the pte */ 1650 - mmu_notifier_invalidate_range(mm, address, 1651 - address + PAGE_SIZE); 1652 1631 } else if (folio_test_anon(folio)) { 1653 - swp_entry_t entry = { .val = page_private(subpage) }; 1632 + swp_entry_t entry = page_swap_entry(subpage); 1654 1633 pte_t swp_pte; 1655 1634 /* 1656 1635 * Store the swap location in the pte. ··· 1657 1642 folio_test_swapcache(folio))) { 1658 1643 WARN_ON_ONCE(1); 1659 1644 ret = false; 1660 - /* We have to invalidate as we cleared the pte */ 1661 - mmu_notifier_invalidate_range(mm, address, 1662 - address + PAGE_SIZE); 1663 1645 page_vma_mapped_walk_done(&pvmw); 1664 1646 break; 1665 1647 } ··· 1687 1675 */ 1688 1676 if (ref_count == 1 + map_count && 1689 1677 !folio_test_dirty(folio)) { 1690 - /* Invalidate as we cleared the pte */ 1691 - mmu_notifier_invalidate_range(mm, 1692 - address, address + PAGE_SIZE); 1693 1678 dec_mm_counter(mm, MM_ANONPAGES); 1694 1679 goto discard; 1695 1680 } ··· 1741 1732 if (pte_uffd_wp(pteval)) 1742 1733 swp_pte = pte_swp_mkuffd_wp(swp_pte); 1743 1734 set_pte_at(mm, address, pvmw.pte, swp_pte); 1744 - /* Invalidate as we cleared the pte */ 1745 - mmu_notifier_invalidate_range(mm, address, 1746 - address + PAGE_SIZE); 1747 1735 } else { 1748 1736 /* 1749 1737 * This is a locked file-backed folio, ··· 1756 1750 dec_mm_counter(mm, mm_counter_file(&folio->page)); 1757 1751 } 1758 1752 discard: 1759 - /* 1760 - * No need to call mmu_notifier_invalidate_range() it has be 1761 - * done above for all cases requiring it to happen under page 1762 - * table lock before mmu_notifier_invalidate_range_end() 1763 - * 1764 - * See Documentation/mm/mmu_notifier.rst 1765 - */ 1766 1753 page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); 1767 1754 if (vma->vm_flags & VM_LOCKED) 1768 1755 mlock_drain_local(); ··· 1934 1935 hugetlb_vma_unlock_write(vma); 1935 1936 flush_tlb_range(vma, 1936 1937 range.start, range.end); 1937 - mmu_notifier_invalidate_range(mm, 1938 - range.start, range.end); 1939 1938 1940 1939 /* 1941 1940 * The ref count of the PMD page was ··· 1966 1969 */ 1967 1970 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 1968 1971 1969 - set_tlb_ubc_flush_pending(mm, pteval); 1972 + set_tlb_ubc_flush_pending(mm, pteval, address); 1970 1973 } else { 1971 1974 pteval = ptep_clear_flush(vma, address, pvmw.pte); 1972 1975 } ··· 2038 2041 * copied pages. 2039 2042 */ 2040 2043 dec_mm_counter(mm, mm_counter(&folio->page)); 2041 - /* We have to invalidate as we cleared the pte */ 2042 - mmu_notifier_invalidate_range(mm, address, 2043 - address + PAGE_SIZE); 2044 2044 } else { 2045 2045 swp_entry_t entry; 2046 2046 pte_t swp_pte; ··· 2101 2107 */ 2102 2108 } 2103 2109 2104 - /* 2105 - * No need to call mmu_notifier_invalidate_range() it has be 2106 - * done above for all cases requiring it to happen under page 2107 - * table lock before mmu_notifier_invalidate_range_end() 2108 - * 2109 - * See Documentation/mm/mmu_notifier.rst 2110 - */ 2111 2110 page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); 2112 2111 if (vma->vm_flags & VM_LOCKED) 2113 2112 mlock_drain_local(); ··· 2389 2402 /* 2390 2403 * rmap_walk_anon - do something to anonymous page using the object-based 2391 2404 * rmap method 2392 - * @page: the page to be handled 2405 + * @folio: the folio to be handled 2393 2406 * @rwc: control variable according to each walk type 2407 + * @locked: caller holds relevant rmap lock 2394 2408 * 2395 - * Find all the mappings of a page using the mapping pointer and the vma chains 2396 - * contained in the anon_vma struct it points to. 2409 + * Find all the mappings of a folio using the mapping pointer and the vma 2410 + * chains contained in the anon_vma struct it points to. 2397 2411 */ 2398 2412 static void rmap_walk_anon(struct folio *folio, 2399 2413 struct rmap_walk_control *rwc, bool locked) ··· 2438 2450 2439 2451 /* 2440 2452 * rmap_walk_file - do something to file page using the object-based rmap method 2441 - * @page: the page to be handled 2453 + * @folio: the folio to be handled 2442 2454 * @rwc: control variable according to each walk type 2455 + * @locked: caller holds relevant rmap lock 2443 2456 * 2444 - * Find all the mappings of a page using the mapping pointer and the vma chains 2457 + * Find all the mappings of a folio using the mapping pointer and the vma chains 2445 2458 * contained in the address_space struct it points to. 2446 2459 */ 2447 2460 static void rmap_walk_file(struct folio *folio,

+8 -6

mm/secretmem.c

··· 55 55 gfp_t gfp = vmf->gfp_mask; 56 56 unsigned long addr; 57 57 struct page *page; 58 + struct folio *folio; 58 59 vm_fault_t ret; 59 60 int err; 60 61 ··· 67 66 retry: 68 67 page = find_lock_page(mapping, offset); 69 68 if (!page) { 70 - page = alloc_page(gfp | __GFP_ZERO); 71 - if (!page) { 69 + folio = folio_alloc(gfp | __GFP_ZERO, 0); 70 + if (!folio) { 72 71 ret = VM_FAULT_OOM; 73 72 goto out; 74 73 } 75 74 75 + page = &folio->page; 76 76 err = set_direct_map_invalid_noflush(page); 77 77 if (err) { 78 - put_page(page); 78 + folio_put(folio); 79 79 ret = vmf_error(err); 80 80 goto out; 81 81 } 82 82 83 - __SetPageUptodate(page); 84 - err = add_to_page_cache_lru(page, mapping, offset, gfp); 83 + __folio_mark_uptodate(folio); 84 + err = filemap_add_folio(mapping, folio, offset, gfp); 85 85 if (unlikely(err)) { 86 - put_page(page); 86 + folio_put(folio); 87 87 /* 88 88 * If a split of large page was required, it 89 89 * already happened when we marked the page invalid

+7 -8

mm/shmem.c

··· 1038 1038 same_folio = lend < folio_pos(folio) + folio_size(folio); 1039 1039 folio_mark_dirty(folio); 1040 1040 if (!truncate_inode_partial_folio(folio, lstart, lend)) { 1041 - start = folio->index + folio_nr_pages(folio); 1041 + start = folio_next_index(folio); 1042 1042 if (same_folio) 1043 1043 end = folio->index; 1044 1044 } ··· 1720 1720 int error; 1721 1721 1722 1722 old = *foliop; 1723 - entry = folio_swap_entry(old); 1723 + entry = old->swap; 1724 1724 swap_index = swp_offset(entry); 1725 1725 swap_mapping = swap_address_space(entry); 1726 1726 ··· 1741 1741 __folio_set_locked(new); 1742 1742 __folio_set_swapbacked(new); 1743 1743 folio_mark_uptodate(new); 1744 - folio_set_swap_entry(new, entry); 1744 + new->swap = entry; 1745 1745 folio_set_swapcache(new); 1746 1746 1747 1747 /* ··· 1786 1786 swp_entry_t swapin_error; 1787 1787 void *old; 1788 1788 1789 - swapin_error = make_swapin_error_entry(); 1789 + swapin_error = make_poisoned_swp_entry(); 1790 1790 old = xa_cmpxchg_irq(&mapping->i_pages, index, 1791 1791 swp_to_radix_entry(swap), 1792 1792 swp_to_radix_entry(swapin_error), 0); ··· 1827 1827 swap = radix_to_swp_entry(*foliop); 1828 1828 *foliop = NULL; 1829 1829 1830 - if (is_swapin_error_entry(swap)) 1830 + if (is_poisoned_swp_entry(swap)) 1831 1831 return -EIO; 1832 1832 1833 1833 si = get_swap_device(swap); ··· 1858 1858 /* We have to do this with folio locked to prevent races */ 1859 1859 folio_lock(folio); 1860 1860 if (!folio_test_swapcache(folio) || 1861 - folio_swap_entry(folio).val != swap.val || 1861 + folio->swap.val != swap.val || 1862 1862 !shmem_confirm_swap(mapping, index, swap)) { 1863 1863 error = -EEXIST; 1864 1864 goto unlock; ··· 4200 4200 struct mempolicy *mpol; 4201 4201 4202 4202 if (sbinfo->max_blocks != shmem_default_max_blocks()) 4203 - seq_printf(seq, ",size=%luk", 4204 - sbinfo->max_blocks << (PAGE_SHIFT - 10)); 4203 + seq_printf(seq, ",size=%luk", K(sbinfo->max_blocks)); 4205 4204 if (sbinfo->max_inodes != shmem_default_max_inodes()) 4206 4205 seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes); 4207 4206 if (sbinfo->mode != (0777 | S_ISVTX))

+5 -5

mm/show_mem.c

··· 186 186 * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's 187 187 * cpuset. 188 188 */ 189 - void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) 189 + static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) 190 190 { 191 191 unsigned long free_pcp = 0; 192 192 int cpu, nid; ··· 251 251 " writeback:%lukB" 252 252 " shmem:%lukB" 253 253 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 254 - " shmem_thp: %lukB" 255 - " shmem_pmdmapped: %lukB" 256 - " anon_thp: %lukB" 254 + " shmem_thp:%lukB" 255 + " shmem_pmdmapped:%lukB" 256 + " anon_thp:%lukB" 257 257 #endif 258 258 " writeback_tmp:%lukB" 259 259 " kernel_stack:%lukB" ··· 406 406 struct zone *zone; 407 407 408 408 printk("Mem-Info:\n"); 409 - __show_free_areas(filter, nodemask, max_zone_idx); 409 + show_free_areas(filter, nodemask, max_zone_idx); 410 410 411 411 for_each_populated_zone(zone) { 412 412

+3

mm/sparse-vmemmap.c

··· 358 358 return 0; 359 359 } 360 360 361 + #ifndef vmemmap_populate_compound_pages 361 362 /* 362 363 * For compound pages bigger than section size (e.g. x86 1G compound 363 364 * pages with 2M subsection size) fill the rest of sections as tail ··· 446 445 447 446 return 0; 448 447 } 448 + 449 + #endif 449 450 450 451 struct page * __meminit __populate_section_memmap(unsigned long pfn, 451 452 unsigned long nr_pages, int nid, struct vmem_altmap *altmap,

+1 -2

mm/sparse.c

··· 172 172 173 173 #define for_each_present_section_nr(start, section_nr) \ 174 174 for (section_nr = next_present_section_nr(start-1); \ 175 - ((section_nr != -1) && \ 176 - (section_nr <= __highest_present_section_nr)); \ 175 + section_nr != -1; \ 177 176 section_nr = next_present_section_nr(section_nr)) 178 177 179 178 static inline unsigned long first_present_section_nr(void)

-1

mm/swap.h

··· 46 46 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 47 47 struct vm_area_struct *vma, 48 48 unsigned long addr, 49 - bool do_poll, 50 49 struct swap_iocb **plug); 51 50 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 52 51 struct vm_area_struct *vma,

+10 -13

mm/swap_state.c

··· 63 63 void show_swap_cache_info(void) 64 64 { 65 65 printk("%lu pages in swap cache\n", total_swapcache_pages()); 66 - printk("Free swap = %ldkB\n", 67 - get_nr_swap_pages() << (PAGE_SHIFT - 10)); 68 - printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); 66 + printk("Free swap = %ldkB\n", K(get_nr_swap_pages())); 67 + printk("Total swap = %lukB\n", K(total_swap_pages)); 69 68 } 70 69 71 70 void *get_shadow_from_swap_cache(swp_entry_t entry) ··· 100 101 101 102 folio_ref_add(folio, nr); 102 103 folio_set_swapcache(folio); 104 + folio->swap = entry; 103 105 104 106 do { 105 107 xas_lock_irq(&xas); ··· 114 114 if (shadowp) 115 115 *shadowp = old; 116 116 } 117 - set_page_private(folio_page(folio, i), entry.val + i); 118 117 xas_store(&xas, folio); 119 118 xas_next(&xas); 120 119 } ··· 154 155 for (i = 0; i < nr; i++) { 155 156 void *entry = xas_store(&xas, shadow); 156 157 VM_BUG_ON_PAGE(entry != folio, entry); 157 - set_page_private(folio_page(folio, i), 0); 158 158 xas_next(&xas); 159 159 } 160 + folio->swap.val = 0; 160 161 folio_clear_swapcache(folio); 161 162 address_space->nrpages -= nr; 162 163 __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr); ··· 232 233 */ 233 234 void delete_from_swap_cache(struct folio *folio) 234 235 { 235 - swp_entry_t entry = folio_swap_entry(folio); 236 + swp_entry_t entry = folio->swap; 236 237 struct address_space *address_space = swap_address_space(entry); 237 238 238 239 xa_lock_irq(&address_space->i_pages); ··· 526 527 */ 527 528 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 528 529 struct vm_area_struct *vma, 529 - unsigned long addr, bool do_poll, 530 - struct swap_iocb **plug) 530 + unsigned long addr, struct swap_iocb **plug) 531 531 { 532 532 bool page_was_allocated; 533 533 struct page *retpage = __read_swap_cache_async(entry, gfp_mask, 534 534 vma, addr, &page_was_allocated); 535 535 536 536 if (page_was_allocated) 537 - swap_readpage(retpage, do_poll, plug); 537 + swap_readpage(retpage, false, plug); 538 538 539 539 return retpage; 540 540 } ··· 628 630 struct swap_info_struct *si = swp_swap_info(entry); 629 631 struct blk_plug plug; 630 632 struct swap_iocb *splug = NULL; 631 - bool do_poll = true, page_allocated; 633 + bool page_allocated; 632 634 struct vm_area_struct *vma = vmf->vma; 633 635 unsigned long addr = vmf->address; 634 636 ··· 636 638 if (!mask) 637 639 goto skip; 638 640 639 - do_poll = false; 640 641 /* Read a page_cluster sized and aligned cluster around offset. */ 641 642 start_offset = offset & ~mask; 642 643 end_offset = offset | mask; ··· 667 670 lru_add_drain(); /* Push any new pages onto the LRU now */ 668 671 skip: 669 672 /* The page was likely read above, so no need for plugging here */ 670 - return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll, NULL); 673 + return read_swap_cache_async(entry, gfp_mask, vma, addr, NULL); 671 674 } 672 675 673 676 int init_swap_address_space(unsigned int type, unsigned long nr_pages) ··· 835 838 skip: 836 839 /* The page was likely read above, so no need for plugging here */ 837 840 return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address, 838 - ra_info.win == 1, NULL); 841 + NULL); 839 842 } 840 843 841 844 /**

+32 -37

mm/swapfile.c

··· 35 35 #include <linux/memcontrol.h> 36 36 #include <linux/poll.h> 37 37 #include <linux/oom.h> 38 - #include <linux/frontswap.h> 39 38 #include <linux/swapfile.h> 40 39 #include <linux/export.h> 41 40 #include <linux/swap_slots.h> 42 41 #include <linux/sort.h> 43 42 #include <linux/completion.h> 44 43 #include <linux/suspend.h> 44 + #include <linux/zswap.h> 45 45 46 46 #include <asm/tlbflush.h> 47 47 #include <linux/swapops.h> 48 48 #include <linux/swap_cgroup.h> 49 + #include "internal.h" 49 50 #include "swap.h" 50 51 51 52 static bool swap_count_continued(struct swap_info_struct *, pgoff_t, ··· 96 95 static struct plist_head *swap_avail_heads; 97 96 static DEFINE_SPINLOCK(swap_avail_lock); 98 97 99 - struct swap_info_struct *swap_info[MAX_SWAPFILES]; 98 + static struct swap_info_struct *swap_info[MAX_SWAPFILES]; 100 99 101 100 static DEFINE_MUTEX(swapon_mutex); 102 101 ··· 715 714 int nid; 716 715 717 716 spin_lock(&swap_avail_lock); 718 - for_each_node(nid) { 719 - WARN_ON(!plist_node_empty(&p->avail_lists[nid])); 717 + for_each_node(nid) 720 718 plist_add(&p->avail_lists[nid], &swap_avail_heads[nid]); 721 - } 722 719 spin_unlock(&swap_avail_lock); 723 720 } 724 721 ··· 745 746 swap_slot_free_notify = NULL; 746 747 while (offset <= end) { 747 748 arch_swap_invalidate_page(si->type, offset); 748 - frontswap_invalidate_page(si->type, offset); 749 + zswap_invalidate(si->type, offset); 749 750 if (swap_slot_free_notify) 750 751 swap_slot_free_notify(si->bdev, offset); 751 752 offset++; ··· 1536 1537 1537 1538 static bool folio_swapped(struct folio *folio) 1538 1539 { 1539 - swp_entry_t entry = folio_swap_entry(folio); 1540 + swp_entry_t entry = folio->swap; 1540 1541 struct swap_info_struct *si = _swap_info_get(entry); 1541 1542 1542 1543 if (!si) ··· 1772 1773 swp_entry = make_hwpoison_entry(swapcache); 1773 1774 page = swapcache; 1774 1775 } else { 1775 - swp_entry = make_swapin_error_entry(); 1776 + swp_entry = make_poisoned_swp_entry(); 1776 1777 } 1777 1778 new_pte = swp_entry_to_pte(swp_entry); 1778 1779 ret = 0; 1779 1780 goto setpte; 1780 1781 } 1782 + 1783 + /* 1784 + * Some architectures may have to restore extra metadata to the page 1785 + * when reading from swap. This metadata may be indexed by swap entry 1786 + * so this must be called before swap_free(). 1787 + */ 1788 + arch_swap_restore(entry, page_folio(page)); 1781 1789 1782 1790 /* See do_swap_page() */ 1783 1791 BUG_ON(!PageAnon(page) && PageMappedToDisk(page)); ··· 2336 2330 * swap_info_struct. 2337 2331 */ 2338 2332 plist_add(&p->list, &swap_active_head); 2339 - add_to_avail_list(p); 2333 + 2334 + /* add to available list iff swap device is not full */ 2335 + if (p->highest_bit) 2336 + add_to_avail_list(p); 2340 2337 } 2341 2338 2342 2339 static void enable_swap_info(struct swap_info_struct *p, int prio, 2343 2340 unsigned char *swap_map, 2344 - struct swap_cluster_info *cluster_info, 2345 - unsigned long *frontswap_map) 2341 + struct swap_cluster_info *cluster_info) 2346 2342 { 2347 - if (IS_ENABLED(CONFIG_FRONTSWAP)) 2348 - frontswap_init(p->type, frontswap_map); 2343 + zswap_swapon(p->type); 2344 + 2349 2345 spin_lock(&swap_lock); 2350 2346 spin_lock(&p->lock); 2351 2347 setup_swap_info(p, prio, swap_map, cluster_info); ··· 2390 2382 struct swap_info_struct *p = NULL; 2391 2383 unsigned char *swap_map; 2392 2384 struct swap_cluster_info *cluster_info; 2393 - unsigned long *frontswap_map; 2394 2385 struct file *swap_file, *victim; 2395 2386 struct address_space *mapping; 2396 2387 struct inode *inode; ··· 2514 2507 p->swap_map = NULL; 2515 2508 cluster_info = p->cluster_info; 2516 2509 p->cluster_info = NULL; 2517 - frontswap_map = frontswap_map_get(p); 2518 2510 spin_unlock(&p->lock); 2519 2511 spin_unlock(&swap_lock); 2520 2512 arch_swap_invalidate_area(p->type); 2521 - frontswap_invalidate_area(p->type); 2522 - frontswap_map_set(p, NULL); 2513 + zswap_swapoff(p->type); 2523 2514 mutex_unlock(&swapon_mutex); 2524 2515 free_percpu(p->percpu_cluster); 2525 2516 p->percpu_cluster = NULL; ··· 2525 2520 p->cluster_next_cpu = NULL; 2526 2521 vfree(swap_map); 2527 2522 kvfree(cluster_info); 2528 - kvfree(frontswap_map); 2529 2523 /* Destroy swap account information */ 2530 2524 swap_cgroup_swapoff(p->type); 2531 2525 exit_swap_address_space(p->type); ··· 2636 2632 return 0; 2637 2633 } 2638 2634 2639 - bytes = si->pages << (PAGE_SHIFT - 10); 2640 - inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10); 2635 + bytes = K(si->pages); 2636 + inuse = K(READ_ONCE(si->inuse_pages)); 2641 2637 2642 2638 file = si->swap_file; 2643 2639 len = seq_file_path(swap, file, " \t\n\\"); ··· 2862 2858 } 2863 2859 if (last_page > maxpages) { 2864 2860 pr_warn("Truncating oversized swap area, only using %luk out of %luk\n", 2865 - maxpages << (PAGE_SHIFT - 10), 2866 - last_page << (PAGE_SHIFT - 10)); 2861 + K(maxpages), K(last_page)); 2867 2862 } 2868 2863 if (maxpages > last_page) { 2869 2864 maxpages = last_page + 1; ··· 2990 2987 unsigned long maxpages; 2991 2988 unsigned char *swap_map = NULL; 2992 2989 struct swap_cluster_info *cluster_info = NULL; 2993 - unsigned long *frontswap_map = NULL; 2994 2990 struct page *page = NULL; 2995 2991 struct inode *inode = NULL; 2996 2992 bool inced_nr_rotate_swap = false; ··· 3129 3127 error = nr_extents; 3130 3128 goto bad_swap_unlock_inode; 3131 3129 } 3132 - /* frontswap enabled? set up bit-per-page map for frontswap */ 3133 - if (IS_ENABLED(CONFIG_FRONTSWAP)) 3134 - frontswap_map = kvcalloc(BITS_TO_LONGS(maxpages), 3135 - sizeof(long), 3136 - GFP_KERNEL); 3137 3130 3138 3131 if ((swap_flags & SWAP_FLAG_DISCARD) && 3139 3132 p->bdev && bdev_max_discard_sectors(p->bdev)) { ··· 3181 3184 if (swap_flags & SWAP_FLAG_PREFER) 3182 3185 prio = 3183 3186 (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT; 3184 - enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map); 3187 + enable_swap_info(p, prio, swap_map, cluster_info); 3185 3188 3186 - pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s%s\n", 3187 - p->pages<<(PAGE_SHIFT-10), name->name, p->prio, 3188 - nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10), 3189 + pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n", 3190 + K(p->pages), name->name, p->prio, nr_extents, 3191 + K((unsigned long long)span), 3189 3192 (p->flags & SWP_SOLIDSTATE) ? "SS" : "", 3190 3193 (p->flags & SWP_DISCARDABLE) ? "D" : "", 3191 3194 (p->flags & SWP_AREA_DISCARD) ? "s" : "", 3192 - (p->flags & SWP_PAGE_DISCARD) ? "c" : "", 3193 - (frontswap_map) ? "FS" : ""); 3195 + (p->flags & SWP_PAGE_DISCARD) ? "c" : ""); 3194 3196 3195 3197 mutex_unlock(&swapon_mutex); 3196 3198 atomic_inc(&proc_poll_event); ··· 3219 3223 spin_unlock(&swap_lock); 3220 3224 vfree(swap_map); 3221 3225 kvfree(cluster_info); 3222 - kvfree(frontswap_map); 3223 3226 if (inced_nr_rotate_swap) 3224 3227 atomic_dec(&nr_rotate_swap); 3225 3228 if (swap_file) ··· 3369 3374 3370 3375 struct swap_info_struct *page_swap_info(struct page *page) 3371 3376 { 3372 - swp_entry_t entry = { .val = page_private(page) }; 3377 + swp_entry_t entry = page_swap_entry(page); 3373 3378 return swp_swap_info(entry); 3374 3379 } 3375 3380 ··· 3384 3389 3385 3390 pgoff_t __page_file_index(struct page *page) 3386 3391 { 3387 - swp_entry_t swap = { .val = page_private(page) }; 3392 + swp_entry_t swap = page_swap_entry(page); 3388 3393 return swp_offset(swap); 3389 3394 } 3390 3395 EXPORT_SYMBOL_GPL(__page_file_index);

+3 -5

mm/truncate.c

··· 19 19 #include <linux/highmem.h> 20 20 #include <linux/pagevec.h> 21 21 #include <linux/task_io_accounting_ops.h> 22 - #include <linux/buffer_head.h> /* grr. try_to_release_page */ 23 22 #include <linux/shmem_fs.h> 24 23 #include <linux/rmap.h> 25 24 #include "internal.h" ··· 275 276 if (folio_ref_count(folio) > 276 277 folio_nr_pages(folio) + folio_has_private(folio) + 1) 277 278 return 0; 278 - if (folio_has_private(folio) && !filemap_release_folio(folio, 0)) 279 + if (!filemap_release_folio(folio, 0)) 279 280 return 0; 280 281 281 282 return remove_mapping(mapping, folio); ··· 377 378 if (!IS_ERR(folio)) { 378 379 same_folio = lend < folio_pos(folio) + folio_size(folio); 379 380 if (!truncate_inode_partial_folio(folio, lstart, lend)) { 380 - start = folio->index + folio_nr_pages(folio); 381 + start = folio_next_index(folio); 381 382 if (same_folio) 382 383 end = folio->index; 383 384 } ··· 572 573 if (folio->mapping != mapping) 573 574 return 0; 574 575 575 - if (folio_has_private(folio) && 576 - !filemap_release_folio(folio, GFP_KERNEL)) 576 + if (!filemap_release_folio(folio, GFP_KERNEL)) 577 577 return 0; 578 578 579 579 spin_lock(&mapping->host->i_lock);

+69 -18

mm/userfaultfd.c

··· 45 45 return dst_vma; 46 46 } 47 47 48 + /* Check if dst_addr is outside of file's size. Must be called with ptl held. */ 49 + static bool mfill_file_over_size(struct vm_area_struct *dst_vma, 50 + unsigned long dst_addr) 51 + { 52 + struct inode *inode; 53 + pgoff_t offset, max_off; 54 + 55 + if (!dst_vma->vm_file) 56 + return false; 57 + 58 + inode = dst_vma->vm_file->f_inode; 59 + offset = linear_page_index(dst_vma, dst_addr); 60 + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 61 + return offset >= max_off; 62 + } 63 + 48 64 /* 49 65 * Install PTEs, to map dst_addr (within dst_vma) to page. 50 66 * ··· 80 64 bool page_in_cache = page_mapping(page); 81 65 spinlock_t *ptl; 82 66 struct folio *folio; 83 - struct inode *inode; 84 - pgoff_t offset, max_off; 85 67 86 68 _dst_pte = mk_pte(page, dst_vma->vm_page_prot); 87 69 _dst_pte = pte_mkdirty(_dst_pte); ··· 95 81 if (!dst_pte) 96 82 goto out; 97 83 98 - if (vma_is_shmem(dst_vma)) { 99 - /* serialize against truncate with the page table lock */ 100 - inode = dst_vma->vm_file->f_inode; 101 - offset = linear_page_index(dst_vma, dst_addr); 102 - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 84 + if (mfill_file_over_size(dst_vma, dst_addr)) { 103 85 ret = -EFAULT; 104 - if (unlikely(offset >= max_off)) 105 - goto out_unlock; 86 + goto out_unlock; 106 87 } 107 88 108 89 ret = -EEXIST; ··· 220 211 pte_t _dst_pte, *dst_pte; 221 212 spinlock_t *ptl; 222 213 int ret; 223 - pgoff_t offset, max_off; 224 - struct inode *inode; 225 214 226 215 _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), 227 216 dst_vma->vm_page_prot)); ··· 227 220 dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl); 228 221 if (!dst_pte) 229 222 goto out; 230 - if (dst_vma->vm_file) { 231 - /* the shmem MAP_PRIVATE case requires checking the i_size */ 232 - inode = dst_vma->vm_file->f_inode; 233 - offset = linear_page_index(dst_vma, dst_addr); 234 - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 223 + if (mfill_file_over_size(dst_vma, dst_addr)) { 235 224 ret = -EFAULT; 236 - if (unlikely(offset >= max_off)) 237 - goto out_unlock; 225 + goto out_unlock; 238 226 } 239 227 ret = -EEXIST; 240 228 if (!pte_none(ptep_get(dst_pte))) ··· 286 284 folio_unlock(folio); 287 285 folio_put(folio); 288 286 goto out; 287 + } 288 + 289 + /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ 290 + static int mfill_atomic_pte_poison(pmd_t *dst_pmd, 291 + struct vm_area_struct *dst_vma, 292 + unsigned long dst_addr, 293 + uffd_flags_t flags) 294 + { 295 + int ret; 296 + struct mm_struct *dst_mm = dst_vma->vm_mm; 297 + pte_t _dst_pte, *dst_pte; 298 + spinlock_t *ptl; 299 + 300 + _dst_pte = make_pte_marker(PTE_MARKER_POISONED); 301 + ret = -EAGAIN; 302 + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); 303 + if (!dst_pte) 304 + goto out; 305 + 306 + if (mfill_file_over_size(dst_vma, dst_addr)) { 307 + ret = -EFAULT; 308 + goto out_unlock; 309 + } 310 + 311 + ret = -EEXIST; 312 + /* Refuse to overwrite any PTE, even a PTE marker (e.g. UFFD WP). */ 313 + if (!pte_none(*dst_pte)) 314 + goto out_unlock; 315 + 316 + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); 317 + 318 + /* No need to invalidate - it was non-present before */ 319 + update_mmu_cache(dst_vma, dst_addr, dst_pte); 320 + ret = 0; 321 + out_unlock: 322 + pte_unmap_unlock(dst_pte, ptl); 323 + out: 324 + return ret; 289 325 } 290 326 291 327 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) ··· 521 481 if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { 522 482 return mfill_atomic_pte_continue(dst_pmd, dst_vma, 523 483 dst_addr, flags); 484 + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { 485 + return mfill_atomic_pte_poison(dst_pmd, dst_vma, 486 + dst_addr, flags); 524 487 } 525 488 526 489 /* ··· 743 700 { 744 701 return mfill_atomic(dst_mm, start, 0, len, mmap_changing, 745 702 uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); 703 + } 704 + 705 + ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, 706 + unsigned long len, atomic_t *mmap_changing, 707 + uffd_flags_t flags) 708 + { 709 + return mfill_atomic(dst_mm, start, 0, len, mmap_changing, 710 + uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON)); 746 711 } 747 712 748 713 long uffd_wp_range(struct vm_area_struct *dst_vma,

+2 -8

mm/util.c

··· 737 737 } 738 738 EXPORT_SYMBOL(vcalloc); 739 739 740 - /* Neutral page->mapping pointer to address_space or anon_vma or other */ 741 - void *page_rmapping(struct page *page) 742 - { 743 - return folio_raw_mapping(page_folio(page)); 744 - } 745 - 746 740 struct anon_vma *folio_anon_vma(struct folio *folio) 747 741 { 748 742 unsigned long mapping = (unsigned long)folio->mapping; ··· 767 773 return NULL; 768 774 769 775 if (unlikely(folio_test_swapcache(folio))) 770 - return swap_address_space(folio_swap_entry(folio)); 776 + return swap_address_space(folio->swap); 771 777 772 778 mapping = folio->mapping; 773 779 if ((unsigned long)mapping & PAGE_MAPPING_FLAGS) ··· 1122 1128 } 1123 1129 EXPORT_SYMBOL(page_offline_end); 1124 1130 1125 - #ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO 1131 + #ifndef flush_dcache_folio 1126 1132 void flush_dcache_folio(struct folio *folio) 1127 1133 { 1128 1134 long i, nr = folio_nr_pages(folio);

+28 -16

mm/vmscan.c

··· 1423 1423 } 1424 1424 1425 1425 if (folio_test_swapcache(folio)) { 1426 - swp_entry_t swap = folio_swap_entry(folio); 1426 + swp_entry_t swap = folio->swap; 1427 1427 1428 1428 if (reclaimed && !mapping_exiting(mapping)) 1429 1429 shadow = workingset_eviction(folio, target_memcg); ··· 2064 2064 * (refcount == 1) it can be freed. Otherwise, leave 2065 2065 * the folio on the LRU so it is swappable. 2066 2066 */ 2067 - if (folio_has_private(folio)) { 2067 + if (folio_needs_release(folio)) { 2068 2068 if (!filemap_release_folio(folio, sc->gfp_mask)) 2069 2069 goto activate_locked; 2070 2070 if (!mapping && folio_ref_count(folio) == 1) { ··· 2729 2729 } 2730 2730 2731 2731 if (unlikely(buffer_heads_over_limit)) { 2732 - if (folio_test_private(folio) && folio_trylock(folio)) { 2733 - if (folio_test_private(folio)) 2734 - filemap_release_folio(folio, 0); 2732 + if (folio_needs_release(folio) && 2733 + folio_trylock(folio)) { 2734 + filemap_release_folio(folio, 0); 2735 2735 folio_unlock(folio); 2736 2736 } 2737 2737 } ··· 4440 4440 int prev, next; 4441 4441 int type, zone; 4442 4442 struct lru_gen_folio *lrugen = &lruvec->lrugen; 4443 - 4443 + restart: 4444 4444 spin_lock_irq(&lruvec->lru_lock); 4445 4445 4446 4446 VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); ··· 4451 4451 4452 4452 VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); 4453 4453 4454 - while (!inc_min_seq(lruvec, type, can_swap)) { 4455 - spin_unlock_irq(&lruvec->lru_lock); 4456 - cond_resched(); 4457 - spin_lock_irq(&lruvec->lru_lock); 4458 - } 4454 + if (inc_min_seq(lruvec, type, can_swap)) 4455 + continue; 4456 + 4457 + spin_unlock_irq(&lruvec->lru_lock); 4458 + cond_resched(); 4459 + goto restart; 4459 4460 } 4460 4461 4461 4462 /* ··· 4657 4656 pte_t *pte = pvmw->pte; 4658 4657 unsigned long addr = pvmw->address; 4659 4658 struct folio *folio = pfn_folio(pvmw->pfn); 4659 + bool can_swap = !folio_is_file_lru(folio); 4660 4660 struct mem_cgroup *memcg = folio_memcg(folio); 4661 4661 struct pglist_data *pgdat = folio_pgdat(folio); 4662 4662 struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); ··· 4706 4704 if (!pte_young(ptent)) 4707 4705 continue; 4708 4706 4709 - folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap); 4707 + folio = get_pfn_folio(pfn, memcg, pgdat, can_swap); 4710 4708 if (!folio) 4711 4709 continue; 4712 4710 ··· 4893 4891 * the eviction 4894 4892 ******************************************************************************/ 4895 4893 4896 - static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx) 4894 + static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc, 4895 + int tier_idx) 4897 4896 { 4898 4897 bool success; 4899 4898 int gen = folio_lru_gen(folio); ··· 4941 4938 4942 4939 WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 4943 4940 lrugen->protected[hist][type][tier - 1] + delta); 4941 + return true; 4942 + } 4943 + 4944 + /* ineligible */ 4945 + if (zone > sc->reclaim_idx || skip_cma(folio, sc)) { 4946 + gen = folio_inc_gen(lruvec, folio, false); 4947 + list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]); 4944 4948 return true; 4945 4949 } 4946 4950 ··· 4999 4989 static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, 5000 4990 int type, int tier, struct list_head *list) 5001 4991 { 5002 - int gen, zone; 4992 + int i; 4993 + int gen; 5003 4994 enum vm_event_item item; 5004 4995 int sorted = 0; 5005 4996 int scanned = 0; ··· 5016 5005 5017 5006 gen = lru_gen_from_seq(lrugen->min_seq[type]); 5018 5007 5019 - for (zone = sc->reclaim_idx; zone >= 0; zone--) { 5008 + for (i = MAX_NR_ZONES; i > 0; i--) { 5020 5009 LIST_HEAD(moved); 5021 5010 int skipped = 0; 5011 + int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES; 5022 5012 struct list_head *head = &lrugen->folios[gen][type][zone]; 5023 5013 5024 5014 while (!list_empty(head)) { ··· 5033 5021 5034 5022 scanned += delta; 5035 5023 5036 - if (sort_folio(lruvec, folio, tier)) 5024 + if (sort_folio(lruvec, folio, sc, tier)) 5037 5025 sorted += delta; 5038 5026 else if (isolate_folio(lruvec, folio, sc)) { 5039 5027 list_add(&folio->lru, list);

-1

mm/vmstat.c

··· 26 26 #include <linux/writeback.h> 27 27 #include <linux/compaction.h> 28 28 #include <linux/mm_inline.h> 29 - #include <linux/page_ext.h> 30 29 #include <linux/page_owner.h> 31 30 #include <linux/sched/isolation.h> 32 31

+1

mm/workingset.c

··· 664 664 struct lruvec *lruvec; 665 665 int i; 666 666 667 + mem_cgroup_flush_stats(); 667 668 lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); 668 669 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) 669 670 pages += lruvec_page_state_local(lruvec,

+17 -10

mm/z3fold.c

··· 133 133 * @stale: list of pages marked for freeing 134 134 * @pages_nr: number of z3fold pages in the pool. 135 135 * @c_handle: cache for z3fold_buddy_slots allocation 136 - * @zpool: zpool driver 137 - * @zpool_ops: zpool operations structure with an evict callback 138 136 * @compact_wq: workqueue for page layout background optimization 139 137 * @release_wq: workqueue for safe page release 140 138 * @work: work_struct for safe page release ··· 478 480 __release_z3fold_page(zhdr, true); 479 481 } 480 482 483 + static inline int put_z3fold_locked(struct z3fold_header *zhdr) 484 + { 485 + return kref_put(&zhdr->refcount, release_z3fold_page_locked); 486 + } 487 + 488 + static inline int put_z3fold_locked_list(struct z3fold_header *zhdr) 489 + { 490 + return kref_put(&zhdr->refcount, release_z3fold_page_locked_list); 491 + } 492 + 481 493 static void free_pages_work(struct work_struct *w) 482 494 { 483 495 struct z3fold_pool *pool = container_of(w, struct z3fold_pool, work); ··· 674 666 return new_zhdr; 675 667 676 668 out_fail: 677 - if (new_zhdr && !kref_put(&new_zhdr->refcount, release_z3fold_page_locked)) { 669 + if (new_zhdr && !put_z3fold_locked(new_zhdr)) { 678 670 add_to_unbuddied(pool, new_zhdr); 679 671 z3fold_page_unlock(new_zhdr); 680 672 } ··· 749 741 list_del_init(&zhdr->buddy); 750 742 spin_unlock(&pool->lock); 751 743 752 - if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) 744 + if (put_z3fold_locked(zhdr)) 753 745 return; 754 746 755 747 if (test_bit(PAGE_STALE, &page->private) || ··· 760 752 761 753 if (!zhdr->foreign_handles && buddy_single(zhdr) && 762 754 zhdr->mapped_count == 0 && compact_single_buddy(zhdr)) { 763 - if (!kref_put(&zhdr->refcount, release_z3fold_page_locked)) { 755 + if (!put_z3fold_locked(zhdr)) { 764 756 clear_bit(PAGE_CLAIMED, &page->private); 765 757 z3fold_page_unlock(zhdr); 766 758 } ··· 886 878 return zhdr; 887 879 888 880 out_fail: 889 - if (!kref_put(&zhdr->refcount, release_z3fold_page_locked)) { 881 + if (!put_z3fold_locked(zhdr)) { 890 882 add_to_unbuddied(pool, zhdr); 891 883 z3fold_page_unlock(zhdr); 892 884 } ··· 1020 1012 if (zhdr) { 1021 1013 bud = get_free_buddy(zhdr, chunks); 1022 1014 if (bud == HEADLESS) { 1023 - if (!kref_put(&zhdr->refcount, 1024 - release_z3fold_page_locked)) 1015 + if (!put_z3fold_locked(zhdr)) 1025 1016 z3fold_page_unlock(zhdr); 1026 1017 pr_err("No free chunks in unbuddied\n"); 1027 1018 WARN_ON(1); ··· 1136 1129 1137 1130 if (!page_claimed) 1138 1131 free_handle(handle, zhdr); 1139 - if (kref_put(&zhdr->refcount, release_z3fold_page_locked_list)) 1132 + if (put_z3fold_locked_list(zhdr)) 1140 1133 return; 1141 1134 if (page_claimed) { 1142 1135 /* the page has not been claimed by us */ ··· 1353 1346 if (!list_empty(&zhdr->buddy)) 1354 1347 list_del_init(&zhdr->buddy); 1355 1348 INIT_LIST_HEAD(&page->lru); 1356 - if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) 1349 + if (put_z3fold_locked(zhdr)) 1357 1350 return; 1358 1351 if (list_empty(&zhdr->buddy)) 1359 1352 add_to_unbuddied(pool, zhdr);

+27 -52

mm/zsmalloc.c

··· 795 795 return *(unsigned long *)handle; 796 796 } 797 797 798 - static bool obj_tagged(struct page *page, void *obj, unsigned long *phandle, 799 - int tag) 798 + static inline bool obj_allocated(struct page *page, void *obj, 799 + unsigned long *phandle) 800 800 { 801 801 unsigned long handle; 802 802 struct zspage *zspage = get_zspage(page); ··· 807 807 } else 808 808 handle = *(unsigned long *)obj; 809 809 810 - if (!(handle & tag)) 810 + if (!(handle & OBJ_ALLOCATED_TAG)) 811 811 return false; 812 812 813 813 /* Clear all tags before returning the handle */ 814 814 *phandle = handle & ~OBJ_TAG_MASK; 815 815 return true; 816 - } 817 - 818 - static inline bool obj_allocated(struct page *page, void *obj, unsigned long *phandle) 819 - { 820 - return obj_tagged(page, obj, phandle, OBJ_ALLOCATED_TAG); 821 816 } 822 817 823 818 static void reset_page(struct page *page) ··· 1140 1145 static bool zspage_full(struct size_class *class, struct zspage *zspage) 1141 1146 { 1142 1147 return get_zspage_inuse(zspage) == class->objs_per_zspage; 1148 + } 1149 + 1150 + static bool zspage_empty(struct zspage *zspage) 1151 + { 1152 + return get_zspage_inuse(zspage) == 0; 1143 1153 } 1144 1154 1145 1155 /** ··· 1546 1546 } 1547 1547 1548 1548 /* 1549 - * Find object with a certain tag in zspage from index object and 1549 + * Find alloced object in zspage from index object and 1550 1550 * return handle. 1551 1551 */ 1552 - static unsigned long find_tagged_obj(struct size_class *class, 1553 - struct page *page, int *obj_idx, int tag) 1552 + static unsigned long find_alloced_obj(struct size_class *class, 1553 + struct page *page, int *obj_idx) 1554 1554 { 1555 1555 unsigned int offset; 1556 1556 int index = *obj_idx; ··· 1561 1561 offset += class->size * index; 1562 1562 1563 1563 while (offset < PAGE_SIZE) { 1564 - if (obj_tagged(page, addr + offset, &handle, tag)) 1564 + if (obj_allocated(page, addr + offset, &handle)) 1565 1565 break; 1566 1566 1567 1567 offset += class->size; ··· 1575 1575 return handle; 1576 1576 } 1577 1577 1578 - /* 1579 - * Find alloced object in zspage from index object and 1580 - * return handle. 1581 - */ 1582 - static unsigned long find_alloced_obj(struct size_class *class, 1583 - struct page *page, int *obj_idx) 1584 - { 1585 - return find_tagged_obj(class, page, obj_idx, OBJ_ALLOCATED_TAG); 1586 - } 1587 - 1588 - struct zs_compact_control { 1589 - /* Source spage for migration which could be a subpage of zspage */ 1590 - struct page *s_page; 1591 - /* Destination page for migration which should be a first page 1592 - * of zspage. */ 1593 - struct page *d_page; 1594 - /* Starting object index within @s_page which used for live object 1595 - * in the subpage. */ 1596 - int obj_idx; 1597 - }; 1598 - 1599 - static void migrate_zspage(struct zs_pool *pool, struct size_class *class, 1600 - struct zs_compact_control *cc) 1578 + static void migrate_zspage(struct zs_pool *pool, struct zspage *src_zspage, 1579 + struct zspage *dst_zspage) 1601 1580 { 1602 1581 unsigned long used_obj, free_obj; 1603 1582 unsigned long handle; 1604 - struct page *s_page = cc->s_page; 1605 - struct page *d_page = cc->d_page; 1606 - int obj_idx = cc->obj_idx; 1583 + int obj_idx = 0; 1584 + struct page *s_page = get_first_page(src_zspage); 1585 + struct size_class *class = pool->size_class[src_zspage->class]; 1607 1586 1608 1587 while (1) { 1609 1588 handle = find_alloced_obj(class, s_page, &obj_idx); ··· 1594 1615 continue; 1595 1616 } 1596 1617 1597 - /* Stop if there is no more space */ 1598 - if (zspage_full(class, get_zspage(d_page))) 1599 - break; 1600 - 1601 1618 used_obj = handle_to_obj(handle); 1602 - free_obj = obj_malloc(pool, get_zspage(d_page), handle); 1619 + free_obj = obj_malloc(pool, dst_zspage, handle); 1603 1620 zs_object_copy(class, free_obj, used_obj); 1604 1621 obj_idx++; 1605 1622 record_obj(handle, free_obj); 1606 1623 obj_free(class->size, used_obj); 1607 - } 1608 1624 1609 - /* Remember last position in this iteration */ 1610 - cc->s_page = s_page; 1611 - cc->obj_idx = obj_idx; 1625 + /* Stop if there is no more space */ 1626 + if (zspage_full(class, dst_zspage)) 1627 + break; 1628 + 1629 + /* Stop if there are no more objects to migrate */ 1630 + if (zspage_empty(src_zspage)) 1631 + break; 1632 + } 1612 1633 } 1613 1634 1614 1635 static struct zspage *isolate_src_zspage(struct size_class *class) ··· 1987 2008 static unsigned long __zs_compact(struct zs_pool *pool, 1988 2009 struct size_class *class) 1989 2010 { 1990 - struct zs_compact_control cc; 1991 2011 struct zspage *src_zspage = NULL; 1992 2012 struct zspage *dst_zspage = NULL; 1993 2013 unsigned long pages_freed = 0; ··· 2004 2026 if (!dst_zspage) 2005 2027 break; 2006 2028 migrate_write_lock(dst_zspage); 2007 - cc.d_page = get_first_page(dst_zspage); 2008 2029 } 2009 2030 2010 2031 src_zspage = isolate_src_zspage(class); ··· 2012 2035 2013 2036 migrate_write_lock_nested(src_zspage); 2014 2037 2015 - cc.obj_idx = 0; 2016 - cc.s_page = get_first_page(src_zspage); 2017 - migrate_zspage(pool, class, &cc); 2038 + migrate_zspage(pool, src_zspage, dst_zspage); 2018 2039 fg = putback_zspage(class, src_zspage); 2019 2040 migrate_write_unlock(src_zspage); 2020 2041

+182 -223

mm/zswap.c

··· 2 2 /* 3 3 * zswap.c - zswap driver file 4 4 * 5 - * zswap is a backend for frontswap that takes pages that are in the process 5 + * zswap is a cache that takes pages that are in the process 6 6 * of being swapped out and attempts to compress and store them in a 7 7 * RAM-based memory pool. This can result in a significant I/O reduction on 8 8 * the swap device and, in the case where decompressing from RAM is faster ··· 20 20 #include <linux/spinlock.h> 21 21 #include <linux/types.h> 22 22 #include <linux/atomic.h> 23 - #include <linux/frontswap.h> 24 23 #include <linux/rbtree.h> 25 24 #include <linux/swap.h> 26 25 #include <linux/crypto.h> ··· 27 28 #include <linux/mempool.h> 28 29 #include <linux/zpool.h> 29 30 #include <crypto/acompress.h> 30 - 31 + #include <linux/zswap.h> 31 32 #include <linux/mm_types.h> 32 33 #include <linux/page-flags.h> 33 34 #include <linux/swapops.h> ··· 141 142 CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON); 142 143 module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644); 143 144 145 + /* Number of zpools in zswap_pool (empirically determined for scalability) */ 146 + #define ZSWAP_NR_ZPOOLS 32 147 + 144 148 /********************************* 145 149 * data structures 146 150 **********************************/ ··· 163 161 * needs to be verified that it's still valid in the tree. 164 162 */ 165 163 struct zswap_pool { 166 - struct zpool *zpool; 164 + struct zpool *zpools[ZSWAP_NR_ZPOOLS]; 167 165 struct crypto_acomp_ctx __percpu *acomp_ctx; 168 166 struct kref kref; 169 167 struct list_head list; ··· 182 180 * page within zswap. 183 181 * 184 182 * rbnode - links the entry into red-black tree for the appropriate swap type 185 - * offset - the swap offset for the entry. Index into the red-black tree. 183 + * swpentry - associated swap entry, the offset indexes into the red-black tree 186 184 * refcount - the number of outstanding reference to the entry. This is needed 187 185 * to protect against premature freeing of the entry by code 188 186 * concurrent calls to load, invalidate, and writeback. The lock ··· 195 193 * pool - the zswap_pool the entry's data is in 196 194 * handle - zpool allocation handle that stores the compressed page data 197 195 * value - value of the same-value filled pages which have same content 196 + * objcg - the obj_cgroup that the compressed memory is charged to 198 197 * lru - handle to the pool's lru used to evict pages. 199 198 */ 200 199 struct zswap_entry { ··· 251 248 252 249 #define zswap_pool_debug(msg, p) \ 253 250 pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ 254 - zpool_get_type((p)->zpool)) 251 + zpool_get_type((p)->zpools[0])) 255 252 256 253 static int zswap_writeback_entry(struct zswap_entry *entry, 257 254 struct zswap_tree *tree); ··· 275 272 { 276 273 struct zswap_pool *pool; 277 274 u64 total = 0; 275 + int i; 278 276 279 277 rcu_read_lock(); 280 278 281 279 list_for_each_entry_rcu(pool, &zswap_pools, list) 282 - total += zpool_get_total_size(pool->zpool); 280 + for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) 281 + total += zpool_get_total_size(pool->zpools[i]); 283 282 284 283 rcu_read_unlock(); 285 284 ··· 370 365 return false; 371 366 } 372 367 368 + static struct zpool *zswap_find_zpool(struct zswap_entry *entry) 369 + { 370 + int i = 0; 371 + 372 + if (ZSWAP_NR_ZPOOLS > 1) 373 + i = hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS)); 374 + 375 + return entry->pool->zpools[i]; 376 + } 377 + 373 378 /* 374 379 * Carries out the common pattern of freeing and entry's zpool allocation, 375 380 * freeing the entry itself, and decrementing the number of stored pages. ··· 396 381 spin_lock(&entry->pool->lru_lock); 397 382 list_del(&entry->lru); 398 383 spin_unlock(&entry->pool->lru_lock); 399 - zpool_free(entry->pool->zpool, entry->handle); 384 + zpool_free(zswap_find_zpool(entry), entry->handle); 400 385 zswap_pool_put(entry->pool); 401 386 } 402 387 zswap_entry_cache_free(entry); ··· 418 403 { 419 404 int refcount = --entry->refcount; 420 405 421 - BUG_ON(refcount < 0); 406 + WARN_ON_ONCE(refcount < 0); 422 407 if (refcount == 0) { 423 - zswap_rb_erase(&tree->rbroot, entry); 408 + WARN_ON_ONCE(!RB_EMPTY_NODE(&entry->rbnode)); 424 409 zswap_free_entry(entry); 425 410 } 426 411 } ··· 605 590 list_for_each_entry_rcu(pool, &zswap_pools, list) { 606 591 if (strcmp(pool->tfm_name, compressor)) 607 592 continue; 608 - if (strcmp(zpool_get_type(pool->zpool), type)) 593 + /* all zpools share the same type */ 594 + if (strcmp(zpool_get_type(pool->zpools[0]), type)) 609 595 continue; 610 596 /* if we can't get it, it's about to be destroyed */ 611 597 if (!zswap_pool_get(pool)) ··· 711 695 712 696 static struct zswap_pool *zswap_pool_create(char *type, char *compressor) 713 697 { 698 + int i; 714 699 struct zswap_pool *pool; 715 700 char name[38]; /* 'zswap' + 32 char (max) num + \0 */ 716 701 gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; ··· 732 715 if (!pool) 733 716 return NULL; 734 717 735 - /* unique name for each pool specifically required by zsmalloc */ 736 - snprintf(name, 38, "zswap%x", atomic_inc_return(&zswap_pools_count)); 718 + for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) { 719 + /* unique name for each pool specifically required by zsmalloc */ 720 + snprintf(name, 38, "zswap%x", 721 + atomic_inc_return(&zswap_pools_count)); 737 722 738 - pool->zpool = zpool_create_pool(type, name, gfp); 739 - if (!pool->zpool) { 740 - pr_err("%s zpool not available\n", type); 741 - goto error; 723 + pool->zpools[i] = zpool_create_pool(type, name, gfp); 724 + if (!pool->zpools[i]) { 725 + pr_err("%s zpool not available\n", type); 726 + goto error; 727 + } 742 728 } 743 - pr_debug("using %s zpool\n", zpool_get_type(pool->zpool)); 729 + pr_debug("using %s zpool\n", zpool_get_type(pool->zpools[0])); 744 730 745 731 strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); 746 732 ··· 775 755 error: 776 756 if (pool->acomp_ctx) 777 757 free_percpu(pool->acomp_ctx); 778 - if (pool->zpool) 779 - zpool_destroy_pool(pool->zpool); 758 + while (i--) 759 + zpool_destroy_pool(pool->zpools[i]); 780 760 kfree(pool); 781 761 return NULL; 782 762 } ··· 825 805 826 806 static void zswap_pool_destroy(struct zswap_pool *pool) 827 807 { 808 + int i; 809 + 828 810 zswap_pool_debug("destroying", pool); 829 811 830 812 cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); 831 813 free_percpu(pool->acomp_ctx); 832 - zpool_destroy_pool(pool->zpool); 814 + for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) 815 + zpool_destroy_pool(pool->zpools[i]); 833 816 kfree(pool); 834 817 } 835 818 ··· 1040 1017 /********************************* 1041 1018 * writeback code 1042 1019 **********************************/ 1043 - /* return enum for zswap_get_swap_cache_page */ 1044 - enum zswap_get_swap_ret { 1045 - ZSWAP_SWAPCACHE_NEW, 1046 - ZSWAP_SWAPCACHE_EXIST, 1047 - ZSWAP_SWAPCACHE_FAIL, 1048 - }; 1049 - 1050 - /* 1051 - * zswap_get_swap_cache_page 1052 - * 1053 - * This is an adaption of read_swap_cache_async() 1054 - * 1055 - * This function tries to find a page with the given swap entry 1056 - * in the swapper_space address space (the swap cache). If the page 1057 - * is found, it is returned in retpage. Otherwise, a page is allocated, 1058 - * added to the swap cache, and returned in retpage. 1059 - * 1060 - * If success, the swap cache page is returned in retpage 1061 - * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache 1062 - * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, 1063 - * the new page is added to swapcache and locked 1064 - * Returns ZSWAP_SWAPCACHE_FAIL on error 1065 - */ 1066 - static int zswap_get_swap_cache_page(swp_entry_t entry, 1067 - struct page **retpage) 1068 - { 1069 - bool page_was_allocated; 1070 - 1071 - *retpage = __read_swap_cache_async(entry, GFP_KERNEL, 1072 - NULL, 0, &page_was_allocated); 1073 - if (page_was_allocated) 1074 - return ZSWAP_SWAPCACHE_NEW; 1075 - if (!*retpage) 1076 - return ZSWAP_SWAPCACHE_FAIL; 1077 - return ZSWAP_SWAPCACHE_EXIST; 1078 - } 1079 - 1080 1020 /* 1081 1021 * Attempts to free an entry by adding a page to the swap cache, 1082 1022 * decompressing the entry data into the page, and issuing a ··· 1047 1061 * 1048 1062 * This can be thought of as a "resumed writeback" of the page 1049 1063 * to the swap device. We are basically resuming the same swap 1050 - * writeback path that was intercepted with the frontswap_store() 1064 + * writeback path that was intercepted with the zswap_store() 1051 1065 * in the first place. After the page has been decompressed into 1052 1066 * the swap cache, the compressed version stored by zswap can be 1053 1067 * freed. ··· 1059 1073 struct page *page; 1060 1074 struct scatterlist input, output; 1061 1075 struct crypto_acomp_ctx *acomp_ctx; 1062 - struct zpool *pool = entry->pool->zpool; 1063 - 1076 + struct zpool *pool = zswap_find_zpool(entry); 1077 + bool page_was_allocated; 1064 1078 u8 *src, *tmp = NULL; 1065 1079 unsigned int dlen; 1066 1080 int ret; ··· 1075 1089 } 1076 1090 1077 1091 /* try to allocate swap cache page */ 1078 - switch (zswap_get_swap_cache_page(swpentry, &page)) { 1079 - case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */ 1092 + page = __read_swap_cache_async(swpentry, GFP_KERNEL, NULL, 0, 1093 + &page_was_allocated); 1094 + if (!page) { 1080 1095 ret = -ENOMEM; 1081 1096 goto fail; 1097 + } 1082 1098 1083 - case ZSWAP_SWAPCACHE_EXIST: 1084 - /* page is already in the swap cache, ignore for now */ 1099 + /* Found an existing page, we raced with load/swapin */ 1100 + if (!page_was_allocated) { 1085 1101 put_page(page); 1086 1102 ret = -EEXIST; 1087 1103 goto fail; 1088 - 1089 - case ZSWAP_SWAPCACHE_NEW: /* page is locked */ 1090 - /* 1091 - * Having a local reference to the zswap entry doesn't exclude 1092 - * swapping from invalidating and recycling the swap slot. Once 1093 - * the swapcache is secured against concurrent swapping to and 1094 - * from the slot, recheck that the entry is still current before 1095 - * writing. 1096 - */ 1097 - spin_lock(&tree->lock); 1098 - if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) { 1099 - spin_unlock(&tree->lock); 1100 - delete_from_swap_cache(page_folio(page)); 1101 - ret = -ENOMEM; 1102 - goto fail; 1103 - } 1104 - spin_unlock(&tree->lock); 1105 - 1106 - /* decompress */ 1107 - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); 1108 - dlen = PAGE_SIZE; 1109 - 1110 - src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO); 1111 - if (!zpool_can_sleep_mapped(pool)) { 1112 - memcpy(tmp, src, entry->length); 1113 - src = tmp; 1114 - zpool_unmap_handle(pool, entry->handle); 1115 - } 1116 - 1117 - mutex_lock(acomp_ctx->mutex); 1118 - sg_init_one(&input, src, entry->length); 1119 - sg_init_table(&output, 1); 1120 - sg_set_page(&output, page, PAGE_SIZE, 0); 1121 - acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); 1122 - ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); 1123 - dlen = acomp_ctx->req->dlen; 1124 - mutex_unlock(acomp_ctx->mutex); 1125 - 1126 - if (!zpool_can_sleep_mapped(pool)) 1127 - kfree(tmp); 1128 - else 1129 - zpool_unmap_handle(pool, entry->handle); 1130 - 1131 - BUG_ON(ret); 1132 - BUG_ON(dlen != PAGE_SIZE); 1133 - 1134 - /* page is up to date */ 1135 - SetPageUptodate(page); 1136 1104 } 1105 + 1106 + /* 1107 + * Page is locked, and the swapcache is now secured against 1108 + * concurrent swapping to and from the slot. Verify that the 1109 + * swap entry hasn't been invalidated and recycled behind our 1110 + * backs (our zswap_entry reference doesn't prevent that), to 1111 + * avoid overwriting a new swap page with old compressed data. 1112 + */ 1113 + spin_lock(&tree->lock); 1114 + if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) { 1115 + spin_unlock(&tree->lock); 1116 + delete_from_swap_cache(page_folio(page)); 1117 + ret = -ENOMEM; 1118 + goto fail; 1119 + } 1120 + spin_unlock(&tree->lock); 1121 + 1122 + /* decompress */ 1123 + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); 1124 + dlen = PAGE_SIZE; 1125 + 1126 + src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO); 1127 + if (!zpool_can_sleep_mapped(pool)) { 1128 + memcpy(tmp, src, entry->length); 1129 + src = tmp; 1130 + zpool_unmap_handle(pool, entry->handle); 1131 + } 1132 + 1133 + mutex_lock(acomp_ctx->mutex); 1134 + sg_init_one(&input, src, entry->length); 1135 + sg_init_table(&output, 1); 1136 + sg_set_page(&output, page, PAGE_SIZE, 0); 1137 + acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); 1138 + ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); 1139 + dlen = acomp_ctx->req->dlen; 1140 + mutex_unlock(acomp_ctx->mutex); 1141 + 1142 + if (!zpool_can_sleep_mapped(pool)) 1143 + kfree(tmp); 1144 + else 1145 + zpool_unmap_handle(pool, entry->handle); 1146 + 1147 + BUG_ON(ret); 1148 + BUG_ON(dlen != PAGE_SIZE); 1149 + 1150 + /* page is up to date */ 1151 + SetPageUptodate(page); 1137 1152 1138 1153 /* move it to the tail of the inactive list after end_writeback */ 1139 1154 SetPageReclaim(page); ··· 1145 1158 zswap_written_back_pages++; 1146 1159 1147 1160 return ret; 1161 + 1148 1162 fail: 1149 1163 if (!zpool_can_sleep_mapped(pool)) 1150 1164 kfree(tmp); 1151 1165 1152 1166 /* 1153 - * if we get here due to ZSWAP_SWAPCACHE_EXIST 1154 - * a load may be happening concurrently. 1155 - * it is safe and okay to not free the entry. 1156 - * it is also okay to return !0 1157 - */ 1167 + * If we get here because the page is already in swapcache, a 1168 + * load may be happening concurrently. It is safe and okay to 1169 + * not free the entry. It is also okay to return !0. 1170 + */ 1158 1171 return ret; 1159 1172 } 1160 1173 ··· 1188 1201 memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); 1189 1202 } 1190 1203 1191 - /********************************* 1192 - * frontswap hooks 1193 - **********************************/ 1194 - /* attempts to compress and store an single page */ 1195 - static int zswap_frontswap_store(unsigned type, pgoff_t offset, 1196 - struct page *page) 1204 + bool zswap_store(struct folio *folio) 1197 1205 { 1206 + swp_entry_t swp = folio->swap; 1207 + int type = swp_type(swp); 1208 + pgoff_t offset = swp_offset(swp); 1209 + struct page *page = &folio->page; 1198 1210 struct zswap_tree *tree = zswap_trees[type]; 1199 1211 struct zswap_entry *entry, *dupentry; 1200 1212 struct scatterlist input, output; 1201 1213 struct crypto_acomp_ctx *acomp_ctx; 1202 1214 struct obj_cgroup *objcg = NULL; 1203 1215 struct zswap_pool *pool; 1204 - int ret; 1216 + struct zpool *zpool; 1205 1217 unsigned int dlen = PAGE_SIZE; 1206 1218 unsigned long handle, value; 1207 1219 char *buf; 1208 1220 u8 *src, *dst; 1209 1221 gfp_t gfp; 1222 + int ret; 1210 1223 1211 - /* THP isn't supported */ 1212 - if (PageTransHuge(page)) { 1213 - ret = -EINVAL; 1214 - goto reject; 1215 - } 1224 + VM_WARN_ON_ONCE(!folio_test_locked(folio)); 1225 + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); 1216 1226 1217 - if (!zswap_enabled || !tree) { 1218 - ret = -ENODEV; 1219 - goto reject; 1220 - } 1227 + /* Large folios aren't supported */ 1228 + if (folio_test_large(folio)) 1229 + return false; 1230 + 1231 + if (!zswap_enabled || !tree) 1232 + return false; 1221 1233 1222 1234 /* 1223 1235 * XXX: zswap reclaim does not work with cgroups yet. Without a 1224 1236 * cgroup-aware entry LRU, we will push out entries system-wide based on 1225 1237 * local cgroup limits. 1226 1238 */ 1227 - objcg = get_obj_cgroup_from_page(page); 1228 - if (objcg && !obj_cgroup_may_zswap(objcg)) { 1229 - ret = -ENOMEM; 1239 + objcg = get_obj_cgroup_from_folio(folio); 1240 + if (objcg && !obj_cgroup_may_zswap(objcg)) 1230 1241 goto reject; 1231 - } 1232 1242 1233 1243 /* reclaim space if needed */ 1234 1244 if (zswap_is_full()) { ··· 1235 1251 } 1236 1252 1237 1253 if (zswap_pool_reached_full) { 1238 - if (!zswap_can_accept()) { 1239 - ret = -ENOMEM; 1254 + if (!zswap_can_accept()) 1240 1255 goto shrink; 1241 - } else 1256 + else 1242 1257 zswap_pool_reached_full = false; 1243 1258 } 1244 1259 ··· 1245 1262 entry = zswap_entry_cache_alloc(GFP_KERNEL); 1246 1263 if (!entry) { 1247 1264 zswap_reject_kmemcache_fail++; 1248 - ret = -ENOMEM; 1249 1265 goto reject; 1250 1266 } 1251 1267 ··· 1261 1279 kunmap_atomic(src); 1262 1280 } 1263 1281 1264 - if (!zswap_non_same_filled_pages_enabled) { 1265 - ret = -EINVAL; 1282 + if (!zswap_non_same_filled_pages_enabled) 1266 1283 goto freepage; 1267 - } 1268 1284 1269 1285 /* if entry is successfully added, it keeps the reference */ 1270 1286 entry->pool = zswap_pool_current_get(); 1271 - if (!entry->pool) { 1272 - ret = -EINVAL; 1287 + if (!entry->pool) 1273 1288 goto freepage; 1274 - } 1275 1289 1276 1290 /* compress */ 1277 1291 acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); ··· 1287 1309 * synchronous in fact. 1288 1310 * Theoretically, acomp supports users send multiple acomp requests in one 1289 1311 * acomp instance, then get those requests done simultaneously. but in this 1290 - * case, frontswap actually does store and load page by page, there is no 1312 + * case, zswap actually does store and load page by page, there is no 1291 1313 * existing method to send the second page before the first page is done 1292 - * in one thread doing frontswap. 1314 + * in one thread doing zwap. 1293 1315 * but in different threads running on different cpu, we have different 1294 1316 * acomp instance, so multiple threads can do (de)compression in parallel. 1295 1317 */ 1296 1318 ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); 1297 1319 dlen = acomp_ctx->req->dlen; 1298 1320 1299 - if (ret) { 1300 - ret = -EINVAL; 1321 + if (ret) 1301 1322 goto put_dstmem; 1302 - } 1303 1323 1304 1324 /* store */ 1325 + zpool = zswap_find_zpool(entry); 1305 1326 gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; 1306 - if (zpool_malloc_support_movable(entry->pool->zpool)) 1327 + if (zpool_malloc_support_movable(zpool)) 1307 1328 gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; 1308 - ret = zpool_malloc(entry->pool->zpool, dlen, gfp, &handle); 1329 + ret = zpool_malloc(zpool, dlen, gfp, &handle); 1309 1330 if (ret == -ENOSPC) { 1310 1331 zswap_reject_compress_poor++; 1311 1332 goto put_dstmem; ··· 1313 1336 zswap_reject_alloc_fail++; 1314 1337 goto put_dstmem; 1315 1338 } 1316 - buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_WO); 1339 + buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO); 1317 1340 memcpy(buf, dst, dlen); 1318 - zpool_unmap_handle(entry->pool->zpool, handle); 1341 + zpool_unmap_handle(zpool, handle); 1319 1342 mutex_unlock(acomp_ctx->mutex); 1320 1343 1321 1344 /* populate entry */ ··· 1333 1356 1334 1357 /* map */ 1335 1358 spin_lock(&tree->lock); 1336 - do { 1337 - ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry); 1338 - if (ret == -EEXIST) { 1339 - zswap_duplicate_entry++; 1340 - /* remove from rbtree */ 1341 - zswap_rb_erase(&tree->rbroot, dupentry); 1342 - zswap_entry_put(tree, dupentry); 1343 - } 1344 - } while (ret == -EEXIST); 1359 + while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { 1360 + zswap_duplicate_entry++; 1361 + zswap_invalidate_entry(tree, dupentry); 1362 + } 1345 1363 if (entry->length) { 1346 1364 spin_lock(&entry->pool->lru_lock); 1347 1365 list_add(&entry->lru, &entry->pool->lru); ··· 1349 1377 zswap_update_total_size(); 1350 1378 count_vm_event(ZSWPOUT); 1351 1379 1352 - return 0; 1380 + return true; 1353 1381 1354 1382 put_dstmem: 1355 1383 mutex_unlock(acomp_ctx->mutex); ··· 1359 1387 reject: 1360 1388 if (objcg) 1361 1389 obj_cgroup_put(objcg); 1362 - return ret; 1390 + return false; 1363 1391 1364 1392 shrink: 1365 1393 pool = zswap_pool_last_get(); 1366 1394 if (pool) 1367 1395 queue_work(shrink_wq, &pool->shrink_work); 1368 - ret = -ENOMEM; 1369 1396 goto reject; 1370 1397 } 1371 1398 1372 - /* 1373 - * returns 0 if the page was successfully decompressed 1374 - * return -1 on entry not found or error 1375 - */ 1376 - static int zswap_frontswap_load(unsigned type, pgoff_t offset, 1377 - struct page *page, bool *exclusive) 1399 + bool zswap_load(struct folio *folio) 1378 1400 { 1401 + swp_entry_t swp = folio->swap; 1402 + int type = swp_type(swp); 1403 + pgoff_t offset = swp_offset(swp); 1404 + struct page *page = &folio->page; 1379 1405 struct zswap_tree *tree = zswap_trees[type]; 1380 1406 struct zswap_entry *entry; 1381 1407 struct scatterlist input, output; 1382 1408 struct crypto_acomp_ctx *acomp_ctx; 1383 1409 u8 *src, *dst, *tmp; 1410 + struct zpool *zpool; 1384 1411 unsigned int dlen; 1385 - int ret; 1412 + bool ret; 1413 + 1414 + VM_WARN_ON_ONCE(!folio_test_locked(folio)); 1386 1415 1387 1416 /* find */ 1388 1417 spin_lock(&tree->lock); 1389 1418 entry = zswap_entry_find_get(&tree->rbroot, offset); 1390 1419 if (!entry) { 1391 - /* entry was written back */ 1392 1420 spin_unlock(&tree->lock); 1393 - return -1; 1421 + return false; 1394 1422 } 1395 1423 spin_unlock(&tree->lock); 1396 1424 ··· 1398 1426 dst = kmap_atomic(page); 1399 1427 zswap_fill_page(dst, entry->value); 1400 1428 kunmap_atomic(dst); 1401 - ret = 0; 1429 + ret = true; 1402 1430 goto stats; 1403 1431 } 1404 1432 1405 - if (!zpool_can_sleep_mapped(entry->pool->zpool)) { 1433 + zpool = zswap_find_zpool(entry); 1434 + if (!zpool_can_sleep_mapped(zpool)) { 1406 1435 tmp = kmalloc(entry->length, GFP_KERNEL); 1407 1436 if (!tmp) { 1408 - ret = -ENOMEM; 1437 + ret = false; 1409 1438 goto freeentry; 1410 1439 } 1411 1440 } 1412 1441 1413 1442 /* decompress */ 1414 1443 dlen = PAGE_SIZE; 1415 - src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); 1444 + src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); 1416 1445 1417 - if (!zpool_can_sleep_mapped(entry->pool->zpool)) { 1446 + if (!zpool_can_sleep_mapped(zpool)) { 1418 1447 memcpy(tmp, src, entry->length); 1419 1448 src = tmp; 1420 - zpool_unmap_handle(entry->pool->zpool, entry->handle); 1449 + zpool_unmap_handle(zpool, entry->handle); 1421 1450 } 1422 1451 1423 1452 acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); ··· 1427 1454 sg_init_table(&output, 1); 1428 1455 sg_set_page(&output, page, PAGE_SIZE, 0); 1429 1456 acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); 1430 - ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); 1457 + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)) 1458 + WARN_ON(1); 1431 1459 mutex_unlock(acomp_ctx->mutex); 1432 1460 1433 - if (zpool_can_sleep_mapped(entry->pool->zpool)) 1434 - zpool_unmap_handle(entry->pool->zpool, entry->handle); 1461 + if (zpool_can_sleep_mapped(zpool)) 1462 + zpool_unmap_handle(zpool, entry->handle); 1435 1463 else 1436 1464 kfree(tmp); 1437 1465 1438 - BUG_ON(ret); 1466 + ret = true; 1439 1467 stats: 1440 1468 count_vm_event(ZSWPIN); 1441 1469 if (entry->objcg) 1442 1470 count_objcg_event(entry->objcg, ZSWPIN); 1443 1471 freeentry: 1444 1472 spin_lock(&tree->lock); 1445 - if (!ret && zswap_exclusive_loads_enabled) { 1473 + if (ret && zswap_exclusive_loads_enabled) { 1446 1474 zswap_invalidate_entry(tree, entry); 1447 - *exclusive = true; 1475 + folio_mark_dirty(folio); 1448 1476 } else if (entry->length) { 1449 1477 spin_lock(&entry->pool->lru_lock); 1450 1478 list_move(&entry->lru, &entry->pool->lru); ··· 1457 1483 return ret; 1458 1484 } 1459 1485 1460 - /* frees an entry in zswap */ 1461 - static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset) 1486 + void zswap_invalidate(int type, pgoff_t offset) 1462 1487 { 1463 1488 struct zswap_tree *tree = zswap_trees[type]; 1464 1489 struct zswap_entry *entry; ··· 1474 1501 spin_unlock(&tree->lock); 1475 1502 } 1476 1503 1477 - /* frees all zswap entries for the given swap type */ 1478 - static void zswap_frontswap_invalidate_area(unsigned type) 1504 + void zswap_swapon(int type) 1505 + { 1506 + struct zswap_tree *tree; 1507 + 1508 + tree = kzalloc(sizeof(*tree), GFP_KERNEL); 1509 + if (!tree) { 1510 + pr_err("alloc failed, zswap disabled for swap type %d\n", type); 1511 + return; 1512 + } 1513 + 1514 + tree->rbroot = RB_ROOT; 1515 + spin_lock_init(&tree->lock); 1516 + zswap_trees[type] = tree; 1517 + } 1518 + 1519 + void zswap_swapoff(int type) 1479 1520 { 1480 1521 struct zswap_tree *tree = zswap_trees[type]; 1481 1522 struct zswap_entry *entry, *n; ··· 1506 1519 kfree(tree); 1507 1520 zswap_trees[type] = NULL; 1508 1521 } 1509 - 1510 - static void zswap_frontswap_init(unsigned type) 1511 - { 1512 - struct zswap_tree *tree; 1513 - 1514 - tree = kzalloc(sizeof(*tree), GFP_KERNEL); 1515 - if (!tree) { 1516 - pr_err("alloc failed, zswap disabled for swap type %d\n", type); 1517 - return; 1518 - } 1519 - 1520 - tree->rbroot = RB_ROOT; 1521 - spin_lock_init(&tree->lock); 1522 - zswap_trees[type] = tree; 1523 - } 1524 - 1525 - static const struct frontswap_ops zswap_frontswap_ops = { 1526 - .store = zswap_frontswap_store, 1527 - .load = zswap_frontswap_load, 1528 - .invalidate_page = zswap_frontswap_invalidate_page, 1529 - .invalidate_area = zswap_frontswap_invalidate_area, 1530 - .init = zswap_frontswap_init 1531 - }; 1532 1522 1533 1523 /********************************* 1534 1524 * debugfs functions ··· 1583 1619 pool = __zswap_pool_create_fallback(); 1584 1620 if (pool) { 1585 1621 pr_info("loaded using pool %s/%s\n", pool->tfm_name, 1586 - zpool_get_type(pool->zpool)); 1622 + zpool_get_type(pool->zpools[0])); 1587 1623 list_add(&pool->list, &zswap_pools); 1588 1624 zswap_has_pool = true; 1589 1625 } else { ··· 1595 1631 if (!shrink_wq) 1596 1632 goto fallback_fail; 1597 1633 1598 - ret = frontswap_register_ops(&zswap_frontswap_ops); 1599 - if (ret) 1600 - goto destroy_wq; 1601 1634 if (zswap_debugfs_init()) 1602 1635 pr_warn("debugfs initialization failed\n"); 1603 1636 zswap_init_state = ZSWAP_INIT_SUCCEED; 1604 1637 return 0; 1605 1638 1606 - destroy_wq: 1607 - destroy_workqueue(shrink_wq); 1608 1639 fallback_fail: 1609 1640 if (pool) 1610 1641 zswap_pool_destroy(pool);

+4 -7

net/ipv4/tcp.c

··· 1742 1742 } 1743 1743 1744 1744 #ifdef CONFIG_MMU 1745 - const struct vm_operations_struct tcp_vm_ops = { 1745 + static const struct vm_operations_struct tcp_vm_ops = { 1746 1746 }; 1747 1747 1748 1748 int tcp_mmap(struct file *file, struct socket *sock, ··· 2045 2045 unsigned long address, 2046 2046 bool *mmap_locked) 2047 2047 { 2048 - struct vm_area_struct *vma = NULL; 2048 + struct vm_area_struct *vma = lock_vma_under_rcu(mm, address); 2049 2049 2050 - #ifdef CONFIG_PER_VMA_LOCK 2051 - vma = lock_vma_under_rcu(mm, address); 2052 - #endif 2053 2050 if (vma) { 2054 - if (!vma_is_tcp(vma)) { 2051 + if (vma->vm_ops != &tcp_vm_ops) { 2055 2052 vma_end_read(vma); 2056 2053 return NULL; 2057 2054 } ··· 2058 2061 2059 2062 mmap_read_lock(mm); 2060 2063 vma = vma_lookup(mm, address); 2061 - if (!vma || !vma_is_tcp(vma)) { 2064 + if (!vma || vma->vm_ops != &tcp_vm_ops) { 2062 2065 mmap_read_unlock(mm); 2063 2066 return NULL; 2064 2067 }

+3 -3

net/netfilter/nf_nat_core.c

··· 327 327 /* If we source map this tuple so reply looks like reply_tuple, will 328 328 * that meet the constraints of range. 329 329 */ 330 - static int in_range(const struct nf_conntrack_tuple *tuple, 330 + static int nf_in_range(const struct nf_conntrack_tuple *tuple, 331 331 const struct nf_nat_range2 *range) 332 332 { 333 333 /* If we are supposed to map IPs, then we must be in the ··· 376 376 &ct->tuplehash[IP_CT_DIR_REPLY].tuple); 377 377 result->dst = tuple->dst; 378 378 379 - if (in_range(result, range)) 379 + if (nf_in_range(result, range)) 380 380 return 1; 381 381 } 382 382 } ··· 607 607 if (maniptype == NF_NAT_MANIP_SRC && 608 608 !(range->flags & NF_NAT_RANGE_PROTO_RANDOM_ALL)) { 609 609 /* try the original tuple first */ 610 - if (in_range(orig_tuple, range)) { 610 + if (nf_in_range(orig_tuple, range)) { 611 611 if (!nf_nat_used_tuple(orig_tuple, ct)) { 612 612 *tuple = *orig_tuple; 613 613 return;

+1 -1

net/tipc/core.h

··· 197 197 return less_eq(left, right) && (mod(right) != mod(left)); 198 198 } 199 199 200 - static inline int in_range(u16 val, u16 min, u16 max) 200 + static inline int tipc_in_range(u16 val, u16 min, u16 max) 201 201 { 202 202 return !less(val, min) && !more(val, max); 203 203 }

+5 -5

net/tipc/link.c

··· 1623 1623 last_ga->bgack_cnt); 1624 1624 } 1625 1625 /* Check against the last Gap ACK block */ 1626 - if (in_range(seqno, start, end)) 1626 + if (tipc_in_range(seqno, start, end)) 1627 1627 continue; 1628 1628 /* Update/release the packet peer is acking */ 1629 1629 bc_has_acked = true; ··· 2251 2251 strncpy(if_name, data, TIPC_MAX_IF_NAME); 2252 2252 2253 2253 /* Update own tolerance if peer indicates a non-zero value */ 2254 - if (in_range(peers_tol, TIPC_MIN_LINK_TOL, TIPC_MAX_LINK_TOL)) { 2254 + if (tipc_in_range(peers_tol, TIPC_MIN_LINK_TOL, TIPC_MAX_LINK_TOL)) { 2255 2255 l->tolerance = peers_tol; 2256 2256 l->bc_rcvlink->tolerance = peers_tol; 2257 2257 } 2258 2258 /* Update own priority if peer's priority is higher */ 2259 - if (in_range(peers_prio, l->priority + 1, TIPC_MAX_LINK_PRI)) 2259 + if (tipc_in_range(peers_prio, l->priority + 1, TIPC_MAX_LINK_PRI)) 2260 2260 l->priority = peers_prio; 2261 2261 2262 2262 /* If peer is going down we want full re-establish cycle */ ··· 2299 2299 l->rcv_nxt_state = msg_seqno(hdr) + 1; 2300 2300 2301 2301 /* Update own tolerance if peer indicates a non-zero value */ 2302 - if (in_range(peers_tol, TIPC_MIN_LINK_TOL, TIPC_MAX_LINK_TOL)) { 2302 + if (tipc_in_range(peers_tol, TIPC_MIN_LINK_TOL, TIPC_MAX_LINK_TOL)) { 2303 2303 l->tolerance = peers_tol; 2304 2304 l->bc_rcvlink->tolerance = peers_tol; 2305 2305 } 2306 2306 /* Update own prio if peer indicates a different value */ 2307 2307 if ((peers_prio != l->priority) && 2308 - in_range(peers_prio, 1, TIPC_MAX_LINK_PRI)) { 2308 + tipc_in_range(peers_prio, 1, TIPC_MAX_LINK_PRI)) { 2309 2309 l->priority = peers_prio; 2310 2310 rc = tipc_link_fsm_evt(l, LINK_FAILURE_EVT); 2311 2311 }

+2 -5

security/selinux/hooks.c

··· 3783 3783 if (default_noexec && 3784 3784 (prot & PROT_EXEC) && !(vma->vm_flags & VM_EXEC)) { 3785 3785 int rc = 0; 3786 - if (vma->vm_start >= vma->vm_mm->start_brk && 3787 - vma->vm_end <= vma->vm_mm->brk) { 3786 + if (vma_is_initial_heap(vma)) { 3788 3787 rc = avc_has_perm(sid, sid, SECCLASS_PROCESS, 3789 3788 PROCESS__EXECHEAP, NULL); 3790 - } else if (!vma->vm_file && 3791 - ((vma->vm_start <= vma->vm_mm->start_stack && 3792 - vma->vm_end >= vma->vm_mm->start_stack) || 3789 + } else if (!vma->vm_file && (vma_is_initial_stack(vma) || 3793 3790 vma_is_stack_for_current(vma))) { 3794 3791 rc = avc_has_perm(sid, sid, SECCLASS_PROCESS, 3795 3792 PROCESS__EXECSTACK, NULL);

+110 -30

tools/testing/radix-tree/maple.c

··· 45 45 unsigned long last[RCU_RANGE_COUNT]; 46 46 }; 47 47 48 + struct rcu_test_struct3 { 49 + struct maple_tree *mt; 50 + unsigned long index; 51 + unsigned long last; 52 + bool stop; 53 + }; 54 + 48 55 struct rcu_reader_struct { 49 56 unsigned int id; 50 57 int mod; ··· 34961 34954 MT_BUG_ON(mt, !vals->seen_entry2); 34962 34955 } 34963 34956 34957 + static void *rcu_slot_store_reader(void *ptr) 34958 + { 34959 + struct rcu_test_struct3 *test = ptr; 34960 + MA_STATE(mas, test->mt, test->index, test->index); 34961 + 34962 + rcu_register_thread(); 34963 + 34964 + rcu_read_lock(); 34965 + while (!test->stop) { 34966 + mas_walk(&mas); 34967 + /* The length of growth to both sides must be equal. */ 34968 + RCU_MT_BUG_ON(test, (test->index - mas.index) != 34969 + (mas.last - test->last)); 34970 + } 34971 + rcu_read_unlock(); 34972 + 34973 + rcu_unregister_thread(); 34974 + return NULL; 34975 + } 34976 + 34977 + static noinline void run_check_rcu_slot_store(struct maple_tree *mt) 34978 + { 34979 + pthread_t readers[20]; 34980 + int range_cnt = 200, i, limit = 10000; 34981 + unsigned long len = ULONG_MAX / range_cnt, start, end; 34982 + struct rcu_test_struct3 test = {.stop = false, .mt = mt}; 34983 + 34984 + start = range_cnt / 2 * len; 34985 + end = start + len - 1; 34986 + test.index = start; 34987 + test.last = end; 34988 + 34989 + for (i = 0; i < range_cnt; i++) { 34990 + mtree_store_range(mt, i * len, i * len + len - 1, 34991 + xa_mk_value(i * 100), GFP_KERNEL); 34992 + } 34993 + 34994 + mt_set_in_rcu(mt); 34995 + MT_BUG_ON(mt, !mt_in_rcu(mt)); 34996 + 34997 + for (i = 0; i < ARRAY_SIZE(readers); i++) { 34998 + if (pthread_create(&readers[i], NULL, rcu_slot_store_reader, 34999 + &test)) { 35000 + perror("creating reader thread"); 35001 + exit(1); 35002 + } 35003 + } 35004 + 35005 + usleep(5); 35006 + 35007 + while (limit--) { 35008 + /* Step by step, expand the most middle range to both sides. */ 35009 + mtree_store_range(mt, --start, ++end, xa_mk_value(100), 35010 + GFP_KERNEL); 35011 + } 35012 + 35013 + test.stop = true; 35014 + 35015 + while (i--) 35016 + pthread_join(readers[i], NULL); 35017 + 35018 + mt_validate(mt); 35019 + } 35020 + 34964 35021 static noinline 34965 35022 void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals) 34966 35023 { ··· 35277 35206 run_check_rcu(mt, &vals); 35278 35207 mtree_destroy(mt); 35279 35208 35209 + /* Check expanding range in RCU mode */ 35210 + mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE); 35211 + run_check_rcu_slot_store(mt); 35212 + mtree_destroy(mt); 35280 35213 35281 35214 /* Forward writer for rcu stress */ 35282 35215 mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE); ··· 35458 35383 for (i = 0; i <= max; i++) 35459 35384 mtree_test_store_range(mt, i * 10, i * 10 + 5, &i); 35460 35385 35461 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35386 + /* Spanning store */ 35387 + mas_set_range(&mas, 470, 500); 35388 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35462 35389 allocated = mas_allocated(&mas); 35463 35390 height = mas_mt_height(&mas); 35464 35391 MT_BUG_ON(mt, allocated == 0); ··· 35469 35392 allocated = mas_allocated(&mas); 35470 35393 MT_BUG_ON(mt, allocated != 0); 35471 35394 35472 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35395 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35473 35396 allocated = mas_allocated(&mas); 35474 35397 height = mas_mt_height(&mas); 35475 35398 MT_BUG_ON(mt, allocated == 0); 35476 35399 MT_BUG_ON(mt, allocated != 1 + height * 3); 35477 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35400 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35478 35401 mas_destroy(&mas); 35479 35402 allocated = mas_allocated(&mas); 35480 35403 MT_BUG_ON(mt, allocated != 0); 35481 35404 35482 35405 35483 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35406 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35484 35407 allocated = mas_allocated(&mas); 35485 35408 height = mas_mt_height(&mas); 35486 - MT_BUG_ON(mt, allocated == 0); 35487 35409 MT_BUG_ON(mt, allocated != 1 + height * 3); 35488 35410 mn = mas_pop_node(&mas); 35489 35411 MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1); 35490 35412 mn->parent = ma_parent_ptr(mn); 35491 35413 ma_free_rcu(mn); 35492 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35414 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35493 35415 mas_destroy(&mas); 35494 35416 allocated = mas_allocated(&mas); 35495 35417 MT_BUG_ON(mt, allocated != 0); 35496 35418 35497 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35419 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35498 35420 allocated = mas_allocated(&mas); 35499 35421 height = mas_mt_height(&mas); 35500 - MT_BUG_ON(mt, allocated == 0); 35501 35422 MT_BUG_ON(mt, allocated != 1 + height * 3); 35502 35423 mn = mas_pop_node(&mas); 35503 35424 MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1); 35504 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35425 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35505 35426 mas_destroy(&mas); 35506 35427 allocated = mas_allocated(&mas); 35507 35428 MT_BUG_ON(mt, allocated != 0); 35508 35429 mn->parent = ma_parent_ptr(mn); 35509 35430 ma_free_rcu(mn); 35510 35431 35511 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35432 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35512 35433 allocated = mas_allocated(&mas); 35513 35434 height = mas_mt_height(&mas); 35514 - MT_BUG_ON(mt, allocated == 0); 35515 35435 MT_BUG_ON(mt, allocated != 1 + height * 3); 35516 35436 mn = mas_pop_node(&mas); 35517 35437 MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1); 35518 35438 mas_push_node(&mas, mn); 35519 35439 MT_BUG_ON(mt, mas_allocated(&mas) != allocated); 35520 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35440 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35521 35441 mas_destroy(&mas); 35522 35442 allocated = mas_allocated(&mas); 35523 35443 MT_BUG_ON(mt, allocated != 0); 35524 35444 35525 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35445 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35526 35446 allocated = mas_allocated(&mas); 35527 35447 height = mas_mt_height(&mas); 35528 - MT_BUG_ON(mt, allocated == 0); 35529 35448 MT_BUG_ON(mt, allocated != 1 + height * 3); 35530 35449 mas_store_prealloc(&mas, ptr); 35531 35450 MT_BUG_ON(mt, mas_allocated(&mas) != 0); 35532 35451 35533 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35452 + /* Slot store does not need allocations */ 35453 + mas_set_range(&mas, 6, 9); 35454 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35534 35455 allocated = mas_allocated(&mas); 35535 - height = mas_mt_height(&mas); 35536 - MT_BUG_ON(mt, allocated == 0); 35537 - MT_BUG_ON(mt, allocated != 1 + height * 3); 35456 + MT_BUG_ON(mt, allocated != 0); 35538 35457 mas_store_prealloc(&mas, ptr); 35539 35458 MT_BUG_ON(mt, mas_allocated(&mas) != 0); 35540 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35541 - allocated = mas_allocated(&mas); 35542 - height = mas_mt_height(&mas); 35543 - MT_BUG_ON(mt, allocated == 0); 35544 - MT_BUG_ON(mt, allocated != 1 + height * 3); 35545 - mas_store_prealloc(&mas, ptr); 35546 35459 35547 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35460 + mas_set_range(&mas, 6, 10); 35461 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35548 35462 allocated = mas_allocated(&mas); 35549 35463 height = mas_mt_height(&mas); 35550 - MT_BUG_ON(mt, allocated == 0); 35551 - MT_BUG_ON(mt, allocated != 1 + height * 3); 35464 + MT_BUG_ON(mt, allocated != 1); 35465 + mas_store_prealloc(&mas, ptr); 35466 + MT_BUG_ON(mt, mas_allocated(&mas) != 0); 35467 + 35468 + /* Split */ 35469 + mas_set_range(&mas, 54, 54); 35470 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35471 + allocated = mas_allocated(&mas); 35472 + height = mas_mt_height(&mas); 35473 + MT_BUG_ON(mt, allocated != 1 + height * 2); 35552 35474 mas_store_prealloc(&mas, ptr); 35553 35475 MT_BUG_ON(mt, mas_allocated(&mas) != 0); 35554 35476 mt_set_non_kernel(1); 35555 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL & GFP_NOWAIT) == 0); 35477 + /* Spanning store */ 35478 + mas_set_range(&mas, 1, 100); 35479 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0); 35556 35480 allocated = mas_allocated(&mas); 35557 35481 height = mas_mt_height(&mas); 35558 35482 MT_BUG_ON(mt, allocated != 0); 35559 35483 mas_destroy(&mas); 35560 35484 35561 35485 35562 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0); 35486 + /* Spanning store */ 35487 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0); 35563 35488 allocated = mas_allocated(&mas); 35564 35489 height = mas_mt_height(&mas); 35565 35490 MT_BUG_ON(mt, allocated == 0); 35566 35491 MT_BUG_ON(mt, allocated != 1 + height * 3); 35567 35492 mas_store_prealloc(&mas, ptr); 35568 35493 MT_BUG_ON(mt, mas_allocated(&mas) != 0); 35494 + mas_set_range(&mas, 0, 200); 35569 35495 mt_set_non_kernel(1); 35570 - MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL & GFP_NOWAIT) == 0); 35496 + MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0); 35571 35497 allocated = mas_allocated(&mas); 35572 35498 height = mas_mt_height(&mas); 35573 35499 MT_BUG_ON(mt, allocated != 0);

+2 -2

tools/testing/selftests/bpf/progs/get_branch_snapshot.c

··· 15 15 #define ENTRY_CNT 32 16 16 struct perf_branch_entry entries[ENTRY_CNT] = {}; 17 17 18 - static inline bool in_range(__u64 val) 18 + static inline bool gbs_in_range(__u64 val) 19 19 { 20 20 return (val >= address_low) && (val < address_high); 21 21 } ··· 31 31 for (i = 0; i < ENTRY_CNT; i++) { 32 32 if (i >= total_entries) 33 33 break; 34 - if (in_range(entries[i].from) && in_range(entries[i].to)) 34 + if (gbs_in_range(entries[i].from) && gbs_in_range(entries[i].to)) 35 35 test1_hits++; 36 36 else if (!test1_hits) 37 37 wasted_entries++;

+1

tools/testing/selftests/cgroup/.gitignore

··· 5 5 test_kmem 6 6 test_kill 7 7 test_cpu 8 + test_zswap 8 9 wait_inotify

+2

tools/testing/selftests/cgroup/Makefile

··· 12 12 TEST_GEN_PROGS += test_freezer 13 13 TEST_GEN_PROGS += test_kill 14 14 TEST_GEN_PROGS += test_cpu 15 + TEST_GEN_PROGS += test_zswap 15 16 16 17 LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h 17 18 ··· 24 23 $(OUTPUT)/test_freezer: cgroup_util.c 25 24 $(OUTPUT)/test_kill: cgroup_util.c 26 25 $(OUTPUT)/test_cpu: cgroup_util.c 26 + $(OUTPUT)/test_zswap: cgroup_util.c

+7 -14

tools/testing/selftests/cgroup/test_kmem.c

··· 162 162 * allocates some slab memory (mostly negative dentries) using 2 * NR_CPUS 163 163 * threads. Then it checks the sanity of numbers on the parent level: 164 164 * the total size of the cgroups should be roughly equal to 165 - * anon + file + slab + kernel_stack. 165 + * anon + file + kernel + sock. 166 166 */ 167 167 static int test_kmem_memcg_deletion(const char *root) 168 168 { 169 - long current, slab, anon, file, kernel_stack, pagetables, percpu, sock, sum; 169 + long current, anon, file, kernel, sock, sum; 170 170 int ret = KSFT_FAIL; 171 171 char *parent; 172 172 ··· 184 184 goto cleanup; 185 185 186 186 current = cg_read_long(parent, "memory.current"); 187 - slab = cg_read_key_long(parent, "memory.stat", "slab "); 188 187 anon = cg_read_key_long(parent, "memory.stat", "anon "); 189 188 file = cg_read_key_long(parent, "memory.stat", "file "); 190 - kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack "); 191 - pagetables = cg_read_key_long(parent, "memory.stat", "pagetables "); 192 - percpu = cg_read_key_long(parent, "memory.stat", "percpu "); 189 + kernel = cg_read_key_long(parent, "memory.stat", "kernel "); 193 190 sock = cg_read_key_long(parent, "memory.stat", "sock "); 194 - if (current < 0 || slab < 0 || anon < 0 || file < 0 || 195 - kernel_stack < 0 || pagetables < 0 || percpu < 0 || sock < 0) 191 + if (current < 0 || anon < 0 || file < 0 || kernel < 0 || sock < 0) 196 192 goto cleanup; 197 193 198 - sum = slab + anon + file + kernel_stack + pagetables + percpu + sock; 194 + sum = anon + file + kernel + sock; 199 195 if (abs(sum - current) < MAX_VMSTAT_ERROR) { 200 196 ret = KSFT_PASS; 201 197 } else { 202 198 printf("memory.current = %ld\n", current); 203 - printf("slab + anon + file + kernel_stack = %ld\n", sum); 204 - printf("slab = %ld\n", slab); 199 + printf("anon + file + kernel + sock = %ld\n", sum); 205 200 printf("anon = %ld\n", anon); 206 201 printf("file = %ld\n", file); 207 - printf("kernel_stack = %ld\n", kernel_stack); 208 - printf("pagetables = %ld\n", pagetables); 209 - printf("percpu = %ld\n", percpu); 202 + printf("kernel = %ld\n", kernel); 210 203 printf("sock = %ld\n", sock); 211 204 } 212 205

+286

tools/testing/selftests/cgroup/test_zswap.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #define _GNU_SOURCE 3 + 4 + #include <linux/limits.h> 5 + #include <unistd.h> 6 + #include <stdio.h> 7 + #include <signal.h> 8 + #include <sys/sysinfo.h> 9 + #include <string.h> 10 + #include <sys/wait.h> 11 + #include <sys/mman.h> 12 + 13 + #include "../kselftest.h" 14 + #include "cgroup_util.h" 15 + 16 + static int read_int(const char *path, size_t *value) 17 + { 18 + FILE *file; 19 + int ret = 0; 20 + 21 + file = fopen(path, "r"); 22 + if (!file) 23 + return -1; 24 + if (fscanf(file, "%ld", value) != 1) 25 + ret = -1; 26 + fclose(file); 27 + return ret; 28 + } 29 + 30 + static int set_min_free_kb(size_t value) 31 + { 32 + FILE *file; 33 + int ret; 34 + 35 + file = fopen("/proc/sys/vm/min_free_kbytes", "w"); 36 + if (!file) 37 + return -1; 38 + ret = fprintf(file, "%ld\n", value); 39 + fclose(file); 40 + return ret; 41 + } 42 + 43 + static int read_min_free_kb(size_t *value) 44 + { 45 + return read_int("/proc/sys/vm/min_free_kbytes", value); 46 + } 47 + 48 + static int get_zswap_stored_pages(size_t *value) 49 + { 50 + return read_int("/sys/kernel/debug/zswap/stored_pages", value); 51 + } 52 + 53 + static int get_zswap_written_back_pages(size_t *value) 54 + { 55 + return read_int("/sys/kernel/debug/zswap/written_back_pages", value); 56 + } 57 + 58 + static int allocate_bytes(const char *cgroup, void *arg) 59 + { 60 + size_t size = (size_t)arg; 61 + char *mem = (char *)malloc(size); 62 + 63 + if (!mem) 64 + return -1; 65 + for (int i = 0; i < size; i += 4095) 66 + mem[i] = 'a'; 67 + free(mem); 68 + return 0; 69 + } 70 + 71 + /* 72 + * When trying to store a memcg page in zswap, if the memcg hits its memory 73 + * limit in zswap, writeback should not be triggered. 74 + * 75 + * This was fixed with commit 0bdf0efa180a("zswap: do not shrink if cgroup may 76 + * not zswap"). Needs to be revised when a per memcg writeback mechanism is 77 + * implemented. 78 + */ 79 + static int test_no_invasive_cgroup_shrink(const char *root) 80 + { 81 + size_t written_back_before, written_back_after; 82 + int ret = KSFT_FAIL; 83 + char *test_group; 84 + 85 + /* Set up */ 86 + test_group = cg_name(root, "no_shrink_test"); 87 + if (!test_group) 88 + goto out; 89 + if (cg_create(test_group)) 90 + goto out; 91 + if (cg_write(test_group, "memory.max", "1M")) 92 + goto out; 93 + if (cg_write(test_group, "memory.zswap.max", "10K")) 94 + goto out; 95 + if (get_zswap_written_back_pages(&written_back_before)) 96 + goto out; 97 + 98 + /* Allocate 10x memory.max to push memory into zswap */ 99 + if (cg_run(test_group, allocate_bytes, (void *)MB(10))) 100 + goto out; 101 + 102 + /* Verify that no writeback happened because of the memcg allocation */ 103 + if (get_zswap_written_back_pages(&written_back_after)) 104 + goto out; 105 + if (written_back_after == written_back_before) 106 + ret = KSFT_PASS; 107 + out: 108 + cg_destroy(test_group); 109 + free(test_group); 110 + return ret; 111 + } 112 + 113 + struct no_kmem_bypass_child_args { 114 + size_t target_alloc_bytes; 115 + size_t child_allocated; 116 + }; 117 + 118 + static int no_kmem_bypass_child(const char *cgroup, void *arg) 119 + { 120 + struct no_kmem_bypass_child_args *values = arg; 121 + void *allocation; 122 + 123 + allocation = malloc(values->target_alloc_bytes); 124 + if (!allocation) { 125 + values->child_allocated = true; 126 + return -1; 127 + } 128 + for (long i = 0; i < values->target_alloc_bytes; i += 4095) 129 + ((char *)allocation)[i] = 'a'; 130 + values->child_allocated = true; 131 + pause(); 132 + free(allocation); 133 + return 0; 134 + } 135 + 136 + /* 137 + * When pages owned by a memcg are pushed to zswap by kswapd, they should be 138 + * charged to that cgroup. This wasn't the case before commit 139 + * cd08d80ecdac("mm: correctly charge compressed memory to its memcg"). 140 + * 141 + * The test first allocates memory in a memcg, then raises min_free_kbytes to 142 + * a very high value so that the allocation falls below low wm, then makes 143 + * another allocation to trigger kswapd that should push the memcg-owned pages 144 + * to zswap and verifies that the zswap pages are correctly charged. 145 + * 146 + * To be run on a VM with at most 4G of memory. 147 + */ 148 + static int test_no_kmem_bypass(const char *root) 149 + { 150 + size_t min_free_kb_high, min_free_kb_low, min_free_kb_original; 151 + struct no_kmem_bypass_child_args *values; 152 + size_t trigger_allocation_size; 153 + int wait_child_iteration = 0; 154 + long stored_pages_threshold; 155 + struct sysinfo sys_info; 156 + int ret = KSFT_FAIL; 157 + int child_status; 158 + char *test_group; 159 + pid_t child_pid; 160 + 161 + /* Read sys info and compute test values accordingly */ 162 + if (sysinfo(&sys_info) != 0) 163 + return KSFT_FAIL; 164 + if (sys_info.totalram > 5000000000) 165 + return KSFT_SKIP; 166 + values = mmap(0, sizeof(struct no_kmem_bypass_child_args), PROT_READ | 167 + PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); 168 + if (values == MAP_FAILED) 169 + return KSFT_FAIL; 170 + if (read_min_free_kb(&min_free_kb_original)) 171 + return KSFT_FAIL; 172 + min_free_kb_high = sys_info.totalram / 2000; 173 + min_free_kb_low = sys_info.totalram / 500000; 174 + values->target_alloc_bytes = (sys_info.totalram - min_free_kb_high * 1000) + 175 + sys_info.totalram * 5 / 100; 176 + stored_pages_threshold = sys_info.totalram / 5 / 4096; 177 + trigger_allocation_size = sys_info.totalram / 20; 178 + 179 + /* Set up test memcg */ 180 + if (cg_write(root, "cgroup.subtree_control", "+memory")) 181 + goto out; 182 + test_group = cg_name(root, "kmem_bypass_test"); 183 + if (!test_group) 184 + goto out; 185 + 186 + /* Spawn memcg child and wait for it to allocate */ 187 + set_min_free_kb(min_free_kb_low); 188 + if (cg_create(test_group)) 189 + goto out; 190 + values->child_allocated = false; 191 + child_pid = cg_run_nowait(test_group, no_kmem_bypass_child, values); 192 + if (child_pid < 0) 193 + goto out; 194 + while (!values->child_allocated && wait_child_iteration++ < 10000) 195 + usleep(1000); 196 + 197 + /* Try to wakeup kswapd and let it push child memory to zswap */ 198 + set_min_free_kb(min_free_kb_high); 199 + for (int i = 0; i < 20; i++) { 200 + size_t stored_pages; 201 + char *trigger_allocation = malloc(trigger_allocation_size); 202 + 203 + if (!trigger_allocation) 204 + break; 205 + for (int i = 0; i < trigger_allocation_size; i += 4095) 206 + trigger_allocation[i] = 'b'; 207 + usleep(100000); 208 + free(trigger_allocation); 209 + if (get_zswap_stored_pages(&stored_pages)) 210 + break; 211 + if (stored_pages < 0) 212 + break; 213 + /* If memory was pushed to zswap, verify it belongs to memcg */ 214 + if (stored_pages > stored_pages_threshold) { 215 + int zswapped = cg_read_key_long(test_group, "memory.stat", "zswapped "); 216 + int delta = stored_pages * 4096 - zswapped; 217 + int result_ok = delta < stored_pages * 4096 / 4; 218 + 219 + ret = result_ok ? KSFT_PASS : KSFT_FAIL; 220 + break; 221 + } 222 + } 223 + 224 + kill(child_pid, SIGTERM); 225 + waitpid(child_pid, &child_status, 0); 226 + out: 227 + set_min_free_kb(min_free_kb_original); 228 + cg_destroy(test_group); 229 + free(test_group); 230 + return ret; 231 + } 232 + 233 + #define T(x) { x, #x } 234 + struct zswap_test { 235 + int (*fn)(const char *root); 236 + const char *name; 237 + } tests[] = { 238 + T(test_no_kmem_bypass), 239 + T(test_no_invasive_cgroup_shrink), 240 + }; 241 + #undef T 242 + 243 + static bool zswap_configured(void) 244 + { 245 + return access("/sys/module/zswap", F_OK) == 0; 246 + } 247 + 248 + int main(int argc, char **argv) 249 + { 250 + char root[PATH_MAX]; 251 + int i, ret = EXIT_SUCCESS; 252 + 253 + if (cg_find_unified_root(root, sizeof(root))) 254 + ksft_exit_skip("cgroup v2 isn't mounted\n"); 255 + 256 + if (!zswap_configured()) 257 + ksft_exit_skip("zswap isn't configured\n"); 258 + 259 + /* 260 + * Check that memory controller is available: 261 + * memory is listed in cgroup.controllers 262 + */ 263 + if (cg_read_strstr(root, "cgroup.controllers", "memory")) 264 + ksft_exit_skip("memory controller isn't available\n"); 265 + 266 + if (cg_read_strstr(root, "cgroup.subtree_control", "memory")) 267 + if (cg_write(root, "cgroup.subtree_control", "+memory")) 268 + ksft_exit_skip("Failed to set memory controller\n"); 269 + 270 + for (i = 0; i < ARRAY_SIZE(tests); i++) { 271 + switch (tests[i].fn(root)) { 272 + case KSFT_PASS: 273 + ksft_test_result_pass("%s\n", tests[i].name); 274 + break; 275 + case KSFT_SKIP: 276 + ksft_test_result_skip("%s\n", tests[i].name); 277 + break; 278 + default: 279 + ret = EXIT_FAILURE; 280 + ksft_test_result_fail("%s\n", tests[i].name); 281 + break; 282 + } 283 + } 284 + 285 + return ret; 286 + }

+6

tools/testing/selftests/damon/sysfs.sh

··· 84 84 { 85 85 tried_regions_dir=$1 86 86 ensure_dir "$tried_regions_dir" "exist" 87 + ensure_file "$tried_regions_dir/total_bytes" "exist" "400" 87 88 } 88 89 89 90 test_stats() ··· 103 102 ensure_file "$filter_dir/type" "exist" "600" 104 103 ensure_write_succ "$filter_dir/type" "anon" "valid input" 105 104 ensure_write_succ "$filter_dir/type" "memcg" "valid input" 105 + ensure_write_succ "$filter_dir/type" "addr" "valid input" 106 + ensure_write_succ "$filter_dir/type" "target" "valid input" 106 107 ensure_write_fail "$filter_dir/type" "foo" "invalid input" 107 108 ensure_file "$filter_dir/matching" "exist" "600" 108 109 ensure_file "$filter_dir/memcg_path" "exist" "600" 110 + ensure_file "$filter_dir/addr_start" "exist" "600" 111 + ensure_file "$filter_dir/addr_end" "exist" "600" 112 + ensure_file "$filter_dir/damon_target_idx" "exist" "600" 109 113 } 110 114 111 115 test_filters()

+9

tools/testing/selftests/kselftest.h

··· 113 113 114 114 static inline void ksft_print_header(void) 115 115 { 116 + /* 117 + * Force line buffering; If stdout is not connected to a terminal, it 118 + * will otherwise default to fully buffered, which can cause output 119 + * duplication if there is content in the buffer when fork()ing. If 120 + * there is a crash, line buffering also means the most recent output 121 + * line will be visible. 122 + */ 123 + setvbuf(stdout, NULL, _IOLBF, 0); 124 + 116 125 if (!(getenv("KSFT_TAP_LEVEL"))) 117 126 printf("TAP version 13\n"); 118 127 }

+5 -2

tools/testing/selftests/kselftest/runner.sh

··· 105 105 echo "# Warning: file $TEST is missing!" 106 106 echo "not ok $test_num $TEST_HDR_MSG" 107 107 else 108 + if [ -x /usr/bin/stdbuf ]; then 109 + stdbuf="/usr/bin/stdbuf --output=L " 110 + fi 108 111 eval kselftest_cmd_args="\$${kselftest_cmd_args_ref:-}" 109 - cmd="./$BASENAME_TEST $kselftest_cmd_args" 112 + cmd="$stdbuf ./$BASENAME_TEST $kselftest_cmd_args" 110 113 if [ ! -x "$TEST" ]; then 111 114 echo "# Warning: file $TEST is not executable" 112 115 113 116 if [ $(head -n 1 "$TEST" | cut -c -2) = "#!" ] 114 117 then 115 118 interpreter=$(head -n 1 "$TEST" | cut -c 3-) 116 - cmd="$interpreter ./$BASENAME_TEST" 119 + cmd="$stdbuf $interpreter ./$BASENAME_TEST" 117 120 else 118 121 echo "not ok $test_num $TEST_HDR_MSG" 119 122 return

+275 -72

tools/testing/selftests/memfd/memfd_test.c

··· 18 18 #include <sys/syscall.h> 19 19 #include <sys/wait.h> 20 20 #include <unistd.h> 21 + #include <ctype.h> 21 22 22 23 #include "common.h" 23 24 ··· 44 43 */ 45 44 static size_t mfd_def_size = MFD_DEF_SIZE; 46 45 static const char *memfd_str = MEMFD_STR; 47 - static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *)); 48 46 static int newpid_thread_fn2(void *arg); 49 47 static void join_newpid_thread(pid_t pid); 50 48 ··· 96 96 int fd = open("/proc/sys/vm/memfd_noexec", O_WRONLY | O_CLOEXEC); 97 97 98 98 if (fd < 0) { 99 - printf("open sysctl failed\n"); 99 + printf("open sysctl failed: %m\n"); 100 100 abort(); 101 101 } 102 102 103 103 if (write(fd, val, strlen(val)) < 0) { 104 - printf("write sysctl failed\n"); 104 + printf("write sysctl %s failed: %m\n", val); 105 105 abort(); 106 106 } 107 107 } ··· 111 111 int fd = open("/proc/sys/vm/memfd_noexec", O_WRONLY | O_CLOEXEC); 112 112 113 113 if (fd < 0) { 114 - printf("open sysctl failed\n"); 114 + printf("open sysctl failed: %m\n"); 115 115 abort(); 116 116 } 117 117 118 118 if (write(fd, val, strlen(val)) >= 0) { 119 119 printf("write sysctl %s succeeded, but failure expected\n", 120 120 val); 121 + abort(); 122 + } 123 + } 124 + 125 + static void sysctl_assert_equal(const char *val) 126 + { 127 + char *p, buf[128] = {}; 128 + int fd = open("/proc/sys/vm/memfd_noexec", O_RDONLY | O_CLOEXEC); 129 + 130 + if (fd < 0) { 131 + printf("open sysctl failed: %m\n"); 132 + abort(); 133 + } 134 + 135 + if (read(fd, buf, sizeof(buf)) < 0) { 136 + printf("read sysctl failed: %m\n"); 137 + abort(); 138 + } 139 + 140 + /* Strip trailing whitespace. */ 141 + p = buf; 142 + while (!isspace(*p)) 143 + p++; 144 + *p = '\0'; 145 + 146 + if (strcmp(buf, val) != 0) { 147 + printf("unexpected sysctl value: expected %s, got %s\n", val, buf); 121 148 abort(); 122 149 } 123 150 } ··· 763 736 return 0; 764 737 } 765 738 766 - static pid_t spawn_idle_thread(unsigned int flags) 739 + static pid_t spawn_thread(unsigned int flags, int (*fn)(void *), void *arg) 767 740 { 768 741 uint8_t *stack; 769 742 pid_t pid; ··· 774 747 abort(); 775 748 } 776 749 777 - pid = clone(idle_thread_fn, 778 - stack + STACK_SIZE, 779 - SIGCHLD | flags, 780 - NULL); 750 + pid = clone(fn, stack + STACK_SIZE, SIGCHLD | flags, arg); 781 751 if (pid < 0) { 782 752 printf("clone() failed: %m\n"); 783 753 abort(); 784 754 } 785 755 786 756 return pid; 757 + } 758 + 759 + static void join_thread(pid_t pid) 760 + { 761 + int wstatus; 762 + 763 + if (waitpid(pid, &wstatus, 0) < 0) { 764 + printf("newpid thread: waitpid() failed: %m\n"); 765 + abort(); 766 + } 767 + 768 + if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) { 769 + printf("newpid thread: exited with non-zero error code %d\n", 770 + WEXITSTATUS(wstatus)); 771 + abort(); 772 + } 773 + 774 + if (WIFSIGNALED(wstatus)) { 775 + printf("newpid thread: killed by signal %d\n", 776 + WTERMSIG(wstatus)); 777 + abort(); 778 + } 779 + } 780 + 781 + static pid_t spawn_idle_thread(unsigned int flags) 782 + { 783 + return spawn_thread(flags, idle_thread_fn, NULL); 787 784 } 788 785 789 786 static void join_idle_thread(pid_t pid) ··· 1162 1111 close(fd); 1163 1112 } 1164 1113 1165 - static void test_sysctl_child(void) 1114 + static void test_sysctl_sysctl0(void) 1166 1115 { 1167 1116 int fd; 1168 - int pid; 1169 1117 1170 - printf("%s sysctl 0\n", memfd_str); 1171 - sysctl_assert_write("0"); 1172 - fd = mfd_assert_new("kern_memfd_sysctl_0", 1118 + sysctl_assert_equal("0"); 1119 + 1120 + fd = mfd_assert_new("kern_memfd_sysctl_0_dfl", 1173 1121 mfd_def_size, 1174 1122 MFD_CLOEXEC | MFD_ALLOW_SEALING); 1123 + mfd_assert_mode(fd, 0777); 1124 + mfd_assert_has_seals(fd, 0); 1125 + mfd_assert_chmod(fd, 0644); 1126 + close(fd); 1127 + } 1175 1128 1129 + static void test_sysctl_set_sysctl0(void) 1130 + { 1131 + sysctl_assert_write("0"); 1132 + test_sysctl_sysctl0(); 1133 + } 1134 + 1135 + static void test_sysctl_sysctl1(void) 1136 + { 1137 + int fd; 1138 + 1139 + sysctl_assert_equal("1"); 1140 + 1141 + fd = mfd_assert_new("kern_memfd_sysctl_1_dfl", 1142 + mfd_def_size, 1143 + MFD_CLOEXEC | MFD_ALLOW_SEALING); 1144 + mfd_assert_mode(fd, 0666); 1145 + mfd_assert_has_seals(fd, F_SEAL_EXEC); 1146 + mfd_fail_chmod(fd, 0777); 1147 + close(fd); 1148 + 1149 + fd = mfd_assert_new("kern_memfd_sysctl_1_exec", 1150 + mfd_def_size, 1151 + MFD_CLOEXEC | MFD_EXEC | MFD_ALLOW_SEALING); 1176 1152 mfd_assert_mode(fd, 0777); 1177 1153 mfd_assert_has_seals(fd, 0); 1178 1154 mfd_assert_chmod(fd, 0644); 1179 1155 close(fd); 1180 1156 1181 - printf("%s sysctl 1\n", memfd_str); 1182 - sysctl_assert_write("1"); 1183 - fd = mfd_assert_new("kern_memfd_sysctl_1", 1157 + fd = mfd_assert_new("kern_memfd_sysctl_1_noexec", 1184 1158 mfd_def_size, 1185 - MFD_CLOEXEC | MFD_ALLOW_SEALING); 1186 - 1187 - printf("%s child ns\n", memfd_str); 1188 - pid = spawn_newpid_thread(CLONE_NEWPID, newpid_thread_fn2); 1189 - join_newpid_thread(pid); 1190 - 1159 + MFD_CLOEXEC | MFD_NOEXEC_SEAL | MFD_ALLOW_SEALING); 1191 1160 mfd_assert_mode(fd, 0666); 1192 1161 mfd_assert_has_seals(fd, F_SEAL_EXEC); 1193 1162 mfd_fail_chmod(fd, 0777); 1194 - sysctl_fail_write("0"); 1195 1163 close(fd); 1196 - 1197 - printf("%s sysctl 2\n", memfd_str); 1198 - sysctl_assert_write("2"); 1199 - mfd_fail_new("kern_memfd_sysctl_2", 1200 - MFD_CLOEXEC | MFD_ALLOW_SEALING); 1201 - sysctl_fail_write("0"); 1202 - sysctl_fail_write("1"); 1203 1164 } 1204 1165 1205 - static int newpid_thread_fn(void *arg) 1166 + static void test_sysctl_set_sysctl1(void) 1206 1167 { 1207 - test_sysctl_child(); 1208 - return 0; 1168 + sysctl_assert_write("1"); 1169 + test_sysctl_sysctl1(); 1209 1170 } 1210 1171 1211 - static void test_sysctl_child2(void) 1172 + static void test_sysctl_sysctl2(void) 1212 1173 { 1213 1174 int fd; 1214 1175 1215 - sysctl_fail_write("0"); 1216 - fd = mfd_assert_new("kern_memfd_sysctl_1", 1176 + sysctl_assert_equal("2"); 1177 + 1178 + fd = mfd_assert_new("kern_memfd_sysctl_2_dfl", 1217 1179 mfd_def_size, 1218 1180 MFD_CLOEXEC | MFD_ALLOW_SEALING); 1181 + mfd_assert_mode(fd, 0666); 1182 + mfd_assert_has_seals(fd, F_SEAL_EXEC); 1183 + mfd_fail_chmod(fd, 0777); 1184 + close(fd); 1219 1185 1186 + mfd_fail_new("kern_memfd_sysctl_2_exec", 1187 + MFD_CLOEXEC | MFD_EXEC | MFD_ALLOW_SEALING); 1188 + 1189 + fd = mfd_assert_new("kern_memfd_sysctl_2_noexec", 1190 + mfd_def_size, 1191 + MFD_CLOEXEC | MFD_NOEXEC_SEAL | MFD_ALLOW_SEALING); 1220 1192 mfd_assert_mode(fd, 0666); 1221 1193 mfd_assert_has_seals(fd, F_SEAL_EXEC); 1222 1194 mfd_fail_chmod(fd, 0777); 1223 1195 close(fd); 1224 1196 } 1225 1197 1226 - static int newpid_thread_fn2(void *arg) 1198 + static void test_sysctl_set_sysctl2(void) 1227 1199 { 1228 - test_sysctl_child2(); 1200 + sysctl_assert_write("2"); 1201 + test_sysctl_sysctl2(); 1202 + } 1203 + 1204 + static int sysctl_simple_child(void *arg) 1205 + { 1206 + int fd; 1207 + int pid; 1208 + 1209 + printf("%s sysctl 0\n", memfd_str); 1210 + test_sysctl_set_sysctl0(); 1211 + 1212 + printf("%s sysctl 1\n", memfd_str); 1213 + test_sysctl_set_sysctl1(); 1214 + 1215 + printf("%s sysctl 0\n", memfd_str); 1216 + test_sysctl_set_sysctl0(); 1217 + 1218 + printf("%s sysctl 2\n", memfd_str); 1219 + test_sysctl_set_sysctl2(); 1220 + 1221 + printf("%s sysctl 1\n", memfd_str); 1222 + test_sysctl_set_sysctl1(); 1223 + 1224 + printf("%s sysctl 0\n", memfd_str); 1225 + test_sysctl_set_sysctl0(); 1226 + 1229 1227 return 0; 1230 - } 1231 - static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *)) 1232 - { 1233 - uint8_t *stack; 1234 - pid_t pid; 1235 - 1236 - stack = malloc(STACK_SIZE); 1237 - if (!stack) { 1238 - printf("malloc(STACK_SIZE) failed: %m\n"); 1239 - abort(); 1240 - } 1241 - 1242 - pid = clone(fn, 1243 - stack + STACK_SIZE, 1244 - SIGCHLD | flags, 1245 - NULL); 1246 - if (pid < 0) { 1247 - printf("clone() failed: %m\n"); 1248 - abort(); 1249 - } 1250 - 1251 - return pid; 1252 - } 1253 - 1254 - static void join_newpid_thread(pid_t pid) 1255 - { 1256 - waitpid(pid, NULL, 0); 1257 1228 } 1258 1229 1259 1230 /* 1260 1231 * Test sysctl 1261 - * A very basic sealing test to see whether setting/retrieving seals works. 1232 + * A very basic test to make sure the core sysctl semantics work. 1262 1233 */ 1263 - static void test_sysctl(void) 1234 + static void test_sysctl_simple(void) 1264 1235 { 1265 - int pid = spawn_newpid_thread(CLONE_NEWPID, newpid_thread_fn); 1236 + int pid = spawn_thread(CLONE_NEWPID, sysctl_simple_child, NULL); 1266 1237 1267 - join_newpid_thread(pid); 1238 + join_thread(pid); 1239 + } 1240 + 1241 + static int sysctl_nested(void *arg) 1242 + { 1243 + void (*fn)(void) = arg; 1244 + 1245 + fn(); 1246 + return 0; 1247 + } 1248 + 1249 + static int sysctl_nested_wait(void *arg) 1250 + { 1251 + /* Wait for a SIGCONT. */ 1252 + kill(getpid(), SIGSTOP); 1253 + return sysctl_nested(arg); 1254 + } 1255 + 1256 + static void test_sysctl_sysctl1_failset(void) 1257 + { 1258 + sysctl_fail_write("0"); 1259 + test_sysctl_sysctl1(); 1260 + } 1261 + 1262 + static void test_sysctl_sysctl2_failset(void) 1263 + { 1264 + sysctl_fail_write("1"); 1265 + test_sysctl_sysctl2(); 1266 + 1267 + sysctl_fail_write("0"); 1268 + test_sysctl_sysctl2(); 1269 + } 1270 + 1271 + static int sysctl_nested_child(void *arg) 1272 + { 1273 + int fd; 1274 + int pid; 1275 + 1276 + printf("%s nested sysctl 0\n", memfd_str); 1277 + sysctl_assert_write("0"); 1278 + /* A further nested pidns works the same. */ 1279 + pid = spawn_thread(CLONE_NEWPID, sysctl_simple_child, NULL); 1280 + join_thread(pid); 1281 + 1282 + printf("%s nested sysctl 1\n", memfd_str); 1283 + sysctl_assert_write("1"); 1284 + /* Child inherits our setting. */ 1285 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested, test_sysctl_sysctl1); 1286 + join_thread(pid); 1287 + /* Child cannot raise the setting. */ 1288 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested, 1289 + test_sysctl_sysctl1_failset); 1290 + join_thread(pid); 1291 + /* Child can lower the setting. */ 1292 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested, 1293 + test_sysctl_set_sysctl2); 1294 + join_thread(pid); 1295 + /* Child lowering the setting has no effect on our setting. */ 1296 + test_sysctl_sysctl1(); 1297 + 1298 + printf("%s nested sysctl 2\n", memfd_str); 1299 + sysctl_assert_write("2"); 1300 + /* Child inherits our setting. */ 1301 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested, test_sysctl_sysctl2); 1302 + join_thread(pid); 1303 + /* Child cannot raise the setting. */ 1304 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested, 1305 + test_sysctl_sysctl2_failset); 1306 + join_thread(pid); 1307 + 1308 + /* Verify that the rules are actually inherited after fork. */ 1309 + printf("%s nested sysctl 0 -> 1 after fork\n", memfd_str); 1310 + sysctl_assert_write("0"); 1311 + 1312 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait, 1313 + test_sysctl_sysctl1_failset); 1314 + sysctl_assert_write("1"); 1315 + kill(pid, SIGCONT); 1316 + join_thread(pid); 1317 + 1318 + printf("%s nested sysctl 0 -> 2 after fork\n", memfd_str); 1319 + sysctl_assert_write("0"); 1320 + 1321 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait, 1322 + test_sysctl_sysctl2_failset); 1323 + sysctl_assert_write("2"); 1324 + kill(pid, SIGCONT); 1325 + join_thread(pid); 1326 + 1327 + /* 1328 + * Verify that the current effective setting is saved on fork, meaning 1329 + * that the parent lowering the sysctl doesn't affect already-forked 1330 + * children. 1331 + */ 1332 + printf("%s nested sysctl 2 -> 1 after fork\n", memfd_str); 1333 + sysctl_assert_write("2"); 1334 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait, 1335 + test_sysctl_sysctl2); 1336 + sysctl_assert_write("1"); 1337 + kill(pid, SIGCONT); 1338 + join_thread(pid); 1339 + 1340 + printf("%s nested sysctl 2 -> 0 after fork\n", memfd_str); 1341 + sysctl_assert_write("2"); 1342 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait, 1343 + test_sysctl_sysctl2); 1344 + sysctl_assert_write("0"); 1345 + kill(pid, SIGCONT); 1346 + join_thread(pid); 1347 + 1348 + printf("%s nested sysctl 1 -> 0 after fork\n", memfd_str); 1349 + sysctl_assert_write("1"); 1350 + pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait, 1351 + test_sysctl_sysctl1); 1352 + sysctl_assert_write("0"); 1353 + kill(pid, SIGCONT); 1354 + join_thread(pid); 1355 + 1356 + return 0; 1357 + } 1358 + 1359 + /* 1360 + * Test sysctl with nested pid namespaces 1361 + * Make sure that the sysctl nesting semantics work correctly. 1362 + */ 1363 + static void test_sysctl_nested(void) 1364 + { 1365 + int pid = spawn_thread(CLONE_NEWPID, sysctl_nested_child, NULL); 1366 + 1367 + join_thread(pid); 1268 1368 } 1269 1369 1270 1370 /* ··· 1601 1399 test_seal_grow(); 1602 1400 test_seal_resize(); 1603 1401 1402 + test_sysctl_simple(); 1403 + test_sysctl_nested(); 1404 + 1604 1405 test_share_dup("SHARE-DUP", ""); 1605 1406 test_share_mmap("SHARE-MMAP", ""); 1606 1407 test_share_open("SHARE-OPEN", ""); ··· 1617 1412 test_share_open("SHARE-OPEN", SHARED_FT_STR); 1618 1413 test_share_fork("SHARE-FORK", SHARED_FT_STR); 1619 1414 join_idle_thread(pid); 1620 - 1621 - test_sysctl(); 1622 1415 1623 1416 printf("memfd: DONE\n"); 1624 1417

+1

tools/testing/selftests/mm/.gitignore

··· 5 5 hugepage-shm 6 6 hugepage-vmemmap 7 7 hugetlb-madvise 8 + hugetlb-read-hwpoison 8 9 khugepaged 9 10 map_hugetlb 10 11 map_populate

+43 -38

tools/testing/selftests/mm/Makefile

··· 35 35 CFLAGS = -Wall -I $(top_srcdir) $(EXTRA_CFLAGS) $(KHDR_INCLUDES) 36 36 LDLIBS = -lrt -lpthread 37 37 38 - TEST_GEN_PROGS = cow 39 - TEST_GEN_PROGS += compaction_test 40 - TEST_GEN_PROGS += gup_longterm 41 - TEST_GEN_PROGS += gup_test 42 - TEST_GEN_PROGS += hmm-tests 43 - TEST_GEN_PROGS += hugetlb-madvise 44 - TEST_GEN_PROGS += hugepage-mmap 45 - TEST_GEN_PROGS += hugepage-mremap 46 - TEST_GEN_PROGS += hugepage-shm 47 - TEST_GEN_PROGS += hugepage-vmemmap 48 - TEST_GEN_PROGS += khugepaged 49 - TEST_GEN_PROGS += madv_populate 50 - TEST_GEN_PROGS += map_fixed_noreplace 51 - TEST_GEN_PROGS += map_hugetlb 52 - TEST_GEN_PROGS += map_populate 53 - TEST_GEN_PROGS += memfd_secret 54 - TEST_GEN_PROGS += migration 55 - TEST_GEN_PROGS += mkdirty 56 - TEST_GEN_PROGS += mlock-random-test 57 - TEST_GEN_PROGS += mlock2-tests 58 - TEST_GEN_PROGS += mrelease_test 59 - TEST_GEN_PROGS += mremap_dontunmap 60 - TEST_GEN_PROGS += mremap_test 61 - TEST_GEN_PROGS += on-fault-limit 62 - TEST_GEN_PROGS += thuge-gen 63 - TEST_GEN_PROGS += transhuge-stress 64 - TEST_GEN_PROGS += uffd-stress 65 - TEST_GEN_PROGS += uffd-unit-tests 38 + TEST_GEN_FILES = cow 39 + TEST_GEN_FILES += compaction_test 40 + TEST_GEN_FILES += gup_longterm 41 + TEST_GEN_FILES += gup_test 42 + TEST_GEN_FILES += hmm-tests 43 + TEST_GEN_FILES += hugetlb-madvise 44 + TEST_GEN_FILES += hugetlb-read-hwpoison 45 + TEST_GEN_FILES += hugepage-mmap 46 + TEST_GEN_FILES += hugepage-mremap 47 + TEST_GEN_FILES += hugepage-shm 48 + TEST_GEN_FILES += hugepage-vmemmap 49 + TEST_GEN_FILES += khugepaged 50 + TEST_GEN_FILES += madv_populate 51 + TEST_GEN_FILES += map_fixed_noreplace 52 + TEST_GEN_FILES += map_hugetlb 53 + TEST_GEN_FILES += map_populate 54 + TEST_GEN_FILES += memfd_secret 55 + TEST_GEN_FILES += migration 56 + TEST_GEN_FILES += mkdirty 57 + TEST_GEN_FILES += mlock-random-test 58 + TEST_GEN_FILES += mlock2-tests 59 + TEST_GEN_FILES += mrelease_test 60 + TEST_GEN_FILES += mremap_dontunmap 61 + TEST_GEN_FILES += mremap_test 62 + TEST_GEN_FILES += on-fault-limit 63 + TEST_GEN_FILES += thuge-gen 64 + TEST_GEN_FILES += transhuge-stress 65 + TEST_GEN_FILES += uffd-stress 66 + TEST_GEN_FILES += uffd-unit-tests 67 + TEST_GEN_FILES += split_huge_page_test 68 + TEST_GEN_FILES += ksm_tests 69 + TEST_GEN_FILES += ksm_functional_tests 70 + TEST_GEN_FILES += mdwe_test 71 + 72 + ifneq ($(ARCH),arm64) 66 73 TEST_GEN_PROGS += soft-dirty 67 - TEST_GEN_PROGS += split_huge_page_test 68 - TEST_GEN_PROGS += ksm_tests 69 - TEST_GEN_PROGS += ksm_functional_tests 70 - TEST_GEN_PROGS += mdwe_test 74 + endif 71 75 72 76 ifeq ($(ARCH),x86_64) 73 77 CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_32bit_program.c -m32) ··· 87 83 endif 88 84 89 85 ifeq ($(CAN_BUILD_I386),1) 90 - TEST_GEN_PROGS += $(BINARIES_32) 86 + TEST_GEN_FILES += $(BINARIES_32) 91 87 endif 92 88 93 89 ifeq ($(CAN_BUILD_X86_64),1) 94 - TEST_GEN_PROGS += $(BINARIES_64) 90 + TEST_GEN_FILES += $(BINARIES_64) 95 91 endif 96 92 else 97 93 98 94 ifneq (,$(findstring $(ARCH),ppc64)) 99 - TEST_GEN_PROGS += protection_keys 95 + TEST_GEN_FILES += protection_keys 100 96 endif 101 97 102 98 endif 103 99 104 100 ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sparc64 x86_64)) 105 - TEST_GEN_PROGS += va_high_addr_switch 106 - TEST_GEN_PROGS += virtual_address_range 107 - TEST_GEN_PROGS += write_to_hugetlbfs 101 + TEST_GEN_FILES += va_high_addr_switch 102 + TEST_GEN_FILES += virtual_address_range 103 + TEST_GEN_FILES += write_to_hugetlbfs 108 104 endif 109 105 110 106 TEST_PROGS := run_vmtests.sh ··· 116 112 include ../lib.mk 117 113 118 114 $(TEST_GEN_PROGS): vm_util.c 115 + $(TEST_GEN_FILES): vm_util.c 119 116 120 117 $(OUTPUT)/uffd-stress: uffd-common.c 121 118 $(OUTPUT)/uffd-unit-tests: uffd-common.c

+322

tools/testing/selftests/mm/hugetlb-read-hwpoison.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #define _GNU_SOURCE 4 + #include <stdlib.h> 5 + #include <stdio.h> 6 + #include <string.h> 7 + 8 + #include <linux/magic.h> 9 + #include <sys/mman.h> 10 + #include <sys/statfs.h> 11 + #include <errno.h> 12 + #include <stdbool.h> 13 + 14 + #include "../kselftest.h" 15 + 16 + #define PREFIX " ... " 17 + #define ERROR_PREFIX " !!! " 18 + 19 + #define MAX_WRITE_READ_CHUNK_SIZE (getpagesize() * 16) 20 + #define MAX(a, b) (((a) > (b)) ? (a) : (b)) 21 + 22 + enum test_status { 23 + TEST_PASSED = 0, 24 + TEST_FAILED = 1, 25 + TEST_SKIPPED = 2, 26 + }; 27 + 28 + static char *status_to_str(enum test_status status) 29 + { 30 + switch (status) { 31 + case TEST_PASSED: 32 + return "TEST_PASSED"; 33 + case TEST_FAILED: 34 + return "TEST_FAILED"; 35 + case TEST_SKIPPED: 36 + return "TEST_SKIPPED"; 37 + default: 38 + return "TEST_???"; 39 + } 40 + } 41 + 42 + static int setup_filemap(char *filemap, size_t len, size_t wr_chunk_size) 43 + { 44 + char iter = 0; 45 + 46 + for (size_t offset = 0; offset < len; 47 + offset += wr_chunk_size) { 48 + iter++; 49 + memset(filemap + offset, iter, wr_chunk_size); 50 + } 51 + 52 + return 0; 53 + } 54 + 55 + static bool verify_chunk(char *buf, size_t len, char val) 56 + { 57 + size_t i; 58 + 59 + for (i = 0; i < len; ++i) { 60 + if (buf[i] != val) { 61 + printf(PREFIX ERROR_PREFIX "check fail: buf[%lu] = %u != %u\n", 62 + i, buf[i], val); 63 + return false; 64 + } 65 + } 66 + 67 + return true; 68 + } 69 + 70 + static bool seek_read_hugepage_filemap(int fd, size_t len, size_t wr_chunk_size, 71 + off_t offset, size_t expected) 72 + { 73 + char buf[MAX_WRITE_READ_CHUNK_SIZE]; 74 + ssize_t ret_count = 0; 75 + ssize_t total_ret_count = 0; 76 + char val = offset / wr_chunk_size + offset % wr_chunk_size; 77 + 78 + printf(PREFIX PREFIX "init val=%u with offset=0x%lx\n", val, offset); 79 + printf(PREFIX PREFIX "expect to read 0x%lx bytes of data in total\n", 80 + expected); 81 + if (lseek(fd, offset, SEEK_SET) < 0) { 82 + perror(PREFIX ERROR_PREFIX "seek failed"); 83 + return false; 84 + } 85 + 86 + while (offset + total_ret_count < len) { 87 + ret_count = read(fd, buf, wr_chunk_size); 88 + if (ret_count == 0) { 89 + printf(PREFIX PREFIX "read reach end of the file\n"); 90 + break; 91 + } else if (ret_count < 0) { 92 + perror(PREFIX ERROR_PREFIX "read failed"); 93 + break; 94 + } 95 + ++val; 96 + if (!verify_chunk(buf, ret_count, val)) 97 + return false; 98 + 99 + total_ret_count += ret_count; 100 + } 101 + printf(PREFIX PREFIX "actually read 0x%lx bytes of data in total\n", 102 + total_ret_count); 103 + 104 + return total_ret_count == expected; 105 + } 106 + 107 + static bool read_hugepage_filemap(int fd, size_t len, 108 + size_t wr_chunk_size, size_t expected) 109 + { 110 + char buf[MAX_WRITE_READ_CHUNK_SIZE]; 111 + ssize_t ret_count = 0; 112 + ssize_t total_ret_count = 0; 113 + char val = 0; 114 + 115 + printf(PREFIX PREFIX "expect to read 0x%lx bytes of data in total\n", 116 + expected); 117 + while (total_ret_count < len) { 118 + ret_count = read(fd, buf, wr_chunk_size); 119 + if (ret_count == 0) { 120 + printf(PREFIX PREFIX "read reach end of the file\n"); 121 + break; 122 + } else if (ret_count < 0) { 123 + perror(PREFIX ERROR_PREFIX "read failed"); 124 + break; 125 + } 126 + ++val; 127 + if (!verify_chunk(buf, ret_count, val)) 128 + return false; 129 + 130 + total_ret_count += ret_count; 131 + } 132 + printf(PREFIX PREFIX "actually read 0x%lx bytes of data in total\n", 133 + total_ret_count); 134 + 135 + return total_ret_count == expected; 136 + } 137 + 138 + static enum test_status 139 + test_hugetlb_read(int fd, size_t len, size_t wr_chunk_size) 140 + { 141 + enum test_status status = TEST_SKIPPED; 142 + char *filemap = NULL; 143 + 144 + if (ftruncate(fd, len) < 0) { 145 + perror(PREFIX ERROR_PREFIX "ftruncate failed"); 146 + return status; 147 + } 148 + 149 + filemap = mmap(NULL, len, PROT_READ | PROT_WRITE, 150 + MAP_SHARED | MAP_POPULATE, fd, 0); 151 + if (filemap == MAP_FAILED) { 152 + perror(PREFIX ERROR_PREFIX "mmap for primary mapping failed"); 153 + goto done; 154 + } 155 + 156 + setup_filemap(filemap, len, wr_chunk_size); 157 + status = TEST_FAILED; 158 + 159 + if (read_hugepage_filemap(fd, len, wr_chunk_size, len)) 160 + status = TEST_PASSED; 161 + 162 + munmap(filemap, len); 163 + done: 164 + if (ftruncate(fd, 0) < 0) { 165 + perror(PREFIX ERROR_PREFIX "ftruncate back to 0 failed"); 166 + status = TEST_FAILED; 167 + } 168 + 169 + return status; 170 + } 171 + 172 + static enum test_status 173 + test_hugetlb_read_hwpoison(int fd, size_t len, size_t wr_chunk_size, 174 + bool skip_hwpoison_page) 175 + { 176 + enum test_status status = TEST_SKIPPED; 177 + char *filemap = NULL; 178 + char *hwp_addr = NULL; 179 + const unsigned long pagesize = getpagesize(); 180 + 181 + if (ftruncate(fd, len) < 0) { 182 + perror(PREFIX ERROR_PREFIX "ftruncate failed"); 183 + return status; 184 + } 185 + 186 + filemap = mmap(NULL, len, PROT_READ | PROT_WRITE, 187 + MAP_SHARED | MAP_POPULATE, fd, 0); 188 + if (filemap == MAP_FAILED) { 189 + perror(PREFIX ERROR_PREFIX "mmap for primary mapping failed"); 190 + goto done; 191 + } 192 + 193 + setup_filemap(filemap, len, wr_chunk_size); 194 + status = TEST_FAILED; 195 + 196 + /* 197 + * Poisoned hugetlb page layout (assume hugepagesize=2MB): 198 + * |<---------------------- 1MB ---------------------->| 199 + * |<---- healthy page ---->|<---- HWPOISON page ----->| 200 + * |<------------------- (1MB - 8KB) ----------------->| 201 + */ 202 + hwp_addr = filemap + len / 2 + pagesize; 203 + if (madvise(hwp_addr, pagesize, MADV_HWPOISON) < 0) { 204 + perror(PREFIX ERROR_PREFIX "MADV_HWPOISON failed"); 205 + goto unmap; 206 + } 207 + 208 + if (!skip_hwpoison_page) { 209 + /* 210 + * Userspace should be able to read (1MB + 1 page) from 211 + * the beginning of the HWPOISONed hugepage. 212 + */ 213 + if (read_hugepage_filemap(fd, len, wr_chunk_size, 214 + len / 2 + pagesize)) 215 + status = TEST_PASSED; 216 + } else { 217 + /* 218 + * Userspace should be able to read (1MB - 2 pages) from 219 + * HWPOISONed hugepage. 220 + */ 221 + if (seek_read_hugepage_filemap(fd, len, wr_chunk_size, 222 + len / 2 + MAX(2 * pagesize, wr_chunk_size), 223 + len / 2 - MAX(2 * pagesize, wr_chunk_size))) 224 + status = TEST_PASSED; 225 + } 226 + 227 + unmap: 228 + munmap(filemap, len); 229 + done: 230 + if (ftruncate(fd, 0) < 0) { 231 + perror(PREFIX ERROR_PREFIX "ftruncate back to 0 failed"); 232 + status = TEST_FAILED; 233 + } 234 + 235 + return status; 236 + } 237 + 238 + static int create_hugetlbfs_file(struct statfs *file_stat) 239 + { 240 + int fd; 241 + 242 + fd = memfd_create("hugetlb_tmp", MFD_HUGETLB); 243 + if (fd < 0) { 244 + perror(PREFIX ERROR_PREFIX "could not open hugetlbfs file"); 245 + return -1; 246 + } 247 + 248 + memset(file_stat, 0, sizeof(*file_stat)); 249 + if (fstatfs(fd, file_stat)) { 250 + perror(PREFIX ERROR_PREFIX "fstatfs failed"); 251 + goto close; 252 + } 253 + if (file_stat->f_type != HUGETLBFS_MAGIC) { 254 + printf(PREFIX ERROR_PREFIX "not hugetlbfs file\n"); 255 + goto close; 256 + } 257 + 258 + return fd; 259 + close: 260 + close(fd); 261 + return -1; 262 + } 263 + 264 + int main(void) 265 + { 266 + int fd; 267 + struct statfs file_stat; 268 + enum test_status status; 269 + /* Test read() in different granularity. */ 270 + size_t wr_chunk_sizes[] = { 271 + getpagesize() / 2, getpagesize(), 272 + getpagesize() * 2, getpagesize() * 4 273 + }; 274 + size_t i; 275 + 276 + for (i = 0; i < ARRAY_SIZE(wr_chunk_sizes); ++i) { 277 + printf("Write/read chunk size=0x%lx\n", 278 + wr_chunk_sizes[i]); 279 + 280 + fd = create_hugetlbfs_file(&file_stat); 281 + if (fd < 0) 282 + goto create_failure; 283 + printf(PREFIX "HugeTLB read regression test...\n"); 284 + status = test_hugetlb_read(fd, file_stat.f_bsize, 285 + wr_chunk_sizes[i]); 286 + printf(PREFIX "HugeTLB read regression test...%s\n", 287 + status_to_str(status)); 288 + close(fd); 289 + if (status == TEST_FAILED) 290 + return -1; 291 + 292 + fd = create_hugetlbfs_file(&file_stat); 293 + if (fd < 0) 294 + goto create_failure; 295 + printf(PREFIX "HugeTLB read HWPOISON test...\n"); 296 + status = test_hugetlb_read_hwpoison(fd, file_stat.f_bsize, 297 + wr_chunk_sizes[i], false); 298 + printf(PREFIX "HugeTLB read HWPOISON test...%s\n", 299 + status_to_str(status)); 300 + close(fd); 301 + if (status == TEST_FAILED) 302 + return -1; 303 + 304 + fd = create_hugetlbfs_file(&file_stat); 305 + if (fd < 0) 306 + goto create_failure; 307 + printf(PREFIX "HugeTLB seek then read HWPOISON test...\n"); 308 + status = test_hugetlb_read_hwpoison(fd, file_stat.f_bsize, 309 + wr_chunk_sizes[i], true); 310 + printf(PREFIX "HugeTLB seek then read HWPOISON test...%s\n", 311 + status_to_str(status)); 312 + close(fd); 313 + if (status == TEST_FAILED) 314 + return -1; 315 + } 316 + 317 + return 0; 318 + 319 + create_failure: 320 + printf(ERROR_PREFIX "Abort test: failed to create hugetlbfs file\n"); 321 + return -1; 322 + }

+194 -6

tools/testing/selftests/mm/ksm_functional_tests.c

··· 27 27 #define KiB 1024u 28 28 #define MiB (1024 * KiB) 29 29 30 + static int mem_fd; 30 31 static int ksm_fd; 31 32 static int ksm_full_scans_fd; 33 + static int proc_self_ksm_stat_fd; 34 + static int proc_self_ksm_merging_pages_fd; 35 + static int ksm_use_zero_pages_fd; 32 36 static int pagemap_fd; 33 37 static size_t pagesize; 34 38 ··· 61 57 } 62 58 } 63 59 return false; 60 + } 61 + 62 + static long get_my_ksm_zero_pages(void) 63 + { 64 + char buf[200]; 65 + char *substr_ksm_zero; 66 + size_t value_pos; 67 + ssize_t read_size; 68 + unsigned long my_ksm_zero_pages; 69 + 70 + if (!proc_self_ksm_stat_fd) 71 + return 0; 72 + 73 + read_size = pread(proc_self_ksm_stat_fd, buf, sizeof(buf) - 1, 0); 74 + if (read_size < 0) 75 + return -errno; 76 + 77 + buf[read_size] = 0; 78 + 79 + substr_ksm_zero = strstr(buf, "ksm_zero_pages"); 80 + if (!substr_ksm_zero) 81 + return 0; 82 + 83 + value_pos = strcspn(substr_ksm_zero, "0123456789"); 84 + my_ksm_zero_pages = strtol(substr_ksm_zero + value_pos, NULL, 10); 85 + 86 + return my_ksm_zero_pages; 87 + } 88 + 89 + static long get_my_merging_pages(void) 90 + { 91 + char buf[10]; 92 + ssize_t ret; 93 + 94 + if (proc_self_ksm_merging_pages_fd < 0) 95 + return proc_self_ksm_merging_pages_fd; 96 + 97 + ret = pread(proc_self_ksm_merging_pages_fd, buf, sizeof(buf) - 1, 0); 98 + if (ret <= 0) 99 + return -errno; 100 + buf[ret] = 0; 101 + 102 + return strtol(buf, NULL, 10); 64 103 } 65 104 66 105 static long ksm_get_full_scans(void) ··· 138 91 return 0; 139 92 } 140 93 141 - static char *mmap_and_merge_range(char val, unsigned long size, bool use_prctl) 94 + static int ksm_unmerge(void) 95 + { 96 + if (write(ksm_fd, "2", 1) != 1) 97 + return -errno; 98 + return 0; 99 + } 100 + 101 + static char *mmap_and_merge_range(char val, unsigned long size, int prot, 102 + bool use_prctl) 142 103 { 143 104 char *map; 144 105 int ret; 106 + 107 + /* Stabilize accounting by disabling KSM completely. */ 108 + if (ksm_unmerge()) { 109 + ksft_test_result_fail("Disabling (unmerging) KSM failed\n"); 110 + goto unmap; 111 + } 112 + 113 + if (get_my_merging_pages() > 0) { 114 + ksft_test_result_fail("Still pages merged\n"); 115 + goto unmap; 116 + } 145 117 146 118 map = mmap(NULL, size, PROT_READ|PROT_WRITE, 147 119 MAP_PRIVATE|MAP_ANON, -1, 0); ··· 177 111 178 112 /* Make sure each page contains the same values to merge them. */ 179 113 memset(map, val, size); 114 + 115 + if (mprotect(map, size, prot)) { 116 + ksft_test_result_skip("mprotect() failed\n"); 117 + goto unmap; 118 + } 180 119 181 120 if (use_prctl) { 182 121 ret = prctl(PR_SET_MEMORY_MERGE, 1, 0, 0, 0); ··· 202 131 ksft_test_result_fail("Running KSM failed\n"); 203 132 goto unmap; 204 133 } 134 + 135 + /* 136 + * Check if anything was merged at all. Ignore the zero page that is 137 + * accounted differently (depending on kernel support). 138 + */ 139 + if (val && !get_my_merging_pages()) { 140 + ksft_test_result_fail("No pages got merged\n"); 141 + goto unmap; 142 + } 143 + 205 144 return map; 206 145 unmap: 207 146 munmap(map, size); ··· 225 144 226 145 ksft_print_msg("[RUN] %s\n", __func__); 227 146 228 - map = mmap_and_merge_range(0xcf, size, false); 147 + map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false); 229 148 if (map == MAP_FAILED) 230 149 return; 231 150 ··· 240 159 munmap(map, size); 241 160 } 242 161 162 + static void test_unmerge_zero_pages(void) 163 + { 164 + const unsigned int size = 2 * MiB; 165 + char *map; 166 + unsigned int offs; 167 + unsigned long pages_expected; 168 + 169 + ksft_print_msg("[RUN] %s\n", __func__); 170 + 171 + if (proc_self_ksm_stat_fd < 0) { 172 + ksft_test_result_skip("open(\"/proc/self/ksm_stat\") failed\n"); 173 + return; 174 + } 175 + if (ksm_use_zero_pages_fd < 0) { 176 + ksft_test_result_skip("open \"/sys/kernel/mm/ksm/use_zero_pages\" failed\n"); 177 + return; 178 + } 179 + if (write(ksm_use_zero_pages_fd, "1", 1) != 1) { 180 + ksft_test_result_skip("write \"/sys/kernel/mm/ksm/use_zero_pages\" failed\n"); 181 + return; 182 + } 183 + 184 + /* Let KSM deduplicate zero pages. */ 185 + map = mmap_and_merge_range(0x00, size, PROT_READ | PROT_WRITE, false); 186 + if (map == MAP_FAILED) 187 + return; 188 + 189 + /* Check if ksm_zero_pages is updated correctly after KSM merging */ 190 + pages_expected = size / pagesize; 191 + if (pages_expected != get_my_ksm_zero_pages()) { 192 + ksft_test_result_fail("'ksm_zero_pages' updated after merging\n"); 193 + goto unmap; 194 + } 195 + 196 + /* Try to unmerge half of the region */ 197 + if (madvise(map, size / 2, MADV_UNMERGEABLE)) { 198 + ksft_test_result_fail("MADV_UNMERGEABLE failed\n"); 199 + goto unmap; 200 + } 201 + 202 + /* Check if ksm_zero_pages is updated correctly after unmerging */ 203 + pages_expected /= 2; 204 + if (pages_expected != get_my_ksm_zero_pages()) { 205 + ksft_test_result_fail("'ksm_zero_pages' updated after unmerging\n"); 206 + goto unmap; 207 + } 208 + 209 + /* Trigger unmerging of the other half by writing to the pages. */ 210 + for (offs = size / 2; offs < size; offs += pagesize) 211 + *((unsigned int *)&map[offs]) = offs; 212 + 213 + /* Now we should have no zeropages remaining. */ 214 + if (get_my_ksm_zero_pages()) { 215 + ksft_test_result_fail("'ksm_zero_pages' updated after write fault\n"); 216 + goto unmap; 217 + } 218 + 219 + /* Check if ksm zero pages are really unmerged */ 220 + ksft_test_result(!range_maps_duplicates(map, size), 221 + "KSM zero pages were unmerged\n"); 222 + unmap: 223 + munmap(map, size); 224 + } 225 + 243 226 static void test_unmerge_discarded(void) 244 227 { 245 228 const unsigned int size = 2 * MiB; ··· 311 166 312 167 ksft_print_msg("[RUN] %s\n", __func__); 313 168 314 - map = mmap_and_merge_range(0xcf, size, false); 169 + map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false); 315 170 if (map == MAP_FAILED) 316 171 return; 317 172 ··· 343 198 344 199 ksft_print_msg("[RUN] %s\n", __func__); 345 200 346 - map = mmap_and_merge_range(0xcf, size, false); 201 + map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false); 347 202 if (map == MAP_FAILED) 348 203 return; 349 204 ··· 486 341 487 342 ksft_print_msg("[RUN] %s\n", __func__); 488 343 489 - map = mmap_and_merge_range(0xcf, size, true); 344 + map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, true); 490 345 if (map == MAP_FAILED) 491 346 return; 492 347 ··· 501 356 munmap(map, size); 502 357 } 503 358 359 + static void test_prot_none(void) 360 + { 361 + const unsigned int size = 2 * MiB; 362 + char *map; 363 + int i; 364 + 365 + ksft_print_msg("[RUN] %s\n", __func__); 366 + 367 + map = mmap_and_merge_range(0x11, size, PROT_NONE, false); 368 + if (map == MAP_FAILED) 369 + goto unmap; 370 + 371 + /* Store a unique value in each page on one half using ptrace */ 372 + for (i = 0; i < size / 2; i += pagesize) { 373 + lseek(mem_fd, (uintptr_t) map + i, SEEK_SET); 374 + if (write(mem_fd, &i, sizeof(i)) != sizeof(i)) { 375 + ksft_test_result_fail("ptrace write failed\n"); 376 + goto unmap; 377 + } 378 + } 379 + 380 + /* Trigger unsharing on the other half. */ 381 + if (madvise(map + size / 2, size / 2, MADV_UNMERGEABLE)) { 382 + ksft_test_result_fail("MADV_UNMERGEABLE failed\n"); 383 + goto unmap; 384 + } 385 + 386 + ksft_test_result(!range_maps_duplicates(map, size), 387 + "Pages were unmerged\n"); 388 + unmap: 389 + munmap(map, size); 390 + } 391 + 504 392 int main(int argc, char **argv) 505 393 { 506 - unsigned int tests = 5; 394 + unsigned int tests = 7; 507 395 int err; 508 396 509 397 #ifdef __NR_userfaultfd ··· 548 370 549 371 pagesize = getpagesize(); 550 372 373 + mem_fd = open("/proc/self/mem", O_RDWR); 374 + if (mem_fd < 0) 375 + ksft_exit_fail_msg("opening /proc/self/mem failed\n"); 551 376 ksm_fd = open("/sys/kernel/mm/ksm/run", O_RDWR); 552 377 if (ksm_fd < 0) 553 378 ksft_exit_skip("open(\"/sys/kernel/mm/ksm/run\") failed\n"); ··· 560 379 pagemap_fd = open("/proc/self/pagemap", O_RDONLY); 561 380 if (pagemap_fd < 0) 562 381 ksft_exit_skip("open(\"/proc/self/pagemap\") failed\n"); 382 + proc_self_ksm_stat_fd = open("/proc/self/ksm_stat", O_RDONLY); 383 + proc_self_ksm_merging_pages_fd = open("/proc/self/ksm_merging_pages", 384 + O_RDONLY); 385 + ksm_use_zero_pages_fd = open("/sys/kernel/mm/ksm/use_zero_pages", O_RDWR); 563 386 564 387 test_unmerge(); 388 + test_unmerge_zero_pages(); 565 389 test_unmerge_discarded(); 566 390 #ifdef __NR_userfaultfd 567 391 test_unmerge_uffd_wp(); 568 392 #endif 393 + 394 + test_prot_none(); 569 395 570 396 test_prctl(); 571 397 test_prctl_fork();

+24 -2

tools/testing/selftests/mm/madv_populate.c

··· 264 264 munmap(addr, SIZE); 265 265 } 266 266 267 + static int system_has_softdirty(void) 268 + { 269 + /* 270 + * There is no way to check if the kernel supports soft-dirty, other 271 + * than by writing to a page and seeing if the bit was set. But the 272 + * tests are intended to check that the bit gets set when it should, so 273 + * doing that check would turn a potentially legitimate fail into a 274 + * skip. Fortunately, we know for sure that arm64 does not support 275 + * soft-dirty. So for now, let's just use the arch as a corse guide. 276 + */ 277 + #if defined(__aarch64__) 278 + return 0; 279 + #else 280 + return 1; 281 + #endif 282 + } 283 + 267 284 int main(int argc, char **argv) 268 285 { 286 + int nr_tests = 16; 269 287 int err; 270 288 271 289 pagesize = getpagesize(); 272 290 291 + if (system_has_softdirty()) 292 + nr_tests += 5; 293 + 273 294 ksft_print_header(); 274 - ksft_set_plan(21); 295 + ksft_set_plan(nr_tests); 275 296 276 297 sense_support(); 277 298 test_prot_read(); ··· 300 279 test_holes(); 301 280 test_populate_read(); 302 281 test_populate_write(); 303 - test_softdirty(); 282 + if (system_has_softdirty()) 283 + test_softdirty(); 304 284 305 285 err = ksft_get_fail_cnt(); 306 286 if (err)

+1 -1

tools/testing/selftests/mm/map_populate.c

··· 77 77 unsigned long *smap; 78 78 79 79 ftmp = tmpfile(); 80 - BUG_ON(ftmp == 0, "tmpfile()"); 80 + BUG_ON(!ftmp, "tmpfile()"); 81 81 82 82 ret = ftruncate(fileno(ftmp), MMAP_SZ); 83 83 BUG_ON(ret, "ftruncate()");

+9 -3

tools/testing/selftests/mm/migration.c

··· 10 10 #include <numa.h> 11 11 #include <numaif.h> 12 12 #include <sys/mman.h> 13 + #include <sys/prctl.h> 13 14 #include <sys/types.h> 14 15 #include <signal.h> 15 16 #include <time.h> 16 17 17 18 #define TWOMEG (2<<20) 18 - #define RUNTIME (60) 19 + #define RUNTIME (20) 19 20 20 21 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) 21 22 ··· 156 155 memset(ptr, 0xde, TWOMEG); 157 156 for (i = 0; i < self->nthreads - 1; i++) { 158 157 pid = fork(); 159 - if (!pid) 158 + if (!pid) { 159 + prctl(PR_SET_PDEATHSIG, SIGHUP); 160 + /* Parent may have died before prctl so check now. */ 161 + if (getppid() == 1) 162 + kill(getpid(), SIGHUP); 160 163 access_mem(ptr); 161 - else 164 + } else { 162 165 self->pids[i] = pid; 166 + } 163 167 } 164 168 165 169 ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);

+1

tools/testing/selftests/mm/mrelease_test.c

··· 7 7 #include <stdbool.h> 8 8 #include <stdio.h> 9 9 #include <stdlib.h> 10 + #include <sys/syscall.h> 10 11 #include <sys/wait.h> 11 12 #include <unistd.h> 12 13 #include <asm-generic/unistd.h>

+71 -9

tools/testing/selftests/mm/run_vmtests.sh

··· 12 12 13 13 usage() { 14 14 cat <<EOF 15 - usage: ${BASH_SOURCE[0]:-$0} [ -h | -t "<categories>"] 15 + usage: ${BASH_SOURCE[0]:-$0} [ options ] 16 + 17 + -a: run all tests, including extra ones 16 18 -t: specify specific categories to tests to run 17 19 -h: display this message 18 20 19 - The default behavior is to run all tests. 21 + The default behavior is to run required tests only. If -a is specified, 22 + will run all tests. 20 23 21 24 Alternatively, specific groups tests can be run by passing a string 22 25 to the -t argument containing one or more of the following categories ··· 58 55 test soft dirty page bit semantics 59 56 - cow 60 57 test copy-on-write semantics 58 + - thp 59 + test transparent huge pages 60 + - migration 61 + invoke move_pages(2) to exercise the migration entry code 62 + paths in the kernel 63 + - mkdirty 64 + test handling of code that might set PTE/PMD dirty in 65 + read-only VMAs 66 + - mdwe 67 + test prctl(PR_SET_MDWE, ...) 68 + 61 69 example: ./run_vmtests.sh -t "hmm mmap ksm" 62 70 EOF 63 71 exit 0 64 72 } 65 73 74 + RUN_ALL=false 66 75 67 - while getopts "ht:" OPT; do 76 + while getopts "aht:" OPT; do 68 77 case ${OPT} in 78 + "a") RUN_ALL=true ;; 69 79 "h") usage ;; 70 80 "t") VM_SELFTEST_ITEMS=${OPTARG} ;; 71 81 esac ··· 99 83 else 100 84 return 1 101 85 fi 86 + } 87 + 88 + run_gup_matrix() { 89 + # -t: thp=on, -T: thp=off, -H: hugetlb=on 90 + local hugetlb_mb=$(( needmem_KB / 1024 )) 91 + 92 + for huge in -t -T "-H -m $hugetlb_mb"; do 93 + # -u: gup-fast, -U: gup-basic, -a: pin-fast, -b: pin-basic, -L: pin-longterm 94 + for test_cmd in -u -U -a -b -L; do 95 + # -w: write=1, -W: write=0 96 + for write in -w -W; do 97 + # -S: shared 98 + for share in -S " "; do 99 + # -n: How many pages to fetch together? 512 is special 100 + # because it's default thp size (or 2M on x86), 123 to 101 + # just test partial gup when hit a huge in whatever form 102 + for num in "-n 1" "-n 512" "-n 123"; do 103 + CATEGORY="gup_test" run_test ./gup_test \ 104 + $huge $test_cmd $write $share $num 105 + done 106 + done 107 + done 108 + done 109 + done 102 110 } 103 111 104 112 # get huge pagesize and freepages from /proc/meminfo ··· 229 189 230 190 CATEGORY="mmap" run_test ./map_fixed_noreplace 231 191 232 - # get_user_pages_fast() benchmark 233 - CATEGORY="gup_test" run_test ./gup_test -u 234 - # pin_user_pages_fast() benchmark 235 - CATEGORY="gup_test" run_test ./gup_test -a 192 + if $RUN_ALL; then 193 + run_gup_matrix 194 + else 195 + # get_user_pages_fast() benchmark 196 + CATEGORY="gup_test" run_test ./gup_test -u 197 + # pin_user_pages_fast() benchmark 198 + CATEGORY="gup_test" run_test ./gup_test -a 199 + fi 236 200 # Dump pages 0, 19, and 4096, using pin_user_pages: 237 201 CATEGORY="gup_test" run_test ./gup_test -ct -F 0x1 0 19 0x1000 238 - 239 202 CATEGORY="gup_test" run_test ./gup_longterm 240 203 241 204 CATEGORY="userfaultfd" run_test ./uffd-unit-tests ··· 305 262 306 263 CATEGORY="memfd_secret" run_test ./memfd_secret 307 264 265 + # KSM KSM_MERGE_TIME_HUGE_PAGES test with size of 100 266 + CATEGORY="ksm" run_test ./ksm_tests -H -s 100 267 + # KSM KSM_MERGE_TIME test with size of 100 268 + CATEGORY="ksm" run_test ./ksm_tests -P -s 100 308 269 # KSM MADV_MERGEABLE test with 10 identical pages 309 270 CATEGORY="ksm" run_test ./ksm_tests -M -p 10 310 271 # KSM unmerge test ··· 337 290 CATEGORY="pkey" run_test ./protection_keys_64 338 291 fi 339 292 340 - CATEGORY="soft_dirty" run_test ./soft-dirty 293 + if [ -x ./soft-dirty ] 294 + then 295 + CATEGORY="soft_dirty" run_test ./soft-dirty 296 + fi 341 297 342 298 # COW tests 343 299 CATEGORY="cow" run_test ./cow 300 + 301 + CATEGORY="thp" run_test ./khugepaged 302 + 303 + CATEGORY="thp" run_test ./transhuge-stress -d 20 304 + 305 + CATEGORY="thp" run_test ./split_huge_page_test 306 + 307 + CATEGORY="migration" run_test ./migration 308 + 309 + CATEGORY="mkdirty" run_test ./mkdirty 310 + 311 + CATEGORY="mdwe" run_test ./mdwe_test 344 312 345 313 echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" 346 314

+1 -1

tools/testing/selftests/mm/settings

··· 1 - timeout=45 1 + timeout=180

+2 -2

tools/testing/selftests/mm/thuge-gen.c

··· 139 139 before, after, before - after, size); 140 140 assert(size == getpagesize() || (before - after) == NUM_PAGES); 141 141 show(size); 142 - err = munmap(map, size); 142 + err = munmap(map, size * NUM_PAGES); 143 143 assert(!err); 144 144 } 145 145 ··· 222 222 test_mmap(ps, MAP_HUGETLB | arg); 223 223 } 224 224 printf("Testing default huge mmap\n"); 225 - test_mmap(default_hps, SHM_HUGETLB); 225 + test_mmap(default_hps, MAP_HUGETLB); 226 226 227 227 puts("Testing non-huge shmget"); 228 228 test_shmget(getpagesize(), 0);

+10 -2

tools/testing/selftests/mm/transhuge-stress.c

··· 25 25 { 26 26 size_t ram, len; 27 27 void *ptr, *p; 28 - struct timespec a, b; 28 + struct timespec start, a, b; 29 29 int i = 0; 30 30 char *name = NULL; 31 31 double s; 32 32 uint8_t *map; 33 33 size_t map_len; 34 34 int pagemap_fd; 35 + int duration = 0; 35 36 36 37 ram = sysconf(_SC_PHYS_PAGES); 37 38 if (ram > SIZE_MAX / psize() / 4) ··· 43 42 44 43 while (++i < argc) { 45 44 if (!strcmp(argv[i], "-h")) 46 - errx(1, "usage: %s [size in MiB]", argv[0]); 45 + errx(1, "usage: %s [-f <filename>] [-d <duration>] [size in MiB]", argv[0]); 47 46 else if (!strcmp(argv[i], "-f")) 48 47 name = argv[++i]; 48 + else if (!strcmp(argv[i], "-d")) 49 + duration = atoi(argv[++i]); 49 50 else 50 51 len = atoll(argv[i]) << 20; 51 52 } ··· 80 77 map = malloc(map_len); 81 78 if (!map) 82 79 errx(2, "map malloc"); 80 + 81 + clock_gettime(CLOCK_MONOTONIC, &start); 83 82 84 83 while (1) { 85 84 int nr_succeed = 0, nr_failed = 0, nr_pages = 0; ··· 123 118 "%4d succeed, %4d failed, %4d different pages", 124 119 s, s * 1000 / (len >> HPAGE_SHIFT), len / s / (1 << 20), 125 120 nr_succeed, nr_failed, nr_pages); 121 + 122 + if (duration > 0 && b.tv_sec - start.tv_sec >= duration) 123 + return 0; 126 124 } 127 125 }

+4 -1

tools/testing/selftests/mm/uffd-common.c

··· 499 499 int ret; 500 500 char tmp_chr; 501 501 502 + if (!args->handle_fault) 503 + args->handle_fault = uffd_handle_page_fault; 504 + 502 505 pollfd[0].fd = uffd; 503 506 pollfd[0].events = POLLIN; 504 507 pollfd[1].fd = pipefd[cpu*2]; ··· 530 527 err("unexpected msg event %u\n", msg.event); 531 528 break; 532 529 case UFFD_EVENT_PAGEFAULT: 533 - uffd_handle_page_fault(&msg, args); 530 + args->handle_fault(&msg, args); 534 531 break; 535 532 case UFFD_EVENT_FORK: 536 533 close(uffd);

+3

tools/testing/selftests/mm/uffd-common.h

··· 77 77 unsigned long missing_faults; 78 78 unsigned long wp_faults; 79 79 unsigned long minor_faults; 80 + 81 + /* A custom fault handler; defaults to uffd_handle_page_fault. */ 82 + void (*handle_fault)(struct uffd_msg *msg, struct uffd_args *args); 80 83 }; 81 84 82 85 struct uffd_test_ops {

+16 -16

tools/testing/selftests/mm/uffd-stress.c

··· 53 53 do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0) 54 54 55 55 const char *examples = 56 - "# Run anonymous memory test on 100MiB region with 99999 bounces:\n" 57 - "./userfaultfd anon 100 99999\n\n" 58 - "# Run share memory test on 1GiB region with 99 bounces:\n" 59 - "./userfaultfd shmem 1000 99\n\n" 60 - "# Run hugetlb memory test on 256MiB region with 50 bounces:\n" 61 - "./userfaultfd hugetlb 256 50\n\n" 62 - "# Run the same hugetlb test but using private file:\n" 63 - "./userfaultfd hugetlb-private 256 50\n\n" 64 - "# 10MiB-~6GiB 999 bounces anonymous test, " 65 - "continue forever unless an error triggers\n" 66 - "while ./userfaultfd anon $[RANDOM % 6000 + 10] 999; do true; done\n\n"; 56 + "# Run anonymous memory test on 100MiB region with 99999 bounces:\n" 57 + "./uffd-stress anon 100 99999\n\n" 58 + "# Run share memory test on 1GiB region with 99 bounces:\n" 59 + "./uffd-stress shmem 1000 99\n\n" 60 + "# Run hugetlb memory test on 256MiB region with 50 bounces:\n" 61 + "./uffd-stress hugetlb 256 50\n\n" 62 + "# Run the same hugetlb test but using private file:\n" 63 + "./uffd-stress hugetlb-private 256 50\n\n" 64 + "# 10MiB-~6GiB 999 bounces anonymous test, " 65 + "continue forever unless an error triggers\n" 66 + "while ./uffd-stress anon $[RANDOM % 6000 + 10] 999; do true; done\n\n"; 67 67 68 68 static void usage(void) 69 69 { 70 - fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces>\n\n"); 70 + fprintf(stderr, "\nUsage: ./uffd-stress <test type> <MiB> <bounces>\n\n"); 71 71 fprintf(stderr, "Supported <test type>: anon, hugetlb, " 72 72 "hugetlb-private, shmem, shmem-private\n\n"); 73 73 fprintf(stderr, "Examples:\n\n"); ··· 189 189 locking_thread, (void *)cpu)) 190 190 return 1; 191 191 if (bounces & BOUNCE_POLL) { 192 - if (pthread_create(&uffd_threads[cpu], &attr, 193 - uffd_poll_thread, 194 - (void *)&args[cpu])) 195 - return 1; 192 + if (pthread_create(&uffd_threads[cpu], &attr, uffd_poll_thread, &args[cpu])) 193 + err("uffd_poll_thread create"); 196 194 } else { 197 195 if (pthread_create(&uffd_threads[cpu], &attr, 198 196 uffd_read_thread, ··· 247 249 unsigned long nr; 248 250 struct uffd_args args[nr_cpus]; 249 251 uint64_t mem_size = nr_pages * page_size; 252 + 253 + memset(args, 0, sizeof(struct uffd_args) * nr_cpus); 250 254 251 255 if (uffd_test_ctx_init(UFFD_FEATURE_WP_UNPOPULATED, NULL)) 252 256 err("context init failed");

+117

tools/testing/selftests/mm/uffd-unit-tests.c

··· 951 951 uffd_test_pass(); 952 952 } 953 953 954 + static void uffd_register_poison(int uffd, void *addr, uint64_t len) 955 + { 956 + uint64_t ioctls = 0; 957 + uint64_t expected = (1 << _UFFDIO_COPY) | (1 << _UFFDIO_POISON); 958 + 959 + if (uffd_register_with_ioctls(uffd, addr, len, true, 960 + false, false, &ioctls)) 961 + err("poison register fail"); 962 + 963 + if ((ioctls & expected) != expected) 964 + err("registered area doesn't support COPY and POISON ioctls"); 965 + } 966 + 967 + static void do_uffdio_poison(int uffd, unsigned long offset) 968 + { 969 + struct uffdio_poison uffdio_poison = { 0 }; 970 + int ret; 971 + __s64 res; 972 + 973 + uffdio_poison.range.start = (unsigned long) area_dst + offset; 974 + uffdio_poison.range.len = page_size; 975 + uffdio_poison.mode = 0; 976 + ret = ioctl(uffd, UFFDIO_POISON, &uffdio_poison); 977 + res = uffdio_poison.updated; 978 + 979 + if (ret) 980 + err("UFFDIO_POISON error: %"PRId64, (int64_t)res); 981 + else if (res != page_size) 982 + err("UFFDIO_POISON unexpected size: %"PRId64, (int64_t)res); 983 + } 984 + 985 + static void uffd_poison_handle_fault( 986 + struct uffd_msg *msg, struct uffd_args *args) 987 + { 988 + unsigned long offset; 989 + 990 + if (msg->event != UFFD_EVENT_PAGEFAULT) 991 + err("unexpected msg event %u", msg->event); 992 + 993 + if (msg->arg.pagefault.flags & 994 + (UFFD_PAGEFAULT_FLAG_WP | UFFD_PAGEFAULT_FLAG_MINOR)) 995 + err("unexpected fault type %llu", msg->arg.pagefault.flags); 996 + 997 + offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst; 998 + offset &= ~(page_size-1); 999 + 1000 + /* Odd pages -> copy zeroed page; even pages -> poison. */ 1001 + if (offset & page_size) 1002 + copy_page(uffd, offset, false); 1003 + else 1004 + do_uffdio_poison(uffd, offset); 1005 + } 1006 + 1007 + static void uffd_poison_test(uffd_test_args_t *targs) 1008 + { 1009 + pthread_t uffd_mon; 1010 + char c; 1011 + struct uffd_args args = { 0 }; 1012 + struct sigaction act = { 0 }; 1013 + unsigned long nr_sigbus = 0; 1014 + unsigned long nr; 1015 + 1016 + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); 1017 + 1018 + uffd_register_poison(uffd, area_dst, nr_pages * page_size); 1019 + memset(area_src, 0, nr_pages * page_size); 1020 + 1021 + args.handle_fault = uffd_poison_handle_fault; 1022 + if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &args)) 1023 + err("uffd_poll_thread create"); 1024 + 1025 + sigbuf = &jbuf; 1026 + act.sa_sigaction = sighndl; 1027 + act.sa_flags = SA_SIGINFO; 1028 + if (sigaction(SIGBUS, &act, 0)) 1029 + err("sigaction"); 1030 + 1031 + for (nr = 0; nr < nr_pages; ++nr) { 1032 + unsigned long offset = nr * page_size; 1033 + const char *bytes = (const char *) area_dst + offset; 1034 + const char *i; 1035 + 1036 + if (sigsetjmp(*sigbuf, 1)) { 1037 + /* 1038 + * Access below triggered a SIGBUS, which was caught by 1039 + * sighndl, which then jumped here. Count this SIGBUS, 1040 + * and move on to next page. 1041 + */ 1042 + ++nr_sigbus; 1043 + continue; 1044 + } 1045 + 1046 + for (i = bytes; i < bytes + page_size; ++i) { 1047 + if (*i) 1048 + err("nonzero byte in area_dst (%p) at %p: %u", 1049 + area_dst, i, *i); 1050 + } 1051 + } 1052 + 1053 + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) 1054 + err("pipe write"); 1055 + if (pthread_join(uffd_mon, NULL)) 1056 + err("pthread_join()"); 1057 + 1058 + if (nr_sigbus != nr_pages / 2) 1059 + err("expected to receive %lu SIGBUS, actually received %lu", 1060 + nr_pages / 2, nr_sigbus); 1061 + 1062 + uffd_test_pass(); 1063 + } 1064 + 954 1065 /* 955 1066 * Test the returned uffdio_register.ioctls with different register modes. 956 1067 * Note that _UFFDIO_ZEROPAGE is tested separately in the zeropage test. ··· 1236 1125 UFFD_FEATURE_EVENT_REMAP | UFFD_FEATURE_EVENT_REMOVE | 1237 1126 UFFD_FEATURE_PAGEFAULT_FLAG_WP | 1238 1127 UFFD_FEATURE_WP_HUGETLBFS_SHMEM, 1128 + }, 1129 + { 1130 + .name = "poison", 1131 + .uffd_fn = uffd_poison_test, 1132 + .mem_targets = MEM_ALL, 1133 + .uffd_feature_required = UFFD_FEATURE_POISON, 1239 1134 }, 1240 1135 }; 1241 1136

+1 -1

tools/testing/selftests/mm/va_high_addr_switch.c

··· 292 292 #elif defined(__x86_64__) 293 293 return 1; 294 294 #elif defined(__aarch64__) 295 - return 1; 295 + return getpagesize() == PAGE_SIZE; 296 296 #else 297 297 return 0; 298 298 #endif

+2 -2

tools/testing/selftests/proc/proc-empty-vm.c

··· 77 77 "Swap: 0 kB\n" 78 78 "SwapPss: 0 kB\n" 79 79 "Locked: 0 kB\n" 80 - "THPeligible: 0\n" 80 + "THPeligible: 0\n" 81 81 /* 82 82 * "ProtectionKey:" field is conditional. It is possible to check it as well, 83 83 * but I don't have such machine. ··· 107 107 "Swap: 0 kB\n" 108 108 "SwapPss: 0 kB\n" 109 109 "Locked: 0 kB\n" 110 - "THPeligible: 0\n" 110 + "THPeligible: 0\n" 111 111 /* 112 112 * "ProtectionKey:" field is conditional. It is possible to check it as well, 113 113 * but I'm too tired.

+12 -1

virt/kvm/kvm_main.c

··· 2517 2517 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, 2518 2518 bool interruptible, bool *writable, kvm_pfn_t *pfn) 2519 2519 { 2520 - unsigned int flags = FOLL_HWPOISON; 2520 + /* 2521 + * When a VCPU accesses a page that is not mapped into the secondary 2522 + * MMU, we lookup the page using GUP to map it, so the guest VCPU can 2523 + * make progress. We always want to honor NUMA hinting faults in that 2524 + * case, because GUP usage corresponds to memory accesses from the VCPU. 2525 + * Otherwise, we'd not trigger NUMA hinting faults once a page is 2526 + * mapped into the secondary MMU and gets accessed by a VCPU. 2527 + * 2528 + * Note that get_user_page_fast_only() and FOLL_WRITE for now 2529 + * implicitly honor NUMA hinting faults and don't need this flag. 2530 + */ 2531 + unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT; 2521 2532 struct page *page; 2522 2533 int npages; 2523 2534