Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/zone_device: support large zone device private folios

Patch series "mm: support device-private THP", v7.

This patch series introduces support for Transparent Huge Page (THP)
migration in zone device-private memory. The implementation enables
efficient migration of large folios between system memory and
device-private memory

Background

Current zone device-private memory implementation only supports PAGE_SIZE
granularity, leading to:
- Increased TLB pressure
- Inefficient migration between CPU and device memory

This series extends the existing zone device-private infrastructure to
support THP, leading to:
- Reduced page table overhead
- Improved memory bandwidth utilization
- Seamless fallback to base pages when needed

In my local testing (using lib/test_hmm) and a throughput test, the series
shows a 350% improvement in data transfer throughput and a 80% improvement
in latency

These patches build on the earlier posts by Ralph Campbell [1]

Two new flags are added in vma_migration to select and mark compound
pages. migrate_vma_setup(), migrate_vma_pages() and
migrate_vma_finalize() support migration of these pages when
MIGRATE_VMA_SELECT_COMPOUND is passed in as arguments.

The series also adds zone device awareness to (m)THP pages along with
fault handling of large zone device private pages. page vma walk and the
rmap code is also zone device aware. Support has also been added for
folios that might need to be split in the middle of migration (when the
src and dst do not agree on MIGRATE_PFN_COMPOUND), that occurs when src
side of the migration can migrate large pages, but the destination has not
been able to allocate large pages. The code supported and used
folio_split() when migrating THP pages, this is used when
MIGRATE_VMA_SELECT_COMPOUND is not passed as an argument to
migrate_vma_setup().

The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.

The nouveau dmem code has been enhanced to use the new THP migration
capability.

mTHP support:

The patches hard code, HPAGE_PMD_NR in a few places, but the code has been
kept generic to support various order sizes. With additional refactoring
of the code support of different order sizes should be possible.

The future plan is to post enhancements to support mTHP with a rough
design as follows:

1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize

The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.

HMM support for large folios was added in 10b9feee2d0d ("mm/hmm:
populate PFNs from PMD swap entry").


This patch (of 16)

Add routines to support allocation of large order zone device folios and
helper functions for zone device folios, to check if a folio is device
private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in pgmap is
called when the folio is freed, this is true for both PAGE_SIZE and higher
order pages.

Zone device private large folios do not support deferred split and scan
like normal THP folios.

Link: https://lkml.kernel.org/r/20251001065707.920170-1-balbirs@nvidia.com
Link: https://lkml.kernel.org/r/20251001065707.920170-2-balbirs@nvidia.com
Link: https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/ [1]
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Balbir Singh and committed by
Andrew Morton
d245f9b4 14524684

+34 -18
+1 -1
arch/powerpc/kvm/book3s_hv_uvmem.c
··· 723 723 724 724 dpage = pfn_to_page(uvmem_pfn); 725 725 dpage->zone_device_data = pvt; 726 - zone_device_page_init(dpage); 726 + zone_device_page_init(dpage, 0); 727 727 return dpage; 728 728 out_clear: 729 729 spin_lock(&kvmppc_uvmem_bitmap_lock);
+1 -1
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
··· 218 218 page = pfn_to_page(pfn); 219 219 svm_range_bo_ref(prange->svm_bo); 220 220 page->zone_device_data = prange->svm_bo; 221 - zone_device_page_init(page); 221 + zone_device_page_init(page, 0); 222 222 } 223 223 224 224 static void
+1 -1
drivers/gpu/drm/drm_pagemap.c
··· 196 196 struct drm_pagemap_zdd *zdd) 197 197 { 198 198 page->zone_device_data = drm_pagemap_zdd_get(zdd); 199 - zone_device_page_init(page); 199 + zone_device_page_init(page, 0); 200 200 } 201 201 202 202 /**
+1 -1
drivers/gpu/drm/nouveau/nouveau_dmem.c
··· 318 318 return NULL; 319 319 } 320 320 321 - zone_device_page_init(page); 321 + zone_device_page_init(page, 0); 322 322 return page; 323 323 } 324 324
+9 -1
include/linux/memremap.h
··· 206 206 } 207 207 208 208 #ifdef CONFIG_ZONE_DEVICE 209 - void zone_device_page_init(struct page *page); 209 + void zone_device_page_init(struct page *page, unsigned int order); 210 210 void *memremap_pages(struct dev_pagemap *pgmap, int nid); 211 211 void memunmap_pages(struct dev_pagemap *pgmap); 212 212 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); ··· 215 215 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); 216 216 217 217 unsigned long memremap_compat_align(void); 218 + 219 + static inline void zone_device_folio_init(struct folio *folio, unsigned int order) 220 + { 221 + zone_device_page_init(&folio->page, order); 222 + if (order) 223 + folio_set_large_rmappable(folio); 224 + } 225 + 218 226 #else 219 227 static inline void *devm_memremap_pages(struct device *dev, 220 228 struct dev_pagemap *pgmap)
+1 -1
lib/test_hmm.c
··· 627 627 goto error; 628 628 } 629 629 630 - zone_device_page_init(dpage); 630 + zone_device_page_init(dpage, 0); 631 631 dpage->zone_device_data = rpage; 632 632 return dpage; 633 633
+15 -11
mm/memremap.c
··· 416 416 void free_zone_device_folio(struct folio *folio) 417 417 { 418 418 struct dev_pagemap *pgmap = folio->pgmap; 419 + unsigned long nr = folio_nr_pages(folio); 420 + int i; 419 421 420 422 if (WARN_ON_ONCE(!pgmap)) 421 423 return; 422 424 423 425 mem_cgroup_uncharge(folio); 424 426 425 - /* 426 - * Note: we don't expect anonymous compound pages yet. Once supported 427 - * and we could PTE-map them similar to THP, we'd have to clear 428 - * PG_anon_exclusive on all tail pages. 429 - */ 430 427 if (folio_test_anon(folio)) { 431 - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); 432 - __ClearPageAnonExclusive(folio_page(folio, 0)); 428 + for (i = 0; i < nr; i++) 429 + __ClearPageAnonExclusive(folio_page(folio, i)); 430 + } else { 431 + VM_WARN_ON_ONCE(folio_test_large(folio)); 433 432 } 434 433 435 434 /* ··· 455 456 case MEMORY_DEVICE_COHERENT: 456 457 if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free)) 457 458 break; 458 - pgmap->ops->page_free(folio_page(folio, 0)); 459 - put_dev_pagemap(pgmap); 459 + pgmap->ops->page_free(&folio->page); 460 + percpu_ref_put_many(&folio->pgmap->ref, nr); 460 461 break; 461 462 462 463 case MEMORY_DEVICE_GENERIC: ··· 479 480 } 480 481 } 481 482 482 - void zone_device_page_init(struct page *page) 483 + void zone_device_page_init(struct page *page, unsigned int order) 483 484 { 485 + VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); 486 + 484 487 /* 485 488 * Drivers shouldn't be allocating pages after calling 486 489 * memunmap_pages(). 487 490 */ 488 - WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)); 491 + WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order)); 489 492 set_page_count(page, 1); 490 493 lock_page(page); 494 + 495 + if (order) 496 + prep_compound_page(page, order); 491 497 } 492 498 EXPORT_SYMBOL_GPL(zone_device_page_init);
+5 -1
mm/rmap.c
··· 1733 1733 * the folio is unmapped and at least one page is still mapped. 1734 1734 * 1735 1735 * Check partially_mapped first to ensure it is a large folio. 1736 + * 1737 + * Device private folios do not support deferred splitting and 1738 + * shrinker based scanning of the folios to free. 1736 1739 */ 1737 1740 if (partially_mapped && folio_test_anon(folio) && 1738 - !folio_test_partially_mapped(folio)) 1741 + !folio_test_partially_mapped(folio) && 1742 + !folio_is_device_private(folio)) 1739 1743 deferred_split_folio(folio, true); 1740 1744 1741 1745 __folio_mod_stat(folio, -nr, -nr_pmdmapped);