Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: remap unused subpages to shared zeropage when splitting isolated thp

Patch series "mm: split underused THPs", v5.

The current upstream default policy for THP is always. However, Meta uses
madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing. Using madvise +
relying on khugepaged has certain drawbacks over THP=always. Using
madvise hints mean THPs aren't "transparent" and require userspace
changes. Waiting for khugepaged to scan memory and collapse pages into
THP can be slow and unpredictable in terms of performance (i.e. you dont
know when the collapse will happen), while production environments require
predictable performance. If there is enough memory available, its better
for both performance and predictability to have a THP from fault time,
i.e. THP=always rather than wait for khugepaged to collapse it, and deal
with sparsely populated THPs when the system is running out of memory.

This patch series is an attempt to mitigate the issue of running out of
memory when THP is always enabled. During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list.
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underused,
i.e. how many of the base 4K pages of the entire THP were zero-filled.
If this number goes above a certain threshold, the shrinker will attempt
to split that THP. Then at remap time, the pages that were zero-filled
are mapped to the shared zeropage, hence saving memory. This method
avoids the downside of wasting memory in areas where THP is sparsely
filled when THP is always enabled, while still providing the upside THPs
like reduced TLB misses without having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker. The results after 2 hours are as follows:

| THP=madvise | THP=always | THP=always
| | | + shrinker series
| | | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement | - | +1.8% | +1.7%
(over THP=madvise) | | |
-----------------------------------------------------------------------------
Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with the
shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kills stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K


This patch (of 5):

Here being unused means containing only zeros and inaccessible to
userspace. When splitting an isolated thp under reclaim or migration, the
unused subpages can be mapped to the shared zeropage, hence saving memory.
This is particularly helpful when the internal fragmentation of a thp is
high, i.e. it has many untouched subpages.

This is also a prerequisite for THP low utilization shrinker which will be
introduced in later patches, where underutilized THPs are split, and the
zero-filled pages are freed saving memory.

Link: https://lkml.kernel.org/r/20240830100438.3623486-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20240830100438.3623486-3-usamaarif642@gmail.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Shuang Zhai <zhais@google.com>
Cc: Alexander Zhu <alexlzhu@fb.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuang Zhai <szhai2@cs.rochester.edu>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Yu Zhao and committed by
Andrew Morton
b1f20206 903edea6

+77 -18
+6 -1
include/linux/rmap.h
··· 745 745 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, 746 746 struct vm_area_struct *vma); 747 747 748 - void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); 748 + enum rmp_flags { 749 + RMP_LOCKED = 1 << 0, 750 + RMP_USE_SHARED_ZEROPAGE = 1 << 1, 751 + }; 752 + 753 + void remove_migration_ptes(struct folio *src, struct folio *dst, int flags); 749 754 750 755 /* 751 756 * rmap_walk_control: To control rmap traversing for specific needs
+4 -4
mm/huge_memory.c
··· 3004 3004 return false; 3005 3005 } 3006 3006 3007 - static void remap_page(struct folio *folio, unsigned long nr) 3007 + static void remap_page(struct folio *folio, unsigned long nr, int flags) 3008 3008 { 3009 3009 int i = 0; 3010 3010 ··· 3012 3012 if (!folio_test_anon(folio)) 3013 3013 return; 3014 3014 for (;;) { 3015 - remove_migration_ptes(folio, folio, true); 3015 + remove_migration_ptes(folio, folio, RMP_LOCKED | flags); 3016 3016 i += folio_nr_pages(folio); 3017 3017 if (i >= nr) 3018 3018 break; ··· 3222 3222 3223 3223 if (nr_dropped) 3224 3224 shmem_uncharge(folio->mapping->host, nr_dropped); 3225 - remap_page(folio, nr); 3225 + remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0); 3226 3226 3227 3227 /* 3228 3228 * set page to its compound_head when split to non order-0 pages, so ··· 3498 3498 if (mapping) 3499 3499 xas_unlock(&xas); 3500 3500 local_irq_enable(); 3501 - remap_page(folio, folio_nr_pages(folio)); 3501 + remap_page(folio, folio_nr_pages(folio), 0); 3502 3502 ret = -EAGAIN; 3503 3503 } 3504 3504
+65 -11
mm/migrate.c
··· 204 204 return true; 205 205 } 206 206 207 + static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, 208 + struct folio *folio, 209 + unsigned long idx) 210 + { 211 + struct page *page = folio_page(folio, idx); 212 + bool contains_data; 213 + pte_t newpte; 214 + void *addr; 215 + 216 + VM_BUG_ON_PAGE(PageCompound(page), page); 217 + VM_BUG_ON_PAGE(!PageAnon(page), page); 218 + VM_BUG_ON_PAGE(!PageLocked(page), page); 219 + VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page); 220 + 221 + if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) || 222 + mm_forbids_zeropage(pvmw->vma->vm_mm)) 223 + return false; 224 + 225 + /* 226 + * The pmd entry mapping the old thp was flushed and the pte mapping 227 + * this subpage has been non present. If the subpage is only zero-filled 228 + * then map it to the shared zeropage. 229 + */ 230 + addr = kmap_local_page(page); 231 + contains_data = memchr_inv(addr, 0, PAGE_SIZE); 232 + kunmap_local(addr); 233 + 234 + if (contains_data) 235 + return false; 236 + 237 + newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address), 238 + pvmw->vma->vm_page_prot)); 239 + set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); 240 + 241 + dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); 242 + return true; 243 + } 244 + 245 + struct rmap_walk_arg { 246 + struct folio *folio; 247 + bool map_unused_to_zeropage; 248 + }; 249 + 207 250 /* 208 251 * Restore a potential migration pte to a working pte entry 209 252 */ 210 253 static bool remove_migration_pte(struct folio *folio, 211 - struct vm_area_struct *vma, unsigned long addr, void *old) 254 + struct vm_area_struct *vma, unsigned long addr, void *arg) 212 255 { 213 - DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); 256 + struct rmap_walk_arg *rmap_walk_arg = arg; 257 + DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION); 214 258 215 259 while (page_vma_mapped_walk(&pvmw)) { 216 260 rmap_t rmap_flags = RMAP_NONE; ··· 278 234 continue; 279 235 } 280 236 #endif 237 + if (rmap_walk_arg->map_unused_to_zeropage && 238 + try_to_map_unused_to_zeropage(&pvmw, folio, idx)) 239 + continue; 281 240 282 241 folio_get(folio); 283 242 pte = mk_pte(new, READ_ONCE(vma->vm_page_prot)); ··· 359 312 * Get rid of all migration entries and replace them by 360 313 * references to the indicated page. 361 314 */ 362 - void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked) 315 + void remove_migration_ptes(struct folio *src, struct folio *dst, int flags) 363 316 { 364 - struct rmap_walk_control rwc = { 365 - .rmap_one = remove_migration_pte, 366 - .arg = src, 317 + struct rmap_walk_arg rmap_walk_arg = { 318 + .folio = src, 319 + .map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE, 367 320 }; 368 321 369 - if (locked) 322 + struct rmap_walk_control rwc = { 323 + .rmap_one = remove_migration_pte, 324 + .arg = &rmap_walk_arg, 325 + }; 326 + 327 + VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src); 328 + 329 + if (flags & RMP_LOCKED) 370 330 rmap_walk_locked(dst, &rwc); 371 331 else 372 332 rmap_walk(dst, &rwc); ··· 988 934 * At this point we know that the migration attempt cannot 989 935 * be successful. 990 936 */ 991 - remove_migration_ptes(folio, folio, false); 937 + remove_migration_ptes(folio, folio, 0); 992 938 993 939 rc = mapping->a_ops->writepage(&folio->page, &wbc); 994 940 ··· 1152 1098 struct list_head *ret) 1153 1099 { 1154 1100 if (page_was_mapped) 1155 - remove_migration_ptes(src, src, false); 1101 + remove_migration_ptes(src, src, 0); 1156 1102 /* Drop an anon_vma reference if we took one */ 1157 1103 if (anon_vma) 1158 1104 put_anon_vma(anon_vma); ··· 1390 1336 lru_add_drain(); 1391 1337 1392 1338 if (old_page_state & PAGE_WAS_MAPPED) 1393 - remove_migration_ptes(src, dst, false); 1339 + remove_migration_ptes(src, dst, 0); 1394 1340 1395 1341 out_unlock_both: 1396 1342 folio_unlock(dst); ··· 1528 1474 1529 1475 if (page_was_mapped) 1530 1476 remove_migration_ptes(src, 1531 - rc == MIGRATEPAGE_SUCCESS ? dst : src, false); 1477 + rc == MIGRATEPAGE_SUCCESS ? dst : src, 0); 1532 1478 1533 1479 unlock_put_anon: 1534 1480 folio_unlock(dst);
+2 -2
mm/migrate_device.c
··· 424 424 continue; 425 425 426 426 folio = page_folio(page); 427 - remove_migration_ptes(folio, folio, false); 427 + remove_migration_ptes(folio, folio, 0); 428 428 429 429 src_pfns[i] = 0; 430 430 folio_unlock(folio); ··· 840 840 dst = src; 841 841 } 842 842 843 - remove_migration_ptes(src, dst, false); 843 + remove_migration_ptes(src, dst, 0); 844 844 folio_unlock(src); 845 845 846 846 if (folio_is_zone_device(src))