Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: Add support for unaccepted memory

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

1. Accept all the memory during boot. It is easy to implement and it
doesn't have runtime cost once the system is booted. The downside is
very long boot time.

Accept can be parallelized to multiple CPUs to keep it manageable
(i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
memory bandwidth and does not scale beyond the point.

2. Accept a block of memory on the first use. It requires more
infrastructure and changes in page allocator to make it work, but
it provides good boot time.

On-demand memory accept means latency spikes every time kernel steps
onto a new memory block. The spikes will go away once workload data
set size gets stabilized or all memory gets accepted.

3. Accept all memory in background. Introduce a thread (or multiple)
that gets memory accepted proactively. It will minimize time the
system experience latency spikes on memory allocation while keeping
low boot time.

This approach cannot function on its own. It is an extension of #2:
background memory acceptance requires functional scheduler, but the
page allocator may need to tap into unaccepted memory before that.

The downside of the approach is that these threads also steal CPU
cycles and memory bandwidth from the user's workload and may hurt
user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

- memblock accepts memory on allocation. It serves early boot memory
allocations and doesn't limit them to pre-accepted pool of memory.

- page allocator accepts memory on the first allocation of the page.
When kernel runs out of accepted memory, it accepts memory until the
high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

- accept_memory() makes a range of physical addresses accepted.

- range_contains_unaccepted_memory() checks anything within the range
of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com

authored by

Kirill A. Shutemov and committed by
Borislav Petkov (AMD)
dcdfdd40 9561de3a

+231
+7
drivers/base/node.c
··· 449 449 "Node %d FileHugePages: %8lu kB\n" 450 450 "Node %d FilePmdMapped: %8lu kB\n" 451 451 #endif 452 + #ifdef CONFIG_UNACCEPTED_MEMORY 453 + "Node %d Unaccepted: %8lu kB\n" 454 + #endif 452 455 , 453 456 nid, K(node_page_state(pgdat, NR_FILE_DIRTY)), 454 457 nid, K(node_page_state(pgdat, NR_WRITEBACK)), ··· 480 477 nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)), 481 478 nid, K(node_page_state(pgdat, NR_FILE_THPS)), 482 479 nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED)) 480 + #endif 481 + #ifdef CONFIG_UNACCEPTED_MEMORY 482 + , 483 + nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED)) 483 484 #endif 484 485 ); 485 486 len += hugetlb_report_node_meminfo(buf, len, nid);
+5
fs/proc/meminfo.c
··· 168 168 global_zone_page_state(NR_FREE_CMA_PAGES)); 169 169 #endif 170 170 171 + #ifdef CONFIG_UNACCEPTED_MEMORY 172 + show_val_kb(m, "Unaccepted: ", 173 + global_zone_page_state(NR_UNACCEPTED)); 174 + #endif 175 + 171 176 hugetlb_report_meminfo(m); 172 177 173 178 arch_report_meminfo(m);
+19
include/linux/mm.h
··· 3816 3816 } 3817 3817 #endif 3818 3818 3819 + #ifdef CONFIG_UNACCEPTED_MEMORY 3820 + 3821 + bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end); 3822 + void accept_memory(phys_addr_t start, phys_addr_t end); 3823 + 3824 + #else 3825 + 3826 + static inline bool range_contains_unaccepted_memory(phys_addr_t start, 3827 + phys_addr_t end) 3828 + { 3829 + return false; 3830 + } 3831 + 3832 + static inline void accept_memory(phys_addr_t start, phys_addr_t end) 3833 + { 3834 + } 3835 + 3836 + #endif 3837 + 3819 3838 #endif /* _LINUX_MM_H */
+8
include/linux/mmzone.h
··· 143 143 NR_ZSPAGES, /* allocated in zsmalloc */ 144 144 #endif 145 145 NR_FREE_CMA_PAGES, 146 + #ifdef CONFIG_UNACCEPTED_MEMORY 147 + NR_UNACCEPTED, 148 + #endif 146 149 NR_VM_ZONE_STAT_ITEMS }; 147 150 148 151 enum node_stat_item { ··· 912 909 913 910 /* free areas of different sizes */ 914 911 struct free_area free_area[MAX_ORDER + 1]; 912 + 913 + #ifdef CONFIG_UNACCEPTED_MEMORY 914 + /* Pages to be accepted. All pages on the list are MAX_ORDER */ 915 + struct list_head unaccepted_pages; 916 + #endif 915 917 916 918 /* zone flags, see below */ 917 919 unsigned long flags;
+9
mm/memblock.c
··· 1436 1436 */ 1437 1437 kmemleak_alloc_phys(found, size, 0); 1438 1438 1439 + /* 1440 + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, 1441 + * require memory to be accepted before it can be used by the 1442 + * guest. 1443 + * 1444 + * Accept the memory of the allocated buffer. 1445 + */ 1446 + accept_memory(found, found + size); 1447 + 1439 1448 return found; 1440 1449 } 1441 1450
+7
mm/mm_init.c
··· 1375 1375 INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 1376 1376 zone->free_area[order].nr_free = 0; 1377 1377 } 1378 + 1379 + #ifdef CONFIG_UNACCEPTED_MEMORY 1380 + INIT_LIST_HEAD(&zone->unaccepted_pages); 1381 + #endif 1378 1382 } 1379 1383 1380 1384 void __meminit init_currently_empty_zone(struct zone *zone, ··· 1963 1959 __free_pages_core(page, MAX_ORDER); 1964 1960 return; 1965 1961 } 1962 + 1963 + /* Accept chunks smaller than MAX_ORDER upfront */ 1964 + accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages)); 1966 1965 1967 1966 for (i = 0; i < nr_pages; i++, page++, pfn++) { 1968 1967 if (pageblock_aligned(pfn))
+173
mm/page_alloc.c
··· 387 387 EXPORT_SYMBOL(nr_online_nodes); 388 388 #endif 389 389 390 + static bool page_contains_unaccepted(struct page *page, unsigned int order); 391 + static void accept_page(struct page *page, unsigned int order); 392 + static bool try_to_accept_memory(struct zone *zone, unsigned int order); 393 + static inline bool has_unaccepted_memory(void); 394 + static bool __free_unaccepted(struct page *page); 395 + 390 396 int page_group_by_mobility_disabled __read_mostly; 391 397 392 398 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT ··· 1486 1480 set_page_count(p, 0); 1487 1481 1488 1482 atomic_long_add(nr_pages, &page_zone(page)->managed_pages); 1483 + 1484 + if (page_contains_unaccepted(page, order)) { 1485 + if (order == MAX_ORDER && __free_unaccepted(page)) 1486 + return; 1487 + 1488 + accept_page(page, order); 1489 + } 1489 1490 1490 1491 /* 1491 1492 * Bypass PCP and place fresh pages right to the tail, primarily ··· 3172 3159 if (!(alloc_flags & ALLOC_CMA)) 3173 3160 unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES); 3174 3161 #endif 3162 + #ifdef CONFIG_UNACCEPTED_MEMORY 3163 + unusable_free += zone_page_state(z, NR_UNACCEPTED); 3164 + #endif 3175 3165 3176 3166 return unusable_free; 3177 3167 } ··· 3474 3458 gfp_mask)) { 3475 3459 int ret; 3476 3460 3461 + if (has_unaccepted_memory()) { 3462 + if (try_to_accept_memory(zone, order)) 3463 + goto try_this_zone; 3464 + } 3465 + 3477 3466 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT 3478 3467 /* 3479 3468 * Watermark failed for this zone, but see if we can ··· 3531 3510 3532 3511 return page; 3533 3512 } else { 3513 + if (has_unaccepted_memory()) { 3514 + if (try_to_accept_memory(zone, order)) 3515 + goto try_this_zone; 3516 + } 3517 + 3534 3518 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT 3535 3519 /* Try again if zone has deferred pages */ 3536 3520 if (deferred_pages_enabled()) { ··· 7241 7215 return false; 7242 7216 } 7243 7217 #endif /* CONFIG_ZONE_DMA */ 7218 + 7219 + #ifdef CONFIG_UNACCEPTED_MEMORY 7220 + 7221 + /* Counts number of zones with unaccepted pages. */ 7222 + static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages); 7223 + 7224 + static bool lazy_accept = true; 7225 + 7226 + static int __init accept_memory_parse(char *p) 7227 + { 7228 + if (!strcmp(p, "lazy")) { 7229 + lazy_accept = true; 7230 + return 0; 7231 + } else if (!strcmp(p, "eager")) { 7232 + lazy_accept = false; 7233 + return 0; 7234 + } else { 7235 + return -EINVAL; 7236 + } 7237 + } 7238 + early_param("accept_memory", accept_memory_parse); 7239 + 7240 + static bool page_contains_unaccepted(struct page *page, unsigned int order) 7241 + { 7242 + phys_addr_t start = page_to_phys(page); 7243 + phys_addr_t end = start + (PAGE_SIZE << order); 7244 + 7245 + return range_contains_unaccepted_memory(start, end); 7246 + } 7247 + 7248 + static void accept_page(struct page *page, unsigned int order) 7249 + { 7250 + phys_addr_t start = page_to_phys(page); 7251 + 7252 + accept_memory(start, start + (PAGE_SIZE << order)); 7253 + } 7254 + 7255 + static bool try_to_accept_memory_one(struct zone *zone) 7256 + { 7257 + unsigned long flags; 7258 + struct page *page; 7259 + bool last; 7260 + 7261 + if (list_empty(&zone->unaccepted_pages)) 7262 + return false; 7263 + 7264 + spin_lock_irqsave(&zone->lock, flags); 7265 + page = list_first_entry_or_null(&zone->unaccepted_pages, 7266 + struct page, lru); 7267 + if (!page) { 7268 + spin_unlock_irqrestore(&zone->lock, flags); 7269 + return false; 7270 + } 7271 + 7272 + list_del(&page->lru); 7273 + last = list_empty(&zone->unaccepted_pages); 7274 + 7275 + __mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); 7276 + __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); 7277 + spin_unlock_irqrestore(&zone->lock, flags); 7278 + 7279 + accept_page(page, MAX_ORDER); 7280 + 7281 + __free_pages_ok(page, MAX_ORDER, FPI_TO_TAIL); 7282 + 7283 + if (last) 7284 + static_branch_dec(&zones_with_unaccepted_pages); 7285 + 7286 + return true; 7287 + } 7288 + 7289 + static bool try_to_accept_memory(struct zone *zone, unsigned int order) 7290 + { 7291 + long to_accept; 7292 + int ret = false; 7293 + 7294 + /* How much to accept to get to high watermark? */ 7295 + to_accept = high_wmark_pages(zone) - 7296 + (zone_page_state(zone, NR_FREE_PAGES) - 7297 + __zone_watermark_unusable_free(zone, order, 0)); 7298 + 7299 + /* Accept at least one page */ 7300 + do { 7301 + if (!try_to_accept_memory_one(zone)) 7302 + break; 7303 + ret = true; 7304 + to_accept -= MAX_ORDER_NR_PAGES; 7305 + } while (to_accept > 0); 7306 + 7307 + return ret; 7308 + } 7309 + 7310 + static inline bool has_unaccepted_memory(void) 7311 + { 7312 + return static_branch_unlikely(&zones_with_unaccepted_pages); 7313 + } 7314 + 7315 + static bool __free_unaccepted(struct page *page) 7316 + { 7317 + struct zone *zone = page_zone(page); 7318 + unsigned long flags; 7319 + bool first = false; 7320 + 7321 + if (!lazy_accept) 7322 + return false; 7323 + 7324 + spin_lock_irqsave(&zone->lock, flags); 7325 + first = list_empty(&zone->unaccepted_pages); 7326 + list_add_tail(&page->lru, &zone->unaccepted_pages); 7327 + __mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); 7328 + __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES); 7329 + spin_unlock_irqrestore(&zone->lock, flags); 7330 + 7331 + if (first) 7332 + static_branch_inc(&zones_with_unaccepted_pages); 7333 + 7334 + return true; 7335 + } 7336 + 7337 + #else 7338 + 7339 + static bool page_contains_unaccepted(struct page *page, unsigned int order) 7340 + { 7341 + return false; 7342 + } 7343 + 7344 + static void accept_page(struct page *page, unsigned int order) 7345 + { 7346 + } 7347 + 7348 + static bool try_to_accept_memory(struct zone *zone, unsigned int order) 7349 + { 7350 + return false; 7351 + } 7352 + 7353 + static inline bool has_unaccepted_memory(void) 7354 + { 7355 + return false; 7356 + } 7357 + 7358 + static bool __free_unaccepted(struct page *page) 7359 + { 7360 + BUILD_BUG(); 7361 + return false; 7362 + } 7363 + 7364 + #endif /* CONFIG_UNACCEPTED_MEMORY */
+3
mm/vmstat.c
··· 1180 1180 "nr_zspages", 1181 1181 #endif 1182 1182 "nr_free_cma", 1183 + #ifdef CONFIG_UNACCEPTED_MEMORY 1184 + "nr_unaccepted", 1185 + #endif 1183 1186 1184 1187 /* enum numa_stat_item counters */ 1185 1188 #ifdef CONFIG_NUMA