Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/page_alloc.c: broken deferred calculation

In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.

The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.

The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.

Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Pavel Tatashin and committed by
Linus Torvalds
d135e575 400e2249

+20 -10
+2 -1
include/linux/mmzone.h
··· 700 700 * is the first PFN that needs to be initialised. 701 701 */ 702 702 unsigned long first_deferred_pfn; 703 - unsigned long static_init_size; 703 + /* Number of non-deferred pages */ 704 + unsigned long static_init_pgcnt; 704 705 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ 705 706 706 707 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+18 -9
mm/page_alloc.c
··· 291 291 int page_group_by_mobility_disabled __read_mostly; 292 292 293 293 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT 294 + 295 + /* 296 + * Determine how many pages need to be initialized durig early boot 297 + * (non-deferred initialization). 298 + * The value of first_deferred_pfn will be set later, once non-deferred pages 299 + * are initialized, but for now set it ULONG_MAX. 300 + */ 294 301 static inline void reset_deferred_meminit(pg_data_t *pgdat) 295 302 { 296 - unsigned long max_initialise; 297 - unsigned long reserved_lowmem; 303 + phys_addr_t start_addr, end_addr; 304 + unsigned long max_pgcnt; 305 + unsigned long reserved; 298 306 299 307 /* 300 308 * Initialise at least 2G of a node but also take into account that 301 309 * two large system hashes that can take up 1GB for 0.25TB/node. 302 310 */ 303 - max_initialise = max(2UL << (30 - PAGE_SHIFT), 304 - (pgdat->node_spanned_pages >> 8)); 311 + max_pgcnt = max(2UL << (30 - PAGE_SHIFT), 312 + (pgdat->node_spanned_pages >> 8)); 305 313 306 314 /* 307 315 * Compensate the all the memblock reservations (e.g. crash kernel) 308 316 * from the initial estimation to make sure we will initialize enough 309 317 * memory to boot. 310 318 */ 311 - reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn, 312 - pgdat->node_start_pfn + max_initialise); 313 - max_initialise += reserved_lowmem; 319 + start_addr = PFN_PHYS(pgdat->node_start_pfn); 320 + end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt); 321 + reserved = memblock_reserved_memory_within(start_addr, end_addr); 322 + max_pgcnt += PHYS_PFN(reserved); 314 323 315 - pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages); 324 + pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages); 316 325 pgdat->first_deferred_pfn = ULONG_MAX; 317 326 } 318 327 ··· 348 339 if (zone_end < pgdat_end_pfn(pgdat)) 349 340 return true; 350 341 (*nr_initialised)++; 351 - if ((*nr_initialised > pgdat->static_init_size) && 342 + if ((*nr_initialised > pgdat->static_init_pgcnt) && 352 343 (pfn & (PAGES_PER_SECTION - 1)) == 0) { 353 344 pgdat->first_deferred_pfn = pfn; 354 345 return false;