Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: memmap defer init doesn't work as expected

VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.

Before the commit:

[0.033176] Normal zone: 1445888 pages used for memmap
[0.033176] Normal zone: 89391104 pages, LIFO batch:63
[0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

[0.026874] Normal zone: 1445888 pages used for memmap
[0.026875] Normal zone: 89391104 pages, LIFO batch:63
[2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.

Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node. However, since commit 73a6e474cb376,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.

E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2. The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
only one memory section's memmap initialized. That's why more time is
costed at the time.

[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.

Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Rahul Gopakumar <gopakumarr@vmware.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Baoquan He and committed by
Linus Torvalds
dc2da7b4 6d87d0ec

+11 -8
+2 -2
arch/ia64/mm/init.c
··· 536 536 537 537 if (map_start < map_end) 538 538 memmap_init_zone((unsigned long)(map_end - map_start), 539 - args->nid, args->zone, page_to_pfn(map_start), 539 + args->nid, args->zone, page_to_pfn(map_start), page_to_pfn(map_end), 540 540 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); 541 541 return 0; 542 542 } ··· 546 546 unsigned long start_pfn) 547 547 { 548 548 if (!vmem_map) { 549 - memmap_init_zone(size, nid, zone, start_pfn, 549 + memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size, 550 550 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); 551 551 } else { 552 552 struct page *start;
+3 -2
include/linux/mm.h
··· 2439 2439 #endif 2440 2440 2441 2441 extern void set_dma_reserve(unsigned long new_dma_reserve); 2442 - extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, 2443 - enum meminit_context, struct vmem_altmap *, int migratetype); 2442 + extern void memmap_init_zone(unsigned long, int, unsigned long, 2443 + unsigned long, unsigned long, enum meminit_context, 2444 + struct vmem_altmap *, int migratetype); 2444 2445 extern void setup_per_zone_wmarks(void); 2445 2446 extern int __meminit init_per_zone_wmark_min(void); 2446 2447 extern void mem_init(void);
+1 -1
mm/memory_hotplug.c
··· 713 713 * expects the zone spans the pfn range. All the pages in the range 714 714 * are reserved so nobody should be touching them so we should be safe 715 715 */ 716 - memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 716 + memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0, 717 717 MEMINIT_HOTPLUG, altmap, migratetype); 718 718 719 719 set_zone_contiguous(zone);
+5 -3
mm/page_alloc.c
··· 423 423 if (end_pfn < pgdat_end_pfn(NODE_DATA(nid))) 424 424 return false; 425 425 426 + if (NODE_DATA(nid)->first_deferred_pfn != ULONG_MAX) 427 + return true; 426 428 /* 427 429 * We start only with one section of pages, more pages are added as 428 430 * needed until the rest of deferred pages are initialized. ··· 6118 6116 * zone stats (e.g., nr_isolate_pageblock) are touched. 6119 6117 */ 6120 6118 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, 6121 - unsigned long start_pfn, 6119 + unsigned long start_pfn, unsigned long zone_end_pfn, 6122 6120 enum meminit_context context, 6123 6121 struct vmem_altmap *altmap, int migratetype) 6124 6122 { ··· 6154 6152 if (context == MEMINIT_EARLY) { 6155 6153 if (overlap_memmap_init(zone, &pfn)) 6156 6154 continue; 6157 - if (defer_init(nid, pfn, end_pfn)) 6155 + if (defer_init(nid, pfn, zone_end_pfn)) 6158 6156 break; 6159 6157 } 6160 6158 ··· 6268 6266 6269 6267 if (end_pfn > start_pfn) { 6270 6268 size = end_pfn - start_pfn; 6271 - memmap_init_zone(size, nid, zone, start_pfn, 6269 + memmap_init_zone(size, nid, zone, start_pfn, range_end_pfn, 6272 6270 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); 6273 6271 } 6274 6272 }