Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data()

The setup_node_data() function allocates a pg_data_t object,
inserts it into the node_data[] array and initializes the
following fields: node_id, node_start_pfn and
node_spanned_pages.

However, a few function calls later during the kernel boot,
free_area_init_node() re-initializes those fields, possibly with
setup_node_data() is not used.

This causes a small glitch when running Linux as a hyperv numa
guest:

SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
Initmem setup node 0 [mem 0x00000000-0x7fffffff]
NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
Initmem setup node 1 [mem 0x80800000-0x1081fffff]
NODE_DATA [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7

In this dmesg, the SRAT table reports that the memory range for
node 1 starts at 0x80200000. However, the line starting with
"Initmem" reports that node 1 memory range starts at 0x80800000.
The "Initmem" line is reported by setup_node_data() and is
wrong, because the kernel ends up using the range as reported in
the SRAT table.

This commit drops all that dead code from setup_node_data(),
renames it to alloc_node_data() and adds a printk() to
free_area_init_node() so that we report a node's memory range
accurately.

Here's the same dmesg section with this patch applied:

SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
Initmem setup node 1 [mem 0x80200000-0x1081fffff]
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7

This commit was tested on a two node bare-metal NUMA machine and
Linux as a numa guest on hyperv and qemu/kvm.

PS: The wrong memory range reported by setup_node_data() seems to be
harmless in the current kernel because it's just not used. However,
that bad range is used in kernel 2.6.32 to initialize the old boot
memory allocator, which causes a crash during boot.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Luiz Capitulino and committed by
Ingo Molnar
8b375f64 9661d5bc

+16 -21
-1
arch/x86/include/asm/numa.h
··· 9 9 #ifdef CONFIG_NUMA 10 10 11 11 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2) 12 - #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT)) 13 12 14 13 /* 15 14 * Too small node sizes may confuse the VM badly. Usually they
+14 -20
arch/x86/mm/numa.c
··· 185 185 return numa_add_memblk_to(nid, start, end, &numa_meminfo); 186 186 } 187 187 188 - /* Initialize NODE_DATA for a node on the local memory */ 189 - static void __init setup_node_data(int nid, u64 start, u64 end) 188 + /* Allocate NODE_DATA for a node on the local memory */ 189 + static void __init alloc_node_data(int nid) 190 190 { 191 191 const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE); 192 192 u64 nd_pa; 193 193 void *nd; 194 194 int tnid; 195 - 196 - /* 197 - * Don't confuse VM with a node that doesn't have the 198 - * minimum amount of memory: 199 - */ 200 - if (end && (end - start) < NODE_MIN_SIZE) 201 - return; 202 - 203 - start = roundup(start, ZONE_ALIGN); 204 - 205 - printk(KERN_INFO "Initmem setup node %d [mem %#010Lx-%#010Lx]\n", 206 - nid, start, end - 1); 207 195 208 196 /* 209 197 * Allocate node data. Try node-local memory and then any node. ··· 210 222 nd = __va(nd_pa); 211 223 212 224 /* report and initialize */ 213 - printk(KERN_INFO " NODE_DATA [mem %#010Lx-%#010Lx]\n", 225 + printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid, 214 226 nd_pa, nd_pa + nd_size - 1); 215 227 tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); 216 228 if (tnid != nid) ··· 218 230 219 231 node_data[nid] = nd; 220 232 memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); 221 - NODE_DATA(nid)->node_id = nid; 222 - NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT; 223 - NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT; 224 233 225 234 node_set_online(nid); 226 235 } ··· 508 523 end = max(mi->blk[i].end, end); 509 524 } 510 525 511 - if (start < end) 512 - setup_node_data(nid, start, end); 526 + if (start >= end) 527 + continue; 528 + 529 + /* 530 + * Don't confuse VM with a node that doesn't have the 531 + * minimum amount of memory: 532 + */ 533 + if (end && (end - start) < NODE_MIN_SIZE) 534 + continue; 535 + 536 + alloc_node_data(nid); 513 537 } 514 538 515 539 /* Dump memblock with node info and return. */
+2
mm/page_alloc.c
··· 4976 4976 pgdat->node_start_pfn = node_start_pfn; 4977 4977 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP 4978 4978 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); 4979 + printk(KERN_INFO "Initmem setup node %d [mem %#010Lx-%#010Lx]\n", nid, 4980 + (u64) start_pfn << PAGE_SHIFT, (u64) (end_pfn << PAGE_SHIFT) - 1); 4979 4981 #endif 4980 4982 calculate_node_totalpages(pgdat, start_pfn, end_pfn, 4981 4983 zones_size, zholes_size);