Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

powerpc/topology: Get topology for shared processors at boot

On a shared LPAR, Phyp will not update the CPU associativity at boot
time. Just after the boot system does recognize itself as a shared
LPAR and trigger a request for correct CPU associativity. But by then
the scheduler would have already created/destroyed its sched domains.

This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all CPUs to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
node sched domain"), can cause rcu stalls at boot up.

The sched_domains_numa_masks table which is used to generate cpumasks
is only created at boot time just before creating sched domains and
never updated. Hence, its better to get the topology correct before
the sched domains are created.

For example on 64 core Power 8 shared LPAR, dmesg reports

Brought up 512 CPUs
Node 0 CPUs: 0-511
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs:
Node 5 CPUs:
Node 6 CPUs:
Node 7 CPUs:
Node 8 CPUs:
Node 9 CPUs:
Node 10 CPUs:
Node 11 CPUs:
...
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain

numactl/lscpu output will still be correct with cores spreading across
all nodes:

Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

Currently on this LPAR, the scheduler detects 2 levels of Numa and
created numa sched domains for all CPUs, but it finds a single DIE
domain consisting of all CPUs. Hence it deletes all numa sched
domains.

To address this, detect the shared processor and update topology soon
after CPUs are setup so that correct topology is updated just before
scheduler creates sched domain.

With the fix, dmesg reports:

numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
numa: Node 11 CPUs: 160-167 256-263 352-359 448-455

and lscpu also reports:

Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

Reported-by: Manjunatha H R <manjuhr1@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Trim / format change log]
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

authored by

Srikar Dronamraju and committed by
Michael Ellerman
2ea62630 d6ee76d3

+20 -10
+5
arch/powerpc/include/asm/topology.h
··· 92 92 extern int prrn_is_enabled(void); 93 93 extern int find_and_online_cpu_nid(int cpu); 94 94 extern int timed_topology_update(int nsecs); 95 + extern void __init shared_proc_topology_init(void); 95 96 #else 96 97 static inline int start_topology_update(void) 97 98 { ··· 114 113 { 115 114 return 0; 116 115 } 116 + 117 + #ifdef CONFIG_SMP 118 + static inline void shared_proc_topology_init(void) {} 119 + #endif 117 120 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ 118 121 119 122 #include <asm-generic/topology.h>
+5
arch/powerpc/kernel/smp.c
··· 1160 1160 if (smp_ops && smp_ops->bringup_done) 1161 1161 smp_ops->bringup_done(); 1162 1162 1163 + /* 1164 + * On a shared LPAR, associativity needs to be requested. 1165 + * Hence, get numa topology before dumping cpu topology 1166 + */ 1167 + shared_proc_topology_init(); 1163 1168 dump_numa_cpu_topology(); 1164 1169 1165 1170 /*
+10 -10
arch/powerpc/mm/numa.c
··· 1078 1078 static void reset_topology_timer(void); 1079 1079 static int topology_timer_secs = 1; 1080 1080 static int topology_inited; 1081 - static int topology_update_needed; 1082 1081 1083 1082 /* 1084 1083 * Change polling interval for associativity changes. ··· 1305 1306 struct device *dev; 1306 1307 int weight, new_nid, i = 0; 1307 1308 1308 - if (!prrn_enabled && !vphn_enabled) { 1309 - if (!topology_inited) 1310 - topology_update_needed = 1; 1309 + if (!prrn_enabled && !vphn_enabled && topology_inited) 1311 1310 return 0; 1312 - } 1313 1311 1314 1312 weight = cpumask_weight(&cpu_associativity_changes_mask); 1315 1313 if (!weight) ··· 1419 1423 1420 1424 out: 1421 1425 kfree(updates); 1422 - topology_update_needed = 0; 1423 1426 return changed; 1424 1427 } 1425 1428 ··· 1546 1551 return prrn_enabled; 1547 1552 } 1548 1553 1554 + void __init shared_proc_topology_init(void) 1555 + { 1556 + if (lppaca_shared_proc(get_lppaca())) { 1557 + bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask), 1558 + nr_cpumask_bits); 1559 + numa_update_cpu_topology(false); 1560 + } 1561 + } 1562 + 1549 1563 static int topology_read(struct seq_file *file, void *v) 1550 1564 { 1551 1565 if (vphn_enabled || prrn_enabled) ··· 1612 1608 return -ENOMEM; 1613 1609 1614 1610 topology_inited = 1; 1615 - if (topology_update_needed) 1616 - bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask), 1617 - nr_cpumask_bits); 1618 - 1619 1611 return 0; 1620 1612 } 1621 1613 device_initcall(topology_update_init);