Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%)
MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%)
MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%)
MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v5
Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%)
Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%*
Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%)
Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%)
CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.

5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%)
Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%)
Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%)
Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%)
Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%)
Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%*
Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%)
Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%)
Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%)
Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%)
Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%)
Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%)
Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%)
Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%)
Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%)
Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%)
Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%)
Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%)
Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%)
Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%)
Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%)
Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%)
Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%)
Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%)
Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%)
Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%)
Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%)
Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%)
Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%)
Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%)
Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%)
Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%)
Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%)
Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%)

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

vanilla sched-numaimb-v6
Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%)
Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%*
Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%)
CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%)
Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net

authored by

Mel Gorman and committed by
Peter Zijlstra
e496132e 2cfb7a1b

+66 -10
+1
include/linux/sched/topology.h
··· 93 93 unsigned int busy_factor; /* less balancing by factor if busy */ 94 94 unsigned int imbalance_pct; /* No balance until over watermark */ 95 95 unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ 96 + unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */ 96 97 97 98 int nohz_idle; /* NOHZ IDLE status */ 98 99 int flags; /* See SD_* */
+12 -10
kernel/sched/fair.c
··· 1489 1489 1490 1490 int src_cpu, src_nid; 1491 1491 int dst_cpu, dst_nid; 1492 + int imb_numa_nr; 1492 1493 1493 1494 struct numa_stats src_stats, dst_stats; 1494 1495 ··· 1504 1503 static unsigned long cpu_load(struct rq *rq); 1505 1504 static unsigned long cpu_runnable(struct rq *rq); 1506 1505 static inline long adjust_numa_imbalance(int imbalance, 1507 - int dst_running, int dst_weight); 1506 + int dst_running, int imb_numa_nr); 1508 1507 1509 1508 static inline enum 1510 1509 numa_type numa_classify(unsigned int imbalance_pct, ··· 1885 1884 dst_running = env->dst_stats.nr_running + 1; 1886 1885 imbalance = max(0, dst_running - src_running); 1887 1886 imbalance = adjust_numa_imbalance(imbalance, dst_running, 1888 - env->dst_stats.weight); 1887 + env->imb_numa_nr); 1889 1888 1890 1889 /* Use idle CPU if there is no imbalance */ 1891 1890 if (!imbalance) { ··· 1950 1949 */ 1951 1950 rcu_read_lock(); 1952 1951 sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); 1953 - if (sd) 1952 + if (sd) { 1954 1953 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; 1954 + env.imb_numa_nr = sd->imb_numa_nr; 1955 + } 1955 1956 rcu_read_unlock(); 1956 1957 1957 1958 /* ··· 9008 9005 * This is an approximation as the number of running tasks may not be 9009 9006 * related to the number of busy CPUs due to sched_setaffinity. 9010 9007 */ 9011 - static inline bool 9012 - allow_numa_imbalance(unsigned int running, unsigned int weight) 9008 + static inline bool allow_numa_imbalance(int running, int imb_numa_nr) 9013 9009 { 9014 - return (running < (weight >> 2)); 9010 + return running <= imb_numa_nr; 9015 9011 } 9016 9012 9017 9013 /* ··· 9150 9148 * allowed. If there is a real need of migration, 9151 9149 * periodic load balance will take care of it. 9152 9150 */ 9153 - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight)) 9151 + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) 9154 9152 return NULL; 9155 9153 } 9156 9154 ··· 9242 9240 #define NUMA_IMBALANCE_MIN 2 9243 9241 9244 9242 static inline long adjust_numa_imbalance(int imbalance, 9245 - int dst_running, int dst_weight) 9243 + int dst_running, int imb_numa_nr) 9246 9244 { 9247 - if (!allow_numa_imbalance(dst_running, dst_weight)) 9245 + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) 9248 9246 return imbalance; 9249 9247 9250 9248 /* ··· 9356 9354 /* Consider allowing a small imbalance between NUMA groups */ 9357 9355 if (env->sd->flags & SD_NUMA) { 9358 9356 env->imbalance = adjust_numa_imbalance(env->imbalance, 9359 - local->sum_nr_running + 1, local->group_weight); 9357 + local->sum_nr_running + 1, env->sd->imb_numa_nr); 9360 9358 } 9361 9359 9362 9360 return;
+53
kernel/sched/topology.c
··· 2242 2242 } 2243 2243 } 2244 2244 2245 + /* 2246 + * Calculate an allowed NUMA imbalance such that LLCs do not get 2247 + * imbalanced. 2248 + */ 2249 + for_each_cpu(i, cpu_map) { 2250 + unsigned int imb = 0; 2251 + unsigned int imb_span = 1; 2252 + 2253 + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { 2254 + struct sched_domain *child = sd->child; 2255 + 2256 + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && 2257 + (child->flags & SD_SHARE_PKG_RESOURCES)) { 2258 + struct sched_domain *top, *top_p; 2259 + unsigned int nr_llcs; 2260 + 2261 + /* 2262 + * For a single LLC per node, allow an 2263 + * imbalance up to 25% of the node. This is an 2264 + * arbitrary cutoff based on SMT-2 to balance 2265 + * between memory bandwidth and avoiding 2266 + * premature sharing of HT resources and SMT-4 2267 + * or SMT-8 *may* benefit from a different 2268 + * cutoff. 2269 + * 2270 + * For multiple LLCs, allow an imbalance 2271 + * until multiple tasks would share an LLC 2272 + * on one node while LLCs on another node 2273 + * remain idle. 2274 + */ 2275 + nr_llcs = sd->span_weight / child->span_weight; 2276 + if (nr_llcs == 1) 2277 + imb = sd->span_weight >> 2; 2278 + else 2279 + imb = nr_llcs; 2280 + sd->imb_numa_nr = imb; 2281 + 2282 + /* Set span based on the first NUMA domain. */ 2283 + top = sd; 2284 + top_p = top->parent; 2285 + while (top_p && !(top_p->flags & SD_NUMA)) { 2286 + top = top->parent; 2287 + top_p = top->parent; 2288 + } 2289 + imb_span = top_p ? top_p->span_weight : sd->span_weight; 2290 + } else { 2291 + int factor = max(1U, (sd->span_weight / imb_span)); 2292 + 2293 + sd->imb_numa_nr = imb * factor; 2294 + } 2295 + } 2296 + } 2297 + 2245 2298 /* Calculate CPU capacity for physical packages and nodes */ 2246 2299 for (i = nr_cpumask_bits-1; i >= 0; i--) { 2247 2300 if (!cpumask_test_cpu(i, cpu_map))