commit 198e2f181163233b379dc7ce8a6d7516b84042e7 · tjh.dev/kernel

tjh.dev / kernel

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

[PATCH] scheduler cache-hot-autodetect

)

From: Ingo Molnar <mingo@elte.hu>

This is the latest version of the scheduler cache-hot-auto-tune patch.

The first problem was that detection time scaled with O(N^2), which is
unacceptable on larger SMP and NUMA systems. To solve this:

- I've added a 'domain distance' function, which is used to cache
measurement results. Each distance is only measured once. This means
that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
distances 0 and 1, and on SMP distance 0 is measured. The code walks
the domain tree to determine the distance, so it automatically follows
whatever hierarchy an architecture sets up. This cuts down on the boot
time significantly and removes the O(N^2) limit. The only assumption
is that migration costs can be expressed as a function of domain
distance - this covers the overwhelming majority of existing systems,
and is a good guess even for more assymetric systems.

[ People hacking systems that have assymetries that break this
assumption (e.g. different CPU speeds) should experiment a bit with
the cpu_distance() function. Adding a ->migration_distance factor to
the domain structure would be one possible solution - but lets first
see the problem systems, if they exist at all. Lets not overdesign. ]

Another problem was that only a single cache-size was used for measuring
the cost of migration, and most architectures didnt set that variable
up. Furthermore, a single cache-size does not fit NUMA hierarchies with
L3 caches and does not fit HT setups, where different CPUs will often
have different 'effective cache sizes'. To solve this problem:

- Instead of relying on a single cache-size provided by the platform and
sticking to it, the code now auto-detects the 'effective migration
cost' between two measured CPUs, via iterating through a wide range of
cachesizes. The code searches for the maximum migration cost, which
occurs when the working set of the test-workload falls just below the
'effective cache size'. I.e. real-life optimized search is done for
the maximum migration cost, between two real CPUs.

This, amongst other things, has the positive effect hat if e.g. two
CPUs share a L2/L3 cache, a different (and accurate) migration cost
will be found than between two CPUs on the same system that dont share
any caches.

(The reliable measurement of migration costs is tricky - see the source
for details.)

Furthermore i've added various boot-time options to override/tune
migration behavior.

Firstly, there's a blanket override for autodetection:

migration_cost=1000,2000,3000

will override the depth 0/1/2 values with 1msec/2msec/3msec values.

Secondly, there's a global factor that can be used to increase (or
decrease) the autodetected values:

migration_factor=120

will increase the autodetected values by 20%. This option is useful to
tune things in a workload-dependent way - e.g. if a workload is
cache-insensitive then CPU utilization can be maximized by specifying
migration_factor=0.

I've tested the autodetection code quite extensively on x86, on 3
P3/Xeon/2MB, and the autodetected values look pretty good:

Dual Celeron (128K L2 cache):

---------------------
migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
---------------------
[00] [01]
[00]: - 1.7(1)
[01]: 1.7(1) -
---------------------
cacheflush times [2]: 0.0 (0) 1.7 (1784008)
---------------------

Here the slow memory subsystem dominates system performance, and even
though caches are small, the migration cost is 1.7 msecs.

Dual HT P4 (512K L2 cache):

---------------------
migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
---------------------
[00] [01] [02] [03]
[00]: - 0.4(1) 0.0(0) 0.4(1)
[01]: 0.4(1) - 0.4(1) 0.0(0)
[02]: 0.0(0) 0.4(1) - 0.4(1)
[03]: 0.4(1) 0.0(0) 0.4(1) -
---------------------
cacheflush times [2]: 0.0 (33900) 0.4 (448514)
---------------------

Here it can be seen that there is no migration cost between two HT
siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.

8-way P3/Xeon [2MB L2 cache]:

---------------------
migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
---------------------
[00] [01] [02] [03] [04] [05] [06] [07]
[00]: - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[01]: 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[02]: 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[03]: 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1)
[04]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1)
[05]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1)
[06]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1)
[07]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) -
---------------------
cacheflush times [2]: 0.0 (0) 19.2 (19281756)
---------------------

This one has huge caches and a relatively slow memory subsystem - so the
migration cost is 19 msecs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: <wilder@us.ibm.com>
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

authored by akpm@osdl.org and committed by Linus Torvalds 20 years ago 198e2f18 4dc7a0bb

+527 -10

12 changed files

expand all

unified split

Documentation

kernel-parameters.txt

arch

i386

kernel

smpboot.c

ia64

kernel

setup.c

ppc

xmon

xmon.c

include

asm-i386

topology.h

asm-ia64

topology.h

asm-mips

mach-ip27

topology.h

asm-powerpc

topology.h

asm-x86_64

topology.h

linux

sched.h

topology.h

kernel

sched.c

+43

Documentation/kernel-parameters.txt

··· 856 856 857 857 mga= [HW,DRM] 858 858 859 + migration_cost= 860 + [KNL,SMP] debug: override scheduler migration costs 861 + Format: <level-1-usecs>,<level-2-usecs>,... 862 + This debugging option can be used to override the 863 + default scheduler migration cost matrix. The numbers 864 + are indexed by 'CPU domain distance'. 865 + E.g. migration_cost=1000,2000,3000 on an SMT NUMA 866 + box will set up an intra-core migration cost of 867 + 1 msec, an inter-core migration cost of 2 msecs, 868 + and an inter-node migration cost of 3 msecs. 869 + 870 + WARNING: using the wrong values here can break 871 + scheduler performance, so it's only for scheduler 872 + development purposes, not production environments. 873 + 874 + migration_debug= 875 + [KNL,SMP] migration cost auto-detect verbosity 876 + Format=<0|1|2> 877 + If a system's migration matrix reported at bootup 878 + seems erroneous then this option can be used to 879 + increase verbosity of the detection process. 880 + We default to 0 (no extra messages), 1 will print 881 + some more information, and 2 will be really 882 + verbose (probably only useful if you also have a 883 + serial console attached to the system). 884 + 885 + migration_factor= 886 + [KNL,SMP] multiply/divide migration costs by a factor 887 + Format=<percent> 888 + This debug option can be used to proportionally 889 + increase or decrease the auto-detected migration 890 + costs for all entries of the migration matrix. 891 + E.g. migration_factor=150 will increase migration 892 + costs by 50%. (and thus the scheduler will be less 893 + eager migrating cache-hot tasks) 894 + migration_factor=80 will decrease migration costs 895 + by 20%. (thus the scheduler will be more eager to 896 + migrate tasks) 897 + 898 + WARNING: using the wrong values here can break 899 + scheduler performance, so it's only for scheduler 900 + development purposes, not production environments. 901 + 859 902 mousedev.tap_time= 860 903 [MOUSE] Maximum time between finger touching and 861 904 leaving touchpad surface for touch to be considered

arch/i386/kernel/smpboot.c

··· 1096 1096 cachesize = 16; /* Pentiums, 2x8kB cache */ 1097 1097 bandwidth = 100; 1098 1098 } 1099 + max_cache_size = cachesize * 1024; 1099 1100 } 1100 1101 } 1101 1102

arch/ia64/kernel/setup.c

··· 696 696 get_max_cacheline_size (void) 697 697 { 698 698 unsigned long line_size, max = 1; 699 + unsigned int cache_size = 0; 699 700 u64 l, levels, unique_caches; 700 701 pal_cache_config_info_t cci; 701 702 s64 status; ··· 726 725 line_size = 1 << cci.pcci_line_size; 727 726 if (line_size > max) 728 727 max = line_size; 728 + if (cache_size < cci.pcci_cache_size) 729 + cache_size = cci.pcci_cache_size; 729 730 if (!cci.pcci_unified) { 730 731 status = ia64_pal_cache_config_info(l, 731 732 /* cache_type (instruction)= */ 1, ··· 744 741 ia64_i_cache_stride_shift = cci.pcci_stride; 745 742 } 746 743 out: 744 + #ifdef CONFIG_SMP 745 + max_cache_size = max(max_cache_size, cache_size); 746 + #endif 747 747 if (max > ia64_max_cacheline_size) 748 748 ia64_max_cacheline_size = max; 749 749 }

+1 -1

arch/ppc/xmon/xmon.c

··· 99 99 static void insert_bpts(void); 100 100 static struct bpt *at_breakpoint(unsigned pc); 101 101 static void bpt_cmds(void); 102 - static void cacheflush(void); 102 + void cacheflush(void); 103 103 #ifdef CONFIG_SMP 104 104 static void cpu_cmd(void); 105 105 #endif /* CONFIG_SMP */

-1

include/asm-i386/topology.h

··· 72 72 .max_interval = 32, \ 73 73 .busy_factor = 32, \ 74 74 .imbalance_pct = 125, \ 75 - .cache_hot_time = (10*1000000), \ 76 75 .cache_nice_tries = 1, \ 77 76 .busy_idx = 3, \ 78 77 .idle_idx = 1, \

-2

include/asm-ia64/topology.h

··· 55 55 .max_interval = 4, \ 56 56 .busy_factor = 64, \ 57 57 .imbalance_pct = 125, \ 58 - .cache_hot_time = (10*1000000), \ 59 58 .per_cpu_gain = 100, \ 60 59 .cache_nice_tries = 2, \ 61 60 .busy_idx = 2, \ ··· 80 81 .max_interval = 8*(min(num_online_cpus(), 32)), \ 81 82 .busy_factor = 64, \ 82 83 .imbalance_pct = 125, \ 83 - .cache_hot_time = (10*1000000), \ 84 84 .cache_nice_tries = 2, \ 85 85 .busy_idx = 3, \ 86 86 .idle_idx = 2, \

-1

include/asm-mips/mach-ip27/topology.h

··· 27 27 .max_interval = 32, \ 28 28 .busy_factor = 32, \ 29 29 .imbalance_pct = 125, \ 30 - .cache_hot_time = (10*1000), \ 31 30 .cache_nice_tries = 1, \ 32 31 .per_cpu_gain = 100, \ 33 32 .flags = SD_LOAD_BALANCE \

-1

include/asm-powerpc/topology.h

··· 39 39 .max_interval = 32, \ 40 40 .busy_factor = 32, \ 41 41 .imbalance_pct = 125, \ 42 - .cache_hot_time = (10*1000000), \ 43 42 .cache_nice_tries = 1, \ 44 43 .per_cpu_gain = 100, \ 45 44 .busy_idx = 3, \

-1

include/asm-x86_64/topology.h

··· 39 39 .max_interval = 32, \ 40 40 .busy_factor = 32, \ 41 41 .imbalance_pct = 125, \ 42 - .cache_hot_time = (10*1000000), \ 43 42 .cache_nice_tries = 2, \ 44 43 .busy_idx = 3, \ 45 44 .idle_idx = 2, \

+8 -1

include/linux/sched.h

··· 631 631 632 632 extern void partition_sched_domains(cpumask_t *partition1, 633 633 cpumask_t *partition2); 634 - #endif /* CONFIG_SMP */ 634 + 635 + /* 636 + * Maximum cache size the migration-costs auto-tuning code will 637 + * search from: 638 + */ 639 + extern unsigned int max_cache_size; 640 + 641 + #endif /* CONFIG_SMP */ 635 642 636 643 637 644 struct io_context; /* See blkdev.h */

-2

include/linux/topology.h

··· 86 86 .max_interval = 2, \ 87 87 .busy_factor = 8, \ 88 88 .imbalance_pct = 110, \ 89 - .cache_hot_time = 0, \ 90 89 .cache_nice_tries = 0, \ 91 90 .per_cpu_gain = 25, \ 92 91 .busy_idx = 0, \ ··· 116 117 .max_interval = 4, \ 117 118 .busy_factor = 64, \ 118 119 .imbalance_pct = 125, \ 119 - .cache_hot_time = (5*1000000/2), \ 120 120 .cache_nice_tries = 1, \ 121 121 .per_cpu_gain = 100, \ 122 122 .busy_idx = 2, \

+468

kernel/sched.c

··· 34 34 #include <linux/notifier.h> 35 35 #include <linux/profile.h> 36 36 #include <linux/suspend.h> 37 + #include <linux/vmalloc.h> 37 38 #include <linux/blkdev.h> 38 39 #include <linux/delay.h> 39 40 #include <linux/smp.h> ··· 5083 5082 5084 5083 #define SD_NODES_PER_DOMAIN 16 5085 5084 5085 + /* 5086 + * Self-tuning task migration cost measurement between source and target CPUs. 5087 + * 5088 + * This is done by measuring the cost of manipulating buffers of varying 5089 + * sizes. For a given buffer-size here are the steps that are taken: 5090 + * 5091 + * 1) the source CPU reads+dirties a shared buffer 5092 + * 2) the target CPU reads+dirties the same shared buffer 5093 + * 5094 + * We measure how long they take, in the following 4 scenarios: 5095 + * 5096 + * - source: CPU1, target: CPU2 | cost1 5097 + * - source: CPU2, target: CPU1 | cost2 5098 + * - source: CPU1, target: CPU1 | cost3 5099 + * - source: CPU2, target: CPU2 | cost4 5100 + * 5101 + * We then calculate the cost3+cost4-cost1-cost2 difference - this is 5102 + * the cost of migration. 5103 + * 5104 + * We then start off from a small buffer-size and iterate up to larger 5105 + * buffer sizes, in 5% steps - measuring each buffer-size separately, and 5106 + * doing a maximum search for the cost. (The maximum cost for a migration 5107 + * normally occurs when the working set size is around the effective cache 5108 + * size.) 5109 + */ 5110 + #define SEARCH_SCOPE 2 5111 + #define MIN_CACHE_SIZE (64*1024U) 5112 + #define DEFAULT_CACHE_SIZE (5*1024*1024U) 5113 + #define ITERATIONS 2 5114 + #define SIZE_THRESH 130 5115 + #define COST_THRESH 130 5116 + 5117 + /* 5118 + * The migration cost is a function of 'domain distance'. Domain 5119 + * distance is the number of steps a CPU has to iterate down its 5120 + * domain tree to share a domain with the other CPU. The farther 5121 + * two CPUs are from each other, the larger the distance gets. 5122 + * 5123 + * Note that we use the distance only to cache measurement results, 5124 + * the distance value is not used numerically otherwise. When two 5125 + * CPUs have the same distance it is assumed that the migration 5126 + * cost is the same. (this is a simplification but quite practical) 5127 + */ 5128 + #define MAX_DOMAIN_DISTANCE 32 5129 + 5130 + static unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] = 5131 + { [ 0 ... MAX_DOMAIN_DISTANCE-1 ] = -1LL }; 5132 + 5133 + /* 5134 + * Allow override of migration cost - in units of microseconds. 5135 + * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost 5136 + * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs: 5137 + */ 5138 + static int __init migration_cost_setup(char *str) 5139 + { 5140 + int ints[MAX_DOMAIN_DISTANCE+1], i; 5141 + 5142 + str = get_options(str, ARRAY_SIZE(ints), ints); 5143 + 5144 + printk("#ints: %d\n", ints[0]); 5145 + for (i = 1; i <= ints[0]; i++) { 5146 + migration_cost[i-1] = (unsigned long long)ints[i]*1000; 5147 + printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]); 5148 + } 5149 + return 1; 5150 + } 5151 + 5152 + __setup ("migration_cost=", migration_cost_setup); 5153 + 5154 + /* 5155 + * Global multiplier (divisor) for migration-cutoff values, 5156 + * in percentiles. E.g. use a value of 150 to get 1.5 times 5157 + * longer cache-hot cutoff times. 5158 + * 5159 + * (We scale it from 100 to 128 to long long handling easier.) 5160 + */ 5161 + 5162 + #define MIGRATION_FACTOR_SCALE 128 5163 + 5164 + static unsigned int migration_factor = MIGRATION_FACTOR_SCALE; 5165 + 5166 + static int __init setup_migration_factor(char *str) 5167 + { 5168 + get_option(&str, &migration_factor); 5169 + migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100; 5170 + return 1; 5171 + } 5172 + 5173 + __setup("migration_factor=", setup_migration_factor); 5174 + 5175 + /* 5176 + * Estimated distance of two CPUs, measured via the number of domains 5177 + * we have to pass for the two CPUs to be in the same span: 5178 + */ 5179 + static unsigned long domain_distance(int cpu1, int cpu2) 5180 + { 5181 + unsigned long distance = 0; 5182 + struct sched_domain *sd; 5183 + 5184 + for_each_domain(cpu1, sd) { 5185 + WARN_ON(!cpu_isset(cpu1, sd->span)); 5186 + if (cpu_isset(cpu2, sd->span)) 5187 + return distance; 5188 + distance++; 5189 + } 5190 + if (distance >= MAX_DOMAIN_DISTANCE) { 5191 + WARN_ON(1); 5192 + distance = MAX_DOMAIN_DISTANCE-1; 5193 + } 5194 + 5195 + return distance; 5196 + } 5197 + 5198 + static unsigned int migration_debug; 5199 + 5200 + static int __init setup_migration_debug(char *str) 5201 + { 5202 + get_option(&str, &migration_debug); 5203 + return 1; 5204 + } 5205 + 5206 + __setup("migration_debug=", setup_migration_debug); 5207 + 5208 + /* 5209 + * Maximum cache-size that the scheduler should try to measure. 5210 + * Architectures with larger caches should tune this up during 5211 + * bootup. Gets used in the domain-setup code (i.e. during SMP 5212 + * bootup). 5213 + */ 5214 + unsigned int max_cache_size; 5215 + 5216 + static int __init setup_max_cache_size(char *str) 5217 + { 5218 + get_option(&str, &max_cache_size); 5219 + return 1; 5220 + } 5221 + 5222 + __setup("max_cache_size=", setup_max_cache_size); 5223 + 5224 + /* 5225 + * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This 5226 + * is the operation that is timed, so we try to generate unpredictable 5227 + * cachemisses that still end up filling the L2 cache: 5228 + */ 5229 + static void touch_cache(void *__cache, unsigned long __size) 5230 + { 5231 + unsigned long size = __size/sizeof(long), chunk1 = size/3, 5232 + chunk2 = 2*size/3; 5233 + unsigned long *cache = __cache; 5234 + int i; 5235 + 5236 + for (i = 0; i < size/6; i += 8) { 5237 + switch (i % 6) { 5238 + case 0: cache[i]++; 5239 + case 1: cache[size-1-i]++; 5240 + case 2: cache[chunk1-i]++; 5241 + case 3: cache[chunk1+i]++; 5242 + case 4: cache[chunk2-i]++; 5243 + case 5: cache[chunk2+i]++; 5244 + } 5245 + } 5246 + } 5247 + 5248 + /* 5249 + * Measure the cache-cost of one task migration. Returns in units of nsec. 5250 + */ 5251 + static unsigned long long measure_one(void *cache, unsigned long size, 5252 + int source, int target) 5253 + { 5254 + cpumask_t mask, saved_mask; 5255 + unsigned long long t0, t1, t2, t3, cost; 5256 + 5257 + saved_mask = current->cpus_allowed; 5258 + 5259 + /* 5260 + * Flush source caches to RAM and invalidate them: 5261 + */ 5262 + sched_cacheflush(); 5263 + 5264 + /* 5265 + * Migrate to the source CPU: 5266 + */ 5267 + mask = cpumask_of_cpu(source); 5268 + set_cpus_allowed(current, mask); 5269 + WARN_ON(smp_processor_id() != source); 5270 + 5271 + /* 5272 + * Dirty the working set: 5273 + */ 5274 + t0 = sched_clock(); 5275 + touch_cache(cache, size); 5276 + t1 = sched_clock(); 5277 + 5278 + /* 5279 + * Migrate to the target CPU, dirty the L2 cache and access 5280 + * the shared buffer. (which represents the working set 5281 + * of a migrated task.) 5282 + */ 5283 + mask = cpumask_of_cpu(target); 5284 + set_cpus_allowed(current, mask); 5285 + WARN_ON(smp_processor_id() != target); 5286 + 5287 + t2 = sched_clock(); 5288 + touch_cache(cache, size); 5289 + t3 = sched_clock(); 5290 + 5291 + cost = t1-t0 + t3-t2; 5292 + 5293 + if (migration_debug >= 2) 5294 + printk("[%d->%d]: %8Ld %8Ld %8Ld => %10Ld.\n", 5295 + source, target, t1-t0, t1-t0, t3-t2, cost); 5296 + /* 5297 + * Flush target caches to RAM and invalidate them: 5298 + */ 5299 + sched_cacheflush(); 5300 + 5301 + set_cpus_allowed(current, saved_mask); 5302 + 5303 + return cost; 5304 + } 5305 + 5306 + /* 5307 + * Measure a series of task migrations and return the average 5308 + * result. Since this code runs early during bootup the system 5309 + * is 'undisturbed' and the average latency makes sense. 5310 + * 5311 + * The algorithm in essence auto-detects the relevant cache-size, 5312 + * so it will properly detect different cachesizes for different 5313 + * cache-hierarchies, depending on how the CPUs are connected. 5314 + * 5315 + * Architectures can prime the upper limit of the search range via 5316 + * max_cache_size, otherwise the search range defaults to 20MB...64K. 5317 + */ 5318 + static unsigned long long 5319 + measure_cost(int cpu1, int cpu2, void *cache, unsigned int size) 5320 + { 5321 + unsigned long long cost1, cost2; 5322 + int i; 5323 + 5324 + /* 5325 + * Measure the migration cost of 'size' bytes, over an 5326 + * average of 10 runs: 5327 + * 5328 + * (We perturb the cache size by a small (0..4k) 5329 + * value to compensate size/alignment related artifacts. 5330 + * We also subtract the cost of the operation done on 5331 + * the same CPU.) 5332 + */ 5333 + cost1 = 0; 5334 + 5335 + /* 5336 + * dry run, to make sure we start off cache-cold on cpu1, 5337 + * and to get any vmalloc pagefaults in advance: 5338 + */ 5339 + measure_one(cache, size, cpu1, cpu2); 5340 + for (i = 0; i < ITERATIONS; i++) 5341 + cost1 += measure_one(cache, size - i*1024, cpu1, cpu2); 5342 + 5343 + measure_one(cache, size, cpu2, cpu1); 5344 + for (i = 0; i < ITERATIONS; i++) 5345 + cost1 += measure_one(cache, size - i*1024, cpu2, cpu1); 5346 + 5347 + /* 5348 + * (We measure the non-migrating [cached] cost on both 5349 + * cpu1 and cpu2, to handle CPUs with different speeds) 5350 + */ 5351 + cost2 = 0; 5352 + 5353 + measure_one(cache, size, cpu1, cpu1); 5354 + for (i = 0; i < ITERATIONS; i++) 5355 + cost2 += measure_one(cache, size - i*1024, cpu1, cpu1); 5356 + 5357 + measure_one(cache, size, cpu2, cpu2); 5358 + for (i = 0; i < ITERATIONS; i++) 5359 + cost2 += measure_one(cache, size - i*1024, cpu2, cpu2); 5360 + 5361 + /* 5362 + * Get the per-iteration migration cost: 5363 + */ 5364 + do_div(cost1, 2*ITERATIONS); 5365 + do_div(cost2, 2*ITERATIONS); 5366 + 5367 + return cost1 - cost2; 5368 + } 5369 + 5370 + static unsigned long long measure_migration_cost(int cpu1, int cpu2) 5371 + { 5372 + unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0; 5373 + unsigned int max_size, size, size_found = 0; 5374 + long long cost = 0, prev_cost; 5375 + void *cache; 5376 + 5377 + /* 5378 + * Search from max_cache_size*5 down to 64K - the real relevant 5379 + * cachesize has to lie somewhere inbetween. 5380 + */ 5381 + if (max_cache_size) { 5382 + max_size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE); 5383 + size = max(max_cache_size / SEARCH_SCOPE, MIN_CACHE_SIZE); 5384 + } else { 5385 + /* 5386 + * Since we have no estimation about the relevant 5387 + * search range 5388 + */ 5389 + max_size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE; 5390 + size = MIN_CACHE_SIZE; 5391 + } 5392 + 5393 + if (!cpu_online(cpu1) || !cpu_online(cpu2)) { 5394 + printk("cpu %d and %d not both online!\n", cpu1, cpu2); 5395 + return 0; 5396 + } 5397 + 5398 + /* 5399 + * Allocate the working set: 5400 + */ 5401 + cache = vmalloc(max_size); 5402 + if (!cache) { 5403 + printk("could not vmalloc %d bytes for cache!\n", 2*max_size); 5404 + return 1000000; // return 1 msec on very small boxen 5405 + } 5406 + 5407 + while (size <= max_size) { 5408 + prev_cost = cost; 5409 + cost = measure_cost(cpu1, cpu2, cache, size); 5410 + 5411 + /* 5412 + * Update the max: 5413 + */ 5414 + if (cost > 0) { 5415 + if (max_cost < cost) { 5416 + max_cost = cost; 5417 + size_found = size; 5418 + } 5419 + } 5420 + /* 5421 + * Calculate average fluctuation, we use this to prevent 5422 + * noise from triggering an early break out of the loop: 5423 + */ 5424 + fluct = abs(cost - prev_cost); 5425 + avg_fluct = (avg_fluct + fluct)/2; 5426 + 5427 + if (migration_debug) 5428 + printk("-> [%d][%d][%7d] %3ld.%ld [%3ld.%ld] (%ld): (%8Ld %8Ld)\n", 5429 + cpu1, cpu2, size, 5430 + (long)cost / 1000000, 5431 + ((long)cost / 100000) % 10, 5432 + (long)max_cost / 1000000, 5433 + ((long)max_cost / 100000) % 10, 5434 + domain_distance(cpu1, cpu2), 5435 + cost, avg_fluct); 5436 + 5437 + /* 5438 + * If we iterated at least 20% past the previous maximum, 5439 + * and the cost has dropped by more than 20% already, 5440 + * (taking fluctuations into account) then we assume to 5441 + * have found the maximum and break out of the loop early: 5442 + */ 5443 + if (size_found && (size*100 > size_found*SIZE_THRESH)) 5444 + if (cost+avg_fluct <= 0 || 5445 + max_cost*100 > (cost+avg_fluct)*COST_THRESH) { 5446 + 5447 + if (migration_debug) 5448 + printk("-> found max.\n"); 5449 + break; 5450 + } 5451 + /* 5452 + * Increase the cachesize in 5% steps: 5453 + */ 5454 + size = size * 20 / 19; 5455 + } 5456 + 5457 + if (migration_debug) 5458 + printk("[%d][%d] working set size found: %d, cost: %Ld\n", 5459 + cpu1, cpu2, size_found, max_cost); 5460 + 5461 + vfree(cache); 5462 + 5463 + /* 5464 + * A task is considered 'cache cold' if at least 2 times 5465 + * the worst-case cost of migration has passed. 5466 + * 5467 + * (this limit is only listened to if the load-balancing 5468 + * situation is 'nice' - if there is a large imbalance we 5469 + * ignore it for the sake of CPU utilization and 5470 + * processing fairness.) 5471 + */ 5472 + return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE; 5473 + } 5474 + 5475 + static void calibrate_migration_costs(const cpumask_t *cpu_map) 5476 + { 5477 + int cpu1 = -1, cpu2 = -1, cpu, orig_cpu = raw_smp_processor_id(); 5478 + unsigned long j0, j1, distance, max_distance = 0; 5479 + struct sched_domain *sd; 5480 + 5481 + j0 = jiffies; 5482 + 5483 + /* 5484 + * First pass - calculate the cacheflush times: 5485 + */ 5486 + for_each_cpu_mask(cpu1, *cpu_map) { 5487 + for_each_cpu_mask(cpu2, *cpu_map) { 5488 + if (cpu1 == cpu2) 5489 + continue; 5490 + distance = domain_distance(cpu1, cpu2); 5491 + max_distance = max(max_distance, distance); 5492 + /* 5493 + * No result cached yet? 5494 + */ 5495 + if (migration_cost[distance] == -1LL) 5496 + migration_cost[distance] = 5497 + measure_migration_cost(cpu1, cpu2); 5498 + } 5499 + } 5500 + /* 5501 + * Second pass - update the sched domain hierarchy with 5502 + * the new cache-hot-time estimations: 5503 + */ 5504 + for_each_cpu_mask(cpu, *cpu_map) { 5505 + distance = 0; 5506 + for_each_domain(cpu, sd) { 5507 + sd->cache_hot_time = migration_cost[distance]; 5508 + distance++; 5509 + } 5510 + } 5511 + /* 5512 + * Print the matrix: 5513 + */ 5514 + if (migration_debug) 5515 + printk("migration: max_cache_size: %d, cpu: %d MHz:\n", 5516 + max_cache_size, 5517 + #ifdef CONFIG_X86 5518 + cpu_khz/1000 5519 + #else 5520 + -1 5521 + #endif 5522 + ); 5523 + printk("migration_cost="); 5524 + for (distance = 0; distance <= max_distance; distance++) { 5525 + if (distance) 5526 + printk(","); 5527 + printk("%ld", (long)migration_cost[distance] / 1000); 5528 + } 5529 + printk("\n"); 5530 + j1 = jiffies; 5531 + if (migration_debug) 5532 + printk("migration: %ld seconds\n", (j1-j0)/HZ); 5533 + 5534 + /* 5535 + * Move back to the original CPU. NUMA-Q gets confused 5536 + * if we migrate to another quad during bootup. 5537 + */ 5538 + if (raw_smp_processor_id() != orig_cpu) { 5539 + cpumask_t mask = cpumask_of_cpu(orig_cpu), 5540 + saved_mask = current->cpus_allowed; 5541 + 5542 + set_cpus_allowed(current, mask); 5543 + set_cpus_allowed(current, saved_mask); 5544 + } 5545 + } 5546 + 5086 5547 #ifdef CONFIG_NUMA 5548 + 5087 5549 /** 5088 5550 * find_next_best_node - find the next node to include in a sched_domain 5089 5551 * @node: node whose sched_domain we're building ··· 5912 5448 #endif 5913 5449 cpu_attach_domain(sd, i); 5914 5450 } 5451 + /* 5452 + * Tune cache-hot values: 5453 + */ 5454 + calibrate_migration_costs(cpu_map); 5915 5455 } 5916 5456 /* 5917 5457 * Set up scheduler domains and groups. Callers must hold the hotplug lock.