Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/cpufreq: Consider reduced CPU capacity in energy calculation

Energy Aware Scheduling (EAS) needs to predict the decisions made by
SchedUtil. The map_util_freq() exists to do that.

There are corner cases where the max allowed frequency might be reduced
(due to thermal). SchedUtil as a CPUFreq governor, is aware of that
but EAS is not. This patch aims to address it.

SchedUtil stores the maximum allowed frequency in
'sugov_policy::next_freq' field. EAS has to predict that value, which is
the real used frequency. That value is made after a call to
cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
In the existing code EAS is not able to predict that real frequency.
This leads to energy estimation errors.

To avoid wrong energy estimation in EAS (due to frequency miss prediction)
make sure that the step which calculates Performance Domain frequency,
is also aware of the allowed CPU capacity.

Furthermore, modify map_util_freq() to not extend the frequency value.
Instead, use map_util_perf() to extend the util value in both places:
SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
In the end, we achieve the same desirable behavior for both subsystems
and alignment in regards to the real CPU frequency.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com

authored by

Lukasz Luba and committed by
Peter Zijlstra
8f1b971b 489f1645

+16 -5
+13 -3
include/linux/energy_model.h
··· 91 91 * @pd : performance domain for which energy has to be estimated 92 92 * @max_util : highest utilization among CPUs of the domain 93 93 * @sum_util : sum of the utilization of all CPUs in the domain 94 + * @allowed_cpu_cap : maximum allowed CPU capacity for the @pd, which 95 + might reflect reduced frequency (due to thermal) 94 96 * 95 97 * This function must be used only for CPU devices. There is no validation, 96 98 * i.e. if the EM is a CPU type and has cpumask allocated. It is called from ··· 102 100 * a capacity state satisfying the max utilization of the domain. 103 101 */ 104 102 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, 105 - unsigned long max_util, unsigned long sum_util) 103 + unsigned long max_util, unsigned long sum_util, 104 + unsigned long allowed_cpu_cap) 106 105 { 107 106 unsigned long freq, scale_cpu; 108 107 struct em_perf_state *ps; ··· 115 112 /* 116 113 * In order to predict the performance state, map the utilization of 117 114 * the most utilized CPU of the performance domain to a requested 118 - * frequency, like schedutil. 115 + * frequency, like schedutil. Take also into account that the real 116 + * frequency might be set lower (due to thermal capping). Thus, clamp 117 + * max utilization to the allowed CPU capacity before calculating 118 + * effective frequency. 119 119 */ 120 120 cpu = cpumask_first(to_cpumask(pd->cpus)); 121 121 scale_cpu = arch_scale_cpu_capacity(cpu); 122 122 ps = &pd->table[pd->nr_perf_states - 1]; 123 + 124 + max_util = map_util_perf(max_util); 125 + max_util = min(max_util, allowed_cpu_cap); 123 126 freq = map_util_freq(max_util, ps->frequency, scale_cpu); 124 127 125 128 /* ··· 218 209 return NULL; 219 210 } 220 211 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, 221 - unsigned long max_util, unsigned long sum_util) 212 + unsigned long max_util, unsigned long sum_util, 213 + unsigned long allowed_cpu_cap) 222 214 { 223 215 return 0; 224 216 }
+1 -1
include/linux/sched/cpufreq.h
··· 26 26 static inline unsigned long map_util_freq(unsigned long util, 27 27 unsigned long freq, unsigned long cap) 28 28 { 29 - return (freq + (freq >> 2)) * util / cap; 29 + return freq * util / cap; 30 30 } 31 31 32 32 static inline unsigned long map_util_perf(unsigned long util)
+1
kernel/sched/cpufreq_schedutil.c
··· 151 151 unsigned int freq = arch_scale_freq_invariant() ? 152 152 policy->cpuinfo.max_freq : policy->cur; 153 153 154 + util = map_util_perf(util); 154 155 freq = map_util_freq(util, freq, max); 155 156 156 157 if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
+1 -1
kernel/sched/fair.c
··· 6592 6592 max_util = max(max_util, min(cpu_util, _cpu_cap)); 6593 6593 } 6594 6594 6595 - return em_cpu_energy(pd->em_pd, max_util, sum_util); 6595 + return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap); 6596 6596 } 6597 6597 6598 6598 /*