Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge back earlier cpuidle material for 6.15

+128 -89
+16 -11
Documentation/admin-guide/pm/cpuidle.rst
··· 275 275 and variance of them. If the variance is small (smaller than 400 square 276 276 milliseconds) or it is small relative to the average (the average is greater 277 277 that 6 times the standard deviation), the average is regarded as the "typical 278 - interval" value. Otherwise, the longest of the saved observed idle duration 278 + interval" value. Otherwise, either the longest or the shortest (depending on 279 + which one is farther from the average) of the saved observed idle duration 279 280 values is discarded and the computation is repeated for the remaining ones. 281 + 280 282 Again, if the variance of them is small (in the above sense), the average is 281 283 taken as the "typical interval" value and so on, until either the "typical 282 - interval" is determined or too many data points are disregarded, in which case 283 - the "typical interval" is assumed to equal "infinity" (the maximum unsigned 284 - integer value). 284 + interval" is determined or too many data points are disregarded. In the latter 285 + case, if the size of the set of data points still under consideration is 286 + sufficiently large, the next idle duration is not likely to be above the largest 287 + idle duration value still in that set, so that value is taken as the predicted 288 + next idle duration. Finally, if the set of data points still under 289 + consideration is too small, no prediction is made. 285 290 286 - If the "typical interval" computed this way is long enough, the governor obtains 287 - the time until the closest timer event with the assumption that the scheduler 288 - tick will be stopped. That time, referred to as the *sleep length* in what follows, 289 - is the upper bound on the time before the next CPU wakeup. It is used to determine 290 - the sleep length range, which in turn is needed to get the sleep length correction 291 - factor. 291 + If the preliminary prediction of the next idle duration computed this way is 292 + long enough, the governor obtains the time until the closest timer event with 293 + the assumption that the scheduler tick will be stopped. That time, referred to 294 + as the *sleep length* in what follows, is the upper bound on the time before the 295 + next CPU wakeup. It is used to determine the sleep length range, which in turn 296 + is needed to get the sleep length correction factor. 292 297 293 298 The ``menu`` governor maintains an array containing several correction factor 294 299 values that correspond to different sleep length ranges organized so that each ··· 307 302 The sleep length is multiplied by the correction factor for the range that it 308 303 falls into to obtain an approximation of the predicted idle duration that is 309 304 compared to the "typical interval" determined previously and the minimum of 310 - the two is taken as the idle duration prediction. 305 + the two is taken as the final idle duration prediction. 311 306 312 307 If the "typical interval" value is small, which means that the CPU is likely 313 308 to be woken up soon enough, the sleep length computation is skipped as it may
+13 -5
Documentation/admin-guide/pm/intel_idle.rst
··· 192 192 Documentation/admin-guide/pm/cpuidle.rst). 193 193 Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. 194 194 195 - The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle`` 196 - if the kernel has been configured with ACPI support) can be set to make the 197 - driver ignore the system's ACPI tables entirely or use them for all of the 198 - recognized processor models, respectively (they both are unset by default and 199 - ``use_acpi`` has no effect if ``no_acpi`` is set). 195 + The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are 196 + recognized by ``intel_idle`` if the kernel has been configured with ACPI 197 + support. In the case that ACPI is not configured these flags have no impact 198 + on functionality. 199 + 200 + ``no_acpi`` - Do not use ACPI at all. Only native mode is available, no 201 + ACPI mode. 202 + 203 + ``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for 204 + C-states on/off status in native mode. 205 + 206 + ``no_native`` - Work only in ACPI mode, no native mode available (ignore 207 + all custom tables). 200 208 201 209 The value of the ``states_off`` module parameter (0 by default) represents a 202 210 list of idle states to be disabled by default in the form of a bitmask.
+5 -3
MAINTAINERS
··· 11668 11668 F: drivers/crypto/intel/iaa/* 11669 11669 11670 11670 INTEL IDLE DRIVER 11671 - M: Jacob Pan <jacob.jun.pan@linux.intel.com> 11672 - M: Len Brown <lenb@kernel.org> 11671 + M: Rafael J. Wysocki <rafael@kernel.org> 11672 + M: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> 11673 + M: Artem Bityutskiy <dedekind1@gmail.com> 11674 + R: Len Brown <lenb@kernel.org> 11673 11675 L: linux-pm@vger.kernel.org 11674 11676 S: Supported 11675 11677 B: https://bugzilla.kernel.org 11676 - T: git git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git 11678 + T: git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 11677 11679 F: drivers/idle/intel_idle.c 11678 11680 11679 11681 INTEL IDXD DRIVER
+67 -62
drivers/cpuidle/governors/menu.c
··· 41 41 * the C state is required to actually break even on this cost. CPUIDLE 42 42 * provides us this duration in the "target_residency" field. So all that we 43 43 * need is a good prediction of how long we'll be idle. Like the traditional 44 - * menu governor, we start with the actual known "next timer event" time. 44 + * menu governor, we take the actual known "next timer event" time. 45 45 * 46 46 * Since there are other source of wakeups (interrupts for example) than 47 47 * the next timer event, this estimation is rather optimistic. To get a ··· 50 50 * duration always was 50% of the next timer tick, the correction factor will 51 51 * be 0.5. 52 52 * 53 - * menu uses a running average for this correction factor, however it uses a 54 - * set of factors, not just a single factor. This stems from the realization 55 - * that the ratio is dependent on the order of magnitude of the expected 56 - * duration; if we expect 500 milliseconds of idle time the likelihood of 57 - * getting an interrupt very early is much higher than if we expect 50 micro 58 - * seconds of idle time. A second independent factor that has big impact on 59 - * the actual factor is if there is (disk) IO outstanding or not. 60 - * (as a special twist, we consider every sleep longer than 50 milliseconds 61 - * as perfect; there are no power gains for sleeping longer than this) 62 - * 63 - * For these two reasons we keep an array of 12 independent factors, that gets 64 - * indexed based on the magnitude of the expected duration as well as the 65 - * "is IO outstanding" property. 53 + * menu uses a running average for this correction factor, but it uses a set of 54 + * factors, not just a single factor. This stems from the realization that the 55 + * ratio is dependent on the order of magnitude of the expected duration; if we 56 + * expect 500 milliseconds of idle time the likelihood of getting an interrupt 57 + * very early is much higher than if we expect 50 micro seconds of idle time. 58 + * For this reason, menu keeps an array of 6 independent factors, that gets 59 + * indexed based on the magnitude of the expected duration. 66 60 * 67 61 * Repeatable-interval-detector 68 62 * ---------------------------- 69 63 * There are some cases where "next timer" is a completely unusable predictor: 70 64 * Those cases where the interval is fixed, for example due to hardware 71 - * interrupt mitigation, but also due to fixed transfer rate devices such as 72 - * mice. 65 + * interrupt mitigation, but also due to fixed transfer rate devices like mice. 73 66 * For this, we use a different predictor: We track the duration of the last 8 74 - * intervals and if the stand deviation of these 8 intervals is below a 75 - * threshold value, we use the average of these intervals as prediction. 76 - * 67 + * intervals and use them to estimate the duration of the next one. 77 68 */ 78 69 79 70 struct menu_device { ··· 107 116 */ 108 117 static unsigned int get_typical_interval(struct menu_device *data) 109 118 { 110 - int i, divisor; 111 - unsigned int min, max, thresh, avg; 112 - uint64_t sum, variance; 113 - 114 - thresh = INT_MAX; /* Discard outliers above this value */ 119 + s64 value, min_thresh = -1, max_thresh = UINT_MAX; 120 + unsigned int max, min, divisor; 121 + u64 avg, variance, avg_sq; 122 + int i; 115 123 116 124 again: 117 - 118 - /* First calculate the average of past intervals */ 119 - min = UINT_MAX; 125 + /* Compute the average and variance of past intervals. */ 120 126 max = 0; 121 - sum = 0; 127 + min = UINT_MAX; 128 + avg = 0; 129 + variance = 0; 122 130 divisor = 0; 123 131 for (i = 0; i < INTERVALS; i++) { 124 - unsigned int value = data->intervals[i]; 125 - if (value <= thresh) { 126 - sum += value; 127 - divisor++; 128 - if (value > max) 129 - max = value; 132 + value = data->intervals[i]; 133 + /* 134 + * Discard the samples outside the interval between the min and 135 + * max thresholds. 136 + */ 137 + if (value <= min_thresh || value >= max_thresh) 138 + continue; 130 139 131 - if (value < min) 132 - min = value; 133 - } 140 + divisor++; 141 + 142 + avg += value; 143 + variance += value * value; 144 + 145 + if (value > max) 146 + max = value; 147 + 148 + if (value < min) 149 + min = value; 134 150 } 135 151 136 152 if (!max) 137 153 return UINT_MAX; 138 154 139 - if (divisor == INTERVALS) 140 - avg = sum >> INTERVAL_SHIFT; 141 - else 142 - avg = div_u64(sum, divisor); 143 - 144 - /* Then try to determine variance */ 145 - variance = 0; 146 - for (i = 0; i < INTERVALS; i++) { 147 - unsigned int value = data->intervals[i]; 148 - if (value <= thresh) { 149 - int64_t diff = (int64_t)value - avg; 150 - variance += diff * diff; 151 - } 152 - } 153 - if (divisor == INTERVALS) 155 + if (divisor == INTERVALS) { 156 + avg >>= INTERVAL_SHIFT; 154 157 variance >>= INTERVAL_SHIFT; 155 - else 158 + } else { 159 + do_div(avg, divisor); 156 160 do_div(variance, divisor); 161 + } 162 + 163 + avg_sq = avg * avg; 164 + variance -= avg_sq; 157 165 158 166 /* 159 167 * The typical interval is obtained when standard deviation is ··· 167 177 * Use this result only if there is no timer to wake us up sooner. 168 178 */ 169 179 if (likely(variance <= U64_MAX/36)) { 170 - if ((((u64)avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3)) 171 - || variance <= 400) { 180 + if ((avg_sq > variance * 36 && divisor * 4 >= INTERVALS * 3) || 181 + variance <= 400) 172 182 return avg; 173 - } 174 183 } 175 184 176 185 /* 177 - * If we have outliers to the upside in our distribution, discard 178 - * those by setting the threshold to exclude these outliers, then 186 + * If there are outliers, discard them by setting thresholds to exclude 187 + * data points at a large enough distance from the average, then 179 188 * calculate the average and standard deviation again. Once we get 180 - * down to the bottom 3/4 of our samples, stop excluding samples. 189 + * down to the last 3/4 of our samples, stop excluding samples. 181 190 * 182 191 * This can deal with workloads that have long pauses interspersed 183 192 * with sporadic activity with a bunch of short pauses. 184 193 */ 185 - if ((divisor * 4) <= INTERVALS * 3) 186 - return UINT_MAX; 194 + if (divisor * 4 <= INTERVALS * 3) { 195 + /* 196 + * If there are sufficiently many data points still under 197 + * consideration after the outliers have been eliminated, 198 + * returning without a prediction would be a mistake because it 199 + * is likely that the next interval will not exceed the current 200 + * maximum, so return the latter in that case. 201 + */ 202 + if (divisor >= INTERVALS / 2) 203 + return max; 187 204 188 - thresh = max - 1; 205 + return UINT_MAX; 206 + } 207 + 208 + /* Update the thresholds for the next round. */ 209 + if (avg - min > max - avg) 210 + min_thresh = min; 211 + else 212 + max_thresh = max; 213 + 189 214 goto again; 190 215 } 191 216
+27 -8
drivers/idle/intel_idle.c
··· 90 90 * Indicate which enable bits to clear here. 91 91 */ 92 92 unsigned long auto_demotion_disable_flags; 93 - bool byt_auto_demotion_disable_flag; 94 93 bool disable_promotion_to_c1e; 95 94 bool use_acpi; 96 95 }; ··· 1463 1464 static const struct idle_cpu idle_cpu_byt __initconst = { 1464 1465 .state_table = byt_cstates, 1465 1466 .disable_promotion_to_c1e = true, 1466 - .byt_auto_demotion_disable_flag = true, 1467 1467 }; 1468 1468 1469 1469 static const struct idle_cpu idle_cpu_cht __initconst = { 1470 1470 .state_table = cht_cstates, 1471 1471 .disable_promotion_to_c1e = true, 1472 - .byt_auto_demotion_disable_flag = true, 1473 1472 }; 1474 1473 1475 1474 static const struct idle_cpu idle_cpu_ivb __initconst = { ··· 1693 1696 module_param_named(use_acpi, force_use_acpi, bool, 0444); 1694 1697 MODULE_PARM_DESC(use_acpi, "Use ACPI _CST for building the idle states list"); 1695 1698 1699 + static bool no_native __read_mostly; /* No effect if no_acpi is set. */ 1700 + module_param_named(no_native, no_native, bool, 0444); 1701 + MODULE_PARM_DESC(no_native, "Ignore cpu specific (native) idle states in lieu of ACPI idle states"); 1702 + 1696 1703 static struct acpi_processor_power acpi_state_table __initdata; 1697 1704 1698 1705 /** ··· 1839 1838 } 1840 1839 return true; 1841 1840 } 1841 + 1842 + static inline bool ignore_native(void) 1843 + { 1844 + return no_native && !no_acpi; 1845 + } 1842 1846 #else /* !CONFIG_ACPI_PROCESSOR_CSTATE */ 1843 1847 #define force_use_acpi (false) 1844 1848 ··· 1853 1847 { 1854 1848 return false; 1855 1849 } 1850 + static inline bool ignore_native(void) { return false; } 1856 1851 #endif /* !CONFIG_ACPI_PROCESSOR_CSTATE */ 1857 1852 1858 1853 /** ··· 2066 2059 } 2067 2060 } 2068 2061 2062 + /** 2063 + * byt_cht_auto_demotion_disable - Disable Bay/Cherry Trail auto-demotion. 2064 + */ 2065 + static void __init byt_cht_auto_demotion_disable(void) 2066 + { 2067 + wrmsrl(MSR_CC6_DEMOTION_POLICY_CONFIG, 0); 2068 + wrmsrl(MSR_MC6_DEMOTION_POLICY_CONFIG, 0); 2069 + } 2070 + 2069 2071 static bool __init intel_idle_verify_cstate(unsigned int mwait_hint) 2070 2072 { 2071 2073 unsigned int mwait_cstate = (MWAIT_HINT2CSTATE(mwait_hint) + 1) & ··· 2156 2140 case INTEL_ATOM_GRACEMONT: 2157 2141 adl_idle_state_table_update(); 2158 2142 break; 2143 + case INTEL_ATOM_SILVERMONT: 2144 + case INTEL_ATOM_AIRMONT: 2145 + byt_cht_auto_demotion_disable(); 2146 + break; 2159 2147 } 2160 2148 2161 2149 for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) { ··· 2201 2181 state->flags |= CPUIDLE_FLAG_TIMER_STOP; 2202 2182 2203 2183 drv->state_count++; 2204 - } 2205 - 2206 - if (icpu->byt_auto_demotion_disable_flag) { 2207 - wrmsrl(MSR_CC6_DEMOTION_POLICY_CONFIG, 0); 2208 - wrmsrl(MSR_MC6_DEMOTION_POLICY_CONFIG, 0); 2209 2184 } 2210 2185 } 2211 2186 ··· 2347 2332 pr_debug("MWAIT substates: 0x%x\n", mwait_substates); 2348 2333 2349 2334 icpu = (const struct idle_cpu *)id->driver_data; 2335 + if (icpu && ignore_native()) { 2336 + pr_debug("ignoring native CPU idle states\n"); 2337 + icpu = NULL; 2338 + } 2350 2339 if (icpu) { 2351 2340 if (icpu->state_table) 2352 2341 cpuidle_state_table = icpu->state_table;