Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Revert "cpuidle: Quickly notice prediction failure for repeat mode"

Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for
repeat mode), because it has been identified as the source of a
significant performance regression in v3.8 and later as explained by
Jeremy Eder:

We believe we've identified a particular commit to the cpuidle code
that seems to be impacting performance of variety of workloads.
The simplest way to reproduce is using netperf TCP_RR test, so
we're using that, on a pair of Sandy Bridge based servers. We also
have data from a large database setup where performance is also
measurably/positively impacted, though that test data isn't easily
share-able.

Included below are test results from 3 test kernels:

kernel reverts
-----------------------------------------------------------
1) vanilla upstream (no reverts)

2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c

3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
e11538d1f03914eb92af5a1a378375c05ae8520c

In summary, netperf TCP_RR numbers improve by approximately 4%
after reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When
69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency
never seems to get above 40%. Taking that patch out gets C0 near
100% quite often, and performance increases.

The below data are histograms representing the %c0 residency @
1-second sample rates (using turbostat), while under netperf test.

- If you look at the first 4 histograms, you can see %c0 residency
almost entirely in the 30,40% bin.
- The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
shows %c0 in the 80,90,100% bins.

Below each kernel name are netperf TCP_RR trans/s numbers for the
particular kernel that can be disclosed publicly, comparing the 3
test kernels. We ran a 4th test with the vanilla kernel where
we've also set /dev/cpu_dma_latency=0 to show overall impact
boosting single-threaded TCP_RR performance over 11% above
baseline.

3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
TCP_RR trans/s 54323.78

-----------------------------------------------------------
3.10-rc2 vanilla RX (no reverts)
TCP_RR trans/s 48192.47

Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 59]:
***********************************************************
40.0000 - 50.0000 [ 1]: *
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:

Sender %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 11]: ***********
40.0000 - 50.0000 [ 49]:
*************************************************
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:

-----------------------------------------------------------
3.10-rc2 perfteam2 RX (reverts commit
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 49698.69

Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 1]: *
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 59]:
***********************************************************
40.0000 - 50.0000 [ 0]:
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:

Sender %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 2]: **
40.0000 - 50.0000 [ 58]:
**********************************************************
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:

-----------------------------------------------------------
3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
and e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 47766.95

Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 1]: *
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 27]: ***************************
40.0000 - 50.0000 [ 2]: **
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 2]: **
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 28]: ****************************

Sender:
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 11]: ***********
40.0000 - 50.0000 [ 0]:
50.0000 - 60.0000 [ 1]: *
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 3]: ***
80.0000 - 90.0000 [ 7]: *******
90.0000 - 100.0000 [ 38]: **************************************

These results demonstrate gaining back the tendency of the CPU to
stay in more responsive, performant C-states (and thus yield
measurably better performance), by reverting commit
69a37beabf1f0a6705c08e879bdd5d82ff6486c4.

Requested-by: Jeremy Eder <jeder@redhat.com>
Tested-by: Len Brown <len.brown@intel.com>
Cc: 3.8+ <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

+6 -82
+4 -69
drivers/cpuidle/governors/menu.c
··· 28 28 #define MAX_INTERESTING 50000 29 29 #define STDDEV_THRESH 400 30 30 31 - /* 60 * 60 > STDDEV_THRESH * INTERVALS = 400 * 8 */ 32 - #define MAX_DEVIATION 60 33 - 34 - static DEFINE_PER_CPU(struct hrtimer, menu_hrtimer); 35 - static DEFINE_PER_CPU(int, hrtimer_status); 36 - /* menu hrtimer mode */ 37 - enum {MENU_HRTIMER_STOP, MENU_HRTIMER_REPEAT}; 38 31 39 32 /* 40 33 * Concepts and ideas behind the menu governor ··· 191 198 return div_u64(dividend + (divisor / 2), divisor); 192 199 } 193 200 194 - /* Cancel the hrtimer if it is not triggered yet */ 195 - void menu_hrtimer_cancel(void) 196 - { 197 - int cpu = smp_processor_id(); 198 - struct hrtimer *hrtmr = &per_cpu(menu_hrtimer, cpu); 199 - 200 - /* The timer is still not time out*/ 201 - if (per_cpu(hrtimer_status, cpu)) { 202 - hrtimer_cancel(hrtmr); 203 - per_cpu(hrtimer_status, cpu) = MENU_HRTIMER_STOP; 204 - } 205 - } 206 - EXPORT_SYMBOL_GPL(menu_hrtimer_cancel); 207 - 208 - /* Call back for hrtimer is triggered */ 209 - static enum hrtimer_restart menu_hrtimer_notify(struct hrtimer *hrtimer) 210 - { 211 - int cpu = smp_processor_id(); 212 - 213 - per_cpu(hrtimer_status, cpu) = MENU_HRTIMER_STOP; 214 - 215 - return HRTIMER_NORESTART; 216 - } 217 - 218 201 /* 219 202 * Try detecting repeating patterns by keeping track of the last 8 220 203 * intervals, and checking if the standard deviation of that set 221 204 * of points is below a threshold. If it is... then use the 222 205 * average of these 8 points as the estimated value. 223 206 */ 224 - static u32 get_typical_interval(struct menu_device *data) 207 + static void get_typical_interval(struct menu_device *data) 225 208 { 226 209 int i = 0, divisor = 0; 227 210 uint64_t max = 0, avg = 0, stddev = 0; 228 211 int64_t thresh = LLONG_MAX; /* Discard outliers above this value. */ 229 - unsigned int ret = 0; 230 212 231 213 again: 232 214 ··· 242 274 if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3)) 243 275 || stddev <= 20) { 244 276 data->predicted_us = avg; 245 - ret = 1; 246 - return ret; 277 + return; 247 278 248 279 } else if ((divisor * 4) > INTERVALS * 3) { 249 280 /* Exclude the max interval */ 250 281 thresh = max - 1; 251 282 goto again; 252 283 } 253 - 254 - return ret; 255 284 } 256 285 257 286 /** ··· 263 298 int i; 264 299 int multiplier; 265 300 struct timespec t; 266 - int repeat = 0, low_predicted = 0; 267 - int cpu = smp_processor_id(); 268 - struct hrtimer *hrtmr = &per_cpu(menu_hrtimer, cpu); 269 301 270 302 if (data->needs_update) { 271 303 menu_update(drv, dev); ··· 297 335 data->predicted_us = div_round64(data->expected_us * data->correction_factor[data->bucket], 298 336 RESOLUTION * DECAY); 299 337 300 - repeat = get_typical_interval(data); 338 + get_typical_interval(data); 301 339 302 340 /* 303 341 * We want to default to C1 (hlt), not to busy polling ··· 318 356 319 357 if (s->disabled || su->disable) 320 358 continue; 321 - if (s->target_residency > data->predicted_us) { 322 - low_predicted = 1; 359 + if (s->target_residency > data->predicted_us) 323 360 continue; 324 - } 325 361 if (s->exit_latency > latency_req) 326 362 continue; 327 363 if (s->exit_latency * multiplier > data->predicted_us) ··· 327 367 328 368 data->last_state_idx = i; 329 369 data->exit_us = s->exit_latency; 330 - } 331 - 332 - /* not deepest C-state chosen for low predicted residency */ 333 - if (low_predicted) { 334 - unsigned int timer_us = 0; 335 - 336 - /* 337 - * Set a timer to detect whether this sleep is much 338 - * longer than repeat mode predicted. If the timer 339 - * triggers, the code will evaluate whether to put 340 - * the CPU into a deeper C-state. 341 - * The timer is cancelled on CPU wakeup. 342 - */ 343 - timer_us = 2 * (data->predicted_us + MAX_DEVIATION); 344 - 345 - if (repeat && (4 * timer_us < data->expected_us)) { 346 - RCU_NONIDLE(hrtimer_start(hrtmr, 347 - ns_to_ktime(1000 * timer_us), 348 - HRTIMER_MODE_REL_PINNED)); 349 - /* In repeat case, menu hrtimer is started */ 350 - per_cpu(hrtimer_status, cpu) = MENU_HRTIMER_REPEAT; 351 - } 352 370 } 353 371 354 372 return data->last_state_idx; ··· 419 481 struct cpuidle_device *dev) 420 482 { 421 483 struct menu_device *data = &per_cpu(menu_devices, dev->cpu); 422 - struct hrtimer *t = &per_cpu(menu_hrtimer, dev->cpu); 423 - hrtimer_init(t, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 424 - t->function = menu_hrtimer_notify; 425 484 426 485 memset(data, 0, sizeof(struct menu_device)); 427 486
-6
include/linux/tick.h
··· 174 174 #endif 175 175 176 176 177 - # ifdef CONFIG_CPU_IDLE_GOV_MENU 178 - extern void menu_hrtimer_cancel(void); 179 - # else 180 - static inline void menu_hrtimer_cancel(void) {} 181 - # endif /* CONFIG_CPU_IDLE_GOV_MENU */ 182 - 183 177 #endif
+2 -7
kernel/time/tick-sched.c
··· 827 827 { 828 828 struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched); 829 829 830 - if (ts->inidle) { 831 - /* Cancel the timer because CPU already waken up from the C-states*/ 832 - menu_hrtimer_cancel(); 830 + if (ts->inidle) 833 831 __tick_nohz_idle_enter(ts); 834 - } else { 832 + else 835 833 tick_nohz_full_stop_tick(ts); 836 - } 837 834 } 838 835 839 836 /** ··· 928 931 929 932 ts->inidle = 0; 930 933 931 - /* Cancel the timer because CPU already waken up from the C-states*/ 932 - menu_hrtimer_cancel(); 933 934 if (ts->idle_active || ts->tick_stopped) 934 935 now = ktime_get(); 935 936