Documentation: admin-guide: PM: Add cpuidle document

+614

Documentation/admin-guide/pm/cpuidle.rst

··· 1 + .. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>` 2 + .. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>` 3 + 4 + ======================== 5 + CPU Idle Time Management 6 + ======================== 7 + 8 + :: 9 + 10 + Copyright (c) 2018 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> 11 + 12 + Concepts 13 + ======== 14 + 15 + Modern processors are generally able to enter states in which the execution of 16 + a program is suspended and instructions belonging to it are not fetched from 17 + memory or executed. Those states are the *idle* states of the processor. 18 + 19 + Since part of the processor hardware is not used in idle states, entering them 20 + generally allows power drawn by the processor to be reduced and, in consequence, 21 + it is an opportunity to save energy. 22 + 23 + CPU idle time management is an energy-efficiency feature concerned about using 24 + the idle states of processors for this purpose. 25 + 26 + Logical CPUs 27 + ------------ 28 + 29 + CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that 30 + is the part of the kernel responsible for the distribution of computational 31 + work in the system). In its view, CPUs are *logical* units. That is, they need 32 + not be separate physical entities and may just be interfaces appearing to 33 + software as individual single-core processors. In other words, a CPU is an 34 + entity which appears to be fetching instructions that belong to one sequence 35 + (program) from memory and executing them, but it need not work this way 36 + physically. Generally, three different cases can be consider here. 37 + 38 + First, if the whole processor can only follow one sequence of instructions (one 39 + program) at a time, it is a CPU. In that case, if the hardware is asked to 40 + enter an idle state, that applies to the processor as a whole. 41 + 42 + Second, if the processor is multi-core, each core in it is able to follow at 43 + least one program at a time. The cores need not be entirely independent of each 44 + other (for example, they may share caches), but still most of the time they 45 + work physically in parallel with each other, so if each of them executes only 46 + one program, those programs run mostly independently of each other at the same 47 + time. The entire cores are CPUs in that case and if the hardware is asked to 48 + enter an idle state, that applies to the core that asked for it in the first 49 + place, but it also may apply to a larger unit (say a "package" or a "cluster") 50 + that the core belongs to (in fact, it may apply to an entire hierarchy of larger 51 + units containing the core). Namely, if all of the cores in the larger unit 52 + except for one have been put into idle states at the "core level" and the 53 + remaining core asks the processor to enter an idle state, that may trigger it 54 + to put the whole larger unit into an idle state which also will affect the 55 + other cores in that unit. 56 + 57 + Finally, each core in a multi-core processor may be able to follow more than one 58 + program in the same time frame (that is, each core may be able to fetch 59 + instructions from multiple locations in memory and execute them in the same time 60 + frame, but not necessarily entirely in parallel with each other). In that case 61 + the cores present themselves to software as "bundles" each consisting of 62 + multiple individual single-core "processors", referred to as *hardware threads* 63 + (or hyper-threads specifically on Intel hardware), that each can follow one 64 + sequence of instructions. Then, the hardware threads are CPUs from the CPU idle 65 + time management perspective and if the processor is asked to enter an idle state 66 + by one of them, the hardware thread (or CPU) that asked for it is stopped, but 67 + nothing more happens, unless all of the other hardware threads within the same 68 + core also have asked the processor to enter an idle state. In that situation, 69 + the core may be put into an idle state individually or a larger unit containing 70 + it may be put into an idle state as a whole (if the other cores within the 71 + larger unit are in idle states already). 72 + 73 + Idle CPUs 74 + --------- 75 + 76 + Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as 77 + *idle* by the Linux kernel when there are no tasks to run on them except for the 78 + special "idle" task. 79 + 80 + Tasks are the CPU scheduler's representation of work. Each task consists of a 81 + sequence of instructions to execute, or code, data to be manipulated while 82 + running that code, and some context information that needs to be loaded into the 83 + processor every time the task's code is run by a CPU. The CPU scheduler 84 + distributes work by assigning tasks to run to the CPUs present in the system. 85 + 86 + Tasks can be in various states. In particular, they are *runnable* if there are 87 + no specific conditions preventing their code from being run by a CPU as long as 88 + there is a CPU available for that (for example, they are not waiting for any 89 + events to occur or similar). When a task becomes runnable, the CPU scheduler 90 + assigns it to one of the available CPUs to run and if there are no more runnable 91 + tasks assigned to it, the CPU will load the given task's context and run its 92 + code (from the instruction following the last one executed so far, possibly by 93 + another CPU). [If there are multiple runnable tasks assigned to one CPU 94 + simultaneously, they will be subject to prioritization and time sharing in order 95 + to allow them to make some progress over time.] 96 + 97 + The special "idle" task becomes runnable if there are no other runnable tasks 98 + assigned to the given CPU and the CPU is then regarded as idle. In other words, 99 + in Linux idle CPUs run the code of the "idle" task called *the idle loop*. That 100 + code may cause the processor to be put into one of its idle states, if they are 101 + supported, in order to save energy, but if the processor does not support any 102 + idle states, or there is not enough time to spend in an idle state before the 103 + next wakeup event, or there are strict latency constraints preventing any of the 104 + available idle states from being used, the CPU will simply execute more or less 105 + useless instructions in a loop until it is assigned a new task to run. 106 + 107 + 108 + .. _idle-loop: 109 + 110 + The Idle Loop 111 + ============= 112 + 113 + The idle loop code takes two major steps in every iteration of it. First, it 114 + calls into a code module referred to as the *governor* that belongs to the CPU 115 + idle time management subsystem called ``CPUIdle`` to select an idle state for 116 + the CPU to ask the hardware to enter. Second, it invokes another code module 117 + from the ``CPUIdle`` subsystem, called the *driver*, to actually ask the 118 + processor hardware to enter the idle state selected by the governor. 119 + 120 + The role of the governor is to find an idle state most suitable for the 121 + conditions at hand. For this purpose, idle states that the hardware can be 122 + asked to enter by logical CPUs are represented in an abstract way independent of 123 + the platform or the processor architecture and organized in a one-dimensional 124 + (linear) array. That array has to be prepared and supplied by the ``CPUIdle`` 125 + driver matching the platform the kernel is running on at the initialization 126 + time. This allows ``CPUIdle`` governors to be independent of the underlying 127 + hardware and to work with any platforms that the Linux kernel can run on. 128 + 129 + Each idle state present in that array is characterized by two parameters to be 130 + taken into account by the governor, the *target residency* and the (worst-case) 131 + *exit latency*. The target residency is the minimum time the hardware must 132 + spend in the given state, including the time needed to enter it (which may be 133 + substantial), in order to save more energy than it would save by entering one of 134 + the shallower idle states instead. [The "depth" of an idle state roughly 135 + corresponds to the power drawn by the processor in that state.] The exit 136 + latency, in turn, is the maximum time it will take a CPU asking the processor 137 + hardware to enter an idle state to start executing the first instruction after a 138 + wakeup from that state. Note that in general the exit latency also must cover 139 + the time needed to enter the given state in case the wakeup occurs when the 140 + hardware is entering it and it must be entered completely to be exited in an 141 + ordered manner. 142 + 143 + There are two types of information that can influence the governor's decisions. 144 + First of all, the governor knows the time until the closest timer event. That 145 + time is known exactly, because the kernel programs timers and it knows exactly 146 + when they will trigger, and it is the maximum time the hardware that the given 147 + CPU depends on can spend in an idle state, including the time necessary to enter 148 + and exit it. However, the CPU may be woken up by a non-timer event at any time 149 + (in particular, before the closest timer triggers) and it generally is not known 150 + when that may happen. The governor can only see how much time the CPU actually 151 + was idle after it has been woken up (that time will be referred to as the *idle 152 + duration* from now on) and it can use that information somehow along with the 153 + time until the closest timer to estimate the idle duration in future. How the 154 + governor uses that information depends on what algorithm is implemented by it 155 + and that is the primary reason for having more than one governor in the 156 + ``CPUIdle`` subsystem. 157 + 158 + There are two ``CPUIdle`` governors available, ``menu`` and ``ladder``. Which 159 + of them is used depends on the configuration of the kernel and in particular on 160 + whether or not the scheduler tick can be `stopped by the idle 161 + loop <idle-cpus-and-tick_>`_. It is possible to change the governor at run time 162 + if the ``cpuidle_sysfs_switch`` command line parameter has been passed to the 163 + kernel, but that is not safe in general, so it should not be done on production 164 + systems (that may change in the future, though). The name of the ``CPUIdle`` 165 + governor currently used by the kernel can be read from the 166 + :file:`current_governor_ro` (or :file:`current_governor` if 167 + ``cpuidle_sysfs_switch`` is present in the kernel command line) file under 168 + :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``. 169 + 170 + Which ``CPUIdle`` driver is used, on the other hand, usually depends on the 171 + platform the kernel is running on, but there are platforms with more than one 172 + matching driver. For example, there are two drivers that can work with the 173 + majority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with 174 + hardcoded idle states information and the other able to read that information 175 + from the system's ACPI tables, respectively. Still, even in those cases, the 176 + driver chosen at the system initialization time cannot be replaced later, so the 177 + decision on which one of them to use has to be made early (on Intel platforms 178 + the ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some 179 + reason or if it does not recognize the processor). The name of the ``CPUIdle`` 180 + driver currently used by the kernel can be read from the :file:`current_driver` 181 + file under :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``. 182 + 183 + 184 + .. _idle-cpus-and-tick: 185 + 186 + Idle CPUs and The Scheduler Tick 187 + ================================ 188 + 189 + The scheduler tick is a timer that triggers periodically in order to implement 190 + the time sharing strategy of the CPU scheduler. Of course, if there are 191 + multiple runnable tasks assigned to one CPU at the same time, the only way to 192 + allow them to make reasonable progress in a given time frame is to make them 193 + share the available CPU time. Namely, in rough approximation, each task is 194 + given a slice of the CPU time to run its code, subject to the scheduling class, 195 + prioritization and so on and when that time slice is used up, the CPU should be 196 + switched over to running (the code of) another task. The currently running task 197 + may not want to give the CPU away voluntarily, however, and the scheduler tick 198 + is there to make the switch happen regardless. That is not the only role of the 199 + tick, but it is the primary reason for using it. 200 + 201 + The scheduler tick is problematic from the CPU idle time management perspective, 202 + because it triggers periodically and relatively often (depending on the kernel 203 + configuration, the length of the tick period is between 1 ms and 10 ms). 204 + Thus, if the tick is allowed to trigger on idle CPUs, it will not make sense 205 + for them to ask the hardware to enter idle states with target residencies above 206 + the tick period length. Moreover, in that case the idle duration of any CPU 207 + will never exceed the tick period length and the energy used for entering and 208 + exiting idle states due to the tick wakeups on idle CPUs will be wasted. 209 + 210 + Fortunately, it is not really necessary to allow the tick to trigger on idle 211 + CPUs, because (by definition) they have no tasks to run except for the special 212 + "idle" one. In other words, from the CPU scheduler perspective, the only user 213 + of the CPU time on them is the idle loop. Since the time of an idle CPU need 214 + not be shared between multiple runnable tasks, the primary reason for using the 215 + tick goes away if the given CPU is idle. Consequently, it is possible to stop 216 + the scheduler tick entirely on idle CPUs in principle, even though that may not 217 + always be worth the effort. 218 + 219 + Whether or not it makes sense to stop the scheduler tick in the idle loop 220 + depends on what is expected by the governor. First, if there is another 221 + (non-tick) timer due to trigger within the tick range, stopping the tick clearly 222 + would be a waste of time, even though the timer hardware may not need to be 223 + reprogrammed in that case. Second, if the governor is expecting a non-timer 224 + wakeup within the tick range, stopping the tick is not necessary and it may even 225 + be harmful. Namely, in that case the governor will select an idle state with 226 + the target residency within the time until the expected wakeup, so that state is 227 + going to be relatively shallow. The governor really cannot select a deep idle 228 + state then, as that would contradict its own expectation of a wakeup in short 229 + order. Now, if the wakeup really occurs shortly, stopping the tick would be a 230 + waste of time and in this case the timer hardware would need to be reprogrammed, 231 + which is expensive. On the other hand, if the tick is stopped and the wakeup 232 + does not occur any time soon, the hardware may spend indefinite amount of time 233 + in the shallow idle state selected by the governor, which will be a waste of 234 + energy. Hence, if the governor is expecting a wakeup of any kind within the 235 + tick range, it is better to allow the tick trigger. Otherwise, however, the 236 + governor will select a relatively deep idle state, so the tick should be stopped 237 + so that it does not wake up the CPU too early. 238 + 239 + In any case, the governor knows what it is expecting and the decision on whether 240 + or not to stop the scheduler tick belongs to it. Still, if the tick has been 241 + stopped already (in one of the previous iterations of the loop), it is better 242 + to leave it as is and the governor needs to take that into account. 243 + 244 + The kernel can be configured to disable stopping the scheduler tick in the idle 245 + loop altogether. That can be done through the build-time configuration of it 246 + (by unsetting the ``CONFIG_NO_HZ_IDLE`` configuration option) or by passing 247 + ``nohz=off`` to it in the command line. In both cases, as the stopping of the 248 + scheduler tick is disabled, the governor's decisions regarding it are simply 249 + ignored by the idle loop code and the tick is never stopped. 250 + 251 + The systems that run kernels configured to allow the scheduler tick to be 252 + stopped on idle CPUs are referred to as *tickless* systems and they are 253 + generally regarded as more energy-efficient than the systems running kernels in 254 + which the tick cannot be stopped. If the given system is tickless, it will use 255 + the ``menu`` governor by default and if it is not tickless, the default 256 + ``CPUIdle`` governor on it will be ``ladder``. 257 + 258 + 259 + The ``menu`` Governor 260 + ===================== 261 + 262 + The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems. 263 + It is quite complex, but the basic principle of its design is straightforward. 264 + Namely, when invoked to select an idle state for a CPU (i.e. an idle state that 265 + the CPU will ask the processor hardware to enter), it attempts to predict the 266 + idle duration and uses the predicted value for idle state selection. 267 + 268 + It first obtains the time until the closest timer event with the assumption 269 + that the scheduler tick will be stopped. That time, referred to as the *sleep 270 + length* in what follows, is the upper bound on the time before the next CPU 271 + wakeup. It is used to determine the sleep length range, which in turn is needed 272 + to get the sleep length correction factor. 273 + 274 + The ``menu`` governor maintains two arrays of sleep length correction factors. 275 + One of them is used when tasks previously running on the given CPU are waiting 276 + for some I/O operations to complete and the other one is used when that is not 277 + the case. Each array contains several correction factor values that correspond 278 + to different sleep length ranges organized so that each range represented in the 279 + array is approximately 10 times wider than the previous one. 280 + 281 + The correction factor for the given sleep length range (determined before 282 + selecting the idle state for the CPU) is updated after the CPU has been woken 283 + up and the closer the sleep length is to the observed idle duration, the closer 284 + to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). 285 + The sleep length is multiplied by the correction factor for the range that it 286 + falls into to obtain the first approximation of the predicted idle duration. 287 + 288 + Next, the governor uses a simple pattern recognition algorithm to refine its 289 + idle duration prediction. Namely, it saves the last 8 observed idle duration 290 + values and, when predicting the idle duration next time, it computes the average 291 + and variance of them. If the variance is small (smaller than 400 square 292 + milliseconds) or it is small relative to the average (the average is greater 293 + that 6 times the standard deviation), the average is regarded as the "typical 294 + interval" value. Otherwise, the longest of the saved observed idle duration 295 + values is discarded and the computation is repeated for the remaining ones. 296 + Again, if the variance of them is small (in the above sense), the average is 297 + taken as the "typical interval" value and so on, until either the "typical 298 + interval" is determined or too many data points are disregarded, in which case 299 + the "typical interval" is assumed to equal "infinity" (the maximum unsigned 300 + integer value). The "typical interval" computed this way is compared with the 301 + sleep length multiplied by the correction factor and the minimum of the two is 302 + taken as the predicted idle duration. 303 + 304 + Then, the governor computes an extra latency limit to help "interactive" 305 + workloads. It uses the observation that if the exit latency of the selected 306 + idle state is comparable with the predicted idle duration, the total time spent 307 + in that state probably will be very short and the amount of energy to save by 308 + entering it will be relatively small, so likely it is better to avoid the 309 + overhead related to entering that state and exiting it. Thus selecting a 310 + shallower state is likely to be a better option then. The first approximation 311 + of the extra latency limit is the predicted idle duration itself which 312 + additionally is divided by a value depending on the number of tasks that 313 + previously ran on the given CPU and now they are waiting for I/O operations to 314 + complete. The result of that division is compared with the latency limit coming 315 + from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_, 316 + framework and the minimum of the two is taken as the limit for the idle states' 317 + exit latency. 318 + 319 + Now, the governor is ready to walk the list of idle states and choose one of 320 + them. For this purpose, it compares the target residency of each state with 321 + the predicted idle duration and the exit latency of it with the computed latency 322 + limit. It selects the state with the target residency closest to the predicted 323 + idle duration, but still below it, and exit latency that does not exceed the 324 + limit. 325 + 326 + In the final step the governor may still need to refine the idle state selection 327 + if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That 328 + happens if the idle duration predicted by it is less than the tick period and 329 + the tick has not been stopped already (in a previous iteration of the idle 330 + loop). Then, the sleep length used in the previous computations may not reflect 331 + the real time until the closest timer event and if it really is greater than 332 + that time, the governor may need to select a shallower state with a suitable 333 + target residency. 334 + 335 + 336 + .. _idle-states-representation: 337 + 338 + Representation of Idle States 339 + ============================= 340 + 341 + For the CPU idle time management purposes all of the physical idle states 342 + supported by the processor have to be represented as a one-dimensional array of 343 + |struct cpuidle_state| objects each allowing an individual (logical) CPU to ask 344 + the processor hardware to enter an idle state of certain properties. If there 345 + is a hierarchy of units in the processor, one |struct cpuidle_state| object can 346 + cover a combination of idle states supported by the units at different levels of 347 + the hierarchy. In that case, the `target residency and exit latency parameters 348 + of it <idle-loop_>`_, must reflect the properties of the idle state at the 349 + deepest level (i.e. the idle state of the unit containing all of the other 350 + units). 351 + 352 + For example, take a processor with two cores in a larger unit referred to as 353 + a "module" and suppose that asking the hardware to enter a specific idle state 354 + (say "X") at the "core" level by one core will trigger the module to try to 355 + enter a specific idle state of its own (say "MX") if the other core is in idle 356 + state "X" already. In other words, asking for idle state "X" at the "core" 357 + level gives the hardware a license to go as deep as to idle state "MX" at the 358 + "module" level, but there is no guarantee that this is going to happen (the core 359 + asking for idle state "X" may just end up in that state by itself instead). 360 + Then, the target residency of the |struct cpuidle_state| object representing 361 + idle state "X" must reflect the minimum time to spend in idle state "MX" of 362 + the module (including the time needed to enter it), because that is the minimum 363 + time the CPU needs to be idle to save any energy in case the hardware enters 364 + that state. Analogously, the exit latency parameter of that object must cover 365 + the exit time of idle state "MX" of the module (and usually its entry time too), 366 + because that is the maximum delay between a wakeup signal and the time the CPU 367 + will start to execute the first new instruction (assuming that both cores in the 368 + module will always be ready to execute instructions as soon as the module 369 + becomes operational as a whole). 370 + 371 + There are processors without direct coordination between different levels of the 372 + hierarchy of units inside them, however. In those cases asking for an idle 373 + state at the "core" level does not automatically affect the "module" level, for 374 + example, in any way and the ``CPUIdle`` driver is responsible for the entire 375 + handling of the hierarchy. Then, the definition of the idle state objects is 376 + entirely up to the driver, but still the physical properties of the idle state 377 + that the processor hardware finally goes into must always follow the parameters 378 + used by the governor for idle state selection (for instance, the actual exit 379 + latency of that idle state must not exceed the exit latency parameter of the 380 + idle state object selected by the governor). 381 + 382 + In addition to the target residency and exit latency idle state parameters 383 + discussed above, the objects representing idle states each contain a few other 384 + parameters describing the idle state and a pointer to the function to run in 385 + order to ask the hardware to enter that state. Also, for each 386 + |struct cpuidle_state| object, there is a corresponding 387 + :c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containing usage 388 + statistics of the given idle state. That information is exposed by the kernel 389 + via ``sysfs``. 390 + 391 + For each CPU in the system, there is a :file:`/sys/devices/system/cpu<N>/cpuidle/` 392 + directory in ``sysfs``, where the number ``<N>`` is assigned to the given 393 + CPU at the initialization time. That directory contains a set of subdirectories 394 + called :file:`state0`, :file:`state1` and so on, up to the number of idle state 395 + objects defined for the given CPU minus one. Each of these directories 396 + corresponds to one idle state object and the larger the number in its name, the 397 + deeper the (effective) idle state represented by it. Each of them contains 398 + a number of files (attributes) representing the properties of the idle state 399 + object corresponding to it, as follows: 400 + 401 + ``desc`` 402 + Description of the idle state. 403 + 404 + ``disable`` 405 + Whether or not this idle state is disabled. 406 + 407 + ``latency`` 408 + Exit latency of the idle state in microseconds. 409 + 410 + ``name`` 411 + Name of the idle state. 412 + 413 + ``power`` 414 + Power drawn by hardware in this idle state in milliwatts (if specified, 415 + 0 otherwise). 416 + 417 + ``residency`` 418 + Target residency of the idle state in microseconds. 419 + 420 + ``time`` 421 + Total time spent in this idle state by the given CPU (as measured by the 422 + kernel) in microseconds. 423 + 424 + ``usage`` 425 + Total number of times the hardware has been asked by the given CPU to 426 + enter this idle state. 427 + 428 + The :file:`desc` and :file:`name` files both contain strings. The difference 429 + between them is that the name is expected to be more concise, while the 430 + description may be longer and it may contain white space or special characters. 431 + The other files listed above contain integer numbers. 432 + 433 + The :file:`disable` attribute is the only writeable one. If it contains 1, the 434 + given idle state is disabled for this particular CPU, which means that the 435 + governor will never select it for this particular CPU and the ``CPUIdle`` 436 + driver will never ask the hardware to enter it for that CPU as a result. 437 + However, disabling an idle state for one CPU does not prevent it from being 438 + asked for by the other CPUs, so it must be disabled for all of them in order to 439 + never be asked for by any of them. [Note that, due to the way the ``ladder`` 440 + governor is implemented, disabling an idle state prevents that governor from 441 + selecting any idle states deeper than the disabled one too.] 442 + 443 + If the :file:`disable` attribute contains 0, the given idle state is enabled for 444 + this particular CPU, but it still may be disabled for some or all of the other 445 + CPUs in the system at the same time. Writing 1 to it causes the idle state to 446 + be disabled for this particular CPU and writing 0 to it allows the governor to 447 + take it into consideration for the given CPU and the driver to ask for it, 448 + unless that state was disabled globally in the driver (in which case it cannot 449 + be used at all). 450 + 451 + The :file:`power` attribute is not defined very well, especially for idle state 452 + objects representing combinations of idle states at different levels of the 453 + hierarchy of units in the processor, and it generally is hard to obtain idle 454 + state power numbers for complex hardware, so :file:`power` often contains 0 (not 455 + available) and if it contains a nonzero number, that number may not be very 456 + accurate and it should not be relied on for anything meaningful. 457 + 458 + The number in the :file:`time` file generally may be greater than the total time 459 + really spent by the given CPU in the given idle state, because it is measured by 460 + the kernel and it may not cover the cases in which the hardware refused to enter 461 + this idle state and entered a shallower one instead of it (or even it did not 462 + enter any idle state at all). The kernel can only measure the time span between 463 + asking the hardware to enter an idle state and the subsequent wakeup of the CPU 464 + and it cannot say what really happened in the meantime at the hardware level. 465 + Moreover, if the idle state object in question represents a combination of idle 466 + states at different levels of the hierarchy of units in the processor, 467 + the kernel can never say how deep the hardware went down the hierarchy in any 468 + particular case. For these reasons, the only reliable way to find out how 469 + much time has been spent by the hardware in different idle states supported by 470 + it is to use idle state residency counters in the hardware, if available. 471 + 472 + 473 + .. _cpu-pm-qos: 474 + 475 + Power Management Quality of Service for CPUs 476 + ============================================ 477 + 478 + The power management quality of service (PM QoS) framework in the Linux kernel 479 + allows kernel code and user space processes to set constraints on various 480 + energy-efficiency features of the kernel to prevent performance from dropping 481 + below a required level. The PM QoS constraints can be set globally, in 482 + predefined categories referred to as PM QoS classes, or against individual 483 + devices. 484 + 485 + CPU idle time management can be affected by PM QoS in two ways, through the 486 + global constraint in the ``PM_QOS_CPU_DMA_LATENCY`` class and through the 487 + resume latency constraints for individual CPUs. Kernel code (e.g. device 488 + drivers) can set both of them with the help of special internal interfaces 489 + provided by the PM QoS framework. User space can modify the former by opening 490 + the :file:`cpu_dma_latency` special device file under :file:`/dev/` and writing 491 + a binary value (interpreted as a signed 32-bit integer) to it. In turn, the 492 + resume latency constraint for a CPU can be modified by user space by writing a 493 + string (representing a signed 32-bit integer) to the 494 + :file:`power/pm_qos_resume_latency_us` file under 495 + :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number 496 + ``<N>`` is allocated at the system initialization time. Negative values 497 + will be rejected in both cases and, also in both cases, the written integer 498 + number will be interpreted as a requested PM QoS constraint in microseconds. 499 + 500 + The requested value is not automatically applied as a new constraint, however, 501 + as it may be less restrictive (greater in this particular case) than another 502 + constraint previously requested by someone else. For this reason, the PM QoS 503 + framework maintains a list of requests that have been made so far in each 504 + global class and for each device, aggregates them and applies the effective 505 + (minimum in this particular case) value as the new constraint. 506 + 507 + In fact, opening the :file:`cpu_dma_latency` special device file causes a new 508 + PM QoS request to be created and added to the priority list of requests in the 509 + ``PM_QOS_CPU_DMA_LATENCY`` class and the file descriptor coming from the 510 + "open" operation represents that request. If that file descriptor is then 511 + used for writing, the number written to it will be associated with the PM QoS 512 + request represented by it as a new requested constraint value. Next, the 513 + priority list mechanism will be used to determine the new effective value of 514 + the entire list of requests and that effective value will be set as a new 515 + constraint. Thus setting a new requested constraint value will only change the 516 + real constraint if the effective "list" value is affected by it. In particular, 517 + for the ``PM_QOS_CPU_DMA_LATENCY`` class it only affects the real constraint if 518 + it is the minimum of the requested constraints in the list. The process holding 519 + a file descriptor obtained by opening the :file:`cpu_dma_latency` special device 520 + file controls the PM QoS request associated with that file descriptor, but it 521 + controls this particular PM QoS request only. 522 + 523 + Closing the :file:`cpu_dma_latency` special device file or, more precisely, the 524 + file descriptor obtained while opening it, causes the PM QoS request associated 525 + with that file descriptor to be removed from the ``PM_QOS_CPU_DMA_LATENCY`` 526 + class priority list and destroyed. If that happens, the priority list mechanism 527 + will be used, again, to determine the new effective value for the whole list 528 + and that value will become the new real constraint. 529 + 530 + In turn, for each CPU there is only one resume latency PM QoS request 531 + associated with the :file:`power/pm_qos_resume_latency_us` file under 532 + :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes 533 + this single PM QoS request to be updated regardless of which user space 534 + process does that. In other words, this PM QoS request is shared by the entire 535 + user space, so access to the file associated with it needs to be arbitrated 536 + to avoid confusion. [Arguably, the only legitimate use of this mechanism in 537 + practice is to pin a process to the CPU in question and let it use the 538 + ``sysfs`` interface to control the resume latency constraint for it.] It 539 + still only is a request, however. It is a member of a priority list used to 540 + determine the effective value to be set as the resume latency constraint for the 541 + CPU in question every time the list of requests is updated this way or another 542 + (there may be other requests coming from kernel code in that list). 543 + 544 + CPU idle time governors are expected to regard the minimum of the global 545 + effective ``PM_QOS_CPU_DMA_LATENCY`` class constraint and the effective 546 + resume latency constraint for the given CPU as the upper limit for the exit 547 + latency of the idle states they can select for that CPU. They should never 548 + select any idle states with exit latency beyond that limit. 549 + 550 + 551 + Idle States Control Via Kernel Command Line 552 + =========================================== 553 + 554 + In addition to the ``sysfs`` interface allowing individual idle states to be 555 + `disabled for individual CPUs <idle-states-representation_>`_, there are kernel 556 + command line parameters affecting CPU idle time management. 557 + 558 + The ``cpuidle.off=1`` kernel command line option can be used to disable the 559 + CPU idle time management entirely. It does not prevent the idle loop from 560 + running on idle CPUs, but it prevents the CPU idle time governors and drivers 561 + from being invoked. If it is added to the kernel command line, the idle loop 562 + will ask the hardware to enter idle states on idle CPUs via the CPU architecture 563 + support code that is expected to provide a default mechanism for this purpose. 564 + That default mechanism usually is the least common denominator for all of the 565 + processors implementing the architecture (i.e. CPU instruction set) in question, 566 + however, so it is rather crude and not very energy-efficient. For this reason, 567 + it is not recommended for production use. 568 + 569 + The other kernel command line parameters controlling CPU idle time management 570 + described below are only relevant for the *x86* architecture and some of 571 + them affect Intel processors only. 572 + 573 + The *x86* architecture support code recognizes three kernel command line 574 + options related to CPU idle time management: ``idle=poll``, ``idle=halt``, 575 + and ``idle=nomwait``. The first two of them disable the ``acpi_idle`` and 576 + ``intel_idle`` drivers altogether, which effectively causes the entire 577 + ``CPUIdle`` subsystem to be disabled and makes the idle loop invoke the 578 + architecture support code to deal with idle CPUs. How it does that depends on 579 + which of the two parameters is added to the kernel command line. In the 580 + ``idle=halt`` case, the architecture support code will use the ``HLT`` 581 + instruction of the CPUs (which, as a rule, suspends the execution of the program 582 + and causes the hardware to attempt to enter the shallowest available idle state) 583 + for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a 584 + more or less ``lightweight'' sequence of instructions in a tight loop. [Note 585 + that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle 586 + CPUs from saving almost any energy at all may not be the only effect of it. 587 + For example, on Intel hardware it effectively prevents CPUs from using 588 + P-states (see |cpufreq|) that require any number of CPUs in a package to be 589 + idle, so it very well may hurt single-thread computations performance as well as 590 + energy-efficiency. Thus using it for performance reasons may not be a good idea 591 + at all.] 592 + 593 + The ``idle=nomwait`` option disables the ``intel_idle`` driver and causes 594 + ``acpi_idle`` to be used (as long as all of the information needed by it is 595 + there in the system's ACPI tables), but it is not allowed to use the 596 + ``MWAIT`` instruction of the CPUs to ask the hardware to enter idle states. 597 + 598 + In addition to the architecture-level kernel command line options affecting CPU 599 + idle time management, there are parameters affecting individual ``CPUIdle`` 600 + drivers that can be passed to them via the kernel command line. Specifically, 601 + the ``intel_idle.max_cstate=<n>`` and ``processor.max_cstate=<n>`` parameters, 602 + where ``<n>`` is an idle state index also used in the name of the given 603 + state's directory in ``sysfs`` (see 604 + `Representation of Idle States <idle-states-representation_>`_), causes the 605 + ``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the 606 + idle states deeper than idle state ``<n>``. In that case, they will never ask 607 + for any of those idle states or expose them to the governor. [The behavior of 608 + the two drivers is different for ``<n>`` equal to ``0``. Adding 609 + ``intel_idle.max_cstate=0`` to the kernel command line disables the 610 + ``intel_idle`` driver and allows ``acpi_idle`` to be used, whereas 611 + ``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``. 612 + Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that 613 + can be loaded separately and ``max_cstate=<n>`` can be passed to it as a module 614 + parameter when it is loaded.]

+1

Documentation/admin-guide/pm/working-state.rst

··· 5 5 .. toctree:: 6 6 :maxdepth: 2 7 7 8 + cpuidle 8 9 cpufreq 9 10 intel_pstate

-23

Documentation/cpuidle/core.txt

··· 1 - 2 - Supporting multiple CPU idle levels in kernel 3 - 4 - cpuidle 5 - 6 - General Information: 7 - 8 - Various CPUs today support multiple idle levels that are differentiated 9 - by varying exit latencies and power consumption during idle. 10 - cpuidle is a generic in-kernel infrastructure that separates 11 - idle policy (governor) from idle mechanism (driver) and provides a 12 - standardized infrastructure to support independent development of 13 - governors and drivers. 14 - 15 - cpuidle resides under drivers/cpuidle. 16 - 17 - Boot options: 18 - "cpuidle_sysfs_switch" 19 - enables current_governor interface in /sys/devices/system/cpu/cpuidle/, 20 - which can be used to switch governors at run time. This boot option 21 - is meant for developer testing only. In normal usage, kernel picks the 22 - best governor based on governor ratings. 23 - SEE ALSO: sysfs.txt in this directory.

-98

Documentation/cpuidle/sysfs.txt

··· 1 - 2 - 3 - Supporting multiple CPU idle levels in kernel 4 - 5 - cpuidle sysfs 6 - 7 - System global cpuidle related information and tunables are under 8 - /sys/devices/system/cpu/cpuidle 9 - 10 - The current interfaces in this directory has self-explanatory names: 11 - * current_driver 12 - * current_governor_ro 13 - 14 - With cpuidle_sysfs_switch boot option (meant for developer testing) 15 - following objects are visible instead. 16 - * current_driver 17 - * available_governors 18 - * current_governor 19 - In this case users can switch the governor at run time by writing 20 - to current_governor. 21 - 22 - 23 - Per logical CPU specific cpuidle information are under 24 - /sys/devices/system/cpu/cpuX/cpuidle 25 - for each online cpu X 26 - 27 - -------------------------------------------------------------------------------- 28 - # ls -lR /sys/devices/system/cpu/cpu0/cpuidle/ 29 - /sys/devices/system/cpu/cpu0/cpuidle/: 30 - total 0 31 - drwxr-xr-x 2 root root 0 Feb 8 10:42 state0 32 - drwxr-xr-x 2 root root 0 Feb 8 10:42 state1 33 - drwxr-xr-x 2 root root 0 Feb 8 10:42 state2 34 - drwxr-xr-x 2 root root 0 Feb 8 10:42 state3 35 - 36 - /sys/devices/system/cpu/cpu0/cpuidle/state0: 37 - total 0 38 - -r--r--r-- 1 root root 4096 Feb 8 10:42 desc 39 - -rw-r--r-- 1 root root 4096 Feb 8 10:42 disable 40 - -r--r--r-- 1 root root 4096 Feb 8 10:42 latency 41 - -r--r--r-- 1 root root 4096 Feb 8 10:42 name 42 - -r--r--r-- 1 root root 4096 Feb 8 10:42 power 43 - -r--r--r-- 1 root root 4096 Feb 8 10:42 residency 44 - -r--r--r-- 1 root root 4096 Feb 8 10:42 time 45 - -r--r--r-- 1 root root 4096 Feb 8 10:42 usage 46 - 47 - /sys/devices/system/cpu/cpu0/cpuidle/state1: 48 - total 0 49 - -r--r--r-- 1 root root 4096 Feb 8 10:42 desc 50 - -rw-r--r-- 1 root root 4096 Feb 8 10:42 disable 51 - -r--r--r-- 1 root root 4096 Feb 8 10:42 latency 52 - -r--r--r-- 1 root root 4096 Feb 8 10:42 name 53 - -r--r--r-- 1 root root 4096 Feb 8 10:42 power 54 - -r--r--r-- 1 root root 4096 Feb 8 10:42 residency 55 - -r--r--r-- 1 root root 4096 Feb 8 10:42 time 56 - -r--r--r-- 1 root root 4096 Feb 8 10:42 usage 57 - 58 - /sys/devices/system/cpu/cpu0/cpuidle/state2: 59 - total 0 60 - -r--r--r-- 1 root root 4096 Feb 8 10:42 desc 61 - -rw-r--r-- 1 root root 4096 Feb 8 10:42 disable 62 - -r--r--r-- 1 root root 4096 Feb 8 10:42 latency 63 - -r--r--r-- 1 root root 4096 Feb 8 10:42 name 64 - -r--r--r-- 1 root root 4096 Feb 8 10:42 power 65 - -r--r--r-- 1 root root 4096 Feb 8 10:42 residency 66 - -r--r--r-- 1 root root 4096 Feb 8 10:42 time 67 - -r--r--r-- 1 root root 4096 Feb 8 10:42 usage 68 - 69 - /sys/devices/system/cpu/cpu0/cpuidle/state3: 70 - total 0 71 - -r--r--r-- 1 root root 4096 Feb 8 10:42 desc 72 - -rw-r--r-- 1 root root 4096 Feb 8 10:42 disable 73 - -r--r--r-- 1 root root 4096 Feb 8 10:42 latency 74 - -r--r--r-- 1 root root 4096 Feb 8 10:42 name 75 - -r--r--r-- 1 root root 4096 Feb 8 10:42 power 76 - -r--r--r-- 1 root root 4096 Feb 8 10:42 residency 77 - -r--r--r-- 1 root root 4096 Feb 8 10:42 time 78 - -r--r--r-- 1 root root 4096 Feb 8 10:42 usage 79 - -------------------------------------------------------------------------------- 80 - 81 - 82 - * desc : Small description about the idle state (string) 83 - * disable : Option to disable this idle state (bool) -> see note below 84 - * latency : Latency to exit out of this idle state (in microseconds) 85 - * residency : Time after which a state becomes more effecient than any 86 - shallower state (in microseconds) 87 - * name : Name of the idle state (string) 88 - * power : Power consumed while in this idle state (in milliwatts) 89 - * time : Total time spent in this idle state (in microseconds) 90 - * usage : Number of times this state was entered (count) 91 - 92 - Note: 93 - The behavior and the effect of the disable variable depends on the 94 - implementation of a particular governor. In the ladder governor, for 95 - example, it is not coherent, i.e. if one is disabling a light state, 96 - then all deeper states are disabled as well, but the disable variable 97 - does not reflect it. Likewise, if one enables a deep state but a lighter 98 - state still is disabled, then this has no effect.