sched/documentation: Document the util clamp feature

+3

Documentation/admin-guide/cgroup-v2.rst

··· 619 619 and is an example of this type. 620 620 621 621 622 + .. _cgroupv2-limits-distributor: 623 + 622 624 Limits 623 625 ------ 624 626 ··· 637 635 "io.max" limits the maximum BPS and/or IOPS that a cgroup can consume 638 636 on an IO device and is an example of this type. 639 637 638 + .. _cgroupv2-protections-distributor: 640 639 641 640 Protections 642 641 -----------

+1

Documentation/scheduler/index.rst

··· 15 15 sched-capacity 16 16 sched-energy 17 17 schedutil 18 + sched-util-clamp 18 19 sched-nice-design 19 20 sched-rt-group 20 21 sched-stats

+741

Documentation/scheduler/sched-util-clamp.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================== 4 + Utilization Clamping 5 + ==================== 6 + 7 + 1. Introduction 8 + =============== 9 + 10 + Utilization clamping, also known as util clamp or uclamp, is a scheduler 11 + feature that allows user space to help in managing the performance requirement 12 + of tasks. It was introduced in v5.3 release. The CGroup support was merged in 13 + v5.4. 14 + 15 + Uclamp is a hinting mechanism that allows the scheduler to understand the 16 + performance requirements and restrictions of the tasks, thus it helps the 17 + scheduler to make a better decision. And when schedutil cpufreq governor is 18 + used, util clamp will influence the CPU frequency selection as well. 19 + 20 + Since the scheduler and schedutil are both driven by PELT (util_avg) signals, 21 + util clamp acts on that to achieve its goal by clamping the signal to a certain 22 + point; hence the name. That is, by clamping utilization we are making the 23 + system run at a certain performance point. 24 + 25 + The right way to view util clamp is as a mechanism to make request or hint on 26 + performance constraints. It consists of two tunables: 27 + 28 + * UCLAMP_MIN, which sets the lower bound. 29 + * UCLAMP_MAX, which sets the upper bound. 30 + 31 + These two bounds will ensure a task will operate within this performance range 32 + of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies 33 + capping a task. 34 + 35 + One can tell the system (scheduler) that some tasks require a minimum 36 + performance point to operate at to deliver the desired user experience. Or one 37 + can tell the system that some tasks should be restricted from consuming too 38 + much resources and should not go above a specific performance point. Viewing 39 + the uclamp values as performance points rather than utilization is a better 40 + abstraction from user space point of view. 41 + 42 + As an example, a game can use util clamp to form a feedback loop with its 43 + perceived Frames Per Second (FPS). It can dynamically increase the minimum 44 + performance point required by its display pipeline to ensure no frame is 45 + dropped. It can also dynamically 'prime' up these tasks if it knows in the 46 + coming few hundred milliseconds a computationally intensive scene is about to 47 + happen. 48 + 49 + On mobile hardware where the capability of the devices varies a lot, this 50 + dynamic feedback loop offers a great flexibility to ensure best user experience 51 + given the capabilities of any system. 52 + 53 + Of course a static configuration is possible too. The exact usage will depend 54 + on the system, application and the desired outcome. 55 + 56 + Another example is in Android where tasks are classified as background, 57 + foreground, top-app, etc. Util clamp can be used to constrain how much 58 + resources background tasks are consuming by capping the performance point they 59 + can run at. This constraint helps reserve resources for important tasks, like 60 + the ones belonging to the currently active app (top-app group). Beside this 61 + helps in limiting how much power they consume. This can be more obvious in 62 + heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the 63 + background tasks to stay on the little cores which will ensure that: 64 + 65 + 1. The big cores are free to run top-app tasks immediately. top-app 66 + tasks are the tasks the user is currently interacting with, hence 67 + the most important tasks in the system. 68 + 2. They don't run on a power hungry core and drain battery even if they 69 + are CPU intensive tasks. 70 + 71 + .. note:: 72 + **little cores**: 73 + CPUs with capacity < 1024 74 + 75 + **big cores**: 76 + CPUs with capacity = 1024 77 + 78 + By making these uclamp performance requests, or rather hints, user space can 79 + ensure system resources are used optimally to deliver the best possible user 80 + experience. 81 + 82 + Another use case is to help with **overcoming the ramp up latency inherit in 83 + how scheduler utilization signal is calculated**. 84 + 85 + On the other hand, a busy task for instance that requires to run at maximum 86 + performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the 87 + scheduler to realize that. This is known to affect workloads like gaming on 88 + mobile devices where frames will drop due to slow response time to select the 89 + higher frequency required for the tasks to finish their work in time. Setting 90 + UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance 91 + level when they start running. 92 + 93 + The overall visible effect goes beyond better perceived user 94 + experience/performance and stretches to help achieve a better overall 95 + performance/watt if used effectively. 96 + 97 + User space can form a feedback loop with the thermal subsystem too to ensure 98 + the device doesn't heat up to the point where it will throttle. 99 + 100 + Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints. 101 + 102 + In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any 103 + performance point rather than being tied to MAX frequency all the time. Which 104 + can be useful on general purpose systems that run on battery powered devices. 105 + 106 + Note that by design RT tasks don't have per-task PELT signal and must always 107 + run at a constant frequency to combat undeterministic DVFS rampup delays. 108 + 109 + Note that using schedutil always implies a single delay to modify the frequency 110 + when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only 111 + helps picking what frequency to request instead of schedutil always requesting 112 + MAX for all RT tasks. 113 + 114 + See :ref:`section 3.4 <uclamp-default-values>` for default values and 115 + :ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks 116 + default value. 117 + 118 + 2. Design 119 + ========= 120 + 121 + Util clamp is a property of every task in the system. It sets the boundaries of 122 + its utilization signal; acting as a bias mechanism that influences certain 123 + decisions within the scheduler. 124 + 125 + The actual utilization signal of a task is never clamped in reality. If you 126 + inspect PELT signals at any point of time you should continue to see them as 127 + they are intact. Clamping happens only when needed, e.g: when a task wakes up 128 + and the scheduler needs to select a suitable CPU for it to run on. 129 + 130 + Since the goal of util clamp is to allow requesting a minimum and maximum 131 + performance point for a task to run on, it must be able to influence the 132 + frequency selection as well as task placement to be most effective. Both of 133 + which have implications on the utilization value at CPU runqueue (rq for short) 134 + level, which brings us to the main design challenge. 135 + 136 + When a task wakes up on an rq, the utilization signal of the rq will be 137 + affected by the uclamp settings of all the tasks enqueued on it. For example if 138 + a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs 139 + to respect to this request as well as all other requests from all of the 140 + enqueued tasks. 141 + 142 + To be able to aggregate the util clamp value of all the tasks attached to the 143 + rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the 144 + scheduler hot path. Hence care must be taken since any slow down will have 145 + significant impact on a lot of use cases and could hinder its usability in 146 + practice. 147 + 148 + The way this is handled is by dividing the utilization range into buckets 149 + (struct uclamp_bucket) which allows us to reduce the search space from every 150 + task on the rq to only a subset of tasks on the top-most bucket. 151 + 152 + When a task is enqueued, the counter in the matching bucket is incremented, 153 + and on dequeue it is decremented. This makes keeping track of the effective 154 + uclamp value at rq level a lot easier. 155 + 156 + As tasks are enqueued and dequeued, we keep track of the current effective 157 + uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on 158 + how this works. 159 + 160 + Later at any path that wants to identify the effective uclamp value of the rq, 161 + it will simply need to read this effective uclamp value of the rq at that exact 162 + moment of time it needs to take a decision. 163 + 164 + For task placement case, only Energy Aware and Capacity Aware Scheduling 165 + (EAS/CAS) make use of uclamp for now, which implies that it is applied on 166 + heterogeneous systems only. 167 + When a task wakes up, the scheduler will look at the current effective uclamp 168 + value of every rq and compare it with the potential new value if the task were 169 + to be enqueued there. Favoring the rq that will end up with the most energy 170 + efficient combination. 171 + 172 + Similarly in schedutil, when it needs to make a frequency update it will look 173 + at the current effective uclamp value of the rq which is influenced by the set 174 + of tasks currently enqueued there and select the appropriate frequency that 175 + will satisfy constraints from requests. 176 + 177 + Other paths like setting overutilization state (which effectively disables EAS) 178 + make use of uclamp as well. Such cases are considered necessary housekeeping to 179 + allow the 2 main use cases above and will not be covered in detail here as they 180 + could change with implementation details. 181 + 182 + .. _uclamp-buckets: 183 + 184 + 2.1. Buckets 185 + ------------ 186 + 187 + :: 188 + 189 + [struct rq] 190 + 191 + (bottom) (top) 192 + 193 + 0 1024 194 + | | 195 + +-----------+-----------+-----------+---- ----+-----------+ 196 + | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | 197 + +-----------+-----------+-----------+---- ----+-----------+ 198 + : : : 199 + +- p0 +- p3 +- p4 200 + : : 201 + +- p1 +- p5 202 + : 203 + +- p2 204 + 205 + 206 + .. note:: 207 + The diagram above is an illustration rather than a true depiction of the 208 + internal data structure. 209 + 210 + To reduce the search space when trying to decide the effective uclamp value of 211 + an rq as tasks are enqueued/dequeued, the whole utilization range is divided 212 + into N buckets where N is configured at compile time by setting 213 + CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. 214 + 215 + The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. 216 + 217 + The range of each bucket is 1024/N. For example, for the default value of 218 + 5 there will be 5 buckets, each of which will cover the following range: 219 + 220 + :: 221 + 222 + DELTA = round_closest(1024/5) = 204.8 = 205 223 + 224 + Bucket 0: [0:204] 225 + Bucket 1: [205:409] 226 + Bucket 2: [410:614] 227 + Bucket 3: [615:819] 228 + Bucket 4: [820:1024] 229 + 230 + When a task p with following tunable parameters 231 + 232 + :: 233 + 234 + p->uclamp[UCLAMP_MIN] = 300 235 + p->uclamp[UCLAMP_MAX] = 1024 236 + 237 + is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket 238 + 4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in 239 + this range. 240 + 241 + The rq then keeps track of its current effective uclamp value for each 242 + uclamp_id. 243 + 244 + When a task p is enqueued, the rq value changes to: 245 + 246 + :: 247 + 248 + // update bucket logic goes here 249 + rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN]) 250 + // repeat for UCLAMP_MAX 251 + 252 + Similarly, when p is dequeued the rq value changes to: 253 + 254 + :: 255 + 256 + // update bucket logic goes here 257 + rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value() 258 + // repeat for UCLAMP_MAX 259 + 260 + When all buckets are empty, the rq uclamp values are reset to system defaults. 261 + See :ref:`section 3.4 <uclamp-default-values>` for details on default values. 262 + 263 + 264 + 2.2. Max aggregation 265 + -------------------- 266 + 267 + Util clamp is tuned to honour the request for the task that requires the 268 + highest performance point. 269 + 270 + When multiple tasks are attached to the same rq, then util clamp must make sure 271 + the task that needs the highest performance point gets it even if there's 272 + another task that doesn't need it or is disallowed from reaching this point. 273 + 274 + For example, if there are multiple tasks attached to an rq with the following 275 + values: 276 + 277 + :: 278 + 279 + p0->uclamp[UCLAMP_MIN] = 300 280 + p0->uclamp[UCLAMP_MAX] = 900 281 + 282 + p1->uclamp[UCLAMP_MIN] = 500 283 + p1->uclamp[UCLAMP_MAX] = 500 284 + 285 + then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN 286 + and UCLAMP_MAX become: 287 + 288 + :: 289 + 290 + rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500 291 + rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900 292 + 293 + As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max 294 + aggregation is the cause of one of limitations when using util clamp, in 295 + particular for UCLAMP_MAX hint when user space would like to save power. 296 + 297 + 2.3. Hierarchical aggregation 298 + ----------------------------- 299 + 300 + As stated earlier, util clamp is a property of every task in the system. But 301 + the actual applied (effective) value can be influenced by more than just the 302 + request made by the task or another actor on its behalf (middleware library). 303 + 304 + The effective util clamp value of any task is restricted as follows: 305 + 306 + 1. By the uclamp settings defined by the cgroup CPU controller it is attached 307 + to, if any. 308 + 2. The restricted value in (1) is then further restricted by the system wide 309 + uclamp settings. 310 + 311 + :ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand 312 + further on that. 313 + 314 + For now suffice to say that if a task makes a request, its actual effective 315 + value will have to adhere to some restrictions imposed by cgroup and system 316 + wide settings. 317 + 318 + The system will still accept the request even if effectively will be beyond the 319 + constraints, but as soon as the task moves to a different cgroup or a sysadmin 320 + modifies the system settings, the request will be satisfied only if it is 321 + within new constraints. 322 + 323 + In other words, this aggregation will not cause an error when a task changes 324 + its uclamp values, but rather the system may not be able to satisfy requests 325 + based on those factors. 326 + 327 + 2.4. Range 328 + ---------- 329 + 330 + Uclamp performance request has the range of 0 to 1024 inclusive. 331 + 332 + For cgroup interface percentage is used (that is 0 to 100 inclusive). 333 + Just like other cgroup interfaces, you can use 'max' instead of 100. 334 + 335 + .. _uclamp-interfaces: 336 + 337 + 3. Interfaces 338 + ============= 339 + 340 + 3.1. Per task interface 341 + ----------------------- 342 + 343 + sched_setattr() syscall was extended to accept two new fields: 344 + 345 + * sched_util_min: requests the minimum performance point the system should run 346 + at when this task is running. Or lower performance bound. 347 + * sched_util_max: requests the maximum performance point the system should run 348 + at when this task is running. Or upper performance bound. 349 + 350 + For example, the following scenario have 40% to 80% utilization constraints: 351 + 352 + :: 353 + 354 + attr->sched_util_min = 40% * 1024; 355 + attr->sched_util_max = 80% * 1024; 356 + 357 + When task @p is running, **the scheduler should try its best to ensure it 358 + starts at 40% performance level**. If the task runs for a long enough time so 359 + that its actual utilization goes above 80%, the utilization, or performance 360 + level, will be capped. 361 + 362 + The special value -1 is used to reset the uclamp settings to the system 363 + default. 364 + 365 + Note that resetting the uclamp value to system default using -1 is not the same 366 + as manually setting uclamp value to system default. This distinction is 367 + important because as we shall see in system interfaces, the default value for 368 + RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the 369 + future. 370 + 371 + 3.2. cgroup interface 372 + --------------------- 373 + 374 + There are two uclamp related values in the CPU cgroup controller: 375 + 376 + * cpu.uclamp.min 377 + * cpu.uclamp.max 378 + 379 + When a task is attached to a CPU controller, its uclamp values will be impacted 380 + as follows: 381 + 382 + * cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup 383 + v2 documentation <cgroupv2-protections-distributor>`. 384 + 385 + If a task uclamp_min value is lower than cpu.uclamp.min, then the task will 386 + inherit the cgroup cpu.uclamp.min value. 387 + 388 + In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child, 389 + parent). 390 + 391 + * cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2 392 + documentation <cgroupv2-limits-distributor>`. 393 + 394 + If a task uclamp_max value is higher than cpu.uclamp.max, then the task will 395 + inherit the cgroup cpu.uclamp.max value. 396 + 397 + In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child, 398 + parent). 399 + 400 + For example, given following parameters: 401 + 402 + :: 403 + 404 + p0->uclamp[UCLAMP_MIN] = // system default; 405 + p0->uclamp[UCLAMP_MAX] = // system default; 406 + 407 + p1->uclamp[UCLAMP_MIN] = 40% * 1024; 408 + p1->uclamp[UCLAMP_MAX] = 50% * 1024; 409 + 410 + cgroup0->cpu.uclamp.min = 20% * 1024; 411 + cgroup0->cpu.uclamp.max = 60% * 1024; 412 + 413 + cgroup1->cpu.uclamp.min = 60% * 1024; 414 + cgroup1->cpu.uclamp.max = 100% * 1024; 415 + 416 + when p0 and p1 are attached to cgroup0, the values become: 417 + 418 + :: 419 + 420 + p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024; 421 + p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024; 422 + 423 + p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact 424 + p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact 425 + 426 + when p0 and p1 are attached to cgroup1, these instead become: 427 + 428 + :: 429 + 430 + p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; 431 + p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024; 432 + 433 + p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; 434 + p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact 435 + 436 + Note that cgroup interfaces allows cpu.uclamp.max value to be lower than 437 + cpu.uclamp.min. Other interfaces don't allow that. 438 + 439 + 3.3. System interface 440 + --------------------- 441 + 442 + 3.3.1 sched_util_clamp_min 443 + -------------------------- 444 + 445 + System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024, 446 + which means that permitted effective UCLAMP_MIN range for tasks is [0:1024]. 447 + By changing it to 512 for example the range reduces to [0:512]. This is useful 448 + to restrict how much boosting tasks are allowed to acquire. 449 + 450 + Requests from tasks to go above this knob value will still succeed, but 451 + they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN]. 452 + 453 + The value must be smaller than or equal to sched_util_clamp_max. 454 + 455 + 3.3.2 sched_util_clamp_max 456 + -------------------------- 457 + 458 + System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024, 459 + which means that permitted effective UCLAMP_MAX range for tasks is [0:1024]. 460 + 461 + By changing it to 512 for example the effective allowed range reduces to 462 + [0:512]. This means is that no task can run above 512, which implies that all 463 + rqs are restricted too. IOW, the whole system is capped to half its performance 464 + capacity. 465 + 466 + This is useful to restrict the overall maximum performance point of the system. 467 + For example, it can be handy to limit performance when running low on battery 468 + or when the system wants to limit access to more energy hungry performance 469 + levels when it's in idle state or screen is off. 470 + 471 + Requests from tasks to go above this knob value will still succeed, but they 472 + won't be satisfied until it is more than p->uclamp[UCLAMP_MAX]. 473 + 474 + The value must be greater than or equal to sched_util_clamp_min. 475 + 476 + .. _uclamp-default-values: 477 + 478 + 3.4. Default values 479 + ------------------- 480 + 481 + By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to: 482 + 483 + :: 484 + 485 + p_fair->uclamp[UCLAMP_MIN] = 0 486 + p_fair->uclamp[UCLAMP_MAX] = 1024 487 + 488 + That is, by default they're boosted to run at the maximum performance point of 489 + changed at boot or runtime. No argument was made yet as to why we should 490 + provide this, but can be added in the future. 491 + 492 + For SCHED_FIFO/SCHED_RR tasks: 493 + 494 + :: 495 + 496 + p_rt->uclamp[UCLAMP_MIN] = 1024 497 + p_rt->uclamp[UCLAMP_MAX] = 1024 498 + 499 + That is by default they're boosted to run at the maximum performance point of 500 + the system which retains the historical behavior of the RT tasks. 501 + 502 + RT tasks default uclamp_min value can be modified at boot or runtime via 503 + sysctl. See below section. 504 + 505 + .. _sched-util-clamp-min-rt-default: 506 + 507 + 3.4.1 sched_util_clamp_min_rt_default 508 + ------------------------------------- 509 + 510 + Running RT tasks at maximum performance point is expensive on battery powered 511 + devices and not necessary. To allow system developer to offer good performance 512 + guarantees for these tasks without pushing it all the way to maximum 513 + performance point, this sysctl knob allows tuning the best boost value to 514 + address the system requirement without burning power running at maximum 515 + performance point all the time. 516 + 517 + Application developer are encouraged to use the per task util clamp interface 518 + to ensure they are performance and power aware. Ideally this knob should be set 519 + to 0 by system designers and leave the task of managing performance 520 + requirements to the apps. 521 + 522 + 4. How to use util clamp 523 + ======================== 524 + 525 + Util clamp promotes the concept of user space assisted power and performance 526 + management. At the scheduler level there is no info required to make the best 527 + decision. However, with util clamp user space can hint to the scheduler to make 528 + better decision about task placement and frequency selection. 529 + 530 + Best results are achieved by not making any assumptions about the system the 531 + application is running on and to use it in conjunction with a feedback loop to 532 + dynamically monitor and adjust. Ultimately this will allow for a better user 533 + experience at a better perf/watt. 534 + 535 + For some systems and use cases, static setup will help to achieve good results. 536 + Portability will be a problem in this case. How much work one can do at 100, 537 + 200 or 1024 is different for each system. Unless there's a specific target 538 + system, static setup should be avoided. 539 + 540 + There are enough possibilities to create a whole framework based on util clamp 541 + or self contained app that makes use of it directly. 542 + 543 + 4.1. Boost important and DVFS-latency-sensitive tasks 544 + ----------------------------------------------------- 545 + 546 + A GUI task might not be busy to warrant driving the frequency high when it 547 + wakes up. However, it requires to finish its work within a specific time window 548 + to deliver the desired user experience. The right frequency it requires at 549 + wakeup will be system dependent. On some underpowered systems it will be high, 550 + on other overpowered ones it will be low or 0. 551 + 552 + This task can increase its UCLAMP_MIN value every time it misses the deadline 553 + to ensure on next wake up it runs at a higher performance point. It should try 554 + to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any 555 + particular system to achieve the best possible perf/watt for that system. 556 + 557 + On heterogeneous systems, it might be important for this task to run on 558 + a faster CPU. 559 + 560 + **Generally it is advised to perceive the input as performance level or point 561 + which will imply both task placement and frequency selection**. 562 + 563 + 4.2. Cap background tasks 564 + ------------------------- 565 + 566 + Like explained for Android case in the introduction. Any app can lower 567 + UCLAMP_MAX for some background tasks that don't care about performance but 568 + could end up being busy and consume unnecessary system resources on the system. 569 + 570 + 4.3. Powersave mode 571 + ------------------- 572 + 573 + sched_util_clamp_max system wide interface can be used to limit all tasks from 574 + operating at the higher performance points which are usually energy 575 + inefficient. 576 + 577 + This is not unique to uclamp as one can achieve the same by reducing max 578 + frequency of the cpufreq governor. It can be considered a more convenient 579 + alternative interface. 580 + 581 + 4.4. Per-app performance restriction 582 + ------------------------------------ 583 + 584 + Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an 585 + app every time it is executed to guarantee a minimum performance point and/or 586 + limit it from draining system power at the cost of reduced performance for 587 + these apps. 588 + 589 + If you want to prevent your laptop from heating up while on the go from 590 + compiling the kernel and happy to sacrifice performance to save power, but 591 + still would like to keep your browser performance intact, uclamp makes it 592 + possible. 593 + 594 + 5. Limitations 595 + ============== 596 + 597 + .. _uclamp-capping-fail: 598 + 599 + 5.1. Capping frequency with uclamp_max fails under certain conditions 600 + --------------------------------------------------------------------- 601 + 602 + If task p0 is capped to run at 512: 603 + 604 + :: 605 + 606 + p0->uclamp[UCLAMP_MAX] = 512 607 + 608 + and it shares the rq with p1 which is free to run at any performance point: 609 + 610 + :: 611 + 612 + p1->uclamp[UCLAMP_MAX] = 1024 613 + 614 + then due to max aggregation the rq will be allowed to reach max performance 615 + point: 616 + 617 + :: 618 + 619 + rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024 620 + 621 + Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for 622 + the rq will depend on the actual utilization value of the tasks. 623 + 624 + If p1 is a small task but p0 is a CPU intensive task, then due to the fact that 625 + both are running at the same rq, p1 will cause the frequency capping to be left 626 + from the rq although p1, which is allowed to run at any performance point, 627 + doesn't actually need to run at that frequency. 628 + 629 + 5.2. UCLAMP_MAX can break PELT (util_avg) signal 630 + ------------------------------------------------ 631 + 632 + PELT assumes that frequency will always increase as the signals grow to ensure 633 + there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency 634 + increase will be prevented which can lead to no idle time in some 635 + circumstances. When there's no idle time, a task will stuck in a busy loop, 636 + which would result in util_avg being 1024. 637 + 638 + Combing with issue described below, this can lead to unwanted frequency spikes 639 + when severely capped tasks share the rq with a small non capped task. 640 + 641 + As an example if task p, which have: 642 + 643 + :: 644 + 645 + p0->util_avg = 300 646 + p0->uclamp[UCLAMP_MAX] = 0 647 + 648 + wakes up on an idle CPU, then it will run at min frequency (Fmin) this 649 + CPU is capable of. The max CPU frequency (Fmax) matters here as well, 650 + since it designates the shortest computational time to finish the task's 651 + work on this CPU. 652 + 653 + :: 654 + 655 + rq->uclamp[UCLAMP_MAX] = 0 656 + 657 + If the ratio of Fmax/Fmin is 3, then maximum value will be: 658 + 659 + :: 660 + 661 + 300 * (Fmax/Fmin) = 900 662 + 663 + which indicates the CPU will still see idle time since 900 is < 1024. The 664 + _actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As 665 + long as there's idle time, p->util_avg updates will be off by a some margin, 666 + but not proportional to Fmax/Fmin. 667 + 668 + :: 669 + 670 + p0->util_avg = 300 + small_error 671 + 672 + Now if the ratio of Fmax/Fmin is 4, the maximum value becomes: 673 + 674 + :: 675 + 676 + 300 * (Fmax/Fmin) = 1200 677 + 678 + which is higher than 1024 and indicates that the CPU has no idle time. When 679 + this happens, then the _actual_ util_avg will become: 680 + 681 + :: 682 + 683 + p0->util_avg = 1024 684 + 685 + If task p1 wakes up on this CPU, which have: 686 + 687 + :: 688 + 689 + p1->util_avg = 200 690 + p1->uclamp[UCLAMP_MAX] = 1024 691 + 692 + then the effective UCLAMP_MAX for the CPU will be 1024 according to max 693 + aggregation rule. But since the capped p0 task was running and throttled 694 + severely, then the rq->util_avg will be: 695 + 696 + :: 697 + 698 + p0->util_avg = 1024 699 + p1->util_avg = 200 700 + 701 + rq->util_avg = 1024 702 + rq->uclamp[UCLAMP_MAX] = 1024 703 + 704 + Hence lead to a frequency spike since if p0 wasn't throttled we should get: 705 + 706 + :: 707 + 708 + p0->util_avg = 300 709 + p1->util_avg = 200 710 + 711 + rq->util_avg = 500 712 + 713 + and run somewhere near mid performance point of that CPU, not the Fmax we get. 714 + 715 + 5.3. Schedutil response time issues 716 + ----------------------------------- 717 + 718 + schedutil has three limitations: 719 + 720 + 1. Hardware takes non-zero time to respond to any frequency change 721 + request. On some platforms can be in the order of few ms. 722 + 2. Non fast-switch systems require a worker deadline thread to wake up 723 + and perform the frequency change, which adds measurable overhead. 724 + 3. schedutil rate_limit_us drops any requests during this rate_limit_us 725 + window. 726 + 727 + If a relatively small task is doing critical job and requires a certain 728 + performance point when it wakes up and starts running, then all these 729 + limitations will prevent it from getting what it wants in the time scale it 730 + expects. 731 + 732 + This limitation is not only impactful when using uclamp, but will be more 733 + prevalent as we no longer gradually ramp up or down. We could easily be 734 + jumping between frequencies depending on the order tasks wake up, and their 735 + respective uclamp values. 736 + 737 + We regard that as a limitation of the capabilities of the underlying system 738 + itself. 739 + 740 + There is room to improve the behavior of schedutil rate_limit_us, but not much 741 + to be done for 1 or 2. They are considered hard limitations of the system.