Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

- Improve uclamp performance by using a static key for the fast path

- Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
better power efficiency of RT tasks on battery powered devices.
(The default is to maximize performance & reduce RT latencies.)

- Improve utime and stime tracking accuracy, which had a fixed boundary
of error, which created larger and larger relative errors as the
values become larger. This is now replaced with more precise
arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

- Improve the deadline scheduler, such as making it capacity aware

- Improve frequency-invariant scheduling

- Misc cleanups in energy/power aware scheduling

- Add sched_update_nr_running tracepoint to track changes to nr_running

- Documentation additions and updates

- Misc cleanups and smaller fixes

* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
sched/doc: Document capacity aware scheduling
sched: Document arch_scale_*_capacity()
arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
Documentation/sysctl: Document uclamp sysctl knobs
sched/uclamp: Add a new sysctl to control RT default boost value
sched/uclamp: Fix a deadlock when enabling uclamp static key
sched: Remove duplicated tick_nohz_full_enabled() check
sched: Fix a typo in a comment
sched/uclamp: Remove unnecessary mutex_init()
arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
arch_topology, sched/core: Cleanup thermal pressure definition
trace/events/sched.h: fix duplicated word
linux/sched/mm.h: drop duplicated words in comments
smp: Fix a potential usage of stale nr_cpus
sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
sched: nohz: stop passing around unused "ticks" parameter.
sched: Better document ttwu()
sched: Add a tracepoint to track rq->nr_running
...

+1521 -332
+54
Documentation/admin-guide/sysctl/kernel.rst
··· 1062 1062 incurs a small amount of overhead in the scheduler but is 1063 1063 useful for debugging and performance tuning. 1064 1064 1065 + sched_util_clamp_min: 1066 + ===================== 1067 + 1068 + Max allowed *minimum* utilization. 1069 + 1070 + Default value is 1024, which is the maximum possible value. 1071 + 1072 + It means that any requested uclamp.min value cannot be greater than 1073 + sched_util_clamp_min, i.e., it is restricted to the range 1074 + [0:sched_util_clamp_min]. 1075 + 1076 + sched_util_clamp_max: 1077 + ===================== 1078 + 1079 + Max allowed *maximum* utilization. 1080 + 1081 + Default value is 1024, which is the maximum possible value. 1082 + 1083 + It means that any requested uclamp.max value cannot be greater than 1084 + sched_util_clamp_max, i.e., it is restricted to the range 1085 + [0:sched_util_clamp_max]. 1086 + 1087 + sched_util_clamp_min_rt_default: 1088 + ================================ 1089 + 1090 + By default Linux is tuned for performance. Which means that RT tasks always run 1091 + at the highest frequency and most capable (highest capacity) CPU (in 1092 + heterogeneous systems). 1093 + 1094 + Uclamp achieves this by setting the requested uclamp.min of all RT tasks to 1095 + 1024 by default, which effectively boosts the tasks to run at the highest 1096 + frequency and biases them to run on the biggest CPU. 1097 + 1098 + This knob allows admins to change the default behavior when uclamp is being 1099 + used. In battery powered devices particularly, running at the maximum 1100 + capacity and frequency will increase energy consumption and shorten the battery 1101 + life. 1102 + 1103 + This knob is only effective for RT tasks which the user hasn't modified their 1104 + requested uclamp.min value via sched_setattr() syscall. 1105 + 1106 + This knob will not escape the range constraint imposed by sched_util_clamp_min 1107 + defined above. 1108 + 1109 + For example if 1110 + 1111 + sched_util_clamp_min_rt_default = 800 1112 + sched_util_clamp_min = 600 1113 + 1114 + Then the boost will be clamped to 600 because 800 is outside of the permissible 1115 + range of [0:600]. This could happen for instance if a powersave mode will 1116 + restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as 1117 + this restriction is lifted, the requested sched_util_clamp_min_rt_default 1118 + will take effect. 1065 1119 1066 1120 seccomp 1067 1121 =======
+1
Documentation/scheduler/index.rst
··· 12 12 sched-deadline 13 13 sched-design-CFS 14 14 sched-domains 15 + sched-capacity 15 16 sched-energy 16 17 sched-nice-design 17 18 sched-rt-group
+439
Documentation/scheduler/sched-capacity.rst
··· 1 + ========================= 2 + Capacity Aware Scheduling 3 + ========================= 4 + 5 + 1. CPU Capacity 6 + =============== 7 + 8 + 1.1 Introduction 9 + ---------------- 10 + 11 + Conventional, homogeneous SMP platforms are composed of purely identical 12 + CPUs. Heterogeneous platforms on the other hand are composed of CPUs with 13 + different performance characteristics - on such platforms, not all CPUs can be 14 + considered equal. 15 + 16 + CPU capacity is a measure of the performance a CPU can reach, normalized against 17 + the most performant CPU in the system. Heterogeneous systems are also called 18 + asymmetric CPU capacity systems, as they contain CPUs of different capacities. 19 + 20 + Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems 21 + from two factors: 22 + 23 + - not all CPUs may have the same microarchitecture (µarch). 24 + - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be 25 + physically able to attain the higher Operating Performance Points (OPP). 26 + 27 + Arm big.LITTLE systems are an example of both. The big CPUs are more 28 + performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, 29 + smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones 30 + can. 31 + 32 + CPU performance is usually expressed in Millions of Instructions Per Second 33 + (MIPS), which can also be expressed as a given amount of instructions attainable 34 + per Hz, leading to:: 35 + 36 + capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) 37 + 38 + 1.2 Scheduler terms 39 + ------------------- 40 + 41 + Two different capacity values are used within the scheduler. A CPU's 42 + ``capacity_orig`` is its maximum attainable capacity, i.e. its maximum 43 + attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to 44 + which some loss of available performance (e.g. time spent handling IRQs) is 45 + subtracted. 46 + 47 + Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, 48 + while ``capacity_orig`` is class-agnostic. The rest of this document will use 49 + the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of 50 + brevity. 51 + 52 + 1.3 Platform examples 53 + --------------------- 54 + 55 + 1.3.1 Identical OPPs 56 + ~~~~~~~~~~~~~~~~~~~~ 57 + 58 + Consider an hypothetical dual-core asymmetric CPU capacity system where 59 + 60 + - work_per_hz(CPU0) = W 61 + - work_per_hz(CPU1) = W/2 62 + - all CPUs are running at the same fixed frequency 63 + 64 + By the above definition of capacity: 65 + 66 + - capacity(CPU0) = C 67 + - capacity(CPU1) = C/2 68 + 69 + To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would 70 + be a LITTLE. 71 + 72 + With a workload that periodically does a fixed amount of work, you will get an 73 + execution trace like so:: 74 + 75 + CPU0 work ^ 76 + | ____ ____ ____ 77 + | | | | | | | 78 + +----+----+----+----+----+----+----+----+----+----+-> time 79 + 80 + CPU1 work ^ 81 + | _________ _________ ____ 82 + | | | | | | 83 + +----+----+----+----+----+----+----+----+----+----+-> time 84 + 85 + CPU0 has the highest capacity in the system (C), and completes a fixed amount of 86 + work W in T units of time. On the other hand, CPU1 has half the capacity of 87 + CPU0, and thus only completes W/2 in T. 88 + 89 + 1.3.2 Different max OPPs 90 + ~~~~~~~~~~~~~~~~~~~~~~~~ 91 + 92 + Usually, CPUs of different capacity values also have different maximum 93 + OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: 94 + 95 + - max_freq(CPU0) = F 96 + - max_freq(CPU1) = 2/3 * F 97 + 98 + This yields: 99 + 100 + - capacity(CPU0) = C 101 + - capacity(CPU1) = C/3 102 + 103 + Executing the same workload as described in 1.3.1, which each CPU running at its 104 + maximum frequency results in:: 105 + 106 + CPU0 work ^ 107 + | ____ ____ ____ 108 + | | | | | | | 109 + +----+----+----+----+----+----+----+----+----+----+-> time 110 + 111 + workload on CPU1 112 + CPU1 work ^ 113 + | ______________ ______________ ____ 114 + | | | | | | 115 + +----+----+----+----+----+----+----+----+----+----+-> time 116 + 117 + 1.4 Representation caveat 118 + ------------------------- 119 + 120 + It should be noted that having a *single* value to represent differences in CPU 121 + performance is somewhat of a contentious point. The relative performance 122 + difference between two different µarchs could be X% on integer operations, Y% on 123 + floating point operations, Z% on branches, and so on. Still, results using this 124 + simple approach have been satisfactory for now. 125 + 126 + 2. Task utilization 127 + =================== 128 + 129 + 2.1 Introduction 130 + ---------------- 131 + 132 + Capacity aware scheduling requires an expression of a task's requirements with 133 + regards to CPU capacity. Each scheduler class can express this differently, and 134 + while task utilization is specific to CFS, it is convenient to describe it here 135 + in order to introduce more generic concepts. 136 + 137 + Task utilization is a percentage meant to represent the throughput requirements 138 + of a task. A simple approximation of it is the task's duty cycle, i.e.:: 139 + 140 + task_util(p) = duty_cycle(p) 141 + 142 + On an SMP system with fixed frequencies, 100% utilization suggests the task is a 143 + busy loop. Conversely, 10% utilization hints it is a small periodic task that 144 + spends more time sleeping than executing. Variable CPU frequencies and 145 + asymmetric CPU capacities complexify this somewhat; the following sections will 146 + expand on these. 147 + 148 + 2.2 Frequency invariance 149 + ------------------------ 150 + 151 + One issue that needs to be taken into account is that a workload's duty cycle is 152 + directly impacted by the current OPP the CPU is running at. Consider running a 153 + periodic workload at a given frequency F:: 154 + 155 + CPU work ^ 156 + | ____ ____ ____ 157 + | | | | | | | 158 + +----+----+----+----+----+----+----+----+----+----+-> time 159 + 160 + This yields duty_cycle(p) == 25%. 161 + 162 + Now, consider running the *same* workload at frequency F/2:: 163 + 164 + CPU work ^ 165 + | _________ _________ ____ 166 + | | | | | | 167 + +----+----+----+----+----+----+----+----+----+----+-> time 168 + 169 + This yields duty_cycle(p) == 50%, despite the task having the exact same 170 + behaviour (i.e. executing the same amount of work) in both executions. 171 + 172 + The task utilization signal can be made frequency invariant using the following 173 + formula:: 174 + 175 + task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) 176 + 177 + Applying this formula to the two examples above yields a frequency invariant 178 + task utilization of 25%. 179 + 180 + 2.3 CPU invariance 181 + ------------------ 182 + 183 + CPU capacity has a similar effect on task utilization in that running an 184 + identical workload on CPUs of different capacity values will yield different 185 + duty cycles. 186 + 187 + Consider the system described in 1.3.2., i.e.:: 188 + 189 + - capacity(CPU0) = C 190 + - capacity(CPU1) = C/3 191 + 192 + Executing a given periodic workload on each CPU at their maximum frequency would 193 + result in:: 194 + 195 + CPU0 work ^ 196 + | ____ ____ ____ 197 + | | | | | | | 198 + +----+----+----+----+----+----+----+----+----+----+-> time 199 + 200 + CPU1 work ^ 201 + | ______________ ______________ ____ 202 + | | | | | | 203 + +----+----+----+----+----+----+----+----+----+----+-> time 204 + 205 + IOW, 206 + 207 + - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency 208 + - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency 209 + 210 + The task utilization signal can be made CPU invariant using the following 211 + formula:: 212 + 213 + task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) 214 + 215 + with ``max_capacity`` being the highest CPU capacity value in the 216 + system. Applying this formula to the above example above yields a CPU 217 + invariant task utilization of 25%. 218 + 219 + 2.4 Invariant task utilization 220 + ------------------------------ 221 + 222 + Both frequency and CPU invariance need to be applied to task utilization in 223 + order to obtain a truly invariant signal. The pseudo-formula for a task 224 + utilization that is both CPU and frequency invariant is thus, for a given 225 + task p:: 226 + 227 + curr_frequency(cpu) capacity(cpu) 228 + task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- 229 + max_frequency(cpu) max_capacity 230 + 231 + In other words, invariant task utilization describes the behaviour of a task as 232 + if it were running on the highest-capacity CPU in the system, running at its 233 + maximum frequency. 234 + 235 + Any mention of task utilization in the following sections will imply its 236 + invariant form. 237 + 238 + 2.5 Utilization estimation 239 + -------------------------- 240 + 241 + Without a crystal ball, task behaviour (and thus task utilization) cannot 242 + accurately be predicted the moment a task first becomes runnable. The CFS class 243 + maintains a handful of CPU and task signals based on the Per-Entity Load 244 + Tracking (PELT) mechanism, one of those yielding an *average* utilization (as 245 + opposed to instantaneous). 246 + 247 + This means that while the capacity aware scheduling criteria will be written 248 + considering a "true" task utilization (using a crystal ball), the implementation 249 + will only ever be able to use an estimator thereof. 250 + 251 + 3. Capacity aware scheduling requirements 252 + ========================================= 253 + 254 + 3.1 CPU capacity 255 + ---------------- 256 + 257 + Linux cannot currently figure out CPU capacity on its own, this information thus 258 + needs to be handed to it. Architectures must define arch_scale_cpu_capacity() 259 + for that purpose. 260 + 261 + The arm and arm64 architectures directly map this to the arch_topology driver 262 + CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see 263 + Documentation/devicetree/bindings/arm/cpu-capacity.txt. 264 + 265 + 3.2 Frequency invariance 266 + ------------------------ 267 + 268 + As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task 269 + utilization. Architectures must define arch_scale_freq_capacity(cpu) for that 270 + purpose. 271 + 272 + Implementing this function requires figuring out at which frequency each CPU 273 + have been running at. One way to implement this is to leverage hardware counters 274 + whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, 275 + AMU on arm64). Another is to directly hook into cpufreq frequency transitions, 276 + when the kernel is aware of the switched-to frequency (also employed by 277 + arm/arm64). 278 + 279 + 4. Scheduler topology 280 + ===================== 281 + 282 + During the construction of the sched domains, the scheduler will figure out 283 + whether the system exhibits asymmetric CPU capacities. Should that be the 284 + case: 285 + 286 + - The sched_asym_cpucapacity static key will be enabled. 287 + - The SD_ASYM_CPUCAPACITY flag will be set at the lowest sched_domain level that 288 + spans all unique CPU capacity values. 289 + 290 + The sched_asym_cpucapacity static key is intended to guard sections of code that 291 + cater to asymmetric CPU capacity systems. Do note however that said key is 292 + *system-wide*. Imagine the following setup using cpusets:: 293 + 294 + capacity C/2 C 295 + ________ ________ 296 + / \ / \ 297 + CPUs 0 1 2 3 4 5 6 7 298 + \__/ \______________/ 299 + cpusets cs0 cs1 300 + 301 + Which could be created via: 302 + 303 + .. code-block:: sh 304 + 305 + mkdir /sys/fs/cgroup/cpuset/cs0 306 + echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus 307 + echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems 308 + 309 + mkdir /sys/fs/cgroup/cpuset/cs1 310 + echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus 311 + echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems 312 + 313 + echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 314 + 315 + Since there *is* CPU capacity asymmetry in the system, the 316 + sched_asym_cpucapacity static key will be enabled. However, the sched_domain 317 + hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't 318 + set in that hierarchy, it describes an SMP island and should be treated as such. 319 + 320 + Therefore, the 'canonical' pattern for protecting codepaths that cater to 321 + asymmetric CPU capacities is to: 322 + 323 + - Check the sched_asym_cpucapacity static key 324 + - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in 325 + the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific 326 + CPU or group thereof) 327 + 328 + 5. Capacity aware scheduling implementation 329 + =========================================== 330 + 331 + 5.1 CFS 332 + ------- 333 + 334 + 5.1.1 Capacity fitness 335 + ~~~~~~~~~~~~~~~~~~~~~~ 336 + 337 + The main capacity scheduling criterion of CFS is:: 338 + 339 + task_util(p) < capacity(task_cpu(p)) 340 + 341 + This is commonly called the capacity fitness criterion, i.e. CFS must ensure a 342 + task "fits" on its CPU. If it is violated, the task will need to achieve more 343 + work than what its CPU can provide: it will be CPU-bound. 344 + 345 + Furthermore, uclamp lets userspace specify a minimum and a maximum utilization 346 + value for a task, either via sched_setattr() or via the cgroup interface (see 347 + Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to 348 + clamp task_util() in the previous criterion. 349 + 350 + 5.1.2 Wakeup CPU selection 351 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 352 + 353 + CFS task wakeup CPU selection follows the capacity fitness criterion described 354 + above. On top of that, uclamp is used to clamp the task utilization values, 355 + which lets userspace have more leverage over the CPU selection of CFS 356 + tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: 357 + 358 + clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) 359 + 360 + By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run 361 + on any CPU by giving it a low uclamp.max value. Conversely, it can force a small 362 + periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by 363 + giving it a high uclamp.min value. 364 + 365 + .. note:: 366 + 367 + Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling 368 + (EAS), which is described in Documentation/scheduling/sched-energy.rst. 369 + 370 + 5.1.3 Load balancing 371 + ~~~~~~~~~~~~~~~~~~~~ 372 + 373 + A pathological case in the wakeup CPU selection occurs when a task rarely 374 + sleeps, if at all - it thus rarely wakes up, if at all. Consider:: 375 + 376 + w == wakeup event 377 + 378 + capacity(CPU0) = C 379 + capacity(CPU1) = C / 3 380 + 381 + workload on CPU0 382 + CPU work ^ 383 + | _________ _________ ____ 384 + | | | | | | 385 + +----+----+----+----+----+----+----+----+----+----+-> time 386 + w w w 387 + 388 + workload on CPU1 389 + CPU work ^ 390 + | ____________________________________________ 391 + | | 392 + +----+----+----+----+----+----+----+----+----+----+-> 393 + w 394 + 395 + This workload should run on CPU0, but if the task either: 396 + 397 + - was improperly scheduled from the start (inaccurate initial 398 + utilization estimation) 399 + - was properly scheduled from the start, but suddenly needs more 400 + processing power 401 + 402 + then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; 403 + the CPU capacity scheduling criterion is violated, and there may not be any more 404 + wakeup event to fix this up via wakeup CPU selection. 405 + 406 + Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism 407 + put in place to handle this shares the same name. Misfit task migration 408 + leverages the CFS load balancer, more specifically the active load balance part 409 + (which caters to migrating currently running tasks). When load balance happens, 410 + a misfit active load balance will be triggered if a misfit task can be migrated 411 + to a CPU with more capacity than its current one. 412 + 413 + 5.2 RT 414 + ------ 415 + 416 + 5.2.1 Wakeup CPU selection 417 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 418 + 419 + RT task wakeup CPU selection searches for a CPU that satisfies:: 420 + 421 + task_uclamp_min(p) <= capacity(task_cpu(cpu)) 422 + 423 + while still following the usual priority constraints. If none of the candidate 424 + CPUs can satisfy this capacity criterion, then strict priority based scheduling 425 + is followed and CPU capacities are ignored. 426 + 427 + 5.3 DL 428 + ------ 429 + 430 + 5.3.1 Wakeup CPU selection 431 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 432 + 433 + DL task wakeup CPU selection searches for a CPU that satisfies:: 434 + 435 + task_bandwidth(p) < capacity(task_cpu(p)) 436 + 437 + while still respecting the usual bandwidth and deadline constraints. If 438 + none of the candidate CPUs can satisfy this capacity criterion, then the 439 + task will remain on its current CPU.
+2 -10
Documentation/scheduler/sched-energy.rst
··· 331 331 looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling 332 332 domains are built. 333 333 334 - The flag is set/cleared automatically by the scheduler topology code whenever 335 - there are CPUs with different capacities in a root domain. The capacities of 336 - CPUs are provided by arch-specific code through the arch_scale_cpu_capacity() 337 - callback. As an example, arm and arm64 share an implementation of this callback 338 - which uses a combination of CPUFreq data and device-tree bindings to compute the 339 - capacity of CPUs (see drivers/base/arch_topology.c for more details). 340 - 341 - So, in order to use EAS on your platform your architecture must implement the 342 - arch_scale_cpu_capacity() callback, and some of the CPUs must have a lower 343 - capacity than others. 334 + See Documentation/sched/sched-capacity.rst for requirements to be met for this 335 + flag to be set in the sched_domain hierarchy. 344 336 345 337 Please note that EAS is not fundamentally incompatible with SMP, but no 346 338 significant savings on SMP platforms have been observed yet. This restriction
+2 -1
arch/arm/include/asm/topology.h
··· 16 16 /* Enable topology flag updates */ 17 17 #define arch_update_cpu_topology topology_update_cpu_topology 18 18 19 - /* Replace task scheduler's default thermal pressure retrieve API */ 19 + /* Replace task scheduler's default thermal pressure API */ 20 20 #define arch_scale_thermal_pressure topology_get_thermal_pressure 21 + #define arch_set_thermal_pressure topology_set_thermal_pressure 21 22 22 23 #else 23 24
+2 -1
arch/arm64/include/asm/topology.h
··· 34 34 /* Enable topology flag updates */ 35 35 #define arch_update_cpu_topology topology_update_cpu_topology 36 36 37 - /* Replace task scheduler's default thermal pressure retrieve API */ 37 + /* Replace task scheduler's default thermal pressure API */ 38 38 #define arch_scale_thermal_pressure topology_get_thermal_pressure 39 + #define arch_set_thermal_pressure topology_set_thermal_pressure 39 40 40 41 #include <asm-generic/topology.h> 41 42
+12 -2
arch/x86/include/asm/div64.h
··· 74 74 #else 75 75 # include <asm-generic/div64.h> 76 76 77 - static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div) 77 + /* 78 + * Will generate an #DE when the result doesn't fit u64, could fix with an 79 + * __ex_table[] entry when it becomes an issue. 80 + */ 81 + static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div) 78 82 { 79 83 u64 q; 80 84 81 85 asm ("mulq %2; divq %3" : "=a" (q) 82 - : "a" (a), "rm" ((u64)mul), "rm" ((u64)div) 86 + : "a" (a), "rm" (mul), "rm" (div) 83 87 : "rdx"); 84 88 85 89 return q; 90 + } 91 + #define mul_u64_u64_div_u64 mul_u64_u64_div_u64 92 + 93 + static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div) 94 + { 95 + return mul_u64_u64_div_u64(a, mul, div); 86 96 } 87 97 #define mul_u64_u32_div mul_u64_u32_div 88 98
+1 -1
arch/x86/include/asm/topology.h
··· 193 193 } 194 194 #endif /* CONFIG_SCHED_MC_PRIO */ 195 195 196 - #ifdef CONFIG_SMP 196 + #if defined(CONFIG_SMP) && defined(CONFIG_X86_64) 197 197 #include <asm/cpufeature.h> 198 198 199 199 DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
+41 -9
arch/x86/kernel/smpboot.c
··· 56 56 #include <linux/cpuidle.h> 57 57 #include <linux/numa.h> 58 58 #include <linux/pgtable.h> 59 + #include <linux/overflow.h> 59 60 60 61 #include <asm/acpi.h> 61 62 #include <asm/desc.h> ··· 1778 1777 1779 1778 #endif 1780 1779 1780 + #ifdef CONFIG_X86_64 1781 1781 /* 1782 1782 * APERF/MPERF frequency ratio computation. 1783 1783 * ··· 1977 1975 static bool intel_set_max_freq_ratio(void) 1978 1976 { 1979 1977 u64 base_freq, turbo_freq; 1978 + u64 turbo_ratio; 1980 1979 1981 1980 if (slv_set_max_freq_ratio(&base_freq, &turbo_freq)) 1982 1981 goto out; ··· 2003 2000 /* 2004 2001 * Some hypervisors advertise X86_FEATURE_APERFMPERF 2005 2002 * but then fill all MSR's with zeroes. 2003 + * Some CPUs have turbo boost but don't declare any turbo ratio 2004 + * in MSR_TURBO_RATIO_LIMIT. 2006 2005 */ 2007 - if (!base_freq) { 2008 - pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n"); 2006 + if (!base_freq || !turbo_freq) { 2007 + pr_debug("Couldn't determine cpu base or turbo frequency, necessary for scale-invariant accounting.\n"); 2009 2008 return false; 2010 2009 } 2011 2010 2012 - arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, 2013 - base_freq); 2011 + turbo_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, base_freq); 2012 + if (!turbo_ratio) { 2013 + pr_debug("Non-zero turbo and base frequencies led to a 0 ratio.\n"); 2014 + return false; 2015 + } 2016 + 2017 + arch_turbo_freq_ratio = turbo_ratio; 2014 2018 arch_set_max_freq_ratio(turbo_disabled()); 2019 + 2015 2020 return true; 2016 2021 } 2017 2022 ··· 2059 2048 } 2060 2049 } 2061 2050 2051 + static void disable_freq_invariance_workfn(struct work_struct *work) 2052 + { 2053 + static_branch_disable(&arch_scale_freq_key); 2054 + } 2055 + 2056 + static DECLARE_WORK(disable_freq_invariance_work, 2057 + disable_freq_invariance_workfn); 2058 + 2062 2059 DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE; 2063 2060 2064 2061 void arch_scale_freq_tick(void) 2065 2062 { 2066 - u64 freq_scale; 2063 + u64 freq_scale = SCHED_CAPACITY_SCALE; 2067 2064 u64 aperf, mperf; 2068 2065 u64 acnt, mcnt; 2069 2066 ··· 2083 2064 2084 2065 acnt = aperf - this_cpu_read(arch_prev_aperf); 2085 2066 mcnt = mperf - this_cpu_read(arch_prev_mperf); 2086 - if (!mcnt) 2087 - return; 2088 2067 2089 2068 this_cpu_write(arch_prev_aperf, aperf); 2090 2069 this_cpu_write(arch_prev_mperf, mperf); 2091 2070 2092 - acnt <<= 2*SCHED_CAPACITY_SHIFT; 2093 - mcnt *= arch_max_freq_ratio; 2071 + if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt)) 2072 + goto error; 2073 + 2074 + if (check_mul_overflow(mcnt, arch_max_freq_ratio, &mcnt) || !mcnt) 2075 + goto error; 2094 2076 2095 2077 freq_scale = div64_u64(acnt, mcnt); 2078 + if (!freq_scale) 2079 + goto error; 2096 2080 2097 2081 if (freq_scale > SCHED_CAPACITY_SCALE) 2098 2082 freq_scale = SCHED_CAPACITY_SCALE; 2099 2083 2100 2084 this_cpu_write(arch_freq_scale, freq_scale); 2085 + return; 2086 + 2087 + error: 2088 + pr_warn("Scheduler frequency invariance went wobbly, disabling!\n"); 2089 + schedule_work(&disable_freq_invariance_work); 2101 2090 } 2091 + #else 2092 + static inline void init_freq_invariance(bool secondary) 2093 + { 2094 + } 2095 + #endif /* CONFIG_X86_64 */
+11
drivers/base/arch_topology.c
··· 54 54 per_cpu(cpu_scale, cpu) = capacity; 55 55 } 56 56 57 + DEFINE_PER_CPU(unsigned long, thermal_pressure); 58 + 59 + void topology_set_thermal_pressure(const struct cpumask *cpus, 60 + unsigned long th_pressure) 61 + { 62 + int cpu; 63 + 64 + for_each_cpu(cpu, cpus) 65 + WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure); 66 + } 67 + 57 68 static ssize_t cpu_capacity_show(struct device *dev, 58 69 struct device_attribute *attr, 59 70 char *buf)
+4 -1
drivers/pci/pci-driver.c
··· 12 12 #include <linux/string.h> 13 13 #include <linux/slab.h> 14 14 #include <linux/sched.h> 15 + #include <linux/sched/isolation.h> 15 16 #include <linux/cpu.h> 16 17 #include <linux/pm_runtime.h> 17 18 #include <linux/suspend.h> ··· 334 333 const struct pci_device_id *id) 335 334 { 336 335 int error, node, cpu; 336 + int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; 337 337 struct drv_dev_and_id ddi = { drv, dev, id }; 338 338 339 339 /* ··· 355 353 pci_physfn_is_probed(dev)) 356 354 cpu = nr_cpu_ids; 357 355 else 358 - cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask); 356 + cpu = cpumask_any_and(cpumask_of_node(node), 357 + housekeeping_cpumask(hk_flags)); 359 358 360 359 if (cpu < nr_cpu_ids) 361 360 error = work_on_cpu(cpu, local_pci_probe, &ddi);
+22 -2
include/asm-generic/vmlinux.lds.h
··· 109 109 #endif 110 110 111 111 /* 112 - * Align to a 32 byte boundary equal to the 113 - * alignment gcc 4.5 uses for a struct 112 + * GCC 4.5 and later have a 32 bytes section alignment for structures. 113 + * Except GCC 4.9, that feels the need to align on 64 bytes. 114 114 */ 115 + #if __GNUC__ == 4 && __GNUC_MINOR__ == 9 116 + #define STRUCT_ALIGNMENT 64 117 + #else 115 118 #define STRUCT_ALIGNMENT 32 119 + #endif 116 120 #define STRUCT_ALIGN() . = ALIGN(STRUCT_ALIGNMENT) 121 + 122 + /* 123 + * The order of the sched class addresses are important, as they are 124 + * used to determine the order of the priority of each sched class in 125 + * relation to each other. 126 + */ 127 + #define SCHED_DATA \ 128 + STRUCT_ALIGN(); \ 129 + __begin_sched_classes = .; \ 130 + *(__idle_sched_class) \ 131 + *(__fair_sched_class) \ 132 + *(__rt_sched_class) \ 133 + *(__dl_sched_class) \ 134 + *(__stop_sched_class) \ 135 + __end_sched_classes = .; 117 136 118 137 /* The actual configuration determine if the init/exit sections 119 138 * are handled as text/data or they can be discarded (which ··· 408 389 .rodata : AT(ADDR(.rodata) - LOAD_OFFSET) { \ 409 390 __start_rodata = .; \ 410 391 *(.rodata) *(.rodata.*) \ 392 + SCHED_DATA \ 411 393 RO_AFTER_INIT_DATA /* Read only after init */ \ 412 394 . = ALIGN(8); \ 413 395 __start___tracepoints_ptrs = .; \
+2 -2
include/linux/arch_topology.h
··· 39 39 return per_cpu(thermal_pressure, cpu); 40 40 } 41 41 42 - void arch_set_thermal_pressure(struct cpumask *cpus, 43 - unsigned long th_pressure); 42 + void topology_set_thermal_pressure(const struct cpumask *cpus, 43 + unsigned long th_pressure); 44 44 45 45 struct cpu_topology { 46 46 int thread_id;
+2
include/linux/math64.h
··· 263 263 } 264 264 #endif /* mul_u64_u32_div */ 265 265 266 + u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div); 267 + 266 268 #define DIV64_U64_ROUND_UP(ll, d) \ 267 269 ({ u64 _tmp = (d); div64_u64((ll) + _tmp - 1, _tmp); }) 268 270
+4 -3
include/linux/psi_types.h
··· 153 153 unsigned long avg[NR_PSI_STATES - 1][3]; 154 154 155 155 /* Monitor work control */ 156 - atomic_t poll_scheduled; 157 - struct kthread_worker __rcu *poll_kworker; 158 - struct kthread_delayed_work poll_work; 156 + struct task_struct __rcu *poll_task; 157 + struct timer_list poll_timer; 158 + wait_queue_head_t poll_wait; 159 + atomic_t poll_wakeup; 159 160 160 161 /* Protects data used by the monitor */ 161 162 struct mutex trigger_lock;
+16 -9
include/linux/sched.h
··· 155 155 * 156 156 * for (;;) { 157 157 * set_current_state(TASK_UNINTERRUPTIBLE); 158 - * if (!need_sleep) 159 - * break; 158 + * if (CONDITION) 159 + * break; 160 160 * 161 161 * schedule(); 162 162 * } 163 163 * __set_current_state(TASK_RUNNING); 164 164 * 165 165 * If the caller does not need such serialisation (because, for instance, the 166 - * condition test and condition change and wakeup are under the same lock) then 166 + * CONDITION test and condition change and wakeup are under the same lock) then 167 167 * use __set_current_state(). 168 168 * 169 169 * The above is typically ordered against the wakeup, which does: 170 170 * 171 - * need_sleep = false; 171 + * CONDITION = 1; 172 172 * wake_up_state(p, TASK_UNINTERRUPTIBLE); 173 173 * 174 - * where wake_up_state() executes a full memory barrier before accessing the 175 - * task state. 174 + * where wake_up_state()/try_to_wake_up() executes a full memory barrier before 175 + * accessing p->state. 176 176 * 177 177 * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is, 178 178 * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a ··· 375 375 * For cfs_rq, they are the aggregated values of all runnable and blocked 376 376 * sched_entities. 377 377 * 378 - * The load/runnable/util_avg doesn't direcly factor frequency scaling and CPU 378 + * The load/runnable/util_avg doesn't directly factor frequency scaling and CPU 379 379 * capacity scaling. The scaling is done through the rq_clock_pelt that is used 380 380 * for computing those signals (see update_rq_clock_pelt()) 381 381 * ··· 687 687 struct sched_dl_entity dl; 688 688 689 689 #ifdef CONFIG_UCLAMP_TASK 690 - /* Clamp values requested for a scheduling entity */ 690 + /* 691 + * Clamp values requested for a scheduling entity. 692 + * Must be updated with task_rq_lock() held. 693 + */ 691 694 struct uclamp_se uclamp_req[UCLAMP_CNT]; 692 - /* Effective clamp values used for a scheduling entity */ 695 + /* 696 + * Effective clamp values used for a scheduling entity. 697 + * Must be updated with task_rq_lock() held. 698 + */ 693 699 struct uclamp_se uclamp[UCLAMP_CNT]; 694 700 #endif 695 701 ··· 2045 2039 const struct sched_avg *sched_trace_rq_avg_irq(struct rq *rq); 2046 2040 2047 2041 int sched_trace_rq_cpu(struct rq *rq); 2042 + int sched_trace_rq_nr_running(struct rq *rq); 2048 2043 2049 2044 const struct cpumask *sched_trace_rd_span(struct root_domain *rd); 2050 2045
+1
include/linux/sched/isolation.h
··· 14 14 HK_FLAG_DOMAIN = (1 << 5), 15 15 HK_FLAG_WQ = (1 << 6), 16 16 HK_FLAG_MANAGED_IRQ = (1 << 7), 17 + HK_FLAG_KTHREAD = (1 << 8), 17 18 }; 18 19 19 20 #ifdef CONFIG_CPU_ISOLATION
+1 -1
include/linux/sched/loadavg.h
··· 43 43 #define LOAD_INT(x) ((x) >> FSHIFT) 44 44 #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) 45 45 46 - extern void calc_global_load(unsigned long ticks); 46 + extern void calc_global_load(void); 47 47 48 48 #endif /* _LINUX_SCHED_LOADAVG_H */
+3 -5
include/linux/sched/mm.h
··· 23 23 * will still exist later on and mmget_not_zero() has to be used before 24 24 * accessing it. 25 25 * 26 - * This is a preferred way to to pin @mm for a longer/unbounded amount 26 + * This is a preferred way to pin @mm for a longer/unbounded amount 27 27 * of time. 28 28 * 29 29 * Use mmdrop() to release the reference acquired by mmgrab(). ··· 48 48 if (unlikely(atomic_dec_and_test(&mm->mm_count))) 49 49 __mmdrop(mm); 50 50 } 51 - 52 - void mmdrop(struct mm_struct *mm); 53 51 54 52 /* 55 53 * This has to be called after a get_task_mm()/mmget_not_zero() ··· 232 234 * @flags: Flags to restore. 233 235 * 234 236 * Ends the implicit GFP_NOIO scope started by memalloc_noio_save function. 235 - * Always make sure that that the given flags is the return value from the 237 + * Always make sure that the given flags is the return value from the 236 238 * pairing memalloc_noio_save call. 237 239 */ 238 240 static inline void memalloc_noio_restore(unsigned int flags) ··· 263 265 * @flags: Flags to restore. 264 266 * 265 267 * Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function. 266 - * Always make sure that that the given flags is the return value from the 268 + * Always make sure that the given flags is the return value from the 267 269 * pairing memalloc_nofs_save call. 268 270 */ 269 271 static inline void memalloc_nofs_restore(unsigned int flags)
+4
include/linux/sched/sysctl.h
··· 61 61 extern unsigned int sysctl_sched_rt_period; 62 62 extern int sysctl_sched_rt_runtime; 63 63 64 + extern unsigned int sysctl_sched_dl_period_max; 65 + extern unsigned int sysctl_sched_dl_period_min; 66 + 64 67 #ifdef CONFIG_UCLAMP_TASK 65 68 extern unsigned int sysctl_sched_uclamp_util_min; 66 69 extern unsigned int sysctl_sched_uclamp_util_max; 70 + extern unsigned int sysctl_sched_uclamp_util_min_rt_default; 67 71 #endif 68 72 69 73 #ifdef CONFIG_CFS_BANDWIDTH
+1
include/linux/sched/task.h
··· 55 55 extern void init_idle(struct task_struct *idle, int cpu); 56 56 57 57 extern int sched_fork(unsigned long clone_flags, struct task_struct *p); 58 + extern void sched_post_fork(struct task_struct *p); 58 59 extern void sched_dead(struct task_struct *p); 59 60 60 61 void __noreturn do_task_dead(void);
+17
include/linux/sched/topology.h
··· 217 217 #endif /* !CONFIG_SMP */ 218 218 219 219 #ifndef arch_scale_cpu_capacity 220 + /** 221 + * arch_scale_cpu_capacity - get the capacity scale factor of a given CPU. 222 + * @cpu: the CPU in question. 223 + * 224 + * Return: the CPU scale factor normalized against SCHED_CAPACITY_SCALE, i.e. 225 + * 226 + * max_perf(cpu) 227 + * ----------------------------- * SCHED_CAPACITY_SCALE 228 + * max(max_perf(c) : c \in CPUs) 229 + */ 220 230 static __always_inline 221 231 unsigned long arch_scale_cpu_capacity(int cpu) 222 232 { ··· 240 230 { 241 231 return 0; 242 232 } 233 + #endif 234 + 235 + #ifndef arch_set_thermal_pressure 236 + static __always_inline 237 + void arch_set_thermal_pressure(const struct cpumask *cpus, 238 + unsigned long th_pressure) 239 + { } 243 240 #endif 244 241 245 242 static inline int task_node(const struct task_struct *p)
+13 -1
include/trace/events/sched.h
··· 91 91 92 92 /* 93 93 * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG. 94 - * It it not always called from the waking context. 94 + * It is not always called from the waking context. 95 95 */ 96 96 DEFINE_EVENT(sched_wakeup_template, sched_wakeup, 97 97 TP_PROTO(struct task_struct *p), ··· 633 633 DECLARE_TRACE(sched_overutilized_tp, 634 634 TP_PROTO(struct root_domain *rd, bool overutilized), 635 635 TP_ARGS(rd, overutilized)); 636 + 637 + DECLARE_TRACE(sched_util_est_cfs_tp, 638 + TP_PROTO(struct cfs_rq *cfs_rq), 639 + TP_ARGS(cfs_rq)); 640 + 641 + DECLARE_TRACE(sched_util_est_se_tp, 642 + TP_PROTO(struct sched_entity *se), 643 + TP_ARGS(se)); 644 + 645 + DECLARE_TRACE(sched_update_nr_running_tp, 646 + TP_PROTO(struct rq *rq, int change), 647 + TP_ARGS(rq, change)); 636 648 637 649 #endif /* _TRACE_SCHED_H */ 638 650
+16 -1
init/Kconfig
··· 492 492 depends on SMP 493 493 494 494 config SCHED_THERMAL_PRESSURE 495 - bool "Enable periodic averaging of thermal pressure" 495 + bool 496 + default y if ARM && ARM_CPU_TOPOLOGY 497 + default y if ARM64 496 498 depends on SMP 499 + depends on CPU_FREQ_THERMAL 500 + help 501 + Select this option to enable thermal pressure accounting in the 502 + scheduler. Thermal pressure is the value conveyed to the scheduler 503 + that reflects the reduction in CPU compute capacity resulted from 504 + thermal throttling. Thermal throttling occurs when the performance of 505 + a CPU is capped due to high operating temperatures. 506 + 507 + If selected, the scheduler will be able to balance tasks accordingly, 508 + i.e. put less load on throttled CPUs than on non/less throttled ones. 509 + 510 + This requires the architecture to implement 511 + arch_set_thermal_pressure() and arch_get_thermal_pressure(). 497 512 498 513 config BSD_PROCESS_ACCT 499 514 bool "BSD Process Accounting"
+1
kernel/fork.c
··· 2302 2302 write_unlock_irq(&tasklist_lock); 2303 2303 2304 2304 proc_fork_connector(p); 2305 + sched_post_fork(p); 2305 2306 cgroup_post_fork(p, args); 2306 2307 perf_event_fork(p); 2307 2308
+4 -2
kernel/kthread.c
··· 27 27 #include <linux/ptrace.h> 28 28 #include <linux/uaccess.h> 29 29 #include <linux/numa.h> 30 + #include <linux/sched/isolation.h> 30 31 #include <trace/events/sched.h> 31 32 32 33 ··· 384 383 * The kernel thread should not inherit these properties. 385 384 */ 386 385 sched_setscheduler_nocheck(task, SCHED_NORMAL, &param); 387 - set_cpus_allowed_ptr(task, cpu_all_mask); 386 + set_cpus_allowed_ptr(task, 387 + housekeeping_cpumask(HK_FLAG_KTHREAD)); 388 388 } 389 389 kfree(create); 390 390 return task; ··· 610 608 /* Setup a clean context for our children to inherit. */ 611 609 set_task_comm(tsk, "kthreadd"); 612 610 ignore_signals(tsk); 613 - set_cpus_allowed_ptr(tsk, cpu_all_mask); 611 + set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_FLAG_KTHREAD)); 614 612 set_mems_allowed(node_states[N_MEMORY]); 615 613 616 614 current->flags |= PF_NOFREEZE;
+396 -70
kernel/sched/core.c
··· 6 6 * 7 7 * Copyright (C) 1991-2002 Linus Torvalds 8 8 */ 9 + #define CREATE_TRACE_POINTS 10 + #include <trace/events/sched.h> 11 + #undef CREATE_TRACE_POINTS 12 + 9 13 #include "sched.h" 10 14 11 15 #include <linux/nospec.h> ··· 27 23 #include "pelt.h" 28 24 #include "smp.h" 29 25 30 - #define CREATE_TRACE_POINTS 31 - #include <trace/events/sched.h> 32 - 33 26 /* 34 27 * Export tracepoints that act as a bare tracehook (ie: have no trace event 35 28 * associated with them) to allow external modules to probe them. ··· 37 36 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp); 38 37 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_se_tp); 39 38 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp); 39 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp); 40 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); 41 + EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); 40 42 41 43 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); 42 44 ··· 78 74 * default: 0.95s 79 75 */ 80 76 int sysctl_sched_rt_runtime = 950000; 77 + 78 + 79 + /* 80 + * Serialization rules: 81 + * 82 + * Lock order: 83 + * 84 + * p->pi_lock 85 + * rq->lock 86 + * hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls) 87 + * 88 + * rq1->lock 89 + * rq2->lock where: rq1 < rq2 90 + * 91 + * Regular state: 92 + * 93 + * Normal scheduling state is serialized by rq->lock. __schedule() takes the 94 + * local CPU's rq->lock, it optionally removes the task from the runqueue and 95 + * always looks at the local rq data structures to find the most elegible task 96 + * to run next. 97 + * 98 + * Task enqueue is also under rq->lock, possibly taken from another CPU. 99 + * Wakeups from another LLC domain might use an IPI to transfer the enqueue to 100 + * the local CPU to avoid bouncing the runqueue state around [ see 101 + * ttwu_queue_wakelist() ] 102 + * 103 + * Task wakeup, specifically wakeups that involve migration, are horribly 104 + * complicated to avoid having to take two rq->locks. 105 + * 106 + * Special state: 107 + * 108 + * System-calls and anything external will use task_rq_lock() which acquires 109 + * both p->pi_lock and rq->lock. As a consequence the state they change is 110 + * stable while holding either lock: 111 + * 112 + * - sched_setaffinity()/ 113 + * set_cpus_allowed_ptr(): p->cpus_ptr, p->nr_cpus_allowed 114 + * - set_user_nice(): p->se.load, p->*prio 115 + * - __sched_setscheduler(): p->sched_class, p->policy, p->*prio, 116 + * p->se.load, p->rt_priority, 117 + * p->dl.dl_{runtime, deadline, period, flags, bw, density} 118 + * - sched_setnuma(): p->numa_preferred_nid 119 + * - sched_move_task()/ 120 + * cpu_cgroup_fork(): p->sched_task_group 121 + * - uclamp_update_active() p->uclamp* 122 + * 123 + * p->state <- TASK_*: 124 + * 125 + * is changed locklessly using set_current_state(), __set_current_state() or 126 + * set_special_state(), see their respective comments, or by 127 + * try_to_wake_up(). This latter uses p->pi_lock to serialize against 128 + * concurrent self. 129 + * 130 + * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }: 131 + * 132 + * is set by activate_task() and cleared by deactivate_task(), under 133 + * rq->lock. Non-zero indicates the task is runnable, the special 134 + * ON_RQ_MIGRATING state is used for migration without holding both 135 + * rq->locks. It indicates task_cpu() is not stable, see task_rq_lock(). 136 + * 137 + * p->on_cpu <- { 0, 1 }: 138 + * 139 + * is set by prepare_task() and cleared by finish_task() such that it will be 140 + * set before p is scheduled-in and cleared after p is scheduled-out, both 141 + * under rq->lock. Non-zero indicates the task is running on its CPU. 142 + * 143 + * [ The astute reader will observe that it is possible for two tasks on one 144 + * CPU to have ->on_cpu = 1 at the same time. ] 145 + * 146 + * task_cpu(p): is changed by set_task_cpu(), the rules are: 147 + * 148 + * - Don't call set_task_cpu() on a blocked task: 149 + * 150 + * We don't care what CPU we're not running on, this simplifies hotplug, 151 + * the CPU assignment of blocked tasks isn't required to be valid. 152 + * 153 + * - for try_to_wake_up(), called under p->pi_lock: 154 + * 155 + * This allows try_to_wake_up() to only take one rq->lock, see its comment. 156 + * 157 + * - for migration called under rq->lock: 158 + * [ see task_on_rq_migrating() in task_rq_lock() ] 159 + * 160 + * o move_queued_task() 161 + * o detach_task() 162 + * 163 + * - for migration called under double_rq_lock(): 164 + * 165 + * o __migrate_swap_task() 166 + * o push_rt_task() / pull_rt_task() 167 + * o push_dl_task() / pull_dl_task() 168 + * o dl_task_offline_migration() 169 + * 170 + */ 81 171 82 172 /* 83 173 * __task_rq_lock - lock the rq @p resides on. ··· 889 791 /* Max allowed maximum utilization */ 890 792 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE; 891 793 794 + /* 795 + * By default RT tasks run at the maximum performance point/capacity of the 796 + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to 797 + * SCHED_CAPACITY_SCALE. 798 + * 799 + * This knob allows admins to change the default behavior when uclamp is being 800 + * used. In battery powered devices, particularly, running at the maximum 801 + * capacity and frequency will increase energy consumption and shorten the 802 + * battery life. 803 + * 804 + * This knob only affects RT tasks that their uclamp_se->user_defined == false. 805 + * 806 + * This knob will not override the system default sched_util_clamp_min defined 807 + * above. 808 + */ 809 + unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE; 810 + 892 811 /* All clamps are required to be less or equal than these values */ 893 812 static struct uclamp_se uclamp_default[UCLAMP_CNT]; 813 + 814 + /* 815 + * This static key is used to reduce the uclamp overhead in the fast path. It 816 + * primarily disables the call to uclamp_rq_{inc, dec}() in 817 + * enqueue/dequeue_task(). 818 + * 819 + * This allows users to continue to enable uclamp in their kernel config with 820 + * minimum uclamp overhead in the fast path. 821 + * 822 + * As soon as userspace modifies any of the uclamp knobs, the static key is 823 + * enabled, since we have an actual users that make use of uclamp 824 + * functionality. 825 + * 826 + * The knobs that would enable this static key are: 827 + * 828 + * * A task modifying its uclamp value with sched_setattr(). 829 + * * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs. 830 + * * An admin modifying the cgroup cpu.uclamp.{min, max} 831 + */ 832 + DEFINE_STATIC_KEY_FALSE(sched_uclamp_used); 894 833 895 834 /* Integer rounded range for each bucket */ 896 835 #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS) ··· 1006 871 1007 872 /* No tasks -- default clamp values */ 1008 873 return uclamp_idle_value(rq, clamp_id, clamp_value); 874 + } 875 + 876 + static void __uclamp_update_util_min_rt_default(struct task_struct *p) 877 + { 878 + unsigned int default_util_min; 879 + struct uclamp_se *uc_se; 880 + 881 + lockdep_assert_held(&p->pi_lock); 882 + 883 + uc_se = &p->uclamp_req[UCLAMP_MIN]; 884 + 885 + /* Only sync if user didn't override the default */ 886 + if (uc_se->user_defined) 887 + return; 888 + 889 + default_util_min = sysctl_sched_uclamp_util_min_rt_default; 890 + uclamp_se_set(uc_se, default_util_min, false); 891 + } 892 + 893 + static void uclamp_update_util_min_rt_default(struct task_struct *p) 894 + { 895 + struct rq_flags rf; 896 + struct rq *rq; 897 + 898 + if (!rt_task(p)) 899 + return; 900 + 901 + /* Protect updates to p->uclamp_* */ 902 + rq = task_rq_lock(p, &rf); 903 + __uclamp_update_util_min_rt_default(p); 904 + task_rq_unlock(rq, p, &rf); 905 + } 906 + 907 + static void uclamp_sync_util_min_rt_default(void) 908 + { 909 + struct task_struct *g, *p; 910 + 911 + /* 912 + * copy_process() sysctl_uclamp 913 + * uclamp_min_rt = X; 914 + * write_lock(&tasklist_lock) read_lock(&tasklist_lock) 915 + * // link thread smp_mb__after_spinlock() 916 + * write_unlock(&tasklist_lock) read_unlock(&tasklist_lock); 917 + * sched_post_fork() for_each_process_thread() 918 + * __uclamp_sync_rt() __uclamp_sync_rt() 919 + * 920 + * Ensures that either sched_post_fork() will observe the new 921 + * uclamp_min_rt or for_each_process_thread() will observe the new 922 + * task. 923 + */ 924 + read_lock(&tasklist_lock); 925 + smp_mb__after_spinlock(); 926 + read_unlock(&tasklist_lock); 927 + 928 + rcu_read_lock(); 929 + for_each_process_thread(g, p) 930 + uclamp_update_util_min_rt_default(p); 931 + rcu_read_unlock(); 1009 932 } 1010 933 1011 934 static inline struct uclamp_se ··· 1183 990 1184 991 lockdep_assert_held(&rq->lock); 1185 992 993 + /* 994 + * If sched_uclamp_used was enabled after task @p was enqueued, 995 + * we could end up with unbalanced call to uclamp_rq_dec_id(). 996 + * 997 + * In this case the uc_se->active flag should be false since no uclamp 998 + * accounting was performed at enqueue time and we can just return 999 + * here. 1000 + * 1001 + * Need to be careful of the following enqeueue/dequeue ordering 1002 + * problem too 1003 + * 1004 + * enqueue(taskA) 1005 + * // sched_uclamp_used gets enabled 1006 + * enqueue(taskB) 1007 + * dequeue(taskA) 1008 + * // Must not decrement bukcet->tasks here 1009 + * dequeue(taskB) 1010 + * 1011 + * where we could end up with stale data in uc_se and 1012 + * bucket[uc_se->bucket_id]. 1013 + * 1014 + * The following check here eliminates the possibility of such race. 1015 + */ 1016 + if (unlikely(!uc_se->active)) 1017 + return; 1018 + 1186 1019 bucket = &uc_rq->bucket[uc_se->bucket_id]; 1020 + 1187 1021 SCHED_WARN_ON(!bucket->tasks); 1188 1022 if (likely(bucket->tasks)) 1189 1023 bucket->tasks--; 1024 + 1190 1025 uc_se->active = false; 1191 1026 1192 1027 /* ··· 1242 1021 { 1243 1022 enum uclamp_id clamp_id; 1244 1023 1024 + /* 1025 + * Avoid any overhead until uclamp is actually used by the userspace. 1026 + * 1027 + * The condition is constructed such that a NOP is generated when 1028 + * sched_uclamp_used is disabled. 1029 + */ 1030 + if (!static_branch_unlikely(&sched_uclamp_used)) 1031 + return; 1032 + 1245 1033 if (unlikely(!p->sched_class->uclamp_enabled)) 1246 1034 return; 1247 1035 ··· 1265 1035 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) 1266 1036 { 1267 1037 enum uclamp_id clamp_id; 1038 + 1039 + /* 1040 + * Avoid any overhead until uclamp is actually used by the userspace. 1041 + * 1042 + * The condition is constructed such that a NOP is generated when 1043 + * sched_uclamp_used is disabled. 1044 + */ 1045 + if (!static_branch_unlikely(&sched_uclamp_used)) 1046 + return; 1268 1047 1269 1048 if (unlikely(!p->sched_class->uclamp_enabled)) 1270 1049 return; ··· 1353 1114 void *buffer, size_t *lenp, loff_t *ppos) 1354 1115 { 1355 1116 bool update_root_tg = false; 1356 - int old_min, old_max; 1117 + int old_min, old_max, old_min_rt; 1357 1118 int result; 1358 1119 1359 1120 mutex_lock(&uclamp_mutex); 1360 1121 old_min = sysctl_sched_uclamp_util_min; 1361 1122 old_max = sysctl_sched_uclamp_util_max; 1123 + old_min_rt = sysctl_sched_uclamp_util_min_rt_default; 1362 1124 1363 1125 result = proc_dointvec(table, write, buffer, lenp, ppos); 1364 1126 if (result) ··· 1368 1128 goto done; 1369 1129 1370 1130 if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max || 1371 - sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) { 1131 + sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE || 1132 + sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) { 1133 + 1372 1134 result = -EINVAL; 1373 1135 goto undo; 1374 1136 } ··· 1386 1144 update_root_tg = true; 1387 1145 } 1388 1146 1389 - if (update_root_tg) 1147 + if (update_root_tg) { 1148 + static_branch_enable(&sched_uclamp_used); 1390 1149 uclamp_update_root_tg(); 1150 + } 1151 + 1152 + if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) { 1153 + static_branch_enable(&sched_uclamp_used); 1154 + uclamp_sync_util_min_rt_default(); 1155 + } 1391 1156 1392 1157 /* 1393 1158 * We update all RUNNABLE tasks only when task groups are in use. ··· 1407 1158 undo: 1408 1159 sysctl_sched_uclamp_util_min = old_min; 1409 1160 sysctl_sched_uclamp_util_max = old_max; 1161 + sysctl_sched_uclamp_util_min_rt_default = old_min_rt; 1410 1162 done: 1411 1163 mutex_unlock(&uclamp_mutex); 1412 1164 ··· 1430 1180 if (upper_bound > SCHED_CAPACITY_SCALE) 1431 1181 return -EINVAL; 1432 1182 1183 + /* 1184 + * We have valid uclamp attributes; make sure uclamp is enabled. 1185 + * 1186 + * We need to do that here, because enabling static branches is a 1187 + * blocking operation which obviously cannot be done while holding 1188 + * scheduler locks. 1189 + */ 1190 + static_branch_enable(&sched_uclamp_used); 1191 + 1433 1192 return 0; 1434 1193 } 1435 1194 ··· 1453 1194 */ 1454 1195 for_each_clamp_id(clamp_id) { 1455 1196 struct uclamp_se *uc_se = &p->uclamp_req[clamp_id]; 1456 - unsigned int clamp_value = uclamp_none(clamp_id); 1457 1197 1458 1198 /* Keep using defined clamps across class changes */ 1459 1199 if (uc_se->user_defined) 1460 1200 continue; 1461 1201 1462 - /* By default, RT tasks always get 100% boost */ 1202 + /* 1203 + * RT by default have a 100% boost value that could be modified 1204 + * at runtime. 1205 + */ 1463 1206 if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) 1464 - clamp_value = uclamp_none(UCLAMP_MAX); 1207 + __uclamp_update_util_min_rt_default(p); 1208 + else 1209 + uclamp_se_set(uc_se, uclamp_none(clamp_id), false); 1465 1210 1466 - uclamp_se_set(uc_se, clamp_value, false); 1467 1211 } 1468 1212 1469 1213 if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) ··· 1487 1225 { 1488 1226 enum uclamp_id clamp_id; 1489 1227 1228 + /* 1229 + * We don't need to hold task_rq_lock() when updating p->uclamp_* here 1230 + * as the task is still at its early fork stages. 1231 + */ 1490 1232 for_each_clamp_id(clamp_id) 1491 1233 p->uclamp[clamp_id].active = false; 1492 1234 ··· 1503 1237 } 1504 1238 } 1505 1239 1240 + static void uclamp_post_fork(struct task_struct *p) 1241 + { 1242 + uclamp_update_util_min_rt_default(p); 1243 + } 1244 + 1245 + static void __init init_uclamp_rq(struct rq *rq) 1246 + { 1247 + enum uclamp_id clamp_id; 1248 + struct uclamp_rq *uc_rq = rq->uclamp; 1249 + 1250 + for_each_clamp_id(clamp_id) { 1251 + uc_rq[clamp_id] = (struct uclamp_rq) { 1252 + .value = uclamp_none(clamp_id) 1253 + }; 1254 + } 1255 + 1256 + rq->uclamp_flags = 0; 1257 + } 1258 + 1506 1259 static void __init init_uclamp(void) 1507 1260 { 1508 1261 struct uclamp_se uc_max = {}; 1509 1262 enum uclamp_id clamp_id; 1510 1263 int cpu; 1511 1264 1512 - mutex_init(&uclamp_mutex); 1513 - 1514 - for_each_possible_cpu(cpu) { 1515 - memset(&cpu_rq(cpu)->uclamp, 0, 1516 - sizeof(struct uclamp_rq)*UCLAMP_CNT); 1517 - cpu_rq(cpu)->uclamp_flags = 0; 1518 - } 1265 + for_each_possible_cpu(cpu) 1266 + init_uclamp_rq(cpu_rq(cpu)); 1519 1267 1520 1268 for_each_clamp_id(clamp_id) { 1521 1269 uclamp_se_set(&init_task.uclamp_req[clamp_id], ··· 1558 1278 static void __setscheduler_uclamp(struct task_struct *p, 1559 1279 const struct sched_attr *attr) { } 1560 1280 static inline void uclamp_fork(struct task_struct *p) { } 1281 + static inline void uclamp_post_fork(struct task_struct *p) { } 1561 1282 static inline void init_uclamp(void) { } 1562 1283 #endif /* CONFIG_UCLAMP_TASK */ 1563 1284 ··· 1685 1404 1686 1405 void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) 1687 1406 { 1688 - const struct sched_class *class; 1689 - 1690 - if (p->sched_class == rq->curr->sched_class) { 1407 + if (p->sched_class == rq->curr->sched_class) 1691 1408 rq->curr->sched_class->check_preempt_curr(rq, p, flags); 1692 - } else { 1693 - for_each_class(class) { 1694 - if (class == rq->curr->sched_class) 1695 - break; 1696 - if (class == p->sched_class) { 1697 - resched_curr(rq); 1698 - break; 1699 - } 1700 - } 1701 - } 1409 + else if (p->sched_class > rq->curr->sched_class) 1410 + resched_curr(rq); 1702 1411 1703 1412 /* 1704 1413 * A queue event has occurred, and we're going to schedule. In ··· 1739 1468 { 1740 1469 lockdep_assert_held(&rq->lock); 1741 1470 1742 - WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); 1743 - dequeue_task(rq, p, DEQUEUE_NOCLOCK); 1471 + deactivate_task(rq, p, DEQUEUE_NOCLOCK); 1744 1472 set_task_cpu(p, new_cpu); 1745 1473 rq_unlock(rq, rf); 1746 1474 ··· 1747 1477 1748 1478 rq_lock(rq, rf); 1749 1479 BUG_ON(task_cpu(p) != new_cpu); 1750 - enqueue_task(rq, p, 0); 1751 - p->on_rq = TASK_ON_RQ_QUEUED; 1480 + activate_task(rq, p, 0); 1752 1481 check_preempt_curr(rq, p, 0); 1753 1482 1754 1483 return rq; ··· 2512 2243 } 2513 2244 2514 2245 /* 2515 - * Called in case the task @p isn't fully descheduled from its runqueue, 2516 - * in this case we must do a remote wakeup. Its a 'light' wakeup though, 2517 - * since all we need to do is flip p->state to TASK_RUNNING, since 2518 - * the task is still ->on_rq. 2246 + * Consider @p being inside a wait loop: 2247 + * 2248 + * for (;;) { 2249 + * set_current_state(TASK_UNINTERRUPTIBLE); 2250 + * 2251 + * if (CONDITION) 2252 + * break; 2253 + * 2254 + * schedule(); 2255 + * } 2256 + * __set_current_state(TASK_RUNNING); 2257 + * 2258 + * between set_current_state() and schedule(). In this case @p is still 2259 + * runnable, so all that needs doing is change p->state back to TASK_RUNNING in 2260 + * an atomic manner. 2261 + * 2262 + * By taking task_rq(p)->lock we serialize against schedule(), if @p->on_rq 2263 + * then schedule() must still happen and p->state can be changed to 2264 + * TASK_RUNNING. Otherwise we lost the race, schedule() has happened, and we 2265 + * need to do a full wakeup with enqueue. 2266 + * 2267 + * Returns: %true when the wakeup is done, 2268 + * %false otherwise. 2519 2269 */ 2520 - static int ttwu_remote(struct task_struct *p, int wake_flags) 2270 + static int ttwu_runnable(struct task_struct *p, int wake_flags) 2521 2271 { 2522 2272 struct rq_flags rf; 2523 2273 struct rq *rq; ··· 2677 2389 2678 2390 return false; 2679 2391 } 2392 + 2393 + #else /* !CONFIG_SMP */ 2394 + 2395 + static inline bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags) 2396 + { 2397 + return false; 2398 + } 2399 + 2680 2400 #endif /* CONFIG_SMP */ 2681 2401 2682 2402 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) ··· 2692 2396 struct rq *rq = cpu_rq(cpu); 2693 2397 struct rq_flags rf; 2694 2398 2695 - #if defined(CONFIG_SMP) 2696 2399 if (ttwu_queue_wakelist(p, cpu, wake_flags)) 2697 2400 return; 2698 - #endif 2699 2401 2700 2402 rq_lock(rq, &rf); 2701 2403 update_rq_clock(rq); ··· 2749 2455 * migration. However the means are completely different as there is no lock 2750 2456 * chain to provide order. Instead we do: 2751 2457 * 2752 - * 1) smp_store_release(X->on_cpu, 0) 2753 - * 2) smp_cond_load_acquire(!X->on_cpu) 2458 + * 1) smp_store_release(X->on_cpu, 0) -- finish_task() 2459 + * 2) smp_cond_load_acquire(!X->on_cpu) -- try_to_wake_up() 2754 2460 * 2755 2461 * Example: 2756 2462 * ··· 2790 2496 * @state: the mask of task states that can be woken 2791 2497 * @wake_flags: wake modifier flags (WF_*) 2792 2498 * 2793 - * If (@state & @p->state) @p->state = TASK_RUNNING. 2499 + * Conceptually does: 2500 + * 2501 + * If (@state & @p->state) @p->state = TASK_RUNNING. 2794 2502 * 2795 2503 * If the task was not queued/runnable, also place it back on a runqueue. 2796 2504 * 2797 - * Atomic against schedule() which would dequeue a task, also see 2798 - * set_current_state(). 2505 + * This function is atomic against schedule() which would dequeue the task. 2799 2506 * 2800 - * This function executes a full memory barrier before accessing the task 2801 - * state; see set_current_state(). 2507 + * It issues a full memory barrier before accessing @p->state, see the comment 2508 + * with set_current_state(). 2509 + * 2510 + * Uses p->pi_lock to serialize against concurrent wake-ups. 2511 + * 2512 + * Relies on p->pi_lock stabilizing: 2513 + * - p->sched_class 2514 + * - p->cpus_ptr 2515 + * - p->sched_task_group 2516 + * in order to do migration, see its use of select_task_rq()/set_task_cpu(). 2517 + * 2518 + * Tries really hard to only take one task_rq(p)->lock for performance. 2519 + * Takes rq->lock in: 2520 + * - ttwu_runnable() -- old rq, unavoidable, see comment there; 2521 + * - ttwu_queue() -- new rq, for enqueue of the task; 2522 + * - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us. 2523 + * 2524 + * As a consequence we race really badly with just about everything. See the 2525 + * many memory barriers and their comments for details. 2802 2526 * 2803 2527 * Return: %true if @p->state changes (an actual wakeup was done), 2804 2528 * %false otherwise. ··· 2832 2520 /* 2833 2521 * We're waking current, this means 'p->on_rq' and 'task_cpu(p) 2834 2522 * == smp_processor_id()'. Together this means we can special 2835 - * case the whole 'p->on_rq && ttwu_remote()' case below 2523 + * case the whole 'p->on_rq && ttwu_runnable()' case below 2836 2524 * without taking any locks. 2837 2525 * 2838 2526 * In particular: ··· 2853 2541 /* 2854 2542 * If we are going to wake up a thread waiting for CONDITION we 2855 2543 * need to ensure that CONDITION=1 done by the caller can not be 2856 - * reordered with p->state check below. This pairs with mb() in 2857 - * set_current_state() the waiting thread does. 2544 + * reordered with p->state check below. This pairs with smp_store_mb() 2545 + * in set_current_state() that the waiting thread does. 2858 2546 */ 2859 2547 raw_spin_lock_irqsave(&p->pi_lock, flags); 2860 2548 smp_mb__after_spinlock(); ··· 2889 2577 * A similar smb_rmb() lives in try_invoke_on_locked_down_task(). 2890 2578 */ 2891 2579 smp_rmb(); 2892 - if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags)) 2580 + if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)) 2893 2581 goto unlock; 2894 2582 2895 2583 if (p->in_iowait) { ··· 3302 2990 return 0; 3303 2991 } 3304 2992 2993 + void sched_post_fork(struct task_struct *p) 2994 + { 2995 + uclamp_post_fork(p); 2996 + } 2997 + 3305 2998 unsigned long to_ratio(u64 period, u64 runtime) 3306 2999 { 3307 3000 if (runtime == RUNTIME_INF) ··· 3464 3147 /* 3465 3148 * Claim the task as running, we do this before switching to it 3466 3149 * such that any running task will have this set. 3150 + * 3151 + * See the ttwu() WF_ON_CPU case and its ordering comment. 3467 3152 */ 3468 - next->on_cpu = 1; 3153 + WRITE_ONCE(next->on_cpu, 1); 3469 3154 #endif 3470 3155 } 3471 3156 ··· 3475 3156 { 3476 3157 #ifdef CONFIG_SMP 3477 3158 /* 3478 - * After ->on_cpu is cleared, the task can be moved to a different CPU. 3479 - * We must ensure this doesn't happen until the switch is completely 3159 + * This must be the very last reference to @prev from this CPU. After 3160 + * p->on_cpu is cleared, the task can be moved to a different CPU. We 3161 + * must ensure this doesn't happen until the switch is completely 3480 3162 * finished. 3481 3163 * 3482 3164 * In particular, the load of prev->state in finish_task_switch() must ··· 3976 3656 return ns; 3977 3657 } 3978 3658 3979 - DEFINE_PER_CPU(unsigned long, thermal_pressure); 3980 - 3981 - void arch_set_thermal_pressure(struct cpumask *cpus, 3982 - unsigned long th_pressure) 3983 - { 3984 - int cpu; 3985 - 3986 - for_each_cpu(cpu, cpus) 3987 - WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure); 3988 - } 3989 - 3990 3659 /* 3991 3660 * This function gets called by the timer code, with HZ frequency. 3992 3661 * We call it with interrupts disabled. ··· 4338 4029 * higher scheduling class, because otherwise those loose the 4339 4030 * opportunity to pull in more work from other CPUs. 4340 4031 */ 4341 - if (likely((prev->sched_class == &idle_sched_class || 4342 - prev->sched_class == &fair_sched_class) && 4032 + if (likely(prev->sched_class <= &fair_sched_class && 4343 4033 rq->nr_running == rq->cfs.h_nr_running)) { 4344 4034 4345 4035 p = pick_next_task_fair(rq, prev, rf); ··· 5827 5519 kattr.sched_nice = task_nice(p); 5828 5520 5829 5521 #ifdef CONFIG_UCLAMP_TASK 5522 + /* 5523 + * This could race with another potential updater, but this is fine 5524 + * because it'll correctly read the old or the new value. We don't need 5525 + * to guarantee who wins the race as long as it doesn't return garbage. 5526 + */ 5830 5527 kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value; 5831 5528 kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value; 5832 5529 #endif ··· 6189 5876 if (task_running(p_rq, p) || p->state) 6190 5877 goto out_unlock; 6191 5878 6192 - yielded = curr->sched_class->yield_to_task(rq, p, preempt); 5879 + yielded = curr->sched_class->yield_to_task(rq, p); 6193 5880 if (yielded) { 6194 5881 schedstat_inc(rq->yld_count); 6195 5882 /* ··· 7023 6710 unsigned long ptr = 0; 7024 6711 int i; 7025 6712 6713 + /* Make sure the linker didn't screw up */ 6714 + BUG_ON(&idle_sched_class + 1 != &fair_sched_class || 6715 + &fair_sched_class + 1 != &rt_sched_class || 6716 + &rt_sched_class + 1 != &dl_sched_class); 6717 + #ifdef CONFIG_SMP 6718 + BUG_ON(&dl_sched_class + 1 != &stop_sched_class); 6719 + #endif 6720 + 7026 6721 wait_bit_init(); 7027 6722 7028 6723 #ifdef CONFIG_FAIR_GROUP_SCHED ··· 7752 7431 if (req.ret) 7753 7432 return req.ret; 7754 7433 7434 + static_branch_enable(&sched_uclamp_used); 7435 + 7755 7436 mutex_lock(&uclamp_mutex); 7756 7437 rcu_read_lock(); 7757 7438 ··· 8441 8118 /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, 8442 8119 }; 8443 8120 8444 - #undef CREATE_TRACE_POINTS 8121 + void call_trace_sched_update_nr_running(struct rq *rq, int count) 8122 + { 8123 + trace_sched_update_nr_running_tp(rq, count); 8124 + }
+24
kernel/sched/cpudeadline.c
··· 121 121 122 122 if (later_mask && 123 123 cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) { 124 + unsigned long cap, max_cap = 0; 125 + int cpu, max_cpu = -1; 126 + 127 + if (!static_branch_unlikely(&sched_asym_cpucapacity)) 128 + return 1; 129 + 130 + /* Ensure the capacity of the CPUs fits the task. */ 131 + for_each_cpu(cpu, later_mask) { 132 + if (!dl_task_fits_capacity(p, cpu)) { 133 + cpumask_clear_cpu(cpu, later_mask); 134 + 135 + cap = capacity_orig_of(cpu); 136 + 137 + if (cap > max_cap || 138 + (cpu == task_cpu(p) && cap == max_cap)) { 139 + max_cap = cap; 140 + max_cpu = cpu; 141 + } 142 + } 143 + } 144 + 145 + if (cpumask_empty(later_mask)) 146 + cpumask_set_cpu(max_cpu, later_mask); 147 + 124 148 return 1; 125 149 } else { 126 150 int best_cpu = cpudl_maximum(cp);
+1 -1
kernel/sched/cpufreq_schedutil.c
··· 210 210 unsigned long dl_util, util, irq; 211 211 struct rq *rq = cpu_rq(cpu); 212 212 213 - if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) && 213 + if (!uclamp_is_used() && 214 214 type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) { 215 215 return max; 216 216 }
+1 -45
kernel/sched/cputime.c
··· 520 520 } 521 521 522 522 /* 523 - * Perform (stime * rtime) / total, but avoid multiplication overflow by 524 - * losing precision when the numbers are big. 525 - */ 526 - static u64 scale_stime(u64 stime, u64 rtime, u64 total) 527 - { 528 - u64 scaled; 529 - 530 - for (;;) { 531 - /* Make sure "rtime" is the bigger of stime/rtime */ 532 - if (stime > rtime) 533 - swap(rtime, stime); 534 - 535 - /* Make sure 'total' fits in 32 bits */ 536 - if (total >> 32) 537 - goto drop_precision; 538 - 539 - /* Does rtime (and thus stime) fit in 32 bits? */ 540 - if (!(rtime >> 32)) 541 - break; 542 - 543 - /* Can we just balance rtime/stime rather than dropping bits? */ 544 - if (stime >> 31) 545 - goto drop_precision; 546 - 547 - /* We can grow stime and shrink rtime and try to make them both fit */ 548 - stime <<= 1; 549 - rtime >>= 1; 550 - continue; 551 - 552 - drop_precision: 553 - /* We drop from rtime, it has more bits than stime */ 554 - rtime >>= 1; 555 - total >>= 1; 556 - } 557 - 558 - /* 559 - * Make sure gcc understands that this is a 32x32->64 multiply, 560 - * followed by a 64/32->64 divide. 561 - */ 562 - scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total); 563 - return scaled; 564 - } 565 - 566 - /* 567 523 * Adjust tick based cputime random precision against scheduler runtime 568 524 * accounting. 569 525 * ··· 578 622 goto update; 579 623 } 580 624 581 - stime = scale_stime(stime, rtime, stime + utime); 625 + stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); 582 626 583 627 update: 584 628 /*
+95 -23
kernel/sched/deadline.c
··· 54 54 static inline int dl_bw_cpus(int i) 55 55 { 56 56 struct root_domain *rd = cpu_rq(i)->rd; 57 - int cpus = 0; 57 + int cpus; 58 58 59 59 RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), 60 60 "sched RCU must be held"); 61 + 62 + if (cpumask_subset(rd->span, cpu_active_mask)) 63 + return cpumask_weight(rd->span); 64 + 65 + cpus = 0; 66 + 61 67 for_each_cpu_and(i, rd->span, cpu_active_mask) 62 68 cpus++; 63 69 64 70 return cpus; 71 + } 72 + 73 + static inline unsigned long __dl_bw_capacity(int i) 74 + { 75 + struct root_domain *rd = cpu_rq(i)->rd; 76 + unsigned long cap = 0; 77 + 78 + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), 79 + "sched RCU must be held"); 80 + 81 + for_each_cpu_and(i, rd->span, cpu_active_mask) 82 + cap += capacity_orig_of(i); 83 + 84 + return cap; 85 + } 86 + 87 + /* 88 + * XXX Fix: If 'rq->rd == def_root_domain' perform AC against capacity 89 + * of the CPU the task is running on rather rd's \Sum CPU capacity. 90 + */ 91 + static inline unsigned long dl_bw_capacity(int i) 92 + { 93 + if (!static_branch_unlikely(&sched_asym_cpucapacity) && 94 + capacity_orig_of(i) == SCHED_CAPACITY_SCALE) { 95 + return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT; 96 + } else { 97 + return __dl_bw_capacity(i); 98 + } 65 99 } 66 100 #else 67 101 static inline struct dl_bw *dl_bw_of(int i) ··· 106 72 static inline int dl_bw_cpus(int i) 107 73 { 108 74 return 1; 75 + } 76 + 77 + static inline unsigned long dl_bw_capacity(int i) 78 + { 79 + return SCHED_CAPACITY_SCALE; 109 80 } 110 81 #endif 111 82 ··· 1137 1098 * cannot use the runtime, and so it replenishes the task. This rule 1138 1099 * works fine for implicit deadline tasks (deadline == period), and the 1139 1100 * CBS was designed for implicit deadline tasks. However, a task with 1140 - * constrained deadline (deadine < period) might be awakened after the 1101 + * constrained deadline (deadline < period) might be awakened after the 1141 1102 * deadline, but before the next period. In this case, replenishing the 1142 1103 * task would allow it to run for runtime / deadline. As in this case 1143 1104 * deadline < period, CBS enables a task to run for more than the ··· 1643 1604 select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) 1644 1605 { 1645 1606 struct task_struct *curr; 1607 + bool select_rq; 1646 1608 struct rq *rq; 1647 1609 1648 1610 if (sd_flag != SD_BALANCE_WAKE) ··· 1663 1623 * other hand, if it has a shorter deadline, we 1664 1624 * try to make it stay here, it might be important. 1665 1625 */ 1666 - if (unlikely(dl_task(curr)) && 1667 - (curr->nr_cpus_allowed < 2 || 1668 - !dl_entity_preempt(&p->dl, &curr->dl)) && 1669 - (p->nr_cpus_allowed > 1)) { 1626 + select_rq = unlikely(dl_task(curr)) && 1627 + (curr->nr_cpus_allowed < 2 || 1628 + !dl_entity_preempt(&p->dl, &curr->dl)) && 1629 + p->nr_cpus_allowed > 1; 1630 + 1631 + /* 1632 + * Take the capacity of the CPU into account to 1633 + * ensure it fits the requirement of the task. 1634 + */ 1635 + if (static_branch_unlikely(&sched_asym_cpucapacity)) 1636 + select_rq |= !dl_task_fits_capacity(p, cpu); 1637 + 1638 + if (select_rq) { 1670 1639 int target = find_later_rq(p); 1671 1640 1672 1641 if (target != -1 && ··· 2479 2430 } 2480 2431 } 2481 2432 2482 - const struct sched_class dl_sched_class = { 2483 - .next = &rt_sched_class, 2433 + const struct sched_class dl_sched_class 2434 + __attribute__((section("__dl_sched_class"))) = { 2484 2435 .enqueue_task = enqueue_task_dl, 2485 2436 .dequeue_task = dequeue_task_dl, 2486 2437 .yield_task = yield_task_dl, ··· 2600 2551 int sched_dl_overflow(struct task_struct *p, int policy, 2601 2552 const struct sched_attr *attr) 2602 2553 { 2603 - struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); 2604 2554 u64 period = attr->sched_period ?: attr->sched_deadline; 2605 2555 u64 runtime = attr->sched_runtime; 2606 2556 u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; 2607 - int cpus, err = -1; 2557 + int cpus, err = -1, cpu = task_cpu(p); 2558 + struct dl_bw *dl_b = dl_bw_of(cpu); 2559 + unsigned long cap; 2608 2560 2609 2561 if (attr->sched_flags & SCHED_FLAG_SUGOV) 2610 2562 return 0; ··· 2620 2570 * allocated bandwidth of the container. 2621 2571 */ 2622 2572 raw_spin_lock(&dl_b->lock); 2623 - cpus = dl_bw_cpus(task_cpu(p)); 2573 + cpus = dl_bw_cpus(cpu); 2574 + cap = dl_bw_capacity(cpu); 2575 + 2624 2576 if (dl_policy(policy) && !task_has_dl_policy(p) && 2625 - !__dl_overflow(dl_b, cpus, 0, new_bw)) { 2577 + !__dl_overflow(dl_b, cap, 0, new_bw)) { 2626 2578 if (hrtimer_active(&p->dl.inactive_timer)) 2627 2579 __dl_sub(dl_b, p->dl.dl_bw, cpus); 2628 2580 __dl_add(dl_b, new_bw, cpus); 2629 2581 err = 0; 2630 2582 } else if (dl_policy(policy) && task_has_dl_policy(p) && 2631 - !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { 2583 + !__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) { 2632 2584 /* 2633 2585 * XXX this is slightly incorrect: when the task 2634 2586 * utilization decreases, we should delay the total ··· 2687 2635 } 2688 2636 2689 2637 /* 2638 + * Default limits for DL period; on the top end we guard against small util 2639 + * tasks still getting rediculous long effective runtimes, on the bottom end we 2640 + * guard against timer DoS. 2641 + */ 2642 + unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */ 2643 + unsigned int sysctl_sched_dl_period_min = 100; /* 100 us */ 2644 + 2645 + /* 2690 2646 * This function validates the new parameters of a -deadline task. 2691 2647 * We ask for the deadline not being zero, and greater or equal 2692 2648 * than the runtime, as well as the period of being zero or ··· 2706 2646 */ 2707 2647 bool __checkparam_dl(const struct sched_attr *attr) 2708 2648 { 2649 + u64 period, max, min; 2650 + 2709 2651 /* special dl tasks don't actually use any parameter */ 2710 2652 if (attr->sched_flags & SCHED_FLAG_SUGOV) 2711 2653 return true; ··· 2731 2669 attr->sched_period & (1ULL << 63)) 2732 2670 return false; 2733 2671 2672 + period = attr->sched_period; 2673 + if (!period) 2674 + period = attr->sched_deadline; 2675 + 2734 2676 /* runtime <= deadline <= period (if period != 0) */ 2735 - if ((attr->sched_period != 0 && 2736 - attr->sched_period < attr->sched_deadline) || 2677 + if (period < attr->sched_deadline || 2737 2678 attr->sched_deadline < attr->sched_runtime) 2679 + return false; 2680 + 2681 + max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC; 2682 + min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC; 2683 + 2684 + if (period < min || period > max) 2738 2685 return false; 2739 2686 2740 2687 return true; ··· 2786 2715 #ifdef CONFIG_SMP 2787 2716 int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed) 2788 2717 { 2718 + unsigned long flags, cap; 2789 2719 unsigned int dest_cpu; 2790 2720 struct dl_bw *dl_b; 2791 2721 bool overflow; 2792 - int cpus, ret; 2793 - unsigned long flags; 2722 + int ret; 2794 2723 2795 2724 dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed); 2796 2725 2797 2726 rcu_read_lock_sched(); 2798 2727 dl_b = dl_bw_of(dest_cpu); 2799 2728 raw_spin_lock_irqsave(&dl_b->lock, flags); 2800 - cpus = dl_bw_cpus(dest_cpu); 2801 - overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw); 2729 + cap = dl_bw_capacity(dest_cpu); 2730 + overflow = __dl_overflow(dl_b, cap, 0, p->dl.dl_bw); 2802 2731 if (overflow) { 2803 2732 ret = -EBUSY; 2804 2733 } else { ··· 2808 2737 * We will free resources in the source root_domain 2809 2738 * later on (see set_cpus_allowed_dl()). 2810 2739 */ 2740 + int cpus = dl_bw_cpus(dest_cpu); 2741 + 2811 2742 __dl_add(dl_b, p->dl.dl_bw, cpus); 2812 2743 ret = 0; 2813 2744 } ··· 2842 2769 2843 2770 bool dl_cpu_busy(unsigned int cpu) 2844 2771 { 2845 - unsigned long flags; 2772 + unsigned long flags, cap; 2846 2773 struct dl_bw *dl_b; 2847 2774 bool overflow; 2848 - int cpus; 2849 2775 2850 2776 rcu_read_lock_sched(); 2851 2777 dl_b = dl_bw_of(cpu); 2852 2778 raw_spin_lock_irqsave(&dl_b->lock, flags); 2853 - cpus = dl_bw_cpus(cpu); 2854 - overflow = __dl_overflow(dl_b, cpus, 0, 0); 2779 + cap = dl_bw_capacity(cpu); 2780 + overflow = __dl_overflow(dl_b, cap, 0, 0); 2855 2781 raw_spin_unlock_irqrestore(&dl_b->lock, flags); 2856 2782 rcu_read_unlock_sched(); 2857 2783
+59 -34
kernel/sched/fair.c
··· 22 22 */ 23 23 #include "sched.h" 24 24 25 - #include <trace/events/sched.h> 26 - 27 25 /* 28 26 * Targeted preemption latency for CPU-bound tasks: 29 27 * ··· 3092 3094 3093 3095 #ifdef CONFIG_SMP 3094 3096 do { 3095 - u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib; 3097 + u32 divider = get_pelt_divider(&se->avg); 3096 3098 3097 3099 se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider); 3098 3100 } while (0); ··· 3438 3440 update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) 3439 3441 { 3440 3442 long delta = gcfs_rq->avg.util_avg - se->avg.util_avg; 3441 - /* 3442 - * cfs_rq->avg.period_contrib can be used for both cfs_rq and se. 3443 - * See ___update_load_avg() for details. 3444 - */ 3445 - u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib; 3443 + u32 divider; 3446 3444 3447 3445 /* Nothing to update */ 3448 3446 if (!delta) 3449 3447 return; 3448 + 3449 + /* 3450 + * cfs_rq->avg.period_contrib can be used for both cfs_rq and se. 3451 + * See ___update_load_avg() for details. 3452 + */ 3453 + divider = get_pelt_divider(&cfs_rq->avg); 3450 3454 3451 3455 /* Set new sched_entity's utilization */ 3452 3456 se->avg.util_avg = gcfs_rq->avg.util_avg; ··· 3463 3463 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) 3464 3464 { 3465 3465 long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg; 3466 - /* 3467 - * cfs_rq->avg.period_contrib can be used for both cfs_rq and se. 3468 - * See ___update_load_avg() for details. 3469 - */ 3470 - u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib; 3466 + u32 divider; 3471 3467 3472 3468 /* Nothing to update */ 3473 3469 if (!delta) 3474 3470 return; 3471 + 3472 + /* 3473 + * cfs_rq->avg.period_contrib can be used for both cfs_rq and se. 3474 + * See ___update_load_avg() for details. 3475 + */ 3476 + divider = get_pelt_divider(&cfs_rq->avg); 3475 3477 3476 3478 /* Set new sched_entity's runnable */ 3477 3479 se->avg.runnable_avg = gcfs_rq->avg.runnable_avg; ··· 3502 3500 * cfs_rq->avg.period_contrib can be used for both cfs_rq and se. 3503 3501 * See ___update_load_avg() for details. 3504 3502 */ 3505 - divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib; 3503 + divider = get_pelt_divider(&cfs_rq->avg); 3506 3504 3507 3505 if (runnable_sum >= 0) { 3508 3506 /* ··· 3648 3646 3649 3647 if (cfs_rq->removed.nr) { 3650 3648 unsigned long r; 3651 - u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib; 3649 + u32 divider = get_pelt_divider(&cfs_rq->avg); 3652 3650 3653 3651 raw_spin_lock(&cfs_rq->removed.lock); 3654 3652 swap(cfs_rq->removed.util_avg, removed_util); ··· 3703 3701 * cfs_rq->avg.period_contrib can be used for both cfs_rq and se. 3704 3702 * See ___update_load_avg() for details. 3705 3703 */ 3706 - u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib; 3704 + u32 divider = get_pelt_divider(&cfs_rq->avg); 3707 3705 3708 3706 /* 3709 3707 * When we attach the @se to the @cfs_rq, we must align the decay ··· 3924 3922 enqueued = cfs_rq->avg.util_est.enqueued; 3925 3923 enqueued += _task_util_est(p); 3926 3924 WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); 3925 + 3926 + trace_sched_util_est_cfs_tp(cfs_rq); 3927 3927 } 3928 3928 3929 3929 /* ··· 3955 3951 ue.enqueued = cfs_rq->avg.util_est.enqueued; 3956 3952 ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p)); 3957 3953 WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued); 3954 + 3955 + trace_sched_util_est_cfs_tp(cfs_rq); 3958 3956 3959 3957 /* 3960 3958 * Skip update of task's estimated utilization when the task has not ··· 4023 4017 ue.ewma >>= UTIL_EST_WEIGHT_SHIFT; 4024 4018 done: 4025 4019 WRITE_ONCE(p->se.avg.util_est, ue); 4020 + 4021 + trace_sched_util_est_se_tp(&p->se); 4026 4022 } 4027 4023 4028 4024 static inline int task_fits_capacity(struct task_struct *p, long capacity) ··· 5626 5618 5627 5619 } 5628 5620 5629 - dequeue_throttle: 5630 - if (!se) 5631 - sub_nr_running(rq, 1); 5621 + /* At this point se is NULL and we are at root level*/ 5622 + sub_nr_running(rq, 1); 5632 5623 5633 5624 /* balance early to pull high priority tasks */ 5634 5625 if (unlikely(!was_sched_idle && sched_idle_rq(rq))) 5635 5626 rq->next_balance = jiffies; 5636 5627 5628 + dequeue_throttle: 5637 5629 util_est_dequeue(&rq->cfs, p, task_sleep); 5638 5630 hrtick_update(rq); 5639 5631 } ··· 7169 7161 set_skip_buddy(se); 7170 7162 } 7171 7163 7172 - static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preempt) 7164 + static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) 7173 7165 { 7174 7166 struct sched_entity *se = &p->se; 7175 7167 ··· 8057 8049 }; 8058 8050 } 8059 8051 8060 - static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu) 8052 + static unsigned long scale_rt_capacity(int cpu) 8061 8053 { 8062 8054 struct rq *rq = cpu_rq(cpu); 8063 8055 unsigned long max = arch_scale_cpu_capacity(cpu); ··· 8089 8081 8090 8082 static void update_cpu_capacity(struct sched_domain *sd, int cpu) 8091 8083 { 8092 - unsigned long capacity = scale_rt_capacity(sd, cpu); 8084 + unsigned long capacity = scale_rt_capacity(cpu); 8093 8085 struct sched_group *sdg = sd->groups; 8094 8086 8095 8087 cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu); ··· 8711 8703 8712 8704 case group_has_spare: 8713 8705 /* Select group with most idle CPUs */ 8714 - if (idlest_sgs->idle_cpus >= sgs->idle_cpus) 8706 + if (idlest_sgs->idle_cpus > sgs->idle_cpus) 8715 8707 return false; 8708 + 8709 + /* Select group with lowest group_util */ 8710 + if (idlest_sgs->idle_cpus == sgs->idle_cpus && 8711 + idlest_sgs->group_util <= sgs->group_util) 8712 + return false; 8713 + 8716 8714 break; 8717 8715 } 8718 8716 ··· 10041 10027 { 10042 10028 int ilb_cpu; 10043 10029 10044 - nohz.next_balance++; 10030 + /* 10031 + * Increase nohz.next_balance only when if full ilb is triggered but 10032 + * not if we only update stats. 10033 + */ 10034 + if (flags & NOHZ_BALANCE_KICK) 10035 + nohz.next_balance = jiffies+1; 10045 10036 10046 10037 ilb_cpu = find_new_ilb(); 10047 10038 ··· 10367 10348 } 10368 10349 } 10369 10350 10351 + /* 10352 + * next_balance will be updated only when there is a need. 10353 + * When the CPU is attached to null domain for ex, it will not be 10354 + * updated. 10355 + */ 10356 + if (likely(update_next_balance)) 10357 + nohz.next_balance = next_balance; 10358 + 10370 10359 /* Newly idle CPU doesn't need an update */ 10371 10360 if (idle != CPU_NEWLY_IDLE) { 10372 10361 update_blocked_averages(this_cpu); ··· 10394 10367 /* There is still blocked load, enable periodic update */ 10395 10368 if (has_blocked_load) 10396 10369 WRITE_ONCE(nohz.has_blocked, 1); 10397 - 10398 - /* 10399 - * next_balance will be updated only when there is a need. 10400 - * When the CPU is attached to null domain for ex, it will not be 10401 - * updated. 10402 - */ 10403 - if (likely(update_next_balance)) 10404 - nohz.next_balance = next_balance; 10405 10370 10406 10371 return ret; 10407 10372 } ··· 11137 11118 /* 11138 11119 * All the scheduling class methods: 11139 11120 */ 11140 - const struct sched_class fair_sched_class = { 11141 - .next = &idle_sched_class, 11121 + const struct sched_class fair_sched_class 11122 + __attribute__((section("__fair_sched_class"))) = { 11142 11123 .enqueue_task = enqueue_task_fair, 11143 11124 .dequeue_task = dequeue_task_fair, 11144 11125 .yield_task = yield_task_fair, ··· 11311 11292 #endif 11312 11293 } 11313 11294 EXPORT_SYMBOL_GPL(sched_trace_rd_span); 11295 + 11296 + int sched_trace_rq_nr_running(struct rq *rq) 11297 + { 11298 + return rq ? rq->nr_running : -1; 11299 + } 11300 + EXPORT_SYMBOL_GPL(sched_trace_rq_nr_running);
+2 -9
kernel/sched/idle.c
··· 453 453 BUG(); 454 454 } 455 455 456 - static unsigned int get_rr_interval_idle(struct rq *rq, struct task_struct *task) 457 - { 458 - return 0; 459 - } 460 - 461 456 static void update_curr_idle(struct rq *rq) 462 457 { 463 458 } ··· 460 465 /* 461 466 * Simple, special scheduling class for the per-CPU idle tasks: 462 467 */ 463 - const struct sched_class idle_sched_class = { 464 - /* .next is NULL */ 468 + const struct sched_class idle_sched_class 469 + __attribute__((section("__idle_sched_class"))) = { 465 470 /* no enqueue/yield_task for idle tasks */ 466 471 467 472 /* dequeue is not valid, we print a debug message there: */ ··· 480 485 #endif 481 486 482 487 .task_tick = task_tick_idle, 483 - 484 - .get_rr_interval = get_rr_interval_idle, 485 488 486 489 .prio_changed = prio_changed_idle, 487 490 .switched_to = switched_to_idle,
+2 -1
kernel/sched/isolation.c
··· 140 140 { 141 141 unsigned int flags; 142 142 143 - flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | HK_FLAG_MISC; 143 + flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | 144 + HK_FLAG_MISC | HK_FLAG_KTHREAD; 144 145 145 146 return housekeeping_setup(str, flags); 146 147 }
+1 -1
kernel/sched/loadavg.c
··· 347 347 * 348 348 * Called from the global timer code. 349 349 */ 350 - void calc_global_load(unsigned long ticks) 350 + void calc_global_load(void) 351 351 { 352 352 unsigned long sample_window; 353 353 long active, delta;
+1 -5
kernel/sched/pelt.c
··· 28 28 #include "sched.h" 29 29 #include "pelt.h" 30 30 31 - #include <trace/events/sched.h> 32 - 33 31 /* 34 32 * Approximate: 35 33 * val * y^n, where y^32 ~= 0.5 (~1 scheduling period) ··· 80 82 81 83 return c1 + c2 + c3; 82 84 } 83 - 84 - #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) 85 85 86 86 /* 87 87 * Accumulate the three separate parts of the sum; d1 the remainder ··· 260 264 static __always_inline void 261 265 ___update_load_avg(struct sched_avg *sa, unsigned long load) 262 266 { 263 - u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib; 267 + u32 divider = get_pelt_divider(sa); 264 268 265 269 /* 266 270 * Step 2: update *_avg.
+5
kernel/sched/pelt.h
··· 37 37 } 38 38 #endif 39 39 40 + static inline u32 get_pelt_divider(struct sched_avg *avg) 41 + { 42 + return LOAD_AVG_MAX - 1024 + avg->period_contrib; 43 + } 44 + 40 45 /* 41 46 * When a task is dequeued, its estimated utilization should not be update if 42 47 * its util_avg has not been updated at least once.
+64 -49
kernel/sched/psi.c
··· 190 190 INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); 191 191 mutex_init(&group->avgs_lock); 192 192 /* Init trigger-related members */ 193 - atomic_set(&group->poll_scheduled, 0); 194 193 mutex_init(&group->trigger_lock); 195 194 INIT_LIST_HEAD(&group->triggers); 196 195 memset(group->nr_triggers, 0, sizeof(group->nr_triggers)); ··· 198 199 memset(group->polling_total, 0, sizeof(group->polling_total)); 199 200 group->polling_next_update = ULLONG_MAX; 200 201 group->polling_until = 0; 201 - rcu_assign_pointer(group->poll_kworker, NULL); 202 + rcu_assign_pointer(group->poll_task, NULL); 202 203 } 203 204 204 205 void __init psi_init(void) ··· 546 547 return now + group->poll_min_period; 547 548 } 548 549 549 - /* 550 - * Schedule polling if it's not already scheduled. It's safe to call even from 551 - * hotpath because even though kthread_queue_delayed_work takes worker->lock 552 - * spinlock that spinlock is never contended due to poll_scheduled atomic 553 - * preventing such competition. 554 - */ 550 + /* Schedule polling if it's not already scheduled. */ 555 551 static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay) 556 552 { 557 - struct kthread_worker *kworker; 553 + struct task_struct *task; 558 554 559 - /* Do not reschedule if already scheduled */ 560 - if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0) 555 + /* 556 + * Do not reschedule if already scheduled. 557 + * Possible race with a timer scheduled after this check but before 558 + * mod_timer below can be tolerated because group->polling_next_update 559 + * will keep updates on schedule. 560 + */ 561 + if (timer_pending(&group->poll_timer)) 561 562 return; 562 563 563 564 rcu_read_lock(); 564 565 565 - kworker = rcu_dereference(group->poll_kworker); 566 + task = rcu_dereference(group->poll_task); 566 567 /* 567 568 * kworker might be NULL in case psi_trigger_destroy races with 568 569 * psi_task_change (hotpath) which can't use locks 569 570 */ 570 - if (likely(kworker)) 571 - kthread_queue_delayed_work(kworker, &group->poll_work, delay); 572 - else 573 - atomic_set(&group->poll_scheduled, 0); 571 + if (likely(task)) 572 + mod_timer(&group->poll_timer, jiffies + delay); 574 573 575 574 rcu_read_unlock(); 576 575 } 577 576 578 - static void psi_poll_work(struct kthread_work *work) 577 + static void psi_poll_work(struct psi_group *group) 579 578 { 580 - struct kthread_delayed_work *dwork; 581 - struct psi_group *group; 582 579 u32 changed_states; 583 580 u64 now; 584 - 585 - dwork = container_of(work, struct kthread_delayed_work, work); 586 - group = container_of(dwork, struct psi_group, poll_work); 587 - 588 - atomic_set(&group->poll_scheduled, 0); 589 581 590 582 mutex_lock(&group->trigger_lock); 591 583 ··· 611 621 612 622 out: 613 623 mutex_unlock(&group->trigger_lock); 624 + } 625 + 626 + static int psi_poll_worker(void *data) 627 + { 628 + struct psi_group *group = (struct psi_group *)data; 629 + struct sched_param param = { 630 + .sched_priority = 1, 631 + }; 632 + 633 + sched_setscheduler_nocheck(current, SCHED_FIFO, &param); 634 + 635 + while (true) { 636 + wait_event_interruptible(group->poll_wait, 637 + atomic_cmpxchg(&group->poll_wakeup, 1, 0) || 638 + kthread_should_stop()); 639 + if (kthread_should_stop()) 640 + break; 641 + 642 + psi_poll_work(group); 643 + } 644 + return 0; 645 + } 646 + 647 + static void poll_timer_fn(struct timer_list *t) 648 + { 649 + struct psi_group *group = from_timer(group, t, poll_timer); 650 + 651 + atomic_set(&group->poll_wakeup, 1); 652 + wake_up_interruptible(&group->poll_wait); 614 653 } 615 654 616 655 static void record_times(struct psi_group_cpu *groupc, int cpu, ··· 1118 1099 1119 1100 mutex_lock(&group->trigger_lock); 1120 1101 1121 - if (!rcu_access_pointer(group->poll_kworker)) { 1122 - struct sched_param param = { 1123 - .sched_priority = 1, 1124 - }; 1125 - struct kthread_worker *kworker; 1102 + if (!rcu_access_pointer(group->poll_task)) { 1103 + struct task_struct *task; 1126 1104 1127 - kworker = kthread_create_worker(0, "psimon"); 1128 - if (IS_ERR(kworker)) { 1105 + task = kthread_create(psi_poll_worker, group, "psimon"); 1106 + if (IS_ERR(task)) { 1129 1107 kfree(t); 1130 1108 mutex_unlock(&group->trigger_lock); 1131 - return ERR_CAST(kworker); 1109 + return ERR_CAST(task); 1132 1110 } 1133 - sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, &param); 1134 - kthread_init_delayed_work(&group->poll_work, 1135 - psi_poll_work); 1136 - rcu_assign_pointer(group->poll_kworker, kworker); 1111 + atomic_set(&group->poll_wakeup, 0); 1112 + init_waitqueue_head(&group->poll_wait); 1113 + wake_up_process(task); 1114 + timer_setup(&group->poll_timer, poll_timer_fn, 0); 1115 + rcu_assign_pointer(group->poll_task, task); 1137 1116 } 1138 1117 1139 1118 list_add(&t->node, &group->triggers); ··· 1149 1132 { 1150 1133 struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount); 1151 1134 struct psi_group *group = t->group; 1152 - struct kthread_worker *kworker_to_destroy = NULL; 1135 + struct task_struct *task_to_destroy = NULL; 1153 1136 1154 1137 if (static_branch_likely(&psi_disabled)) 1155 1138 return; ··· 1175 1158 period = min(period, div_u64(tmp->win.size, 1176 1159 UPDATES_PER_WINDOW)); 1177 1160 group->poll_min_period = period; 1178 - /* Destroy poll_kworker when the last trigger is destroyed */ 1161 + /* Destroy poll_task when the last trigger is destroyed */ 1179 1162 if (group->poll_states == 0) { 1180 1163 group->polling_until = 0; 1181 - kworker_to_destroy = rcu_dereference_protected( 1182 - group->poll_kworker, 1164 + task_to_destroy = rcu_dereference_protected( 1165 + group->poll_task, 1183 1166 lockdep_is_held(&group->trigger_lock)); 1184 - rcu_assign_pointer(group->poll_kworker, NULL); 1167 + rcu_assign_pointer(group->poll_task, NULL); 1185 1168 } 1186 1169 } 1187 1170 ··· 1189 1172 1190 1173 /* 1191 1174 * Wait for both *trigger_ptr from psi_trigger_replace and 1192 - * poll_kworker RCUs to complete their read-side critical sections 1193 - * before destroying the trigger and optionally the poll_kworker 1175 + * poll_task RCUs to complete their read-side critical sections 1176 + * before destroying the trigger and optionally the poll_task 1194 1177 */ 1195 1178 synchronize_rcu(); 1196 1179 /* 1197 1180 * Destroy the kworker after releasing trigger_lock to prevent a 1198 1181 * deadlock while waiting for psi_poll_work to acquire trigger_lock 1199 1182 */ 1200 - if (kworker_to_destroy) { 1183 + if (task_to_destroy) { 1201 1184 /* 1202 1185 * After the RCU grace period has expired, the worker 1203 - * can no longer be found through group->poll_kworker. 1186 + * can no longer be found through group->poll_task. 1204 1187 * But it might have been already scheduled before 1205 1188 * that - deschedule it cleanly before destroying it. 1206 1189 */ 1207 - kthread_cancel_delayed_work_sync(&group->poll_work); 1208 - atomic_set(&group->poll_scheduled, 0); 1209 - 1210 - kthread_destroy_worker(kworker_to_destroy); 1190 + del_timer_sync(&group->poll_timer); 1191 + kthread_stop(task_to_destroy); 1211 1192 } 1212 1193 kfree(t); 1213 1194 }
+2 -2
kernel/sched/rt.c
··· 2429 2429 return 0; 2430 2430 } 2431 2431 2432 - const struct sched_class rt_sched_class = { 2433 - .next = &fair_sched_class, 2432 + const struct sched_class rt_sched_class 2433 + __attribute__((section("__rt_sched_class"))) = { 2434 2434 .enqueue_task = enqueue_task_rt, 2435 2435 .dequeue_task = dequeue_task_rt, 2436 2436 .yield_task = yield_task_rt,
+105 -21
kernel/sched/sched.h
··· 67 67 #include <linux/tsacct_kern.h> 68 68 69 69 #include <asm/tlb.h> 70 + #include <asm-generic/vmlinux.lds.h> 70 71 71 72 #ifdef CONFIG_PARAVIRT 72 73 # include <asm/paravirt.h> ··· 75 74 76 75 #include "cpupri.h" 77 76 #include "cpudeadline.h" 77 + 78 + #include <trace/events/sched.h> 78 79 79 80 #ifdef CONFIG_SCHED_DEBUG 80 81 # define SCHED_WARN_ON(x) WARN_ONCE(x, #x) ··· 99 96 extern void calc_global_load_tick(struct rq *this_rq); 100 97 extern long calc_load_fold_active(struct rq *this_rq, long adjust); 101 98 99 + extern void call_trace_sched_update_nr_running(struct rq *rq, int count); 102 100 /* 103 101 * Helpers for converting nanosecond timing to jiffy resolution 104 102 */ ··· 314 310 __dl_update(dl_b, -((s32)tsk_bw / cpus)); 315 311 } 316 312 317 - static inline 318 - bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw) 313 + static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap, 314 + u64 old_bw, u64 new_bw) 319 315 { 320 316 return dl_b->bw != -1 && 321 - dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw; 317 + cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw; 318 + } 319 + 320 + /* 321 + * Verify the fitness of task @p to run on @cpu taking into account the 322 + * CPU original capacity and the runtime/deadline ratio of the task. 323 + * 324 + * The function will return true if the CPU original capacity of the 325 + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the 326 + * task and false otherwise. 327 + */ 328 + static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu) 329 + { 330 + unsigned long cap = arch_scale_cpu_capacity(cpu); 331 + 332 + return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime; 322 333 } 323 334 324 335 extern void init_dl_bw(struct dl_bw *dl_b); ··· 881 862 unsigned int value; 882 863 struct uclamp_bucket bucket[UCLAMP_BUCKETS]; 883 864 }; 865 + 866 + DECLARE_STATIC_KEY_FALSE(sched_uclamp_used); 884 867 #endif /* CONFIG_UCLAMP_TASK */ 885 868 886 869 /* ··· 1203 1182 #endif 1204 1183 }; 1205 1184 1185 + /* 1186 + * Lockdep annotation that avoids accidental unlocks; it's like a 1187 + * sticky/continuous lockdep_assert_held(). 1188 + * 1189 + * This avoids code that has access to 'struct rq *rq' (basically everything in 1190 + * the scheduler) from accidentally unlocking the rq if they do not also have a 1191 + * copy of the (on-stack) 'struct rq_flags rf'. 1192 + * 1193 + * Also see Documentation/locking/lockdep-design.rst. 1194 + */ 1206 1195 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf) 1207 1196 { 1208 1197 rf->cookie = lockdep_pin_lock(&rq->lock); ··· 1770 1739 #define RETRY_TASK ((void *)-1UL) 1771 1740 1772 1741 struct sched_class { 1773 - const struct sched_class *next; 1774 1742 1775 1743 #ifdef CONFIG_UCLAMP_TASK 1776 1744 int uclamp_enabled; ··· 1778 1748 void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); 1779 1749 void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); 1780 1750 void (*yield_task) (struct rq *rq); 1781 - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt); 1751 + bool (*yield_to_task)(struct rq *rq, struct task_struct *p); 1782 1752 1783 1753 void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); 1784 1754 ··· 1826 1796 #ifdef CONFIG_FAIR_GROUP_SCHED 1827 1797 void (*task_change_group)(struct task_struct *p, int type); 1828 1798 #endif 1829 - }; 1799 + } __aligned(STRUCT_ALIGNMENT); /* STRUCT_ALIGN(), vmlinux.lds.h */ 1830 1800 1831 1801 static inline void put_prev_task(struct rq *rq, struct task_struct *prev) 1832 1802 { ··· 1840 1810 next->sched_class->set_next_task(rq, next, false); 1841 1811 } 1842 1812 1843 - #ifdef CONFIG_SMP 1844 - #define sched_class_highest (&stop_sched_class) 1845 - #else 1846 - #define sched_class_highest (&dl_sched_class) 1847 - #endif 1813 + /* Defined in include/asm-generic/vmlinux.lds.h */ 1814 + extern struct sched_class __begin_sched_classes[]; 1815 + extern struct sched_class __end_sched_classes[]; 1816 + 1817 + #define sched_class_highest (__end_sched_classes - 1) 1818 + #define sched_class_lowest (__begin_sched_classes - 1) 1848 1819 1849 1820 #define for_class_range(class, _from, _to) \ 1850 - for (class = (_from); class != (_to); class = class->next) 1821 + for (class = (_from); class != (_to); class--) 1851 1822 1852 1823 #define for_each_class(class) \ 1853 - for_class_range(class, sched_class_highest, NULL) 1824 + for_class_range(class, sched_class_highest, sched_class_lowest) 1854 1825 1855 1826 extern const struct sched_class stop_sched_class; 1856 1827 extern const struct sched_class dl_sched_class; ··· 1961 1930 */ 1962 1931 static inline void sched_update_tick_dependency(struct rq *rq) 1963 1932 { 1964 - int cpu; 1965 - 1966 - if (!tick_nohz_full_enabled()) 1967 - return; 1968 - 1969 - cpu = cpu_of(rq); 1933 + int cpu = cpu_of(rq); 1970 1934 1971 1935 if (!tick_nohz_full_cpu(cpu)) 1972 1936 return; ··· 1981 1955 unsigned prev_nr = rq->nr_running; 1982 1956 1983 1957 rq->nr_running = prev_nr + count; 1958 + if (trace_sched_update_nr_running_tp_enabled()) { 1959 + call_trace_sched_update_nr_running(rq, count); 1960 + } 1984 1961 1985 1962 #ifdef CONFIG_SMP 1986 1963 if (prev_nr < 2 && rq->nr_running >= 2) { ··· 1998 1969 static inline void sub_nr_running(struct rq *rq, unsigned count) 1999 1970 { 2000 1971 rq->nr_running -= count; 1972 + if (trace_sched_update_nr_running_tp_enabled()) { 1973 + call_trace_sched_update_nr_running(rq, count); 1974 + } 1975 + 2001 1976 /* Check if we still need preemption */ 2002 1977 sched_update_tick_dependency(rq); 2003 1978 } ··· 2049 2016 #endif 2050 2017 2051 2018 #ifndef arch_scale_freq_capacity 2019 + /** 2020 + * arch_scale_freq_capacity - get the frequency scale factor of a given CPU. 2021 + * @cpu: the CPU in question. 2022 + * 2023 + * Return: the frequency scale factor normalized against SCHED_CAPACITY_SCALE, i.e. 2024 + * 2025 + * f_curr 2026 + * ------ * SCHED_CAPACITY_SCALE 2027 + * f_max 2028 + */ 2052 2029 static __always_inline 2053 2030 unsigned long arch_scale_freq_capacity(int cpu) 2054 2031 { ··· 2392 2349 #ifdef CONFIG_UCLAMP_TASK 2393 2350 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id); 2394 2351 2352 + /** 2353 + * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values. 2354 + * @rq: The rq to clamp against. Must not be NULL. 2355 + * @util: The util value to clamp. 2356 + * @p: The task to clamp against. Can be NULL if you want to clamp 2357 + * against @rq only. 2358 + * 2359 + * Clamps the passed @util to the max(@rq, @p) effective uclamp values. 2360 + * 2361 + * If sched_uclamp_used static key is disabled, then just return the util 2362 + * without any clamping since uclamp aggregation at the rq level in the fast 2363 + * path is disabled, rendering this operation a NOP. 2364 + * 2365 + * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It 2366 + * will return the correct effective uclamp value of the task even if the 2367 + * static key is disabled. 2368 + */ 2395 2369 static __always_inline 2396 2370 unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, 2397 2371 struct task_struct *p) 2398 2372 { 2399 - unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value); 2400 - unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value); 2373 + unsigned long min_util; 2374 + unsigned long max_util; 2375 + 2376 + if (!static_branch_likely(&sched_uclamp_used)) 2377 + return util; 2378 + 2379 + min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value); 2380 + max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value); 2401 2381 2402 2382 if (p) { 2403 2383 min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN)); ··· 2437 2371 2438 2372 return clamp(util, min_util, max_util); 2439 2373 } 2374 + 2375 + /* 2376 + * When uclamp is compiled in, the aggregation at rq level is 'turned off' 2377 + * by default in the fast path and only gets turned on once userspace performs 2378 + * an operation that requires it. 2379 + * 2380 + * Returns true if userspace opted-in to use uclamp and aggregation at rq level 2381 + * hence is active. 2382 + */ 2383 + static inline bool uclamp_is_used(void) 2384 + { 2385 + return static_branch_likely(&sched_uclamp_used); 2386 + } 2440 2387 #else /* CONFIG_UCLAMP_TASK */ 2441 2388 static inline 2442 2389 unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util, 2443 2390 struct task_struct *p) 2444 2391 { 2445 2392 return util; 2393 + } 2394 + 2395 + static inline bool uclamp_is_used(void) 2396 + { 2397 + return false; 2446 2398 } 2447 2399 #endif /* CONFIG_UCLAMP_TASK */ 2448 2400
+2 -10
kernel/sched/stop_task.c
··· 102 102 BUG(); /* how!?, what priority? */ 103 103 } 104 104 105 - static unsigned int 106 - get_rr_interval_stop(struct rq *rq, struct task_struct *task) 107 - { 108 - return 0; 109 - } 110 - 111 105 static void update_curr_stop(struct rq *rq) 112 106 { 113 107 } ··· 109 115 /* 110 116 * Simple, special scheduling class for the per-CPU stop tasks: 111 117 */ 112 - const struct sched_class stop_sched_class = { 113 - .next = &dl_sched_class, 118 + const struct sched_class stop_sched_class 119 + __attribute__((section("__stop_sched_class"))) = { 114 120 115 121 .enqueue_task = enqueue_task_stop, 116 122 .dequeue_task = dequeue_task_stop, ··· 129 135 #endif 130 136 131 137 .task_tick = task_tick_stop, 132 - 133 - .get_rr_interval = get_rr_interval_stop, 134 138 135 139 .prio_changed = prio_changed_stop, 136 140 .switched_to = switched_to_stop,
+1 -1
kernel/sched/topology.c
··· 1328 1328 sd_flags = (*tl->sd_flags)(); 1329 1329 if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, 1330 1330 "wrong sd_flags in topology description\n")) 1331 - sd_flags &= ~TOPOLOGY_SD_FLAGS; 1331 + sd_flags &= TOPOLOGY_SD_FLAGS; 1332 1332 1333 1333 /* Apply detected topology flags */ 1334 1334 sd_flags |= dflags;
+1 -2
kernel/smp.c
··· 634 634 { 635 635 int nr_cpus; 636 636 637 - get_option(&str, &nr_cpus); 638 - if (nr_cpus > 0 && nr_cpus < nr_cpu_ids) 637 + if (get_option(&str, &nr_cpus) && nr_cpus > 0 && nr_cpus < nr_cpu_ids) 639 638 nr_cpu_ids = nr_cpus; 640 639 641 640 return 0;
+21
kernel/sysctl.c
··· 1780 1780 .proc_handler = sched_rt_handler, 1781 1781 }, 1782 1782 { 1783 + .procname = "sched_deadline_period_max_us", 1784 + .data = &sysctl_sched_dl_period_max, 1785 + .maxlen = sizeof(unsigned int), 1786 + .mode = 0644, 1787 + .proc_handler = proc_dointvec, 1788 + }, 1789 + { 1790 + .procname = "sched_deadline_period_min_us", 1791 + .data = &sysctl_sched_dl_period_min, 1792 + .maxlen = sizeof(unsigned int), 1793 + .mode = 0644, 1794 + .proc_handler = proc_dointvec, 1795 + }, 1796 + { 1783 1797 .procname = "sched_rr_timeslice_ms", 1784 1798 .data = &sysctl_sched_rr_timeslice, 1785 1799 .maxlen = sizeof(int), ··· 1811 1797 { 1812 1798 .procname = "sched_util_clamp_max", 1813 1799 .data = &sysctl_sched_uclamp_util_max, 1800 + .maxlen = sizeof(unsigned int), 1801 + .mode = 0644, 1802 + .proc_handler = sysctl_sched_uclamp_handler, 1803 + }, 1804 + { 1805 + .procname = "sched_util_clamp_min_rt_default", 1806 + .data = &sysctl_sched_uclamp_util_min_rt_default, 1814 1807 .maxlen = sizeof(unsigned int), 1815 1808 .mode = 0644, 1816 1809 .proc_handler = sysctl_sched_uclamp_handler,
+1 -1
kernel/time/timekeeping.c
··· 2193 2193 void do_timer(unsigned long ticks) 2194 2194 { 2195 2195 jiffies_64 += ticks; 2196 - calc_global_load(ticks); 2196 + calc_global_load(); 2197 2197 } 2198 2198 2199 2199 /**
+11 -5
lib/cpumask.c
··· 6 6 #include <linux/export.h> 7 7 #include <linux/memblock.h> 8 8 #include <linux/numa.h> 9 + #include <linux/sched/isolation.h> 9 10 10 11 /** 11 12 * cpumask_next - get the next cpu in a cpumask ··· 206 205 */ 207 206 unsigned int cpumask_local_spread(unsigned int i, int node) 208 207 { 209 - int cpu; 208 + int cpu, hk_flags; 209 + const struct cpumask *mask; 210 210 211 + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; 212 + mask = housekeeping_cpumask(hk_flags); 211 213 /* Wrap: we always want a cpu. */ 212 - i %= num_online_cpus(); 214 + i %= cpumask_weight(mask); 213 215 214 216 if (node == NUMA_NO_NODE) { 215 - for_each_cpu(cpu, cpu_online_mask) 217 + for_each_cpu(cpu, mask) { 216 218 if (i-- == 0) 217 219 return cpu; 220 + } 218 221 } else { 219 222 /* NUMA first. */ 220 - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask) 223 + for_each_cpu_and(cpu, cpumask_of_node(node), mask) { 221 224 if (i-- == 0) 222 225 return cpu; 226 + } 223 227 224 - for_each_cpu(cpu, cpu_online_mask) { 228 + for_each_cpu(cpu, mask) { 225 229 /* Skip NUMA nodes, done above. */ 226 230 if (cpumask_test_cpu(cpu, cpumask_of_node(node))) 227 231 continue;
+41
lib/math/div64.c
··· 190 190 return __iter_div_u64_rem(dividend, divisor, remainder); 191 191 } 192 192 EXPORT_SYMBOL(iter_div_u64_rem); 193 + 194 + #ifndef mul_u64_u64_div_u64 195 + u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c) 196 + { 197 + u64 res = 0, div, rem; 198 + int shift; 199 + 200 + /* can a * b overflow ? */ 201 + if (ilog2(a) + ilog2(b) > 62) { 202 + /* 203 + * (b * a) / c is equal to 204 + * 205 + * (b / c) * a + 206 + * (b % c) * a / c 207 + * 208 + * if nothing overflows. Can the 1st multiplication 209 + * overflow? Yes, but we do not care: this can only 210 + * happen if the end result can't fit in u64 anyway. 211 + * 212 + * So the code below does 213 + * 214 + * res = (b / c) * a; 215 + * b = b % c; 216 + */ 217 + div = div64_u64_rem(b, c, &rem); 218 + res = div * a; 219 + b = rem; 220 + 221 + shift = ilog2(a) + ilog2(b) - 62; 222 + if (shift > 0) { 223 + /* drop precision */ 224 + b >>= shift; 225 + c >>= shift; 226 + if (!c) 227 + return res; 228 + } 229 + } 230 + 231 + return res + div64_u64(a * b, c); 232 + } 233 + #endif
+9 -1
net/core/net-sysfs.c
··· 11 11 #include <linux/if_arp.h> 12 12 #include <linux/slab.h> 13 13 #include <linux/sched/signal.h> 14 + #include <linux/sched/isolation.h> 14 15 #include <linux/nsproxy.h> 15 16 #include <net/sock.h> 16 17 #include <net/net_namespace.h> ··· 742 741 { 743 742 struct rps_map *old_map, *map; 744 743 cpumask_var_t mask; 745 - int err, cpu, i; 744 + int err, cpu, i, hk_flags; 746 745 static DEFINE_MUTEX(rps_map_mutex); 747 746 748 747 if (!capable(CAP_NET_ADMIN)) ··· 755 754 if (err) { 756 755 free_cpumask_var(mask); 757 756 return err; 757 + } 758 + 759 + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; 760 + cpumask_and(mask, mask, housekeeping_cpumask(hk_flags)); 761 + if (cpumask_empty(mask)) { 762 + free_cpumask_var(mask); 763 + return -EINVAL; 758 764 } 759 765 760 766 map = kzalloc(max_t(unsigned int,