Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

- Add cpufreq pressure feedback for the scheduler

- Rework misfit load-balancing wrt affinity restrictions

- Clean up and simplify the code around ::overutilized and
::overload access.

- Simplify sched_balance_newidle()

- Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
handling that changed the output.

- Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()

- Reorganize, clean up and unify most of the higher level
scheduler balancing function names around the sched_balance_*()
prefix

- Simplify the balancing flag code (sched_balance_running)

- Miscellaneous cleanups & fixes

* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/pelt: Remove shift of thermal clock
sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
thermal/cpufreq: Remove arch_update_thermal_pressure()
sched/cpufreq: Take cpufreq feedback into account
cpufreq: Add a cpufreq pressure feedback for the scheduler
sched/fair: Fix update of rd->sg_overutilized
sched/vtime: Do not include <asm/vtime.h> header
s390/irq,nmi: Include <asm/vtime.h> header directly
s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
sched/vtime: Get rid of generic vtime_task_switch() implementation
sched/vtime: Remove confusing arch_vtime_task_switch() declaration
sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
sched/fair: Rename root_domain::overload to ::overloaded
sched/fair: Use helper functions to access root_domain::overload
sched/fair: Check root_domain::overload value before update
sched/fair: Combine EAS check with root_domain::overutilized access
sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
...

+549 -440
+1
Documentation/admin-guide/kernel-parameters.txt
··· 5826 5826 but is useful for debugging and performance tuning. 5827 5827 5828 5828 sched_thermal_decay_shift= 5829 + [Deprecated] 5829 5830 [KNL, SMP] Set a decay shift for scheduler thermal 5830 5831 pressure signal. Thermal pressure signal follows the 5831 5832 default decay period of other scheduler pelt
+6 -6
Documentation/scheduler/sched-domains.rst
··· 31 31 load of each of its member CPUs, and only when the load of a group becomes 32 32 out of balance are tasks moved between groups. 33 33 34 - In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU 35 - through scheduler_tick(). It raises a softirq after the next regularly scheduled 34 + In kernel/sched/core.c, sched_balance_trigger() is run periodically on each CPU 35 + through sched_tick(). It raises a softirq after the next regularly scheduled 36 36 rebalancing event for the current runqueue has arrived. The actual load 37 - balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run 37 + balancing workhorse, sched_balance_softirq()->sched_balance_domains(), is then run 38 38 in softirq context (SCHED_SOFTIRQ). 39 39 40 40 The latter function takes two arguments: the runqueue of current CPU and whether 41 - the CPU was idle at the time the scheduler_tick() happened and iterates over all 41 + the CPU was idle at the time the sched_tick() happened and iterates over all 42 42 sched domains our CPU is on, starting from its base domain and going up the ->parent 43 43 chain. While doing that, it checks to see if the current domain has exhausted its 44 - rebalance interval. If so, it runs load_balance() on that domain. It then checks 44 + rebalance interval. If so, it runs sched_balance_rq() on that domain. It then checks 45 45 the parent sched_domain (if it exists), and the parent of the parent and so 46 46 forth. 47 47 48 - Initially, load_balance() finds the busiest group in the current sched domain. 48 + Initially, sched_balance_rq() finds the busiest group in the current sched domain. 49 49 If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in 50 50 that group. If it manages to find such a runqueue, it locks both our initial 51 51 CPU's runqueue and the newly found busiest one and starts moving tasks from it
+21 -16
Documentation/scheduler/sched-stats.rst
··· 2 2 Scheduler Statistics 3 3 ==================== 4 4 5 + Version 16 of schedstats changed the order of definitions within 6 + 'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES] 7 + columns in show_schedstat(). In particular the position of CPU_IDLE 8 + and __CPU_NOT_IDLE changed places. The size of the array is unchanged. 9 + 5 10 Version 15 of schedstats dropped counters for some sched_yield: 6 11 yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is 7 12 identical to version 14. ··· 77 72 78 73 The first field is a bit mask indicating what cpus this domain operates over. 79 74 80 - The next 24 are a variety of load_balance() statistics in grouped into types 75 + The next 24 are a variety of sched_balance_rq() statistics in grouped into types 81 76 of idleness (idle, busy, and newly idle): 82 77 83 - 1) # of times in this domain load_balance() was called when the 78 + 1) # of times in this domain sched_balance_rq() was called when the 84 79 cpu was idle 85 - 2) # of times in this domain load_balance() checked but found 80 + 2) # of times in this domain sched_balance_rq() checked but found 86 81 the load did not require balancing when the cpu was idle 87 - 3) # of times in this domain load_balance() tried to move one or 82 + 3) # of times in this domain sched_balance_rq() tried to move one or 88 83 more tasks and failed, when the cpu was idle 89 84 4) sum of imbalances discovered (if any) with each call to 90 - load_balance() in this domain when the cpu was idle 85 + sched_balance_rq() in this domain when the cpu was idle 91 86 5) # of times in this domain pull_task() was called when the cpu 92 87 was idle 93 88 6) # of times in this domain pull_task() was called even though 94 89 the target task was cache-hot when idle 95 - 7) # of times in this domain load_balance() was called but did 90 + 7) # of times in this domain sched_balance_rq() was called but did 96 91 not find a busier queue while the cpu was idle 97 92 8) # of times in this domain a busier queue was found while the 98 93 cpu was idle but no busier group was found 99 - 9) # of times in this domain load_balance() was called when the 94 + 9) # of times in this domain sched_balance_rq() was called when the 100 95 cpu was busy 101 - 10) # of times in this domain load_balance() checked but found the 96 + 10) # of times in this domain sched_balance_rq() checked but found the 102 97 load did not require balancing when busy 103 - 11) # of times in this domain load_balance() tried to move one or 98 + 11) # of times in this domain sched_balance_rq() tried to move one or 104 99 more tasks and failed, when the cpu was busy 105 100 12) sum of imbalances discovered (if any) with each call to 106 - load_balance() in this domain when the cpu was busy 101 + sched_balance_rq() in this domain when the cpu was busy 107 102 13) # of times in this domain pull_task() was called when busy 108 103 14) # of times in this domain pull_task() was called even though the 109 104 target task was cache-hot when busy 110 - 15) # of times in this domain load_balance() was called but did not 105 + 15) # of times in this domain sched_balance_rq() was called but did not 111 106 find a busier queue while the cpu was busy 112 107 16) # of times in this domain a busier queue was found while the cpu 113 108 was busy but no busier group was found 114 109 115 - 17) # of times in this domain load_balance() was called when the 110 + 17) # of times in this domain sched_balance_rq() was called when the 116 111 cpu was just becoming idle 117 - 18) # of times in this domain load_balance() checked but found the 112 + 18) # of times in this domain sched_balance_rq() checked but found the 118 113 load did not require balancing when the cpu was just becoming idle 119 - 19) # of times in this domain load_balance() tried to move one or more 114 + 19) # of times in this domain sched_balance_rq() tried to move one or more 120 115 tasks and failed, when the cpu was just becoming idle 121 116 20) sum of imbalances discovered (if any) with each call to 122 - load_balance() in this domain when the cpu was just becoming idle 117 + sched_balance_rq() in this domain when the cpu was just becoming idle 123 118 21) # of times in this domain pull_task() was called when newly idle 124 119 22) # of times in this domain pull_task() was called even though the 125 120 target task was cache-hot when just becoming idle 126 - 23) # of times in this domain load_balance() was called but did not 121 + 23) # of times in this domain sched_balance_rq() was called but did not 127 122 find a busier queue while the cpu was just becoming idle 128 123 24) # of times in this domain a busier queue was found while the cpu 129 124 was just becoming idle but no busier group was found
+5 -5
Documentation/translations/zh_CN/scheduler/sched-domains.rst
··· 34 34 调度域中的负载均衡发生在调度组中。也就是说,每个组被视为一个实体。组的负载被定义为它 35 35 管辖的每个CPU的负载之和。仅当组的负载不均衡后,任务才在组之间发生迁移。 36 36 37 - 在kernel/sched/core.c中,trigger_load_balance()在每个CPU上通过scheduler_tick() 37 + 在kernel/sched/core.c中,sched_balance_trigger()在每个CPU上通过sched_tick() 38 38 周期执行。在当前运行队列下一个定期调度再平衡事件到达后,它引发一个软中断。负载均衡真正 39 - 的工作由run_rebalance_domains()->rebalance_domains()完成,在软中断上下文中执行 39 + 的工作由sched_balance_softirq()->sched_balance_domains()完成,在软中断上下文中执行 40 40 (SCHED_SOFTIRQ)。 41 41 42 - 后一个函数有两个入参:当前CPU的运行队列、它在scheduler_tick()调用时是否空闲。函数会从 42 + 后一个函数有两个入参:当前CPU的运行队列、它在sched_tick()调用时是否空闲。函数会从 43 43 当前CPU所在的基调度域开始迭代执行,并沿着parent指针链向上进入更高层级的调度域。在迭代 44 44 过程中,函数会检查当前调度域是否已经耗尽了再平衡的时间间隔,如果是,它在该调度域运行 45 - load_balance()。接下来它检查父调度域(如果存在),再后来父调度域的父调度域,以此类推。 45 + sched_balance_rq()。接下来它检查父调度域(如果存在),再后来父调度域的父调度域,以此类推。 46 46 47 - 起初,load_balance()查找当前调度域中最繁忙的调度组。如果成功,在该调度组管辖的全部CPU 47 + 起初,sched_balance_rq()查找当前调度域中最繁忙的调度组。如果成功,在该调度组管辖的全部CPU 48 48 的运行队列中找出最繁忙的运行队列。如能找到,对当前的CPU运行队列和新找到的最繁忙运行 49 49 队列均加锁,并把任务从最繁忙队列中迁移到当前CPU上。被迁移的任务数量等于在先前迭代执行 50 50 中计算出的该调度域的调度组的不均衡值。
+15 -15
Documentation/translations/zh_CN/scheduler/sched-stats.rst
··· 75 75 繁忙,新空闲): 76 76 77 77 78 - 1) 当CPU空闲时,load_balance()在这个调度域中被调用了#次 79 - 2) 当CPU空闲时,load_balance()在这个调度域中被调用,但是发现负载无需 78 + 1) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用了#次 79 + 2) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,但是发现负载无需 80 80 均衡#次 81 - 3) 当CPU空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多 81 + 3) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,试图迁移1个或更多 82 82 任务且失败了#次 83 - 4) 当CPU空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有) 83 + 4) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,发现不均衡(如果有) 84 84 #次 85 85 5) 当CPU空闲时,pull_task()在这个调度域中被调用#次 86 86 6) 当CPU空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次 87 - 7) 当CPU空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的 87 + 7) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,未能找到更繁忙的 88 88 队列#次 89 89 8) 当CPU空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组 90 90 #次 91 - 9) 当CPU繁忙时,load_balance()在这个调度域中被调用了#次 92 - 10) 当CPU繁忙时,load_balance()在这个调度域中被调用,但是发现负载无需 91 + 9) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用了#次 92 + 10) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,但是发现负载无需 93 93 均衡#次 94 - 11) 当CPU繁忙时,load_balance()在这个调度域中被调用,试图迁移1个或更多 94 + 11) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,试图迁移1个或更多 95 95 任务且失败了#次 96 - 12) 当CPU繁忙时,load_balance()在这个调度域中被调用,发现不均衡(如果有) 96 + 12) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,发现不均衡(如果有) 97 97 #次 98 98 13) 当CPU繁忙时,pull_task()在这个调度域中被调用#次 99 99 14) 当CPU繁忙时,尽管目标任务是热缓存状态,pull_task()依然被调用#次 100 - 15) 当CPU繁忙时,load_balance()在这个调度域中被调用,未能找到更繁忙的 100 + 15) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,未能找到更繁忙的 101 101 队列#次 102 102 16) 当CPU繁忙时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组 103 103 #次 104 - 17) 当CPU新空闲时,load_balance()在这个调度域中被调用了#次 105 - 18) 当CPU新空闲时,load_balance()在这个调度域中被调用,但是发现负载无需 104 + 17) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用了#次 105 + 18) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,但是发现负载无需 106 106 均衡#次 107 - 19) 当CPU新空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多 107 + 19) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,试图迁移1个或更多 108 108 任务且失败了#次 109 - 20) 当CPU新空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有) 109 + 20) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,发现不均衡(如果有) 110 110 #次 111 111 21) 当CPU新空闲时,pull_task()在这个调度域中被调用#次 112 112 22) 当CPU新空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次 113 - 23) 当CPU新空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的 113 + 23) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,未能找到更繁忙的 114 114 队列#次 115 115 24) 当CPU新空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组 116 116 #次
+3 -3
arch/arm/include/asm/topology.h
··· 22 22 /* Enable topology flag updates */ 23 23 #define arch_update_cpu_topology topology_update_cpu_topology 24 24 25 - /* Replace task scheduler's default thermal pressure API */ 26 - #define arch_scale_thermal_pressure topology_get_thermal_pressure 27 - #define arch_update_thermal_pressure topology_update_thermal_pressure 25 + /* Replace task scheduler's default HW pressure API */ 26 + #define arch_scale_hw_pressure topology_get_hw_pressure 27 + #define arch_update_hw_pressure topology_update_hw_pressure 28 28 29 29 #else 30 30
+1 -1
arch/arm/kernel/topology.c
··· 42 42 * can take this difference into account during load balance. A per cpu 43 43 * structure is preferred because each CPU updates its own cpu_capacity field 44 44 * during the load balance except for idle cores. One idle core is selected 45 - * to run the rebalance_domains for all idle cores and the cpu_capacity can be 45 + * to run the sched_balance_domains for all idle cores and the cpu_capacity can be 46 46 * updated during this sequence. 47 47 */ 48 48
+3 -3
arch/arm64/include/asm/topology.h
··· 35 35 /* Enable topology flag updates */ 36 36 #define arch_update_cpu_topology topology_update_cpu_topology 37 37 38 - /* Replace task scheduler's default thermal pressure API */ 39 - #define arch_scale_thermal_pressure topology_get_thermal_pressure 40 - #define arch_update_thermal_pressure topology_update_thermal_pressure 38 + /* Replace task scheduler's default HW pressure API */ 39 + #define arch_scale_hw_pressure topology_get_hw_pressure 40 + #define arch_update_hw_pressure topology_update_hw_pressure 41 41 42 42 #include <asm-generic/topology.h> 43 43
-1
arch/powerpc/include/asm/Kbuild
··· 6 6 generic-y += kvm_types.h 7 7 generic-y += mcs_spinlock.h 8 8 generic-y += qrwlock.h 9 - generic-y += vtime.h 10 9 generic-y += early_ioremap.h
-13
arch/powerpc/include/asm/cputime.h
··· 32 32 #ifdef CONFIG_PPC64 33 33 #define get_accounting(tsk) (&get_paca()->accounting) 34 34 #define raw_get_accounting(tsk) (&local_paca->accounting) 35 - static inline void arch_vtime_task_switch(struct task_struct *tsk) { } 36 35 37 36 #else 38 37 #define get_accounting(tsk) (&task_thread_info(tsk)->accounting) 39 38 #define raw_get_accounting(tsk) get_accounting(tsk) 40 - /* 41 - * Called from the context switch with interrupts disabled, to charge all 42 - * accumulated times to the current process, and to prepare accounting on 43 - * the next process. 44 - */ 45 - static inline void arch_vtime_task_switch(struct task_struct *prev) 46 - { 47 - struct cpu_accounting_data *acct = get_accounting(current); 48 - struct cpu_accounting_data *acct0 = get_accounting(prev); 49 - 50 - acct->starttime = acct0->starttime; 51 - } 52 39 #endif 53 40 54 41 /*
+22
arch/powerpc/kernel/time.c
··· 354 354 acct->hardirq_time = 0; 355 355 acct->softirq_time = 0; 356 356 } 357 + 358 + /* 359 + * Called from the context switch with interrupts disabled, to charge all 360 + * accumulated times to the current process, and to prepare accounting on 361 + * the next process. 362 + */ 363 + void vtime_task_switch(struct task_struct *prev) 364 + { 365 + if (is_idle_task(prev)) 366 + vtime_account_idle(prev); 367 + else 368 + vtime_account_kernel(prev); 369 + 370 + vtime_flush(prev); 371 + 372 + if (!IS_ENABLED(CONFIG_PPC64)) { 373 + struct cpu_accounting_data *acct = get_accounting(current); 374 + struct cpu_accounting_data *acct0 = get_accounting(prev); 375 + 376 + acct->starttime = acct0->starttime; 377 + } 378 + } 357 379 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ 358 380 359 381 void __no_kcsan __delay(unsigned long loops)
-2
arch/s390/include/asm/vtime.h
··· 2 2 #ifndef _S390_VTIME_H 3 3 #define _S390_VTIME_H 4 4 5 - #define __ARCH_HAS_VTIME_TASK_SWITCH 6 - 7 5 static inline void update_timer_sys(void) 8 6 { 9 7 S390_lowcore.system_timer += S390_lowcore.last_update_timer - S390_lowcore.exit_timer;
+1
arch/s390/kernel/irq.c
··· 29 29 #include <asm/hw_irq.h> 30 30 #include <asm/stacktrace.h> 31 31 #include <asm/softirq_stack.h> 32 + #include <asm/vtime.h> 32 33 #include "entry.h" 33 34 34 35 DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
+1
arch/s390/kernel/nmi.c
··· 31 31 #include <asm/crw.h> 32 32 #include <asm/asm-offsets.h> 33 33 #include <asm/pai.h> 34 + #include <asm/vtime.h> 34 35 35 36 struct mcck_struct { 36 37 unsigned int kill_task : 1;
+13 -13
drivers/base/arch_topology.c
··· 22 22 #include <linux/units.h> 23 23 24 24 #define CREATE_TRACE_POINTS 25 - #include <trace/events/thermal_pressure.h> 25 + #include <trace/events/hw_pressure.h> 26 26 27 27 static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data); 28 28 static struct cpumask scale_freq_counters_mask; ··· 160 160 per_cpu(cpu_scale, cpu) = capacity; 161 161 } 162 162 163 - DEFINE_PER_CPU(unsigned long, thermal_pressure); 163 + DEFINE_PER_CPU(unsigned long, hw_pressure); 164 164 165 165 /** 166 - * topology_update_thermal_pressure() - Update thermal pressure for CPUs 166 + * topology_update_hw_pressure() - Update HW pressure for CPUs 167 167 * @cpus : The related CPUs for which capacity has been reduced 168 168 * @capped_freq : The maximum allowed frequency that CPUs can run at 169 169 * 170 - * Update the value of thermal pressure for all @cpus in the mask. The 170 + * Update the value of HW pressure for all @cpus in the mask. The 171 171 * cpumask should include all (online+offline) affected CPUs, to avoid 172 172 * operating on stale data when hot-plug is used for some CPUs. The 173 173 * @capped_freq reflects the currently allowed max CPUs frequency due to 174 - * thermal capping. It might be also a boost frequency value, which is bigger 174 + * HW capping. It might be also a boost frequency value, which is bigger 175 175 * than the internal 'capacity_freq_ref' max frequency. In such case the 176 176 * pressure value should simply be removed, since this is an indication that 177 - * there is no thermal throttling. The @capped_freq must be provided in kHz. 177 + * there is no HW throttling. The @capped_freq must be provided in kHz. 178 178 */ 179 - void topology_update_thermal_pressure(const struct cpumask *cpus, 179 + void topology_update_hw_pressure(const struct cpumask *cpus, 180 180 unsigned long capped_freq) 181 181 { 182 - unsigned long max_capacity, capacity, th_pressure; 182 + unsigned long max_capacity, capacity, hw_pressure; 183 183 u32 max_freq; 184 184 int cpu; 185 185 ··· 189 189 190 190 /* 191 191 * Handle properly the boost frequencies, which should simply clean 192 - * the thermal pressure value. 192 + * the HW pressure value. 193 193 */ 194 194 if (max_freq <= capped_freq) 195 195 capacity = max_capacity; 196 196 else 197 197 capacity = mult_frac(max_capacity, capped_freq, max_freq); 198 198 199 - th_pressure = max_capacity - capacity; 199 + hw_pressure = max_capacity - capacity; 200 200 201 - trace_thermal_pressure_update(cpu, th_pressure); 201 + trace_hw_pressure_update(cpu, hw_pressure); 202 202 203 203 for_each_cpu(cpu, cpus) 204 - WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure); 204 + WRITE_ONCE(per_cpu(hw_pressure, cpu), hw_pressure); 205 205 } 206 - EXPORT_SYMBOL_GPL(topology_update_thermal_pressure); 206 + EXPORT_SYMBOL_GPL(topology_update_hw_pressure); 207 207 208 208 static ssize_t cpu_capacity_show(struct device *dev, 209 209 struct device_attribute *attr,
+36
drivers/cpufreq/cpufreq.c
··· 2582 2582 } 2583 2583 EXPORT_SYMBOL(cpufreq_get_policy); 2584 2584 2585 + DEFINE_PER_CPU(unsigned long, cpufreq_pressure); 2586 + 2587 + /** 2588 + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs 2589 + * @policy: cpufreq policy of the CPUs. 2590 + * 2591 + * Update the value of cpufreq pressure for all @cpus in the policy. 2592 + */ 2593 + static void cpufreq_update_pressure(struct cpufreq_policy *policy) 2594 + { 2595 + unsigned long max_capacity, capped_freq, pressure; 2596 + u32 max_freq; 2597 + int cpu; 2598 + 2599 + cpu = cpumask_first(policy->related_cpus); 2600 + max_freq = arch_scale_freq_ref(cpu); 2601 + capped_freq = policy->max; 2602 + 2603 + /* 2604 + * Handle properly the boost frequencies, which should simply clean 2605 + * the cpufreq pressure value. 2606 + */ 2607 + if (max_freq <= capped_freq) { 2608 + pressure = 0; 2609 + } else { 2610 + max_capacity = arch_scale_cpu_capacity(cpu); 2611 + pressure = max_capacity - 2612 + mult_frac(max_capacity, capped_freq, max_freq); 2613 + } 2614 + 2615 + for_each_cpu(cpu, policy->related_cpus) 2616 + WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure); 2617 + } 2618 + 2585 2619 /** 2586 2620 * cpufreq_set_policy - Modify cpufreq policy parameters. 2587 2621 * @policy: Policy object to modify. ··· 2670 2636 policy->min = __resolve_freq(policy, policy->min, CPUFREQ_RELATION_L); 2671 2637 policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H); 2672 2638 trace_cpu_frequency_limits(policy); 2639 + 2640 + cpufreq_update_pressure(policy); 2673 2641 2674 2642 policy->cached_target_freq = UINT_MAX; 2675 2643
+2 -2
drivers/cpufreq/qcom-cpufreq-hw.c
··· 347 347 348 348 throttled_freq = freq_hz / HZ_PER_KHZ; 349 349 350 - /* Update thermal pressure (the boost frequencies are accepted) */ 351 - arch_update_thermal_pressure(policy->related_cpus, throttled_freq); 350 + /* Update HW pressure (the boost frequencies are accepted) */ 351 + arch_update_hw_pressure(policy->related_cpus, throttled_freq); 352 352 353 353 /* 354 354 * In the unlikely case policy is unregistered do not enable
-3
drivers/thermal/cpufreq_cooling.c
··· 477 477 unsigned long state) 478 478 { 479 479 struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata; 480 - struct cpumask *cpus; 481 480 unsigned int frequency; 482 481 int ret; 483 482 ··· 493 494 ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency); 494 495 if (ret >= 0) { 495 496 cpufreq_cdev->cpufreq_state = state; 496 - cpus = cpufreq_cdev->policy->related_cpus; 497 - arch_update_thermal_pressure(cpus, frequency); 498 497 ret = 0; 499 498 } 500 499
-1
include/asm-generic/vtime.h
··· 1 - /* no content, but patch(1) dislikes empty files */
+4 -4
include/linux/arch_topology.h
··· 60 60 void topology_set_scale_freq_source(struct scale_freq_data *data, const struct cpumask *cpus); 61 61 void topology_clear_scale_freq_source(enum scale_freq_source source, const struct cpumask *cpus); 62 62 63 - DECLARE_PER_CPU(unsigned long, thermal_pressure); 63 + DECLARE_PER_CPU(unsigned long, hw_pressure); 64 64 65 - static inline unsigned long topology_get_thermal_pressure(int cpu) 65 + static inline unsigned long topology_get_hw_pressure(int cpu) 66 66 { 67 - return per_cpu(thermal_pressure, cpu); 67 + return per_cpu(hw_pressure, cpu); 68 68 } 69 69 70 - void topology_update_thermal_pressure(const struct cpumask *cpus, 70 + void topology_update_hw_pressure(const struct cpumask *cpus, 71 71 unsigned long capped_freq); 72 72 73 73 struct cpu_topology {
+10
include/linux/cpufreq.h
··· 241 241 void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); 242 242 void cpufreq_disable_fast_switch(struct cpufreq_policy *policy); 243 243 bool has_target_index(void); 244 + 245 + DECLARE_PER_CPU(unsigned long, cpufreq_pressure); 246 + static inline unsigned long cpufreq_get_pressure(int cpu) 247 + { 248 + return READ_ONCE(per_cpu(cpufreq_pressure, cpu)); 249 + } 244 250 #else 245 251 static inline unsigned int cpufreq_get(unsigned int cpu) 246 252 { ··· 270 264 } 271 265 static inline void disable_cpufreq(void) { } 272 266 static inline void cpufreq_update_limits(unsigned int cpu) { } 267 + static inline unsigned long cpufreq_get_pressure(int cpu) 268 + { 269 + return 0; 270 + } 273 271 #endif 274 272 275 273 #ifdef CONFIG_CPU_FREQ_STAT
+2 -1
include/linux/sched.h
··· 301 301 TASK_COMM_LEN = 16, 302 302 }; 303 303 304 - extern void scheduler_tick(void); 304 + extern void sched_tick(void); 305 305 306 306 #define MAX_SCHEDULE_TIMEOUT LONG_MAX 307 307 ··· 835 835 #endif 836 836 837 837 unsigned int policy; 838 + unsigned long max_allowed_capacity; 838 839 int nr_cpus_allowed; 839 840 const cpumask_t *cpus_ptr; 840 841 cpumask_t *user_cpus_ptr;
+1 -1
include/linux/sched/idle.h
··· 5 5 #include <linux/sched.h> 6 6 7 7 enum cpu_idle_type { 8 + __CPU_NOT_IDLE = 0, 8 9 CPU_IDLE, 9 - CPU_NOT_IDLE, 10 10 CPU_NEWLY_IDLE, 11 11 CPU_MAX_IDLE_TYPES 12 12 };
+5 -5
include/linux/sched/topology.h
··· 110 110 unsigned long last_decay_max_lb_cost; 111 111 112 112 #ifdef CONFIG_SCHEDSTATS 113 - /* load_balance() stats */ 113 + /* sched_balance_rq() stats */ 114 114 unsigned int lb_count[CPU_MAX_IDLE_TYPES]; 115 115 unsigned int lb_failed[CPU_MAX_IDLE_TYPES]; 116 116 unsigned int lb_balanced[CPU_MAX_IDLE_TYPES]; ··· 270 270 } 271 271 #endif 272 272 273 - #ifndef arch_scale_thermal_pressure 273 + #ifndef arch_scale_hw_pressure 274 274 static __always_inline 275 - unsigned long arch_scale_thermal_pressure(int cpu) 275 + unsigned long arch_scale_hw_pressure(int cpu) 276 276 { 277 277 return 0; 278 278 } 279 279 #endif 280 280 281 - #ifndef arch_update_thermal_pressure 281 + #ifndef arch_update_hw_pressure 282 282 static __always_inline 283 - void arch_update_thermal_pressure(const struct cpumask *cpus, 283 + void arch_update_hw_pressure(const struct cpumask *cpus, 284 284 unsigned long capped_frequency) 285 285 { } 286 286 #endif
-5
include/linux/vtime.h
··· 5 5 #include <linux/context_tracking_state.h> 6 6 #include <linux/sched.h> 7 7 8 - #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE 9 - #include <asm/vtime.h> 10 - #endif 11 - 12 8 /* 13 9 * Common vtime APIs 14 10 */ ··· 14 18 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */ 15 19 16 20 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN 17 - extern void arch_vtime_task_switch(struct task_struct *tsk); 18 21 extern void vtime_user_enter(struct task_struct *tsk); 19 22 extern void vtime_user_exit(struct task_struct *tsk); 20 23 extern void vtime_guest_enter(struct task_struct *tsk);
+1 -1
include/trace/events/sched.h
··· 787 787 TP_PROTO(struct rq *rq), 788 788 TP_ARGS(rq)); 789 789 790 - DECLARE_TRACE(pelt_thermal_tp, 790 + DECLARE_TRACE(pelt_hw_tp, 791 791 TP_PROTO(struct rq *rq), 792 792 TP_ARGS(rq)); 793 793
+7 -7
include/trace/events/thermal_pressure.h include/trace/events/hw_pressure.h
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 #undef TRACE_SYSTEM 3 - #define TRACE_SYSTEM thermal_pressure 3 + #define TRACE_SYSTEM hw_pressure 4 4 5 5 #if !defined(_TRACE_THERMAL_PRESSURE_H) || defined(TRACE_HEADER_MULTI_READ) 6 6 #define _TRACE_THERMAL_PRESSURE_H 7 7 8 8 #include <linux/tracepoint.h> 9 9 10 - TRACE_EVENT(thermal_pressure_update, 11 - TP_PROTO(int cpu, unsigned long thermal_pressure), 12 - TP_ARGS(cpu, thermal_pressure), 10 + TRACE_EVENT(hw_pressure_update, 11 + TP_PROTO(int cpu, unsigned long hw_pressure), 12 + TP_ARGS(cpu, hw_pressure), 13 13 14 14 TP_STRUCT__entry( 15 - __field(unsigned long, thermal_pressure) 15 + __field(unsigned long, hw_pressure) 16 16 __field(int, cpu) 17 17 ), 18 18 19 19 TP_fast_assign( 20 - __entry->thermal_pressure = thermal_pressure; 20 + __entry->hw_pressure = hw_pressure; 21 21 __entry->cpu = cpu; 22 22 ), 23 23 24 - TP_printk("cpu=%d thermal_pressure=%lu", __entry->cpu, __entry->thermal_pressure) 24 + TP_printk("cpu=%d hw_pressure=%lu", __entry->cpu, __entry->hw_pressure) 25 25 ); 26 26 #endif /* _TRACE_THERMAL_PRESSURE_H */ 27 27
+6 -6
init/Kconfig
··· 547 547 depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING 548 548 depends on SMP 549 549 550 - config SCHED_THERMAL_PRESSURE 550 + config SCHED_HW_PRESSURE 551 551 bool 552 552 default y if ARM && ARM_CPU_TOPOLOGY 553 553 default y if ARM64 554 554 depends on SMP 555 555 depends on CPU_FREQ_THERMAL 556 556 help 557 - Select this option to enable thermal pressure accounting in the 558 - scheduler. Thermal pressure is the value conveyed to the scheduler 557 + Select this option to enable HW pressure accounting in the 558 + scheduler. HW pressure is the value conveyed to the scheduler 559 559 that reflects the reduction in CPU compute capacity resulted from 560 - thermal throttling. Thermal throttling occurs when the performance of 561 - a CPU is capped due to high operating temperatures. 560 + HW throttling. HW throttling occurs when the performance of 561 + a CPU is capped due to high operating temperatures as an example. 562 562 563 563 If selected, the scheduler will be able to balance tasks accordingly, 564 564 i.e. put less load on throttled CPUs than on non/less throttled ones. 565 565 566 566 This requires the architecture to implement 567 - arch_update_thermal_pressure() and arch_scale_thermal_pressure(). 567 + arch_update_hw_pressure() and arch_scale_thermal_pressure(). 568 568 569 569 config BSD_PROCESS_ACCT 570 570 bool "BSD Process Accounting"
+1
init/init_task.c
··· 77 77 .cpus_ptr = &init_task.cpus_mask, 78 78 .user_cpus_ptr = NULL, 79 79 .cpus_mask = CPU_MASK_ALL, 80 + .max_allowed_capacity = SCHED_CAPACITY_SCALE, 80 81 .nr_cpus_allowed= NR_CPUS, 81 82 .mm = NULL, 82 83 .active_mm = &init_mm,
+7 -7
kernel/sched/core.c
··· 108 108 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_dl_tp); 109 109 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp); 110 110 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_se_tp); 111 - EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_thermal_tp); 111 + EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_hw_tp); 112 112 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_cpu_capacity_tp); 113 113 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp); 114 114 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp); ··· 5662 5662 * This function gets called by the timer code, with HZ frequency. 5663 5663 * We call it with interrupts disabled. 5664 5664 */ 5665 - void scheduler_tick(void) 5665 + void sched_tick(void) 5666 5666 { 5667 5667 int cpu = smp_processor_id(); 5668 5668 struct rq *rq = cpu_rq(cpu); 5669 5669 struct task_struct *curr = rq->curr; 5670 5670 struct rq_flags rf; 5671 - unsigned long thermal_pressure; 5671 + unsigned long hw_pressure; 5672 5672 u64 resched_latency; 5673 5673 5674 5674 if (housekeeping_cpu(cpu, HK_TYPE_TICK)) ··· 5679 5679 rq_lock(rq, &rf); 5680 5680 5681 5681 update_rq_clock(rq); 5682 - thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); 5683 - update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure); 5682 + hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); 5683 + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure); 5684 5684 curr->sched_class->task_tick(rq, curr, 0); 5685 5685 if (sched_feat(LATENCY_WARN)) 5686 5686 resched_latency = cpu_resched_latency(rq); ··· 5700 5700 5701 5701 #ifdef CONFIG_SMP 5702 5702 rq->idle_balance = idle_cpu(cpu); 5703 - trigger_load_balance(rq); 5703 + sched_balance_trigger(rq); 5704 5704 #endif 5705 5705 } 5706 5706 ··· 6585 6585 * paths. For example, see arch/x86/entry_64.S. 6586 6586 * 6587 6587 * To drive preemption between tasks, the scheduler sets the flag in timer 6588 - * interrupt handler scheduler_tick(). 6588 + * interrupt handler sched_tick(). 6589 6589 * 6590 6590 * 3. Wakeups don't really cause entry into schedule(). They add a 6591 6591 * task to the run-queue and that's it.
-13
kernel/sched/cputime.c
··· 424 424 */ 425 425 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE 426 426 427 - # ifndef __ARCH_HAS_VTIME_TASK_SWITCH 428 - void vtime_task_switch(struct task_struct *prev) 429 - { 430 - if (is_idle_task(prev)) 431 - vtime_account_idle(prev); 432 - else 433 - vtime_account_kernel(prev); 434 - 435 - vtime_flush(prev); 436 - arch_vtime_task_switch(prev); 437 - } 438 - # endif 439 - 440 427 void vtime_account_irq(struct task_struct *tsk, unsigned int offset) 441 428 { 442 429 unsigned int pc = irq_count() - offset;
+287 -214
kernel/sched/fair.c
··· 78 78 79 79 const_debug unsigned int sysctl_sched_migration_cost = 500000UL; 80 80 81 - int sched_thermal_decay_shift; 82 81 static int __init setup_sched_thermal_decay_shift(char *str) 83 82 { 84 - int _shift = 0; 85 - 86 - if (kstrtoint(str, 0, &_shift)) 87 - pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n"); 88 - 89 - sched_thermal_decay_shift = clamp(_shift, 0, 10); 83 + pr_warn("Ignoring the deprecated sched_thermal_decay_shift= option\n"); 90 84 return 1; 91 85 } 92 86 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift); ··· 382 388 383 389 /* 384 390 * With cfs_rq being unthrottled/throttled during an enqueue, 385 - * it can happen the tmp_alone_branch points the a leaf that 386 - * we finally want to del. In this case, tmp_alone_branch moves 391 + * it can happen the tmp_alone_branch points to the leaf that 392 + * we finally want to delete. In this case, tmp_alone_branch moves 387 393 * to the prev element but it will point to rq->leaf_cfs_rq_list 388 394 * at the end of the enqueue. 389 395 */ ··· 400 406 SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); 401 407 } 402 408 403 - /* Iterate thr' all leaf cfs_rq's on a runqueue */ 409 + /* Iterate through all leaf cfs_rq's on a runqueue */ 404 410 #define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ 405 411 list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ 406 412 leaf_cfs_rq_list) ··· 589 595 * 590 596 * [[ NOTE: this is only equal to the ideal scheduler under the condition 591 597 * that join/leave operations happen at lag_i = 0, otherwise the 592 - * virtual time has non-continguous motion equivalent to: 598 + * virtual time has non-contiguous motion equivalent to: 593 599 * 594 600 * V +-= lag_i / W 595 601 * 596 602 * Also see the comment in place_entity() that deals with this. ]] 597 603 * 598 - * However, since v_i is u64, and the multiplcation could easily overflow 604 + * However, since v_i is u64, and the multiplication could easily overflow 599 605 * transform it into a relative form that uses smaller quantities: 600 606 * 601 607 * Substitute: v_i == (v_i - v0) + v0 ··· 665 671 } 666 672 667 673 if (load) { 668 - /* sign flips effective floor / ceil */ 674 + /* sign flips effective floor / ceiling */ 669 675 if (avg < 0) 670 676 avg -= (load - 1); 671 677 avg = div_s64(avg, load); ··· 721 727 * 722 728 * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i) 723 729 * 724 - * Note: using 'avg_vruntime() > se->vruntime' is inacurate due 730 + * Note: using 'avg_vruntime() > se->vruntime' is inaccurate due 725 731 * to the loss in precision caused by the division. 726 732 */ 727 733 static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) ··· 1024 1030 if (entity_is_task(se)) 1025 1031 sa->load_avg = scale_load_down(se->load.weight); 1026 1032 1027 - /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ 1033 + /* when this task is enqueued, it will contribute to its cfs_rq's load_avg */ 1028 1034 } 1029 1035 1030 1036 /* ··· 1616 1622 max_dist = READ_ONCE(sched_max_numa_distance); 1617 1623 /* 1618 1624 * This code is called for each node, introducing N^2 complexity, 1619 - * which should be ok given the number of nodes rarely exceeds 8. 1625 + * which should be OK given the number of nodes rarely exceeds 8. 1620 1626 */ 1621 1627 for_each_online_node(node) { 1622 1628 unsigned long faults; ··· 3290 3296 /* 3291 3297 * Shared library pages mapped by multiple processes are not 3292 3298 * migrated as it is expected they are cache replicated. Avoid 3293 - * hinting faults in read-only file-backed mappings or the vdso 3299 + * hinting faults in read-only file-backed mappings or the vDSO 3294 3300 * as migrating the pages will be of marginal benefit. 3295 3301 */ 3296 3302 if (!vma->vm_mm || ··· 3301 3307 3302 3308 /* 3303 3309 * Skip inaccessible VMAs to avoid any confusion between 3304 - * PROT_NONE and NUMA hinting ptes 3310 + * PROT_NONE and NUMA hinting PTEs 3305 3311 */ 3306 3312 if (!vma_is_accessible(vma)) { 3307 3313 trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_INACCESSIBLE); ··· 3333 3339 } 3334 3340 3335 3341 /* 3336 - * Scanning the VMA's of short lived tasks add more overhead. So 3342 + * Scanning the VMAs of short lived tasks add more overhead. So 3337 3343 * delay the scan for new VMAs. 3338 3344 */ 3339 3345 if (mm->numa_scan_seq && time_before(jiffies, ··· 3377 3383 /* 3378 3384 * Try to scan sysctl_numa_balancing_size worth of 3379 3385 * hpages that have at least one present PTE that 3380 - * is not already pte-numa. If the VMA contains 3386 + * is not already PTE-numa. If the VMA contains 3381 3387 * areas that are unused or already full of prot_numa 3382 3388 * PTEs, scan up to virtpages, to skip through those 3383 3389 * areas faster. ··· 3684 3690 3685 3691 /* 3686 3692 * VRUNTIME 3687 - * ======== 3693 + * -------- 3688 3694 * 3689 3695 * COROLLARY #1: The virtual runtime of the entity needs to be 3690 3696 * adjusted if re-weight at !0-lag point. ··· 3767 3773 3768 3774 /* 3769 3775 * DEADLINE 3770 - * ======== 3776 + * -------- 3771 3777 * 3772 3778 * When the weight changes, the virtual time slope changes and 3773 3779 * we should adjust the relative virtual deadline accordingly. ··· 4739 4745 4740 4746 /* 4741 4747 * Track task load average for carrying it to new CPU after migrated, and 4742 - * track group sched_entity load average for task_h_load calc in migration 4748 + * track group sched_entity load average for task_h_load calculation in migration 4743 4749 */ 4744 4750 if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) 4745 4751 __update_load_avg_se(now, cfs_rq, se); ··· 4822 4828 return cfs_rq->avg.load_avg; 4823 4829 } 4824 4830 4825 - static int newidle_balance(struct rq *this_rq, struct rq_flags *rf); 4831 + static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf); 4826 4832 4827 4833 static inline unsigned long task_util(struct task_struct *p) 4828 4834 { ··· 4965 4971 trace_sched_util_est_se_tp(&p->se); 4966 4972 } 4967 4973 4974 + static inline unsigned long get_actual_cpu_capacity(int cpu) 4975 + { 4976 + unsigned long capacity = arch_scale_cpu_capacity(cpu); 4977 + 4978 + capacity -= max(hw_load_avg(cpu_rq(cpu)), cpufreq_get_pressure(cpu)); 4979 + 4980 + return capacity; 4981 + } 4982 + 4968 4983 static inline int util_fits_cpu(unsigned long util, 4969 4984 unsigned long uclamp_min, 4970 4985 unsigned long uclamp_max, 4971 4986 int cpu) 4972 4987 { 4973 - unsigned long capacity_orig, capacity_orig_thermal; 4974 4988 unsigned long capacity = capacity_of(cpu); 4989 + unsigned long capacity_orig; 4975 4990 bool fits, uclamp_max_fits; 4976 4991 4977 4992 /* ··· 5002 4999 * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it 5003 5000 * should fit a little cpu even if there's some pressure. 5004 5001 * 5005 - * Only exception is for thermal pressure since it has a direct impact 5002 + * Only exception is for HW or cpufreq pressure since it has a direct impact 5006 5003 * on available OPP of the system. 5007 5004 * 5008 5005 * We honour it for uclamp_min only as a drop in performance level ··· 5012 5009 * goal is to cap the task. So it's okay if it's getting less. 5013 5010 */ 5014 5011 capacity_orig = arch_scale_cpu_capacity(cpu); 5015 - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu); 5016 5012 5017 5013 /* 5018 5014 * We want to force a task to fit a cpu as implied by uclamp_max. ··· 5028 5026 * | | | | | | | 5029 5027 * | | | | | | | 5030 5028 * +---------------------------------------- 5031 - * cpu0 cpu1 cpu2 5029 + * CPU0 CPU1 CPU2 5032 5030 * 5033 5031 * In the above example if a task is capped to a specific performance 5034 5032 * point, y, then when: 5035 5033 * 5036 - * * util = 80% of x then it does not fit on cpu0 and should migrate 5037 - * to cpu1 5038 - * * util = 80% of y then it is forced to fit on cpu1 to honour 5034 + * * util = 80% of x then it does not fit on CPU0 and should migrate 5035 + * to CPU1 5036 + * * util = 80% of y then it is forced to fit on CPU1 to honour 5039 5037 * uclamp_max request. 5040 5038 * 5041 5039 * which is what we're enforcing here. A task always fits if ··· 5066 5064 * | | | | | | | 5067 5065 * | | | | | | | (region c, boosted, util < uclamp_min) 5068 5066 * +---------------------------------------- 5069 - * cpu0 cpu1 cpu2 5067 + * CPU0 CPU1 CPU2 5070 5068 * 5071 5069 * a) If util > uclamp_max, then we're capped, we don't care about 5072 5070 * actual fitness value here. We only care if uclamp_max fits ··· 5086 5084 * handle the case uclamp_min > uclamp_max. 5087 5085 */ 5088 5086 uclamp_min = min(uclamp_min, uclamp_max); 5089 - if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal)) 5087 + if (fits && (util < uclamp_min) && 5088 + (uclamp_min > get_actual_cpu_capacity(cpu))) 5090 5089 return -1; 5091 5090 5092 5091 return fits; ··· 5107 5104 5108 5105 static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 5109 5106 { 5107 + int cpu = cpu_of(rq); 5108 + 5110 5109 if (!sched_asym_cpucap_active()) 5111 5110 return; 5112 5111 5113 - if (!p || p->nr_cpus_allowed == 1) { 5114 - rq->misfit_task_load = 0; 5115 - return; 5116 - } 5112 + /* 5113 + * Affinity allows us to go somewhere higher? Or are we on biggest 5114 + * available CPU already? Or do we fit into this CPU ? 5115 + */ 5116 + if (!p || (p->nr_cpus_allowed == 1) || 5117 + (arch_scale_cpu_capacity(cpu) == p->max_allowed_capacity) || 5118 + task_fits_cpu(p, cpu)) { 5117 5119 5118 - if (task_fits_cpu(p, cpu_of(rq))) { 5119 5120 rq->misfit_task_load = 0; 5120 5121 return; 5121 5122 } ··· 5155 5148 static inline void 5156 5149 detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} 5157 5150 5158 - static inline int newidle_balance(struct rq *rq, struct rq_flags *rf) 5151 + static inline int sched_balance_newidle(struct rq *rq, struct rq_flags *rf) 5159 5152 { 5160 5153 return 0; 5161 5154 } ··· 5261 5254 se->vruntime = vruntime - lag; 5262 5255 5263 5256 /* 5264 - * When joining the competition; the exisiting tasks will be, 5257 + * When joining the competition; the existing tasks will be, 5265 5258 * on average, halfway through their slice, as such start tasks 5266 5259 * off with half a slice to ease into the competition. 5267 5260 */ ··· 5410 5403 * Now advance min_vruntime if @se was the entity holding it back, 5411 5404 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be 5412 5405 * put back on, and if we advance min_vruntime, we'll be placed back 5413 - * further than we started -- ie. we'll be penalized. 5406 + * further than we started -- i.e. we'll be penalized. 5414 5407 */ 5415 5408 if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE) 5416 5409 update_min_vruntime(cfs_rq); ··· 5446 5439 5447 5440 /* 5448 5441 * Track our maximum slice length, if the CPU's load is at 5449 - * least twice that of our own weight (i.e. dont track it 5442 + * least twice that of our own weight (i.e. don't track it 5450 5443 * when there are only lesser-weight tasks around): 5451 5444 */ 5452 5445 if (schedstat_enabled() && ··· 6682 6675 #ifdef CONFIG_SMP 6683 6676 static inline bool cpu_overutilized(int cpu) 6684 6677 { 6685 - unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN); 6686 - unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX); 6678 + unsigned long rq_util_min, rq_util_max; 6679 + 6680 + if (!sched_energy_enabled()) 6681 + return false; 6682 + 6683 + rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN); 6684 + rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX); 6687 6685 6688 6686 /* Return true only if the utilization doesn't fit CPU's capacity */ 6689 6687 return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu); 6690 6688 } 6691 6689 6692 - static inline void update_overutilized_status(struct rq *rq) 6690 + /* 6691 + * overutilized value make sense only if EAS is enabled 6692 + */ 6693 + static inline bool is_rd_overutilized(struct root_domain *rd) 6693 6694 { 6694 - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) { 6695 - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED); 6696 - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED); 6697 - } 6695 + return !sched_energy_enabled() || READ_ONCE(rd->overutilized); 6696 + } 6697 + 6698 + static inline void set_rd_overutilized(struct root_domain *rd, bool flag) 6699 + { 6700 + if (!sched_energy_enabled()) 6701 + return; 6702 + 6703 + WRITE_ONCE(rd->overutilized, flag); 6704 + trace_sched_overutilized_tp(rd, flag); 6705 + } 6706 + 6707 + static inline void check_update_overutilized_status(struct rq *rq) 6708 + { 6709 + /* 6710 + * overutilized field is used for load balancing decisions only 6711 + * if energy aware scheduler is being used 6712 + */ 6713 + 6714 + if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu)) 6715 + set_rd_overutilized(rq->rd, 1); 6698 6716 } 6699 6717 #else 6700 - static inline void update_overutilized_status(struct rq *rq) { } 6718 + static inline void check_update_overutilized_status(struct rq *rq) { } 6701 6719 #endif 6702 6720 6703 6721 /* Runqueue only has SCHED_IDLE tasks enqueued */ ··· 6823 6791 * and the following generally works well enough in practice. 6824 6792 */ 6825 6793 if (!task_new) 6826 - update_overutilized_status(rq); 6794 + check_update_overutilized_status(rq); 6827 6795 6828 6796 enqueue_throttle: 6829 6797 assert_list_leaf_cfs_rq(rq); ··· 6910 6878 6911 6879 #ifdef CONFIG_SMP 6912 6880 6913 - /* Working cpumask for: load_balance, load_balance_newidle. */ 6881 + /* Working cpumask for: sched_balance_rq(), sched_balance_newidle(). */ 6914 6882 static DEFINE_PER_CPU(cpumask_var_t, load_balance_mask); 6915 6883 static DEFINE_PER_CPU(cpumask_var_t, select_rq_mask); 6916 6884 static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask); ··· 7142 7110 } 7143 7111 7144 7112 static struct sched_group * 7145 - find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu); 7113 + sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int this_cpu); 7146 7114 7147 7115 /* 7148 - * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group. 7116 + * sched_balance_find_dst_group_cpu - find the idlest CPU among the CPUs in the group. 7149 7117 */ 7150 7118 static int 7151 - find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) 7119 + sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) 7152 7120 { 7153 7121 unsigned long load, min_load = ULONG_MAX; 7154 7122 unsigned int min_exit_latency = UINT_MAX; ··· 7204 7172 return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; 7205 7173 } 7206 7174 7207 - static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, 7175 + static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct task_struct *p, 7208 7176 int cpu, int prev_cpu, int sd_flag) 7209 7177 { 7210 7178 int new_cpu = cpu; ··· 7229 7197 continue; 7230 7198 } 7231 7199 7232 - group = find_idlest_group(sd, p, cpu); 7200 + group = sched_balance_find_dst_group(sd, p, cpu); 7233 7201 if (!group) { 7234 7202 sd = sd->child; 7235 7203 continue; 7236 7204 } 7237 7205 7238 - new_cpu = find_idlest_group_cpu(group, p, cpu); 7206 + new_cpu = sched_balance_find_dst_group_cpu(group, p, cpu); 7239 7207 if (new_cpu == cpu) { 7240 7208 /* Now try balancing at a lower domain level of 'cpu': */ 7241 7209 sd = sd->child; ··· 7503 7471 * Look for the CPU with best capacity. 7504 7472 */ 7505 7473 else if (fits < 0) 7506 - cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu)); 7474 + cpu_cap = get_actual_cpu_capacity(cpu); 7507 7475 7508 7476 /* 7509 7477 * First, select CPU which fits better (-1 being better than 0). ··· 7547 7515 7548 7516 /* 7549 7517 * On asymmetric system, update task utilization because we will check 7550 - * that the task fits with cpu's capacity. 7518 + * that the task fits with CPU's capacity. 7551 7519 */ 7552 7520 if (sched_asym_cpucap_active()) { 7553 7521 sync_entity_load_avg(&p->se); ··· 7980 7948 * NOTE: Forkees are not accepted in the energy-aware wake-up path because 7981 7949 * they don't have any useful utilization data yet and it's not possible to 7982 7950 * forecast their impact on energy consumption. Consequently, they will be 7983 - * placed by find_idlest_cpu() on the least loaded CPU, which might turn out 7951 + * placed by sched_balance_find_dst_cpu() on the least loaded CPU, which might turn out 7984 7952 * to be energy-inefficient in some use-cases. The alternative would be to 7985 7953 * bias new tasks towards specific types of CPUs first, or to try to infer 7986 7954 * their util_avg from the parent task, but those heuristics could hurt ··· 7996 7964 struct root_domain *rd = this_rq()->rd; 7997 7965 int cpu, best_energy_cpu, target = -1; 7998 7966 int prev_fits = -1, best_fits = -1; 7999 - unsigned long best_thermal_cap = 0; 8000 - unsigned long prev_thermal_cap = 0; 7967 + unsigned long best_actual_cap = 0; 7968 + unsigned long prev_actual_cap = 0; 8001 7969 struct sched_domain *sd; 8002 7970 struct perf_domain *pd; 8003 7971 struct energy_env eenv; 8004 7972 8005 7973 rcu_read_lock(); 8006 7974 pd = rcu_dereference(rd->pd); 8007 - if (!pd || READ_ONCE(rd->overutilized)) 7975 + if (!pd) 8008 7976 goto unlock; 8009 7977 8010 7978 /* ··· 8027 7995 8028 7996 for (; pd; pd = pd->next) { 8029 7997 unsigned long util_min = p_util_min, util_max = p_util_max; 8030 - unsigned long cpu_cap, cpu_thermal_cap, util; 7998 + unsigned long cpu_cap, cpu_actual_cap, util; 8031 7999 long prev_spare_cap = -1, max_spare_cap = -1; 8032 8000 unsigned long rq_util_min, rq_util_max; 8033 8001 unsigned long cur_delta, base_energy; ··· 8039 8007 if (cpumask_empty(cpus)) 8040 8008 continue; 8041 8009 8042 - /* Account thermal pressure for the energy estimation */ 8010 + /* Account external pressure for the energy estimation */ 8043 8011 cpu = cpumask_first(cpus); 8044 - cpu_thermal_cap = arch_scale_cpu_capacity(cpu); 8045 - cpu_thermal_cap -= arch_scale_thermal_pressure(cpu); 8012 + cpu_actual_cap = get_actual_cpu_capacity(cpu); 8046 8013 8047 - eenv.cpu_cap = cpu_thermal_cap; 8014 + eenv.cpu_cap = cpu_actual_cap; 8048 8015 eenv.pd_cap = 0; 8049 8016 8050 8017 for_each_cpu(cpu, cpus) { 8051 8018 struct rq *rq = cpu_rq(cpu); 8052 8019 8053 - eenv.pd_cap += cpu_thermal_cap; 8020 + eenv.pd_cap += cpu_actual_cap; 8054 8021 8055 8022 if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) 8056 8023 continue; ··· 8070 8039 if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) { 8071 8040 /* 8072 8041 * Open code uclamp_rq_util_with() except for 8073 - * the clamp() part. Ie: apply max aggregation 8042 + * the clamp() part. I.e.: apply max aggregation 8074 8043 * only. util_fits_cpu() logic requires to 8075 8044 * operate on non clamped util but must use the 8076 8045 * max-aggregated uclamp_{min, max}. ··· 8120 8089 if (prev_delta < base_energy) 8121 8090 goto unlock; 8122 8091 prev_delta -= base_energy; 8123 - prev_thermal_cap = cpu_thermal_cap; 8092 + prev_actual_cap = cpu_actual_cap; 8124 8093 best_delta = min(best_delta, prev_delta); 8125 8094 } 8126 8095 ··· 8135 8104 * but best energy cpu has better capacity. 8136 8105 */ 8137 8106 if ((max_fits < 0) && 8138 - (cpu_thermal_cap <= best_thermal_cap)) 8107 + (cpu_actual_cap <= best_actual_cap)) 8139 8108 continue; 8140 8109 8141 8110 cur_delta = compute_energy(&eenv, pd, cpus, p, ··· 8156 8125 best_delta = cur_delta; 8157 8126 best_energy_cpu = max_spare_cap_cpu; 8158 8127 best_fits = max_fits; 8159 - best_thermal_cap = cpu_thermal_cap; 8128 + best_actual_cap = cpu_actual_cap; 8160 8129 } 8161 8130 } 8162 8131 rcu_read_unlock(); 8163 8132 8164 8133 if ((best_fits > prev_fits) || 8165 8134 ((best_fits > 0) && (best_delta < prev_delta)) || 8166 - ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap))) 8135 + ((best_fits < 0) && (best_actual_cap > prev_actual_cap))) 8167 8136 target = best_energy_cpu; 8168 8137 8169 8138 return target; ··· 8206 8175 cpumask_test_cpu(cpu, p->cpus_ptr)) 8207 8176 return cpu; 8208 8177 8209 - if (sched_energy_enabled()) { 8178 + if (!is_rd_overutilized(this_rq()->rd)) { 8210 8179 new_cpu = find_energy_efficient_cpu(p, prev_cpu); 8211 8180 if (new_cpu >= 0) 8212 8181 return new_cpu; ··· 8244 8213 8245 8214 if (unlikely(sd)) { 8246 8215 /* Slow path */ 8247 - new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); 8216 + new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag); 8248 8217 } else if (wake_flags & WF_TTWU) { /* XXX always ? */ 8249 8218 /* Fast path */ 8250 8219 new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); ··· 8290 8259 remove_entity_load_avg(&p->se); 8291 8260 } 8292 8261 8262 + /* 8263 + * Set the max capacity the task is allowed to run at for misfit detection. 8264 + */ 8265 + static void set_task_max_allowed_capacity(struct task_struct *p) 8266 + { 8267 + struct asym_cap_data *entry; 8268 + 8269 + if (!sched_asym_cpucap_active()) 8270 + return; 8271 + 8272 + rcu_read_lock(); 8273 + list_for_each_entry_rcu(entry, &asym_cap_list, link) { 8274 + cpumask_t *cpumask; 8275 + 8276 + cpumask = cpu_capacity_span(entry); 8277 + if (!cpumask_intersects(p->cpus_ptr, cpumask)) 8278 + continue; 8279 + 8280 + p->max_allowed_capacity = entry->capacity; 8281 + break; 8282 + } 8283 + rcu_read_unlock(); 8284 + } 8285 + 8286 + static void set_cpus_allowed_fair(struct task_struct *p, struct affinity_context *ctx) 8287 + { 8288 + set_cpus_allowed_common(p, ctx); 8289 + set_task_max_allowed_capacity(p); 8290 + } 8291 + 8293 8292 static int 8294 8293 balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) 8295 8294 { 8296 8295 if (rq->nr_running) 8297 8296 return 1; 8298 8297 8299 - return newidle_balance(rq, rf) != 0; 8298 + return sched_balance_newidle(rq, rf) != 0; 8300 8299 } 8300 + #else 8301 + static inline void set_task_max_allowed_capacity(struct task_struct *p) {} 8301 8302 #endif /* CONFIG_SMP */ 8302 8303 8303 8304 static void set_next_buddy(struct sched_entity *se) ··· 8580 8517 if (!rf) 8581 8518 return NULL; 8582 8519 8583 - new_tasks = newidle_balance(rq, rf); 8520 + new_tasks = sched_balance_newidle(rq, rf); 8584 8521 8585 8522 /* 8586 - * Because newidle_balance() releases (and re-acquires) rq->lock, it is 8523 + * Because sched_balance_newidle() releases (and re-acquires) rq->lock, it is 8587 8524 * possible for any higher priority task to appear. In that case we 8588 8525 * must re-start the pick_next_entity() loop. 8589 8526 */ ··· 8661 8598 if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se))) 8662 8599 return false; 8663 8600 8664 - /* Tell the scheduler that we'd really like pse to run next. */ 8601 + /* Tell the scheduler that we'd really like se to run next. */ 8665 8602 set_next_buddy(se); 8666 8603 8667 8604 yield_task_fair(rq); ··· 8999 8936 if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) 9000 8937 return 0; 9001 8938 9002 - /* Disregard pcpu kthreads; they are where they need to be. */ 8939 + /* Disregard percpu kthreads; they are where they need to be. */ 9003 8940 if (kthread_is_per_cpu(p)) 9004 8941 return 0; 9005 8942 ··· 9145 9082 * We don't want to steal all, otherwise we may be treated likewise, 9146 9083 * which could at worst lead to a livelock crash. 9147 9084 */ 9148 - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) 9085 + if (env->idle && env->src_rq->nr_running <= 1) 9149 9086 break; 9150 9087 9151 9088 env->loop++; ··· 9324 9261 if (cpu_util_dl(rq)) 9325 9262 return true; 9326 9263 9327 - if (thermal_load_avg(rq)) 9264 + if (hw_load_avg(rq)) 9328 9265 return true; 9329 9266 9330 9267 if (cpu_util_irq(rq)) ··· 9354 9291 { 9355 9292 const struct sched_class *curr_class; 9356 9293 u64 now = rq_clock_pelt(rq); 9357 - unsigned long thermal_pressure; 9294 + unsigned long hw_pressure; 9358 9295 bool decayed; 9359 9296 9360 9297 /* ··· 9363 9300 */ 9364 9301 curr_class = rq->curr->sched_class; 9365 9302 9366 - thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); 9303 + hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); 9367 9304 9368 9305 decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | 9369 9306 update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | 9370 - update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | 9307 + update_hw_load_avg(now, rq, hw_pressure) | 9371 9308 update_irq_load_avg(rq, 0); 9372 9309 9373 9310 if (others_have_blocked(rq)) ··· 9486 9423 } 9487 9424 #endif 9488 9425 9489 - static void update_blocked_averages(int cpu) 9426 + static void sched_balance_update_blocked_averages(int cpu) 9490 9427 { 9491 9428 bool decayed = false, done = true; 9492 9429 struct rq *rq = cpu_rq(cpu); ··· 9505 9442 rq_unlock_irqrestore(rq, &rf); 9506 9443 } 9507 9444 9508 - /********** Helpers for find_busiest_group ************************/ 9445 + /********** Helpers for sched_balance_find_src_group ************************/ 9509 9446 9510 9447 /* 9511 - * sg_lb_stats - stats of a sched_group required for load_balancing 9448 + * sg_lb_stats - stats of a sched_group required for load-balancing: 9512 9449 */ 9513 9450 struct sg_lb_stats { 9514 - unsigned long avg_load; /*Avg load across the CPUs of the group */ 9515 - unsigned long group_load; /* Total load over the CPUs of the group */ 9516 - unsigned long group_capacity; 9517 - unsigned long group_util; /* Total utilization over the CPUs of the group */ 9518 - unsigned long group_runnable; /* Total runnable time over the CPUs of the group */ 9519 - unsigned int sum_nr_running; /* Nr of tasks running in the group */ 9520 - unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */ 9521 - unsigned int idle_cpus; 9451 + unsigned long avg_load; /* Avg load over the CPUs of the group */ 9452 + unsigned long group_load; /* Total load over the CPUs of the group */ 9453 + unsigned long group_capacity; /* Capacity over the CPUs of the group */ 9454 + unsigned long group_util; /* Total utilization over the CPUs of the group */ 9455 + unsigned long group_runnable; /* Total runnable time over the CPUs of the group */ 9456 + unsigned int sum_nr_running; /* Nr of all tasks running in the group */ 9457 + unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */ 9458 + unsigned int idle_cpus; /* Nr of idle CPUs in the group */ 9522 9459 unsigned int group_weight; 9523 9460 enum group_type group_type; 9524 - unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */ 9525 - unsigned int group_smt_balance; /* Task on busy SMT be moved */ 9526 - unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ 9461 + unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */ 9462 + unsigned int group_smt_balance; /* Task on busy SMT be moved */ 9463 + unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ 9527 9464 #ifdef CONFIG_NUMA_BALANCING 9528 9465 unsigned int nr_numa_running; 9529 9466 unsigned int nr_preferred_running; ··· 9531 9468 }; 9532 9469 9533 9470 /* 9534 - * sd_lb_stats - Structure to store the statistics of a sched_domain 9535 - * during load balancing. 9471 + * sd_lb_stats - stats of a sched_domain required for load-balancing: 9536 9472 */ 9537 9473 struct sd_lb_stats { 9538 - struct sched_group *busiest; /* Busiest group in this sd */ 9539 - struct sched_group *local; /* Local group in this sd */ 9540 - unsigned long total_load; /* Total load of all groups in sd */ 9541 - unsigned long total_capacity; /* Total capacity of all groups in sd */ 9542 - unsigned long avg_load; /* Average load across all groups in sd */ 9543 - unsigned int prefer_sibling; /* tasks should go to sibling first */ 9474 + struct sched_group *busiest; /* Busiest group in this sd */ 9475 + struct sched_group *local; /* Local group in this sd */ 9476 + unsigned long total_load; /* Total load of all groups in sd */ 9477 + unsigned long total_capacity; /* Total capacity of all groups in sd */ 9478 + unsigned long avg_load; /* Average load across all groups in sd */ 9479 + unsigned int prefer_sibling; /* Tasks should go to sibling first */ 9544 9480 9545 - struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ 9546 - struct sg_lb_stats local_stat; /* Statistics of the local group */ 9481 + struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */ 9482 + struct sg_lb_stats local_stat; /* Statistics of the local group */ 9547 9483 }; 9548 9484 9549 9485 static inline void init_sd_lb_stats(struct sd_lb_stats *sds) ··· 9568 9506 9569 9507 static unsigned long scale_rt_capacity(int cpu) 9570 9508 { 9509 + unsigned long max = get_actual_cpu_capacity(cpu); 9571 9510 struct rq *rq = cpu_rq(cpu); 9572 - unsigned long max = arch_scale_cpu_capacity(cpu); 9573 9511 unsigned long used, free; 9574 9512 unsigned long irq; 9575 9513 ··· 9581 9519 /* 9582 9520 * avg_rt.util_avg and avg_dl.util_avg track binary signals 9583 9521 * (running and not running) with weights 0 and 1024 respectively. 9584 - * avg_thermal.load_avg tracks thermal pressure and the weighted 9585 - * average uses the actual delta max capacity(load). 9586 9522 */ 9587 9523 used = cpu_util_rt(rq); 9588 9524 used += cpu_util_dl(rq); 9589 - used += thermal_load_avg(rq); 9590 9525 9591 9526 if (unlikely(used >= max)) 9592 9527 return 1; ··· 9676 9617 (arch_scale_cpu_capacity(cpu_of(rq)) * 100)); 9677 9618 } 9678 9619 9679 - /* 9680 - * Check whether a rq has a misfit task and if it looks like we can actually 9681 - * help that task: we can migrate the task to a CPU of higher capacity, or 9682 - * the task's current CPU is heavily pressured. 9683 - */ 9684 - static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd) 9620 + /* Check if the rq has a misfit task */ 9621 + static inline bool check_misfit_status(struct rq *rq) 9685 9622 { 9686 - return rq->misfit_task_load && 9687 - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity || 9688 - check_cpu_capacity(rq, sd)); 9623 + return rq->misfit_task_load; 9689 9624 } 9690 9625 9691 9626 /* ··· 9703 9650 * 9704 9651 * When this is so detected; this group becomes a candidate for busiest; see 9705 9652 * update_sd_pick_busiest(). And calculate_imbalance() and 9706 - * find_busiest_group() avoid some of the usual balance conditions to allow it 9653 + * sched_balance_find_src_group() avoid some of the usual balance conditions to allow it 9707 9654 * to create an effective group imbalance. 9708 9655 * 9709 9656 * This is a somewhat tricky proposition since the next run might not find the ··· 9868 9815 static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs, 9869 9816 struct sched_group *group) 9870 9817 { 9871 - if (env->idle == CPU_NOT_IDLE) 9818 + if (!env->idle) 9872 9819 return false; 9873 9820 9874 9821 /* ··· 9892 9839 int ncores_busiest, ncores_local; 9893 9840 long imbalance; 9894 9841 9895 - if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running) 9842 + if (!env->idle || !busiest->sum_nr_running) 9896 9843 return 0; 9897 9844 9898 9845 ncores_busiest = sds->busiest->cores; ··· 9938 9885 * @sds: Load-balancing data with statistics of the local group. 9939 9886 * @group: sched_group whose statistics are to be updated. 9940 9887 * @sgs: variable to hold the statistics for this group. 9941 - * @sg_status: Holds flag indicating the status of the sched_group 9888 + * @sg_overloaded: sched_group is overloaded 9889 + * @sg_overutilized: sched_group is overutilized 9942 9890 */ 9943 9891 static inline void update_sg_lb_stats(struct lb_env *env, 9944 9892 struct sd_lb_stats *sds, 9945 9893 struct sched_group *group, 9946 9894 struct sg_lb_stats *sgs, 9947 - int *sg_status) 9895 + bool *sg_overloaded, 9896 + bool *sg_overutilized) 9948 9897 { 9949 9898 int i, nr_running, local_group; 9950 9899 ··· 9967 9912 sgs->sum_nr_running += nr_running; 9968 9913 9969 9914 if (nr_running > 1) 9970 - *sg_status |= SG_OVERLOAD; 9915 + *sg_overloaded = 1; 9971 9916 9972 9917 if (cpu_overutilized(i)) 9973 - *sg_status |= SG_OVERUTILIZED; 9918 + *sg_overutilized = 1; 9974 9919 9975 9920 #ifdef CONFIG_NUMA_BALANCING 9976 9921 sgs->nr_numa_running += rq->nr_numa_running; ··· 9992 9937 /* Check for a misfit task on the cpu */ 9993 9938 if (sgs->group_misfit_task_load < rq->misfit_task_load) { 9994 9939 sgs->group_misfit_task_load = rq->misfit_task_load; 9995 - *sg_status |= SG_OVERLOAD; 9940 + *sg_overloaded = 1; 9996 9941 } 9997 - } else if ((env->idle != CPU_NOT_IDLE) && 9998 - sched_reduced_capacity(rq, env->sd)) { 9942 + } else if (env->idle && sched_reduced_capacity(rq, env->sd)) { 9999 9943 /* Check for a task running on a CPU with reduced capacity */ 10000 9944 if (sgs->group_misfit_task_load < load) 10001 9945 sgs->group_misfit_task_load = load; ··· 10006 9952 sgs->group_weight = group->group_weight; 10007 9953 10008 9954 /* Check if dst CPU is idle and preferred to this group */ 10009 - if (!local_group && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running && 9955 + if (!local_group && env->idle && sgs->sum_h_nr_running && 10010 9956 sched_group_asym(env, sgs, group)) 10011 9957 sgs->group_asym_packing = 1; 10012 9958 ··· 10144 10090 has_spare: 10145 10091 10146 10092 /* 10147 - * Select not overloaded group with lowest number of idle cpus 10093 + * Select not overloaded group with lowest number of idle CPUs 10148 10094 * and highest number of running tasks. We could also compare 10149 10095 * the spare capacity which is more stable but it can end up 10150 10096 * that the group has less spare capacity but finally more idle ··· 10364 10310 } 10365 10311 10366 10312 /* 10367 - * find_idlest_group() finds and returns the least busy CPU group within the 10313 + * sched_balance_find_dst_group() finds and returns the least busy CPU group within the 10368 10314 * domain. 10369 10315 * 10370 10316 * Assumes p is allowed on at least one CPU in sd. 10371 10317 */ 10372 10318 static struct sched_group * 10373 - find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) 10319 + sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) 10374 10320 { 10375 10321 struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups; 10376 10322 struct sg_lb_stats local_sgs, tmp_sgs; ··· 10618 10564 struct sg_lb_stats *local = &sds->local_stat; 10619 10565 struct sg_lb_stats tmp_sgs; 10620 10566 unsigned long sum_util = 0; 10621 - int sg_status = 0; 10567 + bool sg_overloaded = 0, sg_overutilized = 0; 10622 10568 10623 10569 do { 10624 10570 struct sg_lb_stats *sgs = &tmp_sgs; ··· 10634 10580 update_group_capacity(env->sd, env->dst_cpu); 10635 10581 } 10636 10582 10637 - update_sg_lb_stats(env, sds, sg, sgs, &sg_status); 10583 + update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized); 10638 10584 10639 10585 if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) { 10640 10586 sds->busiest = sg; ··· 10662 10608 env->fbq_type = fbq_classify_group(&sds->busiest_stat); 10663 10609 10664 10610 if (!env->sd->parent) { 10665 - struct root_domain *rd = env->dst_rq->rd; 10666 - 10667 10611 /* update overload indicator if we are at root domain */ 10668 - WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD); 10612 + set_rd_overloaded(env->dst_rq->rd, sg_overloaded); 10669 10613 10670 10614 /* Update over-utilization (tipping point, U >= 0) indicator */ 10671 - WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED); 10672 - trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED); 10673 - } else if (sg_status & SG_OVERUTILIZED) { 10674 - struct root_domain *rd = env->dst_rq->rd; 10675 - 10676 - WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED); 10677 - trace_sched_overutilized_tp(rd, SG_OVERUTILIZED); 10615 + set_rd_overutilized(env->dst_rq->rd, sg_overutilized); 10616 + } else if (sg_overutilized) { 10617 + set_rd_overutilized(env->dst_rq->rd, sg_overutilized); 10678 10618 } 10679 10619 10680 10620 update_idle_cpu_scan(env, sum_util); ··· 10758 10710 * waiting task in this overloaded busiest group. Let's 10759 10711 * try to pull it. 10760 10712 */ 10761 - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) { 10713 + if (env->idle && env->imbalance == 0) { 10762 10714 env->migration_type = migrate_task; 10763 10715 env->imbalance = 1; 10764 10716 } ··· 10777 10729 10778 10730 /* 10779 10731 * If there is no overload, we just want to even the number of 10780 - * idle cpus. 10732 + * idle CPUs. 10781 10733 */ 10782 10734 env->migration_type = migrate_task; 10783 10735 env->imbalance = max_t(long, 0, ··· 10850 10802 ) / SCHED_CAPACITY_SCALE; 10851 10803 } 10852 10804 10853 - /******* find_busiest_group() helpers end here *********************/ 10805 + /******* sched_balance_find_src_group() helpers end here *********************/ 10854 10806 10855 10807 /* 10856 10808 * Decision matrix according to the local and busiest group type: ··· 10873 10825 */ 10874 10826 10875 10827 /** 10876 - * find_busiest_group - Returns the busiest group within the sched_domain 10828 + * sched_balance_find_src_group - Returns the busiest group within the sched_domain 10877 10829 * if there is an imbalance. 10878 10830 * @env: The load balancing environment. 10879 10831 * ··· 10882 10834 * 10883 10835 * Return: - The busiest group if imbalance exists. 10884 10836 */ 10885 - static struct sched_group *find_busiest_group(struct lb_env *env) 10837 + static struct sched_group *sched_balance_find_src_group(struct lb_env *env) 10886 10838 { 10887 10839 struct sg_lb_stats *local, *busiest; 10888 10840 struct sd_lb_stats sds; ··· 10905 10857 if (busiest->group_type == group_misfit_task) 10906 10858 goto force_balance; 10907 10859 10908 - if (sched_energy_enabled()) { 10909 - struct root_domain *rd = env->dst_rq->rd; 10910 - 10911 - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) 10912 - goto out_balanced; 10913 - } 10860 + if (!is_rd_overutilized(env->dst_rq->rd) && 10861 + rcu_dereference(env->dst_rq->rd->pd)) 10862 + goto out_balanced; 10914 10863 10915 10864 /* ASYM feature bypasses nice load balance check */ 10916 10865 if (busiest->group_type == group_asym_packing) ··· 10970 10925 goto force_balance; 10971 10926 10972 10927 if (busiest->group_type != group_overloaded) { 10973 - if (env->idle == CPU_NOT_IDLE) { 10928 + if (!env->idle) { 10974 10929 /* 10975 10930 * If the busiest group is not overloaded (and as a 10976 10931 * result the local one too) but this CPU is already ··· 11018 10973 } 11019 10974 11020 10975 /* 11021 - * find_busiest_queue - find the busiest runqueue among the CPUs in the group. 10976 + * sched_balance_find_src_rq - find the busiest runqueue among the CPUs in the group. 11022 10977 */ 11023 - static struct rq *find_busiest_queue(struct lb_env *env, 10978 + static struct rq *sched_balance_find_src_rq(struct lb_env *env, 11024 10979 struct sched_group *group) 11025 10980 { 11026 10981 struct rq *busiest = NULL, *rq; ··· 11178 11133 * the lower priority @env::dst_cpu help it. Do not follow 11179 11134 * CPU priority. 11180 11135 */ 11181 - return env->idle != CPU_NOT_IDLE && sched_use_asym_prio(env->sd, env->dst_cpu) && 11136 + return env->idle && sched_use_asym_prio(env->sd, env->dst_cpu) && 11182 11137 (sched_asym_prefer(env->dst_cpu, env->src_cpu) || 11183 11138 !sched_use_asym_prio(env->sd, env->src_cpu)); 11184 11139 } ··· 11216 11171 * because of other sched_class or IRQs if more capacity stays 11217 11172 * available on dst_cpu. 11218 11173 */ 11219 - if ((env->idle != CPU_NOT_IDLE) && 11174 + if (env->idle && 11220 11175 (env->src_rq->cfs.h_nr_running == 1)) { 11221 11176 if ((check_cpu_capacity(env->src_rq, sd)) && 11222 11177 (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100)) ··· 11301 11256 * Check this_cpu to ensure it is balanced within domain. Attempt to move 11302 11257 * tasks if there is an imbalance. 11303 11258 */ 11304 - static int load_balance(int this_cpu, struct rq *this_rq, 11259 + static int sched_balance_rq(int this_cpu, struct rq *this_rq, 11305 11260 struct sched_domain *sd, enum cpu_idle_type idle, 11306 11261 int *continue_balancing) 11307 11262 { ··· 11333 11288 goto out_balanced; 11334 11289 } 11335 11290 11336 - group = find_busiest_group(&env); 11291 + group = sched_balance_find_src_group(&env); 11337 11292 if (!group) { 11338 11293 schedstat_inc(sd->lb_nobusyg[idle]); 11339 11294 goto out_balanced; 11340 11295 } 11341 11296 11342 - busiest = find_busiest_queue(&env, group); 11297 + busiest = sched_balance_find_src_rq(&env, group); 11343 11298 if (!busiest) { 11344 11299 schedstat_inc(sd->lb_nobusyq[idle]); 11345 11300 goto out_balanced; ··· 11357 11312 env.flags |= LBF_ALL_PINNED; 11358 11313 if (busiest->nr_running > 1) { 11359 11314 /* 11360 - * Attempt to move tasks. If find_busiest_group has found 11315 + * Attempt to move tasks. If sched_balance_find_src_group has found 11361 11316 * an imbalance but busiest->nr_running <= 1, the group is 11362 11317 * still unbalanced. ld_moved simply stays zero, so it is 11363 11318 * correctly treated as an imbalance. ··· 11472 11427 * We do not want newidle balance, which can be very 11473 11428 * frequent, pollute the failure counter causing 11474 11429 * excessive cache_hot migrations and active balances. 11430 + * 11431 + * Similarly for migration_misfit which is not related to 11432 + * load/util migration, don't pollute nr_balance_failed. 11475 11433 */ 11476 - if (idle != CPU_NEWLY_IDLE) 11434 + if (idle != CPU_NEWLY_IDLE && 11435 + env.migration_type != migrate_misfit) 11477 11436 sd->nr_balance_failed++; 11478 11437 11479 11438 if (need_active_balance(&env)) { ··· 11556 11507 ld_moved = 0; 11557 11508 11558 11509 /* 11559 - * newidle_balance() disregards balance intervals, so we could 11510 + * sched_balance_newidle() disregards balance intervals, so we could 11560 11511 * repeatedly reach this code, which would lead to balance_interval 11561 11512 * skyrocketing in a short amount of time. Skip the balance_interval 11562 11513 * increase logic to avoid that. 11514 + * 11515 + * Similarly misfit migration which is not necessarily an indication of 11516 + * the system being busy and requires lb to backoff to let it settle 11517 + * down. 11563 11518 */ 11564 - if (env.idle == CPU_NEWLY_IDLE) 11519 + if (env.idle == CPU_NEWLY_IDLE || 11520 + env.migration_type == migrate_misfit) 11565 11521 goto out; 11566 11522 11567 11523 /* tune up the balancing interval */ ··· 11699 11645 return 0; 11700 11646 } 11701 11647 11702 - static DEFINE_SPINLOCK(balancing); 11648 + /* 11649 + * This flag serializes load-balancing passes over large domains 11650 + * (above the NODE topology level) - only one load-balancing instance 11651 + * may run at a time, to reduce overhead on very large systems with 11652 + * lots of CPUs and large NUMA distances. 11653 + * 11654 + * - Note that load-balancing passes triggered while another one 11655 + * is executing are skipped and not re-tried. 11656 + * 11657 + * - Also note that this does not serialize rebalance_domains() 11658 + * execution, as non-SD_SERIALIZE domains will still be 11659 + * load-balanced in parallel. 11660 + */ 11661 + static atomic_t sched_balance_running = ATOMIC_INIT(0); 11703 11662 11704 11663 /* 11705 - * Scale the max load_balance interval with the number of CPUs in the system. 11664 + * Scale the max sched_balance_rq interval with the number of CPUs in the system. 11706 11665 * This trades load-balance latency on larger machines for less cross talk. 11707 11666 */ 11708 11667 void update_max_interval(void) ··· 11753 11686 * 11754 11687 * Balancing parameters are set up in init_sched_domains. 11755 11688 */ 11756 - static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) 11689 + static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle) 11757 11690 { 11758 11691 int continue_balancing = 1; 11759 11692 int cpu = rq->cpu; ··· 11790 11723 11791 11724 need_serialize = sd->flags & SD_SERIALIZE; 11792 11725 if (need_serialize) { 11793 - if (!spin_trylock(&balancing)) 11726 + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) 11794 11727 goto out; 11795 11728 } 11796 11729 11797 11730 if (time_after_eq(jiffies, sd->last_balance + interval)) { 11798 - if (load_balance(cpu, rq, sd, idle, &continue_balancing)) { 11731 + if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) { 11799 11732 /* 11800 11733 * The LBF_DST_PINNED logic could have changed 11801 11734 * env->dst_cpu, so we can't know our idle 11802 11735 * state even if we migrated tasks. Update it. 11803 11736 */ 11804 - idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE; 11805 - busy = idle != CPU_IDLE && !sched_idle_cpu(cpu); 11737 + idle = idle_cpu(cpu); 11738 + busy = !idle && !sched_idle_cpu(cpu); 11806 11739 } 11807 11740 sd->last_balance = jiffies; 11808 11741 interval = get_sd_balance_interval(sd, busy); 11809 11742 } 11810 11743 if (need_serialize) 11811 - spin_unlock(&balancing); 11744 + atomic_set_release(&sched_balance_running, 0); 11812 11745 out: 11813 11746 if (time_after(next_balance, sd->last_balance + interval)) { 11814 11747 next_balance = sd->last_balance + interval; ··· 11968 11901 * currently idle; in which case, kick the ILB to move tasks 11969 11902 * around. 11970 11903 * 11971 - * When balancing betwen cores, all the SMT siblings of the 11904 + * When balancing between cores, all the SMT siblings of the 11972 11905 * preferred CPU must be idle. 11973 11906 */ 11974 11907 for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) { ··· 11985 11918 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU 11986 11919 * to run the misfit task on. 11987 11920 */ 11988 - if (check_misfit_status(rq, sd)) { 11921 + if (check_misfit_status(rq)) { 11989 11922 flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK; 11990 11923 goto unlock; 11991 11924 } ··· 12129 12062 out: 12130 12063 /* 12131 12064 * Each time a cpu enter idle, we assume that it has blocked load and 12132 - * enable the periodic update of the load of idle cpus 12065 + * enable the periodic update of the load of idle CPUs 12133 12066 */ 12134 12067 WRITE_ONCE(nohz.has_blocked, 1); 12135 12068 } ··· 12147 12080 if (!time_after(jiffies, READ_ONCE(rq->last_blocked_load_update_tick))) 12148 12081 return true; 12149 12082 12150 - update_blocked_averages(cpu); 12083 + sched_balance_update_blocked_averages(cpu); 12151 12084 12152 12085 return rq->has_blocked_load; 12153 12086 } 12154 12087 12155 12088 /* 12156 - * Internal function that runs load balance for all idle cpus. The load balance 12089 + * Internal function that runs load balance for all idle CPUs. The load balance 12157 12090 * can be a simple update of blocked load or a complete load balance with 12158 12091 * tasks movement depending of flags. 12159 12092 */ ··· 12229 12162 rq_unlock_irqrestore(rq, &rf); 12230 12163 12231 12164 if (flags & NOHZ_BALANCE_KICK) 12232 - rebalance_domains(rq, CPU_IDLE); 12165 + sched_balance_domains(rq, CPU_IDLE); 12233 12166 } 12234 12167 12235 12168 if (time_after(next_balance, rq->next_balance)) { ··· 12258 12191 12259 12192 /* 12260 12193 * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the 12261 - * rebalancing for all the cpus for whom scheduler ticks are stopped. 12194 + * rebalancing for all the CPUs for whom scheduler ticks are stopped. 12262 12195 */ 12263 12196 static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) 12264 12197 { ··· 12289 12222 * called from this function on (this) CPU that's not yet in the mask. That's 12290 12223 * OK because the goal of nohz_run_idle_balance() is to run ILB only for 12291 12224 * updating the blocked load of already idle CPUs without waking up one of 12292 - * those idle CPUs and outside the preempt disable / irq off phase of the local 12225 + * those idle CPUs and outside the preempt disable / IRQ off phase of the local 12293 12226 * cpu about to enter idle, because it can take a long time. 12294 12227 */ 12295 12228 void nohz_run_idle_balance(int cpu) ··· 12300 12233 12301 12234 /* 12302 12235 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen 12303 - * (ie NOHZ_STATS_KICK set) and will do the same. 12236 + * (i.e. NOHZ_STATS_KICK set) and will do the same. 12304 12237 */ 12305 12238 if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) 12306 12239 _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); ··· 12345 12278 #endif /* CONFIG_NO_HZ_COMMON */ 12346 12279 12347 12280 /* 12348 - * newidle_balance is called by schedule() if this_cpu is about to become 12281 + * sched_balance_newidle is called by schedule() if this_cpu is about to become 12349 12282 * idle. Attempts to pull tasks from other CPUs. 12350 12283 * 12351 12284 * Returns: ··· 12353 12286 * 0 - failed, no new tasks 12354 12287 * > 0 - success, new (fair) tasks present 12355 12288 */ 12356 - static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) 12289 + static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) 12357 12290 { 12358 12291 unsigned long next_balance = jiffies + HZ; 12359 12292 int this_cpu = this_rq->cpu; 12293 + int continue_balancing = 1; 12360 12294 u64 t0, t1, curr_cost = 0; 12361 12295 struct sched_domain *sd; 12362 12296 int pulled_task = 0; ··· 12372 12304 return 0; 12373 12305 12374 12306 /* 12375 - * We must set idle_stamp _before_ calling idle_balance(), such that we 12376 - * measure the duration of idle_balance() as idle time. 12307 + * We must set idle_stamp _before_ calling sched_balance_rq() 12308 + * for CPU_NEWLY_IDLE, such that we measure the this duration 12309 + * as idle time. 12377 12310 */ 12378 12311 this_rq->idle_stamp = rq_clock(this_rq); 12379 12312 ··· 12395 12326 rcu_read_lock(); 12396 12327 sd = rcu_dereference_check_sched_domain(this_rq->sd); 12397 12328 12398 - if (!READ_ONCE(this_rq->rd->overload) || 12329 + if (!get_rd_overloaded(this_rq->rd) || 12399 12330 (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { 12400 12331 12401 12332 if (sd) ··· 12409 12340 raw_spin_rq_unlock(this_rq); 12410 12341 12411 12342 t0 = sched_clock_cpu(this_cpu); 12412 - update_blocked_averages(this_cpu); 12343 + sched_balance_update_blocked_averages(this_cpu); 12413 12344 12414 12345 rcu_read_lock(); 12415 12346 for_each_domain(this_cpu, sd) { 12416 - int continue_balancing = 1; 12417 12347 u64 domain_cost; 12418 12348 12419 12349 update_next_balance(sd, &next_balance); ··· 12422 12354 12423 12355 if (sd->flags & SD_BALANCE_NEWIDLE) { 12424 12356 12425 - pulled_task = load_balance(this_cpu, this_rq, 12357 + pulled_task = sched_balance_rq(this_cpu, this_rq, 12426 12358 sd, CPU_NEWLY_IDLE, 12427 12359 &continue_balancing); 12428 12360 ··· 12438 12370 * Stop searching for tasks to pull if there are 12439 12371 * now runnable tasks on this rq. 12440 12372 */ 12441 - if (pulled_task || this_rq->nr_running > 0 || 12442 - this_rq->ttwu_pending) 12373 + if (pulled_task || !continue_balancing) 12443 12374 break; 12444 12375 } 12445 12376 rcu_read_unlock(); ··· 12476 12409 } 12477 12410 12478 12411 /* 12479 - * run_rebalance_domains is triggered when needed from the scheduler tick. 12480 - * Also triggered for nohz idle balancing (with nohz_balancing_kick set). 12412 + * This softirq handler is triggered via SCHED_SOFTIRQ from two places: 12413 + * 12414 + * - directly from the local scheduler_tick() for periodic load balancing 12415 + * 12416 + * - indirectly from a remote scheduler_tick() for NOHZ idle balancing 12417 + * through the SMP cross-call nohz_csd_func() 12481 12418 */ 12482 - static __latent_entropy void run_rebalance_domains(struct softirq_action *h) 12419 + static __latent_entropy void sched_balance_softirq(struct softirq_action *h) 12483 12420 { 12484 12421 struct rq *this_rq = this_rq(); 12485 - enum cpu_idle_type idle = this_rq->idle_balance ? 12486 - CPU_IDLE : CPU_NOT_IDLE; 12487 - 12422 + enum cpu_idle_type idle = this_rq->idle_balance; 12488 12423 /* 12489 - * If this CPU has a pending nohz_balance_kick, then do the 12424 + * If this CPU has a pending NOHZ_BALANCE_KICK, then do the 12490 12425 * balancing on behalf of the other idle CPUs whose ticks are 12491 - * stopped. Do nohz_idle_balance *before* rebalance_domains to 12426 + * stopped. Do nohz_idle_balance *before* sched_balance_domains to 12492 12427 * give the idle CPUs a chance to load balance. Else we may 12493 12428 * load balance only within the local sched_domain hierarchy 12494 12429 * and abort nohz_idle_balance altogether if we pull some load. ··· 12499 12430 return; 12500 12431 12501 12432 /* normal load balance */ 12502 - update_blocked_averages(this_rq->cpu); 12503 - rebalance_domains(this_rq, idle); 12433 + sched_balance_update_blocked_averages(this_rq->cpu); 12434 + sched_balance_domains(this_rq, idle); 12504 12435 } 12505 12436 12506 12437 /* 12507 12438 * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing. 12508 12439 */ 12509 - void trigger_load_balance(struct rq *rq) 12440 + void sched_balance_trigger(struct rq *rq) 12510 12441 { 12511 12442 /* 12512 12443 * Don't need to rebalance while attached to NULL domain or ··· 12690 12621 task_tick_numa(rq, curr); 12691 12622 12692 12623 update_misfit_status(curr, rq); 12693 - update_overutilized_status(task_rq(curr)); 12624 + check_update_overutilized_status(task_rq(curr)); 12694 12625 12695 12626 task_tick_core(rq, curr); 12696 12627 } ··· 12709 12640 12710 12641 rq_lock(rq, &rf); 12711 12642 update_rq_clock(rq); 12643 + 12644 + set_task_max_allowed_capacity(p); 12712 12645 12713 12646 cfs_rq = task_cfs_rq(current); 12714 12647 curr = cfs_rq->curr; ··· 12834 12763 static void switched_to_fair(struct rq *rq, struct task_struct *p) 12835 12764 { 12836 12765 attach_task_cfs_rq(p); 12766 + 12767 + set_task_max_allowed_capacity(p); 12837 12768 12838 12769 if (task_on_rq_queued(p)) { 12839 12770 /* ··· 13208 13135 .rq_offline = rq_offline_fair, 13209 13136 13210 13137 .task_dead = task_dead_fair, 13211 - .set_cpus_allowed = set_cpus_allowed_common, 13138 + .set_cpus_allowed = set_cpus_allowed_fair, 13212 13139 #endif 13213 13140 13214 13141 .task_tick = task_tick_fair, ··· 13288 13215 #endif 13289 13216 } 13290 13217 13291 - open_softirq(SCHED_SOFTIRQ, run_rebalance_domains); 13218 + open_softirq(SCHED_SOFTIRQ, sched_balance_softirq); 13292 13219 13293 13220 #ifdef CONFIG_NO_HZ_COMMON 13294 13221 nohz.next_balance = jiffies;
+1 -1
kernel/sched/loadavg.c
··· 379 379 } 380 380 381 381 /* 382 - * Called from scheduler_tick() to periodically update this CPU's 382 + * Called from sched_tick() to periodically update this CPU's 383 383 * active count. 384 384 */ 385 385 void calc_global_load_tick(struct rq *this_rq)
+11 -11
kernel/sched/pelt.c
··· 208 208 * se has been already dequeued but cfs_rq->curr still points to it. 209 209 * This means that weight will be 0 but not running for a sched_entity 210 210 * but also for a cfs_rq if the latter becomes idle. As an example, 211 - * this happens during idle_balance() which calls 212 - * update_blocked_averages(). 211 + * this happens during sched_balance_newidle() which calls 212 + * sched_balance_update_blocked_averages(). 213 213 * 214 214 * Also see the comment in accumulate_sum(). 215 215 */ ··· 384 384 return 0; 385 385 } 386 386 387 - #ifdef CONFIG_SCHED_THERMAL_PRESSURE 387 + #ifdef CONFIG_SCHED_HW_PRESSURE 388 388 /* 389 - * thermal: 389 + * hardware: 390 390 * 391 391 * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked 392 392 * 393 393 * util_avg and runnable_load_avg are not supported and meaningless. 394 394 * 395 395 * Unlike rt/dl utilization tracking that track time spent by a cpu 396 - * running a rt/dl task through util_avg, the average thermal pressure is 397 - * tracked through load_avg. This is because thermal pressure signal is 396 + * running a rt/dl task through util_avg, the average HW pressure is 397 + * tracked through load_avg. This is because HW pressure signal is 398 398 * time weighted "delta" capacity unlike util_avg which is binary. 399 399 * "delta capacity" = actual capacity - 400 - * capped capacity a cpu due to a thermal event. 400 + * capped capacity a cpu due to a HW event. 401 401 */ 402 402 403 - int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity) 403 + int update_hw_load_avg(u64 now, struct rq *rq, u64 capacity) 404 404 { 405 - if (___update_load_sum(now, &rq->avg_thermal, 405 + if (___update_load_sum(now, &rq->avg_hw, 406 406 capacity, 407 407 capacity, 408 408 capacity)) { 409 - ___update_load_avg(&rq->avg_thermal, 1); 410 - trace_pelt_thermal_tp(rq); 409 + ___update_load_avg(&rq->avg_hw, 1); 410 + trace_pelt_hw_tp(rq); 411 411 return 1; 412 412 } 413 413
+8 -8
kernel/sched/pelt.h
··· 7 7 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running); 8 8 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running); 9 9 10 - #ifdef CONFIG_SCHED_THERMAL_PRESSURE 11 - int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity); 10 + #ifdef CONFIG_SCHED_HW_PRESSURE 11 + int update_hw_load_avg(u64 now, struct rq *rq, u64 capacity); 12 12 13 - static inline u64 thermal_load_avg(struct rq *rq) 13 + static inline u64 hw_load_avg(struct rq *rq) 14 14 { 15 - return READ_ONCE(rq->avg_thermal.load_avg); 15 + return READ_ONCE(rq->avg_hw.load_avg); 16 16 } 17 17 #else 18 18 static inline int 19 - update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity) 19 + update_hw_load_avg(u64 now, struct rq *rq, u64 capacity) 20 20 { 21 21 return 0; 22 22 } 23 23 24 - static inline u64 thermal_load_avg(struct rq *rq) 24 + static inline u64 hw_load_avg(struct rq *rq) 25 25 { 26 26 return 0; 27 27 } ··· 202 202 } 203 203 204 204 static inline int 205 - update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity) 205 + update_hw_load_avg(u64 now, struct rq *rq, u64 capacity) 206 206 { 207 207 return 0; 208 208 } 209 209 210 - static inline u64 thermal_load_avg(struct rq *rq) 210 + static inline u64 hw_load_avg(struct rq *rq) 211 211 { 212 212 return 0; 213 213 }
+35 -36
kernel/sched/sched.h
··· 112 112 extern int sched_rr_timeslice; 113 113 114 114 /* 115 + * Asymmetric CPU capacity bits 116 + */ 117 + struct asym_cap_data { 118 + struct list_head link; 119 + struct rcu_head rcu; 120 + unsigned long capacity; 121 + unsigned long cpus[]; 122 + }; 123 + 124 + extern struct list_head asym_cap_list; 125 + 126 + #define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus) 127 + 128 + /* 115 129 * Helpers for converting nanosecond timing to jiffy resolution 116 130 */ 117 131 #define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ)) ··· 715 701 } highest_prio; 716 702 #endif 717 703 #ifdef CONFIG_SMP 718 - int overloaded; 704 + bool overloaded; 719 705 struct plist_head pushable_tasks; 720 706 721 707 #endif /* CONFIG_SMP */ ··· 759 745 u64 next; 760 746 } earliest_dl; 761 747 762 - int overloaded; 748 + bool overloaded; 763 749 764 750 /* 765 751 * Tasks on this rq that can be pushed away. They are kept in ··· 852 838 struct rcu_head rcu; 853 839 }; 854 840 855 - /* Scheduling group status flags */ 856 - #define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */ 857 - #define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */ 858 - 859 841 /* 860 842 * We add the notion of a root-domain which will be used to define per-domain 861 843 * variables. Each exclusive cpuset essentially defines an island domain by ··· 872 862 * - More than one runnable task 873 863 * - Running task is misfit 874 864 */ 875 - int overload; 865 + bool overloaded; 876 866 877 867 /* Indicate one or more cpus over-utilized (tipping point) */ 878 - int overutilized; 868 + bool overutilized; 879 869 880 870 /* 881 871 * The bit corresponding to a CPU gets set here if such CPU has more ··· 915 905 cpumask_var_t rto_mask; 916 906 struct cpupri cpupri; 917 907 918 - unsigned long max_cpu_capacity; 919 - 920 908 /* 921 909 * NULL-terminated list of performance domains intersecting with the 922 910 * CPUs of the rd. Protected by RCU. ··· 927 919 extern void rq_attach_root(struct rq *rq, struct root_domain *rd); 928 920 extern void sched_get_rd(struct root_domain *rd); 929 921 extern void sched_put_rd(struct root_domain *rd); 922 + 923 + static inline int get_rd_overloaded(struct root_domain *rd) 924 + { 925 + return READ_ONCE(rd->overloaded); 926 + } 927 + 928 + static inline void set_rd_overloaded(struct root_domain *rd, int status) 929 + { 930 + if (get_rd_overloaded(rd) != status) 931 + WRITE_ONCE(rd->overloaded, status); 932 + } 930 933 931 934 #ifdef HAVE_RT_PUSH_IPI 932 935 extern void rto_push_irq_work_func(struct irq_work *work); ··· 1110 1091 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ 1111 1092 struct sched_avg avg_irq; 1112 1093 #endif 1113 - #ifdef CONFIG_SCHED_THERMAL_PRESSURE 1114 - struct sched_avg avg_thermal; 1094 + #ifdef CONFIG_SCHED_HW_PRESSURE 1095 + struct sched_avg avg_hw; 1115 1096 #endif 1116 1097 u64 idle_stamp; 1117 1098 u64 avg_idle; ··· 1550 1531 assert_clock_updated(rq); 1551 1532 1552 1533 return rq->clock_task; 1553 - } 1554 - 1555 - /** 1556 - * By default the decay is the default pelt decay period. 1557 - * The decay shift can change the decay period in 1558 - * multiples of 32. 1559 - * Decay shift Decay period(ms) 1560 - * 0 32 1561 - * 1 64 1562 - * 2 128 1563 - * 3 256 1564 - * 4 512 1565 - */ 1566 - extern int sched_thermal_decay_shift; 1567 - 1568 - static inline u64 rq_clock_thermal(struct rq *rq) 1569 - { 1570 - return rq_clock_task(rq) >> sched_thermal_decay_shift; 1571 1534 } 1572 1535 1573 1536 static inline void rq_clock_skip_update(struct rq *rq) ··· 2400 2399 2401 2400 extern void update_group_capacity(struct sched_domain *sd, int cpu); 2402 2401 2403 - extern void trigger_load_balance(struct rq *rq); 2402 + extern void sched_balance_trigger(struct rq *rq); 2404 2403 2405 2404 extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx); 2406 2405 ··· 2520 2519 } 2521 2520 2522 2521 #ifdef CONFIG_SMP 2523 - if (prev_nr < 2 && rq->nr_running >= 2) { 2524 - if (!READ_ONCE(rq->rd->overload)) 2525 - WRITE_ONCE(rq->rd->overload, 1); 2526 - } 2522 + if (prev_nr < 2 && rq->nr_running >= 2) 2523 + set_rd_overloaded(rq->rd, 1); 2527 2524 #endif 2528 2525 2529 2526 sched_update_tick_dependency(rq); ··· 2905 2906 #define NOHZ_NEWILB_KICK_BIT 2 2906 2907 #define NOHZ_NEXT_KICK_BIT 3 2907 2908 2908 - /* Run rebalance_domains() */ 2909 + /* Run sched_balance_domains() */ 2909 2910 #define NOHZ_BALANCE_KICK BIT(NOHZ_BALANCE_KICK_BIT) 2910 2911 /* Update blocked load */ 2911 2912 #define NOHZ_STATS_KICK BIT(NOHZ_STATS_KICK_BIT)
+2 -3
kernel/sched/stats.c
··· 113 113 * Bump this up when changing the output format or the meaning of an existing 114 114 * format, so that tools can adapt (or abort) 115 115 */ 116 - #define SCHEDSTAT_VERSION 15 116 + #define SCHEDSTAT_VERSION 16 117 117 118 118 static int show_schedstat(struct seq_file *seq, void *v) 119 119 { ··· 150 150 151 151 seq_printf(seq, "domain%d %*pb", dcount++, 152 152 cpumask_pr_args(sched_domain_span(sd))); 153 - for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES; 154 - itype++) { 153 + for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) { 155 154 seq_printf(seq, " %u %u %u %u %u %u %u %u", 156 155 sd->lb_count[itype], 157 156 sd->lb_balanced[itype],
+27 -29
kernel/sched/topology.c
··· 1330 1330 } 1331 1331 1332 1332 /* 1333 - * Asymmetric CPU capacity bits 1334 - */ 1335 - struct asym_cap_data { 1336 - struct list_head link; 1337 - unsigned long capacity; 1338 - unsigned long cpus[]; 1339 - }; 1340 - 1341 - /* 1342 1333 * Set of available CPUs grouped by their corresponding capacities 1343 1334 * Each list entry contains a CPU mask reflecting CPUs that share the same 1344 1335 * capacity. 1345 1336 * The lifespan of data is unlimited. 1346 1337 */ 1347 - static LIST_HEAD(asym_cap_list); 1348 - 1349 - #define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus) 1338 + LIST_HEAD(asym_cap_list); 1350 1339 1351 1340 /* 1352 1341 * Verify whether there is any CPU capacity asymmetry in a given sched domain. ··· 1375 1386 1376 1387 } 1377 1388 1389 + static void free_asym_cap_entry(struct rcu_head *head) 1390 + { 1391 + struct asym_cap_data *entry = container_of(head, struct asym_cap_data, rcu); 1392 + kfree(entry); 1393 + } 1394 + 1378 1395 static inline void asym_cpu_capacity_update_data(int cpu) 1379 1396 { 1380 1397 unsigned long capacity = arch_scale_cpu_capacity(cpu); 1381 - struct asym_cap_data *entry = NULL; 1398 + struct asym_cap_data *insert_entry = NULL; 1399 + struct asym_cap_data *entry; 1382 1400 1401 + /* 1402 + * Search if capacity already exits. If not, track which the entry 1403 + * where we should insert to keep the list ordered descendingly. 1404 + */ 1383 1405 list_for_each_entry(entry, &asym_cap_list, link) { 1384 1406 if (capacity == entry->capacity) 1385 1407 goto done; 1408 + else if (!insert_entry && capacity > entry->capacity) 1409 + insert_entry = list_prev_entry(entry, link); 1386 1410 } 1387 1411 1388 1412 entry = kzalloc(sizeof(*entry) + cpumask_size(), GFP_KERNEL); 1389 1413 if (WARN_ONCE(!entry, "Failed to allocate memory for asymmetry data\n")) 1390 1414 return; 1391 1415 entry->capacity = capacity; 1392 - list_add(&entry->link, &asym_cap_list); 1416 + 1417 + /* If NULL then the new capacity is the smallest, add last. */ 1418 + if (!insert_entry) 1419 + list_add_tail_rcu(&entry->link, &asym_cap_list); 1420 + else 1421 + list_add_rcu(&entry->link, &insert_entry->link); 1393 1422 done: 1394 1423 __cpumask_set_cpu(cpu, cpu_capacity_span(entry)); 1395 1424 } ··· 1430 1423 1431 1424 list_for_each_entry_safe(entry, next, &asym_cap_list, link) { 1432 1425 if (cpumask_empty(cpu_capacity_span(entry))) { 1433 - list_del(&entry->link); 1434 - kfree(entry); 1426 + list_del_rcu(&entry->link); 1427 + call_rcu(&entry->rcu, free_asym_cap_entry); 1435 1428 } 1436 1429 } 1437 1430 ··· 1441 1434 */ 1442 1435 if (list_is_singular(&asym_cap_list)) { 1443 1436 entry = list_first_entry(&asym_cap_list, typeof(*entry), link); 1444 - list_del(&entry->link); 1445 - kfree(entry); 1437 + list_del_rcu(&entry->link); 1438 + call_rcu(&entry->rcu, free_asym_cap_entry); 1446 1439 } 1447 1440 } 1448 1441 ··· 2514 2507 /* Attach the domains */ 2515 2508 rcu_read_lock(); 2516 2509 for_each_cpu(i, cpu_map) { 2517 - unsigned long capacity; 2518 - 2519 2510 rq = cpu_rq(i); 2520 2511 sd = *per_cpu_ptr(d.sd, i); 2521 - 2522 - capacity = arch_scale_cpu_capacity(i); 2523 - /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */ 2524 - if (capacity > READ_ONCE(d.rd->max_cpu_capacity)) 2525 - WRITE_ONCE(d.rd->max_cpu_capacity, capacity); 2526 2512 2527 2513 cpu_attach_domain(sd, d.rd, i); 2528 2514 ··· 2530 2530 if (has_cluster) 2531 2531 static_branch_inc_cpuslocked(&sched_cluster_active); 2532 2532 2533 - if (rq && sched_debug_verbose) { 2534 - pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", 2535 - cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity); 2536 - } 2533 + if (rq && sched_debug_verbose) 2534 + pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map)); 2537 2535 2538 2536 ret = 0; 2539 2537 error:
+1 -1
kernel/time/timer.c
··· 2488 2488 if (in_irq()) 2489 2489 irq_work_tick(); 2490 2490 #endif 2491 - scheduler_tick(); 2491 + sched_tick(); 2492 2492 if (IS_ENABLED(CONFIG_POSIX_TIMERS)) 2493 2493 run_posix_cpu_timers(); 2494 2494 }
+1 -1
kernel/workqueue.c
··· 1468 1468 * wq_worker_tick - a scheduler tick occurred while a kworker is running 1469 1469 * @task: task currently running 1470 1470 * 1471 - * Called from scheduler_tick(). We're in the IRQ context and the current 1471 + * Called from sched_tick(). We're in the IRQ context and the current 1472 1472 * worker's fields which follow the 'K' locking rule can be accessed safely. 1473 1473 */ 1474 1474 void wq_worker_tick(struct task_struct *task)
+1 -1
lib/Kconfig.debug
··· 1251 1251 1252 1252 config SCHEDSTATS 1253 1253 bool "Collect scheduler statistics" 1254 - depends on DEBUG_KERNEL && PROC_FS 1254 + depends on PROC_FS 1255 1255 select SCHED_INFO 1256 1256 help 1257 1257 If you say Y here, additional code will be inserted into the
+1 -1
tools/testing/selftests/ftrace/test.d/ftrace/func_set_ftrace_file.tc
··· 19 19 20 20 FILTER=set_ftrace_filter 21 21 FUNC1="schedule" 22 - FUNC2="scheduler_tick" 22 + FUNC2="sched_tick" 23 23 24 24 ALL_FUNCS="#### all functions enabled ####" 25 25