Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull scheduler updates from Ingo Molnar:
"Load-balancing improvements:

- Improve NUMA balancing on AMD Zen systems for affine workloads.

- Improve the handling of reduced-capacity CPUs in load-balancing.

- Energy Model improvements: fix & refine all the energy fairness
metrics (PELT), and remove the conservative threshold requiring 6%
energy savings to migrate a task. Doing this improves power
efficiency for most workloads, and also increases the reliability
of energy-efficiency scheduling.

- Optimize/tweak select_idle_cpu() to spend (much) less time
searching for an idle CPU on overloaded systems. There's reports of
several milliseconds spent there on large systems with large
workloads ...

[ Since the search logic changed, there might be behavioral side
effects. ]

- Improve NUMA imbalance behavior. On certain systems with spare
capacity, initial placement of tasks is non-deterministic, and such
an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.

The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.

Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which
benefit from the artificial locality of the placement bug(s). Mel
Gorman's conclusion, with which we concur, was that consistency is
better than random workload benefits from non-deterministic bugs:

"Given there is no crystal ball and it's a tradeoff, I think
it's better to be consistent and use similar logic at both fork
time and runtime even if it doesn't have universal benefit."

- Improve core scheduling by fixing a bug in
sched_core_update_cookie() that caused unnecessary forced idling.

- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
for newly woken tasks.

- Fix a newidle balancing bug that introduced unnecessary wakeup
latencies.

ABI improvements/fixes:

- Do not check capabilities and do not issue capability check denial
messages when a scheduler syscall doesn't require privileges. (Such
as increasing niceness.)

- Add forced-idle accounting to cgroups too.

- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the
previous behavior.)

- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.

Optimizations:

- Optimize & simplify leaf_cfs_rq_list()

- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().

Misc fixes & cleanups:

- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.

- Fix a full-NOHZ bug that can in some cases result in the tick not
being re-enabled when the last SCHED_RT task is gone from a
runqueue but there's still SCHED_OTHER tasks around.

- Various PREEMPT_RT related fixes.

- Misc cleanups & smaller fixes"

* tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rseq: Kill process when unknown flags are encountered in ABI structures
rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
sched/core: Fix the bug that task won't enqueue into core tree when update cookie
nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
sched/core: Always flush pending blk_plug
sched/fair: fix case with reduced capacity CPU
sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
sched/core: add forced idle accounting for cgroups
sched/fair: Remove the energy margin in feec()
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
sched/fair: Rename select_idle_mask to select_rq_mask
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
sched: only perform capability check on privileged operation
sched: Remove unused function group_first_cpu()
sched/fair: Remove redundant word " *"
selftests/rseq: check if libc rseq support is registered
...

Linus Torvalds 3 years ago b167fdff 0dd1cabe

+894 -517

22 changed files

expand all

drivers

powercap

dtpm_cpu.c

thermal

cpufreq_cooling.c

include

linux

cgroup-defs.h

kernel_stat.h

sched

rt.h

topology.h

sched.h

kernel

cgroup

rstat.c

rseq.c

sched

core.c

core_sched.c

cpufreq_schedutil.c

cputime.c

deadline.c

fair.c

features.h

pelt.h

rt.c

sched.h

topology.c

tools

testing

selftests

rseq

rseq-riscv.h

rseq.c

+9 -24

drivers/powercap/dtpm_cpu.c

··· 71 71 72 72 static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power) 73 73 { 74 - unsigned long max = 0, sum_util = 0; 74 + unsigned long max, sum_util = 0; 75 75 int cpu; 76 76 77 - for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { 78 - 79 - /* 80 - * The capacity is the same for all CPUs belonging to 81 - * the same perf domain, so a single call to 82 - * arch_scale_cpu_capacity() is enough. However, we 83 - * need the CPU parameter to be initialized by the 84 - * loop, so the call ends up in this block. 85 - * 86 - * We can initialize 'max' with a cpumask_first() call 87 - * before the loop but the bits computation is not 88 - * worth given the arch_scale_cpu_capacity() just 89 - * returns a value where the resulting assembly code 90 - * will be optimized by the compiler. 91 - */ 92 - max = arch_scale_cpu_capacity(cpu); 93 - sum_util += sched_cpu_util(cpu, max); 94 - } 95 - 96 77 /* 97 - * In the improbable case where all the CPUs of the perf 98 - * domain are offline, 'max' will be zero and will lead to an 99 - * illegal operation with a zero division. 78 + * The capacity is the same for all CPUs belonging to 79 + * the same perf domain. 100 80 */ 101 - return max ? (power * ((sum_util << 10) / max)) >> 10 : 0; 81 + max = arch_scale_cpu_capacity(cpumask_first(pd_mask)); 82 + 83 + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) 84 + sum_util += sched_cpu_util(cpu); 85 + 86 + return (power * ((sum_util << 10) / max)) >> 10; 102 87 } 103 88 104 89 static u64 get_pd_power_uw(struct dtpm *dtpm)

+2 -4

drivers/thermal/cpufreq_cooling.c

··· 137 137 static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu, 138 138 int cpu_idx) 139 139 { 140 - unsigned long max = arch_scale_cpu_capacity(cpu); 141 - unsigned long util; 140 + unsigned long util = sched_cpu_util(cpu); 142 141 143 - util = sched_cpu_util(cpu, max); 144 - return (util * 100) / max; 142 + return (util * 100) / arch_scale_cpu_capacity(cpu); 145 143 } 146 144 #else /* !CONFIG_SMP */ 147 145 static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,

include/linux/cgroup-defs.h

··· 288 288 289 289 struct cgroup_base_stat { 290 290 struct task_cputime cputime; 291 + 292 + #ifdef CONFIG_SCHED_CORE 293 + u64 forceidle_sum; 294 + #endif 291 295 }; 292 296 293 297 /*

include/linux/kernel_stat.h

··· 28 28 CPUTIME_STEAL, 29 29 CPUTIME_GUEST, 30 30 CPUTIME_GUEST_NICE, 31 + #ifdef CONFIG_SCHED_CORE 32 + CPUTIME_FORCEIDLE, 33 + #endif 31 34 NR_STATS, 32 35 }; 33 36 ··· 117 114 #endif 118 115 119 116 extern void account_idle_ticks(unsigned long ticks); 117 + 118 + #ifdef CONFIG_SCHED_CORE 119 + extern void __account_forceidle_time(struct task_struct *tsk, u64 delta); 120 + #endif 120 121 121 122 #endif /* _LINUX_KERNEL_STAT_H */

+1 -1

include/linux/sched.h

··· 2257 2257 } 2258 2258 2259 2259 /* Returns effective CPU energy utilization, as seen by the scheduler */ 2260 - unsigned long sched_cpu_util(int cpu, unsigned long max); 2260 + unsigned long sched_cpu_util(int cpu); 2261 2261 #endif /* CONFIG_SMP */ 2262 2262 2263 2263 #ifdef CONFIG_RSEQ

-8

include/linux/sched/rt.h

··· 39 39 } 40 40 extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task); 41 41 extern void rt_mutex_adjust_pi(struct task_struct *p); 42 - static inline bool tsk_is_pi_blocked(struct task_struct *tsk) 43 - { 44 - return tsk->pi_blocked_on != NULL; 45 - } 46 42 #else 47 43 static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task) 48 44 { 49 45 return NULL; 50 46 } 51 47 # define rt_mutex_adjust_pi(p) do { } while (0) 52 - static inline bool tsk_is_pi_blocked(struct task_struct *tsk) 53 - { 54 - return false; 55 - } 56 48 #endif 57 49 58 50 extern void normalize_rt_tasks(void);

include/linux/sched/topology.h

··· 81 81 atomic_t ref; 82 82 atomic_t nr_busy_cpus; 83 83 int has_idle_cores; 84 + int nr_idle_scan; 84 85 }; 85 86 86 87 struct sched_domain {

+38 -6

kernel/cgroup/rstat.c

··· 310 310 dst_bstat->cputime.utime += src_bstat->cputime.utime; 311 311 dst_bstat->cputime.stime += src_bstat->cputime.stime; 312 312 dst_bstat->cputime.sum_exec_runtime += src_bstat->cputime.sum_exec_runtime; 313 + #ifdef CONFIG_SCHED_CORE 314 + dst_bstat->forceidle_sum += src_bstat->forceidle_sum; 315 + #endif 313 316 } 314 317 315 318 static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, ··· 321 318 dst_bstat->cputime.utime -= src_bstat->cputime.utime; 322 319 dst_bstat->cputime.stime -= src_bstat->cputime.stime; 323 320 dst_bstat->cputime.sum_exec_runtime -= src_bstat->cputime.sum_exec_runtime; 321 + #ifdef CONFIG_SCHED_CORE 322 + dst_bstat->forceidle_sum -= src_bstat->forceidle_sum; 323 + #endif 324 324 } 325 325 326 326 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) ··· 404 398 case CPUTIME_SOFTIRQ: 405 399 rstatc->bstat.cputime.stime += delta_exec; 406 400 break; 401 + #ifdef CONFIG_SCHED_CORE 402 + case CPUTIME_FORCEIDLE: 403 + rstatc->bstat.forceidle_sum += delta_exec; 404 + break; 405 + #endif 407 406 default: 408 407 break; 409 408 } ··· 422 411 * with how it is done by __cgroup_account_cputime_field for each bit of 423 412 * cpu time attributed to a cgroup. 424 413 */ 425 - static void root_cgroup_cputime(struct task_cputime *cputime) 414 + static void root_cgroup_cputime(struct cgroup_base_stat *bstat) 426 415 { 416 + struct task_cputime *cputime = &bstat->cputime; 427 417 int i; 428 418 429 419 cputime->stime = 0; ··· 450 438 cputime->sum_exec_runtime += user; 451 439 cputime->sum_exec_runtime += sys; 452 440 cputime->sum_exec_runtime += cpustat[CPUTIME_STEAL]; 441 + 442 + #ifdef CONFIG_SCHED_CORE 443 + bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE]; 444 + #endif 453 445 } 454 446 } 455 447 ··· 461 445 { 462 446 struct cgroup *cgrp = seq_css(seq)->cgroup; 463 447 u64 usage, utime, stime; 464 - struct task_cputime cputime; 448 + struct cgroup_base_stat bstat; 449 + #ifdef CONFIG_SCHED_CORE 450 + u64 forceidle_time; 451 + #endif 465 452 466 453 if (cgroup_parent(cgrp)) { 467 454 cgroup_rstat_flush_hold(cgrp); 468 455 usage = cgrp->bstat.cputime.sum_exec_runtime; 469 456 cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime, 470 457 &utime, &stime); 458 + #ifdef CONFIG_SCHED_CORE 459 + forceidle_time = cgrp->bstat.forceidle_sum; 460 + #endif 471 461 cgroup_rstat_flush_release(); 472 462 } else { 473 - root_cgroup_cputime(&cputime); 474 - usage = cputime.sum_exec_runtime; 475 - utime = cputime.utime; 476 - stime = cputime.stime; 463 + root_cgroup_cputime(&bstat); 464 + usage = bstat.cputime.sum_exec_runtime; 465 + utime = bstat.cputime.utime; 466 + stime = bstat.cputime.stime; 467 + #ifdef CONFIG_SCHED_CORE 468 + forceidle_time = bstat.forceidle_sum; 469 + #endif 477 470 } 478 471 479 472 do_div(usage, NSEC_PER_USEC); 480 473 do_div(utime, NSEC_PER_USEC); 481 474 do_div(stime, NSEC_PER_USEC); 475 + #ifdef CONFIG_SCHED_CORE 476 + do_div(forceidle_time, NSEC_PER_USEC); 477 + #endif 482 478 483 479 seq_printf(seq, "usage_usec %llu\n" 484 480 "user_usec %llu\n" 485 481 "system_usec %llu\n", 486 482 usage, utime, stime); 483 + 484 + #ifdef CONFIG_SCHED_CORE 485 + seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time); 486 + #endif 487 487 }

+8 -15

kernel/rseq.c

··· 18 18 #define CREATE_TRACE_POINTS 19 19 #include <trace/events/rseq.h> 20 20 21 - #define RSEQ_CS_PREEMPT_MIGRATE_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | \ 22 - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT) 21 + #define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \ 22 + RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ 23 + RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) 23 24 24 25 /* 25 26 * ··· 176 175 u32 flags, event_mask; 177 176 int ret; 178 177 178 + if (WARN_ON_ONCE(cs_flags & RSEQ_CS_NO_RESTART_FLAGS) || cs_flags) 179 + return -EINVAL; 180 + 179 181 /* Get thread flags. */ 180 182 ret = get_user(flags, &t->rseq->flags); 181 183 if (ret) 182 184 return ret; 183 185 184 - /* Take critical section flags into account. */ 185 - flags |= cs_flags; 186 - 187 - /* 188 - * Restart on signal can only be inhibited when restart on 189 - * preempt and restart on migrate are inhibited too. Otherwise, 190 - * a preempted signal handler could fail to restart the prior 191 - * execution context on sigreturn. 192 - */ 193 - if (unlikely((flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) && 194 - (flags & RSEQ_CS_PREEMPT_MIGRATE_FLAGS) != 195 - RSEQ_CS_PREEMPT_MIGRATE_FLAGS)) 186 + if (WARN_ON_ONCE(flags & RSEQ_CS_NO_RESTART_FLAGS) || flags) 196 187 return -EINVAL; 197 188 198 189 /* ··· 196 203 t->rseq_event_mask = 0; 197 204 preempt_enable(); 198 205 199 - return !!(event_mask & ~flags); 206 + return !!event_mask; 200 207 } 201 208 202 209 static int clear_rseq_cs(struct task_struct *t)

+125 -92

kernel/sched/core.c

··· 873 873 ({ \ 874 874 typeof(ptr) _ptr = (ptr); \ 875 875 typeof(mask) _mask = (mask); \ 876 - typeof(*_ptr) _old, _val = *_ptr; \ 876 + typeof(*_ptr) _val = *_ptr; \ 877 877 \ 878 - for (;;) { \ 879 - _old = cmpxchg(_ptr, _val, _val | _mask); \ 880 - if (_old == _val) \ 881 - break; \ 882 - _val = _old; \ 883 - } \ 884 - _old; \ 878 + do { \ 879 + } while (!try_cmpxchg(_ptr, &_val, _val | _mask)); \ 880 + _val; \ 885 881 }) 886 882 887 883 #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) ··· 886 890 * this avoids any races wrt polling state changes and thereby avoids 887 891 * spurious IPIs. 888 892 */ 889 - static bool set_nr_and_not_polling(struct task_struct *p) 893 + static inline bool set_nr_and_not_polling(struct task_struct *p) 890 894 { 891 895 struct thread_info *ti = task_thread_info(p); 892 896 return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); ··· 901 905 static bool set_nr_if_polling(struct task_struct *p) 902 906 { 903 907 struct thread_info *ti = task_thread_info(p); 904 - typeof(ti->flags) old, val = READ_ONCE(ti->flags); 908 + typeof(ti->flags) val = READ_ONCE(ti->flags); 905 909 906 910 for (;;) { 907 911 if (!(val & _TIF_POLLING_NRFLAG)) 908 912 return false; 909 913 if (val & _TIF_NEED_RESCHED) 910 914 return true; 911 - old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED); 912 - if (old == val) 915 + if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)) 913 916 break; 914 - val = old; 915 917 } 916 918 return true; 917 919 } 918 920 919 921 #else 920 - static bool set_nr_and_not_polling(struct task_struct *p) 922 + static inline bool set_nr_and_not_polling(struct task_struct *p) 921 923 { 922 924 set_tsk_need_resched(p); 923 925 return true; 924 926 } 925 927 926 928 #ifdef CONFIG_SMP 927 - static bool set_nr_if_polling(struct task_struct *p) 929 + static inline bool set_nr_if_polling(struct task_struct *p) 928 930 { 929 931 return false; 930 932 } ··· 3802 3808 return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu); 3803 3809 } 3804 3810 3805 - static inline bool ttwu_queue_cond(int cpu, int wake_flags) 3811 + static inline bool ttwu_queue_cond(int cpu) 3806 3812 { 3807 3813 /* 3808 3814 * Do not complicate things with the async wake_list while the CPU is ··· 3818 3824 if (!cpus_share_cache(smp_processor_id(), cpu)) 3819 3825 return true; 3820 3826 3827 + if (cpu == smp_processor_id()) 3828 + return false; 3829 + 3821 3830 /* 3822 - * If the task is descheduling and the only running task on the 3823 - * CPU then use the wakelist to offload the task activation to 3824 - * the soon-to-be-idle CPU as the current CPU is likely busy. 3825 - * nr_running is checked to avoid unnecessary task stacking. 3831 + * If the wakee cpu is idle, or the task is descheduling and the 3832 + * only running task on the CPU, then use the wakelist to offload 3833 + * the task activation to the idle (or soon-to-be-idle) CPU as 3834 + * the current CPU is likely busy. nr_running is checked to 3835 + * avoid unnecessary task stacking. 3836 + * 3837 + * Note that we can only get here with (wakee) p->on_rq=0, 3838 + * p->on_cpu can be whatever, we've done the dequeue, so 3839 + * the wakee has been accounted out of ->nr_running. 3826 3840 */ 3827 - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1) 3841 + if (!cpu_rq(cpu)->nr_running) 3828 3842 return true; 3829 3843 3830 3844 return false; ··· 3840 3838 3841 3839 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags) 3842 3840 { 3843 - if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { 3844 - if (WARN_ON_ONCE(cpu == smp_processor_id())) 3845 - return false; 3846 - 3841 + if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu)) { 3847 3842 sched_clock_cpu(cpu); /* Sync clocks across CPUs */ 3848 3843 __ttwu_queue_wakelist(p, cpu, wake_flags); 3849 3844 return true; ··· 4162 4163 * scheduling. 4163 4164 */ 4164 4165 if (smp_load_acquire(&p->on_cpu) && 4165 - ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) 4166 + ttwu_queue_wakelist(p, task_cpu(p), wake_flags)) 4166 4167 goto unlock; 4167 4168 4168 4169 /* ··· 4752 4753 * Claim the task as running, we do this before switching to it 4753 4754 * such that any running task will have this set. 4754 4755 * 4755 - * See the ttwu() WF_ON_CPU case and its ordering comment. 4756 + * See the smp_load_acquire(&p->on_cpu) case in ttwu() and 4757 + * its ordering comment. 4756 4758 */ 4757 4759 WRITE_ONCE(next->on_cpu, 1); 4758 4760 #endif ··· 6500 6500 io_wq_worker_sleeping(tsk); 6501 6501 } 6502 6502 6503 - if (tsk_is_pi_blocked(tsk)) 6504 - return; 6503 + /* 6504 + * spinlock and rwlock must not flush block requests. This will 6505 + * deadlock if the callback attempts to acquire a lock which is 6506 + * already acquired. 6507 + */ 6508 + SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT); 6505 6509 6506 6510 /* 6507 6511 * If we are going to sleep and we have plugged IO queued, ··· 7002 6998 EXPORT_SYMBOL(set_user_nice); 7003 6999 7004 7000 /* 7001 + * is_nice_reduction - check if nice value is an actual reduction 7002 + * 7003 + * Similar to can_nice() but does not perform a capability check. 7004 + * 7005 + * @p: task 7006 + * @nice: nice value 7007 + */ 7008 + static bool is_nice_reduction(const struct task_struct *p, const int nice) 7009 + { 7010 + /* Convert nice value [19,-20] to rlimit style value [1,40]: */ 7011 + int nice_rlim = nice_to_rlimit(nice); 7012 + 7013 + return (nice_rlim <= task_rlimit(p, RLIMIT_NICE)); 7014 + } 7015 + 7016 + /* 7005 7017 * can_nice - check if a task can reduce its nice value 7006 7018 * @p: task 7007 7019 * @nice: nice value 7008 7020 */ 7009 7021 int can_nice(const struct task_struct *p, const int nice) 7010 7022 { 7011 - /* Convert nice value [19,-20] to rlimit style value [1,40]: */ 7012 - int nice_rlim = nice_to_rlimit(nice); 7013 - 7014 - return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) || 7015 - capable(CAP_SYS_NICE)); 7023 + return is_nice_reduction(p, nice) || capable(CAP_SYS_NICE); 7016 7024 } 7017 7025 7018 7026 #ifdef __ARCH_WANT_SYS_NICE ··· 7153 7137 * required to meet deadlines. 7154 7138 */ 7155 7139 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, 7156 - unsigned long max, enum cpu_util_type type, 7140 + enum cpu_util_type type, 7157 7141 struct task_struct *p) 7158 7142 { 7159 - unsigned long dl_util, util, irq; 7143 + unsigned long dl_util, util, irq, max; 7160 7144 struct rq *rq = cpu_rq(cpu); 7145 + 7146 + max = arch_scale_cpu_capacity(cpu); 7161 7147 7162 7148 if (!uclamp_is_used() && 7163 7149 type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) { ··· 7240 7222 return min(max, util); 7241 7223 } 7242 7224 7243 - unsigned long sched_cpu_util(int cpu, unsigned long max) 7225 + unsigned long sched_cpu_util(int cpu) 7244 7226 { 7245 - return effective_cpu_util(cpu, cpu_util_cfs(cpu), max, 7246 - ENERGY_UTIL, NULL); 7227 + return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL); 7247 7228 } 7248 7229 #endif /* CONFIG_SMP */ 7249 7230 ··· 7304 7287 return match; 7305 7288 } 7306 7289 7290 + /* 7291 + * Allow unprivileged RT tasks to decrease priority. 7292 + * Only issue a capable test if needed and only once to avoid an audit 7293 + * event on permitted non-privileged operations: 7294 + */ 7295 + static int user_check_sched_setscheduler(struct task_struct *p, 7296 + const struct sched_attr *attr, 7297 + int policy, int reset_on_fork) 7298 + { 7299 + if (fair_policy(policy)) { 7300 + if (attr->sched_nice < task_nice(p) && 7301 + !is_nice_reduction(p, attr->sched_nice)) 7302 + goto req_priv; 7303 + } 7304 + 7305 + if (rt_policy(policy)) { 7306 + unsigned long rlim_rtprio = task_rlimit(p, RLIMIT_RTPRIO); 7307 + 7308 + /* Can't set/change the rt policy: */ 7309 + if (policy != p->policy && !rlim_rtprio) 7310 + goto req_priv; 7311 + 7312 + /* Can't increase priority: */ 7313 + if (attr->sched_priority > p->rt_priority && 7314 + attr->sched_priority > rlim_rtprio) 7315 + goto req_priv; 7316 + } 7317 + 7318 + /* 7319 + * Can't set/change SCHED_DEADLINE policy at all for now 7320 + * (safest behavior); in the future we would like to allow 7321 + * unprivileged DL tasks to increase their relative deadline 7322 + * or reduce their runtime (both ways reducing utilization) 7323 + */ 7324 + if (dl_policy(policy)) 7325 + goto req_priv; 7326 + 7327 + /* 7328 + * Treat SCHED_IDLE as nice 20. Only allow a switch to 7329 + * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. 7330 + */ 7331 + if (task_has_idle_policy(p) && !idle_policy(policy)) { 7332 + if (!is_nice_reduction(p, task_nice(p))) 7333 + goto req_priv; 7334 + } 7335 + 7336 + /* Can't change other user's priorities: */ 7337 + if (!check_same_owner(p)) 7338 + goto req_priv; 7339 + 7340 + /* Normal users shall not reset the sched_reset_on_fork flag: */ 7341 + if (p->sched_reset_on_fork && !reset_on_fork) 7342 + goto req_priv; 7343 + 7344 + return 0; 7345 + 7346 + req_priv: 7347 + if (!capable(CAP_SYS_NICE)) 7348 + return -EPERM; 7349 + 7350 + return 0; 7351 + } 7352 + 7307 7353 static int __sched_setscheduler(struct task_struct *p, 7308 7354 const struct sched_attr *attr, 7309 7355 bool user, bool pi) ··· 7408 7328 (rt_policy(policy) != (attr->sched_priority != 0))) 7409 7329 return -EINVAL; 7410 7330 7411 - /* 7412 - * Allow unprivileged RT tasks to decrease priority: 7413 - */ 7414 - if (user && !capable(CAP_SYS_NICE)) { 7415 - if (fair_policy(policy)) { 7416 - if (attr->sched_nice < task_nice(p) && 7417 - !can_nice(p, attr->sched_nice)) 7418 - return -EPERM; 7419 - } 7420 - 7421 - if (rt_policy(policy)) { 7422 - unsigned long rlim_rtprio = 7423 - task_rlimit(p, RLIMIT_RTPRIO); 7424 - 7425 - /* Can't set/change the rt policy: */ 7426 - if (policy != p->policy && !rlim_rtprio) 7427 - return -EPERM; 7428 - 7429 - /* Can't increase priority: */ 7430 - if (attr->sched_priority > p->rt_priority && 7431 - attr->sched_priority > rlim_rtprio) 7432 - return -EPERM; 7433 - } 7434 - 7435 - /* 7436 - * Can't set/change SCHED_DEADLINE policy at all for now 7437 - * (safest behavior); in the future we would like to allow 7438 - * unprivileged DL tasks to increase their relative deadline 7439 - * or reduce their runtime (both ways reducing utilization) 7440 - */ 7441 - if (dl_policy(policy)) 7442 - return -EPERM; 7443 - 7444 - /* 7445 - * Treat SCHED_IDLE as nice 20. Only allow a switch to 7446 - * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. 7447 - */ 7448 - if (task_has_idle_policy(p) && !idle_policy(policy)) { 7449 - if (!can_nice(p, task_nice(p))) 7450 - return -EPERM; 7451 - } 7452 - 7453 - /* Can't change other user's priorities: */ 7454 - if (!check_same_owner(p)) 7455 - return -EPERM; 7456 - 7457 - /* Normal users shall not reset the sched_reset_on_fork flag: */ 7458 - if (p->sched_reset_on_fork && !reset_on_fork) 7459 - return -EPERM; 7460 - } 7461 - 7462 7331 if (user) { 7332 + retval = user_check_sched_setscheduler(p, attr, policy, reset_on_fork); 7333 + if (retval) 7334 + return retval; 7335 + 7463 7336 if (attr->sched_flags & SCHED_FLAG_SUGOV) 7464 7337 return -EINVAL; 7465 7338 ··· 9564 9531 #endif 9565 9532 9566 9533 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); 9567 - DECLARE_PER_CPU(cpumask_var_t, select_idle_mask); 9534 + DECLARE_PER_CPU(cpumask_var_t, select_rq_mask); 9568 9535 9569 9536 void __init sched_init(void) 9570 9537 { ··· 9613 9580 for_each_possible_cpu(i) { 9614 9581 per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node( 9615 9582 cpumask_size(), GFP_KERNEL, cpu_to_node(i)); 9616 - per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node( 9583 + per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node( 9617 9584 cpumask_size(), GFP_KERNEL, cpu_to_node(i)); 9618 9585 } 9619 9586 #endif /* CONFIG_CPUMASK_OFFSTACK */

+10 -5

kernel/sched/core_sched.c

··· 56 56 unsigned long old_cookie; 57 57 struct rq_flags rf; 58 58 struct rq *rq; 59 - bool enqueued; 60 59 61 60 rq = task_rq_lock(p, &rf); 62 61 ··· 67 68 */ 68 69 SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq)); 69 70 70 - enqueued = sched_core_enqueued(p); 71 - if (enqueued) 71 + if (sched_core_enqueued(p)) 72 72 sched_core_dequeue(rq, p, DEQUEUE_SAVE); 73 73 74 74 old_cookie = p->core_cookie; 75 75 p->core_cookie = cookie; 76 76 77 - if (enqueued) 77 + /* 78 + * Consider the cases: !prev_cookie and !cookie. 79 + */ 80 + if (cookie && task_on_rq_queued(p)) 78 81 sched_core_enqueue(rq, p); 79 82 80 83 /* ··· 278 277 if (p == rq_i->idle) 279 278 continue; 280 279 281 - __schedstat_add(p->stats.core_forceidle_sum, delta); 280 + /* 281 + * Note: this will account forceidle to the current cpu, even 282 + * if it comes from our SMT sibling. 283 + */ 284 + __account_forceidle_time(p, delta); 282 285 } 283 286 } 284 287

+2 -3

kernel/sched/cpufreq_schedutil.c

··· 157 157 static void sugov_get_util(struct sugov_cpu *sg_cpu) 158 158 { 159 159 struct rq *rq = cpu_rq(sg_cpu->cpu); 160 - unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu); 161 160 162 - sg_cpu->max = max; 161 + sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu); 163 162 sg_cpu->bw_dl = cpu_bw_dl(rq); 164 - sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max, 163 + sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), 165 164 FREQUENCY_UTIL, NULL); 166 165 } 167 166

+15

kernel/sched/cputime.c

··· 226 226 cpustat[CPUTIME_IDLE] += cputime; 227 227 } 228 228 229 + 230 + #ifdef CONFIG_SCHED_CORE 231 + /* 232 + * Account for forceidle time due to core scheduling. 233 + * 234 + * REQUIRES: schedstat is enabled. 235 + */ 236 + void __account_forceidle_time(struct task_struct *p, u64 delta) 237 + { 238 + __schedstat_add(p->stats.core_forceidle_sum, delta); 239 + 240 + task_group_account_field(p, CPUTIME_FORCEIDLE, delta); 241 + } 242 + #endif 243 + 229 244 /* 230 245 * When a guest is interrupted for a longer amount of time, missed clock 231 246 * ticks are not redelivered later. Due to that, this function may on

+4 -2

kernel/sched/deadline.c

··· 30 30 .data = &sysctl_sched_dl_period_max, 31 31 .maxlen = sizeof(unsigned int), 32 32 .mode = 0644, 33 - .proc_handler = proc_dointvec, 33 + .proc_handler = proc_douintvec_minmax, 34 + .extra1 = (void *)&sysctl_sched_dl_period_min, 34 35 }, 35 36 { 36 37 .procname = "sched_deadline_period_min_us", 37 38 .data = &sysctl_sched_dl_period_min, 38 39 .maxlen = sizeof(unsigned int), 39 40 .mode = 0644, 40 - .proc_handler = proc_dointvec, 41 + .proc_handler = proc_douintvec_minmax, 42 + .extra2 = (void *)&sysctl_sched_dl_period_max, 41 43 }, 42 44 {} 43 45 };

+529 -295

kernel/sched/fair.c

··· 612 612 } 613 613 614 614 /* ensure we never gain time by being placed backwards. */ 615 - cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); 616 - #ifndef CONFIG_64BIT 617 - smp_wmb(); 618 - cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; 619 - #endif 615 + u64_u32_store(cfs_rq->min_vruntime, 616 + max_vruntime(cfs_rq->min_vruntime, vruntime)); 620 617 } 621 618 622 619 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) ··· 1051 1054 /************************************************** 1052 1055 * Scheduling class queueing methods: 1053 1056 */ 1057 + 1058 + #ifdef CONFIG_NUMA 1059 + #define NUMA_IMBALANCE_MIN 2 1060 + 1061 + static inline long 1062 + adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr) 1063 + { 1064 + /* 1065 + * Allow a NUMA imbalance if busy CPUs is less than the maximum 1066 + * threshold. Above this threshold, individual tasks may be contending 1067 + * for both memory bandwidth and any shared HT resources. This is an 1068 + * approximation as the number of running tasks may not be related to 1069 + * the number of busy CPUs due to sched_setaffinity. 1070 + */ 1071 + if (dst_running > imb_numa_nr) 1072 + return imbalance; 1073 + 1074 + /* 1075 + * Allow a small imbalance based on a simple pair of communicating 1076 + * tasks that remain local when the destination is lightly loaded. 1077 + */ 1078 + if (imbalance <= NUMA_IMBALANCE_MIN) 1079 + return 0; 1080 + 1081 + return imbalance; 1082 + } 1083 + #endif /* CONFIG_NUMA */ 1054 1084 1055 1085 #ifdef CONFIG_NUMA_BALANCING 1056 1086 /* ··· 1572 1548 1573 1549 static unsigned long cpu_load(struct rq *rq); 1574 1550 static unsigned long cpu_runnable(struct rq *rq); 1575 - static inline long adjust_numa_imbalance(int imbalance, 1576 - int dst_running, int imb_numa_nr); 1577 1551 1578 1552 static inline enum 1579 1553 numa_type numa_classify(unsigned int imbalance_pct, ··· 1812 1790 */ 1813 1791 cur_ng = rcu_dereference(cur->numa_group); 1814 1792 if (cur_ng == p_ng) { 1793 + /* 1794 + * Do not swap within a group or between tasks that have 1795 + * no group if there is spare capacity. Swapping does 1796 + * not address the load imbalance and helps one task at 1797 + * the cost of punishing another. 1798 + */ 1799 + if (env->dst_stats.node_type == node_has_spare) 1800 + goto unlock; 1801 + 1815 1802 imp = taskimp + task_weight(cur, env->src_nid, dist) - 1816 1803 task_weight(cur, env->dst_nid, dist); 1817 1804 /* ··· 2916 2885 p->node_stamp = 0; 2917 2886 p->numa_scan_seq = mm ? mm->numa_scan_seq : 0; 2918 2887 p->numa_scan_period = sysctl_numa_balancing_scan_delay; 2888 + p->numa_migrate_retry = 0; 2919 2889 /* Protect against double add, see task_tick_numa and task_numa_work */ 2920 2890 p->numa_work.next = &p->numa_work; 2921 2891 p->numa_faults = NULL; ··· 3176 3144 load->inv_weight = sched_prio_to_wmult[prio]; 3177 3145 } 3178 3146 3147 + static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); 3148 + 3179 3149 #ifdef CONFIG_FAIR_GROUP_SCHED 3180 3150 #ifdef CONFIG_SMP 3181 3151 /* ··· 3288 3254 } 3289 3255 #endif /* CONFIG_SMP */ 3290 3256 3291 - static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); 3292 - 3293 3257 /* 3294 3258 * Recomputes the group entity based on the current state of its group 3295 3259 * runqueue. ··· 3345 3313 } 3346 3314 3347 3315 #ifdef CONFIG_SMP 3316 + static inline bool load_avg_is_decayed(struct sched_avg *sa) 3317 + { 3318 + if (sa->load_sum) 3319 + return false; 3320 + 3321 + if (sa->util_sum) 3322 + return false; 3323 + 3324 + if (sa->runnable_sum) 3325 + return false; 3326 + 3327 + /* 3328 + * _avg must be null when _sum are null because _avg = _sum / divider 3329 + * Make sure that rounding and/or propagation of PELT values never 3330 + * break this. 3331 + */ 3332 + SCHED_WARN_ON(sa->load_avg || 3333 + sa->util_avg || 3334 + sa->runnable_avg); 3335 + 3336 + return true; 3337 + } 3338 + 3339 + static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq) 3340 + { 3341 + return u64_u32_load_copy(cfs_rq->avg.last_update_time, 3342 + cfs_rq->last_update_time_copy); 3343 + } 3348 3344 #ifdef CONFIG_FAIR_GROUP_SCHED 3349 3345 /* 3350 3346 * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list ··· 3405 3345 if (cfs_rq->load.weight) 3406 3346 return false; 3407 3347 3408 - if (cfs_rq->avg.load_sum) 3409 - return false; 3410 - 3411 - if (cfs_rq->avg.util_sum) 3412 - return false; 3413 - 3414 - if (cfs_rq->avg.runnable_sum) 3348 + if (!load_avg_is_decayed(&cfs_rq->avg)) 3415 3349 return false; 3416 3350 3417 3351 if (child_cfs_rq_on_list(cfs_rq)) 3418 3352 return false; 3419 - 3420 - /* 3421 - * _avg must be null when _sum are null because _avg = _sum / divider 3422 - * Make sure that rounding and/or propagation of PELT values never 3423 - * break this. 3424 - */ 3425 - SCHED_WARN_ON(cfs_rq->avg.load_avg || 3426 - cfs_rq->avg.util_avg || 3427 - cfs_rq->avg.runnable_avg); 3428 3353 3429 3354 return true; 3430 3355 } ··· 3468 3423 if (!(se->avg.last_update_time && prev)) 3469 3424 return; 3470 3425 3471 - #ifndef CONFIG_64BIT 3472 - { 3473 - u64 p_last_update_time_copy; 3474 - u64 n_last_update_time_copy; 3426 + p_last_update_time = cfs_rq_last_update_time(prev); 3427 + n_last_update_time = cfs_rq_last_update_time(next); 3475 3428 3476 - do { 3477 - p_last_update_time_copy = prev->load_last_update_time_copy; 3478 - n_last_update_time_copy = next->load_last_update_time_copy; 3479 - 3480 - smp_rmb(); 3481 - 3482 - p_last_update_time = prev->avg.last_update_time; 3483 - n_last_update_time = next->avg.last_update_time; 3484 - 3485 - } while (p_last_update_time != p_last_update_time_copy || 3486 - n_last_update_time != n_last_update_time_copy); 3487 - } 3488 - #else 3489 - p_last_update_time = prev->avg.last_update_time; 3490 - n_last_update_time = next->avg.last_update_time; 3491 - #endif 3492 3429 __update_load_avg_blocked_se(p_last_update_time, se); 3493 3430 se->avg.last_update_time = n_last_update_time; 3494 3431 } ··· 3749 3722 3750 3723 #endif /* CONFIG_FAIR_GROUP_SCHED */ 3751 3724 3725 + #ifdef CONFIG_NO_HZ_COMMON 3726 + static inline void migrate_se_pelt_lag(struct sched_entity *se) 3727 + { 3728 + u64 throttled = 0, now, lut; 3729 + struct cfs_rq *cfs_rq; 3730 + struct rq *rq; 3731 + bool is_idle; 3732 + 3733 + if (load_avg_is_decayed(&se->avg)) 3734 + return; 3735 + 3736 + cfs_rq = cfs_rq_of(se); 3737 + rq = rq_of(cfs_rq); 3738 + 3739 + rcu_read_lock(); 3740 + is_idle = is_idle_task(rcu_dereference(rq->curr)); 3741 + rcu_read_unlock(); 3742 + 3743 + /* 3744 + * The lag estimation comes with a cost we don't want to pay all the 3745 + * time. Hence, limiting to the case where the source CPU is idle and 3746 + * we know we are at the greatest risk to have an outdated clock. 3747 + */ 3748 + if (!is_idle) 3749 + return; 3750 + 3751 + /* 3752 + * Estimated "now" is: last_update_time + cfs_idle_lag + rq_idle_lag, where: 3753 + * 3754 + * last_update_time (the cfs_rq's last_update_time) 3755 + * = cfs_rq_clock_pelt()@cfs_rq_idle 3756 + * = rq_clock_pelt()@cfs_rq_idle 3757 + * - cfs->throttled_clock_pelt_time@cfs_rq_idle 3758 + * 3759 + * cfs_idle_lag (delta between rq's update and cfs_rq's update) 3760 + * = rq_clock_pelt()@rq_idle - rq_clock_pelt()@cfs_rq_idle 3761 + * 3762 + * rq_idle_lag (delta between now and rq's update) 3763 + * = sched_clock_cpu() - rq_clock()@rq_idle 3764 + * 3765 + * We can then write: 3766 + * 3767 + * now = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time + 3768 + * sched_clock_cpu() - rq_clock()@rq_idle 3769 + * Where: 3770 + * rq_clock_pelt()@rq_idle is rq->clock_pelt_idle 3771 + * rq_clock()@rq_idle is rq->clock_idle 3772 + * cfs->throttled_clock_pelt_time@cfs_rq_idle 3773 + * is cfs_rq->throttled_pelt_idle 3774 + */ 3775 + 3776 + #ifdef CONFIG_CFS_BANDWIDTH 3777 + throttled = u64_u32_load(cfs_rq->throttled_pelt_idle); 3778 + /* The clock has been stopped for throttling */ 3779 + if (throttled == U64_MAX) 3780 + return; 3781 + #endif 3782 + now = u64_u32_load(rq->clock_pelt_idle); 3783 + /* 3784 + * Paired with _update_idle_rq_clock_pelt(). It ensures at the worst case 3785 + * is observed the old clock_pelt_idle value and the new clock_idle, 3786 + * which lead to an underestimation. The opposite would lead to an 3787 + * overestimation. 3788 + */ 3789 + smp_rmb(); 3790 + lut = cfs_rq_last_update_time(cfs_rq); 3791 + 3792 + now -= throttled; 3793 + if (now < lut) 3794 + /* 3795 + * cfs_rq->avg.last_update_time is more recent than our 3796 + * estimation, let's use it. 3797 + */ 3798 + now = lut; 3799 + else 3800 + now += sched_clock_cpu(cpu_of(rq)) - u64_u32_load(rq->clock_idle); 3801 + 3802 + __update_load_avg_blocked_se(now, se); 3803 + } 3804 + #else 3805 + static void migrate_se_pelt_lag(struct sched_entity *se) {} 3806 + #endif 3807 + 3752 3808 /** 3753 3809 * update_cfs_rq_load_avg - update the cfs_rq's load/util averages 3754 3810 * @now: current time, as per cfs_rq_clock_pelt() ··· 3906 3796 } 3907 3797 3908 3798 decayed |= __update_load_avg_cfs_rq(now, cfs_rq); 3909 - 3910 - #ifndef CONFIG_64BIT 3911 - smp_wmb(); 3912 - cfs_rq->load_last_update_time_copy = sa->last_update_time; 3913 - #endif 3914 - 3799 + u64_u32_store_copy(sa->last_update_time, 3800 + cfs_rq->last_update_time_copy, 3801 + sa->last_update_time); 3915 3802 return decayed; 3916 3803 } 3917 3804 ··· 4039 3932 update_tg_load_avg(cfs_rq); 4040 3933 } 4041 3934 } 4042 - 4043 - #ifndef CONFIG_64BIT 4044 - static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq) 4045 - { 4046 - u64 last_update_time_copy; 4047 - u64 last_update_time; 4048 - 4049 - do { 4050 - last_update_time_copy = cfs_rq->load_last_update_time_copy; 4051 - smp_rmb(); 4052 - last_update_time = cfs_rq->avg.last_update_time; 4053 - } while (last_update_time != last_update_time_copy); 4054 - 4055 - return last_update_time; 4056 - } 4057 - #else 4058 - static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq) 4059 - { 4060 - return cfs_rq->avg.last_update_time; 4061 - } 4062 - #endif 4063 3935 4064 3936 /* 4065 3937 * Synchronize entity load avg of dequeued entity without locking ··· 4454 4368 __enqueue_entity(cfs_rq, se); 4455 4369 se->on_rq = 1; 4456 4370 4457 - /* 4458 - * When bandwidth control is enabled, cfs might have been removed 4459 - * because of a parent been throttled but cfs->nr_running > 1. Try to 4460 - * add it unconditionally. 4461 - */ 4462 - if (cfs_rq->nr_running == 1 || cfs_bandwidth_used()) 4463 - list_add_leaf_cfs_rq(cfs_rq); 4464 - 4465 - if (cfs_rq->nr_running == 1) 4371 + if (cfs_rq->nr_running == 1) { 4466 4372 check_enqueue_throttle(cfs_rq); 4373 + if (!throttled_hierarchy(cfs_rq)) 4374 + list_add_leaf_cfs_rq(cfs_rq); 4375 + } 4467 4376 } 4468 4377 4469 4378 static void __clear_buddies_last(struct sched_entity *se) ··· 4558 4477 */ 4559 4478 if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE) 4560 4479 update_min_vruntime(cfs_rq); 4480 + 4481 + if (cfs_rq->nr_running == 0) 4482 + update_idle_cfs_rq_clock_pelt(cfs_rq); 4561 4483 } 4562 4484 4563 4485 /* ··· 5076 4992 /* update hierarchical throttle state */ 5077 4993 walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq); 5078 4994 5079 - /* Nothing to run but something to decay (on_list)? Complete the branch */ 5080 4995 if (!cfs_rq->load.weight) { 5081 - if (cfs_rq->on_list) 5082 - goto unthrottle_throttle; 5083 - return; 4996 + if (!cfs_rq->on_list) 4997 + return; 4998 + /* 4999 + * Nothing to run but something to decay (on_list)? 5000 + * Complete the branch. 5001 + */ 5002 + for_each_sched_entity(se) { 5003 + if (list_add_leaf_cfs_rq(cfs_rq_of(se))) 5004 + break; 5005 + } 5006 + goto unthrottle_throttle; 5084 5007 } 5085 5008 5086 5009 task_delta = cfs_rq->h_nr_running; ··· 5125 5034 /* end evaluation on encountering a throttled cfs_rq */ 5126 5035 if (cfs_rq_throttled(qcfs_rq)) 5127 5036 goto unthrottle_throttle; 5128 - 5129 - /* 5130 - * One parent has been throttled and cfs_rq removed from the 5131 - * list. Add it back to not break the leaf list. 5132 - */ 5133 - if (throttled_hierarchy(qcfs_rq)) 5134 - list_add_leaf_cfs_rq(qcfs_rq); 5135 5037 } 5136 5038 5137 5039 /* At this point se is NULL and we are at root level*/ 5138 5040 add_nr_running(rq, task_delta); 5139 5041 5140 5042 unthrottle_throttle: 5141 - /* 5142 - * The cfs_rq_throttled() breaks in the above iteration can result in 5143 - * incomplete leaf list maintenance, resulting in triggering the 5144 - * assertion below. 5145 - */ 5146 - for_each_sched_entity(se) { 5147 - struct cfs_rq *qcfs_rq = cfs_rq_of(se); 5148 - 5149 - if (list_add_leaf_cfs_rq(qcfs_rq)) 5150 - break; 5151 - } 5152 - 5153 5043 assert_list_leaf_cfs_rq(rq); 5154 5044 5155 5045 /* Determine whether we need to wake up potentially idle CPU: */ ··· 5785 5713 /* end evaluation on encountering a throttled cfs_rq */ 5786 5714 if (cfs_rq_throttled(cfs_rq)) 5787 5715 goto enqueue_throttle; 5788 - 5789 - /* 5790 - * One parent has been throttled and cfs_rq removed from the 5791 - * list. Add it back to not break the leaf list. 5792 - */ 5793 - if (throttled_hierarchy(cfs_rq)) 5794 - list_add_leaf_cfs_rq(cfs_rq); 5795 5716 } 5796 5717 5797 5718 /* At this point se is NULL and we are at root level*/ ··· 5808 5743 update_overutilized_status(rq); 5809 5744 5810 5745 enqueue_throttle: 5811 - if (cfs_bandwidth_used()) { 5812 - /* 5813 - * When bandwidth control is enabled; the cfs_rq_throttled() 5814 - * breaks in the above iteration can result in incomplete 5815 - * leaf list maintenance, resulting in triggering the assertion 5816 - * below. 5817 - */ 5818 - for_each_sched_entity(se) { 5819 - cfs_rq = cfs_rq_of(se); 5820 - 5821 - if (list_add_leaf_cfs_rq(cfs_rq)) 5822 - break; 5823 - } 5824 - } 5825 - 5826 5746 assert_list_leaf_cfs_rq(rq); 5827 5747 5828 5748 hrtick_update(rq); ··· 5894 5844 5895 5845 /* Working cpumask for: load_balance, load_balance_newidle. */ 5896 5846 DEFINE_PER_CPU(cpumask_var_t, load_balance_mask); 5897 - DEFINE_PER_CPU(cpumask_var_t, select_idle_mask); 5847 + DEFINE_PER_CPU(cpumask_var_t, select_rq_mask); 5898 5848 5899 5849 #ifdef CONFIG_NO_HZ_COMMON 5900 5850 ··· 6384 6334 */ 6385 6335 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) 6386 6336 { 6387 - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); 6337 + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); 6388 6338 int i, cpu, idle_cpu = -1, nr = INT_MAX; 6339 + struct sched_domain_shared *sd_share; 6389 6340 struct rq *this_rq = this_rq(); 6390 6341 int this = smp_processor_id(); 6391 6342 struct sched_domain *this_sd; ··· 6424 6373 nr = 4; 6425 6374 6426 6375 time = cpu_clock(this); 6376 + } 6377 + 6378 + if (sched_feat(SIS_UTIL)) { 6379 + sd_share = rcu_dereference(per_cpu(sd_llc_shared, target)); 6380 + if (sd_share) { 6381 + /* because !--nr is the condition to stop scan */ 6382 + nr = READ_ONCE(sd_share->nr_idle_scan) + 1; 6383 + /* overloaded LLC is unlikely to have idle cpu/core */ 6384 + if (nr == 1) 6385 + return -1; 6386 + } 6427 6387 } 6428 6388 6429 6389 for_each_cpu_wrap(cpu, cpus, target + 1) { ··· 6482 6420 int cpu, best_cpu = -1; 6483 6421 struct cpumask *cpus; 6484 6422 6485 - cpus = this_cpu_cpumask_var_ptr(select_idle_mask); 6423 + cpus = this_cpu_cpumask_var_ptr(select_rq_mask); 6486 6424 cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); 6487 6425 6488 6426 task_util = uclamp_task_util(p); ··· 6532 6470 } 6533 6471 6534 6472 /* 6535 - * per-cpu select_idle_mask usage 6473 + * per-cpu select_rq_mask usage 6536 6474 */ 6537 6475 lockdep_assert_irqs_disabled(); 6538 6476 ··· 6702 6640 } 6703 6641 6704 6642 /* 6705 - * compute_energy(): Estimates the energy that @pd would consume if @p was 6706 - * migrated to @dst_cpu. compute_energy() predicts what will be the utilization 6707 - * landscape of @pd's CPUs after the task migration, and uses the Energy Model 6708 - * to compute what would be the energy if we decided to actually migrate that 6709 - * task. 6643 + * energy_env - Utilization landscape for energy estimation. 6644 + * @task_busy_time: Utilization contribution by the task for which we test the 6645 + * placement. Given by eenv_task_busy_time(). 6646 + * @pd_busy_time: Utilization of the whole perf domain without the task 6647 + * contribution. Given by eenv_pd_busy_time(). 6648 + * @cpu_cap: Maximum CPU capacity for the perf domain. 6649 + * @pd_cap: Entire perf domain capacity. (pd->nr_cpus * cpu_cap). 6710 6650 */ 6711 - static long 6712 - compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) 6651 + struct energy_env { 6652 + unsigned long task_busy_time; 6653 + unsigned long pd_busy_time; 6654 + unsigned long cpu_cap; 6655 + unsigned long pd_cap; 6656 + }; 6657 + 6658 + /* 6659 + * Compute the task busy time for compute_energy(). This time cannot be 6660 + * injected directly into effective_cpu_util() because of the IRQ scaling. 6661 + * The latter only makes sense with the most recent CPUs where the task has 6662 + * run. 6663 + */ 6664 + static inline void eenv_task_busy_time(struct energy_env *eenv, 6665 + struct task_struct *p, int prev_cpu) 6713 6666 { 6714 - struct cpumask *pd_mask = perf_domain_span(pd); 6715 - unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); 6716 - unsigned long max_util = 0, sum_util = 0; 6717 - unsigned long _cpu_cap = cpu_cap; 6667 + unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu); 6668 + unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu)); 6669 + 6670 + if (unlikely(irq >= max_cap)) 6671 + busy_time = max_cap; 6672 + else 6673 + busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap); 6674 + 6675 + eenv->task_busy_time = busy_time; 6676 + } 6677 + 6678 + /* 6679 + * Compute the perf_domain (PD) busy time for compute_energy(). Based on the 6680 + * utilization for each @pd_cpus, it however doesn't take into account 6681 + * clamping since the ratio (utilization / cpu_capacity) is already enough to 6682 + * scale the EM reported power consumption at the (eventually clamped) 6683 + * cpu_capacity. 6684 + * 6685 + * The contribution of the task @p for which we want to estimate the 6686 + * energy cost is removed (by cpu_util_next()) and must be calculated 6687 + * separately (see eenv_task_busy_time). This ensures: 6688 + * 6689 + * - A stable PD utilization, no matter which CPU of that PD we want to place 6690 + * the task on. 6691 + * 6692 + * - A fair comparison between CPUs as the task contribution (task_util()) 6693 + * will always be the same no matter which CPU utilization we rely on 6694 + * (util_avg or util_est). 6695 + * 6696 + * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't 6697 + * exceed @eenv->pd_cap. 6698 + */ 6699 + static inline void eenv_pd_busy_time(struct energy_env *eenv, 6700 + struct cpumask *pd_cpus, 6701 + struct task_struct *p) 6702 + { 6703 + unsigned long busy_time = 0; 6718 6704 int cpu; 6719 6705 6720 - _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask)); 6706 + for_each_cpu(cpu, pd_cpus) { 6707 + unsigned long util = cpu_util_next(cpu, p, -1); 6721 6708 6722 - /* 6723 - * The capacity state of CPUs of the current rd can be driven by CPUs 6724 - * of another rd if they belong to the same pd. So, account for the 6725 - * utilization of these CPUs too by masking pd with cpu_online_mask 6726 - * instead of the rd span. 6727 - * 6728 - * If an entire pd is outside of the current rd, it will not appear in 6729 - * its pd list and will not be accounted by compute_energy(). 6730 - */ 6731 - for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { 6732 - unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu); 6733 - unsigned long cpu_util, util_running = util_freq; 6734 - struct task_struct *tsk = NULL; 6709 + busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL); 6710 + } 6735 6711 6736 - /* 6737 - * When @p is placed on @cpu: 6738 - * 6739 - * util_running = max(cpu_util, cpu_util_est) + 6740 - * max(task_util, _task_util_est) 6741 - * 6742 - * while cpu_util_next is: max(cpu_util + task_util, 6743 - * cpu_util_est + _task_util_est) 6744 - */ 6745 - if (cpu == dst_cpu) { 6746 - tsk = p; 6747 - util_running = 6748 - cpu_util_next(cpu, p, -1) + task_util_est(p); 6749 - } 6712 + eenv->pd_busy_time = min(eenv->pd_cap, busy_time); 6713 + } 6750 6714 6751 - /* 6752 - * Busy time computation: utilization clamping is not 6753 - * required since the ratio (sum_util / cpu_capacity) 6754 - * is already enough to scale the EM reported power 6755 - * consumption at the (eventually clamped) cpu_capacity. 6756 - */ 6757 - cpu_util = effective_cpu_util(cpu, util_running, cpu_cap, 6758 - ENERGY_UTIL, NULL); 6715 + /* 6716 + * Compute the maximum utilization for compute_energy() when the task @p 6717 + * is placed on the cpu @dst_cpu. 6718 + * 6719 + * Returns the maximum utilization among @eenv->cpus. This utilization can't 6720 + * exceed @eenv->cpu_cap. 6721 + */ 6722 + static inline unsigned long 6723 + eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus, 6724 + struct task_struct *p, int dst_cpu) 6725 + { 6726 + unsigned long max_util = 0; 6727 + int cpu; 6759 6728 6760 - sum_util += min(cpu_util, _cpu_cap); 6729 + for_each_cpu(cpu, pd_cpus) { 6730 + struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL; 6731 + unsigned long util = cpu_util_next(cpu, p, dst_cpu); 6732 + unsigned long cpu_util; 6761 6733 6762 6734 /* 6763 6735 * Performance domain frequency: utilization clamping ··· 6800 6704 * NOTE: in case RT tasks are running, by default the 6801 6705 * FREQUENCY_UTIL's utilization can be max OPP. 6802 6706 */ 6803 - cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap, 6804 - FREQUENCY_UTIL, tsk); 6805 - max_util = max(max_util, min(cpu_util, _cpu_cap)); 6707 + cpu_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk); 6708 + max_util = max(max_util, cpu_util); 6806 6709 } 6807 6710 6808 - return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap); 6711 + return min(max_util, eenv->cpu_cap); 6712 + } 6713 + 6714 + /* 6715 + * compute_energy(): Use the Energy Model to estimate the energy that @pd would 6716 + * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task 6717 + * contribution is ignored. 6718 + */ 6719 + static inline unsigned long 6720 + compute_energy(struct energy_env *eenv, struct perf_domain *pd, 6721 + struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu) 6722 + { 6723 + unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu); 6724 + unsigned long busy_time = eenv->pd_busy_time; 6725 + 6726 + if (dst_cpu >= 0) 6727 + busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time); 6728 + 6729 + return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap); 6809 6730 } 6810 6731 6811 6732 /* ··· 6866 6753 */ 6867 6754 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) 6868 6755 { 6756 + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); 6869 6757 unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX; 6870 - struct root_domain *rd = cpu_rq(smp_processor_id())->rd; 6871 - int cpu, best_energy_cpu = prev_cpu, target = -1; 6872 - unsigned long cpu_cap, util, base_energy = 0; 6758 + struct root_domain *rd = this_rq()->rd; 6759 + int cpu, best_energy_cpu, target = -1; 6873 6760 struct sched_domain *sd; 6874 6761 struct perf_domain *pd; 6762 + struct energy_env eenv; 6875 6763 6876 6764 rcu_read_lock(); 6877 6765 pd = rcu_dereference(rd->pd); ··· 6895 6781 if (!task_util_est(p)) 6896 6782 goto unlock; 6897 6783 6898 - for (; pd; pd = pd->next) { 6899 - unsigned long cur_delta, spare_cap, max_spare_cap = 0; 6900 - bool compute_prev_delta = false; 6901 - unsigned long base_energy_pd; 6902 - int max_spare_cap_cpu = -1; 6784 + eenv_task_busy_time(&eenv, p, prev_cpu); 6903 6785 6904 - for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) { 6786 + for (; pd; pd = pd->next) { 6787 + unsigned long cpu_cap, cpu_thermal_cap, util; 6788 + unsigned long cur_delta, max_spare_cap = 0; 6789 + bool compute_prev_delta = false; 6790 + int max_spare_cap_cpu = -1; 6791 + unsigned long base_energy; 6792 + 6793 + cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); 6794 + 6795 + if (cpumask_empty(cpus)) 6796 + continue; 6797 + 6798 + /* Account thermal pressure for the energy estimation */ 6799 + cpu = cpumask_first(cpus); 6800 + cpu_thermal_cap = arch_scale_cpu_capacity(cpu); 6801 + cpu_thermal_cap -= arch_scale_thermal_pressure(cpu); 6802 + 6803 + eenv.cpu_cap = cpu_thermal_cap; 6804 + eenv.pd_cap = 0; 6805 + 6806 + for_each_cpu(cpu, cpus) { 6807 + eenv.pd_cap += cpu_thermal_cap; 6808 + 6809 + if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) 6810 + continue; 6811 + 6905 6812 if (!cpumask_test_cpu(cpu, p->cpus_ptr)) 6906 6813 continue; 6907 6814 6908 6815 util = cpu_util_next(cpu, p, cpu); 6909 6816 cpu_cap = capacity_of(cpu); 6910 - spare_cap = cpu_cap; 6911 - lsub_positive(&spare_cap, util); 6912 6817 6913 6818 /* 6914 6819 * Skip CPUs that cannot satisfy the capacity request. ··· 6940 6807 if (!fits_capacity(util, cpu_cap)) 6941 6808 continue; 6942 6809 6810 + lsub_positive(&cpu_cap, util); 6811 + 6943 6812 if (cpu == prev_cpu) { 6944 6813 /* Always use prev_cpu as a candidate. */ 6945 6814 compute_prev_delta = true; 6946 - } else if (spare_cap > max_spare_cap) { 6815 + } else if (cpu_cap > max_spare_cap) { 6947 6816 /* 6948 6817 * Find the CPU with the maximum spare capacity 6949 6818 * in the performance domain. 6950 6819 */ 6951 - max_spare_cap = spare_cap; 6820 + max_spare_cap = cpu_cap; 6952 6821 max_spare_cap_cpu = cpu; 6953 6822 } 6954 6823 } ··· 6958 6823 if (max_spare_cap_cpu < 0 && !compute_prev_delta) 6959 6824 continue; 6960 6825 6826 + eenv_pd_busy_time(&eenv, cpus, p); 6961 6827 /* Compute the 'base' energy of the pd, without @p */ 6962 - base_energy_pd = compute_energy(p, -1, pd); 6963 - base_energy += base_energy_pd; 6828 + base_energy = compute_energy(&eenv, pd, cpus, p, -1); 6964 6829 6965 6830 /* Evaluate the energy impact of using prev_cpu. */ 6966 6831 if (compute_prev_delta) { 6967 - prev_delta = compute_energy(p, prev_cpu, pd); 6968 - if (prev_delta < base_energy_pd) 6832 + prev_delta = compute_energy(&eenv, pd, cpus, p, 6833 + prev_cpu); 6834 + /* CPU utilization has changed */ 6835 + if (prev_delta < base_energy) 6969 6836 goto unlock; 6970 - prev_delta -= base_energy_pd; 6837 + prev_delta -= base_energy; 6971 6838 best_delta = min(best_delta, prev_delta); 6972 6839 } 6973 6840 6974 6841 /* Evaluate the energy impact of using max_spare_cap_cpu. */ 6975 6842 if (max_spare_cap_cpu >= 0) { 6976 - cur_delta = compute_energy(p, max_spare_cap_cpu, pd); 6977 - if (cur_delta < base_energy_pd) 6843 + cur_delta = compute_energy(&eenv, pd, cpus, p, 6844 + max_spare_cap_cpu); 6845 + /* CPU utilization has changed */ 6846 + if (cur_delta < base_energy) 6978 6847 goto unlock; 6979 - cur_delta -= base_energy_pd; 6848 + cur_delta -= base_energy; 6980 6849 if (cur_delta < best_delta) { 6981 6850 best_delta = cur_delta; 6982 6851 best_energy_cpu = max_spare_cap_cpu; ··· 6989 6850 } 6990 6851 rcu_read_unlock(); 6991 6852 6992 - /* 6993 - * Pick the best CPU if prev_cpu cannot be used, or if it saves at 6994 - * least 6% of the energy used by prev_cpu. 6995 - */ 6996 - if ((prev_delta == ULONG_MAX) || 6997 - (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4)) 6853 + if (best_delta < prev_delta) 6998 6854 target = best_energy_cpu; 6999 6855 7000 6856 return target; ··· 7085 6951 */ 7086 6952 static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) 7087 6953 { 6954 + struct sched_entity *se = &p->se; 6955 + 7088 6956 /* 7089 6957 * As blocked tasks retain absolute vruntime the migration needs to 7090 6958 * deal with this by subtracting the old and adding the new ··· 7094 6958 * the task on the new runqueue. 7095 6959 */ 7096 6960 if (READ_ONCE(p->__state) == TASK_WAKING) { 7097 - struct sched_entity *se = &p->se; 7098 6961 struct cfs_rq *cfs_rq = cfs_rq_of(se); 7099 - u64 min_vruntime; 7100 6962 7101 - #ifndef CONFIG_64BIT 7102 - u64 min_vruntime_copy; 7103 - 7104 - do { 7105 - min_vruntime_copy = cfs_rq->min_vruntime_copy; 7106 - smp_rmb(); 7107 - min_vruntime = cfs_rq->min_vruntime; 7108 - } while (min_vruntime != min_vruntime_copy); 7109 - #else 7110 - min_vruntime = cfs_rq->min_vruntime; 7111 - #endif 7112 - 7113 - se->vruntime -= min_vruntime; 6963 + se->vruntime -= u64_u32_load(cfs_rq->min_vruntime); 7114 6964 } 7115 6965 7116 6966 if (p->on_rq == TASK_ON_RQ_MIGRATING) { ··· 7105 6983 * rq->lock and can modify state directly. 7106 6984 */ 7107 6985 lockdep_assert_rq_held(task_rq(p)); 7108 - detach_entity_cfs_rq(&p->se); 6986 + detach_entity_cfs_rq(se); 7109 6987 7110 6988 } else { 6989 + remove_entity_load_avg(se); 6990 + 7111 6991 /* 7112 - * We are supposed to update the task to "current" time, then 7113 - * its up to date and ready to go to new CPU/cfs_rq. But we 7114 - * have difficulty in getting what current time is, so simply 7115 - * throw away the out-of-date time. This will result in the 7116 - * wakee task is less decayed, but giving the wakee more load 7117 - * sounds not bad. 6992 + * Here, the task's PELT values have been updated according to 6993 + * the current rq's clock. But if that clock hasn't been 6994 + * updated in a while, a substantial idle time will be missed, 6995 + * leading to an inflation after wake-up on the new rq. 6996 + * 6997 + * Estimate the missing time from the cfs_rq last_update_time 6998 + * and update sched_avg to improve the PELT continuity after 6999 + * migration. 7118 7000 */ 7119 - remove_entity_load_avg(&p->se); 7001 + migrate_se_pelt_lag(se); 7120 7002 } 7121 7003 7122 7004 /* Tell new CPU we are migrated */ 7123 - p->se.avg.last_update_time = 0; 7005 + se->avg.last_update_time = 0; 7124 7006 7125 7007 /* We have migrated, no longer consider this task hot */ 7126 - p->se.exec_start = 0; 7008 + se->exec_start = 0; 7127 7009 7128 7010 update_scan_period(p, new_cpu); 7129 7011 } ··· 7711 7585 */ 7712 7586 group_fully_busy, 7713 7587 /* 7714 - * SD_ASYM_CPUCAPACITY only: One task doesn't fit with CPU's capacity 7715 - * and must be migrated to a more powerful CPU. 7588 + * One task doesn't fit with CPU's capacity and must be migrated to a 7589 + * more powerful CPU. 7716 7590 */ 7717 7591 group_misfit_task, 7718 7592 /* ··· 8293 8167 if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) { 8294 8168 update_tg_load_avg(cfs_rq); 8295 8169 8170 + if (cfs_rq->nr_running == 0) 8171 + update_idle_cfs_rq_clock_pelt(cfs_rq); 8172 + 8296 8173 if (cfs_rq == &rq->cfs) 8297 8174 decayed = true; 8298 8175 } ··· 8629 8500 /* 8630 8501 * group_has_capacity returns true if the group has spare capacity that could 8631 8502 * be used by some tasks. 8632 - * We consider that a group has spare capacity if the * number of task is 8503 + * We consider that a group has spare capacity if the number of task is 8633 8504 * smaller than the number of CPUs or if the utilization is lower than the 8634 8505 * available capacity for CFS tasks. 8635 8506 * For the latter, we use a threshold to stabilize the state, to take into ··· 8798 8669 return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu); 8799 8670 } 8800 8671 8672 + static inline bool 8673 + sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) 8674 + { 8675 + /* 8676 + * When there is more than 1 task, the group_overloaded case already 8677 + * takes care of cpu with reduced capacity 8678 + */ 8679 + if (rq->cfs.h_nr_running != 1) 8680 + return false; 8681 + 8682 + return check_cpu_capacity(rq, sd); 8683 + } 8684 + 8801 8685 /** 8802 8686 * update_sg_lb_stats - Update sched_group's statistics for load balancing. 8803 8687 * @env: The load balancing environment. ··· 8833 8691 8834 8692 for_each_cpu_and(i, sched_group_span(group), env->cpus) { 8835 8693 struct rq *rq = cpu_rq(i); 8694 + unsigned long load = cpu_load(rq); 8836 8695 8837 - sgs->group_load += cpu_load(rq); 8696 + sgs->group_load += load; 8838 8697 sgs->group_util += cpu_util_cfs(i); 8839 8698 sgs->group_runnable += cpu_runnable(rq); 8840 8699 sgs->sum_h_nr_running += rq->cfs.h_nr_running; ··· 8865 8722 if (local_group) 8866 8723 continue; 8867 8724 8868 - /* Check for a misfit task on the cpu */ 8869 - if (env->sd->flags & SD_ASYM_CPUCAPACITY && 8870 - sgs->group_misfit_task_load < rq->misfit_task_load) { 8871 - sgs->group_misfit_task_load = rq->misfit_task_load; 8872 - *sg_status |= SG_OVERLOAD; 8725 + if (env->sd->flags & SD_ASYM_CPUCAPACITY) { 8726 + /* Check for a misfit task on the cpu */ 8727 + if (sgs->group_misfit_task_load < rq->misfit_task_load) { 8728 + sgs->group_misfit_task_load = rq->misfit_task_load; 8729 + *sg_status |= SG_OVERLOAD; 8730 + } 8731 + } else if ((env->idle != CPU_NOT_IDLE) && 8732 + sched_reduced_capacity(rq, env->sd)) { 8733 + /* Check for a task running on a CPU with reduced capacity */ 8734 + if (sgs->group_misfit_task_load < load) 8735 + sgs->group_misfit_task_load = load; 8873 8736 } 8874 8737 } 8875 8738 ··· 8928 8779 * CPUs in the group should either be possible to resolve 8929 8780 * internally or be covered by avg_load imbalance (eventually). 8930 8781 */ 8931 - if (sgs->group_type == group_misfit_task && 8782 + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) && 8783 + (sgs->group_type == group_misfit_task) && 8932 8784 (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) || 8933 8785 sds->local_stat.group_type != group_has_spare)) 8934 8786 return false; ··· 9208 9058 } 9209 9059 9210 9060 /* 9211 - * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain. 9212 - * This is an approximation as the number of running tasks may not be 9213 - * related to the number of busy CPUs due to sched_setaffinity. 9214 - */ 9215 - static inline bool allow_numa_imbalance(int running, int imb_numa_nr) 9216 - { 9217 - return running <= imb_numa_nr; 9218 - } 9219 - 9220 - /* 9221 9061 * find_idlest_group() finds and returns the least busy CPU group within the 9222 9062 * domain. 9223 9063 * ··· 9323 9183 break; 9324 9184 9325 9185 case group_has_spare: 9186 + #ifdef CONFIG_NUMA 9326 9187 if (sd->flags & SD_NUMA) { 9188 + int imb_numa_nr = sd->imb_numa_nr; 9327 9189 #ifdef CONFIG_NUMA_BALANCING 9328 9190 int idlest_cpu; 9329 9191 /* ··· 9338 9196 idlest_cpu = cpumask_first(sched_group_span(idlest)); 9339 9197 if (cpu_to_node(idlest_cpu) == p->numa_preferred_nid) 9340 9198 return idlest; 9341 - #endif 9199 + #endif /* CONFIG_NUMA_BALANCING */ 9342 9200 /* 9343 9201 * Otherwise, keep the task close to the wakeup source 9344 9202 * and improve locality if the number of running tasks 9345 9203 * would remain below threshold where an imbalance is 9346 - * allowed. If there is a real need of migration, 9347 - * periodic load balance will take care of it. 9204 + * allowed while accounting for the possibility the 9205 + * task is pinned to a subset of CPUs. If there is a 9206 + * real need of migration, periodic load balance will 9207 + * take care of it. 9348 9208 */ 9349 - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) 9209 + if (p->nr_cpus_allowed != NR_CPUS) { 9210 + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); 9211 + 9212 + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr); 9213 + imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr); 9214 + } 9215 + 9216 + imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus); 9217 + if (!adjust_numa_imbalance(imbalance, 9218 + local_sgs.sum_nr_running + 1, 9219 + imb_numa_nr)) { 9350 9220 return NULL; 9221 + } 9351 9222 } 9223 + #endif /* CONFIG_NUMA */ 9352 9224 9353 9225 /* 9354 9226 * Select group with highest number of idle CPUs. We could also ··· 9378 9222 return idlest; 9379 9223 } 9380 9224 9225 + static void update_idle_cpu_scan(struct lb_env *env, 9226 + unsigned long sum_util) 9227 + { 9228 + struct sched_domain_shared *sd_share; 9229 + int llc_weight, pct; 9230 + u64 x, y, tmp; 9231 + /* 9232 + * Update the number of CPUs to scan in LLC domain, which could 9233 + * be used as a hint in select_idle_cpu(). The update of sd_share 9234 + * could be expensive because it is within a shared cache line. 9235 + * So the write of this hint only occurs during periodic load 9236 + * balancing, rather than CPU_NEWLY_IDLE, because the latter 9237 + * can fire way more frequently than the former. 9238 + */ 9239 + if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE) 9240 + return; 9241 + 9242 + llc_weight = per_cpu(sd_llc_size, env->dst_cpu); 9243 + if (env->sd->span_weight != llc_weight) 9244 + return; 9245 + 9246 + sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu)); 9247 + if (!sd_share) 9248 + return; 9249 + 9250 + /* 9251 + * The number of CPUs to search drops as sum_util increases, when 9252 + * sum_util hits 85% or above, the scan stops. 9253 + * The reason to choose 85% as the threshold is because this is the 9254 + * imbalance_pct(117) when a LLC sched group is overloaded. 9255 + * 9256 + * let y = SCHED_CAPACITY_SCALE - p * x^2 [1] 9257 + * and y'= y / SCHED_CAPACITY_SCALE 9258 + * 9259 + * x is the ratio of sum_util compared to the CPU capacity: 9260 + * x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) 9261 + * y' is the ratio of CPUs to be scanned in the LLC domain, 9262 + * and the number of CPUs to scan is calculated by: 9263 + * 9264 + * nr_scan = llc_weight * y' [2] 9265 + * 9266 + * When x hits the threshold of overloaded, AKA, when 9267 + * x = 100 / pct, y drops to 0. According to [1], 9268 + * p should be SCHED_CAPACITY_SCALE * pct^2 / 10000 9269 + * 9270 + * Scale x by SCHED_CAPACITY_SCALE: 9271 + * x' = sum_util / llc_weight; [3] 9272 + * 9273 + * and finally [1] becomes: 9274 + * y = SCHED_CAPACITY_SCALE - 9275 + * x'^2 * pct^2 / (10000 * SCHED_CAPACITY_SCALE) [4] 9276 + * 9277 + */ 9278 + /* equation [3] */ 9279 + x = sum_util; 9280 + do_div(x, llc_weight); 9281 + 9282 + /* equation [4] */ 9283 + pct = env->sd->imbalance_pct; 9284 + tmp = x * x * pct * pct; 9285 + do_div(tmp, 10000 * SCHED_CAPACITY_SCALE); 9286 + tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE); 9287 + y = SCHED_CAPACITY_SCALE - tmp; 9288 + 9289 + /* equation [2] */ 9290 + y *= llc_weight; 9291 + do_div(y, SCHED_CAPACITY_SCALE); 9292 + if ((int)y != sd_share->nr_idle_scan) 9293 + WRITE_ONCE(sd_share->nr_idle_scan, (int)y); 9294 + } 9295 + 9381 9296 /** 9382 9297 * update_sd_lb_stats - Update sched_domain's statistics for load balancing. 9383 9298 * @env: The load balancing environment. ··· 9461 9234 struct sched_group *sg = env->sd->groups; 9462 9235 struct sg_lb_stats *local = &sds->local_stat; 9463 9236 struct sg_lb_stats tmp_sgs; 9237 + unsigned long sum_util = 0; 9464 9238 int sg_status = 0; 9465 9239 9466 9240 do { ··· 9494 9266 sds->total_load += sgs->group_load; 9495 9267 sds->total_capacity += sgs->group_capacity; 9496 9268 9269 + sum_util += sgs->group_util; 9497 9270 sg = sg->next; 9498 9271 } while (sg != env->sd->groups); 9499 9272 ··· 9520 9291 WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED); 9521 9292 trace_sched_overutilized_tp(rd, SG_OVERUTILIZED); 9522 9293 } 9523 - } 9524 9294 9525 - #define NUMA_IMBALANCE_MIN 2 9526 - 9527 - static inline long adjust_numa_imbalance(int imbalance, 9528 - int dst_running, int imb_numa_nr) 9529 - { 9530 - if (!allow_numa_imbalance(dst_running, imb_numa_nr)) 9531 - return imbalance; 9532 - 9533 - /* 9534 - * Allow a small imbalance based on a simple pair of communicating 9535 - * tasks that remain local when the destination is lightly loaded. 9536 - */ 9537 - if (imbalance <= NUMA_IMBALANCE_MIN) 9538 - return 0; 9539 - 9540 - return imbalance; 9295 + update_idle_cpu_scan(env, sum_util); 9541 9296 } 9542 9297 9543 9298 /** ··· 9538 9325 busiest = &sds->busiest_stat; 9539 9326 9540 9327 if (busiest->group_type == group_misfit_task) { 9541 - /* Set imbalance to allow misfit tasks to be balanced. */ 9542 - env->migration_type = migrate_misfit; 9543 - env->imbalance = 1; 9328 + if (env->sd->flags & SD_ASYM_CPUCAPACITY) { 9329 + /* Set imbalance to allow misfit tasks to be balanced. */ 9330 + env->migration_type = migrate_misfit; 9331 + env->imbalance = 1; 9332 + } else { 9333 + /* 9334 + * Set load imbalance to allow moving task from cpu 9335 + * with reduced capacity. 9336 + */ 9337 + env->migration_type = migrate_load; 9338 + env->imbalance = busiest->group_misfit_task_load; 9339 + } 9544 9340 return; 9545 9341 } 9546 9342 ··· 9617 9395 */ 9618 9396 env->migration_type = migrate_task; 9619 9397 lsub_positive(&nr_diff, local->sum_nr_running); 9620 - env->imbalance = nr_diff >> 1; 9398 + env->imbalance = nr_diff; 9621 9399 } else { 9622 9400 9623 9401 /* ··· 9625 9403 * idle cpus. 9626 9404 */ 9627 9405 env->migration_type = migrate_task; 9628 - env->imbalance = max_t(long, 0, (local->idle_cpus - 9629 - busiest->idle_cpus) >> 1); 9406 + env->imbalance = max_t(long, 0, 9407 + (local->idle_cpus - busiest->idle_cpus)); 9630 9408 } 9631 9409 9410 + #ifdef CONFIG_NUMA 9632 9411 /* Consider allowing a small imbalance between NUMA groups */ 9633 9412 if (env->sd->flags & SD_NUMA) { 9634 9413 env->imbalance = adjust_numa_imbalance(env->imbalance, 9635 - local->sum_nr_running + 1, env->sd->imb_numa_nr); 9414 + local->sum_nr_running + 1, 9415 + env->sd->imb_numa_nr); 9636 9416 } 9417 + #endif 9418 + 9419 + /* Number of tasks to move to restore balance */ 9420 + env->imbalance >>= 1; 9637 9421 9638 9422 return; 9639 9423 } ··· 10062 9834 /* 10063 9835 * In the newly idle case, we will allow all the CPUs 10064 9836 * to do the newly idle load balance. 9837 + * 9838 + * However, we bail out if we already have tasks or a wakeup pending, 9839 + * to optimize wakeup latency. 10065 9840 */ 10066 - if (env->idle == CPU_NEWLY_IDLE) 9841 + if (env->idle == CPU_NEWLY_IDLE) { 9842 + if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending) 9843 + return 0; 10067 9844 return 1; 9845 + } 10068 9846 10069 9847 /* Try to find first idle CPU */ 10070 9848 for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) { ··· 11521 11287 */ 11522 11288 static void propagate_entity_cfs_rq(struct sched_entity *se) 11523 11289 { 11524 - struct cfs_rq *cfs_rq; 11290 + struct cfs_rq *cfs_rq = cfs_rq_of(se); 11525 11291 11526 - list_add_leaf_cfs_rq(cfs_rq_of(se)); 11292 + if (cfs_rq_throttled(cfs_rq)) 11293 + return; 11294 + 11295 + if (!throttled_hierarchy(cfs_rq)) 11296 + list_add_leaf_cfs_rq(cfs_rq); 11527 11297 11528 11298 /* Start to propagate at parent */ 11529 11299 se = se->parent; ··· 11535 11297 for_each_sched_entity(se) { 11536 11298 cfs_rq = cfs_rq_of(se); 11537 11299 11538 - if (!cfs_rq_throttled(cfs_rq)){ 11539 - update_load_avg(cfs_rq, se, UPDATE_TG); 11540 - list_add_leaf_cfs_rq(cfs_rq); 11541 - continue; 11542 - } 11300 + update_load_avg(cfs_rq, se, UPDATE_TG); 11543 11301 11544 - if (list_add_leaf_cfs_rq(cfs_rq)) 11302 + if (cfs_rq_throttled(cfs_rq)) 11545 11303 break; 11304 + 11305 + if (!throttled_hierarchy(cfs_rq)) 11306 + list_add_leaf_cfs_rq(cfs_rq); 11546 11307 } 11547 11308 } 11548 11309 #else ··· 11659 11422 void init_cfs_rq(struct cfs_rq *cfs_rq) 11660 11423 { 11661 11424 cfs_rq->tasks_timeline = RB_ROOT_CACHED; 11662 - cfs_rq->min_vruntime = (u64)(-(1LL << 20)); 11663 - #ifndef CONFIG_64BIT 11664 - cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; 11665 - #endif 11425 + u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20))); 11666 11426 #ifdef CONFIG_SMP 11667 11427 raw_spin_lock_init(&cfs_rq->removed.lock); 11668 11428 #endif

+2 -1

kernel/sched/features.h

··· 60 60 /* 61 61 * When doing wakeups, attempt to limit superfluous scans of the LLC domain. 62 62 */ 63 - SCHED_FEAT(SIS_PROP, true) 63 + SCHED_FEAT(SIS_PROP, false) 64 + SCHED_FEAT(SIS_UTIL, true) 64 65 65 66 /* 66 67 * Issue a WARN when we do multiple update_rq_clock() calls

+35 -9

kernel/sched/pelt.h

··· 61 61 WRITE_ONCE(avg->util_est.enqueued, enqueued); 62 62 } 63 63 64 + static inline u64 rq_clock_pelt(struct rq *rq) 65 + { 66 + lockdep_assert_rq_held(rq); 67 + assert_clock_updated(rq); 68 + 69 + return rq->clock_pelt - rq->lost_idle_time; 70 + } 71 + 72 + /* The rq is idle, we can sync to clock_task */ 73 + static inline void _update_idle_rq_clock_pelt(struct rq *rq) 74 + { 75 + rq->clock_pelt = rq_clock_task(rq); 76 + 77 + u64_u32_store(rq->clock_idle, rq_clock(rq)); 78 + /* Paired with smp_rmb in migrate_se_pelt_lag() */ 79 + smp_wmb(); 80 + u64_u32_store(rq->clock_pelt_idle, rq_clock_pelt(rq)); 81 + } 82 + 64 83 /* 65 84 * The clock_pelt scales the time to reflect the effective amount of 66 85 * computation done during the running delta time but then sync back to ··· 95 76 static inline void update_rq_clock_pelt(struct rq *rq, s64 delta) 96 77 { 97 78 if (unlikely(is_idle_task(rq->curr))) { 98 - /* The rq is idle, we can sync to clock_task */ 99 - rq->clock_pelt = rq_clock_task(rq); 79 + _update_idle_rq_clock_pelt(rq); 100 80 return; 101 81 } 102 82 ··· 148 130 */ 149 131 if (util_sum >= divider) 150 132 rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; 151 - } 152 133 153 - static inline u64 rq_clock_pelt(struct rq *rq) 154 - { 155 - lockdep_assert_rq_held(rq); 156 - assert_clock_updated(rq); 157 - 158 - return rq->clock_pelt - rq->lost_idle_time; 134 + _update_idle_rq_clock_pelt(rq); 159 135 } 160 136 161 137 #ifdef CONFIG_CFS_BANDWIDTH 138 + static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) 139 + { 140 + u64 throttled; 141 + 142 + if (unlikely(cfs_rq->throttle_count)) 143 + throttled = U64_MAX; 144 + else 145 + throttled = cfs_rq->throttled_clock_pelt_time; 146 + 147 + u64_u32_store(cfs_rq->throttled_pelt_idle, throttled); 148 + } 149 + 162 150 /* rq->task_clock normalized against any time this cfs_rq has spent throttled */ 163 151 static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) 164 152 { ··· 174 150 return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time; 175 151 } 176 152 #else 153 + static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { } 177 154 static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) 178 155 { 179 156 return rq_clock_pelt(rq_of(cfs_rq)); ··· 229 204 static inline void 230 205 update_idle_rq_clock_pelt(struct rq *rq) { } 231 206 207 + static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { } 232 208 #endif 233 209 234 210

+9 -6

kernel/sched/rt.c

··· 480 480 #endif /* CONFIG_SMP */ 481 481 482 482 static void enqueue_top_rt_rq(struct rt_rq *rt_rq); 483 - static void dequeue_top_rt_rq(struct rt_rq *rt_rq); 483 + static void dequeue_top_rt_rq(struct rt_rq *rt_rq, unsigned int count); 484 484 485 485 static inline int on_rt_rq(struct sched_rt_entity *rt_se) 486 486 { ··· 601 601 rt_se = rt_rq->tg->rt_se[cpu]; 602 602 603 603 if (!rt_se) { 604 - dequeue_top_rt_rq(rt_rq); 604 + dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running); 605 605 /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ 606 606 cpufreq_update_util(rq_of_rt_rq(rt_rq), 0); 607 607 } ··· 687 687 688 688 static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq) 689 689 { 690 - dequeue_top_rt_rq(rt_rq); 690 + dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running); 691 691 } 692 692 693 693 static inline int rt_rq_throttled(struct rt_rq *rt_rq) ··· 1089 1089 } 1090 1090 1091 1091 static void 1092 - dequeue_top_rt_rq(struct rt_rq *rt_rq) 1092 + dequeue_top_rt_rq(struct rt_rq *rt_rq, unsigned int count) 1093 1093 { 1094 1094 struct rq *rq = rq_of_rt_rq(rt_rq); 1095 1095 ··· 1100 1100 1101 1101 BUG_ON(!rq->nr_running); 1102 1102 1103 - sub_nr_running(rq, rt_rq->rt_nr_running); 1103 + sub_nr_running(rq, count); 1104 1104 rt_rq->rt_queued = 0; 1105 1105 1106 1106 } ··· 1486 1486 static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags) 1487 1487 { 1488 1488 struct sched_rt_entity *back = NULL; 1489 + unsigned int rt_nr_running; 1489 1490 1490 1491 for_each_sched_rt_entity(rt_se) { 1491 1492 rt_se->back = back; 1492 1493 back = rt_se; 1493 1494 } 1494 1495 1495 - dequeue_top_rt_rq(rt_rq_of_se(back)); 1496 + rt_nr_running = rt_rq_of_se(back)->rt_nr_running; 1496 1497 1497 1498 for (rt_se = back; rt_se; rt_se = rt_se->back) { 1498 1499 if (on_rt_rq(rt_se)) 1499 1500 __dequeue_rt_entity(rt_se, flags); 1500 1501 } 1502 + 1503 + dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running); 1501 1504 } 1502 1505 1503 1506 static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)

+51 -12

kernel/sched/sched.h

··· 520 520 521 521 #endif /* CONFIG_CGROUP_SCHED */ 522 522 523 + /* 524 + * u64_u32_load/u64_u32_store 525 + * 526 + * Use a copy of a u64 value to protect against data race. This is only 527 + * applicable for 32-bits architectures. 528 + */ 529 + #ifdef CONFIG_64BIT 530 + # define u64_u32_load_copy(var, copy) var 531 + # define u64_u32_store_copy(var, copy, val) (var = val) 532 + #else 533 + # define u64_u32_load_copy(var, copy) \ 534 + ({ \ 535 + u64 __val, __val_copy; \ 536 + do { \ 537 + __val_copy = copy; \ 538 + /* \ 539 + * paired with u64_u32_store_copy(), ordering access \ 540 + * to var and copy. \ 541 + */ \ 542 + smp_rmb(); \ 543 + __val = var; \ 544 + } while (__val != __val_copy); \ 545 + __val; \ 546 + }) 547 + # define u64_u32_store_copy(var, copy, val) \ 548 + do { \ 549 + typeof(val) __val = (val); \ 550 + var = __val; \ 551 + /* \ 552 + * paired with u64_u32_load_copy(), ordering access to var and \ 553 + * copy. \ 554 + */ \ 555 + smp_wmb(); \ 556 + copy = __val; \ 557 + } while (0) 558 + #endif 559 + # define u64_u32_load(var) u64_u32_load_copy(var, var##_copy) 560 + # define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val) 561 + 523 562 /* CFS-related fields in a runqueue */ 524 563 struct cfs_rq { 525 564 struct load_weight load; ··· 599 560 */ 600 561 struct sched_avg avg; 601 562 #ifndef CONFIG_64BIT 602 - u64 load_last_update_time_copy; 563 + u64 last_update_time_copy; 603 564 #endif 604 565 struct { 605 566 raw_spinlock_t lock ____cacheline_aligned; ··· 648 609 int runtime_enabled; 649 610 s64 runtime_remaining; 650 611 612 + u64 throttled_pelt_idle; 613 + #ifndef CONFIG_64BIT 614 + u64 throttled_pelt_idle_copy; 615 + #endif 651 616 u64 throttled_clock; 652 617 u64 throttled_clock_pelt; 653 618 u64 throttled_clock_pelt_time; ··· 1024 981 u64 clock_task ____cacheline_aligned; 1025 982 u64 clock_pelt; 1026 983 unsigned long lost_idle_time; 984 + u64 clock_pelt_idle; 985 + u64 clock_idle; 986 + #ifndef CONFIG_64BIT 987 + u64 clock_pelt_idle_copy; 988 + u64 clock_idle_copy; 989 + #endif 1027 990 1028 991 atomic_t nr_iowait; 1029 992 ··· 1864 1815 return to_cpumask(sg->sgc->cpumask); 1865 1816 } 1866 1817 1867 - /** 1868 - * group_first_cpu - Returns the first CPU in the cpumask of a sched_group. 1869 - * @group: The group whose first CPU is to be returned. 1870 - */ 1871 - static inline unsigned int group_first_cpu(struct sched_group *group) 1872 - { 1873 - return cpumask_first(sched_group_span(group)); 1874 - } 1875 - 1876 1818 extern int group_balance_cpu(struct sched_group *sg); 1877 1819 1878 1820 #ifdef CONFIG_SCHED_DEBUG ··· 2084 2044 2085 2045 #define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ 2086 2046 #define WF_MIGRATED 0x20 /* Internal use, task got migrated */ 2087 - #define WF_ON_CPU 0x40 /* Wakee is on_cpu */ 2088 2047 2089 2048 #ifdef CONFIG_SMP 2090 2049 static_assert(WF_EXEC == SD_BALANCE_EXEC); ··· 2891 2852 }; 2892 2853 2893 2854 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, 2894 - unsigned long max, enum cpu_util_type type, 2855 + enum cpu_util_type type, 2895 2856 struct task_struct *p); 2896 2857 2897 2858 static inline unsigned long cpu_bw_dl(struct rq *rq)

+15 -8

kernel/sched/topology.c

··· 2316 2316 2317 2317 /* 2318 2318 * For a single LLC per node, allow an 2319 - * imbalance up to 25% of the node. This is an 2320 - * arbitrary cutoff based on SMT-2 to balance 2321 - * between memory bandwidth and avoiding 2322 - * premature sharing of HT resources and SMT-4 2323 - * or SMT-8 *may* benefit from a different 2324 - * cutoff. 2319 + * imbalance up to 12.5% of the node. This is 2320 + * arbitrary cutoff based two factors -- SMT and 2321 + * memory channels. For SMT-2, the intent is to 2322 + * avoid premature sharing of HT resources but 2323 + * SMT-4 or SMT-8 *may* benefit from a different 2324 + * cutoff. For memory channels, this is a very 2325 + * rough estimate of how many channels may be 2326 + * active and is based on recent CPUs with 2327 + * many cores. 2325 2328 * 2326 2329 * For multiple LLCs, allow an imbalance 2327 2330 * until multiple tasks would share an LLC 2328 2331 * on one node while LLCs on another node 2329 - * remain idle. 2332 + * remain idle. This assumes that there are 2333 + * enough logical CPUs per LLC to avoid SMT 2334 + * factors and that there is a correlation 2335 + * between LLCs and memory channels. 2330 2336 */ 2331 2337 nr_llcs = sd->span_weight / child->span_weight; 2332 2338 if (nr_llcs == 1) 2333 - imb = sd->span_weight >> 2; 2339 + imb = sd->span_weight >> 3; 2334 2340 else 2335 2341 imb = nr_llcs; 2342 + imb = max(1U, imb); 2336 2343 sd->imb_numa_nr = imb; 2337 2344 2338 2345 /* Set span based on the first NUMA domain. */

+25 -25

tools/testing/selftests/rseq/rseq-riscv.h

··· 86 86 87 87 #define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \ 88 88 RSEQ_INJECT_ASM(1) \ 89 - "la "RSEQ_ASM_TMP_REG_1 ", " __rseq_str(cs_label) "\n" \ 89 + "la " RSEQ_ASM_TMP_REG_1 ", " __rseq_str(cs_label) "\n" \ 90 90 REG_S RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(rseq_cs) "]\n" \ 91 91 __rseq_str(label) ":\n" 92 92 ··· 103 103 104 104 #define RSEQ_ASM_OP_CMPEQ(var, expect, label) \ 105 105 REG_L RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(var) "]\n" \ 106 - "bne "RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(expect) "] ," \ 106 + "bne " RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(expect) "] ," \ 107 107 __rseq_str(label) "\n" 108 108 109 109 #define RSEQ_ASM_OP_CMPEQ32(var, expect, label) \ 110 - "lw "RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(var) "]\n" \ 111 - "bne "RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(expect) "] ," \ 110 + "lw " RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(var) "]\n" \ 111 + "bne " RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(expect) "] ," \ 112 112 __rseq_str(label) "\n" 113 113 114 114 #define RSEQ_ASM_OP_CMPNE(var, expect, label) \ 115 115 REG_L RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(var) "]\n" \ 116 - "beq "RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(expect) "] ," \ 116 + "beq " RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(expect) "] ," \ 117 117 __rseq_str(label) "\n" 118 118 119 119 #define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \ ··· 127 127 REG_S RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(var) "]\n" 128 128 129 129 #define RSEQ_ASM_OP_R_LOAD_OFF(offset) \ 130 - "add "RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(offset) "], " \ 130 + "add " RSEQ_ASM_TMP_REG_1 ", %[" __rseq_str(offset) "], " \ 131 131 RSEQ_ASM_TMP_REG_1 "\n" \ 132 132 REG_L RSEQ_ASM_TMP_REG_1 ", (" RSEQ_ASM_TMP_REG_1 ")\n" 133 133 134 134 #define RSEQ_ASM_OP_R_ADD(count) \ 135 - "add "RSEQ_ASM_TMP_REG_1 ", " RSEQ_ASM_TMP_REG_1 \ 135 + "add " RSEQ_ASM_TMP_REG_1 ", " RSEQ_ASM_TMP_REG_1 \ 136 136 ", %[" __rseq_str(count) "]\n" 137 137 138 138 #define RSEQ_ASM_OP_FINAL_STORE(value, var, post_commit_label) \ ··· 194 194 RSEQ_ASM_DEFINE_ABORT(4, abort) 195 195 : /* gcc asm goto does not allow outputs */ 196 196 : [cpu_id] "r" (cpu), 197 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 198 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 197 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 198 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 199 199 [v] "m" (*v), 200 200 [expect] "r" (expect), 201 201 [newv] "r" (newv) ··· 251 251 RSEQ_ASM_DEFINE_ABORT(4, abort) 252 252 : /* gcc asm goto does not allow outputs */ 253 253 : [cpu_id] "r" (cpu), 254 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 255 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 254 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 255 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 256 256 [v] "m" (*v), 257 257 [expectnot] "r" (expectnot), 258 258 [load] "m" (*load), ··· 301 301 RSEQ_ASM_DEFINE_ABORT(4, abort) 302 302 : /* gcc asm goto does not allow outputs */ 303 303 : [cpu_id] "r" (cpu), 304 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 305 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 304 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 305 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 306 306 [v] "m" (*v), 307 307 [count] "r" (count) 308 308 RSEQ_INJECT_INPUT ··· 352 352 RSEQ_ASM_DEFINE_ABORT(4, abort) 353 353 : /* gcc asm goto does not allow outputs */ 354 354 : [cpu_id] "r" (cpu), 355 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 356 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 355 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 356 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 357 357 [expect] "r" (expect), 358 358 [v] "m" (*v), 359 359 [newv] "r" (newv), ··· 411 411 RSEQ_ASM_DEFINE_ABORT(4, abort) 412 412 : /* gcc asm goto does not allow outputs */ 413 413 : [cpu_id] "r" (cpu), 414 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 415 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 414 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 415 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 416 416 [expect] "r" (expect), 417 417 [v] "m" (*v), 418 418 [newv] "r" (newv), ··· 472 472 RSEQ_ASM_DEFINE_ABORT(4, abort) 473 473 : /* gcc asm goto does not allow outputs */ 474 474 : [cpu_id] "r" (cpu), 475 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 476 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 475 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 476 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 477 477 [v] "m" (*v), 478 478 [expect] "r" (expect), 479 479 [v2] "m" (*v2), ··· 532 532 RSEQ_ASM_DEFINE_ABORT(4, abort) 533 533 : /* gcc asm goto does not allow outputs */ 534 534 : [cpu_id] "r" (cpu), 535 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 536 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 535 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 536 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 537 537 [expect] "r" (expect), 538 538 [v] "m" (*v), 539 539 [newv] "r" (newv), ··· 593 593 RSEQ_ASM_DEFINE_ABORT(4, abort) 594 594 : /* gcc asm goto does not allow outputs */ 595 595 : [cpu_id] "r" (cpu), 596 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 597 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 596 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 597 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 598 598 [expect] "r" (expect), 599 599 [v] "m" (*v), 600 600 [newv] "r" (newv), ··· 651 651 RSEQ_ASM_DEFINE_ABORT(4, abort) 652 652 : /* gcc asm goto does not allow outputs */ 653 653 : [cpu_id] "r" (cpu), 654 - [current_cpu_id] "m" (__rseq_abi.cpu_id), 655 - [rseq_cs] "m" (__rseq_abi.rseq_cs), 654 + [current_cpu_id] "m" (rseq_get_abi()->cpu_id), 655 + [rseq_cs] "m" (rseq_get_abi()->rseq_cs.arch.ptr), 656 656 [ptr] "r" (ptr), 657 657 [off] "er" (off), 658 658 [inc] "er" (inc)

+2 -1

tools/testing/selftests/rseq/rseq.c

··· 111 111 libc_rseq_offset_p = dlsym(RTLD_NEXT, "__rseq_offset"); 112 112 libc_rseq_size_p = dlsym(RTLD_NEXT, "__rseq_size"); 113 113 libc_rseq_flags_p = dlsym(RTLD_NEXT, "__rseq_flags"); 114 - if (libc_rseq_size_p && libc_rseq_offset_p && libc_rseq_flags_p) { 114 + if (libc_rseq_size_p && libc_rseq_offset_p && libc_rseq_flags_p && 115 + *libc_rseq_size_p != 0) { 115 116 /* rseq registration owned by glibc */ 116 117 rseq_offset = *libc_rseq_offset_p; 117 118 rseq_size = *libc_rseq_size_p;