Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_FAIR) enhancements:

- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)

- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function

- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
Chourasia)

- Cleanups:
- Clean up in migrate_degrades_locality() to improve readability
(Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej
Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)

Deadline scheduler (SCHED_DL) enhancements:

- Restore dl_server bandwidth on non-destructive root domain changes
(Juri Lelli)

- Correctly account for allocated bandwidth during hotplug (Juri
Lelli)

- Check bandwidth overflow earlier for hotplug (Juri Lelli)

- Clean up goto label in pick_earliest_pushable_dl_task() (John
Stultz)

- Consolidate timer cancellation (Wander Lairson Costa)

Load-balancer enhancements:

- Improve performance by prioritizing migrating eligible tasks in
sched_balance_rq() (Hao Jia)

- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)

- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)

Generic scheduling code enhancements:

- Use READ_ONCE() in task_on_rq_queued(), to consistently use the
WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)

Isolated CPUs support enhancements: (Waiman Long)

- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE

RSEQ enhancements:

- Validate read-only fields under DEBUG_RSEQ config (Mathieu
Desnoyers)

PSI enhancements:

- Fix race when task wakes up before psi_sched_switch() adjusts flags
(Chengming Zhou)

IRQ time accounting performance enhancements: (Yafang Shao)

- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled

Virtual machine scheduling enhancements:

- Don't try to catch up excess steal time (Suleiman Souhlal)

Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)

- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally

Debugging code & instrumentation enhancements:

- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter
Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat
(Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)"

* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rseq: Fix rseq unregistration regression
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched: Don't account irq time if sched_clock_irqtime is disabled
sched: Define sched_clock_irqtime as static key
sched/fair: Do not compute overloaded status unnecessarily during lb
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
x86/itmt: Use guard() for itmt_update_mutex
x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
sched/debug: Change need_resched warnings to pr_err
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
sched: Fix race between yield_to() and try_to_wake_up()
docs: Update Schedstat version to 17
sched/stats: Print domain name in /proc/schedstat
sched: Move sched domain name out of CONFIG_SCHED_DEBUG
sched: Report the different kinds of imbalances in /proc/schedstat
...

Linus Torvalds 1 year ago 62de6e16 858df1de

+721 -479

23 changed files

expand all

Documentation

admin-guide

kernel-parameters.txt

scheduler

sched-stats.rst

arch

x86

include

asm

topology.h

kernel

itmt.c

smpboot.c

include

linux

sched

isolation.h

topology.h

sched.h

kernel

rseq.c

sched

core.c

cputime.c

deadline.c

debug.c

fair.c

features.h

isolation.c

pelt.c

psi.c

sched.h

stats.c

stats.h

syscalls.c

topology.c

+3 -1

Documentation/admin-guide/kernel-parameters.txt

··· 2506 2506 specified in the flag list (default: domain): 2507 2507 2508 2508 nohz 2509 - Disable the tick when a single task runs. 2509 + Disable the tick when a single task runs as well as 2510 + disabling other kernel noises like having RCU callbacks 2511 + offloaded. This is equivalent to the nohz_full parameter. 2510 2512 2511 2513 A residual 1Hz tick is offloaded to workqueues, which you 2512 2514 need to affine to housekeeping through the global

+76 -52

Documentation/scheduler/sched-stats.rst

··· 2 2 Scheduler Statistics 3 3 ==================== 4 4 5 + Version 17 of schedstats removed 'lb_imbalance' field as it has no 6 + significance anymore and instead added more relevant fields namely 7 + 'lb_imbalance_load', 'lb_imbalance_util', 'lb_imbalance_task' and 8 + 'lb_imbalance_misfit'. The domain field prints the name of the 9 + corresponding sched domain from this version onwards. 10 + 5 11 Version 16 of schedstats changed the order of definitions within 6 12 'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES] 7 13 columns in show_schedstat(). In particular the position of CPU_IDLE ··· 15 9 16 10 Version 15 of schedstats dropped counters for some sched_yield: 17 11 yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is 18 - identical to version 14. 12 + identical to version 14. Details are available at 13 + 14 + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-stats.txt?id=1e1dbb259c79b 19 15 20 16 Version 14 of schedstats includes support for sched_domains, which hit the 21 17 mainline kernel in 2.6.20 although it is identical to the stats from version ··· 34 26 sometimes balancing only between pairs of cpus. At this time, there 35 27 are no architectures which need more than three domain levels. The first 36 28 field in the domain stats is a bit map indicating which cpus are affected 37 - by that domain. 29 + by that domain. Details are available at 30 + 31 + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=b762f3ffb797c 32 + 33 + The schedstat documentation is maintained version 10 onwards and is not 34 + updated for version 11 and 12. The details for version 10 are available at 35 + 36 + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=1da177e4c3f4 38 37 39 38 These fields are counters, and only increment. Programs which make use 40 39 of these will need to start with a baseline observation and then calculate ··· 86 71 ----------------- 87 72 One of these is produced per domain for each cpu described. (Note that if 88 73 CONFIG_SMP is not defined, *no* domains are utilized and these lines 89 - will not appear in the output.) 74 + will not appear in the output. <name> is an extension to the domain field 75 + that prints the name of the corresponding sched domain. It can appear in 76 + schedstat version 17 and above, and requires CONFIG_SCHED_DEBUG.) 90 77 91 - domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 78 + domain<N> <name> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 92 79 93 80 The first field is a bit mask indicating what cpus this domain operates over. 94 81 95 - The next 24 are a variety of sched_balance_rq() statistics in grouped into types 96 - of idleness (idle, busy, and newly idle): 82 + The next 33 are a variety of sched_balance_rq() statistics in grouped into types 83 + of idleness (busy, idle and newly idle): 97 84 98 85 1) # of times in this domain sched_balance_rq() was called when the 99 - cpu was idle 100 - 2) # of times in this domain sched_balance_rq() checked but found 101 - the load did not require balancing when the cpu was idle 102 - 3) # of times in this domain sched_balance_rq() tried to move one or 103 - more tasks and failed, when the cpu was idle 104 - 4) sum of imbalances discovered (if any) with each call to 105 - sched_balance_rq() in this domain when the cpu was idle 106 - 5) # of times in this domain pull_task() was called when the cpu 107 - was idle 108 - 6) # of times in this domain pull_task() was called even though 109 - the target task was cache-hot when idle 110 - 7) # of times in this domain sched_balance_rq() was called but did 111 - not find a busier queue while the cpu was idle 112 - 8) # of times in this domain a busier queue was found while the 113 - cpu was idle but no busier group was found 114 - 9) # of times in this domain sched_balance_rq() was called when the 115 86 cpu was busy 116 - 10) # of times in this domain sched_balance_rq() checked but found the 87 + 2) # of times in this domain sched_balance_rq() checked but found the 117 88 load did not require balancing when busy 118 - 11) # of times in this domain sched_balance_rq() tried to move one or 89 + 3) # of times in this domain sched_balance_rq() tried to move one or 119 90 more tasks and failed, when the cpu was busy 120 - 12) sum of imbalances discovered (if any) with each call to 121 - sched_balance_rq() in this domain when the cpu was busy 122 - 13) # of times in this domain pull_task() was called when busy 123 - 14) # of times in this domain pull_task() was called even though the 91 + 4) Total imbalance in load when the cpu was busy 92 + 5) Total imbalance in utilization when the cpu was busy 93 + 6) Total imbalance in number of tasks when the cpu was busy 94 + 7) Total imbalance due to misfit tasks when the cpu was busy 95 + 8) # of times in this domain pull_task() was called when busy 96 + 9) # of times in this domain pull_task() was called even though the 124 97 target task was cache-hot when busy 125 - 15) # of times in this domain sched_balance_rq() was called but did not 98 + 10) # of times in this domain sched_balance_rq() was called but did not 126 99 find a busier queue while the cpu was busy 127 - 16) # of times in this domain a busier queue was found while the cpu 100 + 11) # of times in this domain a busier queue was found while the cpu 128 101 was busy but no busier group was found 129 102 130 - 17) # of times in this domain sched_balance_rq() was called when the 131 - cpu was just becoming idle 132 - 18) # of times in this domain sched_balance_rq() checked but found the 103 + 12) # of times in this domain sched_balance_rq() was called when the 104 + cpu was idle 105 + 13) # of times in this domain sched_balance_rq() checked but found 106 + the load did not require balancing when the cpu was idle 107 + 14) # of times in this domain sched_balance_rq() tried to move one or 108 + more tasks and failed, when the cpu was idle 109 + 15) Total imbalance in load when the cpu was idle 110 + 16) Total imbalance in utilization when the cpu was idle 111 + 17) Total imbalance in number of tasks when the cpu was idle 112 + 18) Total imbalance due to misfit tasks when the cpu was idle 113 + 19) # of times in this domain pull_task() was called when the cpu 114 + was idle 115 + 20) # of times in this domain pull_task() was called even though 116 + the target task was cache-hot when idle 117 + 21) # of times in this domain sched_balance_rq() was called but did 118 + not find a busier queue while the cpu was idle 119 + 22) # of times in this domain a busier queue was found while the 120 + cpu was idle but no busier group was found 121 + 122 + 23) # of times in this domain sched_balance_rq() was called when the 123 + was just becoming idle 124 + 24) # of times in this domain sched_balance_rq() checked but found the 133 125 load did not require balancing when the cpu was just becoming idle 134 - 19) # of times in this domain sched_balance_rq() tried to move one or more 126 + 25) # of times in this domain sched_balance_rq() tried to move one or more 135 127 tasks and failed, when the cpu was just becoming idle 136 - 20) sum of imbalances discovered (if any) with each call to 137 - sched_balance_rq() in this domain when the cpu was just becoming idle 138 - 21) # of times in this domain pull_task() was called when newly idle 139 - 22) # of times in this domain pull_task() was called even though the 128 + 26) Total imbalance in load when the cpu was just becoming idle 129 + 27) Total imbalance in utilization when the cpu was just becoming idle 130 + 28) Total imbalance in number of tasks when the cpu was just becoming idle 131 + 29) Total imbalance due to misfit tasks when the cpu was just becoming idle 132 + 30) # of times in this domain pull_task() was called when newly idle 133 + 31) # of times in this domain pull_task() was called even though the 140 134 target task was cache-hot when just becoming idle 141 - 23) # of times in this domain sched_balance_rq() was called but did not 135 + 32) # of times in this domain sched_balance_rq() was called but did not 142 136 find a busier queue while the cpu was just becoming idle 143 - 24) # of times in this domain a busier queue was found while the cpu 137 + 33) # of times in this domain a busier queue was found while the cpu 144 138 was just becoming idle but no busier group was found 145 139 146 140 Next three are active_load_balance() statistics: 147 141 148 - 25) # of times active_load_balance() was called 149 - 26) # of times active_load_balance() tried to move a task and failed 150 - 27) # of times active_load_balance() successfully moved a task 142 + 34) # of times active_load_balance() was called 143 + 35) # of times active_load_balance() tried to move a task and failed 144 + 36) # of times active_load_balance() successfully moved a task 151 145 152 146 Next three are sched_balance_exec() statistics: 153 147 154 - 28) sbe_cnt is not used 155 - 29) sbe_balanced is not used 156 - 30) sbe_pushed is not used 148 + 37) sbe_cnt is not used 149 + 38) sbe_balanced is not used 150 + 39) sbe_pushed is not used 157 151 158 152 Next three are sched_balance_fork() statistics: 159 153 160 - 31) sbf_cnt is not used 161 - 32) sbf_balanced is not used 162 - 33) sbf_pushed is not used 154 + 40) sbf_cnt is not used 155 + 41) sbf_balanced is not used 156 + 42) sbf_pushed is not used 163 157 164 158 Next three are try_to_wake_up() statistics: 165 159 166 - 34) # of times in this domain try_to_wake_up() awoke a task that 160 + 43) # of times in this domain try_to_wake_up() awoke a task that 167 161 last ran on a different cpu in this domain 168 - 35) # of times in this domain try_to_wake_up() moved a task to the 162 + 44) # of times in this domain try_to_wake_up() moved a task to the 169 163 waking cpu because it was cache-cold on its own cpu anyway 170 - 36) # of times in this domain try_to_wake_up() started passive balancing 164 + 45) # of times in this domain try_to_wake_up() started passive balancing 171 165 172 166 /proc/<pid>/schedstat 173 167 ---------------------

+2 -2

arch/x86/include/asm/topology.h

··· 251 251 #include <asm/percpu.h> 252 252 253 253 DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority); 254 - extern unsigned int __read_mostly sysctl_sched_itmt_enabled; 254 + extern bool __read_mostly sysctl_sched_itmt_enabled; 255 255 256 256 /* Interface to set priority of a cpu */ 257 257 void sched_set_itmt_core_prio(int prio, int core_cpu); ··· 264 264 265 265 #else /* CONFIG_SCHED_MC_PRIO */ 266 266 267 - #define sysctl_sched_itmt_enabled 0 267 + #define sysctl_sched_itmt_enabled false 268 268 static inline void sched_set_itmt_core_prio(int prio, int core_cpu) 269 269 { 270 270 }

+33 -48

arch/x86/kernel/itmt.c

··· 19 19 #include <linux/sched.h> 20 20 #include <linux/cpumask.h> 21 21 #include <linux/cpuset.h> 22 + #include <linux/debugfs.h> 22 23 #include <linux/mutex.h> 23 24 #include <linux/sysctl.h> 24 25 #include <linux/nodemask.h> ··· 35 34 * of higher turbo frequency for cpus supporting Intel Turbo Boost Max 36 35 * Technology 3.0. 37 36 * 38 - * It can be set via /proc/sys/kernel/sched_itmt_enabled 37 + * It can be set via /sys/kernel/debug/x86/sched_itmt_enabled 39 38 */ 40 - unsigned int __read_mostly sysctl_sched_itmt_enabled; 39 + bool __read_mostly sysctl_sched_itmt_enabled; 41 40 42 - static int sched_itmt_update_handler(const struct ctl_table *table, int write, 43 - void *buffer, size_t *lenp, loff_t *ppos) 41 + static ssize_t sched_itmt_enabled_write(struct file *filp, 42 + const char __user *ubuf, 43 + size_t cnt, loff_t *ppos) 44 44 { 45 - unsigned int old_sysctl; 46 - int ret; 45 + ssize_t result; 46 + bool orig; 47 47 48 - mutex_lock(&itmt_update_mutex); 48 + guard(mutex)(&itmt_update_mutex); 49 49 50 - if (!sched_itmt_capable) { 51 - mutex_unlock(&itmt_update_mutex); 52 - return -EINVAL; 53 - } 50 + orig = sysctl_sched_itmt_enabled; 51 + result = debugfs_write_file_bool(filp, ubuf, cnt, ppos); 54 52 55 - old_sysctl = sysctl_sched_itmt_enabled; 56 - ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); 57 - 58 - if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) { 53 + if (sysctl_sched_itmt_enabled != orig) { 59 54 x86_topology_update = true; 60 55 rebuild_sched_domains(); 61 56 } 62 57 63 - mutex_unlock(&itmt_update_mutex); 64 - 65 - return ret; 58 + return result; 66 59 } 67 60 68 - static struct ctl_table itmt_kern_table[] = { 69 - { 70 - .procname = "sched_itmt_enabled", 71 - .data = &sysctl_sched_itmt_enabled, 72 - .maxlen = sizeof(unsigned int), 73 - .mode = 0644, 74 - .proc_handler = sched_itmt_update_handler, 75 - .extra1 = SYSCTL_ZERO, 76 - .extra2 = SYSCTL_ONE, 77 - }, 61 + static const struct file_operations dfs_sched_itmt_fops = { 62 + .read = debugfs_read_file_bool, 63 + .write = sched_itmt_enabled_write, 64 + .open = simple_open, 65 + .llseek = default_llseek, 78 66 }; 79 67 80 - static struct ctl_table_header *itmt_sysctl_header; 68 + static struct dentry *dfs_sched_itmt; 81 69 82 70 /** 83 71 * sched_set_itmt_support() - Indicate platform supports ITMT ··· 87 97 */ 88 98 int sched_set_itmt_support(void) 89 99 { 90 - mutex_lock(&itmt_update_mutex); 100 + guard(mutex)(&itmt_update_mutex); 91 101 92 - if (sched_itmt_capable) { 93 - mutex_unlock(&itmt_update_mutex); 102 + if (sched_itmt_capable) 94 103 return 0; 95 - } 96 104 97 - itmt_sysctl_header = register_sysctl("kernel", itmt_kern_table); 98 - if (!itmt_sysctl_header) { 99 - mutex_unlock(&itmt_update_mutex); 105 + dfs_sched_itmt = debugfs_create_file_unsafe("sched_itmt_enabled", 106 + 0644, 107 + arch_debugfs_dir, 108 + &sysctl_sched_itmt_enabled, 109 + &dfs_sched_itmt_fops); 110 + if (IS_ERR_OR_NULL(dfs_sched_itmt)) { 111 + dfs_sched_itmt = NULL; 100 112 return -ENOMEM; 101 113 } 102 114 ··· 108 116 109 117 x86_topology_update = true; 110 118 rebuild_sched_domains(); 111 - 112 - mutex_unlock(&itmt_update_mutex); 113 119 114 120 return 0; 115 121 } ··· 124 134 */ 125 135 void sched_clear_itmt_support(void) 126 136 { 127 - mutex_lock(&itmt_update_mutex); 137 + guard(mutex)(&itmt_update_mutex); 128 138 129 - if (!sched_itmt_capable) { 130 - mutex_unlock(&itmt_update_mutex); 139 + if (!sched_itmt_capable) 131 140 return; 132 - } 141 + 133 142 sched_itmt_capable = false; 134 143 135 - if (itmt_sysctl_header) { 136 - unregister_sysctl_table(itmt_sysctl_header); 137 - itmt_sysctl_header = NULL; 138 - } 144 + debugfs_remove(dfs_sched_itmt); 145 + dfs_sched_itmt = NULL; 139 146 140 147 if (sysctl_sched_itmt_enabled) { 141 148 /* disable sched_itmt if we are no longer ITMT capable */ ··· 140 153 x86_topology_update = true; 141 154 rebuild_sched_domains(); 142 155 } 143 - 144 - mutex_unlock(&itmt_update_mutex); 145 156 } 146 157 147 158 int arch_asym_cpu_priority(int cpu)

+2 -17

arch/x86/kernel/smpboot.c

··· 483 483 return cpu_core_flags() | x86_sched_itmt_flags(); 484 484 } 485 485 #endif 486 - #ifdef CONFIG_SCHED_SMT 487 - static int x86_smt_flags(void) 488 - { 489 - return cpu_smt_flags(); 490 - } 491 - #endif 492 486 #ifdef CONFIG_SCHED_CLUSTER 493 487 static int x86_cluster_flags(void) 494 488 { 495 489 return cpu_cluster_flags() | x86_sched_itmt_flags(); 496 490 } 497 491 #endif 498 - 499 - static int x86_die_flags(void) 500 - { 501 - if (cpu_feature_enabled(X86_FEATURE_HYBRID_CPU) || 502 - cpu_feature_enabled(X86_FEATURE_AMD_HETEROGENEOUS_CORES)) 503 - return x86_sched_itmt_flags(); 504 - 505 - return 0; 506 - } 507 492 508 493 /* 509 494 * Set if a package/die has multiple NUMA nodes inside. ··· 505 520 506 521 #ifdef CONFIG_SCHED_SMT 507 522 x86_topology[i++] = (struct sched_domain_topology_level){ 508 - cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) 523 + cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) 509 524 }; 510 525 #endif 511 526 #ifdef CONFIG_SCHED_CLUSTER ··· 525 540 */ 526 541 if (!x86_has_numa_in_package) { 527 542 x86_topology[i++] = (struct sched_domain_topology_level){ 528 - cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(PKG) 543 + cpu_cpu_mask, x86_sched_itmt_flags, SD_INIT_NAME(PKG) 529 544 }; 530 545 } 531 546

+10

include/linux/sched.h

··· 944 944 unsigned sched_reset_on_fork:1; 945 945 unsigned sched_contributes_to_load:1; 946 946 unsigned sched_migrated:1; 947 + unsigned sched_task_hot:1; 947 948 948 949 /* Force alignment to the next boundary: */ 949 950 unsigned :0; ··· 1375 1374 * with respect to preemption. 1376 1375 */ 1377 1376 unsigned long rseq_event_mask; 1377 + # ifdef CONFIG_DEBUG_RSEQ 1378 + /* 1379 + * This is a place holder to save a copy of the rseq fields for 1380 + * validation of read-only fields. The struct rseq has a 1381 + * variable-length array at the end, so it cannot be used 1382 + * directly. Reserve a size large enough for the known fields. 1383 + */ 1384 + char rseq_fields[sizeof(struct rseq)]; 1385 + # endif 1378 1386 #endif 1379 1387 1380 1388 #ifdef CONFIG_SCHED_MM_CID

+13 -8

include/linux/sched/isolation.h

··· 7 7 #include <linux/tick.h> 8 8 9 9 enum hk_type { 10 - HK_TYPE_TIMER, 11 - HK_TYPE_RCU, 12 - HK_TYPE_MISC, 13 - HK_TYPE_SCHED, 14 - HK_TYPE_TICK, 15 10 HK_TYPE_DOMAIN, 16 - HK_TYPE_WQ, 17 11 HK_TYPE_MANAGED_IRQ, 18 - HK_TYPE_KTHREAD, 19 - HK_TYPE_MAX 12 + HK_TYPE_KERNEL_NOISE, 13 + HK_TYPE_MAX, 14 + 15 + /* 16 + * The following housekeeping types are only set by the nohz_full 17 + * boot commandline option. So they can share the same value. 18 + */ 19 + HK_TYPE_TICK = HK_TYPE_KERNEL_NOISE, 20 + HK_TYPE_TIMER = HK_TYPE_KERNEL_NOISE, 21 + HK_TYPE_RCU = HK_TYPE_KERNEL_NOISE, 22 + HK_TYPE_MISC = HK_TYPE_KERNEL_NOISE, 23 + HK_TYPE_WQ = HK_TYPE_KERNEL_NOISE, 24 + HK_TYPE_KTHREAD = HK_TYPE_KERNEL_NOISE 20 25 }; 21 26 22 27 #ifdef CONFIG_CPU_ISOLATION

+4 -9

include/linux/sched/topology.h

··· 114 114 unsigned int lb_count[CPU_MAX_IDLE_TYPES]; 115 115 unsigned int lb_failed[CPU_MAX_IDLE_TYPES]; 116 116 unsigned int lb_balanced[CPU_MAX_IDLE_TYPES]; 117 - unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES]; 117 + unsigned int lb_imbalance_load[CPU_MAX_IDLE_TYPES]; 118 + unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES]; 119 + unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES]; 120 + unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES]; 118 121 unsigned int lb_gained[CPU_MAX_IDLE_TYPES]; 119 122 unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES]; 120 123 unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES]; ··· 143 140 unsigned int ttwu_move_affine; 144 141 unsigned int ttwu_move_balance; 145 142 #endif 146 - #ifdef CONFIG_SCHED_DEBUG 147 143 char *name; 148 - #endif 149 144 union { 150 145 void *private; /* used during construction */ 151 146 struct rcu_head rcu; /* used during destruction */ ··· 199 198 int flags; 200 199 int numa_level; 201 200 struct sd_data data; 202 - #ifdef CONFIG_SCHED_DEBUG 203 201 char *name; 204 - #endif 205 202 }; 206 203 207 204 extern void __init set_sched_topology(struct sched_domain_topology_level *tl); 208 205 209 - #ifdef CONFIG_SCHED_DEBUG 210 206 # define SD_INIT_NAME(type) .name = #type 211 - #else 212 - # define SD_INIT_NAME(type) 213 - #endif 214 207 215 208 #else /* CONFIG_SMP */ 216 209

+98

kernel/rseq.c

··· 13 13 #include <linux/syscalls.h> 14 14 #include <linux/rseq.h> 15 15 #include <linux/types.h> 16 + #include <linux/ratelimit.h> 16 17 #include <asm/ptrace.h> 17 18 18 19 #define CREATE_TRACE_POINTS ··· 25 24 #define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \ 26 25 RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \ 27 26 RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) 27 + 28 + #ifdef CONFIG_DEBUG_RSEQ 29 + static struct rseq *rseq_kernel_fields(struct task_struct *t) 30 + { 31 + return (struct rseq *) t->rseq_fields; 32 + } 33 + 34 + static int rseq_validate_ro_fields(struct task_struct *t) 35 + { 36 + static DEFINE_RATELIMIT_STATE(_rs, 37 + DEFAULT_RATELIMIT_INTERVAL, 38 + DEFAULT_RATELIMIT_BURST); 39 + u32 cpu_id_start, cpu_id, node_id, mm_cid; 40 + struct rseq __user *rseq = t->rseq; 41 + 42 + /* 43 + * Validate fields which are required to be read-only by 44 + * user-space. 45 + */ 46 + if (!user_read_access_begin(rseq, t->rseq_len)) 47 + goto efault; 48 + unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end); 49 + unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end); 50 + unsafe_get_user(node_id, &rseq->node_id, efault_end); 51 + unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end); 52 + user_read_access_end(); 53 + 54 + if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start || 55 + cpu_id != rseq_kernel_fields(t)->cpu_id || 56 + node_id != rseq_kernel_fields(t)->node_id || 57 + mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) { 58 + 59 + pr_warn("Detected rseq corruption for pid: %d, name: %s\n" 60 + "\tcpu_id_start: %u ?= %u\n" 61 + "\tcpu_id: %u ?= %u\n" 62 + "\tnode_id: %u ?= %u\n" 63 + "\tmm_cid: %u ?= %u\n", 64 + t->pid, t->comm, 65 + cpu_id_start, rseq_kernel_fields(t)->cpu_id_start, 66 + cpu_id, rseq_kernel_fields(t)->cpu_id, 67 + node_id, rseq_kernel_fields(t)->node_id, 68 + mm_cid, rseq_kernel_fields(t)->mm_cid); 69 + } 70 + 71 + /* For now, only print a console warning on mismatch. */ 72 + return 0; 73 + 74 + efault_end: 75 + user_read_access_end(); 76 + efault: 77 + return -EFAULT; 78 + } 79 + 80 + static void rseq_set_ro_fields(struct task_struct *t, u32 cpu_id_start, u32 cpu_id, 81 + u32 node_id, u32 mm_cid) 82 + { 83 + rseq_kernel_fields(t)->cpu_id_start = cpu_id; 84 + rseq_kernel_fields(t)->cpu_id = cpu_id; 85 + rseq_kernel_fields(t)->node_id = node_id; 86 + rseq_kernel_fields(t)->mm_cid = mm_cid; 87 + } 88 + #else 89 + static int rseq_validate_ro_fields(struct task_struct *t) 90 + { 91 + return 0; 92 + } 93 + 94 + static void rseq_set_ro_fields(struct task_struct *t, u32 cpu_id_start, u32 cpu_id, 95 + u32 node_id, u32 mm_cid) 96 + { 97 + } 98 + #endif 28 99 29 100 /* 30 101 * ··· 165 92 u32 node_id = cpu_to_node(cpu_id); 166 93 u32 mm_cid = task_mm_cid(t); 167 94 95 + /* 96 + * Validate read-only rseq fields. 97 + */ 98 + if (rseq_validate_ro_fields(t)) 99 + goto efault; 168 100 WARN_ON_ONCE((int) mm_cid < 0); 169 101 if (!user_write_access_begin(rseq, t->rseq_len)) 170 102 goto efault; ··· 183 105 * t->rseq_len != ORIG_RSEQ_SIZE. 184 106 */ 185 107 user_write_access_end(); 108 + rseq_set_ro_fields(t, cpu_id, cpu_id, node_id, mm_cid); 186 109 trace_rseq_update(t); 187 110 return 0; 188 111 ··· 198 119 u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0, 199 120 mm_cid = 0; 200 121 122 + /* 123 + * Validate read-only rseq fields. 124 + */ 125 + if (rseq_validate_ro_fields(t)) 126 + return -EFAULT; 201 127 /* 202 128 * Reset cpu_id_start to its initial state (0). 203 129 */ ··· 225 141 */ 226 142 if (put_user(mm_cid, &t->rseq->mm_cid)) 227 143 return -EFAULT; 144 + 145 + rseq_set_ro_fields(t, cpu_id_start, cpu_id, node_id, mm_cid); 146 + 228 147 /* 229 148 * Additional feature fields added after ORIG_RSEQ_SIZE 230 149 * need to be conditionally reset only if ··· 510 423 current->rseq = rseq; 511 424 current->rseq_len = rseq_len; 512 425 current->rseq_sig = sig; 426 + #ifdef CONFIG_DEBUG_RSEQ 427 + /* 428 + * Initialize the in-kernel rseq fields copy for validation of 429 + * read-only fields. 430 + */ 431 + if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) || 432 + get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) || 433 + get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) || 434 + get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid)) 435 + return -EFAULT; 436 + #endif 513 437 /* 514 438 * If rseq was previously inactive, and has just been 515 439 * registered, ensure the cpu_id_start and cpu_id fields

+45 -49

kernel/sched/core.c

··· 740 740 s64 __maybe_unused steal = 0, irq_delta = 0; 741 741 742 742 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 743 - irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; 743 + if (irqtime_enabled()) { 744 + irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; 744 745 745 - /* 746 - * Since irq_time is only updated on {soft,}irq_exit, we might run into 747 - * this case when a previous update_rq_clock() happened inside a 748 - * {soft,}IRQ region. 749 - * 750 - * When this happens, we stop ->clock_task and only update the 751 - * prev_irq_time stamp to account for the part that fit, so that a next 752 - * update will consume the rest. This ensures ->clock_task is 753 - * monotonic. 754 - * 755 - * It does however cause some slight miss-attribution of {soft,}IRQ 756 - * time, a more accurate solution would be to update the irq_time using 757 - * the current rq->clock timestamp, except that would require using 758 - * atomic ops. 759 - */ 760 - if (irq_delta > delta) 761 - irq_delta = delta; 746 + /* 747 + * Since irq_time is only updated on {soft,}irq_exit, we might run into 748 + * this case when a previous update_rq_clock() happened inside a 749 + * {soft,}IRQ region. 750 + * 751 + * When this happens, we stop ->clock_task and only update the 752 + * prev_irq_time stamp to account for the part that fit, so that a next 753 + * update will consume the rest. This ensures ->clock_task is 754 + * monotonic. 755 + * 756 + * It does however cause some slight miss-attribution of {soft,}IRQ 757 + * time, a more accurate solution would be to update the irq_time using 758 + * the current rq->clock timestamp, except that would require using 759 + * atomic ops. 760 + */ 761 + if (irq_delta > delta) 762 + irq_delta = delta; 762 763 763 - rq->prev_irq_time += irq_delta; 764 - delta -= irq_delta; 765 - delayacct_irq(rq->curr, irq_delta); 764 + rq->prev_irq_time += irq_delta; 765 + delta -= irq_delta; 766 + delayacct_irq(rq->curr, irq_delta); 767 + } 766 768 #endif 767 769 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING 768 770 if (static_key_false((&paravirt_steal_rq_enabled))) { 769 - steal = paravirt_steal_clock(cpu_of(rq)); 771 + u64 prev_steal; 772 + 773 + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); 770 774 steal -= rq->prev_steal_time_rq; 771 775 772 776 if (unlikely(steal > delta)) 773 777 steal = delta; 774 778 775 - rq->prev_steal_time_rq += steal; 779 + rq->prev_steal_time_rq = prev_steal; 776 780 delta -= steal; 777 781 } 778 782 #endif ··· 1172 1168 struct sched_domain *sd; 1173 1169 const struct cpumask *hk_mask; 1174 1170 1175 - if (housekeeping_cpu(cpu, HK_TYPE_TIMER)) { 1171 + if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) { 1176 1172 if (!idle_cpu(cpu)) 1177 1173 return cpu; 1178 1174 default_cpu = cpu; 1179 1175 } 1180 1176 1181 - hk_mask = housekeeping_cpumask(HK_TYPE_TIMER); 1177 + hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE); 1182 1178 1183 1179 guard(rcu)(); 1184 1180 ··· 1193 1189 } 1194 1190 1195 1191 if (default_cpu == -1) 1196 - default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER); 1192 + default_cpu = housekeeping_any_cpu(HK_TYPE_KERNEL_NOISE); 1197 1193 1198 1194 return default_cpu; 1199 1195 } ··· 1345 1341 if (scx_enabled() && !scx_can_stop_tick(rq)) 1346 1342 return false; 1347 1343 1348 - if (rq->cfs.h_nr_running > 1) 1344 + if (rq->cfs.h_nr_queued > 1) 1349 1345 return false; 1350 1346 1351 1347 /* ··· 5636 5632 unsigned long hw_pressure; 5637 5633 u64 resched_latency; 5638 5634 5639 - if (housekeeping_cpu(cpu, HK_TYPE_TICK)) 5635 + if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) 5640 5636 arch_scale_freq_tick(); 5641 5637 5642 5638 sched_clock_tick(); ··· 5775 5771 int os; 5776 5772 struct tick_work *twork; 5777 5773 5778 - if (housekeeping_cpu(cpu, HK_TYPE_TICK)) 5774 + if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) 5779 5775 return; 5780 5776 5781 5777 WARN_ON_ONCE(!tick_work_cpu); ··· 5796 5792 struct tick_work *twork; 5797 5793 int os; 5798 5794 5799 - if (housekeeping_cpu(cpu, HK_TYPE_TICK)) 5795 + if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) 5800 5796 return; 5801 5797 5802 5798 WARN_ON_ONCE(!tick_work_cpu); ··· 6022 6018 * opportunity to pull in more work from other CPUs. 6023 6019 */ 6024 6020 if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && 6025 - rq->nr_running == rq->cfs.h_nr_running)) { 6021 + rq->nr_running == rq->cfs.h_nr_queued)) { 6026 6022 6027 6023 p = pick_next_task_fair(rq, prev, rf); 6028 6024 if (unlikely(p == RETRY_TASK)) ··· 6645 6641 * as a preemption by schedule_debug() and RCU. 6646 6642 */ 6647 6643 bool preempt = sched_mode > SM_NONE; 6648 - bool block = false; 6649 6644 unsigned long *switch_count; 6650 6645 unsigned long prev_state; 6651 6646 struct rq_flags rf; ··· 6705 6702 goto picked; 6706 6703 } 6707 6704 } else if (!preempt && prev_state) { 6708 - block = try_to_block_task(rq, prev, prev_state); 6705 + try_to_block_task(rq, prev, prev_state); 6709 6706 switch_count = &prev->nvcsw; 6710 6707 } 6711 6708 ··· 6751 6748 6752 6749 migrate_disable_switch(rq, prev); 6753 6750 psi_account_irqtime(rq, prev, next); 6754 - psi_sched_switch(prev, next, block); 6751 + psi_sched_switch(prev, next, !task_on_rq_queued(prev) || 6752 + prev->se.sched_delayed); 6755 6753 6756 6754 trace_sched_switch(preempt, prev, next, prev_state); 6757 6755 ··· 8184 8180 cpuset_update_active_cpus(); 8185 8181 } 8186 8182 8187 - static int cpuset_cpu_inactive(unsigned int cpu) 8183 + static void cpuset_cpu_inactive(unsigned int cpu) 8188 8184 { 8189 8185 if (!cpuhp_tasks_frozen) { 8190 - int ret = dl_bw_check_overflow(cpu); 8191 - 8192 - if (ret) 8193 - return ret; 8194 8186 cpuset_update_active_cpus(); 8195 8187 } else { 8196 8188 num_cpus_frozen++; 8197 8189 partition_sched_domains(1, NULL, NULL); 8198 8190 } 8199 - return 0; 8200 8191 } 8201 8192 8202 8193 static inline void sched_smt_present_inc(int cpu) ··· 8253 8254 struct rq *rq = cpu_rq(cpu); 8254 8255 int ret; 8255 8256 8257 + ret = dl_bw_deactivate(cpu); 8258 + 8259 + if (ret) 8260 + return ret; 8261 + 8256 8262 /* 8257 8263 * Remove CPU from nohz.idle_cpus_mask to prevent participating in 8258 8264 * load balancing when not active ··· 8303 8299 return 0; 8304 8300 8305 8301 sched_update_numa(cpu, false); 8306 - ret = cpuset_cpu_inactive(cpu); 8307 - if (ret) { 8308 - sched_smt_present_inc(cpu); 8309 - sched_set_rq_online(rq, cpu); 8310 - balance_push_set(cpu, false); 8311 - set_cpu_active(cpu, true); 8312 - sched_update_numa(cpu, true); 8313 - return ret; 8314 - } 8302 + cpuset_cpu_inactive(cpu); 8315 8303 sched_domains_numa_masks_clear(cpu); 8316 8304 return 0; 8317 8305 }

+7 -9

kernel/sched/cputime.c

··· 9 9 10 10 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 11 11 12 + DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime); 13 + 12 14 /* 13 15 * There are no locks covering percpu hardirq/softirq time. 14 16 * They are only modified in vtime_account, on corresponding CPU ··· 24 22 */ 25 23 DEFINE_PER_CPU(struct irqtime, cpu_irqtime); 26 24 27 - static int sched_clock_irqtime; 28 - 29 25 void enable_sched_clock_irqtime(void) 30 26 { 31 - sched_clock_irqtime = 1; 27 + static_branch_enable(&sched_clock_irqtime); 32 28 } 33 29 34 30 void disable_sched_clock_irqtime(void) 35 31 { 36 - sched_clock_irqtime = 0; 32 + static_branch_disable(&sched_clock_irqtime); 37 33 } 38 34 39 35 static void irqtime_account_delta(struct irqtime *irqtime, u64 delta, ··· 57 57 s64 delta; 58 58 int cpu; 59 59 60 - if (!sched_clock_irqtime) 60 + if (!irqtime_enabled()) 61 61 return; 62 62 63 63 cpu = smp_processor_id(); ··· 89 89 } 90 90 91 91 #else /* CONFIG_IRQ_TIME_ACCOUNTING */ 92 - 93 - #define sched_clock_irqtime (0) 94 92 95 93 static u64 irqtime_tick_accounted(u64 dummy) 96 94 { ··· 476 478 if (vtime_accounting_enabled_this_cpu()) 477 479 return; 478 480 479 - if (sched_clock_irqtime) { 481 + if (irqtime_enabled()) { 480 482 irqtime_account_process_tick(p, user_tick, 1); 481 483 return; 482 484 } ··· 505 507 { 506 508 u64 cputime, steal; 507 509 508 - if (sched_clock_irqtime) { 510 + if (irqtime_enabled()) { 509 511 irqtime_account_idle_ticks(ticks); 510 512 return; 511 513 }

+89 -30

kernel/sched/deadline.c

··· 342 342 __add_rq_bw(new_bw, &rq->dl); 343 343 } 344 344 345 + static __always_inline 346 + void cancel_dl_timer(struct sched_dl_entity *dl_se, struct hrtimer *timer) 347 + { 348 + /* 349 + * If the timer callback was running (hrtimer_try_to_cancel == -1), 350 + * it will eventually call put_task_struct(). 351 + */ 352 + if (hrtimer_try_to_cancel(timer) == 1 && !dl_server(dl_se)) 353 + put_task_struct(dl_task_of(dl_se)); 354 + } 355 + 356 + static __always_inline 357 + void cancel_replenish_timer(struct sched_dl_entity *dl_se) 358 + { 359 + cancel_dl_timer(dl_se, &dl_se->dl_timer); 360 + } 361 + 362 + static __always_inline 363 + void cancel_inactive_timer(struct sched_dl_entity *dl_se) 364 + { 365 + cancel_dl_timer(dl_se, &dl_se->inactive_timer); 366 + } 367 + 345 368 static void dl_change_utilization(struct task_struct *p, u64 new_bw) 346 369 { 347 370 WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV); ··· 518 495 * will not touch the rq's active utilization, 519 496 * so we are still safe. 520 497 */ 521 - if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) { 522 - if (!dl_server(dl_se)) 523 - put_task_struct(dl_task_of(dl_se)); 524 - } 498 + cancel_inactive_timer(dl_se); 525 499 } else { 526 500 /* 527 501 * Since "dl_non_contending" is not set, the ··· 2135 2115 * The replenish timer needs to be canceled. No 2136 2116 * problem if it fires concurrently: boosted threads 2137 2117 * are ignored in dl_task_timer(). 2138 - * 2139 - * If the timer callback was running (hrtimer_try_to_cancel == -1), 2140 - * it will eventually call put_task_struct(). 2141 2118 */ 2142 - if (hrtimer_try_to_cancel(&p->dl.dl_timer) == 1 && 2143 - !dl_server(&p->dl)) 2144 - put_task_struct(p); 2119 + cancel_replenish_timer(&p->dl); 2145 2120 p->dl.dl_throttled = 0; 2146 2121 } 2147 2122 } else if (!dl_prio(p->normal_prio)) { ··· 2304 2289 * will not touch the rq's active utilization, 2305 2290 * so we are still safe. 2306 2291 */ 2307 - if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1) 2308 - put_task_struct(p); 2292 + cancel_inactive_timer(&p->dl); 2309 2293 } 2310 2294 sub_rq_bw(&p->dl, &rq->dl); 2311 2295 rq_unlock(rq, &rf); ··· 2520 2506 return NULL; 2521 2507 2522 2508 next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root); 2523 - 2524 - next_node: 2525 - if (next_node) { 2509 + while (next_node) { 2526 2510 p = __node_2_pdl(next_node); 2527 2511 2528 2512 if (task_is_pushable(rq, p, cpu)) 2529 2513 return p; 2530 2514 2531 2515 next_node = rb_next(next_node); 2532 - goto next_node; 2533 2516 } 2534 2517 2535 2518 return NULL; ··· 2975 2964 2976 2965 void dl_clear_root_domain(struct root_domain *rd) 2977 2966 { 2978 - unsigned long flags; 2967 + int i; 2979 2968 2980 - raw_spin_lock_irqsave(&rd->dl_bw.lock, flags); 2969 + guard(raw_spinlock_irqsave)(&rd->dl_bw.lock); 2981 2970 rd->dl_bw.total_bw = 0; 2982 - raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags); 2971 + 2972 + /* 2973 + * dl_server bandwidth is only restored when CPUs are attached to root 2974 + * domains (after domains are created or CPUs moved back to the 2975 + * default root doamin). 2976 + */ 2977 + for_each_cpu(i, rd->span) { 2978 + struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server; 2979 + 2980 + if (dl_server(dl_se) && cpu_active(i)) 2981 + rd->dl_bw.total_bw += dl_se->dl_bw; 2982 + } 2983 2983 } 2984 2984 2985 2985 #endif /* CONFIG_SMP */ ··· 3051 3029 */ 3052 3030 static void switched_to_dl(struct rq *rq, struct task_struct *p) 3053 3031 { 3054 - if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1) 3055 - put_task_struct(p); 3032 + cancel_inactive_timer(&p->dl); 3056 3033 3057 3034 /* 3058 3035 * In case a task is setscheduled to SCHED_DEADLINE we need to keep ··· 3474 3453 } 3475 3454 3476 3455 enum dl_bw_request { 3477 - dl_bw_req_check_overflow = 0, 3456 + dl_bw_req_deactivate = 0, 3478 3457 dl_bw_req_alloc, 3479 3458 dl_bw_req_free 3480 3459 }; 3481 3460 3482 3461 static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw) 3483 3462 { 3484 - unsigned long flags; 3463 + unsigned long flags, cap; 3485 3464 struct dl_bw *dl_b; 3486 3465 bool overflow = 0; 3466 + u64 fair_server_bw = 0; 3487 3467 3488 3468 rcu_read_lock_sched(); 3489 3469 dl_b = dl_bw_of(cpu); 3490 3470 raw_spin_lock_irqsave(&dl_b->lock, flags); 3491 3471 3492 - if (req == dl_bw_req_free) { 3472 + cap = dl_bw_capacity(cpu); 3473 + switch (req) { 3474 + case dl_bw_req_free: 3493 3475 __dl_sub(dl_b, dl_bw, dl_bw_cpus(cpu)); 3494 - } else { 3495 - unsigned long cap = dl_bw_capacity(cpu); 3496 - 3476 + break; 3477 + case dl_bw_req_alloc: 3497 3478 overflow = __dl_overflow(dl_b, cap, 0, dl_bw); 3498 3479 3499 - if (req == dl_bw_req_alloc && !overflow) { 3480 + if (!overflow) { 3500 3481 /* 3501 3482 * We reserve space in the destination 3502 3483 * root_domain, as we can't fail after this point. ··· 3507 3484 */ 3508 3485 __dl_add(dl_b, dl_bw, dl_bw_cpus(cpu)); 3509 3486 } 3487 + break; 3488 + case dl_bw_req_deactivate: 3489 + /* 3490 + * cpu is not off yet, but we need to do the math by 3491 + * considering it off already (i.e., what would happen if we 3492 + * turn cpu off?). 3493 + */ 3494 + cap -= arch_scale_cpu_capacity(cpu); 3495 + 3496 + /* 3497 + * cpu is going offline and NORMAL tasks will be moved away 3498 + * from it. We can thus discount dl_server bandwidth 3499 + * contribution as it won't need to be servicing tasks after 3500 + * the cpu is off. 3501 + */ 3502 + if (cpu_rq(cpu)->fair_server.dl_server) 3503 + fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw; 3504 + 3505 + /* 3506 + * Not much to check if no DEADLINE bandwidth is present. 3507 + * dl_servers we can discount, as tasks will be moved out the 3508 + * offlined CPUs anyway. 3509 + */ 3510 + if (dl_b->total_bw - fair_server_bw > 0) { 3511 + /* 3512 + * Leaving at least one CPU for DEADLINE tasks seems a 3513 + * wise thing to do. As said above, cpu is not offline 3514 + * yet, so account for that. 3515 + */ 3516 + if (dl_bw_cpus(cpu) - 1) 3517 + overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0); 3518 + else 3519 + overflow = 1; 3520 + } 3521 + 3522 + break; 3510 3523 } 3511 3524 3512 3525 raw_spin_unlock_irqrestore(&dl_b->lock, flags); ··· 3551 3492 return overflow ? -EBUSY : 0; 3552 3493 } 3553 3494 3554 - int dl_bw_check_overflow(int cpu) 3495 + int dl_bw_deactivate(int cpu) 3555 3496 { 3556 - return dl_bw_manage(dl_bw_req_check_overflow, cpu, 0); 3497 + return dl_bw_manage(dl_bw_req_deactivate, cpu, 0); 3557 3498 } 3558 3499 3559 3500 int dl_bw_alloc(int cpu, u64 dl_bw)

+12 -13

kernel/sched/debug.c

··· 379 379 return -EINVAL; 380 380 } 381 381 382 - if (rq->cfs.h_nr_running) { 382 + if (rq->cfs.h_nr_queued) { 383 383 update_rq_clock(rq); 384 384 dl_server_stop(&rq->fair_server); 385 385 } ··· 392 392 printk_deferred("Fair server disabled in CPU %d, system may crash due to starvation.\n", 393 393 cpu_of(rq)); 394 394 395 - if (rq->cfs.h_nr_running) 395 + if (rq->cfs.h_nr_queued) 396 396 dl_server_start(&rq->fair_server); 397 397 } 398 398 ··· 843 843 SPLIT_NS(right_vruntime)); 844 844 spread = right_vruntime - left_vruntime; 845 845 SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread)); 846 - SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running); 847 - SEQ_printf(m, " .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running); 848 - SEQ_printf(m, " .%-30s: %d\n", "h_nr_delayed", cfs_rq->h_nr_delayed); 849 - SEQ_printf(m, " .%-30s: %d\n", "idle_nr_running", 850 - cfs_rq->idle_nr_running); 851 - SEQ_printf(m, " .%-30s: %d\n", "idle_h_nr_running", 852 - cfs_rq->idle_h_nr_running); 846 + SEQ_printf(m, " .%-30s: %d\n", "nr_queued", cfs_rq->nr_queued); 847 + SEQ_printf(m, " .%-30s: %d\n", "h_nr_runnable", cfs_rq->h_nr_runnable); 848 + SEQ_printf(m, " .%-30s: %d\n", "h_nr_queued", cfs_rq->h_nr_queued); 849 + SEQ_printf(m, " .%-30s: %d\n", "h_nr_idle", cfs_rq->h_nr_idle); 853 850 SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight); 854 851 #ifdef CONFIG_SMP 855 852 SEQ_printf(m, " .%-30s: %lu\n", "load_avg", ··· 1292 1295 { 1293 1296 static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, 1); 1294 1297 1295 - WARN(__ratelimit(&latency_check_ratelimit), 1296 - "sched: CPU %d need_resched set for > %llu ns (%d ticks) " 1297 - "without schedule\n", 1298 - cpu, latency, cpu_rq(cpu)->ticks_without_resched); 1298 + if (likely(!__ratelimit(&latency_check_ratelimit))) 1299 + return; 1300 + 1301 + pr_err("sched: CPU %d need_resched set for > %llu ns (%d ticks) without schedule\n", 1302 + cpu, latency, cpu_rq(cpu)->ticks_without_resched); 1303 + dump_stack(); 1299 1304 }

+257 -187

kernel/sched/fair.c

··· 37 37 #include <linux/sched/cputime.h> 38 38 #include <linux/sched/isolation.h> 39 39 #include <linux/sched/nohz.h> 40 + #include <linux/sched/prio.h> 40 41 41 42 #include <linux/cpuidle.h> 42 43 #include <linux/interrupt.h> ··· 51 50 #include <linux/rbtree_augmented.h> 52 51 53 52 #include <asm/switch_to.h> 53 + 54 + #include <uapi/linux/sched/types.h> 54 55 55 56 #include "sched.h" 56 57 #include "stats.h" ··· 526 523 * Scheduling class tree data structure manipulation methods: 527 524 */ 528 525 529 - static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime) 526 + static inline __maybe_unused u64 max_vruntime(u64 max_vruntime, u64 vruntime) 530 527 { 531 528 s64 delta = (s64)(vruntime - max_vruntime); 532 529 if (delta > 0) ··· 535 532 return max_vruntime; 536 533 } 537 534 538 - static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime) 535 + static inline __maybe_unused u64 min_vruntime(u64 min_vruntime, u64 vruntime) 539 536 { 540 537 s64 delta = (s64)(vruntime - min_vruntime); 541 538 if (delta < 0) ··· 913 910 * We can safely skip eligibility check if there is only one entity 914 911 * in this cfs_rq, saving some cycles. 915 912 */ 916 - if (cfs_rq->nr_running == 1) 913 + if (cfs_rq->nr_queued == 1) 917 914 return curr && curr->on_rq ? curr : se; 918 915 919 916 if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) ··· 1248 1245 1249 1246 account_cfs_rq_runtime(cfs_rq, delta_exec); 1250 1247 1251 - if (cfs_rq->nr_running == 1) 1248 + if (cfs_rq->nr_queued == 1) 1252 1249 return; 1253 1250 1254 1251 if (resched || did_preempt_short(cfs_rq, curr)) { ··· 2129 2126 ns->load += cpu_load(rq); 2130 2127 ns->runnable += cpu_runnable(rq); 2131 2128 ns->util += cpu_util_cfs(cpu); 2132 - ns->nr_running += rq->cfs.h_nr_running; 2129 + ns->nr_running += rq->cfs.h_nr_runnable; 2133 2130 ns->compute_capacity += capacity_of(cpu); 2134 2131 2135 2132 if (find_idle && idle_core < 0 && !rq->nr_running && idle_cpu(cpu)) { ··· 3680 3677 list_add(&se->group_node, &rq->cfs_tasks); 3681 3678 } 3682 3679 #endif 3683 - cfs_rq->nr_running++; 3684 - if (se_is_idle(se)) 3685 - cfs_rq->idle_nr_running++; 3680 + cfs_rq->nr_queued++; 3686 3681 } 3687 3682 3688 3683 static void ··· 3693 3692 list_del_init(&se->group_node); 3694 3693 } 3695 3694 #endif 3696 - cfs_rq->nr_running--; 3697 - if (se_is_idle(se)) 3698 - cfs_rq->idle_nr_running--; 3695 + cfs_rq->nr_queued--; 3699 3696 } 3700 3697 3701 3698 /* ··· 5127 5128 5128 5129 static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) 5129 5130 { 5130 - return !cfs_rq->nr_running; 5131 + return !cfs_rq->nr_queued; 5131 5132 } 5132 5133 5133 5134 #define UPDATE_TG 0x0 ··· 5165 5166 5166 5167 #endif /* CONFIG_SMP */ 5167 5168 5169 + void __setparam_fair(struct task_struct *p, const struct sched_attr *attr) 5170 + { 5171 + struct sched_entity *se = &p->se; 5172 + 5173 + p->static_prio = NICE_TO_PRIO(attr->sched_nice); 5174 + if (attr->sched_runtime) { 5175 + se->custom_slice = 1; 5176 + se->slice = clamp_t(u64, attr->sched_runtime, 5177 + NSEC_PER_MSEC/10, /* HZ=1000 * 10 */ 5178 + NSEC_PER_MSEC*100); /* HZ=100 / 10 */ 5179 + } else { 5180 + se->custom_slice = 0; 5181 + se->slice = sysctl_sched_base_slice; 5182 + } 5183 + } 5184 + 5168 5185 static void 5169 5186 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 5170 5187 { ··· 5199 5184 * 5200 5185 * EEVDF: placement strategy #1 / #2 5201 5186 */ 5202 - if (sched_feat(PLACE_LAG) && cfs_rq->nr_running && se->vlag) { 5187 + if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { 5203 5188 struct sched_entity *curr = cfs_rq->curr; 5204 5189 unsigned long load; 5205 5190 ··· 5292 5277 static void check_enqueue_throttle(struct cfs_rq *cfs_rq); 5293 5278 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq); 5294 5279 5295 - static inline bool cfs_bandwidth_used(void); 5296 - 5297 5280 static void 5298 5281 requeue_delayed_entity(struct sched_entity *se); 5299 5282 ··· 5313 5300 * When enqueuing a sched_entity, we must: 5314 5301 * - Update loads to have both entity and cfs_rq synced with now. 5315 5302 * - For group_entity, update its runnable_weight to reflect the new 5316 - * h_nr_running of its group cfs_rq. 5303 + * h_nr_runnable of its group cfs_rq. 5317 5304 * - For group_entity, update its weight to reflect the new share of 5318 5305 * its group cfs_rq 5319 5306 * - Add its new weight to cfs_rq->load.weight ··· 5346 5333 __enqueue_entity(cfs_rq, se); 5347 5334 se->on_rq = 1; 5348 5335 5349 - if (cfs_rq->nr_running == 1) { 5336 + if (cfs_rq->nr_queued == 1) { 5350 5337 check_enqueue_throttle(cfs_rq); 5351 5338 if (!throttled_hierarchy(cfs_rq)) { 5352 5339 list_add_leaf_cfs_rq(cfs_rq); ··· 5388 5375 for_each_sched_entity(se) { 5389 5376 struct cfs_rq *cfs_rq = cfs_rq_of(se); 5390 5377 5391 - cfs_rq->h_nr_delayed++; 5378 + cfs_rq->h_nr_runnable--; 5392 5379 if (cfs_rq_throttled(cfs_rq)) 5393 5380 break; 5394 5381 } ··· 5400 5387 for_each_sched_entity(se) { 5401 5388 struct cfs_rq *cfs_rq = cfs_rq_of(se); 5402 5389 5403 - cfs_rq->h_nr_delayed--; 5390 + cfs_rq->h_nr_runnable++; 5404 5391 if (cfs_rq_throttled(cfs_rq)) 5405 5392 break; 5406 5393 } ··· 5417 5404 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 5418 5405 { 5419 5406 bool sleep = flags & DEQUEUE_SLEEP; 5407 + int action = UPDATE_TG; 5420 5408 5421 5409 update_curr(cfs_rq); 5422 5410 clear_buddies(cfs_rq, se); ··· 5443 5429 } 5444 5430 } 5445 5431 5446 - int action = UPDATE_TG; 5447 5432 if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) 5448 5433 action |= DO_DETACH; 5449 5434 ··· 5450 5437 * When dequeuing a sched_entity, we must: 5451 5438 * - Update loads to have both entity and cfs_rq synced with now. 5452 5439 * - For group_entity, update its runnable_weight to reflect the new 5453 - * h_nr_running of its group cfs_rq. 5440 + * h_nr_runnable of its group cfs_rq. 5454 5441 * - Subtract its previous weight from cfs_rq->load.weight. 5455 5442 * - For group entity, update its weight to reflect the new share 5456 5443 * of its group cfs_rq. ··· 5488 5475 if (flags & DEQUEUE_DELAYED) 5489 5476 finish_delayed_dequeue_entity(se); 5490 5477 5491 - if (cfs_rq->nr_running == 0) 5478 + if (cfs_rq->nr_queued == 0) 5492 5479 update_idle_cfs_rq_clock_pelt(cfs_rq); 5493 5480 5494 5481 return true; ··· 5550 5537 static struct sched_entity * 5551 5538 pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) 5552 5539 { 5540 + struct sched_entity *se; 5541 + 5553 5542 /* 5554 - * Enabling NEXT_BUDDY will affect latency but not fairness. 5543 + * Picking the ->next buddy will affect latency but not fairness. 5555 5544 */ 5556 - if (sched_feat(NEXT_BUDDY) && 5545 + if (sched_feat(PICK_BUDDY) && 5557 5546 cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { 5558 5547 /* ->next will never be delayed */ 5559 5548 SCHED_WARN_ON(cfs_rq->next->sched_delayed); 5560 5549 return cfs_rq->next; 5561 5550 } 5562 5551 5563 - struct sched_entity *se = pick_eevdf(cfs_rq); 5552 + se = pick_eevdf(cfs_rq); 5564 5553 if (se->sched_delayed) { 5565 5554 dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); 5566 5555 /* ··· 5838 5823 list_del_leaf_cfs_rq(cfs_rq); 5839 5824 5840 5825 SCHED_WARN_ON(cfs_rq->throttled_clock_self); 5841 - if (cfs_rq->nr_running) 5826 + if (cfs_rq->nr_queued) 5842 5827 cfs_rq->throttled_clock_self = rq_clock(rq); 5843 5828 } 5844 5829 cfs_rq->throttle_count++; ··· 5851 5836 struct rq *rq = rq_of(cfs_rq); 5852 5837 struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); 5853 5838 struct sched_entity *se; 5854 - long task_delta, idle_task_delta, delayed_delta, dequeue = 1; 5855 - long rq_h_nr_running = rq->cfs.h_nr_running; 5839 + long queued_delta, runnable_delta, idle_delta, dequeue = 1; 5840 + long rq_h_nr_queued = rq->cfs.h_nr_queued; 5856 5841 5857 5842 raw_spin_lock(&cfs_b->lock); 5858 5843 /* This will start the period timer if necessary */ ··· 5882 5867 walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); 5883 5868 rcu_read_unlock(); 5884 5869 5885 - task_delta = cfs_rq->h_nr_running; 5886 - idle_task_delta = cfs_rq->idle_h_nr_running; 5887 - delayed_delta = cfs_rq->h_nr_delayed; 5870 + queued_delta = cfs_rq->h_nr_queued; 5871 + runnable_delta = cfs_rq->h_nr_runnable; 5872 + idle_delta = cfs_rq->h_nr_idle; 5888 5873 for_each_sched_entity(se) { 5889 5874 struct cfs_rq *qcfs_rq = cfs_rq_of(se); 5890 5875 int flags; ··· 5904 5889 dequeue_entity(qcfs_rq, se, flags); 5905 5890 5906 5891 if (cfs_rq_is_idle(group_cfs_rq(se))) 5907 - idle_task_delta = cfs_rq->h_nr_running; 5892 + idle_delta = cfs_rq->h_nr_queued; 5908 5893 5909 - qcfs_rq->h_nr_running -= task_delta; 5910 - qcfs_rq->idle_h_nr_running -= idle_task_delta; 5911 - qcfs_rq->h_nr_delayed -= delayed_delta; 5894 + qcfs_rq->h_nr_queued -= queued_delta; 5895 + qcfs_rq->h_nr_runnable -= runnable_delta; 5896 + qcfs_rq->h_nr_idle -= idle_delta; 5912 5897 5913 5898 if (qcfs_rq->load.weight) { 5914 5899 /* Avoid re-evaluating load for this entity: */ ··· 5927 5912 se_update_runnable(se); 5928 5913 5929 5914 if (cfs_rq_is_idle(group_cfs_rq(se))) 5930 - idle_task_delta = cfs_rq->h_nr_running; 5915 + idle_delta = cfs_rq->h_nr_queued; 5931 5916 5932 - qcfs_rq->h_nr_running -= task_delta; 5933 - qcfs_rq->idle_h_nr_running -= idle_task_delta; 5934 - qcfs_rq->h_nr_delayed -= delayed_delta; 5917 + qcfs_rq->h_nr_queued -= queued_delta; 5918 + qcfs_rq->h_nr_runnable -= runnable_delta; 5919 + qcfs_rq->h_nr_idle -= idle_delta; 5935 5920 } 5936 5921 5937 5922 /* At this point se is NULL and we are at root level*/ 5938 - sub_nr_running(rq, task_delta); 5923 + sub_nr_running(rq, queued_delta); 5939 5924 5940 5925 /* Stop the fair server if throttling resulted in no runnable tasks */ 5941 - if (rq_h_nr_running && !rq->cfs.h_nr_running) 5926 + if (rq_h_nr_queued && !rq->cfs.h_nr_queued) 5942 5927 dl_server_stop(&rq->fair_server); 5943 5928 done: 5944 5929 /* ··· 5947 5932 */ 5948 5933 cfs_rq->throttled = 1; 5949 5934 SCHED_WARN_ON(cfs_rq->throttled_clock); 5950 - if (cfs_rq->nr_running) 5935 + if (cfs_rq->nr_queued) 5951 5936 cfs_rq->throttled_clock = rq_clock(rq); 5952 5937 return true; 5953 5938 } ··· 5957 5942 struct rq *rq = rq_of(cfs_rq); 5958 5943 struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); 5959 5944 struct sched_entity *se; 5960 - long task_delta, idle_task_delta, delayed_delta; 5961 - long rq_h_nr_running = rq->cfs.h_nr_running; 5945 + long queued_delta, runnable_delta, idle_delta; 5946 + long rq_h_nr_queued = rq->cfs.h_nr_queued; 5962 5947 5963 5948 se = cfs_rq->tg->se[cpu_of(rq)]; 5964 5949 ··· 5991 5976 goto unthrottle_throttle; 5992 5977 } 5993 5978 5994 - task_delta = cfs_rq->h_nr_running; 5995 - idle_task_delta = cfs_rq->idle_h_nr_running; 5996 - delayed_delta = cfs_rq->h_nr_delayed; 5979 + queued_delta = cfs_rq->h_nr_queued; 5980 + runnable_delta = cfs_rq->h_nr_runnable; 5981 + idle_delta = cfs_rq->h_nr_idle; 5997 5982 for_each_sched_entity(se) { 5998 5983 struct cfs_rq *qcfs_rq = cfs_rq_of(se); 5999 5984 ··· 6007 5992 enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); 6008 5993 6009 5994 if (cfs_rq_is_idle(group_cfs_rq(se))) 6010 - idle_task_delta = cfs_rq->h_nr_running; 5995 + idle_delta = cfs_rq->h_nr_queued; 6011 5996 6012 - qcfs_rq->h_nr_running += task_delta; 6013 - qcfs_rq->idle_h_nr_running += idle_task_delta; 6014 - qcfs_rq->h_nr_delayed += delayed_delta; 5997 + qcfs_rq->h_nr_queued += queued_delta; 5998 + qcfs_rq->h_nr_runnable += runnable_delta; 5999 + qcfs_rq->h_nr_idle += idle_delta; 6015 6000 6016 6001 /* end evaluation on encountering a throttled cfs_rq */ 6017 6002 if (cfs_rq_throttled(qcfs_rq)) ··· 6025 6010 se_update_runnable(se); 6026 6011 6027 6012 if (cfs_rq_is_idle(group_cfs_rq(se))) 6028 - idle_task_delta = cfs_rq->h_nr_running; 6013 + idle_delta = cfs_rq->h_nr_queued; 6029 6014 6030 - qcfs_rq->h_nr_running += task_delta; 6031 - qcfs_rq->idle_h_nr_running += idle_task_delta; 6032 - qcfs_rq->h_nr_delayed += delayed_delta; 6015 + qcfs_rq->h_nr_queued += queued_delta; 6016 + qcfs_rq->h_nr_runnable += runnable_delta; 6017 + qcfs_rq->h_nr_idle += idle_delta; 6033 6018 6034 6019 /* end evaluation on encountering a throttled cfs_rq */ 6035 6020 if (cfs_rq_throttled(qcfs_rq)) ··· 6037 6022 } 6038 6023 6039 6024 /* Start the fair server if un-throttling resulted in new runnable tasks */ 6040 - if (!rq_h_nr_running && rq->cfs.h_nr_running) 6025 + if (!rq_h_nr_queued && rq->cfs.h_nr_queued) 6041 6026 dl_server_start(&rq->fair_server); 6042 6027 6043 6028 /* At this point se is NULL and we are at root level*/ 6044 - add_nr_running(rq, task_delta); 6029 + add_nr_running(rq, queued_delta); 6045 6030 6046 6031 unthrottle_throttle: 6047 6032 assert_list_leaf_cfs_rq(rq); 6048 6033 6049 6034 /* Determine whether we need to wake up potentially idle CPU: */ 6050 - if (rq->curr == rq->idle && rq->cfs.nr_running) 6035 + if (rq->curr == rq->idle && rq->cfs.nr_queued) 6051 6036 resched_curr(rq); 6052 6037 } 6053 6038 ··· 6348 6333 if (!cfs_bandwidth_used()) 6349 6334 return; 6350 6335 6351 - if (!cfs_rq->runtime_enabled || cfs_rq->nr_running) 6336 + if (!cfs_rq->runtime_enabled || cfs_rq->nr_queued) 6352 6337 return; 6353 6338 6354 6339 __return_cfs_rq_runtime(cfs_rq); ··· 6619 6604 6620 6605 lockdep_assert_rq_held(rq); 6621 6606 6607 + // Do not unthrottle for an active CPU 6608 + if (cpumask_test_cpu(cpu_of(rq), cpu_active_mask)) 6609 + return; 6610 + 6622 6611 /* 6623 6612 * The rq clock has already been updated in the 6624 6613 * set_rq_offline(), so we should skip updating ··· 6638 6619 continue; 6639 6620 6640 6621 /* 6641 - * clock_task is not advancing so we just need to make sure 6642 - * there's some valid quota amount 6643 - */ 6644 - cfs_rq->runtime_remaining = 1; 6645 - /* 6646 6622 * Offline rq is schedulable till CPU is completely disabled 6647 6623 * in take_cpu_down(), so we prevent new cfs throttling here. 6648 6624 */ 6649 6625 cfs_rq->runtime_enabled = 0; 6650 6626 6651 - if (cfs_rq_throttled(cfs_rq)) 6652 - unthrottle_cfs_rq(cfs_rq); 6627 + if (!cfs_rq_throttled(cfs_rq)) 6628 + continue; 6629 + 6630 + /* 6631 + * clock_task is not advancing so we just need to make sure 6632 + * there's some valid quota amount 6633 + */ 6634 + cfs_rq->runtime_remaining = 1; 6635 + unthrottle_cfs_rq(cfs_rq); 6653 6636 } 6654 6637 rcu_read_unlock(); 6655 6638 ··· 6699 6678 #endif 6700 6679 6701 6680 #else /* CONFIG_CFS_BANDWIDTH */ 6702 - 6703 - static inline bool cfs_bandwidth_used(void) 6704 - { 6705 - return false; 6706 - } 6707 6681 6708 6682 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {} 6709 6683 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; } ··· 6757 6741 6758 6742 SCHED_WARN_ON(task_rq(p) != rq); 6759 6743 6760 - if (rq->cfs.h_nr_running > 1) { 6744 + if (rq->cfs.h_nr_queued > 1) { 6761 6745 u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; 6762 6746 u64 slice = se->slice; 6763 6747 s64 delta = slice - ran; ··· 6845 6829 /* Runqueue only has SCHED_IDLE tasks enqueued */ 6846 6830 static int sched_idle_rq(struct rq *rq) 6847 6831 { 6848 - return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running && 6832 + return unlikely(rq->nr_running == rq->cfs.h_nr_idle && 6849 6833 rq->nr_running); 6850 6834 } 6851 6835 ··· 6872 6856 if (sched_feat(DELAY_ZERO)) { 6873 6857 update_entity_lag(cfs_rq, se); 6874 6858 if (se->vlag > 0) { 6875 - cfs_rq->nr_running--; 6859 + cfs_rq->nr_queued--; 6876 6860 if (se != cfs_rq->curr) 6877 6861 __dequeue_entity(cfs_rq, se); 6878 6862 se->vlag = 0; 6879 6863 place_entity(cfs_rq, se, 0); 6880 6864 if (se != cfs_rq->curr) 6881 6865 __enqueue_entity(cfs_rq, se); 6882 - cfs_rq->nr_running++; 6866 + cfs_rq->nr_queued++; 6883 6867 } 6884 6868 } 6885 6869 ··· 6897 6881 { 6898 6882 struct cfs_rq *cfs_rq; 6899 6883 struct sched_entity *se = &p->se; 6900 - int idle_h_nr_running = task_has_idle_policy(p); 6901 - int h_nr_delayed = 0; 6884 + int h_nr_idle = task_has_idle_policy(p); 6885 + int h_nr_runnable = 1; 6902 6886 int task_new = !(flags & ENQUEUE_WAKEUP); 6903 - int rq_h_nr_running = rq->cfs.h_nr_running; 6887 + int rq_h_nr_queued = rq->cfs.h_nr_queued; 6904 6888 u64 slice = 0; 6905 6889 6906 6890 /* ··· 6925 6909 if (p->in_iowait) 6926 6910 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); 6927 6911 6928 - if (task_new) 6929 - h_nr_delayed = !!se->sched_delayed; 6912 + if (task_new && se->sched_delayed) 6913 + h_nr_runnable = 0; 6930 6914 6931 6915 for_each_sched_entity(se) { 6932 6916 if (se->on_rq) { ··· 6948 6932 enqueue_entity(cfs_rq, se, flags); 6949 6933 slice = cfs_rq_min_slice(cfs_rq); 6950 6934 6951 - cfs_rq->h_nr_running++; 6952 - cfs_rq->idle_h_nr_running += idle_h_nr_running; 6953 - cfs_rq->h_nr_delayed += h_nr_delayed; 6935 + cfs_rq->h_nr_runnable += h_nr_runnable; 6936 + cfs_rq->h_nr_queued++; 6937 + cfs_rq->h_nr_idle += h_nr_idle; 6954 6938 6955 6939 if (cfs_rq_is_idle(cfs_rq)) 6956 - idle_h_nr_running = 1; 6940 + h_nr_idle = 1; 6957 6941 6958 6942 /* end evaluation on encountering a throttled cfs_rq */ 6959 6943 if (cfs_rq_throttled(cfs_rq)) ··· 6972 6956 se->slice = slice; 6973 6957 slice = cfs_rq_min_slice(cfs_rq); 6974 6958 6975 - cfs_rq->h_nr_running++; 6976 - cfs_rq->idle_h_nr_running += idle_h_nr_running; 6977 - cfs_rq->h_nr_delayed += h_nr_delayed; 6959 + cfs_rq->h_nr_runnable += h_nr_runnable; 6960 + cfs_rq->h_nr_queued++; 6961 + cfs_rq->h_nr_idle += h_nr_idle; 6978 6962 6979 6963 if (cfs_rq_is_idle(cfs_rq)) 6980 - idle_h_nr_running = 1; 6964 + h_nr_idle = 1; 6981 6965 6982 6966 /* end evaluation on encountering a throttled cfs_rq */ 6983 6967 if (cfs_rq_throttled(cfs_rq)) 6984 6968 goto enqueue_throttle; 6985 6969 } 6986 6970 6987 - if (!rq_h_nr_running && rq->cfs.h_nr_running) { 6971 + if (!rq_h_nr_queued && rq->cfs.h_nr_queued) { 6988 6972 /* Account for idle runtime */ 6989 6973 if (!rq->nr_running) 6990 6974 dl_server_update_idle_time(rq, rq->curr); ··· 7031 7015 static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) 7032 7016 { 7033 7017 bool was_sched_idle = sched_idle_rq(rq); 7034 - int rq_h_nr_running = rq->cfs.h_nr_running; 7018 + int rq_h_nr_queued = rq->cfs.h_nr_queued; 7035 7019 bool task_sleep = flags & DEQUEUE_SLEEP; 7036 7020 bool task_delayed = flags & DEQUEUE_DELAYED; 7037 7021 struct task_struct *p = NULL; 7038 - int idle_h_nr_running = 0; 7039 - int h_nr_running = 0; 7040 - int h_nr_delayed = 0; 7022 + int h_nr_idle = 0; 7023 + int h_nr_queued = 0; 7024 + int h_nr_runnable = 0; 7041 7025 struct cfs_rq *cfs_rq; 7042 7026 u64 slice = 0; 7043 7027 7044 7028 if (entity_is_task(se)) { 7045 7029 p = task_of(se); 7046 - h_nr_running = 1; 7047 - idle_h_nr_running = task_has_idle_policy(p); 7048 - if (!task_sleep && !task_delayed) 7049 - h_nr_delayed = !!se->sched_delayed; 7030 + h_nr_queued = 1; 7031 + h_nr_idle = task_has_idle_policy(p); 7032 + if (task_sleep || task_delayed || !se->sched_delayed) 7033 + h_nr_runnable = 1; 7050 7034 } else { 7051 7035 cfs_rq = group_cfs_rq(se); 7052 7036 slice = cfs_rq_min_slice(cfs_rq); ··· 7062 7046 break; 7063 7047 } 7064 7048 7065 - cfs_rq->h_nr_running -= h_nr_running; 7066 - cfs_rq->idle_h_nr_running -= idle_h_nr_running; 7067 - cfs_rq->h_nr_delayed -= h_nr_delayed; 7049 + cfs_rq->h_nr_runnable -= h_nr_runnable; 7050 + cfs_rq->h_nr_queued -= h_nr_queued; 7051 + cfs_rq->h_nr_idle -= h_nr_idle; 7068 7052 7069 7053 if (cfs_rq_is_idle(cfs_rq)) 7070 - idle_h_nr_running = h_nr_running; 7054 + h_nr_idle = h_nr_queued; 7071 7055 7072 7056 /* end evaluation on encountering a throttled cfs_rq */ 7073 7057 if (cfs_rq_throttled(cfs_rq)) ··· 7101 7085 se->slice = slice; 7102 7086 slice = cfs_rq_min_slice(cfs_rq); 7103 7087 7104 - cfs_rq->h_nr_running -= h_nr_running; 7105 - cfs_rq->idle_h_nr_running -= idle_h_nr_running; 7106 - cfs_rq->h_nr_delayed -= h_nr_delayed; 7088 + cfs_rq->h_nr_runnable -= h_nr_runnable; 7089 + cfs_rq->h_nr_queued -= h_nr_queued; 7090 + cfs_rq->h_nr_idle -= h_nr_idle; 7107 7091 7108 7092 if (cfs_rq_is_idle(cfs_rq)) 7109 - idle_h_nr_running = h_nr_running; 7093 + h_nr_idle = h_nr_queued; 7110 7094 7111 7095 /* end evaluation on encountering a throttled cfs_rq */ 7112 7096 if (cfs_rq_throttled(cfs_rq)) 7113 7097 return 0; 7114 7098 } 7115 7099 7116 - sub_nr_running(rq, h_nr_running); 7100 + sub_nr_running(rq, h_nr_queued); 7117 7101 7118 - if (rq_h_nr_running && !rq->cfs.h_nr_running) 7102 + if (rq_h_nr_queued && !rq->cfs.h_nr_queued) 7119 7103 dl_server_stop(&rq->fair_server); 7120 7104 7121 7105 /* balance early to pull high priority tasks */ ··· 8804 8788 8805 8789 again: 8806 8790 cfs_rq = &rq->cfs; 8807 - if (!cfs_rq->nr_running) 8791 + if (!cfs_rq->nr_queued) 8808 8792 return NULL; 8809 8793 8810 8794 do { ··· 8921 8905 8922 8906 static bool fair_server_has_tasks(struct sched_dl_entity *dl_se) 8923 8907 { 8924 - return !!dl_se->rq->cfs.nr_running; 8908 + return !!dl_se->rq->cfs.nr_queued; 8925 8909 } 8926 8910 8927 8911 static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se) ··· 9252 9236 9253 9237 #ifdef CONFIG_NUMA_BALANCING 9254 9238 /* 9255 - * Returns 1, if task migration degrades locality 9256 - * Returns 0, if task migration improves locality i.e migration preferred. 9257 - * Returns -1, if task migration is not affected by locality. 9239 + * Returns a positive value, if task migration degrades locality. 9240 + * Returns 0, if task migration is not affected by locality. 9241 + * Returns a negative value, if task migration improves locality i.e migration preferred. 9258 9242 */ 9259 - static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env) 9243 + static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env) 9260 9244 { 9261 9245 struct numa_group *numa_group = rcu_dereference(p->numa_group); 9262 9246 unsigned long src_weight, dst_weight; 9263 9247 int src_nid, dst_nid, dist; 9264 9248 9265 9249 if (!static_branch_likely(&sched_numa_balancing)) 9266 - return -1; 9250 + return 0; 9267 9251 9268 9252 if (!p->numa_faults || !(env->sd->flags & SD_NUMA)) 9269 - return -1; 9253 + return 0; 9270 9254 9271 9255 src_nid = cpu_to_node(env->src_cpu); 9272 9256 dst_nid = cpu_to_node(env->dst_cpu); 9273 9257 9274 9258 if (src_nid == dst_nid) 9275 - return -1; 9259 + return 0; 9276 9260 9277 9261 /* Migrating away from the preferred node is always bad. */ 9278 9262 if (src_nid == p->numa_preferred_nid) { 9279 9263 if (env->src_rq->nr_running > env->src_rq->nr_preferred_running) 9280 9264 return 1; 9281 9265 else 9282 - return -1; 9266 + return 0; 9283 9267 } 9284 9268 9285 9269 /* Encourage migration to the preferred node. */ 9286 9270 if (dst_nid == p->numa_preferred_nid) 9287 - return 0; 9271 + return -1; 9288 9272 9289 9273 /* Leaving a core idle is often worse than degrading locality. */ 9290 9274 if (env->idle == CPU_IDLE) 9291 - return -1; 9275 + return 0; 9292 9276 9293 9277 dist = node_distance(src_nid, dst_nid); 9294 9278 if (numa_group) { ··· 9299 9283 dst_weight = task_weight(p, dst_nid, dist); 9300 9284 } 9301 9285 9302 - return dst_weight < src_weight; 9286 + return src_weight - dst_weight; 9303 9287 } 9304 9288 9305 9289 #else 9306 - static inline int migrate_degrades_locality(struct task_struct *p, 9290 + static inline long migrate_degrades_locality(struct task_struct *p, 9307 9291 struct lb_env *env) 9308 9292 { 9309 - return -1; 9293 + return 0; 9310 9294 } 9311 9295 #endif 9296 + 9297 + /* 9298 + * Check whether the task is ineligible on the destination cpu 9299 + * 9300 + * When the PLACE_LAG scheduling feature is enabled and 9301 + * dst_cfs_rq->nr_queued is greater than 1, if the task 9302 + * is ineligible, it will also be ineligible when 9303 + * it is migrated to the destination cpu. 9304 + */ 9305 + static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_cpu) 9306 + { 9307 + struct cfs_rq *dst_cfs_rq; 9308 + 9309 + #ifdef CONFIG_FAIR_GROUP_SCHED 9310 + dst_cfs_rq = task_group(p)->cfs_rq[dest_cpu]; 9311 + #else 9312 + dst_cfs_rq = &cpu_rq(dest_cpu)->cfs; 9313 + #endif 9314 + if (sched_feat(PLACE_LAG) && dst_cfs_rq->nr_queued && 9315 + !entity_eligible(task_cfs_rq(p), &p->se)) 9316 + return 1; 9317 + 9318 + return 0; 9319 + } 9312 9320 9313 9321 /* 9314 9322 * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? ··· 9340 9300 static 9341 9301 int can_migrate_task(struct task_struct *p, struct lb_env *env) 9342 9302 { 9343 - int tsk_cache_hot; 9303 + long degrades, hot; 9344 9304 9345 9305 lockdep_assert_rq_held(env->src_rq); 9306 + if (p->sched_task_hot) 9307 + p->sched_task_hot = 0; 9346 9308 9347 9309 /* 9348 9310 * We do not migrate tasks that are: 9349 - * 1) throttled_lb_pair, or 9350 - * 2) cannot be migrated to this CPU due to cpus_ptr, or 9351 - * 3) running (obviously), or 9352 - * 4) are cache-hot on their current CPU. 9311 + * 1) delayed dequeued unless we migrate load, or 9312 + * 2) throttled_lb_pair, or 9313 + * 3) cannot be migrated to this CPU due to cpus_ptr, or 9314 + * 4) running (obviously), or 9315 + * 5) are cache-hot on their current CPU. 9353 9316 */ 9317 + if ((p->se.sched_delayed) && (env->migration_type != migrate_load)) 9318 + return 0; 9319 + 9354 9320 if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) 9321 + return 0; 9322 + 9323 + /* 9324 + * We want to prioritize the migration of eligible tasks. 9325 + * For ineligible tasks we soft-limit them and only allow 9326 + * them to migrate when nr_balance_failed is non-zero to 9327 + * avoid load-balancing trying very hard to balance the load. 9328 + */ 9329 + if (!env->sd->nr_balance_failed && 9330 + task_is_ineligible_on_dst_cpu(p, env->dst_cpu)) 9355 9331 return 0; 9356 9332 9357 9333 /* Disregard percpu kthreads; they are where they need to be. */ ··· 9425 9369 if (env->flags & LBF_ACTIVE_LB) 9426 9370 return 1; 9427 9371 9428 - tsk_cache_hot = migrate_degrades_locality(p, env); 9429 - if (tsk_cache_hot == -1) 9430 - tsk_cache_hot = task_hot(p, env); 9372 + degrades = migrate_degrades_locality(p, env); 9373 + if (!degrades) 9374 + hot = task_hot(p, env); 9375 + else 9376 + hot = degrades > 0; 9431 9377 9432 - if (tsk_cache_hot <= 0 || 9433 - env->sd->nr_balance_failed > env->sd->cache_nice_tries) { 9434 - if (tsk_cache_hot == 1) { 9435 - schedstat_inc(env->sd->lb_hot_gained[env->idle]); 9436 - schedstat_inc(p->stats.nr_forced_migrations); 9437 - } 9378 + if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) { 9379 + if (hot) 9380 + p->sched_task_hot = 1; 9438 9381 return 1; 9439 9382 } 9440 9383 ··· 9447 9392 static void detach_task(struct task_struct *p, struct lb_env *env) 9448 9393 { 9449 9394 lockdep_assert_rq_held(env->src_rq); 9395 + 9396 + if (p->sched_task_hot) { 9397 + p->sched_task_hot = 0; 9398 + schedstat_inc(env->sd->lb_hot_gained[env->idle]); 9399 + schedstat_inc(p->stats.nr_forced_migrations); 9400 + } 9450 9401 9451 9402 deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK); 9452 9403 set_task_cpu(p, env->dst_cpu); ··· 9614 9553 9615 9554 continue; 9616 9555 next: 9556 + if (p->sched_task_hot) 9557 + schedstat_inc(p->stats.nr_failed_migrations_hot); 9558 + 9617 9559 list_move(&p->se.group_node, tasks); 9618 9560 } 9619 9561 ··· 9759 9695 if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) { 9760 9696 update_tg_load_avg(cfs_rq); 9761 9697 9762 - if (cfs_rq->nr_running == 0) 9698 + if (cfs_rq->nr_queued == 0) 9763 9699 update_idle_cfs_rq_clock_pelt(cfs_rq); 9764 9700 9765 9701 if (cfs_rq == &rq->cfs) ··· 10291 10227 * When there is more than 1 task, the group_overloaded case already 10292 10228 * takes care of cpu with reduced capacity 10293 10229 */ 10294 - if (rq->cfs.h_nr_running != 1) 10230 + if (rq->cfs.h_nr_runnable != 1) 10295 10231 return false; 10296 10232 10297 10233 return check_cpu_capacity(rq, sd); ··· 10313 10249 bool *sg_overloaded, 10314 10250 bool *sg_overutilized) 10315 10251 { 10316 - int i, nr_running, local_group; 10252 + int i, nr_running, local_group, sd_flags = env->sd->flags; 10253 + bool balancing_at_rd = !env->sd->parent; 10317 10254 10318 10255 memset(sgs, 0, sizeof(*sgs)); 10319 10256 ··· 10327 10262 sgs->group_load += load; 10328 10263 sgs->group_util += cpu_util_cfs(i); 10329 10264 sgs->group_runnable += cpu_runnable(rq); 10330 - sgs->sum_h_nr_running += rq->cfs.h_nr_running; 10265 + sgs->sum_h_nr_running += rq->cfs.h_nr_runnable; 10331 10266 10332 10267 nr_running = rq->nr_running; 10333 10268 sgs->sum_nr_running += nr_running; 10334 10269 10335 - if (nr_running > 1) 10336 - *sg_overloaded = 1; 10337 - 10338 10270 if (cpu_overutilized(i)) 10339 10271 *sg_overutilized = 1; 10340 10272 10341 - #ifdef CONFIG_NUMA_BALANCING 10342 - sgs->nr_numa_running += rq->nr_numa_running; 10343 - sgs->nr_preferred_running += rq->nr_preferred_running; 10344 - #endif 10345 10273 /* 10346 10274 * No need to call idle_cpu() if nr_running is not 0 10347 10275 */ ··· 10344 10286 continue; 10345 10287 } 10346 10288 10289 + /* Overload indicator is only updated at root domain */ 10290 + if (balancing_at_rd && nr_running > 1) 10291 + *sg_overloaded = 1; 10292 + 10293 + #ifdef CONFIG_NUMA_BALANCING 10294 + /* Only fbq_classify_group() uses this to classify NUMA groups */ 10295 + if (sd_flags & SD_NUMA) { 10296 + sgs->nr_numa_running += rq->nr_numa_running; 10297 + sgs->nr_preferred_running += rq->nr_preferred_running; 10298 + } 10299 + #endif 10347 10300 if (local_group) 10348 10301 continue; 10349 10302 10350 - if (env->sd->flags & SD_ASYM_CPUCAPACITY) { 10303 + if (sd_flags & SD_ASYM_CPUCAPACITY) { 10351 10304 /* Check for a misfit task on the cpu */ 10352 10305 if (sgs->group_misfit_task_load < rq->misfit_task_load) { 10353 10306 sgs->group_misfit_task_load = rq->misfit_task_load; ··· 10646 10577 sgs->group_util += cpu_util_without(i, p); 10647 10578 sgs->group_runnable += cpu_runnable_without(rq, p); 10648 10579 local = task_running_on_cpu(i, p); 10649 - sgs->sum_h_nr_running += rq->cfs.h_nr_running - local; 10580 + sgs->sum_h_nr_running += rq->cfs.h_nr_runnable - local; 10650 10581 10651 10582 nr_running = rq->nr_running - local; 10652 10583 sgs->sum_nr_running += nr_running; ··· 11428 11359 if (rt > env->fbq_type) 11429 11360 continue; 11430 11361 11431 - nr_running = rq->cfs.h_nr_running; 11362 + nr_running = rq->cfs.h_nr_runnable; 11432 11363 if (!nr_running) 11433 11364 continue; 11434 11365 ··· 11587 11518 * available on dst_cpu. 11588 11519 */ 11589 11520 if (env->idle && 11590 - (env->src_rq->cfs.h_nr_running == 1)) { 11521 + (env->src_rq->cfs.h_nr_runnable == 1)) { 11591 11522 if ((check_cpu_capacity(env->src_rq, sd)) && 11592 11523 (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100)) 11593 11524 return 1; ··· 11667 11598 return group_balance_cpu(sg) == env->dst_cpu; 11668 11599 } 11669 11600 11601 + static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd, 11602 + enum cpu_idle_type idle) 11603 + { 11604 + if (!schedstat_enabled()) 11605 + return; 11606 + 11607 + switch (env->migration_type) { 11608 + case migrate_load: 11609 + __schedstat_add(sd->lb_imbalance_load[idle], env->imbalance); 11610 + break; 11611 + case migrate_util: 11612 + __schedstat_add(sd->lb_imbalance_util[idle], env->imbalance); 11613 + break; 11614 + case migrate_task: 11615 + __schedstat_add(sd->lb_imbalance_task[idle], env->imbalance); 11616 + break; 11617 + case migrate_misfit: 11618 + __schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance); 11619 + break; 11620 + } 11621 + } 11622 + 11670 11623 /* 11671 11624 * Check this_cpu to ensure it is balanced within domain. Attempt to move 11672 11625 * tasks if there is an imbalance. ··· 11739 11648 11740 11649 WARN_ON_ONCE(busiest == env.dst_rq); 11741 11650 11742 - schedstat_add(sd->lb_imbalance[idle], env.imbalance); 11651 + update_lb_imbalance_stat(&env, sd, idle); 11743 11652 11744 11653 env.src_cpu = busiest->cpu; 11745 11654 env.src_rq = busiest; ··· 12237 12146 * - When one of the busy CPUs notices that there may be an idle rebalancing 12238 12147 * needed, they will kick the idle load balancer, which then does idle 12239 12148 * load balancing for all the idle CPUs. 12240 - * 12241 - * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED is not set 12242 - * anywhere yet. 12243 12149 */ 12244 12150 static inline int find_new_ilb(void) 12245 12151 { 12246 12152 const struct cpumask *hk_mask; 12247 12153 int ilb_cpu; 12248 12154 12249 - hk_mask = housekeeping_cpumask(HK_TYPE_MISC); 12155 + hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE); 12250 12156 12251 12157 for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) { 12252 12158 ··· 12261 12173 * Kick a CPU to do the NOHZ balancing, if it is time for it, via a cross-CPU 12262 12174 * SMP function call (IPI). 12263 12175 * 12264 - * We pick the first idle CPU in the HK_TYPE_MISC housekeeping set (if there is one). 12176 + * We pick the first idle CPU in the HK_TYPE_KERNEL_NOISE housekeeping set 12177 + * (if there is one). 12265 12178 */ 12266 12179 static void kick_ilb(unsigned int flags) 12267 12180 { ··· 12350 12261 * If there's a runnable CFS task and the current CPU has reduced 12351 12262 * capacity, kick the ILB to see if there's a better CPU to run on: 12352 12263 */ 12353 - if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) { 12264 + if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) { 12354 12265 flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK; 12355 12266 goto unlock; 12356 12267 } ··· 12480 12391 12481 12392 /* If this CPU is going down, then nothing needs to be done: */ 12482 12393 if (!cpu_active(cpu)) 12483 - return; 12484 - 12485 - /* Spare idle load balancing on CPUs that don't want to be disturbed: */ 12486 - if (!housekeeping_cpu(cpu, HK_TYPE_SCHED)) 12487 12394 return; 12488 12395 12489 12396 /* ··· 12701 12616 { 12702 12617 int this_cpu = this_rq->cpu; 12703 12618 12704 - /* 12705 - * This CPU doesn't want to be disturbed by scheduler 12706 - * housekeeping 12707 - */ 12708 - if (!housekeeping_cpu(this_cpu, HK_TYPE_SCHED)) 12709 - return; 12710 - 12711 12619 /* Will wake up very soon. No time for doing anything else*/ 12712 12620 if (this_rq->avg_idle < sysctl_sched_migration_cost) 12713 12621 return; ··· 12837 12759 * have been enqueued in the meantime. Since we're not going idle, 12838 12760 * pretend we pulled a task. 12839 12761 */ 12840 - if (this_rq->cfs.h_nr_running && !pulled_task) 12762 + if (this_rq->cfs.h_nr_queued && !pulled_task) 12841 12763 pulled_task = 1; 12842 12764 12843 12765 /* Is there a task of a high priority class? */ 12844 - if (this_rq->nr_running != this_rq->cfs.h_nr_running) 12766 + if (this_rq->nr_running != this_rq->cfs.h_nr_queued) 12845 12767 pulled_task = -1; 12846 12768 12847 12769 out: ··· 12862 12784 /* 12863 12785 * This softirq handler is triggered via SCHED_SOFTIRQ from two places: 12864 12786 * 12865 - * - directly from the local scheduler_tick() for periodic load balancing 12787 + * - directly from the local sched_tick() for periodic load balancing 12866 12788 * 12867 - * - indirectly from a remote scheduler_tick() for NOHZ idle balancing 12789 + * - indirectly from a remote sched_tick() for NOHZ idle balancing 12868 12790 * through the SMP cross-call nohz_csd_func() 12869 12791 */ 12870 12792 static __latent_entropy void sched_balance_softirq(void) ··· 12955 12877 * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check 12956 12878 * if we need to give up the CPU. 12957 12879 */ 12958 - if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 && 12880 + if (rq->core->core_forceidle_count && rq->cfs.nr_queued == 1 && 12959 12881 __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE)) 12960 12882 resched_curr(rq); 12961 12883 } ··· 13099 13021 if (!task_on_rq_queued(p)) 13100 13022 return; 13101 13023 13102 - if (rq->cfs.nr_running == 1) 13024 + if (rq->cfs.nr_queued == 1) 13103 13025 return; 13104 13026 13105 13027 /* ··· 13509 13431 for_each_possible_cpu(i) { 13510 13432 struct rq *rq = cpu_rq(i); 13511 13433 struct sched_entity *se = tg->se[i]; 13512 - struct cfs_rq *parent_cfs_rq, *grp_cfs_rq = tg->cfs_rq[i]; 13434 + struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i]; 13513 13435 bool was_idle = cfs_rq_is_idle(grp_cfs_rq); 13514 13436 long idle_task_delta; 13515 13437 struct rq_flags rf; ··· 13520 13442 if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq))) 13521 13443 goto next_cpu; 13522 13444 13523 - if (se->on_rq) { 13524 - parent_cfs_rq = cfs_rq_of(se); 13525 - if (cfs_rq_is_idle(grp_cfs_rq)) 13526 - parent_cfs_rq->idle_nr_running++; 13527 - else 13528 - parent_cfs_rq->idle_nr_running--; 13529 - } 13530 - 13531 - idle_task_delta = grp_cfs_rq->h_nr_running - 13532 - grp_cfs_rq->idle_h_nr_running; 13445 + idle_task_delta = grp_cfs_rq->h_nr_queued - 13446 + grp_cfs_rq->h_nr_idle; 13533 13447 if (!cfs_rq_is_idle(grp_cfs_rq)) 13534 13448 idle_task_delta *= -1; 13535 13449 ··· 13531 13461 if (!se->on_rq) 13532 13462 break; 13533 13463 13534 - cfs_rq->idle_h_nr_running += idle_task_delta; 13464 + cfs_rq->h_nr_idle += idle_task_delta; 13535 13465 13536 13466 /* Already accounted at parent level and above. */ 13537 13467 if (cfs_rq_is_idle(cfs_rq))

kernel/sched/features.h

··· 32 32 SCHED_FEAT(NEXT_BUDDY, false) 33 33 34 34 /* 35 + * Allow completely ignoring cfs_rq->next; which can be set from various 36 + * places: 37 + * - NEXT_BUDDY (wakeup preemption) 38 + * - yield_to_task() 39 + * - cgroup dequeue / pick 40 + */ 41 + SCHED_FEAT(PICK_BUDDY, true) 42 + 43 + /* 35 44 * Consider buddies to be cache hot, decreases the likeliness of a 36 45 * cache buddy being migrated away, increases cache locality. 37 46 */

+9 -13

kernel/sched/isolation.c

··· 9 9 */ 10 10 11 11 enum hk_flags { 12 - HK_FLAG_TIMER = BIT(HK_TYPE_TIMER), 13 - HK_FLAG_RCU = BIT(HK_TYPE_RCU), 14 - HK_FLAG_MISC = BIT(HK_TYPE_MISC), 15 - HK_FLAG_SCHED = BIT(HK_TYPE_SCHED), 16 - HK_FLAG_TICK = BIT(HK_TYPE_TICK), 17 12 HK_FLAG_DOMAIN = BIT(HK_TYPE_DOMAIN), 18 - HK_FLAG_WQ = BIT(HK_TYPE_WQ), 19 13 HK_FLAG_MANAGED_IRQ = BIT(HK_TYPE_MANAGED_IRQ), 20 - HK_FLAG_KTHREAD = BIT(HK_TYPE_KTHREAD), 14 + HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE), 21 15 }; 22 16 23 17 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden); ··· 91 97 92 98 static_branch_enable(&housekeeping_overridden); 93 99 94 - if (housekeeping.flags & HK_FLAG_TICK) 100 + if (housekeeping.flags & HK_FLAG_KERNEL_NOISE) 95 101 sched_tick_offload_init(); 96 102 97 103 for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) { ··· 115 121 unsigned int first_cpu; 116 122 int err = 0; 117 123 118 - if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) { 124 + if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE)) { 119 125 if (!IS_ENABLED(CONFIG_NO_HZ_FULL)) { 120 126 pr_warn("Housekeeping: nohz unsupported." 121 127 " Build with CONFIG_NO_HZ_FULL\n"); ··· 171 177 housekeeping_setup_type(type, housekeeping_staging); 172 178 } 173 179 174 - if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) 180 + if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE)) 175 181 tick_nohz_full_setup(non_housekeeping_mask); 176 182 177 183 housekeeping.flags |= flags; ··· 189 195 { 190 196 unsigned long flags; 191 197 192 - flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | 193 - HK_FLAG_MISC | HK_FLAG_KTHREAD; 198 + flags = HK_FLAG_KERNEL_NOISE; 194 199 195 200 return housekeeping_setup(str, flags); 196 201 } ··· 203 210 int len; 204 211 205 212 while (isalpha(*str)) { 213 + /* 214 + * isolcpus=nohz is equivalent to nohz_full. 215 + */ 206 216 if (!strncmp(str, "nohz,", 5)) { 207 217 str += 5; 208 - flags |= HK_FLAG_TICK; 218 + flags |= HK_FLAG_KERNEL_NOISE; 209 219 continue; 210 220 } 211 221

+2 -2

kernel/sched/pelt.c

··· 275 275 * 276 276 * group: [ see update_cfs_group() ] 277 277 * se_weight() = tg->weight * grq->load_avg / tg->load_avg 278 - * se_runnable() = grq->h_nr_running 278 + * se_runnable() = grq->h_nr_runnable 279 279 * 280 280 * runnable_sum = se_runnable() * runnable = grq->runnable_sum 281 281 * runnable_avg = runnable_sum ··· 321 321 { 322 322 if (___update_load_sum(now, &cfs_rq->avg, 323 323 scale_load_down(cfs_rq->load.weight), 324 - cfs_rq->h_nr_running - cfs_rq->h_nr_delayed, 324 + cfs_rq->h_nr_runnable, 325 325 cfs_rq->curr != NULL)) { 326 326 327 327 ___update_load_avg(&cfs_rq->avg, 1);

+6 -1

kernel/sched/psi.c

··· 998 998 s64 delta; 999 999 u64 irq; 1000 1000 1001 - if (static_branch_likely(&psi_disabled)) 1001 + if (static_branch_likely(&psi_disabled) || !irqtime_enabled()) 1002 1002 return; 1003 1003 1004 1004 if (!curr->pid) ··· 1239 1239 1240 1240 if (static_branch_likely(&psi_disabled)) 1241 1241 return -EOPNOTSUPP; 1242 + 1243 + #ifdef CONFIG_IRQ_TIME_ACCOUNTING 1244 + if (!irqtime_enabled() && res == PSI_IRQ) 1245 + return -EOPNOTSUPP; 1246 + #endif 1242 1247 1243 1248 /* Update averages before reporting them */ 1244 1249 mutex_lock(&group->avgs_lock);

+24 -13

kernel/sched/sched.h

··· 362 362 extern bool __checkparam_dl(const struct sched_attr *attr); 363 363 extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr); 364 364 extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); 365 - extern int dl_bw_check_overflow(int cpu); 365 + extern int dl_bw_deactivate(int cpu); 366 366 extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec); 367 367 /* 368 368 * SCHED_DEADLINE supports servers (nested scheduling) with the following ··· 650 650 /* CFS-related fields in a runqueue */ 651 651 struct cfs_rq { 652 652 struct load_weight load; 653 - unsigned int nr_running; 654 - unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */ 655 - unsigned int idle_nr_running; /* SCHED_IDLE */ 656 - unsigned int idle_h_nr_running; /* SCHED_IDLE */ 657 - unsigned int h_nr_delayed; 653 + unsigned int nr_queued; 654 + unsigned int h_nr_queued; /* SCHED_{NORMAL,BATCH,IDLE} */ 655 + unsigned int h_nr_runnable; /* SCHED_{NORMAL,BATCH,IDLE} */ 656 + unsigned int h_nr_idle; /* SCHED_IDLE */ 658 657 659 658 s64 avg_vruntime; 660 659 u64 avg_load; ··· 903 904 904 905 static inline void se_update_runnable(struct sched_entity *se) 905 906 { 906 - if (!entity_is_task(se)) { 907 - struct cfs_rq *cfs_rq = se->my_q; 908 - 909 - se->runnable_weight = cfs_rq->h_nr_running - cfs_rq->h_nr_delayed; 910 - } 907 + if (!entity_is_task(se)) 908 + se->runnable_weight = se->my_q->h_nr_runnable; 911 909 } 912 910 913 911 static inline long se_runnable(struct sched_entity *se) ··· 2276 2280 2277 2281 static inline int task_on_rq_queued(struct task_struct *p) 2278 2282 { 2279 - return p->on_rq == TASK_ON_RQ_QUEUED; 2283 + return READ_ONCE(p->on_rq) == TASK_ON_RQ_QUEUED; 2280 2284 } 2281 2285 2282 2286 static inline int task_on_rq_migrating(struct task_struct *p) ··· 2570 2574 2571 2575 static inline bool sched_fair_runnable(struct rq *rq) 2572 2576 { 2573 - return rq->cfs.nr_running > 0; 2577 + return rq->cfs.nr_queued > 0; 2574 2578 } 2575 2579 2576 2580 extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); ··· 3238 3242 }; 3239 3243 3240 3244 DECLARE_PER_CPU(struct irqtime, cpu_irqtime); 3245 + DECLARE_STATIC_KEY_FALSE(sched_clock_irqtime); 3246 + 3247 + static inline int irqtime_enabled(void) 3248 + { 3249 + return static_branch_likely(&sched_clock_irqtime); 3250 + } 3241 3251 3242 3252 /* 3243 3253 * Returns the irqtime minus the softirq time computed by ksoftirqd. ··· 3262 3260 } while (__u64_stats_fetch_retry(&irqtime->sync, seq)); 3263 3261 3264 3262 return total; 3263 + } 3264 + 3265 + #else 3266 + 3267 + static inline int irqtime_enabled(void) 3268 + { 3269 + return 0; 3265 3270 } 3266 3271 3267 3272 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ ··· 3517 3508 } 3518 3509 3519 3510 #endif /* !CONFIG_HAVE_SCHED_AVG_IRQ */ 3511 + 3512 + extern void __setparam_fair(struct task_struct *p, const struct sched_attr *attr); 3520 3513 3521 3514 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) 3522 3515

+7 -4

kernel/sched/stats.c

··· 103 103 * Bump this up when changing the output format or the meaning of an existing 104 104 * format, so that tools can adapt (or abort) 105 105 */ 106 - #define SCHEDSTAT_VERSION 16 106 + #define SCHEDSTAT_VERSION 17 107 107 108 108 static int show_schedstat(struct seq_file *seq, void *v) 109 109 { ··· 138 138 for_each_domain(cpu, sd) { 139 139 enum cpu_idle_type itype; 140 140 141 - seq_printf(seq, "domain%d %*pb", dcount++, 141 + seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name, 142 142 cpumask_pr_args(sched_domain_span(sd))); 143 143 for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) { 144 - seq_printf(seq, " %u %u %u %u %u %u %u %u", 144 + seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u", 145 145 sd->lb_count[itype], 146 146 sd->lb_balanced[itype], 147 147 sd->lb_failed[itype], 148 - sd->lb_imbalance[itype], 148 + sd->lb_imbalance_load[itype], 149 + sd->lb_imbalance_util[itype], 150 + sd->lb_imbalance_task[itype], 151 + sd->lb_imbalance_misfit[itype], 149 152 sd->lb_gained[itype], 150 153 sd->lb_hot_gained[itype], 151 154 sd->lb_nobusyq[itype],

kernel/sched/stats.h

··· 138 138 if (flags & ENQUEUE_RESTORE) 139 139 return; 140 140 141 + /* psi_sched_switch() will handle the flags */ 142 + if (task_on_cpu(task_rq(p), p)) 143 + return; 144 + 141 145 if (p->se.sched_delayed) { 142 146 /* CPU migration of "sleeping" task */ 143 147 SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));

+4 -14

kernel/sched/syscalls.c

··· 300 300 301 301 p->policy = policy; 302 302 303 - if (dl_policy(policy)) { 303 + if (dl_policy(policy)) 304 304 __setparam_dl(p, attr); 305 - } else if (fair_policy(policy)) { 306 - p->static_prio = NICE_TO_PRIO(attr->sched_nice); 307 - if (attr->sched_runtime) { 308 - p->se.custom_slice = 1; 309 - p->se.slice = clamp_t(u64, attr->sched_runtime, 310 - NSEC_PER_MSEC/10, /* HZ=1000 * 10 */ 311 - NSEC_PER_MSEC*100); /* HZ=100 / 10 */ 312 - } else { 313 - p->se.custom_slice = 0; 314 - p->se.slice = sysctl_sched_base_slice; 315 - } 316 - } 305 + else if (fair_policy(policy)) 306 + __setparam_fair(p, attr); 317 307 318 308 /* rt-policy tasks do not have a timerslack */ 319 309 if (rt_or_dl_task_policy(p)) { ··· 1423 1433 struct rq *rq, *p_rq; 1424 1434 int yielded = 0; 1425 1435 1426 - scoped_guard (irqsave) { 1436 + scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { 1427 1437 rq = this_rq(); 1428 1438 1429 1439 again:

+5 -7

kernel/sched/topology.c

··· 1635 1635 .max_newidle_lb_cost = 0, 1636 1636 .last_decay_max_lb_cost = jiffies, 1637 1637 .child = child, 1638 - #ifdef CONFIG_SCHED_DEBUG 1639 1638 .name = tl->name, 1640 - #endif 1641 1639 }; 1642 1640 1643 1641 sd_span = sched_domain_span(sd); ··· 2336 2338 if (!cpumask_subset(sched_domain_span(child), 2337 2339 sched_domain_span(sd))) { 2338 2340 pr_err("BUG: arch topology borken\n"); 2339 - #ifdef CONFIG_SCHED_DEBUG 2340 2341 pr_err(" the %s domain not a subset of the %s domain\n", 2341 2342 child->name, sd->name); 2342 - #endif 2343 2343 /* Fixup, ensure @sd has at least @child CPUs. */ 2344 2344 cpumask_or(sched_domain_span(sd), 2345 2345 sched_domain_span(sd), ··· 2717 2721 2718 2722 /* 2719 2723 * This domain won't be destroyed and as such 2720 - * its dl_bw->total_bw needs to be cleared. It 2721 - * will be recomputed in function 2722 - * update_tasks_root_domain(). 2724 + * its dl_bw->total_bw needs to be cleared. 2725 + * Tasks contribution will be then recomputed 2726 + * in function dl_update_tasks_root_domain(), 2727 + * dl_servers contribution in function 2728 + * dl_restore_server_root_domain(). 2723 2729 */ 2724 2730 rd = cpu_rq(cpumask_any(doms_cur[i]))->rd; 2725 2731 dl_clear_root_domain(rd);