Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/debug: Make schedstats a runtime tunable that is disabled by default

schedstats is very useful during debugging and performance tuning but it
incurs overhead to calculate the stats. As such, even though it can be
disabled at build time, it is often enabled as the information is useful.

This patch adds a kernel command-line and sysctl tunable to enable or
disable schedstats on demand (when it's built in). It is disabled
by default as someone who knows they need it can also learn to enable
it when necessary.

The benefits are dependent on how scheduler-intensive the workload is.
If it is then the patch reduces the number of cycles spent calculating
the stats with a small benefit from reducing the cache footprint of the
scheduler.

These measurements were taken from a 48-core 2-socket
machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
single socket machine 8-core machine with Intel i7-3770 processors.

netperf-tcp
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)

Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.

tbench4
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)

Small gains of 2-4% at low thread counts and otherwise flat. The
gains on the 8-core machine were slightly different

tbench4 on 8-core i7-3770 single socket machine
Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)

In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.

hackbench-pipes
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)

Some small gains and losses and while the variance data is not included,
it's close to the noise. The UMA machine did not show anything particularly
different

pipetest
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)

Small improvement and similar gains were seen on the UMA machine.

The gain is small but it stands to reason that doing less work in the
scheduler is a good thing. The downside is that the lack of schedstats and
tracepoints may be surprising to experts doing performance analysis until
they find the existence of the schedstats= parameter or schedstats sysctl.
It will be automatically activated for latencytop and sleep profiling to
alleviate the problem. For tracepoints, there is a simple warning as it's
not safe to activate schedstats in the context when it's known the tracepoint
may be wanted but is unavailable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Mel Gorman and committed by
Ingo Molnar
cb251765 a6e4491c

+256 -90
+5
Documentation/kernel-parameters.txt
··· 3528 3528 3529 3529 sched_debug [KNL] Enables verbose scheduler debug messages. 3530 3530 3531 + schedstats= [KNL,X86] Enable or disable scheduled statistics. 3532 + Allowed values are enable and disable. This feature 3533 + incurs a small amount of overhead in the scheduler 3534 + but is useful for debugging and performance tuning. 3535 + 3531 3536 skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate 3532 3537 xtime_lock contention on larger systems, and/or RCU lock 3533 3538 contention on all systems with CONFIG_MAXSMP set.
+8
Documentation/sysctl/kernel.txt
··· 760 760 761 761 ============================================================== 762 762 763 + sched_schedstats: 764 + 765 + Enables/disables scheduler statistics. Enabling this feature 766 + incurs a small amount of overhead in the scheduler but is 767 + useful for debugging and performance tuning. 768 + 769 + ============================================================== 770 + 763 771 sg-big-buff: 764 772 765 773 This file shows the size of the generic SCSI (sg) buffer.
+3
include/linux/latencytop.h
··· 37 37 38 38 void clear_all_latency_tracing(struct task_struct *p); 39 39 40 + extern int sysctl_latencytop(struct ctl_table *table, int write, 41 + void __user *buffer, size_t *lenp, loff_t *ppos); 42 + 40 43 #else 41 44 42 45 static inline void
+4
include/linux/sched.h
··· 920 920 #endif 921 921 } 922 922 923 + #ifdef CONFIG_SCHEDSTATS 924 + void force_schedstat_enabled(void); 925 + #endif 926 + 923 927 enum cpu_idle_type { 924 928 CPU_IDLE, 925 929 CPU_NOT_IDLE,
+4
include/linux/sched/sysctl.h
··· 95 95 void __user *buffer, size_t *lenp, 96 96 loff_t *ppos); 97 97 98 + extern int sysctl_schedstats(struct ctl_table *table, int write, 99 + void __user *buffer, size_t *lenp, 100 + loff_t *ppos); 101 + 98 102 #endif /* _SCHED_SYSCTL_H */
+13 -1
kernel/latencytop.c
··· 47 47 * of times) 48 48 */ 49 49 50 - #include <linux/latencytop.h> 51 50 #include <linux/kallsyms.h> 52 51 #include <linux/seq_file.h> 53 52 #include <linux/notifier.h> 54 53 #include <linux/spinlock.h> 55 54 #include <linux/proc_fs.h> 55 + #include <linux/latencytop.h> 56 56 #include <linux/export.h> 57 57 #include <linux/sched.h> 58 58 #include <linux/list.h> ··· 288 288 { 289 289 proc_create("latency_stats", 0644, NULL, &lstats_fops); 290 290 return 0; 291 + } 292 + 293 + int sysctl_latencytop(struct ctl_table *table, int write, 294 + void __user *buffer, size_t *lenp, loff_t *ppos) 295 + { 296 + int err; 297 + 298 + err = proc_dointvec(table, write, buffer, lenp, ppos); 299 + if (latencytop_enabled) 300 + force_schedstat_enabled(); 301 + 302 + return err; 291 303 } 292 304 device_initcall(init_lstats_procfs);
+1
kernel/profile.c
··· 59 59 60 60 if (!strncmp(str, sleepstr, strlen(sleepstr))) { 61 61 #ifdef CONFIG_SCHEDSTATS 62 + force_schedstat_enabled(); 62 63 prof_on = SLEEP_PROFILING; 63 64 if (str[strlen(sleepstr)] == ',') 64 65 str += strlen(sleepstr) + 1;
+68 -2
kernel/sched/core.c
··· 2093 2093 2094 2094 ttwu_queue(p, cpu); 2095 2095 stat: 2096 - ttwu_stat(p, cpu, wake_flags); 2096 + if (schedstat_enabled()) 2097 + ttwu_stat(p, cpu, wake_flags); 2097 2098 out: 2098 2099 raw_spin_unlock_irqrestore(&p->pi_lock, flags); 2099 2100 ··· 2142 2141 ttwu_activate(rq, p, ENQUEUE_WAKEUP); 2143 2142 2144 2143 ttwu_do_wakeup(rq, p, 0); 2145 - ttwu_stat(p, smp_processor_id(), 0); 2144 + if (schedstat_enabled()) 2145 + ttwu_stat(p, smp_processor_id(), 0); 2146 2146 out: 2147 2147 raw_spin_unlock(&p->pi_lock); 2148 2148 } ··· 2212 2210 #endif 2213 2211 2214 2212 #ifdef CONFIG_SCHEDSTATS 2213 + /* Even if schedstat is disabled, there should not be garbage */ 2215 2214 memset(&p->se.statistics, 0, sizeof(p->se.statistics)); 2216 2215 #endif 2217 2216 ··· 2279 2276 return err; 2280 2277 if (write) 2281 2278 set_numabalancing_state(state); 2279 + return err; 2280 + } 2281 + #endif 2282 + #endif 2283 + 2284 + DEFINE_STATIC_KEY_FALSE(sched_schedstats); 2285 + 2286 + #ifdef CONFIG_SCHEDSTATS 2287 + static void set_schedstats(bool enabled) 2288 + { 2289 + if (enabled) 2290 + static_branch_enable(&sched_schedstats); 2291 + else 2292 + static_branch_disable(&sched_schedstats); 2293 + } 2294 + 2295 + void force_schedstat_enabled(void) 2296 + { 2297 + if (!schedstat_enabled()) { 2298 + pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n"); 2299 + static_branch_enable(&sched_schedstats); 2300 + } 2301 + } 2302 + 2303 + static int __init setup_schedstats(char *str) 2304 + { 2305 + int ret = 0; 2306 + if (!str) 2307 + goto out; 2308 + 2309 + if (!strcmp(str, "enable")) { 2310 + set_schedstats(true); 2311 + ret = 1; 2312 + } else if (!strcmp(str, "disable")) { 2313 + set_schedstats(false); 2314 + ret = 1; 2315 + } 2316 + out: 2317 + if (!ret) 2318 + pr_warn("Unable to parse schedstats=\n"); 2319 + 2320 + return ret; 2321 + } 2322 + __setup("schedstats=", setup_schedstats); 2323 + 2324 + #ifdef CONFIG_PROC_SYSCTL 2325 + int sysctl_schedstats(struct ctl_table *table, int write, 2326 + void __user *buffer, size_t *lenp, loff_t *ppos) 2327 + { 2328 + struct ctl_table t; 2329 + int err; 2330 + int state = static_branch_likely(&sched_schedstats); 2331 + 2332 + if (write && !capable(CAP_SYS_ADMIN)) 2333 + return -EPERM; 2334 + 2335 + t = *table; 2336 + t.data = &state; 2337 + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); 2338 + if (err < 0) 2339 + return err; 2340 + if (write) 2341 + set_schedstats(state); 2282 2342 return err; 2283 2343 } 2284 2344 #endif
+54 -48
kernel/sched/debug.c
··· 75 75 PN(se->vruntime); 76 76 PN(se->sum_exec_runtime); 77 77 #ifdef CONFIG_SCHEDSTATS 78 - PN(se->statistics.wait_start); 79 - PN(se->statistics.sleep_start); 80 - PN(se->statistics.block_start); 81 - PN(se->statistics.sleep_max); 82 - PN(se->statistics.block_max); 83 - PN(se->statistics.exec_max); 84 - PN(se->statistics.slice_max); 85 - PN(se->statistics.wait_max); 86 - PN(se->statistics.wait_sum); 87 - P(se->statistics.wait_count); 78 + if (schedstat_enabled()) { 79 + PN(se->statistics.wait_start); 80 + PN(se->statistics.sleep_start); 81 + PN(se->statistics.block_start); 82 + PN(se->statistics.sleep_max); 83 + PN(se->statistics.block_max); 84 + PN(se->statistics.exec_max); 85 + PN(se->statistics.slice_max); 86 + PN(se->statistics.wait_max); 87 + PN(se->statistics.wait_sum); 88 + P(se->statistics.wait_count); 89 + } 88 90 #endif 89 91 P(se->load.weight); 90 92 #ifdef CONFIG_SMP ··· 124 122 (long long)(p->nvcsw + p->nivcsw), 125 123 p->prio); 126 124 #ifdef CONFIG_SCHEDSTATS 127 - SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", 128 - SPLIT_NS(p->se.statistics.wait_sum), 129 - SPLIT_NS(p->se.sum_exec_runtime), 130 - SPLIT_NS(p->se.statistics.sum_sleep_runtime)); 125 + if (schedstat_enabled()) { 126 + SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", 127 + SPLIT_NS(p->se.statistics.wait_sum), 128 + SPLIT_NS(p->se.sum_exec_runtime), 129 + SPLIT_NS(p->se.statistics.sum_sleep_runtime)); 130 + } 131 131 #else 132 132 SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", 133 133 0LL, 0L, ··· 317 313 #define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n); 318 314 #define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n); 319 315 320 - P(yld_count); 321 - 322 - P(sched_count); 323 - P(sched_goidle); 324 316 #ifdef CONFIG_SMP 325 317 P64(avg_idle); 326 318 P64(max_idle_balance_cost); 327 319 #endif 328 320 329 - P(ttwu_count); 330 - P(ttwu_local); 321 + if (schedstat_enabled()) { 322 + P(yld_count); 323 + P(sched_count); 324 + P(sched_goidle); 325 + P(ttwu_count); 326 + P(ttwu_local); 327 + } 331 328 332 329 #undef P 333 330 #undef P64 ··· 574 569 nr_switches = p->nvcsw + p->nivcsw; 575 570 576 571 #ifdef CONFIG_SCHEDSTATS 577 - PN(se.statistics.sum_sleep_runtime); 578 - PN(se.statistics.wait_start); 579 - PN(se.statistics.sleep_start); 580 - PN(se.statistics.block_start); 581 - PN(se.statistics.sleep_max); 582 - PN(se.statistics.block_max); 583 - PN(se.statistics.exec_max); 584 - PN(se.statistics.slice_max); 585 - PN(se.statistics.wait_max); 586 - PN(se.statistics.wait_sum); 587 - P(se.statistics.wait_count); 588 - PN(se.statistics.iowait_sum); 589 - P(se.statistics.iowait_count); 590 572 P(se.nr_migrations); 591 - P(se.statistics.nr_migrations_cold); 592 - P(se.statistics.nr_failed_migrations_affine); 593 - P(se.statistics.nr_failed_migrations_running); 594 - P(se.statistics.nr_failed_migrations_hot); 595 - P(se.statistics.nr_forced_migrations); 596 - P(se.statistics.nr_wakeups); 597 - P(se.statistics.nr_wakeups_sync); 598 - P(se.statistics.nr_wakeups_migrate); 599 - P(se.statistics.nr_wakeups_local); 600 - P(se.statistics.nr_wakeups_remote); 601 - P(se.statistics.nr_wakeups_affine); 602 - P(se.statistics.nr_wakeups_affine_attempts); 603 - P(se.statistics.nr_wakeups_passive); 604 - P(se.statistics.nr_wakeups_idle); 605 573 606 - { 574 + if (schedstat_enabled()) { 607 575 u64 avg_atom, avg_per_cpu; 576 + 577 + PN(se.statistics.sum_sleep_runtime); 578 + PN(se.statistics.wait_start); 579 + PN(se.statistics.sleep_start); 580 + PN(se.statistics.block_start); 581 + PN(se.statistics.sleep_max); 582 + PN(se.statistics.block_max); 583 + PN(se.statistics.exec_max); 584 + PN(se.statistics.slice_max); 585 + PN(se.statistics.wait_max); 586 + PN(se.statistics.wait_sum); 587 + P(se.statistics.wait_count); 588 + PN(se.statistics.iowait_sum); 589 + P(se.statistics.iowait_count); 590 + P(se.statistics.nr_migrations_cold); 591 + P(se.statistics.nr_failed_migrations_affine); 592 + P(se.statistics.nr_failed_migrations_running); 593 + P(se.statistics.nr_failed_migrations_hot); 594 + P(se.statistics.nr_forced_migrations); 595 + P(se.statistics.nr_wakeups); 596 + P(se.statistics.nr_wakeups_sync); 597 + P(se.statistics.nr_wakeups_migrate); 598 + P(se.statistics.nr_wakeups_local); 599 + P(se.statistics.nr_wakeups_remote); 600 + P(se.statistics.nr_wakeups_affine); 601 + P(se.statistics.nr_wakeups_affine_attempts); 602 + P(se.statistics.nr_wakeups_passive); 603 + P(se.statistics.nr_wakeups_idle); 608 604 609 605 avg_atom = p->se.sum_exec_runtime; 610 606 if (nr_switches)
+78 -35
kernel/sched/fair.c
··· 20 20 * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra 21 21 */ 22 22 23 - #include <linux/latencytop.h> 24 23 #include <linux/sched.h> 24 + #include <linux/latencytop.h> 25 25 #include <linux/cpumask.h> 26 26 #include <linux/cpuidle.h> 27 27 #include <linux/slab.h> ··· 755 755 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) 756 756 { 757 757 struct task_struct *p; 758 - u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start; 758 + u64 delta; 759 + 760 + delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start; 759 761 760 762 if (entity_is_task(se)) { 761 763 p = task_of(se); ··· 778 776 se->statistics.wait_sum += delta; 779 777 se->statistics.wait_start = 0; 780 778 } 781 - #else 782 - static inline void 783 - update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se) 784 - { 785 - } 786 - 787 - static inline void 788 - update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) 789 - { 790 - } 791 - #endif 792 779 793 780 /* 794 781 * Task is being enqueued - update stats: 795 782 */ 796 - static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) 783 + static inline void 784 + update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) 797 785 { 798 786 /* 799 787 * Are we enqueueing a waiting task? (for current tasks ··· 794 802 } 795 803 796 804 static inline void 797 - update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) 805 + update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 798 806 { 799 807 /* 800 808 * Mark the end of the wait period if dequeueing a ··· 802 810 */ 803 811 if (se != cfs_rq->curr) 804 812 update_stats_wait_end(cfs_rq, se); 813 + 814 + if (flags & DEQUEUE_SLEEP) { 815 + if (entity_is_task(se)) { 816 + struct task_struct *tsk = task_of(se); 817 + 818 + if (tsk->state & TASK_INTERRUPTIBLE) 819 + se->statistics.sleep_start = rq_clock(rq_of(cfs_rq)); 820 + if (tsk->state & TASK_UNINTERRUPTIBLE) 821 + se->statistics.block_start = rq_clock(rq_of(cfs_rq)); 822 + } 823 + } 824 + 805 825 } 826 + #else 827 + static inline void 828 + update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se) 829 + { 830 + } 831 + 832 + static inline void 833 + update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se) 834 + { 835 + } 836 + 837 + static inline void 838 + update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) 839 + { 840 + } 841 + 842 + static inline void 843 + update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 844 + { 845 + } 846 + #endif 806 847 807 848 /* 808 849 * We are picking a new current task - update its stats: ··· 3127 3102 3128 3103 static void check_enqueue_throttle(struct cfs_rq *cfs_rq); 3129 3104 3105 + static inline void check_schedstat_required(void) 3106 + { 3107 + #ifdef CONFIG_SCHEDSTATS 3108 + if (schedstat_enabled()) 3109 + return; 3110 + 3111 + /* Force schedstat enabled if a dependent tracepoint is active */ 3112 + if (trace_sched_stat_wait_enabled() || 3113 + trace_sched_stat_sleep_enabled() || 3114 + trace_sched_stat_iowait_enabled() || 3115 + trace_sched_stat_blocked_enabled() || 3116 + trace_sched_stat_runtime_enabled()) { 3117 + pr_warn_once("Scheduler tracepoints stat_sleep, stat_iowait, " 3118 + "stat_blocked and stat_runtime require the " 3119 + "kernel parameter schedstats=enabled or " 3120 + "kernel.sched_schedstats=1\n"); 3121 + } 3122 + #endif 3123 + } 3124 + 3130 3125 static void 3131 3126 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 3132 3127 { ··· 3167 3122 3168 3123 if (flags & ENQUEUE_WAKEUP) { 3169 3124 place_entity(cfs_rq, se, 0); 3170 - enqueue_sleeper(cfs_rq, se); 3125 + if (schedstat_enabled()) 3126 + enqueue_sleeper(cfs_rq, se); 3171 3127 } 3172 3128 3173 - update_stats_enqueue(cfs_rq, se); 3174 - check_spread(cfs_rq, se); 3129 + check_schedstat_required(); 3130 + if (schedstat_enabled()) { 3131 + update_stats_enqueue(cfs_rq, se); 3132 + check_spread(cfs_rq, se); 3133 + } 3175 3134 if (se != cfs_rq->curr) 3176 3135 __enqueue_entity(cfs_rq, se); 3177 3136 se->on_rq = 1; ··· 3242 3193 update_curr(cfs_rq); 3243 3194 dequeue_entity_load_avg(cfs_rq, se); 3244 3195 3245 - update_stats_dequeue(cfs_rq, se); 3246 - if (flags & DEQUEUE_SLEEP) { 3247 - #ifdef CONFIG_SCHEDSTATS 3248 - if (entity_is_task(se)) { 3249 - struct task_struct *tsk = task_of(se); 3250 - 3251 - if (tsk->state & TASK_INTERRUPTIBLE) 3252 - se->statistics.sleep_start = rq_clock(rq_of(cfs_rq)); 3253 - if (tsk->state & TASK_UNINTERRUPTIBLE) 3254 - se->statistics.block_start = rq_clock(rq_of(cfs_rq)); 3255 - } 3256 - #endif 3257 - } 3196 + if (schedstat_enabled()) 3197 + update_stats_dequeue(cfs_rq, se, flags); 3258 3198 3259 3199 clear_buddies(cfs_rq, se); 3260 3200 ··· 3317 3279 * a CPU. So account for the time it spent waiting on the 3318 3280 * runqueue. 3319 3281 */ 3320 - update_stats_wait_end(cfs_rq, se); 3282 + if (schedstat_enabled()) 3283 + update_stats_wait_end(cfs_rq, se); 3321 3284 __dequeue_entity(cfs_rq, se); 3322 3285 update_load_avg(se, 1); 3323 3286 } ··· 3331 3292 * least twice that of our own weight (i.e. dont track it 3332 3293 * when there are only lesser-weight tasks around): 3333 3294 */ 3334 - if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) { 3295 + if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) { 3335 3296 se->statistics.slice_max = max(se->statistics.slice_max, 3336 3297 se->sum_exec_runtime - se->prev_sum_exec_runtime); 3337 3298 } ··· 3414 3375 /* throttle cfs_rqs exceeding runtime */ 3415 3376 check_cfs_rq_runtime(cfs_rq); 3416 3377 3417 - check_spread(cfs_rq, prev); 3378 + if (schedstat_enabled()) { 3379 + check_spread(cfs_rq, prev); 3380 + if (prev->on_rq) 3381 + update_stats_wait_start(cfs_rq, prev); 3382 + } 3383 + 3418 3384 if (prev->on_rq) { 3419 - update_stats_wait_start(cfs_rq, prev); 3420 3385 /* Put 'current' back into the tree. */ 3421 3386 __enqueue_entity(cfs_rq, prev); 3422 3387 /* in !on_rq case, update occurred at dequeue */
+1
kernel/sched/sched.h
··· 1022 1022 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */ 1023 1023 1024 1024 extern struct static_key_false sched_numa_balancing; 1025 + extern struct static_key_false sched_schedstats; 1025 1026 1026 1027 static inline u64 global_rt_period(void) 1027 1028 {
+5 -3
kernel/sched/stats.h
··· 29 29 if (rq) 30 30 rq->rq_sched_info.run_delay += delta; 31 31 } 32 - # define schedstat_inc(rq, field) do { (rq)->field++; } while (0) 33 - # define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) 34 - # define schedstat_set(var, val) do { var = (val); } while (0) 32 + # define schedstat_enabled() static_branch_unlikely(&sched_schedstats) 33 + # define schedstat_inc(rq, field) do { if (schedstat_enabled()) { (rq)->field++; } } while (0) 34 + # define schedstat_add(rq, field, amt) do { if (schedstat_enabled()) { (rq)->field += (amt); } } while (0) 35 + # define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0) 35 36 #else /* !CONFIG_SCHEDSTATS */ 36 37 static inline void 37 38 rq_sched_info_arrive(struct rq *rq, unsigned long long delta) ··· 43 42 static inline void 44 43 rq_sched_info_depart(struct rq *rq, unsigned long long delta) 45 44 {} 45 + # define schedstat_enabled() 0 46 46 # define schedstat_inc(rq, field) do { } while (0) 47 47 # define schedstat_add(rq, field, amt) do { } while (0) 48 48 # define schedstat_set(var, val) do { } while (0)
+12 -1
kernel/sysctl.c
··· 350 350 .mode = 0644, 351 351 .proc_handler = proc_dointvec, 352 352 }, 353 + #ifdef CONFIG_SCHEDSTATS 354 + { 355 + .procname = "sched_schedstats", 356 + .data = NULL, 357 + .maxlen = sizeof(unsigned int), 358 + .mode = 0644, 359 + .proc_handler = sysctl_schedstats, 360 + .extra1 = &zero, 361 + .extra2 = &one, 362 + }, 363 + #endif /* CONFIG_SCHEDSTATS */ 353 364 #endif /* CONFIG_SMP */ 354 365 #ifdef CONFIG_NUMA_BALANCING 355 366 { ··· 516 505 .data = &latencytop_enabled, 517 506 .maxlen = sizeof(int), 518 507 .mode = 0644, 519 - .proc_handler = proc_dointvec, 508 + .proc_handler = sysctl_latencytop, 520 509 }, 521 510 #endif 522 511 #ifdef CONFIG_BLK_DEV_INITRD