Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull 'full dynticks' support from Ingo Molnar:
"This tree from Frederic Weisbecker adds a new, (exciting! :-) core
kernel feature to the timer and scheduler subsystems: 'full dynticks',
or CONFIG_NO_HZ_FULL=y.

This feature extends the nohz variable-size timer tick feature from
idle to busy CPUs (running at most one task) as well, potentially
reducing the number of timer interrupts significantly.

This feature got motivated by real-time folks and the -rt tree, but
the general utility and motivation of full-dynticks runs wider than
that:

- HPC workloads get faster: CPUs running a single task should be able
to utilize a maximum amount of CPU power. A periodic timer tick at
HZ=1000 can cause a constant overhead of up to 1.0%. This feature
removes that overhead - and speeds up the system by 0.5%-1.0% on
typical distro configs even on modern systems.

- Real-time workload latency reduction: CPUs running critical tasks
should experience as little jitter as possible. The last remaining
source of kernel-related jitter was the periodic timer tick.

- A single task executing on a CPU is a pretty common situation,
especially with an increasing number of cores/CPUs, so this feature
helps desktop and mobile workloads as well.

The cost of the feature is mainly related to increased timer
reprogramming overhead when a CPU switches its tick period, and thus
slightly longer to-idle and from-idle latency.

Configuration-wise a third mode of operation is added to the existing
two NOHZ kconfig modes:

- CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
as a config option. This is the traditional Linux periodic tick
design: there's a HZ tick going on all the time, regardless of
whether a CPU is idle or not.

- CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
periodic tick when a CPU enters idle mode.

- CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
tick when a CPU is idle, also slows the tick down to 1 Hz (one
timer interrupt per second) when only a single task is running on a
CPU.

The .config behavior is compatible: existing !CONFIG_NO_HZ and
CONFIG_NO_HZ=y settings get translated to the new values, without the
user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
default.

This feature is based on a lot of infrastructure work that has been
steadily going upstream in the last 2-3 cycles: related RCU support
and non-periodic cputime support in particular is upstream already.

This tree adds the final pieces and activates the feature. The pull
request is marked RFC because:

- it's marked 64-bit only at the moment - the 32-bit support patch is
small but did not get ready in time.

- it has a number of fresh commits that came in after the merge
window. The overwhelming majority of commits are from before the
merge window, but still some aspects of the tree are fresh and so I
marked it RFC.

- it's a pretty wide-reaching feature with lots of effects - and
while the components have been in testing for some time, the full
combination is still not very widely used. That it's default-off
should reduce its regression abilities and obviously there are no
known regressions with CONFIG_NO_HZ_FULL=y enabled either.

- the feature is not completely idempotent: there is no 100%
equivalent replacement for a periodic scheduler/timer tick. In
particular there's ongoing work to map out and reduce its effects
on scheduler load-balancing and statistics. This should not impact
correctness though, there are no known regressions related to this
feature at this point.

- it's a pretty ambitious feature that with time will likely be
enabled by most Linux distros, and we'd like you to make input on
its design/implementation, if you dislike some aspect we missed.
Without flaming us to crisp! :-)

Future plans:

- there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
the periodic tick altogether when there's a single busy task on a
CPU. We'd first like 1 Hz to be exposed more widely before we go
for the 0 Hz target though.

- once we reach 0 Hz we can remove the periodic tick assumption from
nr_running>=2 as well, by essentially interrupting busy tasks only
as frequently as the sched_latency constraints require us to do -
once every 4-40 msecs, depending on nr_running.

I am personally leaning towards biting the bullet and doing this in
v3.10, like the -rt tree this effort has been going on for too long -
but the final word is up to you as usual.

More technical details can be found in Documentation/timers/NO_HZ.txt"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
sched: Keep at least 1 tick per second for active dynticks tasks
rcu: Fix full dynticks' dependency on wide RCU nocb mode
nohz: Protect smp_processor_id() in tick_nohz_task_switch()
nohz_full: Add documentation.
cputime_nsecs: use math64.h for nsec resolution conversion helpers
nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
nohz: Reduce overhead under high-freq idling patterns
nohz: Remove full dynticks' superfluous dependency on RCU tree
nohz: Fix unavailable tick_stop tracepoint in dynticks idle
nohz: Add basic tracing
nohz: Select wide RCU nocb for full dynticks
nohz: Disable the tick when irq resume in full dynticks CPU
nohz: Re-evaluate the tick for the new task after a context switch
nohz: Prepare to stop the tick on irq exit
nohz: Implement full dynticks kick
nohz: Re-evaluate the tick from the scheduler IPI
sched: New helper to prevent from stopping the tick in full dynticks
sched: Kick full dynticks CPU that have more than one task enqueued.
perf: New helper to prevent full dynticks CPUs from stopping tick
perf: Kick full dynticks CPU if events rotation is needed
...

+990 -119
+1 -1
Documentation/RCU/stallwarn.txt
··· 191 191 o A hardware or software issue shuts off the scheduler-clock 192 192 interrupt on a CPU that is not in dyntick-idle mode. This 193 193 problem really has happened, and seems to be most likely to 194 - result in RCU CPU stall warnings for CONFIG_NO_HZ=n kernels. 194 + result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels. 195 195 196 196 o A bug in the RCU implementation. 197 197
+2 -2
Documentation/cpu-freq/governors.txt
··· 131 131 The sampling rate is limited by the HW transition latency: 132 132 transition_latency * 100 133 133 Or by kernel restrictions: 134 - If CONFIG_NO_HZ is set, the limit is 10ms fixed. 135 - If CONFIG_NO_HZ is not set or nohz=off boot parameter is used, the 134 + If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. 135 + If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is used, the 136 136 limits depend on the CONFIG_HZ option: 137 137 HZ=1000: min=20000us (20ms) 138 138 HZ=250: min=80000us (80ms)
+8
Documentation/kernel-parameters.txt
··· 1964 1964 Valid arguments: on, off 1965 1965 Default: on 1966 1966 1967 + nohz_full= [KNL,BOOT] 1968 + In kernels built with CONFIG_NO_HZ_FULL=y, set 1969 + the specified list of CPUs whose tick will be stopped 1970 + whenever possible. The boot CPU will be forced outside 1971 + the range to maintain the timekeeping. 1972 + The CPUs in this range must also be included in the 1973 + rcu_nocbs= set. 1974 + 1967 1975 noiotrap [SH] Disables trapped I/O port accesses. 1968 1976 1969 1977 noirqdebug [X86-32] Disables the code which attempts to detect and
+273
Documentation/timers/NO_HZ.txt
··· 1 + NO_HZ: Reducing Scheduling-Clock Ticks 2 + 3 + 4 + This document describes Kconfig options and boot parameters that can 5 + reduce the number of scheduling-clock interrupts, thereby improving energy 6 + efficiency and reducing OS jitter. Reducing OS jitter is important for 7 + some types of computationally intensive high-performance computing (HPC) 8 + applications and for real-time applications. 9 + 10 + There are two main contexts in which the number of scheduling-clock 11 + interrupts can be reduced compared to the old-school approach of sending 12 + a scheduling-clock interrupt to all CPUs every jiffy whether they need 13 + it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels): 14 + 15 + 1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels). 16 + 17 + 2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y). 18 + 19 + These two cases are described in the following two sections, followed 20 + by a third section on RCU-specific considerations and a fourth and final 21 + section listing known issues. 22 + 23 + 24 + IDLE CPUs 25 + 26 + If a CPU is idle, there is little point in sending it a scheduling-clock 27 + interrupt. After all, the primary purpose of a scheduling-clock interrupt 28 + is to force a busy CPU to shift its attention among multiple duties, 29 + and an idle CPU has no duties to shift its attention among. 30 + 31 + The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending 32 + scheduling-clock interrupts to idle CPUs, which is critically important 33 + both to battery-powered devices and to highly virtualized mainframes. 34 + A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would 35 + drain its battery very quickly, easily 2-3 times as fast as would the 36 + same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running 37 + 1,500 OS instances might find that half of its CPU time was consumed by 38 + unnecessary scheduling-clock interrupts. In these situations, there 39 + is strong motivation to avoid sending scheduling-clock interrupts to 40 + idle CPUs. That said, dyntick-idle mode is not free: 41 + 42 + 1. It increases the number of instructions executed on the path 43 + to and from the idle loop. 44 + 45 + 2. On many architectures, dyntick-idle mode also increases the 46 + number of expensive clock-reprogramming operations. 47 + 48 + Therefore, systems with aggressive real-time response constraints often 49 + run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels) 50 + in order to avoid degrading from-idle transition latencies. 51 + 52 + An idle CPU that is not receiving scheduling-clock interrupts is said to 53 + be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running 54 + tickless". The remainder of this document will use "dyntick-idle mode". 55 + 56 + There is also a boot parameter "nohz=" that can be used to disable 57 + dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off". 58 + By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling 59 + dyntick-idle mode. 60 + 61 + 62 + CPUs WITH ONLY ONE RUNNABLE TASK 63 + 64 + If a CPU has only one runnable task, there is little point in sending it 65 + a scheduling-clock interrupt because there is no other task to switch to. 66 + 67 + The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid 68 + sending scheduling-clock interrupts to CPUs with a single runnable task, 69 + and such CPUs are said to be "adaptive-ticks CPUs". This is important 70 + for applications with aggressive real-time response constraints because 71 + it allows them to improve their worst-case response times by the maximum 72 + duration of a scheduling-clock interrupt. It is also important for 73 + computationally intensive short-iteration workloads: If any CPU is 74 + delayed during a given iteration, all the other CPUs will be forced to 75 + wait idle while the delayed CPU finishes. Thus, the delay is multiplied 76 + by one less than the number of CPUs. In these situations, there is 77 + again strong motivation to avoid sending scheduling-clock interrupts. 78 + 79 + By default, no CPU will be an adaptive-ticks CPU. The "nohz_full=" 80 + boot parameter specifies the adaptive-ticks CPUs. For example, 81 + "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks 82 + CPUs. Note that you are prohibited from marking all of the CPUs as 83 + adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain 84 + online to handle timekeeping tasks in order to ensure that system calls 85 + like gettimeofday() returns accurate values on adaptive-tick CPUs. 86 + (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no 87 + running user processes to observe slight drifts in clock rate.) 88 + Therefore, the boot CPU is prohibited from entering adaptive-ticks 89 + mode. Specifying a "nohz_full=" mask that includes the boot CPU will 90 + result in a boot-time error message, and the boot CPU will be removed 91 + from the mask. 92 + 93 + Alternatively, the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter specifies 94 + that all CPUs other than the boot CPU are adaptive-ticks CPUs. This 95 + Kconfig parameter will be overridden by the "nohz_full=" boot parameter, 96 + so that if both the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter and 97 + the "nohz_full=1" boot parameter is specified, the boot parameter will 98 + prevail so that only CPU 1 will be an adaptive-ticks CPU. 99 + 100 + Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded. 101 + This is covered in the "RCU IMPLICATIONS" section below. 102 + 103 + Normally, a CPU remains in adaptive-ticks mode as long as possible. 104 + In particular, transitioning to kernel mode does not automatically change 105 + the mode. Instead, the CPU will exit adaptive-ticks mode only if needed, 106 + for example, if that CPU enqueues an RCU callback. 107 + 108 + Just as with dyntick-idle mode, the benefits of adaptive-tick mode do 109 + not come for free: 110 + 111 + 1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run 112 + adaptive ticks without also running dyntick idle. This dependency 113 + extends down into the implementation, so that all of the costs 114 + of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL. 115 + 116 + 2. The user/kernel transitions are slightly more expensive due 117 + to the need to inform kernel subsystems (such as RCU) about 118 + the change in mode. 119 + 120 + 3. POSIX CPU timers on adaptive-tick CPUs may miss their deadlines 121 + (perhaps indefinitely) because they currently rely on 122 + scheduling-tick interrupts. This will likely be fixed in 123 + one of two ways: (1) Prevent CPUs with POSIX CPU timers from 124 + entering adaptive-tick mode, or (2) Use hrtimers or other 125 + adaptive-ticks-immune mechanism to cause the POSIX CPU timer to 126 + fire properly. 127 + 128 + 4. If there are more perf events pending than the hardware can 129 + accommodate, they are normally round-robined so as to collect 130 + all of them over time. Adaptive-tick mode may prevent this 131 + round-robining from happening. This will likely be fixed by 132 + preventing CPUs with large numbers of perf events pending from 133 + entering adaptive-tick mode. 134 + 135 + 5. Scheduler statistics for adaptive-tick CPUs may be computed 136 + slightly differently than those for non-adaptive-tick CPUs. 137 + This might in turn perturb load-balancing of real-time tasks. 138 + 139 + 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. 140 + 141 + Although improvements are expected over time, adaptive ticks is quite 142 + useful for many types of real-time and compute-intensive applications. 143 + However, the drawbacks listed above mean that adaptive ticks should not 144 + (yet) be enabled by default. 145 + 146 + 147 + RCU IMPLICATIONS 148 + 149 + There are situations in which idle CPUs cannot be permitted to 150 + enter either dyntick-idle mode or adaptive-tick mode, the most 151 + common being when that CPU has RCU callbacks pending. 152 + 153 + The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs 154 + to enter dyntick-idle mode or adaptive-tick mode anyway. In this case, 155 + a timer will awaken these CPUs every four jiffies in order to ensure 156 + that the RCU callbacks are processed in a timely fashion. 157 + 158 + Another approach is to offload RCU callback processing to "rcuo" kthreads 159 + using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to 160 + offload may be selected via several methods: 161 + 162 + 1. One of three mutually exclusive Kconfig options specify a 163 + build-time default for the CPUs to offload: 164 + 165 + a. The CONFIG_RCU_NOCB_CPU_NONE=y Kconfig option results in 166 + no CPUs being offloaded. 167 + 168 + b. The CONFIG_RCU_NOCB_CPU_ZERO=y Kconfig option causes 169 + CPU 0 to be offloaded. 170 + 171 + c. The CONFIG_RCU_NOCB_CPU_ALL=y Kconfig option causes all 172 + CPUs to be offloaded. Note that the callbacks will be 173 + offloaded to "rcuo" kthreads, and that those kthreads 174 + will in fact run on some CPU. However, this approach 175 + gives fine-grained control on exactly which CPUs the 176 + callbacks run on, along with their scheduling priority 177 + (including the default of SCHED_OTHER), and it further 178 + allows this control to be varied dynamically at runtime. 179 + 180 + 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated 181 + list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, 182 + 3, 4, and 5. The specified CPUs will be offloaded in addition to 183 + any CPUs specified as offloaded by CONFIG_RCU_NOCB_CPU_ZERO=y or 184 + CONFIG_RCU_NOCB_CPU_ALL=y. This means that the "rcu_nocbs=" boot 185 + parameter has no effect for kernels built with RCU_NOCB_CPU_ALL=y. 186 + 187 + The offloaded CPUs will never queue RCU callbacks, and therefore RCU 188 + never prevents offloaded CPUs from entering either dyntick-idle mode 189 + or adaptive-tick mode. That said, note that it is up to userspace to 190 + pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the 191 + scheduler will decide where to run them, which might or might not be 192 + where you want them to run. 193 + 194 + 195 + KNOWN ISSUES 196 + 197 + o Dyntick-idle slows transitions to and from idle slightly. 198 + In practice, this has not been a problem except for the most 199 + aggressive real-time workloads, which have the option of disabling 200 + dyntick-idle mode, an option that most of them take. However, 201 + some workloads will no doubt want to use adaptive ticks to 202 + eliminate scheduling-clock interrupt latencies. Here are some 203 + options for these workloads: 204 + 205 + a. Use PMQOS from userspace to inform the kernel of your 206 + latency requirements (preferred). 207 + 208 + b. On x86 systems, use the "idle=mwait" boot parameter. 209 + 210 + c. On x86 systems, use the "intel_idle.max_cstate=" to limit 211 + ` the maximum C-state depth. 212 + 213 + d. On x86 systems, use the "idle=poll" boot parameter. 214 + However, please note that use of this parameter can cause 215 + your CPU to overheat, which may cause thermal throttling 216 + to degrade your latencies -- and that this degradation can 217 + be even worse than that of dyntick-idle. Furthermore, 218 + this parameter effectively disables Turbo Mode on Intel 219 + CPUs, which can significantly reduce maximum performance. 220 + 221 + o Adaptive-ticks slows user/kernel transitions slightly. 222 + This is not expected to be a problem for computationally intensive 223 + workloads, which have few such transitions. Careful benchmarking 224 + will be required to determine whether or not other workloads 225 + are significantly affected by this effect. 226 + 227 + o Adaptive-ticks does not do anything unless there is only one 228 + runnable task for a given CPU, even though there are a number 229 + of other situations where the scheduling-clock tick is not 230 + needed. To give but one example, consider a CPU that has one 231 + runnable high-priority SCHED_FIFO task and an arbitrary number 232 + of low-priority SCHED_OTHER tasks. In this case, the CPU is 233 + required to run the SCHED_FIFO task until it either blocks or 234 + some other higher-priority task awakens on (or is assigned to) 235 + this CPU, so there is no point in sending a scheduling-clock 236 + interrupt to this CPU. However, the current implementation 237 + nevertheless sends scheduling-clock interrupts to CPUs having a 238 + single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER 239 + tasks, even though these interrupts are unnecessary. 240 + 241 + Better handling of these sorts of situations is future work. 242 + 243 + o A reboot is required to reconfigure both adaptive idle and RCU 244 + callback offloading. Runtime reconfiguration could be provided 245 + if needed, however, due to the complexity of reconfiguring RCU at 246 + runtime, there would need to be an earthshakingly good reason. 247 + Especially given that you have the straightforward option of 248 + simply offloading RCU callbacks from all CPUs and pinning them 249 + where you want them whenever you want them pinned. 250 + 251 + o Additional configuration is required to deal with other sources 252 + of OS jitter, including interrupts and system-utility tasks 253 + and processes. This configuration normally involves binding 254 + interrupts and tasks to particular CPUs. 255 + 256 + o Some sources of OS jitter can currently be eliminated only by 257 + constraining the workload. For example, the only way to eliminate 258 + OS jitter due to global TLB shootdowns is to avoid the unmapping 259 + operations (such as kernel module unload operations) that 260 + result in these shootdowns. For another example, page faults 261 + and TLB misses can be reduced (and in some cases eliminated) by 262 + using huge pages and by constraining the amount of memory used 263 + by the application. Pre-faulting the working set can also be 264 + helpful, especially when combined with the mlock() and mlockall() 265 + system calls. 266 + 267 + o Unless all CPUs are idle, at least one CPU must keep the 268 + scheduling-clock interrupt going in order to support accurate 269 + timekeeping. 270 + 271 + o If there are adaptive-ticks CPUs, there will be at least one 272 + CPU keeping the scheduling-clock interrupt going, even if all 273 + CPUs are otherwise idle.
+2 -2
arch/um/include/shared/common-offsets.h
··· 30 30 #ifdef CONFIG_PRINTK 31 31 DEFINE(UML_CONFIG_PRINTK, CONFIG_PRINTK); 32 32 #endif 33 - #ifdef CONFIG_NO_HZ 34 - DEFINE(UML_CONFIG_NO_HZ, CONFIG_NO_HZ); 33 + #ifdef CONFIG_NO_HZ_COMMON 34 + DEFINE(UML_CONFIG_NO_HZ_COMMON, CONFIG_NO_HZ_COMMON); 35 35 #endif 36 36 #ifdef CONFIG_UML_X86 37 37 DEFINE(UML_CONFIG_UML_X86, CONFIG_UML_X86);
+1 -1
arch/um/os-Linux/time.c
··· 79 79 return timeval_to_ns(&tv); 80 80 } 81 81 82 - #ifdef UML_CONFIG_NO_HZ 82 + #ifdef UML_CONFIG_NO_HZ_COMMON 83 83 static int after_sleep_interval(struct timespec *ts) 84 84 { 85 85 return 0;
+19 -9
include/asm-generic/cputime_nsecs.h
··· 16 16 #ifndef _ASM_GENERIC_CPUTIME_NSECS_H 17 17 #define _ASM_GENERIC_CPUTIME_NSECS_H 18 18 19 + #include <linux/math64.h> 20 + 19 21 typedef u64 __nocast cputime_t; 20 22 typedef u64 __nocast cputime64_t; 21 23 22 24 #define cputime_one_jiffy jiffies_to_cputime(1) 23 25 26 + #define cputime_div(__ct, divisor) div_u64((__force u64)__ct, divisor) 27 + #define cputime_div_rem(__ct, divisor, remainder) \ 28 + div_u64_rem((__force u64)__ct, divisor, remainder); 29 + 24 30 /* 25 31 * Convert cputime <-> jiffies (HZ) 26 32 */ 27 33 #define cputime_to_jiffies(__ct) \ 28 - ((__force u64)(__ct) / (NSEC_PER_SEC / HZ)) 34 + cputime_div(__ct, NSEC_PER_SEC / HZ) 29 35 #define cputime_to_scaled(__ct) (__ct) 30 36 #define jiffies_to_cputime(__jif) \ 31 37 (__force cputime_t)((__jif) * (NSEC_PER_SEC / HZ)) 32 38 #define cputime64_to_jiffies64(__ct) \ 33 - ((__force u64)(__ct) / (NSEC_PER_SEC / HZ)) 39 + cputime_div(__ct, NSEC_PER_SEC / HZ) 34 40 #define jiffies64_to_cputime64(__jif) \ 35 41 (__force cputime64_t)((__jif) * (NSEC_PER_SEC / HZ)) 36 42 ··· 51 45 * Convert cputime <-> microseconds 52 46 */ 53 47 #define cputime_to_usecs(__ct) \ 54 - ((__force u64)(__ct) / NSEC_PER_USEC) 48 + cputime_div(__ct, NSEC_PER_USEC) 55 49 #define usecs_to_cputime(__usecs) \ 56 50 (__force cputime_t)((__usecs) * NSEC_PER_USEC) 57 51 #define usecs_to_cputime64(__usecs) \ ··· 61 55 * Convert cputime <-> seconds 62 56 */ 63 57 #define cputime_to_secs(__ct) \ 64 - ((__force u64)(__ct) / NSEC_PER_SEC) 58 + cputime_div(__ct, NSEC_PER_SEC) 65 59 #define secs_to_cputime(__secs) \ 66 60 (__force cputime_t)((__secs) * NSEC_PER_SEC) 67 61 ··· 75 69 } 76 70 static inline void cputime_to_timespec(const cputime_t ct, struct timespec *val) 77 71 { 78 - val->tv_sec = (__force u64) ct / NSEC_PER_SEC; 79 - val->tv_nsec = (__force u64) ct % NSEC_PER_SEC; 72 + u32 rem; 73 + 74 + val->tv_sec = cputime_div_rem(ct, NSEC_PER_SEC, &rem); 75 + val->tv_nsec = rem; 80 76 } 81 77 82 78 /* ··· 91 83 } 92 84 static inline void cputime_to_timeval(const cputime_t ct, struct timeval *val) 93 85 { 94 - val->tv_sec = (__force u64) ct / NSEC_PER_SEC; 95 - val->tv_usec = ((__force u64) ct % NSEC_PER_SEC) / NSEC_PER_USEC; 86 + u32 rem; 87 + 88 + val->tv_sec = cputime_div_rem(ct, NSEC_PER_SEC, &rem); 89 + val->tv_usec = rem / NSEC_PER_USEC; 96 90 } 97 91 98 92 /* 99 93 * Convert cputime <-> clock (USER_HZ) 100 94 */ 101 95 #define cputime_to_clock_t(__ct) \ 102 - ((__force u64)(__ct) / (NSEC_PER_SEC / USER_HZ)) 96 + cputime_div(__ct, (NSEC_PER_SEC / USER_HZ)) 103 97 #define clock_t_to_cputime(__x) \ 104 98 (__force cputime_t)((__x) * (NSEC_PER_SEC / USER_HZ)) 105 99
+6
include/linux/perf_event.h
··· 788 788 static inline void perf_event_task_tick(void) { } 789 789 #endif 790 790 791 + #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL) 792 + extern bool perf_event_can_stop_tick(void); 793 + #else 794 + static inline bool perf_event_can_stop_tick(void) { return true; } 795 + #endif 796 + 791 797 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL) 792 798 extern void perf_restore_debug_store(void); 793 799 #else
+2
include/linux/posix-timers.h
··· 123 123 void posix_cpu_timers_exit(struct task_struct *task); 124 124 void posix_cpu_timers_exit_group(struct task_struct *task); 125 125 126 + bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk); 127 + 126 128 void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx, 127 129 cputime_t *newval, cputime_t *oldval); 128 130
+7
include/linux/rcupdate.h
··· 1000 1000 #define kfree_rcu(ptr, rcu_head) \ 1001 1001 __kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head)) 1002 1002 1003 + #ifdef CONFIG_RCU_NOCB_CPU 1004 + extern bool rcu_is_nocb_cpu(int cpu); 1005 + #else 1006 + static inline bool rcu_is_nocb_cpu(int cpu) { return false; } 1007 + #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ 1008 + 1009 + 1003 1010 #endif /* __LINUX_RCUPDATE_H */
+13 -6
include/linux/sched.h
··· 231 231 232 232 extern int runqueue_is_locked(int cpu); 233 233 234 - #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) 234 + #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) 235 235 extern void nohz_balance_enter_idle(int cpu); 236 236 extern void set_cpu_sd_state_idle(void); 237 237 extern int get_nohz_timer_target(void); ··· 1764 1764 } 1765 1765 #endif 1766 1766 1767 - #ifdef CONFIG_NO_HZ 1767 + #ifdef CONFIG_NO_HZ_COMMON 1768 1768 void calc_load_enter_idle(void); 1769 1769 void calc_load_exit_idle(void); 1770 1770 #else 1771 1771 static inline void calc_load_enter_idle(void) { } 1772 1772 static inline void calc_load_exit_idle(void) { } 1773 - #endif /* CONFIG_NO_HZ */ 1773 + #endif /* CONFIG_NO_HZ_COMMON */ 1774 1774 1775 1775 #ifndef CONFIG_CPUMASK_OFFSTACK 1776 1776 static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) ··· 1856 1856 static inline void idle_task_exit(void) {} 1857 1857 #endif 1858 1858 1859 - #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) 1860 - extern void wake_up_idle_cpu(int cpu); 1859 + #if defined(CONFIG_NO_HZ_COMMON) && defined(CONFIG_SMP) 1860 + extern void wake_up_nohz_cpu(int cpu); 1861 1861 #else 1862 - static inline void wake_up_idle_cpu(int cpu) { } 1862 + static inline void wake_up_nohz_cpu(int cpu) { } 1863 + #endif 1864 + 1865 + #ifdef CONFIG_NO_HZ_FULL 1866 + extern bool sched_can_stop_tick(void); 1867 + extern u64 scheduler_tick_max_deferment(void); 1868 + #else 1869 + static inline bool sched_can_stop_tick(void) { return false; } 1863 1870 #endif 1864 1871 1865 1872 #ifdef CONFIG_SCHED_AUTOGROUP
+21 -4
include/linux/tick.h
··· 82 82 extern void tick_setup_sched_timer(void); 83 83 # endif 84 84 85 - # if defined CONFIG_NO_HZ || defined CONFIG_HIGH_RES_TIMERS 85 + # if defined CONFIG_NO_HZ_COMMON || defined CONFIG_HIGH_RES_TIMERS 86 86 extern void tick_cancel_sched_timer(int cpu); 87 87 # else 88 88 static inline void tick_cancel_sched_timer(int cpu) { } ··· 123 123 static inline int tick_oneshot_mode_active(void) { return 0; } 124 124 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */ 125 125 126 - # ifdef CONFIG_NO_HZ 126 + # ifdef CONFIG_NO_HZ_COMMON 127 127 DECLARE_PER_CPU(struct tick_sched, tick_cpu_sched); 128 128 129 129 static inline int tick_nohz_tick_stopped(void) ··· 138 138 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); 139 139 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); 140 140 141 - # else /* !CONFIG_NO_HZ */ 141 + # else /* !CONFIG_NO_HZ_COMMON */ 142 142 static inline int tick_nohz_tick_stopped(void) 143 143 { 144 144 return 0; ··· 155 155 } 156 156 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; } 157 157 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; } 158 - # endif /* !NO_HZ */ 158 + # endif /* !CONFIG_NO_HZ_COMMON */ 159 + 160 + #ifdef CONFIG_NO_HZ_FULL 161 + extern void tick_nohz_init(void); 162 + extern int tick_nohz_full_cpu(int cpu); 163 + extern void tick_nohz_full_check(void); 164 + extern void tick_nohz_full_kick(void); 165 + extern void tick_nohz_full_kick_all(void); 166 + extern void tick_nohz_task_switch(struct task_struct *tsk); 167 + #else 168 + static inline void tick_nohz_init(void) { } 169 + static inline int tick_nohz_full_cpu(int cpu) { return 0; } 170 + static inline void tick_nohz_full_check(void) { } 171 + static inline void tick_nohz_full_kick(void) { } 172 + static inline void tick_nohz_full_kick_all(void) { } 173 + static inline void tick_nohz_task_switch(struct task_struct *tsk) { } 174 + #endif 175 + 159 176 160 177 # ifdef CONFIG_CPU_IDLE_GOV_MENU 161 178 extern void menu_hrtimer_cancel(void);
+21
include/trace/events/timer.h
··· 323 323 (int) __entry->pid, (unsigned long long)__entry->now) 324 324 ); 325 325 326 + #ifdef CONFIG_NO_HZ_COMMON 327 + TRACE_EVENT(tick_stop, 328 + 329 + TP_PROTO(int success, char *error_msg), 330 + 331 + TP_ARGS(success, error_msg), 332 + 333 + TP_STRUCT__entry( 334 + __field( int , success ) 335 + __string( msg, error_msg ) 336 + ), 337 + 338 + TP_fast_assign( 339 + __entry->success = success; 340 + __assign_str(msg, error_msg); 341 + ), 342 + 343 + TP_printk("success=%s msg=%s", __entry->success ? "yes" : "no", __get_str(msg)) 344 + ); 345 + #endif 346 + 326 347 #endif /* _TRACE_TIMER_H */ 327 348 328 349 /* This part must be outside protection */
+6 -6
init/Kconfig
··· 302 302 # Kind of a stub config for the pure tick based cputime accounting 303 303 config TICK_CPU_ACCOUNTING 304 304 bool "Simple tick based cputime accounting" 305 - depends on !S390 305 + depends on !S390 && !NO_HZ_FULL 306 306 help 307 307 This is the basic tick based cputime accounting that maintains 308 308 statistics about user, system and idle time spent on per jiffies ··· 312 312 313 313 config VIRT_CPU_ACCOUNTING_NATIVE 314 314 bool "Deterministic task and CPU time accounting" 315 - depends on HAVE_VIRT_CPU_ACCOUNTING 315 + depends on HAVE_VIRT_CPU_ACCOUNTING && !NO_HZ_FULL 316 316 select VIRT_CPU_ACCOUNTING 317 317 help 318 318 Select this option to enable more accurate task and CPU time ··· 342 342 343 343 config IRQ_TIME_ACCOUNTING 344 344 bool "Fine granularity task level IRQ time accounting" 345 - depends on HAVE_IRQ_TIME_ACCOUNTING 345 + depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL 346 346 help 347 347 Select this option to enable fine granularity task irq time 348 348 accounting. This is done by reading a timestamp on each ··· 576 576 577 577 config RCU_FAST_NO_HZ 578 578 bool "Accelerate last non-dyntick-idle CPU's grace periods" 579 - depends on NO_HZ && SMP 579 + depends on NO_HZ_COMMON && SMP 580 580 default n 581 581 help 582 582 This option permits CPUs to enter dynticks-idle state even if ··· 687 687 688 688 config RCU_NOCB_CPU_NONE 689 689 bool "No build_forced no-CBs CPUs" 690 - depends on RCU_NOCB_CPU 690 + depends on RCU_NOCB_CPU && !NO_HZ_FULL 691 691 help 692 692 This option does not force any of the CPUs to be no-CBs CPUs. 693 693 Only CPUs designated by the rcu_nocbs= boot parameter will be ··· 695 695 696 696 config RCU_NOCB_CPU_ZERO 697 697 bool "CPU 0 is a build_forced no-CBs CPU" 698 - depends on RCU_NOCB_CPU 698 + depends on RCU_NOCB_CPU && !NO_HZ_FULL 699 699 help 700 700 This option forces CPU 0 to be a no-CBs CPU. Additional CPUs 701 701 may be designated as no-CBs CPUs using the rcu_nocbs= boot
+1
init/main.c
··· 544 544 idr_init_cache(); 545 545 perf_event_init(); 546 546 rcu_init(); 547 + tick_nohz_init(); 547 548 radix_tree_init(); 548 549 /* init some links before init_ISA_irqs() */ 549 550 early_irq_init();
+16 -1
kernel/events/core.c
··· 18 18 #include <linux/poll.h> 19 19 #include <linux/slab.h> 20 20 #include <linux/hash.h> 21 + #include <linux/tick.h> 21 22 #include <linux/sysfs.h> 22 23 #include <linux/dcache.h> 23 24 #include <linux/percpu.h> ··· 686 685 687 686 WARN_ON(!irqs_disabled()); 688 687 689 - if (list_empty(&cpuctx->rotation_list)) 688 + if (list_empty(&cpuctx->rotation_list)) { 689 + int was_empty = list_empty(head); 690 690 list_add(&cpuctx->rotation_list, head); 691 + if (was_empty) 692 + tick_nohz_full_kick(); 693 + } 691 694 } 692 695 693 696 static void get_ctx(struct perf_event_context *ctx) ··· 2595 2590 if (remove) 2596 2591 list_del_init(&cpuctx->rotation_list); 2597 2592 } 2593 + 2594 + #ifdef CONFIG_NO_HZ_FULL 2595 + bool perf_event_can_stop_tick(void) 2596 + { 2597 + if (list_empty(&__get_cpu_var(rotation_list))) 2598 + return true; 2599 + else 2600 + return false; 2601 + } 2602 + #endif 2598 2603 2599 2604 void perf_event_task_tick(void) 2600 2605 {
+2 -2
kernel/hrtimer.c
··· 172 172 */ 173 173 static int hrtimer_get_target(int this_cpu, int pinned) 174 174 { 175 - #ifdef CONFIG_NO_HZ 175 + #ifdef CONFIG_NO_HZ_COMMON 176 176 if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) 177 177 return get_nohz_timer_target(); 178 178 #endif ··· 1125 1125 } 1126 1126 EXPORT_SYMBOL_GPL(hrtimer_get_remaining); 1127 1127 1128 - #ifdef CONFIG_NO_HZ 1128 + #ifdef CONFIG_NO_HZ_COMMON 1129 1129 /** 1130 1130 * hrtimer_get_next_event - get the time until next expiry event 1131 1131 *
+60 -16
kernel/posix-cpu-timers.c
··· 10 10 #include <linux/kernel_stat.h> 11 11 #include <trace/events/timer.h> 12 12 #include <linux/random.h> 13 + #include <linux/tick.h> 14 + #include <linux/workqueue.h> 13 15 14 16 /* 15 17 * Called after updating RLIMIT_CPU to run cpu timer and update ··· 153 151 delta -= incr; 154 152 } 155 153 } 154 + } 155 + 156 + /** 157 + * task_cputime_zero - Check a task_cputime struct for all zero fields. 158 + * 159 + * @cputime: The struct to compare. 160 + * 161 + * Checks @cputime to see if all fields are zero. Returns true if all fields 162 + * are zero, false if any field is nonzero. 163 + */ 164 + static inline int task_cputime_zero(const struct task_cputime *cputime) 165 + { 166 + if (!cputime->utime && !cputime->stime && !cputime->sum_exec_runtime) 167 + return 1; 168 + return 0; 156 169 } 157 170 158 171 static inline cputime_t prof_ticks(struct task_struct *p) ··· 653 636 return 0; 654 637 } 655 638 639 + #ifdef CONFIG_NO_HZ_FULL 640 + static void nohz_kick_work_fn(struct work_struct *work) 641 + { 642 + tick_nohz_full_kick_all(); 643 + } 644 + 645 + static DECLARE_WORK(nohz_kick_work, nohz_kick_work_fn); 646 + 647 + /* 648 + * We need the IPIs to be sent from sane process context. 649 + * The posix cpu timers are always set with irqs disabled. 650 + */ 651 + static void posix_cpu_timer_kick_nohz(void) 652 + { 653 + schedule_work(&nohz_kick_work); 654 + } 655 + 656 + bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk) 657 + { 658 + if (!task_cputime_zero(&tsk->cputime_expires)) 659 + return false; 660 + 661 + if (tsk->signal->cputimer.running) 662 + return false; 663 + 664 + return true; 665 + } 666 + #else 667 + static inline void posix_cpu_timer_kick_nohz(void) { } 668 + #endif 669 + 656 670 /* 657 671 * Guts of sys_timer_settime for CPU timers. 658 672 * This is called with the timer locked and interrupts disabled. ··· 842 794 sample_to_timespec(timer->it_clock, 843 795 old_incr, &old->it_interval); 844 796 } 797 + if (!ret) 798 + posix_cpu_timer_kick_nohz(); 845 799 return ret; 846 800 } 847 801 ··· 1056 1006 if (it->expires && (!*expires || it->expires < *expires)) { 1057 1007 *expires = it->expires; 1058 1008 } 1059 - } 1060 - 1061 - /** 1062 - * task_cputime_zero - Check a task_cputime struct for all zero fields. 1063 - * 1064 - * @cputime: The struct to compare. 1065 - * 1066 - * Checks @cputime to see if all fields are zero. Returns true if all fields 1067 - * are zero, false if any field is nonzero. 1068 - */ 1069 - static inline int task_cputime_zero(const struct task_cputime *cputime) 1070 - { 1071 - if (!cputime->utime && !cputime->stime && !cputime->sum_exec_runtime) 1072 - return 1; 1073 - return 0; 1074 1009 } 1075 1010 1076 1011 /* ··· 1371 1336 cpu_timer_fire(timer); 1372 1337 spin_unlock(&timer->it_lock); 1373 1338 } 1339 + 1340 + /* 1341 + * In case some timers were rescheduled after the queue got emptied, 1342 + * wake up full dynticks CPUs. 1343 + */ 1344 + if (tsk->signal->cputimer.running) 1345 + posix_cpu_timer_kick_nohz(); 1374 1346 } 1375 1347 1376 1348 /* ··· 1408 1366 } 1409 1367 1410 1368 if (!*newval) 1411 - return; 1369 + goto out; 1412 1370 *newval += now.cpu; 1413 1371 } 1414 1372 ··· 1426 1384 tsk->signal->cputime_expires.virt_exp = *newval; 1427 1385 break; 1428 1386 } 1387 + out: 1388 + posix_cpu_timer_kick_nohz(); 1429 1389 } 1430 1390 1431 1391 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
+13 -3
kernel/rcutree.c
··· 799 799 rdp->offline_fqs++; 800 800 return 1; 801 801 } 802 + 803 + /* 804 + * There is a possibility that a CPU in adaptive-ticks state 805 + * might run in the kernel with the scheduling-clock tick disabled 806 + * for an extended time period. Invoke rcu_kick_nohz_cpu() to 807 + * force the CPU to restart the scheduling-clock tick in this 808 + * CPU is in this state. 809 + */ 810 + rcu_kick_nohz_cpu(rdp->cpu); 811 + 802 812 return 0; 803 813 } 804 814 ··· 1830 1820 struct rcu_node *rnp, struct rcu_data *rdp) 1831 1821 { 1832 1822 /* No-CBs CPUs do not have orphanable callbacks. */ 1833 - if (is_nocb_cpu(rdp->cpu)) 1823 + if (rcu_is_nocb_cpu(rdp->cpu)) 1834 1824 return; 1835 1825 1836 1826 /* ··· 2902 2892 * corresponding CPU's preceding callbacks have been invoked. 2903 2893 */ 2904 2894 for_each_possible_cpu(cpu) { 2905 - if (!cpu_online(cpu) && !is_nocb_cpu(cpu)) 2895 + if (!cpu_online(cpu) && !rcu_is_nocb_cpu(cpu)) 2906 2896 continue; 2907 2897 rdp = per_cpu_ptr(rsp->rda, cpu); 2908 - if (is_nocb_cpu(cpu)) { 2898 + if (rcu_is_nocb_cpu(cpu)) { 2909 2899 _rcu_barrier_trace(rsp, "OnlineNoCB", cpu, 2910 2900 rsp->n_barrier_done); 2911 2901 atomic_inc(&rsp->barrier_cpu_count);
+1 -1
kernel/rcutree.h
··· 530 530 static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq); 531 531 static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp); 532 532 static void rcu_init_one_nocb(struct rcu_node *rnp); 533 - static bool is_nocb_cpu(int cpu); 534 533 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, 535 534 bool lazy); 536 535 static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp, 537 536 struct rcu_data *rdp); 538 537 static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); 539 538 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp); 539 + static void rcu_kick_nohz_cpu(int cpu); 540 540 static bool init_nocb_callback_list(struct rcu_data *rdp); 541 541 542 542 #endif /* #ifndef RCU_TREE_NONCORE */
+23 -10
kernel/rcutree_plugin.h
··· 28 28 #include <linux/gfp.h> 29 29 #include <linux/oom.h> 30 30 #include <linux/smpboot.h> 31 + #include <linux/tick.h> 31 32 32 33 #define RCU_KTHREAD_PRIO 1 33 34 ··· 1706 1705 return; 1707 1706 1708 1707 /* If this is a no-CBs CPU, no callbacks, just return. */ 1709 - if (is_nocb_cpu(cpu)) 1708 + if (rcu_is_nocb_cpu(cpu)) 1710 1709 return; 1711 1710 1712 1711 /* ··· 1748 1747 struct rcu_data *rdp; 1749 1748 struct rcu_state *rsp; 1750 1749 1751 - if (is_nocb_cpu(cpu)) 1750 + if (rcu_is_nocb_cpu(cpu)) 1752 1751 return; 1753 1752 rcu_try_advance_all_cbs(); 1754 1753 for_each_rcu_flavor(rsp) { ··· 2053 2052 } 2054 2053 2055 2054 /* Is the specified CPU a no-CPUs CPU? */ 2056 - static bool is_nocb_cpu(int cpu) 2055 + bool rcu_is_nocb_cpu(int cpu) 2057 2056 { 2058 2057 if (have_rcu_nocb_mask) 2059 2058 return cpumask_test_cpu(cpu, rcu_nocb_mask); ··· 2111 2110 bool lazy) 2112 2111 { 2113 2112 2114 - if (!is_nocb_cpu(rdp->cpu)) 2113 + if (!rcu_is_nocb_cpu(rdp->cpu)) 2115 2114 return 0; 2116 2115 __call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy); 2117 2116 if (__is_kfree_rcu_offset((unsigned long)rhp->func)) ··· 2135 2134 long qll = rsp->qlen_lazy; 2136 2135 2137 2136 /* If this is not a no-CBs CPU, tell the caller to do it the old way. */ 2138 - if (!is_nocb_cpu(smp_processor_id())) 2137 + if (!rcu_is_nocb_cpu(smp_processor_id())) 2139 2138 return 0; 2140 2139 rsp->qlen = 0; 2141 2140 rsp->qlen_lazy = 0; ··· 2307 2306 { 2308 2307 } 2309 2308 2310 - static bool is_nocb_cpu(int cpu) 2311 - { 2312 - return false; 2313 - } 2314 - 2315 2309 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, 2316 2310 bool lazy) 2317 2311 { ··· 2333 2337 } 2334 2338 2335 2339 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ 2340 + 2341 + /* 2342 + * An adaptive-ticks CPU can potentially execute in kernel mode for an 2343 + * arbitrarily long period of time with the scheduling-clock tick turned 2344 + * off. RCU will be paying attention to this CPU because it is in the 2345 + * kernel, but the CPU cannot be guaranteed to be executing the RCU state 2346 + * machine because the scheduling-clock tick has been disabled. Therefore, 2347 + * if an adaptive-ticks CPU is failing to respond to the current grace 2348 + * period and has not be idle from an RCU perspective, kick it. 2349 + */ 2350 + static void rcu_kick_nohz_cpu(int cpu) 2351 + { 2352 + #ifdef CONFIG_NO_HZ_FULL 2353 + if (tick_nohz_full_cpu(cpu)) 2354 + smp_send_reschedule(cpu); 2355 + #endif /* #ifdef CONFIG_NO_HZ_FULL */ 2356 + }
+81 -11
kernel/sched/core.c
··· 544 544 raw_spin_unlock_irqrestore(&rq->lock, flags); 545 545 } 546 546 547 - #ifdef CONFIG_NO_HZ 547 + #ifdef CONFIG_NO_HZ_COMMON 548 548 /* 549 549 * In the semi idle case, use the nearest busy cpu for migrating timers 550 550 * from an idle cpu. This is good for power-savings. ··· 582 582 * account when the CPU goes back to idle and evaluates the timer 583 583 * wheel for the next timer event. 584 584 */ 585 - void wake_up_idle_cpu(int cpu) 585 + static void wake_up_idle_cpu(int cpu) 586 586 { 587 587 struct rq *rq = cpu_rq(cpu); 588 588 ··· 612 612 smp_send_reschedule(cpu); 613 613 } 614 614 615 + static bool wake_up_full_nohz_cpu(int cpu) 616 + { 617 + if (tick_nohz_full_cpu(cpu)) { 618 + if (cpu != smp_processor_id() || 619 + tick_nohz_tick_stopped()) 620 + smp_send_reschedule(cpu); 621 + return true; 622 + } 623 + 624 + return false; 625 + } 626 + 627 + void wake_up_nohz_cpu(int cpu) 628 + { 629 + if (!wake_up_full_nohz_cpu(cpu)) 630 + wake_up_idle_cpu(cpu); 631 + } 632 + 615 633 static inline bool got_nohz_idle_kick(void) 616 634 { 617 635 int cpu = smp_processor_id(); 618 636 return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); 619 637 } 620 638 621 - #else /* CONFIG_NO_HZ */ 639 + #else /* CONFIG_NO_HZ_COMMON */ 622 640 623 641 static inline bool got_nohz_idle_kick(void) 624 642 { 625 643 return false; 626 644 } 627 645 628 - #endif /* CONFIG_NO_HZ */ 646 + #endif /* CONFIG_NO_HZ_COMMON */ 647 + 648 + #ifdef CONFIG_NO_HZ_FULL 649 + bool sched_can_stop_tick(void) 650 + { 651 + struct rq *rq; 652 + 653 + rq = this_rq(); 654 + 655 + /* Make sure rq->nr_running update is visible after the IPI */ 656 + smp_rmb(); 657 + 658 + /* More than one running task need preemption */ 659 + if (rq->nr_running > 1) 660 + return false; 661 + 662 + return true; 663 + } 664 + #endif /* CONFIG_NO_HZ_FULL */ 629 665 630 666 void sched_avg_update(struct rq *rq) 631 667 { ··· 1393 1357 1394 1358 void scheduler_ipi(void) 1395 1359 { 1396 - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()) 1360 + if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick() 1361 + && !tick_nohz_full_cpu(smp_processor_id())) 1397 1362 return; 1398 1363 1399 1364 /* ··· 1411 1374 * somewhat pessimize the simple resched case. 1412 1375 */ 1413 1376 irq_enter(); 1377 + tick_nohz_full_check(); 1414 1378 sched_ttwu_pending(); 1415 1379 1416 1380 /* ··· 1893 1855 kprobe_flush_task(prev); 1894 1856 put_task_struct(prev); 1895 1857 } 1858 + 1859 + tick_nohz_task_switch(current); 1896 1860 } 1897 1861 1898 1862 #ifdef CONFIG_SMP ··· 2158 2118 return load >> FSHIFT; 2159 2119 } 2160 2120 2161 - #ifdef CONFIG_NO_HZ 2121 + #ifdef CONFIG_NO_HZ_COMMON 2162 2122 /* 2163 2123 * Handle NO_HZ for the global load-average. 2164 2124 * ··· 2384 2344 smp_wmb(); 2385 2345 calc_load_idx++; 2386 2346 } 2387 - #else /* !CONFIG_NO_HZ */ 2347 + #else /* !CONFIG_NO_HZ_COMMON */ 2388 2348 2389 2349 static inline long calc_load_fold_idle(void) { return 0; } 2390 2350 static inline void calc_global_nohz(void) { } 2391 2351 2392 - #endif /* CONFIG_NO_HZ */ 2352 + #endif /* CONFIG_NO_HZ_COMMON */ 2393 2353 2394 2354 /* 2395 2355 * calc_load - update the avenrun load estimates 10 ticks after the ··· 2549 2509 sched_avg_update(this_rq); 2550 2510 } 2551 2511 2552 - #ifdef CONFIG_NO_HZ 2512 + #ifdef CONFIG_NO_HZ_COMMON 2553 2513 /* 2554 2514 * There is no sane way to deal with nohz on smp when using jiffies because the 2555 2515 * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading ··· 2609 2569 } 2610 2570 raw_spin_unlock(&this_rq->lock); 2611 2571 } 2612 - #endif /* CONFIG_NO_HZ */ 2572 + #endif /* CONFIG_NO_HZ_COMMON */ 2613 2573 2614 2574 /* 2615 2575 * Called from scheduler_tick() ··· 2736 2696 rq->idle_balance = idle_cpu(cpu); 2737 2697 trigger_load_balance(rq, cpu); 2738 2698 #endif 2699 + rq_last_tick_reset(rq); 2739 2700 } 2701 + 2702 + #ifdef CONFIG_NO_HZ_FULL 2703 + /** 2704 + * scheduler_tick_max_deferment 2705 + * 2706 + * Keep at least one tick per second when a single 2707 + * active task is running because the scheduler doesn't 2708 + * yet completely support full dynticks environment. 2709 + * 2710 + * This makes sure that uptime, CFS vruntime, load 2711 + * balancing, etc... continue to move forward, even 2712 + * with a very low granularity. 2713 + */ 2714 + u64 scheduler_tick_max_deferment(void) 2715 + { 2716 + struct rq *rq = this_rq(); 2717 + unsigned long next, now = ACCESS_ONCE(jiffies); 2718 + 2719 + next = rq->last_sched_tick + HZ; 2720 + 2721 + if (time_before_eq(next, now)) 2722 + return 0; 2723 + 2724 + return jiffies_to_usecs(next - now) * NSEC_PER_USEC; 2725 + } 2726 + #endif 2740 2727 2741 2728 notrace unsigned long get_parent_ip(unsigned long addr) 2742 2729 { ··· 7018 6951 INIT_LIST_HEAD(&rq->cfs_tasks); 7019 6952 7020 6953 rq_attach_root(rq, &def_root_domain); 7021 - #ifdef CONFIG_NO_HZ 6954 + #ifdef CONFIG_NO_HZ_COMMON 7022 6955 rq->nohz_flags = 0; 6956 + #endif 6957 + #ifdef CONFIG_NO_HZ_FULL 6958 + rq->last_sched_tick = 0; 7023 6959 #endif 7024 6960 #endif 7025 6961 init_rq_hrtick(rq);
+5 -5
kernel/sched/fair.c
··· 5355 5355 return 0; 5356 5356 } 5357 5357 5358 - #ifdef CONFIG_NO_HZ 5358 + #ifdef CONFIG_NO_HZ_COMMON 5359 5359 /* 5360 5360 * idle load balancing details 5361 5361 * - When one of the busy CPUs notice that there may be an idle rebalancing ··· 5572 5572 rq->next_balance = next_balance; 5573 5573 } 5574 5574 5575 - #ifdef CONFIG_NO_HZ 5575 + #ifdef CONFIG_NO_HZ_COMMON 5576 5576 /* 5577 - * In CONFIG_NO_HZ case, the idle balance kickee will do the 5577 + * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the 5578 5578 * rebalancing for all the cpus for whom scheduler ticks are stopped. 5579 5579 */ 5580 5580 static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) ··· 5717 5717 if (time_after_eq(jiffies, rq->next_balance) && 5718 5718 likely(!on_null_domain(cpu))) 5719 5719 raise_softirq(SCHED_SOFTIRQ); 5720 - #ifdef CONFIG_NO_HZ 5720 + #ifdef CONFIG_NO_HZ_COMMON 5721 5721 if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu))) 5722 5722 nohz_balancer_kick(cpu); 5723 5723 #endif ··· 6187 6187 #ifdef CONFIG_SMP 6188 6188 open_softirq(SCHED_SOFTIRQ, run_rebalance_domains); 6189 6189 6190 - #ifdef CONFIG_NO_HZ 6190 + #ifdef CONFIG_NO_HZ_COMMON 6191 6191 nohz.next_balance = jiffies; 6192 6192 zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); 6193 6193 cpu_notifier(sched_ilb_notifier, 0);
+1
kernel/sched/idle_task.c
··· 17 17 static void pre_schedule_idle(struct rq *rq, struct task_struct *prev) 18 18 { 19 19 idle_exit_fair(rq); 20 + rq_last_tick_reset(rq); 20 21 } 21 22 22 23 static void post_schedule_idle(struct rq *rq)
+23 -2
kernel/sched/sched.h
··· 5 5 #include <linux/mutex.h> 6 6 #include <linux/spinlock.h> 7 7 #include <linux/stop_machine.h> 8 + #include <linux/tick.h> 8 9 9 10 #include "cpupri.h" 10 11 #include "cpuacct.h" ··· 406 405 #define CPU_LOAD_IDX_MAX 5 407 406 unsigned long cpu_load[CPU_LOAD_IDX_MAX]; 408 407 unsigned long last_load_update_tick; 409 - #ifdef CONFIG_NO_HZ 408 + #ifdef CONFIG_NO_HZ_COMMON 410 409 u64 nohz_stamp; 411 410 unsigned long nohz_flags; 411 + #endif 412 + #ifdef CONFIG_NO_HZ_FULL 413 + unsigned long last_sched_tick; 412 414 #endif 413 415 int skip_clock_update; 414 416 ··· 1076 1072 static inline void inc_nr_running(struct rq *rq) 1077 1073 { 1078 1074 rq->nr_running++; 1075 + 1076 + #ifdef CONFIG_NO_HZ_FULL 1077 + if (rq->nr_running == 2) { 1078 + if (tick_nohz_full_cpu(rq->cpu)) { 1079 + /* Order rq->nr_running write against the IPI */ 1080 + smp_wmb(); 1081 + smp_send_reschedule(rq->cpu); 1082 + } 1083 + } 1084 + #endif 1079 1085 } 1080 1086 1081 1087 static inline void dec_nr_running(struct rq *rq) 1082 1088 { 1083 1089 rq->nr_running--; 1090 + } 1091 + 1092 + static inline void rq_last_tick_reset(struct rq *rq) 1093 + { 1094 + #ifdef CONFIG_NO_HZ_FULL 1095 + rq->last_sched_tick = jiffies; 1096 + #endif 1084 1097 } 1085 1098 1086 1099 extern void update_rq_clock(struct rq *rq); ··· 1320 1299 1321 1300 extern void account_cfs_bandwidth_used(int enabled, int was_enabled); 1322 1301 1323 - #ifdef CONFIG_NO_HZ 1302 + #ifdef CONFIG_NO_HZ_COMMON 1324 1303 enum rq_nohz_flag_bits { 1325 1304 NOHZ_TICK_STOPPED, 1326 1305 NOHZ_BALANCE_KICK,
+14 -5
kernel/softirq.c
··· 329 329 wakeup_softirqd(); 330 330 } 331 331 332 + static inline void tick_irq_exit(void) 333 + { 334 + #ifdef CONFIG_NO_HZ_COMMON 335 + int cpu = smp_processor_id(); 336 + 337 + /* Make sure that timer wheel updates are propagated */ 338 + if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) { 339 + if (!in_interrupt()) 340 + tick_nohz_irq_exit(); 341 + } 342 + #endif 343 + } 344 + 332 345 /* 333 346 * Exit an interrupt context. Process softirqs if needed and possible: 334 347 */ ··· 359 346 if (!in_interrupt() && local_softirq_pending()) 360 347 invoke_softirq(); 361 348 362 - #ifdef CONFIG_NO_HZ 363 - /* Make sure that timer wheel updates are propagated */ 364 - if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched()) 365 - tick_nohz_irq_exit(); 366 - #endif 349 + tick_irq_exit(); 367 350 rcu_irq_exit(); 368 351 } 369 352
+74 -6
kernel/time/Kconfig
··· 64 64 if GENERIC_CLOCKEVENTS 65 65 menu "Timers subsystem" 66 66 67 - # Core internal switch. Selected by NO_HZ / HIGH_RES_TIMERS. This is 67 + # Core internal switch. Selected by NO_HZ_COMMON / HIGH_RES_TIMERS. This is 68 68 # only related to the tick functionality. Oneshot clockevent devices 69 69 # are supported independ of this. 70 70 config TICK_ONESHOT 71 71 bool 72 72 73 - config NO_HZ 74 - bool "Tickless System (Dynamic Ticks)" 73 + config NO_HZ_COMMON 74 + bool 75 75 depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS 76 76 select TICK_ONESHOT 77 + 78 + choice 79 + prompt "Timer tick handling" 80 + default NO_HZ_IDLE if NO_HZ 81 + 82 + config HZ_PERIODIC 83 + bool "Periodic timer ticks (constant rate, no dynticks)" 77 84 help 78 - This option enables a tickless system: timer interrupts will 79 - only trigger on an as-needed basis both when the system is 80 - busy and when the system is idle. 85 + This option keeps the tick running periodically at a constant 86 + rate, even when the CPU doesn't need it. 87 + 88 + config NO_HZ_IDLE 89 + bool "Idle dynticks system (tickless idle)" 90 + depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS 91 + select NO_HZ_COMMON 92 + help 93 + This option enables a tickless idle system: timer interrupts 94 + will only trigger on an as-needed basis when the system is idle. 95 + This is usually interesting for energy saving. 96 + 97 + Most of the time you want to say Y here. 98 + 99 + config NO_HZ_FULL 100 + bool "Full dynticks system (tickless)" 101 + # NO_HZ_COMMON dependency 102 + depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS 103 + # We need at least one periodic CPU for timekeeping 104 + depends on SMP 105 + # RCU_USER_QS dependency 106 + depends on HAVE_CONTEXT_TRACKING 107 + # VIRT_CPU_ACCOUNTING_GEN dependency 108 + depends on 64BIT 109 + select NO_HZ_COMMON 110 + select RCU_USER_QS 111 + select RCU_NOCB_CPU 112 + select VIRT_CPU_ACCOUNTING_GEN 113 + select CONTEXT_TRACKING_FORCE 114 + select IRQ_WORK 115 + help 116 + Adaptively try to shutdown the tick whenever possible, even when 117 + the CPU is running tasks. Typically this requires running a single 118 + task on the CPU. Chances for running tickless are maximized when 119 + the task mostly runs in userspace and has few kernel activity. 120 + 121 + You need to fill up the nohz_full boot parameter with the 122 + desired range of dynticks CPUs. 123 + 124 + This is implemented at the expense of some overhead in user <-> kernel 125 + transitions: syscalls, exceptions and interrupts. Even when it's 126 + dynamically off. 127 + 128 + Say N. 129 + 130 + endchoice 131 + 132 + config NO_HZ_FULL_ALL 133 + bool "Full dynticks system on all CPUs by default" 134 + depends on NO_HZ_FULL 135 + help 136 + If the user doesn't pass the nohz_full boot option to 137 + define the range of full dynticks CPUs, consider that all 138 + CPUs in the system are full dynticks by default. 139 + Note the boot CPU will still be kept outside the range to 140 + handle the timekeeping duty. 141 + 142 + config NO_HZ 143 + bool "Old Idle dynticks config" 144 + depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS 145 + help 146 + This is the old config entry that enables dynticks idle. 147 + We keep it around for a little while to enforce backward 148 + compatibility with older config files. 81 149 82 150 config HIGH_RES_TIMERS 83 151 bool "High Resolution Timer Support"
+2 -1
kernel/time/tick-broadcast.c
··· 693 693 bc->event_handler = tick_handle_oneshot_broadcast; 694 694 695 695 /* Take the do_timer update */ 696 - tick_do_timer_cpu = cpu; 696 + if (!tick_nohz_full_cpu(cpu)) 697 + tick_do_timer_cpu = cpu; 697 698 698 699 /* 699 700 * We must be careful here. There might be other CPUs
+4 -1
kernel/time/tick-common.c
··· 163 163 * this cpu: 164 164 */ 165 165 if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) { 166 - tick_do_timer_cpu = cpu; 166 + if (!tick_nohz_full_cpu(cpu)) 167 + tick_do_timer_cpu = cpu; 168 + else 169 + tick_do_timer_cpu = TICK_DO_TIMER_NONE; 167 170 tick_next_period = ktime_get(); 168 171 tick_period = ktime_set(0, NSEC_PER_SEC / HZ); 169 172 }
+280 -16
kernel/time/tick-sched.c
··· 21 21 #include <linux/sched.h> 22 22 #include <linux/module.h> 23 23 #include <linux/irq_work.h> 24 + #include <linux/posix-timers.h> 25 + #include <linux/perf_event.h> 24 26 25 27 #include <asm/irq_regs.h> 26 28 27 29 #include "tick-internal.h" 30 + 31 + #include <trace/events/timer.h> 28 32 29 33 /* 30 34 * Per cpu nohz control structure ··· 108 104 { 109 105 int cpu = smp_processor_id(); 110 106 111 - #ifdef CONFIG_NO_HZ 107 + #ifdef CONFIG_NO_HZ_COMMON 112 108 /* 113 109 * Check if the do_timer duty was dropped. We don't care about 114 110 * concurrency: This happens only when the cpu in charge went ··· 116 112 * this duty, then the jiffies update is still serialized by 117 113 * jiffies_lock. 118 114 */ 119 - if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) 115 + if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE) 116 + && !tick_nohz_full_cpu(cpu)) 120 117 tick_do_timer_cpu = cpu; 121 118 #endif 122 119 ··· 128 123 129 124 static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs) 130 125 { 131 - #ifdef CONFIG_NO_HZ 126 + #ifdef CONFIG_NO_HZ_COMMON 132 127 /* 133 128 * When we are idle and the tick is stopped, we have to touch 134 129 * the watchdog as we might not schedule for a really long ··· 147 142 profile_tick(CPU_PROFILING); 148 143 } 149 144 145 + #ifdef CONFIG_NO_HZ_FULL 146 + static cpumask_var_t nohz_full_mask; 147 + bool have_nohz_full_mask; 148 + 149 + static bool can_stop_full_tick(void) 150 + { 151 + WARN_ON_ONCE(!irqs_disabled()); 152 + 153 + if (!sched_can_stop_tick()) { 154 + trace_tick_stop(0, "more than 1 task in runqueue\n"); 155 + return false; 156 + } 157 + 158 + if (!posix_cpu_timers_can_stop_tick(current)) { 159 + trace_tick_stop(0, "posix timers running\n"); 160 + return false; 161 + } 162 + 163 + if (!perf_event_can_stop_tick()) { 164 + trace_tick_stop(0, "perf events running\n"); 165 + return false; 166 + } 167 + 168 + /* sched_clock_tick() needs us? */ 169 + #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 170 + /* 171 + * TODO: kick full dynticks CPUs when 172 + * sched_clock_stable is set. 173 + */ 174 + if (!sched_clock_stable) { 175 + trace_tick_stop(0, "unstable sched clock\n"); 176 + return false; 177 + } 178 + #endif 179 + 180 + return true; 181 + } 182 + 183 + static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now); 184 + 185 + /* 186 + * Re-evaluate the need for the tick on the current CPU 187 + * and restart it if necessary. 188 + */ 189 + void tick_nohz_full_check(void) 190 + { 191 + struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched); 192 + 193 + if (tick_nohz_full_cpu(smp_processor_id())) { 194 + if (ts->tick_stopped && !is_idle_task(current)) { 195 + if (!can_stop_full_tick()) 196 + tick_nohz_restart_sched_tick(ts, ktime_get()); 197 + } 198 + } 199 + } 200 + 201 + static void nohz_full_kick_work_func(struct irq_work *work) 202 + { 203 + tick_nohz_full_check(); 204 + } 205 + 206 + static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = { 207 + .func = nohz_full_kick_work_func, 208 + }; 209 + 210 + /* 211 + * Kick the current CPU if it's full dynticks in order to force it to 212 + * re-evaluate its dependency on the tick and restart it if necessary. 213 + */ 214 + void tick_nohz_full_kick(void) 215 + { 216 + if (tick_nohz_full_cpu(smp_processor_id())) 217 + irq_work_queue(&__get_cpu_var(nohz_full_kick_work)); 218 + } 219 + 220 + static void nohz_full_kick_ipi(void *info) 221 + { 222 + tick_nohz_full_check(); 223 + } 224 + 225 + /* 226 + * Kick all full dynticks CPUs in order to force these to re-evaluate 227 + * their dependency on the tick and restart it if necessary. 228 + */ 229 + void tick_nohz_full_kick_all(void) 230 + { 231 + if (!have_nohz_full_mask) 232 + return; 233 + 234 + preempt_disable(); 235 + smp_call_function_many(nohz_full_mask, 236 + nohz_full_kick_ipi, NULL, false); 237 + preempt_enable(); 238 + } 239 + 240 + /* 241 + * Re-evaluate the need for the tick as we switch the current task. 242 + * It might need the tick due to per task/process properties: 243 + * perf events, posix cpu timers, ... 244 + */ 245 + void tick_nohz_task_switch(struct task_struct *tsk) 246 + { 247 + unsigned long flags; 248 + 249 + local_irq_save(flags); 250 + 251 + if (!tick_nohz_full_cpu(smp_processor_id())) 252 + goto out; 253 + 254 + if (tick_nohz_tick_stopped() && !can_stop_full_tick()) 255 + tick_nohz_full_kick(); 256 + 257 + out: 258 + local_irq_restore(flags); 259 + } 260 + 261 + int tick_nohz_full_cpu(int cpu) 262 + { 263 + if (!have_nohz_full_mask) 264 + return 0; 265 + 266 + return cpumask_test_cpu(cpu, nohz_full_mask); 267 + } 268 + 269 + /* Parse the boot-time nohz CPU list from the kernel parameters. */ 270 + static int __init tick_nohz_full_setup(char *str) 271 + { 272 + int cpu; 273 + 274 + alloc_bootmem_cpumask_var(&nohz_full_mask); 275 + if (cpulist_parse(str, nohz_full_mask) < 0) { 276 + pr_warning("NOHZ: Incorrect nohz_full cpumask\n"); 277 + return 1; 278 + } 279 + 280 + cpu = smp_processor_id(); 281 + if (cpumask_test_cpu(cpu, nohz_full_mask)) { 282 + pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu); 283 + cpumask_clear_cpu(cpu, nohz_full_mask); 284 + } 285 + have_nohz_full_mask = true; 286 + 287 + return 1; 288 + } 289 + __setup("nohz_full=", tick_nohz_full_setup); 290 + 291 + static int __cpuinit tick_nohz_cpu_down_callback(struct notifier_block *nfb, 292 + unsigned long action, 293 + void *hcpu) 294 + { 295 + unsigned int cpu = (unsigned long)hcpu; 296 + 297 + switch (action & ~CPU_TASKS_FROZEN) { 298 + case CPU_DOWN_PREPARE: 299 + /* 300 + * If we handle the timekeeping duty for full dynticks CPUs, 301 + * we can't safely shutdown that CPU. 302 + */ 303 + if (have_nohz_full_mask && tick_do_timer_cpu == cpu) 304 + return -EINVAL; 305 + break; 306 + } 307 + return NOTIFY_OK; 308 + } 309 + 310 + /* 311 + * Worst case string length in chunks of CPU range seems 2 steps 312 + * separations: 0,2,4,6,... 313 + * This is NR_CPUS + sizeof('\0') 314 + */ 315 + static char __initdata nohz_full_buf[NR_CPUS + 1]; 316 + 317 + static int tick_nohz_init_all(void) 318 + { 319 + int err = -1; 320 + 321 + #ifdef CONFIG_NO_HZ_FULL_ALL 322 + if (!alloc_cpumask_var(&nohz_full_mask, GFP_KERNEL)) { 323 + pr_err("NO_HZ: Can't allocate full dynticks cpumask\n"); 324 + return err; 325 + } 326 + err = 0; 327 + cpumask_setall(nohz_full_mask); 328 + cpumask_clear_cpu(smp_processor_id(), nohz_full_mask); 329 + have_nohz_full_mask = true; 330 + #endif 331 + return err; 332 + } 333 + 334 + void __init tick_nohz_init(void) 335 + { 336 + int cpu; 337 + 338 + if (!have_nohz_full_mask) { 339 + if (tick_nohz_init_all() < 0) 340 + return; 341 + } 342 + 343 + cpu_notifier(tick_nohz_cpu_down_callback, 0); 344 + 345 + /* Make sure full dynticks CPU are also RCU nocbs */ 346 + for_each_cpu(cpu, nohz_full_mask) { 347 + if (!rcu_is_nocb_cpu(cpu)) { 348 + pr_warning("NO_HZ: CPU %d is not RCU nocb: " 349 + "cleared from nohz_full range", cpu); 350 + cpumask_clear_cpu(cpu, nohz_full_mask); 351 + } 352 + } 353 + 354 + cpulist_scnprintf(nohz_full_buf, sizeof(nohz_full_buf), nohz_full_mask); 355 + pr_info("NO_HZ: Full dynticks CPUs: %s.\n", nohz_full_buf); 356 + } 357 + #else 358 + #define have_nohz_full_mask (0) 359 + #endif 360 + 150 361 /* 151 362 * NOHZ - aka dynamic tick functionality 152 363 */ 153 - #ifdef CONFIG_NO_HZ 364 + #ifdef CONFIG_NO_HZ_COMMON 154 365 /* 155 366 * NO HZ enabled ? 156 367 */ ··· 566 345 delta_jiffies = rcu_delta_jiffies; 567 346 } 568 347 } 348 + 569 349 /* 570 - * Do not stop the tick, if we are only one off 571 - * or if the cpu is required for rcu 350 + * Do not stop the tick, if we are only one off (or less) 351 + * or if the cpu is required for RCU: 572 352 */ 573 - if (!ts->tick_stopped && delta_jiffies == 1) 353 + if (!ts->tick_stopped && delta_jiffies <= 1) 574 354 goto out; 575 355 576 356 /* Schedule the tick, if we are at least one jiffie off */ ··· 599 377 } else if (!ts->do_timer_last) { 600 378 time_delta = KTIME_MAX; 601 379 } 380 + 381 + #ifdef CONFIG_NO_HZ_FULL 382 + if (!ts->inidle) { 383 + time_delta = min(time_delta, 384 + scheduler_tick_max_deferment()); 385 + } 386 + #endif 602 387 603 388 /* 604 389 * calculate the expiry time for the next timer wheel ··· 650 421 651 422 ts->last_tick = hrtimer_get_expires(&ts->sched_timer); 652 423 ts->tick_stopped = 1; 424 + trace_tick_stop(1, " "); 653 425 } 654 426 655 427 /* ··· 687 457 return ret; 688 458 } 689 459 460 + static void tick_nohz_full_stop_tick(struct tick_sched *ts) 461 + { 462 + #ifdef CONFIG_NO_HZ_FULL 463 + int cpu = smp_processor_id(); 464 + 465 + if (!tick_nohz_full_cpu(cpu) || is_idle_task(current)) 466 + return; 467 + 468 + if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE) 469 + return; 470 + 471 + if (!can_stop_full_tick()) 472 + return; 473 + 474 + tick_nohz_stop_sched_tick(ts, ktime_get(), cpu); 475 + #endif 476 + } 477 + 690 478 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) 691 479 { 692 480 /* ··· 735 487 ratelimit++; 736 488 } 737 489 return false; 490 + } 491 + 492 + if (have_nohz_full_mask) { 493 + /* 494 + * Keep the tick alive to guarantee timekeeping progression 495 + * if there are full dynticks CPUs around 496 + */ 497 + if (tick_do_timer_cpu == cpu) 498 + return false; 499 + /* 500 + * Boot safety: make sure the timekeeping duty has been 501 + * assigned before entering dyntick-idle mode, 502 + */ 503 + if (tick_do_timer_cpu == TICK_DO_TIMER_NONE) 504 + return false; 738 505 } 739 506 740 507 return true; ··· 831 568 { 832 569 struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched); 833 570 834 - if (!ts->inidle) 835 - return; 836 - 837 - /* Cancel the timer because CPU already waken up from the C-states*/ 838 - menu_hrtimer_cancel(); 839 - __tick_nohz_idle_enter(ts); 571 + if (ts->inidle) { 572 + /* Cancel the timer because CPU already waken up from the C-states*/ 573 + menu_hrtimer_cancel(); 574 + __tick_nohz_idle_enter(ts); 575 + } else { 576 + tick_nohz_full_stop_tick(ts); 577 + } 840 578 } 841 579 842 580 /** ··· 1066 802 static inline void tick_nohz_switch_to_nohz(void) { } 1067 803 static inline void tick_check_nohz(int cpu) { } 1068 804 1069 - #endif /* NO_HZ */ 805 + #endif /* CONFIG_NO_HZ_COMMON */ 1070 806 1071 807 /* 1072 808 * Called from irq_enter to notify about the possible interruption of idle() ··· 1151 887 now = ktime_get(); 1152 888 } 1153 889 1154 - #ifdef CONFIG_NO_HZ 890 + #ifdef CONFIG_NO_HZ_COMMON 1155 891 if (tick_nohz_enabled) 1156 892 ts->nohz_mode = NOHZ_MODE_HIGHRES; 1157 893 #endif 1158 894 } 1159 895 #endif /* HIGH_RES_TIMERS */ 1160 896 1161 - #if defined CONFIG_NO_HZ || defined CONFIG_HIGH_RES_TIMERS 897 + #if defined CONFIG_NO_HZ_COMMON || defined CONFIG_HIGH_RES_TIMERS 1162 898 void tick_cancel_sched_timer(int cpu) 1163 899 { 1164 900 struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+8 -8
kernel/timer.c
··· 739 739 740 740 cpu = smp_processor_id(); 741 741 742 - #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) 742 + #if defined(CONFIG_NO_HZ_COMMON) && defined(CONFIG_SMP) 743 743 if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) 744 744 cpu = get_nohz_timer_target(); 745 745 #endif ··· 931 931 debug_activate(timer, timer->expires); 932 932 internal_add_timer(base, timer); 933 933 /* 934 - * Check whether the other CPU is idle and needs to be 935 - * triggered to reevaluate the timer wheel when nohz is 936 - * active. We are protected against the other CPU fiddling 934 + * Check whether the other CPU is in dynticks mode and needs 935 + * to be triggered to reevaluate the timer wheel. 936 + * We are protected against the other CPU fiddling 937 937 * with the timer by holding the timer base lock. This also 938 - * makes sure that a CPU on the way to idle can not evaluate 939 - * the timer wheel. 938 + * makes sure that a CPU on the way to stop its tick can not 939 + * evaluate the timer wheel. 940 940 */ 941 - wake_up_idle_cpu(cpu); 941 + wake_up_nohz_cpu(cpu); 942 942 spin_unlock_irqrestore(&base->lock, flags); 943 943 } 944 944 EXPORT_SYMBOL_GPL(add_timer_on); ··· 1189 1189 spin_unlock_irq(&base->lock); 1190 1190 } 1191 1191 1192 - #ifdef CONFIG_NO_HZ 1192 + #ifdef CONFIG_NO_HZ_COMMON 1193 1193 /* 1194 1194 * Find out when the next timer event is due to happen. This 1195 1195 * is used on S/390 to stop all activity when a CPU is idle.