Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

[PATCH] hrtimers: move and add documentation

Move the initial hrtimers.txt document to the new directory
"Documentation/hrtimers"

Add design notes for the high resolution timer and dynamic tick functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Thomas Gleixner and committed by
Linus Torvalds
dd3629b5 5cfb6de7

+249
Documentation/hrtimers.txt Documentation/hrtimers/hrtimers.txt
+249
Documentation/hrtimers/highres.txt
··· 1 + High resolution timers and dynamic ticks design notes 2 + ----------------------------------------------------- 3 + 4 + Further information can be found in the paper of the OLS 2006 talk "hrtimers 5 + and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can 6 + be found on the OLS website: 7 + http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf 8 + 9 + The slides to this talk are available from: 10 + http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf 11 + 12 + The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the 13 + changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the 14 + design of the Linux time(r) system before hrtimers and other building blocks 15 + got merged into mainline. 16 + 17 + Note: the paper and the slides are talking about "clock event source", while we 18 + switched to the name "clock event devices" in meantime. 19 + 20 + The design contains the following basic building blocks: 21 + 22 + - hrtimer base infrastructure 23 + - timeofday and clock source management 24 + - clock event management 25 + - high resolution timer functionality 26 + - dynamic ticks 27 + 28 + 29 + hrtimer base infrastructure 30 + --------------------------- 31 + 32 + The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of 33 + the base implementation are covered in Documentation/hrtimers/hrtimer.txt. See 34 + also figure #2 (OLS slides p. 15) 35 + 36 + The main differences to the timer wheel, which holds the armed timer_list type 37 + timers are: 38 + - time ordered enqueueing into a rb-tree 39 + - independent of ticks (the processing is based on nanoseconds) 40 + 41 + 42 + timeofday and clock source management 43 + ------------------------------------- 44 + 45 + John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of 46 + code out of the architecture-specific areas into a generic management 47 + framework, as illustrated in figure #3 (OLS slides p. 18). The architecture 48 + specific portion is reduced to the low level hardware details of the clock 49 + sources, which are registered in the framework and selected on a quality based 50 + decision. The low level code provides hardware setup and readout routines and 51 + initializes data structures, which are used by the generic time keeping code to 52 + convert the clock ticks to nanosecond based time values. All other time keeping 53 + related functionality is moved into the generic code. The GTOD base patch got 54 + merged into the 2.6.18 kernel. 55 + 56 + Further information about the Generic Time Of Day framework is available in the 57 + OLS 2005 Proceedings Volume 1: 58 + http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf 59 + 60 + The paper "We Are Not Getting Any Younger: A New Approach to Time and 61 + Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan. 62 + 63 + Figure #3 (OLS slides p.18) illustrates the transformation. 64 + 65 + 66 + clock event management 67 + ---------------------- 68 + 69 + While clock sources provide read access to the monotonically increasing time 70 + value, clock event devices are used to schedule the next event 71 + interrupt(s). The next event is currently defined to be periodic, with its 72 + period defined at compile time. The setup and selection of the event device 73 + for various event driven functionalities is hardwired into the architecture 74 + dependent code. This results in duplicated code across all architectures and 75 + makes it extremely difficult to change the configuration of the system to use 76 + event interrupt devices other than those already built into the 77 + architecture. Another implication of the current design is that it is necessary 78 + to touch all the architecture-specific implementations in order to provide new 79 + functionality like high resolution timers or dynamic ticks. 80 + 81 + The clock events subsystem tries to address this problem by providing a generic 82 + solution to manage clock event devices and their usage for the various clock 83 + event driven kernel functionalities. The goal of the clock event subsystem is 84 + to minimize the clock event related architecture dependent code to the pure 85 + hardware related handling and to allow easy addition and utilization of new 86 + clock event devices. It also minimizes the duplicated code across the 87 + architectures as it provides generic functionality down to the interrupt 88 + service handler, which is almost inherently hardware dependent. 89 + 90 + Clock event devices are registered either by the architecture dependent boot 91 + code or at module insertion time. Each clock event device fills a data 92 + structure with clock-specific property parameters and callback functions. The 93 + clock event management decides, by using the specified property parameters, the 94 + set of system functions a clock event device will be used to support. This 95 + includes the distinction of per-CPU and per-system global event devices. 96 + 97 + System-level global event devices are used for the Linux periodic tick. Per-CPU 98 + event devices are used to provide local CPU functionality such as process 99 + accounting, profiling, and high resolution timers. 100 + 101 + The management layer assignes one or more of the folliwing functions to a clock 102 + event device: 103 + - system global periodic tick (jiffies update) 104 + - cpu local update_process_times 105 + - cpu local profiling 106 + - cpu local next event interrupt (non periodic mode) 107 + 108 + The clock event device delegates the selection of those timer interrupt related 109 + functions completely to the management layer. The clock management layer stores 110 + a function pointer in the device description structure, which has to be called 111 + from the hardware level handler. This removes a lot of duplicated code from the 112 + architecture specific timer interrupt handlers and hands the control over the 113 + clock event devices and the assignment of timer interrupt related functionality 114 + to the core code. 115 + 116 + The clock event layer API is rather small. Aside from the clock event device 117 + registration interface it provides functions to schedule the next event 118 + interrupt, clock event device notification service and support for suspend and 119 + resume. 120 + 121 + The framework adds about 700 lines of code which results in a 2KB increase of 122 + the kernel binary size. The conversion of i386 removes about 100 lines of 123 + code. The binary size decrease is in the range of 400 byte. We believe that the 124 + increase of flexibility and the avoidance of duplicated code across 125 + architectures justifies the slight increase of the binary size. 126 + 127 + The conversion of an architecture has no functional impact, but allows to 128 + utilize the high resolution and dynamic tick functionalites without any change 129 + to the clock event device and timer interrupt code. After the conversion the 130 + enabling of high resolution timers and dynamic ticks is simply provided by 131 + adding the kernel/time/Kconfig file to the architecture specific Kconfig and 132 + adding the dynamic tick specific calls to the idle routine (a total of 3 lines 133 + added to the idle function and the Kconfig file) 134 + 135 + Figure #4 (OLS slides p.20) illustrates the transformation. 136 + 137 + 138 + high resolution timer functionality 139 + ----------------------------------- 140 + 141 + During system boot it is not possible to use the high resolution timer 142 + functionality, while making it possible would be difficult and would serve no 143 + useful function. The initialization of the clock event device framework, the 144 + clock source framework (GTOD) and hrtimers itself has to be done and 145 + appropriate clock sources and clock event devices have to be registered before 146 + the high resolution functionality can work. Up to the point where hrtimers are 147 + initialized, the system works in the usual low resolution periodic mode. The 148 + clock source and the clock event device layers provide notification functions 149 + which inform hrtimers about availability of new hardware. hrtimers validates 150 + the usability of the registered clock sources and clock event devices before 151 + switching to high resolution mode. This ensures also that a kernel which is 152 + configured for high resolution timers can run on a system which lacks the 153 + necessary hardware support. 154 + 155 + The high resolution timer code does not support SMP machines which have only 156 + global clock event devices. The support of such hardware would involve IPI 157 + calls when an interrupt happens. The overhead would be much larger than the 158 + benefit. This is the reason why we currently disable high resolution and 159 + dynamic ticks on i386 SMP systems which stop the local APIC in C3 power 160 + state. A workaround is available as an idea, but the problem has not been 161 + tackled yet. 162 + 163 + The time ordered insertion of timers provides all the infrastructure to decide 164 + whether the event device has to be reprogrammed when a timer is added. The 165 + decision is made per timer base and synchronized across per-cpu timer bases in 166 + a support function. The design allows the system to utilize separate per-CPU 167 + clock event devices for the per-CPU timer bases, but currently only one 168 + reprogrammable clock event device per-CPU is utilized. 169 + 170 + When the timer interrupt happens, the next event interrupt handler is called 171 + from the clock event distribution code and moves expired timers from the 172 + red-black tree to a separate double linked list and invokes the softirq 173 + handler. An additional mode field in the hrtimer structure allows the system to 174 + execute callback functions directly from the next event interrupt handler. This 175 + is restricted to code which can safely be executed in the hard interrupt 176 + context. This applies, for example, to the common case of a wakeup function as 177 + used by nanosleep. The advantage of executing the handler in the interrupt 178 + context is the avoidance of up to two context switches - from the interrupted 179 + context to the softirq and to the task which is woken up by the expired 180 + timer. 181 + 182 + Once a system has switched to high resolution mode, the periodic tick is 183 + switched off. This disables the per system global periodic clock event device - 184 + e.g. the PIT on i386 SMP systems. 185 + 186 + The periodic tick functionality is provided by an per-cpu hrtimer. The callback 187 + function is executed in the next event interrupt context and updates jiffies 188 + and calls update_process_times and profiling. The implementation of the hrtimer 189 + based periodic tick is designed to be extended with dynamic tick functionality. 190 + This allows to use a single clock event device to schedule high resolution 191 + timer and periodic events (jiffies tick, profiling, process accounting) on UP 192 + systems. This has been proved to work with the PIT on i386 and the Incrementer 193 + on PPC. 194 + 195 + The softirq for running the hrtimer queues and executing the callbacks has been 196 + separated from the tick bound timer softirq to allow accurate delivery of high 197 + resolution timer signals which are used by itimer and POSIX interval 198 + timers. The execution of this softirq can still be delayed by other softirqs, 199 + but the overall latencies have been significantly improved by this separation. 200 + 201 + Figure #5 (OLS slides p.22) illustrates the transformation. 202 + 203 + 204 + dynamic ticks 205 + ------------- 206 + 207 + Dynamic ticks are the logical consequence of the hrtimer based periodic tick 208 + replacement (sched_tick). The functionality of the sched_tick hrtimer is 209 + extended by three functions: 210 + 211 + - hrtimer_stop_sched_tick 212 + - hrtimer_restart_sched_tick 213 + - hrtimer_update_jiffies 214 + 215 + hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code 216 + evaluates the next scheduled timer event (from both hrtimers and the timer 217 + wheel) and in case that the next event is further away than the next tick it 218 + reprograms the sched_tick to this future event, to allow longer idle sleeps 219 + without worthless interruption by the periodic tick. The function is also 220 + called when an interrupt happens during the idle period, which does not cause a 221 + reschedule. The call is necessary as the interrupt handler might have armed a 222 + new timer whose expiry time is before the time which was identified as the 223 + nearest event in the previous call to hrtimer_stop_sched_tick. 224 + 225 + hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before 226 + it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick, 227 + which is kept active until the next call to hrtimer_stop_sched_tick(). 228 + 229 + hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens 230 + in the idle period to make sure that jiffies are up to date and the interrupt 231 + handler has not to deal with an eventually stale jiffy value. 232 + 233 + The dynamic tick feature provides statistical values which are exported to 234 + userspace via /proc/stats and can be made available for enhanced power 235 + management control. 236 + 237 + The implementation leaves room for further development like full tickless 238 + systems, where the time slice is controlled by the scheduler, variable 239 + frequency profiling, and a complete removal of jiffies in the future. 240 + 241 + 242 + Aside the current initial submission of i386 support, the patchset has been 243 + extended to x86_64 and ARM already. Initial (work in progress) support is also 244 + available for MIPS and PowerPC. 245 + 246 + Thomas, Ingo 247 + 248 + 249 +