Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Documentation: Add real-time to core-api

The documents explain the design concepts behind PREEMPT_RT and highlight key
differences necessary to achieve it.
It also include a list of requirements that must be fulfilled to support
PREEMPT_RT on a given architecture.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[jc: tweaked "how they differ" section head]
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20250815093858.930751-4-bigeasy@linutronix.de

authored by

Sebastian Andrzej Siewior and committed by
Jonathan Corbet
f51fe3b7 f41c808c

+484
+1
Documentation/core-api/index.rst
··· 24 24 printk-index 25 25 symbol-namespaces 26 26 asm-annotations 27 + real-time/index 27 28 28 29 Data structures and low-level utilities 29 30 =======================================
+109
Documentation/core-api/real-time/architecture-porting.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================================= 4 + Porting an architecture to support PREEMPT_RT 5 + ============================================= 6 + 7 + :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> 8 + 9 + This list outlines the architecture specific requirements that must be 10 + implemented in order to enable PREEMPT_RT. Once all required features are 11 + implemented, ARCH_SUPPORTS_RT can be selected in architecture’s Kconfig to make 12 + PREEMPT_RT selectable. 13 + Many prerequisites (genirq support for example) are enforced by the common code 14 + and are omitted here. 15 + 16 + The optional features are not strictly required but it is worth to consider 17 + them. 18 + 19 + Requirements 20 + ------------ 21 + 22 + Forced threaded interrupts 23 + CONFIG_IRQ_FORCED_THREADING must be selected. Any interrupts that must 24 + remain in hard-IRQ context must be marked with IRQF_NO_THREAD. This 25 + requirement applies for instance to clocksource event interrupts, 26 + perf interrupts and cascading interrupt-controller handlers. 27 + 28 + PREEMPTION support 29 + Kernel preemption must be supported and requires that 30 + CONFIG_ARCH_NO_PREEMPT remain unselected. Scheduling requests, such as those 31 + issued from an interrupt or other exception handler, must be processed 32 + immediately. 33 + 34 + POSIX CPU timers and KVM 35 + POSIX CPU timers must expire from thread context rather than directly within 36 + the timer interrupt. This behavior is enabled by setting the configuration 37 + option CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK. 38 + When KVM is enabled, CONFIG_KVM_XFER_TO_GUEST_WORK must also be set to ensure 39 + that any pending work, such as POSIX timer expiration, is handled before 40 + transitioning into guest mode. 41 + 42 + Hard-IRQ and Soft-IRQ stacks 43 + Soft interrupts are handled in the thread context in which they are raised. If 44 + a soft interrupt is triggered from hard-IRQ context, its execution is deferred 45 + to the ksoftirqd thread. Preemption is never disabled during soft interrupt 46 + handling, which makes soft interrupts preemptible. 47 + If an architecture provides a custom __do_softirq() implementation that uses a 48 + separate stack, it must select CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK. The 49 + functionality should only be enabled when CONFIG_SOFTIRQ_ON_OWN_STACK is set. 50 + 51 + FPU and SIMD access in kernel mode 52 + FPU and SIMD registers are typically not used in kernel mode and are therefore 53 + not saved during kernel preemption. As a result, any kernel code that uses 54 + these registers must be enclosed within a kernel_fpu_begin() and 55 + kernel_fpu_end() section. 56 + The kernel_fpu_begin() function usually invokes local_bh_disable() to prevent 57 + interruptions from softirqs and to disable regular preemption. This allows the 58 + protected code to run safely in both thread and softirq contexts. 59 + On PREEMPT_RT kernels, however, kernel_fpu_begin() must not call 60 + local_bh_disable(). Instead, it should use preempt_disable(), since softirqs 61 + are always handled in thread context under PREEMPT_RT. In this case, disabling 62 + preemption alone is sufficient. 63 + The crypto subsystem operates on memory pages and requires users to "walk and 64 + map" these pages while processing a request. This operation must occur outside 65 + the kernel_fpu_begin()/ kernel_fpu_end() section because it requires preemption 66 + to be enabled. These preemption points are generally sufficient to avoid 67 + excessive scheduling latency. 68 + 69 + Exception handlers 70 + Exception handlers, such as the page fault handler, typically enable interrupts 71 + early, before invoking any generic code to process the exception. This is 72 + necessary because handling a page fault may involve operations that can sleep. 73 + Enabling interrupts is especially important on PREEMPT_RT, where certain 74 + locks, such as spinlock_t, become sleepable. For example, handling an 75 + invalid opcode may result in sending a SIGILL signal to the user task. A 76 + debug excpetion will send a SIGTRAP signal. 77 + In both cases, if the exception occurred in user space, it is safe to enable 78 + interrupts early. Sending a signal requires both interrupts and kernel 79 + preemption to be enabled. 80 + 81 + Optional features 82 + ----------------- 83 + 84 + Timer and clocksource 85 + A high-resolution clocksource and clockevents device are recommended. The 86 + clockevents device should support the CLOCK_EVT_FEAT_ONESHOT feature for 87 + optimal timer behavior. In most cases, microsecond-level accuracy is 88 + sufficient 89 + 90 + Lazy preemption 91 + This mechanism allows an in-kernel scheduling request for non-real-time tasks 92 + to be delayed until the task is about to return to user space. It helps avoid 93 + preempting a task that holds a sleeping lock at the time of the scheduling 94 + request. 95 + With CONFIG_GENERIC_IRQ_ENTRY enabled, supporting this feature requires 96 + defining a bit for TIF_NEED_RESCHED_LAZY, preferably near TIF_NEED_RESCHED. 97 + 98 + Serial console with NBCON 99 + With PREEMPT_RT enabled, all console output is handled by a dedicated thread 100 + rather than directly from the context in which printk() is invoked. This design 101 + allows printk() to be safely used in atomic contexts. 102 + However, this also means that if the kernel crashes and cannot switch to the 103 + printing thread, no output will be visible preventing the system from printing 104 + its final messages. 105 + There are exceptions for immediate output, such as during panic() handling. To 106 + support this, the console driver must implement new-style lock handling. This 107 + involves setting the CON_NBCON flag in console::flags and providing 108 + implementations for the write_atomic, write_thread, device_lock, and 109 + device_unlock callbacks.
+242
Documentation/core-api/real-time/differences.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =========================== 4 + How realtime kernels differ 5 + =========================== 6 + 7 + :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> 8 + 9 + Preface 10 + ======= 11 + 12 + With forced-threaded interrupts and sleeping spin locks, code paths that 13 + previously caused long scheduling latencies have been made preemptible and 14 + moved into process context. This allows the scheduler to manage them more 15 + effectively and respond to higher-priority tasks with reduced latency. 16 + 17 + The following chapters provide an overview of key differences between a 18 + PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel. 19 + 20 + Locking 21 + ======= 22 + 23 + Spinning locks such as spinlock_t are used to provide synchronization for data 24 + structures accessed from both interrupt context and process context. For this 25 + reason, locking functions are also available with the _irq() or _irqsave() 26 + suffixes, which disable interrupts before acquiring the lock. This ensures that 27 + the lock can be safely acquired in process context when interrupts are enabled. 28 + 29 + However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer 30 + run in hard IRQ context. As a result, there is no need to disable interrupts as 31 + part of the locking procedure when using spinlock_t. 32 + 33 + For low-level core components such as interrupt handling, the scheduler, or the 34 + timer subsystem the kernel uses raw_spinlock_t. This lock type preserves 35 + traditional semantics: it disables preemption and, when used with _irq() or 36 + _irqsave(), also disables interrupts. This ensures proper synchronization in 37 + critical sections that must remain non-preemptible or with interrupts disabled. 38 + 39 + Execution context 40 + ================= 41 + 42 + Interrupt handling in a PREEMPT_RT system is invoked in process context through 43 + the use of threaded interrupts. Other parts of the kernel also shift their 44 + execution into threaded context by different mechanisms. The goal is to keep 45 + execution paths preemptible, allowing the scheduler to interrupt them when a 46 + higher-priority task needs to run. 47 + 48 + Below is an overview of the kernel subsystems involved in this transition to 49 + threaded, preemptible execution. 50 + 51 + Interrupt handling 52 + ------------------ 53 + 54 + All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are 55 + interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or 56 + IRQF_ONESHOT flags. 57 + 58 + The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those 59 + registered using request_threaded_irq() and providing only a threaded handler. 60 + Its purpose is to keep the interrupt line masked until the threaded handler has 61 + completed. 62 + 63 + If a primary handler is also provided in this case, it is essential that the 64 + handler does not acquire any sleeping locks, as it will not be threaded. The 65 + handler should be minimal and must avoid introducing delays, such as 66 + busy-waiting on hardware registers. 67 + 68 + 69 + Soft interrupts, bottom half handling 70 + ------------------------------------- 71 + 72 + Soft interrupts are raised by the interrupt handler and are executed after the 73 + handler returns. Since they run in thread context, they can be preempted by 74 + other threads. Do not assume that softirq context runs with preemption 75 + disabled. This means you must not rely on mechanisms like local_bh_disable() in 76 + process context to protect per-CPU variables. Because softirq handlers are 77 + preemptible under PREEMPT_RT, this approach does not provide reliable 78 + synchronization. 79 + 80 + If this kind of protection is required for performance reasons, consider using 81 + local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to 82 + verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the 83 + necessary locking to ensure proper protection. 84 + 85 + Using local_lock_nested_bh() also makes the locking scope explicit and easier 86 + for readers and maintainers to understand. 87 + 88 + 89 + per-CPU variables 90 + ----------------- 91 + 92 + Protecting access to per-CPU variables solely by using preempt_disable() should 93 + be avoided, especially if the critical section has unbounded runtime or may 94 + call APIs that can sleep. 95 + 96 + If using a spinlock_t is considered too costly for performance reasons, 97 + consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces 98 + no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies 99 + that the lock is only acquired in process context and never from softirq or 100 + hard IRQ context. 101 + 102 + On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t, 103 + which provides safe local protection for per-CPU data while keeping the system 104 + preemptible. 105 + 106 + Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used 107 + to protect per-CPU data by relying on implicit preemption disabling. If this 108 + inherited preemption disabling is essential and if local_lock_t cannot be used 109 + due to performance constraints, brevity of the code, or abstraction boundaries 110 + within an API then preempt_disable_nested() may be a suitable alternative. On 111 + non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already 112 + disabled. On PREEMPT_RT, it explicitly disables preemption. 113 + 114 + Timers 115 + ------ 116 + 117 + By default, an hrtimer is executed in hard interrupt context. The exception is 118 + timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in 119 + softirq context. 120 + 121 + On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in 122 + softirq context by default, typically within the ktimersd thread. This thread 123 + runs at the lowest real-time priority, ensuring it executes before any 124 + SCHED_OTHER tasks but does not interfere with higher-priority real-time 125 + threads. To explicitly request execution in hard interrupt context on 126 + PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag. 127 + 128 + Memory allocation 129 + ----------------- 130 + 131 + The memory allocation APIs, such as kmalloc() and alloc_pages(), require a 132 + gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is 133 + necessary to use GFP_ATOMIC when allocating memory from interrupt context or 134 + from sections where preemption is disabled. This is because the allocator must 135 + not sleep in these contexts waiting for memory to become available. 136 + 137 + However, this approach does not work on PREEMPT_RT kernels. The memory 138 + allocator in PREEMPT_RT uses sleeping locks internally, which cannot be 139 + acquired when preemption is disabled. Fortunately, this is generally not a 140 + problem, because PREEMPT_RT moves most contexts that would traditionally run 141 + with preemption or interrupts disabled into threaded context, where sleeping is 142 + allowed. 143 + 144 + What remains problematic is code that explicitly disables preemption or 145 + interrupts. In such cases, memory allocation must be performed outside the 146 + critical section. 147 + 148 + This restriction also applies to memory deallocation routines such as kfree() 149 + and free_pages(), which may also involve internal locking and must not be 150 + called from non-preemptible contexts. 151 + 152 + IRQ work 153 + -------- 154 + 155 + The irq_work API provides a mechanism to schedule a callback in interrupt 156 + context. It is designed for use in contexts where traditional scheduling is not 157 + possible, such as from within NMI handlers or from inside the scheduler, where 158 + using a workqueue would be unsafe. 159 + 160 + On non-PREEMPT_RT systems, all irq_work items are executed immediately in 161 + interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next 162 + timer tick but are still executed in interrupt context. 163 + 164 + On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks 165 + may acquire sleeping locks or have unbounded execution time, they are handled 166 + in thread context by a per-CPU irq_work kernel thread. This thread runs at the 167 + lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks 168 + but does not interfere with higher-priority real-time threads. 169 + 170 + The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still 171 + executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be 172 + deferred until the next timer tick and are also executed by the irq_work/ 173 + thread. 174 + 175 + RCU callbacks 176 + ------------- 177 + 178 + RCU callbacks are invoked by default in softirq context. Their execution is 179 + important because, depending on the use case, they either free memory or ensure 180 + progress in state transitions. Running these callbacks as part of the softirq 181 + chain can lead to undesired situations, such as contention for CPU resources 182 + with other SCHED_OTHER tasks when executed within ksoftirqd. 183 + 184 + To avoid running callbacks in softirq context, the RCU subsystem provides a 185 + mechanism to execute them in process context instead. This behavior can be 186 + enabled by setting the boot command-line parameter rcutree.use_softirq=0. This 187 + setting is enforced in kernels configured with PREEMPT_RT. 188 + 189 + Spin until ready 190 + ================ 191 + 192 + The "spin until ready" pattern involves repeatedly checking (spinning on) the 193 + state of a data structure until it becomes available. This pattern assumes that 194 + preemption, soft interrupts, or interrupts are disabled. If the data structure 195 + is marked busy, it is presumed to be in use by another CPU, and spinning should 196 + eventually succeed as that CPU makes progress. 197 + 198 + Some examples are hrtimer_cancel() or timer_delete_sync(). These functions 199 + cancel timers that execute with interrupts or soft interrupts disabled. If a 200 + thread attempts to cancel a timer and finds it active, spinning until the 201 + callback completes is safe because the callback can only run on another CPU and 202 + will eventually finish. 203 + 204 + On PREEMPT_RT kernels, however, timer callbacks run in thread context. This 205 + introduces a challenge: a higher-priority thread attempting to cancel the timer 206 + may preempt the timer callback thread. Since the scheduler cannot migrate the 207 + callback thread to another CPU due to affinity constraints, spinning can result 208 + in livelock even on multiprocessor systems. 209 + 210 + To avoid this, both the canceling and callback sides must use a handshake 211 + mechanism that supports priority inheritance. This allows the canceling thread 212 + to suspend until the callback completes, ensuring forward progress without 213 + risking livelock. 214 + 215 + In order to solve the problem at the API level, the sequence locks were extended 216 + to allow a proper handover between the the spinning reader and the maybe 217 + blocked writer. 218 + 219 + Sequence locks 220 + -------------- 221 + 222 + Sequence counters and sequential locks are documented in 223 + Documentation/locking/seqlock.rst. 224 + 225 + The interface has been extended to ensure proper preemption states for the 226 + writer and spinning reader contexts. This is achieved by embedding the writer 227 + serialization lock directly into the sequence counter type, resulting in 228 + composite types such as seqcount_spinlock_t or seqcount_mutex_t. 229 + 230 + These composite types allow readers to detect an ongoing write and actively 231 + boost the writer’s priority to help it complete its update instead of spinning 232 + and waiting for its completion. 233 + 234 + If the plain seqcount_t is used, extra care must be taken to synchronize the 235 + reader with the writer during updates. The writer must ensure its update is 236 + serialized and non-preemptible relative to the reader. This cannot be achieved 237 + using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable 238 + preemption. In such cases, using seqcount_spinlock_t is the preferred solution. 239 + 240 + However, if there is no spinning involved i.e., if the reader only needs to 241 + detect whether a write has started and not serialize against it then using 242 + seqcount_t is reasonable.
+16
Documentation/core-api/real-time/index.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ===================== 4 + Real-time preemption 5 + ===================== 6 + 7 + This documentation is intended for Linux kernel developers and contributors 8 + interested in the inner workings of PREEMPT_RT. It explains key concepts and 9 + the required changes compared to a non-PREEMPT_RT configuration. 10 + 11 + .. toctree:: 12 + :maxdepth: 2 13 + 14 + theory 15 + differences 16 + architecture-porting
+116
Documentation/core-api/real-time/theory.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ===================== 4 + Theory of operation 5 + ===================== 6 + 7 + :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> 8 + 9 + Preface 10 + ======= 11 + 12 + PREEMPT_RT transforms the Linux kernel into a real-time kernel. It achieves 13 + this by replacing locking primitives, such as spinlock_t, with a preemptible 14 + and priority-inheritance aware implementation known as rtmutex, and by enforcing 15 + the use of threaded interrupts. As a result, the kernel becomes fully 16 + preemptible, with the exception of a few critical code paths, including entry 17 + code, the scheduler, and low-level interrupt handling routines. 18 + 19 + This transformation places the majority of kernel execution contexts under the 20 + control of the scheduler and significantly increasing the number of preemption 21 + points. Consequently, it reduces the latency between a high-priority task 22 + becoming runnable and its actual execution on the CPU. 23 + 24 + Scheduling 25 + ========== 26 + 27 + The core principles of Linux scheduling and the associated user-space API are 28 + documented in the man page sched(7) 29 + `sched(7) <https://man7.org/linux/man-pages/man7/sched.7.html>`_. 30 + By default, the Linux kernel uses the SCHED_OTHER scheduling policy. Under 31 + this policy, a task is preempted when the scheduler determines that it has 32 + consumed a fair share of CPU time relative to other runnable tasks. However, 33 + the policy does not guarantee immediate preemption when a new SCHED_OTHER task 34 + becomes runnable. The currently running task may continue executing. 35 + 36 + This behavior differs from that of real-time scheduling policies such as 37 + SCHED_FIFO. When a task with a real-time policy becomes runnable, the 38 + scheduler immediately selects it for execution if it has a higher priority than 39 + the currently running task. The task continues to run until it voluntarily 40 + yields the CPU, typically by blocking on an event. 41 + 42 + Sleeping spin locks 43 + =================== 44 + 45 + The various lock types and their behavior under real-time configurations are 46 + described in detail in Documentation/locking/locktypes.rst. 47 + In a non-PREEMPT_RT configuration, a spinlock_t is acquired by first disabling 48 + preemption and then actively spinning until the lock becomes available. Once 49 + the lock is released, preemption is enabled. From a real-time perspective, 50 + this approach is undesirable because disabling preemption prevents the 51 + scheduler from switching to a higher-priority task, potentially increasing 52 + latency. 53 + 54 + To address this, PREEMPT_RT replaces spinning locks with sleeping spin locks 55 + that do not disable preemption. On PREEMPT_RT, spinlock_t is implemented using 56 + rtmutex. Instead of spinning, a task attempting to acquire a contended lock 57 + disables CPU migration, donates its priority to the lock owner (priority 58 + inheritance), and voluntarily schedules out while waiting for the lock to 59 + become available. 60 + 61 + Disabling CPU migration provides the same effect as disabling preemption, while 62 + still allowing preemption and ensuring that the task continues to run on the 63 + same CPU while holding a sleeping lock. 64 + 65 + Priority inheritance 66 + ==================== 67 + 68 + Lock types such as spinlock_t and mutex_t in a PREEMPT_RT enabled kernel are 69 + implemented on top of rtmutex, which provides support for priority inheritance 70 + (PI). When a task blocks on such a lock, the PI mechanism temporarily 71 + propagates the blocked task’s scheduling parameters to the lock owner. 72 + 73 + For example, if a SCHED_FIFO task A blocks on a lock currently held by a 74 + SCHED_OTHER task B, task A’s scheduling policy and priority are temporarily 75 + inherited by task B. After this inheritance, task A is put to sleep while 76 + waiting for the lock, and task B effectively becomes the highest-priority task 77 + in the system. This allows B to continue executing, make progress, and 78 + eventually release the lock. 79 + 80 + Once B releases the lock, it reverts to its original scheduling parameters, and 81 + task A can resume execution. 82 + 83 + Threaded interrupts 84 + =================== 85 + 86 + Interrupt handlers are another source of code that executes with preemption 87 + disabled and outside the control of the scheduler. To bring interrupt handling 88 + under scheduler control, PREEMPT_RT enforces threaded interrupt handlers. 89 + 90 + With forced threading, interrupt handling is split into two stages. The first 91 + stage, the primary handler, is executed in IRQ context with interrupts disabled. 92 + Its sole responsibility is to wake the associated threaded handler. The second 93 + stage, the threaded handler, is the function passed to request_irq() as the 94 + interrupt handler. It runs in process context, scheduled by the kernel. 95 + 96 + From waking the interrupt thread until threaded handling is completed, the 97 + interrupt source is masked in the interrupt controller. This ensures that the 98 + device interrupt remains pending but does not retrigger the CPU, allowing the 99 + system to exit IRQ context and handle the interrupt in a scheduled thread. 100 + 101 + By default, the threaded handler executes with the SCHED_FIFO scheduling policy 102 + and a priority of 50 (MAX_RT_PRIO / 2), which is midway between the minimum and 103 + maximum real-time priorities. 104 + 105 + If the threaded interrupt handler raises any soft interrupts during its 106 + execution, those soft interrupt routines are invoked after the threaded handler 107 + completes, within the same thread. Preemption remains enabled during the 108 + execution of the soft interrupt handler. 109 + 110 + Summary 111 + ======= 112 + 113 + By using sleeping locks and forced-threaded interrupts, PREEMPT_RT 114 + significantly reduces sections of code where interrupts or preemption is 115 + disabled, allowing the scheduler to preempt the current execution context and 116 + switch to a higher-priority task.