Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Documentation: Fill the gaps about entry/noinstr constraints

The entry/exit handling for exceptions, interrupts, syscalls and KVM is
not really documented except for some comments.

Fill the gaps.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de
Co-developed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

----

Changes since v3:
- s/nointr/noinstr/

Changes since v2:
- No big content changes, just style corrections, so it should be
pretty clean at this stage. In the light of this, I kept Mark's
Reviewed-by.
- Paul's style and paragraph re-writes
- Randy's style comments
- Add links to transition type sections

Documentation/core-api/entry.rst | 261 +++++++++++++++++++++++++++++++
Documentation/core-api/index.rst | 8 +
2 files changed, 269 insertions(+)
create mode 100644 Documentation/core-api/entry.rst

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220110105044.94423-1-nsaenzju@redhat.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

authored by

Thomas Gleixner and committed by
Jonathan Corbet
bf026e2e dd774a07

+269
+261
Documentation/core-api/entry.rst
··· 1 + Entry/exit handling for exceptions, interrupts, syscalls and KVM 2 + ================================================================ 3 + 4 + All transitions between execution domains require state updates which are 5 + subject to strict ordering constraints. State updates are required for the 6 + following: 7 + 8 + * Lockdep 9 + * RCU / Context tracking 10 + * Preemption counter 11 + * Tracing 12 + * Time accounting 13 + 14 + The update order depends on the transition type and is explained below in 15 + the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular 16 + exceptions`_, `NMI and NMI-like exceptions`_. 17 + 18 + Non-instrumentable code - noinstr 19 + --------------------------------- 20 + 21 + Most instrumentation facilities depend on RCU, so intrumentation is prohibited 22 + for entry code before RCU starts watching and exit code after RCU stops 23 + watching. In addition, many architectures must save and restore register state, 24 + which means that (for example) a breakpoint in the breakpoint entry code would 25 + overwrite the debug registers of the initial breakpoint. 26 + 27 + Such code must be marked with the 'noinstr' attribute, placing that code into a 28 + special section inaccessible to instrumentation and debug facilities. Some 29 + functions are partially instrumentable, which is handled by marking them 30 + noinstr and using instrumentation_begin() and instrumentation_end() to flag the 31 + instrumentable ranges of code: 32 + 33 + .. code-block:: c 34 + 35 + noinstr void entry(void) 36 + { 37 + handle_entry(); // <-- must be 'noinstr' or '__always_inline' 38 + ... 39 + 40 + instrumentation_begin(); 41 + handle_context(); // <-- instrumentable code 42 + instrumentation_end(); 43 + 44 + ... 45 + handle_exit(); // <-- must be 'noinstr' or '__always_inline' 46 + } 47 + 48 + This allows verification of the 'noinstr' restrictions via objtool on 49 + supported architectures. 50 + 51 + Invoking non-instrumentable functions from instrumentable context has no 52 + restrictions and is useful to protect e.g. state switching which would 53 + cause malfunction if instrumented. 54 + 55 + All non-instrumentable entry/exit code sections before and after the RCU 56 + state transitions must run with interrupts disabled. 57 + 58 + Syscalls 59 + -------- 60 + 61 + Syscall-entry code starts in assembly code and calls out into low-level C code 62 + after establishing low-level architecture-specific state and stack frames. This 63 + low-level C code must not be instrumented. A typical syscall handling function 64 + invoked from low-level assembly code looks like this: 65 + 66 + .. code-block:: c 67 + 68 + noinstr void syscall(struct pt_regs *regs, int nr) 69 + { 70 + arch_syscall_enter(regs); 71 + nr = syscall_enter_from_user_mode(regs, nr); 72 + 73 + instrumentation_begin(); 74 + if (!invoke_syscall(regs, nr) && nr != -1) 75 + result_reg(regs) = __sys_ni_syscall(regs); 76 + instrumentation_end(); 77 + 78 + syscall_exit_to_user_mode(regs); 79 + } 80 + 81 + syscall_enter_from_user_mode() first invokes enter_from_user_mode() which 82 + establishes state in the following order: 83 + 84 + * Lockdep 85 + * RCU / Context tracking 86 + * Tracing 87 + 88 + and then invokes the various entry work functions like ptrace, seccomp, audit, 89 + syscall tracing, etc. After all that is done, the instrumentable invoke_syscall 90 + function can be invoked. The instrumentable code section then ends, after which 91 + syscall_exit_to_user_mode() is invoked. 92 + 93 + syscall_exit_to_user_mode() handles all work which needs to be done before 94 + returning to user space like tracing, audit, signals, task work etc. After 95 + that it invokes exit_to_user_mode() which again handles the state 96 + transition in the reverse order: 97 + 98 + * Tracing 99 + * RCU / Context tracking 100 + * Lockdep 101 + 102 + syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also 103 + available as fine grained subfunctions in cases where the architecture code 104 + has to do extra work between the various steps. In such cases it has to 105 + ensure that enter_from_user_mode() is called first on entry and 106 + exit_to_user_mode() is called last on exit. 107 + 108 + 109 + KVM 110 + --- 111 + 112 + Entering or exiting guest mode is very similar to syscalls. From the host 113 + kernel point of view the CPU goes off into user space when entering the 114 + guest and returns to the kernel on exit. 115 + 116 + kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() 117 + and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). 118 + The state operations have the same ordering. 119 + 120 + Task work handling is done separately for guest at the boundary of the 121 + vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of 122 + the work handled on return to user space. 123 + 124 + Interrupts and regular exceptions 125 + --------------------------------- 126 + 127 + Interrupts entry and exit handling is slightly more complex than syscalls 128 + and KVM transitions. 129 + 130 + If an interrupt is raised while the CPU executes in user space, the entry 131 + and exit handling is exactly the same as for syscalls. 132 + 133 + If the interrupt is raised while the CPU executes in kernel space the entry and 134 + exit handling is slightly different. RCU state is only updated when the 135 + interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will 136 + already be watching. Lockdep and tracing have to be updated unconditionally. 137 + 138 + irqentry_enter() and irqentry_exit() provide the implementation for this. 139 + 140 + The architecture-specific part looks similar to syscall handling: 141 + 142 + .. code-block:: c 143 + 144 + noinstr void interrupt(struct pt_regs *regs, int nr) 145 + { 146 + arch_interrupt_enter(regs); 147 + state = irqentry_enter(regs); 148 + 149 + instrumentation_begin(); 150 + 151 + irq_enter_rcu(); 152 + invoke_irq_handler(regs, nr); 153 + irq_exit_rcu(); 154 + 155 + instrumentation_end(); 156 + 157 + irqentry_exit(regs, state); 158 + } 159 + 160 + Note that the invocation of the actual interrupt handler is within a 161 + irq_enter_rcu() and irq_exit_rcu() pair. 162 + 163 + irq_enter_rcu() updates the preemption count which makes in_hardirq() 164 + return true, handles NOHZ tick state and interrupt time accounting. This 165 + means that up to the point where irq_enter_rcu() is invoked in_hardirq() 166 + returns false. 167 + 168 + irq_exit_rcu() handles interrupt time accounting, undoes the preemption 169 + count update and eventually handles soft interrupts and NOHZ tick state. 170 + 171 + In theory, the preemption count could be updated in irqentry_enter(). In 172 + practice, deferring this update to irq_enter_rcu() allows the preemption-count 173 + code to be traced, while also maintaining symmetry with irq_exit_rcu() and 174 + irqentry_exit(), which are described in the next paragraph. The only downside 175 + is that the early entry code up to irq_enter_rcu() must be aware that the 176 + preemption count has not yet been updated with the HARDIRQ_OFFSET state. 177 + 178 + Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count 179 + before it handles soft interrupts, whose handlers must run in BH context rather 180 + than irq-disabled context. In addition, irqentry_exit() might schedule, which 181 + also requires that HARDIRQ_OFFSET has been removed from the preemption count. 182 + 183 + NMI and NMI-like exceptions 184 + --------------------------- 185 + 186 + NMIs and NMI-like exceptions (machine checks, double faults, debug 187 + interrupts, etc.) can hit any context and must be extra careful with 188 + the state. 189 + 190 + State changes for debug exceptions and machine-check exceptions depend on 191 + whether these exceptions happened in user-space (breakpoints or watchpoints) or 192 + in kernel mode (code patching). From user-space, they are treated like 193 + interrupts, while from kernel mode they are treated like NMIs. 194 + 195 + NMIs and other NMI-like exceptions handle state transitions without 196 + distinguishing between user-mode and kernel-mode origin. 197 + 198 + The state update on entry is handled in irqentry_nmi_enter() which updates 199 + state in the following order: 200 + 201 + * Preemption counter 202 + * Lockdep 203 + * RCU / Context tracking 204 + * Tracing 205 + 206 + The exit counterpart irqentry_nmi_exit() does the reverse operation in the 207 + reverse order. 208 + 209 + Note that the update of the preemption counter has to be the first 210 + operation on enter and the last operation on exit. The reason is that both 211 + lockdep and RCU rely on in_nmi() returning true in this case. The 212 + preemption count modification in the NMI entry/exit case must not be 213 + traced. 214 + 215 + Architecture-specific code looks like this: 216 + 217 + .. code-block:: c 218 + 219 + noinstr void nmi(struct pt_regs *regs) 220 + { 221 + arch_nmi_enter(regs); 222 + state = irqentry_nmi_enter(regs); 223 + 224 + instrumentation_begin(); 225 + nmi_handler(regs); 226 + instrumentation_end(); 227 + 228 + irqentry_nmi_exit(regs); 229 + } 230 + 231 + and for e.g. a debug exception it can look like this: 232 + 233 + .. code-block:: c 234 + 235 + noinstr void debug(struct pt_regs *regs) 236 + { 237 + arch_nmi_enter(regs); 238 + 239 + debug_regs = save_debug_regs(); 240 + 241 + if (user_mode(regs)) { 242 + state = irqentry_enter(regs); 243 + 244 + instrumentation_begin(); 245 + user_mode_debug_handler(regs, debug_regs); 246 + instrumentation_end(); 247 + 248 + irqentry_exit(regs, state); 249 + } else { 250 + state = irqentry_nmi_enter(regs); 251 + 252 + instrumentation_begin(); 253 + kernel_mode_debug_handler(regs, debug_regs); 254 + instrumentation_end(); 255 + 256 + irqentry_nmi_exit(regs, state); 257 + } 258 + } 259 + 260 + There is no combined irqentry_nmi_if_kernel() function available as the 261 + above cannot be handled in an exception-agnostic way.
+8
Documentation/core-api/index.rst
··· 44 44 timekeeping 45 45 errseq 46 46 47 + Low level entry and exit 48 + ======================== 49 + 50 + .. toctree:: 51 + :maxdepth: 1 52 + 53 + entry 54 + 47 55 Concurrency primitives 48 56 ====================== 49 57