Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

docs: networking: document NAPI

Add basic documentation about NAPI. We can stop linking to the ancient
doc on the LF wiki.

Link: https://lore.kernel.org/all/20230315223044.471002-1-kuba@kernel.org/
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> # for ctucanfd-driver.rst
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20230322053848.198452-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

+270 -14
+1 -2
Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
··· 229 229 enabling interrupts, handling an incoming IRQ in ISR, re-enabling the 230 230 softirq and switching context back to softirq. 231 231 232 - More detailed documentation of NAPI may be found on the pages of Linux 233 - Foundation `<https://wiki.linuxfoundation.org/networking/napi>`_. 232 + See :ref:`Documentation/networking/napi.rst <napi>` for more information. 234 233 235 234 Integrating the core to Xilinx Zynq 236 235 -----------------------------------
+1 -2
Documentation/networking/device_drivers/ethernet/intel/e100.rst
··· 151 151 152 152 NAPI (Rx polling mode) is supported in the e100 driver. 153 153 154 - See https://wiki.linuxfoundation.org/networking/napi for more 155 - information on NAPI. 154 + See :ref:`Documentation/networking/napi.rst <napi>` for more information. 156 155 157 156 Multiple Interfaces on Same Ethernet Broadcast Network 158 157 ------------------------------------------------------
+2 -2
Documentation/networking/device_drivers/ethernet/intel/i40e.rst
··· 399 399 NAPI 400 400 ---- 401 401 NAPI (Rx polling mode) is supported in the i40e driver. 402 - For more information on NAPI, see 403 - https://wiki.linuxfoundation.org/networking/napi 402 + 403 + See :ref:`Documentation/networking/napi.rst <napi>` for more information. 404 404 405 405 Flow Control 406 406 ------------
+3 -3
Documentation/networking/device_drivers/ethernet/intel/ice.rst
··· 817 817 818 818 NAPI 819 819 ---- 820 - This driver supports NAPI (Rx polling mode). 821 - For more information on NAPI, see 822 - https://wiki.linuxfoundation.org/networking/napi 823 820 821 + This driver supports NAPI (Rx polling mode). 822 + 823 + See :ref:`Documentation/networking/napi.rst <napi>` for more information. 824 824 825 825 MACVLAN 826 826 -------
+1
Documentation/networking/index.rst
··· 73 73 mpls-sysctl 74 74 mptcp-sysctl 75 75 multiqueue 76 + napi 76 77 netconsole 77 78 netdev-features 78 79 netdevices
+254
Documentation/networking/napi.rst
··· 1 + .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 + 3 + .. _napi: 4 + 5 + ==== 6 + NAPI 7 + ==== 8 + 9 + NAPI is the event handling mechanism used by the Linux networking stack. 10 + The name NAPI no longer stands for anything in particular [#]_. 11 + 12 + In basic operation the device notifies the host about new events 13 + via an interrupt. 14 + The host then schedules a NAPI instance to process the events. 15 + The device may also be polled for events via NAPI without receiving 16 + interrupts first (:ref:`busy polling<poll>`). 17 + 18 + NAPI processing usually happens in the software interrupt context, 19 + but there is an option to use :ref:`separate kernel threads<threaded>` 20 + for NAPI processing. 21 + 22 + All in all NAPI abstracts away from the drivers the context and configuration 23 + of event (packet Rx and Tx) processing. 24 + 25 + Driver API 26 + ========== 27 + 28 + The two most important elements of NAPI are the struct napi_struct 29 + and the associated poll method. struct napi_struct holds the state 30 + of the NAPI instance while the method is the driver-specific event 31 + handler. The method will typically free Tx packets that have been 32 + transmitted and process newly received packets. 33 + 34 + .. _drv_ctrl: 35 + 36 + Control API 37 + ----------- 38 + 39 + netif_napi_add() and netif_napi_del() add/remove a NAPI instance 40 + from the system. The instances are attached to the netdevice passed 41 + as argument (and will be deleted automatically when netdevice is 42 + unregistered). Instances are added in a disabled state. 43 + 44 + napi_enable() and napi_disable() manage the disabled state. 45 + A disabled NAPI can't be scheduled and its poll method is guaranteed 46 + to not be invoked. napi_disable() waits for ownership of the NAPI 47 + instance to be released. 48 + 49 + The control APIs are not idempotent. Control API calls are safe against 50 + concurrent use of datapath APIs but an incorrect sequence of control API 51 + calls may result in crashes, deadlocks, or race conditions. For example, 52 + calling napi_disable() multiple times in a row will deadlock. 53 + 54 + Datapath API 55 + ------------ 56 + 57 + napi_schedule() is the basic method of scheduling a NAPI poll. 58 + Drivers should call this function in their interrupt handler 59 + (see :ref:`drv_sched` for more info). A successful call to napi_schedule() 60 + will take ownership of the NAPI instance. 61 + 62 + Later, after NAPI is scheduled, the driver's poll method will be 63 + called to process the events/packets. The method takes a ``budget`` 64 + argument - drivers can process completions for any number of Tx 65 + packets but should only process up to ``budget`` number of 66 + Rx packets. Rx processing is usually much more expensive. 67 + 68 + In other words, it is recommended to ignore the budget argument when 69 + performing TX buffer reclamation to ensure that the reclamation is not 70 + arbitrarily bounded; however, it is required to honor the budget argument 71 + for RX processing. 72 + 73 + .. warning:: 74 + 75 + The ``budget`` argument may be 0 if core tries to only process Tx completions 76 + and no Rx packets. 77 + 78 + The poll method returns the amount of work done. If the driver still 79 + has outstanding work to do (e.g. ``budget`` was exhausted) 80 + the poll method should return exactly ``budget``. In that case, 81 + the NAPI instance will be serviced/polled again (without the 82 + need to be scheduled). 83 + 84 + If event processing has been completed (all outstanding packets 85 + processed) the poll method should call napi_complete_done() 86 + before returning. napi_complete_done() releases the ownership 87 + of the instance. 88 + 89 + .. warning:: 90 + 91 + The case of finishing all events and using exactly ``budget`` 92 + must be handled carefully. There is no way to report this 93 + (rare) condition to the stack, so the driver must either 94 + not call napi_complete_done() and wait to be called again, 95 + or return ``budget - 1``. 96 + 97 + If the ``budget`` is 0 napi_complete_done() should never be called. 98 + 99 + Call sequence 100 + ------------- 101 + 102 + Drivers should not make assumptions about the exact sequencing 103 + of calls. The poll method may be called without the driver scheduling 104 + the instance (unless the instance is disabled). Similarly, 105 + it's not guaranteed that the poll method will be called, even 106 + if napi_schedule() succeeded (e.g. if the instance gets disabled). 107 + 108 + As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent 109 + calls to the poll method only wait for the ownership of the instance 110 + to be released, not for the poll method to exit. This means that 111 + drivers should avoid accessing any data structures after calling 112 + napi_complete_done(). 113 + 114 + .. _drv_sched: 115 + 116 + Scheduling and IRQ masking 117 + -------------------------- 118 + 119 + Drivers should keep the interrupts masked after scheduling 120 + the NAPI instance - until NAPI polling finishes any further 121 + interrupts are unnecessary. 122 + 123 + Drivers which have to mask the interrupts explicitly (as opposed 124 + to IRQ being auto-masked by the device) should use the napi_schedule_prep() 125 + and __napi_schedule() calls: 126 + 127 + .. code-block:: c 128 + 129 + if (napi_schedule_prep(&v->napi)) { 130 + mydrv_mask_rxtx_irq(v->idx); 131 + /* schedule after masking to avoid races */ 132 + __napi_schedule(&v->napi); 133 + } 134 + 135 + IRQ should only be unmasked after a successful call to napi_complete_done(): 136 + 137 + .. code-block:: c 138 + 139 + if (budget && napi_complete_done(&v->napi, work_done)) { 140 + mydrv_unmask_rxtx_irq(v->idx); 141 + return min(work_done, budget - 1); 142 + } 143 + 144 + napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage 145 + of guarantees given by being invoked in IRQ context (no need to 146 + mask interrupts). Note that PREEMPT_RT forces all interrupts 147 + to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD`` 148 + to avoid issues on real-time kernel configurations. 149 + 150 + Instance to queue mapping 151 + ------------------------- 152 + 153 + Modern devices have multiple NAPI instances (struct napi_struct) per 154 + interface. There is no strong requirement on how the instances are 155 + mapped to queues and interrupts. NAPI is primarily a polling/processing 156 + abstraction without specific user-facing semantics. That said, most networking 157 + devices end up using NAPI in fairly similar ways. 158 + 159 + NAPI instances most often correspond 1:1:1 to interrupts and queue pairs 160 + (queue pair is a set of a single Rx and single Tx queue). 161 + 162 + In less common cases a NAPI instance may be used for multiple queues 163 + or Rx and Tx queues can be serviced by separate NAPI instances on a single 164 + core. Regardless of the queue assignment, however, there is usually still 165 + a 1:1 mapping between NAPI instances and interrupts. 166 + 167 + It's worth noting that the ethtool API uses a "channel" terminology where 168 + each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear 169 + what constitutes a channel; the recommended interpretation is to understand 170 + a channel as an IRQ/NAPI which services queues of a given type. For example, 171 + a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected 172 + to utilize 3 interrupts, 2 Rx and 2 Tx queues. 173 + 174 + User API 175 + ======== 176 + 177 + User interactions with NAPI depend on NAPI instance ID. The instance IDs 178 + are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option. 179 + It's not currently possible to query IDs used by a given device. 180 + 181 + Software IRQ coalescing 182 + ----------------------- 183 + 184 + NAPI does not perform any explicit event coalescing by default. 185 + In most scenarios batching happens due to IRQ coalescing which is done 186 + by the device. There are cases where software coalescing is helpful. 187 + 188 + NAPI can be configured to arm a repoll timer instead of unmasking 189 + the hardware interrupts as soon as all packets are processed. 190 + The ``gro_flush_timeout`` sysfs configuration of the netdevice 191 + is reused to control the delay of the timer, while 192 + ``napi_defer_hard_irqs`` controls the number of consecutive empty polls 193 + before NAPI gives up and goes back to using hardware IRQs. 194 + 195 + .. _poll: 196 + 197 + Busy polling 198 + ------------ 199 + 200 + Busy polling allows a user process to check for incoming packets before 201 + the device interrupt fires. As is the case with any busy polling it trades 202 + off CPU cycles for lower latency (production uses of NAPI busy polling 203 + are not well known). 204 + 205 + Busy polling is enabled by either setting ``SO_BUSY_POLL`` on 206 + selected sockets or using the global ``net.core.busy_poll`` and 207 + ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling 208 + also exists. 209 + 210 + IRQ mitigation 211 + --------------- 212 + 213 + While busy polling is supposed to be used by low latency applications, 214 + a similar mechanism can be used for IRQ mitigation. 215 + 216 + Very high request-per-second applications (especially routing/forwarding 217 + applications and especially applications using AF_XDP sockets) may not 218 + want to be interrupted until they finish processing a request or a batch 219 + of packets. 220 + 221 + Such applications can pledge to the kernel that they will perform a busy 222 + polling operation periodically, and the driver should keep the device IRQs 223 + permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL`` 224 + socket option. To avoid system misbehavior the pledge is revoked 225 + if ``gro_flush_timeout`` passes without any busy poll call. 226 + 227 + The NAPI budget for busy polling is lower than the default (which makes 228 + sense given the low latency intention of normal busy polling). This is 229 + not the case with IRQ mitigation, however, so the budget can be adjusted 230 + with the ``SO_BUSY_POLL_BUDGET`` socket option. 231 + 232 + .. _threaded: 233 + 234 + Threaded NAPI 235 + ------------- 236 + 237 + Threaded NAPI is an operating mode that uses dedicated kernel 238 + threads rather than software IRQ context for NAPI processing. 239 + The configuration is per netdevice and will affect all 240 + NAPI instances of that device. Each NAPI instance will spawn a separate 241 + thread (called ``napi/${ifc-name}-${napi-id}``). 242 + 243 + It is recommended to pin each kernel thread to a single CPU, the same 244 + CPU as the CPU which services the interrupt. Note that the mapping 245 + between IRQs and NAPI instances may not be trivial (and is driver 246 + dependent). The NAPI instance IDs will be assigned in the opposite 247 + order than the process IDs of the kernel threads. 248 + 249 + Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in 250 + netdev's sysfs directory. 251 + 252 + .. rubric:: Footnotes 253 + 254 + .. [#] NAPI was originally referred to as New API in 2.4 Linux.
+8 -5
include/linux/netdevice.h
··· 509 509 return false; 510 510 } 511 511 512 - bool napi_complete_done(struct napi_struct *n, int work_done); 513 512 /** 514 - * napi_complete - NAPI processing complete 515 - * @n: NAPI context 513 + * napi_complete_done - NAPI processing complete 514 + * @n: NAPI context 515 + * @work_done: number of packets processed 516 516 * 517 - * Mark NAPI processing as complete. 518 - * Consider using napi_complete_done() instead. 517 + * Mark NAPI processing as complete. Should only be called if poll budget 518 + * has not been completely consumed. 519 + * Prefer over napi_complete(). 519 520 * Return false if device should avoid rearming interrupts. 520 521 */ 522 + bool napi_complete_done(struct napi_struct *n, int work_done); 523 + 521 524 static inline bool napi_complete(struct napi_struct *n) 522 525 { 523 526 return napi_complete_done(n, 0);