Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt

This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.

Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.

The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.

Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

authored by

Jason Xing and committed by
Paolo Abeni
45e359be c65d3429

+31 -2
+9
Documentation/networking/af_xdp.rst
··· 438 438 Once the option is set, kernel will refuse attempts to bind that socket 439 439 to a different interface. Updating the value requires CAP_NET_RAW. 440 440 441 + XDP_MAX_TX_SKB_BUDGET setsockopt 442 + -------------------------------- 443 + 444 + This setsockopt sets the maximum number of descriptors that can be handled 445 + and passed to the driver at one send syscall. It is applied in the copy 446 + mode to allow application to tune the per-socket maximum iteration for 447 + better throughput and less frequency of send syscall. 448 + Allowed range is [32, xs->tx->nentries]. 449 + 441 450 XDP_STATISTICS getsockopt 442 451 ------------------------- 443 452
+1
include/net/xdp_sock.h
··· 84 84 struct list_head map_list; 85 85 /* Protects map_list */ 86 86 spinlock_t map_list_lock; 87 + u32 max_tx_budget; 87 88 /* Protects multiple processes in the control path */ 88 89 struct mutex mutex; 89 90 struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
+1
include/uapi/linux/if_xdp.h
··· 79 79 #define XDP_UMEM_COMPLETION_RING 6 80 80 #define XDP_STATISTICS 7 81 81 #define XDP_OPTIONS 8 82 + #define XDP_MAX_TX_SKB_BUDGET 9 82 83 83 84 struct xdp_umem_reg { 84 85 __u64 addr; /* Start of packet data area */
+19 -2
net/xdp/xsk.c
··· 34 34 #include "xsk.h" 35 35 36 36 #define TX_BATCH_SIZE 32 37 - #define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE) 37 + #define MAX_PER_SOCKET_BUDGET 32 38 38 39 39 void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) 40 40 { ··· 783 783 static int __xsk_generic_xmit(struct sock *sk) 784 784 { 785 785 struct xdp_sock *xs = xdp_sk(sk); 786 - u32 max_batch = TX_BATCH_SIZE; 787 786 bool sent_frame = false; 788 787 struct xdp_desc desc; 789 788 struct sk_buff *skb; 789 + u32 max_batch; 790 790 int err = 0; 791 791 792 792 mutex_lock(&xs->mutex); ··· 800 800 if (xs->queue_id >= xs->dev->real_num_tx_queues) 801 801 goto out; 802 802 803 + max_batch = READ_ONCE(xs->max_tx_budget); 803 804 while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { 804 805 if (max_batch-- == 0) { 805 806 err = -EAGAIN; ··· 1441 1440 mutex_unlock(&xs->mutex); 1442 1441 return err; 1443 1442 } 1443 + case XDP_MAX_TX_SKB_BUDGET: 1444 + { 1445 + unsigned int budget; 1446 + 1447 + if (optlen != sizeof(budget)) 1448 + return -EINVAL; 1449 + if (copy_from_sockptr(&budget, optval, sizeof(budget))) 1450 + return -EFAULT; 1451 + if (!xs->tx || 1452 + budget < TX_BATCH_SIZE || budget > xs->tx->nentries) 1453 + return -EACCES; 1454 + 1455 + WRITE_ONCE(xs->max_tx_budget, budget); 1456 + return 0; 1457 + } 1444 1458 default: 1445 1459 break; 1446 1460 } ··· 1753 1737 1754 1738 xs = xdp_sk(sk); 1755 1739 xs->state = XSK_READY; 1740 + xs->max_tx_budget = TX_BATCH_SIZE; 1756 1741 mutex_init(&xs->mutex); 1757 1742 1758 1743 INIT_LIST_HEAD(&xs->map_list);
+1
tools/include/uapi/linux/if_xdp.h
··· 79 79 #define XDP_UMEM_COMPLETION_RING 6 80 80 #define XDP_STATISTICS 7 81 81 #define XDP_OPTIONS 8 82 + #define XDP_MAX_TX_SKB_BUDGET 9 82 83 83 84 struct xdp_umem_reg { 84 85 __u64 addr; /* Start of packet data area */