Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'xdp-preferred-busy-polling'

Björn Töpel says:

====================
This series introduces three new features:

1. A new "heavy traffic" busy-polling variant that works in concert
with the existing napi_defer_hard_irqs and gro_flush_timeout knobs.

2. A new socket option that let a user change the busy-polling NAPI
budget.

3. Allow busy-polling to be performed on XDP sockets.

The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.

One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.

This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.

If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.

In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.

Patch 6 touches a lot of drivers, so the Cc: list is grossly long.

Example usage:

$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout

Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.

Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.

Performance simple UDP ping-pong:

A packet generator blasts UDP packets from a packet generator to a
certain {src,dst}IP/port, so a dedicated ksoftirq will be busy
handling the packets at a certain core.

A simple UDP test program that simply does recvfrom/sendto is running
at the host end. Throughput in pps and RTT latency is measured at the
packet generator.

/proc/sys/net/core/busy_read is set (20).

Min Max Avg (usec)

1. Blocking 2-cores: 490Kpps
1218.192 1335.427 1271.083

2. Blocking, 1-core: 155Kpps
1327.195 17294.855 4761.367

3. Non-blocking, 2-cores: 475Kpps
1221.197 1330.465 1270.740

4. Non-blocking, 1-core: 3Kpps
29006.482 37260.465 33128.367

5. Non-blocking, prefer busy-poll, 1-core: 420Kpps
1202.535 5494.052 4885.443

Scenario 2 and 5 shows when the new option should be used. Throughput
go from 155 to 420Kpps, average latency are similar, but the tail
latencies are much better for the latter.

Performance XDP sockets:

Again, a packet generator blasts UDP packets from a packet generator
to a certain {src,dst}IP/port.

Today, running XDP sockets sample on the same core as the softirq
handling, performance tanks mainly because we do not yield to
user-space when the XDP socket Rx queue is full.

# taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r
Rx: 64Kpps

# # preferred busy-polling, budget 8
# taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 8
Rx 9.9Mpps
# # preferred busy-polling, budget 64
# taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 64
Rx: 19.3Mpps
# # preferred busy-polling, budget 256
# taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 256
Rx: 21.4Mpps
# # preferred busy-polling, budget 512
# taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 512
Rx: 21.7Mpps

Compared to the two-core case:
# taskset -c 4 ./xdpsock -i ens785f1 -q 20 -n 1 -r
Rx: 20.7Mpps

We're getting better single-core performance than two, for this naïve
drop scenario.

Performance netperf UDP_RR:

Note that netperf UDP_RR is not a heavy traffic tests, and preferred
busy-polling is not typically something we want to use here.

$ echo 20 | sudo tee /proc/sys/net/core/busy_read
$ netperf -H 192.168.1.1 -l 30 -t UDP_RR -v 2 -- \
-o min_latency,mean_latency,max_latency,stddev_latency,transaction_rate

busy-polling blocking sockets: 12,13.33,224,0.63,74731.177

I hacked netperf to use non-blocking sockets and re-ran:

busy-polling non-blocking sockets: 12,13.46,218,0.72,73991.172
prefer busy-polling non-blocking sockets: 12,13.62,221,0.59,73138.448

Using the preferred busy-polling mode does not impact performance.

The above tests was done for the 'ice' driver.

Thanks to Jakub for suggesting this busy-polling addition [1], and
Eric for all input/review!

Changes:

rfc-v1 [2] -> rfc-v2:
* Changed name from bias to prefer.
* Base the work on Eric's/Luigi's defer irq/gro timeout work.
* Proper GRO flushing.
* Build issues for some XDP drivers.

rfc-v2 [3] -> v1:
* Fixed broken qlogic build.
* Do not trigger an IPI (XDP socket wakeup) when busy-polling is
enabled.

v1 [4] -> v2:
* Added napi_id to socionext driver, and added Ilias Acked-by:. (Ilias)
* Added a samples patch to improve busy-polling for xdpsock/l2fwd.
* Correctly mark atomic operations with {WRITE,READ}_ONCE, to make
KCSAN and the code readers happy. (Eric)
* Check NAPI budget not to exceed U16_MAX. (Eric)
* Added kdoc.

v2 [5] -> v3:
* Collected Acked-by.
* Check NAPI disable prior prefer busy-polling. (Jakub)
* Added napi_id registration for virtio-net. (Michael)
* Added napi_id registration for veth.

v3 [6] -> v4:
* Collected Acked-by/Reviewed-by.

[1] https://lore.kernel.org/netdev/20200925120652.10b8d7c5@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
[2] https://lore.kernel.org/bpf/20201028133437.212503-1-bjorn.topel@gmail.com/
[3] https://lore.kernel.org/bpf/20201105102812.152836-1-bjorn.topel@gmail.com/
[4] https://lore.kernel.org/bpf/20201112114041.131998-1-bjorn.topel@gmail.com/
[5] https://lore.kernel.org/bpf/20201116110416.10719-1-bjorn.topel@gmail.com/
[6] https://lore.kernel.org/bpf/20201119083024.119566-1-bjorn.topel@gmail.com/
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

+301 -108
+3
arch/alpha/include/uapi/asm/socket.h
··· 124 124 125 125 #define SO_DETACH_REUSEPORT_BPF 68 126 126 127 + #define SO_PREFER_BUSY_POLL 69 128 + #define SO_BUSY_POLL_BUDGET 70 129 + 127 130 #if !defined(__KERNEL__) 128 131 129 132 #if __BITS_PER_LONG == 64
+3
arch/mips/include/uapi/asm/socket.h
··· 135 135 136 136 #define SO_DETACH_REUSEPORT_BPF 68 137 137 138 + #define SO_PREFER_BUSY_POLL 69 139 + #define SO_BUSY_POLL_BUDGET 70 140 + 138 141 #if !defined(__KERNEL__) 139 142 140 143 #if __BITS_PER_LONG == 64
+3
arch/parisc/include/uapi/asm/socket.h
··· 116 116 117 117 #define SO_DETACH_REUSEPORT_BPF 0x4042 118 118 119 + #define SO_PREFER_BUSY_POLL 0x4043 120 + #define SO_BUSY_POLL_BUDGET 0x4044 121 + 119 122 #if !defined(__KERNEL__) 120 123 121 124 #if __BITS_PER_LONG == 64
+3
arch/sparc/include/uapi/asm/socket.h
··· 117 117 118 118 #define SO_DETACH_REUSEPORT_BPF 0x0047 119 119 120 + #define SO_PREFER_BUSY_POLL 0x0048 121 + #define SO_BUSY_POLL_BUDGET 0x0049 122 + 120 123 #if !defined(__KERNEL__) 121 124 122 125
+1 -1
drivers/net/ethernet/amazon/ena/ena_netdev.c
··· 416 416 { 417 417 int rc; 418 418 419 - rc = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev, rx_ring->qid); 419 + rc = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev, rx_ring->qid, 0); 420 420 421 421 if (rc) { 422 422 netif_err(rx_ring->adapter, ifup, rx_ring->netdev,
+1 -1
drivers/net/ethernet/broadcom/bnxt/bnxt.c
··· 2884 2884 if (rc) 2885 2885 return rc; 2886 2886 2887 - rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i); 2887 + rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i, 0); 2888 2888 if (rc < 0) 2889 2889 return rc; 2890 2890
+1 -1
drivers/net/ethernet/cavium/thunder/nicvf_queues.c
··· 770 770 rq->caching = 1; 771 771 772 772 /* Driver have no proper error path for failed XDP RX-queue info reg */ 773 - WARN_ON(xdp_rxq_info_reg(&rq->xdp_rxq, nic->netdev, qidx) < 0); 773 + WARN_ON(xdp_rxq_info_reg(&rq->xdp_rxq, nic->netdev, qidx, 0) < 0); 774 774 775 775 /* Send a mailbox msg to PF to config RQ */ 776 776 mbx.rq.msg = NIC_MBOX_MSG_RQ_CFG;
+1 -1
drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
··· 3334 3334 return 0; 3335 3335 3336 3336 err = xdp_rxq_info_reg(&fq->channel->xdp_rxq, priv->net_dev, 3337 - fq->flowid); 3337 + fq->flowid, 0); 3338 3338 if (err) { 3339 3339 dev_err(dev, "xdp_rxq_info_reg failed\n"); 3340 3340 return err;
+1 -1
drivers/net/ethernet/intel/i40e/i40e_txrx.c
··· 1447 1447 /* XDP RX-queue info only needed for RX rings exposed to XDP */ 1448 1448 if (rx_ring->vsi->type == I40E_VSI_MAIN) { 1449 1449 err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev, 1450 - rx_ring->queue_index); 1450 + rx_ring->queue_index, rx_ring->q_vector->napi.napi_id); 1451 1451 if (err < 0) 1452 1452 return err; 1453 1453 }
+2 -2
drivers/net/ethernet/intel/ice/ice_base.c
··· 306 306 if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) 307 307 /* coverity[check_return] */ 308 308 xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev, 309 - ring->q_index); 309 + ring->q_index, ring->q_vector->napi.napi_id); 310 310 311 311 ring->xsk_pool = ice_xsk_pool(ring); 312 312 if (ring->xsk_pool) { ··· 333 333 /* coverity[check_return] */ 334 334 xdp_rxq_info_reg(&ring->xdp_rxq, 335 335 ring->netdev, 336 - ring->q_index); 336 + ring->q_index, ring->q_vector->napi.napi_id); 337 337 338 338 err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq, 339 339 MEM_TYPE_PAGE_SHARED,
+1 -1
drivers/net/ethernet/intel/ice/ice_txrx.c
··· 483 483 if (rx_ring->vsi->type == ICE_VSI_PF && 484 484 !xdp_rxq_info_is_reg(&rx_ring->xdp_rxq)) 485 485 if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev, 486 - rx_ring->q_index)) 486 + rx_ring->q_index, rx_ring->q_vector->napi.napi_id)) 487 487 goto err; 488 488 return 0; 489 489
+1 -1
drivers/net/ethernet/intel/igb/igb_main.c
··· 4352 4352 4353 4353 /* XDP RX-queue info */ 4354 4354 if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev, 4355 - rx_ring->queue_index) < 0) 4355 + rx_ring->queue_index, 0) < 0) 4356 4356 goto err; 4357 4357 4358 4358 return 0;
+1 -1
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
··· 6577 6577 6578 6578 /* XDP RX-queue info */ 6579 6579 if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, adapter->netdev, 6580 - rx_ring->queue_index) < 0) 6580 + rx_ring->queue_index, rx_ring->q_vector->napi.napi_id) < 0) 6581 6581 goto err; 6582 6582 6583 6583 rx_ring->xdp_prog = adapter->xdp_prog;
+1 -1
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
··· 3493 3493 3494 3494 /* XDP RX-queue info */ 3495 3495 if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, adapter->netdev, 3496 - rx_ring->queue_index) < 0) 3496 + rx_ring->queue_index, 0) < 0) 3497 3497 goto err; 3498 3498 3499 3499 rx_ring->xdp_prog = adapter->xdp_prog;
+1 -1
drivers/net/ethernet/marvell/mvneta.c
··· 3227 3227 return err; 3228 3228 } 3229 3229 3230 - err = xdp_rxq_info_reg(&rxq->xdp_rxq, pp->dev, rxq->id); 3230 + err = xdp_rxq_info_reg(&rxq->xdp_rxq, pp->dev, rxq->id, 0); 3231 3231 if (err < 0) 3232 3232 goto err_free_pp; 3233 3233
+2 -2
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
··· 2614 2614 mvpp2_rxq_status_update(port, rxq->id, 0, rxq->size); 2615 2615 2616 2616 if (priv->percpu_pools) { 2617 - err = xdp_rxq_info_reg(&rxq->xdp_rxq_short, port->dev, rxq->id); 2617 + err = xdp_rxq_info_reg(&rxq->xdp_rxq_short, port->dev, rxq->id, 0); 2618 2618 if (err < 0) 2619 2619 goto err_free_dma; 2620 2620 2621 - err = xdp_rxq_info_reg(&rxq->xdp_rxq_long, port->dev, rxq->id); 2621 + err = xdp_rxq_info_reg(&rxq->xdp_rxq_long, port->dev, rxq->id, 0); 2622 2622 if (err < 0) 2623 2623 goto err_unregister_rxq_short; 2624 2624
+1 -1
drivers/net/ethernet/mellanox/mlx4/en_rx.c
··· 283 283 ring->log_stride = ffs(ring->stride) - 1; 284 284 ring->buf_size = ring->size * ring->stride + TXBB_SIZE; 285 285 286 - if (xdp_rxq_info_reg(&ring->xdp_rxq, priv->dev, queue_index) < 0) 286 + if (xdp_rxq_info_reg(&ring->xdp_rxq, priv->dev, queue_index, 0) < 0) 287 287 goto err_ring; 288 288 289 289 tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
+1 -1
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
··· 434 434 rq_xdp_ix = rq->ix; 435 435 if (xsk) 436 436 rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK; 437 - err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix); 437 + err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix, 0); 438 438 if (err < 0) 439 439 goto err_rq_xdp_prog; 440 440
+1 -1
drivers/net/ethernet/netronome/nfp/nfp_net_common.c
··· 2533 2533 2534 2534 if (dp->netdev) { 2535 2535 err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, dp->netdev, 2536 - rx_ring->idx); 2536 + rx_ring->idx, rx_ring->r_vec->napi.napi_id); 2537 2537 if (err < 0) 2538 2538 return err; 2539 2539 }
+1 -1
drivers/net/ethernet/qlogic/qede/qede_main.c
··· 1762 1762 1763 1763 /* Driver have no error path from here */ 1764 1764 WARN_ON(xdp_rxq_info_reg(&fp->rxq->xdp_rxq, edev->ndev, 1765 - fp->rxq->rxq_id) < 0); 1765 + fp->rxq->rxq_id, 0) < 0); 1766 1766 1767 1767 if (xdp_rxq_info_reg_mem_model(&fp->rxq->xdp_rxq, 1768 1768 MEM_TYPE_PAGE_ORDER0,
+1 -1
drivers/net/ethernet/sfc/rx_common.c
··· 262 262 263 263 /* Initialise XDP queue information */ 264 264 rc = xdp_rxq_info_reg(&rx_queue->xdp_rxq_info, efx->net_dev, 265 - rx_queue->core_index); 265 + rx_queue->core_index, 0); 266 266 267 267 if (rc) { 268 268 netif_err(efx, rx_err, efx->net_dev,
+1 -1
drivers/net/ethernet/socionext/netsec.c
··· 1304 1304 goto err_out; 1305 1305 } 1306 1306 1307 - err = xdp_rxq_info_reg(&dring->xdp_rxq, priv->ndev, 0); 1307 + err = xdp_rxq_info_reg(&dring->xdp_rxq, priv->ndev, 0, priv->napi.napi_id); 1308 1308 if (err) 1309 1309 goto err_out; 1310 1310
+1 -1
drivers/net/ethernet/ti/cpsw_priv.c
··· 1186 1186 pool = cpsw->page_pool[ch]; 1187 1187 rxq = &priv->xdp_rxq[ch]; 1188 1188 1189 - ret = xdp_rxq_info_reg(rxq, priv->ndev, ch); 1189 + ret = xdp_rxq_info_reg(rxq, priv->ndev, ch, 0); 1190 1190 if (ret) 1191 1191 return ret; 1192 1192
+1 -1
drivers/net/hyperv/netvsc.c
··· 1499 1499 u64_stats_init(&nvchan->tx_stats.syncp); 1500 1500 u64_stats_init(&nvchan->rx_stats.syncp); 1501 1501 1502 - ret = xdp_rxq_info_reg(&nvchan->xdp_rxq, ndev, i); 1502 + ret = xdp_rxq_info_reg(&nvchan->xdp_rxq, ndev, i, 0); 1503 1503 1504 1504 if (ret) { 1505 1505 netdev_err(ndev, "xdp_rxq_info_reg fail: %d\n", ret);
+1 -1
drivers/net/tun.c
··· 780 780 } else { 781 781 /* Setup XDP RX-queue info, for new tfile getting attached */ 782 782 err = xdp_rxq_info_reg(&tfile->xdp_rxq, 783 - tun->dev, tfile->queue_index); 783 + tun->dev, tfile->queue_index, 0); 784 784 if (err < 0) 785 785 goto out; 786 786 err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+8 -4
drivers/net/veth.c
··· 884 884 for (i = 0; i < dev->real_num_rx_queues; i++) { 885 885 struct veth_rq *rq = &priv->rq[i]; 886 886 887 - netif_napi_add(dev, &rq->xdp_napi, veth_poll, NAPI_POLL_WEIGHT); 888 887 napi_enable(&rq->xdp_napi); 889 888 } 890 889 ··· 925 926 for (i = 0; i < dev->real_num_rx_queues; i++) { 926 927 struct veth_rq *rq = &priv->rq[i]; 927 928 928 - err = xdp_rxq_info_reg(&rq->xdp_rxq, dev, i); 929 + netif_napi_add(dev, &rq->xdp_napi, veth_poll, NAPI_POLL_WEIGHT); 930 + err = xdp_rxq_info_reg(&rq->xdp_rxq, dev, i, rq->xdp_napi.napi_id); 929 931 if (err < 0) 930 932 goto err_rxq_reg; 931 933 ··· 952 952 err_reg_mem: 953 953 xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq); 954 954 err_rxq_reg: 955 - for (i--; i >= 0; i--) 956 - xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq); 955 + for (i--; i >= 0; i--) { 956 + struct veth_rq *rq = &priv->rq[i]; 957 + 958 + xdp_rxq_info_unreg(&rq->xdp_rxq); 959 + netif_napi_del(&rq->xdp_napi); 960 + } 957 961 958 962 return err; 959 963 }
+1 -1
drivers/net/virtio_net.c
··· 1485 1485 if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL)) 1486 1486 schedule_delayed_work(&vi->refill, 0); 1487 1487 1488 - err = xdp_rxq_info_reg(&vi->rq[i].xdp_rxq, dev, i); 1488 + err = xdp_rxq_info_reg(&vi->rq[i].xdp_rxq, dev, i, vi->rq[i].napi.napi_id); 1489 1489 if (err < 0) 1490 1490 return err; 1491 1491
+1 -1
drivers/net/xen-netfront.c
··· 2014 2014 } 2015 2015 2016 2016 err = xdp_rxq_info_reg(&queue->xdp_rxq, queue->info->netdev, 2017 - queue->id); 2017 + queue->id, 0); 2018 2018 if (err) { 2019 2019 netdev_err(queue->info->netdev, "xdp_rxq_info_reg failed\n"); 2020 2020 goto err_free_pp;
+2 -1
fs/eventpoll.c
··· 397 397 unsigned int napi_id = READ_ONCE(ep->napi_id); 398 398 399 399 if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) 400 - napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep); 400 + napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep, false, 401 + BUSY_POLL_BUDGET); 401 402 } 402 403 403 404 static inline void ep_reset_busy_poll_napi_id(struct eventpoll *ep)
+21 -14
include/linux/netdevice.h
··· 350 350 }; 351 351 352 352 enum { 353 - NAPI_STATE_SCHED, /* Poll is scheduled */ 354 - NAPI_STATE_MISSED, /* reschedule a napi */ 355 - NAPI_STATE_DISABLE, /* Disable pending */ 356 - NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */ 357 - NAPI_STATE_LISTED, /* NAPI added to system lists */ 358 - NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */ 359 - NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */ 353 + NAPI_STATE_SCHED, /* Poll is scheduled */ 354 + NAPI_STATE_MISSED, /* reschedule a napi */ 355 + NAPI_STATE_DISABLE, /* Disable pending */ 356 + NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */ 357 + NAPI_STATE_LISTED, /* NAPI added to system lists */ 358 + NAPI_STATE_NO_BUSY_POLL, /* Do not add in napi_hash, no busy polling */ 359 + NAPI_STATE_IN_BUSY_POLL, /* sk_busy_loop() owns this NAPI */ 360 + NAPI_STATE_PREFER_BUSY_POLL, /* prefer busy-polling over softirq processing*/ 360 361 }; 361 362 362 363 enum { 363 - NAPIF_STATE_SCHED = BIT(NAPI_STATE_SCHED), 364 - NAPIF_STATE_MISSED = BIT(NAPI_STATE_MISSED), 365 - NAPIF_STATE_DISABLE = BIT(NAPI_STATE_DISABLE), 366 - NAPIF_STATE_NPSVC = BIT(NAPI_STATE_NPSVC), 367 - NAPIF_STATE_LISTED = BIT(NAPI_STATE_LISTED), 368 - NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL), 369 - NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL), 364 + NAPIF_STATE_SCHED = BIT(NAPI_STATE_SCHED), 365 + NAPIF_STATE_MISSED = BIT(NAPI_STATE_MISSED), 366 + NAPIF_STATE_DISABLE = BIT(NAPI_STATE_DISABLE), 367 + NAPIF_STATE_NPSVC = BIT(NAPI_STATE_NPSVC), 368 + NAPIF_STATE_LISTED = BIT(NAPI_STATE_LISTED), 369 + NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL), 370 + NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL), 371 + NAPIF_STATE_PREFER_BUSY_POLL = BIT(NAPI_STATE_PREFER_BUSY_POLL), 370 372 }; 371 373 372 374 enum gro_result { ··· 437 435 static inline bool napi_disable_pending(struct napi_struct *n) 438 436 { 439 437 return test_bit(NAPI_STATE_DISABLE, &n->state); 438 + } 439 + 440 + static inline bool napi_prefer_busy_poll(struct napi_struct *n) 441 + { 442 + return test_bit(NAPI_STATE_PREFER_BUSY_POLL, &n->state); 440 443 } 441 444 442 445 bool napi_schedule_prep(struct napi_struct *n);
+21 -6
include/net/busy_poll.h
··· 23 23 */ 24 24 #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1)) 25 25 26 + #define BUSY_POLL_BUDGET 8 27 + 26 28 #ifdef CONFIG_NET_RX_BUSY_POLL 27 29 28 30 struct napi_struct; ··· 45 43 46 44 void napi_busy_loop(unsigned int napi_id, 47 45 bool (*loop_end)(void *, unsigned long), 48 - void *loop_end_arg); 46 + void *loop_end_arg, bool prefer_busy_poll, u16 budget); 49 47 50 48 #else /* CONFIG_NET_RX_BUSY_POLL */ 51 49 static inline unsigned long net_busy_loop_on(void) ··· 107 105 unsigned int napi_id = READ_ONCE(sk->sk_napi_id); 108 106 109 107 if (napi_id >= MIN_NAPI_ID) 110 - napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk); 108 + napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk, 109 + READ_ONCE(sk->sk_prefer_busy_poll), 110 + READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET); 111 111 #endif 112 112 } 113 113 ··· 135 131 sk_rx_queue_set(sk, skb); 136 132 } 137 133 134 + static inline void __sk_mark_napi_id_once_xdp(struct sock *sk, unsigned int napi_id) 135 + { 136 + #ifdef CONFIG_NET_RX_BUSY_POLL 137 + if (!READ_ONCE(sk->sk_napi_id)) 138 + WRITE_ONCE(sk->sk_napi_id, napi_id); 139 + #endif 140 + } 141 + 138 142 /* variant used for unconnected sockets */ 139 143 static inline void sk_mark_napi_id_once(struct sock *sk, 140 144 const struct sk_buff *skb) 141 145 { 142 - #ifdef CONFIG_NET_RX_BUSY_POLL 143 - if (!READ_ONCE(sk->sk_napi_id)) 144 - WRITE_ONCE(sk->sk_napi_id, skb->napi_id); 145 - #endif 146 + __sk_mark_napi_id_once_xdp(sk, skb->napi_id); 147 + } 148 + 149 + static inline void sk_mark_napi_id_once_xdp(struct sock *sk, 150 + const struct xdp_buff *xdp) 151 + { 152 + __sk_mark_napi_id_once_xdp(sk, xdp->rxq->napi_id); 146 153 } 147 154 148 155 #endif /* _LINUX_NET_BUSY_POLL_H */
+6
include/net/sock.h
··· 301 301 * @sk_ack_backlog: current listen backlog 302 302 * @sk_max_ack_backlog: listen backlog set in listen() 303 303 * @sk_uid: user id of owner 304 + * @sk_prefer_busy_poll: prefer busypolling over softirq processing 305 + * @sk_busy_poll_budget: napi processing budget when busypolling 304 306 * @sk_priority: %SO_PRIORITY setting 305 307 * @sk_type: socket type (%SOCK_STREAM, etc) 306 308 * @sk_protocol: which protocol this socket belongs in this network family ··· 481 479 u32 sk_ack_backlog; 482 480 u32 sk_max_ack_backlog; 483 481 kuid_t sk_uid; 482 + #ifdef CONFIG_NET_RX_BUSY_POLL 483 + u8 sk_prefer_busy_poll; 484 + u16 sk_busy_poll_budget; 485 + #endif 484 486 struct pid *sk_peer_pid; 485 487 const struct cred *sk_peer_cred; 486 488 long sk_rcvtimeo;
+2 -1
include/net/xdp.h
··· 59 59 u32 queue_index; 60 60 u32 reg_state; 61 61 struct xdp_mem_info mem; 62 + unsigned int napi_id; 62 63 } ____cacheline_aligned; /* perf critical, avoid false-sharing */ 63 64 64 65 struct xdp_txq_info { ··· 227 226 } 228 227 229 228 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, 230 - struct net_device *dev, u32 queue_index); 229 + struct net_device *dev, u32 queue_index, unsigned int napi_id); 231 230 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq); 232 231 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq); 233 232 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
+3
include/uapi/asm-generic/socket.h
··· 119 119 120 120 #define SO_DETACH_REUSEPORT_BPF 68 121 121 122 + #define SO_PREFER_BUSY_POLL 69 123 + #define SO_BUSY_POLL_BUDGET 70 124 + 122 125 #if !defined(__KERNEL__) 123 126 124 127 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
+69 -22
net/core/dev.c
··· 6458 6458 6459 6459 WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED)); 6460 6460 6461 - new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED); 6461 + new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED | 6462 + NAPIF_STATE_PREFER_BUSY_POLL); 6462 6463 6463 6464 /* If STATE_MISSED was set, leave STATE_SCHED set, 6464 6465 * because we will call napi->poll() one more time. ··· 6496 6495 6497 6496 #if defined(CONFIG_NET_RX_BUSY_POLL) 6498 6497 6499 - #define BUSY_POLL_BUDGET 8 6500 - 6501 - static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock) 6498 + static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule) 6502 6499 { 6500 + if (!skip_schedule) { 6501 + gro_normal_list(napi); 6502 + __napi_schedule(napi); 6503 + return; 6504 + } 6505 + 6506 + if (napi->gro_bitmask) { 6507 + /* flush too old packets 6508 + * If HZ < 1000, flush all packets. 6509 + */ 6510 + napi_gro_flush(napi, HZ >= 1000); 6511 + } 6512 + 6513 + gro_normal_list(napi); 6514 + clear_bit(NAPI_STATE_SCHED, &napi->state); 6515 + } 6516 + 6517 + static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool prefer_busy_poll, 6518 + u16 budget) 6519 + { 6520 + bool skip_schedule = false; 6521 + unsigned long timeout; 6503 6522 int rc; 6504 6523 6505 6524 /* Busy polling means there is a high chance device driver hard irq ··· 6536 6515 6537 6516 local_bh_disable(); 6538 6517 6518 + if (prefer_busy_poll) { 6519 + napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs); 6520 + timeout = READ_ONCE(napi->dev->gro_flush_timeout); 6521 + if (napi->defer_hard_irqs_count && timeout) { 6522 + hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED); 6523 + skip_schedule = true; 6524 + } 6525 + } 6526 + 6539 6527 /* All we really want here is to re-enable device interrupts. 6540 6528 * Ideally, a new ndo_busy_poll_stop() could avoid another round. 6541 6529 */ 6542 - rc = napi->poll(napi, BUSY_POLL_BUDGET); 6530 + rc = napi->poll(napi, budget); 6543 6531 /* We can't gro_normal_list() here, because napi->poll() might have 6544 6532 * rearmed the napi (napi_complete_done()) in which case it could 6545 6533 * already be running on another CPU. 6546 6534 */ 6547 - trace_napi_poll(napi, rc, BUSY_POLL_BUDGET); 6535 + trace_napi_poll(napi, rc, budget); 6548 6536 netpoll_poll_unlock(have_poll_lock); 6549 - if (rc == BUSY_POLL_BUDGET) { 6550 - /* As the whole budget was spent, we still own the napi so can 6551 - * safely handle the rx_list. 6552 - */ 6553 - gro_normal_list(napi); 6554 - __napi_schedule(napi); 6555 - } 6537 + if (rc == budget) 6538 + __busy_poll_stop(napi, skip_schedule); 6556 6539 local_bh_enable(); 6557 6540 } 6558 6541 6559 6542 void napi_busy_loop(unsigned int napi_id, 6560 6543 bool (*loop_end)(void *, unsigned long), 6561 - void *loop_end_arg) 6544 + void *loop_end_arg, bool prefer_busy_poll, u16 budget) 6562 6545 { 6563 6546 unsigned long start_time = loop_end ? busy_loop_current_time() : 0; 6564 6547 int (*napi_poll)(struct napi_struct *napi, int budget); ··· 6590 6565 * we avoid dirtying napi->state as much as we can. 6591 6566 */ 6592 6567 if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | 6593 - NAPIF_STATE_IN_BUSY_POLL)) 6568 + NAPIF_STATE_IN_BUSY_POLL)) { 6569 + if (prefer_busy_poll) 6570 + set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); 6594 6571 goto count; 6572 + } 6595 6573 if (cmpxchg(&napi->state, val, 6596 6574 val | NAPIF_STATE_IN_BUSY_POLL | 6597 - NAPIF_STATE_SCHED) != val) 6575 + NAPIF_STATE_SCHED) != val) { 6576 + if (prefer_busy_poll) 6577 + set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); 6598 6578 goto count; 6579 + } 6599 6580 have_poll_lock = netpoll_poll_lock(napi); 6600 6581 napi_poll = napi->poll; 6601 6582 } 6602 - work = napi_poll(napi, BUSY_POLL_BUDGET); 6603 - trace_napi_poll(napi, work, BUSY_POLL_BUDGET); 6583 + work = napi_poll(napi, budget); 6584 + trace_napi_poll(napi, work, budget); 6604 6585 gro_normal_list(napi); 6605 6586 count: 6606 6587 if (work > 0) ··· 6619 6588 6620 6589 if (unlikely(need_resched())) { 6621 6590 if (napi_poll) 6622 - busy_poll_stop(napi, have_poll_lock); 6591 + busy_poll_stop(napi, have_poll_lock, prefer_busy_poll, budget); 6623 6592 preempt_enable(); 6624 6593 rcu_read_unlock(); 6625 6594 cond_resched(); ··· 6630 6599 cpu_relax(); 6631 6600 } 6632 6601 if (napi_poll) 6633 - busy_poll_stop(napi, have_poll_lock); 6602 + busy_poll_stop(napi, have_poll_lock, prefer_busy_poll, budget); 6634 6603 preempt_enable(); 6635 6604 out: 6636 6605 rcu_read_unlock(); ··· 6681 6650 * NAPI_STATE_MISSED, since we do not react to a device IRQ. 6682 6651 */ 6683 6652 if (!napi_disable_pending(napi) && 6684 - !test_and_set_bit(NAPI_STATE_SCHED, &napi->state)) 6653 + !test_and_set_bit(NAPI_STATE_SCHED, &napi->state)) { 6654 + clear_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); 6685 6655 __napi_schedule_irqoff(napi); 6656 + } 6686 6657 6687 6658 return HRTIMER_NORESTART; 6688 6659 } ··· 6742 6709 6743 6710 hrtimer_cancel(&n->timer); 6744 6711 6712 + clear_bit(NAPI_STATE_PREFER_BUSY_POLL, &n->state); 6745 6713 clear_bit(NAPI_STATE_DISABLE, &n->state); 6746 6714 } 6747 6715 EXPORT_SYMBOL(napi_disable); ··· 6812 6778 */ 6813 6779 if (unlikely(napi_disable_pending(n))) { 6814 6780 napi_complete(n); 6781 + goto out_unlock; 6782 + } 6783 + 6784 + /* The NAPI context has more processing work, but busy-polling 6785 + * is preferred. Exit early. 6786 + */ 6787 + if (napi_prefer_busy_poll(n)) { 6788 + if (napi_complete_done(n, work)) { 6789 + /* If timeout is not set, we need to make sure 6790 + * that the NAPI is re-scheduled. 6791 + */ 6792 + napi_schedule(n); 6793 + } 6815 6794 goto out_unlock; 6816 6795 } 6817 6796 ··· 9810 9763 rx[i].dev = dev; 9811 9764 9812 9765 /* XDP RX-queue setup */ 9813 - err = xdp_rxq_info_reg(&rx[i].xdp_rxq, dev, i); 9766 + err = xdp_rxq_info_reg(&rx[i].xdp_rxq, dev, i, 0); 9814 9767 if (err < 0) 9815 9768 goto err_rxq_info; 9816 9769 }
+19
net/core/sock.c
··· 1159 1159 sk->sk_ll_usec = val; 1160 1160 } 1161 1161 break; 1162 + case SO_PREFER_BUSY_POLL: 1163 + if (valbool && !capable(CAP_NET_ADMIN)) 1164 + ret = -EPERM; 1165 + else 1166 + WRITE_ONCE(sk->sk_prefer_busy_poll, valbool); 1167 + break; 1168 + case SO_BUSY_POLL_BUDGET: 1169 + if (val > READ_ONCE(sk->sk_busy_poll_budget) && !capable(CAP_NET_ADMIN)) { 1170 + ret = -EPERM; 1171 + } else { 1172 + if (val < 0 || val > U16_MAX) 1173 + ret = -EINVAL; 1174 + else 1175 + WRITE_ONCE(sk->sk_busy_poll_budget, val); 1176 + } 1177 + break; 1162 1178 #endif 1163 1179 1164 1180 case SO_MAX_PACING_RATE: ··· 1538 1522 #ifdef CONFIG_NET_RX_BUSY_POLL 1539 1523 case SO_BUSY_POLL: 1540 1524 v.val = sk->sk_ll_usec; 1525 + break; 1526 + case SO_PREFER_BUSY_POLL: 1527 + v.val = READ_ONCE(sk->sk_prefer_busy_poll); 1541 1528 break; 1542 1529 #endif 1543 1530
+2 -1
net/core/xdp.c
··· 158 158 159 159 /* Returns 0 on success, negative on failure */ 160 160 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, 161 - struct net_device *dev, u32 queue_index) 161 + struct net_device *dev, u32 queue_index, unsigned int napi_id) 162 162 { 163 163 if (xdp_rxq->reg_state == REG_STATE_UNUSED) { 164 164 WARN(1, "Driver promised not to register this"); ··· 179 179 xdp_rxq_info_init(xdp_rxq); 180 180 xdp_rxq->dev = dev; 181 181 xdp_rxq->queue_index = queue_index; 182 + xdp_rxq->napi_id = napi_id; 182 183 183 184 xdp_rxq->reg_state = REG_STATE_REGISTERED; 184 185 return 0;
+51 -2
net/xdp/xsk.c
··· 23 23 #include <linux/netdevice.h> 24 24 #include <linux/rculist.h> 25 25 #include <net/xdp_sock_drv.h> 26 + #include <net/busy_poll.h> 26 27 #include <net/xdp.h> 27 28 28 29 #include "xsk_queue.h" ··· 233 232 if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index) 234 233 return -EINVAL; 235 234 235 + sk_mark_napi_id_once_xdp(&xs->sk, xdp); 236 236 len = xdp->data_end - xdp->data; 237 237 238 238 return xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL ? ··· 519 517 return xs->zc ? xsk_zc_xmit(xs) : xsk_generic_xmit(sk); 520 518 } 521 519 520 + static bool xsk_no_wakeup(struct sock *sk) 521 + { 522 + #ifdef CONFIG_NET_RX_BUSY_POLL 523 + /* Prefer busy-polling, skip the wakeup. */ 524 + return READ_ONCE(sk->sk_prefer_busy_poll) && READ_ONCE(sk->sk_ll_usec) && 525 + READ_ONCE(sk->sk_napi_id) >= MIN_NAPI_ID; 526 + #else 527 + return false; 528 + #endif 529 + } 530 + 522 531 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) 523 532 { 524 533 bool need_wait = !(m->msg_flags & MSG_DONTWAIT); 525 534 struct sock *sk = sock->sk; 526 535 struct xdp_sock *xs = xdp_sk(sk); 536 + struct xsk_buff_pool *pool; 527 537 528 538 if (unlikely(!xsk_is_bound(xs))) 529 539 return -ENXIO; 530 540 if (unlikely(need_wait)) 531 541 return -EOPNOTSUPP; 532 542 533 - return __xsk_sendmsg(sk); 543 + if (sk_can_busy_loop(sk)) 544 + sk_busy_loop(sk, 1); /* only support non-blocking sockets */ 545 + 546 + if (xsk_no_wakeup(sk)) 547 + return 0; 548 + 549 + pool = xs->pool; 550 + if (pool->cached_need_wakeup & XDP_WAKEUP_TX) 551 + return __xsk_sendmsg(sk); 552 + return 0; 553 + } 554 + 555 + static int xsk_recvmsg(struct socket *sock, struct msghdr *m, size_t len, int flags) 556 + { 557 + bool need_wait = !(flags & MSG_DONTWAIT); 558 + struct sock *sk = sock->sk; 559 + struct xdp_sock *xs = xdp_sk(sk); 560 + 561 + if (unlikely(!(xs->dev->flags & IFF_UP))) 562 + return -ENETDOWN; 563 + if (unlikely(!xs->rx)) 564 + return -ENOBUFS; 565 + if (unlikely(!xsk_is_bound(xs))) 566 + return -ENXIO; 567 + if (unlikely(need_wait)) 568 + return -EOPNOTSUPP; 569 + 570 + if (sk_can_busy_loop(sk)) 571 + sk_busy_loop(sk, 1); /* only support non-blocking sockets */ 572 + 573 + if (xsk_no_wakeup(sk)) 574 + return 0; 575 + 576 + if (xs->pool->cached_need_wakeup & XDP_WAKEUP_RX && xs->zc) 577 + return xsk_wakeup(xs, XDP_WAKEUP_RX); 578 + return 0; 534 579 } 535 580 536 581 static __poll_t xsk_poll(struct file *file, struct socket *sock, ··· 1240 1191 .setsockopt = xsk_setsockopt, 1241 1192 .getsockopt = xsk_getsockopt, 1242 1193 .sendmsg = xsk_sendmsg, 1243 - .recvmsg = sock_no_recvmsg, 1194 + .recvmsg = xsk_recvmsg, 1244 1195 .mmap = xsk_mmap, 1245 1196 .sendpage = sock_no_sendpage, 1246 1197 };
+6 -7
net/xdp/xsk_buff_pool.c
··· 144 144 if (err) 145 145 return err; 146 146 147 - if (flags & XDP_USE_NEED_WAKEUP) { 147 + if (flags & XDP_USE_NEED_WAKEUP) 148 148 pool->uses_need_wakeup = true; 149 - /* Tx needs to be explicitly woken up the first time. 150 - * Also for supporting drivers that do not implement this 151 - * feature. They will always have to call sendto(). 152 - */ 153 - pool->cached_need_wakeup = XDP_WAKEUP_TX; 154 - } 149 + /* Tx needs to be explicitly woken up the first time. Also 150 + * for supporting drivers that do not implement this 151 + * feature. They will always have to call sendto() or poll(). 152 + */ 153 + pool->cached_need_wakeup = XDP_WAKEUP_TX; 155 154 156 155 dev_hold(netdev); 157 156
+54 -25
samples/bpf/xdpsock_user.c
··· 95 95 static bool opt_need_wakeup = true; 96 96 static u32 opt_num_xsks = 1; 97 97 static u32 prog_id; 98 + static bool opt_busy_poll; 98 99 99 100 struct xsk_ring_stats { 100 101 unsigned long rx_npkts; ··· 912 911 {"quiet", no_argument, 0, 'Q'}, 913 912 {"app-stats", no_argument, 0, 'a'}, 914 913 {"irq-string", no_argument, 0, 'I'}, 914 + {"busy-poll", no_argument, 0, 'B'}, 915 915 {0, 0, 0, 0} 916 916 }; 917 917 ··· 951 949 " -Q, --quiet Do not display any stats.\n" 952 950 " -a, --app-stats Display application (syscall) statistics.\n" 953 951 " -I, --irq-string Display driver interrupt statistics for interface associated with irq-string.\n" 952 + " -B, --busy-poll Busy poll.\n" 954 953 "\n"; 955 954 fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE, 956 955 opt_batch_size, MIN_PKT_SIZE, MIN_PKT_SIZE, ··· 967 964 opterr = 0; 968 965 969 966 for (;;) { 970 - c = getopt_long(argc, argv, "Frtli:q:pSNn:czf:muMd:b:C:s:P:xQaI:", 967 + c = getopt_long(argc, argv, "Frtli:q:pSNn:czf:muMd:b:C:s:P:xQaI:B", 971 968 long_options, &option_index); 972 969 if (c == -1) 973 970 break; ··· 1065 1062 fprintf(stderr, "ERROR: Failed to get irqs for %s\n", opt_irq_str); 1066 1063 usage(basename(argv[0])); 1067 1064 } 1068 - 1065 + break; 1066 + case 'B': 1067 + opt_busy_poll = 1; 1069 1068 break; 1070 1069 default: 1071 1070 usage(basename(argv[0])); ··· 1103 1098 exit_with_error(errno); 1104 1099 } 1105 1100 1106 - static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk, 1107 - struct pollfd *fds) 1101 + static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk) 1108 1102 { 1109 1103 struct xsk_umem_info *umem = xsk->umem; 1110 1104 u32 idx_cq = 0, idx_fq = 0; ··· 1136 1132 while (ret != rcvd) { 1137 1133 if (ret < 0) 1138 1134 exit_with_error(-ret); 1139 - if (xsk_ring_prod__needs_wakeup(&umem->fq)) { 1135 + if (opt_busy_poll || xsk_ring_prod__needs_wakeup(&umem->fq)) { 1140 1136 xsk->app_stats.fill_fail_polls++; 1141 - ret = poll(fds, num_socks, opt_timeout); 1137 + recvfrom(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 1138 + NULL); 1142 1139 } 1143 1140 ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq); 1144 1141 } ··· 1175 1170 } 1176 1171 } 1177 1172 1178 - static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds) 1173 + static void rx_drop(struct xsk_socket_info *xsk) 1179 1174 { 1180 1175 unsigned int rcvd, i; 1181 1176 u32 idx_rx = 0, idx_fq = 0; ··· 1183 1178 1184 1179 rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); 1185 1180 if (!rcvd) { 1186 - if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq)) { 1181 + if (opt_busy_poll || xsk_ring_prod__needs_wakeup(&xsk->umem->fq)) { 1187 1182 xsk->app_stats.rx_empty_polls++; 1188 - ret = poll(fds, num_socks, opt_timeout); 1183 + recvfrom(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, NULL); 1189 1184 } 1190 1185 return; 1191 1186 } ··· 1194 1189 while (ret != rcvd) { 1195 1190 if (ret < 0) 1196 1191 exit_with_error(-ret); 1197 - if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq)) { 1192 + if (opt_busy_poll || xsk_ring_prod__needs_wakeup(&xsk->umem->fq)) { 1198 1193 xsk->app_stats.fill_fail_polls++; 1199 - ret = poll(fds, num_socks, opt_timeout); 1194 + recvfrom(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, NULL); 1200 1195 } 1201 1196 ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); 1202 1197 } ··· 1238 1233 } 1239 1234 1240 1235 for (i = 0; i < num_socks; i++) 1241 - rx_drop(xsks[i], fds); 1236 + rx_drop(xsks[i]); 1242 1237 1243 1238 if (benchmark_done) 1244 1239 break; ··· 1336 1331 complete_tx_only_all(); 1337 1332 } 1338 1333 1339 - static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds) 1334 + static void l2fwd(struct xsk_socket_info *xsk) 1340 1335 { 1341 1336 unsigned int rcvd, i; 1342 1337 u32 idx_rx = 0, idx_tx = 0; 1343 1338 int ret; 1344 1339 1345 - complete_tx_l2fwd(xsk, fds); 1340 + complete_tx_l2fwd(xsk); 1346 1341 1347 1342 rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); 1348 1343 if (!rcvd) { 1349 - if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq)) { 1344 + if (opt_busy_poll || xsk_ring_prod__needs_wakeup(&xsk->umem->fq)) { 1350 1345 xsk->app_stats.rx_empty_polls++; 1351 - ret = poll(fds, num_socks, opt_timeout); 1346 + recvfrom(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, NULL); 1352 1347 } 1353 1348 return; 1354 1349 } ··· 1358 1353 while (ret != rcvd) { 1359 1354 if (ret < 0) 1360 1355 exit_with_error(-ret); 1361 - complete_tx_l2fwd(xsk, fds); 1362 - if (xsk_ring_prod__needs_wakeup(&xsk->tx)) { 1356 + complete_tx_l2fwd(xsk); 1357 + if (opt_busy_poll || xsk_ring_prod__needs_wakeup(&xsk->tx)) { 1363 1358 xsk->app_stats.tx_wakeup_sendtos++; 1364 1359 kick_tx(xsk); 1365 1360 } ··· 1393 1388 struct pollfd fds[MAX_SOCKS] = {}; 1394 1389 int i, ret; 1395 1390 1396 - for (i = 0; i < num_socks; i++) { 1397 - fds[i].fd = xsk_socket__fd(xsks[i]->xsk); 1398 - fds[i].events = POLLOUT | POLLIN; 1399 - } 1400 - 1401 1391 for (;;) { 1402 1392 if (opt_poll) { 1403 - for (i = 0; i < num_socks; i++) 1393 + for (i = 0; i < num_socks; i++) { 1394 + fds[i].fd = xsk_socket__fd(xsks[i]->xsk); 1395 + fds[i].events = POLLOUT | POLLIN; 1404 1396 xsks[i]->app_stats.opt_polls++; 1397 + } 1405 1398 ret = poll(fds, num_socks, opt_timeout); 1406 1399 if (ret <= 0) 1407 1400 continue; 1408 1401 } 1409 1402 1410 1403 for (i = 0; i < num_socks; i++) 1411 - l2fwd(xsks[i], fds); 1404 + l2fwd(xsks[i]); 1412 1405 1413 1406 if (benchmark_done) 1414 1407 break; ··· 1464 1461 } 1465 1462 } 1466 1463 1464 + static void apply_setsockopt(struct xsk_socket_info *xsk) 1465 + { 1466 + int sock_opt; 1467 + 1468 + if (!opt_busy_poll) 1469 + return; 1470 + 1471 + sock_opt = 1; 1472 + if (setsockopt(xsk_socket__fd(xsk->xsk), SOL_SOCKET, SO_PREFER_BUSY_POLL, 1473 + (void *)&sock_opt, sizeof(sock_opt)) < 0) 1474 + exit_with_error(errno); 1475 + 1476 + sock_opt = 20; 1477 + if (setsockopt(xsk_socket__fd(xsk->xsk), SOL_SOCKET, SO_BUSY_POLL, 1478 + (void *)&sock_opt, sizeof(sock_opt)) < 0) 1479 + exit_with_error(errno); 1480 + 1481 + sock_opt = opt_batch_size; 1482 + if (setsockopt(xsk_socket__fd(xsk->xsk), SOL_SOCKET, SO_BUSY_POLL_BUDGET, 1483 + (void *)&sock_opt, sizeof(sock_opt)) < 0) 1484 + exit_with_error(errno); 1485 + } 1486 + 1467 1487 int main(int argc, char **argv) 1468 1488 { 1469 1489 struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; ··· 1527 1501 tx = true; 1528 1502 for (i = 0; i < opt_num_xsks; i++) 1529 1503 xsks[num_socks++] = xsk_configure_socket(umem, rx, tx); 1504 + 1505 + for (i = 0; i < opt_num_xsks; i++) 1506 + apply_setsockopt(xsks[i]); 1530 1507 1531 1508 if (opt_bench == BENCH_TXONLY) { 1532 1509 gen_eth_hdr_data();