Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

veth: more robust handing of race to avoid txq getting stuck

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") introduced a race condition that can lead to a permanently
stalled TXQ. This was observed in production on ARM64 systems (Ampere Altra
Max).

The race occurs in veth_xmit(). The producer observes a full ptr_ring and
stops the queue (netif_tx_stop_queue()). The subsequent conditional logic,
intended to re-wake the queue if the consumer had just emptied it (if
(__ptr_ring_empty(...)) netif_tx_wake_queue()), can fail. This leads to a
"lost wakeup" where the TXQ remains stopped (QUEUE_STATE_DRV_XOFF) and
traffic halts.

This failure is caused by an incorrect use of the __ptr_ring_empty() API
from the producer side. As noted in kernel comments, this check is not
guaranteed to be correct if a consumer is operating on another CPU. The
empty test is based on ptr_ring->consumer_head, making it reliable only for
the consumer. Using this check from the producer side is fundamentally racy.

This patch fixes the race by adopting the more robust logic from an earlier
version V4 of the patchset, which always flushed the peer:

(1) In veth_xmit(), the racy conditional wake-up logic and its memory barrier
are removed. Instead, after stopping the queue, we unconditionally call
__veth_xdp_flush(rq). This guarantees that the NAPI consumer is scheduled,
making it solely responsible for re-waking the TXQ.
This handles the race where veth_poll() consumes all packets and completes
NAPI *before* veth_xmit() on the producer side has called netif_tx_stop_queue.
The __veth_xdp_flush(rq) will observe rx_notify_masked is false and schedule
NAPI.

(2) On the consumer side, the logic for waking the peer TXQ is moved out of
veth_xdp_rcv() and placed at the end of the veth_poll() function. This
placement is part of fixing the race, as the netif_tx_queue_stopped() check
must occur after rx_notify_masked is potentially set to false during NAPI
completion.
This handles the race where veth_poll() consumes all packets, but haven't
finished (rx_notify_masked is still true). The producer veth_xmit() stops the
TXQ and __veth_xdp_flush(rq) will observe rx_notify_masked is true, meaning
not starting NAPI. Then veth_poll() change rx_notify_masked to false and
stops NAPI. Before exiting veth_poll() will observe TXQ is stopped and wake
it up.

Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops")
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/176295323282.307447.14790015927673763094.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Jesper Dangaard Brouer and committed by
Jakub Kicinski
5442a9da dfe28c41

+20 -18
+20 -18
drivers/net/veth.c
··· 392 392 } 393 393 /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */ 394 394 __skb_push(skb, ETH_HLEN); 395 - /* Depend on prior success packets started NAPI consumer via 396 - * __veth_xdp_flush(). Cancel TXQ stop if consumer stopped, 397 - * paired with empty check in veth_poll(). 398 - */ 399 395 netif_tx_stop_queue(txq); 400 - smp_mb__after_atomic(); 401 - if (unlikely(__ptr_ring_empty(&rq->xdp_ring))) 402 - netif_tx_wake_queue(txq); 396 + /* Makes sure NAPI peer consumer runs. Consumer is responsible 397 + * for starting txq again, until then ndo_start_xmit (this 398 + * function) will not be invoked by the netstack again. 399 + */ 400 + __veth_xdp_flush(rq); 403 401 break; 404 402 case NET_RX_DROP: /* same as NET_XMIT_DROP */ 405 403 drop: ··· 898 900 struct veth_xdp_tx_bq *bq, 899 901 struct veth_stats *stats) 900 902 { 901 - struct veth_priv *priv = netdev_priv(rq->dev); 902 - int queue_idx = rq->xdp_rxq.queue_index; 903 - struct netdev_queue *peer_txq; 904 - struct net_device *peer_dev; 905 903 int i, done = 0, n_xdpf = 0; 906 904 void *xdpf[VETH_XDP_BATCH]; 907 - 908 - /* NAPI functions as RCU section */ 909 - peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held()); 910 - peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL; 911 905 912 906 for (i = 0; i < budget; i++) { 913 907 void *ptr = __ptr_ring_consume(&rq->xdp_ring); ··· 949 959 rq->stats.vs.xdp_packets += done; 950 960 u64_stats_update_end(&rq->stats.syncp); 951 961 952 - if (peer_txq && unlikely(netif_tx_queue_stopped(peer_txq))) 953 - netif_tx_wake_queue(peer_txq); 954 - 955 962 return done; 956 963 } 957 964 ··· 956 969 { 957 970 struct veth_rq *rq = 958 971 container_of(napi, struct veth_rq, xdp_napi); 972 + struct veth_priv *priv = netdev_priv(rq->dev); 973 + int queue_idx = rq->xdp_rxq.queue_index; 974 + struct netdev_queue *peer_txq; 959 975 struct veth_stats stats = {}; 976 + struct net_device *peer_dev; 960 977 struct veth_xdp_tx_bq bq; 961 978 int done; 962 979 963 980 bq.count = 0; 981 + 982 + /* NAPI functions as RCU section */ 983 + peer_dev = rcu_dereference_check(priv->peer, rcu_read_lock_bh_held()); 984 + peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL; 964 985 965 986 xdp_set_return_frame_no_direct(); 966 987 done = veth_xdp_rcv(rq, budget, &bq, &stats); ··· 990 995 if (stats.xdp_tx > 0) 991 996 veth_xdp_flush(rq, &bq); 992 997 xdp_clear_return_frame_no_direct(); 998 + 999 + /* Release backpressure per NAPI poll */ 1000 + smp_rmb(); /* Paired with netif_tx_stop_queue set_bit */ 1001 + if (peer_txq && netif_tx_queue_stopped(peer_txq)) { 1002 + txq_trans_cond_update(peer_txq); 1003 + netif_tx_wake_queue(peer_txq); 1004 + } 993 1005 994 1006 return done; 995 1007 }