Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

packet: use percpu mmap tx frame pending refcount

In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on a rather crappy testing machine; I expect
it to scale and have even better results on bigger machines:

./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

With patch: 4,022,015 cyc
Without patch: 4,812,994 cyc

time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

With patch:
real 1m32.241s
user 0m0.287s
sys 1m29.316s

Without patch:
real 1m38.386s
user 0m0.265s
sys 1m35.572s

In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process. In case we have
MSG_DONTWAIT, we also do not execute this path as need_wait is
false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
okay to call a packet_read_pending(), because when we ever reach
that path, we're done processing outgoing frames anyway and only
look if there are skbs still outstanding to be orphaned. We can
stay lockless in this percpu counter since it's acceptable when we
reach this path for the sum to be imprecise first, but we'll level
out at 0 after all pending frames have reached the skb destructor
eventually through tx reclaim. When people pin a tx process to
particular CPUs, we expect overflows to happen in the reference
counter as on one CPU we expect heavy increase; and distributed
through ksoftirqd on all CPUs a decrease, for example. As
David Laight points out, since the C language doesn't define the
result of signed int overflow (i.e. rather than wrap, it is
allowed to saturate as a possible outcome), we have to use
unsigned int as reference count. The sum over all CPUs when tx
is complete will result in 0 again.

The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made sure
to increase tx_ring.pending in any case before we called po->xmit(skb).
So testing for tx_ring.pending == 0 is not too useful. Instead, it
would rather have been useful to test if lower layers didn't orphan
the skb so that we're missing ring slots being put back to
TP_STATUS_AVAILABLE. But such a bug will be caught in user space
already as we end up realizing that we do not have any
TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here. Also note that __alloc_percpu() already returns a zero-filled
percpu area, so initialization is done already.

[1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Daniel Borkmann and committed by
David S. Miller
b0138408 87a2fd28

+62 -7
+60 -6
net/packet/af_packet.c
··· 89 89 #include <linux/errqueue.h> 90 90 #include <linux/net_tstamp.h> 91 91 #include <linux/reciprocal_div.h> 92 + #include <linux/percpu.h> 92 93 #ifdef CONFIG_INET 93 94 #include <net/inet_common.h> 94 95 #endif ··· 1169 1168 buff->head = buff->head != buff->frame_max ? buff->head+1 : 0; 1170 1169 } 1171 1170 1171 + static void packet_inc_pending(struct packet_ring_buffer *rb) 1172 + { 1173 + this_cpu_inc(*rb->pending_refcnt); 1174 + } 1175 + 1176 + static void packet_dec_pending(struct packet_ring_buffer *rb) 1177 + { 1178 + this_cpu_dec(*rb->pending_refcnt); 1179 + } 1180 + 1181 + static unsigned int packet_read_pending(const struct packet_ring_buffer *rb) 1182 + { 1183 + unsigned int refcnt = 0; 1184 + int cpu; 1185 + 1186 + /* We don't use pending refcount in rx_ring. */ 1187 + if (rb->pending_refcnt == NULL) 1188 + return 0; 1189 + 1190 + for_each_possible_cpu(cpu) 1191 + refcnt += *per_cpu_ptr(rb->pending_refcnt, cpu); 1192 + 1193 + return refcnt; 1194 + } 1195 + 1196 + static int packet_alloc_pending(struct packet_sock *po) 1197 + { 1198 + po->rx_ring.pending_refcnt = NULL; 1199 + 1200 + po->tx_ring.pending_refcnt = alloc_percpu(unsigned int); 1201 + if (unlikely(po->tx_ring.pending_refcnt == NULL)) 1202 + return -ENOBUFS; 1203 + 1204 + return 0; 1205 + } 1206 + 1207 + static void packet_free_pending(struct packet_sock *po) 1208 + { 1209 + free_percpu(po->tx_ring.pending_refcnt); 1210 + } 1211 + 1172 1212 static bool packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb) 1173 1213 { 1174 1214 struct sock *sk = &po->sk; ··· 2056 2014 __u32 ts; 2057 2015 2058 2016 ph = skb_shinfo(skb)->destructor_arg; 2059 - BUG_ON(atomic_read(&po->tx_ring.pending) == 0); 2060 - atomic_dec(&po->tx_ring.pending); 2017 + packet_dec_pending(&po->tx_ring); 2061 2018 2062 2019 ts = __packet_set_timestamp(po, ph, skb); 2063 2020 __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts); ··· 2277 2236 skb_set_queue_mapping(skb, packet_pick_tx_queue(dev)); 2278 2237 skb->destructor = tpacket_destruct_skb; 2279 2238 __packet_set_status(po, ph, TP_STATUS_SENDING); 2280 - atomic_inc(&po->tx_ring.pending); 2239 + packet_inc_pending(&po->tx_ring); 2281 2240 2282 2241 status = TP_STATUS_SEND_REQUEST; 2283 2242 err = po->xmit(skb); ··· 2297 2256 } 2298 2257 packet_increment_head(&po->tx_ring); 2299 2258 len_sum += tp_len; 2300 - } while (likely((ph != NULL) || (need_wait && 2301 - atomic_read(&po->tx_ring.pending)))); 2259 + } while (likely((ph != NULL) || 2260 + /* Note: packet_read_pending() might be slow if we have 2261 + * to call it as it's per_cpu variable, but in fast-path 2262 + * we already short-circuit the loop with the first 2263 + * condition, and luckily don't have to go that path 2264 + * anyway. 2265 + */ 2266 + (need_wait && packet_read_pending(&po->tx_ring)))); 2302 2267 2303 2268 err = len_sum; 2304 2269 goto out_put; ··· 2603 2556 /* Purge queues */ 2604 2557 2605 2558 skb_queue_purge(&sk->sk_receive_queue); 2559 + packet_free_pending(po); 2606 2560 sk_refcnt_debug_release(sk); 2607 2561 2608 2562 sock_put(sk); ··· 2765 2717 po->num = proto; 2766 2718 po->xmit = dev_queue_xmit; 2767 2719 2720 + err = packet_alloc_pending(po); 2721 + if (err) 2722 + goto out2; 2723 + 2768 2724 packet_cached_dev_reset(po); 2769 2725 2770 2726 sk->sk_destruct = packet_sock_destruct; ··· 2801 2749 preempt_enable(); 2802 2750 2803 2751 return 0; 2752 + out2: 2753 + sk_free(sk); 2804 2754 out: 2805 2755 return err; 2806 2756 } ··· 3730 3676 if (!closing) { 3731 3677 if (atomic_read(&po->mapped)) 3732 3678 goto out; 3733 - if (atomic_read(&rb->pending)) 3679 + if (packet_read_pending(rb)) 3734 3680 goto out; 3735 3681 } 3736 3682
+1
net/packet/diag.c
··· 3 3 #include <linux/net.h> 4 4 #include <linux/netdevice.h> 5 5 #include <linux/packet_diag.h> 6 + #include <linux/percpu.h> 6 7 #include <net/net_namespace.h> 7 8 #include <net/sock.h> 8 9
+1 -1
net/packet/internal.h
··· 64 64 unsigned int pg_vec_pages; 65 65 unsigned int pg_vec_len; 66 66 67 - atomic_t pending; 67 + unsigned int __percpu *pending_refcnt; 68 68 69 69 struct tpacket_kbdq_core prb_bdqc; 70 70 };