Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net: Consistent skb timestamping

With RPS inclusion, skb timestamping is not consistent in RX path.

If netif_receive_skb() is used, its deferred after RPS dispatch.

If netif_rx() is used, its done before RPS dispatch.

This can give strange tcpdump timestamps results.

I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)

Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.

Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue

Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.

Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Eric Dumazet and committed by
David S. Miller
3b098e2d a1aa3483

+49 -19
+10
Documentation/sysctl/net.txt
··· 84 84 Maximum number of packets, queued on the INPUT side, when the interface 85 85 receives packets faster than kernel can process them. 86 86 87 + netdev_tstamp_prequeue 88 + ---------------------- 89 + 90 + If set to 0, RX packet timestamps can be sampled after RPS processing, when 91 + the target CPU processes packets. It might give some delay on timestamps, but 92 + permit to distribute the load on several cpus. 93 + 94 + If set to 1 (default), timestamps are sampled as soon as possible, before 95 + queueing. 96 + 87 97 optmem_max 88 98 ---------- 89 99
+1
include/linux/netdevice.h
··· 2100 2100 extern void dev_txq_stats_fold(const struct net_device *dev, struct net_device_stats *stats); 2101 2101 2102 2102 extern int netdev_max_backlog; 2103 + extern int netdev_tstamp_prequeue; 2103 2104 extern int weight_p; 2104 2105 extern int netdev_set_master(struct net_device *dev, struct net_device *master); 2105 2106 extern int skb_checksum_help(struct sk_buff *skb);
+31 -19
net/core/dev.c
··· 1454 1454 } 1455 1455 EXPORT_SYMBOL(net_disable_timestamp); 1456 1456 1457 - static inline void net_timestamp(struct sk_buff *skb) 1457 + static inline void net_timestamp_set(struct sk_buff *skb) 1458 1458 { 1459 1459 if (atomic_read(&netstamp_needed)) 1460 1460 __net_timestamp(skb); 1461 1461 else 1462 1462 skb->tstamp.tv64 = 0; 1463 + } 1464 + 1465 + static inline void net_timestamp_check(struct sk_buff *skb) 1466 + { 1467 + if (!skb->tstamp.tv64 && atomic_read(&netstamp_needed)) 1468 + __net_timestamp(skb); 1463 1469 } 1464 1470 1465 1471 /** ··· 1514 1508 1515 1509 #ifdef CONFIG_NET_CLS_ACT 1516 1510 if (!(skb->tstamp.tv64 && (G_TC_FROM(skb->tc_verd) & AT_INGRESS))) 1517 - net_timestamp(skb); 1511 + net_timestamp_set(skb); 1518 1512 #else 1519 - net_timestamp(skb); 1513 + net_timestamp_set(skb); 1520 1514 #endif 1521 1515 1522 1516 rcu_read_lock(); ··· 2207 2201 =======================================================================*/ 2208 2202 2209 2203 int netdev_max_backlog __read_mostly = 1000; 2204 + int netdev_tstamp_prequeue __read_mostly = 1; 2210 2205 int netdev_budget __read_mostly = 300; 2211 2206 int weight_p __read_mostly = 64; /* old backlog weight */ 2212 2207 ··· 2472 2465 if (netpoll_rx(skb)) 2473 2466 return NET_RX_DROP; 2474 2467 2475 - if (!skb->tstamp.tv64) 2476 - net_timestamp(skb); 2468 + if (netdev_tstamp_prequeue) 2469 + net_timestamp_check(skb); 2477 2470 2478 2471 #ifdef CONFIG_RPS 2479 2472 { ··· 2798 2791 int ret = NET_RX_DROP; 2799 2792 __be16 type; 2800 2793 2801 - if (!skb->tstamp.tv64) 2802 - net_timestamp(skb); 2794 + if (!netdev_tstamp_prequeue) 2795 + net_timestamp_check(skb); 2803 2796 2804 2797 if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb)) 2805 2798 return NET_RX_SUCCESS; ··· 2917 2910 */ 2918 2911 int netif_receive_skb(struct sk_buff *skb) 2919 2912 { 2913 + if (netdev_tstamp_prequeue) 2914 + net_timestamp_check(skb); 2915 + 2920 2916 #ifdef CONFIG_RPS 2921 - struct rps_dev_flow voidflow, *rflow = &voidflow; 2922 - int cpu, ret; 2917 + { 2918 + struct rps_dev_flow voidflow, *rflow = &voidflow; 2919 + int cpu, ret; 2923 2920 2924 - rcu_read_lock(); 2921 + rcu_read_lock(); 2925 2922 2926 - cpu = get_rps_cpu(skb->dev, skb, &rflow); 2923 + cpu = get_rps_cpu(skb->dev, skb, &rflow); 2927 2924 2928 - if (cpu >= 0) { 2929 - ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); 2930 - rcu_read_unlock(); 2931 - } else { 2932 - rcu_read_unlock(); 2933 - ret = __netif_receive_skb(skb); 2925 + if (cpu >= 0) { 2926 + ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); 2927 + rcu_read_unlock(); 2928 + } else { 2929 + rcu_read_unlock(); 2930 + ret = __netif_receive_skb(skb); 2931 + } 2932 + 2933 + return ret; 2934 2934 } 2935 - 2936 - return ret; 2937 2935 #else 2938 2936 return __netif_receive_skb(skb); 2939 2937 #endif
+7
net/core/sysctl_net_core.c
··· 122 122 .proc_handler = proc_dointvec 123 123 }, 124 124 { 125 + .procname = "netdev_tstamp_prequeue", 126 + .data = &netdev_tstamp_prequeue, 127 + .maxlen = sizeof(int), 128 + .mode = 0644, 129 + .proc_handler = proc_dointvec 130 + }, 131 + { 125 132 .procname = "message_cost", 126 133 .data = &net_ratelimit_state.interval, 127 134 .maxlen = sizeof(int),