Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-02-20

We've added 19 non-merge commits during the last 8 day(s) which contain
a total of 35 files changed, 1126 insertions(+), 53 deletions(-).

The main changes are:

1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing

2) Add network TX timestamping support to BPF sock_ops, from Jason Xing

3) Add TX metadata Launch Time support, from Song Yoong Siang

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
igc: Add launch time support to XDP ZC
igc: Refactor empty frame insertion for launch time support
net: stmmac: Add launch time support to XDP ZC
selftests/bpf: Add launch time request to xdp_hw_metadata
xsk: Add launch time hardware offload support to XDP Tx metadata
selftests/bpf: Add simple bpf tests in the tx path for timestamping feature
bpf: Support selective sampling for bpf timestamping
bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
bpf: Disable unsafe helpers in TX timestamping callbacks
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
bpf: Add networking timestamping support to bpf_get/setsockopt()
selftests/bpf: Add rto max for bpf_setsockopt test
bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt
====================

Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

+1126 -53
+4
Documentation/netlink/specs/netdev.yaml
··· 70 70 name: tx-checksum 71 71 doc: 72 72 L3 checksum HW offload is supported by the driver. 73 + - 74 + name: tx-launch-time-fifo 75 + doc: 76 + Launch time HW offload is supported by the driver. 73 77 - 74 78 name: queue-type 75 79 type: enum
+62
Documentation/networking/xsk-tx-metadata.rst
··· 50 50 checksum. ``csum_start`` specifies byte offset of where the checksumming 51 51 should start and ``csum_offset`` specifies byte offset where the 52 52 device should store the computed checksum. 53 + - ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the 54 + packet for transmission at a pre-determined time called launch time. The 55 + value of launch time is indicated by ``launch_time`` field of 56 + ``union xsk_tx_metadata``. 53 57 54 58 Besides the flags above, in order to trigger the offloads, the first 55 59 packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA`` ··· 69 65 is calculated on the CPU. Do not enable this option in production because 70 66 it will negatively affect performance. 71 67 68 + Launch Time 69 + =========== 70 + 71 + The value of the requested launch time should be based on the device's PTP 72 + Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path 73 + compared to the ETF queuing discipline, which organizes packets and delays 74 + their transmission. Instead, AF_XDP immediately hands off the packets to 75 + the device driver without rearranging their order or holding them prior to 76 + transmission. Since the driver maintains FIFO behavior and does not perform 77 + packet reordering, a packet with a launch time request will block other 78 + packets in the same Tx Queue until it is sent. Therefore, it is recommended 79 + to allocate separate queue for scheduling traffic that is intended for 80 + future transmission. 81 + 82 + In scenarios where the launch time offload feature is disabled, the device 83 + driver is expected to disregard the launch time request. For correct 84 + interpretation and meaningful operation, the launch time should never be 85 + set to a value larger than the farthest programmable time in the future 86 + (the horizon). Different devices have different hardware limitations on the 87 + launch time offload feature. 88 + 89 + stmmac driver 90 + ------------- 91 + 92 + For stmmac, TSO and launch time (TBS) features are mutually exclusive for 93 + each individual Tx Queue. By default, the driver configures Tx Queue 0 to 94 + support TSO and the rest of the Tx Queues to support TBS. The launch time 95 + hardware offload feature can be enabled or disabled by using the tc-etf 96 + command to call the driver's ndo_setup_tc() callback. 97 + 98 + The value of the launch time that is programmed in the Enhanced Normal 99 + Transmit Descriptors is a 32-bit value, where the most significant 8 bits 100 + represent the time in seconds and the remaining 24 bits represent the time 101 + in 256 ns increments. The programmed launch time is compared against the 102 + PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the 103 + horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the 104 + future. 105 + 106 + igc driver 107 + ---------- 108 + 109 + For igc, all four Tx Queues support the launch time feature. The launch 110 + time hardware offload feature can be enabled or disabled by using the 111 + tc-etf command to call the driver's ndo_setup_tc() callback. When entering 112 + TSN mode, the igc driver will reset the device and create a default Qbv 113 + schedule with a 1-second cycle time, with all Tx Queues open at all times. 114 + 115 + The value of the launch time that is programmed in the Advanced Transmit 116 + Context Descriptor is a relative offset to the starting time of the Qbv 117 + transmission window of the queue. The Frst flag of the descriptor can be 118 + set to schedule the packet for the next Qbv cycle. Therefore, the horizon 119 + of the launch time for i225 and i226 is the ending time of the next cycle 120 + of the Qbv transmission window of the queue. For example, when the Qbv 121 + cycle time is set to 1 second, the horizon of the launch time ranges 122 + from 1 second to 2 seconds, depending on where the Qbv cycle is currently 123 + running. 124 + 72 125 Querying Device Capabilities 73 126 ============================ 74 127 ··· 135 74 136 75 - ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP`` 137 76 - ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM`` 77 + - ``tx-launch-time-fifo``: device supports ``XDP_TXMD_FLAGS_LAUNCH_TIME`` 138 78 139 79 See ``tools/net/ynl/samples/netdev.c`` on how to query this information. 140 80
+1
drivers/net/ethernet/intel/igc/igc.h
··· 579 579 struct xsk_tx_metadata *meta; 580 580 struct igc_ring *tx_ring; 581 581 u32 cmd_type; 582 + u16 used_desc; 582 583 }; 583 584 584 585 struct igc_q_vector {
+109 -34
drivers/net/ethernet/intel/igc/igc_main.c
··· 1092 1092 1093 1093 dma = dma_map_single(ring->dev, skb->data, size, DMA_TO_DEVICE); 1094 1094 if (dma_mapping_error(ring->dev, dma)) { 1095 - netdev_err_once(ring->netdev, "Failed to map DMA for TX\n"); 1095 + net_err_ratelimited("%s: DMA mapping error for empty frame\n", 1096 + netdev_name(ring->netdev)); 1096 1097 return -ENOMEM; 1097 1098 } 1098 1099 ··· 1109 1108 return 0; 1110 1109 } 1111 1110 1112 - static int igc_init_tx_empty_descriptor(struct igc_ring *ring, 1113 - struct sk_buff *skb, 1114 - struct igc_tx_buffer *first) 1111 + static void igc_init_tx_empty_descriptor(struct igc_ring *ring, 1112 + struct sk_buff *skb, 1113 + struct igc_tx_buffer *first) 1115 1114 { 1116 1115 union igc_adv_tx_desc *desc; 1117 1116 u32 cmd_type, olinfo_status; 1118 - int err; 1119 - 1120 - if (!igc_desc_unused(ring)) 1121 - return -EBUSY; 1122 - 1123 - err = igc_init_empty_frame(ring, first, skb); 1124 - if (err) 1125 - return err; 1126 1117 1127 1118 cmd_type = IGC_ADVTXD_DTYP_DATA | IGC_ADVTXD_DCMD_DEXT | 1128 1119 IGC_ADVTXD_DCMD_IFCS | IGC_TXD_DCMD | ··· 1133 1140 ring->next_to_use++; 1134 1141 if (ring->next_to_use == ring->count) 1135 1142 ring->next_to_use = 0; 1136 - 1137 - return 0; 1138 1143 } 1139 1144 1140 1145 #define IGC_EMPTY_FRAME_SIZE 60 ··· 1558 1567 return false; 1559 1568 } 1560 1569 1570 + static int igc_insert_empty_frame(struct igc_ring *tx_ring) 1571 + { 1572 + struct igc_tx_buffer *empty_info; 1573 + struct sk_buff *empty_skb; 1574 + void *data; 1575 + int ret; 1576 + 1577 + empty_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use]; 1578 + empty_skb = alloc_skb(IGC_EMPTY_FRAME_SIZE, GFP_ATOMIC); 1579 + if (unlikely(!empty_skb)) { 1580 + net_err_ratelimited("%s: skb alloc error for empty frame\n", 1581 + netdev_name(tx_ring->netdev)); 1582 + return -ENOMEM; 1583 + } 1584 + 1585 + data = skb_put(empty_skb, IGC_EMPTY_FRAME_SIZE); 1586 + memset(data, 0, IGC_EMPTY_FRAME_SIZE); 1587 + 1588 + /* Prepare DMA mapping and Tx buffer information */ 1589 + ret = igc_init_empty_frame(tx_ring, empty_info, empty_skb); 1590 + if (unlikely(ret)) { 1591 + dev_kfree_skb_any(empty_skb); 1592 + return ret; 1593 + } 1594 + 1595 + /* Prepare advanced context descriptor for empty packet */ 1596 + igc_tx_ctxtdesc(tx_ring, 0, false, 0, 0, 0); 1597 + 1598 + /* Prepare advanced data descriptor for empty packet */ 1599 + igc_init_tx_empty_descriptor(tx_ring, empty_skb, empty_info); 1600 + 1601 + return 0; 1602 + } 1603 + 1561 1604 static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb, 1562 1605 struct igc_ring *tx_ring) 1563 1606 { ··· 1611 1586 * + 1 desc for skb_headlen/IGC_MAX_DATA_PER_TXD, 1612 1587 * + 2 desc gap to keep tail from touching head, 1613 1588 * + 1 desc for context descriptor, 1589 + * + 2 desc for inserting an empty packet for launch time, 1614 1590 * otherwise try next time 1615 1591 */ 1616 1592 for (f = 0; f < skb_shinfo(skb)->nr_frags; f++) ··· 1631 1605 launch_time = igc_tx_launchtime(tx_ring, txtime, &first_flag, &insert_empty); 1632 1606 1633 1607 if (insert_empty) { 1634 - struct igc_tx_buffer *empty_info; 1635 - struct sk_buff *empty; 1636 - void *data; 1637 - 1638 - empty_info = &tx_ring->tx_buffer_info[tx_ring->next_to_use]; 1639 - empty = alloc_skb(IGC_EMPTY_FRAME_SIZE, GFP_ATOMIC); 1640 - if (!empty) 1641 - goto done; 1642 - 1643 - data = skb_put(empty, IGC_EMPTY_FRAME_SIZE); 1644 - memset(data, 0, IGC_EMPTY_FRAME_SIZE); 1645 - 1646 - igc_tx_ctxtdesc(tx_ring, 0, false, 0, 0, 0); 1647 - 1648 - if (igc_init_tx_empty_descriptor(tx_ring, 1649 - empty, 1650 - empty_info) < 0) 1651 - dev_kfree_skb_any(empty); 1608 + /* Reset the launch time if the required empty frame fails to 1609 + * be inserted. However, this packet is not dropped, so it 1610 + * "dirties" the current Qbv cycle. This ensures that the 1611 + * upcoming packet, which is scheduled in the next Qbv cycle, 1612 + * does not require an empty frame. This way, the launch time 1613 + * continues to function correctly despite the current failure 1614 + * to insert the empty frame. 1615 + */ 1616 + if (igc_insert_empty_frame(tx_ring)) 1617 + launch_time = 0; 1652 1618 } 1653 1619 1654 1620 done: ··· 2971 2953 return *(u64 *)_priv; 2972 2954 } 2973 2955 2956 + static void igc_xsk_request_launch_time(u64 launch_time, void *_priv) 2957 + { 2958 + struct igc_metadata_request *meta_req = _priv; 2959 + struct igc_ring *tx_ring = meta_req->tx_ring; 2960 + __le32 launch_time_offset; 2961 + bool insert_empty = false; 2962 + bool first_flag = false; 2963 + u16 used_desc = 0; 2964 + 2965 + if (!tx_ring->launchtime_enable) 2966 + return; 2967 + 2968 + launch_time_offset = igc_tx_launchtime(tx_ring, 2969 + ns_to_ktime(launch_time), 2970 + &first_flag, &insert_empty); 2971 + if (insert_empty) { 2972 + /* Disregard the launch time request if the required empty frame 2973 + * fails to be inserted. 2974 + */ 2975 + if (igc_insert_empty_frame(tx_ring)) 2976 + return; 2977 + 2978 + meta_req->tx_buffer = 2979 + &tx_ring->tx_buffer_info[tx_ring->next_to_use]; 2980 + /* Inserting an empty packet requires two descriptors: 2981 + * one data descriptor and one context descriptor. 2982 + */ 2983 + used_desc += 2; 2984 + } 2985 + 2986 + /* Use one context descriptor to specify launch time and first flag. */ 2987 + igc_tx_ctxtdesc(tx_ring, launch_time_offset, first_flag, 0, 0, 0); 2988 + used_desc += 1; 2989 + 2990 + /* Update the number of used descriptors in this request */ 2991 + meta_req->used_desc += used_desc; 2992 + } 2993 + 2974 2994 const struct xsk_tx_metadata_ops igc_xsk_tx_metadata_ops = { 2975 2995 .tmo_request_timestamp = igc_xsk_request_timestamp, 2976 2996 .tmo_fill_timestamp = igc_xsk_fill_timestamp, 2997 + .tmo_request_launch_time = igc_xsk_request_launch_time, 2977 2998 }; 2978 2999 2979 3000 static void igc_xdp_xmit_zc(struct igc_ring *ring) ··· 3035 2978 ntu = ring->next_to_use; 3036 2979 budget = igc_desc_unused(ring); 3037 2980 3038 - while (xsk_tx_peek_desc(pool, &xdp_desc) && budget--) { 2981 + /* Packets with launch time require one data descriptor and one context 2982 + * descriptor. When the launch time falls into the next Qbv cycle, we 2983 + * may need to insert an empty packet, which requires two more 2984 + * descriptors. Therefore, to be safe, we always ensure we have at least 2985 + * 4 descriptors available. 2986 + */ 2987 + while (xsk_tx_peek_desc(pool, &xdp_desc) && budget >= 4) { 3039 2988 struct igc_metadata_request meta_req; 3040 2989 struct xsk_tx_metadata *meta = NULL; 3041 2990 struct igc_tx_buffer *bi; ··· 3062 2999 meta_req.tx_ring = ring; 3063 3000 meta_req.tx_buffer = bi; 3064 3001 meta_req.meta = meta; 3002 + meta_req.used_desc = 0; 3065 3003 xsk_tx_metadata_request(meta, &igc_xsk_tx_metadata_ops, 3066 3004 &meta_req); 3005 + 3006 + /* xsk_tx_metadata_request() may have updated next_to_use */ 3007 + ntu = ring->next_to_use; 3008 + 3009 + /* xsk_tx_metadata_request() may have updated Tx buffer info */ 3010 + bi = meta_req.tx_buffer; 3011 + 3012 + /* xsk_tx_metadata_request() may use a few descriptors */ 3013 + budget -= meta_req.used_desc; 3067 3014 3068 3015 tx_desc = IGC_TX_DESC(ring, ntu); 3069 3016 tx_desc->read.cmd_type_len = cpu_to_le32(meta_req.cmd_type); ··· 3092 3019 ntu++; 3093 3020 if (ntu == ring->count) 3094 3021 ntu = 0; 3022 + 3023 + ring->next_to_use = ntu; 3024 + budget--; 3095 3025 } 3096 3026 3097 - ring->next_to_use = ntu; 3098 3027 if (tx_desc) { 3099 3028 igc_flush_tx_descriptors(ring); 3100 3029 xsk_tx_release(pool);
+2
drivers/net/ethernet/stmicro/stmmac/stmmac.h
··· 106 106 struct stmmac_priv *priv; 107 107 struct dma_desc *tx_desc; 108 108 bool *set_ic; 109 + struct dma_edesc *edesc; 110 + int tbs; 109 111 }; 110 112 111 113 struct stmmac_xsk_tx_complete {
+13
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
··· 2486 2486 return 0; 2487 2487 } 2488 2488 2489 + static void stmmac_xsk_request_launch_time(u64 launch_time, void *_priv) 2490 + { 2491 + struct timespec64 ts = ns_to_timespec64(launch_time); 2492 + struct stmmac_metadata_request *meta_req = _priv; 2493 + 2494 + if (meta_req->tbs & STMMAC_TBS_EN) 2495 + stmmac_set_desc_tbs(meta_req->priv, meta_req->edesc, ts.tv_sec, 2496 + ts.tv_nsec); 2497 + } 2498 + 2489 2499 static const struct xsk_tx_metadata_ops stmmac_xsk_tx_metadata_ops = { 2490 2500 .tmo_request_timestamp = stmmac_xsk_request_timestamp, 2491 2501 .tmo_fill_timestamp = stmmac_xsk_fill_timestamp, 2502 + .tmo_request_launch_time = stmmac_xsk_request_launch_time, 2492 2503 }; 2493 2504 2494 2505 static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget) ··· 2583 2572 meta_req.priv = priv; 2584 2573 meta_req.tx_desc = tx_desc; 2585 2574 meta_req.set_ic = &set_ic; 2575 + meta_req.tbs = tx_q->tbs; 2576 + meta_req.edesc = &tx_q->dma_entx[entry]; 2586 2577 xsk_tx_metadata_request(meta, &stmmac_xsk_tx_metadata_ops, 2587 2578 &meta_req); 2588 2579 if (set_ic) {
+1
include/linux/filter.h
··· 1508 1508 void *skb_data_end; 1509 1509 u8 op; 1510 1510 u8 is_fullsock; 1511 + u8 is_locked_tcp_sock; 1511 1512 u8 remaining_opt_len; 1512 1513 u64 temp; /* temp and everything after is not 1513 1514 * initialized to 0 before calling
+9 -3
include/linux/skbuff.h
··· 470 470 /* Definitions for tx_flags in struct skb_shared_info */ 471 471 enum { 472 472 /* generate hardware time stamp */ 473 - SKBTX_HW_TSTAMP = 1 << 0, 473 + SKBTX_HW_TSTAMP_NOBPF = 1 << 0, 474 474 475 475 /* generate software time stamp when queueing packet to NIC */ 476 476 SKBTX_SW_TSTAMP = 1 << 1, ··· 489 489 490 490 /* generate software time stamp when entering packet scheduling */ 491 491 SKBTX_SCHED_TSTAMP = 1 << 6, 492 + 493 + /* used for bpf extension when a bpf program is loaded */ 494 + SKBTX_BPF = 1 << 7, 492 495 }; 493 496 497 + #define SKBTX_HW_TSTAMP (SKBTX_HW_TSTAMP_NOBPF | SKBTX_BPF) 498 + 494 499 #define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP | \ 495 - SKBTX_SCHED_TSTAMP) 500 + SKBTX_SCHED_TSTAMP | \ 501 + SKBTX_BPF) 496 502 #define SKBTX_ANY_TSTAMP (SKBTX_HW_TSTAMP | \ 497 503 SKBTX_HW_TSTAMP_USE_CYCLES | \ 498 504 SKBTX_ANY_SW_TSTAMP) ··· 4570 4564 static inline void skb_tx_timestamp(struct sk_buff *skb) 4571 4565 { 4572 4566 skb_clone_tx_timestamp(skb); 4573 - if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP) 4567 + if (skb_shinfo(skb)->tx_flags & (SKBTX_SW_TSTAMP | SKBTX_BPF)) 4574 4568 skb_tstamp_tx(skb, NULL); 4575 4569 } 4576 4570
+10
include/net/sock.h
··· 303 303 * @sk_stamp: time stamp of last packet received 304 304 * @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only 305 305 * @sk_tsflags: SO_TIMESTAMPING flags 306 + * @sk_bpf_cb_flags: used in bpf_setsockopt() 306 307 * @sk_use_task_frag: allow sk_page_frag() to use current->task_frag. 307 308 * Sockets that can be used under memory reclaim should 308 309 * set this to false. ··· 526 525 u8 sk_txtime_deadline_mode : 1, 527 526 sk_txtime_report_errors : 1, 528 527 sk_txtime_unused : 6; 528 + #define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG)) 529 + u8 sk_bpf_cb_flags; 529 530 530 531 void *sk_user_data; 531 532 #ifdef CONFIG_SECURITY ··· 2912 2909 struct so_timestamping timestamping); 2913 2910 2914 2911 void sock_enable_timestamps(struct sock *sk); 2912 + #if defined(CONFIG_CGROUP_BPF) 2913 + void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op); 2914 + #else 2915 + static inline void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op) 2916 + { 2917 + } 2918 + #endif 2915 2919 void sock_no_linger(struct sock *sk); 2916 2920 void sock_set_keepalive(struct sock *sk); 2917 2921 void sock_set_priority(struct sock *sk, u32 priority);
+5 -2
include/net/tcp.h
··· 978 978 979 979 __u8 sacked; /* State flags for SACK. */ 980 980 __u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */ 981 - __u8 txstamp_ack:1, /* Record TX timestamp for ack? */ 981 + #define TSTAMP_ACK_SK 0x1 982 + #define TSTAMP_ACK_BPF 0x2 983 + __u8 txstamp_ack:2, /* Record TX timestamp for ack? */ 982 984 eor:1, /* Is skb MSG_EOR marked? */ 983 985 has_rxtstamp:1, /* SKB has a RX timestamp */ 984 - unused:5; 986 + unused:4; 985 987 __u32 ack_seq; /* Sequence number ACK'd */ 986 988 union { 987 989 struct { ··· 2673 2671 memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); 2674 2672 if (sk_fullsock(sk)) { 2675 2673 sock_ops.is_fullsock = 1; 2674 + sock_ops.is_locked_tcp_sock = 1; 2676 2675 sock_owned_by_me(sk); 2677 2676 } 2678 2677
+10
include/net/xdp_sock.h
··· 110 110 * indicates position where checksumming should start. 111 111 * csum_offset indicates position where checksum should be stored. 112 112 * 113 + * void (*tmo_request_launch_time)(u64 launch_time, void *priv) 114 + * Called when AF_XDP frame requested launch time HW offload support. 115 + * launch_time indicates the PTP time at which the device can schedule the 116 + * packet for transmission. 113 117 */ 114 118 struct xsk_tx_metadata_ops { 115 119 void (*tmo_request_timestamp)(void *priv); 116 120 u64 (*tmo_fill_timestamp)(void *priv); 117 121 void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv); 122 + void (*tmo_request_launch_time)(u64 launch_time, void *priv); 118 123 }; 119 124 120 125 #ifdef CONFIG_XDP_SOCKETS ··· 166 161 { 167 162 if (!meta) 168 163 return; 164 + 165 + if (ops->tmo_request_launch_time) 166 + if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME) 167 + ops->tmo_request_launch_time(meta->request.launch_time, 168 + priv); 169 169 170 170 if (ops->tmo_request_timestamp) 171 171 if (meta->flags & XDP_TXMD_FLAGS_TIMESTAMP)
+1
include/net/xdp_sock_drv.h
··· 216 216 #define XDP_TXMD_FLAGS_VALID ( \ 217 217 XDP_TXMD_FLAGS_TIMESTAMP | \ 218 218 XDP_TXMD_FLAGS_CHECKSUM | \ 219 + XDP_TXMD_FLAGS_LAUNCH_TIME | \ 219 220 0) 220 221 221 222 static inline bool
+30
include/uapi/linux/bpf.h
··· 6913 6913 BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, 6914 6914 }; 6915 6915 6916 + enum { 6917 + SK_BPF_CB_TX_TIMESTAMPING = 1<<0, 6918 + SK_BPF_CB_MASK = (SK_BPF_CB_TX_TIMESTAMPING - 1) | 6919 + SK_BPF_CB_TX_TIMESTAMPING 6920 + }; 6921 + 6916 6922 /* List of known BPF sock_ops operators. 6917 6923 * New entries can only be added at the end 6918 6924 */ ··· 7031 7025 * by the kernel or the 7032 7026 * earlier bpf-progs. 7033 7027 */ 7028 + BPF_SOCK_OPS_TSTAMP_SCHED_CB, /* Called when skb is passing 7029 + * through dev layer when 7030 + * SK_BPF_CB_TX_TIMESTAMPING 7031 + * feature is on. 7032 + */ 7033 + BPF_SOCK_OPS_TSTAMP_SND_SW_CB, /* Called when skb is about to send 7034 + * to the nic when SK_BPF_CB_TX_TIMESTAMPING 7035 + * feature is on. 7036 + */ 7037 + BPF_SOCK_OPS_TSTAMP_SND_HW_CB, /* Called in hardware phase when 7038 + * SK_BPF_CB_TX_TIMESTAMPING feature 7039 + * is on. 7040 + */ 7041 + BPF_SOCK_OPS_TSTAMP_ACK_CB, /* Called when all the skbs in the 7042 + * same sendmsg call are acked 7043 + * when SK_BPF_CB_TX_TIMESTAMPING 7044 + * feature is on. 7045 + */ 7046 + BPF_SOCK_OPS_TSTAMP_SENDMSG_CB, /* Called when every sendmsg syscall 7047 + * is triggered. It's used to correlate 7048 + * sendmsg timestamp with corresponding 7049 + * tskey. 7050 + */ 7034 7051 }; 7035 7052 7036 7053 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect ··· 7120 7091 TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ 7121 7092 TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ 7122 7093 TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */ 7094 + SK_BPF_CB_FLAGS = 1009, /* Get or set sock ops flags in socket */ 7123 7095 }; 7124 7096 7125 7097 enum {
+10
include/uapi/linux/if_xdp.h
··· 127 127 */ 128 128 #define XDP_TXMD_FLAGS_CHECKSUM (1 << 1) 129 129 130 + /* Request launch time hardware offload. The device will schedule the packet for 131 + * transmission at a pre-determined time called launch time. The value of 132 + * launch time is communicated via launch_time field of struct xsk_tx_metadata. 133 + */ 134 + #define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2) 135 + 130 136 /* AF_XDP offloads request. 'request' union member is consumed by the driver 131 137 * when the packet is being transmitted. 'completion' union member is 132 138 * filled by the driver when the transmit completion arrives. ··· 148 142 __u16 csum_start; 149 143 /* Offset from csum_start where checksum should be stored. */ 150 144 __u16 csum_offset; 145 + 146 + /* XDP_TXMD_FLAGS_LAUNCH_TIME */ 147 + /* Launch time in nanosecond against the PTP HW Clock */ 148 + __u64 launch_time; 151 149 } request; 152 150 153 151 struct {
+3
include/uapi/linux/netdev.h
··· 59 59 * by the driver. 60 60 * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the 61 61 * driver. 62 + * @NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO: Launch time HW offload is supported 63 + * by the driver. 62 64 */ 63 65 enum netdev_xsk_flags { 64 66 NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1, 65 67 NETDEV_XSK_FLAGS_TX_CHECKSUM = 2, 68 + NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO = 4, 66 69 }; 67 70 68 71 enum netdev_queue_type {
+1
kernel/bpf/btf.c
··· 8524 8524 case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: 8525 8525 case BPF_PROG_TYPE_CGROUP_SOCKOPT: 8526 8526 case BPF_PROG_TYPE_CGROUP_SYSCTL: 8527 + case BPF_PROG_TYPE_SOCK_OPS: 8527 8528 return BTF_KFUNC_HOOK_CGROUP; 8528 8529 case BPF_PROG_TYPE_SCHED_ACT: 8529 8530 return BTF_KFUNC_HOOK_SCHED_ACT;
+2 -1
net/core/dev.c
··· 4572 4572 skb_reset_mac_header(skb); 4573 4573 skb_assert_len(skb); 4574 4574 4575 - if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP)) 4575 + if (unlikely(skb_shinfo(skb)->tx_flags & 4576 + (SKBTX_SCHED_TSTAMP | SKBTX_BPF))) 4576 4577 __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED); 4577 4578 4578 4579 /* Disable soft irqs for various locks below. Also
+75 -5
net/core/filter.c
··· 5222 5222 .arg1_type = ARG_PTR_TO_CTX, 5223 5223 }; 5224 5224 5225 + static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt) 5226 + { 5227 + u32 sk_bpf_cb_flags; 5228 + 5229 + if (getopt) { 5230 + *(u32 *)optval = sk->sk_bpf_cb_flags; 5231 + return 0; 5232 + } 5233 + 5234 + sk_bpf_cb_flags = *(u32 *)optval; 5235 + 5236 + if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK) 5237 + return -EINVAL; 5238 + 5239 + sk->sk_bpf_cb_flags = sk_bpf_cb_flags; 5240 + 5241 + return 0; 5242 + } 5243 + 5225 5244 static int sol_socket_sockopt(struct sock *sk, int optname, 5226 5245 char *optval, int *optlen, 5227 5246 bool getopt) ··· 5257 5238 case SO_MAX_PACING_RATE: 5258 5239 case SO_BINDTOIFINDEX: 5259 5240 case SO_TXREHASH: 5241 + case SK_BPF_CB_FLAGS: 5260 5242 if (*optlen != sizeof(int)) 5261 5243 return -EINVAL; 5262 5244 break; ··· 5266 5246 default: 5267 5247 return -EINVAL; 5268 5248 } 5249 + 5250 + if (optname == SK_BPF_CB_FLAGS) 5251 + return sk_bpf_set_get_cb_flags(sk, optval, getopt); 5269 5252 5270 5253 if (getopt) { 5271 5254 if (optname == SO_BINDTODEVICE) ··· 5405 5382 case TCP_USER_TIMEOUT: 5406 5383 case TCP_NOTSENT_LOWAT: 5407 5384 case TCP_SAVE_SYN: 5385 + case TCP_RTO_MAX_MS: 5408 5386 if (*optlen != sizeof(int)) 5409 5387 return -EINVAL; 5410 5388 break; ··· 5522 5498 return sol_tcp_sockopt(sk, optname, optval, &optlen, false); 5523 5499 5524 5500 return -EINVAL; 5501 + } 5502 + 5503 + static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock) 5504 + { 5505 + return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB; 5525 5506 } 5526 5507 5527 5508 static int _bpf_setsockopt(struct sock *sk, int level, int optname, ··· 5679 5650 BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, 5680 5651 int, level, int, optname, char *, optval, int, optlen) 5681 5652 { 5653 + if (!is_locked_tcp_sock_ops(bpf_sock)) 5654 + return -EOPNOTSUPP; 5655 + 5682 5656 return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen); 5683 5657 } 5684 5658 ··· 5767 5735 BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, 5768 5736 int, level, int, optname, char *, optval, int, optlen) 5769 5737 { 5738 + if (!is_locked_tcp_sock_ops(bpf_sock)) 5739 + return -EOPNOTSUPP; 5740 + 5770 5741 if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP && 5771 5742 optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) { 5772 5743 int ret, copy_len = 0; ··· 5811 5776 { 5812 5777 struct sock *sk = bpf_sock->sk; 5813 5778 int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS; 5779 + 5780 + if (!is_locked_tcp_sock_ops(bpf_sock)) 5781 + return -EOPNOTSUPP; 5814 5782 5815 5783 if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk)) 5816 5784 return -EINVAL; ··· 7623 7585 const u8 *op, *opend, *magic, *search = search_res; 7624 7586 u8 search_kind, search_len, copy_len, magic_len; 7625 7587 int ret; 7588 + 7589 + if (!is_locked_tcp_sock_ops(bpf_sock)) 7590 + return -EOPNOTSUPP; 7626 7591 7627 7592 /* 2 byte is the minimal option len except TCPOPT_NOP and 7628 7593 * TCPOPT_EOL which are useless for the bpf prog to learn ··· 10399 10358 } \ 10400 10359 *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ 10401 10360 struct bpf_sock_ops_kern, \ 10402 - is_fullsock), \ 10361 + is_locked_tcp_sock), \ 10403 10362 fullsock_reg, si->src_reg, \ 10404 10363 offsetof(struct bpf_sock_ops_kern, \ 10405 - is_fullsock)); \ 10364 + is_locked_tcp_sock)); \ 10406 10365 *insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp); \ 10407 10366 if (si->dst_reg == si->src_reg) \ 10408 10367 *insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \ ··· 10487 10446 temp)); \ 10488 10447 *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ 10489 10448 struct bpf_sock_ops_kern, \ 10490 - is_fullsock), \ 10449 + is_locked_tcp_sock), \ 10491 10450 reg, si->dst_reg, \ 10492 10451 offsetof(struct bpf_sock_ops_kern, \ 10493 - is_fullsock)); \ 10452 + is_locked_tcp_sock)); \ 10494 10453 *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2); \ 10495 10454 *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ 10496 10455 struct bpf_sock_ops_kern, sk),\ ··· 12103 12062 #endif 12104 12063 } 12105 12064 12065 + __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops, 12066 + u64 flags) 12067 + { 12068 + struct sk_buff *skb; 12069 + 12070 + if (skops->op != BPF_SOCK_OPS_TSTAMP_SENDMSG_CB) 12071 + return -EOPNOTSUPP; 12072 + 12073 + if (flags) 12074 + return -EINVAL; 12075 + 12076 + skb = skops->skb; 12077 + skb_shinfo(skb)->tx_flags |= SKBTX_BPF; 12078 + TCP_SKB_CB(skb)->txstamp_ack |= TSTAMP_ACK_BPF; 12079 + skb_shinfo(skb)->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1; 12080 + 12081 + return 0; 12082 + } 12083 + 12106 12084 __bpf_kfunc_end_defs(); 12107 12085 12108 12086 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags, ··· 12155 12095 BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk, KF_TRUSTED_ARGS) 12156 12096 BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk) 12157 12097 12098 + BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops) 12099 + BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS) 12100 + BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops) 12101 + 12158 12102 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = { 12159 12103 .owner = THIS_MODULE, 12160 12104 .set = &bpf_kfunc_check_set_skb, ··· 12179 12115 .set = &bpf_kfunc_check_set_tcp_reqsk, 12180 12116 }; 12181 12117 12118 + static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = { 12119 + .owner = THIS_MODULE, 12120 + .set = &bpf_kfunc_check_set_sock_ops, 12121 + }; 12122 + 12182 12123 static int __init bpf_kfunc_init(void) 12183 12124 { 12184 12125 int ret; ··· 12202 12133 ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp); 12203 12134 ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, 12204 12135 &bpf_kfunc_set_sock_addr); 12205 - return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk); 12136 + ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk); 12137 + return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops); 12206 12138 } 12207 12139 late_initcall(bpf_kfunc_init); 12208 12140
+2
net/core/netdev-genl.c
··· 53 53 xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP; 54 54 if (netdev->xsk_tx_metadata_ops->tmo_request_checksum) 55 55 xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM; 56 + if (netdev->xsk_tx_metadata_ops->tmo_request_launch_time) 57 + xsk_features |= NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO; 56 58 } 57 59 58 60 if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
+53
net/core/skbuff.c
··· 5449 5449 } 5450 5450 EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp); 5451 5451 5452 + static bool skb_tstamp_tx_report_so_timestamping(struct sk_buff *skb, 5453 + struct skb_shared_hwtstamps *hwtstamps, 5454 + int tstype) 5455 + { 5456 + switch (tstype) { 5457 + case SCM_TSTAMP_SCHED: 5458 + return skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP; 5459 + case SCM_TSTAMP_SND: 5460 + return skb_shinfo(skb)->tx_flags & (hwtstamps ? SKBTX_HW_TSTAMP_NOBPF : 5461 + SKBTX_SW_TSTAMP); 5462 + case SCM_TSTAMP_ACK: 5463 + return TCP_SKB_CB(skb)->txstamp_ack & TSTAMP_ACK_SK; 5464 + } 5465 + 5466 + return false; 5467 + } 5468 + 5469 + static void skb_tstamp_tx_report_bpf_timestamping(struct sk_buff *skb, 5470 + struct skb_shared_hwtstamps *hwtstamps, 5471 + struct sock *sk, 5472 + int tstype) 5473 + { 5474 + int op; 5475 + 5476 + switch (tstype) { 5477 + case SCM_TSTAMP_SCHED: 5478 + op = BPF_SOCK_OPS_TSTAMP_SCHED_CB; 5479 + break; 5480 + case SCM_TSTAMP_SND: 5481 + if (hwtstamps) { 5482 + op = BPF_SOCK_OPS_TSTAMP_SND_HW_CB; 5483 + *skb_hwtstamps(skb) = *hwtstamps; 5484 + } else { 5485 + op = BPF_SOCK_OPS_TSTAMP_SND_SW_CB; 5486 + } 5487 + break; 5488 + case SCM_TSTAMP_ACK: 5489 + op = BPF_SOCK_OPS_TSTAMP_ACK_CB; 5490 + break; 5491 + default: 5492 + return; 5493 + } 5494 + 5495 + bpf_skops_tx_timestamping(sk, skb, op); 5496 + } 5497 + 5452 5498 void __skb_tstamp_tx(struct sk_buff *orig_skb, 5453 5499 const struct sk_buff *ack_skb, 5454 5500 struct skb_shared_hwtstamps *hwtstamps, ··· 5505 5459 u32 tsflags; 5506 5460 5507 5461 if (!sk) 5462 + return; 5463 + 5464 + if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF) 5465 + skb_tstamp_tx_report_bpf_timestamping(orig_skb, hwtstamps, 5466 + sk, tstype); 5467 + 5468 + if (!skb_tstamp_tx_report_so_timestamping(orig_skb, hwtstamps, tstype)) 5508 5469 return; 5509 5470 5510 5471 tsflags = READ_ONCE(sk->sk_tsflags);
+14
net/core/sock.c
··· 949 949 return 0; 950 950 } 951 951 952 + #if defined(CONFIG_CGROUP_BPF) 953 + void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op) 954 + { 955 + struct bpf_sock_ops_kern sock_ops; 956 + 957 + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); 958 + sock_ops.op = op; 959 + sock_ops.is_fullsock = 1; 960 + sock_ops.sk = sk; 961 + bpf_skops_init_skb(&sock_ops, skb, 0); 962 + __cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS); 963 + } 964 + #endif 965 + 952 966 void sock_set_keepalive(struct sock *sk) 953 967 { 954 968 lock_sock(sk);
+1 -1
net/dsa/user.c
··· 897 897 { 898 898 struct dsa_switch *ds = p->dp->ds; 899 899 900 - if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) 900 + if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP_NOBPF)) 901 901 return; 902 902 903 903 if (!ds->ops->port_txtstamp)
+5 -1
net/ipv4/tcp.c
··· 492 492 493 493 sock_tx_timestamp(sk, sockc, &shinfo->tx_flags); 494 494 if (tsflags & SOF_TIMESTAMPING_TX_ACK) 495 - tcb->txstamp_ack = 1; 495 + tcb->txstamp_ack |= TSTAMP_ACK_SK; 496 496 if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK) 497 497 shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1; 498 498 } 499 + 500 + if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) && 501 + SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) 502 + bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TSTAMP_SENDMSG_CB); 499 503 } 500 504 501 505 static bool tcp_stream_is_readable(struct sock *sk, int target)
+2
net/ipv4/tcp_input.c
··· 169 169 memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); 170 170 sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB; 171 171 sock_ops.is_fullsock = 1; 172 + sock_ops.is_locked_tcp_sock = 1; 172 173 sock_ops.sk = sk; 173 174 bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb)); 174 175 ··· 186 185 memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); 187 186 sock_ops.op = bpf_op; 188 187 sock_ops.is_fullsock = 1; 188 + sock_ops.is_locked_tcp_sock = 1; 189 189 sock_ops.sk = sk; 190 190 /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */ 191 191 if (skb)
+2
net/ipv4/tcp_output.c
··· 525 525 sock_owned_by_me(sk); 526 526 527 527 sock_ops.is_fullsock = 1; 528 + sock_ops.is_locked_tcp_sock = 1; 528 529 sock_ops.sk = sk; 529 530 } 530 531 ··· 571 570 sock_owned_by_me(sk); 572 571 573 572 sock_ops.is_fullsock = 1; 573 + sock_ops.is_locked_tcp_sock = 1; 574 574 sock_ops.sk = sk; 575 575 } 576 576
+1 -1
net/socket.c
··· 681 681 u8 flags = *tx_flags; 682 682 683 683 if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE) { 684 - flags |= SKBTX_HW_TSTAMP; 684 + flags |= SKBTX_HW_TSTAMP_NOBPF; 685 685 686 686 /* PTP hardware clocks can provide a free running cycle counter 687 687 * as a time base for virtual clocks. Tell driver to use the
+3
net/xdp/xsk.c
··· 742 742 goto free_err; 743 743 } 744 744 } 745 + 746 + if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME) 747 + skb->skb_mstamp_ns = meta->request.launch_time; 745 748 } 746 749 } 747 750
+30
tools/include/uapi/linux/bpf.h
··· 6913 6913 BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, 6914 6914 }; 6915 6915 6916 + enum { 6917 + SK_BPF_CB_TX_TIMESTAMPING = 1<<0, 6918 + SK_BPF_CB_MASK = (SK_BPF_CB_TX_TIMESTAMPING - 1) | 6919 + SK_BPF_CB_TX_TIMESTAMPING 6920 + }; 6921 + 6916 6922 /* List of known BPF sock_ops operators. 6917 6923 * New entries can only be added at the end 6918 6924 */ ··· 7031 7025 * by the kernel or the 7032 7026 * earlier bpf-progs. 7033 7027 */ 7028 + BPF_SOCK_OPS_TSTAMP_SCHED_CB, /* Called when skb is passing 7029 + * through dev layer when 7030 + * SK_BPF_CB_TX_TIMESTAMPING 7031 + * feature is on. 7032 + */ 7033 + BPF_SOCK_OPS_TSTAMP_SND_SW_CB, /* Called when skb is about to send 7034 + * to the nic when SK_BPF_CB_TX_TIMESTAMPING 7035 + * feature is on. 7036 + */ 7037 + BPF_SOCK_OPS_TSTAMP_SND_HW_CB, /* Called in hardware phase when 7038 + * SK_BPF_CB_TX_TIMESTAMPING feature 7039 + * is on. 7040 + */ 7041 + BPF_SOCK_OPS_TSTAMP_ACK_CB, /* Called when all the skbs in the 7042 + * same sendmsg call are acked 7043 + * when SK_BPF_CB_TX_TIMESTAMPING 7044 + * feature is on. 7045 + */ 7046 + BPF_SOCK_OPS_TSTAMP_SENDMSG_CB, /* Called when every sendmsg syscall 7047 + * is triggered. It's used to correlate 7048 + * sendmsg timestamp with corresponding 7049 + * tskey. 7050 + */ 7034 7051 }; 7035 7052 7036 7053 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect ··· 7120 7091 TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ 7121 7092 TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ 7122 7093 TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */ 7094 + SK_BPF_CB_FLAGS = 1009, /* Get or set sock ops flags in socket */ 7123 7095 }; 7124 7096 7125 7097 enum {
+10
tools/include/uapi/linux/if_xdp.h
··· 127 127 */ 128 128 #define XDP_TXMD_FLAGS_CHECKSUM (1 << 1) 129 129 130 + /* Request launch time hardware offload. The device will schedule the packet for 131 + * transmission at a pre-determined time called launch time. The value of 132 + * launch time is communicated via launch_time field of struct xsk_tx_metadata. 133 + */ 134 + #define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2) 135 + 130 136 /* AF_XDP offloads request. 'request' union member is consumed by the driver 131 137 * when the packet is being transmitted. 'completion' union member is 132 138 * filled by the driver when the transmit completion arrives. ··· 148 142 __u16 csum_start; 149 143 /* Offset from csum_start where checksum should be stored. */ 150 144 __u16 csum_offset; 145 + 146 + /* XDP_TXMD_FLAGS_LAUNCH_TIME */ 147 + /* Launch time in nanosecond against the PTP HW Clock */ 148 + __u64 launch_time; 151 149 } request; 152 150 153 151 struct {
+3
tools/include/uapi/linux/netdev.h
··· 59 59 * by the driver. 60 60 * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the 61 61 * driver. 62 + * @NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO: Launch time HW offload is supported 63 + * by the driver. 62 64 */ 63 65 enum netdev_xsk_flags { 64 66 NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1, 65 67 NETDEV_XSK_FLAGS_TX_CHECKSUM = 2, 68 + NETDEV_XSK_FLAGS_TX_LAUNCH_TIME_FIFO = 4, 66 69 }; 67 70 68 71 enum netdev_queue_type {
+239
tools/testing/selftests/bpf/prog_tests/net_timestamping.c
··· 1 + #include <linux/net_tstamp.h> 2 + #include <sys/time.h> 3 + #include <linux/errqueue.h> 4 + #include "test_progs.h" 5 + #include "network_helpers.h" 6 + #include "net_timestamping.skel.h" 7 + 8 + #define CG_NAME "/net-timestamping-test" 9 + #define NSEC_PER_SEC 1000000000LL 10 + 11 + static const char addr4_str[] = "127.0.0.1"; 12 + static const char addr6_str[] = "::1"; 13 + static struct net_timestamping *skel; 14 + static const int cfg_payload_len = 30; 15 + static struct timespec usr_ts; 16 + static u64 delay_tolerance_nsec = 10000000000; /* 10 seconds */ 17 + int SK_TS_SCHED; 18 + int SK_TS_TXSW; 19 + int SK_TS_ACK; 20 + 21 + static int64_t timespec_to_ns64(struct timespec *ts) 22 + { 23 + return ts->tv_sec * NSEC_PER_SEC + ts->tv_nsec; 24 + } 25 + 26 + static void validate_key(int tskey, int tstype) 27 + { 28 + static int expected_tskey = -1; 29 + 30 + if (tstype == SCM_TSTAMP_SCHED) 31 + expected_tskey = cfg_payload_len - 1; 32 + 33 + ASSERT_EQ(expected_tskey, tskey, "tskey mismatch"); 34 + 35 + expected_tskey = tskey; 36 + } 37 + 38 + static void validate_timestamp(struct timespec *cur, struct timespec *prev) 39 + { 40 + int64_t cur_ns, prev_ns; 41 + 42 + cur_ns = timespec_to_ns64(cur); 43 + prev_ns = timespec_to_ns64(prev); 44 + 45 + ASSERT_LT(cur_ns - prev_ns, delay_tolerance_nsec, "latency"); 46 + } 47 + 48 + static void test_socket_timestamp(struct scm_timestamping *tss, int tstype, 49 + int tskey) 50 + { 51 + static struct timespec prev_ts; 52 + 53 + validate_key(tskey, tstype); 54 + 55 + switch (tstype) { 56 + case SCM_TSTAMP_SCHED: 57 + validate_timestamp(&tss->ts[0], &usr_ts); 58 + SK_TS_SCHED += 1; 59 + break; 60 + case SCM_TSTAMP_SND: 61 + validate_timestamp(&tss->ts[0], &prev_ts); 62 + SK_TS_TXSW += 1; 63 + break; 64 + case SCM_TSTAMP_ACK: 65 + validate_timestamp(&tss->ts[0], &prev_ts); 66 + SK_TS_ACK += 1; 67 + break; 68 + } 69 + 70 + prev_ts = tss->ts[0]; 71 + } 72 + 73 + static void test_recv_errmsg_cmsg(struct msghdr *msg) 74 + { 75 + struct sock_extended_err *serr = NULL; 76 + struct scm_timestamping *tss = NULL; 77 + struct cmsghdr *cm; 78 + 79 + for (cm = CMSG_FIRSTHDR(msg); 80 + cm && cm->cmsg_len; 81 + cm = CMSG_NXTHDR(msg, cm)) { 82 + if (cm->cmsg_level == SOL_SOCKET && 83 + cm->cmsg_type == SCM_TIMESTAMPING) { 84 + tss = (void *)CMSG_DATA(cm); 85 + } else if ((cm->cmsg_level == SOL_IP && 86 + cm->cmsg_type == IP_RECVERR) || 87 + (cm->cmsg_level == SOL_IPV6 && 88 + cm->cmsg_type == IPV6_RECVERR) || 89 + (cm->cmsg_level == SOL_PACKET && 90 + cm->cmsg_type == PACKET_TX_TIMESTAMP)) { 91 + serr = (void *)CMSG_DATA(cm); 92 + ASSERT_EQ(serr->ee_origin, SO_EE_ORIGIN_TIMESTAMPING, 93 + "cmsg type"); 94 + } 95 + 96 + if (serr && tss) 97 + test_socket_timestamp(tss, serr->ee_info, 98 + serr->ee_data); 99 + } 100 + } 101 + 102 + static bool socket_recv_errmsg(int fd) 103 + { 104 + static char ctrl[1024 /* overprovision*/]; 105 + char data[cfg_payload_len]; 106 + static struct msghdr msg; 107 + struct iovec entry; 108 + int n = 0; 109 + 110 + memset(&msg, 0, sizeof(msg)); 111 + memset(&entry, 0, sizeof(entry)); 112 + memset(ctrl, 0, sizeof(ctrl)); 113 + 114 + entry.iov_base = data; 115 + entry.iov_len = cfg_payload_len; 116 + msg.msg_iov = &entry; 117 + msg.msg_iovlen = 1; 118 + msg.msg_name = NULL; 119 + msg.msg_namelen = 0; 120 + msg.msg_control = ctrl; 121 + msg.msg_controllen = sizeof(ctrl); 122 + 123 + n = recvmsg(fd, &msg, MSG_ERRQUEUE); 124 + if (n == -1) 125 + ASSERT_EQ(errno, EAGAIN, "recvmsg MSG_ERRQUEUE"); 126 + 127 + if (n >= 0) 128 + test_recv_errmsg_cmsg(&msg); 129 + 130 + return n == -1; 131 + } 132 + 133 + static void test_socket_timestamping(int fd) 134 + { 135 + while (!socket_recv_errmsg(fd)); 136 + 137 + ASSERT_EQ(SK_TS_SCHED, 1, "SCM_TSTAMP_SCHED"); 138 + ASSERT_EQ(SK_TS_TXSW, 1, "SCM_TSTAMP_SND"); 139 + ASSERT_EQ(SK_TS_ACK, 1, "SCM_TSTAMP_ACK"); 140 + 141 + SK_TS_SCHED = 0; 142 + SK_TS_TXSW = 0; 143 + SK_TS_ACK = 0; 144 + } 145 + 146 + static void test_tcp(int family, bool enable_socket_timestamping) 147 + { 148 + struct net_timestamping__bss *bss; 149 + char buf[cfg_payload_len]; 150 + int sfd = -1, cfd = -1; 151 + unsigned int sock_opt; 152 + struct netns_obj *ns; 153 + int cg_fd; 154 + int ret; 155 + 156 + cg_fd = test__join_cgroup(CG_NAME); 157 + if (!ASSERT_OK_FD(cg_fd, "join cgroup")) 158 + return; 159 + 160 + ns = netns_new("net_timestamping_ns", true); 161 + if (!ASSERT_OK_PTR(ns, "create ns")) 162 + goto out; 163 + 164 + skel = net_timestamping__open_and_load(); 165 + if (!ASSERT_OK_PTR(skel, "open and load skel")) 166 + goto out; 167 + 168 + if (!ASSERT_OK(net_timestamping__attach(skel), "attach skel")) 169 + goto out; 170 + 171 + skel->links.skops_sockopt = 172 + bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd); 173 + if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup")) 174 + goto out; 175 + 176 + bss = skel->bss; 177 + memset(bss, 0, sizeof(*bss)); 178 + 179 + skel->bss->monitored_pid = getpid(); 180 + 181 + sfd = start_server(family, SOCK_STREAM, 182 + family == AF_INET6 ? addr6_str : addr4_str, 0, 0); 183 + if (!ASSERT_OK_FD(sfd, "start_server")) 184 + goto out; 185 + 186 + cfd = connect_to_fd(sfd, 0); 187 + if (!ASSERT_OK_FD(cfd, "connect_to_fd_server")) 188 + goto out; 189 + 190 + if (enable_socket_timestamping) { 191 + sock_opt = SOF_TIMESTAMPING_SOFTWARE | 192 + SOF_TIMESTAMPING_OPT_ID | 193 + SOF_TIMESTAMPING_TX_SCHED | 194 + SOF_TIMESTAMPING_TX_SOFTWARE | 195 + SOF_TIMESTAMPING_TX_ACK; 196 + ret = setsockopt(cfd, SOL_SOCKET, SO_TIMESTAMPING, 197 + (char *) &sock_opt, sizeof(sock_opt)); 198 + if (!ASSERT_OK(ret, "setsockopt SO_TIMESTAMPING")) 199 + goto out; 200 + 201 + ret = clock_gettime(CLOCK_REALTIME, &usr_ts); 202 + if (!ASSERT_OK(ret, "get user time")) 203 + goto out; 204 + } 205 + 206 + ret = write(cfd, buf, sizeof(buf)); 207 + if (!ASSERT_EQ(ret, sizeof(buf), "send to server")) 208 + goto out; 209 + 210 + if (enable_socket_timestamping) 211 + test_socket_timestamping(cfd); 212 + 213 + ASSERT_EQ(bss->nr_active, 1, "nr_active"); 214 + ASSERT_EQ(bss->nr_snd, 2, "nr_snd"); 215 + ASSERT_EQ(bss->nr_sched, 1, "nr_sched"); 216 + ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw"); 217 + ASSERT_EQ(bss->nr_ack, 1, "nr_ack"); 218 + 219 + out: 220 + if (sfd >= 0) 221 + close(sfd); 222 + if (cfd >= 0) 223 + close(cfd); 224 + net_timestamping__destroy(skel); 225 + netns_free(ns); 226 + close(cg_fd); 227 + } 228 + 229 + void test_net_timestamping(void) 230 + { 231 + if (test__start_subtest("INET4: bpf timestamping")) 232 + test_tcp(AF_INET, false); 233 + if (test__start_subtest("INET4: bpf and socket timestamping")) 234 + test_tcp(AF_INET, true); 235 + if (test__start_subtest("INET6: bpf timestamping")) 236 + test_tcp(AF_INET6, false); 237 + if (test__start_subtest("INET6: bpf and socket timestamping")) 238 + test_tcp(AF_INET6, true); 239 + }
+1
tools/testing/selftests/bpf/progs/bpf_tracing_net.h
··· 49 49 #define TCP_SAVED_SYN 28 50 50 #define TCP_CA_NAME_MAX 16 51 51 #define TCP_NAGLE_OFF 1 52 + #define TCP_RTO_MAX_MS 44 52 53 53 54 #define TCP_ECN_OK 1 54 55 #define TCP_ECN_QUEUE_CWR 2
+248
tools/testing/selftests/bpf/progs/net_timestamping.c
··· 1 + #include "vmlinux.h" 2 + #include "bpf_tracing_net.h" 3 + #include <bpf/bpf_helpers.h> 4 + #include <bpf/bpf_tracing.h> 5 + #include "bpf_misc.h" 6 + #include "bpf_kfuncs.h" 7 + #include <errno.h> 8 + 9 + __u32 monitored_pid = 0; 10 + 11 + int nr_active; 12 + int nr_snd; 13 + int nr_passive; 14 + int nr_sched; 15 + int nr_txsw; 16 + int nr_ack; 17 + 18 + struct sk_stg { 19 + __u64 sendmsg_ns; /* record ts when sendmsg is called */ 20 + }; 21 + 22 + struct sk_tskey { 23 + u64 cookie; 24 + u32 tskey; 25 + }; 26 + 27 + struct delay_info { 28 + u64 sendmsg_ns; /* record ts when sendmsg is called */ 29 + u32 sched_delay; /* SCHED_CB - sendmsg_ns */ 30 + u32 snd_sw_delay; /* SND_SW_CB - SCHED_CB */ 31 + u32 ack_delay; /* ACK_CB - SND_SW_CB */ 32 + }; 33 + 34 + struct { 35 + __uint(type, BPF_MAP_TYPE_SK_STORAGE); 36 + __uint(map_flags, BPF_F_NO_PREALLOC); 37 + __type(key, int); 38 + __type(value, struct sk_stg); 39 + } sk_stg_map SEC(".maps"); 40 + 41 + struct { 42 + __uint(type, BPF_MAP_TYPE_HASH); 43 + __type(key, struct sk_tskey); 44 + __type(value, struct delay_info); 45 + __uint(max_entries, 1024); 46 + } time_map SEC(".maps"); 47 + 48 + static u64 delay_tolerance_nsec = 10000000000; /* 10 second as an example */ 49 + 50 + extern int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops, u64 flags) __ksym; 51 + 52 + static int bpf_test_sockopt(void *ctx, const struct sock *sk, int expected) 53 + { 54 + int tmp, new = SK_BPF_CB_TX_TIMESTAMPING; 55 + int opt = SK_BPF_CB_FLAGS; 56 + int level = SOL_SOCKET; 57 + 58 + if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)) != expected) 59 + return 1; 60 + 61 + if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) != expected || 62 + (!expected && tmp != new)) 63 + return 1; 64 + 65 + return 0; 66 + } 67 + 68 + static bool bpf_test_access_sockopt(void *ctx, const struct sock *sk) 69 + { 70 + if (bpf_test_sockopt(ctx, sk, -EOPNOTSUPP)) 71 + return true; 72 + return false; 73 + } 74 + 75 + static bool bpf_test_access_load_hdr_opt(struct bpf_sock_ops *skops) 76 + { 77 + u8 opt[3] = {0}; 78 + int load_flags = 0; 79 + int ret; 80 + 81 + ret = bpf_load_hdr_opt(skops, opt, sizeof(opt), load_flags); 82 + if (ret != -EOPNOTSUPP) 83 + return true; 84 + 85 + return false; 86 + } 87 + 88 + static bool bpf_test_access_cb_flags_set(struct bpf_sock_ops *skops) 89 + { 90 + int ret; 91 + 92 + ret = bpf_sock_ops_cb_flags_set(skops, 0); 93 + if (ret != -EOPNOTSUPP) 94 + return true; 95 + 96 + return false; 97 + } 98 + 99 + /* In the timestamping callbacks, we're not allowed to call the following 100 + * BPF CALLs for the safety concern. Return false if expected. 101 + */ 102 + static bool bpf_test_access_bpf_calls(struct bpf_sock_ops *skops, 103 + const struct sock *sk) 104 + { 105 + if (bpf_test_access_sockopt(skops, sk)) 106 + return true; 107 + 108 + if (bpf_test_access_load_hdr_opt(skops)) 109 + return true; 110 + 111 + if (bpf_test_access_cb_flags_set(skops)) 112 + return true; 113 + 114 + return false; 115 + } 116 + 117 + static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk) 118 + { 119 + struct bpf_sock_ops_kern *skops_kern; 120 + u64 timestamp = bpf_ktime_get_ns(); 121 + struct skb_shared_info *shinfo; 122 + struct delay_info dinfo = {0}; 123 + struct sk_tskey key = {0}; 124 + struct delay_info *val; 125 + struct sk_buff *skb; 126 + struct sk_stg *stg; 127 + u64 prior_ts, delay; 128 + 129 + if (bpf_test_access_bpf_calls(skops, sk)) 130 + return false; 131 + 132 + skops_kern = bpf_cast_to_kern_ctx(skops); 133 + skb = skops_kern->skb; 134 + shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info); 135 + 136 + key.cookie = bpf_get_socket_cookie(skops); 137 + if (!key.cookie) 138 + return false; 139 + 140 + if (skops->op == BPF_SOCK_OPS_TSTAMP_SENDMSG_CB) { 141 + stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0); 142 + if (!stg) 143 + return false; 144 + dinfo.sendmsg_ns = stg->sendmsg_ns; 145 + bpf_sock_ops_enable_tx_tstamp(skops_kern, 0); 146 + key.tskey = shinfo->tskey; 147 + if (!key.tskey) 148 + return false; 149 + bpf_map_update_elem(&time_map, &key, &dinfo, BPF_ANY); 150 + return true; 151 + } 152 + 153 + key.tskey = shinfo->tskey; 154 + if (!key.tskey) 155 + return false; 156 + 157 + val = bpf_map_lookup_elem(&time_map, &key); 158 + if (!val) 159 + return false; 160 + 161 + switch (skops->op) { 162 + case BPF_SOCK_OPS_TSTAMP_SCHED_CB: 163 + val->sched_delay = timestamp - val->sendmsg_ns; 164 + delay = val->sched_delay; 165 + break; 166 + case BPF_SOCK_OPS_TSTAMP_SND_SW_CB: 167 + prior_ts = val->sched_delay + val->sendmsg_ns; 168 + val->snd_sw_delay = timestamp - prior_ts; 169 + delay = val->snd_sw_delay; 170 + break; 171 + case BPF_SOCK_OPS_TSTAMP_ACK_CB: 172 + prior_ts = val->snd_sw_delay + val->sched_delay + val->sendmsg_ns; 173 + val->ack_delay = timestamp - prior_ts; 174 + delay = val->ack_delay; 175 + break; 176 + } 177 + 178 + if (delay >= delay_tolerance_nsec) 179 + return false; 180 + 181 + /* Since it's the last one, remove from the map after latency check */ 182 + if (skops->op == BPF_SOCK_OPS_TSTAMP_ACK_CB) 183 + bpf_map_delete_elem(&time_map, &key); 184 + 185 + return true; 186 + } 187 + 188 + SEC("fentry/tcp_sendmsg_locked") 189 + int BPF_PROG(trace_tcp_sendmsg_locked, struct sock *sk, struct msghdr *msg, 190 + size_t size) 191 + { 192 + __u32 pid = bpf_get_current_pid_tgid() >> 32; 193 + u64 timestamp = bpf_ktime_get_ns(); 194 + u32 flag = sk->sk_bpf_cb_flags; 195 + struct sk_stg *stg; 196 + 197 + if (pid != monitored_pid || !flag) 198 + return 0; 199 + 200 + stg = bpf_sk_storage_get(&sk_stg_map, sk, 0, 201 + BPF_SK_STORAGE_GET_F_CREATE); 202 + if (!stg) 203 + return 0; 204 + 205 + stg->sendmsg_ns = timestamp; 206 + nr_snd += 1; 207 + return 0; 208 + } 209 + 210 + SEC("sockops") 211 + int skops_sockopt(struct bpf_sock_ops *skops) 212 + { 213 + struct bpf_sock *bpf_sk = skops->sk; 214 + const struct sock *sk; 215 + 216 + if (!bpf_sk) 217 + return 1; 218 + 219 + sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk); 220 + if (!sk) 221 + return 1; 222 + 223 + switch (skops->op) { 224 + case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: 225 + nr_active += !bpf_test_sockopt(skops, sk, 0); 226 + break; 227 + case BPF_SOCK_OPS_TSTAMP_SENDMSG_CB: 228 + if (bpf_test_delay(skops, sk)) 229 + nr_snd += 1; 230 + break; 231 + case BPF_SOCK_OPS_TSTAMP_SCHED_CB: 232 + if (bpf_test_delay(skops, sk)) 233 + nr_sched += 1; 234 + break; 235 + case BPF_SOCK_OPS_TSTAMP_SND_SW_CB: 236 + if (bpf_test_delay(skops, sk)) 237 + nr_txsw += 1; 238 + break; 239 + case BPF_SOCK_OPS_TSTAMP_ACK_CB: 240 + if (bpf_test_delay(skops, sk)) 241 + nr_ack += 1; 242 + break; 243 + } 244 + 245 + return 1; 246 + } 247 + 248 + char _license[] SEC("license") = "GPL";
+1
tools/testing/selftests/bpf/progs/setget_sockopt.c
··· 61 61 { .opt = TCP_NOTSENT_LOWAT, .new = 1314, .expected = 1314, }, 62 62 { .opt = TCP_BPF_SOCK_OPS_CB_FLAGS, .new = BPF_SOCK_OPS_ALL_CB_FLAGS, 63 63 .expected = BPF_SOCK_OPS_ALL_CB_FLAGS, }, 64 + { .opt = TCP_RTO_MAX_MS, .new = 2000, .expected = 2000, }, 64 65 { .opt = 0, }, 65 66 }; 66 67
+163 -5
tools/testing/selftests/bpf/xdp_hw_metadata.c
··· 13 13 * - UDP 9091 packets trigger TX reply 14 14 * - TX HW timestamp is requested and reported back upon completion 15 15 * - TX checksum is requested 16 + * - TX launch time HW offload is requested for transmission 16 17 */ 17 18 18 19 #include <test_progs.h> ··· 38 37 #include <time.h> 39 38 #include <unistd.h> 40 39 #include <libgen.h> 40 + #include <stdio.h> 41 + #include <stdlib.h> 42 + #include <string.h> 43 + #include <sys/ioctl.h> 44 + #include <linux/pkt_sched.h> 45 + #include <linux/pkt_cls.h> 46 + #include <linux/ethtool.h> 47 + #include <sys/socket.h> 48 + #include <arpa/inet.h> 41 49 42 50 #include "xdp_metadata.h" 43 51 ··· 74 64 bool skip_tx; 75 65 __u64 last_hw_rx_timestamp; 76 66 __u64 last_xdp_rx_timestamp; 67 + __u64 last_launch_time; 68 + __u64 launch_time_delta_to_hw_rx_timestamp; 69 + int launch_time_queue; 70 + 71 + #define run_command(cmd, ...) \ 72 + ({ \ 73 + char command[1024]; \ 74 + memset(command, 0, sizeof(command)); \ 75 + snprintf(command, sizeof(command), cmd, ##__VA_ARGS__); \ 76 + fprintf(stderr, "Running: %s\n", command); \ 77 + system(command); \ 78 + }) 77 79 78 80 void test__fail(void) { /* for network_helpers.c */ } 79 81 ··· 320 298 if (meta->completion.tx_timestamp) { 321 299 __u64 ref_tstamp = gettime(clock_id); 322 300 301 + if (launch_time_delta_to_hw_rx_timestamp) { 302 + print_tstamp_delta("HW Launch-time", 303 + "HW TX-complete-time", 304 + last_launch_time, 305 + meta->completion.tx_timestamp); 306 + } 323 307 print_tstamp_delta("HW TX-complete-time", "User TX-complete-time", 324 308 meta->completion.tx_timestamp, ref_tstamp); 325 309 print_tstamp_delta("XDP RX-time", "User TX-complete-time", ··· 423 395 xsk, ntohs(udph->check), ntohs(want_csum), 424 396 meta->request.csum_start, meta->request.csum_offset); 425 397 398 + /* Set the value of launch time */ 399 + if (launch_time_delta_to_hw_rx_timestamp) { 400 + meta->flags |= XDP_TXMD_FLAGS_LAUNCH_TIME; 401 + meta->request.launch_time = last_hw_rx_timestamp + 402 + launch_time_delta_to_hw_rx_timestamp; 403 + last_launch_time = meta->request.launch_time; 404 + print_tstamp_delta("HW RX-time", "HW Launch-time", 405 + last_hw_rx_timestamp, 406 + meta->request.launch_time); 407 + } 408 + 426 409 memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */ 427 410 tx_desc->options |= XDP_TX_METADATA; 428 411 tx_desc->len = len; ··· 446 407 const struct xdp_desc *rx_desc; 447 408 struct pollfd fds[rxq + 1]; 448 409 __u64 comp_addr; 410 + __u64 deadline; 449 411 __u64 addr; 450 412 __u32 idx = 0; 451 413 int ret; ··· 517 477 if (ret) 518 478 printf("kick_tx ret=%d\n", ret); 519 479 520 - for (int j = 0; j < 500; j++) { 480 + /* wait 1 second + cover launch time */ 481 + deadline = gettime(clock_id) + 482 + NANOSEC_PER_SEC + 483 + launch_time_delta_to_hw_rx_timestamp; 484 + while (true) { 521 485 if (complete_tx(xsk, clock_id)) 486 + break; 487 + if (gettime(clock_id) >= deadline) 522 488 break; 523 489 usleep(10); 524 490 } ··· 654 608 " -h Display this help and exit\n\n" 655 609 " -m Enable multi-buffer XDP for larger MTU\n" 656 610 " -r Don't generate AF_XDP reply (rx metadata only)\n" 611 + " -l Delta of launch time relative to HW RX-time in ns\n" 612 + " default: 0 ns (launch time request is disabled)\n" 613 + " -L Tx Queue to be enabled with launch time offload\n" 614 + " default: 0 (Tx Queue 0)\n" 657 615 "Generate test packets on the other machine with:\n" 658 616 " echo -n xdp | nc -u -q1 <dst_ip> 9091\n"; 659 617 ··· 668 618 { 669 619 int opt; 670 620 671 - while ((opt = getopt(argc, argv, "chmr")) != -1) { 621 + while ((opt = getopt(argc, argv, "chmrl:L:")) != -1) { 672 622 switch (opt) { 673 623 case 'c': 674 624 bind_flags &= ~XDP_USE_NEED_WAKEUP; ··· 683 633 break; 684 634 case 'r': 685 635 skip_tx = true; 636 + break; 637 + case 'l': 638 + launch_time_delta_to_hw_rx_timestamp = atoll(optarg); 639 + break; 640 + case 'L': 641 + launch_time_queue = atoll(optarg); 686 642 break; 687 643 case '?': 688 644 if (isprint(optopt)) ··· 713 657 error(-1, errno, "Invalid interface name"); 714 658 } 715 659 660 + void clean_existing_configurations(void) 661 + { 662 + /* Check and delete root qdisc if exists */ 663 + if (run_command("sudo tc qdisc show dev %s | grep -q 'qdisc mqprio 8001:'", ifname) == 0) 664 + run_command("sudo tc qdisc del dev %s root", ifname); 665 + 666 + /* Check and delete ingress qdisc if exists */ 667 + if (run_command("sudo tc qdisc show dev %s | grep -q 'qdisc ingress ffff:'", ifname) == 0) 668 + run_command("sudo tc qdisc del dev %s ingress", ifname); 669 + 670 + /* Check and delete ethtool filters if any exist */ 671 + if (run_command("sudo ethtool -n %s | grep -q 'Filter:'", ifname) == 0) { 672 + run_command("sudo ethtool -n %s | grep 'Filter:' | awk '{print $2}' | xargs -n1 sudo ethtool -N %s delete >&2", 673 + ifname, ifname); 674 + } 675 + } 676 + 677 + #define MAX_TC 16 678 + 716 679 int main(int argc, char *argv[]) 717 680 { 718 681 clockid_t clock_id = CLOCK_TAI; 682 + struct bpf_program *prog; 719 683 int server_fd = -1; 684 + size_t map_len = 0; 685 + size_t que_len = 0; 686 + char *buf = NULL; 687 + char *map = NULL; 688 + char *que = NULL; 689 + char *tmp = NULL; 690 + int tc = 0; 720 691 int ret; 721 692 int i; 722 - 723 - struct bpf_program *prog; 724 693 725 694 read_args(argc, argv); 726 695 727 696 rxq = rxq_num(ifname); 728 - 729 697 printf("rxq: %d\n", rxq); 730 698 699 + if (launch_time_queue >= rxq || launch_time_queue < 0) 700 + error(1, 0, "Invalid launch_time_queue."); 701 + 702 + clean_existing_configurations(); 703 + sleep(1); 704 + 705 + /* Enable tx and rx hardware timestamping */ 731 706 hwtstamp_enable(ifname); 707 + 708 + /* Prepare priority to traffic class map for tc-mqprio */ 709 + for (i = 0; i < MAX_TC; i++) { 710 + if (i < rxq) 711 + tc = i; 712 + 713 + if (asprintf(&buf, "%d ", tc) == -1) { 714 + printf("Failed to malloc buf for tc map.\n"); 715 + goto free_mem; 716 + } 717 + 718 + map_len += strlen(buf); 719 + tmp = realloc(map, map_len + 1); 720 + if (!tmp) { 721 + printf("Failed to realloc tc map.\n"); 722 + goto free_mem; 723 + } 724 + map = tmp; 725 + strcat(map, buf); 726 + free(buf); 727 + buf = NULL; 728 + } 729 + 730 + /* Prepare traffic class to hardware queue map for tc-mqprio */ 731 + for (i = 0; i <= tc; i++) { 732 + if (asprintf(&buf, "1@%d ", i) == -1) { 733 + printf("Failed to malloc buf for tc queues.\n"); 734 + goto free_mem; 735 + } 736 + 737 + que_len += strlen(buf); 738 + tmp = realloc(que, que_len + 1); 739 + if (!tmp) { 740 + printf("Failed to realloc tc queues.\n"); 741 + goto free_mem; 742 + } 743 + que = tmp; 744 + strcat(que, buf); 745 + free(buf); 746 + buf = NULL; 747 + } 748 + 749 + /* Add mqprio qdisc */ 750 + run_command("sudo tc qdisc add dev %s handle 8001: parent root mqprio num_tc %d map %squeues %shw 0", 751 + ifname, tc + 1, map, que); 752 + 753 + /* To test launch time, send UDP packet with VLAN priority 1 to port 9091 */ 754 + if (launch_time_delta_to_hw_rx_timestamp) { 755 + /* Enable launch time hardware offload on launch_time_queue */ 756 + run_command("sudo tc qdisc replace dev %s parent 8001:%d etf offload clockid CLOCK_TAI delta 500000", 757 + ifname, launch_time_queue + 1); 758 + sleep(1); 759 + 760 + /* Route incoming packet with VLAN priority 1 into launch_time_queue */ 761 + if (run_command("sudo ethtool -N %s flow-type ether vlan 0x2000 vlan-mask 0x1FFF action %d", 762 + ifname, launch_time_queue)) { 763 + run_command("sudo tc qdisc add dev %s ingress", ifname); 764 + run_command("sudo tc filter add dev %s parent ffff: protocol 802.1Q flower vlan_prio 1 hw_tc %d", 765 + ifname, launch_time_queue); 766 + } 767 + 768 + /* Enable VLAN tag stripping offload */ 769 + run_command("sudo ethtool -K %s rxvlan on", ifname); 770 + } 732 771 733 772 rx_xsk = malloc(sizeof(struct xsk) * rxq); 734 773 if (!rx_xsk) ··· 884 733 cleanup(); 885 734 if (ret) 886 735 error(1, -ret, "verify_metadata"); 736 + 737 + clean_existing_configurations(); 738 + 739 + free_mem: 740 + free(buf); 741 + free(map); 742 + free(que); 887 743 }