Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

+1

Documentation/bpf/index.rst

··· 42 42 .. toctree:: 43 43 :maxdepth: 1 44 44 45 + prog_cgroup_sockopt 45 46 prog_cgroup_sysctl 46 47 prog_flow_dissector 47 48

+93

Documentation/bpf/prog_cgroup_sockopt.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================ 4 + BPF_PROG_TYPE_CGROUP_SOCKOPT 5 + ============================ 6 + 7 + ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two 8 + cgroup hooks: 9 + 10 + * ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt`` 11 + system call. 12 + * ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt`` 13 + system call. 14 + 15 + The context (``struct bpf_sockopt``) has associated socket (``sk``) and 16 + all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``. 17 + 18 + BPF_CGROUP_SETSOCKOPT 19 + ===================== 20 + 21 + ``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of 22 + sockopt and it has writable context: it can modify the supplied arguments 23 + before passing them down to the kernel. This hook has access to the cgroup 24 + and socket local storage. 25 + 26 + If BPF program sets ``optlen`` to -1, the control will be returned 27 + back to the userspace after all other BPF programs in the cgroup 28 + chain finish (i.e. kernel ``setsockopt`` handling will *not* be executed). 29 + 30 + Note, that ``optlen`` can not be increased beyond the user-supplied 31 + value. It can only be decreased or set to -1. Any other value will 32 + trigger ``EFAULT``. 33 + 34 + Return Type 35 + ----------- 36 + 37 + * ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. 38 + * ``1`` - success, continue with next BPF program in the cgroup chain. 39 + 40 + BPF_CGROUP_GETSOCKOPT 41 + ===================== 42 + 43 + ``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of 44 + sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval`` 45 + if it's interested in whatever kernel has returned. BPF hook can override 46 + the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen`` 47 + has been increased above initial ``getsockopt`` value (i.e. userspace 48 + buffer is too small), ``EFAULT`` is returned. 49 + 50 + This hook has access to the cgroup and socket local storage. 51 + 52 + Note, that the only acceptable value to set to ``retval`` is 0 and the 53 + original value that the kernel returned. Any other value will trigger 54 + ``EFAULT``. 55 + 56 + Return Type 57 + ----------- 58 + 59 + * ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. 60 + * ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return 61 + ``retval`` from the syscall (note that this can be overwritten by 62 + the BPF program from the parent cgroup). 63 + 64 + Cgroup Inheritance 65 + ================== 66 + 67 + Suppose, there is the following cgroup hierarchy where each cgroup 68 + has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with 69 + ``BPF_F_ALLOW_MULTI`` flag:: 70 + 71 + A (root, parent) 72 + \ 73 + B (child) 74 + 75 + When the application calls ``getsockopt`` syscall from the cgroup B, 76 + the programs are executed from the bottom up: B, A. First program 77 + (B) sees the result of kernel's ``getsockopt``. It can optionally 78 + adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that 79 + control will be passed to the second (A) program which will see the 80 + same context as B including any potential modifications. 81 + 82 + Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to 83 + A and B, the trigger order is B, then A. If B does any changes 84 + to the input arguments (``level``, ``optname``, ``optval``, ``optlen``), 85 + then the next program in the chain (A) will see those changes, 86 + *not* the original input ``setsockopt`` arguments. The potentially 87 + modified values will be then passed down to the kernel. 88 + 89 + Example 90 + ======= 91 + 92 + See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example 93 + of BPF program that handles socket options.

+15 -1

Documentation/networking/af_xdp.rst

··· 220 220 In order to use AF_XDP sockets there are two parts needed. The 221 221 user-space application and the XDP program. For a complete setup and 222 222 usage example, please refer to the sample application. The user-space 223 - side is xdpsock_user.c and the XDP side xdpsock_kern.c. 223 + side is xdpsock_user.c and the XDP side is part of libbpf. 224 + 225 + The XDP code sample included in tools/lib/bpf/xsk.c is the following:: 226 + 227 + SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) 228 + { 229 + int index = ctx->rx_queue_index; 230 + 231 + // A set entry here means that the correspnding queue_id 232 + // has an active AF_XDP socket bound to it. 233 + if (bpf_map_lookup_elem(&xsks_map, &index)) 234 + return bpf_redirect_map(&xsks_map, index, 0); 235 + 236 + return XDP_PASS; 237 + } 224 238 225 239 Naive ring dequeue and enqueue could look like this:: 226 240

+7 -5

drivers/net/ethernet/intel/i40e/i40e_xsk.c

··· 641 641 struct i40e_tx_desc *tx_desc = NULL; 642 642 struct i40e_tx_buffer *tx_bi; 643 643 bool work_done = true; 644 + struct xdp_desc desc; 644 645 dma_addr_t dma; 645 - u32 len; 646 646 647 647 while (budget-- > 0) { 648 648 if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) { ··· 651 651 break; 652 652 } 653 653 654 - if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len)) 654 + if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc)) 655 655 break; 656 656 657 - dma_sync_single_for_device(xdp_ring->dev, dma, len, 657 + dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr); 658 + 659 + dma_sync_single_for_device(xdp_ring->dev, dma, desc.len, 658 660 DMA_BIDIRECTIONAL); 659 661 660 662 tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use]; 661 - tx_bi->bytecount = len; 663 + tx_bi->bytecount = desc.len; 662 664 663 665 tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use); 664 666 tx_desc->buffer_addr = cpu_to_le64(dma); 665 667 tx_desc->cmd_type_offset_bsz = 666 668 build_ctob(I40E_TX_DESC_CMD_ICRC 667 669 | I40E_TX_DESC_CMD_EOP, 668 - 0, len, 0); 670 + 0, desc.len, 0); 669 671 670 672 xdp_ring->next_to_use++; 671 673 if (xdp_ring->next_to_use == xdp_ring->count)

+9 -6

drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c

··· 571 571 union ixgbe_adv_tx_desc *tx_desc = NULL; 572 572 struct ixgbe_tx_buffer *tx_bi; 573 573 bool work_done = true; 574 - u32 len, cmd_type; 574 + struct xdp_desc desc; 575 575 dma_addr_t dma; 576 + u32 cmd_type; 576 577 577 578 while (budget-- > 0) { 578 579 if (unlikely(!ixgbe_desc_unused(xdp_ring)) || ··· 582 581 break; 583 582 } 584 583 585 - if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len)) 584 + if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc)) 586 585 break; 587 586 588 - dma_sync_single_for_device(xdp_ring->dev, dma, len, 587 + dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr); 588 + 589 + dma_sync_single_for_device(xdp_ring->dev, dma, desc.len, 589 590 DMA_BIDIRECTIONAL); 590 591 591 592 tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use]; 592 - tx_bi->bytecount = len; 593 + tx_bi->bytecount = desc.len; 593 594 tx_bi->xdpf = NULL; 594 595 tx_bi->gso_segs = 1; 595 596 ··· 602 599 cmd_type = IXGBE_ADVTXD_DTYP_DATA | 603 600 IXGBE_ADVTXD_DCMD_DEXT | 604 601 IXGBE_ADVTXD_DCMD_IFCS; 605 - cmd_type |= len | IXGBE_TXD_CMD; 602 + cmd_type |= desc.len | IXGBE_TXD_CMD; 606 603 tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type); 607 604 tx_desc->read.olinfo_status = 608 - cpu_to_le32(len << IXGBE_ADVTXD_PAYLEN_SHIFT); 605 + cpu_to_le32(desc.len << IXGBE_ADVTXD_PAYLEN_SHIFT); 609 606 610 607 xdp_ring->next_to_use++; 611 608 if (xdp_ring->next_to_use == xdp_ring->count)

+1 -1

drivers/net/ethernet/mellanox/mlx5/core/Makefile

··· 24 24 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \ 25 25 en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \ 26 26 en_selftest.o en/port.o en/monitor_stats.o en/reporter_tx.o \ 27 - en/params.o 27 + en/params.o en/xsk/umem.o en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o 28 28 29 29 # 30 30 # Netdev extra

+140 -15

drivers/net/ethernet/mellanox/mlx5/core/en.h

··· 137 137 #define MLX5E_MAX_NUM_CHANNELS (MLX5E_INDIR_RQT_SIZE >> 1) 138 138 #define MLX5E_MAX_NUM_SQS (MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC) 139 139 #define MLX5E_TX_CQ_POLL_BUDGET 128 140 + #define MLX5E_TX_XSK_POLL_BUDGET 64 140 141 #define MLX5E_SQ_RECOVER_MIN_INTERVAL 500 /* msecs */ 141 142 142 143 #define MLX5E_UMR_WQE_INLINE_SZ \ ··· 156 155 ##__VA_ARGS__); \ 157 156 } while (0) 158 157 158 + enum mlx5e_rq_group { 159 + MLX5E_RQ_GROUP_REGULAR, 160 + MLX5E_RQ_GROUP_XSK, 161 + MLX5E_NUM_RQ_GROUPS /* Keep last. */ 162 + }; 159 163 160 164 static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size) 161 165 { ··· 185 179 /* Use this function to get max num channels after netdev was created */ 186 180 static inline int mlx5e_get_netdev_max_channels(struct net_device *netdev) 187 181 { 188 - return min_t(unsigned int, netdev->num_rx_queues, 182 + return min_t(unsigned int, 183 + netdev->num_rx_queues / MLX5E_NUM_RQ_GROUPS, 189 184 netdev->num_tx_queues); 190 185 } 191 186 ··· 257 250 u32 lro_timeout; 258 251 u32 pflags; 259 252 struct bpf_prog *xdp_prog; 253 + struct mlx5e_xsk *xsk; 260 254 unsigned int sw_mtu; 261 255 int hard_mtu; 262 256 }; ··· 356 348 357 349 struct mlx5e_sq_wqe_info { 358 350 u8 opcode; 351 + 352 + /* Auxiliary data for different opcodes. */ 353 + union { 354 + struct { 355 + struct mlx5e_rq *rq; 356 + } umr; 357 + }; 359 358 }; 360 359 361 360 struct mlx5e_txqsq { ··· 407 392 } ____cacheline_aligned_in_smp; 408 393 409 394 struct mlx5e_dma_info { 410 - struct page *page; 411 - dma_addr_t addr; 395 + dma_addr_t addr; 396 + union { 397 + struct page *page; 398 + struct { 399 + u64 handle; 400 + void *data; 401 + } xsk; 402 + }; 403 + }; 404 + 405 + /* XDP packets can be transmitted in different ways. On completion, we need to 406 + * distinguish between them to clean up things in a proper way. 407 + */ 408 + enum mlx5e_xdp_xmit_mode { 409 + /* An xdp_frame was transmitted due to either XDP_REDIRECT from another 410 + * device or XDP_TX from an XSK RQ. The frame has to be unmapped and 411 + * returned. 412 + */ 413 + MLX5E_XDP_XMIT_MODE_FRAME, 414 + 415 + /* The xdp_frame was created in place as a result of XDP_TX from a 416 + * regular RQ. No DMA remapping happened, and the page belongs to us. 417 + */ 418 + MLX5E_XDP_XMIT_MODE_PAGE, 419 + 420 + /* No xdp_frame was created at all, the transmit happened from a UMEM 421 + * page. The UMEM Completion Ring producer pointer has to be increased. 422 + */ 423 + MLX5E_XDP_XMIT_MODE_XSK, 412 424 }; 413 425 414 426 struct mlx5e_xdp_info { 415 - struct xdp_frame *xdpf; 416 - dma_addr_t dma_addr; 417 - struct mlx5e_dma_info di; 427 + enum mlx5e_xdp_xmit_mode mode; 428 + union { 429 + struct { 430 + struct xdp_frame *xdpf; 431 + dma_addr_t dma_addr; 432 + } frame; 433 + struct { 434 + struct mlx5e_rq *rq; 435 + struct mlx5e_dma_info di; 436 + } page; 437 + }; 438 + }; 439 + 440 + struct mlx5e_xdp_xmit_data { 441 + dma_addr_t dma_addr; 442 + void *data; 443 + u32 len; 418 444 }; 419 445 420 446 struct mlx5e_xdp_info_fifo { ··· 481 425 }; 482 426 483 427 struct mlx5e_xdpsq; 484 - typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq*, 485 - struct mlx5e_xdp_info*); 428 + typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *); 429 + typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *, 430 + struct mlx5e_xdp_xmit_data *, 431 + struct mlx5e_xdp_info *, 432 + int); 433 + 486 434 struct mlx5e_xdpsq { 487 435 /* data path */ 488 436 ··· 503 443 struct mlx5e_cq cq; 504 444 505 445 /* read only */ 446 + struct xdp_umem *umem; 506 447 struct mlx5_wq_cyc wq; 507 448 struct mlx5e_xdpsq_stats *stats; 449 + mlx5e_fp_xmit_xdp_frame_check xmit_xdp_frame_check; 508 450 mlx5e_fp_xmit_xdp_frame xmit_xdp_frame; 509 451 struct { 510 452 struct mlx5e_xdp_wqe_info *wqe_info; ··· 633 571 u8 log_stride_sz; 634 572 u8 umr_in_progress; 635 573 u8 umr_last_bulk; 574 + u8 umr_completed; 636 575 } mpwqe; 637 576 }; 638 577 struct { 578 + u16 umem_headroom; 639 579 u16 headroom; 640 580 u8 map_dir; /* dma map direction */ 641 581 } buff; ··· 664 600 665 601 /* XDP */ 666 602 struct bpf_prog *xdp_prog; 667 - struct mlx5e_xdpsq xdpsq; 603 + struct mlx5e_xdpsq *xdpsq; 668 604 DECLARE_BITMAP(flags, 8); 669 605 struct page_pool *page_pool; 606 + 607 + /* AF_XDP zero-copy */ 608 + struct zero_copy_allocator zca; 609 + struct xdp_umem *umem; 670 610 671 611 /* control */ 672 612 struct mlx5_wq_ctrl wq_ctrl; ··· 684 616 struct xdp_rxq_info xdp_rxq; 685 617 } ____cacheline_aligned_in_smp; 686 618 619 + enum mlx5e_channel_state { 620 + MLX5E_CHANNEL_STATE_XSK, 621 + MLX5E_CHANNEL_NUM_STATES 622 + }; 623 + 687 624 struct mlx5e_channel { 688 625 /* data path */ 689 626 struct mlx5e_rq rq; 627 + struct mlx5e_xdpsq rq_xdpsq; 690 628 struct mlx5e_txqsq sq[MLX5E_MAX_NUM_TC]; 691 629 struct mlx5e_icosq icosq; /* internal control operations */ 692 630 bool xdp; ··· 705 631 /* XDP_REDIRECT */ 706 632 struct mlx5e_xdpsq xdpsq; 707 633 634 + /* AF_XDP zero-copy */ 635 + struct mlx5e_rq xskrq; 636 + struct mlx5e_xdpsq xsksq; 637 + struct mlx5e_icosq xskicosq; 638 + /* xskicosq can be accessed from any CPU - the spinlock protects it. */ 639 + spinlock_t xskicosq_lock; 640 + 708 641 /* data path - accessed per napi poll */ 709 642 struct irq_desc *irq_desc; 710 643 struct mlx5e_ch_stats *stats; ··· 720 639 struct mlx5e_priv *priv; 721 640 struct mlx5_core_dev *mdev; 722 641 struct hwtstamp_config *tstamp; 642 + DECLARE_BITMAP(state, MLX5E_CHANNEL_NUM_STATES); 723 643 int ix; 724 644 int cpu; 725 645 cpumask_var_t xps_cpumask; ··· 736 654 struct mlx5e_ch_stats ch; 737 655 struct mlx5e_sq_stats sq[MLX5E_MAX_NUM_TC]; 738 656 struct mlx5e_rq_stats rq; 657 + struct mlx5e_rq_stats xskrq; 739 658 struct mlx5e_xdpsq_stats rq_xdpsq; 740 659 struct mlx5e_xdpsq_stats xdpsq; 660 + struct mlx5e_xdpsq_stats xsksq; 741 661 } ____cacheline_aligned_in_smp; 742 662 743 663 enum { 744 664 MLX5E_STATE_OPENED, 745 665 MLX5E_STATE_DESTROYING, 746 666 MLX5E_STATE_XDP_TX_ENABLED, 667 + MLX5E_STATE_XDP_OPEN, 747 668 }; 748 669 749 670 struct mlx5e_rqt { ··· 779 694 int rl_index; 780 695 }; 781 696 697 + struct mlx5e_xsk { 698 + /* UMEMs are stored separately from channels, because we don't want to 699 + * lose them when channels are recreated. The kernel also stores UMEMs, 700 + * but it doesn't distinguish between zero-copy and non-zero-copy UMEMs, 701 + * so rely on our mechanism. 702 + */ 703 + struct xdp_umem **umems; 704 + u16 refcnt; 705 + bool ever_used; 706 + }; 707 + 782 708 struct mlx5e_priv { 783 709 /* priv data path fields - start */ 784 710 struct mlx5e_txqsq *txq2sq[MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC]; ··· 810 714 struct mlx5e_tir indir_tir[MLX5E_NUM_INDIR_TIRS]; 811 715 struct mlx5e_tir inner_indir_tir[MLX5E_NUM_INDIR_TIRS]; 812 716 struct mlx5e_tir direct_tir[MLX5E_MAX_NUM_CHANNELS]; 717 + struct mlx5e_tir xsk_tir[MLX5E_MAX_NUM_CHANNELS]; 813 718 struct mlx5e_rss_params rss_params; 814 719 u32 tx_rates[MLX5E_MAX_NUM_SQS]; 815 720 ··· 847 750 struct mlx5e_tls *tls; 848 751 #endif 849 752 struct devlink_health_reporter *tx_reporter; 753 + struct mlx5e_xsk xsk; 850 754 }; 851 755 852 756 struct mlx5e_profile { ··· 892 794 struct mlx5e_params *params); 893 795 894 796 void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info); 895 - void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, 896 - bool recycle); 797 + void mlx5e_page_release_dynamic(struct mlx5e_rq *rq, 798 + struct mlx5e_dma_info *dma_info, 799 + bool recycle); 897 800 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe); 898 801 void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe); 899 802 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq); 803 + void mlx5e_poll_ico_cq(struct mlx5e_cq *cq); 900 804 bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq); 901 805 void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix); 902 806 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix); ··· 953 853 void *tirc, bool inner); 954 854 void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen); 955 855 struct mlx5e_tirc_config mlx5e_tirc_get_default_config(enum mlx5e_traffic_types tt); 856 + 857 + struct mlx5e_xsk_param; 858 + 859 + struct mlx5e_rq_param; 860 + int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params, 861 + struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk, 862 + struct xdp_umem *umem, struct mlx5e_rq *rq); 863 + int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time); 864 + void mlx5e_deactivate_rq(struct mlx5e_rq *rq); 865 + void mlx5e_close_rq(struct mlx5e_rq *rq); 866 + 867 + struct mlx5e_sq_param; 868 + int mlx5e_open_icosq(struct mlx5e_channel *c, struct mlx5e_params *params, 869 + struct mlx5e_sq_param *param, struct mlx5e_icosq *sq); 870 + void mlx5e_close_icosq(struct mlx5e_icosq *sq); 871 + int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params, 872 + struct mlx5e_sq_param *param, struct xdp_umem *umem, 873 + struct mlx5e_xdpsq *sq, bool is_redirect); 874 + void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq); 875 + 876 + struct mlx5e_cq_param; 877 + int mlx5e_open_cq(struct mlx5e_channel *c, struct dim_cq_moder moder, 878 + struct mlx5e_cq_param *param, struct mlx5e_cq *cq); 879 + void mlx5e_close_cq(struct mlx5e_cq *cq); 956 880 957 881 int mlx5e_open_locked(struct net_device *netdev); 958 882 int mlx5e_close_locked(struct net_device *netdev); ··· 1148 1024 int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc); 1149 1025 void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc); 1150 1026 1151 - int mlx5e_create_direct_rqts(struct mlx5e_priv *priv); 1152 - void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv); 1153 - int mlx5e_create_direct_tirs(struct mlx5e_priv *priv); 1154 - void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv); 1027 + int mlx5e_create_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1028 + void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1029 + int mlx5e_create_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1030 + void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1155 1031 void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt); 1156 1032 1157 1033 int mlx5e_create_tis(struct mlx5_core_dev *mdev, int tc, ··· 1221 1097 void mlx5e_destroy_netdev(struct mlx5e_priv *priv); 1222 1098 void mlx5e_set_netdev_mtu_boundaries(struct mlx5e_priv *priv); 1223 1099 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev, 1100 + struct mlx5e_xsk *xsk, 1224 1101 struct mlx5e_rss_params *rss_params, 1225 1102 struct mlx5e_params *params, 1226 1103 u16 max_channels, u16 mtu);

+71 -37

drivers/net/ethernet/mellanox/mlx5/core/en/params.c

··· 3 3 4 4 #include "en/params.h" 5 5 6 - u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params) 6 + static inline bool mlx5e_rx_is_xdp(struct mlx5e_params *params, 7 + struct mlx5e_xsk_param *xsk) 7 8 { 8 - u16 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 9 - u16 linear_rq_headroom = params->xdp_prog ? 10 - XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM; 11 - u32 frag_sz; 9 + return params->xdp_prog || xsk; 10 + } 12 11 13 - linear_rq_headroom += NET_IP_ALIGN; 12 + u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params, 13 + struct mlx5e_xsk_param *xsk) 14 + { 15 + u16 headroom = NET_IP_ALIGN; 14 16 15 - frag_sz = MLX5_SKB_FRAG_SZ(linear_rq_headroom + hw_mtu); 17 + if (mlx5e_rx_is_xdp(params, xsk)) { 18 + headroom += XDP_PACKET_HEADROOM; 19 + if (xsk) 20 + headroom += xsk->headroom; 21 + } else { 22 + headroom += MLX5_RX_HEADROOM; 23 + } 16 24 17 - if (params->xdp_prog && frag_sz < PAGE_SIZE) 18 - frag_sz = PAGE_SIZE; 25 + return headroom; 26 + } 27 + 28 + u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params, 29 + struct mlx5e_xsk_param *xsk) 30 + { 31 + u32 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 32 + u16 linear_rq_headroom = mlx5e_get_linear_rq_headroom(params, xsk); 33 + u32 frag_sz = linear_rq_headroom + hw_mtu; 34 + 35 + /* AF_XDP doesn't build SKBs in place. */ 36 + if (!xsk) 37 + frag_sz = MLX5_SKB_FRAG_SZ(frag_sz); 38 + 39 + /* XDP in mlx5e doesn't support multiple packets per page. */ 40 + if (mlx5e_rx_is_xdp(params, xsk)) 41 + frag_sz = max_t(u32, frag_sz, PAGE_SIZE); 42 + 43 + /* Even if we can go with a smaller fragment size, we must not put 44 + * multiple packets into a single frame. 45 + */ 46 + if (xsk) 47 + frag_sz = max_t(u32, frag_sz, xsk->chunk_size); 19 48 20 49 return frag_sz; 21 50 } 22 51 23 - u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params) 52 + u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params, 53 + struct mlx5e_xsk_param *xsk) 24 54 { 25 - u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params); 55 + u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params, xsk); 26 56 27 57 return MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz); 28 58 } 29 59 30 - bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params) 60 + bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params, 61 + struct mlx5e_xsk_param *xsk) 31 62 { 32 - u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params); 63 + /* AF_XDP allocates SKBs on XDP_PASS - ensure they don't occupy more 64 + * than one page. For this, check both with and without xsk. 65 + */ 66 + u32 linear_frag_sz = max(mlx5e_rx_get_linear_frag_sz(params, xsk), 67 + mlx5e_rx_get_linear_frag_sz(params, NULL)); 33 68 34 - return !params->lro_en && frag_sz <= PAGE_SIZE; 69 + return !params->lro_en && linear_frag_sz <= PAGE_SIZE; 35 70 } 36 71 37 72 #define MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ ((BIT(__mlx5_bit_sz(wq, log_wqe_stride_size)) - 1) + \ 38 73 MLX5_MPWQE_LOG_STRIDE_SZ_BASE) 39 74 bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev, 40 - struct mlx5e_params *params) 75 + struct mlx5e_params *params, 76 + struct mlx5e_xsk_param *xsk) 41 77 { 42 - u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params); 78 + u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params, xsk); 43 79 s8 signed_log_num_strides_param; 44 80 u8 log_num_strides; 45 81 46 - if (!mlx5e_rx_is_linear_skb(params)) 82 + if (!mlx5e_rx_is_linear_skb(params, xsk)) 47 83 return false; 48 84 49 - if (order_base_2(frag_sz) > MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ) 85 + if (order_base_2(linear_frag_sz) > MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ) 50 86 return false; 51 87 52 88 if (MLX5_CAP_GEN(mdev, ext_stride_num_range)) 53 89 return true; 54 90 55 - log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(frag_sz); 91 + log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz); 56 92 signed_log_num_strides_param = 57 93 (s8)log_num_strides - MLX5_MPWQE_LOG_NUM_STRIDES_BASE; 58 94 59 95 return signed_log_num_strides_param >= 0; 60 96 } 61 97 62 - u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params) 98 + u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params, 99 + struct mlx5e_xsk_param *xsk) 63 100 { 64 - u8 log_pkts_per_wqe = mlx5e_mpwqe_log_pkts_per_wqe(params); 101 + u8 log_pkts_per_wqe = mlx5e_mpwqe_log_pkts_per_wqe(params, xsk); 65 102 66 103 /* Numbers are unsigned, don't subtract to avoid underflow. */ 67 104 if (params->log_rq_mtu_frames < ··· 109 72 } 110 73 111 74 u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev, 112 - struct mlx5e_params *params) 75 + struct mlx5e_params *params, 76 + struct mlx5e_xsk_param *xsk) 113 77 { 114 - if (mlx5e_rx_mpwqe_is_linear_skb(mdev, params)) 115 - return order_base_2(mlx5e_rx_get_linear_frag_sz(params)); 78 + if (mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk)) 79 + return order_base_2(mlx5e_rx_get_linear_frag_sz(params, xsk)); 116 80 117 81 return MLX5_MPWRQ_DEF_LOG_STRIDE_SZ(mdev); 118 82 } 119 83 120 84 u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev, 121 - struct mlx5e_params *params) 85 + struct mlx5e_params *params, 86 + struct mlx5e_xsk_param *xsk) 122 87 { 123 88 return MLX5_MPWRQ_LOG_WQE_SZ - 124 - mlx5e_mpwqe_get_log_stride_size(mdev, params); 89 + mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk); 125 90 } 126 91 127 92 u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev, 128 - struct mlx5e_params *params) 93 + struct mlx5e_params *params, 94 + struct mlx5e_xsk_param *xsk) 129 95 { 130 - u16 linear_rq_headroom = params->xdp_prog ? 131 - XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM; 132 - bool is_linear_skb; 96 + bool is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ? 97 + mlx5e_rx_is_linear_skb(params, xsk) : 98 + mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk); 133 99 134 - linear_rq_headroom += NET_IP_ALIGN; 135 - 136 - is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ? 137 - mlx5e_rx_is_linear_skb(params) : 138 - mlx5e_rx_mpwqe_is_linear_skb(mdev, params); 139 - 140 - return is_linear_skb ? linear_rq_headroom : 0; 100 + return is_linear_skb ? mlx5e_get_linear_rq_headroom(params, xsk) : 0; 141 101 }

+110 -8

drivers/net/ethernet/mellanox/mlx5/core/en/params.h

··· 6 6 7 7 #include "en.h" 8 8 9 - u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params); 10 - u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params); 11 - bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params); 9 + struct mlx5e_xsk_param { 10 + u16 headroom; 11 + u16 chunk_size; 12 + }; 13 + 14 + struct mlx5e_rq_param { 15 + u32 rqc[MLX5_ST_SZ_DW(rqc)]; 16 + struct mlx5_wq_param wq; 17 + struct mlx5e_rq_frags_info frags_info; 18 + }; 19 + 20 + struct mlx5e_sq_param { 21 + u32 sqc[MLX5_ST_SZ_DW(sqc)]; 22 + struct mlx5_wq_param wq; 23 + bool is_mpw; 24 + }; 25 + 26 + struct mlx5e_cq_param { 27 + u32 cqc[MLX5_ST_SZ_DW(cqc)]; 28 + struct mlx5_wq_param wq; 29 + u16 eq_ix; 30 + u8 cq_period_mode; 31 + }; 32 + 33 + struct mlx5e_channel_param { 34 + struct mlx5e_rq_param rq; 35 + struct mlx5e_sq_param sq; 36 + struct mlx5e_sq_param xdp_sq; 37 + struct mlx5e_sq_param icosq; 38 + struct mlx5e_cq_param rx_cq; 39 + struct mlx5e_cq_param tx_cq; 40 + struct mlx5e_cq_param icosq_cq; 41 + }; 42 + 43 + static inline bool mlx5e_qid_get_ch_if_in_group(struct mlx5e_params *params, 44 + u16 qid, 45 + enum mlx5e_rq_group group, 46 + u16 *ix) 47 + { 48 + int nch = params->num_channels; 49 + int ch = qid - nch * group; 50 + 51 + if (ch < 0 || ch >= nch) 52 + return false; 53 + 54 + *ix = ch; 55 + return true; 56 + } 57 + 58 + static inline void mlx5e_qid_get_ch_and_group(struct mlx5e_params *params, 59 + u16 qid, 60 + u16 *ix, 61 + enum mlx5e_rq_group *group) 62 + { 63 + u16 nch = params->num_channels; 64 + 65 + *ix = qid % nch; 66 + *group = qid / nch; 67 + } 68 + 69 + static inline bool mlx5e_qid_validate(struct mlx5e_params *params, u64 qid) 70 + { 71 + return qid < params->num_channels * MLX5E_NUM_RQ_GROUPS; 72 + } 73 + 74 + /* Parameter calculations */ 75 + 76 + u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params, 77 + struct mlx5e_xsk_param *xsk); 78 + u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params, 79 + struct mlx5e_xsk_param *xsk); 80 + u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params, 81 + struct mlx5e_xsk_param *xsk); 82 + bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params, 83 + struct mlx5e_xsk_param *xsk); 12 84 bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev, 13 - struct mlx5e_params *params); 14 - u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params); 85 + struct mlx5e_params *params, 86 + struct mlx5e_xsk_param *xsk); 87 + u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params, 88 + struct mlx5e_xsk_param *xsk); 15 89 u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev, 16 - struct mlx5e_params *params); 90 + struct mlx5e_params *params, 91 + struct mlx5e_xsk_param *xsk); 17 92 u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev, 18 - struct mlx5e_params *params); 93 + struct mlx5e_params *params, 94 + struct mlx5e_xsk_param *xsk); 19 95 u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev, 20 - struct mlx5e_params *params); 96 + struct mlx5e_params *params, 97 + struct mlx5e_xsk_param *xsk); 98 + 99 + /* Build queue parameters */ 100 + 101 + void mlx5e_build_rq_param(struct mlx5e_priv *priv, 102 + struct mlx5e_params *params, 103 + struct mlx5e_xsk_param *xsk, 104 + struct mlx5e_rq_param *param); 105 + void mlx5e_build_sq_param_common(struct mlx5e_priv *priv, 106 + struct mlx5e_sq_param *param); 107 + void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, 108 + struct mlx5e_params *params, 109 + struct mlx5e_xsk_param *xsk, 110 + struct mlx5e_cq_param *param); 111 + void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv, 112 + struct mlx5e_params *params, 113 + struct mlx5e_cq_param *param); 114 + void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv, 115 + u8 log_wq_size, 116 + struct mlx5e_cq_param *param); 117 + void mlx5e_build_icosq_param(struct mlx5e_priv *priv, 118 + u8 log_wq_size, 119 + struct mlx5e_sq_param *param); 120 + void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv, 121 + struct mlx5e_params *params, 122 + struct mlx5e_sq_param *param); 21 123 22 124 #endif /* __MLX5_EN_PARAMS_H__ */

+175 -62

drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c

··· 31 31 */ 32 32 33 33 #include <linux/bpf_trace.h> 34 + #include <net/xdp_sock.h> 34 35 #include "en/xdp.h" 36 + #include "en/params.h" 35 37 36 - int mlx5e_xdp_max_mtu(struct mlx5e_params *params) 38 + int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk) 37 39 { 38 - int hr = NET_IP_ALIGN + XDP_PACKET_HEADROOM; 40 + int hr = mlx5e_get_linear_rq_headroom(params, xsk); 39 41 40 42 /* Let S := SKB_DATA_ALIGN(sizeof(struct skb_shared_info)). 41 43 * The condition checked in mlx5e_rx_is_linear_skb is: ··· 56 54 } 57 55 58 56 static inline bool 59 - mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_dma_info *di, 60 - struct xdp_buff *xdp) 57 + mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq, 58 + struct mlx5e_dma_info *di, struct xdp_buff *xdp) 61 59 { 60 + struct mlx5e_xdp_xmit_data xdptxd; 62 61 struct mlx5e_xdp_info xdpi; 62 + struct xdp_frame *xdpf; 63 + dma_addr_t dma_addr; 63 64 64 - xdpi.xdpf = convert_to_xdp_frame(xdp); 65 - if (unlikely(!xdpi.xdpf)) 65 + xdpf = convert_to_xdp_frame(xdp); 66 + if (unlikely(!xdpf)) 66 67 return false; 67 - xdpi.dma_addr = di->addr + (xdpi.xdpf->data - (void *)xdpi.xdpf); 68 - dma_sync_single_for_device(sq->pdev, xdpi.dma_addr, 69 - xdpi.xdpf->len, PCI_DMA_TODEVICE); 70 - xdpi.di = *di; 71 68 72 - return sq->xmit_xdp_frame(sq, &xdpi); 69 + xdptxd.data = xdpf->data; 70 + xdptxd.len = xdpf->len; 71 + 72 + if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) { 73 + /* The xdp_buff was in the UMEM and was copied into a newly 74 + * allocated page. The UMEM page was returned via the ZCA, and 75 + * this new page has to be mapped at this point and has to be 76 + * unmapped and returned via xdp_return_frame on completion. 77 + */ 78 + 79 + /* Prevent double recycling of the UMEM page. Even in case this 80 + * function returns false, the xdp_buff shouldn't be recycled, 81 + * as it was already done in xdp_convert_zc_to_xdp_frame. 82 + */ 83 + __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); /* non-atomic */ 84 + 85 + xdpi.mode = MLX5E_XDP_XMIT_MODE_FRAME; 86 + 87 + dma_addr = dma_map_single(sq->pdev, xdptxd.data, xdptxd.len, 88 + DMA_TO_DEVICE); 89 + if (dma_mapping_error(sq->pdev, dma_addr)) { 90 + xdp_return_frame(xdpf); 91 + return false; 92 + } 93 + 94 + xdptxd.dma_addr = dma_addr; 95 + xdpi.frame.xdpf = xdpf; 96 + xdpi.frame.dma_addr = dma_addr; 97 + } else { 98 + /* Driver assumes that convert_to_xdp_frame returns an xdp_frame 99 + * that points to the same memory region as the original 100 + * xdp_buff. It allows to map the memory only once and to use 101 + * the DMA_BIDIRECTIONAL mode. 102 + */ 103 + 104 + xdpi.mode = MLX5E_XDP_XMIT_MODE_PAGE; 105 + 106 + dma_addr = di->addr + (xdpf->data - (void *)xdpf); 107 + dma_sync_single_for_device(sq->pdev, dma_addr, xdptxd.len, 108 + DMA_TO_DEVICE); 109 + 110 + xdptxd.dma_addr = dma_addr; 111 + xdpi.page.rq = rq; 112 + xdpi.page.di = *di; 113 + } 114 + 115 + return sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, 0); 73 116 } 74 117 75 118 /* returns true if packet was consumed by xdp */ 76 119 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di, 77 - void *va, u16 *rx_headroom, u32 *len) 120 + void *va, u16 *rx_headroom, u32 *len, bool xsk) 78 121 { 79 122 struct bpf_prog *prog = READ_ONCE(rq->xdp_prog); 80 123 struct xdp_buff xdp; ··· 133 86 xdp_set_data_meta_invalid(&xdp); 134 87 xdp.data_end = xdp.data + *len; 135 88 xdp.data_hard_start = va; 89 + if (xsk) 90 + xdp.handle = di->xsk.handle; 136 91 xdp.rxq = &rq->xdp_rxq; 137 92 138 93 act = bpf_prog_run_xdp(prog, &xdp); 94 + if (xsk) 95 + xdp.handle += xdp.data - xdp.data_hard_start; 139 96 switch (act) { 140 97 case XDP_PASS: 141 98 *rx_headroom = xdp.data - xdp.data_hard_start; 142 99 *len = xdp.data_end - xdp.data; 143 100 return false; 144 101 case XDP_TX: 145 - if (unlikely(!mlx5e_xmit_xdp_buff(&rq->xdpsq, di, &xdp))) 102 + if (unlikely(!mlx5e_xmit_xdp_buff(rq->xdpsq, rq, di, &xdp))) 146 103 goto xdp_abort; 147 104 __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); /* non-atomic */ 148 105 return true; ··· 157 106 goto xdp_abort; 158 107 __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); 159 108 __set_bit(MLX5E_RQ_FLAG_XDP_REDIRECT, rq->flags); 160 - mlx5e_page_dma_unmap(rq, di); 109 + if (!xsk) 110 + mlx5e_page_dma_unmap(rq, di); 161 111 rq->stats->xdp_redirect++; 162 112 return true; 163 113 default: ··· 212 160 stats->mpwqe++; 213 161 } 214 162 215 - static void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq) 163 + void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq) 216 164 { 217 165 struct mlx5_wq_cyc *wq = &sq->wq; 218 166 struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; ··· 235 183 session->wqe = NULL; /* Close session */ 236 184 } 237 185 238 - static bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, 239 - struct mlx5e_xdp_info *xdpi) 186 + enum { 187 + MLX5E_XDP_CHECK_OK = 1, 188 + MLX5E_XDP_CHECK_START_MPWQE = 2, 189 + }; 190 + 191 + static int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq) 240 192 { 241 - struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; 242 - struct mlx5e_xdpsq_stats *stats = sq->stats; 243 - 244 - struct xdp_frame *xdpf = xdpi->xdpf; 245 - 246 - if (unlikely(sq->hw_mtu < xdpf->len)) { 247 - stats->err++; 248 - return false; 249 - } 250 - 251 - if (unlikely(!session->wqe)) { 193 + if (unlikely(!sq->mpwqe.wqe)) { 252 194 if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, 253 195 MLX5_SEND_WQE_MAX_WQEBBS))) { 254 196 /* SQ is full, ring doorbell */ 255 197 mlx5e_xmit_xdp_doorbell(sq); 256 - stats->full++; 257 - return false; 198 + sq->stats->full++; 199 + return -EBUSY; 258 200 } 259 201 202 + return MLX5E_XDP_CHECK_START_MPWQE; 203 + } 204 + 205 + return MLX5E_XDP_CHECK_OK; 206 + } 207 + 208 + static bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, 209 + struct mlx5e_xdp_xmit_data *xdptxd, 210 + struct mlx5e_xdp_info *xdpi, 211 + int check_result) 212 + { 213 + struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; 214 + struct mlx5e_xdpsq_stats *stats = sq->stats; 215 + 216 + if (unlikely(xdptxd->len > sq->hw_mtu)) { 217 + stats->err++; 218 + return false; 219 + } 220 + 221 + if (!check_result) 222 + check_result = mlx5e_xmit_xdp_frame_check_mpwqe(sq); 223 + if (unlikely(check_result < 0)) 224 + return false; 225 + 226 + if (check_result == MLX5E_XDP_CHECK_START_MPWQE) { 227 + /* Start the session when nothing can fail, so it's guaranteed 228 + * that if there is an active session, it has at least one dseg, 229 + * and it's safe to complete it at any time. 230 + */ 260 231 mlx5e_xdp_mpwqe_session_start(sq); 261 232 } 262 233 263 - mlx5e_xdp_mpwqe_add_dseg(sq, xdpi, stats); 234 + mlx5e_xdp_mpwqe_add_dseg(sq, xdptxd, stats); 264 235 265 236 if (unlikely(session->complete || 266 237 session->ds_count == session->max_ds_count)) ··· 294 219 return true; 295 220 } 296 221 297 - static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi) 222 + static int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq) 223 + { 224 + if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, 1))) { 225 + /* SQ is full, ring doorbell */ 226 + mlx5e_xmit_xdp_doorbell(sq); 227 + sq->stats->full++; 228 + return -EBUSY; 229 + } 230 + 231 + return MLX5E_XDP_CHECK_OK; 232 + } 233 + 234 + static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, 235 + struct mlx5e_xdp_xmit_data *xdptxd, 236 + struct mlx5e_xdp_info *xdpi, 237 + int check_result) 298 238 { 299 239 struct mlx5_wq_cyc *wq = &sq->wq; 300 240 u16 pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc); ··· 319 229 struct mlx5_wqe_eth_seg *eseg = &wqe->eth; 320 230 struct mlx5_wqe_data_seg *dseg = wqe->data; 321 231 322 - struct xdp_frame *xdpf = xdpi->xdpf; 323 - dma_addr_t dma_addr = xdpi->dma_addr; 324 - unsigned int dma_len = xdpf->len; 232 + dma_addr_t dma_addr = xdptxd->dma_addr; 233 + u32 dma_len = xdptxd->len; 325 234 326 235 struct mlx5e_xdpsq_stats *stats = sq->stats; 327 236 ··· 331 242 return false; 332 243 } 333 244 334 - if (unlikely(!mlx5e_wqc_has_room_for(wq, sq->cc, sq->pc, 1))) { 335 - /* SQ is full, ring doorbell */ 336 - mlx5e_xmit_xdp_doorbell(sq); 337 - stats->full++; 245 + if (!check_result) 246 + check_result = mlx5e_xmit_xdp_frame_check(sq); 247 + if (unlikely(check_result < 0)) 338 248 return false; 339 - } 340 249 341 250 cseg->fm_ce_se = 0; 342 251 343 252 /* copy the inline part if required */ 344 253 if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) { 345 - memcpy(eseg->inline_hdr.start, xdpf->data, MLX5E_XDP_MIN_INLINE); 254 + memcpy(eseg->inline_hdr.start, xdptxd->data, MLX5E_XDP_MIN_INLINE); 346 255 eseg->inline_hdr.sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE); 347 256 dma_len -= MLX5E_XDP_MIN_INLINE; 348 257 dma_addr += MLX5E_XDP_MIN_INLINE; ··· 364 277 365 278 static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq, 366 279 struct mlx5e_xdp_wqe_info *wi, 367 - struct mlx5e_rq *rq, 280 + u32 *xsk_frames, 368 281 bool recycle) 369 282 { 370 283 struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo; ··· 373 286 for (i = 0; i < wi->num_pkts; i++) { 374 287 struct mlx5e_xdp_info xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo); 375 288 376 - if (rq) { 377 - /* XDP_TX */ 378 - mlx5e_page_release(rq, &xdpi.di, recycle); 379 - } else { 380 - /* XDP_REDIRECT */ 381 - dma_unmap_single(sq->pdev, xdpi.dma_addr, 382 - xdpi.xdpf->len, DMA_TO_DEVICE); 383 - xdp_return_frame(xdpi.xdpf); 289 + switch (xdpi.mode) { 290 + case MLX5E_XDP_XMIT_MODE_FRAME: 291 + /* XDP_TX from the XSK RQ and XDP_REDIRECT */ 292 + dma_unmap_single(sq->pdev, xdpi.frame.dma_addr, 293 + xdpi.frame.xdpf->len, DMA_TO_DEVICE); 294 + xdp_return_frame(xdpi.frame.xdpf); 295 + break; 296 + case MLX5E_XDP_XMIT_MODE_PAGE: 297 + /* XDP_TX from the regular RQ */ 298 + mlx5e_page_release_dynamic(xdpi.page.rq, &xdpi.page.di, recycle); 299 + break; 300 + case MLX5E_XDP_XMIT_MODE_XSK: 301 + /* AF_XDP send */ 302 + (*xsk_frames)++; 303 + break; 304 + default: 305 + WARN_ON_ONCE(true); 384 306 } 385 307 } 386 308 } 387 309 388 - bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq) 310 + bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq) 389 311 { 390 312 struct mlx5e_xdpsq *sq; 391 313 struct mlx5_cqe64 *cqe; 314 + u32 xsk_frames = 0; 392 315 u16 sqcc; 393 316 int i; 394 317 ··· 440 343 441 344 sqcc += wi->num_wqebbs; 442 345 443 - mlx5e_free_xdpsq_desc(sq, wi, rq, true); 346 + mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, true); 444 347 } while (!last_wqe); 445 348 } while ((++i < MLX5E_TX_CQ_POLL_BUDGET) && (cqe = mlx5_cqwq_get_cqe(&cq->wq))); 349 + 350 + if (xsk_frames) 351 + xsk_umem_complete_tx(sq->umem, xsk_frames); 446 352 447 353 sq->stats->cqes += i; 448 354 ··· 458 358 return (i == MLX5E_TX_CQ_POLL_BUDGET); 459 359 } 460 360 461 - void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq) 361 + void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq) 462 362 { 363 + u32 xsk_frames = 0; 364 + 463 365 while (sq->cc != sq->pc) { 464 366 struct mlx5e_xdp_wqe_info *wi; 465 367 u16 ci; ··· 471 369 472 370 sq->cc += wi->num_wqebbs; 473 371 474 - mlx5e_free_xdpsq_desc(sq, wi, rq, false); 372 + mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, false); 475 373 } 374 + 375 + if (xsk_frames) 376 + xsk_umem_complete_tx(sq->umem, xsk_frames); 476 377 } 477 378 478 379 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, ··· 503 398 504 399 for (i = 0; i < n; i++) { 505 400 struct xdp_frame *xdpf = frames[i]; 401 + struct mlx5e_xdp_xmit_data xdptxd; 506 402 struct mlx5e_xdp_info xdpi; 507 403 508 - xdpi.dma_addr = dma_map_single(sq->pdev, xdpf->data, xdpf->len, 509 - DMA_TO_DEVICE); 510 - if (unlikely(dma_mapping_error(sq->pdev, xdpi.dma_addr))) { 404 + xdptxd.data = xdpf->data; 405 + xdptxd.len = xdpf->len; 406 + xdptxd.dma_addr = dma_map_single(sq->pdev, xdptxd.data, 407 + xdptxd.len, DMA_TO_DEVICE); 408 + 409 + if (unlikely(dma_mapping_error(sq->pdev, xdptxd.dma_addr))) { 511 410 xdp_return_frame_rx_napi(xdpf); 512 411 drops++; 513 412 continue; 514 413 } 515 414 516 - xdpi.xdpf = xdpf; 415 + xdpi.mode = MLX5E_XDP_XMIT_MODE_FRAME; 416 + xdpi.frame.xdpf = xdpf; 417 + xdpi.frame.dma_addr = xdptxd.dma_addr; 517 418 518 - if (unlikely(!sq->xmit_xdp_frame(sq, &xdpi))) { 519 - dma_unmap_single(sq->pdev, xdpi.dma_addr, 520 - xdpf->len, DMA_TO_DEVICE); 419 + if (unlikely(!sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, 0))) { 420 + dma_unmap_single(sq->pdev, xdptxd.dma_addr, 421 + xdptxd.len, DMA_TO_DEVICE); 521 422 xdp_return_frame_rx_napi(xdpf); 522 423 drops++; 523 424 } ··· 540 429 541 430 void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq) 542 431 { 543 - struct mlx5e_xdpsq *xdpsq = &rq->xdpsq; 432 + struct mlx5e_xdpsq *xdpsq = rq->xdpsq; 544 433 545 434 if (xdpsq->mpwqe.wqe) 546 435 mlx5e_xdp_mpwqe_complete(xdpsq); ··· 555 444 556 445 void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw) 557 446 { 447 + sq->xmit_xdp_frame_check = is_mpw ? 448 + mlx5e_xmit_xdp_frame_check_mpwqe : mlx5e_xmit_xdp_frame_check; 558 449 sq->xmit_xdp_frame = is_mpw ? 559 450 mlx5e_xmit_xdp_frame_mpwqe : mlx5e_xmit_xdp_frame; 560 451 }

+26 -10

drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h

··· 39 39 (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS) 40 40 #define MLX5E_XDP_TX_DS_COUNT (MLX5E_XDP_TX_EMPTY_DS_COUNT + 1 /* SG DS */) 41 41 42 - int mlx5e_xdp_max_mtu(struct mlx5e_params *params); 42 + struct mlx5e_xsk_param; 43 + int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk); 43 44 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di, 44 - void *va, u16 *rx_headroom, u32 *len); 45 - bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq); 46 - void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq); 45 + void *va, u16 *rx_headroom, u32 *len, bool xsk); 46 + void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq); 47 + bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq); 48 + void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq); 47 49 void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw); 48 50 void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq); 49 51 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, ··· 66 64 static inline bool mlx5e_xdp_tx_is_enabled(struct mlx5e_priv *priv) 67 65 { 68 66 return test_bit(MLX5E_STATE_XDP_TX_ENABLED, &priv->state); 67 + } 68 + 69 + static inline void mlx5e_xdp_set_open(struct mlx5e_priv *priv) 70 + { 71 + set_bit(MLX5E_STATE_XDP_OPEN, &priv->state); 72 + } 73 + 74 + static inline void mlx5e_xdp_set_closed(struct mlx5e_priv *priv) 75 + { 76 + clear_bit(MLX5E_STATE_XDP_OPEN, &priv->state); 77 + } 78 + 79 + static inline bool mlx5e_xdp_is_open(struct mlx5e_priv *priv) 80 + { 81 + return test_bit(MLX5E_STATE_XDP_OPEN, &priv->state); 69 82 } 70 83 71 84 static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq) ··· 114 97 } 115 98 116 99 static inline void 117 - mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi, 100 + mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, 101 + struct mlx5e_xdp_xmit_data *xdptxd, 118 102 struct mlx5e_xdpsq_stats *stats) 119 103 { 120 104 struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; 121 - dma_addr_t dma_addr = xdpi->dma_addr; 122 - struct xdp_frame *xdpf = xdpi->xdpf; 123 105 struct mlx5_wqe_data_seg *dseg = 124 106 (struct mlx5_wqe_data_seg *)session->wqe + session->ds_count; 125 - u16 dma_len = xdpf->len; 107 + u32 dma_len = xdptxd->len; 126 108 127 109 session->pkt_count++; 128 110 ··· 140 124 } 141 125 142 126 inline_dseg->byte_count = cpu_to_be32(dma_len | MLX5_INLINE_SEG); 143 - memcpy(inline_dseg->data, xdpf->data, dma_len); 127 + memcpy(inline_dseg->data, xdptxd->data, dma_len); 144 128 145 129 session->ds_count += ds_cnt; 146 130 stats->inlnw++; ··· 148 132 } 149 133 150 134 no_inline: 151 - dseg->addr = cpu_to_be64(dma_addr); 135 + dseg->addr = cpu_to_be64(xdptxd->dma_addr); 152 136 dseg->byte_count = cpu_to_be32(dma_len); 153 137 dseg->lkey = sq->mkey_be; 154 138 session->ds_count++;

+1

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/Makefile

··· 1 + subdir-ccflags-y += -I$(src)/../..

+192

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include "rx.h" 5 + #include "en/xdp.h" 6 + #include <net/xdp_sock.h> 7 + 8 + /* RX data path */ 9 + 10 + bool mlx5e_xsk_pages_enough_umem(struct mlx5e_rq *rq, int count) 11 + { 12 + /* Check in advance that we have enough frames, instead of allocating 13 + * one-by-one, failing and moving frames to the Reuse Ring. 14 + */ 15 + return xsk_umem_has_addrs_rq(rq->umem, count); 16 + } 17 + 18 + int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq, 19 + struct mlx5e_dma_info *dma_info) 20 + { 21 + struct xdp_umem *umem = rq->umem; 22 + u64 handle; 23 + 24 + if (!xsk_umem_peek_addr_rq(umem, &handle)) 25 + return -ENOMEM; 26 + 27 + dma_info->xsk.handle = handle + rq->buff.umem_headroom; 28 + dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle); 29 + 30 + /* No need to add headroom to the DMA address. In striding RQ case, we 31 + * just provide pages for UMR, and headroom is counted at the setup 32 + * stage when creating a WQE. In non-striding RQ case, headroom is 33 + * accounted in mlx5e_alloc_rx_wqe. 34 + */ 35 + dma_info->addr = xdp_umem_get_dma(umem, handle); 36 + 37 + xsk_umem_discard_addr_rq(umem); 38 + 39 + dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE, 40 + DMA_BIDIRECTIONAL); 41 + 42 + return 0; 43 + } 44 + 45 + static inline void mlx5e_xsk_recycle_frame(struct mlx5e_rq *rq, u64 handle) 46 + { 47 + xsk_umem_fq_reuse(rq->umem, handle & rq->umem->chunk_mask); 48 + } 49 + 50 + /* XSKRQ uses pages from UMEM, they must not be released. They are returned to 51 + * the userspace if possible, and if not, this function is called to reuse them 52 + * in the driver. 53 + */ 54 + void mlx5e_xsk_page_release(struct mlx5e_rq *rq, 55 + struct mlx5e_dma_info *dma_info) 56 + { 57 + mlx5e_xsk_recycle_frame(rq, dma_info->xsk.handle); 58 + } 59 + 60 + /* Return a frame back to the hardware to fill in again. It is used by XDP when 61 + * the XDP program returns XDP_TX or XDP_REDIRECT not to an XSKMAP. 62 + */ 63 + void mlx5e_xsk_zca_free(struct zero_copy_allocator *zca, unsigned long handle) 64 + { 65 + struct mlx5e_rq *rq = container_of(zca, struct mlx5e_rq, zca); 66 + 67 + mlx5e_xsk_recycle_frame(rq, handle); 68 + } 69 + 70 + static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, void *data, 71 + u32 cqe_bcnt) 72 + { 73 + struct sk_buff *skb; 74 + 75 + skb = napi_alloc_skb(rq->cq.napi, cqe_bcnt); 76 + if (unlikely(!skb)) { 77 + rq->stats->buff_alloc_err++; 78 + return NULL; 79 + } 80 + 81 + skb_put_data(skb, data, cqe_bcnt); 82 + 83 + return skb; 84 + } 85 + 86 + struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, 87 + struct mlx5e_mpw_info *wi, 88 + u16 cqe_bcnt, 89 + u32 head_offset, 90 + u32 page_idx) 91 + { 92 + struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx]; 93 + u16 rx_headroom = rq->buff.headroom - rq->buff.umem_headroom; 94 + u32 cqe_bcnt32 = cqe_bcnt; 95 + void *va, *data; 96 + u32 frag_size; 97 + bool consumed; 98 + 99 + /* Check packet size. Note LRO doesn't use linear SKB */ 100 + if (unlikely(cqe_bcnt > rq->hw_mtu)) { 101 + rq->stats->oversize_pkts_sw_drop++; 102 + return NULL; 103 + } 104 + 105 + /* head_offset is not used in this function, because di->xsk.data and 106 + * di->addr point directly to the necessary place. Furthermore, in the 107 + * current implementation, one page = one packet = one frame, so 108 + * head_offset should always be 0. 109 + */ 110 + WARN_ON_ONCE(head_offset); 111 + 112 + va = di->xsk.data; 113 + data = va + rx_headroom; 114 + frag_size = rq->buff.headroom + cqe_bcnt32; 115 + 116 + dma_sync_single_for_cpu(rq->pdev, di->addr, frag_size, DMA_BIDIRECTIONAL); 117 + prefetch(data); 118 + 119 + rcu_read_lock(); 120 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, true); 121 + rcu_read_unlock(); 122 + 123 + /* Possible flows: 124 + * - XDP_REDIRECT to XSKMAP: 125 + * The page is owned by the userspace from now. 126 + * - XDP_TX and other XDP_REDIRECTs: 127 + * The page was returned by ZCA and recycled. 128 + * - XDP_DROP: 129 + * Recycle the page. 130 + * - XDP_PASS: 131 + * Allocate an SKB, copy the data and recycle the page. 132 + * 133 + * Pages to be recycled go to the Reuse Ring on MPWQE deallocation. Its 134 + * size is the same as the Driver RX Ring's size, and pages for WQEs are 135 + * allocated first from the Reuse Ring, so it has enough space. 136 + */ 137 + 138 + if (likely(consumed)) { 139 + if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))) 140 + __set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */ 141 + return NULL; /* page/packet was consumed by XDP */ 142 + } 143 + 144 + /* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the 145 + * frame. On SKB allocation failure, NULL is returned. 146 + */ 147 + return mlx5e_xsk_construct_skb(rq, data, cqe_bcnt32); 148 + } 149 + 150 + struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq, 151 + struct mlx5_cqe64 *cqe, 152 + struct mlx5e_wqe_frag_info *wi, 153 + u32 cqe_bcnt) 154 + { 155 + struct mlx5e_dma_info *di = wi->di; 156 + u16 rx_headroom = rq->buff.headroom - rq->buff.umem_headroom; 157 + void *va, *data; 158 + bool consumed; 159 + u32 frag_size; 160 + 161 + /* wi->offset is not used in this function, because di->xsk.data and 162 + * di->addr point directly to the necessary place. Furthermore, in the 163 + * current implementation, one page = one packet = one frame, so 164 + * wi->offset should always be 0. 165 + */ 166 + WARN_ON_ONCE(wi->offset); 167 + 168 + va = di->xsk.data; 169 + data = va + rx_headroom; 170 + frag_size = rq->buff.headroom + cqe_bcnt; 171 + 172 + dma_sync_single_for_cpu(rq->pdev, di->addr, frag_size, DMA_BIDIRECTIONAL); 173 + prefetch(data); 174 + 175 + if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) { 176 + rq->stats->wqe_err++; 177 + return NULL; 178 + } 179 + 180 + rcu_read_lock(); 181 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, true); 182 + rcu_read_unlock(); 183 + 184 + if (likely(consumed)) 185 + return NULL; /* page/packet was consumed by XDP */ 186 + 187 + /* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse 188 + * will be handled by mlx5e_put_rx_frag. 189 + * On SKB allocation failure, NULL is returned. 190 + */ 191 + return mlx5e_xsk_construct_skb(rq, data, cqe_bcnt); 192 + }

+27

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_RX_H__ 5 + #define __MLX5_EN_XSK_RX_H__ 6 + 7 + #include "en.h" 8 + 9 + /* RX data path */ 10 + 11 + bool mlx5e_xsk_pages_enough_umem(struct mlx5e_rq *rq, int count); 12 + int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq, 13 + struct mlx5e_dma_info *dma_info); 14 + void mlx5e_xsk_page_release(struct mlx5e_rq *rq, 15 + struct mlx5e_dma_info *dma_info); 16 + void mlx5e_xsk_zca_free(struct zero_copy_allocator *zca, unsigned long handle); 17 + struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, 18 + struct mlx5e_mpw_info *wi, 19 + u16 cqe_bcnt, 20 + u32 head_offset, 21 + u32 page_idx); 22 + struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq, 23 + struct mlx5_cqe64 *cqe, 24 + struct mlx5e_wqe_frag_info *wi, 25 + u32 cqe_bcnt); 26 + 27 + #endif /* __MLX5_EN_XSK_RX_H__ */

+223

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include "setup.h" 5 + #include "en/params.h" 6 + 7 + bool mlx5e_validate_xsk_param(struct mlx5e_params *params, 8 + struct mlx5e_xsk_param *xsk, 9 + struct mlx5_core_dev *mdev) 10 + { 11 + /* AF_XDP doesn't support frames larger than PAGE_SIZE, and the current 12 + * mlx5e XDP implementation doesn't support multiple packets per page. 13 + */ 14 + if (xsk->chunk_size != PAGE_SIZE) 15 + return false; 16 + 17 + /* Current MTU and XSK headroom don't allow packets to fit the frames. */ 18 + if (mlx5e_rx_get_linear_frag_sz(params, xsk) > xsk->chunk_size) 19 + return false; 20 + 21 + /* frag_sz is different for regular and XSK RQs, so ensure that linear 22 + * SKB mode is possible. 23 + */ 24 + switch (params->rq_wq_type) { 25 + case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: 26 + return mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk); 27 + default: /* MLX5_WQ_TYPE_CYCLIC */ 28 + return mlx5e_rx_is_linear_skb(params, xsk); 29 + } 30 + } 31 + 32 + static void mlx5e_build_xskicosq_param(struct mlx5e_priv *priv, 33 + u8 log_wq_size, 34 + struct mlx5e_sq_param *param) 35 + { 36 + void *sqc = param->sqc; 37 + void *wq = MLX5_ADDR_OF(sqc, sqc, wq); 38 + 39 + mlx5e_build_sq_param_common(priv, param); 40 + 41 + MLX5_SET(wq, wq, log_wq_sz, log_wq_size); 42 + } 43 + 44 + static void mlx5e_build_xsk_cparam(struct mlx5e_priv *priv, 45 + struct mlx5e_params *params, 46 + struct mlx5e_xsk_param *xsk, 47 + struct mlx5e_channel_param *cparam) 48 + { 49 + const u8 xskicosq_size = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE; 50 + 51 + mlx5e_build_rq_param(priv, params, xsk, &cparam->rq); 52 + mlx5e_build_xdpsq_param(priv, params, &cparam->xdp_sq); 53 + mlx5e_build_xskicosq_param(priv, xskicosq_size, &cparam->icosq); 54 + mlx5e_build_rx_cq_param(priv, params, xsk, &cparam->rx_cq); 55 + mlx5e_build_tx_cq_param(priv, params, &cparam->tx_cq); 56 + mlx5e_build_ico_cq_param(priv, xskicosq_size, &cparam->icosq_cq); 57 + } 58 + 59 + int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, 60 + struct mlx5e_xsk_param *xsk, struct xdp_umem *umem, 61 + struct mlx5e_channel *c) 62 + { 63 + struct mlx5e_channel_param cparam = {}; 64 + struct dim_cq_moder icocq_moder = {}; 65 + int err; 66 + 67 + if (!mlx5e_validate_xsk_param(params, xsk, priv->mdev)) 68 + return -EINVAL; 69 + 70 + mlx5e_build_xsk_cparam(priv, params, xsk, &cparam); 71 + 72 + err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam.rx_cq, &c->xskrq.cq); 73 + if (unlikely(err)) 74 + return err; 75 + 76 + err = mlx5e_open_rq(c, params, &cparam.rq, xsk, umem, &c->xskrq); 77 + if (unlikely(err)) 78 + goto err_close_rx_cq; 79 + 80 + err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam.tx_cq, &c->xsksq.cq); 81 + if (unlikely(err)) 82 + goto err_close_rq; 83 + 84 + /* Create a separate SQ, so that when the UMEM is disabled, we could 85 + * close this SQ safely and stop receiving CQEs. In other case, e.g., if 86 + * the XDPSQ was used instead, we might run into trouble when the UMEM 87 + * is disabled and then reenabled, but the SQ continues receiving CQEs 88 + * from the old UMEM. 89 + */ 90 + err = mlx5e_open_xdpsq(c, params, &cparam.xdp_sq, umem, &c->xsksq, true); 91 + if (unlikely(err)) 92 + goto err_close_tx_cq; 93 + 94 + err = mlx5e_open_cq(c, icocq_moder, &cparam.icosq_cq, &c->xskicosq.cq); 95 + if (unlikely(err)) 96 + goto err_close_sq; 97 + 98 + /* Create a dedicated SQ for posting NOPs whenever we need an IRQ to be 99 + * triggered and NAPI to be called on the correct CPU. 100 + */ 101 + err = mlx5e_open_icosq(c, params, &cparam.icosq, &c->xskicosq); 102 + if (unlikely(err)) 103 + goto err_close_icocq; 104 + 105 + spin_lock_init(&c->xskicosq_lock); 106 + 107 + set_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 108 + 109 + return 0; 110 + 111 + err_close_icocq: 112 + mlx5e_close_cq(&c->xskicosq.cq); 113 + 114 + err_close_sq: 115 + mlx5e_close_xdpsq(&c->xsksq); 116 + 117 + err_close_tx_cq: 118 + mlx5e_close_cq(&c->xsksq.cq); 119 + 120 + err_close_rq: 121 + mlx5e_close_rq(&c->xskrq); 122 + 123 + err_close_rx_cq: 124 + mlx5e_close_cq(&c->xskrq.cq); 125 + 126 + return err; 127 + } 128 + 129 + void mlx5e_close_xsk(struct mlx5e_channel *c) 130 + { 131 + clear_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 132 + napi_synchronize(&c->napi); 133 + 134 + mlx5e_close_rq(&c->xskrq); 135 + mlx5e_close_cq(&c->xskrq.cq); 136 + mlx5e_close_icosq(&c->xskicosq); 137 + mlx5e_close_cq(&c->xskicosq.cq); 138 + mlx5e_close_xdpsq(&c->xsksq); 139 + mlx5e_close_cq(&c->xsksq.cq); 140 + } 141 + 142 + void mlx5e_activate_xsk(struct mlx5e_channel *c) 143 + { 144 + set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state); 145 + /* TX queue is created active. */ 146 + mlx5e_trigger_irq(&c->xskicosq); 147 + } 148 + 149 + void mlx5e_deactivate_xsk(struct mlx5e_channel *c) 150 + { 151 + mlx5e_deactivate_rq(&c->xskrq); 152 + /* TX queue is disabled on close. */ 153 + } 154 + 155 + static int mlx5e_redirect_xsk_rqt(struct mlx5e_priv *priv, u16 ix, u32 rqn) 156 + { 157 + struct mlx5e_redirect_rqt_param direct_rrp = { 158 + .is_rss = false, 159 + { 160 + .rqn = rqn, 161 + }, 162 + }; 163 + 164 + u32 rqtn = priv->xsk_tir[ix].rqt.rqtn; 165 + 166 + return mlx5e_redirect_rqt(priv, rqtn, 1, direct_rrp); 167 + } 168 + 169 + int mlx5e_xsk_redirect_rqt_to_channel(struct mlx5e_priv *priv, struct mlx5e_channel *c) 170 + { 171 + return mlx5e_redirect_xsk_rqt(priv, c->ix, c->xskrq.rqn); 172 + } 173 + 174 + int mlx5e_xsk_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix) 175 + { 176 + return mlx5e_redirect_xsk_rqt(priv, ix, priv->drop_rq.rqn); 177 + } 178 + 179 + int mlx5e_xsk_redirect_rqts_to_channels(struct mlx5e_priv *priv, struct mlx5e_channels *chs) 180 + { 181 + int err, i; 182 + 183 + if (!priv->xsk.refcnt) 184 + return 0; 185 + 186 + for (i = 0; i < chs->num; i++) { 187 + struct mlx5e_channel *c = chs->c[i]; 188 + 189 + if (!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 190 + continue; 191 + 192 + err = mlx5e_xsk_redirect_rqt_to_channel(priv, c); 193 + if (unlikely(err)) 194 + goto err_stop; 195 + } 196 + 197 + return 0; 198 + 199 + err_stop: 200 + for (i--; i >= 0; i--) { 201 + if (!test_bit(MLX5E_CHANNEL_STATE_XSK, chs->c[i]->state)) 202 + continue; 203 + 204 + mlx5e_xsk_redirect_rqt_to_drop(priv, i); 205 + } 206 + 207 + return err; 208 + } 209 + 210 + void mlx5e_xsk_redirect_rqts_to_drop(struct mlx5e_priv *priv, struct mlx5e_channels *chs) 211 + { 212 + int i; 213 + 214 + if (!priv->xsk.refcnt) 215 + return; 216 + 217 + for (i = 0; i < chs->num; i++) { 218 + if (!test_bit(MLX5E_CHANNEL_STATE_XSK, chs->c[i]->state)) 219 + continue; 220 + 221 + mlx5e_xsk_redirect_rqt_to_drop(priv, i); 222 + } 223 + }

+25

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_SETUP_H__ 5 + #define __MLX5_EN_XSK_SETUP_H__ 6 + 7 + #include "en.h" 8 + 9 + struct mlx5e_xsk_param; 10 + 11 + bool mlx5e_validate_xsk_param(struct mlx5e_params *params, 12 + struct mlx5e_xsk_param *xsk, 13 + struct mlx5_core_dev *mdev); 14 + int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, 15 + struct mlx5e_xsk_param *xsk, struct xdp_umem *umem, 16 + struct mlx5e_channel *c); 17 + void mlx5e_close_xsk(struct mlx5e_channel *c); 18 + void mlx5e_activate_xsk(struct mlx5e_channel *c); 19 + void mlx5e_deactivate_xsk(struct mlx5e_channel *c); 20 + int mlx5e_xsk_redirect_rqt_to_channel(struct mlx5e_priv *priv, struct mlx5e_channel *c); 21 + int mlx5e_xsk_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix); 22 + int mlx5e_xsk_redirect_rqts_to_channels(struct mlx5e_priv *priv, struct mlx5e_channels *chs); 23 + void mlx5e_xsk_redirect_rqts_to_drop(struct mlx5e_priv *priv, struct mlx5e_channels *chs); 24 + 25 + #endif /* __MLX5_EN_XSK_SETUP_H__ */

+111

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include "tx.h" 5 + #include "umem.h" 6 + #include "en/xdp.h" 7 + #include "en/params.h" 8 + #include <net/xdp_sock.h> 9 + 10 + int mlx5e_xsk_async_xmit(struct net_device *dev, u32 qid) 11 + { 12 + struct mlx5e_priv *priv = netdev_priv(dev); 13 + struct mlx5e_params *params = &priv->channels.params; 14 + struct mlx5e_channel *c; 15 + u16 ix; 16 + 17 + if (unlikely(!mlx5e_xdp_is_open(priv))) 18 + return -ENETDOWN; 19 + 20 + if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid, MLX5E_RQ_GROUP_XSK, &ix))) 21 + return -EINVAL; 22 + 23 + c = priv->channels.c[ix]; 24 + 25 + if (unlikely(!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))) 26 + return -ENXIO; 27 + 28 + if (!napi_if_scheduled_mark_missed(&c->napi)) { 29 + spin_lock(&c->xskicosq_lock); 30 + mlx5e_trigger_irq(&c->xskicosq); 31 + spin_unlock(&c->xskicosq_lock); 32 + } 33 + 34 + return 0; 35 + } 36 + 37 + /* When TX fails (because of the size of the packet), we need to get completions 38 + * in order, so post a NOP to get a CQE. Since AF_XDP doesn't distinguish 39 + * between successful TX and errors, handling in mlx5e_poll_xdpsq_cq is the 40 + * same. 41 + */ 42 + static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq, 43 + struct mlx5e_xdp_info *xdpi) 44 + { 45 + u16 pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc); 46 + struct mlx5e_xdp_wqe_info *wi = &sq->db.wqe_info[pi]; 47 + struct mlx5e_tx_wqe *nopwqe; 48 + 49 + wi->num_wqebbs = 1; 50 + wi->num_pkts = 1; 51 + 52 + nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc); 53 + mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi); 54 + sq->doorbell_cseg = &nopwqe->ctrl; 55 + } 56 + 57 + bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget) 58 + { 59 + struct xdp_umem *umem = sq->umem; 60 + struct mlx5e_xdp_info xdpi; 61 + struct mlx5e_xdp_xmit_data xdptxd; 62 + bool work_done = true; 63 + bool flush = false; 64 + 65 + xdpi.mode = MLX5E_XDP_XMIT_MODE_XSK; 66 + 67 + for (; budget; budget--) { 68 + int check_result = sq->xmit_xdp_frame_check(sq); 69 + struct xdp_desc desc; 70 + 71 + if (unlikely(check_result < 0)) { 72 + work_done = false; 73 + break; 74 + } 75 + 76 + if (!xsk_umem_consume_tx(umem, &desc)) { 77 + /* TX will get stuck until something wakes it up by 78 + * triggering NAPI. Currently it's expected that the 79 + * application calls sendto() if there are consumed, but 80 + * not completed frames. 81 + */ 82 + break; 83 + } 84 + 85 + xdptxd.dma_addr = xdp_umem_get_dma(umem, desc.addr); 86 + xdptxd.data = xdp_umem_get_data(umem, desc.addr); 87 + xdptxd.len = desc.len; 88 + 89 + dma_sync_single_for_device(sq->pdev, xdptxd.dma_addr, 90 + xdptxd.len, DMA_BIDIRECTIONAL); 91 + 92 + if (unlikely(!sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, check_result))) { 93 + if (sq->mpwqe.wqe) 94 + mlx5e_xdp_mpwqe_complete(sq); 95 + 96 + mlx5e_xsk_tx_post_err(sq, &xdpi); 97 + } 98 + 99 + flush = true; 100 + } 101 + 102 + if (flush) { 103 + if (sq->mpwqe.wqe) 104 + mlx5e_xdp_mpwqe_complete(sq); 105 + mlx5e_xmit_xdp_doorbell(sq); 106 + 107 + xsk_umem_consume_tx_done(umem); 108 + } 109 + 110 + return !(budget && work_done); 111 + }

+15

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_TX_H__ 5 + #define __MLX5_EN_XSK_TX_H__ 6 + 7 + #include "en.h" 8 + 9 + /* TX data path */ 10 + 11 + int mlx5e_xsk_async_xmit(struct net_device *dev, u32 qid); 12 + 13 + bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget); 14 + 15 + #endif /* __MLX5_EN_XSK_TX_H__ */

+267

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include <net/xdp_sock.h> 5 + #include "umem.h" 6 + #include "setup.h" 7 + #include "en/params.h" 8 + 9 + static int mlx5e_xsk_map_umem(struct mlx5e_priv *priv, 10 + struct xdp_umem *umem) 11 + { 12 + struct device *dev = priv->mdev->device; 13 + u32 i; 14 + 15 + for (i = 0; i < umem->npgs; i++) { 16 + dma_addr_t dma = dma_map_page(dev, umem->pgs[i], 0, PAGE_SIZE, 17 + DMA_BIDIRECTIONAL); 18 + 19 + if (unlikely(dma_mapping_error(dev, dma))) 20 + goto err_unmap; 21 + umem->pages[i].dma = dma; 22 + } 23 + 24 + return 0; 25 + 26 + err_unmap: 27 + while (i--) { 28 + dma_unmap_page(dev, umem->pages[i].dma, PAGE_SIZE, 29 + DMA_BIDIRECTIONAL); 30 + umem->pages[i].dma = 0; 31 + } 32 + 33 + return -ENOMEM; 34 + } 35 + 36 + static void mlx5e_xsk_unmap_umem(struct mlx5e_priv *priv, 37 + struct xdp_umem *umem) 38 + { 39 + struct device *dev = priv->mdev->device; 40 + u32 i; 41 + 42 + for (i = 0; i < umem->npgs; i++) { 43 + dma_unmap_page(dev, umem->pages[i].dma, PAGE_SIZE, 44 + DMA_BIDIRECTIONAL); 45 + umem->pages[i].dma = 0; 46 + } 47 + } 48 + 49 + static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk) 50 + { 51 + if (!xsk->umems) { 52 + xsk->umems = kcalloc(MLX5E_MAX_NUM_CHANNELS, 53 + sizeof(*xsk->umems), GFP_KERNEL); 54 + if (unlikely(!xsk->umems)) 55 + return -ENOMEM; 56 + } 57 + 58 + xsk->refcnt++; 59 + xsk->ever_used = true; 60 + 61 + return 0; 62 + } 63 + 64 + static void mlx5e_xsk_put_umems(struct mlx5e_xsk *xsk) 65 + { 66 + if (!--xsk->refcnt) { 67 + kfree(xsk->umems); 68 + xsk->umems = NULL; 69 + } 70 + } 71 + 72 + static int mlx5e_xsk_add_umem(struct mlx5e_xsk *xsk, struct xdp_umem *umem, u16 ix) 73 + { 74 + int err; 75 + 76 + err = mlx5e_xsk_get_umems(xsk); 77 + if (unlikely(err)) 78 + return err; 79 + 80 + xsk->umems[ix] = umem; 81 + return 0; 82 + } 83 + 84 + static void mlx5e_xsk_remove_umem(struct mlx5e_xsk *xsk, u16 ix) 85 + { 86 + xsk->umems[ix] = NULL; 87 + 88 + mlx5e_xsk_put_umems(xsk); 89 + } 90 + 91 + static bool mlx5e_xsk_is_umem_sane(struct xdp_umem *umem) 92 + { 93 + return umem->headroom <= 0xffff && umem->chunk_size_nohr <= 0xffff; 94 + } 95 + 96 + void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk) 97 + { 98 + xsk->headroom = umem->headroom; 99 + xsk->chunk_size = umem->chunk_size_nohr + umem->headroom; 100 + } 101 + 102 + static int mlx5e_xsk_enable_locked(struct mlx5e_priv *priv, 103 + struct xdp_umem *umem, u16 ix) 104 + { 105 + struct mlx5e_params *params = &priv->channels.params; 106 + struct mlx5e_xsk_param xsk; 107 + struct mlx5e_channel *c; 108 + int err; 109 + 110 + if (unlikely(mlx5e_xsk_get_umem(&priv->channels.params, &priv->xsk, ix))) 111 + return -EBUSY; 112 + 113 + if (unlikely(!mlx5e_xsk_is_umem_sane(umem))) 114 + return -EINVAL; 115 + 116 + err = mlx5e_xsk_map_umem(priv, umem); 117 + if (unlikely(err)) 118 + return err; 119 + 120 + err = mlx5e_xsk_add_umem(&priv->xsk, umem, ix); 121 + if (unlikely(err)) 122 + goto err_unmap_umem; 123 + 124 + mlx5e_build_xsk_param(umem, &xsk); 125 + 126 + if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) { 127 + /* XSK objects will be created on open. */ 128 + goto validate_closed; 129 + } 130 + 131 + if (!params->xdp_prog) { 132 + /* XSK objects will be created when an XDP program is set, 133 + * and the channels are reopened. 134 + */ 135 + goto validate_closed; 136 + } 137 + 138 + c = priv->channels.c[ix]; 139 + 140 + err = mlx5e_open_xsk(priv, params, &xsk, umem, c); 141 + if (unlikely(err)) 142 + goto err_remove_umem; 143 + 144 + mlx5e_activate_xsk(c); 145 + 146 + /* Don't wait for WQEs, because the newer xdpsock sample doesn't provide 147 + * any Fill Ring entries at the setup stage. 148 + */ 149 + 150 + err = mlx5e_xsk_redirect_rqt_to_channel(priv, priv->channels.c[ix]); 151 + if (unlikely(err)) 152 + goto err_deactivate; 153 + 154 + return 0; 155 + 156 + err_deactivate: 157 + mlx5e_deactivate_xsk(c); 158 + mlx5e_close_xsk(c); 159 + 160 + err_remove_umem: 161 + mlx5e_xsk_remove_umem(&priv->xsk, ix); 162 + 163 + err_unmap_umem: 164 + mlx5e_xsk_unmap_umem(priv, umem); 165 + 166 + return err; 167 + 168 + validate_closed: 169 + /* Check the configuration in advance, rather than fail at a later stage 170 + * (in mlx5e_xdp_set or on open) and end up with no channels. 171 + */ 172 + if (!mlx5e_validate_xsk_param(params, &xsk, priv->mdev)) { 173 + err = -EINVAL; 174 + goto err_remove_umem; 175 + } 176 + 177 + return 0; 178 + } 179 + 180 + static int mlx5e_xsk_disable_locked(struct mlx5e_priv *priv, u16 ix) 181 + { 182 + struct xdp_umem *umem = mlx5e_xsk_get_umem(&priv->channels.params, 183 + &priv->xsk, ix); 184 + struct mlx5e_channel *c; 185 + 186 + if (unlikely(!umem)) 187 + return -EINVAL; 188 + 189 + if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) 190 + goto remove_umem; 191 + 192 + /* XSK RQ and SQ are only created if XDP program is set. */ 193 + if (!priv->channels.params.xdp_prog) 194 + goto remove_umem; 195 + 196 + c = priv->channels.c[ix]; 197 + mlx5e_xsk_redirect_rqt_to_drop(priv, ix); 198 + mlx5e_deactivate_xsk(c); 199 + mlx5e_close_xsk(c); 200 + 201 + remove_umem: 202 + mlx5e_xsk_remove_umem(&priv->xsk, ix); 203 + mlx5e_xsk_unmap_umem(priv, umem); 204 + 205 + return 0; 206 + } 207 + 208 + static int mlx5e_xsk_enable_umem(struct mlx5e_priv *priv, struct xdp_umem *umem, 209 + u16 ix) 210 + { 211 + int err; 212 + 213 + mutex_lock(&priv->state_lock); 214 + err = mlx5e_xsk_enable_locked(priv, umem, ix); 215 + mutex_unlock(&priv->state_lock); 216 + 217 + return err; 218 + } 219 + 220 + static int mlx5e_xsk_disable_umem(struct mlx5e_priv *priv, u16 ix) 221 + { 222 + int err; 223 + 224 + mutex_lock(&priv->state_lock); 225 + err = mlx5e_xsk_disable_locked(priv, ix); 226 + mutex_unlock(&priv->state_lock); 227 + 228 + return err; 229 + } 230 + 231 + int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid) 232 + { 233 + struct mlx5e_priv *priv = netdev_priv(dev); 234 + struct mlx5e_params *params = &priv->channels.params; 235 + u16 ix; 236 + 237 + if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid, MLX5E_RQ_GROUP_XSK, &ix))) 238 + return -EINVAL; 239 + 240 + return umem ? mlx5e_xsk_enable_umem(priv, umem, ix) : 241 + mlx5e_xsk_disable_umem(priv, ix); 242 + } 243 + 244 + int mlx5e_xsk_resize_reuseq(struct xdp_umem *umem, u32 nentries) 245 + { 246 + struct xdp_umem_fq_reuse *reuseq; 247 + 248 + reuseq = xsk_reuseq_prepare(nentries); 249 + if (unlikely(!reuseq)) 250 + return -ENOMEM; 251 + xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq)); 252 + 253 + return 0; 254 + } 255 + 256 + u16 mlx5e_xsk_first_unused_channel(struct mlx5e_params *params, struct mlx5e_xsk *xsk) 257 + { 258 + u16 res = xsk->refcnt ? params->num_channels : 0; 259 + 260 + while (res) { 261 + if (mlx5e_xsk_get_umem(params, xsk, res - 1)) 262 + break; 263 + --res; 264 + } 265 + 266 + return res; 267 + }

+31

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_UMEM_H__ 5 + #define __MLX5_EN_XSK_UMEM_H__ 6 + 7 + #include "en.h" 8 + 9 + static inline struct xdp_umem *mlx5e_xsk_get_umem(struct mlx5e_params *params, 10 + struct mlx5e_xsk *xsk, u16 ix) 11 + { 12 + if (!xsk || !xsk->umems) 13 + return NULL; 14 + 15 + if (unlikely(ix >= params->num_channels)) 16 + return NULL; 17 + 18 + return xsk->umems[ix]; 19 + } 20 + 21 + struct mlx5e_xsk_param; 22 + void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk); 23 + 24 + /* .ndo_bpf callback. */ 25 + int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid); 26 + 27 + int mlx5e_xsk_resize_reuseq(struct xdp_umem *umem, u32 nentries); 28 + 29 + u16 mlx5e_xsk_first_unused_channel(struct mlx5e_params *params, struct mlx5e_xsk *xsk); 30 + 31 + #endif /* __MLX5_EN_XSK_UMEM_H__ */

+23 -2

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c

··· 32 32 33 33 #include "en.h" 34 34 #include "en/port.h" 35 + #include "en/xsk/umem.h" 35 36 #include "lib/clock.h" 36 37 37 38 void mlx5e_ethtool_get_drvinfo(struct mlx5e_priv *priv, ··· 389 388 void mlx5e_ethtool_get_channels(struct mlx5e_priv *priv, 390 389 struct ethtool_channels *ch) 391 390 { 391 + mutex_lock(&priv->state_lock); 392 + 392 393 ch->max_combined = mlx5e_get_netdev_max_channels(priv->netdev); 393 394 ch->combined_count = priv->channels.params.num_channels; 395 + if (priv->xsk.refcnt) { 396 + /* The upper half are XSK queues. */ 397 + ch->max_combined *= 2; 398 + ch->combined_count *= 2; 399 + } 400 + 401 + mutex_unlock(&priv->state_lock); 394 402 } 395 403 396 404 static void mlx5e_get_channels(struct net_device *dev, ··· 413 403 int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv, 414 404 struct ethtool_channels *ch) 415 405 { 406 + struct mlx5e_params *cur_params = &priv->channels.params; 416 407 unsigned int count = ch->combined_count; 417 408 struct mlx5e_channels new_channels = {}; 418 409 bool arfs_enabled; ··· 425 414 return -EINVAL; 426 415 } 427 416 428 - if (priv->channels.params.num_channels == count) 417 + if (cur_params->num_channels == count) 429 418 return 0; 430 419 431 420 mutex_lock(&priv->state_lock); 421 + 422 + /* Don't allow changing the number of channels if there is an active 423 + * XSK, because the numeration of the XSK and regular RQs will change. 424 + */ 425 + if (priv->xsk.refcnt) { 426 + err = -EINVAL; 427 + netdev_err(priv->netdev, "%s: AF_XDP is active, cannot change the number of channels\n", 428 + __func__); 429 + goto out; 430 + } 432 431 433 432 new_channels.params = priv->channels.params; 434 433 new_channels.params.num_channels = count; 435 434 436 435 if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) { 437 - priv->channels.params = new_channels.params; 436 + *cur_params = new_channels.params; 438 437 if (!netif_is_rxfh_configured(priv->netdev)) 439 438 mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt, 440 439 MLX5E_INDIR_RQT_SIZE, count);

+14 -4

drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c

··· 32 32 33 33 #include <linux/mlx5/fs.h> 34 34 #include "en.h" 35 + #include "en/params.h" 36 + #include "en/xsk/umem.h" 35 37 36 38 struct mlx5e_ethtool_rule { 37 39 struct list_head list; ··· 416 414 if (fs->ring_cookie == RX_CLS_FLOW_DISC) { 417 415 flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP; 418 416 } else { 417 + struct mlx5e_params *params = &priv->channels.params; 418 + enum mlx5e_rq_group group; 419 + struct mlx5e_tir *tir; 420 + u16 ix; 421 + 422 + mlx5e_qid_get_ch_and_group(params, fs->ring_cookie, &ix, &group); 423 + tir = group == MLX5E_RQ_GROUP_XSK ? priv->xsk_tir : priv->direct_tir; 424 + 419 425 dst = kzalloc(sizeof(*dst), GFP_KERNEL); 420 426 if (!dst) { 421 427 err = -ENOMEM; ··· 431 421 } 432 422 433 423 dst->type = MLX5_FLOW_DESTINATION_TYPE_TIR; 434 - dst->tir_num = priv->direct_tir[fs->ring_cookie].tirn; 424 + dst->tir_num = tir[ix].tirn; 435 425 flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST; 436 426 } 437 427 ··· 610 600 if (fs->location >= MAX_NUM_OF_ETHTOOL_RULES) 611 601 return -ENOSPC; 612 602 613 - if (fs->ring_cookie >= priv->channels.params.num_channels && 614 - fs->ring_cookie != RX_CLS_FLOW_DISC) 615 - return -EINVAL; 603 + if (fs->ring_cookie != RX_CLS_FLOW_DISC) 604 + if (!mlx5e_qid_validate(&priv->channels.params, fs->ring_cookie)) 605 + return -EINVAL; 616 606 617 607 switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) { 618 608 case ETHER_FLOW:

+498 -282

drivers/net/ethernet/mellanox/mlx5/core/en_main.c

··· 38 38 #include <linux/bpf.h> 39 39 #include <linux/if_bridge.h> 40 40 #include <net/page_pool.h> 41 + #include <net/xdp_sock.h> 41 42 #include "eswitch.h" 42 43 #include "en.h" 43 44 #include "en_tc.h" ··· 57 56 #include "en/monitor_stats.h" 58 57 #include "en/reporter.h" 59 58 #include "en/params.h" 60 - 61 - struct mlx5e_rq_param { 62 - u32 rqc[MLX5_ST_SZ_DW(rqc)]; 63 - struct mlx5_wq_param wq; 64 - struct mlx5e_rq_frags_info frags_info; 65 - }; 66 - 67 - struct mlx5e_sq_param { 68 - u32 sqc[MLX5_ST_SZ_DW(sqc)]; 69 - struct mlx5_wq_param wq; 70 - bool is_mpw; 71 - }; 72 - 73 - struct mlx5e_cq_param { 74 - u32 cqc[MLX5_ST_SZ_DW(cqc)]; 75 - struct mlx5_wq_param wq; 76 - u16 eq_ix; 77 - u8 cq_period_mode; 78 - }; 79 - 80 - struct mlx5e_channel_param { 81 - struct mlx5e_rq_param rq; 82 - struct mlx5e_sq_param sq; 83 - struct mlx5e_sq_param xdp_sq; 84 - struct mlx5e_sq_param icosq; 85 - struct mlx5e_cq_param rx_cq; 86 - struct mlx5e_cq_param tx_cq; 87 - struct mlx5e_cq_param icosq_cq; 88 - }; 59 + #include "en/xsk/umem.h" 60 + #include "en/xsk/setup.h" 61 + #include "en/xsk/rx.h" 62 + #include "en/xsk/tx.h" 89 63 90 64 bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev) 91 65 { ··· 90 114 mlx5_core_info(mdev, "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n", 91 115 params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ, 92 116 params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ ? 93 - BIT(mlx5e_mpwqe_get_log_rq_size(params)) : 117 + BIT(mlx5e_mpwqe_get_log_rq_size(params, NULL)) : 94 118 BIT(params->log_rq_mtu_frames), 95 - BIT(mlx5e_mpwqe_get_log_stride_size(mdev, params)), 119 + BIT(mlx5e_mpwqe_get_log_stride_size(mdev, params, NULL)), 96 120 MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS)); 97 121 } 98 122 99 123 bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev, 100 124 struct mlx5e_params *params) 101 125 { 102 - return mlx5e_check_fragmented_striding_rq_cap(mdev) && 103 - !MLX5_IPSEC_DEV(mdev) && 104 - !(params->xdp_prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params)); 126 + if (!mlx5e_check_fragmented_striding_rq_cap(mdev)) 127 + return false; 128 + 129 + if (MLX5_IPSEC_DEV(mdev)) 130 + return false; 131 + 132 + if (params->xdp_prog) { 133 + /* XSK params are not considered here. If striding RQ is in use, 134 + * and an XSK is being opened, mlx5e_rx_mpwqe_is_linear_skb will 135 + * be called with the known XSK params. 136 + */ 137 + if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL)) 138 + return false; 139 + } 140 + 141 + return true; 105 142 } 106 143 107 144 void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params) ··· 383 394 384 395 static int mlx5e_alloc_rq(struct mlx5e_channel *c, 385 396 struct mlx5e_params *params, 397 + struct mlx5e_xsk_param *xsk, 398 + struct xdp_umem *umem, 386 399 struct mlx5e_rq_param *rqp, 387 400 struct mlx5e_rq *rq) 388 401 { ··· 392 401 struct mlx5_core_dev *mdev = c->mdev; 393 402 void *rqc = rqp->rqc; 394 403 void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq); 404 + u32 num_xsk_frames = 0; 405 + u32 rq_xdp_ix; 395 406 u32 pool_size; 396 407 int wq_sz; 397 408 int err; ··· 410 417 rq->ix = c->ix; 411 418 rq->mdev = mdev; 412 419 rq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 413 - rq->stats = &c->priv->channel_stats[c->ix].rq; 420 + rq->xdpsq = &c->rq_xdpsq; 421 + rq->umem = umem; 422 + 423 + if (rq->umem) 424 + rq->stats = &c->priv->channel_stats[c->ix].xskrq; 425 + else 426 + rq->stats = &c->priv->channel_stats[c->ix].rq; 414 427 415 428 rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL; 416 429 if (IS_ERR(rq->xdp_prog)) { ··· 425 426 goto err_rq_wq_destroy; 426 427 } 427 428 428 - err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix); 429 + rq_xdp_ix = rq->ix; 430 + if (xsk) 431 + rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK; 432 + err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix); 429 433 if (err < 0) 430 434 goto err_rq_wq_destroy; 431 435 432 436 rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE; 433 - rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params); 437 + rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk); 438 + rq->buff.umem_headroom = xsk ? xsk->headroom : 0; 434 439 pool_size = 1 << params->log_rq_mtu_frames; 435 440 436 441 switch (rq->wq_type) { ··· 448 445 449 446 wq_sz = mlx5_wq_ll_get_size(&rq->mpwqe.wq); 450 447 451 - pool_size = MLX5_MPWRQ_PAGES_PER_WQE << mlx5e_mpwqe_get_log_rq_size(params); 448 + if (xsk) 449 + num_xsk_frames = wq_sz << 450 + mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk); 451 + 452 + pool_size = MLX5_MPWRQ_PAGES_PER_WQE << 453 + mlx5e_mpwqe_get_log_rq_size(params, xsk); 452 454 453 455 rq->post_wqes = mlx5e_post_rx_mpwqes; 454 456 rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe; ··· 472 464 goto err_rq_wq_destroy; 473 465 } 474 466 475 - rq->mpwqe.skb_from_cqe_mpwrq = 476 - mlx5e_rx_mpwqe_is_linear_skb(mdev, params) ? 477 - mlx5e_skb_from_cqe_mpwrq_linear : 478 - mlx5e_skb_from_cqe_mpwrq_nonlinear; 479 - rq->mpwqe.log_stride_sz = mlx5e_mpwqe_get_log_stride_size(mdev, params); 480 - rq->mpwqe.num_strides = BIT(mlx5e_mpwqe_get_log_num_strides(mdev, params)); 467 + rq->mpwqe.skb_from_cqe_mpwrq = xsk ? 468 + mlx5e_xsk_skb_from_cqe_mpwrq_linear : 469 + mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL) ? 470 + mlx5e_skb_from_cqe_mpwrq_linear : 471 + mlx5e_skb_from_cqe_mpwrq_nonlinear; 472 + 473 + rq->mpwqe.log_stride_sz = mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk); 474 + rq->mpwqe.num_strides = 475 + BIT(mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk)); 481 476 482 477 err = mlx5e_create_rq_umr_mkey(mdev, rq); 483 478 if (err) ··· 501 490 502 491 wq_sz = mlx5_wq_cyc_get_size(&rq->wqe.wq); 503 492 493 + if (xsk) 494 + num_xsk_frames = wq_sz << rq->wqe.info.log_num_frags; 495 + 504 496 rq->wqe.info = rqp->frags_info; 505 497 rq->wqe.frags = 506 498 kvzalloc_node(array_size(sizeof(*rq->wqe.frags), ··· 517 503 err = mlx5e_init_di_list(rq, wq_sz, c->cpu); 518 504 if (err) 519 505 goto err_free; 506 + 520 507 rq->post_wqes = mlx5e_post_rx_wqes; 521 508 rq->dealloc_wqe = mlx5e_dealloc_rx_wqe; 522 509 ··· 533 518 goto err_free; 534 519 } 535 520 536 - rq->wqe.skb_from_cqe = mlx5e_rx_is_linear_skb(params) ? 537 - mlx5e_skb_from_cqe_linear : 538 - mlx5e_skb_from_cqe_nonlinear; 521 + rq->wqe.skb_from_cqe = xsk ? 522 + mlx5e_xsk_skb_from_cqe_linear : 523 + mlx5e_rx_is_linear_skb(params, NULL) ? 524 + mlx5e_skb_from_cqe_linear : 525 + mlx5e_skb_from_cqe_nonlinear; 539 526 rq->mkey_be = c->mkey_be; 540 527 } 541 528 542 - /* Create a page_pool and register it with rxq */ 543 - pp_params.order = 0; 544 - pp_params.flags = 0; /* No-internal DMA mapping in page_pool */ 545 - pp_params.pool_size = pool_size; 546 - pp_params.nid = cpu_to_node(c->cpu); 547 - pp_params.dev = c->pdev; 548 - pp_params.dma_dir = rq->buff.map_dir; 529 + if (xsk) { 530 + err = mlx5e_xsk_resize_reuseq(umem, num_xsk_frames); 531 + if (unlikely(err)) { 532 + mlx5_core_err(mdev, "Unable to allocate the Reuse Ring for %u frames\n", 533 + num_xsk_frames); 534 + goto err_free; 535 + } 549 536 550 - /* page_pool can be used even when there is no rq->xdp_prog, 551 - * given page_pool does not handle DMA mapping there is no 552 - * required state to clear. And page_pool gracefully handle 553 - * elevated refcnt. 554 - */ 555 - rq->page_pool = page_pool_create(&pp_params); 556 - if (IS_ERR(rq->page_pool)) { 557 - err = PTR_ERR(rq->page_pool); 558 - rq->page_pool = NULL; 559 - goto err_free; 537 + rq->zca.free = mlx5e_xsk_zca_free; 538 + err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, 539 + MEM_TYPE_ZERO_COPY, 540 + &rq->zca); 541 + } else { 542 + /* Create a page_pool and register it with rxq */ 543 + pp_params.order = 0; 544 + pp_params.flags = 0; /* No-internal DMA mapping in page_pool */ 545 + pp_params.pool_size = pool_size; 546 + pp_params.nid = cpu_to_node(c->cpu); 547 + pp_params.dev = c->pdev; 548 + pp_params.dma_dir = rq->buff.map_dir; 549 + 550 + /* page_pool can be used even when there is no rq->xdp_prog, 551 + * given page_pool does not handle DMA mapping there is no 552 + * required state to clear. And page_pool gracefully handle 553 + * elevated refcnt. 554 + */ 555 + rq->page_pool = page_pool_create(&pp_params); 556 + if (IS_ERR(rq->page_pool)) { 557 + err = PTR_ERR(rq->page_pool); 558 + rq->page_pool = NULL; 559 + goto err_free; 560 + } 561 + err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, 562 + MEM_TYPE_PAGE_POOL, rq->page_pool); 563 + if (err) 564 + page_pool_free(rq->page_pool); 560 565 } 561 - err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, 562 - MEM_TYPE_PAGE_POOL, rq->page_pool); 563 - if (err) { 564 - page_pool_free(rq->page_pool); 566 + if (err) 565 567 goto err_free; 566 - } 567 568 568 569 for (i = 0; i < wq_sz; i++) { 569 570 if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) { ··· 670 639 i = (i + 1) & (MLX5E_CACHE_SIZE - 1)) { 671 640 struct mlx5e_dma_info *dma_info = &rq->page_cache.page_cache[i]; 672 641 673 - mlx5e_page_release(rq, dma_info, false); 642 + /* With AF_XDP, page_cache is not used, so this loop is not 643 + * entered, and it's safe to call mlx5e_page_release_dynamic 644 + * directly. 645 + */ 646 + mlx5e_page_release_dynamic(rq, dma_info, false); 674 647 } 675 648 676 649 xdp_rxq_info_unreg(&rq->xdp_rxq); ··· 811 776 mlx5_core_destroy_rq(rq->mdev, rq->rqn); 812 777 } 813 778 814 - static int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time) 779 + int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time) 815 780 { 816 781 unsigned long exp_time = jiffies + msecs_to_jiffies(wait_time); 817 782 struct mlx5e_channel *c = rq->channel; ··· 869 834 870 835 } 871 836 872 - static int mlx5e_open_rq(struct mlx5e_channel *c, 873 - struct mlx5e_params *params, 874 - struct mlx5e_rq_param *param, 875 - struct mlx5e_rq *rq) 837 + int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params, 838 + struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk, 839 + struct xdp_umem *umem, struct mlx5e_rq *rq) 876 840 { 877 841 int err; 878 842 879 - err = mlx5e_alloc_rq(c, params, param, rq); 843 + err = mlx5e_alloc_rq(c, params, xsk, umem, param, rq); 880 844 if (err) 881 845 return err; 882 846 ··· 913 879 mlx5e_trigger_irq(&rq->channel->icosq); 914 880 } 915 881 916 - static void mlx5e_deactivate_rq(struct mlx5e_rq *rq) 882 + void mlx5e_deactivate_rq(struct mlx5e_rq *rq) 917 883 { 918 884 clear_bit(MLX5E_RQ_STATE_ENABLED, &rq->state); 919 885 napi_synchronize(&rq->channel->napi); /* prevent mlx5e_post_rx_wqes */ 920 886 } 921 887 922 - static void mlx5e_close_rq(struct mlx5e_rq *rq) 888 + void mlx5e_close_rq(struct mlx5e_rq *rq) 923 889 { 924 890 cancel_work_sync(&rq->dim.work); 925 891 mlx5e_destroy_rq(rq); ··· 972 938 973 939 static int mlx5e_alloc_xdpsq(struct mlx5e_channel *c, 974 940 struct mlx5e_params *params, 941 + struct xdp_umem *umem, 975 942 struct mlx5e_sq_param *param, 976 943 struct mlx5e_xdpsq *sq, 977 944 bool is_redirect) ··· 988 953 sq->uar_map = mdev->mlx5e_res.bfreg.map; 989 954 sq->min_inline_mode = params->tx_min_inline_mode; 990 955 sq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 991 - sq->stats = is_redirect ? 992 - &c->priv->channel_stats[c->ix].xdpsq : 993 - &c->priv->channel_stats[c->ix].rq_xdpsq; 956 + sq->umem = umem; 957 + 958 + sq->stats = sq->umem ? 959 + &c->priv->channel_stats[c->ix].xsksq : 960 + is_redirect ? 961 + &c->priv->channel_stats[c->ix].xdpsq : 962 + &c->priv->channel_stats[c->ix].rq_xdpsq; 994 963 995 964 param->wq.db_numa_node = cpu_to_node(c->cpu); 996 965 err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, wq, &sq->wq_ctrl); ··· 1374 1335 mlx5e_tx_reporter_err_cqe(sq); 1375 1336 } 1376 1337 1377 - static int mlx5e_open_icosq(struct mlx5e_channel *c, 1378 - struct mlx5e_params *params, 1379 - struct mlx5e_sq_param *param, 1380 - struct mlx5e_icosq *sq) 1338 + int mlx5e_open_icosq(struct mlx5e_channel *c, struct mlx5e_params *params, 1339 + struct mlx5e_sq_param *param, struct mlx5e_icosq *sq) 1381 1340 { 1382 1341 struct mlx5e_create_sq_param csp = {}; 1383 1342 int err; ··· 1401 1364 return err; 1402 1365 } 1403 1366 1404 - static void mlx5e_close_icosq(struct mlx5e_icosq *sq) 1367 + void mlx5e_close_icosq(struct mlx5e_icosq *sq) 1405 1368 { 1406 1369 struct mlx5e_channel *c = sq->channel; 1407 1370 ··· 1412 1375 mlx5e_free_icosq(sq); 1413 1376 } 1414 1377 1415 - static int mlx5e_open_xdpsq(struct mlx5e_channel *c, 1416 - struct mlx5e_params *params, 1417 - struct mlx5e_sq_param *param, 1418 - struct mlx5e_xdpsq *sq, 1419 - bool is_redirect) 1378 + int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params, 1379 + struct mlx5e_sq_param *param, struct xdp_umem *umem, 1380 + struct mlx5e_xdpsq *sq, bool is_redirect) 1420 1381 { 1421 1382 struct mlx5e_create_sq_param csp = {}; 1422 1383 int err; 1423 1384 1424 - err = mlx5e_alloc_xdpsq(c, params, param, sq, is_redirect); 1385 + err = mlx5e_alloc_xdpsq(c, params, umem, param, sq, is_redirect); 1425 1386 if (err) 1426 1387 return err; 1427 1388 ··· 1473 1438 return err; 1474 1439 } 1475 1440 1476 - static void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq) 1441 + void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq) 1477 1442 { 1478 1443 struct mlx5e_channel *c = sq->channel; 1479 1444 ··· 1481 1446 napi_synchronize(&c->napi); 1482 1447 1483 1448 mlx5e_destroy_sq(c->mdev, sq->sqn); 1484 - mlx5e_free_xdpsq_descs(sq, rq); 1449 + mlx5e_free_xdpsq_descs(sq); 1485 1450 mlx5e_free_xdpsq(sq); 1486 1451 } 1487 1452 ··· 1602 1567 mlx5_core_destroy_cq(cq->mdev, &cq->mcq); 1603 1568 } 1604 1569 1605 - static int mlx5e_open_cq(struct mlx5e_channel *c, 1606 - struct dim_cq_moder moder, 1607 - struct mlx5e_cq_param *param, 1608 - struct mlx5e_cq *cq) 1570 + int mlx5e_open_cq(struct mlx5e_channel *c, struct dim_cq_moder moder, 1571 + struct mlx5e_cq_param *param, struct mlx5e_cq *cq) 1609 1572 { 1610 1573 struct mlx5_core_dev *mdev = c->mdev; 1611 1574 int err; ··· 1626 1593 return err; 1627 1594 } 1628 1595 1629 - static void mlx5e_close_cq(struct mlx5e_cq *cq) 1596 + void mlx5e_close_cq(struct mlx5e_cq *cq) 1630 1597 { 1631 1598 mlx5e_destroy_cq(cq); 1632 1599 mlx5e_free_cq(cq); ··· 1800 1767 free_cpumask_var(c->xps_cpumask); 1801 1768 } 1802 1769 1770 + static int mlx5e_open_queues(struct mlx5e_channel *c, 1771 + struct mlx5e_params *params, 1772 + struct mlx5e_channel_param *cparam) 1773 + { 1774 + struct dim_cq_moder icocq_moder = {0, 0}; 1775 + int err; 1776 + 1777 + err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->icosq.cq); 1778 + if (err) 1779 + return err; 1780 + 1781 + err = mlx5e_open_tx_cqs(c, params, cparam); 1782 + if (err) 1783 + goto err_close_icosq_cq; 1784 + 1785 + err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam->tx_cq, &c->xdpsq.cq); 1786 + if (err) 1787 + goto err_close_tx_cqs; 1788 + 1789 + err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rx_cq, &c->rq.cq); 1790 + if (err) 1791 + goto err_close_xdp_tx_cqs; 1792 + 1793 + /* XDP SQ CQ params are same as normal TXQ sq CQ params */ 1794 + err = c->xdp ? mlx5e_open_cq(c, params->tx_cq_moderation, 1795 + &cparam->tx_cq, &c->rq_xdpsq.cq) : 0; 1796 + if (err) 1797 + goto err_close_rx_cq; 1798 + 1799 + napi_enable(&c->napi); 1800 + 1801 + err = mlx5e_open_icosq(c, params, &cparam->icosq, &c->icosq); 1802 + if (err) 1803 + goto err_disable_napi; 1804 + 1805 + err = mlx5e_open_sqs(c, params, cparam); 1806 + if (err) 1807 + goto err_close_icosq; 1808 + 1809 + if (c->xdp) { 1810 + err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, NULL, 1811 + &c->rq_xdpsq, false); 1812 + if (err) 1813 + goto err_close_sqs; 1814 + } 1815 + 1816 + err = mlx5e_open_rq(c, params, &cparam->rq, NULL, NULL, &c->rq); 1817 + if (err) 1818 + goto err_close_xdp_sq; 1819 + 1820 + err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, NULL, &c->xdpsq, true); 1821 + if (err) 1822 + goto err_close_rq; 1823 + 1824 + return 0; 1825 + 1826 + err_close_rq: 1827 + mlx5e_close_rq(&c->rq); 1828 + 1829 + err_close_xdp_sq: 1830 + if (c->xdp) 1831 + mlx5e_close_xdpsq(&c->rq_xdpsq); 1832 + 1833 + err_close_sqs: 1834 + mlx5e_close_sqs(c); 1835 + 1836 + err_close_icosq: 1837 + mlx5e_close_icosq(&c->icosq); 1838 + 1839 + err_disable_napi: 1840 + napi_disable(&c->napi); 1841 + 1842 + if (c->xdp) 1843 + mlx5e_close_cq(&c->rq_xdpsq.cq); 1844 + 1845 + err_close_rx_cq: 1846 + mlx5e_close_cq(&c->rq.cq); 1847 + 1848 + err_close_xdp_tx_cqs: 1849 + mlx5e_close_cq(&c->xdpsq.cq); 1850 + 1851 + err_close_tx_cqs: 1852 + mlx5e_close_tx_cqs(c); 1853 + 1854 + err_close_icosq_cq: 1855 + mlx5e_close_cq(&c->icosq.cq); 1856 + 1857 + return err; 1858 + } 1859 + 1860 + static void mlx5e_close_queues(struct mlx5e_channel *c) 1861 + { 1862 + mlx5e_close_xdpsq(&c->xdpsq); 1863 + mlx5e_close_rq(&c->rq); 1864 + if (c->xdp) 1865 + mlx5e_close_xdpsq(&c->rq_xdpsq); 1866 + mlx5e_close_sqs(c); 1867 + mlx5e_close_icosq(&c->icosq); 1868 + napi_disable(&c->napi); 1869 + if (c->xdp) 1870 + mlx5e_close_cq(&c->rq_xdpsq.cq); 1871 + mlx5e_close_cq(&c->rq.cq); 1872 + mlx5e_close_cq(&c->xdpsq.cq); 1873 + mlx5e_close_tx_cqs(c); 1874 + mlx5e_close_cq(&c->icosq.cq); 1875 + } 1876 + 1803 1877 static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix, 1804 1878 struct mlx5e_params *params, 1805 1879 struct mlx5e_channel_param *cparam, 1880 + struct xdp_umem *umem, 1806 1881 struct mlx5e_channel **cp) 1807 1882 { 1808 1883 int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix)); 1809 - struct dim_cq_moder icocq_moder = {0, 0}; 1810 1884 struct net_device *netdev = priv->netdev; 1885 + struct mlx5e_xsk_param xsk; 1811 1886 struct mlx5e_channel *c; 1812 1887 unsigned int irq; 1813 1888 int err; ··· 1948 1807 1949 1808 netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64); 1950 1809 1951 - err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->icosq.cq); 1952 - if (err) 1810 + err = mlx5e_open_queues(c, params, cparam); 1811 + if (unlikely(err)) 1953 1812 goto err_napi_del; 1954 1813 1955 - err = mlx5e_open_tx_cqs(c, params, cparam); 1956 - if (err) 1957 - goto err_close_icosq_cq; 1958 - 1959 - err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam->tx_cq, &c->xdpsq.cq); 1960 - if (err) 1961 - goto err_close_tx_cqs; 1962 - 1963 - err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rx_cq, &c->rq.cq); 1964 - if (err) 1965 - goto err_close_xdp_tx_cqs; 1966 - 1967 - /* XDP SQ CQ params are same as normal TXQ sq CQ params */ 1968 - err = c->xdp ? mlx5e_open_cq(c, params->tx_cq_moderation, 1969 - &cparam->tx_cq, &c->rq.xdpsq.cq) : 0; 1970 - if (err) 1971 - goto err_close_rx_cq; 1972 - 1973 - napi_enable(&c->napi); 1974 - 1975 - err = mlx5e_open_icosq(c, params, &cparam->icosq, &c->icosq); 1976 - if (err) 1977 - goto err_disable_napi; 1978 - 1979 - err = mlx5e_open_sqs(c, params, cparam); 1980 - if (err) 1981 - goto err_close_icosq; 1982 - 1983 - err = c->xdp ? mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, &c->rq.xdpsq, false) : 0; 1984 - if (err) 1985 - goto err_close_sqs; 1986 - 1987 - err = mlx5e_open_rq(c, params, &cparam->rq, &c->rq); 1988 - if (err) 1989 - goto err_close_xdp_sq; 1990 - 1991 - err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, &c->xdpsq, true); 1992 - if (err) 1993 - goto err_close_rq; 1814 + if (umem) { 1815 + mlx5e_build_xsk_param(umem, &xsk); 1816 + err = mlx5e_open_xsk(priv, params, &xsk, umem, c); 1817 + if (unlikely(err)) 1818 + goto err_close_queues; 1819 + } 1994 1820 1995 1821 *cp = c; 1996 1822 1997 1823 return 0; 1998 1824 1999 - err_close_rq: 2000 - mlx5e_close_rq(&c->rq); 2001 - 2002 - err_close_xdp_sq: 2003 - if (c->xdp) 2004 - mlx5e_close_xdpsq(&c->rq.xdpsq, &c->rq); 2005 - 2006 - err_close_sqs: 2007 - mlx5e_close_sqs(c); 2008 - 2009 - err_close_icosq: 2010 - mlx5e_close_icosq(&c->icosq); 2011 - 2012 - err_disable_napi: 2013 - napi_disable(&c->napi); 2014 - if (c->xdp) 2015 - mlx5e_close_cq(&c->rq.xdpsq.cq); 2016 - 2017 - err_close_rx_cq: 2018 - mlx5e_close_cq(&c->rq.cq); 2019 - 2020 - err_close_xdp_tx_cqs: 2021 - mlx5e_close_cq(&c->xdpsq.cq); 2022 - 2023 - err_close_tx_cqs: 2024 - mlx5e_close_tx_cqs(c); 2025 - 2026 - err_close_icosq_cq: 2027 - mlx5e_close_cq(&c->icosq.cq); 1825 + err_close_queues: 1826 + mlx5e_close_queues(c); 2028 1827 2029 1828 err_napi_del: 2030 1829 netif_napi_del(&c->napi); ··· 1984 1903 mlx5e_activate_txqsq(&c->sq[tc]); 1985 1904 mlx5e_activate_rq(&c->rq); 1986 1905 netif_set_xps_queue(c->netdev, c->xps_cpumask, c->ix); 1906 + 1907 + if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 1908 + mlx5e_activate_xsk(c); 1987 1909 } 1988 1910 1989 1911 static void mlx5e_deactivate_channel(struct mlx5e_channel *c) 1990 1912 { 1991 1913 int tc; 1914 + 1915 + if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 1916 + mlx5e_deactivate_xsk(c); 1992 1917 1993 1918 mlx5e_deactivate_rq(&c->rq); 1994 1919 for (tc = 0; tc < c->num_tc; tc++) ··· 2003 1916 2004 1917 static void mlx5e_close_channel(struct mlx5e_channel *c) 2005 1918 { 2006 - mlx5e_close_xdpsq(&c->xdpsq, NULL); 2007 - mlx5e_close_rq(&c->rq); 2008 - if (c->xdp) 2009 - mlx5e_close_xdpsq(&c->rq.xdpsq, &c->rq); 2010 - mlx5e_close_sqs(c); 2011 - mlx5e_close_icosq(&c->icosq); 2012 - napi_disable(&c->napi); 2013 - if (c->xdp) 2014 - mlx5e_close_cq(&c->rq.xdpsq.cq); 2015 - mlx5e_close_cq(&c->rq.cq); 2016 - mlx5e_close_cq(&c->xdpsq.cq); 2017 - mlx5e_close_tx_cqs(c); 2018 - mlx5e_close_cq(&c->icosq.cq); 1919 + if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 1920 + mlx5e_close_xsk(c); 1921 + mlx5e_close_queues(c); 2019 1922 netif_napi_del(&c->napi); 2020 1923 mlx5e_free_xps_cpumask(c); 2021 1924 ··· 2016 1939 2017 1940 static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev, 2018 1941 struct mlx5e_params *params, 1942 + struct mlx5e_xsk_param *xsk, 2019 1943 struct mlx5e_rq_frags_info *info) 2020 1944 { 2021 1945 u32 byte_count = MLX5E_SW2HW_MTU(params, params->sw_mtu); ··· 2029 1951 byte_count += MLX5E_METADATA_ETHER_LEN; 2030 1952 #endif 2031 1953 2032 - if (mlx5e_rx_is_linear_skb(params)) { 1954 + if (mlx5e_rx_is_linear_skb(params, xsk)) { 2033 1955 int frag_stride; 2034 1956 2035 - frag_stride = mlx5e_rx_get_linear_frag_sz(params); 1957 + frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk); 2036 1958 frag_stride = roundup_pow_of_two(frag_stride); 2037 1959 2038 1960 info->arr[0].frag_size = byte_count; ··· 2090 2012 return MLX5_GET(wq, wq, log_wq_sz); 2091 2013 } 2092 2014 2093 - static void mlx5e_build_rq_param(struct mlx5e_priv *priv, 2094 - struct mlx5e_params *params, 2095 - struct mlx5e_rq_param *param) 2015 + void mlx5e_build_rq_param(struct mlx5e_priv *priv, 2016 + struct mlx5e_params *params, 2017 + struct mlx5e_xsk_param *xsk, 2018 + struct mlx5e_rq_param *param) 2096 2019 { 2097 2020 struct mlx5_core_dev *mdev = priv->mdev; 2098 2021 void *rqc = param->rqc; ··· 2103 2024 switch (params->rq_wq_type) { 2104 2025 case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: 2105 2026 MLX5_SET(wq, wq, log_wqe_num_of_strides, 2106 - mlx5e_mpwqe_get_log_num_strides(mdev, params) - 2027 + mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk) - 2107 2028 MLX5_MPWQE_LOG_NUM_STRIDES_BASE); 2108 2029 MLX5_SET(wq, wq, log_wqe_stride_size, 2109 - mlx5e_mpwqe_get_log_stride_size(mdev, params) - 2030 + mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk) - 2110 2031 MLX5_MPWQE_LOG_STRIDE_SZ_BASE); 2111 - MLX5_SET(wq, wq, log_wq_sz, mlx5e_mpwqe_get_log_rq_size(params)); 2032 + MLX5_SET(wq, wq, log_wq_sz, mlx5e_mpwqe_get_log_rq_size(params, xsk)); 2112 2033 break; 2113 2034 default: /* MLX5_WQ_TYPE_CYCLIC */ 2114 2035 MLX5_SET(wq, wq, log_wq_sz, params->log_rq_mtu_frames); 2115 - mlx5e_build_rq_frags_info(mdev, params, &param->frags_info); 2036 + mlx5e_build_rq_frags_info(mdev, params, xsk, &param->frags_info); 2116 2037 ndsegs = param->frags_info.num_frags; 2117 2038 } 2118 2039 ··· 2143 2064 param->wq.buf_numa_node = dev_to_node(mdev->device); 2144 2065 } 2145 2066 2146 - static void mlx5e_build_sq_param_common(struct mlx5e_priv *priv, 2147 - struct mlx5e_sq_param *param) 2067 + void mlx5e_build_sq_param_common(struct mlx5e_priv *priv, 2068 + struct mlx5e_sq_param *param) 2148 2069 { 2149 2070 void *sqc = param->sqc; 2150 2071 void *wq = MLX5_ADDR_OF(sqc, sqc, wq); ··· 2180 2101 MLX5_SET(cqc, cqc, cqe_sz, CQE_STRIDE_128_PAD); 2181 2102 } 2182 2103 2183 - static void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, 2184 - struct mlx5e_params *params, 2185 - struct mlx5e_cq_param *param) 2104 + void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, 2105 + struct mlx5e_params *params, 2106 + struct mlx5e_xsk_param *xsk, 2107 + struct mlx5e_cq_param *param) 2186 2108 { 2187 2109 struct mlx5_core_dev *mdev = priv->mdev; 2188 2110 void *cqc = param->cqc; ··· 2191 2111 2192 2112 switch (params->rq_wq_type) { 2193 2113 case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: 2194 - log_cq_size = mlx5e_mpwqe_get_log_rq_size(params) + 2195 - mlx5e_mpwqe_get_log_num_strides(mdev, params); 2114 + log_cq_size = mlx5e_mpwqe_get_log_rq_size(params, xsk) + 2115 + mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk); 2196 2116 break; 2197 2117 default: /* MLX5_WQ_TYPE_CYCLIC */ 2198 2118 log_cq_size = params->log_rq_mtu_frames; ··· 2208 2128 param->cq_period_mode = params->rx_cq_moderation.cq_period_mode; 2209 2129 } 2210 2130 2211 - static void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv, 2212 - struct mlx5e_params *params, 2213 - struct mlx5e_cq_param *param) 2131 + void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv, 2132 + struct mlx5e_params *params, 2133 + struct mlx5e_cq_param *param) 2214 2134 { 2215 2135 void *cqc = param->cqc; 2216 2136 ··· 2220 2140 param->cq_period_mode = params->tx_cq_moderation.cq_period_mode; 2221 2141 } 2222 2142 2223 - static void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv, 2224 - u8 log_wq_size, 2225 - struct mlx5e_cq_param *param) 2143 + void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv, 2144 + u8 log_wq_size, 2145 + struct mlx5e_cq_param *param) 2226 2146 { 2227 2147 void *cqc = param->cqc; 2228 2148 ··· 2233 2153 param->cq_period_mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE; 2234 2154 } 2235 2155 2236 - static void mlx5e_build_icosq_param(struct mlx5e_priv *priv, 2237 - u8 log_wq_size, 2238 - struct mlx5e_sq_param *param) 2156 + void mlx5e_build_icosq_param(struct mlx5e_priv *priv, 2157 + u8 log_wq_size, 2158 + struct mlx5e_sq_param *param) 2239 2159 { 2240 2160 void *sqc = param->sqc; 2241 2161 void *wq = MLX5_ADDR_OF(sqc, sqc, wq); ··· 2246 2166 MLX5_SET(sqc, sqc, reg_umr, MLX5_CAP_ETH(priv->mdev, reg_umr_sq)); 2247 2167 } 2248 2168 2249 - static void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv, 2250 - struct mlx5e_params *params, 2251 - struct mlx5e_sq_param *param) 2169 + void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv, 2170 + struct mlx5e_params *params, 2171 + struct mlx5e_sq_param *param) 2252 2172 { 2253 2173 void *sqc = param->sqc; 2254 2174 void *wq = MLX5_ADDR_OF(sqc, sqc, wq); ··· 2276 2196 { 2277 2197 u8 icosq_log_wq_sz; 2278 2198 2279 - mlx5e_build_rq_param(priv, params, &cparam->rq); 2199 + mlx5e_build_rq_param(priv, params, NULL, &cparam->rq); 2280 2200 2281 2201 icosq_log_wq_sz = mlx5e_build_icosq_log_wq_sz(params, &cparam->rq); 2282 2202 2283 2203 mlx5e_build_sq_param(priv, params, &cparam->sq); 2284 2204 mlx5e_build_xdpsq_param(priv, params, &cparam->xdp_sq); 2285 2205 mlx5e_build_icosq_param(priv, icosq_log_wq_sz, &cparam->icosq); 2286 - mlx5e_build_rx_cq_param(priv, params, &cparam->rx_cq); 2206 + mlx5e_build_rx_cq_param(priv, params, NULL, &cparam->rx_cq); 2287 2207 mlx5e_build_tx_cq_param(priv, params, &cparam->tx_cq); 2288 2208 mlx5e_build_ico_cq_param(priv, icosq_log_wq_sz, &cparam->icosq_cq); 2289 2209 } ··· 2304 2224 2305 2225 mlx5e_build_channel_param(priv, &chs->params, cparam); 2306 2226 for (i = 0; i < chs->num; i++) { 2307 - err = mlx5e_open_channel(priv, i, &chs->params, cparam, &chs->c[i]); 2227 + struct xdp_umem *umem = NULL; 2228 + 2229 + if (chs->params.xdp_prog) 2230 + umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, i); 2231 + 2232 + err = mlx5e_open_channel(priv, i, &chs->params, cparam, umem, &chs->c[i]); 2308 2233 if (err) 2309 2234 goto err_close_channels; 2310 2235 } ··· 2351 2266 int timeout = err ? 0 : MLX5E_RQ_WQES_TIMEOUT; 2352 2267 2353 2268 err |= mlx5e_wait_for_min_rx_wqes(&chs->c[i]->rq, timeout); 2269 + 2270 + /* Don't wait on the XSK RQ, because the newer xdpsock sample 2271 + * doesn't provide any Fill Ring entries at the setup stage. 2272 + */ 2354 2273 } 2355 2274 2356 2275 return err ? -ETIMEDOUT : 0; ··· 2427 2338 return err; 2428 2339 } 2429 2340 2430 - int mlx5e_create_direct_rqts(struct mlx5e_priv *priv) 2341 + int mlx5e_create_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 2431 2342 { 2432 - struct mlx5e_rqt *rqt; 2343 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 2433 2344 int err; 2434 2345 int ix; 2435 2346 2436 - for (ix = 0; ix < mlx5e_get_netdev_max_channels(priv->netdev); ix++) { 2437 - rqt = &priv->direct_tir[ix].rqt; 2438 - err = mlx5e_create_rqt(priv, 1 /*size */, rqt); 2439 - if (err) 2347 + for (ix = 0; ix < max_nch; ix++) { 2348 + err = mlx5e_create_rqt(priv, 1 /*size */, &tirs[ix].rqt); 2349 + if (unlikely(err)) 2440 2350 goto err_destroy_rqts; 2441 2351 } 2442 2352 2443 2353 return 0; 2444 2354 2445 2355 err_destroy_rqts: 2446 - mlx5_core_warn(priv->mdev, "create direct rqts failed, %d\n", err); 2356 + mlx5_core_warn(priv->mdev, "create rqts failed, %d\n", err); 2447 2357 for (ix--; ix >= 0; ix--) 2448 - mlx5e_destroy_rqt(priv, &priv->direct_tir[ix].rqt); 2358 + mlx5e_destroy_rqt(priv, &tirs[ix].rqt); 2449 2359 2450 2360 return err; 2451 2361 } 2452 2362 2453 - void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv) 2363 + void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 2454 2364 { 2365 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 2455 2366 int i; 2456 2367 2457 - for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev); i++) 2458 - mlx5e_destroy_rqt(priv, &priv->direct_tir[i].rqt); 2368 + for (i = 0; i < max_nch; i++) 2369 + mlx5e_destroy_rqt(priv, &tirs[i].rqt); 2459 2370 } 2460 2371 2461 2372 static int mlx5e_rx_hash_fn(int hfunc) ··· 2875 2786 void mlx5e_activate_priv_channels(struct mlx5e_priv *priv) 2876 2787 { 2877 2788 int num_txqs = priv->channels.num * priv->channels.params.num_tc; 2789 + int num_rxqs = priv->channels.num * MLX5E_NUM_RQ_GROUPS; 2878 2790 struct net_device *netdev = priv->netdev; 2879 2791 2880 2792 mlx5e_netdev_set_tcs(netdev); 2881 2793 netif_set_real_num_tx_queues(netdev, num_txqs); 2882 - netif_set_real_num_rx_queues(netdev, priv->channels.num); 2794 + netif_set_real_num_rx_queues(netdev, num_rxqs); 2883 2795 2884 2796 mlx5e_build_tx2sq_maps(priv); 2885 2797 mlx5e_activate_channels(&priv->channels); ··· 2892 2802 2893 2803 mlx5e_wait_channels_min_rx_wqes(&priv->channels); 2894 2804 mlx5e_redirect_rqts_to_channels(priv, &priv->channels); 2805 + 2806 + mlx5e_xsk_redirect_rqts_to_channels(priv, &priv->channels); 2895 2807 } 2896 2808 2897 2809 void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv) 2898 2810 { 2811 + mlx5e_xsk_redirect_rqts_to_drop(priv, &priv->channels); 2812 + 2899 2813 mlx5e_redirect_rqts_to_drop(priv); 2900 2814 2901 2815 if (mlx5e_is_vport_rep(priv)) ··· 2978 2884 int mlx5e_open_locked(struct net_device *netdev) 2979 2885 { 2980 2886 struct mlx5e_priv *priv = netdev_priv(netdev); 2887 + bool is_xdp = priv->channels.params.xdp_prog; 2981 2888 int err; 2982 2889 2983 2890 set_bit(MLX5E_STATE_OPENED, &priv->state); 2891 + if (is_xdp) 2892 + mlx5e_xdp_set_open(priv); 2984 2893 2985 2894 err = mlx5e_open_channels(priv, &priv->channels); 2986 2895 if (err) ··· 2998 2901 return 0; 2999 2902 3000 2903 err_clear_state_opened_flag: 2904 + if (is_xdp) 2905 + mlx5e_xdp_set_closed(priv); 3001 2906 clear_bit(MLX5E_STATE_OPENED, &priv->state); 3002 2907 return err; 3003 2908 } ··· 3031 2932 if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) 3032 2933 return 0; 3033 2934 2935 + if (priv->channels.params.xdp_prog) 2936 + mlx5e_xdp_set_closed(priv); 3034 2937 clear_bit(MLX5E_STATE_OPENED, &priv->state); 3035 2938 3036 2939 netif_carrier_off(priv->netdev); ··· 3289 3188 return err; 3290 3189 } 3291 3190 3292 - int mlx5e_create_direct_tirs(struct mlx5e_priv *priv) 3191 + int mlx5e_create_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 3293 3192 { 3294 - int nch = mlx5e_get_netdev_max_channels(priv->netdev); 3193 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 3295 3194 struct mlx5e_tir *tir; 3296 3195 void *tirc; 3297 3196 int inlen; 3298 - int err; 3197 + int err = 0; 3299 3198 u32 *in; 3300 3199 int ix; 3301 3200 ··· 3304 3203 if (!in) 3305 3204 return -ENOMEM; 3306 3205 3307 - for (ix = 0; ix < nch; ix++) { 3206 + for (ix = 0; ix < max_nch; ix++) { 3308 3207 memset(in, 0, inlen); 3309 - tir = &priv->direct_tir[ix]; 3208 + tir = &tirs[ix]; 3310 3209 tirc = MLX5_ADDR_OF(create_tir_in, in, ctx); 3311 - mlx5e_build_direct_tir_ctx(priv, priv->direct_tir[ix].rqt.rqtn, tirc); 3210 + mlx5e_build_direct_tir_ctx(priv, tir->rqt.rqtn, tirc); 3312 3211 err = mlx5e_create_tir(priv->mdev, tir, in, inlen); 3313 - if (err) 3212 + if (unlikely(err)) 3314 3213 goto err_destroy_ch_tirs; 3315 3214 } 3316 3215 3317 - kvfree(in); 3318 - 3319 - return 0; 3216 + goto out; 3320 3217 3321 3218 err_destroy_ch_tirs: 3322 - mlx5_core_warn(priv->mdev, "create direct tirs failed, %d\n", err); 3219 + mlx5_core_warn(priv->mdev, "create tirs failed, %d\n", err); 3323 3220 for (ix--; ix >= 0; ix--) 3324 - mlx5e_destroy_tir(priv->mdev, &priv->direct_tir[ix]); 3221 + mlx5e_destroy_tir(priv->mdev, &tirs[ix]); 3325 3222 3223 + out: 3326 3224 kvfree(in); 3327 3225 3328 3226 return err; ··· 3341 3241 mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]); 3342 3242 } 3343 3243 3344 - void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv) 3244 + void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 3345 3245 { 3346 - int nch = mlx5e_get_netdev_max_channels(priv->netdev); 3246 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 3347 3247 int i; 3348 3248 3349 - for (i = 0; i < nch; i++) 3350 - mlx5e_destroy_tir(priv->mdev, &priv->direct_tir[i]); 3249 + for (i = 0; i < max_nch; i++) 3250 + mlx5e_destroy_tir(priv->mdev, &tirs[i]); 3351 3251 } 3352 3252 3353 3253 static int mlx5e_modify_channels_scatter_fcs(struct mlx5e_channels *chs, bool enable) ··· 3489 3389 3490 3390 for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev); i++) { 3491 3391 struct mlx5e_channel_stats *channel_stats = &priv->channel_stats[i]; 3392 + struct mlx5e_rq_stats *xskrq_stats = &channel_stats->xskrq; 3492 3393 struct mlx5e_rq_stats *rq_stats = &channel_stats->rq; 3493 3394 int j; 3494 3395 3495 - s->rx_packets += rq_stats->packets; 3496 - s->rx_bytes += rq_stats->bytes; 3396 + s->rx_packets += rq_stats->packets + xskrq_stats->packets; 3397 + s->rx_bytes += rq_stats->bytes + xskrq_stats->bytes; 3497 3398 3498 3399 for (j = 0; j < priv->max_opened_tc; j++) { 3499 3400 struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j]; ··· 3593 3492 3594 3493 mutex_lock(&priv->state_lock); 3595 3494 3495 + if (enable && priv->xsk.refcnt) { 3496 + netdev_warn(netdev, "LRO is incompatible with AF_XDP (%hu XSKs are active)\n", 3497 + priv->xsk.refcnt); 3498 + err = -EINVAL; 3499 + goto out; 3500 + } 3501 + 3596 3502 old_params = &priv->channels.params; 3597 3503 if (enable && !MLX5E_GET_PFLAG(old_params, MLX5E_PFLAG_RX_STRIDING_RQ)) { 3598 3504 netdev_warn(netdev, "can't set LRO with legacy RQ\n"); ··· 3613 3505 new_channels.params.lro_en = enable; 3614 3506 3615 3507 if (old_params->rq_wq_type != MLX5_WQ_TYPE_CYCLIC) { 3616 - if (mlx5e_rx_mpwqe_is_linear_skb(mdev, old_params) == 3617 - mlx5e_rx_mpwqe_is_linear_skb(mdev, &new_channels.params)) 3508 + if (mlx5e_rx_mpwqe_is_linear_skb(mdev, old_params, NULL) == 3509 + mlx5e_rx_mpwqe_is_linear_skb(mdev, &new_channels.params, NULL)) 3618 3510 reset = false; 3619 3511 } 3620 3512 ··· 3804 3696 return features; 3805 3697 } 3806 3698 3699 + static bool mlx5e_xsk_validate_mtu(struct net_device *netdev, 3700 + struct mlx5e_channels *chs, 3701 + struct mlx5e_params *new_params, 3702 + struct mlx5_core_dev *mdev) 3703 + { 3704 + u16 ix; 3705 + 3706 + for (ix = 0; ix < chs->params.num_channels; ix++) { 3707 + struct xdp_umem *umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, ix); 3708 + struct mlx5e_xsk_param xsk; 3709 + 3710 + if (!umem) 3711 + continue; 3712 + 3713 + mlx5e_build_xsk_param(umem, &xsk); 3714 + 3715 + if (!mlx5e_validate_xsk_param(new_params, &xsk, mdev)) { 3716 + u32 hr = mlx5e_get_linear_rq_headroom(new_params, &xsk); 3717 + int max_mtu_frame, max_mtu_page, max_mtu; 3718 + 3719 + /* Two criteria must be met: 3720 + * 1. HW MTU + all headrooms <= XSK frame size. 3721 + * 2. Size of SKBs allocated on XDP_PASS <= PAGE_SIZE. 3722 + */ 3723 + max_mtu_frame = MLX5E_HW2SW_MTU(new_params, xsk.chunk_size - hr); 3724 + max_mtu_page = mlx5e_xdp_max_mtu(new_params, &xsk); 3725 + max_mtu = min(max_mtu_frame, max_mtu_page); 3726 + 3727 + netdev_err(netdev, "MTU %d is too big for an XSK running on channel %hu. Try MTU <= %d\n", 3728 + new_params->sw_mtu, ix, max_mtu); 3729 + return false; 3730 + } 3731 + } 3732 + 3733 + return true; 3734 + } 3735 + 3807 3736 int mlx5e_change_mtu(struct net_device *netdev, int new_mtu, 3808 3737 change_hw_mtu_cb set_mtu_cb) 3809 3738 { ··· 3861 3716 new_channels.params.sw_mtu = new_mtu; 3862 3717 3863 3718 if (params->xdp_prog && 3864 - !mlx5e_rx_is_linear_skb(&new_channels.params)) { 3719 + !mlx5e_rx_is_linear_skb(&new_channels.params, NULL)) { 3865 3720 netdev_err(netdev, "MTU(%d) > %d is not allowed while XDP enabled\n", 3866 - new_mtu, mlx5e_xdp_max_mtu(params)); 3721 + new_mtu, mlx5e_xdp_max_mtu(params, NULL)); 3722 + err = -EINVAL; 3723 + goto out; 3724 + } 3725 + 3726 + if (priv->xsk.refcnt && 3727 + !mlx5e_xsk_validate_mtu(netdev, &priv->channels, 3728 + &new_channels.params, priv->mdev)) { 3867 3729 err = -EINVAL; 3868 3730 goto out; 3869 3731 } 3870 3732 3871 3733 if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) { 3872 - bool is_linear = mlx5e_rx_mpwqe_is_linear_skb(priv->mdev, &new_channels.params); 3873 - u8 ppw_old = mlx5e_mpwqe_log_pkts_per_wqe(params); 3874 - u8 ppw_new = mlx5e_mpwqe_log_pkts_per_wqe(&new_channels.params); 3734 + bool is_linear = mlx5e_rx_mpwqe_is_linear_skb(priv->mdev, 3735 + &new_channels.params, 3736 + NULL); 3737 + u8 ppw_old = mlx5e_mpwqe_log_pkts_per_wqe(params, NULL); 3738 + u8 ppw_new = mlx5e_mpwqe_log_pkts_per_wqe(&new_channels.params, NULL); 3875 3739 3740 + /* If XSK is active, XSK RQs are linear. */ 3741 + is_linear |= priv->xsk.refcnt; 3742 + 3743 + /* Always reset in linear mode - hw_mtu is used in data path. */ 3876 3744 reset = reset && (is_linear || (ppw_old != ppw_new)); 3877 3745 } 3878 3746 ··· 4318 4160 new_channels.params = priv->channels.params; 4319 4161 new_channels.params.xdp_prog = prog; 4320 4162 4321 - if (!mlx5e_rx_is_linear_skb(&new_channels.params)) { 4163 + /* No XSK params: AF_XDP can't be enabled yet at the point of setting 4164 + * the XDP program. 4165 + */ 4166 + if (!mlx5e_rx_is_linear_skb(&new_channels.params, NULL)) { 4322 4167 netdev_warn(netdev, "XDP is not allowed with MTU(%d) > %d\n", 4323 4168 new_channels.params.sw_mtu, 4324 - mlx5e_xdp_max_mtu(&new_channels.params)); 4169 + mlx5e_xdp_max_mtu(&new_channels.params, NULL)); 4325 4170 return -EINVAL; 4326 4171 } 4172 + 4173 + return 0; 4174 + } 4175 + 4176 + static int mlx5e_xdp_update_state(struct mlx5e_priv *priv) 4177 + { 4178 + if (priv->channels.params.xdp_prog) 4179 + mlx5e_xdp_set_open(priv); 4180 + else 4181 + mlx5e_xdp_set_closed(priv); 4327 4182 4328 4183 return 0; 4329 4184 } ··· 4361 4190 /* no need for full reset when exchanging programs */ 4362 4191 reset = (!priv->channels.params.xdp_prog || !prog); 4363 4192 4364 - if (was_opened && reset) 4365 - mlx5e_close_locked(netdev); 4366 4193 if (was_opened && !reset) { 4367 4194 /* num_channels is invariant here, so we can take the 4368 4195 * batched reference right upfront. ··· 4372 4203 } 4373 4204 } 4374 4205 4375 - /* exchange programs, extra prog reference we got from caller 4376 - * as long as we don't fail from this point onwards. 4377 - */ 4378 - old_prog = xchg(&priv->channels.params.xdp_prog, prog); 4206 + if (was_opened && reset) { 4207 + struct mlx5e_channels new_channels = {}; 4208 + 4209 + new_channels.params = priv->channels.params; 4210 + new_channels.params.xdp_prog = prog; 4211 + mlx5e_set_rq_type(priv->mdev, &new_channels.params); 4212 + old_prog = priv->channels.params.xdp_prog; 4213 + 4214 + err = mlx5e_safe_switch_channels(priv, &new_channels, mlx5e_xdp_update_state); 4215 + if (err) 4216 + goto unlock; 4217 + } else { 4218 + /* exchange programs, extra prog reference we got from caller 4219 + * as long as we don't fail from this point onwards. 4220 + */ 4221 + old_prog = xchg(&priv->channels.params.xdp_prog, prog); 4222 + } 4223 + 4379 4224 if (old_prog) 4380 4225 bpf_prog_put(old_prog); 4381 4226 4382 - if (reset) /* change RQ type according to priv->xdp_prog */ 4227 + if (!was_opened && reset) /* change RQ type according to priv->xdp_prog */ 4383 4228 mlx5e_set_rq_type(priv->mdev, &priv->channels.params); 4384 4229 4385 - if (was_opened && reset) 4386 - err = mlx5e_open_locked(netdev); 4387 - 4388 - if (!test_bit(MLX5E_STATE_OPENED, &priv->state) || reset) 4230 + if (!was_opened || reset) 4389 4231 goto unlock; 4390 4232 4391 4233 /* exchanging programs w/o reset, we update ref counts on behalf ··· 4404 4224 */ 4405 4225 for (i = 0; i < priv->channels.num; i++) { 4406 4226 struct mlx5e_channel *c = priv->channels.c[i]; 4227 + bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 4407 4228 4408 4229 clear_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state); 4230 + if (xsk_open) 4231 + clear_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state); 4409 4232 napi_synchronize(&c->napi); 4410 4233 /* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */ 4411 4234 4412 4235 old_prog = xchg(&c->rq.xdp_prog, prog); 4413 - 4414 - set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state); 4415 - /* napi_schedule in case we have missed anything */ 4416 - napi_schedule(&c->napi); 4417 - 4418 4236 if (old_prog) 4419 4237 bpf_prog_put(old_prog); 4238 + 4239 + if (xsk_open) { 4240 + old_prog = xchg(&c->xskrq.xdp_prog, prog); 4241 + if (old_prog) 4242 + bpf_prog_put(old_prog); 4243 + } 4244 + 4245 + set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state); 4246 + if (xsk_open) 4247 + set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state); 4248 + /* napi_schedule in case we have missed anything */ 4249 + napi_schedule(&c->napi); 4420 4250 } 4421 4251 4422 4252 unlock: ··· 4457 4267 case XDP_QUERY_PROG: 4458 4268 xdp->prog_id = mlx5e_xdp_query(dev); 4459 4269 return 0; 4270 + case XDP_SETUP_XSK_UMEM: 4271 + return mlx5e_xsk_setup_umem(dev, xdp->xsk.umem, 4272 + xdp->xsk.queue_id); 4460 4273 default: 4461 4274 return -EINVAL; 4462 4275 } ··· 4542 4349 .ndo_tx_timeout = mlx5e_tx_timeout, 4543 4350 .ndo_bpf = mlx5e_xdp, 4544 4351 .ndo_xdp_xmit = mlx5e_xdp_xmit, 4352 + .ndo_xsk_async_xmit = mlx5e_xsk_async_xmit, 4545 4353 #ifdef CONFIG_MLX5_EN_ARFS 4546 4354 .ndo_rx_flow_steer = mlx5e_rx_flow_steer, 4547 4355 #endif ··· 4694 4500 * - Striding RQ configuration is not possible/supported. 4695 4501 * - Slow PCI heuristic. 4696 4502 * - Legacy RQ would use linear SKB while Striding RQ would use non-linear. 4503 + * 4504 + * No XSK params: checking the availability of striding RQ in general. 4697 4505 */ 4698 4506 if (!slow_pci_heuristic(mdev) && 4699 4507 mlx5e_striding_rq_possible(mdev, params) && 4700 - (mlx5e_rx_mpwqe_is_linear_skb(mdev, params) || 4701 - !mlx5e_rx_is_linear_skb(params))) 4508 + (mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL) || 4509 + !mlx5e_rx_is_linear_skb(params, NULL))) 4702 4510 MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ, true); 4703 4511 mlx5e_set_rq_type(mdev, params); 4704 4512 mlx5e_init_rq_type_params(mdev, params); ··· 4722 4526 } 4723 4527 4724 4528 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev, 4529 + struct mlx5e_xsk *xsk, 4725 4530 struct mlx5e_rss_params *rss_params, 4726 4531 struct mlx5e_params *params, 4727 4532 u16 max_channels, u16 mtu) ··· 4758 4561 /* HW LRO */ 4759 4562 4760 4563 /* TODO: && MLX5_CAP_ETH(mdev, lro_cap) */ 4761 - if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) 4762 - if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params)) 4564 + if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) { 4565 + /* No XSK params: checking the availability of striding RQ in general. */ 4566 + if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL)) 4763 4567 params->lro_en = !slow_pci_heuristic(mdev); 4568 + } 4764 4569 params->lro_timeout = mlx5e_choose_lro_timeout(mdev, MLX5E_DEFAULT_LRO_TIMEOUT); 4765 4570 4766 4571 /* CQ moderation params */ ··· 4781 4582 mlx5e_build_rss_params(rss_params, params->num_channels); 4782 4583 params->tunneled_offload_en = 4783 4584 mlx5e_tunnel_inner_ft_supported(mdev); 4585 + 4586 + /* AF_XDP */ 4587 + params->xsk = xsk; 4784 4588 } 4785 4589 4786 4590 static void mlx5e_set_netdev_dev_addr(struct net_device *netdev) ··· 4959 4757 if (err) 4960 4758 return err; 4961 4759 4962 - mlx5e_build_nic_params(mdev, rss, &priv->channels.params, 4760 + mlx5e_build_nic_params(mdev, &priv->xsk, rss, &priv->channels.params, 4963 4761 mlx5e_get_netdev_max_channels(netdev), 4964 4762 netdev->mtu); 4965 4763 ··· 5001 4799 if (err) 5002 4800 goto err_close_drop_rq; 5003 4801 5004 - err = mlx5e_create_direct_rqts(priv); 4802 + err = mlx5e_create_direct_rqts(priv, priv->direct_tir); 5005 4803 if (err) 5006 4804 goto err_destroy_indirect_rqts; 5007 4805 ··· 5009 4807 if (err) 5010 4808 goto err_destroy_direct_rqts; 5011 4809 5012 - err = mlx5e_create_direct_tirs(priv); 4810 + err = mlx5e_create_direct_tirs(priv, priv->direct_tir); 5013 4811 if (err) 5014 4812 goto err_destroy_indirect_tirs; 4813 + 4814 + err = mlx5e_create_direct_rqts(priv, priv->xsk_tir); 4815 + if (unlikely(err)) 4816 + goto err_destroy_direct_tirs; 4817 + 4818 + err = mlx5e_create_direct_tirs(priv, priv->xsk_tir); 4819 + if (unlikely(err)) 4820 + goto err_destroy_xsk_rqts; 5015 4821 5016 4822 err = mlx5e_create_flow_steering(priv); 5017 4823 if (err) { 5018 4824 mlx5_core_warn(mdev, "create flow steering failed, %d\n", err); 5019 - goto err_destroy_direct_tirs; 4825 + goto err_destroy_xsk_tirs; 5020 4826 } 5021 4827 5022 4828 err = mlx5e_tc_nic_init(priv); ··· 5035 4825 5036 4826 err_destroy_flow_steering: 5037 4827 mlx5e_destroy_flow_steering(priv); 4828 + err_destroy_xsk_tirs: 4829 + mlx5e_destroy_direct_tirs(priv, priv->xsk_tir); 4830 + err_destroy_xsk_rqts: 4831 + mlx5e_destroy_direct_rqts(priv, priv->xsk_tir); 5038 4832 err_destroy_direct_tirs: 5039 - mlx5e_destroy_direct_tirs(priv); 4833 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 5040 4834 err_destroy_indirect_tirs: 5041 4835 mlx5e_destroy_indirect_tirs(priv, true); 5042 4836 err_destroy_direct_rqts: 5043 - mlx5e_destroy_direct_rqts(priv); 4837 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 5044 4838 err_destroy_indirect_rqts: 5045 4839 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 5046 4840 err_close_drop_rq: ··· 5058 4844 { 5059 4845 mlx5e_tc_nic_cleanup(priv); 5060 4846 mlx5e_destroy_flow_steering(priv); 5061 - mlx5e_destroy_direct_tirs(priv); 4847 + mlx5e_destroy_direct_tirs(priv, priv->xsk_tir); 4848 + mlx5e_destroy_direct_rqts(priv, priv->xsk_tir); 4849 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 5062 4850 mlx5e_destroy_indirect_tirs(priv, true); 5063 - mlx5e_destroy_direct_rqts(priv); 4851 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 5064 4852 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 5065 4853 mlx5e_close_drop_rq(&priv->drop_rq); 5066 4854 mlx5e_destroy_q_counters(priv); ··· 5218 5002 5219 5003 netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), 5220 5004 nch * profile->max_tc, 5221 - nch); 5005 + nch * MLX5E_NUM_RQ_GROUPS); 5222 5006 if (!netdev) { 5223 5007 mlx5_core_err(mdev, "alloc_etherdev_mqs() failed\n"); 5224 5008 return NULL;

+6 -6

drivers/net/ethernet/mellanox/mlx5/core/en_rep.c

··· 1510 1510 if (err) 1511 1511 goto err_close_drop_rq; 1512 1512 1513 - err = mlx5e_create_direct_rqts(priv); 1513 + err = mlx5e_create_direct_rqts(priv, priv->direct_tir); 1514 1514 if (err) 1515 1515 goto err_destroy_indirect_rqts; 1516 1516 ··· 1518 1518 if (err) 1519 1519 goto err_destroy_direct_rqts; 1520 1520 1521 - err = mlx5e_create_direct_tirs(priv); 1521 + err = mlx5e_create_direct_tirs(priv, priv->direct_tir); 1522 1522 if (err) 1523 1523 goto err_destroy_indirect_tirs; 1524 1524 ··· 1535 1535 err_destroy_ttc_table: 1536 1536 mlx5e_destroy_ttc_table(priv, &priv->fs.ttc); 1537 1537 err_destroy_direct_tirs: 1538 - mlx5e_destroy_direct_tirs(priv); 1538 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 1539 1539 err_destroy_indirect_tirs: 1540 1540 mlx5e_destroy_indirect_tirs(priv, false); 1541 1541 err_destroy_direct_rqts: 1542 - mlx5e_destroy_direct_rqts(priv); 1542 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 1543 1543 err_destroy_indirect_rqts: 1544 1544 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 1545 1545 err_close_drop_rq: ··· 1553 1553 1554 1554 mlx5_del_flow_rules(rpriv->vport_rx_rule); 1555 1555 mlx5e_destroy_ttc_table(priv, &priv->fs.ttc); 1556 - mlx5e_destroy_direct_tirs(priv); 1556 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 1557 1557 mlx5e_destroy_indirect_tirs(priv, false); 1558 - mlx5e_destroy_direct_rqts(priv); 1558 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 1559 1559 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 1560 1560 mlx5e_close_drop_rq(&priv->drop_rq); 1561 1561 }

+76 -28

drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

··· 47 47 #include "en_accel/tls_rxtx.h" 48 48 #include "lib/clock.h" 49 49 #include "en/xdp.h" 50 + #include "en/xsk/rx.h" 50 51 51 52 static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config) 52 53 { ··· 236 235 return true; 237 236 } 238 237 239 - static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, 240 - struct mlx5e_dma_info *dma_info) 238 + static inline int mlx5e_page_alloc_pool(struct mlx5e_rq *rq, 239 + struct mlx5e_dma_info *dma_info) 241 240 { 242 241 if (mlx5e_rx_cache_get(rq, dma_info)) 243 242 return 0; ··· 257 256 return 0; 258 257 } 259 258 259 + static inline int mlx5e_page_alloc(struct mlx5e_rq *rq, 260 + struct mlx5e_dma_info *dma_info) 261 + { 262 + if (rq->umem) 263 + return mlx5e_xsk_page_alloc_umem(rq, dma_info); 264 + else 265 + return mlx5e_page_alloc_pool(rq, dma_info); 266 + } 267 + 260 268 void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info) 261 269 { 262 270 dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, rq->buff.map_dir); 263 271 } 264 272 265 - void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, 266 - bool recycle) 273 + void mlx5e_page_release_dynamic(struct mlx5e_rq *rq, 274 + struct mlx5e_dma_info *dma_info, 275 + bool recycle) 267 276 { 268 277 if (likely(recycle)) { 269 278 if (mlx5e_rx_cache_put(rq, dma_info)) ··· 288 277 } 289 278 } 290 279 280 + static inline void mlx5e_page_release(struct mlx5e_rq *rq, 281 + struct mlx5e_dma_info *dma_info, 282 + bool recycle) 283 + { 284 + if (rq->umem) 285 + /* The `recycle` parameter is ignored, and the page is always 286 + * put into the Reuse Ring, because there is no way to return 287 + * the page to the userspace when the interface goes down. 288 + */ 289 + mlx5e_xsk_page_release(rq, dma_info); 290 + else 291 + mlx5e_page_release_dynamic(rq, dma_info, recycle); 292 + } 293 + 291 294 static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq, 292 295 struct mlx5e_wqe_frag_info *frag) 293 296 { ··· 313 288 * offset) should just use the new one without replenishing again 314 289 * by themselves. 315 290 */ 316 - err = mlx5e_page_alloc_mapped(rq, frag->di); 291 + err = mlx5e_page_alloc(rq, frag->di); 317 292 318 293 return err; 319 294 } ··· 379 354 int err; 380 355 int i; 381 356 357 + if (rq->umem) { 358 + int pages_desired = wqe_bulk << rq->wqe.info.log_num_frags; 359 + 360 + if (unlikely(!mlx5e_xsk_pages_enough_umem(rq, pages_desired))) 361 + return -ENOMEM; 362 + } 363 + 382 364 for (i = 0; i < wqe_bulk; i++) { 383 365 struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i); 384 366 ··· 433 401 static void 434 402 mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle) 435 403 { 436 - const bool no_xdp_xmit = 437 - bitmap_empty(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE); 404 + bool no_xdp_xmit; 438 405 struct mlx5e_dma_info *dma_info = wi->umr.dma_info; 439 406 int i; 407 + 408 + /* A common case for AF_XDP. */ 409 + if (bitmap_full(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE)) 410 + return; 411 + 412 + no_xdp_xmit = bitmap_empty(wi->xdp_xmit_bitmap, 413 + MLX5_MPWRQ_PAGES_PER_WQE); 440 414 441 415 for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) 442 416 if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap)) ··· 463 425 dma_wmb(); 464 426 465 427 mlx5_wq_ll_update_db_record(wq); 466 - } 467 - 468 - static inline u16 mlx5e_icosq_wrap_cnt(struct mlx5e_icosq *sq) 469 - { 470 - return mlx5_wq_cyc_get_ctr_wrap_cnt(&sq->wq, sq->pc); 471 428 } 472 429 473 430 static inline void mlx5e_fill_icosq_frag_edge(struct mlx5e_icosq *sq, ··· 492 459 int err; 493 460 int i; 494 461 462 + if (rq->umem && 463 + unlikely(!mlx5e_xsk_pages_enough_umem(rq, MLX5_MPWRQ_PAGES_PER_WQE))) { 464 + err = -ENOMEM; 465 + goto err; 466 + } 467 + 495 468 pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc); 496 469 contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi); 497 470 if (unlikely(contig_wqebbs_room < MLX5E_UMR_WQEBBS)) { ··· 506 467 } 507 468 508 469 umr_wqe = mlx5_wq_cyc_get_wqe(wq, pi); 509 - if (unlikely(mlx5e_icosq_wrap_cnt(sq) < 2)) 510 - memcpy(umr_wqe, &rq->mpwqe.umr_wqe, 511 - offsetof(struct mlx5e_umr_wqe, inline_mtts)); 470 + memcpy(umr_wqe, &rq->mpwqe.umr_wqe, offsetof(struct mlx5e_umr_wqe, inline_mtts)); 512 471 513 472 for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++, dma_info++) { 514 - err = mlx5e_page_alloc_mapped(rq, dma_info); 473 + err = mlx5e_page_alloc(rq, dma_info); 515 474 if (unlikely(err)) 516 475 goto err_unmap; 517 476 umr_wqe->inline_mtts[i].ptag = cpu_to_be64(dma_info->addr | MLX5_EN_WR); ··· 524 487 umr_wqe->uctrl.xlt_offset = cpu_to_be16(xlt_offset); 525 488 526 489 sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR; 490 + sq->db.ico_wqe[pi].umr.rq = rq; 527 491 sq->pc += MLX5E_UMR_WQEBBS; 528 492 529 493 sq->doorbell_cseg = &umr_wqe->ctrl; ··· 536 498 dma_info--; 537 499 mlx5e_page_release(rq, dma_info, true); 538 500 } 501 + 502 + err: 539 503 rq->stats->buff_alloc_err++; 540 504 541 505 return err; ··· 584 544 return !!err; 585 545 } 586 546 587 - static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq) 547 + void mlx5e_poll_ico_cq(struct mlx5e_cq *cq) 588 548 { 589 549 struct mlx5e_icosq *sq = container_of(cq, struct mlx5e_icosq, cq); 590 550 struct mlx5_cqe64 *cqe; 591 - u8 completed_umr = 0; 592 551 u16 sqcc; 593 552 int i; 594 553 ··· 628 589 629 590 if (likely(wi->opcode == MLX5_OPCODE_UMR)) { 630 591 sqcc += MLX5E_UMR_WQEBBS; 631 - completed_umr++; 592 + wi->umr.rq->mpwqe.umr_completed++; 632 593 } else if (likely(wi->opcode == MLX5_OPCODE_NOP)) { 633 594 sqcc++; 634 595 } else { ··· 644 605 sq->cc = sqcc; 645 606 646 607 mlx5_cqwq_update_db_record(&cq->wq); 647 - 648 - if (likely(completed_umr)) { 649 - mlx5e_post_rx_mpwqe(rq, completed_umr); 650 - rq->mpwqe.umr_in_progress -= completed_umr; 651 - } 652 608 } 653 609 654 610 bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq) 655 611 { 656 612 struct mlx5e_icosq *sq = &rq->channel->icosq; 657 613 struct mlx5_wq_ll *wq = &rq->mpwqe.wq; 614 + u8 umr_completed = rq->mpwqe.umr_completed; 615 + int alloc_err = 0; 658 616 u8 missing, i; 659 617 u16 head; 660 618 661 619 if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state))) 662 620 return false; 663 621 664 - mlx5e_poll_ico_cq(&sq->cq, rq); 622 + if (umr_completed) { 623 + mlx5e_post_rx_mpwqe(rq, umr_completed); 624 + rq->mpwqe.umr_in_progress -= umr_completed; 625 + rq->mpwqe.umr_completed = 0; 626 + } 665 627 666 628 missing = mlx5_wq_ll_missing(wq) - rq->mpwqe.umr_in_progress; 667 629 ··· 676 636 head = rq->mpwqe.actual_wq_head; 677 637 i = missing; 678 638 do { 679 - if (unlikely(mlx5e_alloc_rx_mpwqe(rq, head))) 639 + alloc_err = mlx5e_alloc_rx_mpwqe(rq, head); 640 + 641 + if (unlikely(alloc_err)) 680 642 break; 681 643 head = mlx5_wq_ll_get_wqe_next_ix(wq, head); 682 644 } while (--i); ··· 691 649 692 650 rq->mpwqe.umr_in_progress += rq->mpwqe.umr_last_bulk; 693 651 rq->mpwqe.actual_wq_head = head; 652 + 653 + /* If XSK Fill Ring doesn't have enough frames, busy poll by 654 + * rescheduling the NAPI poll. 655 + */ 656 + if (unlikely(alloc_err == -ENOMEM && rq->umem)) 657 + return true; 694 658 695 659 return false; 696 660 } ··· 1066 1018 } 1067 1019 1068 1020 rcu_read_lock(); 1069 - consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt); 1021 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, false); 1070 1022 rcu_read_unlock(); 1071 1023 if (consumed) 1072 1024 return NULL; /* page/packet was consumed by XDP */ ··· 1283 1235 prefetch(data); 1284 1236 1285 1237 rcu_read_lock(); 1286 - consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32); 1238 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, false); 1287 1239 rcu_read_unlock(); 1288 1240 if (consumed) { 1289 1241 if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))

+112 -3

drivers/net/ethernet/mellanox/mlx5/core/en_stats.c

··· 104 104 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_poll) }, 105 105 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_arm) }, 106 106 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_aff_change) }, 107 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_force_irq) }, 107 108 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_eq_rearm) }, 109 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_packets) }, 110 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_bytes) }, 111 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_complete) }, 112 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_unnecessary) }, 113 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_unnecessary_inner) }, 114 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_none) }, 115 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_ecn_mark) }, 116 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_removed_vlan_packets) }, 117 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_xdp_drop) }, 118 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_xdp_redirect) }, 119 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_wqe_err) }, 120 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_mpwqe_filler_cqes) }, 121 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_mpwqe_filler_strides) }, 122 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_oversize_pkts_sw_drop) }, 123 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_buff_alloc_err) }, 124 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_cqe_compress_blks) }, 125 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_cqe_compress_pkts) }, 126 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_congst_umr) }, 127 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_arfs_err) }, 128 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_xmit) }, 129 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_mpwqe) }, 130 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_inlnw) }, 131 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_full) }, 132 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_err) }, 133 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_cqes) }, 108 134 }; 109 135 110 136 #define NUM_SW_COUNTERS ARRAY_SIZE(sw_stats_desc) ··· 170 144 &priv->channel_stats[i]; 171 145 struct mlx5e_xdpsq_stats *xdpsq_red_stats = &channel_stats->xdpsq; 172 146 struct mlx5e_xdpsq_stats *xdpsq_stats = &channel_stats->rq_xdpsq; 147 + struct mlx5e_xdpsq_stats *xsksq_stats = &channel_stats->xsksq; 148 + struct mlx5e_rq_stats *xskrq_stats = &channel_stats->xskrq; 173 149 struct mlx5e_rq_stats *rq_stats = &channel_stats->rq; 174 150 struct mlx5e_ch_stats *ch_stats = &channel_stats->ch; 175 151 int j; ··· 214 186 s->ch_poll += ch_stats->poll; 215 187 s->ch_arm += ch_stats->arm; 216 188 s->ch_aff_change += ch_stats->aff_change; 189 + s->ch_force_irq += ch_stats->force_irq; 217 190 s->ch_eq_rearm += ch_stats->eq_rearm; 218 191 /* xdp redirect */ 219 192 s->tx_xdp_xmit += xdpsq_red_stats->xmit; ··· 223 194 s->tx_xdp_full += xdpsq_red_stats->full; 224 195 s->tx_xdp_err += xdpsq_red_stats->err; 225 196 s->tx_xdp_cqes += xdpsq_red_stats->cqes; 197 + /* AF_XDP zero-copy */ 198 + s->rx_xsk_packets += xskrq_stats->packets; 199 + s->rx_xsk_bytes += xskrq_stats->bytes; 200 + s->rx_xsk_csum_complete += xskrq_stats->csum_complete; 201 + s->rx_xsk_csum_unnecessary += xskrq_stats->csum_unnecessary; 202 + s->rx_xsk_csum_unnecessary_inner += xskrq_stats->csum_unnecessary_inner; 203 + s->rx_xsk_csum_none += xskrq_stats->csum_none; 204 + s->rx_xsk_ecn_mark += xskrq_stats->ecn_mark; 205 + s->rx_xsk_removed_vlan_packets += xskrq_stats->removed_vlan_packets; 206 + s->rx_xsk_xdp_drop += xskrq_stats->xdp_drop; 207 + s->rx_xsk_xdp_redirect += xskrq_stats->xdp_redirect; 208 + s->rx_xsk_wqe_err += xskrq_stats->wqe_err; 209 + s->rx_xsk_mpwqe_filler_cqes += xskrq_stats->mpwqe_filler_cqes; 210 + s->rx_xsk_mpwqe_filler_strides += xskrq_stats->mpwqe_filler_strides; 211 + s->rx_xsk_oversize_pkts_sw_drop += xskrq_stats->oversize_pkts_sw_drop; 212 + s->rx_xsk_buff_alloc_err += xskrq_stats->buff_alloc_err; 213 + s->rx_xsk_cqe_compress_blks += xskrq_stats->cqe_compress_blks; 214 + s->rx_xsk_cqe_compress_pkts += xskrq_stats->cqe_compress_pkts; 215 + s->rx_xsk_congst_umr += xskrq_stats->congst_umr; 216 + s->rx_xsk_arfs_err += xskrq_stats->arfs_err; 217 + s->tx_xsk_xmit += xsksq_stats->xmit; 218 + s->tx_xsk_mpwqe += xsksq_stats->mpwqe; 219 + s->tx_xsk_inlnw += xsksq_stats->inlnw; 220 + s->tx_xsk_full += xsksq_stats->full; 221 + s->tx_xsk_err += xsksq_stats->err; 222 + s->tx_xsk_cqes += xsksq_stats->cqes; 226 223 227 224 for (j = 0; j < priv->max_opened_tc; j++) { 228 225 struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j]; ··· 1321 1266 { MLX5E_DECLARE_XDPSQ_STAT(struct mlx5e_xdpsq_stats, cqes) }, 1322 1267 }; 1323 1268 1269 + static const struct counter_desc xskrq_stats_desc[] = { 1270 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, packets) }, 1271 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, bytes) }, 1272 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_complete) }, 1273 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_unnecessary) }, 1274 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) }, 1275 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_none) }, 1276 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, ecn_mark) }, 1277 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, removed_vlan_packets) }, 1278 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, xdp_drop) }, 1279 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, xdp_redirect) }, 1280 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, wqe_err) }, 1281 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, mpwqe_filler_cqes) }, 1282 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, mpwqe_filler_strides) }, 1283 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, oversize_pkts_sw_drop) }, 1284 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, buff_alloc_err) }, 1285 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, cqe_compress_blks) }, 1286 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) }, 1287 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, congst_umr) }, 1288 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, arfs_err) }, 1289 + }; 1290 + 1291 + static const struct counter_desc xsksq_stats_desc[] = { 1292 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, xmit) }, 1293 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, mpwqe) }, 1294 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, inlnw) }, 1295 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, full) }, 1296 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, err) }, 1297 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, cqes) }, 1298 + }; 1299 + 1324 1300 static const struct counter_desc ch_stats_desc[] = { 1325 1301 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, events) }, 1326 1302 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, poll) }, 1327 1303 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, arm) }, 1328 1304 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, aff_change) }, 1305 + { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, force_irq) }, 1329 1306 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, eq_rearm) }, 1330 1307 }; 1331 1308 ··· 1365 1278 #define NUM_SQ_STATS ARRAY_SIZE(sq_stats_desc) 1366 1279 #define NUM_XDPSQ_STATS ARRAY_SIZE(xdpsq_stats_desc) 1367 1280 #define NUM_RQ_XDPSQ_STATS ARRAY_SIZE(rq_xdpsq_stats_desc) 1281 + #define NUM_XSKRQ_STATS ARRAY_SIZE(xskrq_stats_desc) 1282 + #define NUM_XSKSQ_STATS ARRAY_SIZE(xsksq_stats_desc) 1368 1283 #define NUM_CH_STATS ARRAY_SIZE(ch_stats_desc) 1369 1284 1370 1285 static int mlx5e_grp_channels_get_num_stats(struct mlx5e_priv *priv) ··· 1377 1288 (NUM_CH_STATS * max_nch) + 1378 1289 (NUM_SQ_STATS * max_nch * priv->max_opened_tc) + 1379 1290 (NUM_RQ_XDPSQ_STATS * max_nch) + 1380 - (NUM_XDPSQ_STATS * max_nch); 1291 + (NUM_XDPSQ_STATS * max_nch) + 1292 + (NUM_XSKRQ_STATS * max_nch * priv->xsk.ever_used) + 1293 + (NUM_XSKSQ_STATS * max_nch * priv->xsk.ever_used); 1381 1294 } 1382 1295 1383 1296 static int mlx5e_grp_channels_fill_strings(struct mlx5e_priv *priv, u8 *data, 1384 1297 int idx) 1385 1298 { 1386 1299 int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 1300 + bool is_xsk = priv->xsk.ever_used; 1387 1301 int i, j, tc; 1388 1302 1389 1303 for (i = 0; i < max_nch; i++) ··· 1398 1306 for (j = 0; j < NUM_RQ_STATS; j++) 1399 1307 sprintf(data + (idx++) * ETH_GSTRING_LEN, 1400 1308 rq_stats_desc[j].format, i); 1309 + for (j = 0; j < NUM_XSKRQ_STATS * is_xsk; j++) 1310 + sprintf(data + (idx++) * ETH_GSTRING_LEN, 1311 + xskrq_stats_desc[j].format, i); 1401 1312 for (j = 0; j < NUM_RQ_XDPSQ_STATS; j++) 1402 1313 sprintf(data + (idx++) * ETH_GSTRING_LEN, 1403 1314 rq_xdpsq_stats_desc[j].format, i); ··· 1413 1318 sq_stats_desc[j].format, 1414 1319 priv->channel_tc2txq[i][tc]); 1415 1320 1416 - for (i = 0; i < max_nch; i++) 1321 + for (i = 0; i < max_nch; i++) { 1322 + for (j = 0; j < NUM_XSKSQ_STATS * is_xsk; j++) 1323 + sprintf(data + (idx++) * ETH_GSTRING_LEN, 1324 + xsksq_stats_desc[j].format, i); 1417 1325 for (j = 0; j < NUM_XDPSQ_STATS; j++) 1418 1326 sprintf(data + (idx++) * ETH_GSTRING_LEN, 1419 1327 xdpsq_stats_desc[j].format, i); 1328 + } 1420 1329 1421 1330 return idx; 1422 1331 } ··· 1429 1330 int idx) 1430 1331 { 1431 1332 int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 1333 + bool is_xsk = priv->xsk.ever_used; 1432 1334 int i, j, tc; 1433 1335 1434 1336 for (i = 0; i < max_nch; i++) ··· 1443 1343 data[idx++] = 1444 1344 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].rq, 1445 1345 rq_stats_desc, j); 1346 + for (j = 0; j < NUM_XSKRQ_STATS * is_xsk; j++) 1347 + data[idx++] = 1348 + MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].xskrq, 1349 + xskrq_stats_desc, j); 1446 1350 for (j = 0; j < NUM_RQ_XDPSQ_STATS; j++) 1447 1351 data[idx++] = 1448 1352 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].rq_xdpsq, ··· 1460 1356 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].sq[tc], 1461 1357 sq_stats_desc, j); 1462 1358 1463 - for (i = 0; i < max_nch; i++) 1359 + for (i = 0; i < max_nch; i++) { 1360 + for (j = 0; j < NUM_XSKSQ_STATS * is_xsk; j++) 1361 + data[idx++] = 1362 + MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].xsksq, 1363 + xsksq_stats_desc, j); 1464 1364 for (j = 0; j < NUM_XDPSQ_STATS; j++) 1465 1365 data[idx++] = 1466 1366 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].xdpsq, 1467 1367 xdpsq_stats_desc, j); 1368 + } 1468 1369 1469 1370 return idx; 1470 1371 }

+30

drivers/net/ethernet/mellanox/mlx5/core/en_stats.h

··· 46 46 #define MLX5E_DECLARE_TX_STAT(type, fld) "tx%d_"#fld, offsetof(type, fld) 47 47 #define MLX5E_DECLARE_XDPSQ_STAT(type, fld) "tx%d_xdp_"#fld, offsetof(type, fld) 48 48 #define MLX5E_DECLARE_RQ_XDPSQ_STAT(type, fld) "rx%d_xdp_tx_"#fld, offsetof(type, fld) 49 + #define MLX5E_DECLARE_XSKRQ_STAT(type, fld) "rx%d_xsk_"#fld, offsetof(type, fld) 50 + #define MLX5E_DECLARE_XSKSQ_STAT(type, fld) "tx%d_xsk_"#fld, offsetof(type, fld) 49 51 #define MLX5E_DECLARE_CH_STAT(type, fld) "ch%d_"#fld, offsetof(type, fld) 50 52 51 53 struct counter_desc { ··· 118 116 u64 ch_poll; 119 117 u64 ch_arm; 120 118 u64 ch_aff_change; 119 + u64 ch_force_irq; 121 120 u64 ch_eq_rearm; 122 121 123 122 #ifdef CONFIG_MLX5_EN_TLS 124 123 u64 tx_tls_ooo; 125 124 u64 tx_tls_resync_bytes; 126 125 #endif 126 + 127 + u64 rx_xsk_packets; 128 + u64 rx_xsk_bytes; 129 + u64 rx_xsk_csum_complete; 130 + u64 rx_xsk_csum_unnecessary; 131 + u64 rx_xsk_csum_unnecessary_inner; 132 + u64 rx_xsk_csum_none; 133 + u64 rx_xsk_ecn_mark; 134 + u64 rx_xsk_removed_vlan_packets; 135 + u64 rx_xsk_xdp_drop; 136 + u64 rx_xsk_xdp_redirect; 137 + u64 rx_xsk_wqe_err; 138 + u64 rx_xsk_mpwqe_filler_cqes; 139 + u64 rx_xsk_mpwqe_filler_strides; 140 + u64 rx_xsk_oversize_pkts_sw_drop; 141 + u64 rx_xsk_buff_alloc_err; 142 + u64 rx_xsk_cqe_compress_blks; 143 + u64 rx_xsk_cqe_compress_pkts; 144 + u64 rx_xsk_congst_umr; 145 + u64 rx_xsk_arfs_err; 146 + u64 tx_xsk_xmit; 147 + u64 tx_xsk_mpwqe; 148 + u64 tx_xsk_inlnw; 149 + u64 tx_xsk_full; 150 + u64 tx_xsk_err; 151 + u64 tx_xsk_cqes; 127 152 }; 128 153 129 154 struct mlx5e_qcounter_stats { ··· 285 256 u64 poll; 286 257 u64 arm; 287 258 u64 aff_change; 259 + u64 force_irq; 288 260 u64 eq_rearm; 289 261 }; 290 262

+38 -4

drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c

··· 33 33 #include <linux/irq.h> 34 34 #include "en.h" 35 35 #include "en/xdp.h" 36 + #include "en/xsk/tx.h" 36 37 37 38 static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c) 38 39 { ··· 86 85 struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel, 87 86 napi); 88 87 struct mlx5e_ch_stats *ch_stats = c->stats; 88 + struct mlx5e_xdpsq *xsksq = &c->xsksq; 89 + struct mlx5e_rq *xskrq = &c->xskrq; 89 90 struct mlx5e_rq *rq = &c->rq; 91 + bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 92 + bool aff_change = false; 93 + bool busy_xsk = false; 90 94 bool busy = false; 91 95 int work_done = 0; 92 96 int i; ··· 101 95 for (i = 0; i < c->num_tc; i++) 102 96 busy |= mlx5e_poll_tx_cq(&c->sq[i].cq, budget); 103 97 104 - busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq, NULL); 98 + busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq); 105 99 106 100 if (c->xdp) 107 - busy |= mlx5e_poll_xdpsq_cq(&rq->xdpsq.cq, rq); 101 + busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq); 108 102 109 103 if (likely(budget)) { /* budget=0 means: don't poll rx rings */ 110 - work_done = mlx5e_poll_rx_cq(&rq->cq, budget); 104 + if (xsk_open) 105 + work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget); 106 + 107 + if (likely(budget - work_done)) 108 + work_done += mlx5e_poll_rx_cq(&rq->cq, budget - work_done); 109 + 111 110 busy |= work_done == budget; 112 111 } 113 112 114 - busy |= c->rq.post_wqes(rq); 113 + mlx5e_poll_ico_cq(&c->icosq.cq); 114 + 115 + busy |= rq->post_wqes(rq); 116 + if (xsk_open) { 117 + mlx5e_poll_ico_cq(&c->xskicosq.cq); 118 + busy |= mlx5e_poll_xdpsq_cq(&xsksq->cq); 119 + busy_xsk |= mlx5e_xsk_tx(xsksq, MLX5E_TX_XSK_POLL_BUDGET); 120 + busy_xsk |= xskrq->post_wqes(xskrq); 121 + } 122 + 123 + busy |= busy_xsk; 115 124 116 125 if (busy) { 117 126 if (likely(mlx5e_channel_no_affinity_change(c))) 118 127 return budget; 119 128 ch_stats->aff_change++; 129 + aff_change = true; 120 130 if (budget && work_done == budget) 121 131 work_done--; 122 132 } ··· 152 130 mlx5e_cq_arm(&rq->cq); 153 131 mlx5e_cq_arm(&c->icosq.cq); 154 132 mlx5e_cq_arm(&c->xdpsq.cq); 133 + 134 + if (xsk_open) { 135 + mlx5e_handle_rx_dim(xskrq); 136 + mlx5e_cq_arm(&c->xskicosq.cq); 137 + mlx5e_cq_arm(&xsksq->cq); 138 + mlx5e_cq_arm(&xskrq->cq); 139 + } 140 + 141 + if (unlikely(aff_change && busy_xsk)) { 142 + mlx5e_trigger_irq(&c->icosq); 143 + ch_stats->force_irq++; 144 + } 155 145 156 146 return work_done; 157 147 }

+7 -7

drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c

··· 87 87 mlx5e_set_netdev_mtu_boundaries(priv); 88 88 netdev->mtu = netdev->max_mtu; 89 89 90 - mlx5e_build_nic_params(mdev, &priv->rss_params, &priv->channels.params, 90 + mlx5e_build_nic_params(mdev, NULL, &priv->rss_params, &priv->channels.params, 91 91 mlx5e_get_netdev_max_channels(netdev), 92 92 netdev->mtu); 93 93 mlx5i_build_nic_params(mdev, &priv->channels.params); ··· 365 365 if (err) 366 366 goto err_close_drop_rq; 367 367 368 - err = mlx5e_create_direct_rqts(priv); 368 + err = mlx5e_create_direct_rqts(priv, priv->direct_tir); 369 369 if (err) 370 370 goto err_destroy_indirect_rqts; 371 371 ··· 373 373 if (err) 374 374 goto err_destroy_direct_rqts; 375 375 376 - err = mlx5e_create_direct_tirs(priv); 376 + err = mlx5e_create_direct_tirs(priv, priv->direct_tir); 377 377 if (err) 378 378 goto err_destroy_indirect_tirs; 379 379 ··· 384 384 return 0; 385 385 386 386 err_destroy_direct_tirs: 387 - mlx5e_destroy_direct_tirs(priv); 387 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 388 388 err_destroy_indirect_tirs: 389 389 mlx5e_destroy_indirect_tirs(priv, true); 390 390 err_destroy_direct_rqts: 391 - mlx5e_destroy_direct_rqts(priv); 391 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 392 392 err_destroy_indirect_rqts: 393 393 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 394 394 err_close_drop_rq: ··· 401 401 static void mlx5i_cleanup_rx(struct mlx5e_priv *priv) 402 402 { 403 403 mlx5i_destroy_flow_steering(priv); 404 - mlx5e_destroy_direct_tirs(priv); 404 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 405 405 mlx5e_destroy_indirect_tirs(priv, true); 406 - mlx5e_destroy_direct_rqts(priv); 406 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 407 407 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 408 408 mlx5e_close_drop_rq(&priv->drop_rq); 409 409 mlx5e_destroy_q_counters(priv);

-5

drivers/net/ethernet/mellanox/mlx5/core/wq.h

··· 134 134 *wq->db = cpu_to_be32(wq->wqe_ctr); 135 135 } 136 136 137 - static inline u16 mlx5_wq_cyc_get_ctr_wrap_cnt(struct mlx5_wq_cyc *wq, u16 ctr) 138 - { 139 - return ctr >> wq->fbc.log_sz; 140 - } 141 - 142 137 static inline u16 mlx5_wq_cyc_ctr2ix(struct mlx5_wq_cyc *wq, u16 ctr) 143 138 { 144 139 return ctr & wq->fbc.sz_m1;

+48 -12

drivers/net/veth.c

··· 38 38 #define VETH_XDP_TX BIT(0) 39 39 #define VETH_XDP_REDIR BIT(1) 40 40 41 + #define VETH_XDP_TX_BULK_SIZE 16 42 + 41 43 struct veth_rq_stats { 42 44 u64 xdp_packets; 43 45 u64 xdp_bytes; ··· 64 62 struct bpf_prog *_xdp_prog; 65 63 struct veth_rq *rq; 66 64 unsigned int requested_headroom; 65 + }; 66 + 67 + struct veth_xdp_tx_bq { 68 + struct xdp_frame *q[VETH_XDP_TX_BULK_SIZE]; 69 + unsigned int count; 67 70 }; 68 71 69 72 /* ··· 449 442 return ret; 450 443 } 451 444 452 - static void veth_xdp_flush(struct net_device *dev) 445 + static void veth_xdp_flush_bq(struct net_device *dev, struct veth_xdp_tx_bq *bq) 446 + { 447 + int sent, i, err = 0; 448 + 449 + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); 450 + if (sent < 0) { 451 + err = sent; 452 + sent = 0; 453 + for (i = 0; i < bq->count; i++) 454 + xdp_return_frame(bq->q[i]); 455 + } 456 + trace_xdp_bulk_tx(dev, sent, bq->count - sent, err); 457 + 458 + bq->count = 0; 459 + } 460 + 461 + static void veth_xdp_flush(struct net_device *dev, struct veth_xdp_tx_bq *bq) 453 462 { 454 463 struct veth_priv *rcv_priv, *priv = netdev_priv(dev); 455 464 struct net_device *rcv; 456 465 struct veth_rq *rq; 457 466 458 467 rcu_read_lock(); 468 + veth_xdp_flush_bq(dev, bq); 459 469 rcv = rcu_dereference(priv->peer); 460 470 if (unlikely(!rcv)) 461 471 goto out; ··· 488 464 rcu_read_unlock(); 489 465 } 490 466 491 - static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp) 467 + static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp, 468 + struct veth_xdp_tx_bq *bq) 492 469 { 493 470 struct xdp_frame *frame = convert_to_xdp_frame(xdp); 494 471 495 472 if (unlikely(!frame)) 496 473 return -EOVERFLOW; 497 474 498 - return veth_xdp_xmit(dev, 1, &frame, 0); 475 + if (unlikely(bq->count == VETH_XDP_TX_BULK_SIZE)) 476 + veth_xdp_flush_bq(dev, bq); 477 + 478 + bq->q[bq->count++] = frame; 479 + 480 + return 0; 499 481 } 500 482 501 483 static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, 502 484 struct xdp_frame *frame, 503 - unsigned int *xdp_xmit) 485 + unsigned int *xdp_xmit, 486 + struct veth_xdp_tx_bq *bq) 504 487 { 505 488 void *hard_start = frame->data - frame->headroom; 506 489 void *head = hard_start - sizeof(struct xdp_frame); ··· 540 509 orig_frame = *frame; 541 510 xdp.data_hard_start = head; 542 511 xdp.rxq->mem = frame->mem; 543 - if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) { 512 + if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) { 544 513 trace_xdp_exception(rq->dev, xdp_prog, act); 545 514 frame = &orig_frame; 546 515 goto err_xdp; ··· 591 560 } 592 561 593 562 static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, 594 - unsigned int *xdp_xmit) 563 + unsigned int *xdp_xmit, 564 + struct veth_xdp_tx_bq *bq) 595 565 { 596 566 u32 pktlen, headroom, act, metalen; 597 567 void *orig_data, *orig_data_end; ··· 668 636 get_page(virt_to_page(xdp.data)); 669 637 consume_skb(skb); 670 638 xdp.rxq->mem = rq->xdp_mem; 671 - if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) { 639 + if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) { 672 640 trace_xdp_exception(rq->dev, xdp_prog, act); 673 641 goto err_xdp; 674 642 } ··· 723 691 return NULL; 724 692 } 725 693 726 - static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit) 694 + static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit, 695 + struct veth_xdp_tx_bq *bq) 727 696 { 728 697 int i, done = 0, drops = 0, bytes = 0; 729 698 ··· 740 707 struct xdp_frame *frame = veth_ptr_to_xdp(ptr); 741 708 742 709 bytes += frame->len; 743 - skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one); 710 + skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one, bq); 744 711 } else { 745 712 skb = ptr; 746 713 bytes += skb->len; 747 - skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one); 714 + skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one, bq); 748 715 } 749 716 *xdp_xmit |= xdp_xmit_one; 750 717 ··· 770 737 struct veth_rq *rq = 771 738 container_of(napi, struct veth_rq, xdp_napi); 772 739 unsigned int xdp_xmit = 0; 740 + struct veth_xdp_tx_bq bq; 773 741 int done; 774 742 743 + bq.count = 0; 744 + 775 745 xdp_set_return_frame_no_direct(); 776 - done = veth_xdp_rcv(rq, budget, &xdp_xmit); 746 + done = veth_xdp_rcv(rq, budget, &xdp_xmit, &bq); 777 747 778 748 if (done < budget && napi_complete_done(napi, done)) { 779 749 /* Write rx_notify_masked before reading ptr_ring */ ··· 788 752 } 789 753 790 754 if (xdp_xmit & VETH_XDP_TX) 791 - veth_xdp_flush(rq->dev); 755 + veth_xdp_flush(rq->dev, &bq); 792 756 if (xdp_xmit & VETH_XDP_REDIR) 793 757 xdp_do_flush_map(); 794 758 xdp_clear_return_frame_no_direct();

+45

include/linux/bpf-cgroup.h

··· 124 124 loff_t *ppos, void **new_buf, 125 125 enum bpf_attach_type type); 126 126 127 + int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int *level, 128 + int *optname, char __user *optval, 129 + int *optlen, char **kernel_optval); 130 + int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, 131 + int optname, char __user *optval, 132 + int __user *optlen, int max_optlen, 133 + int retval); 134 + 127 135 static inline enum bpf_cgroup_storage_type cgroup_storage_type( 128 136 struct bpf_map *map) 129 137 { ··· 294 286 __ret; \ 295 287 }) 296 288 289 + #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \ 290 + kernel_optval) \ 291 + ({ \ 292 + int __ret = 0; \ 293 + if (cgroup_bpf_enabled) \ 294 + __ret = __cgroup_bpf_run_filter_setsockopt(sock, level, \ 295 + optname, optval, \ 296 + optlen, \ 297 + kernel_optval); \ 298 + __ret; \ 299 + }) 300 + 301 + #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) \ 302 + ({ \ 303 + int __ret = 0; \ 304 + if (cgroup_bpf_enabled) \ 305 + get_user(__ret, optlen); \ 306 + __ret; \ 307 + }) 308 + 309 + #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen, \ 310 + max_optlen, retval) \ 311 + ({ \ 312 + int __ret = retval; \ 313 + if (cgroup_bpf_enabled) \ 314 + __ret = __cgroup_bpf_run_filter_getsockopt(sock, level, \ 315 + optname, optval, \ 316 + optlen, max_optlen, \ 317 + retval); \ 318 + __ret; \ 319 + }) 320 + 297 321 int cgroup_bpf_prog_attach(const union bpf_attr *attr, 298 322 enum bpf_prog_type ptype, struct bpf_prog *prog); 299 323 int cgroup_bpf_prog_detach(const union bpf_attr *attr, ··· 397 357 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; }) 398 358 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; }) 399 359 #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; }) 360 + #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; }) 361 + #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \ 362 + optlen, max_optlen, retval) ({ retval; }) 363 + #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \ 364 + kernel_optval) ({ 0; }) 400 365 401 366 #define for_each_cgroup_storage_type(stype) for (; false; ) 402 367

+2

include/linux/bpf.h

··· 518 518 struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags); 519 519 void bpf_prog_array_free(struct bpf_prog_array *progs); 520 520 int bpf_prog_array_length(struct bpf_prog_array *progs); 521 + bool bpf_prog_array_is_empty(struct bpf_prog_array *array); 521 522 int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs, 522 523 __u32 __user *prog_ids, u32 cnt); 523 524 ··· 1052 1051 extern const struct bpf_func_proto bpf_get_local_storage_proto; 1053 1052 extern const struct bpf_func_proto bpf_strtol_proto; 1054 1053 extern const struct bpf_func_proto bpf_strtoul_proto; 1054 + extern const struct bpf_func_proto bpf_tcp_sock_proto; 1055 1055 1056 1056 /* Shared helpers among cBPF and eBPF. */ 1057 1057 void bpf_user_rnd_init_once(void);

+1

include/linux/bpf_types.h

··· 30 30 #ifdef CONFIG_CGROUP_BPF 31 31 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) 32 32 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl) 33 + BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt) 33 34 #endif 34 35 #ifdef CONFIG_BPF_LIRC_MODE2 35 36 BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)

+12 -1

include/linux/filter.h

··· 578 578 }; 579 579 580 580 struct bpf_redirect_info { 581 - u32 ifindex; 582 581 u32 flags; 582 + u32 tgt_index; 583 + void *tgt_value; 583 584 struct bpf_map *map; 584 585 struct bpf_map *map_to_flush; 585 586 u32 kern_flags; ··· 1198 1197 loff_t *ppos; 1199 1198 /* Temporary "register" for indirect stores to ppos. */ 1200 1199 u64 tmp_reg; 1200 + }; 1201 + 1202 + struct bpf_sockopt_kern { 1203 + struct sock *sk; 1204 + u8 *optval; 1205 + u8 *optval_end; 1206 + s32 level; 1207 + s32 optname; 1208 + s32 optlen; 1209 + s32 retval; 1201 1210 }; 1202 1211 1203 1212 #endif /* __LINUX_FILTER_H__ */

+14

include/linux/list.h

··· 106 106 WRITE_ONCE(prev->next, next); 107 107 } 108 108 109 + /* 110 + * Delete a list entry and clear the 'prev' pointer. 111 + * 112 + * This is a special-purpose list clearing method used in the networking code 113 + * for lists allocated as per-cpu, where we don't want to incur the extra 114 + * WRITE_ONCE() overhead of a regular list_del_init(). The code that uses this 115 + * needs to check the node 'prev' pointer instead of calling list_empty(). 116 + */ 117 + static inline void __list_del_clearprev(struct list_head *entry) 118 + { 119 + __list_del(entry->prev, entry->next); 120 + entry->prev = NULL; 121 + } 122 + 109 123 /** 110 124 * list_del - deletes entry from list. 111 125 * @entry: the element to delete from the list.

+8

include/net/tcp.h

··· 2221 2221 return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1); 2222 2222 } 2223 2223 2224 + static inline void tcp_bpf_rtt(struct sock *sk) 2225 + { 2226 + struct tcp_sock *tp = tcp_sk(sk); 2227 + 2228 + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTT_CB_FLAG)) 2229 + tcp_call_bpf(sk, BPF_SOCK_OPS_RTT_CB, 0, NULL); 2230 + } 2231 + 2224 2232 #if IS_ENABLED(CONFIG_SMC) 2225 2233 extern struct static_key_false tcp_have_smc; 2226 2234 #endif

+24 -3

include/net/xdp_sock.h

··· 77 77 void xsk_flush(struct xdp_sock *xs); 78 78 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs); 79 79 /* Used from netdev driver */ 80 + bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt); 80 81 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr); 81 82 void xsk_umem_discard_addr(struct xdp_umem *umem); 82 83 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries); 83 - bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len); 84 + bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc); 84 85 void xsk_umem_consume_tx_done(struct xdp_umem *umem); 85 86 struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries); 86 87 struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem, ··· 100 99 } 101 100 102 101 /* Reuse-queue aware version of FILL queue helpers */ 102 + static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt) 103 + { 104 + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; 105 + 106 + if (rq->length >= cnt) 107 + return true; 108 + 109 + return xsk_umem_has_addrs(umem, cnt - rq->length); 110 + } 111 + 103 112 static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr) 104 113 { 105 114 struct xdp_umem_fq_reuse *rq = umem->fq_reuse; ··· 157 146 return false; 158 147 } 159 148 149 + static inline bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt) 150 + { 151 + return false; 152 + } 153 + 160 154 static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr) 161 155 { 162 156 return NULL; ··· 175 159 { 176 160 } 177 161 178 - static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, 179 - u32 *len) 162 + static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, 163 + struct xdp_desc *desc) 180 164 { 181 165 return false; 182 166 } ··· 214 198 static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr) 215 199 { 216 200 return 0; 201 + } 202 + 203 + static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt) 204 + { 205 + return false; 217 206 } 218 207 219 208 static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)

+31 -3

include/trace/events/xdp.h

··· 50 50 __entry->ifindex) 51 51 ); 52 52 53 + TRACE_EVENT(xdp_bulk_tx, 54 + 55 + TP_PROTO(const struct net_device *dev, 56 + int sent, int drops, int err), 57 + 58 + TP_ARGS(dev, sent, drops, err), 59 + 60 + TP_STRUCT__entry( 61 + __field(int, ifindex) 62 + __field(u32, act) 63 + __field(int, drops) 64 + __field(int, sent) 65 + __field(int, err) 66 + ), 67 + 68 + TP_fast_assign( 69 + __entry->ifindex = dev->ifindex; 70 + __entry->act = XDP_TX; 71 + __entry->drops = drops; 72 + __entry->sent = sent; 73 + __entry->err = err; 74 + ), 75 + 76 + TP_printk("ifindex=%d action=%s sent=%d drops=%d err=%d", 77 + __entry->ifindex, 78 + __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), 79 + __entry->sent, __entry->drops, __entry->err) 80 + ); 81 + 53 82 DECLARE_EVENT_CLASS(xdp_redirect_template, 54 83 55 84 TP_PROTO(const struct net_device *dev, ··· 175 146 #endif /* __DEVMAP_OBJ_TYPE */ 176 147 177 148 #define devmap_ifindex(fwd, map) \ 178 - (!fwd ? 0 : \ 179 - ((map->map_type == BPF_MAP_TYPE_DEVMAP) ? \ 180 - ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)) 149 + ((map->map_type == BPF_MAP_TYPE_DEVMAP) ? \ 150 + ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0) 181 151 182 152 #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx) \ 183 153 trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map), \

+30 -3

include/uapi/linux/bpf.h

··· 170 170 BPF_PROG_TYPE_FLOW_DISSECTOR, 171 171 BPF_PROG_TYPE_CGROUP_SYSCTL, 172 172 BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, 173 + BPF_PROG_TYPE_CGROUP_SOCKOPT, 173 174 }; 174 175 175 176 enum bpf_attach_type { ··· 195 194 BPF_CGROUP_SYSCTL, 196 195 BPF_CGROUP_UDP4_RECVMSG, 197 196 BPF_CGROUP_UDP6_RECVMSG, 197 + BPF_CGROUP_GETSOCKOPT, 198 + BPF_CGROUP_SETSOCKOPT, 198 199 __MAX_BPF_ATTACH_TYPE 199 200 }; 200 201 ··· 1571 1568 * but this is only implemented for native XDP (with driver 1572 1569 * support) as of this writing). 1573 1570 * 1574 - * All values for *flags* are reserved for future usage, and must 1575 - * be left at zero. 1571 + * The lower two bits of *flags* are used as the return code if 1572 + * the map lookup fails. This is so that the return value can be 1573 + * one of the XDP program return codes up to XDP_TX, as chosen by 1574 + * the caller. Any higher bits in the *flags* argument must be 1575 + * unset. 1576 1576 * 1577 1577 * When used to redirect packets to net devices, this helper 1578 1578 * provides a high performance increase over **bpf_redirect**\ (). ··· 1770 1764 * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) 1771 1765 * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) 1772 1766 * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) 1767 + * * **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT) 1773 1768 * 1774 1769 * Therefore, this function can be used to clear a callback flag by 1775 1770 * setting the appropriate bit to zero. e.g. to disable the RTO ··· 3073 3066 * sum(delta(snd_una)), or how many bytes 3074 3067 * were acked. 3075 3068 */ 3069 + __u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups 3070 + * total number of DSACK blocks received 3071 + */ 3072 + __u32 delivered; /* Total data packets delivered incl. rexmits */ 3073 + __u32 delivered_ce; /* Like the above but only ECE marked packets */ 3074 + __u32 icsk_retransmits; /* Number of unrecovered [RTO] timeouts */ 3076 3075 }; 3077 3076 3078 3077 struct bpf_sock_tuple { ··· 3321 3308 #define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) 3322 3309 #define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) 3323 3310 #define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2) 3324 - #define BPF_SOCK_OPS_ALL_CB_FLAGS 0x7 /* Mask of all currently 3311 + #define BPF_SOCK_OPS_RTT_CB_FLAG (1<<3) 3312 + #define BPF_SOCK_OPS_ALL_CB_FLAGS 0xF /* Mask of all currently 3325 3313 * supported cb flags 3326 3314 */ 3327 3315 ··· 3376 3362 */ 3377 3363 BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after 3378 3364 * socket transition to LISTEN state. 3365 + */ 3366 + BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. 3379 3367 */ 3380 3368 }; 3381 3369 ··· 3555 3539 __u32 file_pos; /* Sysctl file position to read from, write to. 3556 3540 * Allows 1,2,4-byte read an 4-byte write. 3557 3541 */ 3542 + }; 3543 + 3544 + struct bpf_sockopt { 3545 + __bpf_md_ptr(struct bpf_sock *, sk); 3546 + __bpf_md_ptr(void *, optval); 3547 + __bpf_md_ptr(void *, optval_end); 3548 + 3549 + __s32 level; 3550 + __s32 optname; 3551 + __s32 optlen; 3552 + __s32 retval; 3558 3553 }; 3559 3554 3560 3555 #endif /* _UAPI__LINUX_BPF_H__ */

+8

include/uapi/linux/if_xdp.h

··· 46 46 #define XDP_UMEM_FILL_RING 5 47 47 #define XDP_UMEM_COMPLETION_RING 6 48 48 #define XDP_STATISTICS 7 49 + #define XDP_OPTIONS 8 49 50 50 51 struct xdp_umem_reg { 51 52 __u64 addr; /* Start of packet data area */ ··· 60 59 __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 61 60 __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 62 61 }; 62 + 63 + struct xdp_options { 64 + __u32 flags; 65 + }; 66 + 67 + /* Flags for the flags field of struct xdp_options */ 68 + #define XDP_OPTIONS_ZEROCOPY (1 << 0) 63 69 64 70 /* Pgoff for mmaping the rings */ 65 71 #define XDP_PGOFF_RX_RING 0

+351 -1

kernel/bpf/cgroup.c

··· 15 15 #include <linux/bpf.h> 16 16 #include <linux/bpf-cgroup.h> 17 17 #include <net/sock.h> 18 + #include <net/bpf_sk_storage.h> 19 + 20 + #include "../cgroup/cgroup-internal.h" 18 21 19 22 DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key); 20 23 EXPORT_SYMBOL(cgroup_bpf_enabled_key); ··· 41 38 struct bpf_prog_array *old_array; 42 39 unsigned int type; 43 40 41 + mutex_lock(&cgroup_mutex); 42 + 44 43 for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) { 45 44 struct list_head *progs = &cgrp->bpf.progs[type]; 46 45 struct bpf_prog_list *pl, *tmp; ··· 59 54 } 60 55 old_array = rcu_dereference_protected( 61 56 cgrp->bpf.effective[type], 62 - percpu_ref_is_dying(&cgrp->bpf.refcnt)); 57 + lockdep_is_held(&cgroup_mutex)); 63 58 bpf_prog_array_free(old_array); 64 59 } 60 + 61 + mutex_unlock(&cgroup_mutex); 65 62 66 63 percpu_ref_exit(&cgrp->bpf.refcnt); 67 64 cgroup_put(cgrp); ··· 236 229 css_for_each_descendant_pre(css, &cgrp->self) { 237 230 struct cgroup *desc = container_of(css, struct cgroup, self); 238 231 232 + if (percpu_ref_is_zero(&desc->bpf.refcnt)) 233 + continue; 234 + 239 235 err = compute_effective_progs(desc, type, &desc->bpf.inactive); 240 236 if (err) 241 237 goto cleanup; ··· 247 237 /* all allocations were successful. Activate all prog arrays */ 248 238 css_for_each_descendant_pre(css, &cgrp->self) { 249 239 struct cgroup *desc = container_of(css, struct cgroup, self); 240 + 241 + if (percpu_ref_is_zero(&desc->bpf.refcnt)) { 242 + if (unlikely(desc->bpf.inactive)) { 243 + bpf_prog_array_free(desc->bpf.inactive); 244 + desc->bpf.inactive = NULL; 245 + } 246 + continue; 247 + } 250 248 251 249 activate_effective_progs(desc, type, desc->bpf.inactive); 252 250 desc->bpf.inactive = NULL; ··· 939 921 } 940 922 EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl); 941 923 924 + static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, 925 + enum bpf_attach_type attach_type) 926 + { 927 + struct bpf_prog_array *prog_array; 928 + bool empty; 929 + 930 + rcu_read_lock(); 931 + prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]); 932 + empty = bpf_prog_array_is_empty(prog_array); 933 + rcu_read_unlock(); 934 + 935 + return empty; 936 + } 937 + 938 + static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) 939 + { 940 + if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) 941 + return -EINVAL; 942 + 943 + ctx->optval = kzalloc(max_optlen, GFP_USER); 944 + if (!ctx->optval) 945 + return -ENOMEM; 946 + 947 + ctx->optval_end = ctx->optval + max_optlen; 948 + ctx->optlen = max_optlen; 949 + 950 + return 0; 951 + } 952 + 953 + static void sockopt_free_buf(struct bpf_sockopt_kern *ctx) 954 + { 955 + kfree(ctx->optval); 956 + } 957 + 958 + int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, 959 + int *optname, char __user *optval, 960 + int *optlen, char **kernel_optval) 961 + { 962 + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); 963 + struct bpf_sockopt_kern ctx = { 964 + .sk = sk, 965 + .level = *level, 966 + .optname = *optname, 967 + }; 968 + int ret; 969 + 970 + /* Opportunistic check to see whether we have any BPF program 971 + * attached to the hook so we don't waste time allocating 972 + * memory and locking the socket. 973 + */ 974 + if (!cgroup_bpf_enabled || 975 + __cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_SETSOCKOPT)) 976 + return 0; 977 + 978 + ret = sockopt_alloc_buf(&ctx, *optlen); 979 + if (ret) 980 + return ret; 981 + 982 + if (copy_from_user(ctx.optval, optval, *optlen) != 0) { 983 + ret = -EFAULT; 984 + goto out; 985 + } 986 + 987 + lock_sock(sk); 988 + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT], 989 + &ctx, BPF_PROG_RUN); 990 + release_sock(sk); 991 + 992 + if (!ret) { 993 + ret = -EPERM; 994 + goto out; 995 + } 996 + 997 + if (ctx.optlen == -1) { 998 + /* optlen set to -1, bypass kernel */ 999 + ret = 1; 1000 + } else if (ctx.optlen > *optlen || ctx.optlen < -1) { 1001 + /* optlen is out of bounds */ 1002 + ret = -EFAULT; 1003 + } else { 1004 + /* optlen within bounds, run kernel handler */ 1005 + ret = 0; 1006 + 1007 + /* export any potential modifications */ 1008 + *level = ctx.level; 1009 + *optname = ctx.optname; 1010 + *optlen = ctx.optlen; 1011 + *kernel_optval = ctx.optval; 1012 + } 1013 + 1014 + out: 1015 + if (ret) 1016 + sockopt_free_buf(&ctx); 1017 + return ret; 1018 + } 1019 + EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt); 1020 + 1021 + int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, 1022 + int optname, char __user *optval, 1023 + int __user *optlen, int max_optlen, 1024 + int retval) 1025 + { 1026 + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); 1027 + struct bpf_sockopt_kern ctx = { 1028 + .sk = sk, 1029 + .level = level, 1030 + .optname = optname, 1031 + .retval = retval, 1032 + }; 1033 + int ret; 1034 + 1035 + /* Opportunistic check to see whether we have any BPF program 1036 + * attached to the hook so we don't waste time allocating 1037 + * memory and locking the socket. 1038 + */ 1039 + if (!cgroup_bpf_enabled || 1040 + __cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_GETSOCKOPT)) 1041 + return retval; 1042 + 1043 + ret = sockopt_alloc_buf(&ctx, max_optlen); 1044 + if (ret) 1045 + return ret; 1046 + 1047 + if (!retval) { 1048 + /* If kernel getsockopt finished successfully, 1049 + * copy whatever was returned to the user back 1050 + * into our temporary buffer. Set optlen to the 1051 + * one that kernel returned as well to let 1052 + * BPF programs inspect the value. 1053 + */ 1054 + 1055 + if (get_user(ctx.optlen, optlen)) { 1056 + ret = -EFAULT; 1057 + goto out; 1058 + } 1059 + 1060 + if (ctx.optlen > max_optlen) 1061 + ctx.optlen = max_optlen; 1062 + 1063 + if (copy_from_user(ctx.optval, optval, ctx.optlen) != 0) { 1064 + ret = -EFAULT; 1065 + goto out; 1066 + } 1067 + } 1068 + 1069 + lock_sock(sk); 1070 + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT], 1071 + &ctx, BPF_PROG_RUN); 1072 + release_sock(sk); 1073 + 1074 + if (!ret) { 1075 + ret = -EPERM; 1076 + goto out; 1077 + } 1078 + 1079 + if (ctx.optlen > max_optlen) { 1080 + ret = -EFAULT; 1081 + goto out; 1082 + } 1083 + 1084 + /* BPF programs only allowed to set retval to 0, not some 1085 + * arbitrary value. 1086 + */ 1087 + if (ctx.retval != 0 && ctx.retval != retval) { 1088 + ret = -EFAULT; 1089 + goto out; 1090 + } 1091 + 1092 + if (copy_to_user(optval, ctx.optval, ctx.optlen) || 1093 + put_user(ctx.optlen, optlen)) { 1094 + ret = -EFAULT; 1095 + goto out; 1096 + } 1097 + 1098 + ret = ctx.retval; 1099 + 1100 + out: 1101 + sockopt_free_buf(&ctx); 1102 + return ret; 1103 + } 1104 + EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt); 1105 + 942 1106 static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp, 943 1107 size_t *lenp) 944 1108 { ··· 1380 1180 }; 1381 1181 1382 1182 const struct bpf_prog_ops cg_sysctl_prog_ops = { 1183 + }; 1184 + 1185 + static const struct bpf_func_proto * 1186 + cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) 1187 + { 1188 + switch (func_id) { 1189 + case BPF_FUNC_sk_storage_get: 1190 + return &bpf_sk_storage_get_proto; 1191 + case BPF_FUNC_sk_storage_delete: 1192 + return &bpf_sk_storage_delete_proto; 1193 + #ifdef CONFIG_INET 1194 + case BPF_FUNC_tcp_sock: 1195 + return &bpf_tcp_sock_proto; 1196 + #endif 1197 + default: 1198 + return cgroup_base_func_proto(func_id, prog); 1199 + } 1200 + } 1201 + 1202 + static bool cg_sockopt_is_valid_access(int off, int size, 1203 + enum bpf_access_type type, 1204 + const struct bpf_prog *prog, 1205 + struct bpf_insn_access_aux *info) 1206 + { 1207 + const int size_default = sizeof(__u32); 1208 + 1209 + if (off < 0 || off >= sizeof(struct bpf_sockopt)) 1210 + return false; 1211 + 1212 + if (off % size != 0) 1213 + return false; 1214 + 1215 + if (type == BPF_WRITE) { 1216 + switch (off) { 1217 + case offsetof(struct bpf_sockopt, retval): 1218 + if (size != size_default) 1219 + return false; 1220 + return prog->expected_attach_type == 1221 + BPF_CGROUP_GETSOCKOPT; 1222 + case offsetof(struct bpf_sockopt, optname): 1223 + /* fallthrough */ 1224 + case offsetof(struct bpf_sockopt, level): 1225 + if (size != size_default) 1226 + return false; 1227 + return prog->expected_attach_type == 1228 + BPF_CGROUP_SETSOCKOPT; 1229 + case offsetof(struct bpf_sockopt, optlen): 1230 + return size == size_default; 1231 + default: 1232 + return false; 1233 + } 1234 + } 1235 + 1236 + switch (off) { 1237 + case offsetof(struct bpf_sockopt, sk): 1238 + if (size != sizeof(__u64)) 1239 + return false; 1240 + info->reg_type = PTR_TO_SOCKET; 1241 + break; 1242 + case offsetof(struct bpf_sockopt, optval): 1243 + if (size != sizeof(__u64)) 1244 + return false; 1245 + info->reg_type = PTR_TO_PACKET; 1246 + break; 1247 + case offsetof(struct bpf_sockopt, optval_end): 1248 + if (size != sizeof(__u64)) 1249 + return false; 1250 + info->reg_type = PTR_TO_PACKET_END; 1251 + break; 1252 + case offsetof(struct bpf_sockopt, retval): 1253 + if (size != size_default) 1254 + return false; 1255 + return prog->expected_attach_type == BPF_CGROUP_GETSOCKOPT; 1256 + default: 1257 + if (size != size_default) 1258 + return false; 1259 + break; 1260 + } 1261 + return true; 1262 + } 1263 + 1264 + #define CG_SOCKOPT_ACCESS_FIELD(T, F) \ 1265 + T(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, F), \ 1266 + si->dst_reg, si->src_reg, \ 1267 + offsetof(struct bpf_sockopt_kern, F)) 1268 + 1269 + static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type, 1270 + const struct bpf_insn *si, 1271 + struct bpf_insn *insn_buf, 1272 + struct bpf_prog *prog, 1273 + u32 *target_size) 1274 + { 1275 + struct bpf_insn *insn = insn_buf; 1276 + 1277 + switch (si->off) { 1278 + case offsetof(struct bpf_sockopt, sk): 1279 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, sk); 1280 + break; 1281 + case offsetof(struct bpf_sockopt, level): 1282 + if (type == BPF_WRITE) 1283 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, level); 1284 + else 1285 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, level); 1286 + break; 1287 + case offsetof(struct bpf_sockopt, optname): 1288 + if (type == BPF_WRITE) 1289 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, optname); 1290 + else 1291 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optname); 1292 + break; 1293 + case offsetof(struct bpf_sockopt, optlen): 1294 + if (type == BPF_WRITE) 1295 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, optlen); 1296 + else 1297 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optlen); 1298 + break; 1299 + case offsetof(struct bpf_sockopt, retval): 1300 + if (type == BPF_WRITE) 1301 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, retval); 1302 + else 1303 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, retval); 1304 + break; 1305 + case offsetof(struct bpf_sockopt, optval): 1306 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optval); 1307 + break; 1308 + case offsetof(struct bpf_sockopt, optval_end): 1309 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optval_end); 1310 + break; 1311 + } 1312 + 1313 + return insn - insn_buf; 1314 + } 1315 + 1316 + static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf, 1317 + bool direct_write, 1318 + const struct bpf_prog *prog) 1319 + { 1320 + /* Nothing to do for sockopt argument. The data is kzalloc'ated. 1321 + */ 1322 + return 0; 1323 + } 1324 + 1325 + const struct bpf_verifier_ops cg_sockopt_verifier_ops = { 1326 + .get_func_proto = cg_sockopt_func_proto, 1327 + .is_valid_access = cg_sockopt_is_valid_access, 1328 + .convert_ctx_access = cg_sockopt_convert_ctx_access, 1329 + .gen_prologue = cg_sockopt_get_prologue, 1330 + }; 1331 + 1332 + const struct bpf_prog_ops cg_sockopt_prog_ops = { 1383 1333 };

+10

kernel/bpf/core.c

··· 1809 1809 return cnt; 1810 1810 } 1811 1811 1812 + bool bpf_prog_array_is_empty(struct bpf_prog_array *array) 1813 + { 1814 + struct bpf_prog_array_item *item; 1815 + 1816 + for (item = array->items; item->prog; item++) 1817 + if (item->prog != &dummy_bpf_prog.prog) 1818 + return false; 1819 + return true; 1820 + } 1812 1821 1813 1822 static bool bpf_prog_array_copy_core(struct bpf_prog_array *array, 1814 1823 u32 *prog_ids, ··· 2110 2101 #include <linux/bpf_trace.h> 2111 2102 2112 2103 EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception); 2104 + EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx);

+48 -57

kernel/bpf/cpumap.c

··· 32 32 33 33 /* General idea: XDP packets getting XDP redirected to another CPU, 34 34 * will maximum be stored/queued for one driver ->poll() call. It is 35 - * guaranteed that setting flush bit and flush operation happen on 35 + * guaranteed that queueing the frame and the flush operation happen on 36 36 * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr() 37 37 * which queue in bpf_cpu_map_entry contains packets. 38 38 */ 39 39 40 40 #define CPU_MAP_BULK_SIZE 8 /* 8 == one cacheline on 64-bit archs */ 41 + struct bpf_cpu_map_entry; 42 + struct bpf_cpu_map; 43 + 41 44 struct xdp_bulk_queue { 42 45 void *q[CPU_MAP_BULK_SIZE]; 46 + struct list_head flush_node; 47 + struct bpf_cpu_map_entry *obj; 43 48 unsigned int count; 44 49 }; 45 50 ··· 56 51 57 52 /* XDP can run multiple RX-ring queues, need __percpu enqueue store */ 58 53 struct xdp_bulk_queue __percpu *bulkq; 54 + 55 + struct bpf_cpu_map *cmap; 59 56 60 57 /* Queue with potential multi-producers, and single-consumer kthread */ 61 58 struct ptr_ring *queue; ··· 72 65 struct bpf_map map; 73 66 /* Below members specific for map type */ 74 67 struct bpf_cpu_map_entry **cpu_map; 75 - unsigned long __percpu *flush_needed; 68 + struct list_head __percpu *flush_list; 76 69 }; 77 70 78 - static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, 79 - struct xdp_bulk_queue *bq, bool in_napi_ctx); 80 - 81 - static u64 cpu_map_bitmap_size(const union bpf_attr *attr) 82 - { 83 - return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long); 84 - } 71 + static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx); 85 72 86 73 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) 87 74 { 88 75 struct bpf_cpu_map *cmap; 89 76 int err = -ENOMEM; 77 + int ret, cpu; 90 78 u64 cost; 91 - int ret; 92 79 93 80 if (!capable(CAP_SYS_ADMIN)) 94 81 return ERR_PTR(-EPERM); ··· 106 105 107 106 /* make sure page count doesn't overflow */ 108 107 cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *); 109 - cost += cpu_map_bitmap_size(attr) * num_possible_cpus(); 108 + cost += sizeof(struct list_head) * num_possible_cpus(); 110 109 111 110 /* Notice returns -EPERM on if map size is larger than memlock limit */ 112 111 ret = bpf_map_charge_init(&cmap->map.memory, cost); ··· 115 114 goto free_cmap; 116 115 } 117 116 118 - /* A per cpu bitfield with a bit per possible CPU in map */ 119 - cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr), 120 - __alignof__(unsigned long)); 121 - if (!cmap->flush_needed) 117 + cmap->flush_list = alloc_percpu(struct list_head); 118 + if (!cmap->flush_list) 122 119 goto free_charge; 120 + 121 + for_each_possible_cpu(cpu) 122 + INIT_LIST_HEAD(per_cpu_ptr(cmap->flush_list, cpu)); 123 123 124 124 /* Alloc array for possible remote "destination" CPUs */ 125 125 cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries * ··· 131 129 132 130 return &cmap->map; 133 131 free_percpu: 134 - free_percpu(cmap->flush_needed); 132 + free_percpu(cmap->flush_list); 135 133 free_charge: 136 134 bpf_map_charge_finish(&cmap->map.memory); 137 135 free_cmap: ··· 336 334 { 337 335 gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; 338 336 struct bpf_cpu_map_entry *rcpu; 339 - int numa, err; 337 + struct xdp_bulk_queue *bq; 338 + int numa, err, i; 340 339 341 340 /* Have map->numa_node, but choose node of redirect target CPU */ 342 341 numa = cpu_to_node(cpu); ··· 351 348 sizeof(void *), gfp); 352 349 if (!rcpu->bulkq) 353 350 goto free_rcu; 351 + 352 + for_each_possible_cpu(i) { 353 + bq = per_cpu_ptr(rcpu->bulkq, i); 354 + bq->obj = rcpu; 355 + } 354 356 355 357 /* Alloc queue */ 356 358 rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa); ··· 413 405 struct xdp_bulk_queue *bq = per_cpu_ptr(rcpu->bulkq, cpu); 414 406 415 407 /* No concurrent bq_enqueue can run at this point */ 416 - bq_flush_to_queue(rcpu, bq, false); 408 + bq_flush_to_queue(bq, false); 417 409 } 418 410 free_percpu(rcpu->bulkq); 419 411 /* Cannot kthread_stop() here, last put free rcpu resources */ ··· 496 488 rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id); 497 489 if (!rcpu) 498 490 return -ENOMEM; 491 + rcpu->cmap = cmap; 499 492 } 500 493 rcu_read_lock(); 501 494 __cpu_map_entry_replace(cmap, key_cpu, rcpu); ··· 523 514 synchronize_rcu(); 524 515 525 516 /* To ensure all pending flush operations have completed wait for flush 526 - * bitmap to indicate all flush_needed bits to be zero on _all_ cpus. 527 - * Because the above synchronize_rcu() ensures the map is disconnected 528 - * from the program we can assume no new bits will be set. 517 + * list be empty on _all_ cpus. Because the above synchronize_rcu() 518 + * ensures the map is disconnected from the program we can assume no new 519 + * items will be added to the list. 529 520 */ 530 521 for_each_online_cpu(cpu) { 531 - unsigned long *bitmap = per_cpu_ptr(cmap->flush_needed, cpu); 522 + struct list_head *flush_list = per_cpu_ptr(cmap->flush_list, cpu); 532 523 533 - while (!bitmap_empty(bitmap, cmap->map.max_entries)) 524 + while (!list_empty(flush_list)) 534 525 cond_resched(); 535 526 } 536 527 ··· 547 538 /* bq flush and cleanup happens after RCU graze-period */ 548 539 __cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */ 549 540 } 550 - free_percpu(cmap->flush_needed); 541 + free_percpu(cmap->flush_list); 551 542 bpf_map_area_free(cmap->cpu_map); 552 543 kfree(cmap); 553 544 } ··· 599 590 .map_check_btf = map_check_no_btf, 600 591 }; 601 592 602 - static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, 603 - struct xdp_bulk_queue *bq, bool in_napi_ctx) 593 + static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx) 604 594 { 595 + struct bpf_cpu_map_entry *rcpu = bq->obj; 605 596 unsigned int processed = 0, drops = 0; 606 597 const int to_cpu = rcpu->cpu; 607 598 struct ptr_ring *q; ··· 630 621 bq->count = 0; 631 622 spin_unlock(&q->producer_lock); 632 623 624 + __list_del_clearprev(&bq->flush_node); 625 + 633 626 /* Feedback loop via tracepoints */ 634 627 trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu); 635 628 return 0; ··· 642 631 */ 643 632 static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) 644 633 { 634 + struct list_head *flush_list = this_cpu_ptr(rcpu->cmap->flush_list); 645 635 struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); 646 636 647 637 if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) 648 - bq_flush_to_queue(rcpu, bq, true); 638 + bq_flush_to_queue(bq, true); 649 639 650 640 /* Notice, xdp_buff/page MUST be queued here, long enough for 651 641 * driver to code invoking us to finished, due to driver ··· 658 646 * operation, when completing napi->poll call. 659 647 */ 660 648 bq->q[bq->count++] = xdpf; 649 + 650 + if (!bq->flush_node.prev) 651 + list_add(&bq->flush_node, flush_list); 652 + 661 653 return 0; 662 654 } 663 655 ··· 681 665 return 0; 682 666 } 683 667 684 - void __cpu_map_insert_ctx(struct bpf_map *map, u32 bit) 685 - { 686 - struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); 687 - unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed); 688 - 689 - __set_bit(bit, bitmap); 690 - } 691 - 692 668 void __cpu_map_flush(struct bpf_map *map) 693 669 { 694 670 struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); 695 - unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed); 696 - u32 bit; 671 + struct list_head *flush_list = this_cpu_ptr(cmap->flush_list); 672 + struct xdp_bulk_queue *bq, *tmp; 697 673 698 - /* The napi->poll softirq makes sure __cpu_map_insert_ctx() 699 - * and __cpu_map_flush() happen on same CPU. Thus, the percpu 700 - * bitmap indicate which percpu bulkq have packets. 701 - */ 702 - for_each_set_bit(bit, bitmap, map->max_entries) { 703 - struct bpf_cpu_map_entry *rcpu = READ_ONCE(cmap->cpu_map[bit]); 704 - struct xdp_bulk_queue *bq; 705 - 706 - /* This is possible if entry is removed by user space 707 - * between xdp redirect and flush op. 708 - */ 709 - if (unlikely(!rcpu)) 710 - continue; 711 - 712 - __clear_bit(bit, bitmap); 713 - 714 - /* Flush all frames in bulkq to real queue */ 715 - bq = this_cpu_ptr(rcpu->bulkq); 716 - bq_flush_to_queue(rcpu, bq, true); 674 + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { 675 + bq_flush_to_queue(bq, true); 717 676 718 677 /* If already running, costs spin_lock_irqsave + smb_mb */ 719 - wake_up_process(rcpu->kthread); 678 + wake_up_process(bq->obj->kthread); 720 679 } 721 680 }

+52 -60

kernel/bpf/devmap.c

··· 17 17 * datapath always has a valid copy. However, the datapath does a "flush" 18 18 * operation that pushes any pending packets in the driver outside the RCU 19 19 * critical section. Each bpf_dtab_netdev tracks these pending operations using 20 - * an atomic per-cpu bitmap. The bpf_dtab_netdev object will not be destroyed 21 - * until all bits are cleared indicating outstanding flush operations have 22 - * completed. 20 + * a per-cpu flush list. The bpf_dtab_netdev object will not be destroyed until 21 + * this list is empty, indicating outstanding flush operations have completed. 23 22 * 24 23 * BPF syscalls may race with BPF program calls on any of the update, delete 25 24 * or lookup operations. As noted above the xchg() operation also keep the ··· 47 48 (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) 48 49 49 50 #define DEV_MAP_BULK_SIZE 16 51 + struct bpf_dtab_netdev; 52 + 50 53 struct xdp_bulk_queue { 51 54 struct xdp_frame *q[DEV_MAP_BULK_SIZE]; 55 + struct list_head flush_node; 52 56 struct net_device *dev_rx; 57 + struct bpf_dtab_netdev *obj; 53 58 unsigned int count; 54 59 }; 55 60 ··· 68 65 struct bpf_dtab { 69 66 struct bpf_map map; 70 67 struct bpf_dtab_netdev **netdev_map; 71 - unsigned long __percpu *flush_needed; 68 + struct list_head __percpu *flush_list; 72 69 struct list_head list; 73 70 }; 74 71 75 72 static DEFINE_SPINLOCK(dev_map_lock); 76 73 static LIST_HEAD(dev_map_list); 77 74 78 - static u64 dev_map_bitmap_size(const union bpf_attr *attr) 79 - { 80 - return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long); 81 - } 82 - 83 75 static struct bpf_map *dev_map_alloc(union bpf_attr *attr) 84 76 { 85 77 struct bpf_dtab *dtab; 78 + int err, cpu; 86 79 u64 cost; 87 - int err; 88 80 89 81 if (!capable(CAP_NET_ADMIN)) 90 82 return ERR_PTR(-EPERM); ··· 89 91 attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) 90 92 return ERR_PTR(-EINVAL); 91 93 94 + /* Lookup returns a pointer straight to dev->ifindex, so make sure the 95 + * verifier prevents writes from the BPF side 96 + */ 97 + attr->map_flags |= BPF_F_RDONLY_PROG; 98 + 92 99 dtab = kzalloc(sizeof(*dtab), GFP_USER); 93 100 if (!dtab) 94 101 return ERR_PTR(-ENOMEM); ··· 102 99 103 100 /* make sure page count doesn't overflow */ 104 101 cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *); 105 - cost += dev_map_bitmap_size(attr) * num_possible_cpus(); 102 + cost += sizeof(struct list_head) * num_possible_cpus(); 106 103 107 104 /* if map size is larger than memlock limit, reject it */ 108 105 err = bpf_map_charge_init(&dtab->map.memory, cost); ··· 111 108 112 109 err = -ENOMEM; 113 110 114 - /* A per cpu bitfield with a bit per possible net device */ 115 - dtab->flush_needed = __alloc_percpu_gfp(dev_map_bitmap_size(attr), 116 - __alignof__(unsigned long), 117 - GFP_KERNEL | __GFP_NOWARN); 118 - if (!dtab->flush_needed) 111 + dtab->flush_list = alloc_percpu(struct list_head); 112 + if (!dtab->flush_list) 119 113 goto free_charge; 114 + 115 + for_each_possible_cpu(cpu) 116 + INIT_LIST_HEAD(per_cpu_ptr(dtab->flush_list, cpu)); 120 117 121 118 dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries * 122 119 sizeof(struct bpf_dtab_netdev *), 123 120 dtab->map.numa_node); 124 121 if (!dtab->netdev_map) 125 - goto free_charge; 122 + goto free_percpu; 126 123 127 124 spin_lock(&dev_map_lock); 128 125 list_add_tail_rcu(&dtab->list, &dev_map_list); 129 126 spin_unlock(&dev_map_lock); 130 127 131 128 return &dtab->map; 129 + 130 + free_percpu: 131 + free_percpu(dtab->flush_list); 132 132 free_charge: 133 133 bpf_map_charge_finish(&dtab->map.memory); 134 134 free_dtab: 135 - free_percpu(dtab->flush_needed); 136 135 kfree(dtab); 137 136 return ERR_PTR(err); 138 137 } ··· 163 158 rcu_barrier(); 164 159 165 160 /* To ensure all pending flush operations have completed wait for flush 166 - * bitmap to indicate all flush_needed bits to be zero on _all_ cpus. 161 + * list to empty on _all_ cpus. 167 162 * Because the above synchronize_rcu() ensures the map is disconnected 168 - * from the program we can assume no new bits will be set. 163 + * from the program we can assume no new items will be added. 169 164 */ 170 165 for_each_online_cpu(cpu) { 171 - unsigned long *bitmap = per_cpu_ptr(dtab->flush_needed, cpu); 166 + struct list_head *flush_list = per_cpu_ptr(dtab->flush_list, cpu); 172 167 173 - while (!bitmap_empty(bitmap, dtab->map.max_entries)) 168 + while (!list_empty(flush_list)) 174 169 cond_resched(); 175 170 } 176 171 ··· 186 181 kfree(dev); 187 182 } 188 183 189 - free_percpu(dtab->flush_needed); 184 + free_percpu(dtab->flush_list); 190 185 bpf_map_area_free(dtab->netdev_map); 191 186 kfree(dtab); 192 187 } ··· 208 203 return 0; 209 204 } 210 205 211 - void __dev_map_insert_ctx(struct bpf_map *map, u32 bit) 212 - { 213 - struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); 214 - unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); 215 - 216 - __set_bit(bit, bitmap); 217 - } 218 - 219 - static int bq_xmit_all(struct bpf_dtab_netdev *obj, 220 - struct xdp_bulk_queue *bq, u32 flags, 206 + static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags, 221 207 bool in_napi_ctx) 222 208 { 209 + struct bpf_dtab_netdev *obj = bq->obj; 223 210 struct net_device *dev = obj->dev; 224 211 int sent = 0, drops = 0, err = 0; 225 212 int i; ··· 238 241 trace_xdp_devmap_xmit(&obj->dtab->map, obj->bit, 239 242 sent, drops, bq->dev_rx, dev, err); 240 243 bq->dev_rx = NULL; 244 + __list_del_clearprev(&bq->flush_node); 241 245 return 0; 242 246 error: 243 247 /* If ndo_xdp_xmit fails with an errno, no frames have been ··· 261 263 * from the driver before returning from its napi->poll() routine. The poll() 262 264 * routine is called either from busy_poll context or net_rx_action signaled 263 265 * from NET_RX_SOFTIRQ. Either way the poll routine must complete before the 264 - * net device can be torn down. On devmap tear down we ensure the ctx bitmap 265 - * is zeroed before completing to ensure all flush operations have completed. 266 + * net device can be torn down. On devmap tear down we ensure the flush list 267 + * is empty before completing to ensure all flush operations have completed. 266 268 */ 267 269 void __dev_map_flush(struct bpf_map *map) 268 270 { 269 271 struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); 270 - unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); 271 - u32 bit; 272 + struct list_head *flush_list = this_cpu_ptr(dtab->flush_list); 273 + struct xdp_bulk_queue *bq, *tmp; 272 274 273 275 rcu_read_lock(); 274 - for_each_set_bit(bit, bitmap, map->max_entries) { 275 - struct bpf_dtab_netdev *dev = READ_ONCE(dtab->netdev_map[bit]); 276 - struct xdp_bulk_queue *bq; 277 - 278 - /* This is possible if the dev entry is removed by user space 279 - * between xdp redirect and flush op. 280 - */ 281 - if (unlikely(!dev)) 282 - continue; 283 - 284 - bq = this_cpu_ptr(dev->bulkq); 285 - bq_xmit_all(dev, bq, XDP_XMIT_FLUSH, true); 286 - 287 - __clear_bit(bit, bitmap); 288 - } 276 + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) 277 + bq_xmit_all(bq, XDP_XMIT_FLUSH, true); 289 278 rcu_read_unlock(); 290 279 } 291 280 ··· 299 314 struct net_device *dev_rx) 300 315 301 316 { 317 + struct list_head *flush_list = this_cpu_ptr(obj->dtab->flush_list); 302 318 struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq); 303 319 304 320 if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) 305 - bq_xmit_all(obj, bq, 0, true); 321 + bq_xmit_all(bq, 0, true); 306 322 307 323 /* Ingress dev_rx will be the same for all xdp_frame's in 308 324 * bulk_queue, because bq stored per-CPU and must be flushed ··· 313 327 bq->dev_rx = dev_rx; 314 328 315 329 bq->q[bq->count++] = xdpf; 330 + 331 + if (!bq->flush_node.prev) 332 + list_add(&bq->flush_node, flush_list); 333 + 316 334 return 0; 317 335 } 318 336 ··· 367 377 { 368 378 if (dev->dev->netdev_ops->ndo_xdp_xmit) { 369 379 struct xdp_bulk_queue *bq; 370 - unsigned long *bitmap; 371 - 372 380 int cpu; 373 381 374 382 rcu_read_lock(); 375 383 for_each_online_cpu(cpu) { 376 - bitmap = per_cpu_ptr(dev->dtab->flush_needed, cpu); 377 - __clear_bit(dev->bit, bitmap); 378 - 379 384 bq = per_cpu_ptr(dev->bulkq, cpu); 380 - bq_xmit_all(dev, bq, XDP_XMIT_FLUSH, false); 385 + bq_xmit_all(bq, XDP_XMIT_FLUSH, false); 381 386 } 382 387 rcu_read_unlock(); 383 388 } ··· 419 434 struct net *net = current->nsproxy->net_ns; 420 435 gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; 421 436 struct bpf_dtab_netdev *dev, *old_dev; 422 - u32 i = *(u32 *)key; 423 437 u32 ifindex = *(u32 *)value; 438 + struct xdp_bulk_queue *bq; 439 + u32 i = *(u32 *)key; 440 + int cpu; 424 441 425 442 if (unlikely(map_flags > BPF_EXIST)) 426 443 return -EINVAL; ··· 443 456 if (!dev->bulkq) { 444 457 kfree(dev); 445 458 return -ENOMEM; 459 + } 460 + 461 + for_each_possible_cpu(cpu) { 462 + bq = per_cpu_ptr(dev->bulkq, cpu); 463 + bq->obj = dev; 446 464 } 447 465 448 466 dev->dev = dev_get_by_index(net, ifindex);

+19

kernel/bpf/syscall.c

··· 1590 1590 default: 1591 1591 return -EINVAL; 1592 1592 } 1593 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 1594 + switch (expected_attach_type) { 1595 + case BPF_CGROUP_SETSOCKOPT: 1596 + case BPF_CGROUP_GETSOCKOPT: 1597 + return 0; 1598 + default: 1599 + return -EINVAL; 1600 + } 1593 1601 default: 1594 1602 return 0; 1595 1603 } ··· 1848 1840 switch (prog->type) { 1849 1841 case BPF_PROG_TYPE_CGROUP_SOCK: 1850 1842 case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: 1843 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 1851 1844 return attach_type == prog->expected_attach_type ? 0 : -EINVAL; 1852 1845 case BPF_PROG_TYPE_CGROUP_SKB: 1853 1846 return prog->enforce_expected_attach_type && ··· 1920 1911 break; 1921 1912 case BPF_CGROUP_SYSCTL: 1922 1913 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; 1914 + break; 1915 + case BPF_CGROUP_GETSOCKOPT: 1916 + case BPF_CGROUP_SETSOCKOPT: 1917 + ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT; 1923 1918 break; 1924 1919 default: 1925 1920 return -EINVAL; ··· 2008 1995 case BPF_CGROUP_SYSCTL: 2009 1996 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; 2010 1997 break; 1998 + case BPF_CGROUP_GETSOCKOPT: 1999 + case BPF_CGROUP_SETSOCKOPT: 2000 + ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT; 2001 + break; 2011 2002 default: 2012 2003 return -EINVAL; 2013 2004 } ··· 2048 2031 case BPF_CGROUP_SOCK_OPS: 2049 2032 case BPF_CGROUP_DEVICE: 2050 2033 case BPF_CGROUP_SYSCTL: 2034 + case BPF_CGROUP_GETSOCKOPT: 2035 + case BPF_CGROUP_SETSOCKOPT: 2051 2036 break; 2052 2037 case BPF_LIRC_MODE2: 2053 2038 return lirc_prog_query(attr, uattr);

+117 -19

kernel/bpf/verifier.c

··· 1659 1659 } 1660 1660 } 1661 1661 1662 - static int mark_chain_precision(struct bpf_verifier_env *env, int regno) 1662 + static int __mark_chain_precision(struct bpf_verifier_env *env, int regno, 1663 + int spi) 1663 1664 { 1664 1665 struct bpf_verifier_state *st = env->cur_state; 1665 1666 int first_idx = st->first_insn_idx; 1666 1667 int last_idx = env->insn_idx; 1667 1668 struct bpf_func_state *func; 1668 1669 struct bpf_reg_state *reg; 1669 - u32 reg_mask = 1u << regno; 1670 - u64 stack_mask = 0; 1670 + u32 reg_mask = regno >= 0 ? 1u << regno : 0; 1671 + u64 stack_mask = spi >= 0 ? 1ull << spi : 0; 1671 1672 bool skip_first = true; 1673 + bool new_marks = false; 1672 1674 int i, err; 1673 1675 1674 1676 if (!env->allow_ptr_leaks) ··· 1678 1676 return 0; 1679 1677 1680 1678 func = st->frame[st->curframe]; 1681 - reg = &func->regs[regno]; 1682 - if (reg->type != SCALAR_VALUE) { 1683 - WARN_ONCE(1, "backtracing misuse"); 1684 - return -EFAULT; 1679 + if (regno >= 0) { 1680 + reg = &func->regs[regno]; 1681 + if (reg->type != SCALAR_VALUE) { 1682 + WARN_ONCE(1, "backtracing misuse"); 1683 + return -EFAULT; 1684 + } 1685 + if (!reg->precise) 1686 + new_marks = true; 1687 + else 1688 + reg_mask = 0; 1689 + reg->precise = true; 1685 1690 } 1686 - if (reg->precise) 1687 - return 0; 1688 - func->regs[regno].precise = true; 1689 1691 1692 + while (spi >= 0) { 1693 + if (func->stack[spi].slot_type[0] != STACK_SPILL) { 1694 + stack_mask = 0; 1695 + break; 1696 + } 1697 + reg = &func->stack[spi].spilled_ptr; 1698 + if (reg->type != SCALAR_VALUE) { 1699 + stack_mask = 0; 1700 + break; 1701 + } 1702 + if (!reg->precise) 1703 + new_marks = true; 1704 + else 1705 + stack_mask = 0; 1706 + reg->precise = true; 1707 + break; 1708 + } 1709 + 1710 + if (!new_marks) 1711 + return 0; 1712 + if (!reg_mask && !stack_mask) 1713 + return 0; 1690 1714 for (;;) { 1691 1715 DECLARE_BITMAP(mask, 64); 1692 - bool new_marks = false; 1693 1716 u32 history = st->jmp_history_cnt; 1694 1717 1695 1718 if (env->log.level & BPF_LOG_LEVEL) ··· 1757 1730 if (!st) 1758 1731 break; 1759 1732 1733 + new_marks = false; 1760 1734 func = st->frame[st->curframe]; 1761 1735 bitmap_from_u64(mask, reg_mask); 1762 1736 for_each_set_bit(i, mask, 32) { 1763 1737 reg = &func->regs[i]; 1764 - if (reg->type != SCALAR_VALUE) 1738 + if (reg->type != SCALAR_VALUE) { 1739 + reg_mask &= ~(1u << i); 1765 1740 continue; 1741 + } 1766 1742 if (!reg->precise) 1767 1743 new_marks = true; 1768 1744 reg->precise = true; ··· 1786 1756 return -EFAULT; 1787 1757 } 1788 1758 1789 - if (func->stack[i].slot_type[0] != STACK_SPILL) 1759 + if (func->stack[i].slot_type[0] != STACK_SPILL) { 1760 + stack_mask &= ~(1ull << i); 1790 1761 continue; 1762 + } 1791 1763 reg = &func->stack[i].spilled_ptr; 1792 - if (reg->type != SCALAR_VALUE) 1764 + if (reg->type != SCALAR_VALUE) { 1765 + stack_mask &= ~(1ull << i); 1793 1766 continue; 1767 + } 1794 1768 if (!reg->precise) 1795 1769 new_marks = true; 1796 1770 reg->precise = true; ··· 1806 1772 reg_mask, stack_mask); 1807 1773 } 1808 1774 1775 + if (!reg_mask && !stack_mask) 1776 + break; 1809 1777 if (!new_marks) 1810 1778 break; 1811 1779 ··· 1817 1781 return 0; 1818 1782 } 1819 1783 1784 + static int mark_chain_precision(struct bpf_verifier_env *env, int regno) 1785 + { 1786 + return __mark_chain_precision(env, regno, -1); 1787 + } 1788 + 1789 + static int mark_chain_precision_stack(struct bpf_verifier_env *env, int spi) 1790 + { 1791 + return __mark_chain_precision(env, -1, spi); 1792 + } 1820 1793 1821 1794 static bool is_spillable_regtype(enum bpf_reg_type type) 1822 1795 { ··· 2260 2215 2261 2216 env->seen_direct_write = true; 2262 2217 return true; 2218 + 2219 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 2220 + if (t == BPF_WRITE) 2221 + env->seen_direct_write = true; 2222 + 2223 + return true; 2224 + 2263 2225 default: 2264 2226 return false; 2265 2227 } ··· 3459 3407 if (func_id != BPF_FUNC_get_local_storage) 3460 3408 goto error; 3461 3409 break; 3462 - /* devmap returns a pointer to a live net_device ifindex that we cannot 3463 - * allow to be modified from bpf side. So do not allow lookup elements 3464 - * for now. 3465 - */ 3466 3410 case BPF_MAP_TYPE_DEVMAP: 3467 - if (func_id != BPF_FUNC_redirect_map) 3411 + if (func_id != BPF_FUNC_redirect_map && 3412 + func_id != BPF_FUNC_map_lookup_elem) 3468 3413 goto error; 3469 3414 break; 3470 3415 /* Restrict bpf side of cpumap and xskmap, open when use-cases ··· 6115 6066 case BPF_PROG_TYPE_SOCK_OPS: 6116 6067 case BPF_PROG_TYPE_CGROUP_DEVICE: 6117 6068 case BPF_PROG_TYPE_CGROUP_SYSCTL: 6069 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 6118 6070 break; 6119 6071 default: 6120 6072 return 0; ··· 7156 7106 return 0; 7157 7107 } 7158 7108 7109 + /* find precise scalars in the previous equivalent state and 7110 + * propagate them into the current state 7111 + */ 7112 + static int propagate_precision(struct bpf_verifier_env *env, 7113 + const struct bpf_verifier_state *old) 7114 + { 7115 + struct bpf_reg_state *state_reg; 7116 + struct bpf_func_state *state; 7117 + int i, err = 0; 7118 + 7119 + state = old->frame[old->curframe]; 7120 + state_reg = state->regs; 7121 + for (i = 0; i < BPF_REG_FP; i++, state_reg++) { 7122 + if (state_reg->type != SCALAR_VALUE || 7123 + !state_reg->precise) 7124 + continue; 7125 + if (env->log.level & BPF_LOG_LEVEL2) 7126 + verbose(env, "propagating r%d\n", i); 7127 + err = mark_chain_precision(env, i); 7128 + if (err < 0) 7129 + return err; 7130 + } 7131 + 7132 + for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) { 7133 + if (state->stack[i].slot_type[0] != STACK_SPILL) 7134 + continue; 7135 + state_reg = &state->stack[i].spilled_ptr; 7136 + if (state_reg->type != SCALAR_VALUE || 7137 + !state_reg->precise) 7138 + continue; 7139 + if (env->log.level & BPF_LOG_LEVEL2) 7140 + verbose(env, "propagating fp%d\n", 7141 + (-i - 1) * BPF_REG_SIZE); 7142 + err = mark_chain_precision_stack(env, i); 7143 + if (err < 0) 7144 + return err; 7145 + } 7146 + return 0; 7147 + } 7148 + 7159 7149 static bool states_maybe_looping(struct bpf_verifier_state *old, 7160 7150 struct bpf_verifier_state *cur) 7161 7151 { ··· 7288 7198 * this state and will pop a new one. 7289 7199 */ 7290 7200 err = propagate_liveness(env, &sl->state, cur); 7201 + 7202 + /* if previous state reached the exit with precision and 7203 + * current state is equivalent to it (except precsion marks) 7204 + * the precision needs to be propagated back in 7205 + * the current state. 7206 + */ 7207 + err = err ? : push_jmp_history(env, cur); 7208 + err = err ? : propagate_precision(env, &sl->state); 7291 7209 if (err) 7292 7210 return err; 7293 7211 return 1;

+1 -2

kernel/bpf/xskmap.c

··· 145 145 146 146 list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { 147 147 xsk_flush(xs); 148 - __list_del(xs->flush_node.prev, xs->flush_node.next); 149 - xs->flush_node.prev = NULL; 148 + __list_del_clearprev(&xs->flush_node); 150 149 } 151 150 } 152 151

+14 -13

kernel/trace/bpf_trace.c

··· 1431 1431 return err; 1432 1432 } 1433 1433 1434 + static int __init send_signal_irq_work_init(void) 1435 + { 1436 + int cpu; 1437 + struct send_signal_irq_work *work; 1438 + 1439 + for_each_possible_cpu(cpu) { 1440 + work = per_cpu_ptr(&send_signal_work, cpu); 1441 + init_irq_work(&work->irq_work, do_bpf_send_signal); 1442 + } 1443 + return 0; 1444 + } 1445 + 1446 + subsys_initcall(send_signal_irq_work_init); 1447 + 1434 1448 #ifdef CONFIG_MODULES 1435 1449 static int bpf_event_notify(struct notifier_block *nb, unsigned long op, 1436 1450 void *module) ··· 1492 1478 return 0; 1493 1479 } 1494 1480 1495 - static int __init send_signal_irq_work_init(void) 1496 - { 1497 - int cpu; 1498 - struct send_signal_irq_work *work; 1499 - 1500 - for_each_possible_cpu(cpu) { 1501 - work = per_cpu_ptr(&send_signal_work, cpu); 1502 - init_irq_work(&work->irq_work, do_bpf_send_signal); 1503 - } 1504 - return 0; 1505 - } 1506 - 1507 1481 fs_initcall(bpf_event_init); 1508 - subsys_initcall(send_signal_irq_work_init); 1509 1482 #endif /* CONFIG_MODULES */

+185 -84

net/core/filter.c

··· 2158 2158 if (unlikely(flags & ~(BPF_F_INGRESS))) 2159 2159 return TC_ACT_SHOT; 2160 2160 2161 - ri->ifindex = ifindex; 2162 2161 ri->flags = flags; 2162 + ri->tgt_index = ifindex; 2163 2163 2164 2164 return TC_ACT_REDIRECT; 2165 2165 } ··· 2169 2169 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 2170 2170 struct net_device *dev; 2171 2171 2172 - dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); 2173 - ri->ifindex = 0; 2172 + dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->tgt_index); 2173 + ri->tgt_index = 0; 2174 2174 if (unlikely(!dev)) { 2175 2175 kfree_skb(skb); 2176 2176 return -EINVAL; ··· 3488 3488 struct bpf_prog *xdp_prog, struct bpf_redirect_info *ri) 3489 3489 { 3490 3490 struct net_device *fwd; 3491 - u32 index = ri->ifindex; 3491 + u32 index = ri->tgt_index; 3492 3492 int err; 3493 3493 3494 3494 fwd = dev_get_by_index_rcu(dev_net(dev), index); 3495 - ri->ifindex = 0; 3495 + ri->tgt_index = 0; 3496 3496 if (unlikely(!fwd)) { 3497 3497 err = -EINVAL; 3498 3498 goto err; ··· 3523 3523 err = dev_map_enqueue(dst, xdp, dev_rx); 3524 3524 if (unlikely(err)) 3525 3525 return err; 3526 - __dev_map_insert_ctx(map, index); 3527 3526 break; 3528 3527 } 3529 3528 case BPF_MAP_TYPE_CPUMAP: { ··· 3531 3532 err = cpu_map_enqueue(rcpu, xdp, dev_rx); 3532 3533 if (unlikely(err)) 3533 3534 return err; 3534 - __cpu_map_insert_ctx(map, index); 3535 3535 break; 3536 3536 } 3537 3537 case BPF_MAP_TYPE_XSKMAP: { ··· 3604 3606 struct bpf_prog *xdp_prog, struct bpf_map *map, 3605 3607 struct bpf_redirect_info *ri) 3606 3608 { 3607 - u32 index = ri->ifindex; 3608 - void *fwd = NULL; 3609 + u32 index = ri->tgt_index; 3610 + void *fwd = ri->tgt_value; 3609 3611 int err; 3610 3612 3611 - ri->ifindex = 0; 3613 + ri->tgt_index = 0; 3614 + ri->tgt_value = NULL; 3612 3615 WRITE_ONCE(ri->map, NULL); 3613 3616 3614 - fwd = __xdp_map_lookup_elem(map, index); 3615 - if (unlikely(!fwd)) { 3616 - err = -EINVAL; 3617 - goto err; 3618 - } 3619 3617 if (ri->map_to_flush && unlikely(ri->map_to_flush != map)) 3620 3618 xdp_do_flush_map(); 3621 3619 ··· 3647 3653 struct bpf_map *map) 3648 3654 { 3649 3655 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 3650 - u32 index = ri->ifindex; 3651 - void *fwd = NULL; 3656 + u32 index = ri->tgt_index; 3657 + void *fwd = ri->tgt_value; 3652 3658 int err = 0; 3653 3659 3654 - ri->ifindex = 0; 3660 + ri->tgt_index = 0; 3661 + ri->tgt_value = NULL; 3655 3662 WRITE_ONCE(ri->map, NULL); 3656 - 3657 - fwd = __xdp_map_lookup_elem(map, index); 3658 - if (unlikely(!fwd)) { 3659 - err = -EINVAL; 3660 - goto err; 3661 - } 3662 3663 3663 3664 if (map->map_type == BPF_MAP_TYPE_DEVMAP) { 3664 3665 struct bpf_dtab_netdev *dst = fwd; ··· 3686 3697 { 3687 3698 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 3688 3699 struct bpf_map *map = READ_ONCE(ri->map); 3689 - u32 index = ri->ifindex; 3700 + u32 index = ri->tgt_index; 3690 3701 struct net_device *fwd; 3691 3702 int err = 0; 3692 3703 3693 3704 if (map) 3694 3705 return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, 3695 3706 map); 3696 - ri->ifindex = 0; 3707 + ri->tgt_index = 0; 3697 3708 fwd = dev_get_by_index_rcu(dev_net(dev), index); 3698 3709 if (unlikely(!fwd)) { 3699 3710 err = -EINVAL; ··· 3721 3732 if (unlikely(flags)) 3722 3733 return XDP_ABORTED; 3723 3734 3724 - ri->ifindex = ifindex; 3725 3735 ri->flags = flags; 3736 + ri->tgt_index = ifindex; 3737 + ri->tgt_value = NULL; 3726 3738 WRITE_ONCE(ri->map, NULL); 3727 3739 3728 3740 return XDP_REDIRECT; ··· 3742 3752 { 3743 3753 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 3744 3754 3745 - if (unlikely(flags)) 3755 + /* Lower bits of the flags are used as return code on lookup failure */ 3756 + if (unlikely(flags > XDP_TX)) 3746 3757 return XDP_ABORTED; 3747 3758 3748 - ri->ifindex = ifindex; 3759 + ri->tgt_value = __xdp_map_lookup_elem(map, ifindex); 3760 + if (unlikely(!ri->tgt_value)) { 3761 + /* If the lookup fails we want to clear out the state in the 3762 + * redirect_info struct completely, so that if an eBPF program 3763 + * performs multiple lookups, the last one always takes 3764 + * precedence. 3765 + */ 3766 + WRITE_ONCE(ri->map, NULL); 3767 + return flags; 3768 + } 3769 + 3749 3770 ri->flags = flags; 3771 + ri->tgt_index = ifindex; 3750 3772 WRITE_ONCE(ri->map, map); 3751 3773 3752 3774 return XDP_REDIRECT; ··· 5194 5192 }; 5195 5193 #endif /* CONFIG_IPV6_SEG6_BPF */ 5196 5194 5197 - #define CONVERT_COMMON_TCP_SOCK_FIELDS(md_type, CONVERT) \ 5198 - do { \ 5199 - switch (si->off) { \ 5200 - case offsetof(md_type, snd_cwnd): \ 5201 - CONVERT(snd_cwnd); break; \ 5202 - case offsetof(md_type, srtt_us): \ 5203 - CONVERT(srtt_us); break; \ 5204 - case offsetof(md_type, snd_ssthresh): \ 5205 - CONVERT(snd_ssthresh); break; \ 5206 - case offsetof(md_type, rcv_nxt): \ 5207 - CONVERT(rcv_nxt); break; \ 5208 - case offsetof(md_type, snd_nxt): \ 5209 - CONVERT(snd_nxt); break; \ 5210 - case offsetof(md_type, snd_una): \ 5211 - CONVERT(snd_una); break; \ 5212 - case offsetof(md_type, mss_cache): \ 5213 - CONVERT(mss_cache); break; \ 5214 - case offsetof(md_type, ecn_flags): \ 5215 - CONVERT(ecn_flags); break; \ 5216 - case offsetof(md_type, rate_delivered): \ 5217 - CONVERT(rate_delivered); break; \ 5218 - case offsetof(md_type, rate_interval_us): \ 5219 - CONVERT(rate_interval_us); break; \ 5220 - case offsetof(md_type, packets_out): \ 5221 - CONVERT(packets_out); break; \ 5222 - case offsetof(md_type, retrans_out): \ 5223 - CONVERT(retrans_out); break; \ 5224 - case offsetof(md_type, total_retrans): \ 5225 - CONVERT(total_retrans); break; \ 5226 - case offsetof(md_type, segs_in): \ 5227 - CONVERT(segs_in); break; \ 5228 - case offsetof(md_type, data_segs_in): \ 5229 - CONVERT(data_segs_in); break; \ 5230 - case offsetof(md_type, segs_out): \ 5231 - CONVERT(segs_out); break; \ 5232 - case offsetof(md_type, data_segs_out): \ 5233 - CONVERT(data_segs_out); break; \ 5234 - case offsetof(md_type, lost_out): \ 5235 - CONVERT(lost_out); break; \ 5236 - case offsetof(md_type, sacked_out): \ 5237 - CONVERT(sacked_out); break; \ 5238 - case offsetof(md_type, bytes_received): \ 5239 - CONVERT(bytes_received); break; \ 5240 - case offsetof(md_type, bytes_acked): \ 5241 - CONVERT(bytes_acked); break; \ 5242 - } \ 5243 - } while (0) 5244 - 5245 5195 #ifdef CONFIG_INET 5246 5196 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, 5247 5197 int dif, int sdif, u8 family, u8 proto) ··· 5544 5590 bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, 5545 5591 struct bpf_insn_access_aux *info) 5546 5592 { 5547 - if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, bytes_acked)) 5593 + if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, 5594 + icsk_retransmits)) 5548 5595 return false; 5549 5596 5550 5597 if (off % size != 0) ··· 5576 5621 offsetof(struct tcp_sock, FIELD)); \ 5577 5622 } while (0) 5578 5623 5579 - CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_tcp_sock, 5580 - BPF_TCP_SOCK_GET_COMMON); 5624 + #define BPF_INET_SOCK_GET_COMMON(FIELD) \ 5625 + do { \ 5626 + BUILD_BUG_ON(FIELD_SIZEOF(struct inet_connection_sock, \ 5627 + FIELD) > \ 5628 + FIELD_SIZEOF(struct bpf_tcp_sock, FIELD)); \ 5629 + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ 5630 + struct inet_connection_sock, \ 5631 + FIELD), \ 5632 + si->dst_reg, si->src_reg, \ 5633 + offsetof( \ 5634 + struct inet_connection_sock, \ 5635 + FIELD)); \ 5636 + } while (0) 5581 5637 5582 5638 if (insn > insn_buf) 5583 5639 return insn - insn_buf; ··· 5604 5638 offsetof(struct tcp_sock, rtt_min) + 5605 5639 offsetof(struct minmax_sample, v)); 5606 5640 break; 5641 + case offsetof(struct bpf_tcp_sock, snd_cwnd): 5642 + BPF_TCP_SOCK_GET_COMMON(snd_cwnd); 5643 + break; 5644 + case offsetof(struct bpf_tcp_sock, srtt_us): 5645 + BPF_TCP_SOCK_GET_COMMON(srtt_us); 5646 + break; 5647 + case offsetof(struct bpf_tcp_sock, snd_ssthresh): 5648 + BPF_TCP_SOCK_GET_COMMON(snd_ssthresh); 5649 + break; 5650 + case offsetof(struct bpf_tcp_sock, rcv_nxt): 5651 + BPF_TCP_SOCK_GET_COMMON(rcv_nxt); 5652 + break; 5653 + case offsetof(struct bpf_tcp_sock, snd_nxt): 5654 + BPF_TCP_SOCK_GET_COMMON(snd_nxt); 5655 + break; 5656 + case offsetof(struct bpf_tcp_sock, snd_una): 5657 + BPF_TCP_SOCK_GET_COMMON(snd_una); 5658 + break; 5659 + case offsetof(struct bpf_tcp_sock, mss_cache): 5660 + BPF_TCP_SOCK_GET_COMMON(mss_cache); 5661 + break; 5662 + case offsetof(struct bpf_tcp_sock, ecn_flags): 5663 + BPF_TCP_SOCK_GET_COMMON(ecn_flags); 5664 + break; 5665 + case offsetof(struct bpf_tcp_sock, rate_delivered): 5666 + BPF_TCP_SOCK_GET_COMMON(rate_delivered); 5667 + break; 5668 + case offsetof(struct bpf_tcp_sock, rate_interval_us): 5669 + BPF_TCP_SOCK_GET_COMMON(rate_interval_us); 5670 + break; 5671 + case offsetof(struct bpf_tcp_sock, packets_out): 5672 + BPF_TCP_SOCK_GET_COMMON(packets_out); 5673 + break; 5674 + case offsetof(struct bpf_tcp_sock, retrans_out): 5675 + BPF_TCP_SOCK_GET_COMMON(retrans_out); 5676 + break; 5677 + case offsetof(struct bpf_tcp_sock, total_retrans): 5678 + BPF_TCP_SOCK_GET_COMMON(total_retrans); 5679 + break; 5680 + case offsetof(struct bpf_tcp_sock, segs_in): 5681 + BPF_TCP_SOCK_GET_COMMON(segs_in); 5682 + break; 5683 + case offsetof(struct bpf_tcp_sock, data_segs_in): 5684 + BPF_TCP_SOCK_GET_COMMON(data_segs_in); 5685 + break; 5686 + case offsetof(struct bpf_tcp_sock, segs_out): 5687 + BPF_TCP_SOCK_GET_COMMON(segs_out); 5688 + break; 5689 + case offsetof(struct bpf_tcp_sock, data_segs_out): 5690 + BPF_TCP_SOCK_GET_COMMON(data_segs_out); 5691 + break; 5692 + case offsetof(struct bpf_tcp_sock, lost_out): 5693 + BPF_TCP_SOCK_GET_COMMON(lost_out); 5694 + break; 5695 + case offsetof(struct bpf_tcp_sock, sacked_out): 5696 + BPF_TCP_SOCK_GET_COMMON(sacked_out); 5697 + break; 5698 + case offsetof(struct bpf_tcp_sock, bytes_received): 5699 + BPF_TCP_SOCK_GET_COMMON(bytes_received); 5700 + break; 5701 + case offsetof(struct bpf_tcp_sock, bytes_acked): 5702 + BPF_TCP_SOCK_GET_COMMON(bytes_acked); 5703 + break; 5704 + case offsetof(struct bpf_tcp_sock, dsack_dups): 5705 + BPF_TCP_SOCK_GET_COMMON(dsack_dups); 5706 + break; 5707 + case offsetof(struct bpf_tcp_sock, delivered): 5708 + BPF_TCP_SOCK_GET_COMMON(delivered); 5709 + break; 5710 + case offsetof(struct bpf_tcp_sock, delivered_ce): 5711 + BPF_TCP_SOCK_GET_COMMON(delivered_ce); 5712 + break; 5713 + case offsetof(struct bpf_tcp_sock, icsk_retransmits): 5714 + BPF_INET_SOCK_GET_COMMON(icsk_retransmits); 5715 + break; 5607 5716 } 5608 5717 5609 5718 return insn - insn_buf; ··· 5692 5651 return (unsigned long)NULL; 5693 5652 } 5694 5653 5695 - static const struct bpf_func_proto bpf_tcp_sock_proto = { 5654 + const struct bpf_func_proto bpf_tcp_sock_proto = { 5696 5655 .func = bpf_tcp_sock, 5697 5656 .gpl_only = false, 5698 5657 .ret_type = RET_PTR_TO_TCP_SOCK_OR_NULL, ··· 7952 7911 SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ); \ 7953 7912 } while (0) 7954 7913 7955 - CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_sock_ops, 7956 - SOCK_OPS_GET_TCP_SOCK_FIELD); 7957 - 7958 7914 if (insn > insn_buf) 7959 7915 return insn - insn_buf; 7960 7916 ··· 8120 8082 case offsetof(struct bpf_sock_ops, sk_txhash): 8121 8083 SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash, 8122 8084 struct sock, type); 8085 + break; 8086 + case offsetof(struct bpf_sock_ops, snd_cwnd): 8087 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_cwnd); 8088 + break; 8089 + case offsetof(struct bpf_sock_ops, srtt_us): 8090 + SOCK_OPS_GET_TCP_SOCK_FIELD(srtt_us); 8091 + break; 8092 + case offsetof(struct bpf_sock_ops, snd_ssthresh): 8093 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_ssthresh); 8094 + break; 8095 + case offsetof(struct bpf_sock_ops, rcv_nxt): 8096 + SOCK_OPS_GET_TCP_SOCK_FIELD(rcv_nxt); 8097 + break; 8098 + case offsetof(struct bpf_sock_ops, snd_nxt): 8099 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_nxt); 8100 + break; 8101 + case offsetof(struct bpf_sock_ops, snd_una): 8102 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_una); 8103 + break; 8104 + case offsetof(struct bpf_sock_ops, mss_cache): 8105 + SOCK_OPS_GET_TCP_SOCK_FIELD(mss_cache); 8106 + break; 8107 + case offsetof(struct bpf_sock_ops, ecn_flags): 8108 + SOCK_OPS_GET_TCP_SOCK_FIELD(ecn_flags); 8109 + break; 8110 + case offsetof(struct bpf_sock_ops, rate_delivered): 8111 + SOCK_OPS_GET_TCP_SOCK_FIELD(rate_delivered); 8112 + break; 8113 + case offsetof(struct bpf_sock_ops, rate_interval_us): 8114 + SOCK_OPS_GET_TCP_SOCK_FIELD(rate_interval_us); 8115 + break; 8116 + case offsetof(struct bpf_sock_ops, packets_out): 8117 + SOCK_OPS_GET_TCP_SOCK_FIELD(packets_out); 8118 + break; 8119 + case offsetof(struct bpf_sock_ops, retrans_out): 8120 + SOCK_OPS_GET_TCP_SOCK_FIELD(retrans_out); 8121 + break; 8122 + case offsetof(struct bpf_sock_ops, total_retrans): 8123 + SOCK_OPS_GET_TCP_SOCK_FIELD(total_retrans); 8124 + break; 8125 + case offsetof(struct bpf_sock_ops, segs_in): 8126 + SOCK_OPS_GET_TCP_SOCK_FIELD(segs_in); 8127 + break; 8128 + case offsetof(struct bpf_sock_ops, data_segs_in): 8129 + SOCK_OPS_GET_TCP_SOCK_FIELD(data_segs_in); 8130 + break; 8131 + case offsetof(struct bpf_sock_ops, segs_out): 8132 + SOCK_OPS_GET_TCP_SOCK_FIELD(segs_out); 8133 + break; 8134 + case offsetof(struct bpf_sock_ops, data_segs_out): 8135 + SOCK_OPS_GET_TCP_SOCK_FIELD(data_segs_out); 8136 + break; 8137 + case offsetof(struct bpf_sock_ops, lost_out): 8138 + SOCK_OPS_GET_TCP_SOCK_FIELD(lost_out); 8139 + break; 8140 + case offsetof(struct bpf_sock_ops, sacked_out): 8141 + SOCK_OPS_GET_TCP_SOCK_FIELD(sacked_out); 8142 + break; 8143 + case offsetof(struct bpf_sock_ops, bytes_received): 8144 + SOCK_OPS_GET_TCP_SOCK_FIELD(bytes_received); 8145 + break; 8146 + case offsetof(struct bpf_sock_ops, bytes_acked): 8147 + SOCK_OPS_GET_TCP_SOCK_FIELD(bytes_acked); 8123 8148 break; 8124 8149 case offsetof(struct bpf_sock_ops, sk): 8125 8150 *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(

+1 -1

net/core/xdp.c

··· 85 85 kfree(xa); 86 86 } 87 87 88 - bool __mem_id_disconnect(int id, bool force) 88 + static bool __mem_id_disconnect(int id, bool force) 89 89 { 90 90 struct xdp_mem_allocator *xa; 91 91 bool safe_to_remove = true;

+4

net/ipv4/tcp_input.c

··· 778 778 tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2; 779 779 tp->rtt_seq = tp->snd_nxt; 780 780 tp->mdev_max_us = tcp_rto_min_us(sk); 781 + 782 + tcp_bpf_rtt(sk); 781 783 } 782 784 } else { 783 785 /* no previous measure. */ ··· 788 786 tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk)); 789 787 tp->mdev_max_us = tp->rttvar_us; 790 788 tp->rtt_seq = tp->snd_nxt; 789 + 790 + tcp_bpf_rtt(sk); 791 791 } 792 792 tp->srtt_us = max(1U, srtt); 793 793 }

+30

net/socket.c

··· 2050 2050 static int __sys_setsockopt(int fd, int level, int optname, 2051 2051 char __user *optval, int optlen) 2052 2052 { 2053 + mm_segment_t oldfs = get_fs(); 2054 + char *kernel_optval = NULL; 2053 2055 int err, fput_needed; 2054 2056 struct socket *sock; 2055 2057 ··· 2064 2062 if (err) 2065 2063 goto out_put; 2066 2064 2065 + err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, 2066 + &optname, optval, &optlen, 2067 + &kernel_optval); 2068 + 2069 + if (err < 0) { 2070 + goto out_put; 2071 + } else if (err > 0) { 2072 + err = 0; 2073 + goto out_put; 2074 + } 2075 + 2076 + if (kernel_optval) { 2077 + set_fs(KERNEL_DS); 2078 + optval = (char __user __force *)kernel_optval; 2079 + } 2080 + 2067 2081 if (level == SOL_SOCKET) 2068 2082 err = 2069 2083 sock_setsockopt(sock, level, optname, optval, ··· 2088 2070 err = 2089 2071 sock->ops->setsockopt(sock, level, optname, optval, 2090 2072 optlen); 2073 + 2074 + if (kernel_optval) { 2075 + set_fs(oldfs); 2076 + kfree(kernel_optval); 2077 + } 2091 2078 out_put: 2092 2079 fput_light(sock->file, fput_needed); 2093 2080 } ··· 2115 2092 { 2116 2093 int err, fput_needed; 2117 2094 struct socket *sock; 2095 + int max_optlen; 2118 2096 2119 2097 sock = sockfd_lookup_light(fd, &err, &fput_needed); 2120 2098 if (sock != NULL) { 2121 2099 err = security_socket_getsockopt(sock, level, optname); 2122 2100 if (err) 2123 2101 goto out_put; 2102 + 2103 + max_optlen = BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen); 2124 2104 2125 2105 if (level == SOL_SOCKET) 2126 2106 err = ··· 2133 2107 err = 2134 2108 sock->ops->getsockopt(sock, level, optname, optval, 2135 2109 optlen); 2110 + 2111 + err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname, 2112 + optval, optlen, 2113 + max_optlen, err); 2136 2114 out_put: 2137 2115 fput_light(sock->file, fput_needed); 2138 2116 }

+29 -7

net/xdp/xsk.c

··· 37 37 READ_ONCE(xs->umem->fq); 38 38 } 39 39 40 + bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt) 41 + { 42 + return xskq_has_addrs(umem->fq, cnt); 43 + } 44 + EXPORT_SYMBOL(xsk_umem_has_addrs); 45 + 40 46 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr) 41 47 { 42 48 return xskq_peek_addr(umem->fq, addr); ··· 172 166 } 173 167 EXPORT_SYMBOL(xsk_umem_consume_tx_done); 174 168 175 - bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len) 169 + bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc) 176 170 { 177 - struct xdp_desc desc; 178 171 struct xdp_sock *xs; 179 172 180 173 rcu_read_lock(); 181 174 list_for_each_entry_rcu(xs, &umem->xsk_list, list) { 182 - if (!xskq_peek_desc(xs->tx, &desc)) 175 + if (!xskq_peek_desc(xs->tx, desc)) 183 176 continue; 184 177 185 - if (xskq_produce_addr_lazy(umem->cq, desc.addr)) 178 + if (xskq_produce_addr_lazy(umem->cq, desc->addr)) 186 179 goto out; 187 - 188 - *dma = xdp_umem_get_dma(umem, desc.addr); 189 - *len = desc.len; 190 180 191 181 xskq_discard_desc(xs->tx); 192 182 rcu_read_unlock(); ··· 640 638 641 639 len = sizeof(off); 642 640 if (copy_to_user(optval, &off, len)) 641 + return -EFAULT; 642 + if (put_user(len, optlen)) 643 + return -EFAULT; 644 + 645 + return 0; 646 + } 647 + case XDP_OPTIONS: 648 + { 649 + struct xdp_options opts = {}; 650 + 651 + if (len < sizeof(opts)) 652 + return -EINVAL; 653 + 654 + mutex_lock(&xs->mutex); 655 + if (xs->zc) 656 + opts.flags |= XDP_OPTIONS_ZEROCOPY; 657 + mutex_unlock(&xs->mutex); 658 + 659 + len = sizeof(opts); 660 + if (copy_to_user(optval, &opts, len)) 643 661 return -EFAULT; 644 662 if (put_user(len, optlen)) 645 663 return -EFAULT;

+14

net/xdp/xsk_queue.h

··· 117 117 return q->nentries - (producer - q->cons_tail); 118 118 } 119 119 120 + static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt) 121 + { 122 + u32 entries = q->prod_tail - q->cons_tail; 123 + 124 + if (entries >= cnt) 125 + return true; 126 + 127 + /* Refresh the local pointer. */ 128 + q->prod_tail = READ_ONCE(q->ring->producer); 129 + entries = q->prod_tail - q->cons_tail; 130 + 131 + return entries >= cnt; 132 + } 133 + 120 134 /* UMEM queue */ 121 135 122 136 static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)

+3

samples/bpf/Makefile

··· 154 154 always += tcp_clamp_kern.o 155 155 always += tcp_basertt_kern.o 156 156 always += tcp_tos_reflect_kern.o 157 + always += tcp_dumpstats_kern.o 157 158 always += xdp_redirect_kern.o 158 159 always += xdp_redirect_map_kern.o 159 160 always += xdp_redirect_cpu_kern.o ··· 169 168 always += xdp_sample_pkts_kern.o 170 169 always += ibumad_kern.o 171 170 always += hbm_out_kern.o 171 + always += hbm_edt_kern.o 172 172 173 173 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include 174 174 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/ ··· 274 272 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h 275 273 $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h 276 274 $(obj)/hbm.o: $(src)/hbm.h 275 + $(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h 277 276 278 277 # asm/sysreg.h - inline assembly used by it is incompatible with llvm. 279 278 # But, there is no easy way to fix it, so just exclude it since it is

+12 -10

samples/bpf/do_hbm_test.sh

··· 14 14 echo "loads. The output is the goodput in Mbps (unless -D was used)." 15 15 echo "" 16 16 echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>]" 17 - echo " [-D] [-d=<delay>|--delay=<delay>] [--debug] [-E]" 17 + echo " [-D] [-d=<delay>|--delay=<delay>] [--debug] [-E] [--edt]" 18 18 echo " [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >]" 19 19 echo " [-l] [-N] [--no_cn] [-p=<port>|--port=<port>] [-P]" 20 20 echo " [-q=<qdisc>] [-R] [-s=<server>|--server=<server]" ··· 30 30 echo " other detailed information. This information is" 31 31 echo " test dependent (i.e. iperf3 or netperf)." 32 32 echo " -E enable ECN (not required for dctcp)" 33 + echo " --edt use fq's Earliest Departure Time (requires fq)" 33 34 echo " -f or --flows number of concurrent flows (default=1)" 34 35 echo " -i or --id cgroup id (an integer, default is 1)" 35 36 echo " -N use netperf instead of iperf3" ··· 131 130 details=1 132 131 ;; 133 132 -E) 134 - ecn=1 133 + ecn=1 134 + ;; 135 + --edt) 136 + flags="$flags --edt" 137 + qdisc="fq" 135 138 ;; 136 - # Support for upcomming fq Early Departure Time egress rate limiting 137 - #--edt) 138 - # prog="hbm_out_edt_kern.o" 139 - # qdisc="fq" 140 - # ;; 141 139 -f=*|--flows=*) 142 140 flows="${i#*=}" 143 141 ;; ··· 228 228 tc qdisc del dev lo root > /dev/null 2>&1 229 229 tc qdisc add dev lo root netem delay $netem\ms > /dev/null 2>&1 230 230 elif [ "$qdisc" != "" ] ; then 231 - tc qdisc del dev lo root > /dev/null 2>&1 232 - tc qdisc add dev lo root $qdisc > /dev/null 2>&1 231 + tc qdisc del dev eth0 root > /dev/null 2>&1 232 + tc qdisc add dev eth0 root $qdisc > /dev/null 2>&1 233 233 fi 234 234 235 235 n=0 ··· 399 399 if [ "$netem" -ne "0" ] ; then 400 400 tc qdisc del dev lo root > /dev/null 2>&1 401 401 fi 402 - 402 + if [ "$qdisc" != "" ] ; then 403 + tc qdisc del dev eth0 root > /dev/null 2>&1 404 + fi 403 405 sleep 2 404 406 405 407 hbmPid=`ps ax | grep "hbm " | grep --invert-match "grep" | awk '{ print $1 }'`

+15 -3

samples/bpf/hbm.c

··· 62 62 bool debugFlag; 63 63 bool work_conserving_flag; 64 64 bool no_cn_flag; 65 + bool edt_flag; 65 66 66 67 static void Usage(void); 67 68 static void read_trace_pipe2(void); ··· 373 372 fprintf(fout, "avg rtt:%d\n", 374 373 (int)(qstats.sum_rtt / (qstats.pkts_total + 1))); 375 374 // Average credit 376 - fprintf(fout, "avg credit:%d\n", 377 - (int)(qstats.sum_credit / 378 - (1500 * ((int)qstats.pkts_total) + 1))); 375 + if (edt_flag) 376 + fprintf(fout, "avg credit_ms:%.03f\n", 377 + (qstats.sum_credit / 378 + (qstats.pkts_total + 1.0)) / 1000000.0); 379 + else 380 + fprintf(fout, "avg credit:%d\n", 381 + (int)(qstats.sum_credit / 382 + (1500 * ((int)qstats.pkts_total ) + 1))); 379 383 380 384 // Return values stats 381 385 for (k = 0; k < RET_VAL_COUNT; k++) { ··· 414 408 " Where:\n" 415 409 " -o indicates egress direction (default)\n" 416 410 " -d print BPF trace debug buffer\n" 411 + " --edt use fq's Earliest Departure Time\n" 417 412 " -l also limit flows using loopback\n" 418 413 " -n <#> to create cgroup \"/hbm#\" and attach prog\n" 419 414 " Default is /hbm1\n" ··· 440 433 char *optstring = "iodln:r:st:wh"; 441 434 struct option loptions[] = { 442 435 {"no_cn", 0, NULL, 1}, 436 + {"edt", 0, NULL, 2}, 443 437 {NULL, 0, NULL, 0} 444 438 }; 445 439 ··· 448 440 switch (k) { 449 441 case 1: 450 442 no_cn_flag = true; 443 + break; 444 + case 2: 445 + prog = "hbm_edt_kern.o"; 446 + edt_flag = true; 451 447 break; 452 448 case'o': 453 449 break;

+168

samples/bpf/hbm_edt_kern.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2019 Facebook 3 + * 4 + * This program is free software; you can redistribute it and/or 5 + * modify it under the terms of version 2 of the GNU General Public 6 + * License as published by the Free Software Foundation. 7 + * 8 + * Sample Host Bandwidth Manager (HBM) BPF program. 9 + * 10 + * A cgroup skb BPF egress program to limit cgroup output bandwidth. 11 + * It uses a modified virtual token bucket queue to limit average 12 + * egress bandwidth. The implementation uses credits instead of tokens. 13 + * Negative credits imply that queueing would have happened (this is 14 + * a virtual queue, so no queueing is done by it. However, queueing may 15 + * occur at the actual qdisc (which is not used for rate limiting). 16 + * 17 + * This implementation uses 3 thresholds, one to start marking packets and 18 + * the other two to drop packets: 19 + * CREDIT 20 + * - <--------------------------|------------------------> + 21 + * | | | 0 22 + * | Large pkt | 23 + * | drop thresh | 24 + * Small pkt drop Mark threshold 25 + * thresh 26 + * 27 + * The effect of marking depends on the type of packet: 28 + * a) If the packet is ECN enabled and it is a TCP packet, then the packet 29 + * is ECN marked. 30 + * b) If the packet is a TCP packet, then we probabilistically call tcp_cwr 31 + * to reduce the congestion window. The current implementation uses a linear 32 + * distribution (0% probability at marking threshold, 100% probability 33 + * at drop threshold). 34 + * c) If the packet is not a TCP packet, then it is dropped. 35 + * 36 + * If the credit is below the drop threshold, the packet is dropped. If it 37 + * is a TCP packet, then it also calls tcp_cwr since packets dropped by 38 + * by a cgroup skb BPF program do not automatically trigger a call to 39 + * tcp_cwr in the current kernel code. 40 + * 41 + * This BPF program actually uses 2 drop thresholds, one threshold 42 + * for larger packets (>= 120 bytes) and another for smaller packets. This 43 + * protects smaller packets such as SYNs, ACKs, etc. 44 + * 45 + * The default bandwidth limit is set at 1Gbps but this can be changed by 46 + * a user program through a shared BPF map. In addition, by default this BPF 47 + * program does not limit connections using loopback. This behavior can be 48 + * overwritten by the user program. There is also an option to calculate 49 + * some statistics, such as percent of packets marked or dropped, which 50 + * a user program, such as hbm, can access. 51 + */ 52 + 53 + #include "hbm_kern.h" 54 + 55 + SEC("cgroup_skb/egress") 56 + int _hbm_out_cg(struct __sk_buff *skb) 57 + { 58 + long long delta = 0, delta_send; 59 + unsigned long long curtime, sendtime; 60 + struct hbm_queue_stats *qsp = NULL; 61 + unsigned int queue_index = 0; 62 + bool congestion_flag = false; 63 + bool ecn_ce_flag = false; 64 + struct hbm_pkt_info pkti = {}; 65 + struct hbm_vqueue *qdp; 66 + bool drop_flag = false; 67 + bool cwr_flag = false; 68 + int len = skb->len; 69 + int rv = ALLOW_PKT; 70 + 71 + qsp = bpf_map_lookup_elem(&queue_stats, &queue_index); 72 + 73 + // Check if we should ignore loopback traffic 74 + if (qsp != NULL && !qsp->loopback && (skb->ifindex == 1)) 75 + return ALLOW_PKT; 76 + 77 + hbm_get_pkt_info(skb, &pkti); 78 + 79 + // We may want to account for the length of headers in len 80 + // calculation, like ETH header + overhead, specially if it 81 + // is a gso packet. But I am not doing it right now. 82 + 83 + qdp = bpf_get_local_storage(&queue_state, 0); 84 + if (!qdp) 85 + return ALLOW_PKT; 86 + if (qdp->lasttime == 0) 87 + hbm_init_edt_vqueue(qdp, 1024); 88 + 89 + curtime = bpf_ktime_get_ns(); 90 + 91 + // Begin critical section 92 + bpf_spin_lock(&qdp->lock); 93 + delta = qdp->lasttime - curtime; 94 + // bound bursts to 100us 95 + if (delta < -BURST_SIZE_NS) { 96 + // negative delta is a credit that allows bursts 97 + qdp->lasttime = curtime - BURST_SIZE_NS; 98 + delta = -BURST_SIZE_NS; 99 + } 100 + sendtime = qdp->lasttime; 101 + delta_send = BYTES_TO_NS(len, qdp->rate); 102 + __sync_add_and_fetch(&(qdp->lasttime), delta_send); 103 + bpf_spin_unlock(&qdp->lock); 104 + // End critical section 105 + 106 + // Set EDT of packet 107 + skb->tstamp = sendtime; 108 + 109 + // Check if we should update rate 110 + if (qsp != NULL && (qsp->rate * 128) != qdp->rate) 111 + qdp->rate = qsp->rate * 128; 112 + 113 + // Set flags (drop, congestion, cwr) 114 + // last packet will be sent in the future, bound latency 115 + if (delta > DROP_THRESH_NS || (delta > LARGE_PKT_DROP_THRESH_NS && 116 + len > LARGE_PKT_THRESH)) { 117 + drop_flag = true; 118 + if (pkti.is_tcp && pkti.ecn == 0) 119 + cwr_flag = true; 120 + } else if (delta > MARK_THRESH_NS) { 121 + if (pkti.is_tcp) 122 + congestion_flag = true; 123 + else 124 + drop_flag = true; 125 + } 126 + 127 + if (congestion_flag) { 128 + if (bpf_skb_ecn_set_ce(skb)) { 129 + ecn_ce_flag = true; 130 + } else { 131 + if (pkti.is_tcp) { 132 + unsigned int rand = bpf_get_prandom_u32(); 133 + 134 + if (delta >= MARK_THRESH_NS + 135 + (rand % MARK_REGION_SIZE_NS)) { 136 + // Do congestion control 137 + cwr_flag = true; 138 + } 139 + } else if (len > LARGE_PKT_THRESH) { 140 + // Problem if too many small packets? 141 + drop_flag = true; 142 + congestion_flag = false; 143 + } 144 + } 145 + } 146 + 147 + if (pkti.is_tcp && drop_flag && pkti.packets_out <= 1) { 148 + drop_flag = false; 149 + cwr_flag = true; 150 + congestion_flag = false; 151 + } 152 + 153 + if (qsp != NULL && qsp->no_cn) 154 + cwr_flag = false; 155 + 156 + hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag, 157 + cwr_flag, ecn_ce_flag, &pkti, (int) delta); 158 + 159 + if (drop_flag) { 160 + __sync_add_and_fetch(&(qdp->lasttime), -delta_send); 161 + rv = DROP_PKT; 162 + } 163 + 164 + if (cwr_flag) 165 + rv |= CWR; 166 + return rv; 167 + } 168 + char _license[] SEC("license") = "GPL";

+34 -6

samples/bpf/hbm_kern.h

··· 29 29 #define DROP_PKT 0 30 30 #define ALLOW_PKT 1 31 31 #define TCP_ECN_OK 1 32 + #define CWR 2 32 33 33 34 #ifndef HBM_DEBUG // Define HBM_DEBUG to enable debugging 34 35 #undef bpf_printk ··· 46 45 #define MAX_CREDIT (100 * MAX_BYTES_PER_PACKET) 47 46 #define INIT_CREDIT (INITIAL_CREDIT_PACKETS * MAX_BYTES_PER_PACKET) 48 47 48 + // Time base accounting for fq's EDT 49 + #define BURST_SIZE_NS 100000 // 100us 50 + #define MARK_THRESH_NS 50000 // 50us 51 + #define DROP_THRESH_NS 500000 // 500us 52 + // Reserve 20us of queuing for small packets (less than 120 bytes) 53 + #define LARGE_PKT_DROP_THRESH_NS (DROP_THRESH_NS - 20000) 54 + #define MARK_REGION_SIZE_NS (LARGE_PKT_DROP_THRESH_NS - MARK_THRESH_NS) 55 + 49 56 // rate in bytes per ns << 20 50 57 #define CREDIT_PER_NS(delta, rate) ((((u64)(delta)) * (rate)) >> 20) 58 + #define BYTES_PER_NS(delta, rate) ((((u64)(delta)) * (rate)) >> 20) 59 + #define BYTES_TO_NS(bytes, rate) div64_u64(((u64)(bytes)) << 20, (u64)(rate)) 51 60 52 61 struct bpf_map_def SEC("maps") queue_state = { 53 62 .type = BPF_MAP_TYPE_CGROUP_STORAGE, ··· 78 67 struct hbm_pkt_info { 79 68 int cwnd; 80 69 int rtt; 70 + int packets_out; 81 71 bool is_ip; 82 72 bool is_tcp; 83 73 short ecn; ··· 98 86 if (tp) { 99 87 pkti->cwnd = tp->snd_cwnd; 100 88 pkti->rtt = tp->srtt_us >> 3; 89 + pkti->packets_out = tp->packets_out; 101 90 return 0; 102 91 } 103 92 } 104 93 } 105 94 } 95 + pkti->cwnd = 0; 96 + pkti->rtt = 0; 97 + pkti->packets_out = 0; 106 98 return 1; 107 99 } 108 100 109 - static __always_inline void hbm_get_pkt_info(struct __sk_buff *skb, 110 - struct hbm_pkt_info *pkti) 101 + static void hbm_get_pkt_info(struct __sk_buff *skb, 102 + struct hbm_pkt_info *pkti) 111 103 { 112 104 struct iphdr iph; 113 105 struct ipv6hdr *ip6h; ··· 139 123 140 124 static __always_inline void hbm_init_vqueue(struct hbm_vqueue *qdp, int rate) 141 125 { 142 - bpf_printk("Initializing queue_state, rate:%d\n", rate * 128); 143 - qdp->lasttime = bpf_ktime_get_ns(); 144 - qdp->credit = INIT_CREDIT; 145 - qdp->rate = rate * 128; 126 + bpf_printk("Initializing queue_state, rate:%d\n", rate * 128); 127 + qdp->lasttime = bpf_ktime_get_ns(); 128 + qdp->credit = INIT_CREDIT; 129 + qdp->rate = rate * 128; 130 + } 131 + 132 + static __always_inline void hbm_init_edt_vqueue(struct hbm_vqueue *qdp, 133 + int rate) 134 + { 135 + unsigned long long curtime; 136 + 137 + curtime = bpf_ktime_get_ns(); 138 + bpf_printk("Initializing queue_state, rate:%d\n", rate * 128); 139 + qdp->lasttime = curtime - BURST_SIZE_NS; // support initial burst 140 + qdp->credit = 0; // not used 141 + qdp->rate = rate * 128; 146 142 } 147 143 148 144 static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,

+6 -12

samples/bpf/ibumad_kern.c

··· 31 31 }; 32 32 33 33 #undef DEBUG 34 - #ifdef DEBUG 35 - #define bpf_debug(fmt, ...) \ 36 - ({ \ 37 - char ____fmt[] = fmt; \ 38 - bpf_trace_printk(____fmt, sizeof(____fmt), \ 39 - ##__VA_ARGS__); \ 40 - }) 41 - #else 42 - #define bpf_debug(fmt, ...) 34 + #ifndef DEBUG 35 + #undef bpf_printk 36 + #define bpf_printk(fmt, ...) 43 37 #endif 44 38 45 39 /* Taken from the current format defined in ··· 80 86 u64 zero = 0, *val; 81 87 u8 class = ctx->mgmt_class; 82 88 83 - bpf_debug("ib_umad read recv : class 0x%x\n", class); 89 + bpf_printk("ib_umad read recv : class 0x%x\n", class); 84 90 85 91 val = bpf_map_lookup_elem(&read_count, &class); 86 92 if (!val) { ··· 100 106 u64 zero = 0, *val; 101 107 u8 class = ctx->mgmt_class; 102 108 103 - bpf_debug("ib_umad read send : class 0x%x\n", class); 109 + bpf_printk("ib_umad read send : class 0x%x\n", class); 104 110 105 111 val = bpf_map_lookup_elem(&read_count, &class); 106 112 if (!val) { ··· 120 126 u64 zero = 0, *val; 121 127 u8 class = ctx->mgmt_class; 122 128 123 - bpf_debug("ib_umad write : class 0x%x\n", class); 129 + bpf_printk("ib_umad write : class 0x%x\n", class); 124 130 125 131 val = bpf_map_lookup_elem(&write_count, &class); 126 132 if (!val) {

+1 -1

samples/bpf/tcp_bpf.readme

··· 25 25 26 26 To remove (unattach) a socket_ops BPF program from a cgroupv2: 27 27 28 - bpftool cgroup attach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog 28 + bpftool cgroup detach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog

+68

samples/bpf/tcp_dumpstats_kern.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Refer to samples/bpf/tcp_bpf.readme for the instructions on 3 + * how to run this sample program. 4 + */ 5 + #include <linux/bpf.h> 6 + 7 + #include "bpf_helpers.h" 8 + #include "bpf_endian.h" 9 + 10 + #define INTERVAL 1000000000ULL 11 + 12 + int _version SEC("version") = 1; 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + struct { 16 + __u32 type; 17 + __u32 map_flags; 18 + int *key; 19 + __u64 *value; 20 + } bpf_next_dump SEC(".maps") = { 21 + .type = BPF_MAP_TYPE_SK_STORAGE, 22 + .map_flags = BPF_F_NO_PREALLOC, 23 + }; 24 + 25 + SEC("sockops") 26 + int _sockops(struct bpf_sock_ops *ctx) 27 + { 28 + struct bpf_tcp_sock *tcp_sk; 29 + struct bpf_sock *sk; 30 + __u64 *next_dump; 31 + __u64 now; 32 + 33 + switch (ctx->op) { 34 + case BPF_SOCK_OPS_TCP_CONNECT_CB: 35 + bpf_sock_ops_cb_flags_set(ctx, BPF_SOCK_OPS_RTT_CB_FLAG); 36 + return 1; 37 + case BPF_SOCK_OPS_RTT_CB: 38 + break; 39 + default: 40 + return 1; 41 + } 42 + 43 + sk = ctx->sk; 44 + if (!sk) 45 + return 1; 46 + 47 + next_dump = bpf_sk_storage_get(&bpf_next_dump, sk, 0, 48 + BPF_SK_STORAGE_GET_F_CREATE); 49 + if (!next_dump) 50 + return 1; 51 + 52 + now = bpf_ktime_get_ns(); 53 + if (now < *next_dump) 54 + return 1; 55 + 56 + tcp_sk = bpf_tcp_sock(sk); 57 + if (!tcp_sk) 58 + return 1; 59 + 60 + *next_dump = now + INTERVAL; 61 + 62 + bpf_printk("dsack_dups=%u delivered=%u\n", 63 + tcp_sk->dsack_dups, tcp_sk->delivered); 64 + bpf_printk("delivered_ce=%u icsk_retransmits=%u\n", 65 + tcp_sk->delivered_ce, tcp_sk->icsk_retransmits); 66 + 67 + return 1; 68 + }

+10 -2

samples/bpf/xdp_adjust_tail_user.c

··· 13 13 #include <stdio.h> 14 14 #include <stdlib.h> 15 15 #include <string.h> 16 + #include <net/if.h> 16 17 #include <sys/resource.h> 17 18 #include <arpa/inet.h> 18 19 #include <netinet/ether.h> ··· 70 69 printf("Start a XDP prog which send ICMP \"packet too big\" \n" 71 70 "messages if ingress packet is bigger then MAX_SIZE bytes\n"); 72 71 printf("Usage: %s [...]\n", cmd); 73 - printf(" -i <ifindex> Interface Index\n"); 72 + printf(" -i <ifname|ifindex> Interface\n"); 74 73 printf(" -T <stop-after-X-seconds> Default: 0 (forever)\n"); 75 74 printf(" -S use skb-mode\n"); 76 75 printf(" -N enforce native mode\n"); ··· 103 102 104 103 switch (opt) { 105 104 case 'i': 106 - ifindex = atoi(optarg); 105 + ifindex = if_nametoindex(optarg); 106 + if (!ifindex) 107 + ifindex = atoi(optarg); 107 108 break; 108 109 case 'T': 109 110 kill_after_s = atoi(optarg); ··· 136 133 137 134 if (setrlimit(RLIMIT_MEMLOCK, &r)) { 138 135 perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)"); 136 + return 1; 137 + } 138 + 139 + if (!ifindex) { 140 + fprintf(stderr, "Invalid ifname\n"); 139 141 return 1; 140 142 } 141 143

+11 -4

samples/bpf/xdp_redirect_map_user.c

··· 10 10 #include <stdlib.h> 11 11 #include <stdbool.h> 12 12 #include <string.h> 13 + #include <net/if.h> 13 14 #include <unistd.h> 14 15 #include <libgen.h> 15 16 #include <sys/resource.h> ··· 86 85 static void usage(const char *prog) 87 86 { 88 87 fprintf(stderr, 89 - "usage: %s [OPTS] IFINDEX_IN IFINDEX_OUT\n\n" 88 + "usage: %s [OPTS] <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n\n" 90 89 "OPTS:\n" 91 90 " -S use skb-mode\n" 92 91 " -N enforce native mode\n" ··· 128 127 } 129 128 130 129 if (optind == argc) { 131 - printf("usage: %s IFINDEX_IN IFINDEX_OUT\n", argv[0]); 130 + printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]); 132 131 return 1; 133 132 } 134 133 ··· 137 136 return 1; 138 137 } 139 138 140 - ifindex_in = strtoul(argv[optind], NULL, 0); 141 - ifindex_out = strtoul(argv[optind + 1], NULL, 0); 139 + ifindex_in = if_nametoindex(argv[optind]); 140 + if (!ifindex_in) 141 + ifindex_in = strtoul(argv[optind], NULL, 0); 142 + 143 + ifindex_out = if_nametoindex(argv[optind + 1]); 144 + if (!ifindex_out) 145 + ifindex_out = strtoul(argv[optind + 1], NULL, 0); 146 + 142 147 printf("input: %d output: %d\n", ifindex_in, ifindex_out); 143 148 144 149 snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

+11 -4

samples/bpf/xdp_redirect_user.c

··· 10 10 #include <stdlib.h> 11 11 #include <stdbool.h> 12 12 #include <string.h> 13 + #include <net/if.h> 13 14 #include <unistd.h> 14 15 #include <libgen.h> 15 16 #include <sys/resource.h> ··· 86 85 static void usage(const char *prog) 87 86 { 88 87 fprintf(stderr, 89 - "usage: %s [OPTS] IFINDEX_IN IFINDEX_OUT\n\n" 88 + "usage: %s [OPTS] <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n\n" 90 89 "OPTS:\n" 91 90 " -S use skb-mode\n" 92 91 " -N enforce native mode\n" ··· 129 128 } 130 129 131 130 if (optind == argc) { 132 - printf("usage: %s IFINDEX_IN IFINDEX_OUT\n", argv[0]); 131 + printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]); 133 132 return 1; 134 133 } 135 134 ··· 138 137 return 1; 139 138 } 140 139 141 - ifindex_in = strtoul(argv[optind], NULL, 0); 142 - ifindex_out = strtoul(argv[optind + 1], NULL, 0); 140 + ifindex_in = if_nametoindex(argv[optind]); 141 + if (!ifindex_in) 142 + ifindex_in = strtoul(argv[optind], NULL, 0); 143 + 144 + ifindex_out = if_nametoindex(argv[optind + 1]); 145 + if (!ifindex_out) 146 + ifindex_out = strtoul(argv[optind + 1], NULL, 0); 147 + 143 148 printf("input: %d output: %d\n", ifindex_in, ifindex_out); 144 149 145 150 snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

+10 -2

samples/bpf/xdp_tx_iptunnel_user.c

··· 9 9 #include <stdio.h> 10 10 #include <stdlib.h> 11 11 #include <string.h> 12 + #include <net/if.h> 12 13 #include <sys/resource.h> 13 14 #include <arpa/inet.h> 14 15 #include <netinet/ether.h> ··· 84 83 "in an IPv4/v6 header and XDP_TX it out. The dst <VIP:PORT>\n" 85 84 "is used to select packets to encapsulate\n\n"); 86 85 printf("Usage: %s [...]\n", cmd); 87 - printf(" -i <ifindex> Interface Index\n"); 86 + printf(" -i <ifname|ifindex> Interface\n"); 88 87 printf(" -a <vip-service-address> IPv4 or IPv6\n"); 89 88 printf(" -p <vip-service-port> A port range (e.g. 433-444) is also allowed\n"); 90 89 printf(" -s <source-ip> Used in the IPTunnel header\n"); ··· 182 181 183 182 switch (opt) { 184 183 case 'i': 185 - ifindex = atoi(optarg); 184 + ifindex = if_nametoindex(optarg); 185 + if (!ifindex) 186 + ifindex = atoi(optarg); 186 187 break; 187 188 case 'a': 188 189 vip.family = parse_ipstr(optarg, vip.daddr.v6); ··· 253 250 254 251 if (setrlimit(RLIMIT_MEMLOCK, &r)) { 255 252 perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)"); 253 + return 1; 254 + } 255 + 256 + if (!ifindex) { 257 + fprintf(stderr, "Invalid ifname\n"); 256 258 return 1; 257 259 } 258 260

+28 -16

samples/bpf/xdpsock_user.c

··· 68 68 static int opt_poll; 69 69 static int opt_interval = 1; 70 70 static u32 opt_xdp_bind_flags; 71 + static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE; 71 72 static __u32 prog_id; 72 73 73 74 struct xsk_umem_info { ··· 277 276 static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size) 278 277 { 279 278 struct xsk_umem_info *umem; 279 + struct xsk_umem_config cfg = { 280 + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, 281 + .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, 282 + .frame_size = opt_xsk_frame_size, 283 + .frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM, 284 + }; 280 285 int ret; 281 286 282 287 umem = calloc(1, sizeof(*umem)); ··· 290 283 exit_with_error(errno); 291 284 292 285 ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, 293 - NULL); 286 + &cfg); 294 287 if (ret) 295 288 exit_with_error(-ret); 296 289 ··· 330 323 &idx); 331 324 if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS) 332 325 exit_with_error(-ret); 333 - for (i = 0; 334 - i < XSK_RING_PROD__DEFAULT_NUM_DESCS * 335 - XSK_UMEM__DEFAULT_FRAME_SIZE; 336 - i += XSK_UMEM__DEFAULT_FRAME_SIZE) 337 - *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = i; 326 + for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS; i++) 327 + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = 328 + i * opt_xsk_frame_size; 338 329 xsk_ring_prod__submit(&xsk->umem->fq, 339 330 XSK_RING_PROD__DEFAULT_NUM_DESCS); 340 331 ··· 351 346 {"interval", required_argument, 0, 'n'}, 352 347 {"zero-copy", no_argument, 0, 'z'}, 353 348 {"copy", no_argument, 0, 'c'}, 349 + {"frame-size", required_argument, 0, 'f'}, 354 350 {0, 0, 0, 0} 355 351 }; 356 352 ··· 371 365 " -n, --interval=n Specify statistics update interval (default 1 sec).\n" 372 366 " -z, --zero-copy Force zero-copy mode.\n" 373 367 " -c, --copy Force copy mode.\n" 368 + " -f, --frame-size=n Set the frame size (must be a power of two, default is %d).\n" 374 369 "\n"; 375 - fprintf(stderr, str, prog); 370 + fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE); 376 371 exit(EXIT_FAILURE); 377 372 } 378 373 ··· 384 377 opterr = 0; 385 378 386 379 for (;;) { 387 - c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options, 380 + c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:", long_options, 388 381 &option_index); 389 382 if (c == -1) 390 383 break; ··· 427 420 case 'F': 428 421 opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST; 429 422 break; 423 + case 'f': 424 + opt_xsk_frame_size = atoi(optarg); 425 + break; 430 426 default: 431 427 usage(basename(argv[0])); 432 428 } ··· 442 432 usage(basename(argv[0])); 443 433 } 444 434 435 + if (opt_xsk_frame_size & (opt_xsk_frame_size - 1)) { 436 + fprintf(stderr, "--frame-size=%d is not a power of two\n", 437 + opt_xsk_frame_size); 438 + usage(basename(argv[0])); 439 + } 445 440 } 446 441 447 442 static void kick_tx(struct xsk_socket_info *xsk) ··· 598 583 599 584 for (i = 0; i < BATCH_SIZE; i++) { 600 585 xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr 601 - = (frame_nb + i) << 602 - XSK_UMEM__DEFAULT_FRAME_SHIFT; 586 + = (frame_nb + i) * opt_xsk_frame_size; 603 587 xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len = 604 588 sizeof(pkt_data) - 1; 605 589 } ··· 675 661 } 676 662 677 663 ret = posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */ 678 - NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE); 664 + NUM_FRAMES * opt_xsk_frame_size); 679 665 if (ret) 680 666 exit_with_error(ret); 681 667 682 668 /* Create sockets... */ 683 - umem = xsk_configure_umem(bufs, 684 - NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE); 669 + umem = xsk_configure_umem(bufs, NUM_FRAMES * opt_xsk_frame_size); 685 670 xsks[num_socks++] = xsk_configure_socket(umem); 686 671 687 672 if (opt_bench == BENCH_TXONLY) { 688 673 int i; 689 674 690 - for (i = 0; i < NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE; 691 - i += XSK_UMEM__DEFAULT_FRAME_SIZE) 692 - (void)gen_eth_frame(umem, i); 675 + for (i = 0; i < NUM_FRAMES; i++) 676 + (void)gen_eth_frame(umem, i * opt_xsk_frame_size); 693 677 } 694 678 695 679 signal(SIGINT, int_exit);

+5 -2

tools/bpf/bpftool/Documentation/bpftool-cgroup.rst

··· 29 29 | *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* } 30 30 | *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | **sock_ops** | **device** | 31 31 | **bind4** | **bind6** | **post_bind4** | **post_bind6** | **connect4** | **connect6** | 32 - | **sendmsg4** | **sendmsg6** | **recvmsg4** | **recvmsg6** | **sysctl** } 32 + | **sendmsg4** | **sendmsg6** | **recvmsg4** | **recvmsg6** | **sysctl** | 33 + | **getsockopt** | **setsockopt** } 33 34 | *ATTACH_FLAGS* := { **multi** | **override** } 34 35 35 36 DESCRIPTION ··· 91 90 an unconnected udp4 socket (since 5.2); 92 91 **recvmsg6** call to recvfrom(2), recvmsg(2), recvmmsg(2) for 93 92 an unconnected udp6 socket (since 5.2); 94 - **sysctl** sysctl access (since 5.2). 93 + **sysctl** sysctl access (since 5.2); 94 + **getsockopt** call to getsockopt (since 5.3); 95 + **setsockopt** call to setsockopt (since 5.3). 95 96 96 97 **bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG* 97 98 Detach *PROG* from the cgroup *CGROUP* and attach type

+2 -1

tools/bpf/bpftool/Documentation/bpftool-prog.rst

··· 40 40 | **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | **lirc_mode2** | 41 41 | **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** | 42 42 | **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** | 43 - | **cgroup/recvmsg4** | **cgroup/recvmsg6** | **cgroup/sysctl** 43 + | **cgroup/recvmsg4** | **cgroup/recvmsg6** | **cgroup/sysctl** | 44 + | **cgroup/getsockopt** | **cgroup/setsockopt** 44 45 | } 45 46 | *ATTACH_TYPE* := { 46 47 | **msg_verdict** | **stream_verdict** | **stream_parser** | **flow_dissector**

+6 -3

tools/bpf/bpftool/bash-completion/bpftool

··· 379 379 cgroup/sendmsg4 cgroup/sendmsg6 \ 380 380 cgroup/recvmsg4 cgroup/recvmsg6 \ 381 381 cgroup/post_bind4 cgroup/post_bind6 \ 382 - cgroup/sysctl" -- \ 382 + cgroup/sysctl cgroup/getsockopt \ 383 + cgroup/setsockopt" -- \ 383 384 "$cur" ) ) 384 385 return 0 385 386 ;; ··· 690 689 attach|detach) 691 690 local ATTACH_TYPES='ingress egress sock_create sock_ops \ 692 691 device bind4 bind6 post_bind4 post_bind6 connect4 \ 693 - connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6 sysctl' 692 + connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6 sysctl \ 693 + getsockopt setsockopt' 694 694 local ATTACH_FLAGS='multi override' 695 695 local PROG_TYPE='id pinned tag' 696 696 case $prev in ··· 701 699 ;; 702 700 ingress|egress|sock_create|sock_ops|device|bind4|bind6|\ 703 701 post_bind4|post_bind6|connect4|connect6|sendmsg4|\ 704 - sendmsg6|recvmsg4|recvmsg6|sysctl) 702 + sendmsg6|recvmsg4|recvmsg6|sysctl|getsockopt|\ 703 + setsockopt) 705 704 COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \ 706 705 "$cur" ) ) 707 706 return 0

+4 -1

tools/bpf/bpftool/cgroup.c

··· 26 26 " sock_ops | device | bind4 | bind6 |\n" \ 27 27 " post_bind4 | post_bind6 | connect4 |\n" \ 28 28 " connect6 | sendmsg4 | sendmsg6 |\n" \ 29 - " recvmsg4 | recvmsg6 | sysctl }" 29 + " recvmsg4 | recvmsg6 | sysctl |\n" \ 30 + " getsockopt | setsockopt }" 30 31 31 32 static const char * const attach_type_strings[] = { 32 33 [BPF_CGROUP_INET_INGRESS] = "ingress", ··· 46 45 [BPF_CGROUP_SYSCTL] = "sysctl", 47 46 [BPF_CGROUP_UDP4_RECVMSG] = "recvmsg4", 48 47 [BPF_CGROUP_UDP6_RECVMSG] = "recvmsg6", 48 + [BPF_CGROUP_GETSOCKOPT] = "getsockopt", 49 + [BPF_CGROUP_SETSOCKOPT] = "setsockopt", 49 50 [__MAX_BPF_ATTACH_TYPE] = NULL, 50 51 }; 51 52

+1

tools/bpf/bpftool/main.h

··· 74 74 [BPF_PROG_TYPE_SK_REUSEPORT] = "sk_reuseport", 75 75 [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector", 76 76 [BPF_PROG_TYPE_CGROUP_SYSCTL] = "cgroup_sysctl", 77 + [BPF_PROG_TYPE_CGROUP_SOCKOPT] = "cgroup_sockopt", 77 78 }; 78 79 79 80 extern const char * const map_type_name[];

+2 -1

tools/bpf/bpftool/prog.c

··· 1071 1071 " cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n" 1072 1072 " cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n" 1073 1073 " cgroup/sendmsg4 | cgroup/sendmsg6 | cgroup/recvmsg4 |\n" 1074 - " cgroup/recvmsg6 }\n" 1074 + " cgroup/recvmsg6 | cgroup/getsockopt |\n" 1075 + " cgroup/setsockopt }\n" 1075 1076 " ATTACH_TYPE := { msg_verdict | stream_verdict | stream_parser |\n" 1076 1077 " flow_dissector }\n" 1077 1078 " " HELP_SPEC_OPTIONS "\n"

+25 -1

tools/include/uapi/linux/bpf.h

··· 170 170 BPF_PROG_TYPE_FLOW_DISSECTOR, 171 171 BPF_PROG_TYPE_CGROUP_SYSCTL, 172 172 BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, 173 + BPF_PROG_TYPE_CGROUP_SOCKOPT, 173 174 }; 174 175 175 176 enum bpf_attach_type { ··· 195 194 BPF_CGROUP_SYSCTL, 196 195 BPF_CGROUP_UDP4_RECVMSG, 197 196 BPF_CGROUP_UDP6_RECVMSG, 197 + BPF_CGROUP_GETSOCKOPT, 198 + BPF_CGROUP_SETSOCKOPT, 198 199 __MAX_BPF_ATTACH_TYPE 199 200 }; 200 201 ··· 1767 1764 * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) 1768 1765 * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) 1769 1766 * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) 1767 + * * **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT) 1770 1768 * 1771 1769 * Therefore, this function can be used to clear a callback flag by 1772 1770 * setting the appropriate bit to zero. e.g. to disable the RTO ··· 3070 3066 * sum(delta(snd_una)), or how many bytes 3071 3067 * were acked. 3072 3068 */ 3069 + __u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups 3070 + * total number of DSACK blocks received 3071 + */ 3072 + __u32 delivered; /* Total data packets delivered incl. rexmits */ 3073 + __u32 delivered_ce; /* Like the above but only ECE marked packets */ 3074 + __u32 icsk_retransmits; /* Number of unrecovered [RTO] timeouts */ 3073 3075 }; 3074 3076 3075 3077 struct bpf_sock_tuple { ··· 3318 3308 #define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) 3319 3309 #define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) 3320 3310 #define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2) 3321 - #define BPF_SOCK_OPS_ALL_CB_FLAGS 0x7 /* Mask of all currently 3311 + #define BPF_SOCK_OPS_RTT_CB_FLAG (1<<3) 3312 + #define BPF_SOCK_OPS_ALL_CB_FLAGS 0xF /* Mask of all currently 3322 3313 * supported cb flags 3323 3314 */ 3324 3315 ··· 3373 3362 */ 3374 3363 BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after 3375 3364 * socket transition to LISTEN state. 3365 + */ 3366 + BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. 3376 3367 */ 3377 3368 }; 3378 3369 ··· 3552 3539 __u32 file_pos; /* Sysctl file position to read from, write to. 3553 3540 * Allows 1,2,4-byte read an 4-byte write. 3554 3541 */ 3542 + }; 3543 + 3544 + struct bpf_sockopt { 3545 + __bpf_md_ptr(struct bpf_sock *, sk); 3546 + __bpf_md_ptr(void *, optval); 3547 + __bpf_md_ptr(void *, optval_end); 3548 + 3549 + __s32 level; 3550 + __s32 optname; 3551 + __s32 optlen; 3552 + __s32 retval; 3555 3553 }; 3556 3554 3557 3555 #endif /* _UAPI__LINUX_BPF_H__ */

+8

tools/include/uapi/linux/if_xdp.h

··· 46 46 #define XDP_UMEM_FILL_RING 5 47 47 #define XDP_UMEM_COMPLETION_RING 6 48 48 #define XDP_STATISTICS 7 49 + #define XDP_OPTIONS 8 49 50 50 51 struct xdp_umem_reg { 51 52 __u64 addr; /* Start of packet data area */ ··· 60 59 __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 61 60 __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 62 61 }; 62 + 63 + struct xdp_options { 64 + __u32 flags; 65 + }; 66 + 67 + /* Flags for the flags field of struct xdp_options */ 68 + #define XDP_OPTIONS_ZEROCOPY (1 << 0) 63 69 64 70 /* Pgoff for mmaping the rings */ 65 71 #define XDP_PGOFF_RX_RING 0

+14 -9

tools/lib/bpf/libbpf.c

··· 778 778 if (obj->nr_maps < obj->maps_cap) 779 779 return &obj->maps[obj->nr_maps++]; 780 780 781 - new_cap = max(4ul, obj->maps_cap * 3 / 2); 781 + new_cap = max((size_t)4, obj->maps_cap * 3 / 2); 782 782 new_maps = realloc(obj->maps, new_cap * sizeof(*obj->maps)); 783 783 if (!new_maps) { 784 784 pr_warning("alloc maps for object failed\n"); ··· 1169 1169 pr_debug("map '%s': found key_size = %u.\n", 1170 1170 map_name, sz); 1171 1171 if (map->def.key_size && map->def.key_size != sz) { 1172 - pr_warning("map '%s': conflictling key size %u != %u.\n", 1172 + pr_warning("map '%s': conflicting key size %u != %u.\n", 1173 1173 map_name, map->def.key_size, sz); 1174 1174 return -EINVAL; 1175 1175 } ··· 1197 1197 pr_debug("map '%s': found key [%u], sz = %lld.\n", 1198 1198 map_name, t->type, sz); 1199 1199 if (map->def.key_size && map->def.key_size != sz) { 1200 - pr_warning("map '%s': conflictling key size %u != %lld.\n", 1200 + pr_warning("map '%s': conflicting key size %u != %lld.\n", 1201 1201 map_name, map->def.key_size, sz); 1202 1202 return -EINVAL; 1203 1203 } ··· 1212 1212 pr_debug("map '%s': found value_size = %u.\n", 1213 1213 map_name, sz); 1214 1214 if (map->def.value_size && map->def.value_size != sz) { 1215 - pr_warning("map '%s': conflictling value size %u != %u.\n", 1215 + pr_warning("map '%s': conflicting value size %u != %u.\n", 1216 1216 map_name, map->def.value_size, sz); 1217 1217 return -EINVAL; 1218 1218 } ··· 1240 1240 pr_debug("map '%s': found value [%u], sz = %lld.\n", 1241 1241 map_name, t->type, sz); 1242 1242 if (map->def.value_size && map->def.value_size != sz) { 1243 - pr_warning("map '%s': conflictling value size %u != %lld.\n", 1243 + pr_warning("map '%s': conflicting value size %u != %lld.\n", 1244 1244 map_name, map->def.value_size, sz); 1245 1245 return -EINVAL; 1246 1246 } ··· 2646 2646 case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE: 2647 2647 case BPF_PROG_TYPE_PERF_EVENT: 2648 2648 case BPF_PROG_TYPE_CGROUP_SYSCTL: 2649 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 2649 2650 return false; 2650 2651 case BPF_PROG_TYPE_KPROBE: 2651 2652 default: ··· 3605 3604 BPF_CGROUP_UDP6_RECVMSG), 3606 3605 BPF_EAPROG_SEC("cgroup/sysctl", BPF_PROG_TYPE_CGROUP_SYSCTL, 3607 3606 BPF_CGROUP_SYSCTL), 3607 + BPF_EAPROG_SEC("cgroup/getsockopt", BPF_PROG_TYPE_CGROUP_SOCKOPT, 3608 + BPF_CGROUP_GETSOCKOPT), 3609 + BPF_EAPROG_SEC("cgroup/setsockopt", BPF_PROG_TYPE_CGROUP_SOCKOPT, 3610 + BPF_CGROUP_SETSOCKOPT), 3608 3611 }; 3609 3612 3610 3613 #undef BPF_PROG_SEC_IMPL ··· 3872 3867 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, 3873 3868 struct bpf_object **pobj, int *prog_fd) 3874 3869 { 3875 - struct bpf_object_open_attr open_attr = { 3876 - .file = attr->file, 3877 - .prog_type = attr->prog_type, 3878 - }; 3870 + struct bpf_object_open_attr open_attr = {}; 3879 3871 struct bpf_program *prog, *first_prog = NULL; 3880 3872 enum bpf_attach_type expected_attach_type; 3881 3873 enum bpf_prog_type prog_type; ··· 3884 3882 return -EINVAL; 3885 3883 if (!attr->file) 3886 3884 return -EINVAL; 3885 + 3886 + open_attr.file = attr->file; 3887 + open_attr.prog_type = attr->prog_type; 3887 3888 3888 3889 obj = bpf_object__open_xattr(&open_attr); 3889 3890 if (IS_ERR_OR_NULL(obj))

+1

tools/lib/bpf/libbpf_probes.c

··· 101 101 case BPF_PROG_TYPE_SK_REUSEPORT: 102 102 case BPF_PROG_TYPE_FLOW_DISSECTOR: 103 103 case BPF_PROG_TYPE_CGROUP_SYSCTL: 104 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 104 105 default: 105 106 break; 106 107 }

+14 -1

tools/lib/bpf/xsk.c

··· 65 65 int xsks_map_fd; 66 66 __u32 queue_id; 67 67 char ifname[IFNAMSIZ]; 68 + bool zc; 68 69 }; 69 70 70 71 struct xsk_nl_info { ··· 327 326 328 327 channels.cmd = ETHTOOL_GCHANNELS; 329 328 ifr.ifr_data = (void *)&channels; 330 - strncpy(ifr.ifr_name, xsk->ifname, IFNAMSIZ); 329 + strncpy(ifr.ifr_name, xsk->ifname, IFNAMSIZ - 1); 330 + ifr.ifr_name[IFNAMSIZ - 1] = '\0'; 331 331 err = ioctl(fd, SIOCETHTOOL, &ifr); 332 332 if (err && errno != EOPNOTSUPP) { 333 333 ret = -errno; ··· 482 480 void *rx_map = NULL, *tx_map = NULL; 483 481 struct sockaddr_xdp sxdp = {}; 484 482 struct xdp_mmap_offsets off; 483 + struct xdp_options opts; 485 484 struct xsk_socket *xsk; 486 485 socklen_t optlen; 487 486 int err; ··· 600 597 } 601 598 602 599 xsk->prog_fd = -1; 600 + 601 + optlen = sizeof(opts); 602 + err = getsockopt(xsk->fd, SOL_XDP, XDP_OPTIONS, &opts, &optlen); 603 + if (err) { 604 + err = -errno; 605 + goto out_mmap_tx; 606 + } 607 + 608 + xsk->zc = opts.flags & XDP_OPTIONS_ZEROCOPY; 609 + 603 610 if (!(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) { 604 611 err = xsk_setup_xdp_prog(xsk); 605 612 if (err)

+1 -1

tools/lib/bpf/xsk.h

··· 167 167 168 168 #define XSK_RING_CONS__DEFAULT_NUM_DESCS 2048 169 169 #define XSK_RING_PROD__DEFAULT_NUM_DESCS 2048 170 - #define XSK_UMEM__DEFAULT_FRAME_SHIFT 11 /* 2048 bytes */ 170 + #define XSK_UMEM__DEFAULT_FRAME_SHIFT 12 /* 4096 bytes */ 171 171 #define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT) 172 172 #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0 173 173

+3

tools/testing/selftests/bpf/.gitignore

··· 39 39 test_hashmap 40 40 test_btf_dump 41 41 xdping 42 + test_sockopt 43 + test_sockopt_sk 44 + test_sockopt_multi

+8 -2

tools/testing/selftests/bpf/Makefile

··· 15 15 LLVM_OBJCOPY ?= llvm-objcopy 16 16 LLVM_READELF ?= llvm-readelf 17 17 BTF_PAHOLE ?= pahole 18 - CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include \ 18 + CFLAGS += -g -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include \ 19 19 -Dbpf_prog_load=bpf_prog_test_load \ 20 20 -Dbpf_load_program=bpf_test_load_program 21 21 LDLIBS += -lcap -lelf -lrt -lpthread ··· 26 26 test_sock test_btf test_sockmap get_cgroup_id_user test_socket_cookie \ 27 27 test_cgroup_storage test_select_reuseport test_section_names \ 28 28 test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \ 29 - test_btf_dump test_cgroup_attach xdping 29 + test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \ 30 + test_sockopt_multi test_tcp_rtt 30 31 31 32 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c))) 32 33 TEST_GEN_FILES = $(BPF_OBJ_FILES) ··· 47 46 test_libbpf.sh \ 48 47 test_xdp_redirect.sh \ 49 48 test_xdp_meta.sh \ 49 + test_xdp_veth.sh \ 50 50 test_offload.py \ 51 51 test_sock_addr.sh \ 52 52 test_tunnel.sh \ ··· 104 102 $(OUTPUT)/test_sock_fields: cgroup_helpers.c 105 103 $(OUTPUT)/test_sysctl: cgroup_helpers.c 106 104 $(OUTPUT)/test_cgroup_attach: cgroup_helpers.c 105 + $(OUTPUT)/test_sockopt: cgroup_helpers.c 106 + $(OUTPUT)/test_sockopt_sk: cgroup_helpers.c 107 + $(OUTPUT)/test_sockopt_multi: cgroup_helpers.c 108 + $(OUTPUT)/test_tcp_rtt: cgroup_helpers.c 107 109 108 110 .PHONY: force 109 111

+4 -5

tools/testing/selftests/bpf/progs/pyperf.h

··· 75 75 void* co_name; // PyCodeObject.co_name 76 76 } FrameData; 77 77 78 - static inline __attribute__((__always_inline__)) void* 79 - get_thread_state(void* tls_base, PidData* pidData) 78 + static __always_inline void *get_thread_state(void *tls_base, PidData *pidData) 80 79 { 81 80 void* thread_state; 82 81 int key; ··· 86 87 return thread_state; 87 88 } 88 89 89 - static inline __attribute__((__always_inline__)) bool 90 - get_frame_data(void* frame_ptr, PidData* pidData, FrameData* frame, Symbol* symbol) 90 + static __always_inline bool get_frame_data(void *frame_ptr, PidData *pidData, 91 + FrameData *frame, Symbol *symbol) 91 92 { 92 93 // read data from PyFrameObject 93 94 bpf_probe_read(&frame->f_back, ··· 160 161 .max_elem = 1000, 161 162 }; 162 163 163 - static inline __attribute__((__always_inline__)) int __on_event(struct pt_regs *ctx) 164 + static __always_inline int __on_event(struct pt_regs *ctx) 164 165 { 165 166 uint64_t pid_tgid = bpf_get_current_pid_tgid(); 166 167 pid_t pid = (pid_t)(pid_tgid >> 32);

+71

tools/testing/selftests/bpf/progs/sockopt_multi.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <netinet/in.h> 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + char _license[] SEC("license") = "GPL"; 7 + __u32 _version SEC("version") = 1; 8 + 9 + SEC("cgroup/getsockopt/child") 10 + int _getsockopt_child(struct bpf_sockopt *ctx) 11 + { 12 + __u8 *optval_end = ctx->optval_end; 13 + __u8 *optval = ctx->optval; 14 + 15 + if (ctx->level != SOL_IP || ctx->optname != IP_TOS) 16 + return 1; 17 + 18 + if (optval + 1 > optval_end) 19 + return 0; /* EPERM, bounds check */ 20 + 21 + if (optval[0] != 0x80) 22 + return 0; /* EPERM, unexpected optval from the kernel */ 23 + 24 + ctx->retval = 0; /* Reset system call return value to zero */ 25 + 26 + optval[0] = 0x90; 27 + ctx->optlen = 1; 28 + 29 + return 1; 30 + } 31 + 32 + SEC("cgroup/getsockopt/parent") 33 + int _getsockopt_parent(struct bpf_sockopt *ctx) 34 + { 35 + __u8 *optval_end = ctx->optval_end; 36 + __u8 *optval = ctx->optval; 37 + 38 + if (ctx->level != SOL_IP || ctx->optname != IP_TOS) 39 + return 1; 40 + 41 + if (optval + 1 > optval_end) 42 + return 0; /* EPERM, bounds check */ 43 + 44 + if (optval[0] != 0x90) 45 + return 0; /* EPERM, unexpected optval from the kernel */ 46 + 47 + ctx->retval = 0; /* Reset system call return value to zero */ 48 + 49 + optval[0] = 0xA0; 50 + ctx->optlen = 1; 51 + 52 + return 1; 53 + } 54 + 55 + SEC("cgroup/setsockopt") 56 + int _setsockopt(struct bpf_sockopt *ctx) 57 + { 58 + __u8 *optval_end = ctx->optval_end; 59 + __u8 *optval = ctx->optval; 60 + 61 + if (ctx->level != SOL_IP || ctx->optname != IP_TOS) 62 + return 1; 63 + 64 + if (optval + 1 > optval_end) 65 + return 0; /* EPERM, bounds check */ 66 + 67 + optval[0] += 0x10; 68 + ctx->optlen = 1; 69 + 70 + return 1; 71 + }

+111

tools/testing/selftests/bpf/progs/sockopt_sk.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <netinet/in.h> 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + char _license[] SEC("license") = "GPL"; 7 + __u32 _version SEC("version") = 1; 8 + 9 + #define SOL_CUSTOM 0xdeadbeef 10 + 11 + struct sockopt_sk { 12 + __u8 val; 13 + }; 14 + 15 + struct bpf_map_def SEC("maps") socket_storage_map = { 16 + .type = BPF_MAP_TYPE_SK_STORAGE, 17 + .key_size = sizeof(int), 18 + .value_size = sizeof(struct sockopt_sk), 19 + .map_flags = BPF_F_NO_PREALLOC, 20 + }; 21 + BPF_ANNOTATE_KV_PAIR(socket_storage_map, int, struct sockopt_sk); 22 + 23 + SEC("cgroup/getsockopt") 24 + int _getsockopt(struct bpf_sockopt *ctx) 25 + { 26 + __u8 *optval_end = ctx->optval_end; 27 + __u8 *optval = ctx->optval; 28 + struct sockopt_sk *storage; 29 + 30 + if (ctx->level == SOL_IP && ctx->optname == IP_TOS) 31 + /* Not interested in SOL_IP:IP_TOS; 32 + * let next BPF program in the cgroup chain or kernel 33 + * handle it. 34 + */ 35 + return 1; 36 + 37 + if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) { 38 + /* Not interested in SOL_SOCKET:SO_SNDBUF; 39 + * let next BPF program in the cgroup chain or kernel 40 + * handle it. 41 + */ 42 + return 1; 43 + } 44 + 45 + if (ctx->level != SOL_CUSTOM) 46 + return 0; /* EPERM, deny everything except custom level */ 47 + 48 + if (optval + 1 > optval_end) 49 + return 0; /* EPERM, bounds check */ 50 + 51 + storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0, 52 + BPF_SK_STORAGE_GET_F_CREATE); 53 + if (!storage) 54 + return 0; /* EPERM, couldn't get sk storage */ 55 + 56 + if (!ctx->retval) 57 + return 0; /* EPERM, kernel should not have handled 58 + * SOL_CUSTOM, something is wrong! 59 + */ 60 + ctx->retval = 0; /* Reset system call return value to zero */ 61 + 62 + optval[0] = storage->val; 63 + ctx->optlen = 1; 64 + 65 + return 1; 66 + } 67 + 68 + SEC("cgroup/setsockopt") 69 + int _setsockopt(struct bpf_sockopt *ctx) 70 + { 71 + __u8 *optval_end = ctx->optval_end; 72 + __u8 *optval = ctx->optval; 73 + struct sockopt_sk *storage; 74 + 75 + if (ctx->level == SOL_IP && ctx->optname == IP_TOS) 76 + /* Not interested in SOL_IP:IP_TOS; 77 + * let next BPF program in the cgroup chain or kernel 78 + * handle it. 79 + */ 80 + return 1; 81 + 82 + if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) { 83 + /* Overwrite SO_SNDBUF value */ 84 + 85 + if (optval + sizeof(__u32) > optval_end) 86 + return 0; /* EPERM, bounds check */ 87 + 88 + *(__u32 *)optval = 0x55AA; 89 + ctx->optlen = 4; 90 + 91 + return 1; 92 + } 93 + 94 + if (ctx->level != SOL_CUSTOM) 95 + return 0; /* EPERM, deny everything except custom level */ 96 + 97 + if (optval + 1 > optval_end) 98 + return 0; /* EPERM, bounds check */ 99 + 100 + storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0, 101 + BPF_SK_STORAGE_GET_F_CREATE); 102 + if (!storage) 103 + return 0; /* EPERM, couldn't get sk storage */ 104 + 105 + storage->val = optval[0]; 106 + ctx->optlen = -1; /* BPF has consumed this option, don't call kernel 107 + * setsockopt handler. 108 + */ 109 + 110 + return 1; 111 + }

+19 -17

tools/testing/selftests/bpf/progs/strobemeta.h

··· 266 266 uint64_t offset; 267 267 }; 268 268 269 - static inline __attribute__((always_inline)) 270 - void *calc_location(struct strobe_value_loc *loc, void *tls_base) 269 + static __always_inline void *calc_location(struct strobe_value_loc *loc, 270 + void *tls_base) 271 271 { 272 272 /* 273 273 * tls_mode value is: ··· 327 327 : NULL; 328 328 } 329 329 330 - static inline __attribute__((always_inline)) 331 - void read_int_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base, 332 - struct strobe_value_generic *value, 333 - struct strobemeta_payload *data) 330 + static __always_inline void read_int_var(struct strobemeta_cfg *cfg, 331 + size_t idx, void *tls_base, 332 + struct strobe_value_generic *value, 333 + struct strobemeta_payload *data) 334 334 { 335 335 void *location = calc_location(&cfg->int_locs[idx], tls_base); 336 336 if (!location) ··· 342 342 data->int_vals_set_mask |= (1 << idx); 343 343 } 344 344 345 - static inline __attribute__((always_inline)) 346 - uint64_t read_str_var(struct strobemeta_cfg* cfg, size_t idx, void *tls_base, 347 - struct strobe_value_generic *value, 348 - struct strobemeta_payload *data, void *payload) 345 + static __always_inline uint64_t read_str_var(struct strobemeta_cfg *cfg, 346 + size_t idx, void *tls_base, 347 + struct strobe_value_generic *value, 348 + struct strobemeta_payload *data, 349 + void *payload) 349 350 { 350 351 void *location; 351 352 uint32_t len; ··· 372 371 return len; 373 372 } 374 373 375 - static inline __attribute__((always_inline)) 376 - void *read_map_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base, 377 - struct strobe_value_generic *value, 378 - struct strobemeta_payload* data, void *payload) 374 + static __always_inline void *read_map_var(struct strobemeta_cfg *cfg, 375 + size_t idx, void *tls_base, 376 + struct strobe_value_generic *value, 377 + struct strobemeta_payload *data, 378 + void *payload) 379 379 { 380 380 struct strobe_map_descr* descr = &data->map_descrs[idx]; 381 381 struct strobe_map_raw map; ··· 437 435 * read_strobe_meta returns NULL, if no metadata was read; otherwise returns 438 436 * pointer to *right after* payload ends 439 437 */ 440 - static inline __attribute__((always_inline)) 441 - void *read_strobe_meta(struct task_struct* task, 442 - struct strobemeta_payload* data) { 438 + static __always_inline void *read_strobe_meta(struct task_struct *task, 439 + struct strobemeta_payload *data) 440 + { 443 441 pid_t pid = bpf_get_current_pid_tgid() >> 32; 444 442 struct strobe_value_generic value = {0}; 445 443 struct strobemeta_cfg *cfg;

+61

tools/testing/selftests/bpf/progs/tcp_rtt.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <linux/bpf.h> 3 + #include "bpf_helpers.h" 4 + 5 + char _license[] SEC("license") = "GPL"; 6 + __u32 _version SEC("version") = 1; 7 + 8 + struct tcp_rtt_storage { 9 + __u32 invoked; 10 + __u32 dsack_dups; 11 + __u32 delivered; 12 + __u32 delivered_ce; 13 + __u32 icsk_retransmits; 14 + }; 15 + 16 + struct bpf_map_def SEC("maps") socket_storage_map = { 17 + .type = BPF_MAP_TYPE_SK_STORAGE, 18 + .key_size = sizeof(int), 19 + .value_size = sizeof(struct tcp_rtt_storage), 20 + .map_flags = BPF_F_NO_PREALLOC, 21 + }; 22 + BPF_ANNOTATE_KV_PAIR(socket_storage_map, int, struct tcp_rtt_storage); 23 + 24 + SEC("sockops") 25 + int _sockops(struct bpf_sock_ops *ctx) 26 + { 27 + struct tcp_rtt_storage *storage; 28 + struct bpf_tcp_sock *tcp_sk; 29 + int op = (int) ctx->op; 30 + struct bpf_sock *sk; 31 + 32 + sk = ctx->sk; 33 + if (!sk) 34 + return 1; 35 + 36 + storage = bpf_sk_storage_get(&socket_storage_map, sk, 0, 37 + BPF_SK_STORAGE_GET_F_CREATE); 38 + if (!storage) 39 + return 1; 40 + 41 + if (op == BPF_SOCK_OPS_TCP_CONNECT_CB) { 42 + bpf_sock_ops_cb_flags_set(ctx, BPF_SOCK_OPS_RTT_CB_FLAG); 43 + return 1; 44 + } 45 + 46 + if (op != BPF_SOCK_OPS_RTT_CB) 47 + return 1; 48 + 49 + tcp_sk = bpf_tcp_sock(sk); 50 + if (!tcp_sk) 51 + return 1; 52 + 53 + storage->invoked++; 54 + 55 + storage->dsack_dups = tcp_sk->dsack_dups; 56 + storage->delivered = tcp_sk->delivered; 57 + storage->delivered_ce = tcp_sk->delivered_ce; 58 + storage->icsk_retransmits = tcp_sk->icsk_retransmits; 59 + 60 + return 1; 61 + }

+2 -1

tools/testing/selftests/bpf/progs/test_jhash.h

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 // Copyright (c) 2019 Facebook 3 + #include <features.h> 3 4 4 5 typedef unsigned int u32; 5 6 6 - static __attribute__((always_inline)) u32 rol32(u32 word, unsigned int shift) 7 + static __always_inline u32 rol32(u32 word, unsigned int shift) 7 8 { 8 9 return (word << shift) | (word >> ((-shift) & 31)); 9 10 }

+12 -11

tools/testing/selftests/bpf/progs/test_seg6_loop.c

··· 54 54 unsigned char value[0]; 55 55 } BPF_PACKET_HEADER; 56 56 57 - static __attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb) 57 + static __always_inline struct ip6_srh_t *get_srh(struct __sk_buff *skb) 58 58 { 59 59 void *cursor, *data_end; 60 60 struct ip6_srh_t *srh; ··· 88 88 return srh; 89 89 } 90 90 91 - static __attribute__((always_inline)) 92 - int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad, 93 - uint32_t old_pad, uint32_t pad_off) 91 + static __always_inline int update_tlv_pad(struct __sk_buff *skb, 92 + uint32_t new_pad, uint32_t old_pad, 93 + uint32_t pad_off) 94 94 { 95 95 int err; 96 96 ··· 118 118 return 0; 119 119 } 120 120 121 - static __attribute__((always_inline)) 122 - int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh, 123 - uint32_t *tlv_off, uint32_t *pad_size, 124 - uint32_t *pad_off) 121 + static __always_inline int is_valid_tlv_boundary(struct __sk_buff *skb, 122 + struct ip6_srh_t *srh, 123 + uint32_t *tlv_off, 124 + uint32_t *pad_size, 125 + uint32_t *pad_off) 125 126 { 126 127 uint32_t srh_off, cur_off; 127 128 int offset_valid = 0; ··· 178 177 return 0; 179 178 } 180 179 181 - static __attribute__((always_inline)) 182 - int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off, 183 - struct sr6_tlv_t *itlv, uint8_t tlv_size) 180 + static __always_inline int add_tlv(struct __sk_buff *skb, 181 + struct ip6_srh_t *srh, uint32_t tlv_off, 182 + struct sr6_tlv_t *itlv, uint8_t tlv_size) 184 183 { 185 184 uint32_t srh_off = (char *)srh - (char *)(long)skb->data; 186 185 uint8_t len_remaining, new_pad;

+1 -1

tools/testing/selftests/bpf/progs/test_verif_scale2.c

··· 2 2 // Copyright (c) 2019 Facebook 3 3 #include <linux/bpf.h> 4 4 #include "bpf_helpers.h" 5 - #define ATTR __attribute__((always_inline)) 5 + #define ATTR __always_inline 6 6 #include "test_jhash.h" 7 7 8 8 SEC("scale90_inline")

+31

tools/testing/selftests/bpf/progs/xdp_redirect_map.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + struct bpf_map_def SEC("maps") tx_port = { 7 + .type = BPF_MAP_TYPE_DEVMAP, 8 + .key_size = sizeof(int), 9 + .value_size = sizeof(int), 10 + .max_entries = 8, 11 + }; 12 + 13 + SEC("redirect_map_0") 14 + int xdp_redirect_map_0(struct xdp_md *xdp) 15 + { 16 + return bpf_redirect_map(&tx_port, 0, 0); 17 + } 18 + 19 + SEC("redirect_map_1") 20 + int xdp_redirect_map_1(struct xdp_md *xdp) 21 + { 22 + return bpf_redirect_map(&tx_port, 1, 0); 23 + } 24 + 25 + SEC("redirect_map_2") 26 + int xdp_redirect_map_2(struct xdp_md *xdp) 27 + { 28 + return bpf_redirect_map(&tx_port, 2, 0); 29 + } 30 + 31 + char _license[] SEC("license") = "GPL";

+12

tools/testing/selftests/bpf/progs/xdp_tx.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + SEC("tx") 7 + int xdp_tx(struct xdp_md *xdp) 8 + { 9 + return XDP_TX; 10 + } 11 + 12 + char _license[] SEC("license") = "GPL";

+10

tools/testing/selftests/bpf/test_section_names.c

··· 134 134 {0, BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_CGROUP_SYSCTL}, 135 135 {0, BPF_CGROUP_SYSCTL}, 136 136 }, 137 + { 138 + "cgroup/getsockopt", 139 + {0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT}, 140 + {0, BPF_CGROUP_GETSOCKOPT}, 141 + }, 142 + { 143 + "cgroup/setsockopt", 144 + {0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT}, 145 + {0, BPF_CGROUP_SETSOCKOPT}, 146 + }, 137 147 }; 138 148 139 149 static int test_prog_type_by_name(const struct sec_name_test *test)

+1021

tools/testing/selftests/bpf/test_sockopt.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <errno.h> 4 + #include <stdio.h> 5 + #include <unistd.h> 6 + #include <sys/types.h> 7 + #include <sys/socket.h> 8 + #include <netinet/in.h> 9 + 10 + #include <linux/filter.h> 11 + #include <bpf/bpf.h> 12 + #include <bpf/libbpf.h> 13 + 14 + #include "bpf_rlimit.h" 15 + #include "bpf_util.h" 16 + #include "cgroup_helpers.h" 17 + 18 + #define CG_PATH "/sockopt" 19 + 20 + static char bpf_log_buf[4096]; 21 + static bool verbose; 22 + 23 + enum sockopt_test_error { 24 + OK = 0, 25 + DENY_LOAD, 26 + DENY_ATTACH, 27 + EPERM_GETSOCKOPT, 28 + EFAULT_GETSOCKOPT, 29 + EPERM_SETSOCKOPT, 30 + EFAULT_SETSOCKOPT, 31 + }; 32 + 33 + static struct sockopt_test { 34 + const char *descr; 35 + const struct bpf_insn insns[64]; 36 + enum bpf_attach_type attach_type; 37 + enum bpf_attach_type expected_attach_type; 38 + 39 + int set_optname; 40 + int set_level; 41 + const char set_optval[64]; 42 + socklen_t set_optlen; 43 + 44 + int get_optname; 45 + int get_level; 46 + const char get_optval[64]; 47 + socklen_t get_optlen; 48 + socklen_t get_optlen_ret; 49 + 50 + enum sockopt_test_error error; 51 + } tests[] = { 52 + 53 + /* ==================== getsockopt ==================== */ 54 + 55 + { 56 + .descr = "getsockopt: no expected_attach_type", 57 + .insns = { 58 + /* return 1 */ 59 + BPF_MOV64_IMM(BPF_REG_0, 1), 60 + BPF_EXIT_INSN(), 61 + 62 + }, 63 + .attach_type = BPF_CGROUP_GETSOCKOPT, 64 + .expected_attach_type = 0, 65 + .error = DENY_LOAD, 66 + }, 67 + { 68 + .descr = "getsockopt: wrong expected_attach_type", 69 + .insns = { 70 + /* return 1 */ 71 + BPF_MOV64_IMM(BPF_REG_0, 1), 72 + BPF_EXIT_INSN(), 73 + 74 + }, 75 + .attach_type = BPF_CGROUP_GETSOCKOPT, 76 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 77 + .error = DENY_ATTACH, 78 + }, 79 + { 80 + .descr = "getsockopt: bypass bpf hook", 81 + .insns = { 82 + /* return 1 */ 83 + BPF_MOV64_IMM(BPF_REG_0, 1), 84 + BPF_EXIT_INSN(), 85 + }, 86 + .attach_type = BPF_CGROUP_GETSOCKOPT, 87 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 88 + 89 + .get_level = SOL_IP, 90 + .set_level = SOL_IP, 91 + 92 + .get_optname = IP_TOS, 93 + .set_optname = IP_TOS, 94 + 95 + .set_optval = { 1 << 3 }, 96 + .set_optlen = 1, 97 + 98 + .get_optval = { 1 << 3 }, 99 + .get_optlen = 1, 100 + }, 101 + { 102 + .descr = "getsockopt: return EPERM from bpf hook", 103 + .insns = { 104 + BPF_MOV64_IMM(BPF_REG_0, 0), 105 + BPF_EXIT_INSN(), 106 + }, 107 + .attach_type = BPF_CGROUP_GETSOCKOPT, 108 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 109 + 110 + .get_level = SOL_IP, 111 + .get_optname = IP_TOS, 112 + 113 + .get_optlen = 1, 114 + .error = EPERM_GETSOCKOPT, 115 + }, 116 + { 117 + .descr = "getsockopt: no optval bounds check, deny loading", 118 + .insns = { 119 + /* r6 = ctx->optval */ 120 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 121 + offsetof(struct bpf_sockopt, optval)), 122 + 123 + /* ctx->optval[0] = 0x80 */ 124 + BPF_MOV64_IMM(BPF_REG_0, 0x80), 125 + BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_0, 0), 126 + 127 + /* return 1 */ 128 + BPF_MOV64_IMM(BPF_REG_0, 1), 129 + BPF_EXIT_INSN(), 130 + }, 131 + .attach_type = BPF_CGROUP_GETSOCKOPT, 132 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 133 + .error = DENY_LOAD, 134 + }, 135 + { 136 + .descr = "getsockopt: read ctx->level", 137 + .insns = { 138 + /* r6 = ctx->level */ 139 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 140 + offsetof(struct bpf_sockopt, level)), 141 + 142 + /* if (ctx->level == 123) { */ 143 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 144 + /* ctx->retval = 0 */ 145 + BPF_MOV64_IMM(BPF_REG_0, 0), 146 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 147 + offsetof(struct bpf_sockopt, retval)), 148 + /* return 1 */ 149 + BPF_MOV64_IMM(BPF_REG_0, 1), 150 + BPF_JMP_A(1), 151 + /* } else { */ 152 + /* return 0 */ 153 + BPF_MOV64_IMM(BPF_REG_0, 0), 154 + /* } */ 155 + BPF_EXIT_INSN(), 156 + }, 157 + .attach_type = BPF_CGROUP_GETSOCKOPT, 158 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 159 + 160 + .get_level = 123, 161 + 162 + .get_optlen = 1, 163 + }, 164 + { 165 + .descr = "getsockopt: deny writing to ctx->level", 166 + .insns = { 167 + /* ctx->level = 1 */ 168 + BPF_MOV64_IMM(BPF_REG_0, 1), 169 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 170 + offsetof(struct bpf_sockopt, level)), 171 + BPF_EXIT_INSN(), 172 + }, 173 + .attach_type = BPF_CGROUP_GETSOCKOPT, 174 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 175 + 176 + .error = DENY_LOAD, 177 + }, 178 + { 179 + .descr = "getsockopt: read ctx->optname", 180 + .insns = { 181 + /* r6 = ctx->optname */ 182 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 183 + offsetof(struct bpf_sockopt, optname)), 184 + 185 + /* if (ctx->optname == 123) { */ 186 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 187 + /* ctx->retval = 0 */ 188 + BPF_MOV64_IMM(BPF_REG_0, 0), 189 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 190 + offsetof(struct bpf_sockopt, retval)), 191 + /* return 1 */ 192 + BPF_MOV64_IMM(BPF_REG_0, 1), 193 + BPF_JMP_A(1), 194 + /* } else { */ 195 + /* return 0 */ 196 + BPF_MOV64_IMM(BPF_REG_0, 0), 197 + /* } */ 198 + BPF_EXIT_INSN(), 199 + }, 200 + .attach_type = BPF_CGROUP_GETSOCKOPT, 201 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 202 + 203 + .get_optname = 123, 204 + 205 + .get_optlen = 1, 206 + }, 207 + { 208 + .descr = "getsockopt: read ctx->retval", 209 + .insns = { 210 + /* r6 = ctx->retval */ 211 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 212 + offsetof(struct bpf_sockopt, retval)), 213 + 214 + /* return 1 */ 215 + BPF_MOV64_IMM(BPF_REG_0, 1), 216 + BPF_EXIT_INSN(), 217 + }, 218 + .attach_type = BPF_CGROUP_GETSOCKOPT, 219 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 220 + 221 + .get_level = SOL_IP, 222 + .get_optname = IP_TOS, 223 + .get_optlen = 1, 224 + }, 225 + { 226 + .descr = "getsockopt: deny writing to ctx->optname", 227 + .insns = { 228 + /* ctx->optname = 1 */ 229 + BPF_MOV64_IMM(BPF_REG_0, 1), 230 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 231 + offsetof(struct bpf_sockopt, optname)), 232 + BPF_EXIT_INSN(), 233 + }, 234 + .attach_type = BPF_CGROUP_GETSOCKOPT, 235 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 236 + 237 + .error = DENY_LOAD, 238 + }, 239 + { 240 + .descr = "getsockopt: read ctx->optlen", 241 + .insns = { 242 + /* r6 = ctx->optlen */ 243 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 244 + offsetof(struct bpf_sockopt, optlen)), 245 + 246 + /* if (ctx->optlen == 64) { */ 247 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 64, 4), 248 + /* ctx->retval = 0 */ 249 + BPF_MOV64_IMM(BPF_REG_0, 0), 250 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 251 + offsetof(struct bpf_sockopt, retval)), 252 + /* return 1 */ 253 + BPF_MOV64_IMM(BPF_REG_0, 1), 254 + BPF_JMP_A(1), 255 + /* } else { */ 256 + /* return 0 */ 257 + BPF_MOV64_IMM(BPF_REG_0, 0), 258 + /* } */ 259 + BPF_EXIT_INSN(), 260 + }, 261 + .attach_type = BPF_CGROUP_GETSOCKOPT, 262 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 263 + 264 + .get_optlen = 64, 265 + }, 266 + { 267 + .descr = "getsockopt: deny bigger ctx->optlen", 268 + .insns = { 269 + /* ctx->optlen = 65 */ 270 + BPF_MOV64_IMM(BPF_REG_0, 65), 271 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 272 + offsetof(struct bpf_sockopt, optlen)), 273 + 274 + /* ctx->retval = 0 */ 275 + BPF_MOV64_IMM(BPF_REG_0, 0), 276 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 277 + offsetof(struct bpf_sockopt, retval)), 278 + 279 + /* return 1 */ 280 + BPF_MOV64_IMM(BPF_REG_0, 1), 281 + BPF_EXIT_INSN(), 282 + }, 283 + .attach_type = BPF_CGROUP_GETSOCKOPT, 284 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 285 + 286 + .get_optlen = 64, 287 + 288 + .error = EFAULT_GETSOCKOPT, 289 + }, 290 + { 291 + .descr = "getsockopt: deny arbitrary ctx->retval", 292 + .insns = { 293 + /* ctx->retval = 123 */ 294 + BPF_MOV64_IMM(BPF_REG_0, 123), 295 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 296 + offsetof(struct bpf_sockopt, retval)), 297 + 298 + /* return 1 */ 299 + BPF_MOV64_IMM(BPF_REG_0, 1), 300 + BPF_EXIT_INSN(), 301 + }, 302 + .attach_type = BPF_CGROUP_GETSOCKOPT, 303 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 304 + 305 + .get_optlen = 64, 306 + 307 + .error = EFAULT_GETSOCKOPT, 308 + }, 309 + { 310 + .descr = "getsockopt: support smaller ctx->optlen", 311 + .insns = { 312 + /* ctx->optlen = 32 */ 313 + BPF_MOV64_IMM(BPF_REG_0, 32), 314 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 315 + offsetof(struct bpf_sockopt, optlen)), 316 + /* ctx->retval = 0 */ 317 + BPF_MOV64_IMM(BPF_REG_0, 0), 318 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 319 + offsetof(struct bpf_sockopt, retval)), 320 + /* return 1 */ 321 + BPF_MOV64_IMM(BPF_REG_0, 1), 322 + BPF_EXIT_INSN(), 323 + }, 324 + .attach_type = BPF_CGROUP_GETSOCKOPT, 325 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 326 + 327 + .get_optlen = 64, 328 + .get_optlen_ret = 32, 329 + }, 330 + { 331 + .descr = "getsockopt: deny writing to ctx->optval", 332 + .insns = { 333 + /* ctx->optval = 1 */ 334 + BPF_MOV64_IMM(BPF_REG_0, 1), 335 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 336 + offsetof(struct bpf_sockopt, optval)), 337 + BPF_EXIT_INSN(), 338 + }, 339 + .attach_type = BPF_CGROUP_GETSOCKOPT, 340 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 341 + 342 + .error = DENY_LOAD, 343 + }, 344 + { 345 + .descr = "getsockopt: deny writing to ctx->optval_end", 346 + .insns = { 347 + /* ctx->optval_end = 1 */ 348 + BPF_MOV64_IMM(BPF_REG_0, 1), 349 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 350 + offsetof(struct bpf_sockopt, optval_end)), 351 + BPF_EXIT_INSN(), 352 + }, 353 + .attach_type = BPF_CGROUP_GETSOCKOPT, 354 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 355 + 356 + .error = DENY_LOAD, 357 + }, 358 + { 359 + .descr = "getsockopt: rewrite value", 360 + .insns = { 361 + /* r6 = ctx->optval */ 362 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 363 + offsetof(struct bpf_sockopt, optval)), 364 + /* r2 = ctx->optval */ 365 + BPF_MOV64_REG(BPF_REG_2, BPF_REG_6), 366 + /* r6 = ctx->optval + 1 */ 367 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 1), 368 + 369 + /* r7 = ctx->optval_end */ 370 + BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_1, 371 + offsetof(struct bpf_sockopt, optval_end)), 372 + 373 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 374 + BPF_JMP_REG(BPF_JGT, BPF_REG_6, BPF_REG_7, 1), 375 + /* ctx->optval[0] = 0xF0 */ 376 + BPF_ST_MEM(BPF_B, BPF_REG_2, 0, 0xF0), 377 + /* } */ 378 + 379 + /* ctx->retval = 0 */ 380 + BPF_MOV64_IMM(BPF_REG_0, 0), 381 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 382 + offsetof(struct bpf_sockopt, retval)), 383 + 384 + /* return 1*/ 385 + BPF_MOV64_IMM(BPF_REG_0, 1), 386 + BPF_EXIT_INSN(), 387 + }, 388 + .attach_type = BPF_CGROUP_GETSOCKOPT, 389 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 390 + 391 + .get_level = SOL_IP, 392 + .get_optname = IP_TOS, 393 + 394 + .get_optval = { 0xF0 }, 395 + .get_optlen = 1, 396 + }, 397 + 398 + /* ==================== setsockopt ==================== */ 399 + 400 + { 401 + .descr = "setsockopt: no expected_attach_type", 402 + .insns = { 403 + /* return 1 */ 404 + BPF_MOV64_IMM(BPF_REG_0, 1), 405 + BPF_EXIT_INSN(), 406 + 407 + }, 408 + .attach_type = BPF_CGROUP_SETSOCKOPT, 409 + .expected_attach_type = 0, 410 + .error = DENY_LOAD, 411 + }, 412 + { 413 + .descr = "setsockopt: wrong expected_attach_type", 414 + .insns = { 415 + /* return 1 */ 416 + BPF_MOV64_IMM(BPF_REG_0, 1), 417 + BPF_EXIT_INSN(), 418 + 419 + }, 420 + .attach_type = BPF_CGROUP_SETSOCKOPT, 421 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 422 + .error = DENY_ATTACH, 423 + }, 424 + { 425 + .descr = "setsockopt: bypass bpf hook", 426 + .insns = { 427 + /* return 1 */ 428 + BPF_MOV64_IMM(BPF_REG_0, 1), 429 + BPF_EXIT_INSN(), 430 + }, 431 + .attach_type = BPF_CGROUP_SETSOCKOPT, 432 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 433 + 434 + .get_level = SOL_IP, 435 + .set_level = SOL_IP, 436 + 437 + .get_optname = IP_TOS, 438 + .set_optname = IP_TOS, 439 + 440 + .set_optval = { 1 << 3 }, 441 + .set_optlen = 1, 442 + 443 + .get_optval = { 1 << 3 }, 444 + .get_optlen = 1, 445 + }, 446 + { 447 + .descr = "setsockopt: return EPERM from bpf hook", 448 + .insns = { 449 + /* return 0 */ 450 + BPF_MOV64_IMM(BPF_REG_0, 0), 451 + BPF_EXIT_INSN(), 452 + }, 453 + .attach_type = BPF_CGROUP_SETSOCKOPT, 454 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 455 + 456 + .set_level = SOL_IP, 457 + .set_optname = IP_TOS, 458 + 459 + .set_optlen = 1, 460 + .error = EPERM_SETSOCKOPT, 461 + }, 462 + { 463 + .descr = "setsockopt: no optval bounds check, deny loading", 464 + .insns = { 465 + /* r6 = ctx->optval */ 466 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 467 + offsetof(struct bpf_sockopt, optval)), 468 + 469 + /* r0 = ctx->optval[0] */ 470 + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0), 471 + 472 + /* return 1 */ 473 + BPF_MOV64_IMM(BPF_REG_0, 1), 474 + BPF_EXIT_INSN(), 475 + }, 476 + .attach_type = BPF_CGROUP_SETSOCKOPT, 477 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 478 + .error = DENY_LOAD, 479 + }, 480 + { 481 + .descr = "setsockopt: read ctx->level", 482 + .insns = { 483 + /* r6 = ctx->level */ 484 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 485 + offsetof(struct bpf_sockopt, level)), 486 + 487 + /* if (ctx->level == 123) { */ 488 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 489 + /* ctx->optlen = -1 */ 490 + BPF_MOV64_IMM(BPF_REG_0, -1), 491 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 492 + offsetof(struct bpf_sockopt, optlen)), 493 + /* return 1 */ 494 + BPF_MOV64_IMM(BPF_REG_0, 1), 495 + BPF_JMP_A(1), 496 + /* } else { */ 497 + /* return 0 */ 498 + BPF_MOV64_IMM(BPF_REG_0, 0), 499 + /* } */ 500 + BPF_EXIT_INSN(), 501 + }, 502 + .attach_type = BPF_CGROUP_SETSOCKOPT, 503 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 504 + 505 + .set_level = 123, 506 + 507 + .set_optlen = 1, 508 + }, 509 + { 510 + .descr = "setsockopt: allow changing ctx->level", 511 + .insns = { 512 + /* ctx->level = SOL_IP */ 513 + BPF_MOV64_IMM(BPF_REG_0, SOL_IP), 514 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 515 + offsetof(struct bpf_sockopt, level)), 516 + /* return 1 */ 517 + BPF_MOV64_IMM(BPF_REG_0, 1), 518 + BPF_EXIT_INSN(), 519 + }, 520 + .attach_type = BPF_CGROUP_SETSOCKOPT, 521 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 522 + 523 + .get_level = SOL_IP, 524 + .set_level = 234, /* should be rewritten to SOL_IP */ 525 + 526 + .get_optname = IP_TOS, 527 + .set_optname = IP_TOS, 528 + 529 + .set_optval = { 1 << 3 }, 530 + .set_optlen = 1, 531 + .get_optval = { 1 << 3 }, 532 + .get_optlen = 1, 533 + }, 534 + { 535 + .descr = "setsockopt: read ctx->optname", 536 + .insns = { 537 + /* r6 = ctx->optname */ 538 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 539 + offsetof(struct bpf_sockopt, optname)), 540 + 541 + /* if (ctx->optname == 123) { */ 542 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 543 + /* ctx->optlen = -1 */ 544 + BPF_MOV64_IMM(BPF_REG_0, -1), 545 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 546 + offsetof(struct bpf_sockopt, optlen)), 547 + /* return 1 */ 548 + BPF_MOV64_IMM(BPF_REG_0, 1), 549 + BPF_JMP_A(1), 550 + /* } else { */ 551 + /* return 0 */ 552 + BPF_MOV64_IMM(BPF_REG_0, 0), 553 + /* } */ 554 + BPF_EXIT_INSN(), 555 + }, 556 + .attach_type = BPF_CGROUP_SETSOCKOPT, 557 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 558 + 559 + .set_optname = 123, 560 + 561 + .set_optlen = 1, 562 + }, 563 + { 564 + .descr = "setsockopt: allow changing ctx->optname", 565 + .insns = { 566 + /* ctx->optname = IP_TOS */ 567 + BPF_MOV64_IMM(BPF_REG_0, IP_TOS), 568 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 569 + offsetof(struct bpf_sockopt, optname)), 570 + /* return 1 */ 571 + BPF_MOV64_IMM(BPF_REG_0, 1), 572 + BPF_EXIT_INSN(), 573 + }, 574 + .attach_type = BPF_CGROUP_SETSOCKOPT, 575 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 576 + 577 + .get_level = SOL_IP, 578 + .set_level = SOL_IP, 579 + 580 + .get_optname = IP_TOS, 581 + .set_optname = 456, /* should be rewritten to IP_TOS */ 582 + 583 + .set_optval = { 1 << 3 }, 584 + .set_optlen = 1, 585 + .get_optval = { 1 << 3 }, 586 + .get_optlen = 1, 587 + }, 588 + { 589 + .descr = "setsockopt: read ctx->optlen", 590 + .insns = { 591 + /* r6 = ctx->optlen */ 592 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 593 + offsetof(struct bpf_sockopt, optlen)), 594 + 595 + /* if (ctx->optlen == 64) { */ 596 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 64, 4), 597 + /* ctx->optlen = -1 */ 598 + BPF_MOV64_IMM(BPF_REG_0, -1), 599 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 600 + offsetof(struct bpf_sockopt, optlen)), 601 + /* return 1 */ 602 + BPF_MOV64_IMM(BPF_REG_0, 1), 603 + BPF_JMP_A(1), 604 + /* } else { */ 605 + /* return 0 */ 606 + BPF_MOV64_IMM(BPF_REG_0, 0), 607 + /* } */ 608 + BPF_EXIT_INSN(), 609 + }, 610 + .attach_type = BPF_CGROUP_SETSOCKOPT, 611 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 612 + 613 + .set_optlen = 64, 614 + }, 615 + { 616 + .descr = "setsockopt: ctx->optlen == -1 is ok", 617 + .insns = { 618 + /* ctx->optlen = -1 */ 619 + BPF_MOV64_IMM(BPF_REG_0, -1), 620 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 621 + offsetof(struct bpf_sockopt, optlen)), 622 + /* return 1 */ 623 + BPF_MOV64_IMM(BPF_REG_0, 1), 624 + BPF_EXIT_INSN(), 625 + }, 626 + .attach_type = BPF_CGROUP_SETSOCKOPT, 627 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 628 + 629 + .set_optlen = 64, 630 + }, 631 + { 632 + .descr = "setsockopt: deny ctx->optlen < 0 (except -1)", 633 + .insns = { 634 + /* ctx->optlen = -2 */ 635 + BPF_MOV64_IMM(BPF_REG_0, -2), 636 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 637 + offsetof(struct bpf_sockopt, optlen)), 638 + /* return 1 */ 639 + BPF_MOV64_IMM(BPF_REG_0, 1), 640 + BPF_EXIT_INSN(), 641 + }, 642 + .attach_type = BPF_CGROUP_SETSOCKOPT, 643 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 644 + 645 + .set_optlen = 4, 646 + 647 + .error = EFAULT_SETSOCKOPT, 648 + }, 649 + { 650 + .descr = "setsockopt: deny ctx->optlen > input optlen", 651 + .insns = { 652 + /* ctx->optlen = 65 */ 653 + BPF_MOV64_IMM(BPF_REG_0, 65), 654 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 655 + offsetof(struct bpf_sockopt, optlen)), 656 + BPF_MOV64_IMM(BPF_REG_0, 1), 657 + BPF_EXIT_INSN(), 658 + }, 659 + .attach_type = BPF_CGROUP_SETSOCKOPT, 660 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 661 + 662 + .set_optlen = 64, 663 + 664 + .error = EFAULT_SETSOCKOPT, 665 + }, 666 + { 667 + .descr = "setsockopt: allow changing ctx->optlen within bounds", 668 + .insns = { 669 + /* r6 = ctx->optval */ 670 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 671 + offsetof(struct bpf_sockopt, optval)), 672 + /* r2 = ctx->optval */ 673 + BPF_MOV64_REG(BPF_REG_2, BPF_REG_6), 674 + /* r6 = ctx->optval + 1 */ 675 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 1), 676 + 677 + /* r7 = ctx->optval_end */ 678 + BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_1, 679 + offsetof(struct bpf_sockopt, optval_end)), 680 + 681 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 682 + BPF_JMP_REG(BPF_JGT, BPF_REG_6, BPF_REG_7, 1), 683 + /* ctx->optval[0] = 1 << 3 */ 684 + BPF_ST_MEM(BPF_B, BPF_REG_2, 0, 1 << 3), 685 + /* } */ 686 + 687 + /* ctx->optlen = 1 */ 688 + BPF_MOV64_IMM(BPF_REG_0, 1), 689 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 690 + offsetof(struct bpf_sockopt, optlen)), 691 + 692 + /* return 1*/ 693 + BPF_MOV64_IMM(BPF_REG_0, 1), 694 + BPF_EXIT_INSN(), 695 + }, 696 + .attach_type = BPF_CGROUP_SETSOCKOPT, 697 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 698 + 699 + .get_level = SOL_IP, 700 + .set_level = SOL_IP, 701 + 702 + .get_optname = IP_TOS, 703 + .set_optname = IP_TOS, 704 + 705 + .set_optval = { 1, 1, 1, 1 }, 706 + .set_optlen = 4, 707 + .get_optval = { 1 << 3 }, 708 + .get_optlen = 1, 709 + }, 710 + { 711 + .descr = "setsockopt: deny write ctx->retval", 712 + .insns = { 713 + /* ctx->retval = 0 */ 714 + BPF_MOV64_IMM(BPF_REG_0, 0), 715 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 716 + offsetof(struct bpf_sockopt, retval)), 717 + 718 + /* return 1 */ 719 + BPF_MOV64_IMM(BPF_REG_0, 1), 720 + BPF_EXIT_INSN(), 721 + }, 722 + .attach_type = BPF_CGROUP_SETSOCKOPT, 723 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 724 + 725 + .error = DENY_LOAD, 726 + }, 727 + { 728 + .descr = "setsockopt: deny read ctx->retval", 729 + .insns = { 730 + /* r6 = ctx->retval */ 731 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 732 + offsetof(struct bpf_sockopt, retval)), 733 + 734 + /* return 1 */ 735 + BPF_MOV64_IMM(BPF_REG_0, 1), 736 + BPF_EXIT_INSN(), 737 + }, 738 + .attach_type = BPF_CGROUP_SETSOCKOPT, 739 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 740 + 741 + .error = DENY_LOAD, 742 + }, 743 + { 744 + .descr = "setsockopt: deny writing to ctx->optval", 745 + .insns = { 746 + /* ctx->optval = 1 */ 747 + BPF_MOV64_IMM(BPF_REG_0, 1), 748 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 749 + offsetof(struct bpf_sockopt, optval)), 750 + BPF_EXIT_INSN(), 751 + }, 752 + .attach_type = BPF_CGROUP_SETSOCKOPT, 753 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 754 + 755 + .error = DENY_LOAD, 756 + }, 757 + { 758 + .descr = "setsockopt: deny writing to ctx->optval_end", 759 + .insns = { 760 + /* ctx->optval_end = 1 */ 761 + BPF_MOV64_IMM(BPF_REG_0, 1), 762 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 763 + offsetof(struct bpf_sockopt, optval_end)), 764 + BPF_EXIT_INSN(), 765 + }, 766 + .attach_type = BPF_CGROUP_SETSOCKOPT, 767 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 768 + 769 + .error = DENY_LOAD, 770 + }, 771 + { 772 + .descr = "setsockopt: allow IP_TOS <= 128", 773 + .insns = { 774 + /* r6 = ctx->optval */ 775 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 776 + offsetof(struct bpf_sockopt, optval)), 777 + /* r7 = ctx->optval + 1 */ 778 + BPF_MOV64_REG(BPF_REG_7, BPF_REG_6), 779 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1), 780 + 781 + /* r8 = ctx->optval_end */ 782 + BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_1, 783 + offsetof(struct bpf_sockopt, optval_end)), 784 + 785 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 786 + BPF_JMP_REG(BPF_JGT, BPF_REG_7, BPF_REG_8, 4), 787 + 788 + /* r9 = ctx->optval[0] */ 789 + BPF_LDX_MEM(BPF_B, BPF_REG_9, BPF_REG_6, 0), 790 + 791 + /* if (ctx->optval[0] < 128) */ 792 + BPF_JMP_IMM(BPF_JGT, BPF_REG_9, 128, 2), 793 + BPF_MOV64_IMM(BPF_REG_0, 1), 794 + BPF_JMP_A(1), 795 + /* } */ 796 + 797 + /* } else { */ 798 + BPF_MOV64_IMM(BPF_REG_0, 0), 799 + /* } */ 800 + 801 + BPF_EXIT_INSN(), 802 + }, 803 + .attach_type = BPF_CGROUP_SETSOCKOPT, 804 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 805 + 806 + .get_level = SOL_IP, 807 + .set_level = SOL_IP, 808 + 809 + .get_optname = IP_TOS, 810 + .set_optname = IP_TOS, 811 + 812 + .set_optval = { 0x80 }, 813 + .set_optlen = 1, 814 + .get_optval = { 0x80 }, 815 + .get_optlen = 1, 816 + }, 817 + { 818 + .descr = "setsockopt: deny IP_TOS > 128", 819 + .insns = { 820 + /* r6 = ctx->optval */ 821 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 822 + offsetof(struct bpf_sockopt, optval)), 823 + /* r7 = ctx->optval + 1 */ 824 + BPF_MOV64_REG(BPF_REG_7, BPF_REG_6), 825 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1), 826 + 827 + /* r8 = ctx->optval_end */ 828 + BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_1, 829 + offsetof(struct bpf_sockopt, optval_end)), 830 + 831 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 832 + BPF_JMP_REG(BPF_JGT, BPF_REG_7, BPF_REG_8, 4), 833 + 834 + /* r9 = ctx->optval[0] */ 835 + BPF_LDX_MEM(BPF_B, BPF_REG_9, BPF_REG_6, 0), 836 + 837 + /* if (ctx->optval[0] < 128) */ 838 + BPF_JMP_IMM(BPF_JGT, BPF_REG_9, 128, 2), 839 + BPF_MOV64_IMM(BPF_REG_0, 1), 840 + BPF_JMP_A(1), 841 + /* } */ 842 + 843 + /* } else { */ 844 + BPF_MOV64_IMM(BPF_REG_0, 0), 845 + /* } */ 846 + 847 + BPF_EXIT_INSN(), 848 + }, 849 + .attach_type = BPF_CGROUP_SETSOCKOPT, 850 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 851 + 852 + .get_level = SOL_IP, 853 + .set_level = SOL_IP, 854 + 855 + .get_optname = IP_TOS, 856 + .set_optname = IP_TOS, 857 + 858 + .set_optval = { 0x81 }, 859 + .set_optlen = 1, 860 + .get_optval = { 0x00 }, 861 + .get_optlen = 1, 862 + 863 + .error = EPERM_SETSOCKOPT, 864 + }, 865 + }; 866 + 867 + static int load_prog(const struct bpf_insn *insns, 868 + enum bpf_attach_type expected_attach_type) 869 + { 870 + struct bpf_load_program_attr attr = { 871 + .prog_type = BPF_PROG_TYPE_CGROUP_SOCKOPT, 872 + .expected_attach_type = expected_attach_type, 873 + .insns = insns, 874 + .license = "GPL", 875 + .log_level = 2, 876 + }; 877 + int fd; 878 + 879 + for (; 880 + insns[attr.insns_cnt].code != (BPF_JMP | BPF_EXIT); 881 + attr.insns_cnt++) { 882 + } 883 + attr.insns_cnt++; 884 + 885 + fd = bpf_load_program_xattr(&attr, bpf_log_buf, sizeof(bpf_log_buf)); 886 + if (verbose && fd < 0) 887 + fprintf(stderr, "%s\n", bpf_log_buf); 888 + 889 + return fd; 890 + } 891 + 892 + static int run_test(int cgroup_fd, struct sockopt_test *test) 893 + { 894 + int sock_fd, err, prog_fd; 895 + void *optval = NULL; 896 + int ret = 0; 897 + 898 + prog_fd = load_prog(test->insns, test->expected_attach_type); 899 + if (prog_fd < 0) { 900 + if (test->error == DENY_LOAD) 901 + return 0; 902 + 903 + log_err("Failed to load BPF program"); 904 + return -1; 905 + } 906 + 907 + err = bpf_prog_attach(prog_fd, cgroup_fd, test->attach_type, 0); 908 + if (err < 0) { 909 + if (test->error == DENY_ATTACH) 910 + goto close_prog_fd; 911 + 912 + log_err("Failed to attach BPF program"); 913 + ret = -1; 914 + goto close_prog_fd; 915 + } 916 + 917 + sock_fd = socket(AF_INET, SOCK_STREAM, 0); 918 + if (sock_fd < 0) { 919 + log_err("Failed to create AF_INET socket"); 920 + ret = -1; 921 + goto detach_prog; 922 + } 923 + 924 + if (test->set_optlen) { 925 + err = setsockopt(sock_fd, test->set_level, test->set_optname, 926 + test->set_optval, test->set_optlen); 927 + if (err) { 928 + if (errno == EPERM && test->error == EPERM_SETSOCKOPT) 929 + goto close_sock_fd; 930 + if (errno == EFAULT && test->error == EFAULT_SETSOCKOPT) 931 + goto free_optval; 932 + 933 + log_err("Failed to call setsockopt"); 934 + ret = -1; 935 + goto close_sock_fd; 936 + } 937 + } 938 + 939 + if (test->get_optlen) { 940 + optval = malloc(test->get_optlen); 941 + socklen_t optlen = test->get_optlen; 942 + socklen_t expected_get_optlen = test->get_optlen_ret ?: 943 + test->get_optlen; 944 + 945 + err = getsockopt(sock_fd, test->get_level, test->get_optname, 946 + optval, &optlen); 947 + if (err) { 948 + if (errno == EPERM && test->error == EPERM_GETSOCKOPT) 949 + goto free_optval; 950 + if (errno == EFAULT && test->error == EFAULT_GETSOCKOPT) 951 + goto free_optval; 952 + 953 + log_err("Failed to call getsockopt"); 954 + ret = -1; 955 + goto free_optval; 956 + } 957 + 958 + if (optlen != expected_get_optlen) { 959 + errno = 0; 960 + log_err("getsockopt returned unexpected optlen"); 961 + ret = -1; 962 + goto free_optval; 963 + } 964 + 965 + if (memcmp(optval, test->get_optval, optlen) != 0) { 966 + errno = 0; 967 + log_err("getsockopt returned unexpected optval"); 968 + ret = -1; 969 + goto free_optval; 970 + } 971 + } 972 + 973 + ret = test->error != OK; 974 + 975 + free_optval: 976 + free(optval); 977 + close_sock_fd: 978 + close(sock_fd); 979 + detach_prog: 980 + bpf_prog_detach2(prog_fd, cgroup_fd, test->attach_type); 981 + close_prog_fd: 982 + close(prog_fd); 983 + return ret; 984 + } 985 + 986 + int main(int args, char **argv) 987 + { 988 + int err = EXIT_FAILURE, error_cnt = 0; 989 + int cgroup_fd, i; 990 + 991 + if (setup_cgroup_environment()) 992 + goto cleanup_obj; 993 + 994 + cgroup_fd = create_and_get_cgroup(CG_PATH); 995 + if (cgroup_fd < 0) 996 + goto cleanup_cgroup_env; 997 + 998 + if (join_cgroup(CG_PATH)) 999 + goto cleanup_cgroup; 1000 + 1001 + for (i = 0; i < ARRAY_SIZE(tests); i++) { 1002 + int err = run_test(cgroup_fd, &tests[i]); 1003 + 1004 + if (err) 1005 + error_cnt++; 1006 + 1007 + printf("#%d %s: %s\n", i, err ? "FAIL" : "PASS", 1008 + tests[i].descr); 1009 + } 1010 + 1011 + printf("Summary: %ld PASSED, %d FAILED\n", 1012 + ARRAY_SIZE(tests) - error_cnt, error_cnt); 1013 + err = error_cnt ? EXIT_FAILURE : EXIT_SUCCESS; 1014 + 1015 + cleanup_cgroup: 1016 + close(cgroup_fd); 1017 + cleanup_cgroup_env: 1018 + cleanup_cgroup_environment(); 1019 + cleanup_obj: 1020 + return err; 1021 + }

+374

tools/testing/selftests/bpf/test_sockopt_multi.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <error.h> 4 + #include <errno.h> 5 + #include <stdio.h> 6 + #include <unistd.h> 7 + #include <sys/types.h> 8 + #include <sys/socket.h> 9 + #include <netinet/in.h> 10 + 11 + #include <linux/filter.h> 12 + #include <bpf/bpf.h> 13 + #include <bpf/libbpf.h> 14 + 15 + #include "bpf_rlimit.h" 16 + #include "bpf_util.h" 17 + #include "cgroup_helpers.h" 18 + 19 + static int prog_attach(struct bpf_object *obj, int cgroup_fd, const char *title) 20 + { 21 + enum bpf_attach_type attach_type; 22 + enum bpf_prog_type prog_type; 23 + struct bpf_program *prog; 24 + int err; 25 + 26 + err = libbpf_prog_type_by_name(title, &prog_type, &attach_type); 27 + if (err) { 28 + log_err("Failed to deduct types for %s BPF program", title); 29 + return -1; 30 + } 31 + 32 + prog = bpf_object__find_program_by_title(obj, title); 33 + if (!prog) { 34 + log_err("Failed to find %s BPF program", title); 35 + return -1; 36 + } 37 + 38 + err = bpf_prog_attach(bpf_program__fd(prog), cgroup_fd, 39 + attach_type, BPF_F_ALLOW_MULTI); 40 + if (err) { 41 + log_err("Failed to attach %s BPF program", title); 42 + return -1; 43 + } 44 + 45 + return 0; 46 + } 47 + 48 + static int prog_detach(struct bpf_object *obj, int cgroup_fd, const char *title) 49 + { 50 + enum bpf_attach_type attach_type; 51 + enum bpf_prog_type prog_type; 52 + struct bpf_program *prog; 53 + int err; 54 + 55 + err = libbpf_prog_type_by_name(title, &prog_type, &attach_type); 56 + if (err) 57 + return -1; 58 + 59 + prog = bpf_object__find_program_by_title(obj, title); 60 + if (!prog) 61 + return -1; 62 + 63 + err = bpf_prog_detach2(bpf_program__fd(prog), cgroup_fd, 64 + attach_type); 65 + if (err) 66 + return -1; 67 + 68 + return 0; 69 + } 70 + 71 + static int run_getsockopt_test(struct bpf_object *obj, int cg_parent, 72 + int cg_child, int sock_fd) 73 + { 74 + socklen_t optlen; 75 + __u8 buf; 76 + int err; 77 + 78 + /* Set IP_TOS to the expected value (0x80). */ 79 + 80 + buf = 0x80; 81 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 82 + if (err < 0) { 83 + log_err("Failed to call setsockopt(IP_TOS)"); 84 + goto detach; 85 + } 86 + 87 + buf = 0x00; 88 + optlen = 1; 89 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 90 + if (err) { 91 + log_err("Failed to call getsockopt(IP_TOS)"); 92 + goto detach; 93 + } 94 + 95 + if (buf != 0x80) { 96 + log_err("Unexpected getsockopt 0x%x != 0x80 without BPF", buf); 97 + err = -1; 98 + goto detach; 99 + } 100 + 101 + /* Attach child program and make sure it returns new value: 102 + * - kernel: -> 0x80 103 + * - child: 0x80 -> 0x90 104 + */ 105 + 106 + err = prog_attach(obj, cg_child, "cgroup/getsockopt/child"); 107 + if (err) 108 + goto detach; 109 + 110 + buf = 0x00; 111 + optlen = 1; 112 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 113 + if (err) { 114 + log_err("Failed to call getsockopt(IP_TOS)"); 115 + goto detach; 116 + } 117 + 118 + if (buf != 0x90) { 119 + log_err("Unexpected getsockopt 0x%x != 0x90", buf); 120 + err = -1; 121 + goto detach; 122 + } 123 + 124 + /* Attach parent program and make sure it returns new value: 125 + * - kernel: -> 0x80 126 + * - child: 0x80 -> 0x90 127 + * - parent: 0x90 -> 0xA0 128 + */ 129 + 130 + err = prog_attach(obj, cg_parent, "cgroup/getsockopt/parent"); 131 + if (err) 132 + goto detach; 133 + 134 + buf = 0x00; 135 + optlen = 1; 136 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 137 + if (err) { 138 + log_err("Failed to call getsockopt(IP_TOS)"); 139 + goto detach; 140 + } 141 + 142 + if (buf != 0xA0) { 143 + log_err("Unexpected getsockopt 0x%x != 0xA0", buf); 144 + err = -1; 145 + goto detach; 146 + } 147 + 148 + /* Setting unexpected initial sockopt should return EPERM: 149 + * - kernel: -> 0x40 150 + * - child: unexpected 0x40, EPERM 151 + * - parent: unexpected 0x40, EPERM 152 + */ 153 + 154 + buf = 0x40; 155 + if (setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1) < 0) { 156 + log_err("Failed to call setsockopt(IP_TOS)"); 157 + goto detach; 158 + } 159 + 160 + buf = 0x00; 161 + optlen = 1; 162 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 163 + if (!err) { 164 + log_err("Unexpected success from getsockopt(IP_TOS)"); 165 + goto detach; 166 + } 167 + 168 + /* Detach child program and make sure we still get EPERM: 169 + * - kernel: -> 0x40 170 + * - parent: unexpected 0x40, EPERM 171 + */ 172 + 173 + err = prog_detach(obj, cg_child, "cgroup/getsockopt/child"); 174 + if (err) { 175 + log_err("Failed to detach child program"); 176 + goto detach; 177 + } 178 + 179 + buf = 0x00; 180 + optlen = 1; 181 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 182 + if (!err) { 183 + log_err("Unexpected success from getsockopt(IP_TOS)"); 184 + goto detach; 185 + } 186 + 187 + /* Set initial value to the one the parent program expects: 188 + * - kernel: -> 0x90 189 + * - parent: 0x90 -> 0xA0 190 + */ 191 + 192 + buf = 0x90; 193 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 194 + if (err < 0) { 195 + log_err("Failed to call setsockopt(IP_TOS)"); 196 + goto detach; 197 + } 198 + 199 + buf = 0x00; 200 + optlen = 1; 201 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 202 + if (err) { 203 + log_err("Failed to call getsockopt(IP_TOS)"); 204 + goto detach; 205 + } 206 + 207 + if (buf != 0xA0) { 208 + log_err("Unexpected getsockopt 0x%x != 0xA0", buf); 209 + err = -1; 210 + goto detach; 211 + } 212 + 213 + detach: 214 + prog_detach(obj, cg_child, "cgroup/getsockopt/child"); 215 + prog_detach(obj, cg_parent, "cgroup/getsockopt/parent"); 216 + 217 + return err; 218 + } 219 + 220 + static int run_setsockopt_test(struct bpf_object *obj, int cg_parent, 221 + int cg_child, int sock_fd) 222 + { 223 + socklen_t optlen; 224 + __u8 buf; 225 + int err; 226 + 227 + /* Set IP_TOS to the expected value (0x80). */ 228 + 229 + buf = 0x80; 230 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 231 + if (err < 0) { 232 + log_err("Failed to call setsockopt(IP_TOS)"); 233 + goto detach; 234 + } 235 + 236 + buf = 0x00; 237 + optlen = 1; 238 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 239 + if (err) { 240 + log_err("Failed to call getsockopt(IP_TOS)"); 241 + goto detach; 242 + } 243 + 244 + if (buf != 0x80) { 245 + log_err("Unexpected getsockopt 0x%x != 0x80 without BPF", buf); 246 + err = -1; 247 + goto detach; 248 + } 249 + 250 + /* Attach child program and make sure it adds 0x10. */ 251 + 252 + err = prog_attach(obj, cg_child, "cgroup/setsockopt"); 253 + if (err) 254 + goto detach; 255 + 256 + buf = 0x80; 257 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 258 + if (err < 0) { 259 + log_err("Failed to call setsockopt(IP_TOS)"); 260 + goto detach; 261 + } 262 + 263 + buf = 0x00; 264 + optlen = 1; 265 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 266 + if (err) { 267 + log_err("Failed to call getsockopt(IP_TOS)"); 268 + goto detach; 269 + } 270 + 271 + if (buf != 0x80 + 0x10) { 272 + log_err("Unexpected getsockopt 0x%x != 0x80 + 0x10", buf); 273 + err = -1; 274 + goto detach; 275 + } 276 + 277 + /* Attach parent program and make sure it adds another 0x10. */ 278 + 279 + err = prog_attach(obj, cg_parent, "cgroup/setsockopt"); 280 + if (err) 281 + goto detach; 282 + 283 + buf = 0x80; 284 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 285 + if (err < 0) { 286 + log_err("Failed to call setsockopt(IP_TOS)"); 287 + goto detach; 288 + } 289 + 290 + buf = 0x00; 291 + optlen = 1; 292 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 293 + if (err) { 294 + log_err("Failed to call getsockopt(IP_TOS)"); 295 + goto detach; 296 + } 297 + 298 + if (buf != 0x80 + 2 * 0x10) { 299 + log_err("Unexpected getsockopt 0x%x != 0x80 + 2 * 0x10", buf); 300 + err = -1; 301 + goto detach; 302 + } 303 + 304 + detach: 305 + prog_detach(obj, cg_child, "cgroup/setsockopt"); 306 + prog_detach(obj, cg_parent, "cgroup/setsockopt"); 307 + 308 + return err; 309 + } 310 + 311 + int main(int argc, char **argv) 312 + { 313 + struct bpf_prog_load_attr attr = { 314 + .file = "./sockopt_multi.o", 315 + }; 316 + int cg_parent = -1, cg_child = -1; 317 + struct bpf_object *obj = NULL; 318 + int sock_fd = -1; 319 + int err = -1; 320 + int ignored; 321 + 322 + if (setup_cgroup_environment()) { 323 + log_err("Failed to setup cgroup environment\n"); 324 + goto out; 325 + } 326 + 327 + cg_parent = create_and_get_cgroup("/parent"); 328 + if (cg_parent < 0) { 329 + log_err("Failed to create cgroup /parent\n"); 330 + goto out; 331 + } 332 + 333 + cg_child = create_and_get_cgroup("/parent/child"); 334 + if (cg_child < 0) { 335 + log_err("Failed to create cgroup /parent/child\n"); 336 + goto out; 337 + } 338 + 339 + if (join_cgroup("/parent/child")) { 340 + log_err("Failed to join cgroup /parent/child\n"); 341 + goto out; 342 + } 343 + 344 + err = bpf_prog_load_xattr(&attr, &obj, &ignored); 345 + if (err) { 346 + log_err("Failed to load BPF object"); 347 + goto out; 348 + } 349 + 350 + sock_fd = socket(AF_INET, SOCK_STREAM, 0); 351 + if (sock_fd < 0) { 352 + log_err("Failed to create socket"); 353 + goto out; 354 + } 355 + 356 + if (run_getsockopt_test(obj, cg_parent, cg_child, sock_fd)) 357 + err = -1; 358 + printf("test_sockopt_multi: getsockopt %s\n", 359 + err ? "FAILED" : "PASSED"); 360 + 361 + if (run_setsockopt_test(obj, cg_parent, cg_child, sock_fd)) 362 + err = -1; 363 + printf("test_sockopt_multi: setsockopt %s\n", 364 + err ? "FAILED" : "PASSED"); 365 + 366 + out: 367 + close(sock_fd); 368 + bpf_object__close(obj); 369 + close(cg_child); 370 + close(cg_parent); 371 + 372 + printf("test_sockopt_multi: %s\n", err ? "FAILED" : "PASSED"); 373 + return err ? EXIT_FAILURE : EXIT_SUCCESS; 374 + }

+211

tools/testing/selftests/bpf/test_sockopt_sk.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <errno.h> 4 + #include <stdio.h> 5 + #include <unistd.h> 6 + #include <sys/types.h> 7 + #include <sys/socket.h> 8 + #include <netinet/in.h> 9 + 10 + #include <linux/filter.h> 11 + #include <bpf/bpf.h> 12 + #include <bpf/libbpf.h> 13 + 14 + #include "bpf_rlimit.h" 15 + #include "bpf_util.h" 16 + #include "cgroup_helpers.h" 17 + 18 + #define CG_PATH "/sockopt" 19 + 20 + #define SOL_CUSTOM 0xdeadbeef 21 + 22 + static int getsetsockopt(void) 23 + { 24 + int fd, err; 25 + union { 26 + char u8[4]; 27 + __u32 u32; 28 + } buf = {}; 29 + socklen_t optlen; 30 + 31 + fd = socket(AF_INET, SOCK_STREAM, 0); 32 + if (fd < 0) { 33 + log_err("Failed to create socket"); 34 + return -1; 35 + } 36 + 37 + /* IP_TOS - BPF bypass */ 38 + 39 + buf.u8[0] = 0x08; 40 + err = setsockopt(fd, SOL_IP, IP_TOS, &buf, 1); 41 + if (err) { 42 + log_err("Failed to call setsockopt(IP_TOS)"); 43 + goto err; 44 + } 45 + 46 + buf.u8[0] = 0x00; 47 + optlen = 1; 48 + err = getsockopt(fd, SOL_IP, IP_TOS, &buf, &optlen); 49 + if (err) { 50 + log_err("Failed to call getsockopt(IP_TOS)"); 51 + goto err; 52 + } 53 + 54 + if (buf.u8[0] != 0x08) { 55 + log_err("Unexpected getsockopt(IP_TOS) buf[0] 0x%02x != 0x08", 56 + buf.u8[0]); 57 + goto err; 58 + } 59 + 60 + /* IP_TTL - EPERM */ 61 + 62 + buf.u8[0] = 1; 63 + err = setsockopt(fd, SOL_IP, IP_TTL, &buf, 1); 64 + if (!err || errno != EPERM) { 65 + log_err("Unexpected success from setsockopt(IP_TTL)"); 66 + goto err; 67 + } 68 + 69 + /* SOL_CUSTOM - handled by BPF */ 70 + 71 + buf.u8[0] = 0x01; 72 + err = setsockopt(fd, SOL_CUSTOM, 0, &buf, 1); 73 + if (err) { 74 + log_err("Failed to call setsockopt"); 75 + goto err; 76 + } 77 + 78 + buf.u32 = 0x00; 79 + optlen = 4; 80 + err = getsockopt(fd, SOL_CUSTOM, 0, &buf, &optlen); 81 + if (err) { 82 + log_err("Failed to call getsockopt"); 83 + goto err; 84 + } 85 + 86 + if (optlen != 1) { 87 + log_err("Unexpected optlen %d != 1", optlen); 88 + goto err; 89 + } 90 + if (buf.u8[0] != 0x01) { 91 + log_err("Unexpected buf[0] 0x%02x != 0x01", buf.u8[0]); 92 + goto err; 93 + } 94 + 95 + /* SO_SNDBUF is overwritten */ 96 + 97 + buf.u32 = 0x01010101; 98 + err = setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &buf, 4); 99 + if (err) { 100 + log_err("Failed to call setsockopt(SO_SNDBUF)"); 101 + goto err; 102 + } 103 + 104 + buf.u32 = 0x00; 105 + optlen = 4; 106 + err = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &buf, &optlen); 107 + if (err) { 108 + log_err("Failed to call getsockopt(SO_SNDBUF)"); 109 + goto err; 110 + } 111 + 112 + if (buf.u32 != 0x55AA*2) { 113 + log_err("Unexpected getsockopt(SO_SNDBUF) 0x%x != 0x55AA*2", 114 + buf.u32); 115 + goto err; 116 + } 117 + 118 + close(fd); 119 + return 0; 120 + err: 121 + close(fd); 122 + return -1; 123 + } 124 + 125 + static int prog_attach(struct bpf_object *obj, int cgroup_fd, const char *title) 126 + { 127 + enum bpf_attach_type attach_type; 128 + enum bpf_prog_type prog_type; 129 + struct bpf_program *prog; 130 + int err; 131 + 132 + err = libbpf_prog_type_by_name(title, &prog_type, &attach_type); 133 + if (err) { 134 + log_err("Failed to deduct types for %s BPF program", title); 135 + return -1; 136 + } 137 + 138 + prog = bpf_object__find_program_by_title(obj, title); 139 + if (!prog) { 140 + log_err("Failed to find %s BPF program", title); 141 + return -1; 142 + } 143 + 144 + err = bpf_prog_attach(bpf_program__fd(prog), cgroup_fd, 145 + attach_type, 0); 146 + if (err) { 147 + log_err("Failed to attach %s BPF program", title); 148 + return -1; 149 + } 150 + 151 + return 0; 152 + } 153 + 154 + static int run_test(int cgroup_fd) 155 + { 156 + struct bpf_prog_load_attr attr = { 157 + .file = "./sockopt_sk.o", 158 + }; 159 + struct bpf_object *obj; 160 + int ignored; 161 + int err; 162 + 163 + err = bpf_prog_load_xattr(&attr, &obj, &ignored); 164 + if (err) { 165 + log_err("Failed to load BPF object"); 166 + return -1; 167 + } 168 + 169 + err = prog_attach(obj, cgroup_fd, "cgroup/getsockopt"); 170 + if (err) 171 + goto close_bpf_object; 172 + 173 + err = prog_attach(obj, cgroup_fd, "cgroup/setsockopt"); 174 + if (err) 175 + goto close_bpf_object; 176 + 177 + err = getsetsockopt(); 178 + 179 + close_bpf_object: 180 + bpf_object__close(obj); 181 + return err; 182 + } 183 + 184 + int main(int args, char **argv) 185 + { 186 + int cgroup_fd; 187 + int err = EXIT_SUCCESS; 188 + 189 + if (setup_cgroup_environment()) 190 + goto cleanup_obj; 191 + 192 + cgroup_fd = create_and_get_cgroup(CG_PATH); 193 + if (cgroup_fd < 0) 194 + goto cleanup_cgroup_env; 195 + 196 + if (join_cgroup(CG_PATH)) 197 + goto cleanup_cgroup; 198 + 199 + if (run_test(cgroup_fd)) 200 + err = EXIT_FAILURE; 201 + 202 + printf("test_sockopt_sk: %s\n", 203 + err == EXIT_SUCCESS ? "PASSED" : "FAILED"); 204 + 205 + cleanup_cgroup: 206 + close(cgroup_fd); 207 + cleanup_cgroup_env: 208 + cleanup_cgroup_environment(); 209 + cleanup_obj: 210 + return err; 211 + }

+254

tools/testing/selftests/bpf/test_tcp_rtt.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <error.h> 3 + #include <errno.h> 4 + #include <stdio.h> 5 + #include <unistd.h> 6 + #include <sys/types.h> 7 + #include <sys/socket.h> 8 + #include <netinet/in.h> 9 + #include <pthread.h> 10 + 11 + #include <linux/filter.h> 12 + #include <bpf/bpf.h> 13 + #include <bpf/libbpf.h> 14 + 15 + #include "bpf_rlimit.h" 16 + #include "bpf_util.h" 17 + #include "cgroup_helpers.h" 18 + 19 + #define CG_PATH "/tcp_rtt" 20 + 21 + struct tcp_rtt_storage { 22 + __u32 invoked; 23 + __u32 dsack_dups; 24 + __u32 delivered; 25 + __u32 delivered_ce; 26 + __u32 icsk_retransmits; 27 + }; 28 + 29 + static void send_byte(int fd) 30 + { 31 + char b = 0x55; 32 + 33 + if (write(fd, &b, sizeof(b)) != 1) 34 + error(1, errno, "Failed to send single byte"); 35 + } 36 + 37 + static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked, 38 + __u32 dsack_dups, __u32 delivered, __u32 delivered_ce, 39 + __u32 icsk_retransmits) 40 + { 41 + int err = 0; 42 + struct tcp_rtt_storage val; 43 + 44 + if (bpf_map_lookup_elem(map_fd, &client_fd, &val) < 0) 45 + error(1, errno, "Failed to read socket storage"); 46 + 47 + if (val.invoked != invoked) { 48 + log_err("%s: unexpected bpf_tcp_sock.invoked %d != %d", 49 + msg, val.invoked, invoked); 50 + err++; 51 + } 52 + 53 + if (val.dsack_dups != dsack_dups) { 54 + log_err("%s: unexpected bpf_tcp_sock.dsack_dups %d != %d", 55 + msg, val.dsack_dups, dsack_dups); 56 + err++; 57 + } 58 + 59 + if (val.delivered != delivered) { 60 + log_err("%s: unexpected bpf_tcp_sock.delivered %d != %d", 61 + msg, val.delivered, delivered); 62 + err++; 63 + } 64 + 65 + if (val.delivered_ce != delivered_ce) { 66 + log_err("%s: unexpected bpf_tcp_sock.delivered_ce %d != %d", 67 + msg, val.delivered_ce, delivered_ce); 68 + err++; 69 + } 70 + 71 + if (val.icsk_retransmits != icsk_retransmits) { 72 + log_err("%s: unexpected bpf_tcp_sock.icsk_retransmits %d != %d", 73 + msg, val.icsk_retransmits, icsk_retransmits); 74 + err++; 75 + } 76 + 77 + return err; 78 + } 79 + 80 + static int connect_to_server(int server_fd) 81 + { 82 + struct sockaddr_storage addr; 83 + socklen_t len = sizeof(addr); 84 + int fd; 85 + 86 + fd = socket(AF_INET, SOCK_STREAM, 0); 87 + if (fd < 0) { 88 + log_err("Failed to create client socket"); 89 + return -1; 90 + } 91 + 92 + if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) { 93 + log_err("Failed to get server addr"); 94 + goto out; 95 + } 96 + 97 + if (connect(fd, (const struct sockaddr *)&addr, len) < 0) { 98 + log_err("Fail to connect to server"); 99 + goto out; 100 + } 101 + 102 + return fd; 103 + 104 + out: 105 + close(fd); 106 + return -1; 107 + } 108 + 109 + static int run_test(int cgroup_fd, int server_fd) 110 + { 111 + struct bpf_prog_load_attr attr = { 112 + .prog_type = BPF_PROG_TYPE_SOCK_OPS, 113 + .file = "./tcp_rtt.o", 114 + .expected_attach_type = BPF_CGROUP_SOCK_OPS, 115 + }; 116 + struct bpf_object *obj; 117 + struct bpf_map *map; 118 + int client_fd; 119 + int prog_fd; 120 + int map_fd; 121 + int err; 122 + 123 + err = bpf_prog_load_xattr(&attr, &obj, &prog_fd); 124 + if (err) { 125 + log_err("Failed to load BPF object"); 126 + return -1; 127 + } 128 + 129 + map = bpf_map__next(NULL, obj); 130 + map_fd = bpf_map__fd(map); 131 + 132 + err = bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_SOCK_OPS, 0); 133 + if (err) { 134 + log_err("Failed to attach BPF program"); 135 + goto close_bpf_object; 136 + } 137 + 138 + client_fd = connect_to_server(server_fd); 139 + if (client_fd < 0) { 140 + err = -1; 141 + goto close_bpf_object; 142 + } 143 + 144 + err += verify_sk(map_fd, client_fd, "syn-ack", 145 + /*invoked=*/1, 146 + /*dsack_dups=*/0, 147 + /*delivered=*/1, 148 + /*delivered_ce=*/0, 149 + /*icsk_retransmits=*/0); 150 + 151 + send_byte(client_fd); 152 + 153 + err += verify_sk(map_fd, client_fd, "first payload byte", 154 + /*invoked=*/2, 155 + /*dsack_dups=*/0, 156 + /*delivered=*/2, 157 + /*delivered_ce=*/0, 158 + /*icsk_retransmits=*/0); 159 + 160 + close(client_fd); 161 + 162 + close_bpf_object: 163 + bpf_object__close(obj); 164 + return err; 165 + } 166 + 167 + static int start_server(void) 168 + { 169 + struct sockaddr_in addr = { 170 + .sin_family = AF_INET, 171 + .sin_addr.s_addr = htonl(INADDR_LOOPBACK), 172 + }; 173 + int fd; 174 + 175 + fd = socket(AF_INET, SOCK_STREAM, 0); 176 + if (fd < 0) { 177 + log_err("Failed to create server socket"); 178 + return -1; 179 + } 180 + 181 + if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) < 0) { 182 + log_err("Failed to bind socket"); 183 + close(fd); 184 + return -1; 185 + } 186 + 187 + return fd; 188 + } 189 + 190 + static void *server_thread(void *arg) 191 + { 192 + struct sockaddr_storage addr; 193 + socklen_t len = sizeof(addr); 194 + int fd = *(int *)arg; 195 + int client_fd; 196 + 197 + if (listen(fd, 1) < 0) 198 + error(1, errno, "Failed to listed on socket"); 199 + 200 + client_fd = accept(fd, (struct sockaddr *)&addr, &len); 201 + if (client_fd < 0) 202 + error(1, errno, "Failed to accept client"); 203 + 204 + /* Wait for the next connection (that never arrives) 205 + * to keep this thread alive to prevent calling 206 + * close() on client_fd. 207 + */ 208 + if (accept(fd, (struct sockaddr *)&addr, &len) >= 0) 209 + error(1, errno, "Unexpected success in second accept"); 210 + 211 + close(client_fd); 212 + 213 + return NULL; 214 + } 215 + 216 + int main(int args, char **argv) 217 + { 218 + int server_fd, cgroup_fd; 219 + int err = EXIT_SUCCESS; 220 + pthread_t tid; 221 + 222 + if (setup_cgroup_environment()) 223 + goto cleanup_obj; 224 + 225 + cgroup_fd = create_and_get_cgroup(CG_PATH); 226 + if (cgroup_fd < 0) 227 + goto cleanup_cgroup_env; 228 + 229 + if (join_cgroup(CG_PATH)) 230 + goto cleanup_cgroup; 231 + 232 + server_fd = start_server(); 233 + if (server_fd < 0) { 234 + err = EXIT_FAILURE; 235 + goto cleanup_cgroup; 236 + } 237 + 238 + pthread_create(&tid, NULL, server_thread, (void *)&server_fd); 239 + 240 + if (run_test(cgroup_fd, server_fd)) 241 + err = EXIT_FAILURE; 242 + 243 + close(server_fd); 244 + 245 + printf("test_sockopt_sk: %s\n", 246 + err == EXIT_SUCCESS ? "PASSED" : "FAILED"); 247 + 248 + cleanup_cgroup: 249 + close(cgroup_fd); 250 + cleanup_cgroup_env: 251 + cleanup_cgroup_environment(); 252 + cleanup_obj: 253 + return err; 254 + }

+118

tools/testing/selftests/bpf/test_xdp_veth.sh

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # 4 + # Create 3 namespaces with 3 veth peers, and 5 + # forward packets in-between using native XDP 6 + # 7 + # XDP_TX 8 + # NS1(veth11) NS2(veth22) NS3(veth33) 9 + # | | | 10 + # | | | 11 + # (veth1, (veth2, (veth3, 12 + # id:111) id:122) id:133) 13 + # ^ | ^ | ^ | 14 + # | | XDP_REDIRECT | | XDP_REDIRECT | | 15 + # | ------------------ ------------------ | 16 + # ----------------------------------------- 17 + # XDP_REDIRECT 18 + 19 + # Kselftest framework requirement - SKIP code is 4. 20 + ksft_skip=4 21 + 22 + TESTNAME=xdp_veth 23 + BPF_FS=$(awk '$3 == "bpf" {print $2; exit}' /proc/mounts) 24 + BPF_DIR=$BPF_FS/test_$TESTNAME 25 + 26 + _cleanup() 27 + { 28 + set +e 29 + ip link del veth1 2> /dev/null 30 + ip link del veth2 2> /dev/null 31 + ip link del veth3 2> /dev/null 32 + ip netns del ns1 2> /dev/null 33 + ip netns del ns2 2> /dev/null 34 + ip netns del ns3 2> /dev/null 35 + rm -rf $BPF_DIR 2> /dev/null 36 + } 37 + 38 + cleanup_skip() 39 + { 40 + echo "selftests: $TESTNAME [SKIP]" 41 + _cleanup 42 + 43 + exit $ksft_skip 44 + } 45 + 46 + cleanup() 47 + { 48 + if [ "$?" = 0 ]; then 49 + echo "selftests: $TESTNAME [PASS]" 50 + else 51 + echo "selftests: $TESTNAME [FAILED]" 52 + fi 53 + _cleanup 54 + } 55 + 56 + if [ $(id -u) -ne 0 ]; then 57 + echo "selftests: $TESTNAME [SKIP] Need root privileges" 58 + exit $ksft_skip 59 + fi 60 + 61 + if ! ip link set dev lo xdp off > /dev/null 2>&1; then 62 + echo "selftests: $TESTNAME [SKIP] Could not run test without the ip xdp support" 63 + exit $ksft_skip 64 + fi 65 + 66 + if [ -z "$BPF_FS" ]; then 67 + echo "selftests: $TESTNAME [SKIP] Could not run test without bpffs mounted" 68 + exit $ksft_skip 69 + fi 70 + 71 + if ! bpftool version > /dev/null 2>&1; then 72 + echo "selftests: $TESTNAME [SKIP] Could not run test without bpftool" 73 + exit $ksft_skip 74 + fi 75 + 76 + set -e 77 + 78 + trap cleanup_skip EXIT 79 + 80 + ip netns add ns1 81 + ip netns add ns2 82 + ip netns add ns3 83 + 84 + ip link add veth1 index 111 type veth peer name veth11 netns ns1 85 + ip link add veth2 index 122 type veth peer name veth22 netns ns2 86 + ip link add veth3 index 133 type veth peer name veth33 netns ns3 87 + 88 + ip link set veth1 up 89 + ip link set veth2 up 90 + ip link set veth3 up 91 + 92 + ip -n ns1 addr add 10.1.1.11/24 dev veth11 93 + ip -n ns3 addr add 10.1.1.33/24 dev veth33 94 + 95 + ip -n ns1 link set dev veth11 up 96 + ip -n ns2 link set dev veth22 up 97 + ip -n ns3 link set dev veth33 up 98 + 99 + mkdir $BPF_DIR 100 + bpftool prog loadall \ 101 + xdp_redirect_map.o $BPF_DIR/progs type xdp \ 102 + pinmaps $BPF_DIR/maps 103 + bpftool map update pinned $BPF_DIR/maps/tx_port key 0 0 0 0 value 122 0 0 0 104 + bpftool map update pinned $BPF_DIR/maps/tx_port key 1 0 0 0 value 133 0 0 0 105 + bpftool map update pinned $BPF_DIR/maps/tx_port key 2 0 0 0 value 111 0 0 0 106 + ip link set dev veth1 xdp pinned $BPF_DIR/progs/redirect_map_0 107 + ip link set dev veth2 xdp pinned $BPF_DIR/progs/redirect_map_1 108 + ip link set dev veth3 xdp pinned $BPF_DIR/progs/redirect_map_2 109 + 110 + ip -n ns1 link set dev veth11 xdp obj xdp_dummy.o sec xdp_dummy 111 + ip -n ns2 link set dev veth22 xdp obj xdp_tx.o sec tx 112 + ip -n ns3 link set dev veth33 xdp obj xdp_dummy.o sec xdp_dummy 113 + 114 + trap cleanup EXIT 115 + 116 + ip netns exec ns1 ping -c 1 -W 1 10.1.1.33 117 + 118 + exit 0