Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-03

The following pull-request contains BPF updates for your *net-next* tree.

There is a minor merge conflict in mlx5 due to 8960b38932be ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:

** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

<<<<<<< HEAD
static int mlx5e_open_cq(struct mlx5e_channel *c,
struct dim_cq_moder moder,
struct mlx5e_cq_param *param,
struct mlx5e_cq *cq)
=======
int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497

Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...

drivers/net/ethernet/mellanox/mlx5/core/en.h +977

... and in mlx5e_open_xsk() ...

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64

... needs the same rename from net_dim_cq_moder into dim_cq_moder.

** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

<<<<<<< HEAD
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
struct dim_cq_moder icocq_moder = {0, 0};
struct net_device *netdev = priv->netdev;
struct mlx5e_channel *c;
unsigned int irq;
=======
struct net_dim_cq_moder icocq_moder = {0, 0};
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497

Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.

Let me know if you run into any issues. Anyway, the main changes are:

1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.

2) Addition of two new per-cgroup BPF hooks for getsockopt and
setsockopt along with a new sockopt program type which allows more
fine-grained pass/reject settings for containers. Also add a sock_ops
callback that can be selectively enabled on a per-socket basis and is
executed for every RTT to help tracking TCP statistics, both features
from Stanislav.

3) Follow-up fix from loops in precision tracking which was not propagating
precision marks and as a result verifier assumed that some branches were
not taken and therefore wrongly removed as dead code, from Alexei.

4) Fix BPF cgroup release synchronization race which could lead to a
double-free if a leaf's cgroup_bpf object is released and a new BPF
program is attached to the one of ancestor cgroups in parallel, from Roman.

5) Support for bulking XDP_TX on veth devices which improves performance
in some cases by around 9%, from Toshiaki.

6) Allow for lookups into BPF devmap and improve feedback when calling into
bpf_redirect_map() as lookup is now performed right away in the helper
itself, from Toke.

7) Add support for fq's Earliest Departure Time to the Host Bandwidth
Manager (HBM) sample BPF program, from Lawrence.

8) Various cleanups and minor fixes all over the place from many others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

+6225 -869
+1
Documentation/bpf/index.rst
··· 42 42 .. toctree:: 43 43 :maxdepth: 1 44 44 45 + prog_cgroup_sockopt 45 46 prog_cgroup_sysctl 46 47 prog_flow_dissector 47 48
+93
Documentation/bpf/prog_cgroup_sockopt.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================ 4 + BPF_PROG_TYPE_CGROUP_SOCKOPT 5 + ============================ 6 + 7 + ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two 8 + cgroup hooks: 9 + 10 + * ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt`` 11 + system call. 12 + * ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt`` 13 + system call. 14 + 15 + The context (``struct bpf_sockopt``) has associated socket (``sk``) and 16 + all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``. 17 + 18 + BPF_CGROUP_SETSOCKOPT 19 + ===================== 20 + 21 + ``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of 22 + sockopt and it has writable context: it can modify the supplied arguments 23 + before passing them down to the kernel. This hook has access to the cgroup 24 + and socket local storage. 25 + 26 + If BPF program sets ``optlen`` to -1, the control will be returned 27 + back to the userspace after all other BPF programs in the cgroup 28 + chain finish (i.e. kernel ``setsockopt`` handling will *not* be executed). 29 + 30 + Note, that ``optlen`` can not be increased beyond the user-supplied 31 + value. It can only be decreased or set to -1. Any other value will 32 + trigger ``EFAULT``. 33 + 34 + Return Type 35 + ----------- 36 + 37 + * ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. 38 + * ``1`` - success, continue with next BPF program in the cgroup chain. 39 + 40 + BPF_CGROUP_GETSOCKOPT 41 + ===================== 42 + 43 + ``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of 44 + sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval`` 45 + if it's interested in whatever kernel has returned. BPF hook can override 46 + the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen`` 47 + has been increased above initial ``getsockopt`` value (i.e. userspace 48 + buffer is too small), ``EFAULT`` is returned. 49 + 50 + This hook has access to the cgroup and socket local storage. 51 + 52 + Note, that the only acceptable value to set to ``retval`` is 0 and the 53 + original value that the kernel returned. Any other value will trigger 54 + ``EFAULT``. 55 + 56 + Return Type 57 + ----------- 58 + 59 + * ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. 60 + * ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return 61 + ``retval`` from the syscall (note that this can be overwritten by 62 + the BPF program from the parent cgroup). 63 + 64 + Cgroup Inheritance 65 + ================== 66 + 67 + Suppose, there is the following cgroup hierarchy where each cgroup 68 + has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with 69 + ``BPF_F_ALLOW_MULTI`` flag:: 70 + 71 + A (root, parent) 72 + \ 73 + B (child) 74 + 75 + When the application calls ``getsockopt`` syscall from the cgroup B, 76 + the programs are executed from the bottom up: B, A. First program 77 + (B) sees the result of kernel's ``getsockopt``. It can optionally 78 + adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that 79 + control will be passed to the second (A) program which will see the 80 + same context as B including any potential modifications. 81 + 82 + Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to 83 + A and B, the trigger order is B, then A. If B does any changes 84 + to the input arguments (``level``, ``optname``, ``optval``, ``optlen``), 85 + then the next program in the chain (A) will see those changes, 86 + *not* the original input ``setsockopt`` arguments. The potentially 87 + modified values will be then passed down to the kernel. 88 + 89 + Example 90 + ======= 91 + 92 + See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example 93 + of BPF program that handles socket options.
+15 -1
Documentation/networking/af_xdp.rst
··· 220 220 In order to use AF_XDP sockets there are two parts needed. The 221 221 user-space application and the XDP program. For a complete setup and 222 222 usage example, please refer to the sample application. The user-space 223 - side is xdpsock_user.c and the XDP side xdpsock_kern.c. 223 + side is xdpsock_user.c and the XDP side is part of libbpf. 224 + 225 + The XDP code sample included in tools/lib/bpf/xsk.c is the following:: 226 + 227 + SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) 228 + { 229 + int index = ctx->rx_queue_index; 230 + 231 + // A set entry here means that the correspnding queue_id 232 + // has an active AF_XDP socket bound to it. 233 + if (bpf_map_lookup_elem(&xsks_map, &index)) 234 + return bpf_redirect_map(&xsks_map, index, 0); 235 + 236 + return XDP_PASS; 237 + } 224 238 225 239 Naive ring dequeue and enqueue could look like this:: 226 240
+7 -5
drivers/net/ethernet/intel/i40e/i40e_xsk.c
··· 641 641 struct i40e_tx_desc *tx_desc = NULL; 642 642 struct i40e_tx_buffer *tx_bi; 643 643 bool work_done = true; 644 + struct xdp_desc desc; 644 645 dma_addr_t dma; 645 - u32 len; 646 646 647 647 while (budget-- > 0) { 648 648 if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) { ··· 651 651 break; 652 652 } 653 653 654 - if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len)) 654 + if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc)) 655 655 break; 656 656 657 - dma_sync_single_for_device(xdp_ring->dev, dma, len, 657 + dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr); 658 + 659 + dma_sync_single_for_device(xdp_ring->dev, dma, desc.len, 658 660 DMA_BIDIRECTIONAL); 659 661 660 662 tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use]; 661 - tx_bi->bytecount = len; 663 + tx_bi->bytecount = desc.len; 662 664 663 665 tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use); 664 666 tx_desc->buffer_addr = cpu_to_le64(dma); 665 667 tx_desc->cmd_type_offset_bsz = 666 668 build_ctob(I40E_TX_DESC_CMD_ICRC 667 669 | I40E_TX_DESC_CMD_EOP, 668 - 0, len, 0); 670 + 0, desc.len, 0); 669 671 670 672 xdp_ring->next_to_use++; 671 673 if (xdp_ring->next_to_use == xdp_ring->count)
+9 -6
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
··· 571 571 union ixgbe_adv_tx_desc *tx_desc = NULL; 572 572 struct ixgbe_tx_buffer *tx_bi; 573 573 bool work_done = true; 574 - u32 len, cmd_type; 574 + struct xdp_desc desc; 575 575 dma_addr_t dma; 576 + u32 cmd_type; 576 577 577 578 while (budget-- > 0) { 578 579 if (unlikely(!ixgbe_desc_unused(xdp_ring)) || ··· 582 581 break; 583 582 } 584 583 585 - if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len)) 584 + if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc)) 586 585 break; 587 586 588 - dma_sync_single_for_device(xdp_ring->dev, dma, len, 587 + dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr); 588 + 589 + dma_sync_single_for_device(xdp_ring->dev, dma, desc.len, 589 590 DMA_BIDIRECTIONAL); 590 591 591 592 tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use]; 592 - tx_bi->bytecount = len; 593 + tx_bi->bytecount = desc.len; 593 594 tx_bi->xdpf = NULL; 594 595 tx_bi->gso_segs = 1; 595 596 ··· 602 599 cmd_type = IXGBE_ADVTXD_DTYP_DATA | 603 600 IXGBE_ADVTXD_DCMD_DEXT | 604 601 IXGBE_ADVTXD_DCMD_IFCS; 605 - cmd_type |= len | IXGBE_TXD_CMD; 602 + cmd_type |= desc.len | IXGBE_TXD_CMD; 606 603 tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type); 607 604 tx_desc->read.olinfo_status = 608 - cpu_to_le32(len << IXGBE_ADVTXD_PAYLEN_SHIFT); 605 + cpu_to_le32(desc.len << IXGBE_ADVTXD_PAYLEN_SHIFT); 609 606 610 607 xdp_ring->next_to_use++; 611 608 if (xdp_ring->next_to_use == xdp_ring->count)
+1 -1
drivers/net/ethernet/mellanox/mlx5/core/Makefile
··· 24 24 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \ 25 25 en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \ 26 26 en_selftest.o en/port.o en/monitor_stats.o en/reporter_tx.o \ 27 - en/params.o 27 + en/params.o en/xsk/umem.o en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o 28 28 29 29 # 30 30 # Netdev extra
+140 -15
drivers/net/ethernet/mellanox/mlx5/core/en.h
··· 137 137 #define MLX5E_MAX_NUM_CHANNELS (MLX5E_INDIR_RQT_SIZE >> 1) 138 138 #define MLX5E_MAX_NUM_SQS (MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC) 139 139 #define MLX5E_TX_CQ_POLL_BUDGET 128 140 + #define MLX5E_TX_XSK_POLL_BUDGET 64 140 141 #define MLX5E_SQ_RECOVER_MIN_INTERVAL 500 /* msecs */ 141 142 142 143 #define MLX5E_UMR_WQE_INLINE_SZ \ ··· 156 155 ##__VA_ARGS__); \ 157 156 } while (0) 158 157 158 + enum mlx5e_rq_group { 159 + MLX5E_RQ_GROUP_REGULAR, 160 + MLX5E_RQ_GROUP_XSK, 161 + MLX5E_NUM_RQ_GROUPS /* Keep last. */ 162 + }; 159 163 160 164 static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size) 161 165 { ··· 185 179 /* Use this function to get max num channels after netdev was created */ 186 180 static inline int mlx5e_get_netdev_max_channels(struct net_device *netdev) 187 181 { 188 - return min_t(unsigned int, netdev->num_rx_queues, 182 + return min_t(unsigned int, 183 + netdev->num_rx_queues / MLX5E_NUM_RQ_GROUPS, 189 184 netdev->num_tx_queues); 190 185 } 191 186 ··· 257 250 u32 lro_timeout; 258 251 u32 pflags; 259 252 struct bpf_prog *xdp_prog; 253 + struct mlx5e_xsk *xsk; 260 254 unsigned int sw_mtu; 261 255 int hard_mtu; 262 256 }; ··· 356 348 357 349 struct mlx5e_sq_wqe_info { 358 350 u8 opcode; 351 + 352 + /* Auxiliary data for different opcodes. */ 353 + union { 354 + struct { 355 + struct mlx5e_rq *rq; 356 + } umr; 357 + }; 359 358 }; 360 359 361 360 struct mlx5e_txqsq { ··· 407 392 } ____cacheline_aligned_in_smp; 408 393 409 394 struct mlx5e_dma_info { 410 - struct page *page; 411 - dma_addr_t addr; 395 + dma_addr_t addr; 396 + union { 397 + struct page *page; 398 + struct { 399 + u64 handle; 400 + void *data; 401 + } xsk; 402 + }; 403 + }; 404 + 405 + /* XDP packets can be transmitted in different ways. On completion, we need to 406 + * distinguish between them to clean up things in a proper way. 407 + */ 408 + enum mlx5e_xdp_xmit_mode { 409 + /* An xdp_frame was transmitted due to either XDP_REDIRECT from another 410 + * device or XDP_TX from an XSK RQ. The frame has to be unmapped and 411 + * returned. 412 + */ 413 + MLX5E_XDP_XMIT_MODE_FRAME, 414 + 415 + /* The xdp_frame was created in place as a result of XDP_TX from a 416 + * regular RQ. No DMA remapping happened, and the page belongs to us. 417 + */ 418 + MLX5E_XDP_XMIT_MODE_PAGE, 419 + 420 + /* No xdp_frame was created at all, the transmit happened from a UMEM 421 + * page. The UMEM Completion Ring producer pointer has to be increased. 422 + */ 423 + MLX5E_XDP_XMIT_MODE_XSK, 412 424 }; 413 425 414 426 struct mlx5e_xdp_info { 415 - struct xdp_frame *xdpf; 416 - dma_addr_t dma_addr; 417 - struct mlx5e_dma_info di; 427 + enum mlx5e_xdp_xmit_mode mode; 428 + union { 429 + struct { 430 + struct xdp_frame *xdpf; 431 + dma_addr_t dma_addr; 432 + } frame; 433 + struct { 434 + struct mlx5e_rq *rq; 435 + struct mlx5e_dma_info di; 436 + } page; 437 + }; 438 + }; 439 + 440 + struct mlx5e_xdp_xmit_data { 441 + dma_addr_t dma_addr; 442 + void *data; 443 + u32 len; 418 444 }; 419 445 420 446 struct mlx5e_xdp_info_fifo { ··· 481 425 }; 482 426 483 427 struct mlx5e_xdpsq; 484 - typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq*, 485 - struct mlx5e_xdp_info*); 428 + typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *); 429 + typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *, 430 + struct mlx5e_xdp_xmit_data *, 431 + struct mlx5e_xdp_info *, 432 + int); 433 + 486 434 struct mlx5e_xdpsq { 487 435 /* data path */ 488 436 ··· 503 443 struct mlx5e_cq cq; 504 444 505 445 /* read only */ 446 + struct xdp_umem *umem; 506 447 struct mlx5_wq_cyc wq; 507 448 struct mlx5e_xdpsq_stats *stats; 449 + mlx5e_fp_xmit_xdp_frame_check xmit_xdp_frame_check; 508 450 mlx5e_fp_xmit_xdp_frame xmit_xdp_frame; 509 451 struct { 510 452 struct mlx5e_xdp_wqe_info *wqe_info; ··· 633 571 u8 log_stride_sz; 634 572 u8 umr_in_progress; 635 573 u8 umr_last_bulk; 574 + u8 umr_completed; 636 575 } mpwqe; 637 576 }; 638 577 struct { 578 + u16 umem_headroom; 639 579 u16 headroom; 640 580 u8 map_dir; /* dma map direction */ 641 581 } buff; ··· 664 600 665 601 /* XDP */ 666 602 struct bpf_prog *xdp_prog; 667 - struct mlx5e_xdpsq xdpsq; 603 + struct mlx5e_xdpsq *xdpsq; 668 604 DECLARE_BITMAP(flags, 8); 669 605 struct page_pool *page_pool; 606 + 607 + /* AF_XDP zero-copy */ 608 + struct zero_copy_allocator zca; 609 + struct xdp_umem *umem; 670 610 671 611 /* control */ 672 612 struct mlx5_wq_ctrl wq_ctrl; ··· 684 616 struct xdp_rxq_info xdp_rxq; 685 617 } ____cacheline_aligned_in_smp; 686 618 619 + enum mlx5e_channel_state { 620 + MLX5E_CHANNEL_STATE_XSK, 621 + MLX5E_CHANNEL_NUM_STATES 622 + }; 623 + 687 624 struct mlx5e_channel { 688 625 /* data path */ 689 626 struct mlx5e_rq rq; 627 + struct mlx5e_xdpsq rq_xdpsq; 690 628 struct mlx5e_txqsq sq[MLX5E_MAX_NUM_TC]; 691 629 struct mlx5e_icosq icosq; /* internal control operations */ 692 630 bool xdp; ··· 705 631 /* XDP_REDIRECT */ 706 632 struct mlx5e_xdpsq xdpsq; 707 633 634 + /* AF_XDP zero-copy */ 635 + struct mlx5e_rq xskrq; 636 + struct mlx5e_xdpsq xsksq; 637 + struct mlx5e_icosq xskicosq; 638 + /* xskicosq can be accessed from any CPU - the spinlock protects it. */ 639 + spinlock_t xskicosq_lock; 640 + 708 641 /* data path - accessed per napi poll */ 709 642 struct irq_desc *irq_desc; 710 643 struct mlx5e_ch_stats *stats; ··· 720 639 struct mlx5e_priv *priv; 721 640 struct mlx5_core_dev *mdev; 722 641 struct hwtstamp_config *tstamp; 642 + DECLARE_BITMAP(state, MLX5E_CHANNEL_NUM_STATES); 723 643 int ix; 724 644 int cpu; 725 645 cpumask_var_t xps_cpumask; ··· 736 654 struct mlx5e_ch_stats ch; 737 655 struct mlx5e_sq_stats sq[MLX5E_MAX_NUM_TC]; 738 656 struct mlx5e_rq_stats rq; 657 + struct mlx5e_rq_stats xskrq; 739 658 struct mlx5e_xdpsq_stats rq_xdpsq; 740 659 struct mlx5e_xdpsq_stats xdpsq; 660 + struct mlx5e_xdpsq_stats xsksq; 741 661 } ____cacheline_aligned_in_smp; 742 662 743 663 enum { 744 664 MLX5E_STATE_OPENED, 745 665 MLX5E_STATE_DESTROYING, 746 666 MLX5E_STATE_XDP_TX_ENABLED, 667 + MLX5E_STATE_XDP_OPEN, 747 668 }; 748 669 749 670 struct mlx5e_rqt { ··· 779 694 int rl_index; 780 695 }; 781 696 697 + struct mlx5e_xsk { 698 + /* UMEMs are stored separately from channels, because we don't want to 699 + * lose them when channels are recreated. The kernel also stores UMEMs, 700 + * but it doesn't distinguish between zero-copy and non-zero-copy UMEMs, 701 + * so rely on our mechanism. 702 + */ 703 + struct xdp_umem **umems; 704 + u16 refcnt; 705 + bool ever_used; 706 + }; 707 + 782 708 struct mlx5e_priv { 783 709 /* priv data path fields - start */ 784 710 struct mlx5e_txqsq *txq2sq[MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC]; ··· 810 714 struct mlx5e_tir indir_tir[MLX5E_NUM_INDIR_TIRS]; 811 715 struct mlx5e_tir inner_indir_tir[MLX5E_NUM_INDIR_TIRS]; 812 716 struct mlx5e_tir direct_tir[MLX5E_MAX_NUM_CHANNELS]; 717 + struct mlx5e_tir xsk_tir[MLX5E_MAX_NUM_CHANNELS]; 813 718 struct mlx5e_rss_params rss_params; 814 719 u32 tx_rates[MLX5E_MAX_NUM_SQS]; 815 720 ··· 847 750 struct mlx5e_tls *tls; 848 751 #endif 849 752 struct devlink_health_reporter *tx_reporter; 753 + struct mlx5e_xsk xsk; 850 754 }; 851 755 852 756 struct mlx5e_profile { ··· 892 794 struct mlx5e_params *params); 893 795 894 796 void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info); 895 - void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, 896 - bool recycle); 797 + void mlx5e_page_release_dynamic(struct mlx5e_rq *rq, 798 + struct mlx5e_dma_info *dma_info, 799 + bool recycle); 897 800 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe); 898 801 void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe); 899 802 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq); 803 + void mlx5e_poll_ico_cq(struct mlx5e_cq *cq); 900 804 bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq); 901 805 void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix); 902 806 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix); ··· 953 853 void *tirc, bool inner); 954 854 void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen); 955 855 struct mlx5e_tirc_config mlx5e_tirc_get_default_config(enum mlx5e_traffic_types tt); 856 + 857 + struct mlx5e_xsk_param; 858 + 859 + struct mlx5e_rq_param; 860 + int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params, 861 + struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk, 862 + struct xdp_umem *umem, struct mlx5e_rq *rq); 863 + int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time); 864 + void mlx5e_deactivate_rq(struct mlx5e_rq *rq); 865 + void mlx5e_close_rq(struct mlx5e_rq *rq); 866 + 867 + struct mlx5e_sq_param; 868 + int mlx5e_open_icosq(struct mlx5e_channel *c, struct mlx5e_params *params, 869 + struct mlx5e_sq_param *param, struct mlx5e_icosq *sq); 870 + void mlx5e_close_icosq(struct mlx5e_icosq *sq); 871 + int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params, 872 + struct mlx5e_sq_param *param, struct xdp_umem *umem, 873 + struct mlx5e_xdpsq *sq, bool is_redirect); 874 + void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq); 875 + 876 + struct mlx5e_cq_param; 877 + int mlx5e_open_cq(struct mlx5e_channel *c, struct dim_cq_moder moder, 878 + struct mlx5e_cq_param *param, struct mlx5e_cq *cq); 879 + void mlx5e_close_cq(struct mlx5e_cq *cq); 956 880 957 881 int mlx5e_open_locked(struct net_device *netdev); 958 882 int mlx5e_close_locked(struct net_device *netdev); ··· 1148 1024 int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc); 1149 1025 void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc); 1150 1026 1151 - int mlx5e_create_direct_rqts(struct mlx5e_priv *priv); 1152 - void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv); 1153 - int mlx5e_create_direct_tirs(struct mlx5e_priv *priv); 1154 - void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv); 1027 + int mlx5e_create_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1028 + void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1029 + int mlx5e_create_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1030 + void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs); 1155 1031 void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt); 1156 1032 1157 1033 int mlx5e_create_tis(struct mlx5_core_dev *mdev, int tc, ··· 1221 1097 void mlx5e_destroy_netdev(struct mlx5e_priv *priv); 1222 1098 void mlx5e_set_netdev_mtu_boundaries(struct mlx5e_priv *priv); 1223 1099 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev, 1100 + struct mlx5e_xsk *xsk, 1224 1101 struct mlx5e_rss_params *rss_params, 1225 1102 struct mlx5e_params *params, 1226 1103 u16 max_channels, u16 mtu);
+71 -37
drivers/net/ethernet/mellanox/mlx5/core/en/params.c
··· 3 3 4 4 #include "en/params.h" 5 5 6 - u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params) 6 + static inline bool mlx5e_rx_is_xdp(struct mlx5e_params *params, 7 + struct mlx5e_xsk_param *xsk) 7 8 { 8 - u16 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 9 - u16 linear_rq_headroom = params->xdp_prog ? 10 - XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM; 11 - u32 frag_sz; 9 + return params->xdp_prog || xsk; 10 + } 12 11 13 - linear_rq_headroom += NET_IP_ALIGN; 12 + u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params, 13 + struct mlx5e_xsk_param *xsk) 14 + { 15 + u16 headroom = NET_IP_ALIGN; 14 16 15 - frag_sz = MLX5_SKB_FRAG_SZ(linear_rq_headroom + hw_mtu); 17 + if (mlx5e_rx_is_xdp(params, xsk)) { 18 + headroom += XDP_PACKET_HEADROOM; 19 + if (xsk) 20 + headroom += xsk->headroom; 21 + } else { 22 + headroom += MLX5_RX_HEADROOM; 23 + } 16 24 17 - if (params->xdp_prog && frag_sz < PAGE_SIZE) 18 - frag_sz = PAGE_SIZE; 25 + return headroom; 26 + } 27 + 28 + u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params, 29 + struct mlx5e_xsk_param *xsk) 30 + { 31 + u32 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 32 + u16 linear_rq_headroom = mlx5e_get_linear_rq_headroom(params, xsk); 33 + u32 frag_sz = linear_rq_headroom + hw_mtu; 34 + 35 + /* AF_XDP doesn't build SKBs in place. */ 36 + if (!xsk) 37 + frag_sz = MLX5_SKB_FRAG_SZ(frag_sz); 38 + 39 + /* XDP in mlx5e doesn't support multiple packets per page. */ 40 + if (mlx5e_rx_is_xdp(params, xsk)) 41 + frag_sz = max_t(u32, frag_sz, PAGE_SIZE); 42 + 43 + /* Even if we can go with a smaller fragment size, we must not put 44 + * multiple packets into a single frame. 45 + */ 46 + if (xsk) 47 + frag_sz = max_t(u32, frag_sz, xsk->chunk_size); 19 48 20 49 return frag_sz; 21 50 } 22 51 23 - u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params) 52 + u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params, 53 + struct mlx5e_xsk_param *xsk) 24 54 { 25 - u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params); 55 + u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params, xsk); 26 56 27 57 return MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz); 28 58 } 29 59 30 - bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params) 60 + bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params, 61 + struct mlx5e_xsk_param *xsk) 31 62 { 32 - u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params); 63 + /* AF_XDP allocates SKBs on XDP_PASS - ensure they don't occupy more 64 + * than one page. For this, check both with and without xsk. 65 + */ 66 + u32 linear_frag_sz = max(mlx5e_rx_get_linear_frag_sz(params, xsk), 67 + mlx5e_rx_get_linear_frag_sz(params, NULL)); 33 68 34 - return !params->lro_en && frag_sz <= PAGE_SIZE; 69 + return !params->lro_en && linear_frag_sz <= PAGE_SIZE; 35 70 } 36 71 37 72 #define MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ ((BIT(__mlx5_bit_sz(wq, log_wqe_stride_size)) - 1) + \ 38 73 MLX5_MPWQE_LOG_STRIDE_SZ_BASE) 39 74 bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev, 40 - struct mlx5e_params *params) 75 + struct mlx5e_params *params, 76 + struct mlx5e_xsk_param *xsk) 41 77 { 42 - u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params); 78 + u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params, xsk); 43 79 s8 signed_log_num_strides_param; 44 80 u8 log_num_strides; 45 81 46 - if (!mlx5e_rx_is_linear_skb(params)) 82 + if (!mlx5e_rx_is_linear_skb(params, xsk)) 47 83 return false; 48 84 49 - if (order_base_2(frag_sz) > MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ) 85 + if (order_base_2(linear_frag_sz) > MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ) 50 86 return false; 51 87 52 88 if (MLX5_CAP_GEN(mdev, ext_stride_num_range)) 53 89 return true; 54 90 55 - log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(frag_sz); 91 + log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz); 56 92 signed_log_num_strides_param = 57 93 (s8)log_num_strides - MLX5_MPWQE_LOG_NUM_STRIDES_BASE; 58 94 59 95 return signed_log_num_strides_param >= 0; 60 96 } 61 97 62 - u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params) 98 + u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params, 99 + struct mlx5e_xsk_param *xsk) 63 100 { 64 - u8 log_pkts_per_wqe = mlx5e_mpwqe_log_pkts_per_wqe(params); 101 + u8 log_pkts_per_wqe = mlx5e_mpwqe_log_pkts_per_wqe(params, xsk); 65 102 66 103 /* Numbers are unsigned, don't subtract to avoid underflow. */ 67 104 if (params->log_rq_mtu_frames < ··· 109 72 } 110 73 111 74 u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev, 112 - struct mlx5e_params *params) 75 + struct mlx5e_params *params, 76 + struct mlx5e_xsk_param *xsk) 113 77 { 114 - if (mlx5e_rx_mpwqe_is_linear_skb(mdev, params)) 115 - return order_base_2(mlx5e_rx_get_linear_frag_sz(params)); 78 + if (mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk)) 79 + return order_base_2(mlx5e_rx_get_linear_frag_sz(params, xsk)); 116 80 117 81 return MLX5_MPWRQ_DEF_LOG_STRIDE_SZ(mdev); 118 82 } 119 83 120 84 u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev, 121 - struct mlx5e_params *params) 85 + struct mlx5e_params *params, 86 + struct mlx5e_xsk_param *xsk) 122 87 { 123 88 return MLX5_MPWRQ_LOG_WQE_SZ - 124 - mlx5e_mpwqe_get_log_stride_size(mdev, params); 89 + mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk); 125 90 } 126 91 127 92 u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev, 128 - struct mlx5e_params *params) 93 + struct mlx5e_params *params, 94 + struct mlx5e_xsk_param *xsk) 129 95 { 130 - u16 linear_rq_headroom = params->xdp_prog ? 131 - XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM; 132 - bool is_linear_skb; 96 + bool is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ? 97 + mlx5e_rx_is_linear_skb(params, xsk) : 98 + mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk); 133 99 134 - linear_rq_headroom += NET_IP_ALIGN; 135 - 136 - is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ? 137 - mlx5e_rx_is_linear_skb(params) : 138 - mlx5e_rx_mpwqe_is_linear_skb(mdev, params); 139 - 140 - return is_linear_skb ? linear_rq_headroom : 0; 100 + return is_linear_skb ? mlx5e_get_linear_rq_headroom(params, xsk) : 0; 141 101 }
+110 -8
drivers/net/ethernet/mellanox/mlx5/core/en/params.h
··· 6 6 7 7 #include "en.h" 8 8 9 - u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params); 10 - u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params); 11 - bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params); 9 + struct mlx5e_xsk_param { 10 + u16 headroom; 11 + u16 chunk_size; 12 + }; 13 + 14 + struct mlx5e_rq_param { 15 + u32 rqc[MLX5_ST_SZ_DW(rqc)]; 16 + struct mlx5_wq_param wq; 17 + struct mlx5e_rq_frags_info frags_info; 18 + }; 19 + 20 + struct mlx5e_sq_param { 21 + u32 sqc[MLX5_ST_SZ_DW(sqc)]; 22 + struct mlx5_wq_param wq; 23 + bool is_mpw; 24 + }; 25 + 26 + struct mlx5e_cq_param { 27 + u32 cqc[MLX5_ST_SZ_DW(cqc)]; 28 + struct mlx5_wq_param wq; 29 + u16 eq_ix; 30 + u8 cq_period_mode; 31 + }; 32 + 33 + struct mlx5e_channel_param { 34 + struct mlx5e_rq_param rq; 35 + struct mlx5e_sq_param sq; 36 + struct mlx5e_sq_param xdp_sq; 37 + struct mlx5e_sq_param icosq; 38 + struct mlx5e_cq_param rx_cq; 39 + struct mlx5e_cq_param tx_cq; 40 + struct mlx5e_cq_param icosq_cq; 41 + }; 42 + 43 + static inline bool mlx5e_qid_get_ch_if_in_group(struct mlx5e_params *params, 44 + u16 qid, 45 + enum mlx5e_rq_group group, 46 + u16 *ix) 47 + { 48 + int nch = params->num_channels; 49 + int ch = qid - nch * group; 50 + 51 + if (ch < 0 || ch >= nch) 52 + return false; 53 + 54 + *ix = ch; 55 + return true; 56 + } 57 + 58 + static inline void mlx5e_qid_get_ch_and_group(struct mlx5e_params *params, 59 + u16 qid, 60 + u16 *ix, 61 + enum mlx5e_rq_group *group) 62 + { 63 + u16 nch = params->num_channels; 64 + 65 + *ix = qid % nch; 66 + *group = qid / nch; 67 + } 68 + 69 + static inline bool mlx5e_qid_validate(struct mlx5e_params *params, u64 qid) 70 + { 71 + return qid < params->num_channels * MLX5E_NUM_RQ_GROUPS; 72 + } 73 + 74 + /* Parameter calculations */ 75 + 76 + u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params, 77 + struct mlx5e_xsk_param *xsk); 78 + u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params, 79 + struct mlx5e_xsk_param *xsk); 80 + u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params, 81 + struct mlx5e_xsk_param *xsk); 82 + bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params, 83 + struct mlx5e_xsk_param *xsk); 12 84 bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev, 13 - struct mlx5e_params *params); 14 - u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params); 85 + struct mlx5e_params *params, 86 + struct mlx5e_xsk_param *xsk); 87 + u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params, 88 + struct mlx5e_xsk_param *xsk); 15 89 u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev, 16 - struct mlx5e_params *params); 90 + struct mlx5e_params *params, 91 + struct mlx5e_xsk_param *xsk); 17 92 u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev, 18 - struct mlx5e_params *params); 93 + struct mlx5e_params *params, 94 + struct mlx5e_xsk_param *xsk); 19 95 u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev, 20 - struct mlx5e_params *params); 96 + struct mlx5e_params *params, 97 + struct mlx5e_xsk_param *xsk); 98 + 99 + /* Build queue parameters */ 100 + 101 + void mlx5e_build_rq_param(struct mlx5e_priv *priv, 102 + struct mlx5e_params *params, 103 + struct mlx5e_xsk_param *xsk, 104 + struct mlx5e_rq_param *param); 105 + void mlx5e_build_sq_param_common(struct mlx5e_priv *priv, 106 + struct mlx5e_sq_param *param); 107 + void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, 108 + struct mlx5e_params *params, 109 + struct mlx5e_xsk_param *xsk, 110 + struct mlx5e_cq_param *param); 111 + void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv, 112 + struct mlx5e_params *params, 113 + struct mlx5e_cq_param *param); 114 + void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv, 115 + u8 log_wq_size, 116 + struct mlx5e_cq_param *param); 117 + void mlx5e_build_icosq_param(struct mlx5e_priv *priv, 118 + u8 log_wq_size, 119 + struct mlx5e_sq_param *param); 120 + void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv, 121 + struct mlx5e_params *params, 122 + struct mlx5e_sq_param *param); 21 123 22 124 #endif /* __MLX5_EN_PARAMS_H__ */
+175 -62
drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
··· 31 31 */ 32 32 33 33 #include <linux/bpf_trace.h> 34 + #include <net/xdp_sock.h> 34 35 #include "en/xdp.h" 36 + #include "en/params.h" 35 37 36 - int mlx5e_xdp_max_mtu(struct mlx5e_params *params) 38 + int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk) 37 39 { 38 - int hr = NET_IP_ALIGN + XDP_PACKET_HEADROOM; 40 + int hr = mlx5e_get_linear_rq_headroom(params, xsk); 39 41 40 42 /* Let S := SKB_DATA_ALIGN(sizeof(struct skb_shared_info)). 41 43 * The condition checked in mlx5e_rx_is_linear_skb is: ··· 56 54 } 57 55 58 56 static inline bool 59 - mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_dma_info *di, 60 - struct xdp_buff *xdp) 57 + mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq, 58 + struct mlx5e_dma_info *di, struct xdp_buff *xdp) 61 59 { 60 + struct mlx5e_xdp_xmit_data xdptxd; 62 61 struct mlx5e_xdp_info xdpi; 62 + struct xdp_frame *xdpf; 63 + dma_addr_t dma_addr; 63 64 64 - xdpi.xdpf = convert_to_xdp_frame(xdp); 65 - if (unlikely(!xdpi.xdpf)) 65 + xdpf = convert_to_xdp_frame(xdp); 66 + if (unlikely(!xdpf)) 66 67 return false; 67 - xdpi.dma_addr = di->addr + (xdpi.xdpf->data - (void *)xdpi.xdpf); 68 - dma_sync_single_for_device(sq->pdev, xdpi.dma_addr, 69 - xdpi.xdpf->len, PCI_DMA_TODEVICE); 70 - xdpi.di = *di; 71 68 72 - return sq->xmit_xdp_frame(sq, &xdpi); 69 + xdptxd.data = xdpf->data; 70 + xdptxd.len = xdpf->len; 71 + 72 + if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) { 73 + /* The xdp_buff was in the UMEM and was copied into a newly 74 + * allocated page. The UMEM page was returned via the ZCA, and 75 + * this new page has to be mapped at this point and has to be 76 + * unmapped and returned via xdp_return_frame on completion. 77 + */ 78 + 79 + /* Prevent double recycling of the UMEM page. Even in case this 80 + * function returns false, the xdp_buff shouldn't be recycled, 81 + * as it was already done in xdp_convert_zc_to_xdp_frame. 82 + */ 83 + __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); /* non-atomic */ 84 + 85 + xdpi.mode = MLX5E_XDP_XMIT_MODE_FRAME; 86 + 87 + dma_addr = dma_map_single(sq->pdev, xdptxd.data, xdptxd.len, 88 + DMA_TO_DEVICE); 89 + if (dma_mapping_error(sq->pdev, dma_addr)) { 90 + xdp_return_frame(xdpf); 91 + return false; 92 + } 93 + 94 + xdptxd.dma_addr = dma_addr; 95 + xdpi.frame.xdpf = xdpf; 96 + xdpi.frame.dma_addr = dma_addr; 97 + } else { 98 + /* Driver assumes that convert_to_xdp_frame returns an xdp_frame 99 + * that points to the same memory region as the original 100 + * xdp_buff. It allows to map the memory only once and to use 101 + * the DMA_BIDIRECTIONAL mode. 102 + */ 103 + 104 + xdpi.mode = MLX5E_XDP_XMIT_MODE_PAGE; 105 + 106 + dma_addr = di->addr + (xdpf->data - (void *)xdpf); 107 + dma_sync_single_for_device(sq->pdev, dma_addr, xdptxd.len, 108 + DMA_TO_DEVICE); 109 + 110 + xdptxd.dma_addr = dma_addr; 111 + xdpi.page.rq = rq; 112 + xdpi.page.di = *di; 113 + } 114 + 115 + return sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, 0); 73 116 } 74 117 75 118 /* returns true if packet was consumed by xdp */ 76 119 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di, 77 - void *va, u16 *rx_headroom, u32 *len) 120 + void *va, u16 *rx_headroom, u32 *len, bool xsk) 78 121 { 79 122 struct bpf_prog *prog = READ_ONCE(rq->xdp_prog); 80 123 struct xdp_buff xdp; ··· 133 86 xdp_set_data_meta_invalid(&xdp); 134 87 xdp.data_end = xdp.data + *len; 135 88 xdp.data_hard_start = va; 89 + if (xsk) 90 + xdp.handle = di->xsk.handle; 136 91 xdp.rxq = &rq->xdp_rxq; 137 92 138 93 act = bpf_prog_run_xdp(prog, &xdp); 94 + if (xsk) 95 + xdp.handle += xdp.data - xdp.data_hard_start; 139 96 switch (act) { 140 97 case XDP_PASS: 141 98 *rx_headroom = xdp.data - xdp.data_hard_start; 142 99 *len = xdp.data_end - xdp.data; 143 100 return false; 144 101 case XDP_TX: 145 - if (unlikely(!mlx5e_xmit_xdp_buff(&rq->xdpsq, di, &xdp))) 102 + if (unlikely(!mlx5e_xmit_xdp_buff(rq->xdpsq, rq, di, &xdp))) 146 103 goto xdp_abort; 147 104 __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); /* non-atomic */ 148 105 return true; ··· 157 106 goto xdp_abort; 158 107 __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); 159 108 __set_bit(MLX5E_RQ_FLAG_XDP_REDIRECT, rq->flags); 160 - mlx5e_page_dma_unmap(rq, di); 109 + if (!xsk) 110 + mlx5e_page_dma_unmap(rq, di); 161 111 rq->stats->xdp_redirect++; 162 112 return true; 163 113 default: ··· 212 160 stats->mpwqe++; 213 161 } 214 162 215 - static void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq) 163 + void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq) 216 164 { 217 165 struct mlx5_wq_cyc *wq = &sq->wq; 218 166 struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; ··· 235 183 session->wqe = NULL; /* Close session */ 236 184 } 237 185 238 - static bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, 239 - struct mlx5e_xdp_info *xdpi) 186 + enum { 187 + MLX5E_XDP_CHECK_OK = 1, 188 + MLX5E_XDP_CHECK_START_MPWQE = 2, 189 + }; 190 + 191 + static int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq) 240 192 { 241 - struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; 242 - struct mlx5e_xdpsq_stats *stats = sq->stats; 243 - 244 - struct xdp_frame *xdpf = xdpi->xdpf; 245 - 246 - if (unlikely(sq->hw_mtu < xdpf->len)) { 247 - stats->err++; 248 - return false; 249 - } 250 - 251 - if (unlikely(!session->wqe)) { 193 + if (unlikely(!sq->mpwqe.wqe)) { 252 194 if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, 253 195 MLX5_SEND_WQE_MAX_WQEBBS))) { 254 196 /* SQ is full, ring doorbell */ 255 197 mlx5e_xmit_xdp_doorbell(sq); 256 - stats->full++; 257 - return false; 198 + sq->stats->full++; 199 + return -EBUSY; 258 200 } 259 201 202 + return MLX5E_XDP_CHECK_START_MPWQE; 203 + } 204 + 205 + return MLX5E_XDP_CHECK_OK; 206 + } 207 + 208 + static bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, 209 + struct mlx5e_xdp_xmit_data *xdptxd, 210 + struct mlx5e_xdp_info *xdpi, 211 + int check_result) 212 + { 213 + struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; 214 + struct mlx5e_xdpsq_stats *stats = sq->stats; 215 + 216 + if (unlikely(xdptxd->len > sq->hw_mtu)) { 217 + stats->err++; 218 + return false; 219 + } 220 + 221 + if (!check_result) 222 + check_result = mlx5e_xmit_xdp_frame_check_mpwqe(sq); 223 + if (unlikely(check_result < 0)) 224 + return false; 225 + 226 + if (check_result == MLX5E_XDP_CHECK_START_MPWQE) { 227 + /* Start the session when nothing can fail, so it's guaranteed 228 + * that if there is an active session, it has at least one dseg, 229 + * and it's safe to complete it at any time. 230 + */ 260 231 mlx5e_xdp_mpwqe_session_start(sq); 261 232 } 262 233 263 - mlx5e_xdp_mpwqe_add_dseg(sq, xdpi, stats); 234 + mlx5e_xdp_mpwqe_add_dseg(sq, xdptxd, stats); 264 235 265 236 if (unlikely(session->complete || 266 237 session->ds_count == session->max_ds_count)) ··· 294 219 return true; 295 220 } 296 221 297 - static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi) 222 + static int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq) 223 + { 224 + if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, 1))) { 225 + /* SQ is full, ring doorbell */ 226 + mlx5e_xmit_xdp_doorbell(sq); 227 + sq->stats->full++; 228 + return -EBUSY; 229 + } 230 + 231 + return MLX5E_XDP_CHECK_OK; 232 + } 233 + 234 + static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, 235 + struct mlx5e_xdp_xmit_data *xdptxd, 236 + struct mlx5e_xdp_info *xdpi, 237 + int check_result) 298 238 { 299 239 struct mlx5_wq_cyc *wq = &sq->wq; 300 240 u16 pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc); ··· 319 229 struct mlx5_wqe_eth_seg *eseg = &wqe->eth; 320 230 struct mlx5_wqe_data_seg *dseg = wqe->data; 321 231 322 - struct xdp_frame *xdpf = xdpi->xdpf; 323 - dma_addr_t dma_addr = xdpi->dma_addr; 324 - unsigned int dma_len = xdpf->len; 232 + dma_addr_t dma_addr = xdptxd->dma_addr; 233 + u32 dma_len = xdptxd->len; 325 234 326 235 struct mlx5e_xdpsq_stats *stats = sq->stats; 327 236 ··· 331 242 return false; 332 243 } 333 244 334 - if (unlikely(!mlx5e_wqc_has_room_for(wq, sq->cc, sq->pc, 1))) { 335 - /* SQ is full, ring doorbell */ 336 - mlx5e_xmit_xdp_doorbell(sq); 337 - stats->full++; 245 + if (!check_result) 246 + check_result = mlx5e_xmit_xdp_frame_check(sq); 247 + if (unlikely(check_result < 0)) 338 248 return false; 339 - } 340 249 341 250 cseg->fm_ce_se = 0; 342 251 343 252 /* copy the inline part if required */ 344 253 if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) { 345 - memcpy(eseg->inline_hdr.start, xdpf->data, MLX5E_XDP_MIN_INLINE); 254 + memcpy(eseg->inline_hdr.start, xdptxd->data, MLX5E_XDP_MIN_INLINE); 346 255 eseg->inline_hdr.sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE); 347 256 dma_len -= MLX5E_XDP_MIN_INLINE; 348 257 dma_addr += MLX5E_XDP_MIN_INLINE; ··· 364 277 365 278 static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq, 366 279 struct mlx5e_xdp_wqe_info *wi, 367 - struct mlx5e_rq *rq, 280 + u32 *xsk_frames, 368 281 bool recycle) 369 282 { 370 283 struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo; ··· 373 286 for (i = 0; i < wi->num_pkts; i++) { 374 287 struct mlx5e_xdp_info xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo); 375 288 376 - if (rq) { 377 - /* XDP_TX */ 378 - mlx5e_page_release(rq, &xdpi.di, recycle); 379 - } else { 380 - /* XDP_REDIRECT */ 381 - dma_unmap_single(sq->pdev, xdpi.dma_addr, 382 - xdpi.xdpf->len, DMA_TO_DEVICE); 383 - xdp_return_frame(xdpi.xdpf); 289 + switch (xdpi.mode) { 290 + case MLX5E_XDP_XMIT_MODE_FRAME: 291 + /* XDP_TX from the XSK RQ and XDP_REDIRECT */ 292 + dma_unmap_single(sq->pdev, xdpi.frame.dma_addr, 293 + xdpi.frame.xdpf->len, DMA_TO_DEVICE); 294 + xdp_return_frame(xdpi.frame.xdpf); 295 + break; 296 + case MLX5E_XDP_XMIT_MODE_PAGE: 297 + /* XDP_TX from the regular RQ */ 298 + mlx5e_page_release_dynamic(xdpi.page.rq, &xdpi.page.di, recycle); 299 + break; 300 + case MLX5E_XDP_XMIT_MODE_XSK: 301 + /* AF_XDP send */ 302 + (*xsk_frames)++; 303 + break; 304 + default: 305 + WARN_ON_ONCE(true); 384 306 } 385 307 } 386 308 } 387 309 388 - bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq) 310 + bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq) 389 311 { 390 312 struct mlx5e_xdpsq *sq; 391 313 struct mlx5_cqe64 *cqe; 314 + u32 xsk_frames = 0; 392 315 u16 sqcc; 393 316 int i; 394 317 ··· 440 343 441 344 sqcc += wi->num_wqebbs; 442 345 443 - mlx5e_free_xdpsq_desc(sq, wi, rq, true); 346 + mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, true); 444 347 } while (!last_wqe); 445 348 } while ((++i < MLX5E_TX_CQ_POLL_BUDGET) && (cqe = mlx5_cqwq_get_cqe(&cq->wq))); 349 + 350 + if (xsk_frames) 351 + xsk_umem_complete_tx(sq->umem, xsk_frames); 446 352 447 353 sq->stats->cqes += i; 448 354 ··· 458 358 return (i == MLX5E_TX_CQ_POLL_BUDGET); 459 359 } 460 360 461 - void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq) 361 + void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq) 462 362 { 363 + u32 xsk_frames = 0; 364 + 463 365 while (sq->cc != sq->pc) { 464 366 struct mlx5e_xdp_wqe_info *wi; 465 367 u16 ci; ··· 471 369 472 370 sq->cc += wi->num_wqebbs; 473 371 474 - mlx5e_free_xdpsq_desc(sq, wi, rq, false); 372 + mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, false); 475 373 } 374 + 375 + if (xsk_frames) 376 + xsk_umem_complete_tx(sq->umem, xsk_frames); 476 377 } 477 378 478 379 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, ··· 503 398 504 399 for (i = 0; i < n; i++) { 505 400 struct xdp_frame *xdpf = frames[i]; 401 + struct mlx5e_xdp_xmit_data xdptxd; 506 402 struct mlx5e_xdp_info xdpi; 507 403 508 - xdpi.dma_addr = dma_map_single(sq->pdev, xdpf->data, xdpf->len, 509 - DMA_TO_DEVICE); 510 - if (unlikely(dma_mapping_error(sq->pdev, xdpi.dma_addr))) { 404 + xdptxd.data = xdpf->data; 405 + xdptxd.len = xdpf->len; 406 + xdptxd.dma_addr = dma_map_single(sq->pdev, xdptxd.data, 407 + xdptxd.len, DMA_TO_DEVICE); 408 + 409 + if (unlikely(dma_mapping_error(sq->pdev, xdptxd.dma_addr))) { 511 410 xdp_return_frame_rx_napi(xdpf); 512 411 drops++; 513 412 continue; 514 413 } 515 414 516 - xdpi.xdpf = xdpf; 415 + xdpi.mode = MLX5E_XDP_XMIT_MODE_FRAME; 416 + xdpi.frame.xdpf = xdpf; 417 + xdpi.frame.dma_addr = xdptxd.dma_addr; 517 418 518 - if (unlikely(!sq->xmit_xdp_frame(sq, &xdpi))) { 519 - dma_unmap_single(sq->pdev, xdpi.dma_addr, 520 - xdpf->len, DMA_TO_DEVICE); 419 + if (unlikely(!sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, 0))) { 420 + dma_unmap_single(sq->pdev, xdptxd.dma_addr, 421 + xdptxd.len, DMA_TO_DEVICE); 521 422 xdp_return_frame_rx_napi(xdpf); 522 423 drops++; 523 424 } ··· 540 429 541 430 void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq) 542 431 { 543 - struct mlx5e_xdpsq *xdpsq = &rq->xdpsq; 432 + struct mlx5e_xdpsq *xdpsq = rq->xdpsq; 544 433 545 434 if (xdpsq->mpwqe.wqe) 546 435 mlx5e_xdp_mpwqe_complete(xdpsq); ··· 555 444 556 445 void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw) 557 446 { 447 + sq->xmit_xdp_frame_check = is_mpw ? 448 + mlx5e_xmit_xdp_frame_check_mpwqe : mlx5e_xmit_xdp_frame_check; 558 449 sq->xmit_xdp_frame = is_mpw ? 559 450 mlx5e_xmit_xdp_frame_mpwqe : mlx5e_xmit_xdp_frame; 560 451 }
+26 -10
drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
··· 39 39 (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS) 40 40 #define MLX5E_XDP_TX_DS_COUNT (MLX5E_XDP_TX_EMPTY_DS_COUNT + 1 /* SG DS */) 41 41 42 - int mlx5e_xdp_max_mtu(struct mlx5e_params *params); 42 + struct mlx5e_xsk_param; 43 + int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk); 43 44 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di, 44 - void *va, u16 *rx_headroom, u32 *len); 45 - bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq); 46 - void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq); 45 + void *va, u16 *rx_headroom, u32 *len, bool xsk); 46 + void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq); 47 + bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq); 48 + void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq); 47 49 void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw); 48 50 void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq); 49 51 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, ··· 66 64 static inline bool mlx5e_xdp_tx_is_enabled(struct mlx5e_priv *priv) 67 65 { 68 66 return test_bit(MLX5E_STATE_XDP_TX_ENABLED, &priv->state); 67 + } 68 + 69 + static inline void mlx5e_xdp_set_open(struct mlx5e_priv *priv) 70 + { 71 + set_bit(MLX5E_STATE_XDP_OPEN, &priv->state); 72 + } 73 + 74 + static inline void mlx5e_xdp_set_closed(struct mlx5e_priv *priv) 75 + { 76 + clear_bit(MLX5E_STATE_XDP_OPEN, &priv->state); 77 + } 78 + 79 + static inline bool mlx5e_xdp_is_open(struct mlx5e_priv *priv) 80 + { 81 + return test_bit(MLX5E_STATE_XDP_OPEN, &priv->state); 69 82 } 70 83 71 84 static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq) ··· 114 97 } 115 98 116 99 static inline void 117 - mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi, 100 + mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, 101 + struct mlx5e_xdp_xmit_data *xdptxd, 118 102 struct mlx5e_xdpsq_stats *stats) 119 103 { 120 104 struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; 121 - dma_addr_t dma_addr = xdpi->dma_addr; 122 - struct xdp_frame *xdpf = xdpi->xdpf; 123 105 struct mlx5_wqe_data_seg *dseg = 124 106 (struct mlx5_wqe_data_seg *)session->wqe + session->ds_count; 125 - u16 dma_len = xdpf->len; 107 + u32 dma_len = xdptxd->len; 126 108 127 109 session->pkt_count++; 128 110 ··· 140 124 } 141 125 142 126 inline_dseg->byte_count = cpu_to_be32(dma_len | MLX5_INLINE_SEG); 143 - memcpy(inline_dseg->data, xdpf->data, dma_len); 127 + memcpy(inline_dseg->data, xdptxd->data, dma_len); 144 128 145 129 session->ds_count += ds_cnt; 146 130 stats->inlnw++; ··· 148 132 } 149 133 150 134 no_inline: 151 - dseg->addr = cpu_to_be64(dma_addr); 135 + dseg->addr = cpu_to_be64(xdptxd->dma_addr); 152 136 dseg->byte_count = cpu_to_be32(dma_len); 153 137 dseg->lkey = sq->mkey_be; 154 138 session->ds_count++;
+1
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/Makefile
··· 1 + subdir-ccflags-y += -I$(src)/../..
+192
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include "rx.h" 5 + #include "en/xdp.h" 6 + #include <net/xdp_sock.h> 7 + 8 + /* RX data path */ 9 + 10 + bool mlx5e_xsk_pages_enough_umem(struct mlx5e_rq *rq, int count) 11 + { 12 + /* Check in advance that we have enough frames, instead of allocating 13 + * one-by-one, failing and moving frames to the Reuse Ring. 14 + */ 15 + return xsk_umem_has_addrs_rq(rq->umem, count); 16 + } 17 + 18 + int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq, 19 + struct mlx5e_dma_info *dma_info) 20 + { 21 + struct xdp_umem *umem = rq->umem; 22 + u64 handle; 23 + 24 + if (!xsk_umem_peek_addr_rq(umem, &handle)) 25 + return -ENOMEM; 26 + 27 + dma_info->xsk.handle = handle + rq->buff.umem_headroom; 28 + dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle); 29 + 30 + /* No need to add headroom to the DMA address. In striding RQ case, we 31 + * just provide pages for UMR, and headroom is counted at the setup 32 + * stage when creating a WQE. In non-striding RQ case, headroom is 33 + * accounted in mlx5e_alloc_rx_wqe. 34 + */ 35 + dma_info->addr = xdp_umem_get_dma(umem, handle); 36 + 37 + xsk_umem_discard_addr_rq(umem); 38 + 39 + dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE, 40 + DMA_BIDIRECTIONAL); 41 + 42 + return 0; 43 + } 44 + 45 + static inline void mlx5e_xsk_recycle_frame(struct mlx5e_rq *rq, u64 handle) 46 + { 47 + xsk_umem_fq_reuse(rq->umem, handle & rq->umem->chunk_mask); 48 + } 49 + 50 + /* XSKRQ uses pages from UMEM, they must not be released. They are returned to 51 + * the userspace if possible, and if not, this function is called to reuse them 52 + * in the driver. 53 + */ 54 + void mlx5e_xsk_page_release(struct mlx5e_rq *rq, 55 + struct mlx5e_dma_info *dma_info) 56 + { 57 + mlx5e_xsk_recycle_frame(rq, dma_info->xsk.handle); 58 + } 59 + 60 + /* Return a frame back to the hardware to fill in again. It is used by XDP when 61 + * the XDP program returns XDP_TX or XDP_REDIRECT not to an XSKMAP. 62 + */ 63 + void mlx5e_xsk_zca_free(struct zero_copy_allocator *zca, unsigned long handle) 64 + { 65 + struct mlx5e_rq *rq = container_of(zca, struct mlx5e_rq, zca); 66 + 67 + mlx5e_xsk_recycle_frame(rq, handle); 68 + } 69 + 70 + static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, void *data, 71 + u32 cqe_bcnt) 72 + { 73 + struct sk_buff *skb; 74 + 75 + skb = napi_alloc_skb(rq->cq.napi, cqe_bcnt); 76 + if (unlikely(!skb)) { 77 + rq->stats->buff_alloc_err++; 78 + return NULL; 79 + } 80 + 81 + skb_put_data(skb, data, cqe_bcnt); 82 + 83 + return skb; 84 + } 85 + 86 + struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, 87 + struct mlx5e_mpw_info *wi, 88 + u16 cqe_bcnt, 89 + u32 head_offset, 90 + u32 page_idx) 91 + { 92 + struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx]; 93 + u16 rx_headroom = rq->buff.headroom - rq->buff.umem_headroom; 94 + u32 cqe_bcnt32 = cqe_bcnt; 95 + void *va, *data; 96 + u32 frag_size; 97 + bool consumed; 98 + 99 + /* Check packet size. Note LRO doesn't use linear SKB */ 100 + if (unlikely(cqe_bcnt > rq->hw_mtu)) { 101 + rq->stats->oversize_pkts_sw_drop++; 102 + return NULL; 103 + } 104 + 105 + /* head_offset is not used in this function, because di->xsk.data and 106 + * di->addr point directly to the necessary place. Furthermore, in the 107 + * current implementation, one page = one packet = one frame, so 108 + * head_offset should always be 0. 109 + */ 110 + WARN_ON_ONCE(head_offset); 111 + 112 + va = di->xsk.data; 113 + data = va + rx_headroom; 114 + frag_size = rq->buff.headroom + cqe_bcnt32; 115 + 116 + dma_sync_single_for_cpu(rq->pdev, di->addr, frag_size, DMA_BIDIRECTIONAL); 117 + prefetch(data); 118 + 119 + rcu_read_lock(); 120 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, true); 121 + rcu_read_unlock(); 122 + 123 + /* Possible flows: 124 + * - XDP_REDIRECT to XSKMAP: 125 + * The page is owned by the userspace from now. 126 + * - XDP_TX and other XDP_REDIRECTs: 127 + * The page was returned by ZCA and recycled. 128 + * - XDP_DROP: 129 + * Recycle the page. 130 + * - XDP_PASS: 131 + * Allocate an SKB, copy the data and recycle the page. 132 + * 133 + * Pages to be recycled go to the Reuse Ring on MPWQE deallocation. Its 134 + * size is the same as the Driver RX Ring's size, and pages for WQEs are 135 + * allocated first from the Reuse Ring, so it has enough space. 136 + */ 137 + 138 + if (likely(consumed)) { 139 + if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))) 140 + __set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */ 141 + return NULL; /* page/packet was consumed by XDP */ 142 + } 143 + 144 + /* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the 145 + * frame. On SKB allocation failure, NULL is returned. 146 + */ 147 + return mlx5e_xsk_construct_skb(rq, data, cqe_bcnt32); 148 + } 149 + 150 + struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq, 151 + struct mlx5_cqe64 *cqe, 152 + struct mlx5e_wqe_frag_info *wi, 153 + u32 cqe_bcnt) 154 + { 155 + struct mlx5e_dma_info *di = wi->di; 156 + u16 rx_headroom = rq->buff.headroom - rq->buff.umem_headroom; 157 + void *va, *data; 158 + bool consumed; 159 + u32 frag_size; 160 + 161 + /* wi->offset is not used in this function, because di->xsk.data and 162 + * di->addr point directly to the necessary place. Furthermore, in the 163 + * current implementation, one page = one packet = one frame, so 164 + * wi->offset should always be 0. 165 + */ 166 + WARN_ON_ONCE(wi->offset); 167 + 168 + va = di->xsk.data; 169 + data = va + rx_headroom; 170 + frag_size = rq->buff.headroom + cqe_bcnt; 171 + 172 + dma_sync_single_for_cpu(rq->pdev, di->addr, frag_size, DMA_BIDIRECTIONAL); 173 + prefetch(data); 174 + 175 + if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) { 176 + rq->stats->wqe_err++; 177 + return NULL; 178 + } 179 + 180 + rcu_read_lock(); 181 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, true); 182 + rcu_read_unlock(); 183 + 184 + if (likely(consumed)) 185 + return NULL; /* page/packet was consumed by XDP */ 186 + 187 + /* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse 188 + * will be handled by mlx5e_put_rx_frag. 189 + * On SKB allocation failure, NULL is returned. 190 + */ 191 + return mlx5e_xsk_construct_skb(rq, data, cqe_bcnt); 192 + }
+27
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_RX_H__ 5 + #define __MLX5_EN_XSK_RX_H__ 6 + 7 + #include "en.h" 8 + 9 + /* RX data path */ 10 + 11 + bool mlx5e_xsk_pages_enough_umem(struct mlx5e_rq *rq, int count); 12 + int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq, 13 + struct mlx5e_dma_info *dma_info); 14 + void mlx5e_xsk_page_release(struct mlx5e_rq *rq, 15 + struct mlx5e_dma_info *dma_info); 16 + void mlx5e_xsk_zca_free(struct zero_copy_allocator *zca, unsigned long handle); 17 + struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, 18 + struct mlx5e_mpw_info *wi, 19 + u16 cqe_bcnt, 20 + u32 head_offset, 21 + u32 page_idx); 22 + struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq, 23 + struct mlx5_cqe64 *cqe, 24 + struct mlx5e_wqe_frag_info *wi, 25 + u32 cqe_bcnt); 26 + 27 + #endif /* __MLX5_EN_XSK_RX_H__ */
+223
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include "setup.h" 5 + #include "en/params.h" 6 + 7 + bool mlx5e_validate_xsk_param(struct mlx5e_params *params, 8 + struct mlx5e_xsk_param *xsk, 9 + struct mlx5_core_dev *mdev) 10 + { 11 + /* AF_XDP doesn't support frames larger than PAGE_SIZE, and the current 12 + * mlx5e XDP implementation doesn't support multiple packets per page. 13 + */ 14 + if (xsk->chunk_size != PAGE_SIZE) 15 + return false; 16 + 17 + /* Current MTU and XSK headroom don't allow packets to fit the frames. */ 18 + if (mlx5e_rx_get_linear_frag_sz(params, xsk) > xsk->chunk_size) 19 + return false; 20 + 21 + /* frag_sz is different for regular and XSK RQs, so ensure that linear 22 + * SKB mode is possible. 23 + */ 24 + switch (params->rq_wq_type) { 25 + case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: 26 + return mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk); 27 + default: /* MLX5_WQ_TYPE_CYCLIC */ 28 + return mlx5e_rx_is_linear_skb(params, xsk); 29 + } 30 + } 31 + 32 + static void mlx5e_build_xskicosq_param(struct mlx5e_priv *priv, 33 + u8 log_wq_size, 34 + struct mlx5e_sq_param *param) 35 + { 36 + void *sqc = param->sqc; 37 + void *wq = MLX5_ADDR_OF(sqc, sqc, wq); 38 + 39 + mlx5e_build_sq_param_common(priv, param); 40 + 41 + MLX5_SET(wq, wq, log_wq_sz, log_wq_size); 42 + } 43 + 44 + static void mlx5e_build_xsk_cparam(struct mlx5e_priv *priv, 45 + struct mlx5e_params *params, 46 + struct mlx5e_xsk_param *xsk, 47 + struct mlx5e_channel_param *cparam) 48 + { 49 + const u8 xskicosq_size = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE; 50 + 51 + mlx5e_build_rq_param(priv, params, xsk, &cparam->rq); 52 + mlx5e_build_xdpsq_param(priv, params, &cparam->xdp_sq); 53 + mlx5e_build_xskicosq_param(priv, xskicosq_size, &cparam->icosq); 54 + mlx5e_build_rx_cq_param(priv, params, xsk, &cparam->rx_cq); 55 + mlx5e_build_tx_cq_param(priv, params, &cparam->tx_cq); 56 + mlx5e_build_ico_cq_param(priv, xskicosq_size, &cparam->icosq_cq); 57 + } 58 + 59 + int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, 60 + struct mlx5e_xsk_param *xsk, struct xdp_umem *umem, 61 + struct mlx5e_channel *c) 62 + { 63 + struct mlx5e_channel_param cparam = {}; 64 + struct dim_cq_moder icocq_moder = {}; 65 + int err; 66 + 67 + if (!mlx5e_validate_xsk_param(params, xsk, priv->mdev)) 68 + return -EINVAL; 69 + 70 + mlx5e_build_xsk_cparam(priv, params, xsk, &cparam); 71 + 72 + err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam.rx_cq, &c->xskrq.cq); 73 + if (unlikely(err)) 74 + return err; 75 + 76 + err = mlx5e_open_rq(c, params, &cparam.rq, xsk, umem, &c->xskrq); 77 + if (unlikely(err)) 78 + goto err_close_rx_cq; 79 + 80 + err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam.tx_cq, &c->xsksq.cq); 81 + if (unlikely(err)) 82 + goto err_close_rq; 83 + 84 + /* Create a separate SQ, so that when the UMEM is disabled, we could 85 + * close this SQ safely and stop receiving CQEs. In other case, e.g., if 86 + * the XDPSQ was used instead, we might run into trouble when the UMEM 87 + * is disabled and then reenabled, but the SQ continues receiving CQEs 88 + * from the old UMEM. 89 + */ 90 + err = mlx5e_open_xdpsq(c, params, &cparam.xdp_sq, umem, &c->xsksq, true); 91 + if (unlikely(err)) 92 + goto err_close_tx_cq; 93 + 94 + err = mlx5e_open_cq(c, icocq_moder, &cparam.icosq_cq, &c->xskicosq.cq); 95 + if (unlikely(err)) 96 + goto err_close_sq; 97 + 98 + /* Create a dedicated SQ for posting NOPs whenever we need an IRQ to be 99 + * triggered and NAPI to be called on the correct CPU. 100 + */ 101 + err = mlx5e_open_icosq(c, params, &cparam.icosq, &c->xskicosq); 102 + if (unlikely(err)) 103 + goto err_close_icocq; 104 + 105 + spin_lock_init(&c->xskicosq_lock); 106 + 107 + set_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 108 + 109 + return 0; 110 + 111 + err_close_icocq: 112 + mlx5e_close_cq(&c->xskicosq.cq); 113 + 114 + err_close_sq: 115 + mlx5e_close_xdpsq(&c->xsksq); 116 + 117 + err_close_tx_cq: 118 + mlx5e_close_cq(&c->xsksq.cq); 119 + 120 + err_close_rq: 121 + mlx5e_close_rq(&c->xskrq); 122 + 123 + err_close_rx_cq: 124 + mlx5e_close_cq(&c->xskrq.cq); 125 + 126 + return err; 127 + } 128 + 129 + void mlx5e_close_xsk(struct mlx5e_channel *c) 130 + { 131 + clear_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 132 + napi_synchronize(&c->napi); 133 + 134 + mlx5e_close_rq(&c->xskrq); 135 + mlx5e_close_cq(&c->xskrq.cq); 136 + mlx5e_close_icosq(&c->xskicosq); 137 + mlx5e_close_cq(&c->xskicosq.cq); 138 + mlx5e_close_xdpsq(&c->xsksq); 139 + mlx5e_close_cq(&c->xsksq.cq); 140 + } 141 + 142 + void mlx5e_activate_xsk(struct mlx5e_channel *c) 143 + { 144 + set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state); 145 + /* TX queue is created active. */ 146 + mlx5e_trigger_irq(&c->xskicosq); 147 + } 148 + 149 + void mlx5e_deactivate_xsk(struct mlx5e_channel *c) 150 + { 151 + mlx5e_deactivate_rq(&c->xskrq); 152 + /* TX queue is disabled on close. */ 153 + } 154 + 155 + static int mlx5e_redirect_xsk_rqt(struct mlx5e_priv *priv, u16 ix, u32 rqn) 156 + { 157 + struct mlx5e_redirect_rqt_param direct_rrp = { 158 + .is_rss = false, 159 + { 160 + .rqn = rqn, 161 + }, 162 + }; 163 + 164 + u32 rqtn = priv->xsk_tir[ix].rqt.rqtn; 165 + 166 + return mlx5e_redirect_rqt(priv, rqtn, 1, direct_rrp); 167 + } 168 + 169 + int mlx5e_xsk_redirect_rqt_to_channel(struct mlx5e_priv *priv, struct mlx5e_channel *c) 170 + { 171 + return mlx5e_redirect_xsk_rqt(priv, c->ix, c->xskrq.rqn); 172 + } 173 + 174 + int mlx5e_xsk_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix) 175 + { 176 + return mlx5e_redirect_xsk_rqt(priv, ix, priv->drop_rq.rqn); 177 + } 178 + 179 + int mlx5e_xsk_redirect_rqts_to_channels(struct mlx5e_priv *priv, struct mlx5e_channels *chs) 180 + { 181 + int err, i; 182 + 183 + if (!priv->xsk.refcnt) 184 + return 0; 185 + 186 + for (i = 0; i < chs->num; i++) { 187 + struct mlx5e_channel *c = chs->c[i]; 188 + 189 + if (!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 190 + continue; 191 + 192 + err = mlx5e_xsk_redirect_rqt_to_channel(priv, c); 193 + if (unlikely(err)) 194 + goto err_stop; 195 + } 196 + 197 + return 0; 198 + 199 + err_stop: 200 + for (i--; i >= 0; i--) { 201 + if (!test_bit(MLX5E_CHANNEL_STATE_XSK, chs->c[i]->state)) 202 + continue; 203 + 204 + mlx5e_xsk_redirect_rqt_to_drop(priv, i); 205 + } 206 + 207 + return err; 208 + } 209 + 210 + void mlx5e_xsk_redirect_rqts_to_drop(struct mlx5e_priv *priv, struct mlx5e_channels *chs) 211 + { 212 + int i; 213 + 214 + if (!priv->xsk.refcnt) 215 + return; 216 + 217 + for (i = 0; i < chs->num; i++) { 218 + if (!test_bit(MLX5E_CHANNEL_STATE_XSK, chs->c[i]->state)) 219 + continue; 220 + 221 + mlx5e_xsk_redirect_rqt_to_drop(priv, i); 222 + } 223 + }
+25
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_SETUP_H__ 5 + #define __MLX5_EN_XSK_SETUP_H__ 6 + 7 + #include "en.h" 8 + 9 + struct mlx5e_xsk_param; 10 + 11 + bool mlx5e_validate_xsk_param(struct mlx5e_params *params, 12 + struct mlx5e_xsk_param *xsk, 13 + struct mlx5_core_dev *mdev); 14 + int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, 15 + struct mlx5e_xsk_param *xsk, struct xdp_umem *umem, 16 + struct mlx5e_channel *c); 17 + void mlx5e_close_xsk(struct mlx5e_channel *c); 18 + void mlx5e_activate_xsk(struct mlx5e_channel *c); 19 + void mlx5e_deactivate_xsk(struct mlx5e_channel *c); 20 + int mlx5e_xsk_redirect_rqt_to_channel(struct mlx5e_priv *priv, struct mlx5e_channel *c); 21 + int mlx5e_xsk_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix); 22 + int mlx5e_xsk_redirect_rqts_to_channels(struct mlx5e_priv *priv, struct mlx5e_channels *chs); 23 + void mlx5e_xsk_redirect_rqts_to_drop(struct mlx5e_priv *priv, struct mlx5e_channels *chs); 24 + 25 + #endif /* __MLX5_EN_XSK_SETUP_H__ */
+111
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include "tx.h" 5 + #include "umem.h" 6 + #include "en/xdp.h" 7 + #include "en/params.h" 8 + #include <net/xdp_sock.h> 9 + 10 + int mlx5e_xsk_async_xmit(struct net_device *dev, u32 qid) 11 + { 12 + struct mlx5e_priv *priv = netdev_priv(dev); 13 + struct mlx5e_params *params = &priv->channels.params; 14 + struct mlx5e_channel *c; 15 + u16 ix; 16 + 17 + if (unlikely(!mlx5e_xdp_is_open(priv))) 18 + return -ENETDOWN; 19 + 20 + if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid, MLX5E_RQ_GROUP_XSK, &ix))) 21 + return -EINVAL; 22 + 23 + c = priv->channels.c[ix]; 24 + 25 + if (unlikely(!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))) 26 + return -ENXIO; 27 + 28 + if (!napi_if_scheduled_mark_missed(&c->napi)) { 29 + spin_lock(&c->xskicosq_lock); 30 + mlx5e_trigger_irq(&c->xskicosq); 31 + spin_unlock(&c->xskicosq_lock); 32 + } 33 + 34 + return 0; 35 + } 36 + 37 + /* When TX fails (because of the size of the packet), we need to get completions 38 + * in order, so post a NOP to get a CQE. Since AF_XDP doesn't distinguish 39 + * between successful TX and errors, handling in mlx5e_poll_xdpsq_cq is the 40 + * same. 41 + */ 42 + static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq, 43 + struct mlx5e_xdp_info *xdpi) 44 + { 45 + u16 pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc); 46 + struct mlx5e_xdp_wqe_info *wi = &sq->db.wqe_info[pi]; 47 + struct mlx5e_tx_wqe *nopwqe; 48 + 49 + wi->num_wqebbs = 1; 50 + wi->num_pkts = 1; 51 + 52 + nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc); 53 + mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi); 54 + sq->doorbell_cseg = &nopwqe->ctrl; 55 + } 56 + 57 + bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget) 58 + { 59 + struct xdp_umem *umem = sq->umem; 60 + struct mlx5e_xdp_info xdpi; 61 + struct mlx5e_xdp_xmit_data xdptxd; 62 + bool work_done = true; 63 + bool flush = false; 64 + 65 + xdpi.mode = MLX5E_XDP_XMIT_MODE_XSK; 66 + 67 + for (; budget; budget--) { 68 + int check_result = sq->xmit_xdp_frame_check(sq); 69 + struct xdp_desc desc; 70 + 71 + if (unlikely(check_result < 0)) { 72 + work_done = false; 73 + break; 74 + } 75 + 76 + if (!xsk_umem_consume_tx(umem, &desc)) { 77 + /* TX will get stuck until something wakes it up by 78 + * triggering NAPI. Currently it's expected that the 79 + * application calls sendto() if there are consumed, but 80 + * not completed frames. 81 + */ 82 + break; 83 + } 84 + 85 + xdptxd.dma_addr = xdp_umem_get_dma(umem, desc.addr); 86 + xdptxd.data = xdp_umem_get_data(umem, desc.addr); 87 + xdptxd.len = desc.len; 88 + 89 + dma_sync_single_for_device(sq->pdev, xdptxd.dma_addr, 90 + xdptxd.len, DMA_BIDIRECTIONAL); 91 + 92 + if (unlikely(!sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, check_result))) { 93 + if (sq->mpwqe.wqe) 94 + mlx5e_xdp_mpwqe_complete(sq); 95 + 96 + mlx5e_xsk_tx_post_err(sq, &xdpi); 97 + } 98 + 99 + flush = true; 100 + } 101 + 102 + if (flush) { 103 + if (sq->mpwqe.wqe) 104 + mlx5e_xdp_mpwqe_complete(sq); 105 + mlx5e_xmit_xdp_doorbell(sq); 106 + 107 + xsk_umem_consume_tx_done(umem); 108 + } 109 + 110 + return !(budget && work_done); 111 + }
+15
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_TX_H__ 5 + #define __MLX5_EN_XSK_TX_H__ 6 + 7 + #include "en.h" 8 + 9 + /* TX data path */ 10 + 11 + int mlx5e_xsk_async_xmit(struct net_device *dev, u32 qid); 12 + 13 + bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget); 14 + 15 + #endif /* __MLX5_EN_XSK_TX_H__ */
+267
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #include <net/xdp_sock.h> 5 + #include "umem.h" 6 + #include "setup.h" 7 + #include "en/params.h" 8 + 9 + static int mlx5e_xsk_map_umem(struct mlx5e_priv *priv, 10 + struct xdp_umem *umem) 11 + { 12 + struct device *dev = priv->mdev->device; 13 + u32 i; 14 + 15 + for (i = 0; i < umem->npgs; i++) { 16 + dma_addr_t dma = dma_map_page(dev, umem->pgs[i], 0, PAGE_SIZE, 17 + DMA_BIDIRECTIONAL); 18 + 19 + if (unlikely(dma_mapping_error(dev, dma))) 20 + goto err_unmap; 21 + umem->pages[i].dma = dma; 22 + } 23 + 24 + return 0; 25 + 26 + err_unmap: 27 + while (i--) { 28 + dma_unmap_page(dev, umem->pages[i].dma, PAGE_SIZE, 29 + DMA_BIDIRECTIONAL); 30 + umem->pages[i].dma = 0; 31 + } 32 + 33 + return -ENOMEM; 34 + } 35 + 36 + static void mlx5e_xsk_unmap_umem(struct mlx5e_priv *priv, 37 + struct xdp_umem *umem) 38 + { 39 + struct device *dev = priv->mdev->device; 40 + u32 i; 41 + 42 + for (i = 0; i < umem->npgs; i++) { 43 + dma_unmap_page(dev, umem->pages[i].dma, PAGE_SIZE, 44 + DMA_BIDIRECTIONAL); 45 + umem->pages[i].dma = 0; 46 + } 47 + } 48 + 49 + static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk) 50 + { 51 + if (!xsk->umems) { 52 + xsk->umems = kcalloc(MLX5E_MAX_NUM_CHANNELS, 53 + sizeof(*xsk->umems), GFP_KERNEL); 54 + if (unlikely(!xsk->umems)) 55 + return -ENOMEM; 56 + } 57 + 58 + xsk->refcnt++; 59 + xsk->ever_used = true; 60 + 61 + return 0; 62 + } 63 + 64 + static void mlx5e_xsk_put_umems(struct mlx5e_xsk *xsk) 65 + { 66 + if (!--xsk->refcnt) { 67 + kfree(xsk->umems); 68 + xsk->umems = NULL; 69 + } 70 + } 71 + 72 + static int mlx5e_xsk_add_umem(struct mlx5e_xsk *xsk, struct xdp_umem *umem, u16 ix) 73 + { 74 + int err; 75 + 76 + err = mlx5e_xsk_get_umems(xsk); 77 + if (unlikely(err)) 78 + return err; 79 + 80 + xsk->umems[ix] = umem; 81 + return 0; 82 + } 83 + 84 + static void mlx5e_xsk_remove_umem(struct mlx5e_xsk *xsk, u16 ix) 85 + { 86 + xsk->umems[ix] = NULL; 87 + 88 + mlx5e_xsk_put_umems(xsk); 89 + } 90 + 91 + static bool mlx5e_xsk_is_umem_sane(struct xdp_umem *umem) 92 + { 93 + return umem->headroom <= 0xffff && umem->chunk_size_nohr <= 0xffff; 94 + } 95 + 96 + void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk) 97 + { 98 + xsk->headroom = umem->headroom; 99 + xsk->chunk_size = umem->chunk_size_nohr + umem->headroom; 100 + } 101 + 102 + static int mlx5e_xsk_enable_locked(struct mlx5e_priv *priv, 103 + struct xdp_umem *umem, u16 ix) 104 + { 105 + struct mlx5e_params *params = &priv->channels.params; 106 + struct mlx5e_xsk_param xsk; 107 + struct mlx5e_channel *c; 108 + int err; 109 + 110 + if (unlikely(mlx5e_xsk_get_umem(&priv->channels.params, &priv->xsk, ix))) 111 + return -EBUSY; 112 + 113 + if (unlikely(!mlx5e_xsk_is_umem_sane(umem))) 114 + return -EINVAL; 115 + 116 + err = mlx5e_xsk_map_umem(priv, umem); 117 + if (unlikely(err)) 118 + return err; 119 + 120 + err = mlx5e_xsk_add_umem(&priv->xsk, umem, ix); 121 + if (unlikely(err)) 122 + goto err_unmap_umem; 123 + 124 + mlx5e_build_xsk_param(umem, &xsk); 125 + 126 + if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) { 127 + /* XSK objects will be created on open. */ 128 + goto validate_closed; 129 + } 130 + 131 + if (!params->xdp_prog) { 132 + /* XSK objects will be created when an XDP program is set, 133 + * and the channels are reopened. 134 + */ 135 + goto validate_closed; 136 + } 137 + 138 + c = priv->channels.c[ix]; 139 + 140 + err = mlx5e_open_xsk(priv, params, &xsk, umem, c); 141 + if (unlikely(err)) 142 + goto err_remove_umem; 143 + 144 + mlx5e_activate_xsk(c); 145 + 146 + /* Don't wait for WQEs, because the newer xdpsock sample doesn't provide 147 + * any Fill Ring entries at the setup stage. 148 + */ 149 + 150 + err = mlx5e_xsk_redirect_rqt_to_channel(priv, priv->channels.c[ix]); 151 + if (unlikely(err)) 152 + goto err_deactivate; 153 + 154 + return 0; 155 + 156 + err_deactivate: 157 + mlx5e_deactivate_xsk(c); 158 + mlx5e_close_xsk(c); 159 + 160 + err_remove_umem: 161 + mlx5e_xsk_remove_umem(&priv->xsk, ix); 162 + 163 + err_unmap_umem: 164 + mlx5e_xsk_unmap_umem(priv, umem); 165 + 166 + return err; 167 + 168 + validate_closed: 169 + /* Check the configuration in advance, rather than fail at a later stage 170 + * (in mlx5e_xdp_set or on open) and end up with no channels. 171 + */ 172 + if (!mlx5e_validate_xsk_param(params, &xsk, priv->mdev)) { 173 + err = -EINVAL; 174 + goto err_remove_umem; 175 + } 176 + 177 + return 0; 178 + } 179 + 180 + static int mlx5e_xsk_disable_locked(struct mlx5e_priv *priv, u16 ix) 181 + { 182 + struct xdp_umem *umem = mlx5e_xsk_get_umem(&priv->channels.params, 183 + &priv->xsk, ix); 184 + struct mlx5e_channel *c; 185 + 186 + if (unlikely(!umem)) 187 + return -EINVAL; 188 + 189 + if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) 190 + goto remove_umem; 191 + 192 + /* XSK RQ and SQ are only created if XDP program is set. */ 193 + if (!priv->channels.params.xdp_prog) 194 + goto remove_umem; 195 + 196 + c = priv->channels.c[ix]; 197 + mlx5e_xsk_redirect_rqt_to_drop(priv, ix); 198 + mlx5e_deactivate_xsk(c); 199 + mlx5e_close_xsk(c); 200 + 201 + remove_umem: 202 + mlx5e_xsk_remove_umem(&priv->xsk, ix); 203 + mlx5e_xsk_unmap_umem(priv, umem); 204 + 205 + return 0; 206 + } 207 + 208 + static int mlx5e_xsk_enable_umem(struct mlx5e_priv *priv, struct xdp_umem *umem, 209 + u16 ix) 210 + { 211 + int err; 212 + 213 + mutex_lock(&priv->state_lock); 214 + err = mlx5e_xsk_enable_locked(priv, umem, ix); 215 + mutex_unlock(&priv->state_lock); 216 + 217 + return err; 218 + } 219 + 220 + static int mlx5e_xsk_disable_umem(struct mlx5e_priv *priv, u16 ix) 221 + { 222 + int err; 223 + 224 + mutex_lock(&priv->state_lock); 225 + err = mlx5e_xsk_disable_locked(priv, ix); 226 + mutex_unlock(&priv->state_lock); 227 + 228 + return err; 229 + } 230 + 231 + int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid) 232 + { 233 + struct mlx5e_priv *priv = netdev_priv(dev); 234 + struct mlx5e_params *params = &priv->channels.params; 235 + u16 ix; 236 + 237 + if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid, MLX5E_RQ_GROUP_XSK, &ix))) 238 + return -EINVAL; 239 + 240 + return umem ? mlx5e_xsk_enable_umem(priv, umem, ix) : 241 + mlx5e_xsk_disable_umem(priv, ix); 242 + } 243 + 244 + int mlx5e_xsk_resize_reuseq(struct xdp_umem *umem, u32 nentries) 245 + { 246 + struct xdp_umem_fq_reuse *reuseq; 247 + 248 + reuseq = xsk_reuseq_prepare(nentries); 249 + if (unlikely(!reuseq)) 250 + return -ENOMEM; 251 + xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq)); 252 + 253 + return 0; 254 + } 255 + 256 + u16 mlx5e_xsk_first_unused_channel(struct mlx5e_params *params, struct mlx5e_xsk *xsk) 257 + { 258 + u16 res = xsk->refcnt ? params->num_channels : 0; 259 + 260 + while (res) { 261 + if (mlx5e_xsk_get_umem(params, xsk, res - 1)) 262 + break; 263 + --res; 264 + } 265 + 266 + return res; 267 + }
+31
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ 2 + /* Copyright (c) 2019 Mellanox Technologies. */ 3 + 4 + #ifndef __MLX5_EN_XSK_UMEM_H__ 5 + #define __MLX5_EN_XSK_UMEM_H__ 6 + 7 + #include "en.h" 8 + 9 + static inline struct xdp_umem *mlx5e_xsk_get_umem(struct mlx5e_params *params, 10 + struct mlx5e_xsk *xsk, u16 ix) 11 + { 12 + if (!xsk || !xsk->umems) 13 + return NULL; 14 + 15 + if (unlikely(ix >= params->num_channels)) 16 + return NULL; 17 + 18 + return xsk->umems[ix]; 19 + } 20 + 21 + struct mlx5e_xsk_param; 22 + void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk); 23 + 24 + /* .ndo_bpf callback. */ 25 + int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid); 26 + 27 + int mlx5e_xsk_resize_reuseq(struct xdp_umem *umem, u32 nentries); 28 + 29 + u16 mlx5e_xsk_first_unused_channel(struct mlx5e_params *params, struct mlx5e_xsk *xsk); 30 + 31 + #endif /* __MLX5_EN_XSK_UMEM_H__ */
+23 -2
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
··· 32 32 33 33 #include "en.h" 34 34 #include "en/port.h" 35 + #include "en/xsk/umem.h" 35 36 #include "lib/clock.h" 36 37 37 38 void mlx5e_ethtool_get_drvinfo(struct mlx5e_priv *priv, ··· 389 388 void mlx5e_ethtool_get_channels(struct mlx5e_priv *priv, 390 389 struct ethtool_channels *ch) 391 390 { 391 + mutex_lock(&priv->state_lock); 392 + 392 393 ch->max_combined = mlx5e_get_netdev_max_channels(priv->netdev); 393 394 ch->combined_count = priv->channels.params.num_channels; 395 + if (priv->xsk.refcnt) { 396 + /* The upper half are XSK queues. */ 397 + ch->max_combined *= 2; 398 + ch->combined_count *= 2; 399 + } 400 + 401 + mutex_unlock(&priv->state_lock); 394 402 } 395 403 396 404 static void mlx5e_get_channels(struct net_device *dev, ··· 413 403 int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv, 414 404 struct ethtool_channels *ch) 415 405 { 406 + struct mlx5e_params *cur_params = &priv->channels.params; 416 407 unsigned int count = ch->combined_count; 417 408 struct mlx5e_channels new_channels = {}; 418 409 bool arfs_enabled; ··· 425 414 return -EINVAL; 426 415 } 427 416 428 - if (priv->channels.params.num_channels == count) 417 + if (cur_params->num_channels == count) 429 418 return 0; 430 419 431 420 mutex_lock(&priv->state_lock); 421 + 422 + /* Don't allow changing the number of channels if there is an active 423 + * XSK, because the numeration of the XSK and regular RQs will change. 424 + */ 425 + if (priv->xsk.refcnt) { 426 + err = -EINVAL; 427 + netdev_err(priv->netdev, "%s: AF_XDP is active, cannot change the number of channels\n", 428 + __func__); 429 + goto out; 430 + } 432 431 433 432 new_channels.params = priv->channels.params; 434 433 new_channels.params.num_channels = count; 435 434 436 435 if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) { 437 - priv->channels.params = new_channels.params; 436 + *cur_params = new_channels.params; 438 437 if (!netif_is_rxfh_configured(priv->netdev)) 439 438 mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt, 440 439 MLX5E_INDIR_RQT_SIZE, count);
+14 -4
drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
··· 32 32 33 33 #include <linux/mlx5/fs.h> 34 34 #include "en.h" 35 + #include "en/params.h" 36 + #include "en/xsk/umem.h" 35 37 36 38 struct mlx5e_ethtool_rule { 37 39 struct list_head list; ··· 416 414 if (fs->ring_cookie == RX_CLS_FLOW_DISC) { 417 415 flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP; 418 416 } else { 417 + struct mlx5e_params *params = &priv->channels.params; 418 + enum mlx5e_rq_group group; 419 + struct mlx5e_tir *tir; 420 + u16 ix; 421 + 422 + mlx5e_qid_get_ch_and_group(params, fs->ring_cookie, &ix, &group); 423 + tir = group == MLX5E_RQ_GROUP_XSK ? priv->xsk_tir : priv->direct_tir; 424 + 419 425 dst = kzalloc(sizeof(*dst), GFP_KERNEL); 420 426 if (!dst) { 421 427 err = -ENOMEM; ··· 431 421 } 432 422 433 423 dst->type = MLX5_FLOW_DESTINATION_TYPE_TIR; 434 - dst->tir_num = priv->direct_tir[fs->ring_cookie].tirn; 424 + dst->tir_num = tir[ix].tirn; 435 425 flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST; 436 426 } 437 427 ··· 610 600 if (fs->location >= MAX_NUM_OF_ETHTOOL_RULES) 611 601 return -ENOSPC; 612 602 613 - if (fs->ring_cookie >= priv->channels.params.num_channels && 614 - fs->ring_cookie != RX_CLS_FLOW_DISC) 615 - return -EINVAL; 603 + if (fs->ring_cookie != RX_CLS_FLOW_DISC) 604 + if (!mlx5e_qid_validate(&priv->channels.params, fs->ring_cookie)) 605 + return -EINVAL; 616 606 617 607 switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) { 618 608 case ETHER_FLOW:
+498 -282
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
··· 38 38 #include <linux/bpf.h> 39 39 #include <linux/if_bridge.h> 40 40 #include <net/page_pool.h> 41 + #include <net/xdp_sock.h> 41 42 #include "eswitch.h" 42 43 #include "en.h" 43 44 #include "en_tc.h" ··· 57 56 #include "en/monitor_stats.h" 58 57 #include "en/reporter.h" 59 58 #include "en/params.h" 60 - 61 - struct mlx5e_rq_param { 62 - u32 rqc[MLX5_ST_SZ_DW(rqc)]; 63 - struct mlx5_wq_param wq; 64 - struct mlx5e_rq_frags_info frags_info; 65 - }; 66 - 67 - struct mlx5e_sq_param { 68 - u32 sqc[MLX5_ST_SZ_DW(sqc)]; 69 - struct mlx5_wq_param wq; 70 - bool is_mpw; 71 - }; 72 - 73 - struct mlx5e_cq_param { 74 - u32 cqc[MLX5_ST_SZ_DW(cqc)]; 75 - struct mlx5_wq_param wq; 76 - u16 eq_ix; 77 - u8 cq_period_mode; 78 - }; 79 - 80 - struct mlx5e_channel_param { 81 - struct mlx5e_rq_param rq; 82 - struct mlx5e_sq_param sq; 83 - struct mlx5e_sq_param xdp_sq; 84 - struct mlx5e_sq_param icosq; 85 - struct mlx5e_cq_param rx_cq; 86 - struct mlx5e_cq_param tx_cq; 87 - struct mlx5e_cq_param icosq_cq; 88 - }; 59 + #include "en/xsk/umem.h" 60 + #include "en/xsk/setup.h" 61 + #include "en/xsk/rx.h" 62 + #include "en/xsk/tx.h" 89 63 90 64 bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev) 91 65 { ··· 90 114 mlx5_core_info(mdev, "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n", 91 115 params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ, 92 116 params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ ? 93 - BIT(mlx5e_mpwqe_get_log_rq_size(params)) : 117 + BIT(mlx5e_mpwqe_get_log_rq_size(params, NULL)) : 94 118 BIT(params->log_rq_mtu_frames), 95 - BIT(mlx5e_mpwqe_get_log_stride_size(mdev, params)), 119 + BIT(mlx5e_mpwqe_get_log_stride_size(mdev, params, NULL)), 96 120 MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS)); 97 121 } 98 122 99 123 bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev, 100 124 struct mlx5e_params *params) 101 125 { 102 - return mlx5e_check_fragmented_striding_rq_cap(mdev) && 103 - !MLX5_IPSEC_DEV(mdev) && 104 - !(params->xdp_prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params)); 126 + if (!mlx5e_check_fragmented_striding_rq_cap(mdev)) 127 + return false; 128 + 129 + if (MLX5_IPSEC_DEV(mdev)) 130 + return false; 131 + 132 + if (params->xdp_prog) { 133 + /* XSK params are not considered here. If striding RQ is in use, 134 + * and an XSK is being opened, mlx5e_rx_mpwqe_is_linear_skb will 135 + * be called with the known XSK params. 136 + */ 137 + if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL)) 138 + return false; 139 + } 140 + 141 + return true; 105 142 } 106 143 107 144 void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params) ··· 383 394 384 395 static int mlx5e_alloc_rq(struct mlx5e_channel *c, 385 396 struct mlx5e_params *params, 397 + struct mlx5e_xsk_param *xsk, 398 + struct xdp_umem *umem, 386 399 struct mlx5e_rq_param *rqp, 387 400 struct mlx5e_rq *rq) 388 401 { ··· 392 401 struct mlx5_core_dev *mdev = c->mdev; 393 402 void *rqc = rqp->rqc; 394 403 void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq); 404 + u32 num_xsk_frames = 0; 405 + u32 rq_xdp_ix; 395 406 u32 pool_size; 396 407 int wq_sz; 397 408 int err; ··· 410 417 rq->ix = c->ix; 411 418 rq->mdev = mdev; 412 419 rq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 413 - rq->stats = &c->priv->channel_stats[c->ix].rq; 420 + rq->xdpsq = &c->rq_xdpsq; 421 + rq->umem = umem; 422 + 423 + if (rq->umem) 424 + rq->stats = &c->priv->channel_stats[c->ix].xskrq; 425 + else 426 + rq->stats = &c->priv->channel_stats[c->ix].rq; 414 427 415 428 rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL; 416 429 if (IS_ERR(rq->xdp_prog)) { ··· 425 426 goto err_rq_wq_destroy; 426 427 } 427 428 428 - err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix); 429 + rq_xdp_ix = rq->ix; 430 + if (xsk) 431 + rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK; 432 + err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix); 429 433 if (err < 0) 430 434 goto err_rq_wq_destroy; 431 435 432 436 rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE; 433 - rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params); 437 + rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk); 438 + rq->buff.umem_headroom = xsk ? xsk->headroom : 0; 434 439 pool_size = 1 << params->log_rq_mtu_frames; 435 440 436 441 switch (rq->wq_type) { ··· 448 445 449 446 wq_sz = mlx5_wq_ll_get_size(&rq->mpwqe.wq); 450 447 451 - pool_size = MLX5_MPWRQ_PAGES_PER_WQE << mlx5e_mpwqe_get_log_rq_size(params); 448 + if (xsk) 449 + num_xsk_frames = wq_sz << 450 + mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk); 451 + 452 + pool_size = MLX5_MPWRQ_PAGES_PER_WQE << 453 + mlx5e_mpwqe_get_log_rq_size(params, xsk); 452 454 453 455 rq->post_wqes = mlx5e_post_rx_mpwqes; 454 456 rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe; ··· 472 464 goto err_rq_wq_destroy; 473 465 } 474 466 475 - rq->mpwqe.skb_from_cqe_mpwrq = 476 - mlx5e_rx_mpwqe_is_linear_skb(mdev, params) ? 477 - mlx5e_skb_from_cqe_mpwrq_linear : 478 - mlx5e_skb_from_cqe_mpwrq_nonlinear; 479 - rq->mpwqe.log_stride_sz = mlx5e_mpwqe_get_log_stride_size(mdev, params); 480 - rq->mpwqe.num_strides = BIT(mlx5e_mpwqe_get_log_num_strides(mdev, params)); 467 + rq->mpwqe.skb_from_cqe_mpwrq = xsk ? 468 + mlx5e_xsk_skb_from_cqe_mpwrq_linear : 469 + mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL) ? 470 + mlx5e_skb_from_cqe_mpwrq_linear : 471 + mlx5e_skb_from_cqe_mpwrq_nonlinear; 472 + 473 + rq->mpwqe.log_stride_sz = mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk); 474 + rq->mpwqe.num_strides = 475 + BIT(mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk)); 481 476 482 477 err = mlx5e_create_rq_umr_mkey(mdev, rq); 483 478 if (err) ··· 501 490 502 491 wq_sz = mlx5_wq_cyc_get_size(&rq->wqe.wq); 503 492 493 + if (xsk) 494 + num_xsk_frames = wq_sz << rq->wqe.info.log_num_frags; 495 + 504 496 rq->wqe.info = rqp->frags_info; 505 497 rq->wqe.frags = 506 498 kvzalloc_node(array_size(sizeof(*rq->wqe.frags), ··· 517 503 err = mlx5e_init_di_list(rq, wq_sz, c->cpu); 518 504 if (err) 519 505 goto err_free; 506 + 520 507 rq->post_wqes = mlx5e_post_rx_wqes; 521 508 rq->dealloc_wqe = mlx5e_dealloc_rx_wqe; 522 509 ··· 533 518 goto err_free; 534 519 } 535 520 536 - rq->wqe.skb_from_cqe = mlx5e_rx_is_linear_skb(params) ? 537 - mlx5e_skb_from_cqe_linear : 538 - mlx5e_skb_from_cqe_nonlinear; 521 + rq->wqe.skb_from_cqe = xsk ? 522 + mlx5e_xsk_skb_from_cqe_linear : 523 + mlx5e_rx_is_linear_skb(params, NULL) ? 524 + mlx5e_skb_from_cqe_linear : 525 + mlx5e_skb_from_cqe_nonlinear; 539 526 rq->mkey_be = c->mkey_be; 540 527 } 541 528 542 - /* Create a page_pool and register it with rxq */ 543 - pp_params.order = 0; 544 - pp_params.flags = 0; /* No-internal DMA mapping in page_pool */ 545 - pp_params.pool_size = pool_size; 546 - pp_params.nid = cpu_to_node(c->cpu); 547 - pp_params.dev = c->pdev; 548 - pp_params.dma_dir = rq->buff.map_dir; 529 + if (xsk) { 530 + err = mlx5e_xsk_resize_reuseq(umem, num_xsk_frames); 531 + if (unlikely(err)) { 532 + mlx5_core_err(mdev, "Unable to allocate the Reuse Ring for %u frames\n", 533 + num_xsk_frames); 534 + goto err_free; 535 + } 549 536 550 - /* page_pool can be used even when there is no rq->xdp_prog, 551 - * given page_pool does not handle DMA mapping there is no 552 - * required state to clear. And page_pool gracefully handle 553 - * elevated refcnt. 554 - */ 555 - rq->page_pool = page_pool_create(&pp_params); 556 - if (IS_ERR(rq->page_pool)) { 557 - err = PTR_ERR(rq->page_pool); 558 - rq->page_pool = NULL; 559 - goto err_free; 537 + rq->zca.free = mlx5e_xsk_zca_free; 538 + err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, 539 + MEM_TYPE_ZERO_COPY, 540 + &rq->zca); 541 + } else { 542 + /* Create a page_pool and register it with rxq */ 543 + pp_params.order = 0; 544 + pp_params.flags = 0; /* No-internal DMA mapping in page_pool */ 545 + pp_params.pool_size = pool_size; 546 + pp_params.nid = cpu_to_node(c->cpu); 547 + pp_params.dev = c->pdev; 548 + pp_params.dma_dir = rq->buff.map_dir; 549 + 550 + /* page_pool can be used even when there is no rq->xdp_prog, 551 + * given page_pool does not handle DMA mapping there is no 552 + * required state to clear. And page_pool gracefully handle 553 + * elevated refcnt. 554 + */ 555 + rq->page_pool = page_pool_create(&pp_params); 556 + if (IS_ERR(rq->page_pool)) { 557 + err = PTR_ERR(rq->page_pool); 558 + rq->page_pool = NULL; 559 + goto err_free; 560 + } 561 + err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, 562 + MEM_TYPE_PAGE_POOL, rq->page_pool); 563 + if (err) 564 + page_pool_free(rq->page_pool); 560 565 } 561 - err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, 562 - MEM_TYPE_PAGE_POOL, rq->page_pool); 563 - if (err) { 564 - page_pool_free(rq->page_pool); 566 + if (err) 565 567 goto err_free; 566 - } 567 568 568 569 for (i = 0; i < wq_sz; i++) { 569 570 if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) { ··· 670 639 i = (i + 1) & (MLX5E_CACHE_SIZE - 1)) { 671 640 struct mlx5e_dma_info *dma_info = &rq->page_cache.page_cache[i]; 672 641 673 - mlx5e_page_release(rq, dma_info, false); 642 + /* With AF_XDP, page_cache is not used, so this loop is not 643 + * entered, and it's safe to call mlx5e_page_release_dynamic 644 + * directly. 645 + */ 646 + mlx5e_page_release_dynamic(rq, dma_info, false); 674 647 } 675 648 676 649 xdp_rxq_info_unreg(&rq->xdp_rxq); ··· 811 776 mlx5_core_destroy_rq(rq->mdev, rq->rqn); 812 777 } 813 778 814 - static int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time) 779 + int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time) 815 780 { 816 781 unsigned long exp_time = jiffies + msecs_to_jiffies(wait_time); 817 782 struct mlx5e_channel *c = rq->channel; ··· 869 834 870 835 } 871 836 872 - static int mlx5e_open_rq(struct mlx5e_channel *c, 873 - struct mlx5e_params *params, 874 - struct mlx5e_rq_param *param, 875 - struct mlx5e_rq *rq) 837 + int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params, 838 + struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk, 839 + struct xdp_umem *umem, struct mlx5e_rq *rq) 876 840 { 877 841 int err; 878 842 879 - err = mlx5e_alloc_rq(c, params, param, rq); 843 + err = mlx5e_alloc_rq(c, params, xsk, umem, param, rq); 880 844 if (err) 881 845 return err; 882 846 ··· 913 879 mlx5e_trigger_irq(&rq->channel->icosq); 914 880 } 915 881 916 - static void mlx5e_deactivate_rq(struct mlx5e_rq *rq) 882 + void mlx5e_deactivate_rq(struct mlx5e_rq *rq) 917 883 { 918 884 clear_bit(MLX5E_RQ_STATE_ENABLED, &rq->state); 919 885 napi_synchronize(&rq->channel->napi); /* prevent mlx5e_post_rx_wqes */ 920 886 } 921 887 922 - static void mlx5e_close_rq(struct mlx5e_rq *rq) 888 + void mlx5e_close_rq(struct mlx5e_rq *rq) 923 889 { 924 890 cancel_work_sync(&rq->dim.work); 925 891 mlx5e_destroy_rq(rq); ··· 972 938 973 939 static int mlx5e_alloc_xdpsq(struct mlx5e_channel *c, 974 940 struct mlx5e_params *params, 941 + struct xdp_umem *umem, 975 942 struct mlx5e_sq_param *param, 976 943 struct mlx5e_xdpsq *sq, 977 944 bool is_redirect) ··· 988 953 sq->uar_map = mdev->mlx5e_res.bfreg.map; 989 954 sq->min_inline_mode = params->tx_min_inline_mode; 990 955 sq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); 991 - sq->stats = is_redirect ? 992 - &c->priv->channel_stats[c->ix].xdpsq : 993 - &c->priv->channel_stats[c->ix].rq_xdpsq; 956 + sq->umem = umem; 957 + 958 + sq->stats = sq->umem ? 959 + &c->priv->channel_stats[c->ix].xsksq : 960 + is_redirect ? 961 + &c->priv->channel_stats[c->ix].xdpsq : 962 + &c->priv->channel_stats[c->ix].rq_xdpsq; 994 963 995 964 param->wq.db_numa_node = cpu_to_node(c->cpu); 996 965 err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, wq, &sq->wq_ctrl); ··· 1374 1335 mlx5e_tx_reporter_err_cqe(sq); 1375 1336 } 1376 1337 1377 - static int mlx5e_open_icosq(struct mlx5e_channel *c, 1378 - struct mlx5e_params *params, 1379 - struct mlx5e_sq_param *param, 1380 - struct mlx5e_icosq *sq) 1338 + int mlx5e_open_icosq(struct mlx5e_channel *c, struct mlx5e_params *params, 1339 + struct mlx5e_sq_param *param, struct mlx5e_icosq *sq) 1381 1340 { 1382 1341 struct mlx5e_create_sq_param csp = {}; 1383 1342 int err; ··· 1401 1364 return err; 1402 1365 } 1403 1366 1404 - static void mlx5e_close_icosq(struct mlx5e_icosq *sq) 1367 + void mlx5e_close_icosq(struct mlx5e_icosq *sq) 1405 1368 { 1406 1369 struct mlx5e_channel *c = sq->channel; 1407 1370 ··· 1412 1375 mlx5e_free_icosq(sq); 1413 1376 } 1414 1377 1415 - static int mlx5e_open_xdpsq(struct mlx5e_channel *c, 1416 - struct mlx5e_params *params, 1417 - struct mlx5e_sq_param *param, 1418 - struct mlx5e_xdpsq *sq, 1419 - bool is_redirect) 1378 + int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params, 1379 + struct mlx5e_sq_param *param, struct xdp_umem *umem, 1380 + struct mlx5e_xdpsq *sq, bool is_redirect) 1420 1381 { 1421 1382 struct mlx5e_create_sq_param csp = {}; 1422 1383 int err; 1423 1384 1424 - err = mlx5e_alloc_xdpsq(c, params, param, sq, is_redirect); 1385 + err = mlx5e_alloc_xdpsq(c, params, umem, param, sq, is_redirect); 1425 1386 if (err) 1426 1387 return err; 1427 1388 ··· 1473 1438 return err; 1474 1439 } 1475 1440 1476 - static void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq) 1441 + void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq) 1477 1442 { 1478 1443 struct mlx5e_channel *c = sq->channel; 1479 1444 ··· 1481 1446 napi_synchronize(&c->napi); 1482 1447 1483 1448 mlx5e_destroy_sq(c->mdev, sq->sqn); 1484 - mlx5e_free_xdpsq_descs(sq, rq); 1449 + mlx5e_free_xdpsq_descs(sq); 1485 1450 mlx5e_free_xdpsq(sq); 1486 1451 } 1487 1452 ··· 1602 1567 mlx5_core_destroy_cq(cq->mdev, &cq->mcq); 1603 1568 } 1604 1569 1605 - static int mlx5e_open_cq(struct mlx5e_channel *c, 1606 - struct dim_cq_moder moder, 1607 - struct mlx5e_cq_param *param, 1608 - struct mlx5e_cq *cq) 1570 + int mlx5e_open_cq(struct mlx5e_channel *c, struct dim_cq_moder moder, 1571 + struct mlx5e_cq_param *param, struct mlx5e_cq *cq) 1609 1572 { 1610 1573 struct mlx5_core_dev *mdev = c->mdev; 1611 1574 int err; ··· 1626 1593 return err; 1627 1594 } 1628 1595 1629 - static void mlx5e_close_cq(struct mlx5e_cq *cq) 1596 + void mlx5e_close_cq(struct mlx5e_cq *cq) 1630 1597 { 1631 1598 mlx5e_destroy_cq(cq); 1632 1599 mlx5e_free_cq(cq); ··· 1800 1767 free_cpumask_var(c->xps_cpumask); 1801 1768 } 1802 1769 1770 + static int mlx5e_open_queues(struct mlx5e_channel *c, 1771 + struct mlx5e_params *params, 1772 + struct mlx5e_channel_param *cparam) 1773 + { 1774 + struct dim_cq_moder icocq_moder = {0, 0}; 1775 + int err; 1776 + 1777 + err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->icosq.cq); 1778 + if (err) 1779 + return err; 1780 + 1781 + err = mlx5e_open_tx_cqs(c, params, cparam); 1782 + if (err) 1783 + goto err_close_icosq_cq; 1784 + 1785 + err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam->tx_cq, &c->xdpsq.cq); 1786 + if (err) 1787 + goto err_close_tx_cqs; 1788 + 1789 + err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rx_cq, &c->rq.cq); 1790 + if (err) 1791 + goto err_close_xdp_tx_cqs; 1792 + 1793 + /* XDP SQ CQ params are same as normal TXQ sq CQ params */ 1794 + err = c->xdp ? mlx5e_open_cq(c, params->tx_cq_moderation, 1795 + &cparam->tx_cq, &c->rq_xdpsq.cq) : 0; 1796 + if (err) 1797 + goto err_close_rx_cq; 1798 + 1799 + napi_enable(&c->napi); 1800 + 1801 + err = mlx5e_open_icosq(c, params, &cparam->icosq, &c->icosq); 1802 + if (err) 1803 + goto err_disable_napi; 1804 + 1805 + err = mlx5e_open_sqs(c, params, cparam); 1806 + if (err) 1807 + goto err_close_icosq; 1808 + 1809 + if (c->xdp) { 1810 + err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, NULL, 1811 + &c->rq_xdpsq, false); 1812 + if (err) 1813 + goto err_close_sqs; 1814 + } 1815 + 1816 + err = mlx5e_open_rq(c, params, &cparam->rq, NULL, NULL, &c->rq); 1817 + if (err) 1818 + goto err_close_xdp_sq; 1819 + 1820 + err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, NULL, &c->xdpsq, true); 1821 + if (err) 1822 + goto err_close_rq; 1823 + 1824 + return 0; 1825 + 1826 + err_close_rq: 1827 + mlx5e_close_rq(&c->rq); 1828 + 1829 + err_close_xdp_sq: 1830 + if (c->xdp) 1831 + mlx5e_close_xdpsq(&c->rq_xdpsq); 1832 + 1833 + err_close_sqs: 1834 + mlx5e_close_sqs(c); 1835 + 1836 + err_close_icosq: 1837 + mlx5e_close_icosq(&c->icosq); 1838 + 1839 + err_disable_napi: 1840 + napi_disable(&c->napi); 1841 + 1842 + if (c->xdp) 1843 + mlx5e_close_cq(&c->rq_xdpsq.cq); 1844 + 1845 + err_close_rx_cq: 1846 + mlx5e_close_cq(&c->rq.cq); 1847 + 1848 + err_close_xdp_tx_cqs: 1849 + mlx5e_close_cq(&c->xdpsq.cq); 1850 + 1851 + err_close_tx_cqs: 1852 + mlx5e_close_tx_cqs(c); 1853 + 1854 + err_close_icosq_cq: 1855 + mlx5e_close_cq(&c->icosq.cq); 1856 + 1857 + return err; 1858 + } 1859 + 1860 + static void mlx5e_close_queues(struct mlx5e_channel *c) 1861 + { 1862 + mlx5e_close_xdpsq(&c->xdpsq); 1863 + mlx5e_close_rq(&c->rq); 1864 + if (c->xdp) 1865 + mlx5e_close_xdpsq(&c->rq_xdpsq); 1866 + mlx5e_close_sqs(c); 1867 + mlx5e_close_icosq(&c->icosq); 1868 + napi_disable(&c->napi); 1869 + if (c->xdp) 1870 + mlx5e_close_cq(&c->rq_xdpsq.cq); 1871 + mlx5e_close_cq(&c->rq.cq); 1872 + mlx5e_close_cq(&c->xdpsq.cq); 1873 + mlx5e_close_tx_cqs(c); 1874 + mlx5e_close_cq(&c->icosq.cq); 1875 + } 1876 + 1803 1877 static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix, 1804 1878 struct mlx5e_params *params, 1805 1879 struct mlx5e_channel_param *cparam, 1880 + struct xdp_umem *umem, 1806 1881 struct mlx5e_channel **cp) 1807 1882 { 1808 1883 int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix)); 1809 - struct dim_cq_moder icocq_moder = {0, 0}; 1810 1884 struct net_device *netdev = priv->netdev; 1885 + struct mlx5e_xsk_param xsk; 1811 1886 struct mlx5e_channel *c; 1812 1887 unsigned int irq; 1813 1888 int err; ··· 1948 1807 1949 1808 netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64); 1950 1809 1951 - err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->icosq.cq); 1952 - if (err) 1810 + err = mlx5e_open_queues(c, params, cparam); 1811 + if (unlikely(err)) 1953 1812 goto err_napi_del; 1954 1813 1955 - err = mlx5e_open_tx_cqs(c, params, cparam); 1956 - if (err) 1957 - goto err_close_icosq_cq; 1958 - 1959 - err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam->tx_cq, &c->xdpsq.cq); 1960 - if (err) 1961 - goto err_close_tx_cqs; 1962 - 1963 - err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rx_cq, &c->rq.cq); 1964 - if (err) 1965 - goto err_close_xdp_tx_cqs; 1966 - 1967 - /* XDP SQ CQ params are same as normal TXQ sq CQ params */ 1968 - err = c->xdp ? mlx5e_open_cq(c, params->tx_cq_moderation, 1969 - &cparam->tx_cq, &c->rq.xdpsq.cq) : 0; 1970 - if (err) 1971 - goto err_close_rx_cq; 1972 - 1973 - napi_enable(&c->napi); 1974 - 1975 - err = mlx5e_open_icosq(c, params, &cparam->icosq, &c->icosq); 1976 - if (err) 1977 - goto err_disable_napi; 1978 - 1979 - err = mlx5e_open_sqs(c, params, cparam); 1980 - if (err) 1981 - goto err_close_icosq; 1982 - 1983 - err = c->xdp ? mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, &c->rq.xdpsq, false) : 0; 1984 - if (err) 1985 - goto err_close_sqs; 1986 - 1987 - err = mlx5e_open_rq(c, params, &cparam->rq, &c->rq); 1988 - if (err) 1989 - goto err_close_xdp_sq; 1990 - 1991 - err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, &c->xdpsq, true); 1992 - if (err) 1993 - goto err_close_rq; 1814 + if (umem) { 1815 + mlx5e_build_xsk_param(umem, &xsk); 1816 + err = mlx5e_open_xsk(priv, params, &xsk, umem, c); 1817 + if (unlikely(err)) 1818 + goto err_close_queues; 1819 + } 1994 1820 1995 1821 *cp = c; 1996 1822 1997 1823 return 0; 1998 1824 1999 - err_close_rq: 2000 - mlx5e_close_rq(&c->rq); 2001 - 2002 - err_close_xdp_sq: 2003 - if (c->xdp) 2004 - mlx5e_close_xdpsq(&c->rq.xdpsq, &c->rq); 2005 - 2006 - err_close_sqs: 2007 - mlx5e_close_sqs(c); 2008 - 2009 - err_close_icosq: 2010 - mlx5e_close_icosq(&c->icosq); 2011 - 2012 - err_disable_napi: 2013 - napi_disable(&c->napi); 2014 - if (c->xdp) 2015 - mlx5e_close_cq(&c->rq.xdpsq.cq); 2016 - 2017 - err_close_rx_cq: 2018 - mlx5e_close_cq(&c->rq.cq); 2019 - 2020 - err_close_xdp_tx_cqs: 2021 - mlx5e_close_cq(&c->xdpsq.cq); 2022 - 2023 - err_close_tx_cqs: 2024 - mlx5e_close_tx_cqs(c); 2025 - 2026 - err_close_icosq_cq: 2027 - mlx5e_close_cq(&c->icosq.cq); 1825 + err_close_queues: 1826 + mlx5e_close_queues(c); 2028 1827 2029 1828 err_napi_del: 2030 1829 netif_napi_del(&c->napi); ··· 1984 1903 mlx5e_activate_txqsq(&c->sq[tc]); 1985 1904 mlx5e_activate_rq(&c->rq); 1986 1905 netif_set_xps_queue(c->netdev, c->xps_cpumask, c->ix); 1906 + 1907 + if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 1908 + mlx5e_activate_xsk(c); 1987 1909 } 1988 1910 1989 1911 static void mlx5e_deactivate_channel(struct mlx5e_channel *c) 1990 1912 { 1991 1913 int tc; 1914 + 1915 + if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 1916 + mlx5e_deactivate_xsk(c); 1992 1917 1993 1918 mlx5e_deactivate_rq(&c->rq); 1994 1919 for (tc = 0; tc < c->num_tc; tc++) ··· 2003 1916 2004 1917 static void mlx5e_close_channel(struct mlx5e_channel *c) 2005 1918 { 2006 - mlx5e_close_xdpsq(&c->xdpsq, NULL); 2007 - mlx5e_close_rq(&c->rq); 2008 - if (c->xdp) 2009 - mlx5e_close_xdpsq(&c->rq.xdpsq, &c->rq); 2010 - mlx5e_close_sqs(c); 2011 - mlx5e_close_icosq(&c->icosq); 2012 - napi_disable(&c->napi); 2013 - if (c->xdp) 2014 - mlx5e_close_cq(&c->rq.xdpsq.cq); 2015 - mlx5e_close_cq(&c->rq.cq); 2016 - mlx5e_close_cq(&c->xdpsq.cq); 2017 - mlx5e_close_tx_cqs(c); 2018 - mlx5e_close_cq(&c->icosq.cq); 1919 + if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)) 1920 + mlx5e_close_xsk(c); 1921 + mlx5e_close_queues(c); 2019 1922 netif_napi_del(&c->napi); 2020 1923 mlx5e_free_xps_cpumask(c); 2021 1924 ··· 2016 1939 2017 1940 static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev, 2018 1941 struct mlx5e_params *params, 1942 + struct mlx5e_xsk_param *xsk, 2019 1943 struct mlx5e_rq_frags_info *info) 2020 1944 { 2021 1945 u32 byte_count = MLX5E_SW2HW_MTU(params, params->sw_mtu); ··· 2029 1951 byte_count += MLX5E_METADATA_ETHER_LEN; 2030 1952 #endif 2031 1953 2032 - if (mlx5e_rx_is_linear_skb(params)) { 1954 + if (mlx5e_rx_is_linear_skb(params, xsk)) { 2033 1955 int frag_stride; 2034 1956 2035 - frag_stride = mlx5e_rx_get_linear_frag_sz(params); 1957 + frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk); 2036 1958 frag_stride = roundup_pow_of_two(frag_stride); 2037 1959 2038 1960 info->arr[0].frag_size = byte_count; ··· 2090 2012 return MLX5_GET(wq, wq, log_wq_sz); 2091 2013 } 2092 2014 2093 - static void mlx5e_build_rq_param(struct mlx5e_priv *priv, 2094 - struct mlx5e_params *params, 2095 - struct mlx5e_rq_param *param) 2015 + void mlx5e_build_rq_param(struct mlx5e_priv *priv, 2016 + struct mlx5e_params *params, 2017 + struct mlx5e_xsk_param *xsk, 2018 + struct mlx5e_rq_param *param) 2096 2019 { 2097 2020 struct mlx5_core_dev *mdev = priv->mdev; 2098 2021 void *rqc = param->rqc; ··· 2103 2024 switch (params->rq_wq_type) { 2104 2025 case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: 2105 2026 MLX5_SET(wq, wq, log_wqe_num_of_strides, 2106 - mlx5e_mpwqe_get_log_num_strides(mdev, params) - 2027 + mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk) - 2107 2028 MLX5_MPWQE_LOG_NUM_STRIDES_BASE); 2108 2029 MLX5_SET(wq, wq, log_wqe_stride_size, 2109 - mlx5e_mpwqe_get_log_stride_size(mdev, params) - 2030 + mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk) - 2110 2031 MLX5_MPWQE_LOG_STRIDE_SZ_BASE); 2111 - MLX5_SET(wq, wq, log_wq_sz, mlx5e_mpwqe_get_log_rq_size(params)); 2032 + MLX5_SET(wq, wq, log_wq_sz, mlx5e_mpwqe_get_log_rq_size(params, xsk)); 2112 2033 break; 2113 2034 default: /* MLX5_WQ_TYPE_CYCLIC */ 2114 2035 MLX5_SET(wq, wq, log_wq_sz, params->log_rq_mtu_frames); 2115 - mlx5e_build_rq_frags_info(mdev, params, &param->frags_info); 2036 + mlx5e_build_rq_frags_info(mdev, params, xsk, &param->frags_info); 2116 2037 ndsegs = param->frags_info.num_frags; 2117 2038 } 2118 2039 ··· 2143 2064 param->wq.buf_numa_node = dev_to_node(mdev->device); 2144 2065 } 2145 2066 2146 - static void mlx5e_build_sq_param_common(struct mlx5e_priv *priv, 2147 - struct mlx5e_sq_param *param) 2067 + void mlx5e_build_sq_param_common(struct mlx5e_priv *priv, 2068 + struct mlx5e_sq_param *param) 2148 2069 { 2149 2070 void *sqc = param->sqc; 2150 2071 void *wq = MLX5_ADDR_OF(sqc, sqc, wq); ··· 2180 2101 MLX5_SET(cqc, cqc, cqe_sz, CQE_STRIDE_128_PAD); 2181 2102 } 2182 2103 2183 - static void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, 2184 - struct mlx5e_params *params, 2185 - struct mlx5e_cq_param *param) 2104 + void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, 2105 + struct mlx5e_params *params, 2106 + struct mlx5e_xsk_param *xsk, 2107 + struct mlx5e_cq_param *param) 2186 2108 { 2187 2109 struct mlx5_core_dev *mdev = priv->mdev; 2188 2110 void *cqc = param->cqc; ··· 2191 2111 2192 2112 switch (params->rq_wq_type) { 2193 2113 case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: 2194 - log_cq_size = mlx5e_mpwqe_get_log_rq_size(params) + 2195 - mlx5e_mpwqe_get_log_num_strides(mdev, params); 2114 + log_cq_size = mlx5e_mpwqe_get_log_rq_size(params, xsk) + 2115 + mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk); 2196 2116 break; 2197 2117 default: /* MLX5_WQ_TYPE_CYCLIC */ 2198 2118 log_cq_size = params->log_rq_mtu_frames; ··· 2208 2128 param->cq_period_mode = params->rx_cq_moderation.cq_period_mode; 2209 2129 } 2210 2130 2211 - static void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv, 2212 - struct mlx5e_params *params, 2213 - struct mlx5e_cq_param *param) 2131 + void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv, 2132 + struct mlx5e_params *params, 2133 + struct mlx5e_cq_param *param) 2214 2134 { 2215 2135 void *cqc = param->cqc; 2216 2136 ··· 2220 2140 param->cq_period_mode = params->tx_cq_moderation.cq_period_mode; 2221 2141 } 2222 2142 2223 - static void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv, 2224 - u8 log_wq_size, 2225 - struct mlx5e_cq_param *param) 2143 + void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv, 2144 + u8 log_wq_size, 2145 + struct mlx5e_cq_param *param) 2226 2146 { 2227 2147 void *cqc = param->cqc; 2228 2148 ··· 2233 2153 param->cq_period_mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE; 2234 2154 } 2235 2155 2236 - static void mlx5e_build_icosq_param(struct mlx5e_priv *priv, 2237 - u8 log_wq_size, 2238 - struct mlx5e_sq_param *param) 2156 + void mlx5e_build_icosq_param(struct mlx5e_priv *priv, 2157 + u8 log_wq_size, 2158 + struct mlx5e_sq_param *param) 2239 2159 { 2240 2160 void *sqc = param->sqc; 2241 2161 void *wq = MLX5_ADDR_OF(sqc, sqc, wq); ··· 2246 2166 MLX5_SET(sqc, sqc, reg_umr, MLX5_CAP_ETH(priv->mdev, reg_umr_sq)); 2247 2167 } 2248 2168 2249 - static void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv, 2250 - struct mlx5e_params *params, 2251 - struct mlx5e_sq_param *param) 2169 + void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv, 2170 + struct mlx5e_params *params, 2171 + struct mlx5e_sq_param *param) 2252 2172 { 2253 2173 void *sqc = param->sqc; 2254 2174 void *wq = MLX5_ADDR_OF(sqc, sqc, wq); ··· 2276 2196 { 2277 2197 u8 icosq_log_wq_sz; 2278 2198 2279 - mlx5e_build_rq_param(priv, params, &cparam->rq); 2199 + mlx5e_build_rq_param(priv, params, NULL, &cparam->rq); 2280 2200 2281 2201 icosq_log_wq_sz = mlx5e_build_icosq_log_wq_sz(params, &cparam->rq); 2282 2202 2283 2203 mlx5e_build_sq_param(priv, params, &cparam->sq); 2284 2204 mlx5e_build_xdpsq_param(priv, params, &cparam->xdp_sq); 2285 2205 mlx5e_build_icosq_param(priv, icosq_log_wq_sz, &cparam->icosq); 2286 - mlx5e_build_rx_cq_param(priv, params, &cparam->rx_cq); 2206 + mlx5e_build_rx_cq_param(priv, params, NULL, &cparam->rx_cq); 2287 2207 mlx5e_build_tx_cq_param(priv, params, &cparam->tx_cq); 2288 2208 mlx5e_build_ico_cq_param(priv, icosq_log_wq_sz, &cparam->icosq_cq); 2289 2209 } ··· 2304 2224 2305 2225 mlx5e_build_channel_param(priv, &chs->params, cparam); 2306 2226 for (i = 0; i < chs->num; i++) { 2307 - err = mlx5e_open_channel(priv, i, &chs->params, cparam, &chs->c[i]); 2227 + struct xdp_umem *umem = NULL; 2228 + 2229 + if (chs->params.xdp_prog) 2230 + umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, i); 2231 + 2232 + err = mlx5e_open_channel(priv, i, &chs->params, cparam, umem, &chs->c[i]); 2308 2233 if (err) 2309 2234 goto err_close_channels; 2310 2235 } ··· 2351 2266 int timeout = err ? 0 : MLX5E_RQ_WQES_TIMEOUT; 2352 2267 2353 2268 err |= mlx5e_wait_for_min_rx_wqes(&chs->c[i]->rq, timeout); 2269 + 2270 + /* Don't wait on the XSK RQ, because the newer xdpsock sample 2271 + * doesn't provide any Fill Ring entries at the setup stage. 2272 + */ 2354 2273 } 2355 2274 2356 2275 return err ? -ETIMEDOUT : 0; ··· 2427 2338 return err; 2428 2339 } 2429 2340 2430 - int mlx5e_create_direct_rqts(struct mlx5e_priv *priv) 2341 + int mlx5e_create_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 2431 2342 { 2432 - struct mlx5e_rqt *rqt; 2343 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 2433 2344 int err; 2434 2345 int ix; 2435 2346 2436 - for (ix = 0; ix < mlx5e_get_netdev_max_channels(priv->netdev); ix++) { 2437 - rqt = &priv->direct_tir[ix].rqt; 2438 - err = mlx5e_create_rqt(priv, 1 /*size */, rqt); 2439 - if (err) 2347 + for (ix = 0; ix < max_nch; ix++) { 2348 + err = mlx5e_create_rqt(priv, 1 /*size */, &tirs[ix].rqt); 2349 + if (unlikely(err)) 2440 2350 goto err_destroy_rqts; 2441 2351 } 2442 2352 2443 2353 return 0; 2444 2354 2445 2355 err_destroy_rqts: 2446 - mlx5_core_warn(priv->mdev, "create direct rqts failed, %d\n", err); 2356 + mlx5_core_warn(priv->mdev, "create rqts failed, %d\n", err); 2447 2357 for (ix--; ix >= 0; ix--) 2448 - mlx5e_destroy_rqt(priv, &priv->direct_tir[ix].rqt); 2358 + mlx5e_destroy_rqt(priv, &tirs[ix].rqt); 2449 2359 2450 2360 return err; 2451 2361 } 2452 2362 2453 - void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv) 2363 + void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 2454 2364 { 2365 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 2455 2366 int i; 2456 2367 2457 - for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev); i++) 2458 - mlx5e_destroy_rqt(priv, &priv->direct_tir[i].rqt); 2368 + for (i = 0; i < max_nch; i++) 2369 + mlx5e_destroy_rqt(priv, &tirs[i].rqt); 2459 2370 } 2460 2371 2461 2372 static int mlx5e_rx_hash_fn(int hfunc) ··· 2875 2786 void mlx5e_activate_priv_channels(struct mlx5e_priv *priv) 2876 2787 { 2877 2788 int num_txqs = priv->channels.num * priv->channels.params.num_tc; 2789 + int num_rxqs = priv->channels.num * MLX5E_NUM_RQ_GROUPS; 2878 2790 struct net_device *netdev = priv->netdev; 2879 2791 2880 2792 mlx5e_netdev_set_tcs(netdev); 2881 2793 netif_set_real_num_tx_queues(netdev, num_txqs); 2882 - netif_set_real_num_rx_queues(netdev, priv->channels.num); 2794 + netif_set_real_num_rx_queues(netdev, num_rxqs); 2883 2795 2884 2796 mlx5e_build_tx2sq_maps(priv); 2885 2797 mlx5e_activate_channels(&priv->channels); ··· 2892 2802 2893 2803 mlx5e_wait_channels_min_rx_wqes(&priv->channels); 2894 2804 mlx5e_redirect_rqts_to_channels(priv, &priv->channels); 2805 + 2806 + mlx5e_xsk_redirect_rqts_to_channels(priv, &priv->channels); 2895 2807 } 2896 2808 2897 2809 void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv) 2898 2810 { 2811 + mlx5e_xsk_redirect_rqts_to_drop(priv, &priv->channels); 2812 + 2899 2813 mlx5e_redirect_rqts_to_drop(priv); 2900 2814 2901 2815 if (mlx5e_is_vport_rep(priv)) ··· 2978 2884 int mlx5e_open_locked(struct net_device *netdev) 2979 2885 { 2980 2886 struct mlx5e_priv *priv = netdev_priv(netdev); 2887 + bool is_xdp = priv->channels.params.xdp_prog; 2981 2888 int err; 2982 2889 2983 2890 set_bit(MLX5E_STATE_OPENED, &priv->state); 2891 + if (is_xdp) 2892 + mlx5e_xdp_set_open(priv); 2984 2893 2985 2894 err = mlx5e_open_channels(priv, &priv->channels); 2986 2895 if (err) ··· 2998 2901 return 0; 2999 2902 3000 2903 err_clear_state_opened_flag: 2904 + if (is_xdp) 2905 + mlx5e_xdp_set_closed(priv); 3001 2906 clear_bit(MLX5E_STATE_OPENED, &priv->state); 3002 2907 return err; 3003 2908 } ··· 3031 2932 if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) 3032 2933 return 0; 3033 2934 2935 + if (priv->channels.params.xdp_prog) 2936 + mlx5e_xdp_set_closed(priv); 3034 2937 clear_bit(MLX5E_STATE_OPENED, &priv->state); 3035 2938 3036 2939 netif_carrier_off(priv->netdev); ··· 3289 3188 return err; 3290 3189 } 3291 3190 3292 - int mlx5e_create_direct_tirs(struct mlx5e_priv *priv) 3191 + int mlx5e_create_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 3293 3192 { 3294 - int nch = mlx5e_get_netdev_max_channels(priv->netdev); 3193 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 3295 3194 struct mlx5e_tir *tir; 3296 3195 void *tirc; 3297 3196 int inlen; 3298 - int err; 3197 + int err = 0; 3299 3198 u32 *in; 3300 3199 int ix; 3301 3200 ··· 3304 3203 if (!in) 3305 3204 return -ENOMEM; 3306 3205 3307 - for (ix = 0; ix < nch; ix++) { 3206 + for (ix = 0; ix < max_nch; ix++) { 3308 3207 memset(in, 0, inlen); 3309 - tir = &priv->direct_tir[ix]; 3208 + tir = &tirs[ix]; 3310 3209 tirc = MLX5_ADDR_OF(create_tir_in, in, ctx); 3311 - mlx5e_build_direct_tir_ctx(priv, priv->direct_tir[ix].rqt.rqtn, tirc); 3210 + mlx5e_build_direct_tir_ctx(priv, tir->rqt.rqtn, tirc); 3312 3211 err = mlx5e_create_tir(priv->mdev, tir, in, inlen); 3313 - if (err) 3212 + if (unlikely(err)) 3314 3213 goto err_destroy_ch_tirs; 3315 3214 } 3316 3215 3317 - kvfree(in); 3318 - 3319 - return 0; 3216 + goto out; 3320 3217 3321 3218 err_destroy_ch_tirs: 3322 - mlx5_core_warn(priv->mdev, "create direct tirs failed, %d\n", err); 3219 + mlx5_core_warn(priv->mdev, "create tirs failed, %d\n", err); 3323 3220 for (ix--; ix >= 0; ix--) 3324 - mlx5e_destroy_tir(priv->mdev, &priv->direct_tir[ix]); 3221 + mlx5e_destroy_tir(priv->mdev, &tirs[ix]); 3325 3222 3223 + out: 3326 3224 kvfree(in); 3327 3225 3328 3226 return err; ··· 3341 3241 mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]); 3342 3242 } 3343 3243 3344 - void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv) 3244 + void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs) 3345 3245 { 3346 - int nch = mlx5e_get_netdev_max_channels(priv->netdev); 3246 + const int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 3347 3247 int i; 3348 3248 3349 - for (i = 0; i < nch; i++) 3350 - mlx5e_destroy_tir(priv->mdev, &priv->direct_tir[i]); 3249 + for (i = 0; i < max_nch; i++) 3250 + mlx5e_destroy_tir(priv->mdev, &tirs[i]); 3351 3251 } 3352 3252 3353 3253 static int mlx5e_modify_channels_scatter_fcs(struct mlx5e_channels *chs, bool enable) ··· 3489 3389 3490 3390 for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev); i++) { 3491 3391 struct mlx5e_channel_stats *channel_stats = &priv->channel_stats[i]; 3392 + struct mlx5e_rq_stats *xskrq_stats = &channel_stats->xskrq; 3492 3393 struct mlx5e_rq_stats *rq_stats = &channel_stats->rq; 3493 3394 int j; 3494 3395 3495 - s->rx_packets += rq_stats->packets; 3496 - s->rx_bytes += rq_stats->bytes; 3396 + s->rx_packets += rq_stats->packets + xskrq_stats->packets; 3397 + s->rx_bytes += rq_stats->bytes + xskrq_stats->bytes; 3497 3398 3498 3399 for (j = 0; j < priv->max_opened_tc; j++) { 3499 3400 struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j]; ··· 3593 3492 3594 3493 mutex_lock(&priv->state_lock); 3595 3494 3495 + if (enable && priv->xsk.refcnt) { 3496 + netdev_warn(netdev, "LRO is incompatible with AF_XDP (%hu XSKs are active)\n", 3497 + priv->xsk.refcnt); 3498 + err = -EINVAL; 3499 + goto out; 3500 + } 3501 + 3596 3502 old_params = &priv->channels.params; 3597 3503 if (enable && !MLX5E_GET_PFLAG(old_params, MLX5E_PFLAG_RX_STRIDING_RQ)) { 3598 3504 netdev_warn(netdev, "can't set LRO with legacy RQ\n"); ··· 3613 3505 new_channels.params.lro_en = enable; 3614 3506 3615 3507 if (old_params->rq_wq_type != MLX5_WQ_TYPE_CYCLIC) { 3616 - if (mlx5e_rx_mpwqe_is_linear_skb(mdev, old_params) == 3617 - mlx5e_rx_mpwqe_is_linear_skb(mdev, &new_channels.params)) 3508 + if (mlx5e_rx_mpwqe_is_linear_skb(mdev, old_params, NULL) == 3509 + mlx5e_rx_mpwqe_is_linear_skb(mdev, &new_channels.params, NULL)) 3618 3510 reset = false; 3619 3511 } 3620 3512 ··· 3804 3696 return features; 3805 3697 } 3806 3698 3699 + static bool mlx5e_xsk_validate_mtu(struct net_device *netdev, 3700 + struct mlx5e_channels *chs, 3701 + struct mlx5e_params *new_params, 3702 + struct mlx5_core_dev *mdev) 3703 + { 3704 + u16 ix; 3705 + 3706 + for (ix = 0; ix < chs->params.num_channels; ix++) { 3707 + struct xdp_umem *umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, ix); 3708 + struct mlx5e_xsk_param xsk; 3709 + 3710 + if (!umem) 3711 + continue; 3712 + 3713 + mlx5e_build_xsk_param(umem, &xsk); 3714 + 3715 + if (!mlx5e_validate_xsk_param(new_params, &xsk, mdev)) { 3716 + u32 hr = mlx5e_get_linear_rq_headroom(new_params, &xsk); 3717 + int max_mtu_frame, max_mtu_page, max_mtu; 3718 + 3719 + /* Two criteria must be met: 3720 + * 1. HW MTU + all headrooms <= XSK frame size. 3721 + * 2. Size of SKBs allocated on XDP_PASS <= PAGE_SIZE. 3722 + */ 3723 + max_mtu_frame = MLX5E_HW2SW_MTU(new_params, xsk.chunk_size - hr); 3724 + max_mtu_page = mlx5e_xdp_max_mtu(new_params, &xsk); 3725 + max_mtu = min(max_mtu_frame, max_mtu_page); 3726 + 3727 + netdev_err(netdev, "MTU %d is too big for an XSK running on channel %hu. Try MTU <= %d\n", 3728 + new_params->sw_mtu, ix, max_mtu); 3729 + return false; 3730 + } 3731 + } 3732 + 3733 + return true; 3734 + } 3735 + 3807 3736 int mlx5e_change_mtu(struct net_device *netdev, int new_mtu, 3808 3737 change_hw_mtu_cb set_mtu_cb) 3809 3738 { ··· 3861 3716 new_channels.params.sw_mtu = new_mtu; 3862 3717 3863 3718 if (params->xdp_prog && 3864 - !mlx5e_rx_is_linear_skb(&new_channels.params)) { 3719 + !mlx5e_rx_is_linear_skb(&new_channels.params, NULL)) { 3865 3720 netdev_err(netdev, "MTU(%d) > %d is not allowed while XDP enabled\n", 3866 - new_mtu, mlx5e_xdp_max_mtu(params)); 3721 + new_mtu, mlx5e_xdp_max_mtu(params, NULL)); 3722 + err = -EINVAL; 3723 + goto out; 3724 + } 3725 + 3726 + if (priv->xsk.refcnt && 3727 + !mlx5e_xsk_validate_mtu(netdev, &priv->channels, 3728 + &new_channels.params, priv->mdev)) { 3867 3729 err = -EINVAL; 3868 3730 goto out; 3869 3731 } 3870 3732 3871 3733 if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) { 3872 - bool is_linear = mlx5e_rx_mpwqe_is_linear_skb(priv->mdev, &new_channels.params); 3873 - u8 ppw_old = mlx5e_mpwqe_log_pkts_per_wqe(params); 3874 - u8 ppw_new = mlx5e_mpwqe_log_pkts_per_wqe(&new_channels.params); 3734 + bool is_linear = mlx5e_rx_mpwqe_is_linear_skb(priv->mdev, 3735 + &new_channels.params, 3736 + NULL); 3737 + u8 ppw_old = mlx5e_mpwqe_log_pkts_per_wqe(params, NULL); 3738 + u8 ppw_new = mlx5e_mpwqe_log_pkts_per_wqe(&new_channels.params, NULL); 3875 3739 3740 + /* If XSK is active, XSK RQs are linear. */ 3741 + is_linear |= priv->xsk.refcnt; 3742 + 3743 + /* Always reset in linear mode - hw_mtu is used in data path. */ 3876 3744 reset = reset && (is_linear || (ppw_old != ppw_new)); 3877 3745 } 3878 3746 ··· 4318 4160 new_channels.params = priv->channels.params; 4319 4161 new_channels.params.xdp_prog = prog; 4320 4162 4321 - if (!mlx5e_rx_is_linear_skb(&new_channels.params)) { 4163 + /* No XSK params: AF_XDP can't be enabled yet at the point of setting 4164 + * the XDP program. 4165 + */ 4166 + if (!mlx5e_rx_is_linear_skb(&new_channels.params, NULL)) { 4322 4167 netdev_warn(netdev, "XDP is not allowed with MTU(%d) > %d\n", 4323 4168 new_channels.params.sw_mtu, 4324 - mlx5e_xdp_max_mtu(&new_channels.params)); 4169 + mlx5e_xdp_max_mtu(&new_channels.params, NULL)); 4325 4170 return -EINVAL; 4326 4171 } 4172 + 4173 + return 0; 4174 + } 4175 + 4176 + static int mlx5e_xdp_update_state(struct mlx5e_priv *priv) 4177 + { 4178 + if (priv->channels.params.xdp_prog) 4179 + mlx5e_xdp_set_open(priv); 4180 + else 4181 + mlx5e_xdp_set_closed(priv); 4327 4182 4328 4183 return 0; 4329 4184 } ··· 4361 4190 /* no need for full reset when exchanging programs */ 4362 4191 reset = (!priv->channels.params.xdp_prog || !prog); 4363 4192 4364 - if (was_opened && reset) 4365 - mlx5e_close_locked(netdev); 4366 4193 if (was_opened && !reset) { 4367 4194 /* num_channels is invariant here, so we can take the 4368 4195 * batched reference right upfront. ··· 4372 4203 } 4373 4204 } 4374 4205 4375 - /* exchange programs, extra prog reference we got from caller 4376 - * as long as we don't fail from this point onwards. 4377 - */ 4378 - old_prog = xchg(&priv->channels.params.xdp_prog, prog); 4206 + if (was_opened && reset) { 4207 + struct mlx5e_channels new_channels = {}; 4208 + 4209 + new_channels.params = priv->channels.params; 4210 + new_channels.params.xdp_prog = prog; 4211 + mlx5e_set_rq_type(priv->mdev, &new_channels.params); 4212 + old_prog = priv->channels.params.xdp_prog; 4213 + 4214 + err = mlx5e_safe_switch_channels(priv, &new_channels, mlx5e_xdp_update_state); 4215 + if (err) 4216 + goto unlock; 4217 + } else { 4218 + /* exchange programs, extra prog reference we got from caller 4219 + * as long as we don't fail from this point onwards. 4220 + */ 4221 + old_prog = xchg(&priv->channels.params.xdp_prog, prog); 4222 + } 4223 + 4379 4224 if (old_prog) 4380 4225 bpf_prog_put(old_prog); 4381 4226 4382 - if (reset) /* change RQ type according to priv->xdp_prog */ 4227 + if (!was_opened && reset) /* change RQ type according to priv->xdp_prog */ 4383 4228 mlx5e_set_rq_type(priv->mdev, &priv->channels.params); 4384 4229 4385 - if (was_opened && reset) 4386 - err = mlx5e_open_locked(netdev); 4387 - 4388 - if (!test_bit(MLX5E_STATE_OPENED, &priv->state) || reset) 4230 + if (!was_opened || reset) 4389 4231 goto unlock; 4390 4232 4391 4233 /* exchanging programs w/o reset, we update ref counts on behalf ··· 4404 4224 */ 4405 4225 for (i = 0; i < priv->channels.num; i++) { 4406 4226 struct mlx5e_channel *c = priv->channels.c[i]; 4227 + bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 4407 4228 4408 4229 clear_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state); 4230 + if (xsk_open) 4231 + clear_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state); 4409 4232 napi_synchronize(&c->napi); 4410 4233 /* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */ 4411 4234 4412 4235 old_prog = xchg(&c->rq.xdp_prog, prog); 4413 - 4414 - set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state); 4415 - /* napi_schedule in case we have missed anything */ 4416 - napi_schedule(&c->napi); 4417 - 4418 4236 if (old_prog) 4419 4237 bpf_prog_put(old_prog); 4238 + 4239 + if (xsk_open) { 4240 + old_prog = xchg(&c->xskrq.xdp_prog, prog); 4241 + if (old_prog) 4242 + bpf_prog_put(old_prog); 4243 + } 4244 + 4245 + set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state); 4246 + if (xsk_open) 4247 + set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state); 4248 + /* napi_schedule in case we have missed anything */ 4249 + napi_schedule(&c->napi); 4420 4250 } 4421 4251 4422 4252 unlock: ··· 4457 4267 case XDP_QUERY_PROG: 4458 4268 xdp->prog_id = mlx5e_xdp_query(dev); 4459 4269 return 0; 4270 + case XDP_SETUP_XSK_UMEM: 4271 + return mlx5e_xsk_setup_umem(dev, xdp->xsk.umem, 4272 + xdp->xsk.queue_id); 4460 4273 default: 4461 4274 return -EINVAL; 4462 4275 } ··· 4542 4349 .ndo_tx_timeout = mlx5e_tx_timeout, 4543 4350 .ndo_bpf = mlx5e_xdp, 4544 4351 .ndo_xdp_xmit = mlx5e_xdp_xmit, 4352 + .ndo_xsk_async_xmit = mlx5e_xsk_async_xmit, 4545 4353 #ifdef CONFIG_MLX5_EN_ARFS 4546 4354 .ndo_rx_flow_steer = mlx5e_rx_flow_steer, 4547 4355 #endif ··· 4694 4500 * - Striding RQ configuration is not possible/supported. 4695 4501 * - Slow PCI heuristic. 4696 4502 * - Legacy RQ would use linear SKB while Striding RQ would use non-linear. 4503 + * 4504 + * No XSK params: checking the availability of striding RQ in general. 4697 4505 */ 4698 4506 if (!slow_pci_heuristic(mdev) && 4699 4507 mlx5e_striding_rq_possible(mdev, params) && 4700 - (mlx5e_rx_mpwqe_is_linear_skb(mdev, params) || 4701 - !mlx5e_rx_is_linear_skb(params))) 4508 + (mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL) || 4509 + !mlx5e_rx_is_linear_skb(params, NULL))) 4702 4510 MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ, true); 4703 4511 mlx5e_set_rq_type(mdev, params); 4704 4512 mlx5e_init_rq_type_params(mdev, params); ··· 4722 4526 } 4723 4527 4724 4528 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev, 4529 + struct mlx5e_xsk *xsk, 4725 4530 struct mlx5e_rss_params *rss_params, 4726 4531 struct mlx5e_params *params, 4727 4532 u16 max_channels, u16 mtu) ··· 4758 4561 /* HW LRO */ 4759 4562 4760 4563 /* TODO: && MLX5_CAP_ETH(mdev, lro_cap) */ 4761 - if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) 4762 - if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params)) 4564 + if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) { 4565 + /* No XSK params: checking the availability of striding RQ in general. */ 4566 + if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL)) 4763 4567 params->lro_en = !slow_pci_heuristic(mdev); 4568 + } 4764 4569 params->lro_timeout = mlx5e_choose_lro_timeout(mdev, MLX5E_DEFAULT_LRO_TIMEOUT); 4765 4570 4766 4571 /* CQ moderation params */ ··· 4781 4582 mlx5e_build_rss_params(rss_params, params->num_channels); 4782 4583 params->tunneled_offload_en = 4783 4584 mlx5e_tunnel_inner_ft_supported(mdev); 4585 + 4586 + /* AF_XDP */ 4587 + params->xsk = xsk; 4784 4588 } 4785 4589 4786 4590 static void mlx5e_set_netdev_dev_addr(struct net_device *netdev) ··· 4959 4757 if (err) 4960 4758 return err; 4961 4759 4962 - mlx5e_build_nic_params(mdev, rss, &priv->channels.params, 4760 + mlx5e_build_nic_params(mdev, &priv->xsk, rss, &priv->channels.params, 4963 4761 mlx5e_get_netdev_max_channels(netdev), 4964 4762 netdev->mtu); 4965 4763 ··· 5001 4799 if (err) 5002 4800 goto err_close_drop_rq; 5003 4801 5004 - err = mlx5e_create_direct_rqts(priv); 4802 + err = mlx5e_create_direct_rqts(priv, priv->direct_tir); 5005 4803 if (err) 5006 4804 goto err_destroy_indirect_rqts; 5007 4805 ··· 5009 4807 if (err) 5010 4808 goto err_destroy_direct_rqts; 5011 4809 5012 - err = mlx5e_create_direct_tirs(priv); 4810 + err = mlx5e_create_direct_tirs(priv, priv->direct_tir); 5013 4811 if (err) 5014 4812 goto err_destroy_indirect_tirs; 4813 + 4814 + err = mlx5e_create_direct_rqts(priv, priv->xsk_tir); 4815 + if (unlikely(err)) 4816 + goto err_destroy_direct_tirs; 4817 + 4818 + err = mlx5e_create_direct_tirs(priv, priv->xsk_tir); 4819 + if (unlikely(err)) 4820 + goto err_destroy_xsk_rqts; 5015 4821 5016 4822 err = mlx5e_create_flow_steering(priv); 5017 4823 if (err) { 5018 4824 mlx5_core_warn(mdev, "create flow steering failed, %d\n", err); 5019 - goto err_destroy_direct_tirs; 4825 + goto err_destroy_xsk_tirs; 5020 4826 } 5021 4827 5022 4828 err = mlx5e_tc_nic_init(priv); ··· 5035 4825 5036 4826 err_destroy_flow_steering: 5037 4827 mlx5e_destroy_flow_steering(priv); 4828 + err_destroy_xsk_tirs: 4829 + mlx5e_destroy_direct_tirs(priv, priv->xsk_tir); 4830 + err_destroy_xsk_rqts: 4831 + mlx5e_destroy_direct_rqts(priv, priv->xsk_tir); 5038 4832 err_destroy_direct_tirs: 5039 - mlx5e_destroy_direct_tirs(priv); 4833 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 5040 4834 err_destroy_indirect_tirs: 5041 4835 mlx5e_destroy_indirect_tirs(priv, true); 5042 4836 err_destroy_direct_rqts: 5043 - mlx5e_destroy_direct_rqts(priv); 4837 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 5044 4838 err_destroy_indirect_rqts: 5045 4839 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 5046 4840 err_close_drop_rq: ··· 5058 4844 { 5059 4845 mlx5e_tc_nic_cleanup(priv); 5060 4846 mlx5e_destroy_flow_steering(priv); 5061 - mlx5e_destroy_direct_tirs(priv); 4847 + mlx5e_destroy_direct_tirs(priv, priv->xsk_tir); 4848 + mlx5e_destroy_direct_rqts(priv, priv->xsk_tir); 4849 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 5062 4850 mlx5e_destroy_indirect_tirs(priv, true); 5063 - mlx5e_destroy_direct_rqts(priv); 4851 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 5064 4852 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 5065 4853 mlx5e_close_drop_rq(&priv->drop_rq); 5066 4854 mlx5e_destroy_q_counters(priv); ··· 5218 5002 5219 5003 netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), 5220 5004 nch * profile->max_tc, 5221 - nch); 5005 + nch * MLX5E_NUM_RQ_GROUPS); 5222 5006 if (!netdev) { 5223 5007 mlx5_core_err(mdev, "alloc_etherdev_mqs() failed\n"); 5224 5008 return NULL;
+6 -6
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
··· 1510 1510 if (err) 1511 1511 goto err_close_drop_rq; 1512 1512 1513 - err = mlx5e_create_direct_rqts(priv); 1513 + err = mlx5e_create_direct_rqts(priv, priv->direct_tir); 1514 1514 if (err) 1515 1515 goto err_destroy_indirect_rqts; 1516 1516 ··· 1518 1518 if (err) 1519 1519 goto err_destroy_direct_rqts; 1520 1520 1521 - err = mlx5e_create_direct_tirs(priv); 1521 + err = mlx5e_create_direct_tirs(priv, priv->direct_tir); 1522 1522 if (err) 1523 1523 goto err_destroy_indirect_tirs; 1524 1524 ··· 1535 1535 err_destroy_ttc_table: 1536 1536 mlx5e_destroy_ttc_table(priv, &priv->fs.ttc); 1537 1537 err_destroy_direct_tirs: 1538 - mlx5e_destroy_direct_tirs(priv); 1538 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 1539 1539 err_destroy_indirect_tirs: 1540 1540 mlx5e_destroy_indirect_tirs(priv, false); 1541 1541 err_destroy_direct_rqts: 1542 - mlx5e_destroy_direct_rqts(priv); 1542 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 1543 1543 err_destroy_indirect_rqts: 1544 1544 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 1545 1545 err_close_drop_rq: ··· 1553 1553 1554 1554 mlx5_del_flow_rules(rpriv->vport_rx_rule); 1555 1555 mlx5e_destroy_ttc_table(priv, &priv->fs.ttc); 1556 - mlx5e_destroy_direct_tirs(priv); 1556 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 1557 1557 mlx5e_destroy_indirect_tirs(priv, false); 1558 - mlx5e_destroy_direct_rqts(priv); 1558 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 1559 1559 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 1560 1560 mlx5e_close_drop_rq(&priv->drop_rq); 1561 1561 }
+76 -28
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
··· 47 47 #include "en_accel/tls_rxtx.h" 48 48 #include "lib/clock.h" 49 49 #include "en/xdp.h" 50 + #include "en/xsk/rx.h" 50 51 51 52 static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config) 52 53 { ··· 236 235 return true; 237 236 } 238 237 239 - static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, 240 - struct mlx5e_dma_info *dma_info) 238 + static inline int mlx5e_page_alloc_pool(struct mlx5e_rq *rq, 239 + struct mlx5e_dma_info *dma_info) 241 240 { 242 241 if (mlx5e_rx_cache_get(rq, dma_info)) 243 242 return 0; ··· 257 256 return 0; 258 257 } 259 258 259 + static inline int mlx5e_page_alloc(struct mlx5e_rq *rq, 260 + struct mlx5e_dma_info *dma_info) 261 + { 262 + if (rq->umem) 263 + return mlx5e_xsk_page_alloc_umem(rq, dma_info); 264 + else 265 + return mlx5e_page_alloc_pool(rq, dma_info); 266 + } 267 + 260 268 void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info) 261 269 { 262 270 dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, rq->buff.map_dir); 263 271 } 264 272 265 - void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, 266 - bool recycle) 273 + void mlx5e_page_release_dynamic(struct mlx5e_rq *rq, 274 + struct mlx5e_dma_info *dma_info, 275 + bool recycle) 267 276 { 268 277 if (likely(recycle)) { 269 278 if (mlx5e_rx_cache_put(rq, dma_info)) ··· 288 277 } 289 278 } 290 279 280 + static inline void mlx5e_page_release(struct mlx5e_rq *rq, 281 + struct mlx5e_dma_info *dma_info, 282 + bool recycle) 283 + { 284 + if (rq->umem) 285 + /* The `recycle` parameter is ignored, and the page is always 286 + * put into the Reuse Ring, because there is no way to return 287 + * the page to the userspace when the interface goes down. 288 + */ 289 + mlx5e_xsk_page_release(rq, dma_info); 290 + else 291 + mlx5e_page_release_dynamic(rq, dma_info, recycle); 292 + } 293 + 291 294 static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq, 292 295 struct mlx5e_wqe_frag_info *frag) 293 296 { ··· 313 288 * offset) should just use the new one without replenishing again 314 289 * by themselves. 315 290 */ 316 - err = mlx5e_page_alloc_mapped(rq, frag->di); 291 + err = mlx5e_page_alloc(rq, frag->di); 317 292 318 293 return err; 319 294 } ··· 379 354 int err; 380 355 int i; 381 356 357 + if (rq->umem) { 358 + int pages_desired = wqe_bulk << rq->wqe.info.log_num_frags; 359 + 360 + if (unlikely(!mlx5e_xsk_pages_enough_umem(rq, pages_desired))) 361 + return -ENOMEM; 362 + } 363 + 382 364 for (i = 0; i < wqe_bulk; i++) { 383 365 struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i); 384 366 ··· 433 401 static void 434 402 mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle) 435 403 { 436 - const bool no_xdp_xmit = 437 - bitmap_empty(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE); 404 + bool no_xdp_xmit; 438 405 struct mlx5e_dma_info *dma_info = wi->umr.dma_info; 439 406 int i; 407 + 408 + /* A common case for AF_XDP. */ 409 + if (bitmap_full(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE)) 410 + return; 411 + 412 + no_xdp_xmit = bitmap_empty(wi->xdp_xmit_bitmap, 413 + MLX5_MPWRQ_PAGES_PER_WQE); 440 414 441 415 for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) 442 416 if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap)) ··· 463 425 dma_wmb(); 464 426 465 427 mlx5_wq_ll_update_db_record(wq); 466 - } 467 - 468 - static inline u16 mlx5e_icosq_wrap_cnt(struct mlx5e_icosq *sq) 469 - { 470 - return mlx5_wq_cyc_get_ctr_wrap_cnt(&sq->wq, sq->pc); 471 428 } 472 429 473 430 static inline void mlx5e_fill_icosq_frag_edge(struct mlx5e_icosq *sq, ··· 492 459 int err; 493 460 int i; 494 461 462 + if (rq->umem && 463 + unlikely(!mlx5e_xsk_pages_enough_umem(rq, MLX5_MPWRQ_PAGES_PER_WQE))) { 464 + err = -ENOMEM; 465 + goto err; 466 + } 467 + 495 468 pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc); 496 469 contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi); 497 470 if (unlikely(contig_wqebbs_room < MLX5E_UMR_WQEBBS)) { ··· 506 467 } 507 468 508 469 umr_wqe = mlx5_wq_cyc_get_wqe(wq, pi); 509 - if (unlikely(mlx5e_icosq_wrap_cnt(sq) < 2)) 510 - memcpy(umr_wqe, &rq->mpwqe.umr_wqe, 511 - offsetof(struct mlx5e_umr_wqe, inline_mtts)); 470 + memcpy(umr_wqe, &rq->mpwqe.umr_wqe, offsetof(struct mlx5e_umr_wqe, inline_mtts)); 512 471 513 472 for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++, dma_info++) { 514 - err = mlx5e_page_alloc_mapped(rq, dma_info); 473 + err = mlx5e_page_alloc(rq, dma_info); 515 474 if (unlikely(err)) 516 475 goto err_unmap; 517 476 umr_wqe->inline_mtts[i].ptag = cpu_to_be64(dma_info->addr | MLX5_EN_WR); ··· 524 487 umr_wqe->uctrl.xlt_offset = cpu_to_be16(xlt_offset); 525 488 526 489 sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR; 490 + sq->db.ico_wqe[pi].umr.rq = rq; 527 491 sq->pc += MLX5E_UMR_WQEBBS; 528 492 529 493 sq->doorbell_cseg = &umr_wqe->ctrl; ··· 536 498 dma_info--; 537 499 mlx5e_page_release(rq, dma_info, true); 538 500 } 501 + 502 + err: 539 503 rq->stats->buff_alloc_err++; 540 504 541 505 return err; ··· 584 544 return !!err; 585 545 } 586 546 587 - static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq) 547 + void mlx5e_poll_ico_cq(struct mlx5e_cq *cq) 588 548 { 589 549 struct mlx5e_icosq *sq = container_of(cq, struct mlx5e_icosq, cq); 590 550 struct mlx5_cqe64 *cqe; 591 - u8 completed_umr = 0; 592 551 u16 sqcc; 593 552 int i; 594 553 ··· 628 589 629 590 if (likely(wi->opcode == MLX5_OPCODE_UMR)) { 630 591 sqcc += MLX5E_UMR_WQEBBS; 631 - completed_umr++; 592 + wi->umr.rq->mpwqe.umr_completed++; 632 593 } else if (likely(wi->opcode == MLX5_OPCODE_NOP)) { 633 594 sqcc++; 634 595 } else { ··· 644 605 sq->cc = sqcc; 645 606 646 607 mlx5_cqwq_update_db_record(&cq->wq); 647 - 648 - if (likely(completed_umr)) { 649 - mlx5e_post_rx_mpwqe(rq, completed_umr); 650 - rq->mpwqe.umr_in_progress -= completed_umr; 651 - } 652 608 } 653 609 654 610 bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq) 655 611 { 656 612 struct mlx5e_icosq *sq = &rq->channel->icosq; 657 613 struct mlx5_wq_ll *wq = &rq->mpwqe.wq; 614 + u8 umr_completed = rq->mpwqe.umr_completed; 615 + int alloc_err = 0; 658 616 u8 missing, i; 659 617 u16 head; 660 618 661 619 if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state))) 662 620 return false; 663 621 664 - mlx5e_poll_ico_cq(&sq->cq, rq); 622 + if (umr_completed) { 623 + mlx5e_post_rx_mpwqe(rq, umr_completed); 624 + rq->mpwqe.umr_in_progress -= umr_completed; 625 + rq->mpwqe.umr_completed = 0; 626 + } 665 627 666 628 missing = mlx5_wq_ll_missing(wq) - rq->mpwqe.umr_in_progress; 667 629 ··· 676 636 head = rq->mpwqe.actual_wq_head; 677 637 i = missing; 678 638 do { 679 - if (unlikely(mlx5e_alloc_rx_mpwqe(rq, head))) 639 + alloc_err = mlx5e_alloc_rx_mpwqe(rq, head); 640 + 641 + if (unlikely(alloc_err)) 680 642 break; 681 643 head = mlx5_wq_ll_get_wqe_next_ix(wq, head); 682 644 } while (--i); ··· 691 649 692 650 rq->mpwqe.umr_in_progress += rq->mpwqe.umr_last_bulk; 693 651 rq->mpwqe.actual_wq_head = head; 652 + 653 + /* If XSK Fill Ring doesn't have enough frames, busy poll by 654 + * rescheduling the NAPI poll. 655 + */ 656 + if (unlikely(alloc_err == -ENOMEM && rq->umem)) 657 + return true; 694 658 695 659 return false; 696 660 } ··· 1066 1018 } 1067 1019 1068 1020 rcu_read_lock(); 1069 - consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt); 1021 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, false); 1070 1022 rcu_read_unlock(); 1071 1023 if (consumed) 1072 1024 return NULL; /* page/packet was consumed by XDP */ ··· 1283 1235 prefetch(data); 1284 1236 1285 1237 rcu_read_lock(); 1286 - consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32); 1238 + consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, false); 1287 1239 rcu_read_unlock(); 1288 1240 if (consumed) { 1289 1241 if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
+112 -3
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
··· 104 104 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_poll) }, 105 105 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_arm) }, 106 106 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_aff_change) }, 107 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_force_irq) }, 107 108 { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_eq_rearm) }, 109 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_packets) }, 110 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_bytes) }, 111 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_complete) }, 112 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_unnecessary) }, 113 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_unnecessary_inner) }, 114 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_csum_none) }, 115 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_ecn_mark) }, 116 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_removed_vlan_packets) }, 117 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_xdp_drop) }, 118 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_xdp_redirect) }, 119 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_wqe_err) }, 120 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_mpwqe_filler_cqes) }, 121 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_mpwqe_filler_strides) }, 122 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_oversize_pkts_sw_drop) }, 123 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_buff_alloc_err) }, 124 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_cqe_compress_blks) }, 125 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_cqe_compress_pkts) }, 126 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_congst_umr) }, 127 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xsk_arfs_err) }, 128 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_xmit) }, 129 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_mpwqe) }, 130 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_inlnw) }, 131 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_full) }, 132 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_err) }, 133 + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xsk_cqes) }, 108 134 }; 109 135 110 136 #define NUM_SW_COUNTERS ARRAY_SIZE(sw_stats_desc) ··· 170 144 &priv->channel_stats[i]; 171 145 struct mlx5e_xdpsq_stats *xdpsq_red_stats = &channel_stats->xdpsq; 172 146 struct mlx5e_xdpsq_stats *xdpsq_stats = &channel_stats->rq_xdpsq; 147 + struct mlx5e_xdpsq_stats *xsksq_stats = &channel_stats->xsksq; 148 + struct mlx5e_rq_stats *xskrq_stats = &channel_stats->xskrq; 173 149 struct mlx5e_rq_stats *rq_stats = &channel_stats->rq; 174 150 struct mlx5e_ch_stats *ch_stats = &channel_stats->ch; 175 151 int j; ··· 214 186 s->ch_poll += ch_stats->poll; 215 187 s->ch_arm += ch_stats->arm; 216 188 s->ch_aff_change += ch_stats->aff_change; 189 + s->ch_force_irq += ch_stats->force_irq; 217 190 s->ch_eq_rearm += ch_stats->eq_rearm; 218 191 /* xdp redirect */ 219 192 s->tx_xdp_xmit += xdpsq_red_stats->xmit; ··· 223 194 s->tx_xdp_full += xdpsq_red_stats->full; 224 195 s->tx_xdp_err += xdpsq_red_stats->err; 225 196 s->tx_xdp_cqes += xdpsq_red_stats->cqes; 197 + /* AF_XDP zero-copy */ 198 + s->rx_xsk_packets += xskrq_stats->packets; 199 + s->rx_xsk_bytes += xskrq_stats->bytes; 200 + s->rx_xsk_csum_complete += xskrq_stats->csum_complete; 201 + s->rx_xsk_csum_unnecessary += xskrq_stats->csum_unnecessary; 202 + s->rx_xsk_csum_unnecessary_inner += xskrq_stats->csum_unnecessary_inner; 203 + s->rx_xsk_csum_none += xskrq_stats->csum_none; 204 + s->rx_xsk_ecn_mark += xskrq_stats->ecn_mark; 205 + s->rx_xsk_removed_vlan_packets += xskrq_stats->removed_vlan_packets; 206 + s->rx_xsk_xdp_drop += xskrq_stats->xdp_drop; 207 + s->rx_xsk_xdp_redirect += xskrq_stats->xdp_redirect; 208 + s->rx_xsk_wqe_err += xskrq_stats->wqe_err; 209 + s->rx_xsk_mpwqe_filler_cqes += xskrq_stats->mpwqe_filler_cqes; 210 + s->rx_xsk_mpwqe_filler_strides += xskrq_stats->mpwqe_filler_strides; 211 + s->rx_xsk_oversize_pkts_sw_drop += xskrq_stats->oversize_pkts_sw_drop; 212 + s->rx_xsk_buff_alloc_err += xskrq_stats->buff_alloc_err; 213 + s->rx_xsk_cqe_compress_blks += xskrq_stats->cqe_compress_blks; 214 + s->rx_xsk_cqe_compress_pkts += xskrq_stats->cqe_compress_pkts; 215 + s->rx_xsk_congst_umr += xskrq_stats->congst_umr; 216 + s->rx_xsk_arfs_err += xskrq_stats->arfs_err; 217 + s->tx_xsk_xmit += xsksq_stats->xmit; 218 + s->tx_xsk_mpwqe += xsksq_stats->mpwqe; 219 + s->tx_xsk_inlnw += xsksq_stats->inlnw; 220 + s->tx_xsk_full += xsksq_stats->full; 221 + s->tx_xsk_err += xsksq_stats->err; 222 + s->tx_xsk_cqes += xsksq_stats->cqes; 226 223 227 224 for (j = 0; j < priv->max_opened_tc; j++) { 228 225 struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j]; ··· 1321 1266 { MLX5E_DECLARE_XDPSQ_STAT(struct mlx5e_xdpsq_stats, cqes) }, 1322 1267 }; 1323 1268 1269 + static const struct counter_desc xskrq_stats_desc[] = { 1270 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, packets) }, 1271 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, bytes) }, 1272 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_complete) }, 1273 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_unnecessary) }, 1274 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) }, 1275 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, csum_none) }, 1276 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, ecn_mark) }, 1277 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, removed_vlan_packets) }, 1278 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, xdp_drop) }, 1279 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, xdp_redirect) }, 1280 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, wqe_err) }, 1281 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, mpwqe_filler_cqes) }, 1282 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, mpwqe_filler_strides) }, 1283 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, oversize_pkts_sw_drop) }, 1284 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, buff_alloc_err) }, 1285 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, cqe_compress_blks) }, 1286 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) }, 1287 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, congst_umr) }, 1288 + { MLX5E_DECLARE_XSKRQ_STAT(struct mlx5e_rq_stats, arfs_err) }, 1289 + }; 1290 + 1291 + static const struct counter_desc xsksq_stats_desc[] = { 1292 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, xmit) }, 1293 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, mpwqe) }, 1294 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, inlnw) }, 1295 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, full) }, 1296 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, err) }, 1297 + { MLX5E_DECLARE_XSKSQ_STAT(struct mlx5e_xdpsq_stats, cqes) }, 1298 + }; 1299 + 1324 1300 static const struct counter_desc ch_stats_desc[] = { 1325 1301 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, events) }, 1326 1302 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, poll) }, 1327 1303 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, arm) }, 1328 1304 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, aff_change) }, 1305 + { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, force_irq) }, 1329 1306 { MLX5E_DECLARE_CH_STAT(struct mlx5e_ch_stats, eq_rearm) }, 1330 1307 }; 1331 1308 ··· 1365 1278 #define NUM_SQ_STATS ARRAY_SIZE(sq_stats_desc) 1366 1279 #define NUM_XDPSQ_STATS ARRAY_SIZE(xdpsq_stats_desc) 1367 1280 #define NUM_RQ_XDPSQ_STATS ARRAY_SIZE(rq_xdpsq_stats_desc) 1281 + #define NUM_XSKRQ_STATS ARRAY_SIZE(xskrq_stats_desc) 1282 + #define NUM_XSKSQ_STATS ARRAY_SIZE(xsksq_stats_desc) 1368 1283 #define NUM_CH_STATS ARRAY_SIZE(ch_stats_desc) 1369 1284 1370 1285 static int mlx5e_grp_channels_get_num_stats(struct mlx5e_priv *priv) ··· 1377 1288 (NUM_CH_STATS * max_nch) + 1378 1289 (NUM_SQ_STATS * max_nch * priv->max_opened_tc) + 1379 1290 (NUM_RQ_XDPSQ_STATS * max_nch) + 1380 - (NUM_XDPSQ_STATS * max_nch); 1291 + (NUM_XDPSQ_STATS * max_nch) + 1292 + (NUM_XSKRQ_STATS * max_nch * priv->xsk.ever_used) + 1293 + (NUM_XSKSQ_STATS * max_nch * priv->xsk.ever_used); 1381 1294 } 1382 1295 1383 1296 static int mlx5e_grp_channels_fill_strings(struct mlx5e_priv *priv, u8 *data, 1384 1297 int idx) 1385 1298 { 1386 1299 int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 1300 + bool is_xsk = priv->xsk.ever_used; 1387 1301 int i, j, tc; 1388 1302 1389 1303 for (i = 0; i < max_nch; i++) ··· 1398 1306 for (j = 0; j < NUM_RQ_STATS; j++) 1399 1307 sprintf(data + (idx++) * ETH_GSTRING_LEN, 1400 1308 rq_stats_desc[j].format, i); 1309 + for (j = 0; j < NUM_XSKRQ_STATS * is_xsk; j++) 1310 + sprintf(data + (idx++) * ETH_GSTRING_LEN, 1311 + xskrq_stats_desc[j].format, i); 1401 1312 for (j = 0; j < NUM_RQ_XDPSQ_STATS; j++) 1402 1313 sprintf(data + (idx++) * ETH_GSTRING_LEN, 1403 1314 rq_xdpsq_stats_desc[j].format, i); ··· 1413 1318 sq_stats_desc[j].format, 1414 1319 priv->channel_tc2txq[i][tc]); 1415 1320 1416 - for (i = 0; i < max_nch; i++) 1321 + for (i = 0; i < max_nch; i++) { 1322 + for (j = 0; j < NUM_XSKSQ_STATS * is_xsk; j++) 1323 + sprintf(data + (idx++) * ETH_GSTRING_LEN, 1324 + xsksq_stats_desc[j].format, i); 1417 1325 for (j = 0; j < NUM_XDPSQ_STATS; j++) 1418 1326 sprintf(data + (idx++) * ETH_GSTRING_LEN, 1419 1327 xdpsq_stats_desc[j].format, i); 1328 + } 1420 1329 1421 1330 return idx; 1422 1331 } ··· 1429 1330 int idx) 1430 1331 { 1431 1332 int max_nch = mlx5e_get_netdev_max_channels(priv->netdev); 1333 + bool is_xsk = priv->xsk.ever_used; 1432 1334 int i, j, tc; 1433 1335 1434 1336 for (i = 0; i < max_nch; i++) ··· 1443 1343 data[idx++] = 1444 1344 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].rq, 1445 1345 rq_stats_desc, j); 1346 + for (j = 0; j < NUM_XSKRQ_STATS * is_xsk; j++) 1347 + data[idx++] = 1348 + MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].xskrq, 1349 + xskrq_stats_desc, j); 1446 1350 for (j = 0; j < NUM_RQ_XDPSQ_STATS; j++) 1447 1351 data[idx++] = 1448 1352 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].rq_xdpsq, ··· 1460 1356 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].sq[tc], 1461 1357 sq_stats_desc, j); 1462 1358 1463 - for (i = 0; i < max_nch; i++) 1359 + for (i = 0; i < max_nch; i++) { 1360 + for (j = 0; j < NUM_XSKSQ_STATS * is_xsk; j++) 1361 + data[idx++] = 1362 + MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].xsksq, 1363 + xsksq_stats_desc, j); 1464 1364 for (j = 0; j < NUM_XDPSQ_STATS; j++) 1465 1365 data[idx++] = 1466 1366 MLX5E_READ_CTR64_CPU(&priv->channel_stats[i].xdpsq, 1467 1367 xdpsq_stats_desc, j); 1368 + } 1468 1369 1469 1370 return idx; 1470 1371 }
+30
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
··· 46 46 #define MLX5E_DECLARE_TX_STAT(type, fld) "tx%d_"#fld, offsetof(type, fld) 47 47 #define MLX5E_DECLARE_XDPSQ_STAT(type, fld) "tx%d_xdp_"#fld, offsetof(type, fld) 48 48 #define MLX5E_DECLARE_RQ_XDPSQ_STAT(type, fld) "rx%d_xdp_tx_"#fld, offsetof(type, fld) 49 + #define MLX5E_DECLARE_XSKRQ_STAT(type, fld) "rx%d_xsk_"#fld, offsetof(type, fld) 50 + #define MLX5E_DECLARE_XSKSQ_STAT(type, fld) "tx%d_xsk_"#fld, offsetof(type, fld) 49 51 #define MLX5E_DECLARE_CH_STAT(type, fld) "ch%d_"#fld, offsetof(type, fld) 50 52 51 53 struct counter_desc { ··· 118 116 u64 ch_poll; 119 117 u64 ch_arm; 120 118 u64 ch_aff_change; 119 + u64 ch_force_irq; 121 120 u64 ch_eq_rearm; 122 121 123 122 #ifdef CONFIG_MLX5_EN_TLS 124 123 u64 tx_tls_ooo; 125 124 u64 tx_tls_resync_bytes; 126 125 #endif 126 + 127 + u64 rx_xsk_packets; 128 + u64 rx_xsk_bytes; 129 + u64 rx_xsk_csum_complete; 130 + u64 rx_xsk_csum_unnecessary; 131 + u64 rx_xsk_csum_unnecessary_inner; 132 + u64 rx_xsk_csum_none; 133 + u64 rx_xsk_ecn_mark; 134 + u64 rx_xsk_removed_vlan_packets; 135 + u64 rx_xsk_xdp_drop; 136 + u64 rx_xsk_xdp_redirect; 137 + u64 rx_xsk_wqe_err; 138 + u64 rx_xsk_mpwqe_filler_cqes; 139 + u64 rx_xsk_mpwqe_filler_strides; 140 + u64 rx_xsk_oversize_pkts_sw_drop; 141 + u64 rx_xsk_buff_alloc_err; 142 + u64 rx_xsk_cqe_compress_blks; 143 + u64 rx_xsk_cqe_compress_pkts; 144 + u64 rx_xsk_congst_umr; 145 + u64 rx_xsk_arfs_err; 146 + u64 tx_xsk_xmit; 147 + u64 tx_xsk_mpwqe; 148 + u64 tx_xsk_inlnw; 149 + u64 tx_xsk_full; 150 + u64 tx_xsk_err; 151 + u64 tx_xsk_cqes; 127 152 }; 128 153 129 154 struct mlx5e_qcounter_stats { ··· 285 256 u64 poll; 286 257 u64 arm; 287 258 u64 aff_change; 259 + u64 force_irq; 288 260 u64 eq_rearm; 289 261 }; 290 262
+38 -4
drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
··· 33 33 #include <linux/irq.h> 34 34 #include "en.h" 35 35 #include "en/xdp.h" 36 + #include "en/xsk/tx.h" 36 37 37 38 static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c) 38 39 { ··· 86 85 struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel, 87 86 napi); 88 87 struct mlx5e_ch_stats *ch_stats = c->stats; 88 + struct mlx5e_xdpsq *xsksq = &c->xsksq; 89 + struct mlx5e_rq *xskrq = &c->xskrq; 89 90 struct mlx5e_rq *rq = &c->rq; 91 + bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state); 92 + bool aff_change = false; 93 + bool busy_xsk = false; 90 94 bool busy = false; 91 95 int work_done = 0; 92 96 int i; ··· 101 95 for (i = 0; i < c->num_tc; i++) 102 96 busy |= mlx5e_poll_tx_cq(&c->sq[i].cq, budget); 103 97 104 - busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq, NULL); 98 + busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq); 105 99 106 100 if (c->xdp) 107 - busy |= mlx5e_poll_xdpsq_cq(&rq->xdpsq.cq, rq); 101 + busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq); 108 102 109 103 if (likely(budget)) { /* budget=0 means: don't poll rx rings */ 110 - work_done = mlx5e_poll_rx_cq(&rq->cq, budget); 104 + if (xsk_open) 105 + work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget); 106 + 107 + if (likely(budget - work_done)) 108 + work_done += mlx5e_poll_rx_cq(&rq->cq, budget - work_done); 109 + 111 110 busy |= work_done == budget; 112 111 } 113 112 114 - busy |= c->rq.post_wqes(rq); 113 + mlx5e_poll_ico_cq(&c->icosq.cq); 114 + 115 + busy |= rq->post_wqes(rq); 116 + if (xsk_open) { 117 + mlx5e_poll_ico_cq(&c->xskicosq.cq); 118 + busy |= mlx5e_poll_xdpsq_cq(&xsksq->cq); 119 + busy_xsk |= mlx5e_xsk_tx(xsksq, MLX5E_TX_XSK_POLL_BUDGET); 120 + busy_xsk |= xskrq->post_wqes(xskrq); 121 + } 122 + 123 + busy |= busy_xsk; 115 124 116 125 if (busy) { 117 126 if (likely(mlx5e_channel_no_affinity_change(c))) 118 127 return budget; 119 128 ch_stats->aff_change++; 129 + aff_change = true; 120 130 if (budget && work_done == budget) 121 131 work_done--; 122 132 } ··· 152 130 mlx5e_cq_arm(&rq->cq); 153 131 mlx5e_cq_arm(&c->icosq.cq); 154 132 mlx5e_cq_arm(&c->xdpsq.cq); 133 + 134 + if (xsk_open) { 135 + mlx5e_handle_rx_dim(xskrq); 136 + mlx5e_cq_arm(&c->xskicosq.cq); 137 + mlx5e_cq_arm(&xsksq->cq); 138 + mlx5e_cq_arm(&xskrq->cq); 139 + } 140 + 141 + if (unlikely(aff_change && busy_xsk)) { 142 + mlx5e_trigger_irq(&c->icosq); 143 + ch_stats->force_irq++; 144 + } 155 145 156 146 return work_done; 157 147 }
+7 -7
drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
··· 87 87 mlx5e_set_netdev_mtu_boundaries(priv); 88 88 netdev->mtu = netdev->max_mtu; 89 89 90 - mlx5e_build_nic_params(mdev, &priv->rss_params, &priv->channels.params, 90 + mlx5e_build_nic_params(mdev, NULL, &priv->rss_params, &priv->channels.params, 91 91 mlx5e_get_netdev_max_channels(netdev), 92 92 netdev->mtu); 93 93 mlx5i_build_nic_params(mdev, &priv->channels.params); ··· 365 365 if (err) 366 366 goto err_close_drop_rq; 367 367 368 - err = mlx5e_create_direct_rqts(priv); 368 + err = mlx5e_create_direct_rqts(priv, priv->direct_tir); 369 369 if (err) 370 370 goto err_destroy_indirect_rqts; 371 371 ··· 373 373 if (err) 374 374 goto err_destroy_direct_rqts; 375 375 376 - err = mlx5e_create_direct_tirs(priv); 376 + err = mlx5e_create_direct_tirs(priv, priv->direct_tir); 377 377 if (err) 378 378 goto err_destroy_indirect_tirs; 379 379 ··· 384 384 return 0; 385 385 386 386 err_destroy_direct_tirs: 387 - mlx5e_destroy_direct_tirs(priv); 387 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 388 388 err_destroy_indirect_tirs: 389 389 mlx5e_destroy_indirect_tirs(priv, true); 390 390 err_destroy_direct_rqts: 391 - mlx5e_destroy_direct_rqts(priv); 391 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 392 392 err_destroy_indirect_rqts: 393 393 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 394 394 err_close_drop_rq: ··· 401 401 static void mlx5i_cleanup_rx(struct mlx5e_priv *priv) 402 402 { 403 403 mlx5i_destroy_flow_steering(priv); 404 - mlx5e_destroy_direct_tirs(priv); 404 + mlx5e_destroy_direct_tirs(priv, priv->direct_tir); 405 405 mlx5e_destroy_indirect_tirs(priv, true); 406 - mlx5e_destroy_direct_rqts(priv); 406 + mlx5e_destroy_direct_rqts(priv, priv->direct_tir); 407 407 mlx5e_destroy_rqt(priv, &priv->indir_rqt); 408 408 mlx5e_close_drop_rq(&priv->drop_rq); 409 409 mlx5e_destroy_q_counters(priv);
-5
drivers/net/ethernet/mellanox/mlx5/core/wq.h
··· 134 134 *wq->db = cpu_to_be32(wq->wqe_ctr); 135 135 } 136 136 137 - static inline u16 mlx5_wq_cyc_get_ctr_wrap_cnt(struct mlx5_wq_cyc *wq, u16 ctr) 138 - { 139 - return ctr >> wq->fbc.log_sz; 140 - } 141 - 142 137 static inline u16 mlx5_wq_cyc_ctr2ix(struct mlx5_wq_cyc *wq, u16 ctr) 143 138 { 144 139 return ctr & wq->fbc.sz_m1;
+48 -12
drivers/net/veth.c
··· 38 38 #define VETH_XDP_TX BIT(0) 39 39 #define VETH_XDP_REDIR BIT(1) 40 40 41 + #define VETH_XDP_TX_BULK_SIZE 16 42 + 41 43 struct veth_rq_stats { 42 44 u64 xdp_packets; 43 45 u64 xdp_bytes; ··· 64 62 struct bpf_prog *_xdp_prog; 65 63 struct veth_rq *rq; 66 64 unsigned int requested_headroom; 65 + }; 66 + 67 + struct veth_xdp_tx_bq { 68 + struct xdp_frame *q[VETH_XDP_TX_BULK_SIZE]; 69 + unsigned int count; 67 70 }; 68 71 69 72 /* ··· 449 442 return ret; 450 443 } 451 444 452 - static void veth_xdp_flush(struct net_device *dev) 445 + static void veth_xdp_flush_bq(struct net_device *dev, struct veth_xdp_tx_bq *bq) 446 + { 447 + int sent, i, err = 0; 448 + 449 + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); 450 + if (sent < 0) { 451 + err = sent; 452 + sent = 0; 453 + for (i = 0; i < bq->count; i++) 454 + xdp_return_frame(bq->q[i]); 455 + } 456 + trace_xdp_bulk_tx(dev, sent, bq->count - sent, err); 457 + 458 + bq->count = 0; 459 + } 460 + 461 + static void veth_xdp_flush(struct net_device *dev, struct veth_xdp_tx_bq *bq) 453 462 { 454 463 struct veth_priv *rcv_priv, *priv = netdev_priv(dev); 455 464 struct net_device *rcv; 456 465 struct veth_rq *rq; 457 466 458 467 rcu_read_lock(); 468 + veth_xdp_flush_bq(dev, bq); 459 469 rcv = rcu_dereference(priv->peer); 460 470 if (unlikely(!rcv)) 461 471 goto out; ··· 488 464 rcu_read_unlock(); 489 465 } 490 466 491 - static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp) 467 + static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp, 468 + struct veth_xdp_tx_bq *bq) 492 469 { 493 470 struct xdp_frame *frame = convert_to_xdp_frame(xdp); 494 471 495 472 if (unlikely(!frame)) 496 473 return -EOVERFLOW; 497 474 498 - return veth_xdp_xmit(dev, 1, &frame, 0); 475 + if (unlikely(bq->count == VETH_XDP_TX_BULK_SIZE)) 476 + veth_xdp_flush_bq(dev, bq); 477 + 478 + bq->q[bq->count++] = frame; 479 + 480 + return 0; 499 481 } 500 482 501 483 static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, 502 484 struct xdp_frame *frame, 503 - unsigned int *xdp_xmit) 485 + unsigned int *xdp_xmit, 486 + struct veth_xdp_tx_bq *bq) 504 487 { 505 488 void *hard_start = frame->data - frame->headroom; 506 489 void *head = hard_start - sizeof(struct xdp_frame); ··· 540 509 orig_frame = *frame; 541 510 xdp.data_hard_start = head; 542 511 xdp.rxq->mem = frame->mem; 543 - if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) { 512 + if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) { 544 513 trace_xdp_exception(rq->dev, xdp_prog, act); 545 514 frame = &orig_frame; 546 515 goto err_xdp; ··· 591 560 } 592 561 593 562 static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, 594 - unsigned int *xdp_xmit) 563 + unsigned int *xdp_xmit, 564 + struct veth_xdp_tx_bq *bq) 595 565 { 596 566 u32 pktlen, headroom, act, metalen; 597 567 void *orig_data, *orig_data_end; ··· 668 636 get_page(virt_to_page(xdp.data)); 669 637 consume_skb(skb); 670 638 xdp.rxq->mem = rq->xdp_mem; 671 - if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) { 639 + if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) { 672 640 trace_xdp_exception(rq->dev, xdp_prog, act); 673 641 goto err_xdp; 674 642 } ··· 723 691 return NULL; 724 692 } 725 693 726 - static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit) 694 + static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit, 695 + struct veth_xdp_tx_bq *bq) 727 696 { 728 697 int i, done = 0, drops = 0, bytes = 0; 729 698 ··· 740 707 struct xdp_frame *frame = veth_ptr_to_xdp(ptr); 741 708 742 709 bytes += frame->len; 743 - skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one); 710 + skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one, bq); 744 711 } else { 745 712 skb = ptr; 746 713 bytes += skb->len; 747 - skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one); 714 + skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one, bq); 748 715 } 749 716 *xdp_xmit |= xdp_xmit_one; 750 717 ··· 770 737 struct veth_rq *rq = 771 738 container_of(napi, struct veth_rq, xdp_napi); 772 739 unsigned int xdp_xmit = 0; 740 + struct veth_xdp_tx_bq bq; 773 741 int done; 774 742 743 + bq.count = 0; 744 + 775 745 xdp_set_return_frame_no_direct(); 776 - done = veth_xdp_rcv(rq, budget, &xdp_xmit); 746 + done = veth_xdp_rcv(rq, budget, &xdp_xmit, &bq); 777 747 778 748 if (done < budget && napi_complete_done(napi, done)) { 779 749 /* Write rx_notify_masked before reading ptr_ring */ ··· 788 752 } 789 753 790 754 if (xdp_xmit & VETH_XDP_TX) 791 - veth_xdp_flush(rq->dev); 755 + veth_xdp_flush(rq->dev, &bq); 792 756 if (xdp_xmit & VETH_XDP_REDIR) 793 757 xdp_do_flush_map(); 794 758 xdp_clear_return_frame_no_direct();
+45
include/linux/bpf-cgroup.h
··· 124 124 loff_t *ppos, void **new_buf, 125 125 enum bpf_attach_type type); 126 126 127 + int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int *level, 128 + int *optname, char __user *optval, 129 + int *optlen, char **kernel_optval); 130 + int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, 131 + int optname, char __user *optval, 132 + int __user *optlen, int max_optlen, 133 + int retval); 134 + 127 135 static inline enum bpf_cgroup_storage_type cgroup_storage_type( 128 136 struct bpf_map *map) 129 137 { ··· 294 286 __ret; \ 295 287 }) 296 288 289 + #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \ 290 + kernel_optval) \ 291 + ({ \ 292 + int __ret = 0; \ 293 + if (cgroup_bpf_enabled) \ 294 + __ret = __cgroup_bpf_run_filter_setsockopt(sock, level, \ 295 + optname, optval, \ 296 + optlen, \ 297 + kernel_optval); \ 298 + __ret; \ 299 + }) 300 + 301 + #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) \ 302 + ({ \ 303 + int __ret = 0; \ 304 + if (cgroup_bpf_enabled) \ 305 + get_user(__ret, optlen); \ 306 + __ret; \ 307 + }) 308 + 309 + #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen, \ 310 + max_optlen, retval) \ 311 + ({ \ 312 + int __ret = retval; \ 313 + if (cgroup_bpf_enabled) \ 314 + __ret = __cgroup_bpf_run_filter_getsockopt(sock, level, \ 315 + optname, optval, \ 316 + optlen, max_optlen, \ 317 + retval); \ 318 + __ret; \ 319 + }) 320 + 297 321 int cgroup_bpf_prog_attach(const union bpf_attr *attr, 298 322 enum bpf_prog_type ptype, struct bpf_prog *prog); 299 323 int cgroup_bpf_prog_detach(const union bpf_attr *attr, ··· 397 357 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; }) 398 358 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; }) 399 359 #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; }) 360 + #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; }) 361 + #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \ 362 + optlen, max_optlen, retval) ({ retval; }) 363 + #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \ 364 + kernel_optval) ({ 0; }) 400 365 401 366 #define for_each_cgroup_storage_type(stype) for (; false; ) 402 367
+2
include/linux/bpf.h
··· 518 518 struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags); 519 519 void bpf_prog_array_free(struct bpf_prog_array *progs); 520 520 int bpf_prog_array_length(struct bpf_prog_array *progs); 521 + bool bpf_prog_array_is_empty(struct bpf_prog_array *array); 521 522 int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs, 522 523 __u32 __user *prog_ids, u32 cnt); 523 524 ··· 1052 1051 extern const struct bpf_func_proto bpf_get_local_storage_proto; 1053 1052 extern const struct bpf_func_proto bpf_strtol_proto; 1054 1053 extern const struct bpf_func_proto bpf_strtoul_proto; 1054 + extern const struct bpf_func_proto bpf_tcp_sock_proto; 1055 1055 1056 1056 /* Shared helpers among cBPF and eBPF. */ 1057 1057 void bpf_user_rnd_init_once(void);
+1
include/linux/bpf_types.h
··· 30 30 #ifdef CONFIG_CGROUP_BPF 31 31 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) 32 32 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl) 33 + BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt) 33 34 #endif 34 35 #ifdef CONFIG_BPF_LIRC_MODE2 35 36 BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
+12 -1
include/linux/filter.h
··· 578 578 }; 579 579 580 580 struct bpf_redirect_info { 581 - u32 ifindex; 582 581 u32 flags; 582 + u32 tgt_index; 583 + void *tgt_value; 583 584 struct bpf_map *map; 584 585 struct bpf_map *map_to_flush; 585 586 u32 kern_flags; ··· 1198 1197 loff_t *ppos; 1199 1198 /* Temporary "register" for indirect stores to ppos. */ 1200 1199 u64 tmp_reg; 1200 + }; 1201 + 1202 + struct bpf_sockopt_kern { 1203 + struct sock *sk; 1204 + u8 *optval; 1205 + u8 *optval_end; 1206 + s32 level; 1207 + s32 optname; 1208 + s32 optlen; 1209 + s32 retval; 1201 1210 }; 1202 1211 1203 1212 #endif /* __LINUX_FILTER_H__ */
+14
include/linux/list.h
··· 106 106 WRITE_ONCE(prev->next, next); 107 107 } 108 108 109 + /* 110 + * Delete a list entry and clear the 'prev' pointer. 111 + * 112 + * This is a special-purpose list clearing method used in the networking code 113 + * for lists allocated as per-cpu, where we don't want to incur the extra 114 + * WRITE_ONCE() overhead of a regular list_del_init(). The code that uses this 115 + * needs to check the node 'prev' pointer instead of calling list_empty(). 116 + */ 117 + static inline void __list_del_clearprev(struct list_head *entry) 118 + { 119 + __list_del(entry->prev, entry->next); 120 + entry->prev = NULL; 121 + } 122 + 109 123 /** 110 124 * list_del - deletes entry from list. 111 125 * @entry: the element to delete from the list.
+8
include/net/tcp.h
··· 2221 2221 return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1); 2222 2222 } 2223 2223 2224 + static inline void tcp_bpf_rtt(struct sock *sk) 2225 + { 2226 + struct tcp_sock *tp = tcp_sk(sk); 2227 + 2228 + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTT_CB_FLAG)) 2229 + tcp_call_bpf(sk, BPF_SOCK_OPS_RTT_CB, 0, NULL); 2230 + } 2231 + 2224 2232 #if IS_ENABLED(CONFIG_SMC) 2225 2233 extern struct static_key_false tcp_have_smc; 2226 2234 #endif
+24 -3
include/net/xdp_sock.h
··· 77 77 void xsk_flush(struct xdp_sock *xs); 78 78 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs); 79 79 /* Used from netdev driver */ 80 + bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt); 80 81 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr); 81 82 void xsk_umem_discard_addr(struct xdp_umem *umem); 82 83 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries); 83 - bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len); 84 + bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc); 84 85 void xsk_umem_consume_tx_done(struct xdp_umem *umem); 85 86 struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries); 86 87 struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem, ··· 100 99 } 101 100 102 101 /* Reuse-queue aware version of FILL queue helpers */ 102 + static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt) 103 + { 104 + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; 105 + 106 + if (rq->length >= cnt) 107 + return true; 108 + 109 + return xsk_umem_has_addrs(umem, cnt - rq->length); 110 + } 111 + 103 112 static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr) 104 113 { 105 114 struct xdp_umem_fq_reuse *rq = umem->fq_reuse; ··· 157 146 return false; 158 147 } 159 148 149 + static inline bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt) 150 + { 151 + return false; 152 + } 153 + 160 154 static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr) 161 155 { 162 156 return NULL; ··· 175 159 { 176 160 } 177 161 178 - static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, 179 - u32 *len) 162 + static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, 163 + struct xdp_desc *desc) 180 164 { 181 165 return false; 182 166 } ··· 214 198 static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr) 215 199 { 216 200 return 0; 201 + } 202 + 203 + static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt) 204 + { 205 + return false; 217 206 } 218 207 219 208 static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+31 -3
include/trace/events/xdp.h
··· 50 50 __entry->ifindex) 51 51 ); 52 52 53 + TRACE_EVENT(xdp_bulk_tx, 54 + 55 + TP_PROTO(const struct net_device *dev, 56 + int sent, int drops, int err), 57 + 58 + TP_ARGS(dev, sent, drops, err), 59 + 60 + TP_STRUCT__entry( 61 + __field(int, ifindex) 62 + __field(u32, act) 63 + __field(int, drops) 64 + __field(int, sent) 65 + __field(int, err) 66 + ), 67 + 68 + TP_fast_assign( 69 + __entry->ifindex = dev->ifindex; 70 + __entry->act = XDP_TX; 71 + __entry->drops = drops; 72 + __entry->sent = sent; 73 + __entry->err = err; 74 + ), 75 + 76 + TP_printk("ifindex=%d action=%s sent=%d drops=%d err=%d", 77 + __entry->ifindex, 78 + __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), 79 + __entry->sent, __entry->drops, __entry->err) 80 + ); 81 + 53 82 DECLARE_EVENT_CLASS(xdp_redirect_template, 54 83 55 84 TP_PROTO(const struct net_device *dev, ··· 175 146 #endif /* __DEVMAP_OBJ_TYPE */ 176 147 177 148 #define devmap_ifindex(fwd, map) \ 178 - (!fwd ? 0 : \ 179 - ((map->map_type == BPF_MAP_TYPE_DEVMAP) ? \ 180 - ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)) 149 + ((map->map_type == BPF_MAP_TYPE_DEVMAP) ? \ 150 + ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0) 181 151 182 152 #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx) \ 183 153 trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map), \
+30 -3
include/uapi/linux/bpf.h
··· 170 170 BPF_PROG_TYPE_FLOW_DISSECTOR, 171 171 BPF_PROG_TYPE_CGROUP_SYSCTL, 172 172 BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, 173 + BPF_PROG_TYPE_CGROUP_SOCKOPT, 173 174 }; 174 175 175 176 enum bpf_attach_type { ··· 195 194 BPF_CGROUP_SYSCTL, 196 195 BPF_CGROUP_UDP4_RECVMSG, 197 196 BPF_CGROUP_UDP6_RECVMSG, 197 + BPF_CGROUP_GETSOCKOPT, 198 + BPF_CGROUP_SETSOCKOPT, 198 199 __MAX_BPF_ATTACH_TYPE 199 200 }; 200 201 ··· 1571 1568 * but this is only implemented for native XDP (with driver 1572 1569 * support) as of this writing). 1573 1570 * 1574 - * All values for *flags* are reserved for future usage, and must 1575 - * be left at zero. 1571 + * The lower two bits of *flags* are used as the return code if 1572 + * the map lookup fails. This is so that the return value can be 1573 + * one of the XDP program return codes up to XDP_TX, as chosen by 1574 + * the caller. Any higher bits in the *flags* argument must be 1575 + * unset. 1576 1576 * 1577 1577 * When used to redirect packets to net devices, this helper 1578 1578 * provides a high performance increase over **bpf_redirect**\ (). ··· 1770 1764 * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) 1771 1765 * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) 1772 1766 * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) 1767 + * * **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT) 1773 1768 * 1774 1769 * Therefore, this function can be used to clear a callback flag by 1775 1770 * setting the appropriate bit to zero. e.g. to disable the RTO ··· 3073 3066 * sum(delta(snd_una)), or how many bytes 3074 3067 * were acked. 3075 3068 */ 3069 + __u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups 3070 + * total number of DSACK blocks received 3071 + */ 3072 + __u32 delivered; /* Total data packets delivered incl. rexmits */ 3073 + __u32 delivered_ce; /* Like the above but only ECE marked packets */ 3074 + __u32 icsk_retransmits; /* Number of unrecovered [RTO] timeouts */ 3076 3075 }; 3077 3076 3078 3077 struct bpf_sock_tuple { ··· 3321 3308 #define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) 3322 3309 #define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) 3323 3310 #define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2) 3324 - #define BPF_SOCK_OPS_ALL_CB_FLAGS 0x7 /* Mask of all currently 3311 + #define BPF_SOCK_OPS_RTT_CB_FLAG (1<<3) 3312 + #define BPF_SOCK_OPS_ALL_CB_FLAGS 0xF /* Mask of all currently 3325 3313 * supported cb flags 3326 3314 */ 3327 3315 ··· 3376 3362 */ 3377 3363 BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after 3378 3364 * socket transition to LISTEN state. 3365 + */ 3366 + BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. 3379 3367 */ 3380 3368 }; 3381 3369 ··· 3555 3539 __u32 file_pos; /* Sysctl file position to read from, write to. 3556 3540 * Allows 1,2,4-byte read an 4-byte write. 3557 3541 */ 3542 + }; 3543 + 3544 + struct bpf_sockopt { 3545 + __bpf_md_ptr(struct bpf_sock *, sk); 3546 + __bpf_md_ptr(void *, optval); 3547 + __bpf_md_ptr(void *, optval_end); 3548 + 3549 + __s32 level; 3550 + __s32 optname; 3551 + __s32 optlen; 3552 + __s32 retval; 3558 3553 }; 3559 3554 3560 3555 #endif /* _UAPI__LINUX_BPF_H__ */
+8
include/uapi/linux/if_xdp.h
··· 46 46 #define XDP_UMEM_FILL_RING 5 47 47 #define XDP_UMEM_COMPLETION_RING 6 48 48 #define XDP_STATISTICS 7 49 + #define XDP_OPTIONS 8 49 50 50 51 struct xdp_umem_reg { 51 52 __u64 addr; /* Start of packet data area */ ··· 60 59 __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 61 60 __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 62 61 }; 62 + 63 + struct xdp_options { 64 + __u32 flags; 65 + }; 66 + 67 + /* Flags for the flags field of struct xdp_options */ 68 + #define XDP_OPTIONS_ZEROCOPY (1 << 0) 63 69 64 70 /* Pgoff for mmaping the rings */ 65 71 #define XDP_PGOFF_RX_RING 0
+351 -1
kernel/bpf/cgroup.c
··· 15 15 #include <linux/bpf.h> 16 16 #include <linux/bpf-cgroup.h> 17 17 #include <net/sock.h> 18 + #include <net/bpf_sk_storage.h> 19 + 20 + #include "../cgroup/cgroup-internal.h" 18 21 19 22 DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key); 20 23 EXPORT_SYMBOL(cgroup_bpf_enabled_key); ··· 41 38 struct bpf_prog_array *old_array; 42 39 unsigned int type; 43 40 41 + mutex_lock(&cgroup_mutex); 42 + 44 43 for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) { 45 44 struct list_head *progs = &cgrp->bpf.progs[type]; 46 45 struct bpf_prog_list *pl, *tmp; ··· 59 54 } 60 55 old_array = rcu_dereference_protected( 61 56 cgrp->bpf.effective[type], 62 - percpu_ref_is_dying(&cgrp->bpf.refcnt)); 57 + lockdep_is_held(&cgroup_mutex)); 63 58 bpf_prog_array_free(old_array); 64 59 } 60 + 61 + mutex_unlock(&cgroup_mutex); 65 62 66 63 percpu_ref_exit(&cgrp->bpf.refcnt); 67 64 cgroup_put(cgrp); ··· 236 229 css_for_each_descendant_pre(css, &cgrp->self) { 237 230 struct cgroup *desc = container_of(css, struct cgroup, self); 238 231 232 + if (percpu_ref_is_zero(&desc->bpf.refcnt)) 233 + continue; 234 + 239 235 err = compute_effective_progs(desc, type, &desc->bpf.inactive); 240 236 if (err) 241 237 goto cleanup; ··· 247 237 /* all allocations were successful. Activate all prog arrays */ 248 238 css_for_each_descendant_pre(css, &cgrp->self) { 249 239 struct cgroup *desc = container_of(css, struct cgroup, self); 240 + 241 + if (percpu_ref_is_zero(&desc->bpf.refcnt)) { 242 + if (unlikely(desc->bpf.inactive)) { 243 + bpf_prog_array_free(desc->bpf.inactive); 244 + desc->bpf.inactive = NULL; 245 + } 246 + continue; 247 + } 250 248 251 249 activate_effective_progs(desc, type, desc->bpf.inactive); 252 250 desc->bpf.inactive = NULL; ··· 939 921 } 940 922 EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl); 941 923 924 + static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, 925 + enum bpf_attach_type attach_type) 926 + { 927 + struct bpf_prog_array *prog_array; 928 + bool empty; 929 + 930 + rcu_read_lock(); 931 + prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]); 932 + empty = bpf_prog_array_is_empty(prog_array); 933 + rcu_read_unlock(); 934 + 935 + return empty; 936 + } 937 + 938 + static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) 939 + { 940 + if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) 941 + return -EINVAL; 942 + 943 + ctx->optval = kzalloc(max_optlen, GFP_USER); 944 + if (!ctx->optval) 945 + return -ENOMEM; 946 + 947 + ctx->optval_end = ctx->optval + max_optlen; 948 + ctx->optlen = max_optlen; 949 + 950 + return 0; 951 + } 952 + 953 + static void sockopt_free_buf(struct bpf_sockopt_kern *ctx) 954 + { 955 + kfree(ctx->optval); 956 + } 957 + 958 + int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, 959 + int *optname, char __user *optval, 960 + int *optlen, char **kernel_optval) 961 + { 962 + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); 963 + struct bpf_sockopt_kern ctx = { 964 + .sk = sk, 965 + .level = *level, 966 + .optname = *optname, 967 + }; 968 + int ret; 969 + 970 + /* Opportunistic check to see whether we have any BPF program 971 + * attached to the hook so we don't waste time allocating 972 + * memory and locking the socket. 973 + */ 974 + if (!cgroup_bpf_enabled || 975 + __cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_SETSOCKOPT)) 976 + return 0; 977 + 978 + ret = sockopt_alloc_buf(&ctx, *optlen); 979 + if (ret) 980 + return ret; 981 + 982 + if (copy_from_user(ctx.optval, optval, *optlen) != 0) { 983 + ret = -EFAULT; 984 + goto out; 985 + } 986 + 987 + lock_sock(sk); 988 + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT], 989 + &ctx, BPF_PROG_RUN); 990 + release_sock(sk); 991 + 992 + if (!ret) { 993 + ret = -EPERM; 994 + goto out; 995 + } 996 + 997 + if (ctx.optlen == -1) { 998 + /* optlen set to -1, bypass kernel */ 999 + ret = 1; 1000 + } else if (ctx.optlen > *optlen || ctx.optlen < -1) { 1001 + /* optlen is out of bounds */ 1002 + ret = -EFAULT; 1003 + } else { 1004 + /* optlen within bounds, run kernel handler */ 1005 + ret = 0; 1006 + 1007 + /* export any potential modifications */ 1008 + *level = ctx.level; 1009 + *optname = ctx.optname; 1010 + *optlen = ctx.optlen; 1011 + *kernel_optval = ctx.optval; 1012 + } 1013 + 1014 + out: 1015 + if (ret) 1016 + sockopt_free_buf(&ctx); 1017 + return ret; 1018 + } 1019 + EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt); 1020 + 1021 + int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, 1022 + int optname, char __user *optval, 1023 + int __user *optlen, int max_optlen, 1024 + int retval) 1025 + { 1026 + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); 1027 + struct bpf_sockopt_kern ctx = { 1028 + .sk = sk, 1029 + .level = level, 1030 + .optname = optname, 1031 + .retval = retval, 1032 + }; 1033 + int ret; 1034 + 1035 + /* Opportunistic check to see whether we have any BPF program 1036 + * attached to the hook so we don't waste time allocating 1037 + * memory and locking the socket. 1038 + */ 1039 + if (!cgroup_bpf_enabled || 1040 + __cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_GETSOCKOPT)) 1041 + return retval; 1042 + 1043 + ret = sockopt_alloc_buf(&ctx, max_optlen); 1044 + if (ret) 1045 + return ret; 1046 + 1047 + if (!retval) { 1048 + /* If kernel getsockopt finished successfully, 1049 + * copy whatever was returned to the user back 1050 + * into our temporary buffer. Set optlen to the 1051 + * one that kernel returned as well to let 1052 + * BPF programs inspect the value. 1053 + */ 1054 + 1055 + if (get_user(ctx.optlen, optlen)) { 1056 + ret = -EFAULT; 1057 + goto out; 1058 + } 1059 + 1060 + if (ctx.optlen > max_optlen) 1061 + ctx.optlen = max_optlen; 1062 + 1063 + if (copy_from_user(ctx.optval, optval, ctx.optlen) != 0) { 1064 + ret = -EFAULT; 1065 + goto out; 1066 + } 1067 + } 1068 + 1069 + lock_sock(sk); 1070 + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT], 1071 + &ctx, BPF_PROG_RUN); 1072 + release_sock(sk); 1073 + 1074 + if (!ret) { 1075 + ret = -EPERM; 1076 + goto out; 1077 + } 1078 + 1079 + if (ctx.optlen > max_optlen) { 1080 + ret = -EFAULT; 1081 + goto out; 1082 + } 1083 + 1084 + /* BPF programs only allowed to set retval to 0, not some 1085 + * arbitrary value. 1086 + */ 1087 + if (ctx.retval != 0 && ctx.retval != retval) { 1088 + ret = -EFAULT; 1089 + goto out; 1090 + } 1091 + 1092 + if (copy_to_user(optval, ctx.optval, ctx.optlen) || 1093 + put_user(ctx.optlen, optlen)) { 1094 + ret = -EFAULT; 1095 + goto out; 1096 + } 1097 + 1098 + ret = ctx.retval; 1099 + 1100 + out: 1101 + sockopt_free_buf(&ctx); 1102 + return ret; 1103 + } 1104 + EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt); 1105 + 942 1106 static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp, 943 1107 size_t *lenp) 944 1108 { ··· 1380 1180 }; 1381 1181 1382 1182 const struct bpf_prog_ops cg_sysctl_prog_ops = { 1183 + }; 1184 + 1185 + static const struct bpf_func_proto * 1186 + cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) 1187 + { 1188 + switch (func_id) { 1189 + case BPF_FUNC_sk_storage_get: 1190 + return &bpf_sk_storage_get_proto; 1191 + case BPF_FUNC_sk_storage_delete: 1192 + return &bpf_sk_storage_delete_proto; 1193 + #ifdef CONFIG_INET 1194 + case BPF_FUNC_tcp_sock: 1195 + return &bpf_tcp_sock_proto; 1196 + #endif 1197 + default: 1198 + return cgroup_base_func_proto(func_id, prog); 1199 + } 1200 + } 1201 + 1202 + static bool cg_sockopt_is_valid_access(int off, int size, 1203 + enum bpf_access_type type, 1204 + const struct bpf_prog *prog, 1205 + struct bpf_insn_access_aux *info) 1206 + { 1207 + const int size_default = sizeof(__u32); 1208 + 1209 + if (off < 0 || off >= sizeof(struct bpf_sockopt)) 1210 + return false; 1211 + 1212 + if (off % size != 0) 1213 + return false; 1214 + 1215 + if (type == BPF_WRITE) { 1216 + switch (off) { 1217 + case offsetof(struct bpf_sockopt, retval): 1218 + if (size != size_default) 1219 + return false; 1220 + return prog->expected_attach_type == 1221 + BPF_CGROUP_GETSOCKOPT; 1222 + case offsetof(struct bpf_sockopt, optname): 1223 + /* fallthrough */ 1224 + case offsetof(struct bpf_sockopt, level): 1225 + if (size != size_default) 1226 + return false; 1227 + return prog->expected_attach_type == 1228 + BPF_CGROUP_SETSOCKOPT; 1229 + case offsetof(struct bpf_sockopt, optlen): 1230 + return size == size_default; 1231 + default: 1232 + return false; 1233 + } 1234 + } 1235 + 1236 + switch (off) { 1237 + case offsetof(struct bpf_sockopt, sk): 1238 + if (size != sizeof(__u64)) 1239 + return false; 1240 + info->reg_type = PTR_TO_SOCKET; 1241 + break; 1242 + case offsetof(struct bpf_sockopt, optval): 1243 + if (size != sizeof(__u64)) 1244 + return false; 1245 + info->reg_type = PTR_TO_PACKET; 1246 + break; 1247 + case offsetof(struct bpf_sockopt, optval_end): 1248 + if (size != sizeof(__u64)) 1249 + return false; 1250 + info->reg_type = PTR_TO_PACKET_END; 1251 + break; 1252 + case offsetof(struct bpf_sockopt, retval): 1253 + if (size != size_default) 1254 + return false; 1255 + return prog->expected_attach_type == BPF_CGROUP_GETSOCKOPT; 1256 + default: 1257 + if (size != size_default) 1258 + return false; 1259 + break; 1260 + } 1261 + return true; 1262 + } 1263 + 1264 + #define CG_SOCKOPT_ACCESS_FIELD(T, F) \ 1265 + T(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, F), \ 1266 + si->dst_reg, si->src_reg, \ 1267 + offsetof(struct bpf_sockopt_kern, F)) 1268 + 1269 + static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type, 1270 + const struct bpf_insn *si, 1271 + struct bpf_insn *insn_buf, 1272 + struct bpf_prog *prog, 1273 + u32 *target_size) 1274 + { 1275 + struct bpf_insn *insn = insn_buf; 1276 + 1277 + switch (si->off) { 1278 + case offsetof(struct bpf_sockopt, sk): 1279 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, sk); 1280 + break; 1281 + case offsetof(struct bpf_sockopt, level): 1282 + if (type == BPF_WRITE) 1283 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, level); 1284 + else 1285 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, level); 1286 + break; 1287 + case offsetof(struct bpf_sockopt, optname): 1288 + if (type == BPF_WRITE) 1289 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, optname); 1290 + else 1291 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optname); 1292 + break; 1293 + case offsetof(struct bpf_sockopt, optlen): 1294 + if (type == BPF_WRITE) 1295 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, optlen); 1296 + else 1297 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optlen); 1298 + break; 1299 + case offsetof(struct bpf_sockopt, retval): 1300 + if (type == BPF_WRITE) 1301 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_STX_MEM, retval); 1302 + else 1303 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, retval); 1304 + break; 1305 + case offsetof(struct bpf_sockopt, optval): 1306 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optval); 1307 + break; 1308 + case offsetof(struct bpf_sockopt, optval_end): 1309 + *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optval_end); 1310 + break; 1311 + } 1312 + 1313 + return insn - insn_buf; 1314 + } 1315 + 1316 + static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf, 1317 + bool direct_write, 1318 + const struct bpf_prog *prog) 1319 + { 1320 + /* Nothing to do for sockopt argument. The data is kzalloc'ated. 1321 + */ 1322 + return 0; 1323 + } 1324 + 1325 + const struct bpf_verifier_ops cg_sockopt_verifier_ops = { 1326 + .get_func_proto = cg_sockopt_func_proto, 1327 + .is_valid_access = cg_sockopt_is_valid_access, 1328 + .convert_ctx_access = cg_sockopt_convert_ctx_access, 1329 + .gen_prologue = cg_sockopt_get_prologue, 1330 + }; 1331 + 1332 + const struct bpf_prog_ops cg_sockopt_prog_ops = { 1383 1333 };
+10
kernel/bpf/core.c
··· 1809 1809 return cnt; 1810 1810 } 1811 1811 1812 + bool bpf_prog_array_is_empty(struct bpf_prog_array *array) 1813 + { 1814 + struct bpf_prog_array_item *item; 1815 + 1816 + for (item = array->items; item->prog; item++) 1817 + if (item->prog != &dummy_bpf_prog.prog) 1818 + return false; 1819 + return true; 1820 + } 1812 1821 1813 1822 static bool bpf_prog_array_copy_core(struct bpf_prog_array *array, 1814 1823 u32 *prog_ids, ··· 2110 2101 #include <linux/bpf_trace.h> 2111 2102 2112 2103 EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception); 2104 + EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx);
+48 -57
kernel/bpf/cpumap.c
··· 32 32 33 33 /* General idea: XDP packets getting XDP redirected to another CPU, 34 34 * will maximum be stored/queued for one driver ->poll() call. It is 35 - * guaranteed that setting flush bit and flush operation happen on 35 + * guaranteed that queueing the frame and the flush operation happen on 36 36 * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr() 37 37 * which queue in bpf_cpu_map_entry contains packets. 38 38 */ 39 39 40 40 #define CPU_MAP_BULK_SIZE 8 /* 8 == one cacheline on 64-bit archs */ 41 + struct bpf_cpu_map_entry; 42 + struct bpf_cpu_map; 43 + 41 44 struct xdp_bulk_queue { 42 45 void *q[CPU_MAP_BULK_SIZE]; 46 + struct list_head flush_node; 47 + struct bpf_cpu_map_entry *obj; 43 48 unsigned int count; 44 49 }; 45 50 ··· 56 51 57 52 /* XDP can run multiple RX-ring queues, need __percpu enqueue store */ 58 53 struct xdp_bulk_queue __percpu *bulkq; 54 + 55 + struct bpf_cpu_map *cmap; 59 56 60 57 /* Queue with potential multi-producers, and single-consumer kthread */ 61 58 struct ptr_ring *queue; ··· 72 65 struct bpf_map map; 73 66 /* Below members specific for map type */ 74 67 struct bpf_cpu_map_entry **cpu_map; 75 - unsigned long __percpu *flush_needed; 68 + struct list_head __percpu *flush_list; 76 69 }; 77 70 78 - static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, 79 - struct xdp_bulk_queue *bq, bool in_napi_ctx); 80 - 81 - static u64 cpu_map_bitmap_size(const union bpf_attr *attr) 82 - { 83 - return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long); 84 - } 71 + static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx); 85 72 86 73 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) 87 74 { 88 75 struct bpf_cpu_map *cmap; 89 76 int err = -ENOMEM; 77 + int ret, cpu; 90 78 u64 cost; 91 - int ret; 92 79 93 80 if (!capable(CAP_SYS_ADMIN)) 94 81 return ERR_PTR(-EPERM); ··· 106 105 107 106 /* make sure page count doesn't overflow */ 108 107 cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *); 109 - cost += cpu_map_bitmap_size(attr) * num_possible_cpus(); 108 + cost += sizeof(struct list_head) * num_possible_cpus(); 110 109 111 110 /* Notice returns -EPERM on if map size is larger than memlock limit */ 112 111 ret = bpf_map_charge_init(&cmap->map.memory, cost); ··· 115 114 goto free_cmap; 116 115 } 117 116 118 - /* A per cpu bitfield with a bit per possible CPU in map */ 119 - cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr), 120 - __alignof__(unsigned long)); 121 - if (!cmap->flush_needed) 117 + cmap->flush_list = alloc_percpu(struct list_head); 118 + if (!cmap->flush_list) 122 119 goto free_charge; 120 + 121 + for_each_possible_cpu(cpu) 122 + INIT_LIST_HEAD(per_cpu_ptr(cmap->flush_list, cpu)); 123 123 124 124 /* Alloc array for possible remote "destination" CPUs */ 125 125 cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries * ··· 131 129 132 130 return &cmap->map; 133 131 free_percpu: 134 - free_percpu(cmap->flush_needed); 132 + free_percpu(cmap->flush_list); 135 133 free_charge: 136 134 bpf_map_charge_finish(&cmap->map.memory); 137 135 free_cmap: ··· 336 334 { 337 335 gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; 338 336 struct bpf_cpu_map_entry *rcpu; 339 - int numa, err; 337 + struct xdp_bulk_queue *bq; 338 + int numa, err, i; 340 339 341 340 /* Have map->numa_node, but choose node of redirect target CPU */ 342 341 numa = cpu_to_node(cpu); ··· 351 348 sizeof(void *), gfp); 352 349 if (!rcpu->bulkq) 353 350 goto free_rcu; 351 + 352 + for_each_possible_cpu(i) { 353 + bq = per_cpu_ptr(rcpu->bulkq, i); 354 + bq->obj = rcpu; 355 + } 354 356 355 357 /* Alloc queue */ 356 358 rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa); ··· 413 405 struct xdp_bulk_queue *bq = per_cpu_ptr(rcpu->bulkq, cpu); 414 406 415 407 /* No concurrent bq_enqueue can run at this point */ 416 - bq_flush_to_queue(rcpu, bq, false); 408 + bq_flush_to_queue(bq, false); 417 409 } 418 410 free_percpu(rcpu->bulkq); 419 411 /* Cannot kthread_stop() here, last put free rcpu resources */ ··· 496 488 rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id); 497 489 if (!rcpu) 498 490 return -ENOMEM; 491 + rcpu->cmap = cmap; 499 492 } 500 493 rcu_read_lock(); 501 494 __cpu_map_entry_replace(cmap, key_cpu, rcpu); ··· 523 514 synchronize_rcu(); 524 515 525 516 /* To ensure all pending flush operations have completed wait for flush 526 - * bitmap to indicate all flush_needed bits to be zero on _all_ cpus. 527 - * Because the above synchronize_rcu() ensures the map is disconnected 528 - * from the program we can assume no new bits will be set. 517 + * list be empty on _all_ cpus. Because the above synchronize_rcu() 518 + * ensures the map is disconnected from the program we can assume no new 519 + * items will be added to the list. 529 520 */ 530 521 for_each_online_cpu(cpu) { 531 - unsigned long *bitmap = per_cpu_ptr(cmap->flush_needed, cpu); 522 + struct list_head *flush_list = per_cpu_ptr(cmap->flush_list, cpu); 532 523 533 - while (!bitmap_empty(bitmap, cmap->map.max_entries)) 524 + while (!list_empty(flush_list)) 534 525 cond_resched(); 535 526 } 536 527 ··· 547 538 /* bq flush and cleanup happens after RCU graze-period */ 548 539 __cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */ 549 540 } 550 - free_percpu(cmap->flush_needed); 541 + free_percpu(cmap->flush_list); 551 542 bpf_map_area_free(cmap->cpu_map); 552 543 kfree(cmap); 553 544 } ··· 599 590 .map_check_btf = map_check_no_btf, 600 591 }; 601 592 602 - static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, 603 - struct xdp_bulk_queue *bq, bool in_napi_ctx) 593 + static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx) 604 594 { 595 + struct bpf_cpu_map_entry *rcpu = bq->obj; 605 596 unsigned int processed = 0, drops = 0; 606 597 const int to_cpu = rcpu->cpu; 607 598 struct ptr_ring *q; ··· 630 621 bq->count = 0; 631 622 spin_unlock(&q->producer_lock); 632 623 624 + __list_del_clearprev(&bq->flush_node); 625 + 633 626 /* Feedback loop via tracepoints */ 634 627 trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu); 635 628 return 0; ··· 642 631 */ 643 632 static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) 644 633 { 634 + struct list_head *flush_list = this_cpu_ptr(rcpu->cmap->flush_list); 645 635 struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); 646 636 647 637 if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) 648 - bq_flush_to_queue(rcpu, bq, true); 638 + bq_flush_to_queue(bq, true); 649 639 650 640 /* Notice, xdp_buff/page MUST be queued here, long enough for 651 641 * driver to code invoking us to finished, due to driver ··· 658 646 * operation, when completing napi->poll call. 659 647 */ 660 648 bq->q[bq->count++] = xdpf; 649 + 650 + if (!bq->flush_node.prev) 651 + list_add(&bq->flush_node, flush_list); 652 + 661 653 return 0; 662 654 } 663 655 ··· 681 665 return 0; 682 666 } 683 667 684 - void __cpu_map_insert_ctx(struct bpf_map *map, u32 bit) 685 - { 686 - struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); 687 - unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed); 688 - 689 - __set_bit(bit, bitmap); 690 - } 691 - 692 668 void __cpu_map_flush(struct bpf_map *map) 693 669 { 694 670 struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); 695 - unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed); 696 - u32 bit; 671 + struct list_head *flush_list = this_cpu_ptr(cmap->flush_list); 672 + struct xdp_bulk_queue *bq, *tmp; 697 673 698 - /* The napi->poll softirq makes sure __cpu_map_insert_ctx() 699 - * and __cpu_map_flush() happen on same CPU. Thus, the percpu 700 - * bitmap indicate which percpu bulkq have packets. 701 - */ 702 - for_each_set_bit(bit, bitmap, map->max_entries) { 703 - struct bpf_cpu_map_entry *rcpu = READ_ONCE(cmap->cpu_map[bit]); 704 - struct xdp_bulk_queue *bq; 705 - 706 - /* This is possible if entry is removed by user space 707 - * between xdp redirect and flush op. 708 - */ 709 - if (unlikely(!rcpu)) 710 - continue; 711 - 712 - __clear_bit(bit, bitmap); 713 - 714 - /* Flush all frames in bulkq to real queue */ 715 - bq = this_cpu_ptr(rcpu->bulkq); 716 - bq_flush_to_queue(rcpu, bq, true); 674 + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { 675 + bq_flush_to_queue(bq, true); 717 676 718 677 /* If already running, costs spin_lock_irqsave + smb_mb */ 719 - wake_up_process(rcpu->kthread); 678 + wake_up_process(bq->obj->kthread); 720 679 } 721 680 }
+52 -60
kernel/bpf/devmap.c
··· 17 17 * datapath always has a valid copy. However, the datapath does a "flush" 18 18 * operation that pushes any pending packets in the driver outside the RCU 19 19 * critical section. Each bpf_dtab_netdev tracks these pending operations using 20 - * an atomic per-cpu bitmap. The bpf_dtab_netdev object will not be destroyed 21 - * until all bits are cleared indicating outstanding flush operations have 22 - * completed. 20 + * a per-cpu flush list. The bpf_dtab_netdev object will not be destroyed until 21 + * this list is empty, indicating outstanding flush operations have completed. 23 22 * 24 23 * BPF syscalls may race with BPF program calls on any of the update, delete 25 24 * or lookup operations. As noted above the xchg() operation also keep the ··· 47 48 (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) 48 49 49 50 #define DEV_MAP_BULK_SIZE 16 51 + struct bpf_dtab_netdev; 52 + 50 53 struct xdp_bulk_queue { 51 54 struct xdp_frame *q[DEV_MAP_BULK_SIZE]; 55 + struct list_head flush_node; 52 56 struct net_device *dev_rx; 57 + struct bpf_dtab_netdev *obj; 53 58 unsigned int count; 54 59 }; 55 60 ··· 68 65 struct bpf_dtab { 69 66 struct bpf_map map; 70 67 struct bpf_dtab_netdev **netdev_map; 71 - unsigned long __percpu *flush_needed; 68 + struct list_head __percpu *flush_list; 72 69 struct list_head list; 73 70 }; 74 71 75 72 static DEFINE_SPINLOCK(dev_map_lock); 76 73 static LIST_HEAD(dev_map_list); 77 74 78 - static u64 dev_map_bitmap_size(const union bpf_attr *attr) 79 - { 80 - return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long); 81 - } 82 - 83 75 static struct bpf_map *dev_map_alloc(union bpf_attr *attr) 84 76 { 85 77 struct bpf_dtab *dtab; 78 + int err, cpu; 86 79 u64 cost; 87 - int err; 88 80 89 81 if (!capable(CAP_NET_ADMIN)) 90 82 return ERR_PTR(-EPERM); ··· 89 91 attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) 90 92 return ERR_PTR(-EINVAL); 91 93 94 + /* Lookup returns a pointer straight to dev->ifindex, so make sure the 95 + * verifier prevents writes from the BPF side 96 + */ 97 + attr->map_flags |= BPF_F_RDONLY_PROG; 98 + 92 99 dtab = kzalloc(sizeof(*dtab), GFP_USER); 93 100 if (!dtab) 94 101 return ERR_PTR(-ENOMEM); ··· 102 99 103 100 /* make sure page count doesn't overflow */ 104 101 cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *); 105 - cost += dev_map_bitmap_size(attr) * num_possible_cpus(); 102 + cost += sizeof(struct list_head) * num_possible_cpus(); 106 103 107 104 /* if map size is larger than memlock limit, reject it */ 108 105 err = bpf_map_charge_init(&dtab->map.memory, cost); ··· 111 108 112 109 err = -ENOMEM; 113 110 114 - /* A per cpu bitfield with a bit per possible net device */ 115 - dtab->flush_needed = __alloc_percpu_gfp(dev_map_bitmap_size(attr), 116 - __alignof__(unsigned long), 117 - GFP_KERNEL | __GFP_NOWARN); 118 - if (!dtab->flush_needed) 111 + dtab->flush_list = alloc_percpu(struct list_head); 112 + if (!dtab->flush_list) 119 113 goto free_charge; 114 + 115 + for_each_possible_cpu(cpu) 116 + INIT_LIST_HEAD(per_cpu_ptr(dtab->flush_list, cpu)); 120 117 121 118 dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries * 122 119 sizeof(struct bpf_dtab_netdev *), 123 120 dtab->map.numa_node); 124 121 if (!dtab->netdev_map) 125 - goto free_charge; 122 + goto free_percpu; 126 123 127 124 spin_lock(&dev_map_lock); 128 125 list_add_tail_rcu(&dtab->list, &dev_map_list); 129 126 spin_unlock(&dev_map_lock); 130 127 131 128 return &dtab->map; 129 + 130 + free_percpu: 131 + free_percpu(dtab->flush_list); 132 132 free_charge: 133 133 bpf_map_charge_finish(&dtab->map.memory); 134 134 free_dtab: 135 - free_percpu(dtab->flush_needed); 136 135 kfree(dtab); 137 136 return ERR_PTR(err); 138 137 } ··· 163 158 rcu_barrier(); 164 159 165 160 /* To ensure all pending flush operations have completed wait for flush 166 - * bitmap to indicate all flush_needed bits to be zero on _all_ cpus. 161 + * list to empty on _all_ cpus. 167 162 * Because the above synchronize_rcu() ensures the map is disconnected 168 - * from the program we can assume no new bits will be set. 163 + * from the program we can assume no new items will be added. 169 164 */ 170 165 for_each_online_cpu(cpu) { 171 - unsigned long *bitmap = per_cpu_ptr(dtab->flush_needed, cpu); 166 + struct list_head *flush_list = per_cpu_ptr(dtab->flush_list, cpu); 172 167 173 - while (!bitmap_empty(bitmap, dtab->map.max_entries)) 168 + while (!list_empty(flush_list)) 174 169 cond_resched(); 175 170 } 176 171 ··· 186 181 kfree(dev); 187 182 } 188 183 189 - free_percpu(dtab->flush_needed); 184 + free_percpu(dtab->flush_list); 190 185 bpf_map_area_free(dtab->netdev_map); 191 186 kfree(dtab); 192 187 } ··· 208 203 return 0; 209 204 } 210 205 211 - void __dev_map_insert_ctx(struct bpf_map *map, u32 bit) 212 - { 213 - struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); 214 - unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); 215 - 216 - __set_bit(bit, bitmap); 217 - } 218 - 219 - static int bq_xmit_all(struct bpf_dtab_netdev *obj, 220 - struct xdp_bulk_queue *bq, u32 flags, 206 + static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags, 221 207 bool in_napi_ctx) 222 208 { 209 + struct bpf_dtab_netdev *obj = bq->obj; 223 210 struct net_device *dev = obj->dev; 224 211 int sent = 0, drops = 0, err = 0; 225 212 int i; ··· 238 241 trace_xdp_devmap_xmit(&obj->dtab->map, obj->bit, 239 242 sent, drops, bq->dev_rx, dev, err); 240 243 bq->dev_rx = NULL; 244 + __list_del_clearprev(&bq->flush_node); 241 245 return 0; 242 246 error: 243 247 /* If ndo_xdp_xmit fails with an errno, no frames have been ··· 261 263 * from the driver before returning from its napi->poll() routine. The poll() 262 264 * routine is called either from busy_poll context or net_rx_action signaled 263 265 * from NET_RX_SOFTIRQ. Either way the poll routine must complete before the 264 - * net device can be torn down. On devmap tear down we ensure the ctx bitmap 265 - * is zeroed before completing to ensure all flush operations have completed. 266 + * net device can be torn down. On devmap tear down we ensure the flush list 267 + * is empty before completing to ensure all flush operations have completed. 266 268 */ 267 269 void __dev_map_flush(struct bpf_map *map) 268 270 { 269 271 struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); 270 - unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); 271 - u32 bit; 272 + struct list_head *flush_list = this_cpu_ptr(dtab->flush_list); 273 + struct xdp_bulk_queue *bq, *tmp; 272 274 273 275 rcu_read_lock(); 274 - for_each_set_bit(bit, bitmap, map->max_entries) { 275 - struct bpf_dtab_netdev *dev = READ_ONCE(dtab->netdev_map[bit]); 276 - struct xdp_bulk_queue *bq; 277 - 278 - /* This is possible if the dev entry is removed by user space 279 - * between xdp redirect and flush op. 280 - */ 281 - if (unlikely(!dev)) 282 - continue; 283 - 284 - bq = this_cpu_ptr(dev->bulkq); 285 - bq_xmit_all(dev, bq, XDP_XMIT_FLUSH, true); 286 - 287 - __clear_bit(bit, bitmap); 288 - } 276 + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) 277 + bq_xmit_all(bq, XDP_XMIT_FLUSH, true); 289 278 rcu_read_unlock(); 290 279 } 291 280 ··· 299 314 struct net_device *dev_rx) 300 315 301 316 { 317 + struct list_head *flush_list = this_cpu_ptr(obj->dtab->flush_list); 302 318 struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq); 303 319 304 320 if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) 305 - bq_xmit_all(obj, bq, 0, true); 321 + bq_xmit_all(bq, 0, true); 306 322 307 323 /* Ingress dev_rx will be the same for all xdp_frame's in 308 324 * bulk_queue, because bq stored per-CPU and must be flushed ··· 313 327 bq->dev_rx = dev_rx; 314 328 315 329 bq->q[bq->count++] = xdpf; 330 + 331 + if (!bq->flush_node.prev) 332 + list_add(&bq->flush_node, flush_list); 333 + 316 334 return 0; 317 335 } 318 336 ··· 367 377 { 368 378 if (dev->dev->netdev_ops->ndo_xdp_xmit) { 369 379 struct xdp_bulk_queue *bq; 370 - unsigned long *bitmap; 371 - 372 380 int cpu; 373 381 374 382 rcu_read_lock(); 375 383 for_each_online_cpu(cpu) { 376 - bitmap = per_cpu_ptr(dev->dtab->flush_needed, cpu); 377 - __clear_bit(dev->bit, bitmap); 378 - 379 384 bq = per_cpu_ptr(dev->bulkq, cpu); 380 - bq_xmit_all(dev, bq, XDP_XMIT_FLUSH, false); 385 + bq_xmit_all(bq, XDP_XMIT_FLUSH, false); 381 386 } 382 387 rcu_read_unlock(); 383 388 } ··· 419 434 struct net *net = current->nsproxy->net_ns; 420 435 gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; 421 436 struct bpf_dtab_netdev *dev, *old_dev; 422 - u32 i = *(u32 *)key; 423 437 u32 ifindex = *(u32 *)value; 438 + struct xdp_bulk_queue *bq; 439 + u32 i = *(u32 *)key; 440 + int cpu; 424 441 425 442 if (unlikely(map_flags > BPF_EXIST)) 426 443 return -EINVAL; ··· 443 456 if (!dev->bulkq) { 444 457 kfree(dev); 445 458 return -ENOMEM; 459 + } 460 + 461 + for_each_possible_cpu(cpu) { 462 + bq = per_cpu_ptr(dev->bulkq, cpu); 463 + bq->obj = dev; 446 464 } 447 465 448 466 dev->dev = dev_get_by_index(net, ifindex);
+19
kernel/bpf/syscall.c
··· 1590 1590 default: 1591 1591 return -EINVAL; 1592 1592 } 1593 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 1594 + switch (expected_attach_type) { 1595 + case BPF_CGROUP_SETSOCKOPT: 1596 + case BPF_CGROUP_GETSOCKOPT: 1597 + return 0; 1598 + default: 1599 + return -EINVAL; 1600 + } 1593 1601 default: 1594 1602 return 0; 1595 1603 } ··· 1848 1840 switch (prog->type) { 1849 1841 case BPF_PROG_TYPE_CGROUP_SOCK: 1850 1842 case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: 1843 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 1851 1844 return attach_type == prog->expected_attach_type ? 0 : -EINVAL; 1852 1845 case BPF_PROG_TYPE_CGROUP_SKB: 1853 1846 return prog->enforce_expected_attach_type && ··· 1920 1911 break; 1921 1912 case BPF_CGROUP_SYSCTL: 1922 1913 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; 1914 + break; 1915 + case BPF_CGROUP_GETSOCKOPT: 1916 + case BPF_CGROUP_SETSOCKOPT: 1917 + ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT; 1923 1918 break; 1924 1919 default: 1925 1920 return -EINVAL; ··· 2008 1995 case BPF_CGROUP_SYSCTL: 2009 1996 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; 2010 1997 break; 1998 + case BPF_CGROUP_GETSOCKOPT: 1999 + case BPF_CGROUP_SETSOCKOPT: 2000 + ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT; 2001 + break; 2011 2002 default: 2012 2003 return -EINVAL; 2013 2004 } ··· 2048 2031 case BPF_CGROUP_SOCK_OPS: 2049 2032 case BPF_CGROUP_DEVICE: 2050 2033 case BPF_CGROUP_SYSCTL: 2034 + case BPF_CGROUP_GETSOCKOPT: 2035 + case BPF_CGROUP_SETSOCKOPT: 2051 2036 break; 2052 2037 case BPF_LIRC_MODE2: 2053 2038 return lirc_prog_query(attr, uattr);
+117 -19
kernel/bpf/verifier.c
··· 1659 1659 } 1660 1660 } 1661 1661 1662 - static int mark_chain_precision(struct bpf_verifier_env *env, int regno) 1662 + static int __mark_chain_precision(struct bpf_verifier_env *env, int regno, 1663 + int spi) 1663 1664 { 1664 1665 struct bpf_verifier_state *st = env->cur_state; 1665 1666 int first_idx = st->first_insn_idx; 1666 1667 int last_idx = env->insn_idx; 1667 1668 struct bpf_func_state *func; 1668 1669 struct bpf_reg_state *reg; 1669 - u32 reg_mask = 1u << regno; 1670 - u64 stack_mask = 0; 1670 + u32 reg_mask = regno >= 0 ? 1u << regno : 0; 1671 + u64 stack_mask = spi >= 0 ? 1ull << spi : 0; 1671 1672 bool skip_first = true; 1673 + bool new_marks = false; 1672 1674 int i, err; 1673 1675 1674 1676 if (!env->allow_ptr_leaks) ··· 1678 1676 return 0; 1679 1677 1680 1678 func = st->frame[st->curframe]; 1681 - reg = &func->regs[regno]; 1682 - if (reg->type != SCALAR_VALUE) { 1683 - WARN_ONCE(1, "backtracing misuse"); 1684 - return -EFAULT; 1679 + if (regno >= 0) { 1680 + reg = &func->regs[regno]; 1681 + if (reg->type != SCALAR_VALUE) { 1682 + WARN_ONCE(1, "backtracing misuse"); 1683 + return -EFAULT; 1684 + } 1685 + if (!reg->precise) 1686 + new_marks = true; 1687 + else 1688 + reg_mask = 0; 1689 + reg->precise = true; 1685 1690 } 1686 - if (reg->precise) 1687 - return 0; 1688 - func->regs[regno].precise = true; 1689 1691 1692 + while (spi >= 0) { 1693 + if (func->stack[spi].slot_type[0] != STACK_SPILL) { 1694 + stack_mask = 0; 1695 + break; 1696 + } 1697 + reg = &func->stack[spi].spilled_ptr; 1698 + if (reg->type != SCALAR_VALUE) { 1699 + stack_mask = 0; 1700 + break; 1701 + } 1702 + if (!reg->precise) 1703 + new_marks = true; 1704 + else 1705 + stack_mask = 0; 1706 + reg->precise = true; 1707 + break; 1708 + } 1709 + 1710 + if (!new_marks) 1711 + return 0; 1712 + if (!reg_mask && !stack_mask) 1713 + return 0; 1690 1714 for (;;) { 1691 1715 DECLARE_BITMAP(mask, 64); 1692 - bool new_marks = false; 1693 1716 u32 history = st->jmp_history_cnt; 1694 1717 1695 1718 if (env->log.level & BPF_LOG_LEVEL) ··· 1757 1730 if (!st) 1758 1731 break; 1759 1732 1733 + new_marks = false; 1760 1734 func = st->frame[st->curframe]; 1761 1735 bitmap_from_u64(mask, reg_mask); 1762 1736 for_each_set_bit(i, mask, 32) { 1763 1737 reg = &func->regs[i]; 1764 - if (reg->type != SCALAR_VALUE) 1738 + if (reg->type != SCALAR_VALUE) { 1739 + reg_mask &= ~(1u << i); 1765 1740 continue; 1741 + } 1766 1742 if (!reg->precise) 1767 1743 new_marks = true; 1768 1744 reg->precise = true; ··· 1786 1756 return -EFAULT; 1787 1757 } 1788 1758 1789 - if (func->stack[i].slot_type[0] != STACK_SPILL) 1759 + if (func->stack[i].slot_type[0] != STACK_SPILL) { 1760 + stack_mask &= ~(1ull << i); 1790 1761 continue; 1762 + } 1791 1763 reg = &func->stack[i].spilled_ptr; 1792 - if (reg->type != SCALAR_VALUE) 1764 + if (reg->type != SCALAR_VALUE) { 1765 + stack_mask &= ~(1ull << i); 1793 1766 continue; 1767 + } 1794 1768 if (!reg->precise) 1795 1769 new_marks = true; 1796 1770 reg->precise = true; ··· 1806 1772 reg_mask, stack_mask); 1807 1773 } 1808 1774 1775 + if (!reg_mask && !stack_mask) 1776 + break; 1809 1777 if (!new_marks) 1810 1778 break; 1811 1779 ··· 1817 1781 return 0; 1818 1782 } 1819 1783 1784 + static int mark_chain_precision(struct bpf_verifier_env *env, int regno) 1785 + { 1786 + return __mark_chain_precision(env, regno, -1); 1787 + } 1788 + 1789 + static int mark_chain_precision_stack(struct bpf_verifier_env *env, int spi) 1790 + { 1791 + return __mark_chain_precision(env, -1, spi); 1792 + } 1820 1793 1821 1794 static bool is_spillable_regtype(enum bpf_reg_type type) 1822 1795 { ··· 2260 2215 2261 2216 env->seen_direct_write = true; 2262 2217 return true; 2218 + 2219 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 2220 + if (t == BPF_WRITE) 2221 + env->seen_direct_write = true; 2222 + 2223 + return true; 2224 + 2263 2225 default: 2264 2226 return false; 2265 2227 } ··· 3459 3407 if (func_id != BPF_FUNC_get_local_storage) 3460 3408 goto error; 3461 3409 break; 3462 - /* devmap returns a pointer to a live net_device ifindex that we cannot 3463 - * allow to be modified from bpf side. So do not allow lookup elements 3464 - * for now. 3465 - */ 3466 3410 case BPF_MAP_TYPE_DEVMAP: 3467 - if (func_id != BPF_FUNC_redirect_map) 3411 + if (func_id != BPF_FUNC_redirect_map && 3412 + func_id != BPF_FUNC_map_lookup_elem) 3468 3413 goto error; 3469 3414 break; 3470 3415 /* Restrict bpf side of cpumap and xskmap, open when use-cases ··· 6115 6066 case BPF_PROG_TYPE_SOCK_OPS: 6116 6067 case BPF_PROG_TYPE_CGROUP_DEVICE: 6117 6068 case BPF_PROG_TYPE_CGROUP_SYSCTL: 6069 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 6118 6070 break; 6119 6071 default: 6120 6072 return 0; ··· 7156 7106 return 0; 7157 7107 } 7158 7108 7109 + /* find precise scalars in the previous equivalent state and 7110 + * propagate them into the current state 7111 + */ 7112 + static int propagate_precision(struct bpf_verifier_env *env, 7113 + const struct bpf_verifier_state *old) 7114 + { 7115 + struct bpf_reg_state *state_reg; 7116 + struct bpf_func_state *state; 7117 + int i, err = 0; 7118 + 7119 + state = old->frame[old->curframe]; 7120 + state_reg = state->regs; 7121 + for (i = 0; i < BPF_REG_FP; i++, state_reg++) { 7122 + if (state_reg->type != SCALAR_VALUE || 7123 + !state_reg->precise) 7124 + continue; 7125 + if (env->log.level & BPF_LOG_LEVEL2) 7126 + verbose(env, "propagating r%d\n", i); 7127 + err = mark_chain_precision(env, i); 7128 + if (err < 0) 7129 + return err; 7130 + } 7131 + 7132 + for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) { 7133 + if (state->stack[i].slot_type[0] != STACK_SPILL) 7134 + continue; 7135 + state_reg = &state->stack[i].spilled_ptr; 7136 + if (state_reg->type != SCALAR_VALUE || 7137 + !state_reg->precise) 7138 + continue; 7139 + if (env->log.level & BPF_LOG_LEVEL2) 7140 + verbose(env, "propagating fp%d\n", 7141 + (-i - 1) * BPF_REG_SIZE); 7142 + err = mark_chain_precision_stack(env, i); 7143 + if (err < 0) 7144 + return err; 7145 + } 7146 + return 0; 7147 + } 7148 + 7159 7149 static bool states_maybe_looping(struct bpf_verifier_state *old, 7160 7150 struct bpf_verifier_state *cur) 7161 7151 { ··· 7288 7198 * this state and will pop a new one. 7289 7199 */ 7290 7200 err = propagate_liveness(env, &sl->state, cur); 7201 + 7202 + /* if previous state reached the exit with precision and 7203 + * current state is equivalent to it (except precsion marks) 7204 + * the precision needs to be propagated back in 7205 + * the current state. 7206 + */ 7207 + err = err ? : push_jmp_history(env, cur); 7208 + err = err ? : propagate_precision(env, &sl->state); 7291 7209 if (err) 7292 7210 return err; 7293 7211 return 1;
+1 -2
kernel/bpf/xskmap.c
··· 145 145 146 146 list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { 147 147 xsk_flush(xs); 148 - __list_del(xs->flush_node.prev, xs->flush_node.next); 149 - xs->flush_node.prev = NULL; 148 + __list_del_clearprev(&xs->flush_node); 150 149 } 151 150 } 152 151
+14 -13
kernel/trace/bpf_trace.c
··· 1431 1431 return err; 1432 1432 } 1433 1433 1434 + static int __init send_signal_irq_work_init(void) 1435 + { 1436 + int cpu; 1437 + struct send_signal_irq_work *work; 1438 + 1439 + for_each_possible_cpu(cpu) { 1440 + work = per_cpu_ptr(&send_signal_work, cpu); 1441 + init_irq_work(&work->irq_work, do_bpf_send_signal); 1442 + } 1443 + return 0; 1444 + } 1445 + 1446 + subsys_initcall(send_signal_irq_work_init); 1447 + 1434 1448 #ifdef CONFIG_MODULES 1435 1449 static int bpf_event_notify(struct notifier_block *nb, unsigned long op, 1436 1450 void *module) ··· 1492 1478 return 0; 1493 1479 } 1494 1480 1495 - static int __init send_signal_irq_work_init(void) 1496 - { 1497 - int cpu; 1498 - struct send_signal_irq_work *work; 1499 - 1500 - for_each_possible_cpu(cpu) { 1501 - work = per_cpu_ptr(&send_signal_work, cpu); 1502 - init_irq_work(&work->irq_work, do_bpf_send_signal); 1503 - } 1504 - return 0; 1505 - } 1506 - 1507 1481 fs_initcall(bpf_event_init); 1508 - subsys_initcall(send_signal_irq_work_init); 1509 1482 #endif /* CONFIG_MODULES */
+185 -84
net/core/filter.c
··· 2158 2158 if (unlikely(flags & ~(BPF_F_INGRESS))) 2159 2159 return TC_ACT_SHOT; 2160 2160 2161 - ri->ifindex = ifindex; 2162 2161 ri->flags = flags; 2162 + ri->tgt_index = ifindex; 2163 2163 2164 2164 return TC_ACT_REDIRECT; 2165 2165 } ··· 2169 2169 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 2170 2170 struct net_device *dev; 2171 2171 2172 - dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); 2173 - ri->ifindex = 0; 2172 + dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->tgt_index); 2173 + ri->tgt_index = 0; 2174 2174 if (unlikely(!dev)) { 2175 2175 kfree_skb(skb); 2176 2176 return -EINVAL; ··· 3488 3488 struct bpf_prog *xdp_prog, struct bpf_redirect_info *ri) 3489 3489 { 3490 3490 struct net_device *fwd; 3491 - u32 index = ri->ifindex; 3491 + u32 index = ri->tgt_index; 3492 3492 int err; 3493 3493 3494 3494 fwd = dev_get_by_index_rcu(dev_net(dev), index); 3495 - ri->ifindex = 0; 3495 + ri->tgt_index = 0; 3496 3496 if (unlikely(!fwd)) { 3497 3497 err = -EINVAL; 3498 3498 goto err; ··· 3523 3523 err = dev_map_enqueue(dst, xdp, dev_rx); 3524 3524 if (unlikely(err)) 3525 3525 return err; 3526 - __dev_map_insert_ctx(map, index); 3527 3526 break; 3528 3527 } 3529 3528 case BPF_MAP_TYPE_CPUMAP: { ··· 3531 3532 err = cpu_map_enqueue(rcpu, xdp, dev_rx); 3532 3533 if (unlikely(err)) 3533 3534 return err; 3534 - __cpu_map_insert_ctx(map, index); 3535 3535 break; 3536 3536 } 3537 3537 case BPF_MAP_TYPE_XSKMAP: { ··· 3604 3606 struct bpf_prog *xdp_prog, struct bpf_map *map, 3605 3607 struct bpf_redirect_info *ri) 3606 3608 { 3607 - u32 index = ri->ifindex; 3608 - void *fwd = NULL; 3609 + u32 index = ri->tgt_index; 3610 + void *fwd = ri->tgt_value; 3609 3611 int err; 3610 3612 3611 - ri->ifindex = 0; 3613 + ri->tgt_index = 0; 3614 + ri->tgt_value = NULL; 3612 3615 WRITE_ONCE(ri->map, NULL); 3613 3616 3614 - fwd = __xdp_map_lookup_elem(map, index); 3615 - if (unlikely(!fwd)) { 3616 - err = -EINVAL; 3617 - goto err; 3618 - } 3619 3617 if (ri->map_to_flush && unlikely(ri->map_to_flush != map)) 3620 3618 xdp_do_flush_map(); 3621 3619 ··· 3647 3653 struct bpf_map *map) 3648 3654 { 3649 3655 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 3650 - u32 index = ri->ifindex; 3651 - void *fwd = NULL; 3656 + u32 index = ri->tgt_index; 3657 + void *fwd = ri->tgt_value; 3652 3658 int err = 0; 3653 3659 3654 - ri->ifindex = 0; 3660 + ri->tgt_index = 0; 3661 + ri->tgt_value = NULL; 3655 3662 WRITE_ONCE(ri->map, NULL); 3656 - 3657 - fwd = __xdp_map_lookup_elem(map, index); 3658 - if (unlikely(!fwd)) { 3659 - err = -EINVAL; 3660 - goto err; 3661 - } 3662 3663 3663 3664 if (map->map_type == BPF_MAP_TYPE_DEVMAP) { 3664 3665 struct bpf_dtab_netdev *dst = fwd; ··· 3686 3697 { 3687 3698 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 3688 3699 struct bpf_map *map = READ_ONCE(ri->map); 3689 - u32 index = ri->ifindex; 3700 + u32 index = ri->tgt_index; 3690 3701 struct net_device *fwd; 3691 3702 int err = 0; 3692 3703 3693 3704 if (map) 3694 3705 return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, 3695 3706 map); 3696 - ri->ifindex = 0; 3707 + ri->tgt_index = 0; 3697 3708 fwd = dev_get_by_index_rcu(dev_net(dev), index); 3698 3709 if (unlikely(!fwd)) { 3699 3710 err = -EINVAL; ··· 3721 3732 if (unlikely(flags)) 3722 3733 return XDP_ABORTED; 3723 3734 3724 - ri->ifindex = ifindex; 3725 3735 ri->flags = flags; 3736 + ri->tgt_index = ifindex; 3737 + ri->tgt_value = NULL; 3726 3738 WRITE_ONCE(ri->map, NULL); 3727 3739 3728 3740 return XDP_REDIRECT; ··· 3742 3752 { 3743 3753 struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); 3744 3754 3745 - if (unlikely(flags)) 3755 + /* Lower bits of the flags are used as return code on lookup failure */ 3756 + if (unlikely(flags > XDP_TX)) 3746 3757 return XDP_ABORTED; 3747 3758 3748 - ri->ifindex = ifindex; 3759 + ri->tgt_value = __xdp_map_lookup_elem(map, ifindex); 3760 + if (unlikely(!ri->tgt_value)) { 3761 + /* If the lookup fails we want to clear out the state in the 3762 + * redirect_info struct completely, so that if an eBPF program 3763 + * performs multiple lookups, the last one always takes 3764 + * precedence. 3765 + */ 3766 + WRITE_ONCE(ri->map, NULL); 3767 + return flags; 3768 + } 3769 + 3749 3770 ri->flags = flags; 3771 + ri->tgt_index = ifindex; 3750 3772 WRITE_ONCE(ri->map, map); 3751 3773 3752 3774 return XDP_REDIRECT; ··· 5194 5192 }; 5195 5193 #endif /* CONFIG_IPV6_SEG6_BPF */ 5196 5194 5197 - #define CONVERT_COMMON_TCP_SOCK_FIELDS(md_type, CONVERT) \ 5198 - do { \ 5199 - switch (si->off) { \ 5200 - case offsetof(md_type, snd_cwnd): \ 5201 - CONVERT(snd_cwnd); break; \ 5202 - case offsetof(md_type, srtt_us): \ 5203 - CONVERT(srtt_us); break; \ 5204 - case offsetof(md_type, snd_ssthresh): \ 5205 - CONVERT(snd_ssthresh); break; \ 5206 - case offsetof(md_type, rcv_nxt): \ 5207 - CONVERT(rcv_nxt); break; \ 5208 - case offsetof(md_type, snd_nxt): \ 5209 - CONVERT(snd_nxt); break; \ 5210 - case offsetof(md_type, snd_una): \ 5211 - CONVERT(snd_una); break; \ 5212 - case offsetof(md_type, mss_cache): \ 5213 - CONVERT(mss_cache); break; \ 5214 - case offsetof(md_type, ecn_flags): \ 5215 - CONVERT(ecn_flags); break; \ 5216 - case offsetof(md_type, rate_delivered): \ 5217 - CONVERT(rate_delivered); break; \ 5218 - case offsetof(md_type, rate_interval_us): \ 5219 - CONVERT(rate_interval_us); break; \ 5220 - case offsetof(md_type, packets_out): \ 5221 - CONVERT(packets_out); break; \ 5222 - case offsetof(md_type, retrans_out): \ 5223 - CONVERT(retrans_out); break; \ 5224 - case offsetof(md_type, total_retrans): \ 5225 - CONVERT(total_retrans); break; \ 5226 - case offsetof(md_type, segs_in): \ 5227 - CONVERT(segs_in); break; \ 5228 - case offsetof(md_type, data_segs_in): \ 5229 - CONVERT(data_segs_in); break; \ 5230 - case offsetof(md_type, segs_out): \ 5231 - CONVERT(segs_out); break; \ 5232 - case offsetof(md_type, data_segs_out): \ 5233 - CONVERT(data_segs_out); break; \ 5234 - case offsetof(md_type, lost_out): \ 5235 - CONVERT(lost_out); break; \ 5236 - case offsetof(md_type, sacked_out): \ 5237 - CONVERT(sacked_out); break; \ 5238 - case offsetof(md_type, bytes_received): \ 5239 - CONVERT(bytes_received); break; \ 5240 - case offsetof(md_type, bytes_acked): \ 5241 - CONVERT(bytes_acked); break; \ 5242 - } \ 5243 - } while (0) 5244 - 5245 5195 #ifdef CONFIG_INET 5246 5196 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, 5247 5197 int dif, int sdif, u8 family, u8 proto) ··· 5544 5590 bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, 5545 5591 struct bpf_insn_access_aux *info) 5546 5592 { 5547 - if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, bytes_acked)) 5593 + if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, 5594 + icsk_retransmits)) 5548 5595 return false; 5549 5596 5550 5597 if (off % size != 0) ··· 5576 5621 offsetof(struct tcp_sock, FIELD)); \ 5577 5622 } while (0) 5578 5623 5579 - CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_tcp_sock, 5580 - BPF_TCP_SOCK_GET_COMMON); 5624 + #define BPF_INET_SOCK_GET_COMMON(FIELD) \ 5625 + do { \ 5626 + BUILD_BUG_ON(FIELD_SIZEOF(struct inet_connection_sock, \ 5627 + FIELD) > \ 5628 + FIELD_SIZEOF(struct bpf_tcp_sock, FIELD)); \ 5629 + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ 5630 + struct inet_connection_sock, \ 5631 + FIELD), \ 5632 + si->dst_reg, si->src_reg, \ 5633 + offsetof( \ 5634 + struct inet_connection_sock, \ 5635 + FIELD)); \ 5636 + } while (0) 5581 5637 5582 5638 if (insn > insn_buf) 5583 5639 return insn - insn_buf; ··· 5604 5638 offsetof(struct tcp_sock, rtt_min) + 5605 5639 offsetof(struct minmax_sample, v)); 5606 5640 break; 5641 + case offsetof(struct bpf_tcp_sock, snd_cwnd): 5642 + BPF_TCP_SOCK_GET_COMMON(snd_cwnd); 5643 + break; 5644 + case offsetof(struct bpf_tcp_sock, srtt_us): 5645 + BPF_TCP_SOCK_GET_COMMON(srtt_us); 5646 + break; 5647 + case offsetof(struct bpf_tcp_sock, snd_ssthresh): 5648 + BPF_TCP_SOCK_GET_COMMON(snd_ssthresh); 5649 + break; 5650 + case offsetof(struct bpf_tcp_sock, rcv_nxt): 5651 + BPF_TCP_SOCK_GET_COMMON(rcv_nxt); 5652 + break; 5653 + case offsetof(struct bpf_tcp_sock, snd_nxt): 5654 + BPF_TCP_SOCK_GET_COMMON(snd_nxt); 5655 + break; 5656 + case offsetof(struct bpf_tcp_sock, snd_una): 5657 + BPF_TCP_SOCK_GET_COMMON(snd_una); 5658 + break; 5659 + case offsetof(struct bpf_tcp_sock, mss_cache): 5660 + BPF_TCP_SOCK_GET_COMMON(mss_cache); 5661 + break; 5662 + case offsetof(struct bpf_tcp_sock, ecn_flags): 5663 + BPF_TCP_SOCK_GET_COMMON(ecn_flags); 5664 + break; 5665 + case offsetof(struct bpf_tcp_sock, rate_delivered): 5666 + BPF_TCP_SOCK_GET_COMMON(rate_delivered); 5667 + break; 5668 + case offsetof(struct bpf_tcp_sock, rate_interval_us): 5669 + BPF_TCP_SOCK_GET_COMMON(rate_interval_us); 5670 + break; 5671 + case offsetof(struct bpf_tcp_sock, packets_out): 5672 + BPF_TCP_SOCK_GET_COMMON(packets_out); 5673 + break; 5674 + case offsetof(struct bpf_tcp_sock, retrans_out): 5675 + BPF_TCP_SOCK_GET_COMMON(retrans_out); 5676 + break; 5677 + case offsetof(struct bpf_tcp_sock, total_retrans): 5678 + BPF_TCP_SOCK_GET_COMMON(total_retrans); 5679 + break; 5680 + case offsetof(struct bpf_tcp_sock, segs_in): 5681 + BPF_TCP_SOCK_GET_COMMON(segs_in); 5682 + break; 5683 + case offsetof(struct bpf_tcp_sock, data_segs_in): 5684 + BPF_TCP_SOCK_GET_COMMON(data_segs_in); 5685 + break; 5686 + case offsetof(struct bpf_tcp_sock, segs_out): 5687 + BPF_TCP_SOCK_GET_COMMON(segs_out); 5688 + break; 5689 + case offsetof(struct bpf_tcp_sock, data_segs_out): 5690 + BPF_TCP_SOCK_GET_COMMON(data_segs_out); 5691 + break; 5692 + case offsetof(struct bpf_tcp_sock, lost_out): 5693 + BPF_TCP_SOCK_GET_COMMON(lost_out); 5694 + break; 5695 + case offsetof(struct bpf_tcp_sock, sacked_out): 5696 + BPF_TCP_SOCK_GET_COMMON(sacked_out); 5697 + break; 5698 + case offsetof(struct bpf_tcp_sock, bytes_received): 5699 + BPF_TCP_SOCK_GET_COMMON(bytes_received); 5700 + break; 5701 + case offsetof(struct bpf_tcp_sock, bytes_acked): 5702 + BPF_TCP_SOCK_GET_COMMON(bytes_acked); 5703 + break; 5704 + case offsetof(struct bpf_tcp_sock, dsack_dups): 5705 + BPF_TCP_SOCK_GET_COMMON(dsack_dups); 5706 + break; 5707 + case offsetof(struct bpf_tcp_sock, delivered): 5708 + BPF_TCP_SOCK_GET_COMMON(delivered); 5709 + break; 5710 + case offsetof(struct bpf_tcp_sock, delivered_ce): 5711 + BPF_TCP_SOCK_GET_COMMON(delivered_ce); 5712 + break; 5713 + case offsetof(struct bpf_tcp_sock, icsk_retransmits): 5714 + BPF_INET_SOCK_GET_COMMON(icsk_retransmits); 5715 + break; 5607 5716 } 5608 5717 5609 5718 return insn - insn_buf; ··· 5692 5651 return (unsigned long)NULL; 5693 5652 } 5694 5653 5695 - static const struct bpf_func_proto bpf_tcp_sock_proto = { 5654 + const struct bpf_func_proto bpf_tcp_sock_proto = { 5696 5655 .func = bpf_tcp_sock, 5697 5656 .gpl_only = false, 5698 5657 .ret_type = RET_PTR_TO_TCP_SOCK_OR_NULL, ··· 7952 7911 SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ); \ 7953 7912 } while (0) 7954 7913 7955 - CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_sock_ops, 7956 - SOCK_OPS_GET_TCP_SOCK_FIELD); 7957 - 7958 7914 if (insn > insn_buf) 7959 7915 return insn - insn_buf; 7960 7916 ··· 8120 8082 case offsetof(struct bpf_sock_ops, sk_txhash): 8121 8083 SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash, 8122 8084 struct sock, type); 8085 + break; 8086 + case offsetof(struct bpf_sock_ops, snd_cwnd): 8087 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_cwnd); 8088 + break; 8089 + case offsetof(struct bpf_sock_ops, srtt_us): 8090 + SOCK_OPS_GET_TCP_SOCK_FIELD(srtt_us); 8091 + break; 8092 + case offsetof(struct bpf_sock_ops, snd_ssthresh): 8093 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_ssthresh); 8094 + break; 8095 + case offsetof(struct bpf_sock_ops, rcv_nxt): 8096 + SOCK_OPS_GET_TCP_SOCK_FIELD(rcv_nxt); 8097 + break; 8098 + case offsetof(struct bpf_sock_ops, snd_nxt): 8099 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_nxt); 8100 + break; 8101 + case offsetof(struct bpf_sock_ops, snd_una): 8102 + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_una); 8103 + break; 8104 + case offsetof(struct bpf_sock_ops, mss_cache): 8105 + SOCK_OPS_GET_TCP_SOCK_FIELD(mss_cache); 8106 + break; 8107 + case offsetof(struct bpf_sock_ops, ecn_flags): 8108 + SOCK_OPS_GET_TCP_SOCK_FIELD(ecn_flags); 8109 + break; 8110 + case offsetof(struct bpf_sock_ops, rate_delivered): 8111 + SOCK_OPS_GET_TCP_SOCK_FIELD(rate_delivered); 8112 + break; 8113 + case offsetof(struct bpf_sock_ops, rate_interval_us): 8114 + SOCK_OPS_GET_TCP_SOCK_FIELD(rate_interval_us); 8115 + break; 8116 + case offsetof(struct bpf_sock_ops, packets_out): 8117 + SOCK_OPS_GET_TCP_SOCK_FIELD(packets_out); 8118 + break; 8119 + case offsetof(struct bpf_sock_ops, retrans_out): 8120 + SOCK_OPS_GET_TCP_SOCK_FIELD(retrans_out); 8121 + break; 8122 + case offsetof(struct bpf_sock_ops, total_retrans): 8123 + SOCK_OPS_GET_TCP_SOCK_FIELD(total_retrans); 8124 + break; 8125 + case offsetof(struct bpf_sock_ops, segs_in): 8126 + SOCK_OPS_GET_TCP_SOCK_FIELD(segs_in); 8127 + break; 8128 + case offsetof(struct bpf_sock_ops, data_segs_in): 8129 + SOCK_OPS_GET_TCP_SOCK_FIELD(data_segs_in); 8130 + break; 8131 + case offsetof(struct bpf_sock_ops, segs_out): 8132 + SOCK_OPS_GET_TCP_SOCK_FIELD(segs_out); 8133 + break; 8134 + case offsetof(struct bpf_sock_ops, data_segs_out): 8135 + SOCK_OPS_GET_TCP_SOCK_FIELD(data_segs_out); 8136 + break; 8137 + case offsetof(struct bpf_sock_ops, lost_out): 8138 + SOCK_OPS_GET_TCP_SOCK_FIELD(lost_out); 8139 + break; 8140 + case offsetof(struct bpf_sock_ops, sacked_out): 8141 + SOCK_OPS_GET_TCP_SOCK_FIELD(sacked_out); 8142 + break; 8143 + case offsetof(struct bpf_sock_ops, bytes_received): 8144 + SOCK_OPS_GET_TCP_SOCK_FIELD(bytes_received); 8145 + break; 8146 + case offsetof(struct bpf_sock_ops, bytes_acked): 8147 + SOCK_OPS_GET_TCP_SOCK_FIELD(bytes_acked); 8123 8148 break; 8124 8149 case offsetof(struct bpf_sock_ops, sk): 8125 8150 *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+1 -1
net/core/xdp.c
··· 85 85 kfree(xa); 86 86 } 87 87 88 - bool __mem_id_disconnect(int id, bool force) 88 + static bool __mem_id_disconnect(int id, bool force) 89 89 { 90 90 struct xdp_mem_allocator *xa; 91 91 bool safe_to_remove = true;
+4
net/ipv4/tcp_input.c
··· 778 778 tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2; 779 779 tp->rtt_seq = tp->snd_nxt; 780 780 tp->mdev_max_us = tcp_rto_min_us(sk); 781 + 782 + tcp_bpf_rtt(sk); 781 783 } 782 784 } else { 783 785 /* no previous measure. */ ··· 788 786 tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk)); 789 787 tp->mdev_max_us = tp->rttvar_us; 790 788 tp->rtt_seq = tp->snd_nxt; 789 + 790 + tcp_bpf_rtt(sk); 791 791 } 792 792 tp->srtt_us = max(1U, srtt); 793 793 }
+30
net/socket.c
··· 2050 2050 static int __sys_setsockopt(int fd, int level, int optname, 2051 2051 char __user *optval, int optlen) 2052 2052 { 2053 + mm_segment_t oldfs = get_fs(); 2054 + char *kernel_optval = NULL; 2053 2055 int err, fput_needed; 2054 2056 struct socket *sock; 2055 2057 ··· 2064 2062 if (err) 2065 2063 goto out_put; 2066 2064 2065 + err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, 2066 + &optname, optval, &optlen, 2067 + &kernel_optval); 2068 + 2069 + if (err < 0) { 2070 + goto out_put; 2071 + } else if (err > 0) { 2072 + err = 0; 2073 + goto out_put; 2074 + } 2075 + 2076 + if (kernel_optval) { 2077 + set_fs(KERNEL_DS); 2078 + optval = (char __user __force *)kernel_optval; 2079 + } 2080 + 2067 2081 if (level == SOL_SOCKET) 2068 2082 err = 2069 2083 sock_setsockopt(sock, level, optname, optval, ··· 2088 2070 err = 2089 2071 sock->ops->setsockopt(sock, level, optname, optval, 2090 2072 optlen); 2073 + 2074 + if (kernel_optval) { 2075 + set_fs(oldfs); 2076 + kfree(kernel_optval); 2077 + } 2091 2078 out_put: 2092 2079 fput_light(sock->file, fput_needed); 2093 2080 } ··· 2115 2092 { 2116 2093 int err, fput_needed; 2117 2094 struct socket *sock; 2095 + int max_optlen; 2118 2096 2119 2097 sock = sockfd_lookup_light(fd, &err, &fput_needed); 2120 2098 if (sock != NULL) { 2121 2099 err = security_socket_getsockopt(sock, level, optname); 2122 2100 if (err) 2123 2101 goto out_put; 2102 + 2103 + max_optlen = BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen); 2124 2104 2125 2105 if (level == SOL_SOCKET) 2126 2106 err = ··· 2133 2107 err = 2134 2108 sock->ops->getsockopt(sock, level, optname, optval, 2135 2109 optlen); 2110 + 2111 + err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname, 2112 + optval, optlen, 2113 + max_optlen, err); 2136 2114 out_put: 2137 2115 fput_light(sock->file, fput_needed); 2138 2116 }
+29 -7
net/xdp/xsk.c
··· 37 37 READ_ONCE(xs->umem->fq); 38 38 } 39 39 40 + bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt) 41 + { 42 + return xskq_has_addrs(umem->fq, cnt); 43 + } 44 + EXPORT_SYMBOL(xsk_umem_has_addrs); 45 + 40 46 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr) 41 47 { 42 48 return xskq_peek_addr(umem->fq, addr); ··· 172 166 } 173 167 EXPORT_SYMBOL(xsk_umem_consume_tx_done); 174 168 175 - bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len) 169 + bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc) 176 170 { 177 - struct xdp_desc desc; 178 171 struct xdp_sock *xs; 179 172 180 173 rcu_read_lock(); 181 174 list_for_each_entry_rcu(xs, &umem->xsk_list, list) { 182 - if (!xskq_peek_desc(xs->tx, &desc)) 175 + if (!xskq_peek_desc(xs->tx, desc)) 183 176 continue; 184 177 185 - if (xskq_produce_addr_lazy(umem->cq, desc.addr)) 178 + if (xskq_produce_addr_lazy(umem->cq, desc->addr)) 186 179 goto out; 187 - 188 - *dma = xdp_umem_get_dma(umem, desc.addr); 189 - *len = desc.len; 190 180 191 181 xskq_discard_desc(xs->tx); 192 182 rcu_read_unlock(); ··· 640 638 641 639 len = sizeof(off); 642 640 if (copy_to_user(optval, &off, len)) 641 + return -EFAULT; 642 + if (put_user(len, optlen)) 643 + return -EFAULT; 644 + 645 + return 0; 646 + } 647 + case XDP_OPTIONS: 648 + { 649 + struct xdp_options opts = {}; 650 + 651 + if (len < sizeof(opts)) 652 + return -EINVAL; 653 + 654 + mutex_lock(&xs->mutex); 655 + if (xs->zc) 656 + opts.flags |= XDP_OPTIONS_ZEROCOPY; 657 + mutex_unlock(&xs->mutex); 658 + 659 + len = sizeof(opts); 660 + if (copy_to_user(optval, &opts, len)) 643 661 return -EFAULT; 644 662 if (put_user(len, optlen)) 645 663 return -EFAULT;
+14
net/xdp/xsk_queue.h
··· 117 117 return q->nentries - (producer - q->cons_tail); 118 118 } 119 119 120 + static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt) 121 + { 122 + u32 entries = q->prod_tail - q->cons_tail; 123 + 124 + if (entries >= cnt) 125 + return true; 126 + 127 + /* Refresh the local pointer. */ 128 + q->prod_tail = READ_ONCE(q->ring->producer); 129 + entries = q->prod_tail - q->cons_tail; 130 + 131 + return entries >= cnt; 132 + } 133 + 120 134 /* UMEM queue */ 121 135 122 136 static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
+3
samples/bpf/Makefile
··· 154 154 always += tcp_clamp_kern.o 155 155 always += tcp_basertt_kern.o 156 156 always += tcp_tos_reflect_kern.o 157 + always += tcp_dumpstats_kern.o 157 158 always += xdp_redirect_kern.o 158 159 always += xdp_redirect_map_kern.o 159 160 always += xdp_redirect_cpu_kern.o ··· 169 168 always += xdp_sample_pkts_kern.o 170 169 always += ibumad_kern.o 171 170 always += hbm_out_kern.o 171 + always += hbm_edt_kern.o 172 172 173 173 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include 174 174 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/ ··· 274 272 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h 275 273 $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h 276 274 $(obj)/hbm.o: $(src)/hbm.h 275 + $(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h 277 276 278 277 # asm/sysreg.h - inline assembly used by it is incompatible with llvm. 279 278 # But, there is no easy way to fix it, so just exclude it since it is
+12 -10
samples/bpf/do_hbm_test.sh
··· 14 14 echo "loads. The output is the goodput in Mbps (unless -D was used)." 15 15 echo "" 16 16 echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>]" 17 - echo " [-D] [-d=<delay>|--delay=<delay>] [--debug] [-E]" 17 + echo " [-D] [-d=<delay>|--delay=<delay>] [--debug] [-E] [--edt]" 18 18 echo " [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >]" 19 19 echo " [-l] [-N] [--no_cn] [-p=<port>|--port=<port>] [-P]" 20 20 echo " [-q=<qdisc>] [-R] [-s=<server>|--server=<server]" ··· 30 30 echo " other detailed information. This information is" 31 31 echo " test dependent (i.e. iperf3 or netperf)." 32 32 echo " -E enable ECN (not required for dctcp)" 33 + echo " --edt use fq's Earliest Departure Time (requires fq)" 33 34 echo " -f or --flows number of concurrent flows (default=1)" 34 35 echo " -i or --id cgroup id (an integer, default is 1)" 35 36 echo " -N use netperf instead of iperf3" ··· 131 130 details=1 132 131 ;; 133 132 -E) 134 - ecn=1 133 + ecn=1 134 + ;; 135 + --edt) 136 + flags="$flags --edt" 137 + qdisc="fq" 135 138 ;; 136 - # Support for upcomming fq Early Departure Time egress rate limiting 137 - #--edt) 138 - # prog="hbm_out_edt_kern.o" 139 - # qdisc="fq" 140 - # ;; 141 139 -f=*|--flows=*) 142 140 flows="${i#*=}" 143 141 ;; ··· 228 228 tc qdisc del dev lo root > /dev/null 2>&1 229 229 tc qdisc add dev lo root netem delay $netem\ms > /dev/null 2>&1 230 230 elif [ "$qdisc" != "" ] ; then 231 - tc qdisc del dev lo root > /dev/null 2>&1 232 - tc qdisc add dev lo root $qdisc > /dev/null 2>&1 231 + tc qdisc del dev eth0 root > /dev/null 2>&1 232 + tc qdisc add dev eth0 root $qdisc > /dev/null 2>&1 233 233 fi 234 234 235 235 n=0 ··· 399 399 if [ "$netem" -ne "0" ] ; then 400 400 tc qdisc del dev lo root > /dev/null 2>&1 401 401 fi 402 - 402 + if [ "$qdisc" != "" ] ; then 403 + tc qdisc del dev eth0 root > /dev/null 2>&1 404 + fi 403 405 sleep 2 404 406 405 407 hbmPid=`ps ax | grep "hbm " | grep --invert-match "grep" | awk '{ print $1 }'`
+15 -3
samples/bpf/hbm.c
··· 62 62 bool debugFlag; 63 63 bool work_conserving_flag; 64 64 bool no_cn_flag; 65 + bool edt_flag; 65 66 66 67 static void Usage(void); 67 68 static void read_trace_pipe2(void); ··· 373 372 fprintf(fout, "avg rtt:%d\n", 374 373 (int)(qstats.sum_rtt / (qstats.pkts_total + 1))); 375 374 // Average credit 376 - fprintf(fout, "avg credit:%d\n", 377 - (int)(qstats.sum_credit / 378 - (1500 * ((int)qstats.pkts_total) + 1))); 375 + if (edt_flag) 376 + fprintf(fout, "avg credit_ms:%.03f\n", 377 + (qstats.sum_credit / 378 + (qstats.pkts_total + 1.0)) / 1000000.0); 379 + else 380 + fprintf(fout, "avg credit:%d\n", 381 + (int)(qstats.sum_credit / 382 + (1500 * ((int)qstats.pkts_total ) + 1))); 379 383 380 384 // Return values stats 381 385 for (k = 0; k < RET_VAL_COUNT; k++) { ··· 414 408 " Where:\n" 415 409 " -o indicates egress direction (default)\n" 416 410 " -d print BPF trace debug buffer\n" 411 + " --edt use fq's Earliest Departure Time\n" 417 412 " -l also limit flows using loopback\n" 418 413 " -n <#> to create cgroup \"/hbm#\" and attach prog\n" 419 414 " Default is /hbm1\n" ··· 440 433 char *optstring = "iodln:r:st:wh"; 441 434 struct option loptions[] = { 442 435 {"no_cn", 0, NULL, 1}, 436 + {"edt", 0, NULL, 2}, 443 437 {NULL, 0, NULL, 0} 444 438 }; 445 439 ··· 448 440 switch (k) { 449 441 case 1: 450 442 no_cn_flag = true; 443 + break; 444 + case 2: 445 + prog = "hbm_edt_kern.o"; 446 + edt_flag = true; 451 447 break; 452 448 case'o': 453 449 break;
+168
samples/bpf/hbm_edt_kern.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2019 Facebook 3 + * 4 + * This program is free software; you can redistribute it and/or 5 + * modify it under the terms of version 2 of the GNU General Public 6 + * License as published by the Free Software Foundation. 7 + * 8 + * Sample Host Bandwidth Manager (HBM) BPF program. 9 + * 10 + * A cgroup skb BPF egress program to limit cgroup output bandwidth. 11 + * It uses a modified virtual token bucket queue to limit average 12 + * egress bandwidth. The implementation uses credits instead of tokens. 13 + * Negative credits imply that queueing would have happened (this is 14 + * a virtual queue, so no queueing is done by it. However, queueing may 15 + * occur at the actual qdisc (which is not used for rate limiting). 16 + * 17 + * This implementation uses 3 thresholds, one to start marking packets and 18 + * the other two to drop packets: 19 + * CREDIT 20 + * - <--------------------------|------------------------> + 21 + * | | | 0 22 + * | Large pkt | 23 + * | drop thresh | 24 + * Small pkt drop Mark threshold 25 + * thresh 26 + * 27 + * The effect of marking depends on the type of packet: 28 + * a) If the packet is ECN enabled and it is a TCP packet, then the packet 29 + * is ECN marked. 30 + * b) If the packet is a TCP packet, then we probabilistically call tcp_cwr 31 + * to reduce the congestion window. The current implementation uses a linear 32 + * distribution (0% probability at marking threshold, 100% probability 33 + * at drop threshold). 34 + * c) If the packet is not a TCP packet, then it is dropped. 35 + * 36 + * If the credit is below the drop threshold, the packet is dropped. If it 37 + * is a TCP packet, then it also calls tcp_cwr since packets dropped by 38 + * by a cgroup skb BPF program do not automatically trigger a call to 39 + * tcp_cwr in the current kernel code. 40 + * 41 + * This BPF program actually uses 2 drop thresholds, one threshold 42 + * for larger packets (>= 120 bytes) and another for smaller packets. This 43 + * protects smaller packets such as SYNs, ACKs, etc. 44 + * 45 + * The default bandwidth limit is set at 1Gbps but this can be changed by 46 + * a user program through a shared BPF map. In addition, by default this BPF 47 + * program does not limit connections using loopback. This behavior can be 48 + * overwritten by the user program. There is also an option to calculate 49 + * some statistics, such as percent of packets marked or dropped, which 50 + * a user program, such as hbm, can access. 51 + */ 52 + 53 + #include "hbm_kern.h" 54 + 55 + SEC("cgroup_skb/egress") 56 + int _hbm_out_cg(struct __sk_buff *skb) 57 + { 58 + long long delta = 0, delta_send; 59 + unsigned long long curtime, sendtime; 60 + struct hbm_queue_stats *qsp = NULL; 61 + unsigned int queue_index = 0; 62 + bool congestion_flag = false; 63 + bool ecn_ce_flag = false; 64 + struct hbm_pkt_info pkti = {}; 65 + struct hbm_vqueue *qdp; 66 + bool drop_flag = false; 67 + bool cwr_flag = false; 68 + int len = skb->len; 69 + int rv = ALLOW_PKT; 70 + 71 + qsp = bpf_map_lookup_elem(&queue_stats, &queue_index); 72 + 73 + // Check if we should ignore loopback traffic 74 + if (qsp != NULL && !qsp->loopback && (skb->ifindex == 1)) 75 + return ALLOW_PKT; 76 + 77 + hbm_get_pkt_info(skb, &pkti); 78 + 79 + // We may want to account for the length of headers in len 80 + // calculation, like ETH header + overhead, specially if it 81 + // is a gso packet. But I am not doing it right now. 82 + 83 + qdp = bpf_get_local_storage(&queue_state, 0); 84 + if (!qdp) 85 + return ALLOW_PKT; 86 + if (qdp->lasttime == 0) 87 + hbm_init_edt_vqueue(qdp, 1024); 88 + 89 + curtime = bpf_ktime_get_ns(); 90 + 91 + // Begin critical section 92 + bpf_spin_lock(&qdp->lock); 93 + delta = qdp->lasttime - curtime; 94 + // bound bursts to 100us 95 + if (delta < -BURST_SIZE_NS) { 96 + // negative delta is a credit that allows bursts 97 + qdp->lasttime = curtime - BURST_SIZE_NS; 98 + delta = -BURST_SIZE_NS; 99 + } 100 + sendtime = qdp->lasttime; 101 + delta_send = BYTES_TO_NS(len, qdp->rate); 102 + __sync_add_and_fetch(&(qdp->lasttime), delta_send); 103 + bpf_spin_unlock(&qdp->lock); 104 + // End critical section 105 + 106 + // Set EDT of packet 107 + skb->tstamp = sendtime; 108 + 109 + // Check if we should update rate 110 + if (qsp != NULL && (qsp->rate * 128) != qdp->rate) 111 + qdp->rate = qsp->rate * 128; 112 + 113 + // Set flags (drop, congestion, cwr) 114 + // last packet will be sent in the future, bound latency 115 + if (delta > DROP_THRESH_NS || (delta > LARGE_PKT_DROP_THRESH_NS && 116 + len > LARGE_PKT_THRESH)) { 117 + drop_flag = true; 118 + if (pkti.is_tcp && pkti.ecn == 0) 119 + cwr_flag = true; 120 + } else if (delta > MARK_THRESH_NS) { 121 + if (pkti.is_tcp) 122 + congestion_flag = true; 123 + else 124 + drop_flag = true; 125 + } 126 + 127 + if (congestion_flag) { 128 + if (bpf_skb_ecn_set_ce(skb)) { 129 + ecn_ce_flag = true; 130 + } else { 131 + if (pkti.is_tcp) { 132 + unsigned int rand = bpf_get_prandom_u32(); 133 + 134 + if (delta >= MARK_THRESH_NS + 135 + (rand % MARK_REGION_SIZE_NS)) { 136 + // Do congestion control 137 + cwr_flag = true; 138 + } 139 + } else if (len > LARGE_PKT_THRESH) { 140 + // Problem if too many small packets? 141 + drop_flag = true; 142 + congestion_flag = false; 143 + } 144 + } 145 + } 146 + 147 + if (pkti.is_tcp && drop_flag && pkti.packets_out <= 1) { 148 + drop_flag = false; 149 + cwr_flag = true; 150 + congestion_flag = false; 151 + } 152 + 153 + if (qsp != NULL && qsp->no_cn) 154 + cwr_flag = false; 155 + 156 + hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag, 157 + cwr_flag, ecn_ce_flag, &pkti, (int) delta); 158 + 159 + if (drop_flag) { 160 + __sync_add_and_fetch(&(qdp->lasttime), -delta_send); 161 + rv = DROP_PKT; 162 + } 163 + 164 + if (cwr_flag) 165 + rv |= CWR; 166 + return rv; 167 + } 168 + char _license[] SEC("license") = "GPL";
+34 -6
samples/bpf/hbm_kern.h
··· 29 29 #define DROP_PKT 0 30 30 #define ALLOW_PKT 1 31 31 #define TCP_ECN_OK 1 32 + #define CWR 2 32 33 33 34 #ifndef HBM_DEBUG // Define HBM_DEBUG to enable debugging 34 35 #undef bpf_printk ··· 46 45 #define MAX_CREDIT (100 * MAX_BYTES_PER_PACKET) 47 46 #define INIT_CREDIT (INITIAL_CREDIT_PACKETS * MAX_BYTES_PER_PACKET) 48 47 48 + // Time base accounting for fq's EDT 49 + #define BURST_SIZE_NS 100000 // 100us 50 + #define MARK_THRESH_NS 50000 // 50us 51 + #define DROP_THRESH_NS 500000 // 500us 52 + // Reserve 20us of queuing for small packets (less than 120 bytes) 53 + #define LARGE_PKT_DROP_THRESH_NS (DROP_THRESH_NS - 20000) 54 + #define MARK_REGION_SIZE_NS (LARGE_PKT_DROP_THRESH_NS - MARK_THRESH_NS) 55 + 49 56 // rate in bytes per ns << 20 50 57 #define CREDIT_PER_NS(delta, rate) ((((u64)(delta)) * (rate)) >> 20) 58 + #define BYTES_PER_NS(delta, rate) ((((u64)(delta)) * (rate)) >> 20) 59 + #define BYTES_TO_NS(bytes, rate) div64_u64(((u64)(bytes)) << 20, (u64)(rate)) 51 60 52 61 struct bpf_map_def SEC("maps") queue_state = { 53 62 .type = BPF_MAP_TYPE_CGROUP_STORAGE, ··· 78 67 struct hbm_pkt_info { 79 68 int cwnd; 80 69 int rtt; 70 + int packets_out; 81 71 bool is_ip; 82 72 bool is_tcp; 83 73 short ecn; ··· 98 86 if (tp) { 99 87 pkti->cwnd = tp->snd_cwnd; 100 88 pkti->rtt = tp->srtt_us >> 3; 89 + pkti->packets_out = tp->packets_out; 101 90 return 0; 102 91 } 103 92 } 104 93 } 105 94 } 95 + pkti->cwnd = 0; 96 + pkti->rtt = 0; 97 + pkti->packets_out = 0; 106 98 return 1; 107 99 } 108 100 109 - static __always_inline void hbm_get_pkt_info(struct __sk_buff *skb, 110 - struct hbm_pkt_info *pkti) 101 + static void hbm_get_pkt_info(struct __sk_buff *skb, 102 + struct hbm_pkt_info *pkti) 111 103 { 112 104 struct iphdr iph; 113 105 struct ipv6hdr *ip6h; ··· 139 123 140 124 static __always_inline void hbm_init_vqueue(struct hbm_vqueue *qdp, int rate) 141 125 { 142 - bpf_printk("Initializing queue_state, rate:%d\n", rate * 128); 143 - qdp->lasttime = bpf_ktime_get_ns(); 144 - qdp->credit = INIT_CREDIT; 145 - qdp->rate = rate * 128; 126 + bpf_printk("Initializing queue_state, rate:%d\n", rate * 128); 127 + qdp->lasttime = bpf_ktime_get_ns(); 128 + qdp->credit = INIT_CREDIT; 129 + qdp->rate = rate * 128; 130 + } 131 + 132 + static __always_inline void hbm_init_edt_vqueue(struct hbm_vqueue *qdp, 133 + int rate) 134 + { 135 + unsigned long long curtime; 136 + 137 + curtime = bpf_ktime_get_ns(); 138 + bpf_printk("Initializing queue_state, rate:%d\n", rate * 128); 139 + qdp->lasttime = curtime - BURST_SIZE_NS; // support initial burst 140 + qdp->credit = 0; // not used 141 + qdp->rate = rate * 128; 146 142 } 147 143 148 144 static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,
+6 -12
samples/bpf/ibumad_kern.c
··· 31 31 }; 32 32 33 33 #undef DEBUG 34 - #ifdef DEBUG 35 - #define bpf_debug(fmt, ...) \ 36 - ({ \ 37 - char ____fmt[] = fmt; \ 38 - bpf_trace_printk(____fmt, sizeof(____fmt), \ 39 - ##__VA_ARGS__); \ 40 - }) 41 - #else 42 - #define bpf_debug(fmt, ...) 34 + #ifndef DEBUG 35 + #undef bpf_printk 36 + #define bpf_printk(fmt, ...) 43 37 #endif 44 38 45 39 /* Taken from the current format defined in ··· 80 86 u64 zero = 0, *val; 81 87 u8 class = ctx->mgmt_class; 82 88 83 - bpf_debug("ib_umad read recv : class 0x%x\n", class); 89 + bpf_printk("ib_umad read recv : class 0x%x\n", class); 84 90 85 91 val = bpf_map_lookup_elem(&read_count, &class); 86 92 if (!val) { ··· 100 106 u64 zero = 0, *val; 101 107 u8 class = ctx->mgmt_class; 102 108 103 - bpf_debug("ib_umad read send : class 0x%x\n", class); 109 + bpf_printk("ib_umad read send : class 0x%x\n", class); 104 110 105 111 val = bpf_map_lookup_elem(&read_count, &class); 106 112 if (!val) { ··· 120 126 u64 zero = 0, *val; 121 127 u8 class = ctx->mgmt_class; 122 128 123 - bpf_debug("ib_umad write : class 0x%x\n", class); 129 + bpf_printk("ib_umad write : class 0x%x\n", class); 124 130 125 131 val = bpf_map_lookup_elem(&write_count, &class); 126 132 if (!val) {
+1 -1
samples/bpf/tcp_bpf.readme
··· 25 25 26 26 To remove (unattach) a socket_ops BPF program from a cgroupv2: 27 27 28 - bpftool cgroup attach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog 28 + bpftool cgroup detach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog
+68
samples/bpf/tcp_dumpstats_kern.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Refer to samples/bpf/tcp_bpf.readme for the instructions on 3 + * how to run this sample program. 4 + */ 5 + #include <linux/bpf.h> 6 + 7 + #include "bpf_helpers.h" 8 + #include "bpf_endian.h" 9 + 10 + #define INTERVAL 1000000000ULL 11 + 12 + int _version SEC("version") = 1; 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + struct { 16 + __u32 type; 17 + __u32 map_flags; 18 + int *key; 19 + __u64 *value; 20 + } bpf_next_dump SEC(".maps") = { 21 + .type = BPF_MAP_TYPE_SK_STORAGE, 22 + .map_flags = BPF_F_NO_PREALLOC, 23 + }; 24 + 25 + SEC("sockops") 26 + int _sockops(struct bpf_sock_ops *ctx) 27 + { 28 + struct bpf_tcp_sock *tcp_sk; 29 + struct bpf_sock *sk; 30 + __u64 *next_dump; 31 + __u64 now; 32 + 33 + switch (ctx->op) { 34 + case BPF_SOCK_OPS_TCP_CONNECT_CB: 35 + bpf_sock_ops_cb_flags_set(ctx, BPF_SOCK_OPS_RTT_CB_FLAG); 36 + return 1; 37 + case BPF_SOCK_OPS_RTT_CB: 38 + break; 39 + default: 40 + return 1; 41 + } 42 + 43 + sk = ctx->sk; 44 + if (!sk) 45 + return 1; 46 + 47 + next_dump = bpf_sk_storage_get(&bpf_next_dump, sk, 0, 48 + BPF_SK_STORAGE_GET_F_CREATE); 49 + if (!next_dump) 50 + return 1; 51 + 52 + now = bpf_ktime_get_ns(); 53 + if (now < *next_dump) 54 + return 1; 55 + 56 + tcp_sk = bpf_tcp_sock(sk); 57 + if (!tcp_sk) 58 + return 1; 59 + 60 + *next_dump = now + INTERVAL; 61 + 62 + bpf_printk("dsack_dups=%u delivered=%u\n", 63 + tcp_sk->dsack_dups, tcp_sk->delivered); 64 + bpf_printk("delivered_ce=%u icsk_retransmits=%u\n", 65 + tcp_sk->delivered_ce, tcp_sk->icsk_retransmits); 66 + 67 + return 1; 68 + }
+10 -2
samples/bpf/xdp_adjust_tail_user.c
··· 13 13 #include <stdio.h> 14 14 #include <stdlib.h> 15 15 #include <string.h> 16 + #include <net/if.h> 16 17 #include <sys/resource.h> 17 18 #include <arpa/inet.h> 18 19 #include <netinet/ether.h> ··· 70 69 printf("Start a XDP prog which send ICMP \"packet too big\" \n" 71 70 "messages if ingress packet is bigger then MAX_SIZE bytes\n"); 72 71 printf("Usage: %s [...]\n", cmd); 73 - printf(" -i <ifindex> Interface Index\n"); 72 + printf(" -i <ifname|ifindex> Interface\n"); 74 73 printf(" -T <stop-after-X-seconds> Default: 0 (forever)\n"); 75 74 printf(" -S use skb-mode\n"); 76 75 printf(" -N enforce native mode\n"); ··· 103 102 104 103 switch (opt) { 105 104 case 'i': 106 - ifindex = atoi(optarg); 105 + ifindex = if_nametoindex(optarg); 106 + if (!ifindex) 107 + ifindex = atoi(optarg); 107 108 break; 108 109 case 'T': 109 110 kill_after_s = atoi(optarg); ··· 136 133 137 134 if (setrlimit(RLIMIT_MEMLOCK, &r)) { 138 135 perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)"); 136 + return 1; 137 + } 138 + 139 + if (!ifindex) { 140 + fprintf(stderr, "Invalid ifname\n"); 139 141 return 1; 140 142 } 141 143
+11 -4
samples/bpf/xdp_redirect_map_user.c
··· 10 10 #include <stdlib.h> 11 11 #include <stdbool.h> 12 12 #include <string.h> 13 + #include <net/if.h> 13 14 #include <unistd.h> 14 15 #include <libgen.h> 15 16 #include <sys/resource.h> ··· 86 85 static void usage(const char *prog) 87 86 { 88 87 fprintf(stderr, 89 - "usage: %s [OPTS] IFINDEX_IN IFINDEX_OUT\n\n" 88 + "usage: %s [OPTS] <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n\n" 90 89 "OPTS:\n" 91 90 " -S use skb-mode\n" 92 91 " -N enforce native mode\n" ··· 128 127 } 129 128 130 129 if (optind == argc) { 131 - printf("usage: %s IFINDEX_IN IFINDEX_OUT\n", argv[0]); 130 + printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]); 132 131 return 1; 133 132 } 134 133 ··· 137 136 return 1; 138 137 } 139 138 140 - ifindex_in = strtoul(argv[optind], NULL, 0); 141 - ifindex_out = strtoul(argv[optind + 1], NULL, 0); 139 + ifindex_in = if_nametoindex(argv[optind]); 140 + if (!ifindex_in) 141 + ifindex_in = strtoul(argv[optind], NULL, 0); 142 + 143 + ifindex_out = if_nametoindex(argv[optind + 1]); 144 + if (!ifindex_out) 145 + ifindex_out = strtoul(argv[optind + 1], NULL, 0); 146 + 142 147 printf("input: %d output: %d\n", ifindex_in, ifindex_out); 143 148 144 149 snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+11 -4
samples/bpf/xdp_redirect_user.c
··· 10 10 #include <stdlib.h> 11 11 #include <stdbool.h> 12 12 #include <string.h> 13 + #include <net/if.h> 13 14 #include <unistd.h> 14 15 #include <libgen.h> 15 16 #include <sys/resource.h> ··· 86 85 static void usage(const char *prog) 87 86 { 88 87 fprintf(stderr, 89 - "usage: %s [OPTS] IFINDEX_IN IFINDEX_OUT\n\n" 88 + "usage: %s [OPTS] <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n\n" 90 89 "OPTS:\n" 91 90 " -S use skb-mode\n" 92 91 " -N enforce native mode\n" ··· 129 128 } 130 129 131 130 if (optind == argc) { 132 - printf("usage: %s IFINDEX_IN IFINDEX_OUT\n", argv[0]); 131 + printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]); 133 132 return 1; 134 133 } 135 134 ··· 138 137 return 1; 139 138 } 140 139 141 - ifindex_in = strtoul(argv[optind], NULL, 0); 142 - ifindex_out = strtoul(argv[optind + 1], NULL, 0); 140 + ifindex_in = if_nametoindex(argv[optind]); 141 + if (!ifindex_in) 142 + ifindex_in = strtoul(argv[optind], NULL, 0); 143 + 144 + ifindex_out = if_nametoindex(argv[optind + 1]); 145 + if (!ifindex_out) 146 + ifindex_out = strtoul(argv[optind + 1], NULL, 0); 147 + 143 148 printf("input: %d output: %d\n", ifindex_in, ifindex_out); 144 149 145 150 snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+10 -2
samples/bpf/xdp_tx_iptunnel_user.c
··· 9 9 #include <stdio.h> 10 10 #include <stdlib.h> 11 11 #include <string.h> 12 + #include <net/if.h> 12 13 #include <sys/resource.h> 13 14 #include <arpa/inet.h> 14 15 #include <netinet/ether.h> ··· 84 83 "in an IPv4/v6 header and XDP_TX it out. The dst <VIP:PORT>\n" 85 84 "is used to select packets to encapsulate\n\n"); 86 85 printf("Usage: %s [...]\n", cmd); 87 - printf(" -i <ifindex> Interface Index\n"); 86 + printf(" -i <ifname|ifindex> Interface\n"); 88 87 printf(" -a <vip-service-address> IPv4 or IPv6\n"); 89 88 printf(" -p <vip-service-port> A port range (e.g. 433-444) is also allowed\n"); 90 89 printf(" -s <source-ip> Used in the IPTunnel header\n"); ··· 182 181 183 182 switch (opt) { 184 183 case 'i': 185 - ifindex = atoi(optarg); 184 + ifindex = if_nametoindex(optarg); 185 + if (!ifindex) 186 + ifindex = atoi(optarg); 186 187 break; 187 188 case 'a': 188 189 vip.family = parse_ipstr(optarg, vip.daddr.v6); ··· 253 250 254 251 if (setrlimit(RLIMIT_MEMLOCK, &r)) { 255 252 perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)"); 253 + return 1; 254 + } 255 + 256 + if (!ifindex) { 257 + fprintf(stderr, "Invalid ifname\n"); 256 258 return 1; 257 259 } 258 260
+28 -16
samples/bpf/xdpsock_user.c
··· 68 68 static int opt_poll; 69 69 static int opt_interval = 1; 70 70 static u32 opt_xdp_bind_flags; 71 + static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE; 71 72 static __u32 prog_id; 72 73 73 74 struct xsk_umem_info { ··· 277 276 static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size) 278 277 { 279 278 struct xsk_umem_info *umem; 279 + struct xsk_umem_config cfg = { 280 + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, 281 + .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, 282 + .frame_size = opt_xsk_frame_size, 283 + .frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM, 284 + }; 280 285 int ret; 281 286 282 287 umem = calloc(1, sizeof(*umem)); ··· 290 283 exit_with_error(errno); 291 284 292 285 ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, 293 - NULL); 286 + &cfg); 294 287 if (ret) 295 288 exit_with_error(-ret); 296 289 ··· 330 323 &idx); 331 324 if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS) 332 325 exit_with_error(-ret); 333 - for (i = 0; 334 - i < XSK_RING_PROD__DEFAULT_NUM_DESCS * 335 - XSK_UMEM__DEFAULT_FRAME_SIZE; 336 - i += XSK_UMEM__DEFAULT_FRAME_SIZE) 337 - *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = i; 326 + for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS; i++) 327 + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = 328 + i * opt_xsk_frame_size; 338 329 xsk_ring_prod__submit(&xsk->umem->fq, 339 330 XSK_RING_PROD__DEFAULT_NUM_DESCS); 340 331 ··· 351 346 {"interval", required_argument, 0, 'n'}, 352 347 {"zero-copy", no_argument, 0, 'z'}, 353 348 {"copy", no_argument, 0, 'c'}, 349 + {"frame-size", required_argument, 0, 'f'}, 354 350 {0, 0, 0, 0} 355 351 }; 356 352 ··· 371 365 " -n, --interval=n Specify statistics update interval (default 1 sec).\n" 372 366 " -z, --zero-copy Force zero-copy mode.\n" 373 367 " -c, --copy Force copy mode.\n" 368 + " -f, --frame-size=n Set the frame size (must be a power of two, default is %d).\n" 374 369 "\n"; 375 - fprintf(stderr, str, prog); 370 + fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE); 376 371 exit(EXIT_FAILURE); 377 372 } 378 373 ··· 384 377 opterr = 0; 385 378 386 379 for (;;) { 387 - c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options, 380 + c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:", long_options, 388 381 &option_index); 389 382 if (c == -1) 390 383 break; ··· 427 420 case 'F': 428 421 opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST; 429 422 break; 423 + case 'f': 424 + opt_xsk_frame_size = atoi(optarg); 425 + break; 430 426 default: 431 427 usage(basename(argv[0])); 432 428 } ··· 442 432 usage(basename(argv[0])); 443 433 } 444 434 435 + if (opt_xsk_frame_size & (opt_xsk_frame_size - 1)) { 436 + fprintf(stderr, "--frame-size=%d is not a power of two\n", 437 + opt_xsk_frame_size); 438 + usage(basename(argv[0])); 439 + } 445 440 } 446 441 447 442 static void kick_tx(struct xsk_socket_info *xsk) ··· 598 583 599 584 for (i = 0; i < BATCH_SIZE; i++) { 600 585 xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr 601 - = (frame_nb + i) << 602 - XSK_UMEM__DEFAULT_FRAME_SHIFT; 586 + = (frame_nb + i) * opt_xsk_frame_size; 603 587 xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len = 604 588 sizeof(pkt_data) - 1; 605 589 } ··· 675 661 } 676 662 677 663 ret = posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */ 678 - NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE); 664 + NUM_FRAMES * opt_xsk_frame_size); 679 665 if (ret) 680 666 exit_with_error(ret); 681 667 682 668 /* Create sockets... */ 683 - umem = xsk_configure_umem(bufs, 684 - NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE); 669 + umem = xsk_configure_umem(bufs, NUM_FRAMES * opt_xsk_frame_size); 685 670 xsks[num_socks++] = xsk_configure_socket(umem); 686 671 687 672 if (opt_bench == BENCH_TXONLY) { 688 673 int i; 689 674 690 - for (i = 0; i < NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE; 691 - i += XSK_UMEM__DEFAULT_FRAME_SIZE) 692 - (void)gen_eth_frame(umem, i); 675 + for (i = 0; i < NUM_FRAMES; i++) 676 + (void)gen_eth_frame(umem, i * opt_xsk_frame_size); 693 677 } 694 678 695 679 signal(SIGINT, int_exit);
+5 -2
tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
··· 29 29 | *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* } 30 30 | *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | **sock_ops** | **device** | 31 31 | **bind4** | **bind6** | **post_bind4** | **post_bind6** | **connect4** | **connect6** | 32 - | **sendmsg4** | **sendmsg6** | **recvmsg4** | **recvmsg6** | **sysctl** } 32 + | **sendmsg4** | **sendmsg6** | **recvmsg4** | **recvmsg6** | **sysctl** | 33 + | **getsockopt** | **setsockopt** } 33 34 | *ATTACH_FLAGS* := { **multi** | **override** } 34 35 35 36 DESCRIPTION ··· 91 90 an unconnected udp4 socket (since 5.2); 92 91 **recvmsg6** call to recvfrom(2), recvmsg(2), recvmmsg(2) for 93 92 an unconnected udp6 socket (since 5.2); 94 - **sysctl** sysctl access (since 5.2). 93 + **sysctl** sysctl access (since 5.2); 94 + **getsockopt** call to getsockopt (since 5.3); 95 + **setsockopt** call to setsockopt (since 5.3). 95 96 96 97 **bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG* 97 98 Detach *PROG* from the cgroup *CGROUP* and attach type
+2 -1
tools/bpf/bpftool/Documentation/bpftool-prog.rst
··· 40 40 | **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | **lirc_mode2** | 41 41 | **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** | 42 42 | **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** | 43 - | **cgroup/recvmsg4** | **cgroup/recvmsg6** | **cgroup/sysctl** 43 + | **cgroup/recvmsg4** | **cgroup/recvmsg6** | **cgroup/sysctl** | 44 + | **cgroup/getsockopt** | **cgroup/setsockopt** 44 45 | } 45 46 | *ATTACH_TYPE* := { 46 47 | **msg_verdict** | **stream_verdict** | **stream_parser** | **flow_dissector**
+6 -3
tools/bpf/bpftool/bash-completion/bpftool
··· 379 379 cgroup/sendmsg4 cgroup/sendmsg6 \ 380 380 cgroup/recvmsg4 cgroup/recvmsg6 \ 381 381 cgroup/post_bind4 cgroup/post_bind6 \ 382 - cgroup/sysctl" -- \ 382 + cgroup/sysctl cgroup/getsockopt \ 383 + cgroup/setsockopt" -- \ 383 384 "$cur" ) ) 384 385 return 0 385 386 ;; ··· 690 689 attach|detach) 691 690 local ATTACH_TYPES='ingress egress sock_create sock_ops \ 692 691 device bind4 bind6 post_bind4 post_bind6 connect4 \ 693 - connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6 sysctl' 692 + connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6 sysctl \ 693 + getsockopt setsockopt' 694 694 local ATTACH_FLAGS='multi override' 695 695 local PROG_TYPE='id pinned tag' 696 696 case $prev in ··· 701 699 ;; 702 700 ingress|egress|sock_create|sock_ops|device|bind4|bind6|\ 703 701 post_bind4|post_bind6|connect4|connect6|sendmsg4|\ 704 - sendmsg6|recvmsg4|recvmsg6|sysctl) 702 + sendmsg6|recvmsg4|recvmsg6|sysctl|getsockopt|\ 703 + setsockopt) 705 704 COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \ 706 705 "$cur" ) ) 707 706 return 0
+4 -1
tools/bpf/bpftool/cgroup.c
··· 26 26 " sock_ops | device | bind4 | bind6 |\n" \ 27 27 " post_bind4 | post_bind6 | connect4 |\n" \ 28 28 " connect6 | sendmsg4 | sendmsg6 |\n" \ 29 - " recvmsg4 | recvmsg6 | sysctl }" 29 + " recvmsg4 | recvmsg6 | sysctl |\n" \ 30 + " getsockopt | setsockopt }" 30 31 31 32 static const char * const attach_type_strings[] = { 32 33 [BPF_CGROUP_INET_INGRESS] = "ingress", ··· 46 45 [BPF_CGROUP_SYSCTL] = "sysctl", 47 46 [BPF_CGROUP_UDP4_RECVMSG] = "recvmsg4", 48 47 [BPF_CGROUP_UDP6_RECVMSG] = "recvmsg6", 48 + [BPF_CGROUP_GETSOCKOPT] = "getsockopt", 49 + [BPF_CGROUP_SETSOCKOPT] = "setsockopt", 49 50 [__MAX_BPF_ATTACH_TYPE] = NULL, 50 51 }; 51 52
+1
tools/bpf/bpftool/main.h
··· 74 74 [BPF_PROG_TYPE_SK_REUSEPORT] = "sk_reuseport", 75 75 [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector", 76 76 [BPF_PROG_TYPE_CGROUP_SYSCTL] = "cgroup_sysctl", 77 + [BPF_PROG_TYPE_CGROUP_SOCKOPT] = "cgroup_sockopt", 77 78 }; 78 79 79 80 extern const char * const map_type_name[];
+2 -1
tools/bpf/bpftool/prog.c
··· 1071 1071 " cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n" 1072 1072 " cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n" 1073 1073 " cgroup/sendmsg4 | cgroup/sendmsg6 | cgroup/recvmsg4 |\n" 1074 - " cgroup/recvmsg6 }\n" 1074 + " cgroup/recvmsg6 | cgroup/getsockopt |\n" 1075 + " cgroup/setsockopt }\n" 1075 1076 " ATTACH_TYPE := { msg_verdict | stream_verdict | stream_parser |\n" 1076 1077 " flow_dissector }\n" 1077 1078 " " HELP_SPEC_OPTIONS "\n"
+25 -1
tools/include/uapi/linux/bpf.h
··· 170 170 BPF_PROG_TYPE_FLOW_DISSECTOR, 171 171 BPF_PROG_TYPE_CGROUP_SYSCTL, 172 172 BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, 173 + BPF_PROG_TYPE_CGROUP_SOCKOPT, 173 174 }; 174 175 175 176 enum bpf_attach_type { ··· 195 194 BPF_CGROUP_SYSCTL, 196 195 BPF_CGROUP_UDP4_RECVMSG, 197 196 BPF_CGROUP_UDP6_RECVMSG, 197 + BPF_CGROUP_GETSOCKOPT, 198 + BPF_CGROUP_SETSOCKOPT, 198 199 __MAX_BPF_ATTACH_TYPE 199 200 }; 200 201 ··· 1767 1764 * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) 1768 1765 * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) 1769 1766 * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) 1767 + * * **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT) 1770 1768 * 1771 1769 * Therefore, this function can be used to clear a callback flag by 1772 1770 * setting the appropriate bit to zero. e.g. to disable the RTO ··· 3070 3066 * sum(delta(snd_una)), or how many bytes 3071 3067 * were acked. 3072 3068 */ 3069 + __u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups 3070 + * total number of DSACK blocks received 3071 + */ 3072 + __u32 delivered; /* Total data packets delivered incl. rexmits */ 3073 + __u32 delivered_ce; /* Like the above but only ECE marked packets */ 3074 + __u32 icsk_retransmits; /* Number of unrecovered [RTO] timeouts */ 3073 3075 }; 3074 3076 3075 3077 struct bpf_sock_tuple { ··· 3318 3308 #define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) 3319 3309 #define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) 3320 3310 #define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2) 3321 - #define BPF_SOCK_OPS_ALL_CB_FLAGS 0x7 /* Mask of all currently 3311 + #define BPF_SOCK_OPS_RTT_CB_FLAG (1<<3) 3312 + #define BPF_SOCK_OPS_ALL_CB_FLAGS 0xF /* Mask of all currently 3322 3313 * supported cb flags 3323 3314 */ 3324 3315 ··· 3373 3362 */ 3374 3363 BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after 3375 3364 * socket transition to LISTEN state. 3365 + */ 3366 + BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. 3376 3367 */ 3377 3368 }; 3378 3369 ··· 3552 3539 __u32 file_pos; /* Sysctl file position to read from, write to. 3553 3540 * Allows 1,2,4-byte read an 4-byte write. 3554 3541 */ 3542 + }; 3543 + 3544 + struct bpf_sockopt { 3545 + __bpf_md_ptr(struct bpf_sock *, sk); 3546 + __bpf_md_ptr(void *, optval); 3547 + __bpf_md_ptr(void *, optval_end); 3548 + 3549 + __s32 level; 3550 + __s32 optname; 3551 + __s32 optlen; 3552 + __s32 retval; 3555 3553 }; 3556 3554 3557 3555 #endif /* _UAPI__LINUX_BPF_H__ */
+8
tools/include/uapi/linux/if_xdp.h
··· 46 46 #define XDP_UMEM_FILL_RING 5 47 47 #define XDP_UMEM_COMPLETION_RING 6 48 48 #define XDP_STATISTICS 7 49 + #define XDP_OPTIONS 8 49 50 50 51 struct xdp_umem_reg { 51 52 __u64 addr; /* Start of packet data area */ ··· 60 59 __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 61 60 __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 62 61 }; 62 + 63 + struct xdp_options { 64 + __u32 flags; 65 + }; 66 + 67 + /* Flags for the flags field of struct xdp_options */ 68 + #define XDP_OPTIONS_ZEROCOPY (1 << 0) 63 69 64 70 /* Pgoff for mmaping the rings */ 65 71 #define XDP_PGOFF_RX_RING 0
+14 -9
tools/lib/bpf/libbpf.c
··· 778 778 if (obj->nr_maps < obj->maps_cap) 779 779 return &obj->maps[obj->nr_maps++]; 780 780 781 - new_cap = max(4ul, obj->maps_cap * 3 / 2); 781 + new_cap = max((size_t)4, obj->maps_cap * 3 / 2); 782 782 new_maps = realloc(obj->maps, new_cap * sizeof(*obj->maps)); 783 783 if (!new_maps) { 784 784 pr_warning("alloc maps for object failed\n"); ··· 1169 1169 pr_debug("map '%s': found key_size = %u.\n", 1170 1170 map_name, sz); 1171 1171 if (map->def.key_size && map->def.key_size != sz) { 1172 - pr_warning("map '%s': conflictling key size %u != %u.\n", 1172 + pr_warning("map '%s': conflicting key size %u != %u.\n", 1173 1173 map_name, map->def.key_size, sz); 1174 1174 return -EINVAL; 1175 1175 } ··· 1197 1197 pr_debug("map '%s': found key [%u], sz = %lld.\n", 1198 1198 map_name, t->type, sz); 1199 1199 if (map->def.key_size && map->def.key_size != sz) { 1200 - pr_warning("map '%s': conflictling key size %u != %lld.\n", 1200 + pr_warning("map '%s': conflicting key size %u != %lld.\n", 1201 1201 map_name, map->def.key_size, sz); 1202 1202 return -EINVAL; 1203 1203 } ··· 1212 1212 pr_debug("map '%s': found value_size = %u.\n", 1213 1213 map_name, sz); 1214 1214 if (map->def.value_size && map->def.value_size != sz) { 1215 - pr_warning("map '%s': conflictling value size %u != %u.\n", 1215 + pr_warning("map '%s': conflicting value size %u != %u.\n", 1216 1216 map_name, map->def.value_size, sz); 1217 1217 return -EINVAL; 1218 1218 } ··· 1240 1240 pr_debug("map '%s': found value [%u], sz = %lld.\n", 1241 1241 map_name, t->type, sz); 1242 1242 if (map->def.value_size && map->def.value_size != sz) { 1243 - pr_warning("map '%s': conflictling value size %u != %lld.\n", 1243 + pr_warning("map '%s': conflicting value size %u != %lld.\n", 1244 1244 map_name, map->def.value_size, sz); 1245 1245 return -EINVAL; 1246 1246 } ··· 2646 2646 case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE: 2647 2647 case BPF_PROG_TYPE_PERF_EVENT: 2648 2648 case BPF_PROG_TYPE_CGROUP_SYSCTL: 2649 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 2649 2650 return false; 2650 2651 case BPF_PROG_TYPE_KPROBE: 2651 2652 default: ··· 3605 3604 BPF_CGROUP_UDP6_RECVMSG), 3606 3605 BPF_EAPROG_SEC("cgroup/sysctl", BPF_PROG_TYPE_CGROUP_SYSCTL, 3607 3606 BPF_CGROUP_SYSCTL), 3607 + BPF_EAPROG_SEC("cgroup/getsockopt", BPF_PROG_TYPE_CGROUP_SOCKOPT, 3608 + BPF_CGROUP_GETSOCKOPT), 3609 + BPF_EAPROG_SEC("cgroup/setsockopt", BPF_PROG_TYPE_CGROUP_SOCKOPT, 3610 + BPF_CGROUP_SETSOCKOPT), 3608 3611 }; 3609 3612 3610 3613 #undef BPF_PROG_SEC_IMPL ··· 3872 3867 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, 3873 3868 struct bpf_object **pobj, int *prog_fd) 3874 3869 { 3875 - struct bpf_object_open_attr open_attr = { 3876 - .file = attr->file, 3877 - .prog_type = attr->prog_type, 3878 - }; 3870 + struct bpf_object_open_attr open_attr = {}; 3879 3871 struct bpf_program *prog, *first_prog = NULL; 3880 3872 enum bpf_attach_type expected_attach_type; 3881 3873 enum bpf_prog_type prog_type; ··· 3884 3882 return -EINVAL; 3885 3883 if (!attr->file) 3886 3884 return -EINVAL; 3885 + 3886 + open_attr.file = attr->file; 3887 + open_attr.prog_type = attr->prog_type; 3887 3888 3888 3889 obj = bpf_object__open_xattr(&open_attr); 3889 3890 if (IS_ERR_OR_NULL(obj))
+1
tools/lib/bpf/libbpf_probes.c
··· 101 101 case BPF_PROG_TYPE_SK_REUSEPORT: 102 102 case BPF_PROG_TYPE_FLOW_DISSECTOR: 103 103 case BPF_PROG_TYPE_CGROUP_SYSCTL: 104 + case BPF_PROG_TYPE_CGROUP_SOCKOPT: 104 105 default: 105 106 break; 106 107 }
+14 -1
tools/lib/bpf/xsk.c
··· 65 65 int xsks_map_fd; 66 66 __u32 queue_id; 67 67 char ifname[IFNAMSIZ]; 68 + bool zc; 68 69 }; 69 70 70 71 struct xsk_nl_info { ··· 327 326 328 327 channels.cmd = ETHTOOL_GCHANNELS; 329 328 ifr.ifr_data = (void *)&channels; 330 - strncpy(ifr.ifr_name, xsk->ifname, IFNAMSIZ); 329 + strncpy(ifr.ifr_name, xsk->ifname, IFNAMSIZ - 1); 330 + ifr.ifr_name[IFNAMSIZ - 1] = '\0'; 331 331 err = ioctl(fd, SIOCETHTOOL, &ifr); 332 332 if (err && errno != EOPNOTSUPP) { 333 333 ret = -errno; ··· 482 480 void *rx_map = NULL, *tx_map = NULL; 483 481 struct sockaddr_xdp sxdp = {}; 484 482 struct xdp_mmap_offsets off; 483 + struct xdp_options opts; 485 484 struct xsk_socket *xsk; 486 485 socklen_t optlen; 487 486 int err; ··· 600 597 } 601 598 602 599 xsk->prog_fd = -1; 600 + 601 + optlen = sizeof(opts); 602 + err = getsockopt(xsk->fd, SOL_XDP, XDP_OPTIONS, &opts, &optlen); 603 + if (err) { 604 + err = -errno; 605 + goto out_mmap_tx; 606 + } 607 + 608 + xsk->zc = opts.flags & XDP_OPTIONS_ZEROCOPY; 609 + 603 610 if (!(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) { 604 611 err = xsk_setup_xdp_prog(xsk); 605 612 if (err)
+1 -1
tools/lib/bpf/xsk.h
··· 167 167 168 168 #define XSK_RING_CONS__DEFAULT_NUM_DESCS 2048 169 169 #define XSK_RING_PROD__DEFAULT_NUM_DESCS 2048 170 - #define XSK_UMEM__DEFAULT_FRAME_SHIFT 11 /* 2048 bytes */ 170 + #define XSK_UMEM__DEFAULT_FRAME_SHIFT 12 /* 4096 bytes */ 171 171 #define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT) 172 172 #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0 173 173
+3
tools/testing/selftests/bpf/.gitignore
··· 39 39 test_hashmap 40 40 test_btf_dump 41 41 xdping 42 + test_sockopt 43 + test_sockopt_sk 44 + test_sockopt_multi
+8 -2
tools/testing/selftests/bpf/Makefile
··· 15 15 LLVM_OBJCOPY ?= llvm-objcopy 16 16 LLVM_READELF ?= llvm-readelf 17 17 BTF_PAHOLE ?= pahole 18 - CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include \ 18 + CFLAGS += -g -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(BPFDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include \ 19 19 -Dbpf_prog_load=bpf_prog_test_load \ 20 20 -Dbpf_load_program=bpf_test_load_program 21 21 LDLIBS += -lcap -lelf -lrt -lpthread ··· 26 26 test_sock test_btf test_sockmap get_cgroup_id_user test_socket_cookie \ 27 27 test_cgroup_storage test_select_reuseport test_section_names \ 28 28 test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \ 29 - test_btf_dump test_cgroup_attach xdping 29 + test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \ 30 + test_sockopt_multi test_tcp_rtt 30 31 31 32 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c))) 32 33 TEST_GEN_FILES = $(BPF_OBJ_FILES) ··· 47 46 test_libbpf.sh \ 48 47 test_xdp_redirect.sh \ 49 48 test_xdp_meta.sh \ 49 + test_xdp_veth.sh \ 50 50 test_offload.py \ 51 51 test_sock_addr.sh \ 52 52 test_tunnel.sh \ ··· 104 102 $(OUTPUT)/test_sock_fields: cgroup_helpers.c 105 103 $(OUTPUT)/test_sysctl: cgroup_helpers.c 106 104 $(OUTPUT)/test_cgroup_attach: cgroup_helpers.c 105 + $(OUTPUT)/test_sockopt: cgroup_helpers.c 106 + $(OUTPUT)/test_sockopt_sk: cgroup_helpers.c 107 + $(OUTPUT)/test_sockopt_multi: cgroup_helpers.c 108 + $(OUTPUT)/test_tcp_rtt: cgroup_helpers.c 107 109 108 110 .PHONY: force 109 111
+4 -5
tools/testing/selftests/bpf/progs/pyperf.h
··· 75 75 void* co_name; // PyCodeObject.co_name 76 76 } FrameData; 77 77 78 - static inline __attribute__((__always_inline__)) void* 79 - get_thread_state(void* tls_base, PidData* pidData) 78 + static __always_inline void *get_thread_state(void *tls_base, PidData *pidData) 80 79 { 81 80 void* thread_state; 82 81 int key; ··· 86 87 return thread_state; 87 88 } 88 89 89 - static inline __attribute__((__always_inline__)) bool 90 - get_frame_data(void* frame_ptr, PidData* pidData, FrameData* frame, Symbol* symbol) 90 + static __always_inline bool get_frame_data(void *frame_ptr, PidData *pidData, 91 + FrameData *frame, Symbol *symbol) 91 92 { 92 93 // read data from PyFrameObject 93 94 bpf_probe_read(&frame->f_back, ··· 160 161 .max_elem = 1000, 161 162 }; 162 163 163 - static inline __attribute__((__always_inline__)) int __on_event(struct pt_regs *ctx) 164 + static __always_inline int __on_event(struct pt_regs *ctx) 164 165 { 165 166 uint64_t pid_tgid = bpf_get_current_pid_tgid(); 166 167 pid_t pid = (pid_t)(pid_tgid >> 32);
+71
tools/testing/selftests/bpf/progs/sockopt_multi.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <netinet/in.h> 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + char _license[] SEC("license") = "GPL"; 7 + __u32 _version SEC("version") = 1; 8 + 9 + SEC("cgroup/getsockopt/child") 10 + int _getsockopt_child(struct bpf_sockopt *ctx) 11 + { 12 + __u8 *optval_end = ctx->optval_end; 13 + __u8 *optval = ctx->optval; 14 + 15 + if (ctx->level != SOL_IP || ctx->optname != IP_TOS) 16 + return 1; 17 + 18 + if (optval + 1 > optval_end) 19 + return 0; /* EPERM, bounds check */ 20 + 21 + if (optval[0] != 0x80) 22 + return 0; /* EPERM, unexpected optval from the kernel */ 23 + 24 + ctx->retval = 0; /* Reset system call return value to zero */ 25 + 26 + optval[0] = 0x90; 27 + ctx->optlen = 1; 28 + 29 + return 1; 30 + } 31 + 32 + SEC("cgroup/getsockopt/parent") 33 + int _getsockopt_parent(struct bpf_sockopt *ctx) 34 + { 35 + __u8 *optval_end = ctx->optval_end; 36 + __u8 *optval = ctx->optval; 37 + 38 + if (ctx->level != SOL_IP || ctx->optname != IP_TOS) 39 + return 1; 40 + 41 + if (optval + 1 > optval_end) 42 + return 0; /* EPERM, bounds check */ 43 + 44 + if (optval[0] != 0x90) 45 + return 0; /* EPERM, unexpected optval from the kernel */ 46 + 47 + ctx->retval = 0; /* Reset system call return value to zero */ 48 + 49 + optval[0] = 0xA0; 50 + ctx->optlen = 1; 51 + 52 + return 1; 53 + } 54 + 55 + SEC("cgroup/setsockopt") 56 + int _setsockopt(struct bpf_sockopt *ctx) 57 + { 58 + __u8 *optval_end = ctx->optval_end; 59 + __u8 *optval = ctx->optval; 60 + 61 + if (ctx->level != SOL_IP || ctx->optname != IP_TOS) 62 + return 1; 63 + 64 + if (optval + 1 > optval_end) 65 + return 0; /* EPERM, bounds check */ 66 + 67 + optval[0] += 0x10; 68 + ctx->optlen = 1; 69 + 70 + return 1; 71 + }
+111
tools/testing/selftests/bpf/progs/sockopt_sk.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <netinet/in.h> 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + char _license[] SEC("license") = "GPL"; 7 + __u32 _version SEC("version") = 1; 8 + 9 + #define SOL_CUSTOM 0xdeadbeef 10 + 11 + struct sockopt_sk { 12 + __u8 val; 13 + }; 14 + 15 + struct bpf_map_def SEC("maps") socket_storage_map = { 16 + .type = BPF_MAP_TYPE_SK_STORAGE, 17 + .key_size = sizeof(int), 18 + .value_size = sizeof(struct sockopt_sk), 19 + .map_flags = BPF_F_NO_PREALLOC, 20 + }; 21 + BPF_ANNOTATE_KV_PAIR(socket_storage_map, int, struct sockopt_sk); 22 + 23 + SEC("cgroup/getsockopt") 24 + int _getsockopt(struct bpf_sockopt *ctx) 25 + { 26 + __u8 *optval_end = ctx->optval_end; 27 + __u8 *optval = ctx->optval; 28 + struct sockopt_sk *storage; 29 + 30 + if (ctx->level == SOL_IP && ctx->optname == IP_TOS) 31 + /* Not interested in SOL_IP:IP_TOS; 32 + * let next BPF program in the cgroup chain or kernel 33 + * handle it. 34 + */ 35 + return 1; 36 + 37 + if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) { 38 + /* Not interested in SOL_SOCKET:SO_SNDBUF; 39 + * let next BPF program in the cgroup chain or kernel 40 + * handle it. 41 + */ 42 + return 1; 43 + } 44 + 45 + if (ctx->level != SOL_CUSTOM) 46 + return 0; /* EPERM, deny everything except custom level */ 47 + 48 + if (optval + 1 > optval_end) 49 + return 0; /* EPERM, bounds check */ 50 + 51 + storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0, 52 + BPF_SK_STORAGE_GET_F_CREATE); 53 + if (!storage) 54 + return 0; /* EPERM, couldn't get sk storage */ 55 + 56 + if (!ctx->retval) 57 + return 0; /* EPERM, kernel should not have handled 58 + * SOL_CUSTOM, something is wrong! 59 + */ 60 + ctx->retval = 0; /* Reset system call return value to zero */ 61 + 62 + optval[0] = storage->val; 63 + ctx->optlen = 1; 64 + 65 + return 1; 66 + } 67 + 68 + SEC("cgroup/setsockopt") 69 + int _setsockopt(struct bpf_sockopt *ctx) 70 + { 71 + __u8 *optval_end = ctx->optval_end; 72 + __u8 *optval = ctx->optval; 73 + struct sockopt_sk *storage; 74 + 75 + if (ctx->level == SOL_IP && ctx->optname == IP_TOS) 76 + /* Not interested in SOL_IP:IP_TOS; 77 + * let next BPF program in the cgroup chain or kernel 78 + * handle it. 79 + */ 80 + return 1; 81 + 82 + if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) { 83 + /* Overwrite SO_SNDBUF value */ 84 + 85 + if (optval + sizeof(__u32) > optval_end) 86 + return 0; /* EPERM, bounds check */ 87 + 88 + *(__u32 *)optval = 0x55AA; 89 + ctx->optlen = 4; 90 + 91 + return 1; 92 + } 93 + 94 + if (ctx->level != SOL_CUSTOM) 95 + return 0; /* EPERM, deny everything except custom level */ 96 + 97 + if (optval + 1 > optval_end) 98 + return 0; /* EPERM, bounds check */ 99 + 100 + storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0, 101 + BPF_SK_STORAGE_GET_F_CREATE); 102 + if (!storage) 103 + return 0; /* EPERM, couldn't get sk storage */ 104 + 105 + storage->val = optval[0]; 106 + ctx->optlen = -1; /* BPF has consumed this option, don't call kernel 107 + * setsockopt handler. 108 + */ 109 + 110 + return 1; 111 + }
+19 -17
tools/testing/selftests/bpf/progs/strobemeta.h
··· 266 266 uint64_t offset; 267 267 }; 268 268 269 - static inline __attribute__((always_inline)) 270 - void *calc_location(struct strobe_value_loc *loc, void *tls_base) 269 + static __always_inline void *calc_location(struct strobe_value_loc *loc, 270 + void *tls_base) 271 271 { 272 272 /* 273 273 * tls_mode value is: ··· 327 327 : NULL; 328 328 } 329 329 330 - static inline __attribute__((always_inline)) 331 - void read_int_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base, 332 - struct strobe_value_generic *value, 333 - struct strobemeta_payload *data) 330 + static __always_inline void read_int_var(struct strobemeta_cfg *cfg, 331 + size_t idx, void *tls_base, 332 + struct strobe_value_generic *value, 333 + struct strobemeta_payload *data) 334 334 { 335 335 void *location = calc_location(&cfg->int_locs[idx], tls_base); 336 336 if (!location) ··· 342 342 data->int_vals_set_mask |= (1 << idx); 343 343 } 344 344 345 - static inline __attribute__((always_inline)) 346 - uint64_t read_str_var(struct strobemeta_cfg* cfg, size_t idx, void *tls_base, 347 - struct strobe_value_generic *value, 348 - struct strobemeta_payload *data, void *payload) 345 + static __always_inline uint64_t read_str_var(struct strobemeta_cfg *cfg, 346 + size_t idx, void *tls_base, 347 + struct strobe_value_generic *value, 348 + struct strobemeta_payload *data, 349 + void *payload) 349 350 { 350 351 void *location; 351 352 uint32_t len; ··· 372 371 return len; 373 372 } 374 373 375 - static inline __attribute__((always_inline)) 376 - void *read_map_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base, 377 - struct strobe_value_generic *value, 378 - struct strobemeta_payload* data, void *payload) 374 + static __always_inline void *read_map_var(struct strobemeta_cfg *cfg, 375 + size_t idx, void *tls_base, 376 + struct strobe_value_generic *value, 377 + struct strobemeta_payload *data, 378 + void *payload) 379 379 { 380 380 struct strobe_map_descr* descr = &data->map_descrs[idx]; 381 381 struct strobe_map_raw map; ··· 437 435 * read_strobe_meta returns NULL, if no metadata was read; otherwise returns 438 436 * pointer to *right after* payload ends 439 437 */ 440 - static inline __attribute__((always_inline)) 441 - void *read_strobe_meta(struct task_struct* task, 442 - struct strobemeta_payload* data) { 438 + static __always_inline void *read_strobe_meta(struct task_struct *task, 439 + struct strobemeta_payload *data) 440 + { 443 441 pid_t pid = bpf_get_current_pid_tgid() >> 32; 444 442 struct strobe_value_generic value = {0}; 445 443 struct strobemeta_cfg *cfg;
+61
tools/testing/selftests/bpf/progs/tcp_rtt.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <linux/bpf.h> 3 + #include "bpf_helpers.h" 4 + 5 + char _license[] SEC("license") = "GPL"; 6 + __u32 _version SEC("version") = 1; 7 + 8 + struct tcp_rtt_storage { 9 + __u32 invoked; 10 + __u32 dsack_dups; 11 + __u32 delivered; 12 + __u32 delivered_ce; 13 + __u32 icsk_retransmits; 14 + }; 15 + 16 + struct bpf_map_def SEC("maps") socket_storage_map = { 17 + .type = BPF_MAP_TYPE_SK_STORAGE, 18 + .key_size = sizeof(int), 19 + .value_size = sizeof(struct tcp_rtt_storage), 20 + .map_flags = BPF_F_NO_PREALLOC, 21 + }; 22 + BPF_ANNOTATE_KV_PAIR(socket_storage_map, int, struct tcp_rtt_storage); 23 + 24 + SEC("sockops") 25 + int _sockops(struct bpf_sock_ops *ctx) 26 + { 27 + struct tcp_rtt_storage *storage; 28 + struct bpf_tcp_sock *tcp_sk; 29 + int op = (int) ctx->op; 30 + struct bpf_sock *sk; 31 + 32 + sk = ctx->sk; 33 + if (!sk) 34 + return 1; 35 + 36 + storage = bpf_sk_storage_get(&socket_storage_map, sk, 0, 37 + BPF_SK_STORAGE_GET_F_CREATE); 38 + if (!storage) 39 + return 1; 40 + 41 + if (op == BPF_SOCK_OPS_TCP_CONNECT_CB) { 42 + bpf_sock_ops_cb_flags_set(ctx, BPF_SOCK_OPS_RTT_CB_FLAG); 43 + return 1; 44 + } 45 + 46 + if (op != BPF_SOCK_OPS_RTT_CB) 47 + return 1; 48 + 49 + tcp_sk = bpf_tcp_sock(sk); 50 + if (!tcp_sk) 51 + return 1; 52 + 53 + storage->invoked++; 54 + 55 + storage->dsack_dups = tcp_sk->dsack_dups; 56 + storage->delivered = tcp_sk->delivered; 57 + storage->delivered_ce = tcp_sk->delivered_ce; 58 + storage->icsk_retransmits = tcp_sk->icsk_retransmits; 59 + 60 + return 1; 61 + }
+2 -1
tools/testing/selftests/bpf/progs/test_jhash.h
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 // Copyright (c) 2019 Facebook 3 + #include <features.h> 3 4 4 5 typedef unsigned int u32; 5 6 6 - static __attribute__((always_inline)) u32 rol32(u32 word, unsigned int shift) 7 + static __always_inline u32 rol32(u32 word, unsigned int shift) 7 8 { 8 9 return (word << shift) | (word >> ((-shift) & 31)); 9 10 }
+12 -11
tools/testing/selftests/bpf/progs/test_seg6_loop.c
··· 54 54 unsigned char value[0]; 55 55 } BPF_PACKET_HEADER; 56 56 57 - static __attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb) 57 + static __always_inline struct ip6_srh_t *get_srh(struct __sk_buff *skb) 58 58 { 59 59 void *cursor, *data_end; 60 60 struct ip6_srh_t *srh; ··· 88 88 return srh; 89 89 } 90 90 91 - static __attribute__((always_inline)) 92 - int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad, 93 - uint32_t old_pad, uint32_t pad_off) 91 + static __always_inline int update_tlv_pad(struct __sk_buff *skb, 92 + uint32_t new_pad, uint32_t old_pad, 93 + uint32_t pad_off) 94 94 { 95 95 int err; 96 96 ··· 118 118 return 0; 119 119 } 120 120 121 - static __attribute__((always_inline)) 122 - int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh, 123 - uint32_t *tlv_off, uint32_t *pad_size, 124 - uint32_t *pad_off) 121 + static __always_inline int is_valid_tlv_boundary(struct __sk_buff *skb, 122 + struct ip6_srh_t *srh, 123 + uint32_t *tlv_off, 124 + uint32_t *pad_size, 125 + uint32_t *pad_off) 125 126 { 126 127 uint32_t srh_off, cur_off; 127 128 int offset_valid = 0; ··· 178 177 return 0; 179 178 } 180 179 181 - static __attribute__((always_inline)) 182 - int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off, 183 - struct sr6_tlv_t *itlv, uint8_t tlv_size) 180 + static __always_inline int add_tlv(struct __sk_buff *skb, 181 + struct ip6_srh_t *srh, uint32_t tlv_off, 182 + struct sr6_tlv_t *itlv, uint8_t tlv_size) 184 183 { 185 184 uint32_t srh_off = (char *)srh - (char *)(long)skb->data; 186 185 uint8_t len_remaining, new_pad;
+1 -1
tools/testing/selftests/bpf/progs/test_verif_scale2.c
··· 2 2 // Copyright (c) 2019 Facebook 3 3 #include <linux/bpf.h> 4 4 #include "bpf_helpers.h" 5 - #define ATTR __attribute__((always_inline)) 5 + #define ATTR __always_inline 6 6 #include "test_jhash.h" 7 7 8 8 SEC("scale90_inline")
+31
tools/testing/selftests/bpf/progs/xdp_redirect_map.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + struct bpf_map_def SEC("maps") tx_port = { 7 + .type = BPF_MAP_TYPE_DEVMAP, 8 + .key_size = sizeof(int), 9 + .value_size = sizeof(int), 10 + .max_entries = 8, 11 + }; 12 + 13 + SEC("redirect_map_0") 14 + int xdp_redirect_map_0(struct xdp_md *xdp) 15 + { 16 + return bpf_redirect_map(&tx_port, 0, 0); 17 + } 18 + 19 + SEC("redirect_map_1") 20 + int xdp_redirect_map_1(struct xdp_md *xdp) 21 + { 22 + return bpf_redirect_map(&tx_port, 1, 0); 23 + } 24 + 25 + SEC("redirect_map_2") 26 + int xdp_redirect_map_2(struct xdp_md *xdp) 27 + { 28 + return bpf_redirect_map(&tx_port, 2, 0); 29 + } 30 + 31 + char _license[] SEC("license") = "GPL";
+12
tools/testing/selftests/bpf/progs/xdp_tx.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/bpf.h> 4 + #include "bpf_helpers.h" 5 + 6 + SEC("tx") 7 + int xdp_tx(struct xdp_md *xdp) 8 + { 9 + return XDP_TX; 10 + } 11 + 12 + char _license[] SEC("license") = "GPL";
+10
tools/testing/selftests/bpf/test_section_names.c
··· 134 134 {0, BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_CGROUP_SYSCTL}, 135 135 {0, BPF_CGROUP_SYSCTL}, 136 136 }, 137 + { 138 + "cgroup/getsockopt", 139 + {0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT}, 140 + {0, BPF_CGROUP_GETSOCKOPT}, 141 + }, 142 + { 143 + "cgroup/setsockopt", 144 + {0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT}, 145 + {0, BPF_CGROUP_SETSOCKOPT}, 146 + }, 137 147 }; 138 148 139 149 static int test_prog_type_by_name(const struct sec_name_test *test)
+1021
tools/testing/selftests/bpf/test_sockopt.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <errno.h> 4 + #include <stdio.h> 5 + #include <unistd.h> 6 + #include <sys/types.h> 7 + #include <sys/socket.h> 8 + #include <netinet/in.h> 9 + 10 + #include <linux/filter.h> 11 + #include <bpf/bpf.h> 12 + #include <bpf/libbpf.h> 13 + 14 + #include "bpf_rlimit.h" 15 + #include "bpf_util.h" 16 + #include "cgroup_helpers.h" 17 + 18 + #define CG_PATH "/sockopt" 19 + 20 + static char bpf_log_buf[4096]; 21 + static bool verbose; 22 + 23 + enum sockopt_test_error { 24 + OK = 0, 25 + DENY_LOAD, 26 + DENY_ATTACH, 27 + EPERM_GETSOCKOPT, 28 + EFAULT_GETSOCKOPT, 29 + EPERM_SETSOCKOPT, 30 + EFAULT_SETSOCKOPT, 31 + }; 32 + 33 + static struct sockopt_test { 34 + const char *descr; 35 + const struct bpf_insn insns[64]; 36 + enum bpf_attach_type attach_type; 37 + enum bpf_attach_type expected_attach_type; 38 + 39 + int set_optname; 40 + int set_level; 41 + const char set_optval[64]; 42 + socklen_t set_optlen; 43 + 44 + int get_optname; 45 + int get_level; 46 + const char get_optval[64]; 47 + socklen_t get_optlen; 48 + socklen_t get_optlen_ret; 49 + 50 + enum sockopt_test_error error; 51 + } tests[] = { 52 + 53 + /* ==================== getsockopt ==================== */ 54 + 55 + { 56 + .descr = "getsockopt: no expected_attach_type", 57 + .insns = { 58 + /* return 1 */ 59 + BPF_MOV64_IMM(BPF_REG_0, 1), 60 + BPF_EXIT_INSN(), 61 + 62 + }, 63 + .attach_type = BPF_CGROUP_GETSOCKOPT, 64 + .expected_attach_type = 0, 65 + .error = DENY_LOAD, 66 + }, 67 + { 68 + .descr = "getsockopt: wrong expected_attach_type", 69 + .insns = { 70 + /* return 1 */ 71 + BPF_MOV64_IMM(BPF_REG_0, 1), 72 + BPF_EXIT_INSN(), 73 + 74 + }, 75 + .attach_type = BPF_CGROUP_GETSOCKOPT, 76 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 77 + .error = DENY_ATTACH, 78 + }, 79 + { 80 + .descr = "getsockopt: bypass bpf hook", 81 + .insns = { 82 + /* return 1 */ 83 + BPF_MOV64_IMM(BPF_REG_0, 1), 84 + BPF_EXIT_INSN(), 85 + }, 86 + .attach_type = BPF_CGROUP_GETSOCKOPT, 87 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 88 + 89 + .get_level = SOL_IP, 90 + .set_level = SOL_IP, 91 + 92 + .get_optname = IP_TOS, 93 + .set_optname = IP_TOS, 94 + 95 + .set_optval = { 1 << 3 }, 96 + .set_optlen = 1, 97 + 98 + .get_optval = { 1 << 3 }, 99 + .get_optlen = 1, 100 + }, 101 + { 102 + .descr = "getsockopt: return EPERM from bpf hook", 103 + .insns = { 104 + BPF_MOV64_IMM(BPF_REG_0, 0), 105 + BPF_EXIT_INSN(), 106 + }, 107 + .attach_type = BPF_CGROUP_GETSOCKOPT, 108 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 109 + 110 + .get_level = SOL_IP, 111 + .get_optname = IP_TOS, 112 + 113 + .get_optlen = 1, 114 + .error = EPERM_GETSOCKOPT, 115 + }, 116 + { 117 + .descr = "getsockopt: no optval bounds check, deny loading", 118 + .insns = { 119 + /* r6 = ctx->optval */ 120 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 121 + offsetof(struct bpf_sockopt, optval)), 122 + 123 + /* ctx->optval[0] = 0x80 */ 124 + BPF_MOV64_IMM(BPF_REG_0, 0x80), 125 + BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_0, 0), 126 + 127 + /* return 1 */ 128 + BPF_MOV64_IMM(BPF_REG_0, 1), 129 + BPF_EXIT_INSN(), 130 + }, 131 + .attach_type = BPF_CGROUP_GETSOCKOPT, 132 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 133 + .error = DENY_LOAD, 134 + }, 135 + { 136 + .descr = "getsockopt: read ctx->level", 137 + .insns = { 138 + /* r6 = ctx->level */ 139 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 140 + offsetof(struct bpf_sockopt, level)), 141 + 142 + /* if (ctx->level == 123) { */ 143 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 144 + /* ctx->retval = 0 */ 145 + BPF_MOV64_IMM(BPF_REG_0, 0), 146 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 147 + offsetof(struct bpf_sockopt, retval)), 148 + /* return 1 */ 149 + BPF_MOV64_IMM(BPF_REG_0, 1), 150 + BPF_JMP_A(1), 151 + /* } else { */ 152 + /* return 0 */ 153 + BPF_MOV64_IMM(BPF_REG_0, 0), 154 + /* } */ 155 + BPF_EXIT_INSN(), 156 + }, 157 + .attach_type = BPF_CGROUP_GETSOCKOPT, 158 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 159 + 160 + .get_level = 123, 161 + 162 + .get_optlen = 1, 163 + }, 164 + { 165 + .descr = "getsockopt: deny writing to ctx->level", 166 + .insns = { 167 + /* ctx->level = 1 */ 168 + BPF_MOV64_IMM(BPF_REG_0, 1), 169 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 170 + offsetof(struct bpf_sockopt, level)), 171 + BPF_EXIT_INSN(), 172 + }, 173 + .attach_type = BPF_CGROUP_GETSOCKOPT, 174 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 175 + 176 + .error = DENY_LOAD, 177 + }, 178 + { 179 + .descr = "getsockopt: read ctx->optname", 180 + .insns = { 181 + /* r6 = ctx->optname */ 182 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 183 + offsetof(struct bpf_sockopt, optname)), 184 + 185 + /* if (ctx->optname == 123) { */ 186 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 187 + /* ctx->retval = 0 */ 188 + BPF_MOV64_IMM(BPF_REG_0, 0), 189 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 190 + offsetof(struct bpf_sockopt, retval)), 191 + /* return 1 */ 192 + BPF_MOV64_IMM(BPF_REG_0, 1), 193 + BPF_JMP_A(1), 194 + /* } else { */ 195 + /* return 0 */ 196 + BPF_MOV64_IMM(BPF_REG_0, 0), 197 + /* } */ 198 + BPF_EXIT_INSN(), 199 + }, 200 + .attach_type = BPF_CGROUP_GETSOCKOPT, 201 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 202 + 203 + .get_optname = 123, 204 + 205 + .get_optlen = 1, 206 + }, 207 + { 208 + .descr = "getsockopt: read ctx->retval", 209 + .insns = { 210 + /* r6 = ctx->retval */ 211 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 212 + offsetof(struct bpf_sockopt, retval)), 213 + 214 + /* return 1 */ 215 + BPF_MOV64_IMM(BPF_REG_0, 1), 216 + BPF_EXIT_INSN(), 217 + }, 218 + .attach_type = BPF_CGROUP_GETSOCKOPT, 219 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 220 + 221 + .get_level = SOL_IP, 222 + .get_optname = IP_TOS, 223 + .get_optlen = 1, 224 + }, 225 + { 226 + .descr = "getsockopt: deny writing to ctx->optname", 227 + .insns = { 228 + /* ctx->optname = 1 */ 229 + BPF_MOV64_IMM(BPF_REG_0, 1), 230 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 231 + offsetof(struct bpf_sockopt, optname)), 232 + BPF_EXIT_INSN(), 233 + }, 234 + .attach_type = BPF_CGROUP_GETSOCKOPT, 235 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 236 + 237 + .error = DENY_LOAD, 238 + }, 239 + { 240 + .descr = "getsockopt: read ctx->optlen", 241 + .insns = { 242 + /* r6 = ctx->optlen */ 243 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 244 + offsetof(struct bpf_sockopt, optlen)), 245 + 246 + /* if (ctx->optlen == 64) { */ 247 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 64, 4), 248 + /* ctx->retval = 0 */ 249 + BPF_MOV64_IMM(BPF_REG_0, 0), 250 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 251 + offsetof(struct bpf_sockopt, retval)), 252 + /* return 1 */ 253 + BPF_MOV64_IMM(BPF_REG_0, 1), 254 + BPF_JMP_A(1), 255 + /* } else { */ 256 + /* return 0 */ 257 + BPF_MOV64_IMM(BPF_REG_0, 0), 258 + /* } */ 259 + BPF_EXIT_INSN(), 260 + }, 261 + .attach_type = BPF_CGROUP_GETSOCKOPT, 262 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 263 + 264 + .get_optlen = 64, 265 + }, 266 + { 267 + .descr = "getsockopt: deny bigger ctx->optlen", 268 + .insns = { 269 + /* ctx->optlen = 65 */ 270 + BPF_MOV64_IMM(BPF_REG_0, 65), 271 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 272 + offsetof(struct bpf_sockopt, optlen)), 273 + 274 + /* ctx->retval = 0 */ 275 + BPF_MOV64_IMM(BPF_REG_0, 0), 276 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 277 + offsetof(struct bpf_sockopt, retval)), 278 + 279 + /* return 1 */ 280 + BPF_MOV64_IMM(BPF_REG_0, 1), 281 + BPF_EXIT_INSN(), 282 + }, 283 + .attach_type = BPF_CGROUP_GETSOCKOPT, 284 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 285 + 286 + .get_optlen = 64, 287 + 288 + .error = EFAULT_GETSOCKOPT, 289 + }, 290 + { 291 + .descr = "getsockopt: deny arbitrary ctx->retval", 292 + .insns = { 293 + /* ctx->retval = 123 */ 294 + BPF_MOV64_IMM(BPF_REG_0, 123), 295 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 296 + offsetof(struct bpf_sockopt, retval)), 297 + 298 + /* return 1 */ 299 + BPF_MOV64_IMM(BPF_REG_0, 1), 300 + BPF_EXIT_INSN(), 301 + }, 302 + .attach_type = BPF_CGROUP_GETSOCKOPT, 303 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 304 + 305 + .get_optlen = 64, 306 + 307 + .error = EFAULT_GETSOCKOPT, 308 + }, 309 + { 310 + .descr = "getsockopt: support smaller ctx->optlen", 311 + .insns = { 312 + /* ctx->optlen = 32 */ 313 + BPF_MOV64_IMM(BPF_REG_0, 32), 314 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 315 + offsetof(struct bpf_sockopt, optlen)), 316 + /* ctx->retval = 0 */ 317 + BPF_MOV64_IMM(BPF_REG_0, 0), 318 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 319 + offsetof(struct bpf_sockopt, retval)), 320 + /* return 1 */ 321 + BPF_MOV64_IMM(BPF_REG_0, 1), 322 + BPF_EXIT_INSN(), 323 + }, 324 + .attach_type = BPF_CGROUP_GETSOCKOPT, 325 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 326 + 327 + .get_optlen = 64, 328 + .get_optlen_ret = 32, 329 + }, 330 + { 331 + .descr = "getsockopt: deny writing to ctx->optval", 332 + .insns = { 333 + /* ctx->optval = 1 */ 334 + BPF_MOV64_IMM(BPF_REG_0, 1), 335 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 336 + offsetof(struct bpf_sockopt, optval)), 337 + BPF_EXIT_INSN(), 338 + }, 339 + .attach_type = BPF_CGROUP_GETSOCKOPT, 340 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 341 + 342 + .error = DENY_LOAD, 343 + }, 344 + { 345 + .descr = "getsockopt: deny writing to ctx->optval_end", 346 + .insns = { 347 + /* ctx->optval_end = 1 */ 348 + BPF_MOV64_IMM(BPF_REG_0, 1), 349 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 350 + offsetof(struct bpf_sockopt, optval_end)), 351 + BPF_EXIT_INSN(), 352 + }, 353 + .attach_type = BPF_CGROUP_GETSOCKOPT, 354 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 355 + 356 + .error = DENY_LOAD, 357 + }, 358 + { 359 + .descr = "getsockopt: rewrite value", 360 + .insns = { 361 + /* r6 = ctx->optval */ 362 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 363 + offsetof(struct bpf_sockopt, optval)), 364 + /* r2 = ctx->optval */ 365 + BPF_MOV64_REG(BPF_REG_2, BPF_REG_6), 366 + /* r6 = ctx->optval + 1 */ 367 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 1), 368 + 369 + /* r7 = ctx->optval_end */ 370 + BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_1, 371 + offsetof(struct bpf_sockopt, optval_end)), 372 + 373 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 374 + BPF_JMP_REG(BPF_JGT, BPF_REG_6, BPF_REG_7, 1), 375 + /* ctx->optval[0] = 0xF0 */ 376 + BPF_ST_MEM(BPF_B, BPF_REG_2, 0, 0xF0), 377 + /* } */ 378 + 379 + /* ctx->retval = 0 */ 380 + BPF_MOV64_IMM(BPF_REG_0, 0), 381 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 382 + offsetof(struct bpf_sockopt, retval)), 383 + 384 + /* return 1*/ 385 + BPF_MOV64_IMM(BPF_REG_0, 1), 386 + BPF_EXIT_INSN(), 387 + }, 388 + .attach_type = BPF_CGROUP_GETSOCKOPT, 389 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 390 + 391 + .get_level = SOL_IP, 392 + .get_optname = IP_TOS, 393 + 394 + .get_optval = { 0xF0 }, 395 + .get_optlen = 1, 396 + }, 397 + 398 + /* ==================== setsockopt ==================== */ 399 + 400 + { 401 + .descr = "setsockopt: no expected_attach_type", 402 + .insns = { 403 + /* return 1 */ 404 + BPF_MOV64_IMM(BPF_REG_0, 1), 405 + BPF_EXIT_INSN(), 406 + 407 + }, 408 + .attach_type = BPF_CGROUP_SETSOCKOPT, 409 + .expected_attach_type = 0, 410 + .error = DENY_LOAD, 411 + }, 412 + { 413 + .descr = "setsockopt: wrong expected_attach_type", 414 + .insns = { 415 + /* return 1 */ 416 + BPF_MOV64_IMM(BPF_REG_0, 1), 417 + BPF_EXIT_INSN(), 418 + 419 + }, 420 + .attach_type = BPF_CGROUP_SETSOCKOPT, 421 + .expected_attach_type = BPF_CGROUP_GETSOCKOPT, 422 + .error = DENY_ATTACH, 423 + }, 424 + { 425 + .descr = "setsockopt: bypass bpf hook", 426 + .insns = { 427 + /* return 1 */ 428 + BPF_MOV64_IMM(BPF_REG_0, 1), 429 + BPF_EXIT_INSN(), 430 + }, 431 + .attach_type = BPF_CGROUP_SETSOCKOPT, 432 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 433 + 434 + .get_level = SOL_IP, 435 + .set_level = SOL_IP, 436 + 437 + .get_optname = IP_TOS, 438 + .set_optname = IP_TOS, 439 + 440 + .set_optval = { 1 << 3 }, 441 + .set_optlen = 1, 442 + 443 + .get_optval = { 1 << 3 }, 444 + .get_optlen = 1, 445 + }, 446 + { 447 + .descr = "setsockopt: return EPERM from bpf hook", 448 + .insns = { 449 + /* return 0 */ 450 + BPF_MOV64_IMM(BPF_REG_0, 0), 451 + BPF_EXIT_INSN(), 452 + }, 453 + .attach_type = BPF_CGROUP_SETSOCKOPT, 454 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 455 + 456 + .set_level = SOL_IP, 457 + .set_optname = IP_TOS, 458 + 459 + .set_optlen = 1, 460 + .error = EPERM_SETSOCKOPT, 461 + }, 462 + { 463 + .descr = "setsockopt: no optval bounds check, deny loading", 464 + .insns = { 465 + /* r6 = ctx->optval */ 466 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 467 + offsetof(struct bpf_sockopt, optval)), 468 + 469 + /* r0 = ctx->optval[0] */ 470 + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0), 471 + 472 + /* return 1 */ 473 + BPF_MOV64_IMM(BPF_REG_0, 1), 474 + BPF_EXIT_INSN(), 475 + }, 476 + .attach_type = BPF_CGROUP_SETSOCKOPT, 477 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 478 + .error = DENY_LOAD, 479 + }, 480 + { 481 + .descr = "setsockopt: read ctx->level", 482 + .insns = { 483 + /* r6 = ctx->level */ 484 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 485 + offsetof(struct bpf_sockopt, level)), 486 + 487 + /* if (ctx->level == 123) { */ 488 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 489 + /* ctx->optlen = -1 */ 490 + BPF_MOV64_IMM(BPF_REG_0, -1), 491 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 492 + offsetof(struct bpf_sockopt, optlen)), 493 + /* return 1 */ 494 + BPF_MOV64_IMM(BPF_REG_0, 1), 495 + BPF_JMP_A(1), 496 + /* } else { */ 497 + /* return 0 */ 498 + BPF_MOV64_IMM(BPF_REG_0, 0), 499 + /* } */ 500 + BPF_EXIT_INSN(), 501 + }, 502 + .attach_type = BPF_CGROUP_SETSOCKOPT, 503 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 504 + 505 + .set_level = 123, 506 + 507 + .set_optlen = 1, 508 + }, 509 + { 510 + .descr = "setsockopt: allow changing ctx->level", 511 + .insns = { 512 + /* ctx->level = SOL_IP */ 513 + BPF_MOV64_IMM(BPF_REG_0, SOL_IP), 514 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 515 + offsetof(struct bpf_sockopt, level)), 516 + /* return 1 */ 517 + BPF_MOV64_IMM(BPF_REG_0, 1), 518 + BPF_EXIT_INSN(), 519 + }, 520 + .attach_type = BPF_CGROUP_SETSOCKOPT, 521 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 522 + 523 + .get_level = SOL_IP, 524 + .set_level = 234, /* should be rewritten to SOL_IP */ 525 + 526 + .get_optname = IP_TOS, 527 + .set_optname = IP_TOS, 528 + 529 + .set_optval = { 1 << 3 }, 530 + .set_optlen = 1, 531 + .get_optval = { 1 << 3 }, 532 + .get_optlen = 1, 533 + }, 534 + { 535 + .descr = "setsockopt: read ctx->optname", 536 + .insns = { 537 + /* r6 = ctx->optname */ 538 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 539 + offsetof(struct bpf_sockopt, optname)), 540 + 541 + /* if (ctx->optname == 123) { */ 542 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 4), 543 + /* ctx->optlen = -1 */ 544 + BPF_MOV64_IMM(BPF_REG_0, -1), 545 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 546 + offsetof(struct bpf_sockopt, optlen)), 547 + /* return 1 */ 548 + BPF_MOV64_IMM(BPF_REG_0, 1), 549 + BPF_JMP_A(1), 550 + /* } else { */ 551 + /* return 0 */ 552 + BPF_MOV64_IMM(BPF_REG_0, 0), 553 + /* } */ 554 + BPF_EXIT_INSN(), 555 + }, 556 + .attach_type = BPF_CGROUP_SETSOCKOPT, 557 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 558 + 559 + .set_optname = 123, 560 + 561 + .set_optlen = 1, 562 + }, 563 + { 564 + .descr = "setsockopt: allow changing ctx->optname", 565 + .insns = { 566 + /* ctx->optname = IP_TOS */ 567 + BPF_MOV64_IMM(BPF_REG_0, IP_TOS), 568 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 569 + offsetof(struct bpf_sockopt, optname)), 570 + /* return 1 */ 571 + BPF_MOV64_IMM(BPF_REG_0, 1), 572 + BPF_EXIT_INSN(), 573 + }, 574 + .attach_type = BPF_CGROUP_SETSOCKOPT, 575 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 576 + 577 + .get_level = SOL_IP, 578 + .set_level = SOL_IP, 579 + 580 + .get_optname = IP_TOS, 581 + .set_optname = 456, /* should be rewritten to IP_TOS */ 582 + 583 + .set_optval = { 1 << 3 }, 584 + .set_optlen = 1, 585 + .get_optval = { 1 << 3 }, 586 + .get_optlen = 1, 587 + }, 588 + { 589 + .descr = "setsockopt: read ctx->optlen", 590 + .insns = { 591 + /* r6 = ctx->optlen */ 592 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 593 + offsetof(struct bpf_sockopt, optlen)), 594 + 595 + /* if (ctx->optlen == 64) { */ 596 + BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 64, 4), 597 + /* ctx->optlen = -1 */ 598 + BPF_MOV64_IMM(BPF_REG_0, -1), 599 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 600 + offsetof(struct bpf_sockopt, optlen)), 601 + /* return 1 */ 602 + BPF_MOV64_IMM(BPF_REG_0, 1), 603 + BPF_JMP_A(1), 604 + /* } else { */ 605 + /* return 0 */ 606 + BPF_MOV64_IMM(BPF_REG_0, 0), 607 + /* } */ 608 + BPF_EXIT_INSN(), 609 + }, 610 + .attach_type = BPF_CGROUP_SETSOCKOPT, 611 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 612 + 613 + .set_optlen = 64, 614 + }, 615 + { 616 + .descr = "setsockopt: ctx->optlen == -1 is ok", 617 + .insns = { 618 + /* ctx->optlen = -1 */ 619 + BPF_MOV64_IMM(BPF_REG_0, -1), 620 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 621 + offsetof(struct bpf_sockopt, optlen)), 622 + /* return 1 */ 623 + BPF_MOV64_IMM(BPF_REG_0, 1), 624 + BPF_EXIT_INSN(), 625 + }, 626 + .attach_type = BPF_CGROUP_SETSOCKOPT, 627 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 628 + 629 + .set_optlen = 64, 630 + }, 631 + { 632 + .descr = "setsockopt: deny ctx->optlen < 0 (except -1)", 633 + .insns = { 634 + /* ctx->optlen = -2 */ 635 + BPF_MOV64_IMM(BPF_REG_0, -2), 636 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 637 + offsetof(struct bpf_sockopt, optlen)), 638 + /* return 1 */ 639 + BPF_MOV64_IMM(BPF_REG_0, 1), 640 + BPF_EXIT_INSN(), 641 + }, 642 + .attach_type = BPF_CGROUP_SETSOCKOPT, 643 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 644 + 645 + .set_optlen = 4, 646 + 647 + .error = EFAULT_SETSOCKOPT, 648 + }, 649 + { 650 + .descr = "setsockopt: deny ctx->optlen > input optlen", 651 + .insns = { 652 + /* ctx->optlen = 65 */ 653 + BPF_MOV64_IMM(BPF_REG_0, 65), 654 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 655 + offsetof(struct bpf_sockopt, optlen)), 656 + BPF_MOV64_IMM(BPF_REG_0, 1), 657 + BPF_EXIT_INSN(), 658 + }, 659 + .attach_type = BPF_CGROUP_SETSOCKOPT, 660 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 661 + 662 + .set_optlen = 64, 663 + 664 + .error = EFAULT_SETSOCKOPT, 665 + }, 666 + { 667 + .descr = "setsockopt: allow changing ctx->optlen within bounds", 668 + .insns = { 669 + /* r6 = ctx->optval */ 670 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 671 + offsetof(struct bpf_sockopt, optval)), 672 + /* r2 = ctx->optval */ 673 + BPF_MOV64_REG(BPF_REG_2, BPF_REG_6), 674 + /* r6 = ctx->optval + 1 */ 675 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 1), 676 + 677 + /* r7 = ctx->optval_end */ 678 + BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_1, 679 + offsetof(struct bpf_sockopt, optval_end)), 680 + 681 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 682 + BPF_JMP_REG(BPF_JGT, BPF_REG_6, BPF_REG_7, 1), 683 + /* ctx->optval[0] = 1 << 3 */ 684 + BPF_ST_MEM(BPF_B, BPF_REG_2, 0, 1 << 3), 685 + /* } */ 686 + 687 + /* ctx->optlen = 1 */ 688 + BPF_MOV64_IMM(BPF_REG_0, 1), 689 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 690 + offsetof(struct bpf_sockopt, optlen)), 691 + 692 + /* return 1*/ 693 + BPF_MOV64_IMM(BPF_REG_0, 1), 694 + BPF_EXIT_INSN(), 695 + }, 696 + .attach_type = BPF_CGROUP_SETSOCKOPT, 697 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 698 + 699 + .get_level = SOL_IP, 700 + .set_level = SOL_IP, 701 + 702 + .get_optname = IP_TOS, 703 + .set_optname = IP_TOS, 704 + 705 + .set_optval = { 1, 1, 1, 1 }, 706 + .set_optlen = 4, 707 + .get_optval = { 1 << 3 }, 708 + .get_optlen = 1, 709 + }, 710 + { 711 + .descr = "setsockopt: deny write ctx->retval", 712 + .insns = { 713 + /* ctx->retval = 0 */ 714 + BPF_MOV64_IMM(BPF_REG_0, 0), 715 + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 716 + offsetof(struct bpf_sockopt, retval)), 717 + 718 + /* return 1 */ 719 + BPF_MOV64_IMM(BPF_REG_0, 1), 720 + BPF_EXIT_INSN(), 721 + }, 722 + .attach_type = BPF_CGROUP_SETSOCKOPT, 723 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 724 + 725 + .error = DENY_LOAD, 726 + }, 727 + { 728 + .descr = "setsockopt: deny read ctx->retval", 729 + .insns = { 730 + /* r6 = ctx->retval */ 731 + BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1, 732 + offsetof(struct bpf_sockopt, retval)), 733 + 734 + /* return 1 */ 735 + BPF_MOV64_IMM(BPF_REG_0, 1), 736 + BPF_EXIT_INSN(), 737 + }, 738 + .attach_type = BPF_CGROUP_SETSOCKOPT, 739 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 740 + 741 + .error = DENY_LOAD, 742 + }, 743 + { 744 + .descr = "setsockopt: deny writing to ctx->optval", 745 + .insns = { 746 + /* ctx->optval = 1 */ 747 + BPF_MOV64_IMM(BPF_REG_0, 1), 748 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 749 + offsetof(struct bpf_sockopt, optval)), 750 + BPF_EXIT_INSN(), 751 + }, 752 + .attach_type = BPF_CGROUP_SETSOCKOPT, 753 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 754 + 755 + .error = DENY_LOAD, 756 + }, 757 + { 758 + .descr = "setsockopt: deny writing to ctx->optval_end", 759 + .insns = { 760 + /* ctx->optval_end = 1 */ 761 + BPF_MOV64_IMM(BPF_REG_0, 1), 762 + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 763 + offsetof(struct bpf_sockopt, optval_end)), 764 + BPF_EXIT_INSN(), 765 + }, 766 + .attach_type = BPF_CGROUP_SETSOCKOPT, 767 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 768 + 769 + .error = DENY_LOAD, 770 + }, 771 + { 772 + .descr = "setsockopt: allow IP_TOS <= 128", 773 + .insns = { 774 + /* r6 = ctx->optval */ 775 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 776 + offsetof(struct bpf_sockopt, optval)), 777 + /* r7 = ctx->optval + 1 */ 778 + BPF_MOV64_REG(BPF_REG_7, BPF_REG_6), 779 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1), 780 + 781 + /* r8 = ctx->optval_end */ 782 + BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_1, 783 + offsetof(struct bpf_sockopt, optval_end)), 784 + 785 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 786 + BPF_JMP_REG(BPF_JGT, BPF_REG_7, BPF_REG_8, 4), 787 + 788 + /* r9 = ctx->optval[0] */ 789 + BPF_LDX_MEM(BPF_B, BPF_REG_9, BPF_REG_6, 0), 790 + 791 + /* if (ctx->optval[0] < 128) */ 792 + BPF_JMP_IMM(BPF_JGT, BPF_REG_9, 128, 2), 793 + BPF_MOV64_IMM(BPF_REG_0, 1), 794 + BPF_JMP_A(1), 795 + /* } */ 796 + 797 + /* } else { */ 798 + BPF_MOV64_IMM(BPF_REG_0, 0), 799 + /* } */ 800 + 801 + BPF_EXIT_INSN(), 802 + }, 803 + .attach_type = BPF_CGROUP_SETSOCKOPT, 804 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 805 + 806 + .get_level = SOL_IP, 807 + .set_level = SOL_IP, 808 + 809 + .get_optname = IP_TOS, 810 + .set_optname = IP_TOS, 811 + 812 + .set_optval = { 0x80 }, 813 + .set_optlen = 1, 814 + .get_optval = { 0x80 }, 815 + .get_optlen = 1, 816 + }, 817 + { 818 + .descr = "setsockopt: deny IP_TOS > 128", 819 + .insns = { 820 + /* r6 = ctx->optval */ 821 + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 822 + offsetof(struct bpf_sockopt, optval)), 823 + /* r7 = ctx->optval + 1 */ 824 + BPF_MOV64_REG(BPF_REG_7, BPF_REG_6), 825 + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1), 826 + 827 + /* r8 = ctx->optval_end */ 828 + BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_1, 829 + offsetof(struct bpf_sockopt, optval_end)), 830 + 831 + /* if (ctx->optval + 1 <= ctx->optval_end) { */ 832 + BPF_JMP_REG(BPF_JGT, BPF_REG_7, BPF_REG_8, 4), 833 + 834 + /* r9 = ctx->optval[0] */ 835 + BPF_LDX_MEM(BPF_B, BPF_REG_9, BPF_REG_6, 0), 836 + 837 + /* if (ctx->optval[0] < 128) */ 838 + BPF_JMP_IMM(BPF_JGT, BPF_REG_9, 128, 2), 839 + BPF_MOV64_IMM(BPF_REG_0, 1), 840 + BPF_JMP_A(1), 841 + /* } */ 842 + 843 + /* } else { */ 844 + BPF_MOV64_IMM(BPF_REG_0, 0), 845 + /* } */ 846 + 847 + BPF_EXIT_INSN(), 848 + }, 849 + .attach_type = BPF_CGROUP_SETSOCKOPT, 850 + .expected_attach_type = BPF_CGROUP_SETSOCKOPT, 851 + 852 + .get_level = SOL_IP, 853 + .set_level = SOL_IP, 854 + 855 + .get_optname = IP_TOS, 856 + .set_optname = IP_TOS, 857 + 858 + .set_optval = { 0x81 }, 859 + .set_optlen = 1, 860 + .get_optval = { 0x00 }, 861 + .get_optlen = 1, 862 + 863 + .error = EPERM_SETSOCKOPT, 864 + }, 865 + }; 866 + 867 + static int load_prog(const struct bpf_insn *insns, 868 + enum bpf_attach_type expected_attach_type) 869 + { 870 + struct bpf_load_program_attr attr = { 871 + .prog_type = BPF_PROG_TYPE_CGROUP_SOCKOPT, 872 + .expected_attach_type = expected_attach_type, 873 + .insns = insns, 874 + .license = "GPL", 875 + .log_level = 2, 876 + }; 877 + int fd; 878 + 879 + for (; 880 + insns[attr.insns_cnt].code != (BPF_JMP | BPF_EXIT); 881 + attr.insns_cnt++) { 882 + } 883 + attr.insns_cnt++; 884 + 885 + fd = bpf_load_program_xattr(&attr, bpf_log_buf, sizeof(bpf_log_buf)); 886 + if (verbose && fd < 0) 887 + fprintf(stderr, "%s\n", bpf_log_buf); 888 + 889 + return fd; 890 + } 891 + 892 + static int run_test(int cgroup_fd, struct sockopt_test *test) 893 + { 894 + int sock_fd, err, prog_fd; 895 + void *optval = NULL; 896 + int ret = 0; 897 + 898 + prog_fd = load_prog(test->insns, test->expected_attach_type); 899 + if (prog_fd < 0) { 900 + if (test->error == DENY_LOAD) 901 + return 0; 902 + 903 + log_err("Failed to load BPF program"); 904 + return -1; 905 + } 906 + 907 + err = bpf_prog_attach(prog_fd, cgroup_fd, test->attach_type, 0); 908 + if (err < 0) { 909 + if (test->error == DENY_ATTACH) 910 + goto close_prog_fd; 911 + 912 + log_err("Failed to attach BPF program"); 913 + ret = -1; 914 + goto close_prog_fd; 915 + } 916 + 917 + sock_fd = socket(AF_INET, SOCK_STREAM, 0); 918 + if (sock_fd < 0) { 919 + log_err("Failed to create AF_INET socket"); 920 + ret = -1; 921 + goto detach_prog; 922 + } 923 + 924 + if (test->set_optlen) { 925 + err = setsockopt(sock_fd, test->set_level, test->set_optname, 926 + test->set_optval, test->set_optlen); 927 + if (err) { 928 + if (errno == EPERM && test->error == EPERM_SETSOCKOPT) 929 + goto close_sock_fd; 930 + if (errno == EFAULT && test->error == EFAULT_SETSOCKOPT) 931 + goto free_optval; 932 + 933 + log_err("Failed to call setsockopt"); 934 + ret = -1; 935 + goto close_sock_fd; 936 + } 937 + } 938 + 939 + if (test->get_optlen) { 940 + optval = malloc(test->get_optlen); 941 + socklen_t optlen = test->get_optlen; 942 + socklen_t expected_get_optlen = test->get_optlen_ret ?: 943 + test->get_optlen; 944 + 945 + err = getsockopt(sock_fd, test->get_level, test->get_optname, 946 + optval, &optlen); 947 + if (err) { 948 + if (errno == EPERM && test->error == EPERM_GETSOCKOPT) 949 + goto free_optval; 950 + if (errno == EFAULT && test->error == EFAULT_GETSOCKOPT) 951 + goto free_optval; 952 + 953 + log_err("Failed to call getsockopt"); 954 + ret = -1; 955 + goto free_optval; 956 + } 957 + 958 + if (optlen != expected_get_optlen) { 959 + errno = 0; 960 + log_err("getsockopt returned unexpected optlen"); 961 + ret = -1; 962 + goto free_optval; 963 + } 964 + 965 + if (memcmp(optval, test->get_optval, optlen) != 0) { 966 + errno = 0; 967 + log_err("getsockopt returned unexpected optval"); 968 + ret = -1; 969 + goto free_optval; 970 + } 971 + } 972 + 973 + ret = test->error != OK; 974 + 975 + free_optval: 976 + free(optval); 977 + close_sock_fd: 978 + close(sock_fd); 979 + detach_prog: 980 + bpf_prog_detach2(prog_fd, cgroup_fd, test->attach_type); 981 + close_prog_fd: 982 + close(prog_fd); 983 + return ret; 984 + } 985 + 986 + int main(int args, char **argv) 987 + { 988 + int err = EXIT_FAILURE, error_cnt = 0; 989 + int cgroup_fd, i; 990 + 991 + if (setup_cgroup_environment()) 992 + goto cleanup_obj; 993 + 994 + cgroup_fd = create_and_get_cgroup(CG_PATH); 995 + if (cgroup_fd < 0) 996 + goto cleanup_cgroup_env; 997 + 998 + if (join_cgroup(CG_PATH)) 999 + goto cleanup_cgroup; 1000 + 1001 + for (i = 0; i < ARRAY_SIZE(tests); i++) { 1002 + int err = run_test(cgroup_fd, &tests[i]); 1003 + 1004 + if (err) 1005 + error_cnt++; 1006 + 1007 + printf("#%d %s: %s\n", i, err ? "FAIL" : "PASS", 1008 + tests[i].descr); 1009 + } 1010 + 1011 + printf("Summary: %ld PASSED, %d FAILED\n", 1012 + ARRAY_SIZE(tests) - error_cnt, error_cnt); 1013 + err = error_cnt ? EXIT_FAILURE : EXIT_SUCCESS; 1014 + 1015 + cleanup_cgroup: 1016 + close(cgroup_fd); 1017 + cleanup_cgroup_env: 1018 + cleanup_cgroup_environment(); 1019 + cleanup_obj: 1020 + return err; 1021 + }
+374
tools/testing/selftests/bpf/test_sockopt_multi.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <error.h> 4 + #include <errno.h> 5 + #include <stdio.h> 6 + #include <unistd.h> 7 + #include <sys/types.h> 8 + #include <sys/socket.h> 9 + #include <netinet/in.h> 10 + 11 + #include <linux/filter.h> 12 + #include <bpf/bpf.h> 13 + #include <bpf/libbpf.h> 14 + 15 + #include "bpf_rlimit.h" 16 + #include "bpf_util.h" 17 + #include "cgroup_helpers.h" 18 + 19 + static int prog_attach(struct bpf_object *obj, int cgroup_fd, const char *title) 20 + { 21 + enum bpf_attach_type attach_type; 22 + enum bpf_prog_type prog_type; 23 + struct bpf_program *prog; 24 + int err; 25 + 26 + err = libbpf_prog_type_by_name(title, &prog_type, &attach_type); 27 + if (err) { 28 + log_err("Failed to deduct types for %s BPF program", title); 29 + return -1; 30 + } 31 + 32 + prog = bpf_object__find_program_by_title(obj, title); 33 + if (!prog) { 34 + log_err("Failed to find %s BPF program", title); 35 + return -1; 36 + } 37 + 38 + err = bpf_prog_attach(bpf_program__fd(prog), cgroup_fd, 39 + attach_type, BPF_F_ALLOW_MULTI); 40 + if (err) { 41 + log_err("Failed to attach %s BPF program", title); 42 + return -1; 43 + } 44 + 45 + return 0; 46 + } 47 + 48 + static int prog_detach(struct bpf_object *obj, int cgroup_fd, const char *title) 49 + { 50 + enum bpf_attach_type attach_type; 51 + enum bpf_prog_type prog_type; 52 + struct bpf_program *prog; 53 + int err; 54 + 55 + err = libbpf_prog_type_by_name(title, &prog_type, &attach_type); 56 + if (err) 57 + return -1; 58 + 59 + prog = bpf_object__find_program_by_title(obj, title); 60 + if (!prog) 61 + return -1; 62 + 63 + err = bpf_prog_detach2(bpf_program__fd(prog), cgroup_fd, 64 + attach_type); 65 + if (err) 66 + return -1; 67 + 68 + return 0; 69 + } 70 + 71 + static int run_getsockopt_test(struct bpf_object *obj, int cg_parent, 72 + int cg_child, int sock_fd) 73 + { 74 + socklen_t optlen; 75 + __u8 buf; 76 + int err; 77 + 78 + /* Set IP_TOS to the expected value (0x80). */ 79 + 80 + buf = 0x80; 81 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 82 + if (err < 0) { 83 + log_err("Failed to call setsockopt(IP_TOS)"); 84 + goto detach; 85 + } 86 + 87 + buf = 0x00; 88 + optlen = 1; 89 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 90 + if (err) { 91 + log_err("Failed to call getsockopt(IP_TOS)"); 92 + goto detach; 93 + } 94 + 95 + if (buf != 0x80) { 96 + log_err("Unexpected getsockopt 0x%x != 0x80 without BPF", buf); 97 + err = -1; 98 + goto detach; 99 + } 100 + 101 + /* Attach child program and make sure it returns new value: 102 + * - kernel: -> 0x80 103 + * - child: 0x80 -> 0x90 104 + */ 105 + 106 + err = prog_attach(obj, cg_child, "cgroup/getsockopt/child"); 107 + if (err) 108 + goto detach; 109 + 110 + buf = 0x00; 111 + optlen = 1; 112 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 113 + if (err) { 114 + log_err("Failed to call getsockopt(IP_TOS)"); 115 + goto detach; 116 + } 117 + 118 + if (buf != 0x90) { 119 + log_err("Unexpected getsockopt 0x%x != 0x90", buf); 120 + err = -1; 121 + goto detach; 122 + } 123 + 124 + /* Attach parent program and make sure it returns new value: 125 + * - kernel: -> 0x80 126 + * - child: 0x80 -> 0x90 127 + * - parent: 0x90 -> 0xA0 128 + */ 129 + 130 + err = prog_attach(obj, cg_parent, "cgroup/getsockopt/parent"); 131 + if (err) 132 + goto detach; 133 + 134 + buf = 0x00; 135 + optlen = 1; 136 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 137 + if (err) { 138 + log_err("Failed to call getsockopt(IP_TOS)"); 139 + goto detach; 140 + } 141 + 142 + if (buf != 0xA0) { 143 + log_err("Unexpected getsockopt 0x%x != 0xA0", buf); 144 + err = -1; 145 + goto detach; 146 + } 147 + 148 + /* Setting unexpected initial sockopt should return EPERM: 149 + * - kernel: -> 0x40 150 + * - child: unexpected 0x40, EPERM 151 + * - parent: unexpected 0x40, EPERM 152 + */ 153 + 154 + buf = 0x40; 155 + if (setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1) < 0) { 156 + log_err("Failed to call setsockopt(IP_TOS)"); 157 + goto detach; 158 + } 159 + 160 + buf = 0x00; 161 + optlen = 1; 162 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 163 + if (!err) { 164 + log_err("Unexpected success from getsockopt(IP_TOS)"); 165 + goto detach; 166 + } 167 + 168 + /* Detach child program and make sure we still get EPERM: 169 + * - kernel: -> 0x40 170 + * - parent: unexpected 0x40, EPERM 171 + */ 172 + 173 + err = prog_detach(obj, cg_child, "cgroup/getsockopt/child"); 174 + if (err) { 175 + log_err("Failed to detach child program"); 176 + goto detach; 177 + } 178 + 179 + buf = 0x00; 180 + optlen = 1; 181 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 182 + if (!err) { 183 + log_err("Unexpected success from getsockopt(IP_TOS)"); 184 + goto detach; 185 + } 186 + 187 + /* Set initial value to the one the parent program expects: 188 + * - kernel: -> 0x90 189 + * - parent: 0x90 -> 0xA0 190 + */ 191 + 192 + buf = 0x90; 193 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 194 + if (err < 0) { 195 + log_err("Failed to call setsockopt(IP_TOS)"); 196 + goto detach; 197 + } 198 + 199 + buf = 0x00; 200 + optlen = 1; 201 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 202 + if (err) { 203 + log_err("Failed to call getsockopt(IP_TOS)"); 204 + goto detach; 205 + } 206 + 207 + if (buf != 0xA0) { 208 + log_err("Unexpected getsockopt 0x%x != 0xA0", buf); 209 + err = -1; 210 + goto detach; 211 + } 212 + 213 + detach: 214 + prog_detach(obj, cg_child, "cgroup/getsockopt/child"); 215 + prog_detach(obj, cg_parent, "cgroup/getsockopt/parent"); 216 + 217 + return err; 218 + } 219 + 220 + static int run_setsockopt_test(struct bpf_object *obj, int cg_parent, 221 + int cg_child, int sock_fd) 222 + { 223 + socklen_t optlen; 224 + __u8 buf; 225 + int err; 226 + 227 + /* Set IP_TOS to the expected value (0x80). */ 228 + 229 + buf = 0x80; 230 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 231 + if (err < 0) { 232 + log_err("Failed to call setsockopt(IP_TOS)"); 233 + goto detach; 234 + } 235 + 236 + buf = 0x00; 237 + optlen = 1; 238 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 239 + if (err) { 240 + log_err("Failed to call getsockopt(IP_TOS)"); 241 + goto detach; 242 + } 243 + 244 + if (buf != 0x80) { 245 + log_err("Unexpected getsockopt 0x%x != 0x80 without BPF", buf); 246 + err = -1; 247 + goto detach; 248 + } 249 + 250 + /* Attach child program and make sure it adds 0x10. */ 251 + 252 + err = prog_attach(obj, cg_child, "cgroup/setsockopt"); 253 + if (err) 254 + goto detach; 255 + 256 + buf = 0x80; 257 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 258 + if (err < 0) { 259 + log_err("Failed to call setsockopt(IP_TOS)"); 260 + goto detach; 261 + } 262 + 263 + buf = 0x00; 264 + optlen = 1; 265 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 266 + if (err) { 267 + log_err("Failed to call getsockopt(IP_TOS)"); 268 + goto detach; 269 + } 270 + 271 + if (buf != 0x80 + 0x10) { 272 + log_err("Unexpected getsockopt 0x%x != 0x80 + 0x10", buf); 273 + err = -1; 274 + goto detach; 275 + } 276 + 277 + /* Attach parent program and make sure it adds another 0x10. */ 278 + 279 + err = prog_attach(obj, cg_parent, "cgroup/setsockopt"); 280 + if (err) 281 + goto detach; 282 + 283 + buf = 0x80; 284 + err = setsockopt(sock_fd, SOL_IP, IP_TOS, &buf, 1); 285 + if (err < 0) { 286 + log_err("Failed to call setsockopt(IP_TOS)"); 287 + goto detach; 288 + } 289 + 290 + buf = 0x00; 291 + optlen = 1; 292 + err = getsockopt(sock_fd, SOL_IP, IP_TOS, &buf, &optlen); 293 + if (err) { 294 + log_err("Failed to call getsockopt(IP_TOS)"); 295 + goto detach; 296 + } 297 + 298 + if (buf != 0x80 + 2 * 0x10) { 299 + log_err("Unexpected getsockopt 0x%x != 0x80 + 2 * 0x10", buf); 300 + err = -1; 301 + goto detach; 302 + } 303 + 304 + detach: 305 + prog_detach(obj, cg_child, "cgroup/setsockopt"); 306 + prog_detach(obj, cg_parent, "cgroup/setsockopt"); 307 + 308 + return err; 309 + } 310 + 311 + int main(int argc, char **argv) 312 + { 313 + struct bpf_prog_load_attr attr = { 314 + .file = "./sockopt_multi.o", 315 + }; 316 + int cg_parent = -1, cg_child = -1; 317 + struct bpf_object *obj = NULL; 318 + int sock_fd = -1; 319 + int err = -1; 320 + int ignored; 321 + 322 + if (setup_cgroup_environment()) { 323 + log_err("Failed to setup cgroup environment\n"); 324 + goto out; 325 + } 326 + 327 + cg_parent = create_and_get_cgroup("/parent"); 328 + if (cg_parent < 0) { 329 + log_err("Failed to create cgroup /parent\n"); 330 + goto out; 331 + } 332 + 333 + cg_child = create_and_get_cgroup("/parent/child"); 334 + if (cg_child < 0) { 335 + log_err("Failed to create cgroup /parent/child\n"); 336 + goto out; 337 + } 338 + 339 + if (join_cgroup("/parent/child")) { 340 + log_err("Failed to join cgroup /parent/child\n"); 341 + goto out; 342 + } 343 + 344 + err = bpf_prog_load_xattr(&attr, &obj, &ignored); 345 + if (err) { 346 + log_err("Failed to load BPF object"); 347 + goto out; 348 + } 349 + 350 + sock_fd = socket(AF_INET, SOCK_STREAM, 0); 351 + if (sock_fd < 0) { 352 + log_err("Failed to create socket"); 353 + goto out; 354 + } 355 + 356 + if (run_getsockopt_test(obj, cg_parent, cg_child, sock_fd)) 357 + err = -1; 358 + printf("test_sockopt_multi: getsockopt %s\n", 359 + err ? "FAILED" : "PASSED"); 360 + 361 + if (run_setsockopt_test(obj, cg_parent, cg_child, sock_fd)) 362 + err = -1; 363 + printf("test_sockopt_multi: setsockopt %s\n", 364 + err ? "FAILED" : "PASSED"); 365 + 366 + out: 367 + close(sock_fd); 368 + bpf_object__close(obj); 369 + close(cg_child); 370 + close(cg_parent); 371 + 372 + printf("test_sockopt_multi: %s\n", err ? "FAILED" : "PASSED"); 373 + return err ? EXIT_FAILURE : EXIT_SUCCESS; 374 + }
+211
tools/testing/selftests/bpf/test_sockopt_sk.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <errno.h> 4 + #include <stdio.h> 5 + #include <unistd.h> 6 + #include <sys/types.h> 7 + #include <sys/socket.h> 8 + #include <netinet/in.h> 9 + 10 + #include <linux/filter.h> 11 + #include <bpf/bpf.h> 12 + #include <bpf/libbpf.h> 13 + 14 + #include "bpf_rlimit.h" 15 + #include "bpf_util.h" 16 + #include "cgroup_helpers.h" 17 + 18 + #define CG_PATH "/sockopt" 19 + 20 + #define SOL_CUSTOM 0xdeadbeef 21 + 22 + static int getsetsockopt(void) 23 + { 24 + int fd, err; 25 + union { 26 + char u8[4]; 27 + __u32 u32; 28 + } buf = {}; 29 + socklen_t optlen; 30 + 31 + fd = socket(AF_INET, SOCK_STREAM, 0); 32 + if (fd < 0) { 33 + log_err("Failed to create socket"); 34 + return -1; 35 + } 36 + 37 + /* IP_TOS - BPF bypass */ 38 + 39 + buf.u8[0] = 0x08; 40 + err = setsockopt(fd, SOL_IP, IP_TOS, &buf, 1); 41 + if (err) { 42 + log_err("Failed to call setsockopt(IP_TOS)"); 43 + goto err; 44 + } 45 + 46 + buf.u8[0] = 0x00; 47 + optlen = 1; 48 + err = getsockopt(fd, SOL_IP, IP_TOS, &buf, &optlen); 49 + if (err) { 50 + log_err("Failed to call getsockopt(IP_TOS)"); 51 + goto err; 52 + } 53 + 54 + if (buf.u8[0] != 0x08) { 55 + log_err("Unexpected getsockopt(IP_TOS) buf[0] 0x%02x != 0x08", 56 + buf.u8[0]); 57 + goto err; 58 + } 59 + 60 + /* IP_TTL - EPERM */ 61 + 62 + buf.u8[0] = 1; 63 + err = setsockopt(fd, SOL_IP, IP_TTL, &buf, 1); 64 + if (!err || errno != EPERM) { 65 + log_err("Unexpected success from setsockopt(IP_TTL)"); 66 + goto err; 67 + } 68 + 69 + /* SOL_CUSTOM - handled by BPF */ 70 + 71 + buf.u8[0] = 0x01; 72 + err = setsockopt(fd, SOL_CUSTOM, 0, &buf, 1); 73 + if (err) { 74 + log_err("Failed to call setsockopt"); 75 + goto err; 76 + } 77 + 78 + buf.u32 = 0x00; 79 + optlen = 4; 80 + err = getsockopt(fd, SOL_CUSTOM, 0, &buf, &optlen); 81 + if (err) { 82 + log_err("Failed to call getsockopt"); 83 + goto err; 84 + } 85 + 86 + if (optlen != 1) { 87 + log_err("Unexpected optlen %d != 1", optlen); 88 + goto err; 89 + } 90 + if (buf.u8[0] != 0x01) { 91 + log_err("Unexpected buf[0] 0x%02x != 0x01", buf.u8[0]); 92 + goto err; 93 + } 94 + 95 + /* SO_SNDBUF is overwritten */ 96 + 97 + buf.u32 = 0x01010101; 98 + err = setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &buf, 4); 99 + if (err) { 100 + log_err("Failed to call setsockopt(SO_SNDBUF)"); 101 + goto err; 102 + } 103 + 104 + buf.u32 = 0x00; 105 + optlen = 4; 106 + err = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &buf, &optlen); 107 + if (err) { 108 + log_err("Failed to call getsockopt(SO_SNDBUF)"); 109 + goto err; 110 + } 111 + 112 + if (buf.u32 != 0x55AA*2) { 113 + log_err("Unexpected getsockopt(SO_SNDBUF) 0x%x != 0x55AA*2", 114 + buf.u32); 115 + goto err; 116 + } 117 + 118 + close(fd); 119 + return 0; 120 + err: 121 + close(fd); 122 + return -1; 123 + } 124 + 125 + static int prog_attach(struct bpf_object *obj, int cgroup_fd, const char *title) 126 + { 127 + enum bpf_attach_type attach_type; 128 + enum bpf_prog_type prog_type; 129 + struct bpf_program *prog; 130 + int err; 131 + 132 + err = libbpf_prog_type_by_name(title, &prog_type, &attach_type); 133 + if (err) { 134 + log_err("Failed to deduct types for %s BPF program", title); 135 + return -1; 136 + } 137 + 138 + prog = bpf_object__find_program_by_title(obj, title); 139 + if (!prog) { 140 + log_err("Failed to find %s BPF program", title); 141 + return -1; 142 + } 143 + 144 + err = bpf_prog_attach(bpf_program__fd(prog), cgroup_fd, 145 + attach_type, 0); 146 + if (err) { 147 + log_err("Failed to attach %s BPF program", title); 148 + return -1; 149 + } 150 + 151 + return 0; 152 + } 153 + 154 + static int run_test(int cgroup_fd) 155 + { 156 + struct bpf_prog_load_attr attr = { 157 + .file = "./sockopt_sk.o", 158 + }; 159 + struct bpf_object *obj; 160 + int ignored; 161 + int err; 162 + 163 + err = bpf_prog_load_xattr(&attr, &obj, &ignored); 164 + if (err) { 165 + log_err("Failed to load BPF object"); 166 + return -1; 167 + } 168 + 169 + err = prog_attach(obj, cgroup_fd, "cgroup/getsockopt"); 170 + if (err) 171 + goto close_bpf_object; 172 + 173 + err = prog_attach(obj, cgroup_fd, "cgroup/setsockopt"); 174 + if (err) 175 + goto close_bpf_object; 176 + 177 + err = getsetsockopt(); 178 + 179 + close_bpf_object: 180 + bpf_object__close(obj); 181 + return err; 182 + } 183 + 184 + int main(int args, char **argv) 185 + { 186 + int cgroup_fd; 187 + int err = EXIT_SUCCESS; 188 + 189 + if (setup_cgroup_environment()) 190 + goto cleanup_obj; 191 + 192 + cgroup_fd = create_and_get_cgroup(CG_PATH); 193 + if (cgroup_fd < 0) 194 + goto cleanup_cgroup_env; 195 + 196 + if (join_cgroup(CG_PATH)) 197 + goto cleanup_cgroup; 198 + 199 + if (run_test(cgroup_fd)) 200 + err = EXIT_FAILURE; 201 + 202 + printf("test_sockopt_sk: %s\n", 203 + err == EXIT_SUCCESS ? "PASSED" : "FAILED"); 204 + 205 + cleanup_cgroup: 206 + close(cgroup_fd); 207 + cleanup_cgroup_env: 208 + cleanup_cgroup_environment(); 209 + cleanup_obj: 210 + return err; 211 + }
+254
tools/testing/selftests/bpf/test_tcp_rtt.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <error.h> 3 + #include <errno.h> 4 + #include <stdio.h> 5 + #include <unistd.h> 6 + #include <sys/types.h> 7 + #include <sys/socket.h> 8 + #include <netinet/in.h> 9 + #include <pthread.h> 10 + 11 + #include <linux/filter.h> 12 + #include <bpf/bpf.h> 13 + #include <bpf/libbpf.h> 14 + 15 + #include "bpf_rlimit.h" 16 + #include "bpf_util.h" 17 + #include "cgroup_helpers.h" 18 + 19 + #define CG_PATH "/tcp_rtt" 20 + 21 + struct tcp_rtt_storage { 22 + __u32 invoked; 23 + __u32 dsack_dups; 24 + __u32 delivered; 25 + __u32 delivered_ce; 26 + __u32 icsk_retransmits; 27 + }; 28 + 29 + static void send_byte(int fd) 30 + { 31 + char b = 0x55; 32 + 33 + if (write(fd, &b, sizeof(b)) != 1) 34 + error(1, errno, "Failed to send single byte"); 35 + } 36 + 37 + static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked, 38 + __u32 dsack_dups, __u32 delivered, __u32 delivered_ce, 39 + __u32 icsk_retransmits) 40 + { 41 + int err = 0; 42 + struct tcp_rtt_storage val; 43 + 44 + if (bpf_map_lookup_elem(map_fd, &client_fd, &val) < 0) 45 + error(1, errno, "Failed to read socket storage"); 46 + 47 + if (val.invoked != invoked) { 48 + log_err("%s: unexpected bpf_tcp_sock.invoked %d != %d", 49 + msg, val.invoked, invoked); 50 + err++; 51 + } 52 + 53 + if (val.dsack_dups != dsack_dups) { 54 + log_err("%s: unexpected bpf_tcp_sock.dsack_dups %d != %d", 55 + msg, val.dsack_dups, dsack_dups); 56 + err++; 57 + } 58 + 59 + if (val.delivered != delivered) { 60 + log_err("%s: unexpected bpf_tcp_sock.delivered %d != %d", 61 + msg, val.delivered, delivered); 62 + err++; 63 + } 64 + 65 + if (val.delivered_ce != delivered_ce) { 66 + log_err("%s: unexpected bpf_tcp_sock.delivered_ce %d != %d", 67 + msg, val.delivered_ce, delivered_ce); 68 + err++; 69 + } 70 + 71 + if (val.icsk_retransmits != icsk_retransmits) { 72 + log_err("%s: unexpected bpf_tcp_sock.icsk_retransmits %d != %d", 73 + msg, val.icsk_retransmits, icsk_retransmits); 74 + err++; 75 + } 76 + 77 + return err; 78 + } 79 + 80 + static int connect_to_server(int server_fd) 81 + { 82 + struct sockaddr_storage addr; 83 + socklen_t len = sizeof(addr); 84 + int fd; 85 + 86 + fd = socket(AF_INET, SOCK_STREAM, 0); 87 + if (fd < 0) { 88 + log_err("Failed to create client socket"); 89 + return -1; 90 + } 91 + 92 + if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) { 93 + log_err("Failed to get server addr"); 94 + goto out; 95 + } 96 + 97 + if (connect(fd, (const struct sockaddr *)&addr, len) < 0) { 98 + log_err("Fail to connect to server"); 99 + goto out; 100 + } 101 + 102 + return fd; 103 + 104 + out: 105 + close(fd); 106 + return -1; 107 + } 108 + 109 + static int run_test(int cgroup_fd, int server_fd) 110 + { 111 + struct bpf_prog_load_attr attr = { 112 + .prog_type = BPF_PROG_TYPE_SOCK_OPS, 113 + .file = "./tcp_rtt.o", 114 + .expected_attach_type = BPF_CGROUP_SOCK_OPS, 115 + }; 116 + struct bpf_object *obj; 117 + struct bpf_map *map; 118 + int client_fd; 119 + int prog_fd; 120 + int map_fd; 121 + int err; 122 + 123 + err = bpf_prog_load_xattr(&attr, &obj, &prog_fd); 124 + if (err) { 125 + log_err("Failed to load BPF object"); 126 + return -1; 127 + } 128 + 129 + map = bpf_map__next(NULL, obj); 130 + map_fd = bpf_map__fd(map); 131 + 132 + err = bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_SOCK_OPS, 0); 133 + if (err) { 134 + log_err("Failed to attach BPF program"); 135 + goto close_bpf_object; 136 + } 137 + 138 + client_fd = connect_to_server(server_fd); 139 + if (client_fd < 0) { 140 + err = -1; 141 + goto close_bpf_object; 142 + } 143 + 144 + err += verify_sk(map_fd, client_fd, "syn-ack", 145 + /*invoked=*/1, 146 + /*dsack_dups=*/0, 147 + /*delivered=*/1, 148 + /*delivered_ce=*/0, 149 + /*icsk_retransmits=*/0); 150 + 151 + send_byte(client_fd); 152 + 153 + err += verify_sk(map_fd, client_fd, "first payload byte", 154 + /*invoked=*/2, 155 + /*dsack_dups=*/0, 156 + /*delivered=*/2, 157 + /*delivered_ce=*/0, 158 + /*icsk_retransmits=*/0); 159 + 160 + close(client_fd); 161 + 162 + close_bpf_object: 163 + bpf_object__close(obj); 164 + return err; 165 + } 166 + 167 + static int start_server(void) 168 + { 169 + struct sockaddr_in addr = { 170 + .sin_family = AF_INET, 171 + .sin_addr.s_addr = htonl(INADDR_LOOPBACK), 172 + }; 173 + int fd; 174 + 175 + fd = socket(AF_INET, SOCK_STREAM, 0); 176 + if (fd < 0) { 177 + log_err("Failed to create server socket"); 178 + return -1; 179 + } 180 + 181 + if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) < 0) { 182 + log_err("Failed to bind socket"); 183 + close(fd); 184 + return -1; 185 + } 186 + 187 + return fd; 188 + } 189 + 190 + static void *server_thread(void *arg) 191 + { 192 + struct sockaddr_storage addr; 193 + socklen_t len = sizeof(addr); 194 + int fd = *(int *)arg; 195 + int client_fd; 196 + 197 + if (listen(fd, 1) < 0) 198 + error(1, errno, "Failed to listed on socket"); 199 + 200 + client_fd = accept(fd, (struct sockaddr *)&addr, &len); 201 + if (client_fd < 0) 202 + error(1, errno, "Failed to accept client"); 203 + 204 + /* Wait for the next connection (that never arrives) 205 + * to keep this thread alive to prevent calling 206 + * close() on client_fd. 207 + */ 208 + if (accept(fd, (struct sockaddr *)&addr, &len) >= 0) 209 + error(1, errno, "Unexpected success in second accept"); 210 + 211 + close(client_fd); 212 + 213 + return NULL; 214 + } 215 + 216 + int main(int args, char **argv) 217 + { 218 + int server_fd, cgroup_fd; 219 + int err = EXIT_SUCCESS; 220 + pthread_t tid; 221 + 222 + if (setup_cgroup_environment()) 223 + goto cleanup_obj; 224 + 225 + cgroup_fd = create_and_get_cgroup(CG_PATH); 226 + if (cgroup_fd < 0) 227 + goto cleanup_cgroup_env; 228 + 229 + if (join_cgroup(CG_PATH)) 230 + goto cleanup_cgroup; 231 + 232 + server_fd = start_server(); 233 + if (server_fd < 0) { 234 + err = EXIT_FAILURE; 235 + goto cleanup_cgroup; 236 + } 237 + 238 + pthread_create(&tid, NULL, server_thread, (void *)&server_fd); 239 + 240 + if (run_test(cgroup_fd, server_fd)) 241 + err = EXIT_FAILURE; 242 + 243 + close(server_fd); 244 + 245 + printf("test_sockopt_sk: %s\n", 246 + err == EXIT_SUCCESS ? "PASSED" : "FAILED"); 247 + 248 + cleanup_cgroup: 249 + close(cgroup_fd); 250 + cleanup_cgroup_env: 251 + cleanup_cgroup_environment(); 252 + cleanup_obj: 253 + return err; 254 + }
+118
tools/testing/selftests/bpf/test_xdp_veth.sh
··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # 4 + # Create 3 namespaces with 3 veth peers, and 5 + # forward packets in-between using native XDP 6 + # 7 + # XDP_TX 8 + # NS1(veth11) NS2(veth22) NS3(veth33) 9 + # | | | 10 + # | | | 11 + # (veth1, (veth2, (veth3, 12 + # id:111) id:122) id:133) 13 + # ^ | ^ | ^ | 14 + # | | XDP_REDIRECT | | XDP_REDIRECT | | 15 + # | ------------------ ------------------ | 16 + # ----------------------------------------- 17 + # XDP_REDIRECT 18 + 19 + # Kselftest framework requirement - SKIP code is 4. 20 + ksft_skip=4 21 + 22 + TESTNAME=xdp_veth 23 + BPF_FS=$(awk '$3 == "bpf" {print $2; exit}' /proc/mounts) 24 + BPF_DIR=$BPF_FS/test_$TESTNAME 25 + 26 + _cleanup() 27 + { 28 + set +e 29 + ip link del veth1 2> /dev/null 30 + ip link del veth2 2> /dev/null 31 + ip link del veth3 2> /dev/null 32 + ip netns del ns1 2> /dev/null 33 + ip netns del ns2 2> /dev/null 34 + ip netns del ns3 2> /dev/null 35 + rm -rf $BPF_DIR 2> /dev/null 36 + } 37 + 38 + cleanup_skip() 39 + { 40 + echo "selftests: $TESTNAME [SKIP]" 41 + _cleanup 42 + 43 + exit $ksft_skip 44 + } 45 + 46 + cleanup() 47 + { 48 + if [ "$?" = 0 ]; then 49 + echo "selftests: $TESTNAME [PASS]" 50 + else 51 + echo "selftests: $TESTNAME [FAILED]" 52 + fi 53 + _cleanup 54 + } 55 + 56 + if [ $(id -u) -ne 0 ]; then 57 + echo "selftests: $TESTNAME [SKIP] Need root privileges" 58 + exit $ksft_skip 59 + fi 60 + 61 + if ! ip link set dev lo xdp off > /dev/null 2>&1; then 62 + echo "selftests: $TESTNAME [SKIP] Could not run test without the ip xdp support" 63 + exit $ksft_skip 64 + fi 65 + 66 + if [ -z "$BPF_FS" ]; then 67 + echo "selftests: $TESTNAME [SKIP] Could not run test without bpffs mounted" 68 + exit $ksft_skip 69 + fi 70 + 71 + if ! bpftool version > /dev/null 2>&1; then 72 + echo "selftests: $TESTNAME [SKIP] Could not run test without bpftool" 73 + exit $ksft_skip 74 + fi 75 + 76 + set -e 77 + 78 + trap cleanup_skip EXIT 79 + 80 + ip netns add ns1 81 + ip netns add ns2 82 + ip netns add ns3 83 + 84 + ip link add veth1 index 111 type veth peer name veth11 netns ns1 85 + ip link add veth2 index 122 type veth peer name veth22 netns ns2 86 + ip link add veth3 index 133 type veth peer name veth33 netns ns3 87 + 88 + ip link set veth1 up 89 + ip link set veth2 up 90 + ip link set veth3 up 91 + 92 + ip -n ns1 addr add 10.1.1.11/24 dev veth11 93 + ip -n ns3 addr add 10.1.1.33/24 dev veth33 94 + 95 + ip -n ns1 link set dev veth11 up 96 + ip -n ns2 link set dev veth22 up 97 + ip -n ns3 link set dev veth33 up 98 + 99 + mkdir $BPF_DIR 100 + bpftool prog loadall \ 101 + xdp_redirect_map.o $BPF_DIR/progs type xdp \ 102 + pinmaps $BPF_DIR/maps 103 + bpftool map update pinned $BPF_DIR/maps/tx_port key 0 0 0 0 value 122 0 0 0 104 + bpftool map update pinned $BPF_DIR/maps/tx_port key 1 0 0 0 value 133 0 0 0 105 + bpftool map update pinned $BPF_DIR/maps/tx_port key 2 0 0 0 value 111 0 0 0 106 + ip link set dev veth1 xdp pinned $BPF_DIR/progs/redirect_map_0 107 + ip link set dev veth2 xdp pinned $BPF_DIR/progs/redirect_map_1 108 + ip link set dev veth3 xdp pinned $BPF_DIR/progs/redirect_map_2 109 + 110 + ip -n ns1 link set dev veth11 xdp obj xdp_dummy.o sec xdp_dummy 111 + ip -n ns2 link set dev veth22 xdp obj xdp_tx.o sec tx 112 + ip -n ns3 link set dev veth33 xdp obj xdp_dummy.o sec xdp_dummy 113 + 114 + trap cleanup EXIT 115 + 116 + ip netns exec ns1 ping -c 1 -W 1 10.1.1.33 117 + 118 + exit 0