Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

xsk: add multi-buffer documentation

Add AF_XDP multi-buffer support documentation including two
pseudo-code samples.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-18-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Magnus Karlsson and committed by
Alexei Starovoitov
49ca37d0 a92b96c4

+210 -1
+210 -1
Documentation/networking/af_xdp.rst
··· 462 462 Gets options from an XDP socket. The only one supported so far is 463 463 XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. 464 464 465 + Multi-Buffer Support 466 + ==================== 467 + 468 + With multi-buffer support, programs using AF_XDP sockets can receive 469 + and transmit packets consisting of multiple buffers both in copy and 470 + zero-copy mode. For example, a packet can consist of two 471 + frames/buffers, one with the header and the other one with the data, 472 + or a 9K Ethernet jumbo frame can be constructed by chaining together 473 + three 4K frames. 474 + 475 + Some definitions: 476 + 477 + * A packet consists of one or more frames 478 + 479 + * A descriptor in one of the AF_XDP rings always refers to a single 480 + frame. In the case the packet consists of a single frame, the 481 + descriptor refers to the whole packet. 482 + 483 + To enable multi-buffer support for an AF_XDP socket, use the new bind 484 + flag XDP_USE_SG. If this is not provided, all multi-buffer packets 485 + will be dropped just as before. Note that the XDP program loaded also 486 + needs to be in multi-buffer mode. This can be accomplished by using 487 + "xdp.frags" as the section name of the XDP program used. 488 + 489 + To represent a packet consisting of multiple frames, a new flag called 490 + XDP_PKT_CONTD is introduced in the options field of the Rx and Tx 491 + descriptors. If it is true (1) the packet continues with the next 492 + descriptor and if it is false (0) it means this is the last descriptor 493 + of the packet. Why the reverse logic of end-of-packet (eop) flag found 494 + in many NICs? Just to preserve compatibility with non-multi-buffer 495 + applications that have this bit set to false for all packets on Rx, 496 + and the apps set the options field to zero for Tx, as anything else 497 + will be treated as an invalid descriptor. 498 + 499 + These are the semantics for producing packets onto AF_XDP Tx ring 500 + consisting of multiple frames: 501 + 502 + * When an invalid descriptor is found, all the other 503 + descriptors/frames of this packet are marked as invalid and not 504 + completed. The next descriptor is treated as the start of a new 505 + packet, even if this was not the intent (because we cannot guess 506 + the intent). As before, if your program is producing invalid 507 + descriptors you have a bug that must be fixed. 508 + 509 + * Zero length descriptors are treated as invalid descriptors. 510 + 511 + * For copy mode, the maximum supported number of frames in a packet is 512 + equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all 513 + descriptors accumulated so far are dropped and treated as 514 + invalid. To produce an application that will work on any system 515 + regardless of this config setting, limit the number of frags to 18, 516 + as the minimum value of the config is 17. 517 + 518 + * For zero-copy mode, the limit is up to what the NIC HW 519 + supports. Usually at least five on the NICs we have checked. We 520 + consciously chose to not enforce a rigid limit (such as 521 + CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have 522 + resulted in copy actions under the hood to fit into what limit the 523 + NIC supports. Kind of defeats the purpose of zero-copy mode. How to 524 + probe for this limit is explained in the "probe for multi-buffer 525 + support" section. 526 + 527 + On the Rx path in copy-mode, the xsk core copies the XDP data into 528 + multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as 529 + detailed before. Zero-copy mode works the same, though the data is not 530 + copied. When the application gets a descriptor with the XDP_PKT_CONTD 531 + flag set to one, it means that the packet consists of multiple buffers 532 + and it continues with the next buffer in the following 533 + descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it 534 + means that this is the last buffer of the packet. AF_XDP guarantees 535 + that only a complete packet (all frames in the packet) is sent to the 536 + application. If there is not enough space in the AF_XDP Rx ring, all 537 + frames of the packet will be dropped. 538 + 539 + If application reads a batch of descriptors, using for example the libxdp 540 + interfaces, it is not guaranteed that the batch will end with a full 541 + packet. It might end in the middle of a packet and the rest of the 542 + buffers of that packet will arrive at the beginning of the next batch, 543 + since the libxdp interface does not read the whole ring (unless you 544 + have an enormous batch size or a very small ring size). 545 + 546 + An example program each for Rx and Tx multi-buffer support can be found 547 + later in this document. 548 + 465 549 Usage 466 - ===== 550 + ----- 467 551 468 552 In order to use AF_XDP sockets two parts are needed. The 469 553 user-space application and the XDP program. For a complete setup and ··· 624 540 625 541 But please use the libbpf functions as they are optimized and ready to 626 542 use. Will make your life easier. 543 + 544 + Usage Multi-Buffer Rx 545 + --------------------- 546 + 547 + Here is a simple Rx path pseudo-code example (using libxdp interfaces 548 + for simplicity). Error paths have been excluded to keep it short: 549 + 550 + .. code-block:: c 551 + 552 + void rx_packets(struct xsk_socket_info *xsk) 553 + { 554 + static bool new_packet = true; 555 + u32 idx_rx = 0, idx_fq = 0; 556 + static char *pkt; 557 + 558 + int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); 559 + 560 + xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); 561 + 562 + for (int i = 0; i < rcvd; i++) { 563 + struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); 564 + char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); 565 + bool eop = !(desc->options & XDP_PKT_CONTD); 566 + 567 + if (new_packet) 568 + pkt = frag; 569 + else 570 + add_frag_to_pkt(pkt, frag); 571 + 572 + if (eop) 573 + process_pkt(pkt); 574 + 575 + new_packet = eop; 576 + 577 + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; 578 + } 579 + 580 + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); 581 + xsk_ring_cons__release(&xsk->rx, rcvd); 582 + } 583 + 584 + Usage Multi-Buffer Tx 585 + --------------------- 586 + 587 + Here is an example Tx path pseudo-code (using libxdp interfaces for 588 + simplicity) ignoring that the umem is finite in size, and that we 589 + eventually will run out of packets to send. Also assumes pkts.addr 590 + points to a valid location in the umem. 591 + 592 + .. code-block:: c 593 + 594 + void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts, 595 + int batch_size) 596 + { 597 + u32 idx, i, pkt_nb = 0; 598 + 599 + xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx); 600 + 601 + for (i = 0; i < batch_size;) { 602 + u64 addr = pkts[pkt_nb].addr; 603 + u32 len = pkts[pkt_nb].size; 604 + 605 + do { 606 + struct xdp_desc *tx_desc; 607 + 608 + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++); 609 + tx_desc->addr = addr; 610 + 611 + if (len > xsk_frame_size) { 612 + tx_desc->len = xsk_frame_size; 613 + tx_desc->options = XDP_PKT_CONTD; 614 + } else { 615 + tx_desc->len = len; 616 + tx_desc->options = 0; 617 + pkt_nb++; 618 + } 619 + len -= tx_desc->len; 620 + addr += xsk_frame_size; 621 + 622 + if (i == batch_size) { 623 + /* Remember len, addr, pkt_nb for next iteration. 624 + * Skipped for simplicity. 625 + */ 626 + break; 627 + } 628 + } while (len); 629 + } 630 + 631 + xsk_ring_prod__submit(&xsk->tx, i); 632 + } 633 + 634 + Probing for Multi-Buffer Support 635 + -------------------------------- 636 + 637 + To discover if a driver supports multi-buffer AF_XDP in SKB or DRV 638 + mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to 639 + query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for 640 + querying for XDP multi-buffer support. If XDP supports multi-buffer in 641 + a driver, then AF_XDP will also support that in SKB and DRV mode. 642 + 643 + To discover if a driver supports multi-buffer AF_XDP in zero-copy 644 + mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY 645 + flag. If it is set, it means that at least zero-copy is supported and 646 + you should go and check the netlink attribute 647 + NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer 648 + value will be returned stating the max number of frags that are 649 + supported by this device in zero-copy mode. These are the possible 650 + return values: 651 + 652 + 1: Multi-buffer for zero-copy is not supported by this device, as max 653 + one fragment supported means that multi-buffer is not possible. 654 + 655 + >=2: Multi-buffer is supported in zero-copy mode for this device. The 656 + returned number signifies the max number of frags supported. 657 + 658 + For an example on how these are used through libbpf, please take a 659 + look at tools/testing/selftests/bpf/xskxceiver.c. 660 + 661 + Multi-Buffer Support for Zero-Copy Drivers 662 + ------------------------------------------ 663 + 664 + Zero-copy drivers usually use the batched APIs for Rx and Tx 665 + processing. Note that the Tx batch API guarantees that it will provide 666 + a batch of Tx descriptors that ends with full packet at the end. This 667 + to facilitate extending a zero-copy driver with multi-buffer support. 627 668 628 669 Sample application 629 670 ==================