docs: networking: convert packet_mmap.txt to ReST

+1

Documentation/networking/index.rst

··· 89 89 nf_flowtable 90 90 openvswitch 91 91 operstates 92 + packet_mmap 92 93 93 94 .. only:: subproject and html 94 95

+1084

Documentation/networking/packet_mmap.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =========== 4 + Packet MMAP 5 + =========== 6 + 7 + Abstract 8 + ======== 9 + 10 + This file documents the mmap() facility available with the PACKET 11 + socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for 12 + 13 + i) capture network traffic with utilities like tcpdump, 14 + ii) transmit network traffic, or any other that needs raw 15 + access to network interface. 16 + 17 + Howto can be found at: 18 + 19 + https://sites.google.com/site/packetmmap/ 20 + 21 + Please send your comments to 22 + - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> 23 + - Johann Baudy 24 + 25 + Why use PACKET_MMAP 26 + =================== 27 + 28 + In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very 29 + inefficient. It uses very limited buffers and requires one system call to 30 + capture each packet, it requires two if you want to get packet's timestamp 31 + (like libpcap always does). 32 + 33 + In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 34 + configurable circular buffer mapped in user space that can be used to either 35 + send or receive packets. This way reading packets just needs to wait for them, 36 + most of the time there is no need to issue a single system call. Concerning 37 + transmission, multiple packets can be sent through one system call to get the 38 + highest bandwidth. By using a shared buffer between the kernel and the user 39 + also has the benefit of minimizing packet copies. 40 + 41 + It's fine to use PACKET_MMAP to improve the performance of the capture and 42 + transmission process, but it isn't everything. At least, if you are capturing 43 + at high speeds (this is relative to the cpu speed), you should check if the 44 + device driver of your network interface card supports some sort of interrupt 45 + load mitigation or (even better) if it supports NAPI, also make sure it is 46 + enabled. For transmission, check the MTU (Maximum Transmission Unit) used and 47 + supported by devices of your network. CPU IRQ pinning of your network interface 48 + card can also be an advantage. 49 + 50 + How to use mmap() to improve capture process 51 + ============================================ 52 + 53 + From the user standpoint, you should use the higher level libpcap library, which 54 + is a de facto standard, portable across nearly all operating systems 55 + including Win32. 56 + 57 + Packet MMAP support was integrated into libpcap around the time of version 1.3.0; 58 + TPACKET_V3 support was added in version 1.5.0 59 + 60 + How to use mmap() directly to improve capture process 61 + ===================================================== 62 + 63 + From the system calls stand point, the use of PACKET_MMAP involves 64 + the following process:: 65 + 66 + 67 + [setup] socket() -------> creation of the capture socket 68 + setsockopt() ---> allocation of the circular buffer (ring) 69 + option: PACKET_RX_RING 70 + mmap() ---------> mapping of the allocated buffer to the 71 + user process 72 + 73 + [capture] poll() ---------> to wait for incoming packets 74 + 75 + [shutdown] close() --------> destruction of the capture socket and 76 + deallocation of all associated 77 + resources. 78 + 79 + 80 + socket creation and destruction is straight forward, and is done 81 + the same way with or without PACKET_MMAP:: 82 + 83 + int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); 84 + 85 + where mode is SOCK_RAW for the raw interface were link level 86 + information can be captured or SOCK_DGRAM for the cooked 87 + interface where link level information capture is not 88 + supported and a link level pseudo-header is provided 89 + by the kernel. 90 + 91 + The destruction of the socket and all associated resources 92 + is done by a simple call to close(fd). 93 + 94 + Similarly as without PACKET_MMAP, it is possible to use one socket 95 + for capture and transmission. This can be done by mapping the 96 + allocated RX and TX buffer ring with a single mmap() call. 97 + See "Mapping and use of the circular buffer (ring)". 98 + 99 + Next I will describe PACKET_MMAP settings and its constraints, 100 + also the mapping of the circular buffer in the user process and 101 + the use of this buffer. 102 + 103 + How to use mmap() directly to improve transmission process 104 + ========================================================== 105 + Transmission process is similar to capture as shown below:: 106 + 107 + [setup] socket() -------> creation of the transmission socket 108 + setsockopt() ---> allocation of the circular buffer (ring) 109 + option: PACKET_TX_RING 110 + bind() ---------> bind transmission socket with a network interface 111 + mmap() ---------> mapping of the allocated buffer to the 112 + user process 113 + 114 + [transmission] poll() ---------> wait for free packets (optional) 115 + send() ---------> send all packets that are set as ready in 116 + the ring 117 + The flag MSG_DONTWAIT can be used to return 118 + before end of transfer. 119 + 120 + [shutdown] close() --------> destruction of the transmission socket and 121 + deallocation of all associated resources. 122 + 123 + Socket creation and destruction is also straight forward, and is done 124 + the same way as in capturing described in the previous paragraph:: 125 + 126 + int fd = socket(PF_PACKET, mode, 0); 127 + 128 + The protocol can optionally be 0 in case we only want to transmit 129 + via this socket, which avoids an expensive call to packet_rcv(). 130 + In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 131 + set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. 132 + 133 + Binding the socket to your network interface is mandatory (with zero copy) to 134 + know the header size of frames used in the circular buffer. 135 + 136 + As capture, each frame contains two parts:: 137 + 138 + -------------------- 139 + | struct tpacket_hdr | Header. It contains the status of 140 + | | of this frame 141 + |--------------------| 142 + | data buffer | 143 + . . Data that will be sent over the network interface. 144 + . . 145 + -------------------- 146 + 147 + bind() associates the socket to your network interface thanks to 148 + sll_ifindex parameter of struct sockaddr_ll. 149 + 150 + Initialization example:: 151 + 152 + struct sockaddr_ll my_addr; 153 + struct ifreq s_ifr; 154 + ... 155 + 156 + strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); 157 + 158 + /* get interface index of eth0 */ 159 + ioctl(this->socket, SIOCGIFINDEX, &s_ifr); 160 + 161 + /* fill sockaddr_ll struct to prepare binding */ 162 + my_addr.sll_family = AF_PACKET; 163 + my_addr.sll_protocol = htons(ETH_P_ALL); 164 + my_addr.sll_ifindex = s_ifr.ifr_ifindex; 165 + 166 + /* bind socket to eth0 */ 167 + bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); 168 + 169 + A complete tutorial is available at: https://sites.google.com/site/packetmmap/ 170 + 171 + By default, the user should put data at:: 172 + 173 + frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) 174 + 175 + So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), 176 + the beginning of the user data will be at:: 177 + 178 + frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) 179 + 180 + If you wish to put user data at a custom offset from the beginning of 181 + the frame (for payload alignment with SOCK_RAW mode for instance) you 182 + can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order 183 + to make this work it must be enabled previously with setsockopt() 184 + and the PACKET_TX_HAS_OFF option. 185 + 186 + PACKET_MMAP settings 187 + ==================== 188 + 189 + To setup PACKET_MMAP from user level code is done with a call like 190 + 191 + - Capture process:: 192 + 193 + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) 194 + 195 + - Transmission process:: 196 + 197 + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) 198 + 199 + The most significant argument in the previous call is the req parameter, 200 + this parameter must to have the following structure:: 201 + 202 + struct tpacket_req 203 + { 204 + unsigned int tp_block_size; /* Minimal size of contiguous block */ 205 + unsigned int tp_block_nr; /* Number of blocks */ 206 + unsigned int tp_frame_size; /* Size of frame */ 207 + unsigned int tp_frame_nr; /* Total number of frames */ 208 + }; 209 + 210 + This structure is defined in /usr/include/linux/if_packet.h and establishes a 211 + circular buffer (ring) of unswappable memory. 212 + Being mapped in the capture process allows reading the captured frames and 213 + related meta-information like timestamps without requiring a system call. 214 + 215 + Frames are grouped in blocks. Each block is a physically contiguous 216 + region of memory and holds tp_block_size/tp_frame_size frames. The total number 217 + of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because:: 218 + 219 + frames_per_block = tp_block_size/tp_frame_size 220 + 221 + indeed, packet_set_ring checks that the following condition is true:: 222 + 223 + frames_per_block * tp_block_nr == tp_frame_nr 224 + 225 + Lets see an example, with the following values:: 226 + 227 + tp_block_size= 4096 228 + tp_frame_size= 2048 229 + tp_block_nr = 4 230 + tp_frame_nr = 8 231 + 232 + we will get the following buffer structure:: 233 + 234 + block #1 block #2 235 + +---------+---------+ +---------+---------+ 236 + | frame 1 | frame 2 | | frame 3 | frame 4 | 237 + +---------+---------+ +---------+---------+ 238 + 239 + block #3 block #4 240 + +---------+---------+ +---------+---------+ 241 + | frame 5 | frame 6 | | frame 7 | frame 8 | 242 + +---------+---------+ +---------+---------+ 243 + 244 + A frame can be of any size with the only condition it can fit in a block. A block 245 + can only hold an integer number of frames, or in other words, a frame cannot 246 + be spawned across two blocks, so there are some details you have to take into 247 + account when choosing the frame_size. See "Mapping and use of the circular 248 + buffer (ring)". 249 + 250 + PACKET_MMAP setting constraints 251 + =============================== 252 + 253 + In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), 254 + the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or 255 + 16384 in a 64 bit architecture. For information on these kernel versions 256 + see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt 257 + 258 + Block size limit 259 + ---------------- 260 + 261 + As stated earlier, each block is a contiguous physical region of memory. These 262 + memory regions are allocated with calls to the __get_free_pages() function. As 263 + the name indicates, this function allocates pages of memory, and the second 264 + argument is "order" or a power of two number of pages, that is 265 + (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, 266 + order=2 ==> 16384 bytes, etc. The maximum size of a 267 + region allocated by __get_free_pages is determined by the MAX_ORDER macro. More 268 + precisely the limit can be calculated as:: 269 + 270 + PAGE_SIZE << MAX_ORDER 271 + 272 + In a i386 architecture PAGE_SIZE is 4096 bytes 273 + In a 2.4/i386 kernel MAX_ORDER is 10 274 + In a 2.6/i386 kernel MAX_ORDER is 11 275 + 276 + So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel 277 + respectively, with an i386 architecture. 278 + 279 + User space programs can include /usr/include/sys/user.h and 280 + /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. 281 + 282 + The pagesize can also be determined dynamically with the getpagesize (2) 283 + system call. 284 + 285 + Block number limit 286 + ------------------ 287 + 288 + To understand the constraints of PACKET_MMAP, we have to see the structure 289 + used to hold the pointers to each block. 290 + 291 + Currently, this structure is a dynamically allocated vector with kmalloc 292 + called pg_vec, its size limits the number of blocks that can be allocated:: 293 + 294 + +---+---+---+---+ 295 + | x | x | x | x | 296 + +---+---+---+---+ 297 + | | | | 298 + | | | v 299 + | | v block #4 300 + | v block #3 301 + v block #2 302 + block #1 303 + 304 + kmalloc allocates any number of bytes of physically contiguous memory from 305 + a pool of pre-determined sizes. This pool of memory is maintained by the slab 306 + allocator which is at the end the responsible for doing the allocation and 307 + hence which imposes the maximum memory that kmalloc can allocate. 308 + 309 + In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The 310 + predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" 311 + entries of /proc/slabinfo 312 + 313 + In a 32 bit architecture, pointers are 4 bytes long, so the total number of 314 + pointers to blocks is:: 315 + 316 + 131072/4 = 32768 blocks 317 + 318 + PACKET_MMAP buffer size calculator 319 + ================================== 320 + 321 + Definitions: 322 + 323 + ============== ================================================================ 324 + <size-max> is the maximum size of allocable with kmalloc 325 + (see /proc/slabinfo) 326 + <pointer size> depends on the architecture -- ``sizeof(void *)`` 327 + <page size> depends on the architecture -- PAGE_SIZE or getpagesize (2) 328 + <max-order> is the value defined with MAX_ORDER 329 + <frame size> it's an upper bound of frame's capture size (more on this later) 330 + ============== ================================================================ 331 + 332 + from these definitions we will derive:: 333 + 334 + <block number> = <size-max>/<pointer size> 335 + <block size> = <pagesize> << <max-order> 336 + 337 + so, the max buffer size is:: 338 + 339 + <block number> * <block size> 340 + 341 + and, the number of frames be:: 342 + 343 + <block number> * <block size> / <frame size> 344 + 345 + Suppose the following parameters, which apply for 2.6 kernel and an 346 + i386 architecture:: 347 + 348 + <size-max> = 131072 bytes 349 + <pointer size> = 4 bytes 350 + <pagesize> = 4096 bytes 351 + <max-order> = 11 352 + 353 + and a value for <frame size> of 2048 bytes. These parameters will yield:: 354 + 355 + <block number> = 131072/4 = 32768 blocks 356 + <block size> = 4096 << 11 = 8 MiB. 357 + 358 + and hence the buffer will have a 262144 MiB size. So it can hold 359 + 262144 MiB / 2048 bytes = 134217728 frames 360 + 361 + Actually, this buffer size is not possible with an i386 architecture. 362 + Remember that the memory is allocated in kernel space, in the case of 363 + an i386 kernel's memory size is limited to 1GiB. 364 + 365 + All memory allocations are not freed until the socket is closed. The memory 366 + allocations are done with GFP_KERNEL priority, this basically means that 367 + the allocation can wait and swap other process' memory in order to allocate 368 + the necessary memory, so normally limits can be reached. 369 + 370 + Other constraints 371 + ----------------- 372 + 373 + If you check the source code you will see that what I draw here as a frame 374 + is not only the link level frame. At the beginning of each frame there is a 375 + header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame 376 + meta information like timestamp. So what we draw here a frame it's really 377 + the following (from include/linux/if_packet.h):: 378 + 379 + /* 380 + Frame structure: 381 + 382 + - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 383 + - struct tpacket_hdr 384 + - pad to TPACKET_ALIGNMENT=16 385 + - struct sockaddr_ll 386 + - Gap, chosen so that packet data (Start+tp_net) aligns to 387 + TPACKET_ALIGNMENT=16 388 + - Start+tp_mac: [ Optional MAC header ] 389 + - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. 390 + - Pad to align to TPACKET_ALIGNMENT=16 391 + */ 392 + 393 + The following are conditions that are checked in packet_set_ring 394 + 395 + - tp_block_size must be a multiple of PAGE_SIZE (1) 396 + - tp_frame_size must be greater than TPACKET_HDRLEN (obvious) 397 + - tp_frame_size must be a multiple of TPACKET_ALIGNMENT 398 + - tp_frame_nr must be exactly frames_per_block*tp_block_nr 399 + 400 + Note that tp_block_size should be chosen to be a power of two or there will 401 + be a waste of memory. 402 + 403 + Mapping and use of the circular buffer (ring) 404 + --------------------------------------------- 405 + 406 + The mapping of the buffer in the user process is done with the conventional 407 + mmap function. Even the circular buffer is compound of several physically 408 + discontiguous blocks of memory, they are contiguous to the user space, hence 409 + just one call to mmap is needed:: 410 + 411 + mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 412 + 413 + If tp_frame_size is a divisor of tp_block_size frames will be 414 + contiguously spaced by tp_frame_size bytes. If not, each 415 + tp_block_size/tp_frame_size frames there will be a gap between 416 + the frames. This is because a frame cannot be spawn across two 417 + blocks. 418 + 419 + To use one socket for capture and transmission, the mapping of both the 420 + RX and TX buffer ring has to be done with one call to mmap:: 421 + 422 + ... 423 + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); 424 + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); 425 + ... 426 + rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 427 + tx_ring = rx_ring + size; 428 + 429 + RX must be the first as the kernel maps the TX ring memory right 430 + after the RX one. 431 + 432 + At the beginning of each frame there is an status field (see 433 + struct tpacket_hdr). If this field is 0 means that the frame is ready 434 + to be used for the kernel, If not, there is a frame the user can read 435 + and the following flags apply: 436 + 437 + Capture process 438 + ^^^^^^^^^^^^^^^ 439 + 440 + from include/linux/if_packet.h 441 + 442 + #define TP_STATUS_COPY (1 << 1) 443 + #define TP_STATUS_LOSING (1 << 2) 444 + #define TP_STATUS_CSUMNOTREADY (1 << 3) 445 + #define TP_STATUS_CSUM_VALID (1 << 7) 446 + 447 + ====================== ======================================================= 448 + TP_STATUS_COPY This flag indicates that the frame (and associated 449 + meta information) has been truncated because it's 450 + larger than tp_frame_size. This packet can be 451 + read entirely with recvfrom(). 452 + 453 + In order to make this work it must to be 454 + enabled previously with setsockopt() and 455 + the PACKET_COPY_THRESH option. 456 + 457 + The number of frames that can be buffered to 458 + be read with recvfrom is limited like a normal socket. 459 + See the SO_RCVBUF option in the socket (7) man page. 460 + 461 + TP_STATUS_LOSING indicates there were packet drops from last time 462 + statistics where checked with getsockopt() and 463 + the PACKET_STATISTICS option. 464 + 465 + TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which 466 + its checksum will be done in hardware. So while 467 + reading the packet we should not try to check the 468 + checksum. 469 + 470 + TP_STATUS_CSUM_VALID This flag indicates that at least the transport 471 + header checksum of the packet has been already 472 + validated on the kernel side. If the flag is not set 473 + then we are free to check the checksum by ourselves 474 + provided that TP_STATUS_CSUMNOTREADY is also not set. 475 + ====================== ======================================================= 476 + 477 + for convenience there are also the following defines:: 478 + 479 + #define TP_STATUS_KERNEL 0 480 + #define TP_STATUS_USER 1 481 + 482 + The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel 483 + receives a packet it puts in the buffer and updates the status with 484 + at least the TP_STATUS_USER flag. Then the user can read the packet, 485 + once the packet is read the user must zero the status field, so the kernel 486 + can use again that frame buffer. 487 + 488 + The user can use poll (any other variant should apply too) to check if new 489 + packets are in the ring:: 490 + 491 + struct pollfd pfd; 492 + 493 + pfd.fd = fd; 494 + pfd.revents = 0; 495 + pfd.events = POLLIN|POLLRDNORM|POLLERR; 496 + 497 + if (status == TP_STATUS_KERNEL) 498 + retval = poll(&pfd, 1, timeout); 499 + 500 + It doesn't incur in a race condition to first check the status value and 501 + then poll for frames. 502 + 503 + Transmission process 504 + ^^^^^^^^^^^^^^^^^^^^ 505 + 506 + Those defines are also used for transmission:: 507 + 508 + #define TP_STATUS_AVAILABLE 0 // Frame is available 509 + #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() 510 + #define TP_STATUS_SENDING 2 // Frame is currently in transmission 511 + #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct 512 + 513 + First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a 514 + packet, the user fills a data buffer of an available frame, sets tp_len to 515 + current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. 516 + This can be done on multiple frames. Once the user is ready to transmit, it 517 + calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are 518 + forwarded to the network device. The kernel updates each status of sent 519 + frames with TP_STATUS_SENDING until the end of transfer. 520 + 521 + At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. 522 + 523 + :: 524 + 525 + header->tp_len = in_i_size; 526 + header->tp_status = TP_STATUS_SEND_REQUEST; 527 + retval = send(this->socket, NULL, 0, 0); 528 + 529 + The user can also use poll() to check if a buffer is available: 530 + 531 + (status == TP_STATUS_SENDING) 532 + 533 + :: 534 + 535 + struct pollfd pfd; 536 + pfd.fd = fd; 537 + pfd.revents = 0; 538 + pfd.events = POLLOUT; 539 + retval = poll(&pfd, 1, timeout); 540 + 541 + What TPACKET versions are available and when to use them? 542 + ========================================================= 543 + 544 + :: 545 + 546 + int val = tpacket_version; 547 + setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 548 + getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 549 + 550 + where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. 551 + 552 + TPACKET_V1: 553 + - Default if not otherwise specified by setsockopt(2) 554 + - RX_RING, TX_RING available 555 + 556 + TPACKET_V1 --> TPACKET_V2: 557 + - Made 64 bit clean due to unsigned long usage in TPACKET_V1 558 + structures, thus this also works on 64 bit kernel with 32 bit 559 + userspace and the like 560 + - Timestamp resolution in nanoseconds instead of microseconds 561 + - RX_RING, TX_RING available 562 + - VLAN metadata information available for packets 563 + (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), 564 + in the tpacket2_hdr structure: 565 + 566 + - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates 567 + that the tp_vlan_tci field has valid VLAN TCI value 568 + - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field 569 + indicates that the tp_vlan_tpid field has valid VLAN TPID value 570 + 571 + - How to switch to TPACKET_V2: 572 + 573 + 1. Replace struct tpacket_hdr by struct tpacket2_hdr 574 + 2. Query header len and save 575 + 3. Set protocol version to 2, set up ring as usual 576 + 4. For getting the sockaddr_ll, 577 + use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of 578 + ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))`` 579 + 580 + TPACKET_V2 --> TPACKET_V3: 581 + - Flexible buffer implementation for RX_RING: 582 + 1. Blocks can be configured with non-static frame-size 583 + 2. Read/poll is at a block-level (as opposed to packet-level) 584 + 3. Added poll timeout to avoid indefinite user-space wait 585 + on idle links 586 + 4. Added user-configurable knobs: 587 + 588 + 4.1 block::timeout 589 + 4.2 tpkt_hdr::sk_rxhash 590 + 591 + - RX Hash data available in user space 592 + - TX_RING semantics are conceptually similar to TPACKET_V2; 593 + use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN 594 + instead of TPACKET2_HDRLEN. In the current implementation, 595 + the tp_next_offset field in the tpacket3_hdr MUST be set to 596 + zero, indicating that the ring does not hold variable sized frames. 597 + Packets with non-zero values of tp_next_offset will be dropped. 598 + 599 + AF_PACKET fanout mode 600 + ===================== 601 + 602 + In the AF_PACKET fanout mode, packet reception can be load balanced among 603 + processes. This also works in combination with mmap(2) on packet sockets. 604 + 605 + Currently implemented fanout policies are: 606 + 607 + - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash 608 + - PACKET_FANOUT_LB: schedule to socket by round-robin 609 + - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on 610 + - PACKET_FANOUT_RND: schedule to socket by random selection 611 + - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another 612 + - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping 613 + 614 + Minimal example code by David S. Miller (try things like "./test eth0 hash", 615 + "./test eth0 lb", etc.):: 616 + 617 + #include <stddef.h> 618 + #include <stdlib.h> 619 + #include <stdio.h> 620 + #include <string.h> 621 + 622 + #include <sys/types.h> 623 + #include <sys/wait.h> 624 + #include <sys/socket.h> 625 + #include <sys/ioctl.h> 626 + 627 + #include <unistd.h> 628 + 629 + #include <linux/if_ether.h> 630 + #include <linux/if_packet.h> 631 + 632 + #include <net/if.h> 633 + 634 + static const char *device_name; 635 + static int fanout_type; 636 + static int fanout_id; 637 + 638 + #ifndef PACKET_FANOUT 639 + # define PACKET_FANOUT 18 640 + # define PACKET_FANOUT_HASH 0 641 + # define PACKET_FANOUT_LB 1 642 + #endif 643 + 644 + static int setup_socket(void) 645 + { 646 + int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); 647 + struct sockaddr_ll ll; 648 + struct ifreq ifr; 649 + int fanout_arg; 650 + 651 + if (fd < 0) { 652 + perror("socket"); 653 + return EXIT_FAILURE; 654 + } 655 + 656 + memset(&ifr, 0, sizeof(ifr)); 657 + strcpy(ifr.ifr_name, device_name); 658 + err = ioctl(fd, SIOCGIFINDEX, &ifr); 659 + if (err < 0) { 660 + perror("SIOCGIFINDEX"); 661 + return EXIT_FAILURE; 662 + } 663 + 664 + memset(&ll, 0, sizeof(ll)); 665 + ll.sll_family = AF_PACKET; 666 + ll.sll_ifindex = ifr.ifr_ifindex; 667 + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 668 + if (err < 0) { 669 + perror("bind"); 670 + return EXIT_FAILURE; 671 + } 672 + 673 + fanout_arg = (fanout_id | (fanout_type << 16)); 674 + err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, 675 + &fanout_arg, sizeof(fanout_arg)); 676 + if (err) { 677 + perror("setsockopt"); 678 + return EXIT_FAILURE; 679 + } 680 + 681 + return fd; 682 + } 683 + 684 + static void fanout_thread(void) 685 + { 686 + int fd = setup_socket(); 687 + int limit = 10000; 688 + 689 + if (fd < 0) 690 + exit(fd); 691 + 692 + while (limit-- > 0) { 693 + char buf[1600]; 694 + int err; 695 + 696 + err = read(fd, buf, sizeof(buf)); 697 + if (err < 0) { 698 + perror("read"); 699 + exit(EXIT_FAILURE); 700 + } 701 + if ((limit % 10) == 0) 702 + fprintf(stdout, "(%d) \n", getpid()); 703 + } 704 + 705 + fprintf(stdout, "%d: Received 10000 packets\n", getpid()); 706 + 707 + close(fd); 708 + exit(0); 709 + } 710 + 711 + int main(int argc, char **argp) 712 + { 713 + int fd, err; 714 + int i; 715 + 716 + if (argc != 3) { 717 + fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); 718 + return EXIT_FAILURE; 719 + } 720 + 721 + if (!strcmp(argp[2], "hash")) 722 + fanout_type = PACKET_FANOUT_HASH; 723 + else if (!strcmp(argp[2], "lb")) 724 + fanout_type = PACKET_FANOUT_LB; 725 + else { 726 + fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); 727 + exit(EXIT_FAILURE); 728 + } 729 + 730 + device_name = argp[1]; 731 + fanout_id = getpid() & 0xffff; 732 + 733 + for (i = 0; i < 4; i++) { 734 + pid_t pid = fork(); 735 + 736 + switch (pid) { 737 + case 0: 738 + fanout_thread(); 739 + 740 + case -1: 741 + perror("fork"); 742 + exit(EXIT_FAILURE); 743 + } 744 + } 745 + 746 + for (i = 0; i < 4; i++) { 747 + int status; 748 + 749 + wait(&status); 750 + } 751 + 752 + return 0; 753 + } 754 + 755 + AF_PACKET TPACKET_V3 example 756 + ============================ 757 + 758 + AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame 759 + sizes by doing it's own memory management. It is based on blocks where polling 760 + works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. 761 + 762 + It is said that TPACKET_V3 brings the following benefits: 763 + 764 + * ~15% - 20% reduction in CPU-usage 765 + * ~20% increase in packet capture rate 766 + * ~2x increase in packet density 767 + * Port aggregation analysis 768 + * Non static frame size to capture entire packet payload 769 + 770 + So it seems to be a good candidate to be used with packet fanout. 771 + 772 + Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile 773 + it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):: 774 + 775 + /* Written from scratch, but kernel-to-user space API usage 776 + * dissected from lolpcap: 777 + * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> 778 + * License: GPL, version 2.0 779 + */ 780 + 781 + #include <stdio.h> 782 + #include <stdlib.h> 783 + #include <stdint.h> 784 + #include <string.h> 785 + #include <assert.h> 786 + #include <net/if.h> 787 + #include <arpa/inet.h> 788 + #include <netdb.h> 789 + #include <poll.h> 790 + #include <unistd.h> 791 + #include <signal.h> 792 + #include <inttypes.h> 793 + #include <sys/socket.h> 794 + #include <sys/mman.h> 795 + #include <linux/if_packet.h> 796 + #include <linux/if_ether.h> 797 + #include <linux/ip.h> 798 + 799 + #ifndef likely 800 + # define likely(x) __builtin_expect(!!(x), 1) 801 + #endif 802 + #ifndef unlikely 803 + # define unlikely(x) __builtin_expect(!!(x), 0) 804 + #endif 805 + 806 + struct block_desc { 807 + uint32_t version; 808 + uint32_t offset_to_priv; 809 + struct tpacket_hdr_v1 h1; 810 + }; 811 + 812 + struct ring { 813 + struct iovec *rd; 814 + uint8_t *map; 815 + struct tpacket_req3 req; 816 + }; 817 + 818 + static unsigned long packets_total = 0, bytes_total = 0; 819 + static sig_atomic_t sigint = 0; 820 + 821 + static void sighandler(int num) 822 + { 823 + sigint = 1; 824 + } 825 + 826 + static int setup_socket(struct ring *ring, char *netdev) 827 + { 828 + int err, i, fd, v = TPACKET_V3; 829 + struct sockaddr_ll ll; 830 + unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; 831 + unsigned int blocknum = 64; 832 + 833 + fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); 834 + if (fd < 0) { 835 + perror("socket"); 836 + exit(1); 837 + } 838 + 839 + err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); 840 + if (err < 0) { 841 + perror("setsockopt"); 842 + exit(1); 843 + } 844 + 845 + memset(&ring->req, 0, sizeof(ring->req)); 846 + ring->req.tp_block_size = blocksiz; 847 + ring->req.tp_frame_size = framesiz; 848 + ring->req.tp_block_nr = blocknum; 849 + ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; 850 + ring->req.tp_retire_blk_tov = 60; 851 + ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; 852 + 853 + err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, 854 + sizeof(ring->req)); 855 + if (err < 0) { 856 + perror("setsockopt"); 857 + exit(1); 858 + } 859 + 860 + ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, 861 + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); 862 + if (ring->map == MAP_FAILED) { 863 + perror("mmap"); 864 + exit(1); 865 + } 866 + 867 + ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); 868 + assert(ring->rd); 869 + for (i = 0; i < ring->req.tp_block_nr; ++i) { 870 + ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); 871 + ring->rd[i].iov_len = ring->req.tp_block_size; 872 + } 873 + 874 + memset(&ll, 0, sizeof(ll)); 875 + ll.sll_family = PF_PACKET; 876 + ll.sll_protocol = htons(ETH_P_ALL); 877 + ll.sll_ifindex = if_nametoindex(netdev); 878 + ll.sll_hatype = 0; 879 + ll.sll_pkttype = 0; 880 + ll.sll_halen = 0; 881 + 882 + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 883 + if (err < 0) { 884 + perror("bind"); 885 + exit(1); 886 + } 887 + 888 + return fd; 889 + } 890 + 891 + static void display(struct tpacket3_hdr *ppd) 892 + { 893 + struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); 894 + struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); 895 + 896 + if (eth->h_proto == htons(ETH_P_IP)) { 897 + struct sockaddr_in ss, sd; 898 + char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; 899 + 900 + memset(&ss, 0, sizeof(ss)); 901 + ss.sin_family = PF_INET; 902 + ss.sin_addr.s_addr = ip->saddr; 903 + getnameinfo((struct sockaddr *) &ss, sizeof(ss), 904 + sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); 905 + 906 + memset(&sd, 0, sizeof(sd)); 907 + sd.sin_family = PF_INET; 908 + sd.sin_addr.s_addr = ip->daddr; 909 + getnameinfo((struct sockaddr *) &sd, sizeof(sd), 910 + dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); 911 + 912 + printf("%s -> %s, ", sbuff, dbuff); 913 + } 914 + 915 + printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); 916 + } 917 + 918 + static void walk_block(struct block_desc *pbd, const int block_num) 919 + { 920 + int num_pkts = pbd->h1.num_pkts, i; 921 + unsigned long bytes = 0; 922 + struct tpacket3_hdr *ppd; 923 + 924 + ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + 925 + pbd->h1.offset_to_first_pkt); 926 + for (i = 0; i < num_pkts; ++i) { 927 + bytes += ppd->tp_snaplen; 928 + display(ppd); 929 + 930 + ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + 931 + ppd->tp_next_offset); 932 + } 933 + 934 + packets_total += num_pkts; 935 + bytes_total += bytes; 936 + } 937 + 938 + static void flush_block(struct block_desc *pbd) 939 + { 940 + pbd->h1.block_status = TP_STATUS_KERNEL; 941 + } 942 + 943 + static void teardown_socket(struct ring *ring, int fd) 944 + { 945 + munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); 946 + free(ring->rd); 947 + close(fd); 948 + } 949 + 950 + int main(int argc, char **argp) 951 + { 952 + int fd, err; 953 + socklen_t len; 954 + struct ring ring; 955 + struct pollfd pfd; 956 + unsigned int block_num = 0, blocks = 64; 957 + struct block_desc *pbd; 958 + struct tpacket_stats_v3 stats; 959 + 960 + if (argc != 2) { 961 + fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); 962 + return EXIT_FAILURE; 963 + } 964 + 965 + signal(SIGINT, sighandler); 966 + 967 + memset(&ring, 0, sizeof(ring)); 968 + fd = setup_socket(&ring, argp[argc - 1]); 969 + assert(fd > 0); 970 + 971 + memset(&pfd, 0, sizeof(pfd)); 972 + pfd.fd = fd; 973 + pfd.events = POLLIN | POLLERR; 974 + pfd.revents = 0; 975 + 976 + while (likely(!sigint)) { 977 + pbd = (struct block_desc *) ring.rd[block_num].iov_base; 978 + 979 + if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { 980 + poll(&pfd, 1, -1); 981 + continue; 982 + } 983 + 984 + walk_block(pbd, block_num); 985 + flush_block(pbd); 986 + block_num = (block_num + 1) % blocks; 987 + } 988 + 989 + len = sizeof(stats); 990 + err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); 991 + if (err < 0) { 992 + perror("getsockopt"); 993 + exit(1); 994 + } 995 + 996 + fflush(stdout); 997 + printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", 998 + stats.tp_packets, bytes_total, stats.tp_drops, 999 + stats.tp_freeze_q_cnt); 1000 + 1001 + teardown_socket(&ring, fd); 1002 + return 0; 1003 + } 1004 + 1005 + PACKET_QDISC_BYPASS 1006 + =================== 1007 + 1008 + If there is a requirement to load the network with many packets in a similar 1009 + fashion as pktgen does, you might set the following option after socket 1010 + creation:: 1011 + 1012 + int one = 1; 1013 + setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); 1014 + 1015 + This has the side-effect, that packets sent through PF_PACKET will bypass the 1016 + kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, 1017 + packet are not buffered, tc disciplines are ignored, increased loss can occur 1018 + and such packets are also not visible to other PF_PACKET sockets anymore. So, 1019 + you have been warned; generally, this can be useful for stress testing various 1020 + components of a system. 1021 + 1022 + On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled 1023 + on PF_PACKET sockets. 1024 + 1025 + PACKET_TIMESTAMP 1026 + ================ 1027 + 1028 + The PACKET_TIMESTAMP setting determines the source of the timestamp in 1029 + the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your 1030 + NIC is capable of timestamping packets in hardware, you can request those 1031 + hardware timestamps to be used. Note: you may need to enable the generation 1032 + of hardware timestamps with SIOCSHWTSTAMP (see related information from 1033 + Documentation/networking/timestamping.txt). 1034 + 1035 + PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:: 1036 + 1037 + int req = SOF_TIMESTAMPING_RAW_HARDWARE; 1038 + setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) 1039 + 1040 + For the mmap(2)ed ring buffers, such timestamps are stored in the 1041 + ``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members. 1042 + To determine what kind of timestamp has been reported, the tp_status field 1043 + is binary or'ed with the following possible bits ... 1044 + 1045 + :: 1046 + 1047 + TP_STATUS_TS_RAW_HARDWARE 1048 + TP_STATUS_TS_SOFTWARE 1049 + 1050 + ... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the 1051 + RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a 1052 + software fallback was invoked *within* PF_PACKET's processing code (less 1053 + precise). 1054 + 1055 + Getting timestamps for the TX_RING works as follows: i) fill the ring frames, 1056 + ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant 1057 + frames to be updated resp. the frame handed over to the application, iv) walk 1058 + through the frames to pick up the individual hw/sw timestamps. 1059 + 1060 + Only (!) if transmit timestamping is enabled, then these bits are combined 1061 + with binary | with TP_STATUS_AVAILABLE, so you must check for that in your 1062 + application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) 1063 + in a first step to see if the frame belongs to the application, and then 1064 + one can extract the type of timestamp in a second step from tp_status)! 1065 + 1066 + If you don't care about them, thus having it disabled, checking for 1067 + TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the 1068 + TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec 1069 + members do not contain a valid value. For TX_RINGs, by default no timestamp 1070 + is generated! 1071 + 1072 + See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt 1073 + for more information on hardware timestamps. 1074 + 1075 + Miscellaneous bits 1076 + ================== 1077 + 1078 + - Packet sockets work well together with Linux socket filters, thus you also 1079 + might want to have a look at Documentation/networking/filter.txt 1080 + 1081 + THANKS 1082 + ====== 1083 + 1084 + Jesse Brandeburg, for fixing my grammathical/spelling errors

-1061

Documentation/networking/packet_mmap.txt

··· 1 - -------------------------------------------------------------------------------- 2 - + ABSTRACT 3 - -------------------------------------------------------------------------------- 4 - 5 - This file documents the mmap() facility available with the PACKET 6 - socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for 7 - i) capture network traffic with utilities like tcpdump, ii) transmit network 8 - traffic, or any other that needs raw access to network interface. 9 - 10 - Howto can be found at: 11 - https://sites.google.com/site/packetmmap/ 12 - 13 - Please send your comments to 14 - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> 15 - Johann Baudy 16 - 17 - ------------------------------------------------------------------------------- 18 - + Why use PACKET_MMAP 19 - -------------------------------------------------------------------------------- 20 - 21 - In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very 22 - inefficient. It uses very limited buffers and requires one system call to 23 - capture each packet, it requires two if you want to get packet's timestamp 24 - (like libpcap always does). 25 - 26 - In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 27 - configurable circular buffer mapped in user space that can be used to either 28 - send or receive packets. This way reading packets just needs to wait for them, 29 - most of the time there is no need to issue a single system call. Concerning 30 - transmission, multiple packets can be sent through one system call to get the 31 - highest bandwidth. By using a shared buffer between the kernel and the user 32 - also has the benefit of minimizing packet copies. 33 - 34 - It's fine to use PACKET_MMAP to improve the performance of the capture and 35 - transmission process, but it isn't everything. At least, if you are capturing 36 - at high speeds (this is relative to the cpu speed), you should check if the 37 - device driver of your network interface card supports some sort of interrupt 38 - load mitigation or (even better) if it supports NAPI, also make sure it is 39 - enabled. For transmission, check the MTU (Maximum Transmission Unit) used and 40 - supported by devices of your network. CPU IRQ pinning of your network interface 41 - card can also be an advantage. 42 - 43 - -------------------------------------------------------------------------------- 44 - + How to use mmap() to improve capture process 45 - -------------------------------------------------------------------------------- 46 - 47 - From the user standpoint, you should use the higher level libpcap library, which 48 - is a de facto standard, portable across nearly all operating systems 49 - including Win32. 50 - 51 - Packet MMAP support was integrated into libpcap around the time of version 1.3.0; 52 - TPACKET_V3 support was added in version 1.5.0 53 - 54 - -------------------------------------------------------------------------------- 55 - + How to use mmap() directly to improve capture process 56 - -------------------------------------------------------------------------------- 57 - 58 - From the system calls stand point, the use of PACKET_MMAP involves 59 - the following process: 60 - 61 - 62 - [setup] socket() -------> creation of the capture socket 63 - setsockopt() ---> allocation of the circular buffer (ring) 64 - option: PACKET_RX_RING 65 - mmap() ---------> mapping of the allocated buffer to the 66 - user process 67 - 68 - [capture] poll() ---------> to wait for incoming packets 69 - 70 - [shutdown] close() --------> destruction of the capture socket and 71 - deallocation of all associated 72 - resources. 73 - 74 - 75 - socket creation and destruction is straight forward, and is done 76 - the same way with or without PACKET_MMAP: 77 - 78 - int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); 79 - 80 - where mode is SOCK_RAW for the raw interface were link level 81 - information can be captured or SOCK_DGRAM for the cooked 82 - interface where link level information capture is not 83 - supported and a link level pseudo-header is provided 84 - by the kernel. 85 - 86 - The destruction of the socket and all associated resources 87 - is done by a simple call to close(fd). 88 - 89 - Similarly as without PACKET_MMAP, it is possible to use one socket 90 - for capture and transmission. This can be done by mapping the 91 - allocated RX and TX buffer ring with a single mmap() call. 92 - See "Mapping and use of the circular buffer (ring)". 93 - 94 - Next I will describe PACKET_MMAP settings and its constraints, 95 - also the mapping of the circular buffer in the user process and 96 - the use of this buffer. 97 - 98 - -------------------------------------------------------------------------------- 99 - + How to use mmap() directly to improve transmission process 100 - -------------------------------------------------------------------------------- 101 - Transmission process is similar to capture as shown below. 102 - 103 - [setup] socket() -------> creation of the transmission socket 104 - setsockopt() ---> allocation of the circular buffer (ring) 105 - option: PACKET_TX_RING 106 - bind() ---------> bind transmission socket with a network interface 107 - mmap() ---------> mapping of the allocated buffer to the 108 - user process 109 - 110 - [transmission] poll() ---------> wait for free packets (optional) 111 - send() ---------> send all packets that are set as ready in 112 - the ring 113 - The flag MSG_DONTWAIT can be used to return 114 - before end of transfer. 115 - 116 - [shutdown] close() --------> destruction of the transmission socket and 117 - deallocation of all associated resources. 118 - 119 - Socket creation and destruction is also straight forward, and is done 120 - the same way as in capturing described in the previous paragraph: 121 - 122 - int fd = socket(PF_PACKET, mode, 0); 123 - 124 - The protocol can optionally be 0 in case we only want to transmit 125 - via this socket, which avoids an expensive call to packet_rcv(). 126 - In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 127 - set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. 128 - 129 - Binding the socket to your network interface is mandatory (with zero copy) to 130 - know the header size of frames used in the circular buffer. 131 - 132 - As capture, each frame contains two parts: 133 - 134 - -------------------- 135 - | struct tpacket_hdr | Header. It contains the status of 136 - | | of this frame 137 - |--------------------| 138 - | data buffer | 139 - . . Data that will be sent over the network interface. 140 - . . 141 - -------------------- 142 - 143 - bind() associates the socket to your network interface thanks to 144 - sll_ifindex parameter of struct sockaddr_ll. 145 - 146 - Initialization example: 147 - 148 - struct sockaddr_ll my_addr; 149 - struct ifreq s_ifr; 150 - ... 151 - 152 - strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); 153 - 154 - /* get interface index of eth0 */ 155 - ioctl(this->socket, SIOCGIFINDEX, &s_ifr); 156 - 157 - /* fill sockaddr_ll struct to prepare binding */ 158 - my_addr.sll_family = AF_PACKET; 159 - my_addr.sll_protocol = htons(ETH_P_ALL); 160 - my_addr.sll_ifindex = s_ifr.ifr_ifindex; 161 - 162 - /* bind socket to eth0 */ 163 - bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); 164 - 165 - A complete tutorial is available at: https://sites.google.com/site/packetmmap/ 166 - 167 - By default, the user should put data at : 168 - frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) 169 - 170 - So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), 171 - the beginning of the user data will be at : 172 - frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) 173 - 174 - If you wish to put user data at a custom offset from the beginning of 175 - the frame (for payload alignment with SOCK_RAW mode for instance) you 176 - can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order 177 - to make this work it must be enabled previously with setsockopt() 178 - and the PACKET_TX_HAS_OFF option. 179 - 180 - -------------------------------------------------------------------------------- 181 - + PACKET_MMAP settings 182 - -------------------------------------------------------------------------------- 183 - 184 - To setup PACKET_MMAP from user level code is done with a call like 185 - 186 - - Capture process 187 - setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) 188 - - Transmission process 189 - setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) 190 - 191 - The most significant argument in the previous call is the req parameter, 192 - this parameter must to have the following structure: 193 - 194 - struct tpacket_req 195 - { 196 - unsigned int tp_block_size; /* Minimal size of contiguous block */ 197 - unsigned int tp_block_nr; /* Number of blocks */ 198 - unsigned int tp_frame_size; /* Size of frame */ 199 - unsigned int tp_frame_nr; /* Total number of frames */ 200 - }; 201 - 202 - This structure is defined in /usr/include/linux/if_packet.h and establishes a 203 - circular buffer (ring) of unswappable memory. 204 - Being mapped in the capture process allows reading the captured frames and 205 - related meta-information like timestamps without requiring a system call. 206 - 207 - Frames are grouped in blocks. Each block is a physically contiguous 208 - region of memory and holds tp_block_size/tp_frame_size frames. The total number 209 - of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because 210 - 211 - frames_per_block = tp_block_size/tp_frame_size 212 - 213 - indeed, packet_set_ring checks that the following condition is true 214 - 215 - frames_per_block * tp_block_nr == tp_frame_nr 216 - 217 - Lets see an example, with the following values: 218 - 219 - tp_block_size= 4096 220 - tp_frame_size= 2048 221 - tp_block_nr = 4 222 - tp_frame_nr = 8 223 - 224 - we will get the following buffer structure: 225 - 226 - block #1 block #2 227 - +---------+---------+ +---------+---------+ 228 - | frame 1 | frame 2 | | frame 3 | frame 4 | 229 - +---------+---------+ +---------+---------+ 230 - 231 - block #3 block #4 232 - +---------+---------+ +---------+---------+ 233 - | frame 5 | frame 6 | | frame 7 | frame 8 | 234 - +---------+---------+ +---------+---------+ 235 - 236 - A frame can be of any size with the only condition it can fit in a block. A block 237 - can only hold an integer number of frames, or in other words, a frame cannot 238 - be spawned across two blocks, so there are some details you have to take into 239 - account when choosing the frame_size. See "Mapping and use of the circular 240 - buffer (ring)". 241 - 242 - -------------------------------------------------------------------------------- 243 - + PACKET_MMAP setting constraints 244 - -------------------------------------------------------------------------------- 245 - 246 - In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), 247 - the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or 248 - 16384 in a 64 bit architecture. For information on these kernel versions 249 - see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt 250 - 251 - Block size limit 252 - ------------------ 253 - 254 - As stated earlier, each block is a contiguous physical region of memory. These 255 - memory regions are allocated with calls to the __get_free_pages() function. As 256 - the name indicates, this function allocates pages of memory, and the second 257 - argument is "order" or a power of two number of pages, that is 258 - (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, 259 - order=2 ==> 16384 bytes, etc. The maximum size of a 260 - region allocated by __get_free_pages is determined by the MAX_ORDER macro. More 261 - precisely the limit can be calculated as: 262 - 263 - PAGE_SIZE << MAX_ORDER 264 - 265 - In a i386 architecture PAGE_SIZE is 4096 bytes 266 - In a 2.4/i386 kernel MAX_ORDER is 10 267 - In a 2.6/i386 kernel MAX_ORDER is 11 268 - 269 - So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel 270 - respectively, with an i386 architecture. 271 - 272 - User space programs can include /usr/include/sys/user.h and 273 - /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. 274 - 275 - The pagesize can also be determined dynamically with the getpagesize (2) 276 - system call. 277 - 278 - Block number limit 279 - -------------------- 280 - 281 - To understand the constraints of PACKET_MMAP, we have to see the structure 282 - used to hold the pointers to each block. 283 - 284 - Currently, this structure is a dynamically allocated vector with kmalloc 285 - called pg_vec, its size limits the number of blocks that can be allocated. 286 - 287 - +---+---+---+---+ 288 - | x | x | x | x | 289 - +---+---+---+---+ 290 - | | | | 291 - | | | v 292 - | | v block #4 293 - | v block #3 294 - v block #2 295 - block #1 296 - 297 - kmalloc allocates any number of bytes of physically contiguous memory from 298 - a pool of pre-determined sizes. This pool of memory is maintained by the slab 299 - allocator which is at the end the responsible for doing the allocation and 300 - hence which imposes the maximum memory that kmalloc can allocate. 301 - 302 - In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The 303 - predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" 304 - entries of /proc/slabinfo 305 - 306 - In a 32 bit architecture, pointers are 4 bytes long, so the total number of 307 - pointers to blocks is 308 - 309 - 131072/4 = 32768 blocks 310 - 311 - PACKET_MMAP buffer size calculator 312 - ------------------------------------ 313 - 314 - Definitions: 315 - 316 - <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) 317 - <pointer size>: depends on the architecture -- sizeof(void *) 318 - <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) 319 - <max-order> : is the value defined with MAX_ORDER 320 - <frame size> : it's an upper bound of frame's capture size (more on this later) 321 - 322 - from these definitions we will derive 323 - 324 - <block number> = <size-max>/<pointer size> 325 - <block size> = <pagesize> << <max-order> 326 - 327 - so, the max buffer size is 328 - 329 - <block number> * <block size> 330 - 331 - and, the number of frames be 332 - 333 - <block number> * <block size> / <frame size> 334 - 335 - Suppose the following parameters, which apply for 2.6 kernel and an 336 - i386 architecture: 337 - 338 - <size-max> = 131072 bytes 339 - <pointer size> = 4 bytes 340 - <pagesize> = 4096 bytes 341 - <max-order> = 11 342 - 343 - and a value for <frame size> of 2048 bytes. These parameters will yield 344 - 345 - <block number> = 131072/4 = 32768 blocks 346 - <block size> = 4096 << 11 = 8 MiB. 347 - 348 - and hence the buffer will have a 262144 MiB size. So it can hold 349 - 262144 MiB / 2048 bytes = 134217728 frames 350 - 351 - Actually, this buffer size is not possible with an i386 architecture. 352 - Remember that the memory is allocated in kernel space, in the case of 353 - an i386 kernel's memory size is limited to 1GiB. 354 - 355 - All memory allocations are not freed until the socket is closed. The memory 356 - allocations are done with GFP_KERNEL priority, this basically means that 357 - the allocation can wait and swap other process' memory in order to allocate 358 - the necessary memory, so normally limits can be reached. 359 - 360 - Other constraints 361 - ------------------- 362 - 363 - If you check the source code you will see that what I draw here as a frame 364 - is not only the link level frame. At the beginning of each frame there is a 365 - header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame 366 - meta information like timestamp. So what we draw here a frame it's really 367 - the following (from include/linux/if_packet.h): 368 - 369 - /* 370 - Frame structure: 371 - 372 - - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 373 - - struct tpacket_hdr 374 - - pad to TPACKET_ALIGNMENT=16 375 - - struct sockaddr_ll 376 - - Gap, chosen so that packet data (Start+tp_net) aligns to 377 - TPACKET_ALIGNMENT=16 378 - - Start+tp_mac: [ Optional MAC header ] 379 - - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. 380 - - Pad to align to TPACKET_ALIGNMENT=16 381 - */ 382 - 383 - The following are conditions that are checked in packet_set_ring 384 - 385 - tp_block_size must be a multiple of PAGE_SIZE (1) 386 - tp_frame_size must be greater than TPACKET_HDRLEN (obvious) 387 - tp_frame_size must be a multiple of TPACKET_ALIGNMENT 388 - tp_frame_nr must be exactly frames_per_block*tp_block_nr 389 - 390 - Note that tp_block_size should be chosen to be a power of two or there will 391 - be a waste of memory. 392 - 393 - -------------------------------------------------------------------------------- 394 - + Mapping and use of the circular buffer (ring) 395 - -------------------------------------------------------------------------------- 396 - 397 - The mapping of the buffer in the user process is done with the conventional 398 - mmap function. Even the circular buffer is compound of several physically 399 - discontiguous blocks of memory, they are contiguous to the user space, hence 400 - just one call to mmap is needed: 401 - 402 - mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 403 - 404 - If tp_frame_size is a divisor of tp_block_size frames will be 405 - contiguously spaced by tp_frame_size bytes. If not, each 406 - tp_block_size/tp_frame_size frames there will be a gap between 407 - the frames. This is because a frame cannot be spawn across two 408 - blocks. 409 - 410 - To use one socket for capture and transmission, the mapping of both the 411 - RX and TX buffer ring has to be done with one call to mmap: 412 - 413 - ... 414 - setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); 415 - setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); 416 - ... 417 - rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 418 - tx_ring = rx_ring + size; 419 - 420 - RX must be the first as the kernel maps the TX ring memory right 421 - after the RX one. 422 - 423 - At the beginning of each frame there is an status field (see 424 - struct tpacket_hdr). If this field is 0 means that the frame is ready 425 - to be used for the kernel, If not, there is a frame the user can read 426 - and the following flags apply: 427 - 428 - +++ Capture process: 429 - from include/linux/if_packet.h 430 - 431 - #define TP_STATUS_COPY (1 << 1) 432 - #define TP_STATUS_LOSING (1 << 2) 433 - #define TP_STATUS_CSUMNOTREADY (1 << 3) 434 - #define TP_STATUS_CSUM_VALID (1 << 7) 435 - 436 - TP_STATUS_COPY : This flag indicates that the frame (and associated 437 - meta information) has been truncated because it's 438 - larger than tp_frame_size. This packet can be 439 - read entirely with recvfrom(). 440 - 441 - In order to make this work it must to be 442 - enabled previously with setsockopt() and 443 - the PACKET_COPY_THRESH option. 444 - 445 - The number of frames that can be buffered to 446 - be read with recvfrom is limited like a normal socket. 447 - See the SO_RCVBUF option in the socket (7) man page. 448 - 449 - TP_STATUS_LOSING : indicates there were packet drops from last time 450 - statistics where checked with getsockopt() and 451 - the PACKET_STATISTICS option. 452 - 453 - TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which 454 - its checksum will be done in hardware. So while 455 - reading the packet we should not try to check the 456 - checksum. 457 - 458 - TP_STATUS_CSUM_VALID : This flag indicates that at least the transport 459 - header checksum of the packet has been already 460 - validated on the kernel side. If the flag is not set 461 - then we are free to check the checksum by ourselves 462 - provided that TP_STATUS_CSUMNOTREADY is also not set. 463 - 464 - for convenience there are also the following defines: 465 - 466 - #define TP_STATUS_KERNEL 0 467 - #define TP_STATUS_USER 1 468 - 469 - The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel 470 - receives a packet it puts in the buffer and updates the status with 471 - at least the TP_STATUS_USER flag. Then the user can read the packet, 472 - once the packet is read the user must zero the status field, so the kernel 473 - can use again that frame buffer. 474 - 475 - The user can use poll (any other variant should apply too) to check if new 476 - packets are in the ring: 477 - 478 - struct pollfd pfd; 479 - 480 - pfd.fd = fd; 481 - pfd.revents = 0; 482 - pfd.events = POLLIN|POLLRDNORM|POLLERR; 483 - 484 - if (status == TP_STATUS_KERNEL) 485 - retval = poll(&pfd, 1, timeout); 486 - 487 - It doesn't incur in a race condition to first check the status value and 488 - then poll for frames. 489 - 490 - ++ Transmission process 491 - Those defines are also used for transmission: 492 - 493 - #define TP_STATUS_AVAILABLE 0 // Frame is available 494 - #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() 495 - #define TP_STATUS_SENDING 2 // Frame is currently in transmission 496 - #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct 497 - 498 - First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a 499 - packet, the user fills a data buffer of an available frame, sets tp_len to 500 - current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. 501 - This can be done on multiple frames. Once the user is ready to transmit, it 502 - calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are 503 - forwarded to the network device. The kernel updates each status of sent 504 - frames with TP_STATUS_SENDING until the end of transfer. 505 - At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. 506 - 507 - header->tp_len = in_i_size; 508 - header->tp_status = TP_STATUS_SEND_REQUEST; 509 - retval = send(this->socket, NULL, 0, 0); 510 - 511 - The user can also use poll() to check if a buffer is available: 512 - (status == TP_STATUS_SENDING) 513 - 514 - struct pollfd pfd; 515 - pfd.fd = fd; 516 - pfd.revents = 0; 517 - pfd.events = POLLOUT; 518 - retval = poll(&pfd, 1, timeout); 519 - 520 - ------------------------------------------------------------------------------- 521 - + What TPACKET versions are available and when to use them? 522 - ------------------------------------------------------------------------------- 523 - 524 - int val = tpacket_version; 525 - setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 526 - getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 527 - 528 - where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. 529 - 530 - TPACKET_V1: 531 - - Default if not otherwise specified by setsockopt(2) 532 - - RX_RING, TX_RING available 533 - 534 - TPACKET_V1 --> TPACKET_V2: 535 - - Made 64 bit clean due to unsigned long usage in TPACKET_V1 536 - structures, thus this also works on 64 bit kernel with 32 bit 537 - userspace and the like 538 - - Timestamp resolution in nanoseconds instead of microseconds 539 - - RX_RING, TX_RING available 540 - - VLAN metadata information available for packets 541 - (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), 542 - in the tpacket2_hdr structure: 543 - - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates 544 - that the tp_vlan_tci field has valid VLAN TCI value 545 - - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field 546 - indicates that the tp_vlan_tpid field has valid VLAN TPID value 547 - - How to switch to TPACKET_V2: 548 - 1. Replace struct tpacket_hdr by struct tpacket2_hdr 549 - 2. Query header len and save 550 - 3. Set protocol version to 2, set up ring as usual 551 - 4. For getting the sockaddr_ll, 552 - use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of 553 - (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) 554 - 555 - TPACKET_V2 --> TPACKET_V3: 556 - - Flexible buffer implementation for RX_RING: 557 - 1. Blocks can be configured with non-static frame-size 558 - 2. Read/poll is at a block-level (as opposed to packet-level) 559 - 3. Added poll timeout to avoid indefinite user-space wait 560 - on idle links 561 - 4. Added user-configurable knobs: 562 - 4.1 block::timeout 563 - 4.2 tpkt_hdr::sk_rxhash 564 - - RX Hash data available in user space 565 - - TX_RING semantics are conceptually similar to TPACKET_V2; 566 - use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN 567 - instead of TPACKET2_HDRLEN. In the current implementation, 568 - the tp_next_offset field in the tpacket3_hdr MUST be set to 569 - zero, indicating that the ring does not hold variable sized frames. 570 - Packets with non-zero values of tp_next_offset will be dropped. 571 - 572 - ------------------------------------------------------------------------------- 573 - + AF_PACKET fanout mode 574 - ------------------------------------------------------------------------------- 575 - 576 - In the AF_PACKET fanout mode, packet reception can be load balanced among 577 - processes. This also works in combination with mmap(2) on packet sockets. 578 - 579 - Currently implemented fanout policies are: 580 - 581 - - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash 582 - - PACKET_FANOUT_LB: schedule to socket by round-robin 583 - - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on 584 - - PACKET_FANOUT_RND: schedule to socket by random selection 585 - - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another 586 - - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping 587 - 588 - Minimal example code by David S. Miller (try things like "./test eth0 hash", 589 - "./test eth0 lb", etc.): 590 - 591 - #include <stddef.h> 592 - #include <stdlib.h> 593 - #include <stdio.h> 594 - #include <string.h> 595 - 596 - #include <sys/types.h> 597 - #include <sys/wait.h> 598 - #include <sys/socket.h> 599 - #include <sys/ioctl.h> 600 - 601 - #include <unistd.h> 602 - 603 - #include <linux/if_ether.h> 604 - #include <linux/if_packet.h> 605 - 606 - #include <net/if.h> 607 - 608 - static const char *device_name; 609 - static int fanout_type; 610 - static int fanout_id; 611 - 612 - #ifndef PACKET_FANOUT 613 - # define PACKET_FANOUT 18 614 - # define PACKET_FANOUT_HASH 0 615 - # define PACKET_FANOUT_LB 1 616 - #endif 617 - 618 - static int setup_socket(void) 619 - { 620 - int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); 621 - struct sockaddr_ll ll; 622 - struct ifreq ifr; 623 - int fanout_arg; 624 - 625 - if (fd < 0) { 626 - perror("socket"); 627 - return EXIT_FAILURE; 628 - } 629 - 630 - memset(&ifr, 0, sizeof(ifr)); 631 - strcpy(ifr.ifr_name, device_name); 632 - err = ioctl(fd, SIOCGIFINDEX, &ifr); 633 - if (err < 0) { 634 - perror("SIOCGIFINDEX"); 635 - return EXIT_FAILURE; 636 - } 637 - 638 - memset(&ll, 0, sizeof(ll)); 639 - ll.sll_family = AF_PACKET; 640 - ll.sll_ifindex = ifr.ifr_ifindex; 641 - err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 642 - if (err < 0) { 643 - perror("bind"); 644 - return EXIT_FAILURE; 645 - } 646 - 647 - fanout_arg = (fanout_id | (fanout_type << 16)); 648 - err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, 649 - &fanout_arg, sizeof(fanout_arg)); 650 - if (err) { 651 - perror("setsockopt"); 652 - return EXIT_FAILURE; 653 - } 654 - 655 - return fd; 656 - } 657 - 658 - static void fanout_thread(void) 659 - { 660 - int fd = setup_socket(); 661 - int limit = 10000; 662 - 663 - if (fd < 0) 664 - exit(fd); 665 - 666 - while (limit-- > 0) { 667 - char buf[1600]; 668 - int err; 669 - 670 - err = read(fd, buf, sizeof(buf)); 671 - if (err < 0) { 672 - perror("read"); 673 - exit(EXIT_FAILURE); 674 - } 675 - if ((limit % 10) == 0) 676 - fprintf(stdout, "(%d) \n", getpid()); 677 - } 678 - 679 - fprintf(stdout, "%d: Received 10000 packets\n", getpid()); 680 - 681 - close(fd); 682 - exit(0); 683 - } 684 - 685 - int main(int argc, char **argp) 686 - { 687 - int fd, err; 688 - int i; 689 - 690 - if (argc != 3) { 691 - fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); 692 - return EXIT_FAILURE; 693 - } 694 - 695 - if (!strcmp(argp[2], "hash")) 696 - fanout_type = PACKET_FANOUT_HASH; 697 - else if (!strcmp(argp[2], "lb")) 698 - fanout_type = PACKET_FANOUT_LB; 699 - else { 700 - fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); 701 - exit(EXIT_FAILURE); 702 - } 703 - 704 - device_name = argp[1]; 705 - fanout_id = getpid() & 0xffff; 706 - 707 - for (i = 0; i < 4; i++) { 708 - pid_t pid = fork(); 709 - 710 - switch (pid) { 711 - case 0: 712 - fanout_thread(); 713 - 714 - case -1: 715 - perror("fork"); 716 - exit(EXIT_FAILURE); 717 - } 718 - } 719 - 720 - for (i = 0; i < 4; i++) { 721 - int status; 722 - 723 - wait(&status); 724 - } 725 - 726 - return 0; 727 - } 728 - 729 - ------------------------------------------------------------------------------- 730 - + AF_PACKET TPACKET_V3 example 731 - ------------------------------------------------------------------------------- 732 - 733 - AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame 734 - sizes by doing it's own memory management. It is based on blocks where polling 735 - works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. 736 - 737 - It is said that TPACKET_V3 brings the following benefits: 738 - *) ~15 - 20% reduction in CPU-usage 739 - *) ~20% increase in packet capture rate 740 - *) ~2x increase in packet density 741 - *) Port aggregation analysis 742 - *) Non static frame size to capture entire packet payload 743 - 744 - So it seems to be a good candidate to be used with packet fanout. 745 - 746 - Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile 747 - it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): 748 - 749 - /* Written from scratch, but kernel-to-user space API usage 750 - * dissected from lolpcap: 751 - * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> 752 - * License: GPL, version 2.0 753 - */ 754 - 755 - #include <stdio.h> 756 - #include <stdlib.h> 757 - #include <stdint.h> 758 - #include <string.h> 759 - #include <assert.h> 760 - #include <net/if.h> 761 - #include <arpa/inet.h> 762 - #include <netdb.h> 763 - #include <poll.h> 764 - #include <unistd.h> 765 - #include <signal.h> 766 - #include <inttypes.h> 767 - #include <sys/socket.h> 768 - #include <sys/mman.h> 769 - #include <linux/if_packet.h> 770 - #include <linux/if_ether.h> 771 - #include <linux/ip.h> 772 - 773 - #ifndef likely 774 - # define likely(x) __builtin_expect(!!(x), 1) 775 - #endif 776 - #ifndef unlikely 777 - # define unlikely(x) __builtin_expect(!!(x), 0) 778 - #endif 779 - 780 - struct block_desc { 781 - uint32_t version; 782 - uint32_t offset_to_priv; 783 - struct tpacket_hdr_v1 h1; 784 - }; 785 - 786 - struct ring { 787 - struct iovec *rd; 788 - uint8_t *map; 789 - struct tpacket_req3 req; 790 - }; 791 - 792 - static unsigned long packets_total = 0, bytes_total = 0; 793 - static sig_atomic_t sigint = 0; 794 - 795 - static void sighandler(int num) 796 - { 797 - sigint = 1; 798 - } 799 - 800 - static int setup_socket(struct ring *ring, char *netdev) 801 - { 802 - int err, i, fd, v = TPACKET_V3; 803 - struct sockaddr_ll ll; 804 - unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; 805 - unsigned int blocknum = 64; 806 - 807 - fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); 808 - if (fd < 0) { 809 - perror("socket"); 810 - exit(1); 811 - } 812 - 813 - err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); 814 - if (err < 0) { 815 - perror("setsockopt"); 816 - exit(1); 817 - } 818 - 819 - memset(&ring->req, 0, sizeof(ring->req)); 820 - ring->req.tp_block_size = blocksiz; 821 - ring->req.tp_frame_size = framesiz; 822 - ring->req.tp_block_nr = blocknum; 823 - ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; 824 - ring->req.tp_retire_blk_tov = 60; 825 - ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; 826 - 827 - err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, 828 - sizeof(ring->req)); 829 - if (err < 0) { 830 - perror("setsockopt"); 831 - exit(1); 832 - } 833 - 834 - ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, 835 - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); 836 - if (ring->map == MAP_FAILED) { 837 - perror("mmap"); 838 - exit(1); 839 - } 840 - 841 - ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); 842 - assert(ring->rd); 843 - for (i = 0; i < ring->req.tp_block_nr; ++i) { 844 - ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); 845 - ring->rd[i].iov_len = ring->req.tp_block_size; 846 - } 847 - 848 - memset(&ll, 0, sizeof(ll)); 849 - ll.sll_family = PF_PACKET; 850 - ll.sll_protocol = htons(ETH_P_ALL); 851 - ll.sll_ifindex = if_nametoindex(netdev); 852 - ll.sll_hatype = 0; 853 - ll.sll_pkttype = 0; 854 - ll.sll_halen = 0; 855 - 856 - err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 857 - if (err < 0) { 858 - perror("bind"); 859 - exit(1); 860 - } 861 - 862 - return fd; 863 - } 864 - 865 - static void display(struct tpacket3_hdr *ppd) 866 - { 867 - struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); 868 - struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); 869 - 870 - if (eth->h_proto == htons(ETH_P_IP)) { 871 - struct sockaddr_in ss, sd; 872 - char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; 873 - 874 - memset(&ss, 0, sizeof(ss)); 875 - ss.sin_family = PF_INET; 876 - ss.sin_addr.s_addr = ip->saddr; 877 - getnameinfo((struct sockaddr *) &ss, sizeof(ss), 878 - sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); 879 - 880 - memset(&sd, 0, sizeof(sd)); 881 - sd.sin_family = PF_INET; 882 - sd.sin_addr.s_addr = ip->daddr; 883 - getnameinfo((struct sockaddr *) &sd, sizeof(sd), 884 - dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); 885 - 886 - printf("%s -> %s, ", sbuff, dbuff); 887 - } 888 - 889 - printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); 890 - } 891 - 892 - static void walk_block(struct block_desc *pbd, const int block_num) 893 - { 894 - int num_pkts = pbd->h1.num_pkts, i; 895 - unsigned long bytes = 0; 896 - struct tpacket3_hdr *ppd; 897 - 898 - ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + 899 - pbd->h1.offset_to_first_pkt); 900 - for (i = 0; i < num_pkts; ++i) { 901 - bytes += ppd->tp_snaplen; 902 - display(ppd); 903 - 904 - ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + 905 - ppd->tp_next_offset); 906 - } 907 - 908 - packets_total += num_pkts; 909 - bytes_total += bytes; 910 - } 911 - 912 - static void flush_block(struct block_desc *pbd) 913 - { 914 - pbd->h1.block_status = TP_STATUS_KERNEL; 915 - } 916 - 917 - static void teardown_socket(struct ring *ring, int fd) 918 - { 919 - munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); 920 - free(ring->rd); 921 - close(fd); 922 - } 923 - 924 - int main(int argc, char **argp) 925 - { 926 - int fd, err; 927 - socklen_t len; 928 - struct ring ring; 929 - struct pollfd pfd; 930 - unsigned int block_num = 0, blocks = 64; 931 - struct block_desc *pbd; 932 - struct tpacket_stats_v3 stats; 933 - 934 - if (argc != 2) { 935 - fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); 936 - return EXIT_FAILURE; 937 - } 938 - 939 - signal(SIGINT, sighandler); 940 - 941 - memset(&ring, 0, sizeof(ring)); 942 - fd = setup_socket(&ring, argp[argc - 1]); 943 - assert(fd > 0); 944 - 945 - memset(&pfd, 0, sizeof(pfd)); 946 - pfd.fd = fd; 947 - pfd.events = POLLIN | POLLERR; 948 - pfd.revents = 0; 949 - 950 - while (likely(!sigint)) { 951 - pbd = (struct block_desc *) ring.rd[block_num].iov_base; 952 - 953 - if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { 954 - poll(&pfd, 1, -1); 955 - continue; 956 - } 957 - 958 - walk_block(pbd, block_num); 959 - flush_block(pbd); 960 - block_num = (block_num + 1) % blocks; 961 - } 962 - 963 - len = sizeof(stats); 964 - err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); 965 - if (err < 0) { 966 - perror("getsockopt"); 967 - exit(1); 968 - } 969 - 970 - fflush(stdout); 971 - printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", 972 - stats.tp_packets, bytes_total, stats.tp_drops, 973 - stats.tp_freeze_q_cnt); 974 - 975 - teardown_socket(&ring, fd); 976 - return 0; 977 - } 978 - 979 - ------------------------------------------------------------------------------- 980 - + PACKET_QDISC_BYPASS 981 - ------------------------------------------------------------------------------- 982 - 983 - If there is a requirement to load the network with many packets in a similar 984 - fashion as pktgen does, you might set the following option after socket 985 - creation: 986 - 987 - int one = 1; 988 - setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); 989 - 990 - This has the side-effect, that packets sent through PF_PACKET will bypass the 991 - kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, 992 - packet are not buffered, tc disciplines are ignored, increased loss can occur 993 - and such packets are also not visible to other PF_PACKET sockets anymore. So, 994 - you have been warned; generally, this can be useful for stress testing various 995 - components of a system. 996 - 997 - On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled 998 - on PF_PACKET sockets. 999 - 1000 - ------------------------------------------------------------------------------- 1001 - + PACKET_TIMESTAMP 1002 - ------------------------------------------------------------------------------- 1003 - 1004 - The PACKET_TIMESTAMP setting determines the source of the timestamp in 1005 - the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your 1006 - NIC is capable of timestamping packets in hardware, you can request those 1007 - hardware timestamps to be used. Note: you may need to enable the generation 1008 - of hardware timestamps with SIOCSHWTSTAMP (see related information from 1009 - Documentation/networking/timestamping.txt). 1010 - 1011 - PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING: 1012 - 1013 - int req = SOF_TIMESTAMPING_RAW_HARDWARE; 1014 - setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) 1015 - 1016 - For the mmap(2)ed ring buffers, such timestamps are stored in the 1017 - tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine 1018 - what kind of timestamp has been reported, the tp_status field is binary |'ed 1019 - with the following possible bits ... 1020 - 1021 - TP_STATUS_TS_RAW_HARDWARE 1022 - TP_STATUS_TS_SOFTWARE 1023 - 1024 - ... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the 1025 - RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a 1026 - software fallback was invoked *within* PF_PACKET's processing code (less 1027 - precise). 1028 - 1029 - Getting timestamps for the TX_RING works as follows: i) fill the ring frames, 1030 - ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant 1031 - frames to be updated resp. the frame handed over to the application, iv) walk 1032 - through the frames to pick up the individual hw/sw timestamps. 1033 - 1034 - Only (!) if transmit timestamping is enabled, then these bits are combined 1035 - with binary | with TP_STATUS_AVAILABLE, so you must check for that in your 1036 - application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) 1037 - in a first step to see if the frame belongs to the application, and then 1038 - one can extract the type of timestamp in a second step from tp_status)! 1039 - 1040 - If you don't care about them, thus having it disabled, checking for 1041 - TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the 1042 - TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec 1043 - members do not contain a valid value. For TX_RINGs, by default no timestamp 1044 - is generated! 1045 - 1046 - See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt 1047 - for more information on hardware timestamps. 1048 - 1049 - ------------------------------------------------------------------------------- 1050 - + Miscellaneous bits 1051 - ------------------------------------------------------------------------------- 1052 - 1053 - - Packet sockets work well together with Linux socket filters, thus you also 1054 - might want to have a look at Documentation/networking/filter.rst 1055 - 1056 - -------------------------------------------------------------------------------- 1057 - + THANKS 1058 - -------------------------------------------------------------------------------- 1059 - 1060 - Jesse Brandeburg, for fixing my grammathical/spelling errors 1061 -