Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux

Pavel Begunkov says:

====================
io_uring zerocopy send

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc | zc | zc + flush
4000 | 495134 | 606420 (+22%) | 558971 (+12%)
1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc | zc | zc + flush
8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc | zc
1200 | 4174 | 4148
4096 | 7597 | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

liburing (benchmark + tests):
[1] https://github.com/isilence/liburing/tree/zc_v4

kernel repo:
[2] https://github.com/isilence/linux/tree/zc_v4

RFC v1:
[3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

RFC v2:
https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

Net patches based:
git@github.com:isilence/linux.git zc_v4-net-base
or
https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

The series introduces an io_uring concept of notifactors. From the userspace
perspective it's an entity to which it can bind one or more requests and then
requesting to flush it. Flushing a notifier makes it impossible to attach new
requests to it, and instructs the notifier to post a completion once all
requests attached to it are completed and the kernel doesn't need the buffers
anymore.

Notifications are stored in notification slots, which should be registered as
an array in io_uring. Each slot stores only one notifier at any particular
moment. Flushing removes it from the slot and the slot automatically replaces
it with a new notifier. All operations with notifiers are done by specifying
an index of a slot it's currently in.

When registering a notification the userspace specifies a u64 tag for each
slot, which will be copied in notification completion entries as
cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
sequence number counting notifiers of a slot.

====================

Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

+191 -64
+48 -18
include/linux/skbuff.h
··· 509 509 * charged to the kernel memory. 510 510 */ 511 511 SKBFL_PURE_ZEROCOPY = BIT(2), 512 + 513 + SKBFL_DONT_ORPHAN = BIT(3), 514 + 515 + /* page references are managed by the ubuf_info, so it's safe to 516 + * use frags only up until ubuf_info is released 517 + */ 518 + SKBFL_MANAGED_FRAG_REFS = BIT(4), 512 519 }; 513 520 514 521 #define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG) 515 - #define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY) 522 + #define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \ 523 + SKBFL_DONT_ORPHAN | SKBFL_MANAGED_FRAG_REFS) 516 524 517 525 /* 518 526 * The callback notifies userspace to release buffers when skb DMA is done in ··· 1604 1596 void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg, 1605 1597 bool success); 1606 1598 1607 - int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb, 1608 - struct iov_iter *from, size_t length); 1599 + int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, 1600 + struct sk_buff *skb, struct iov_iter *from, 1601 + size_t length); 1609 1602 1610 1603 static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, 1611 1604 struct msghdr *msg, int len) 1612 1605 { 1613 - return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len); 1606 + return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len); 1614 1607 } 1615 1608 1616 1609 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, ··· 1636 1627 static inline bool skb_zcopy_pure(const struct sk_buff *skb) 1637 1628 { 1638 1629 return skb_shinfo(skb)->flags & SKBFL_PURE_ZEROCOPY; 1630 + } 1631 + 1632 + static inline bool skb_zcopy_managed(const struct sk_buff *skb) 1633 + { 1634 + return skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAG_REFS; 1639 1635 } 1640 1636 1641 1637 static inline bool skb_pure_zcopy_same(const struct sk_buff *skb1, ··· 1715 1701 1716 1702 skb_shinfo(skb)->flags &= ~SKBFL_ALL_ZEROCOPY; 1717 1703 } 1704 + } 1705 + 1706 + void __skb_zcopy_downgrade_managed(struct sk_buff *skb); 1707 + 1708 + static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) 1709 + { 1710 + if (unlikely(skb_zcopy_managed(skb))) 1711 + __skb_zcopy_downgrade_managed(skb); 1718 1712 } 1719 1713 1720 1714 static inline void skb_mark_not_on_list(struct sk_buff *skb) ··· 2373 2351 return skb_headlen(skb) + __skb_pagelen(skb); 2374 2352 } 2375 2353 2354 + static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo, 2355 + int i, struct page *page, 2356 + int off, int size) 2357 + { 2358 + skb_frag_t *frag = &shinfo->frags[i]; 2359 + 2360 + /* 2361 + * Propagate page pfmemalloc to the skb if we can. The problem is 2362 + * that not all callers have unique ownership of the page but rely 2363 + * on page_is_pfmemalloc doing the right thing(tm). 2364 + */ 2365 + frag->bv_page = page; 2366 + frag->bv_offset = off; 2367 + skb_frag_size_set(frag, size); 2368 + } 2369 + 2376 2370 /** 2377 2371 * skb_len_add - adds a number to len fields of skb 2378 2372 * @skb: buffer to add len to ··· 2417 2379 static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, 2418 2380 struct page *page, int off, int size) 2419 2381 { 2420 - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; 2421 - 2422 - /* 2423 - * Propagate page pfmemalloc to the skb if we can. The problem is 2424 - * that not all callers have unique ownership of the page but rely 2425 - * on page_is_pfmemalloc doing the right thing(tm). 2426 - */ 2427 - frag->bv_page = page; 2428 - frag->bv_offset = off; 2429 - skb_frag_size_set(frag, size); 2430 - 2382 + __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size); 2431 2383 page = compound_head(page); 2432 2384 if (page_is_pfmemalloc(page)) 2433 2385 skb->pfmemalloc = true; ··· 3047 3019 { 3048 3020 if (likely(!skb_zcopy(skb))) 3049 3021 return 0; 3050 - if (!skb_zcopy_is_nouarg(skb) && 3051 - skb_uarg(skb)->callback == msg_zerocopy_callback) 3022 + if (skb_shinfo(skb)->flags & SKBFL_DONT_ORPHAN) 3052 3023 return 0; 3053 3024 return skb_copy_ubufs(skb, gfp_mask); 3054 3025 } ··· 3360 3333 */ 3361 3334 static inline void skb_frag_unref(struct sk_buff *skb, int f) 3362 3335 { 3363 - __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle); 3336 + struct skb_shared_info *shinfo = skb_shinfo(skb); 3337 + 3338 + if (!skb_zcopy_managed(skb)) 3339 + __skb_frag_unref(&shinfo->frags[f], skb->pp_recycle); 3364 3340 } 3365 3341 3366 3342 /**
+5
include/linux/socket.h
··· 14 14 struct pid; 15 15 struct cred; 16 16 struct socket; 17 + struct sock; 18 + struct sk_buff; 17 19 18 20 #define __sockaddr_check_size(size) \ 19 21 BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage))) ··· 71 69 unsigned int msg_flags; /* flags on received message */ 72 70 __kernel_size_t msg_controllen; /* ancillary data buffer length */ 73 71 struct kiocb *msg_iocb; /* ptr to iocb for async requests */ 72 + struct ubuf_info *msg_ubuf; 73 + int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb, 74 + struct iov_iter *from, size_t length); 74 75 }; 75 76 76 77 struct user_msghdr {
+1
net/compat.c
··· 80 80 return -EMSGSIZE; 81 81 82 82 kmsg->msg_iocb = NULL; 83 + kmsg->msg_ubuf = NULL; 83 84 *ptr = msg.msg_iov; 84 85 *len = msg.msg_iovlen; 85 86 return 0;
+10 -4
net/core/datagram.c
··· 610 610 } 611 611 EXPORT_SYMBOL(skb_copy_datagram_from_iter); 612 612 613 - int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb, 614 - struct iov_iter *from, size_t length) 613 + int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, 614 + struct sk_buff *skb, struct iov_iter *from, 615 + size_t length) 615 616 { 616 - int frag = skb_shinfo(skb)->nr_frags; 617 + int frag; 618 + 619 + if (msg && msg->sg_from_iter) 620 + return msg->sg_from_iter(sk, skb, from, length); 621 + 622 + frag = skb_shinfo(skb)->nr_frags; 617 623 618 624 while (length && iov_iter_count(from)) { 619 625 struct page *pages[MAX_SKB_FRAGS]; ··· 705 699 if (skb_copy_datagram_from_iter(skb, 0, from, copy)) 706 700 return -EFAULT; 707 701 708 - return __zerocopy_sg_from_iter(NULL, skb, from, ~0U); 702 + return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U); 709 703 } 710 704 EXPORT_SYMBOL(zerocopy_sg_from_iter); 711 705
+33 -4
net/core/skbuff.c
··· 669 669 &shinfo->dataref)) 670 670 goto exit; 671 671 672 - skb_zcopy_clear(skb, true); 672 + if (skb_zcopy(skb)) { 673 + bool skip_unref = shinfo->flags & SKBFL_MANAGED_FRAG_REFS; 674 + 675 + skb_zcopy_clear(skb, true); 676 + if (skip_unref) 677 + goto free_head; 678 + } 673 679 674 680 for (i = 0; i < shinfo->nr_frags; i++) 675 681 __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle); 676 682 683 + free_head: 677 684 if (shinfo->frag_list) 678 685 kfree_skb_list(shinfo->frag_list); 679 686 ··· 905 898 */ 906 899 void skb_tx_error(struct sk_buff *skb) 907 900 { 908 - skb_zcopy_clear(skb, true); 901 + if (skb) { 902 + skb_zcopy_downgrade_managed(skb); 903 + skb_zcopy_clear(skb, true); 904 + } 909 905 } 910 906 EXPORT_SYMBOL(skb_tx_error); 911 907 ··· 1206 1196 uarg->len = 1; 1207 1197 uarg->bytelen = size; 1208 1198 uarg->zerocopy = 1; 1209 - uarg->flags = SKBFL_ZEROCOPY_FRAG; 1199 + uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN; 1210 1200 refcount_set(&uarg->refcnt, 1); 1211 1201 sock_hold(sk); 1212 1202 ··· 1224 1214 if (uarg) { 1225 1215 const u32 byte_limit = 1 << 19; /* limit to a few TSO */ 1226 1216 u32 bytelen, next; 1217 + 1218 + /* there might be non MSG_ZEROCOPY users */ 1219 + if (uarg->callback != msg_zerocopy_callback) 1220 + return NULL; 1227 1221 1228 1222 /* realloc only when socket is locked (TCP, UDP cork), 1229 1223 * so uarg->len and sk_zckey access is serialized ··· 1371 1357 if (orig_uarg && uarg != orig_uarg) 1372 1358 return -EEXIST; 1373 1359 1374 - err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len); 1360 + err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len); 1375 1361 if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) { 1376 1362 struct sock *save_sk = skb->sk; 1377 1363 ··· 1387 1373 return skb->len - orig_len; 1388 1374 } 1389 1375 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream); 1376 + 1377 + void __skb_zcopy_downgrade_managed(struct sk_buff *skb) 1378 + { 1379 + int i; 1380 + 1381 + skb_shinfo(skb)->flags &= ~SKBFL_MANAGED_FRAG_REFS; 1382 + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) 1383 + skb_frag_ref(skb, i); 1384 + } 1385 + EXPORT_SYMBOL_GPL(__skb_zcopy_downgrade_managed); 1390 1386 1391 1387 static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig, 1392 1388 gfp_t gfp_mask) ··· 1715 1691 BUG_ON(nhead < 0); 1716 1692 1717 1693 BUG_ON(skb_shared(skb)); 1694 + 1695 + skb_zcopy_downgrade_managed(skb); 1718 1696 1719 1697 size = SKB_DATA_ALIGN(size); 1720 1698 ··· 3512 3486 int pos = skb_headlen(skb); 3513 3487 const int zc_flags = SKBFL_SHARED_FRAG | SKBFL_PURE_ZEROCOPY; 3514 3488 3489 + skb_zcopy_downgrade_managed(skb); 3490 + 3515 3491 skb_shinfo(skb1)->flags |= skb_shinfo(skb)->flags & zc_flags; 3516 3492 skb_zerocopy_clone(skb1, skb, 0); 3517 3493 if (len < pos) /* Split line is inside header. */ ··· 3862 3834 if (skb_can_coalesce(skb, i, page, offset)) { 3863 3835 skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size); 3864 3836 } else if (i < MAX_SKB_FRAGS) { 3837 + skb_zcopy_downgrade_managed(skb); 3865 3838 get_page(page); 3866 3839 skb_fill_page_desc(skb, i, page, offset, size); 3867 3840 } else {
+36 -14
net/ipv4/ip_output.c
··· 969 969 struct inet_sock *inet = inet_sk(sk); 970 970 struct ubuf_info *uarg = NULL; 971 971 struct sk_buff *skb; 972 - 973 972 struct ip_options *opt = cork->opt; 974 973 int hh_len; 975 974 int exthdrlen; ··· 976 977 int copy; 977 978 int err; 978 979 int offset = 0; 980 + bool zc = false; 979 981 unsigned int maxfraglen, fragheaderlen, maxnonfragsize; 980 982 int csummode = CHECKSUM_NONE; 981 983 struct rtable *rt = (struct rtable *)cork->dst; ··· 1017 1017 (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM))) 1018 1018 csummode = CHECKSUM_PARTIAL; 1019 1019 1020 - if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) { 1021 - uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); 1022 - if (!uarg) 1023 - return -ENOBUFS; 1024 - extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ 1025 - if (rt->dst.dev->features & NETIF_F_SG && 1026 - csummode == CHECKSUM_PARTIAL) { 1027 - paged = true; 1028 - } else { 1029 - uarg->zerocopy = 0; 1030 - skb_zcopy_set(skb, uarg, &extra_uref); 1020 + if ((flags & MSG_ZEROCOPY) && length) { 1021 + struct msghdr *msg = from; 1022 + 1023 + if (getfrag == ip_generic_getfrag && msg->msg_ubuf) { 1024 + if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb)) 1025 + return -EINVAL; 1026 + 1027 + /* Leave uarg NULL if can't zerocopy, callers should 1028 + * be able to handle it. 1029 + */ 1030 + if ((rt->dst.dev->features & NETIF_F_SG) && 1031 + csummode == CHECKSUM_PARTIAL) { 1032 + paged = true; 1033 + zc = true; 1034 + uarg = msg->msg_ubuf; 1035 + } 1036 + } else if (sock_flag(sk, SOCK_ZEROCOPY)) { 1037 + uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); 1038 + if (!uarg) 1039 + return -ENOBUFS; 1040 + extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ 1041 + if (rt->dst.dev->features & NETIF_F_SG && 1042 + csummode == CHECKSUM_PARTIAL) { 1043 + paged = true; 1044 + zc = true; 1045 + } else { 1046 + uarg->zerocopy = 0; 1047 + skb_zcopy_set(skb, uarg, &extra_uref); 1048 + } 1031 1049 } 1032 1050 } 1033 1051 ··· 1109 1091 (fraglen + alloc_extra < SKB_MAX_ALLOC || 1110 1092 !(rt->dst.dev->features & NETIF_F_SG))) 1111 1093 alloclen = fraglen; 1112 - else { 1094 + else if (!zc) { 1113 1095 alloclen = min_t(int, fraglen, MAX_HEADER); 1114 1096 pagedlen = fraglen - alloclen; 1097 + } else { 1098 + alloclen = fragheaderlen + transhdrlen; 1099 + pagedlen = datalen - transhdrlen; 1115 1100 } 1116 1101 1117 1102 alloclen += alloc_extra; ··· 1209 1188 err = -EFAULT; 1210 1189 goto error; 1211 1190 } 1212 - } else if (!uarg || !uarg->zerocopy) { 1191 + } else if (!zc) { 1213 1192 int i = skb_shinfo(skb)->nr_frags; 1214 1193 1215 1194 err = -ENOMEM; 1216 1195 if (!sk_page_frag_refill(sk, pfrag)) 1217 1196 goto error; 1218 1197 1198 + skb_zcopy_downgrade_managed(skb); 1219 1199 if (!skb_can_coalesce(skb, i, pfrag->page, 1220 1200 pfrag->offset)) { 1221 1201 err = -EMSGSIZE;
+20 -11
net/ipv4/tcp.c
··· 1223 1223 1224 1224 flags = msg->msg_flags; 1225 1225 1226 - if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) { 1226 + if ((flags & MSG_ZEROCOPY) && size) { 1227 1227 skb = tcp_write_queue_tail(sk); 1228 - uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); 1229 - if (!uarg) { 1230 - err = -ENOBUFS; 1231 - goto out_err; 1232 - } 1233 1228 1234 - zc = sk->sk_route_caps & NETIF_F_SG; 1235 - if (!zc) 1236 - uarg->zerocopy = 0; 1229 + if (msg->msg_ubuf) { 1230 + uarg = msg->msg_ubuf; 1231 + net_zcopy_get(uarg); 1232 + zc = sk->sk_route_caps & NETIF_F_SG; 1233 + } else if (sock_flag(sk, SOCK_ZEROCOPY)) { 1234 + uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); 1235 + if (!uarg) { 1236 + err = -ENOBUFS; 1237 + goto out_err; 1238 + } 1239 + zc = sk->sk_route_caps & NETIF_F_SG; 1240 + if (!zc) 1241 + uarg->zerocopy = 0; 1242 + } 1237 1243 } 1238 1244 1239 1245 if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) && ··· 1362 1356 1363 1357 copy = min_t(int, copy, pfrag->size - pfrag->offset); 1364 1358 1365 - if (tcp_downgrade_zcopy_pure(sk, skb)) 1366 - goto wait_for_space; 1359 + if (unlikely(skb_zcopy_pure(skb) || skb_zcopy_managed(skb))) { 1360 + if (tcp_downgrade_zcopy_pure(sk, skb)) 1361 + goto wait_for_space; 1362 + skb_zcopy_downgrade_managed(skb); 1363 + } 1367 1364 1368 1365 copy = tcp_wmem_schedule(sk, copy); 1369 1366 if (!copy)
+36 -13
net/ipv6/ip6_output.c
··· 1464 1464 int copy; 1465 1465 int err; 1466 1466 int offset = 0; 1467 + bool zc = false; 1467 1468 u32 tskey = 0; 1468 1469 struct rt6_info *rt = (struct rt6_info *)cork->dst; 1469 1470 struct ipv6_txoptions *opt = v6_cork->opt; ··· 1542 1541 rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM)) 1543 1542 csummode = CHECKSUM_PARTIAL; 1544 1543 1545 - if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) { 1546 - uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); 1547 - if (!uarg) 1548 - return -ENOBUFS; 1549 - extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ 1550 - if (rt->dst.dev->features & NETIF_F_SG && 1551 - csummode == CHECKSUM_PARTIAL) { 1552 - paged = true; 1553 - } else { 1554 - uarg->zerocopy = 0; 1555 - skb_zcopy_set(skb, uarg, &extra_uref); 1544 + if ((flags & MSG_ZEROCOPY) && length) { 1545 + struct msghdr *msg = from; 1546 + 1547 + if (getfrag == ip_generic_getfrag && msg->msg_ubuf) { 1548 + if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb)) 1549 + return -EINVAL; 1550 + 1551 + /* Leave uarg NULL if can't zerocopy, callers should 1552 + * be able to handle it. 1553 + */ 1554 + if ((rt->dst.dev->features & NETIF_F_SG) && 1555 + csummode == CHECKSUM_PARTIAL) { 1556 + paged = true; 1557 + zc = true; 1558 + uarg = msg->msg_ubuf; 1559 + } 1560 + } else if (sock_flag(sk, SOCK_ZEROCOPY)) { 1561 + uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); 1562 + if (!uarg) 1563 + return -ENOBUFS; 1564 + extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ 1565 + if (rt->dst.dev->features & NETIF_F_SG && 1566 + csummode == CHECKSUM_PARTIAL) { 1567 + paged = true; 1568 + zc = true; 1569 + } else { 1570 + uarg->zerocopy = 0; 1571 + skb_zcopy_set(skb, uarg, &extra_uref); 1572 + } 1556 1573 } 1557 1574 } 1558 1575 ··· 1649 1630 (fraglen + alloc_extra < SKB_MAX_ALLOC || 1650 1631 !(rt->dst.dev->features & NETIF_F_SG))) 1651 1632 alloclen = fraglen; 1652 - else { 1633 + else if (!zc) { 1653 1634 alloclen = min_t(int, fraglen, MAX_HEADER); 1654 1635 pagedlen = fraglen - alloclen; 1636 + } else { 1637 + alloclen = fragheaderlen + transhdrlen; 1638 + pagedlen = datalen - transhdrlen; 1655 1639 } 1656 1640 alloclen += alloc_extra; 1657 1641 ··· 1764 1742 err = -EFAULT; 1765 1743 goto error; 1766 1744 } 1767 - } else if (!uarg || !uarg->zerocopy) { 1745 + } else if (!zc) { 1768 1746 int i = skb_shinfo(skb)->nr_frags; 1769 1747 1770 1748 err = -ENOMEM; 1771 1749 if (!sk_page_frag_refill(sk, pfrag)) 1772 1750 goto error; 1773 1751 1752 + skb_zcopy_downgrade_managed(skb); 1774 1753 if (!skb_can_coalesce(skb, i, pfrag->page, 1775 1754 pfrag->offset)) { 1776 1755 err = -EMSGSIZE;
+2
net/socket.c
··· 2103 2103 msg.msg_control = NULL; 2104 2104 msg.msg_controllen = 0; 2105 2105 msg.msg_namelen = 0; 2106 + msg.msg_ubuf = NULL; 2106 2107 if (addr) { 2107 2108 err = move_addr_to_kernel(addr, addr_len, &address); 2108 2109 if (err < 0) ··· 2403 2402 return -EMSGSIZE; 2404 2403 2405 2404 kmsg->msg_iocb = NULL; 2405 + kmsg->msg_ubuf = NULL; 2406 2406 *uiov = msg.msg_iov; 2407 2407 *nsegs = msg.msg_iovlen; 2408 2408 return 0;