Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R

On long-running enterprise production servers, high-order contiguous
memory pages are usually very rare and in most cases we can only get
fragmented pages.

When replacing TCP with SMC-R in such production scenarios, attempting
to allocate high-order physically contiguous sndbufs and RMBs may result
in frequent memory compaction, which will cause unexpected hung issue
and further stability risks.

So this patch is aimed to allow SMC-R link group to use virtually
contiguous sndbufs and RMBs to avoid potential issues mentioned above.
Whether to use physically or virtually contiguous buffers can be set
by sysctl smcr_buf_type.

Note that using virtually contiguous buffers will bring an acceptable
performance regression, which can be mainly divided into two parts:

1) regression in data path, which is brought by additional address
translation of sndbuf by RNIC in Tx. But in general, translating
address through MTT is fast.

Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
latency and bandwidth test with physically and virtually contiguous
buffers are as follows:

- client:
smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
-t 5 -vu tcp_{bw|lat}
- server:
smc_run taskset -c <cpu> qperf

[latency]
msgsize tcp smcr smcr-use-virt-buf
1 11.17 us 7.56 us 7.51 us (-0.67%)
2 10.65 us 7.74 us 7.56 us (-2.31%)
4 11.11 us 7.52 us 7.59 us ( 0.84%)
8 10.83 us 7.55 us 7.51 us (-0.48%)
16 11.21 us 7.46 us 7.51 us ( 0.71%)
32 10.65 us 7.53 us 7.58 us ( 0.61%)
64 10.95 us 7.74 us 7.80 us ( 0.76%)
128 11.14 us 7.83 us 7.87 us ( 0.47%)
256 10.97 us 7.94 us 7.92 us (-0.28%)
512 11.23 us 7.94 us 8.20 us ( 3.25%)
1024 11.60 us 8.12 us 8.20 us ( 0.96%)
2048 14.04 us 8.30 us 8.51 us ( 2.49%)
4096 16.88 us 9.13 us 9.07 us (-0.64%)
8192 22.50 us 10.56 us 11.22 us ( 6.26%)
16384 28.99 us 12.88 us 13.83 us ( 7.37%)
32768 40.13 us 16.76 us 16.95 us ( 1.16%)
65536 68.70 us 24.68 us 24.85 us ( 0.68%)
[bandwidth]
msgsize tcp smcr smcr-use-virt-buf
1 1.65 MB/s 1.59 MB/s 1.53 MB/s (-3.88%)
2 3.32 MB/s 3.17 MB/s 3.08 MB/s (-2.67%)
4 6.66 MB/s 6.33 MB/s 6.09 MB/s (-3.85%)
8 13.67 MB/s 13.45 MB/s 11.97 MB/s (-10.99%)
16 25.36 MB/s 27.15 MB/s 24.16 MB/s (-11.01%)
32 48.22 MB/s 54.24 MB/s 49.41 MB/s (-8.89%)
64 106.79 MB/s 107.32 MB/s 99.05 MB/s (-7.71%)
128 210.21 MB/s 202.46 MB/s 201.02 MB/s (-0.71%)
256 400.81 MB/s 416.81 MB/s 393.52 MB/s (-5.59%)
512 746.49 MB/s 834.12 MB/s 809.99 MB/s (-2.89%)
1024 1292.33 MB/s 1641.96 MB/s 1571.82 MB/s (-4.27%)
2048 2007.64 MB/s 2760.44 MB/s 2717.68 MB/s (-1.55%)
4096 2665.17 MB/s 4157.44 MB/s 4070.76 MB/s (-2.09%)
8192 3159.72 MB/s 4361.57 MB/s 4270.65 MB/s (-2.08%)
16384 4186.70 MB/s 4574.13 MB/s 4501.17 MB/s (-1.60%)
32768 4093.21 MB/s 4487.42 MB/s 4322.43 MB/s (-3.68%)
65536 4057.14 MB/s 4735.61 MB/s 4555.17 MB/s (-3.81%)

2) regression in buffer initialization and destruction path, which is
brought by additional MR operations of sndbufs. But thanks to link
group buffer reuse mechanism, the impact of this kind of regression
decreases as times of buffer reuse increases.

Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
buffer-related function obtained by bpftrace are as follows:

Function Phys-bufs Virt-bufs
smcr_new_buf_create() 67154 ns 79164 ns
smc_ib_buf_map_sg() 525 ns 928 ns
smc_ib_get_memory_region() 162294 ns 161191 ns
smc_wr_reg_send() 9957 ns 9635 ns
smc_ib_put_memory_region() 203548 ns 198374 ns
smc_ib_buf_unmap_sg() 508 ns 1158 ns

------------
Test environment notes:
1. Above tests run on 2 VMs within the same Host.
2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
the each VM respectively.
3. VMs' vCPUs are binded to different physical CPUs, and the binded
physical CPUs are isolated by `isolcpus=xxx` cmdline.
4. NICs' queue number are set to 1.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Wen Gu and committed by
David S. Miller
b8d19945 b984f370

+327 -117
+58 -8
net/smc/af_smc.c
··· 487 487 smc_copy_sock_settings(&smc->sk, smc->clcsock->sk, SK_FLAGS_CLC_TO_SMC); 488 488 } 489 489 490 + /* register the new vzalloced sndbuf on all links */ 491 + static int smcr_lgr_reg_sndbufs(struct smc_link *link, 492 + struct smc_buf_desc *snd_desc) 493 + { 494 + struct smc_link_group *lgr = link->lgr; 495 + int i, rc = 0; 496 + 497 + if (!snd_desc->is_vm) 498 + return -EINVAL; 499 + 500 + /* protect against parallel smcr_link_reg_buf() */ 501 + mutex_lock(&lgr->llc_conf_mutex); 502 + for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) { 503 + if (!smc_link_active(&lgr->lnk[i])) 504 + continue; 505 + rc = smcr_link_reg_buf(&lgr->lnk[i], snd_desc); 506 + if (rc) 507 + break; 508 + } 509 + mutex_unlock(&lgr->llc_conf_mutex); 510 + return rc; 511 + } 512 + 490 513 /* register the new rmb on all links */ 491 514 static int smcr_lgr_reg_rmbs(struct smc_link *link, 492 515 struct smc_buf_desc *rmb_desc) ··· 521 498 if (rc) 522 499 return rc; 523 500 /* protect against parallel smc_llc_cli_rkey_exchange() and 524 - * parallel smcr_link_reg_rmb() 501 + * parallel smcr_link_reg_buf() 525 502 */ 526 503 mutex_lock(&lgr->llc_conf_mutex); 527 504 for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) { 528 505 if (!smc_link_active(&lgr->lnk[i])) 529 506 continue; 530 - rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc); 507 + rc = smcr_link_reg_buf(&lgr->lnk[i], rmb_desc); 531 508 if (rc) 532 509 goto out; 533 510 } ··· 573 550 574 551 smc_wr_remember_qp_attr(link); 575 552 576 - if (smcr_link_reg_rmb(link, smc->conn.rmb_desc)) 577 - return SMC_CLC_DECL_ERR_REGRMB; 553 + /* reg the sndbuf if it was vzalloced */ 554 + if (smc->conn.sndbuf_desc->is_vm) { 555 + if (smcr_link_reg_buf(link, smc->conn.sndbuf_desc)) 556 + return SMC_CLC_DECL_ERR_REGBUF; 557 + } 558 + 559 + /* reg the rmb */ 560 + if (smcr_link_reg_buf(link, smc->conn.rmb_desc)) 561 + return SMC_CLC_DECL_ERR_REGBUF; 578 562 579 563 /* confirm_rkey is implicit on 1st contact */ 580 564 smc->conn.rmb_desc->is_conf_rkey = true; ··· 1251 1221 goto connect_abort; 1252 1222 } 1253 1223 } else { 1224 + /* reg sendbufs if they were vzalloced */ 1225 + if (smc->conn.sndbuf_desc->is_vm) { 1226 + if (smcr_lgr_reg_sndbufs(link, smc->conn.sndbuf_desc)) { 1227 + reason_code = SMC_CLC_DECL_ERR_REGBUF; 1228 + goto connect_abort; 1229 + } 1230 + } 1254 1231 if (smcr_lgr_reg_rmbs(link, smc->conn.rmb_desc)) { 1255 - reason_code = SMC_CLC_DECL_ERR_REGRMB; 1232 + reason_code = SMC_CLC_DECL_ERR_REGBUF; 1256 1233 goto connect_abort; 1257 1234 } 1258 1235 } ··· 1786 1749 struct smc_llc_qentry *qentry; 1787 1750 int rc; 1788 1751 1789 - if (smcr_link_reg_rmb(link, smc->conn.rmb_desc)) 1790 - return SMC_CLC_DECL_ERR_REGRMB; 1752 + /* reg the sndbuf if it was vzalloced*/ 1753 + if (smc->conn.sndbuf_desc->is_vm) { 1754 + if (smcr_link_reg_buf(link, smc->conn.sndbuf_desc)) 1755 + return SMC_CLC_DECL_ERR_REGBUF; 1756 + } 1757 + 1758 + /* reg the rmb */ 1759 + if (smcr_link_reg_buf(link, smc->conn.rmb_desc)) 1760 + return SMC_CLC_DECL_ERR_REGBUF; 1791 1761 1792 1762 /* send CONFIRM LINK request to client over the RoCE fabric */ 1793 1763 rc = smc_llc_send_confirm_link(link, SMC_LLC_REQ); ··· 2153 2109 struct smc_connection *conn = &new_smc->conn; 2154 2110 2155 2111 if (!local_first) { 2112 + /* reg sendbufs if they were vzalloced */ 2113 + if (conn->sndbuf_desc->is_vm) { 2114 + if (smcr_lgr_reg_sndbufs(conn->lnk, 2115 + conn->sndbuf_desc)) 2116 + return SMC_CLC_DECL_ERR_REGBUF; 2117 + } 2156 2118 if (smcr_lgr_reg_rmbs(conn->lnk, conn->rmb_desc)) 2157 - return SMC_CLC_DECL_ERR_REGRMB; 2119 + return SMC_CLC_DECL_ERR_REGBUF; 2158 2120 } 2159 2121 2160 2122 return 0;
+5 -3
net/smc/smc_clc.c
··· 1034 1034 ETH_ALEN); 1035 1035 hton24(clc->r0.qpn, link->roce_qp->qp_num); 1036 1036 clc->r0.rmb_rkey = 1037 - htonl(conn->rmb_desc->mr_rx[link->link_idx]->rkey); 1037 + htonl(conn->rmb_desc->mr[link->link_idx]->rkey); 1038 1038 clc->r0.rmbe_idx = 1; /* for now: 1 RMB = 1 RMBE */ 1039 1039 clc->r0.rmbe_alert_token = htonl(conn->alert_token_local); 1040 1040 switch (clc->hdr.type) { ··· 1046 1046 break; 1047 1047 } 1048 1048 clc->r0.rmbe_size = conn->rmbe_size_short; 1049 - clc->r0.rmb_dma_addr = cpu_to_be64((u64)sg_dma_address 1050 - (conn->rmb_desc->sgt[link->link_idx].sgl)); 1049 + clc->r0.rmb_dma_addr = conn->rmb_desc->is_vm ? 1050 + cpu_to_be64((uintptr_t)conn->rmb_desc->cpu_addr) : 1051 + cpu_to_be64((u64)sg_dma_address 1052 + (conn->rmb_desc->sgt[link->link_idx].sgl)); 1051 1053 hton24(clc->r0.psn, link->psn_initial); 1052 1054 if (version == SMC_V1) { 1053 1055 clc->hdr.length = htons(SMCR_CLC_ACCEPT_CONFIRM_LEN);
+1 -1
net/smc/smc_clc.h
··· 62 62 #define SMC_CLC_DECL_INTERR 0x09990000 /* internal error */ 63 63 #define SMC_CLC_DECL_ERR_RTOK 0x09990001 /* rtoken handling failed */ 64 64 #define SMC_CLC_DECL_ERR_RDYLNK 0x09990002 /* ib ready link failed */ 65 - #define SMC_CLC_DECL_ERR_REGRMB 0x09990003 /* reg rmb failed */ 65 + #define SMC_CLC_DECL_ERR_REGBUF 0x09990003 /* reg rdma bufs failed */ 66 66 67 67 #define SMC_FIRST_CONTACT_MASK 0b10 /* first contact bit within typev2 */ 68 68
+149 -64
net/smc/smc_core.c
··· 1087 1087 return NULL; 1088 1088 } 1089 1089 1090 - static void smcr_buf_unuse(struct smc_buf_desc *rmb_desc, 1090 + static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb, 1091 1091 struct smc_link_group *lgr) 1092 1092 { 1093 + struct mutex *lock; /* lock buffer list */ 1093 1094 int rc; 1094 1095 1095 - if (rmb_desc->is_conf_rkey && !list_empty(&lgr->list)) { 1096 + if (is_rmb && buf_desc->is_conf_rkey && !list_empty(&lgr->list)) { 1096 1097 /* unregister rmb with peer */ 1097 1098 rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY); 1098 1099 if (!rc) { 1099 1100 /* protect against smc_llc_cli_rkey_exchange() */ 1100 1101 mutex_lock(&lgr->llc_conf_mutex); 1101 - smc_llc_do_delete_rkey(lgr, rmb_desc); 1102 - rmb_desc->is_conf_rkey = false; 1102 + smc_llc_do_delete_rkey(lgr, buf_desc); 1103 + buf_desc->is_conf_rkey = false; 1103 1104 mutex_unlock(&lgr->llc_conf_mutex); 1104 1105 smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl); 1105 1106 } 1106 1107 } 1107 1108 1108 - if (rmb_desc->is_reg_err) { 1109 + if (buf_desc->is_reg_err) { 1109 1110 /* buf registration failed, reuse not possible */ 1110 - mutex_lock(&lgr->rmbs_lock); 1111 - list_del(&rmb_desc->list); 1112 - mutex_unlock(&lgr->rmbs_lock); 1111 + lock = is_rmb ? &lgr->rmbs_lock : 1112 + &lgr->sndbufs_lock; 1113 + mutex_lock(lock); 1114 + list_del(&buf_desc->list); 1115 + mutex_unlock(lock); 1113 1116 1114 - smc_buf_free(lgr, true, rmb_desc); 1117 + smc_buf_free(lgr, is_rmb, buf_desc); 1115 1118 } else { 1116 - rmb_desc->used = 0; 1117 - memset(rmb_desc->cpu_addr, 0, rmb_desc->len); 1119 + buf_desc->used = 0; 1120 + memset(buf_desc->cpu_addr, 0, buf_desc->len); 1118 1121 } 1119 1122 } 1120 1123 ··· 1125 1122 struct smc_link_group *lgr) 1126 1123 { 1127 1124 if (conn->sndbuf_desc) { 1128 - conn->sndbuf_desc->used = 0; 1129 - memset(conn->sndbuf_desc->cpu_addr, 0, conn->sndbuf_desc->len); 1125 + if (!lgr->is_smcd && conn->sndbuf_desc->is_vm) { 1126 + smcr_buf_unuse(conn->sndbuf_desc, false, lgr); 1127 + } else { 1128 + conn->sndbuf_desc->used = 0; 1129 + memset(conn->sndbuf_desc->cpu_addr, 0, 1130 + conn->sndbuf_desc->len); 1131 + } 1130 1132 } 1131 - if (conn->rmb_desc && lgr->is_smcd) { 1132 - conn->rmb_desc->used = 0; 1133 - memset(conn->rmb_desc->cpu_addr, 0, conn->rmb_desc->len + 1134 - sizeof(struct smcd_cdc_msg)); 1135 - } else if (conn->rmb_desc) { 1136 - smcr_buf_unuse(conn->rmb_desc, lgr); 1133 + if (conn->rmb_desc) { 1134 + if (!lgr->is_smcd) { 1135 + smcr_buf_unuse(conn->rmb_desc, true, lgr); 1136 + } else { 1137 + conn->rmb_desc->used = 0; 1138 + memset(conn->rmb_desc->cpu_addr, 0, 1139 + conn->rmb_desc->len + 1140 + sizeof(struct smcd_cdc_msg)); 1141 + } 1137 1142 } 1138 1143 } 1139 1144 ··· 1189 1178 static void smcr_buf_unmap_link(struct smc_buf_desc *buf_desc, bool is_rmb, 1190 1179 struct smc_link *lnk) 1191 1180 { 1192 - if (is_rmb) 1181 + if (is_rmb || buf_desc->is_vm) 1193 1182 buf_desc->is_reg_mr[lnk->link_idx] = false; 1194 1183 if (!buf_desc->is_map_ib[lnk->link_idx]) 1195 1184 return; 1196 - if (is_rmb) { 1197 - if (buf_desc->mr_rx[lnk->link_idx]) { 1198 - smc_ib_put_memory_region( 1199 - buf_desc->mr_rx[lnk->link_idx]); 1200 - buf_desc->mr_rx[lnk->link_idx] = NULL; 1201 - } 1202 - smc_ib_buf_unmap_sg(lnk, buf_desc, DMA_FROM_DEVICE); 1203 - } else { 1204 - smc_ib_buf_unmap_sg(lnk, buf_desc, DMA_TO_DEVICE); 1185 + 1186 + if ((is_rmb || buf_desc->is_vm) && 1187 + buf_desc->mr[lnk->link_idx]) { 1188 + smc_ib_put_memory_region(buf_desc->mr[lnk->link_idx]); 1189 + buf_desc->mr[lnk->link_idx] = NULL; 1205 1190 } 1191 + if (is_rmb) 1192 + smc_ib_buf_unmap_sg(lnk, buf_desc, DMA_FROM_DEVICE); 1193 + else 1194 + smc_ib_buf_unmap_sg(lnk, buf_desc, DMA_TO_DEVICE); 1195 + 1206 1196 sg_free_table(&buf_desc->sgt[lnk->link_idx]); 1207 1197 buf_desc->is_map_ib[lnk->link_idx] = false; 1208 1198 } ··· 1292 1280 for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) 1293 1281 smcr_buf_unmap_link(buf_desc, is_rmb, &lgr->lnk[i]); 1294 1282 1295 - if (buf_desc->pages) 1283 + if (!buf_desc->is_vm && buf_desc->pages) 1296 1284 __free_pages(buf_desc->pages, buf_desc->order); 1285 + else if (buf_desc->is_vm && buf_desc->cpu_addr) 1286 + vfree(buf_desc->cpu_addr); 1297 1287 kfree(buf_desc); 1298 1288 } 1299 1289 ··· 2007 1993 return max_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2); 2008 1994 } 2009 1995 2010 - /* map an rmb buf to a link */ 1996 + /* map an buf to a link */ 2011 1997 static int smcr_buf_map_link(struct smc_buf_desc *buf_desc, bool is_rmb, 2012 1998 struct smc_link *lnk) 2013 1999 { 2014 - int rc; 2000 + int rc, i, nents, offset, buf_size, size, access_flags; 2001 + struct scatterlist *sg; 2002 + void *buf; 2015 2003 2016 2004 if (buf_desc->is_map_ib[lnk->link_idx]) 2017 2005 return 0; 2018 2006 2019 - rc = sg_alloc_table(&buf_desc->sgt[lnk->link_idx], 1, GFP_KERNEL); 2007 + if (buf_desc->is_vm) { 2008 + buf = buf_desc->cpu_addr; 2009 + buf_size = buf_desc->len; 2010 + offset = offset_in_page(buf_desc->cpu_addr); 2011 + nents = PAGE_ALIGN(buf_size + offset) / PAGE_SIZE; 2012 + } else { 2013 + nents = 1; 2014 + } 2015 + 2016 + rc = sg_alloc_table(&buf_desc->sgt[lnk->link_idx], nents, GFP_KERNEL); 2020 2017 if (rc) 2021 2018 return rc; 2022 - sg_set_buf(buf_desc->sgt[lnk->link_idx].sgl, 2023 - buf_desc->cpu_addr, buf_desc->len); 2019 + 2020 + if (buf_desc->is_vm) { 2021 + /* virtually contiguous buffer */ 2022 + for_each_sg(buf_desc->sgt[lnk->link_idx].sgl, sg, nents, i) { 2023 + size = min_t(int, PAGE_SIZE - offset, buf_size); 2024 + sg_set_page(sg, vmalloc_to_page(buf), size, offset); 2025 + buf += size / sizeof(*buf); 2026 + buf_size -= size; 2027 + offset = 0; 2028 + } 2029 + } else { 2030 + /* physically contiguous buffer */ 2031 + sg_set_buf(buf_desc->sgt[lnk->link_idx].sgl, 2032 + buf_desc->cpu_addr, buf_desc->len); 2033 + } 2024 2034 2025 2035 /* map sg table to DMA address */ 2026 2036 rc = smc_ib_buf_map_sg(lnk, buf_desc, 2027 2037 is_rmb ? DMA_FROM_DEVICE : DMA_TO_DEVICE); 2028 2038 /* SMC protocol depends on mapping to one DMA address only */ 2029 - if (rc != 1) { 2039 + if (rc != nents) { 2030 2040 rc = -EAGAIN; 2031 2041 goto free_table; 2032 2042 } ··· 2058 2020 buf_desc->is_dma_need_sync |= 2059 2021 smc_ib_is_sg_need_sync(lnk, buf_desc) << lnk->link_idx; 2060 2022 2061 - /* create a new memory region for the RMB */ 2062 - if (is_rmb) { 2063 - rc = smc_ib_get_memory_region(lnk->roce_pd, 2064 - IB_ACCESS_REMOTE_WRITE | 2065 - IB_ACCESS_LOCAL_WRITE, 2023 + if (is_rmb || buf_desc->is_vm) { 2024 + /* create a new memory region for the RMB or vzalloced sndbuf */ 2025 + access_flags = is_rmb ? 2026 + IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE : 2027 + IB_ACCESS_LOCAL_WRITE; 2028 + 2029 + rc = smc_ib_get_memory_region(lnk->roce_pd, access_flags, 2066 2030 buf_desc, lnk->link_idx); 2067 2031 if (rc) 2068 2032 goto buf_unmap; 2069 - smc_ib_sync_sg_for_device(lnk, buf_desc, DMA_FROM_DEVICE); 2033 + smc_ib_sync_sg_for_device(lnk, buf_desc, 2034 + is_rmb ? DMA_FROM_DEVICE : DMA_TO_DEVICE); 2070 2035 } 2071 2036 buf_desc->is_map_ib[lnk->link_idx] = true; 2072 2037 return 0; ··· 2082 2041 return rc; 2083 2042 } 2084 2043 2085 - /* register a new rmb on IB device, 2044 + /* register a new buf on IB device, rmb or vzalloced sndbuf 2086 2045 * must be called under lgr->llc_conf_mutex lock 2087 2046 */ 2088 - int smcr_link_reg_rmb(struct smc_link *link, struct smc_buf_desc *rmb_desc) 2047 + int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *buf_desc) 2089 2048 { 2090 2049 if (list_empty(&link->lgr->list)) 2091 2050 return -ENOLINK; 2092 - if (!rmb_desc->is_reg_mr[link->link_idx]) { 2093 - /* register memory region for new rmb */ 2094 - if (smc_wr_reg_send(link, rmb_desc->mr_rx[link->link_idx])) { 2095 - rmb_desc->is_reg_err = true; 2051 + if (!buf_desc->is_reg_mr[link->link_idx]) { 2052 + /* register memory region for new buf */ 2053 + if (buf_desc->is_vm) 2054 + buf_desc->mr[link->link_idx]->iova = 2055 + (uintptr_t)buf_desc->cpu_addr; 2056 + if (smc_wr_reg_send(link, buf_desc->mr[link->link_idx])) { 2057 + buf_desc->is_reg_err = true; 2096 2058 return -EFAULT; 2097 2059 } 2098 - rmb_desc->is_reg_mr[link->link_idx] = true; 2060 + buf_desc->is_reg_mr[link->link_idx] = true; 2099 2061 } 2100 2062 return 0; 2101 2063 } ··· 2150 2106 struct smc_buf_desc *buf_desc, *bf; 2151 2107 int i, rc = 0; 2152 2108 2109 + /* reg all RMBs for a new link */ 2153 2110 mutex_lock(&lgr->rmbs_lock); 2154 2111 for (i = 0; i < SMC_RMBE_SIZES; i++) { 2155 2112 list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list) { 2156 2113 if (!buf_desc->used) 2157 2114 continue; 2158 - rc = smcr_link_reg_rmb(lnk, buf_desc); 2159 - if (rc) 2160 - goto out; 2115 + rc = smcr_link_reg_buf(lnk, buf_desc); 2116 + if (rc) { 2117 + mutex_unlock(&lgr->rmbs_lock); 2118 + return rc; 2119 + } 2161 2120 } 2162 2121 } 2163 - out: 2164 2122 mutex_unlock(&lgr->rmbs_lock); 2123 + 2124 + if (lgr->buf_type == SMCR_PHYS_CONT_BUFS) 2125 + return rc; 2126 + 2127 + /* reg all vzalloced sndbufs for a new link */ 2128 + mutex_lock(&lgr->sndbufs_lock); 2129 + for (i = 0; i < SMC_RMBE_SIZES; i++) { 2130 + list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i], list) { 2131 + if (!buf_desc->used || !buf_desc->is_vm) 2132 + continue; 2133 + rc = smcr_link_reg_buf(lnk, buf_desc); 2134 + if (rc) { 2135 + mutex_unlock(&lgr->sndbufs_lock); 2136 + return rc; 2137 + } 2138 + } 2139 + } 2140 + mutex_unlock(&lgr->sndbufs_lock); 2165 2141 return rc; 2166 2142 } 2167 2143 ··· 2195 2131 if (!buf_desc) 2196 2132 return ERR_PTR(-ENOMEM); 2197 2133 2198 - buf_desc->order = get_order(bufsize); 2199 - buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN | 2200 - __GFP_NOMEMALLOC | __GFP_COMP | 2201 - __GFP_NORETRY | __GFP_ZERO, 2202 - buf_desc->order); 2203 - if (!buf_desc->pages) { 2204 - kfree(buf_desc); 2205 - return ERR_PTR(-EAGAIN); 2134 + switch (lgr->buf_type) { 2135 + case SMCR_PHYS_CONT_BUFS: 2136 + case SMCR_MIXED_BUFS: 2137 + buf_desc->order = get_order(bufsize); 2138 + buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN | 2139 + __GFP_NOMEMALLOC | __GFP_COMP | 2140 + __GFP_NORETRY | __GFP_ZERO, 2141 + buf_desc->order); 2142 + if (buf_desc->pages) { 2143 + buf_desc->cpu_addr = 2144 + (void *)page_address(buf_desc->pages); 2145 + buf_desc->len = bufsize; 2146 + buf_desc->is_vm = false; 2147 + break; 2148 + } 2149 + if (lgr->buf_type == SMCR_PHYS_CONT_BUFS) 2150 + goto out; 2151 + fallthrough; // try virtually continguous buf 2152 + case SMCR_VIRT_CONT_BUFS: 2153 + buf_desc->order = get_order(bufsize); 2154 + buf_desc->cpu_addr = vzalloc(PAGE_SIZE << buf_desc->order); 2155 + if (!buf_desc->cpu_addr) 2156 + goto out; 2157 + buf_desc->pages = NULL; 2158 + buf_desc->len = bufsize; 2159 + buf_desc->is_vm = true; 2160 + break; 2206 2161 } 2207 - buf_desc->cpu_addr = (void *)page_address(buf_desc->pages); 2208 - buf_desc->len = bufsize; 2209 2162 return buf_desc; 2163 + 2164 + out: 2165 + kfree(buf_desc); 2166 + return ERR_PTR(-EAGAIN); 2210 2167 } 2211 2168 2212 2169 /* map buf_desc on all usable links, ··· 2358 2273 2359 2274 if (!is_smcd) { 2360 2275 if (smcr_buf_map_usable_links(lgr, buf_desc, is_rmb)) { 2361 - smcr_buf_unuse(buf_desc, lgr); 2276 + smcr_buf_unuse(buf_desc, is_rmb, lgr); 2362 2277 return -ENOMEM; 2363 2278 } 2364 2279 }
+7 -3
net/smc/smc_core.h
··· 168 168 struct { /* SMC-R */ 169 169 struct sg_table sgt[SMC_LINKS_PER_LGR_MAX]; 170 170 /* virtual buffer */ 171 - struct ib_mr *mr_rx[SMC_LINKS_PER_LGR_MAX]; 172 - /* for rmb only: memory region 171 + struct ib_mr *mr[SMC_LINKS_PER_LGR_MAX]; 172 + /* memory region: for rmb and 173 + * vzalloced sndbuf 173 174 * incl. rkey provided to peer 175 + * and lkey provided to local 174 176 */ 175 177 u32 order; /* allocation order */ 176 178 ··· 185 183 u8 is_dma_need_sync; 186 184 u8 is_reg_err; 187 185 /* buffer registration err */ 186 + u8 is_vm; 187 + /* virtually contiguous */ 188 188 }; 189 189 struct { /* SMC-D */ 190 190 unsigned short sba_idx; ··· 547 543 void smcr_lgr_set_type(struct smc_link_group *lgr, enum smc_lgr_type new_type); 548 544 void smcr_lgr_set_type_asym(struct smc_link_group *lgr, 549 545 enum smc_lgr_type new_type, int asym_lnk_idx); 550 - int smcr_link_reg_rmb(struct smc_link *link, struct smc_buf_desc *rmb_desc); 546 + int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *rmb_desc); 551 547 struct smc_link *smc_switch_conns(struct smc_link_group *lgr, 552 548 struct smc_link *from_lnk, bool is_dev_err); 553 549 void smcr_link_down_cond(struct smc_link *lnk);
+8 -7
net/smc/smc_ib.c
··· 698 698 int sg_num; 699 699 700 700 /* map the largest prefix of a dma mapped SG list */ 701 - sg_num = ib_map_mr_sg(buf_slot->mr_rx[link_idx], 701 + sg_num = ib_map_mr_sg(buf_slot->mr[link_idx], 702 702 buf_slot->sgt[link_idx].sgl, 703 703 buf_slot->sgt[link_idx].orig_nents, 704 704 &offset, PAGE_SIZE); ··· 710 710 int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags, 711 711 struct smc_buf_desc *buf_slot, u8 link_idx) 712 712 { 713 - if (buf_slot->mr_rx[link_idx]) 713 + if (buf_slot->mr[link_idx]) 714 714 return 0; /* already done */ 715 715 716 - buf_slot->mr_rx[link_idx] = 716 + buf_slot->mr[link_idx] = 717 717 ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order); 718 - if (IS_ERR(buf_slot->mr_rx[link_idx])) { 718 + if (IS_ERR(buf_slot->mr[link_idx])) { 719 719 int rc; 720 720 721 - rc = PTR_ERR(buf_slot->mr_rx[link_idx]); 722 - buf_slot->mr_rx[link_idx] = NULL; 721 + rc = PTR_ERR(buf_slot->mr[link_idx]); 722 + buf_slot->mr[link_idx] = NULL; 723 723 return rc; 724 724 } 725 725 726 - if (smc_ib_map_mr_sg(buf_slot, link_idx) != 1) 726 + if (smc_ib_map_mr_sg(buf_slot, link_idx) != 727 + buf_slot->sgt[link_idx].orig_nents) 727 728 return -EINVAL; 728 729 729 730 return 0;
+19 -14
net/smc/smc_llc.c
··· 505 505 if (smc_link_active(link) && link != send_link) { 506 506 rkeyllc->rtoken[rtok_ix].link_id = link->link_id; 507 507 rkeyllc->rtoken[rtok_ix].rmb_key = 508 - htonl(rmb_desc->mr_rx[link->link_idx]->rkey); 509 - rkeyllc->rtoken[rtok_ix].rmb_vaddr = cpu_to_be64( 510 - (u64)sg_dma_address( 511 - rmb_desc->sgt[link->link_idx].sgl)); 508 + htonl(rmb_desc->mr[link->link_idx]->rkey); 509 + rkeyllc->rtoken[rtok_ix].rmb_vaddr = rmb_desc->is_vm ? 510 + cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) : 511 + cpu_to_be64((u64)sg_dma_address 512 + (rmb_desc->sgt[link->link_idx].sgl)); 512 513 rtok_ix++; 513 514 } 514 515 } 515 516 /* rkey of send_link is in rtoken[0] */ 516 517 rkeyllc->rtoken[0].num_rkeys = rtok_ix - 1; 517 518 rkeyllc->rtoken[0].rmb_key = 518 - htonl(rmb_desc->mr_rx[send_link->link_idx]->rkey); 519 - rkeyllc->rtoken[0].rmb_vaddr = cpu_to_be64( 520 - (u64)sg_dma_address(rmb_desc->sgt[send_link->link_idx].sgl)); 519 + htonl(rmb_desc->mr[send_link->link_idx]->rkey); 520 + rkeyllc->rtoken[0].rmb_vaddr = rmb_desc->is_vm ? 521 + cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) : 522 + cpu_to_be64((u64)sg_dma_address 523 + (rmb_desc->sgt[send_link->link_idx].sgl)); 521 524 /* send llc message */ 522 525 rc = smc_wr_tx_send(send_link, pend); 523 526 put_out: ··· 547 544 rkeyllc->hd.common.llc_type = SMC_LLC_DELETE_RKEY; 548 545 smc_llc_init_msg_hdr(&rkeyllc->hd, link->lgr, sizeof(*rkeyllc)); 549 546 rkeyllc->num_rkeys = 1; 550 - rkeyllc->rkey[0] = htonl(rmb_desc->mr_rx[link->link_idx]->rkey); 547 + rkeyllc->rkey[0] = htonl(rmb_desc->mr[link->link_idx]->rkey); 551 548 /* send llc message */ 552 549 rc = smc_wr_tx_send(link, pend); 553 550 put_out: ··· 617 614 if (!buf_pos) 618 615 break; 619 616 rmb = buf_pos; 620 - ext->rt[i].rmb_key = htonl(rmb->mr_rx[prim_lnk_idx]->rkey); 621 - ext->rt[i].rmb_key_new = htonl(rmb->mr_rx[lnk_idx]->rkey); 622 - ext->rt[i].rmb_vaddr_new = 617 + ext->rt[i].rmb_key = htonl(rmb->mr[prim_lnk_idx]->rkey); 618 + ext->rt[i].rmb_key_new = htonl(rmb->mr[lnk_idx]->rkey); 619 + ext->rt[i].rmb_vaddr_new = rmb->is_vm ? 620 + cpu_to_be64((uintptr_t)rmb->cpu_addr) : 623 621 cpu_to_be64((u64)sg_dma_address(rmb->sgt[lnk_idx].sgl)); 624 622 buf_pos = smc_llc_get_next_rmb(lgr, &buf_lst, buf_pos); 625 623 while (buf_pos && !(buf_pos)->used) ··· 856 852 } 857 853 rmb = *buf_pos; 858 854 859 - addc_llc->rt[i].rmb_key = htonl(rmb->mr_rx[prim_lnk_idx]->rkey); 860 - addc_llc->rt[i].rmb_key_new = htonl(rmb->mr_rx[lnk_idx]->rkey); 861 - addc_llc->rt[i].rmb_vaddr_new = 855 + addc_llc->rt[i].rmb_key = htonl(rmb->mr[prim_lnk_idx]->rkey); 856 + addc_llc->rt[i].rmb_key_new = htonl(rmb->mr[lnk_idx]->rkey); 857 + addc_llc->rt[i].rmb_vaddr_new = rmb->is_vm ? 858 + cpu_to_be64((uintptr_t)rmb->cpu_addr) : 862 859 cpu_to_be64((u64)sg_dma_address(rmb->sgt[lnk_idx].sgl)); 863 860 864 861 (*num_rkeys_todo)--;
+73 -15
net/smc/smc_rx.c
··· 145 145 static int smc_rx_splice(struct pipe_inode_info *pipe, char *src, size_t len, 146 146 struct smc_sock *smc) 147 147 { 148 + struct smc_link_group *lgr = smc->conn.lgr; 149 + int offset = offset_in_page(src); 150 + struct partial_page *partial; 148 151 struct splice_pipe_desc spd; 149 - struct partial_page partial; 150 - struct smc_spd_priv *priv; 151 - int bytes; 152 + struct smc_spd_priv **priv; 153 + struct page **pages; 154 + int bytes, nr_pages; 155 + int i; 152 156 153 - priv = kzalloc(sizeof(*priv), GFP_KERNEL); 157 + nr_pages = !lgr->is_smcd && smc->conn.rmb_desc->is_vm ? 158 + PAGE_ALIGN(len + offset) / PAGE_SIZE : 1; 159 + 160 + pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL); 161 + if (!pages) 162 + goto out; 163 + partial = kcalloc(nr_pages, sizeof(*partial), GFP_KERNEL); 164 + if (!partial) 165 + goto out_page; 166 + priv = kcalloc(nr_pages, sizeof(*priv), GFP_KERNEL); 154 167 if (!priv) 155 - return -ENOMEM; 156 - priv->len = len; 157 - priv->smc = smc; 158 - partial.offset = src - (char *)smc->conn.rmb_desc->cpu_addr; 159 - partial.len = len; 160 - partial.private = (unsigned long)priv; 168 + goto out_part; 169 + for (i = 0; i < nr_pages; i++) { 170 + priv[i] = kzalloc(sizeof(**priv), GFP_KERNEL); 171 + if (!priv[i]) 172 + goto out_priv; 173 + } 161 174 162 - spd.nr_pages_max = 1; 163 - spd.nr_pages = 1; 164 - spd.pages = &smc->conn.rmb_desc->pages; 165 - spd.partial = &partial; 175 + if (lgr->is_smcd || 176 + (!lgr->is_smcd && !smc->conn.rmb_desc->is_vm)) { 177 + /* smcd or smcr that uses physically contiguous RMBs */ 178 + priv[0]->len = len; 179 + priv[0]->smc = smc; 180 + partial[0].offset = src - (char *)smc->conn.rmb_desc->cpu_addr; 181 + partial[0].len = len; 182 + partial[0].private = (unsigned long)priv[0]; 183 + pages[0] = smc->conn.rmb_desc->pages; 184 + } else { 185 + int size, left = len; 186 + void *buf = src; 187 + /* smcr that uses virtually contiguous RMBs*/ 188 + for (i = 0; i < nr_pages; i++) { 189 + size = min_t(int, PAGE_SIZE - offset, left); 190 + priv[i]->len = size; 191 + priv[i]->smc = smc; 192 + pages[i] = vmalloc_to_page(buf); 193 + partial[i].offset = offset; 194 + partial[i].len = size; 195 + partial[i].private = (unsigned long)priv[i]; 196 + buf += size / sizeof(*buf); 197 + left -= size; 198 + offset = 0; 199 + } 200 + } 201 + spd.nr_pages_max = nr_pages; 202 + spd.nr_pages = nr_pages; 203 + spd.pages = pages; 204 + spd.partial = partial; 166 205 spd.ops = &smc_pipe_ops; 167 206 spd.spd_release = smc_rx_spd_release; 168 207 169 208 bytes = splice_to_pipe(pipe, &spd); 170 209 if (bytes > 0) { 171 210 sock_hold(&smc->sk); 172 - get_page(smc->conn.rmb_desc->pages); 211 + if (!lgr->is_smcd && smc->conn.rmb_desc->is_vm) { 212 + for (i = 0; i < PAGE_ALIGN(bytes + offset) / PAGE_SIZE; i++) 213 + get_page(pages[i]); 214 + } else { 215 + get_page(smc->conn.rmb_desc->pages); 216 + } 173 217 atomic_add(bytes, &smc->conn.splice_pending); 174 218 } 219 + kfree(priv); 220 + kfree(partial); 221 + kfree(pages); 175 222 176 223 return bytes; 224 + 225 + out_priv: 226 + for (i = (i - 1); i >= 0; i--) 227 + kfree(priv[i]); 228 + kfree(priv); 229 + out_part: 230 + kfree(partial); 231 + out_page: 232 + kfree(pages); 233 + out: 234 + return -ENOMEM; 177 235 } 178 236 179 237 static int smc_rx_data_available_and_no_splice_pend(struct smc_connection *conn)
+7 -2
net/smc/smc_tx.c
··· 383 383 384 384 dma_addr_t dma_addr = 385 385 sg_dma_address(conn->sndbuf_desc->sgt[link->link_idx].sgl); 386 + u64 virt_addr = (uintptr_t)conn->sndbuf_desc->cpu_addr; 386 387 int src_len_sum = src_len, dst_len_sum = dst_len; 387 388 int sent_count = src_off; 388 389 int srcchunk, dstchunk; ··· 396 395 u64 base_addr = dma_addr; 397 396 398 397 if (dst_len < link->qp_attr.cap.max_inline_data) { 399 - base_addr = (uintptr_t)conn->sndbuf_desc->cpu_addr; 398 + base_addr = virt_addr; 400 399 wr->wr.send_flags |= IB_SEND_INLINE; 401 400 } else { 402 401 wr->wr.send_flags &= ~IB_SEND_INLINE; ··· 404 403 405 404 num_sges = 0; 406 405 for (srcchunk = 0; srcchunk < 2; srcchunk++) { 407 - sge[srcchunk].addr = base_addr + src_off; 406 + sge[srcchunk].addr = conn->sndbuf_desc->is_vm ? 407 + (virt_addr + src_off) : (base_addr + src_off); 408 408 sge[srcchunk].length = src_len; 409 + if (conn->sndbuf_desc->is_vm) 410 + sge[srcchunk].lkey = 411 + conn->sndbuf_desc->mr[link->link_idx]->lkey; 409 412 num_sges++; 410 413 411 414 src_off += src_len;