Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net/smc: add autocorking support

This patch adds autocorking support for SMC which could improve
throughput for small message by x3+.

The main idea is borrowed from TCP autocorking with some RDMA
specific modification:
1. The first message should never cork to make sure we won't
bring extra latency
2. If we have posted any Tx WRs to the NIC that have not
completed, cork the new messages until:
a) Receive CQE for the last Tx WR
b) We have corked enough message on the connection
3. Try to push the corked data out when we receive CQE of
the last Tx WR to prevent the corked messages hang in
the send queue.

Both SMC autocorking and TCP autocorking check the TX completion
to decide whether we should cork or not. The difference is
when we got a SMC Tx WR completion, the data have been confirmed
by the RNIC while TCP TX completion just tells us the data
have been sent out by the local NIC.

Add an atomic variable tx_pushing in smc_connection to make
sure only one can send to let it cork more and save CDC slot.

SMC autocorking should not bring extra latency since the first
message will always been sent out immediately.

The qperf tcp_bw test shows more than x4 increase under small
message size with Mellanox connectX4-Lx, same result with other
throughput benchmarks like sockperf/netperf.
The qperf tcp_lat test shows SMC autocorking has not increase any
ping-pong latency.

Test command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf

=== Bandwidth ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%)
8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%)
16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%)
32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%)
64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%)
128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%)
256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%)
==== Latency ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%)
2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%)
4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%)
8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%)
16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%)
32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%)
64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%)
128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%)
256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%)
512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%)
1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%)
2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%)
4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%)
8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%)
16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%)
32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%)
65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%)

With SMC autocorking support, we can archive better throughput
than TCP in most message sizes without any latency trade-off.

Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Dust Li and committed by
David S. Miller
dcd2cf5f 462791bb

+105 -15
+2
net/smc/smc.h
··· 29 29 #define SMC_MAX_ISM_DEVS 8 /* max # of proposed non-native ISM 30 30 * devices 31 31 */ 32 + #define SMC_AUTOCORKING_DEFAULT_SIZE 0x10000 /* 64K by default */ 32 33 33 34 extern struct proto smc_proto; 34 35 extern struct proto smc_proto6; ··· 193 192 * - dec on polled tx cqe 194 193 */ 195 194 wait_queue_head_t cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/ 195 + atomic_t tx_pushing; /* nr_threads trying tx push */ 196 196 struct delayed_work tx_work; /* retry of smc_cdc_msg_send */ 197 197 u32 tx_off; /* base offset in peer rmb */ 198 198
+8 -3
net/smc/smc_cdc.c
··· 48 48 conn->tx_cdc_seq_fin = cdcpend->ctrl_seq; 49 49 } 50 50 51 - if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) && 52 - unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq))) 53 - wake_up(&conn->cdc_pend_tx_wq); 51 + if (atomic_dec_and_test(&conn->cdc_pend_tx_wr)) { 52 + /* If this is the last pending WR complete, we must push to 53 + * prevent hang when autocork enabled. 54 + */ 55 + smc_tx_sndbuf_nonempty(conn); 56 + if (unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq))) 57 + wake_up(&conn->cdc_pend_tx_wq); 58 + } 54 59 WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0); 55 60 56 61 smc_tx_sndbuf_nonfull(smc);
+95 -12
net/smc/smc_tx.c
··· 131 131 return (tp->nonagle & TCP_NAGLE_CORK) ? true : false; 132 132 } 133 133 134 + /* If we have pending CDC messages, do not send: 135 + * Because CQE of this CDC message will happen shortly, it gives 136 + * a chance to coalesce future sendmsg() payload in to one RDMA Write, 137 + * without need for a timer, and with no latency trade off. 138 + * Algorithm here: 139 + * 1. First message should never cork 140 + * 2. If we have pending Tx CDC messages, wait for the first CDC 141 + * message's completion 142 + * 3. Don't cork to much data in a single RDMA Write to prevent burst 143 + * traffic, total corked message should not exceed sendbuf/2 144 + */ 145 + static bool smc_should_autocork(struct smc_sock *smc) 146 + { 147 + struct smc_connection *conn = &smc->conn; 148 + int corking_size; 149 + 150 + corking_size = min(SMC_AUTOCORKING_DEFAULT_SIZE, 151 + conn->sndbuf_desc->len >> 1); 152 + 153 + if (atomic_read(&conn->cdc_pend_tx_wr) == 0 || 154 + smc_tx_prepared_sends(conn) > corking_size) 155 + return false; 156 + return true; 157 + } 158 + 159 + static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg) 160 + { 161 + struct smc_connection *conn = &smc->conn; 162 + 163 + if (smc_should_autocork(smc)) 164 + return true; 165 + 166 + /* for a corked socket defer the RDMA writes if 167 + * sndbuf_space is still available. The applications 168 + * should known how/when to uncork it. 169 + */ 170 + if ((msg->msg_flags & MSG_MORE || 171 + smc_tx_is_corked(smc) || 172 + msg->msg_flags & MSG_SENDPAGE_NOTLAST) && 173 + atomic_read(&conn->sndbuf_space)) 174 + return true; 175 + 176 + return false; 177 + } 178 + 134 179 /* sndbuf producer: main API called by socket layer. 135 180 * called under sock lock. 136 181 */ ··· 280 235 */ 281 236 if ((msg->msg_flags & MSG_OOB) && !send_remaining) 282 237 conn->urg_tx_pend = true; 283 - /* for a corked socket defer the RDMA writes if 284 - * sndbuf_space is still available. The applications 285 - * should known how/when to uncork it. 238 + /* If we need to cork, do nothing and wait for the next 239 + * sendmsg() call or push on tx completion 286 240 */ 287 - if (!((msg->msg_flags & MSG_MORE || smc_tx_is_corked(smc) || 288 - msg->msg_flags & MSG_SENDPAGE_NOTLAST) && 289 - atomic_read(&conn->sndbuf_space))) 241 + if (!smc_tx_should_cork(smc, msg)) 290 242 smc_tx_sndbuf_nonempty(conn); 291 243 292 244 trace_smc_tx_sendmsg(smc, copylen); ··· 631 589 return rc; 632 590 } 633 591 634 - int smc_tx_sndbuf_nonempty(struct smc_connection *conn) 592 + static int __smc_tx_sndbuf_nonempty(struct smc_connection *conn) 635 593 { 636 - int rc; 594 + struct smc_sock *smc = container_of(conn, struct smc_sock, conn); 595 + int rc = 0; 596 + 597 + /* No data in the send queue */ 598 + if (unlikely(smc_tx_prepared_sends(conn) <= 0)) 599 + goto out; 600 + 601 + /* Peer don't have RMBE space */ 602 + if (unlikely(atomic_read(&conn->peer_rmbe_space) <= 0)) { 603 + SMC_STAT_RMB_TX_PEER_FULL(smc, !conn->lnk); 604 + goto out; 605 + } 637 606 638 607 if (conn->killed || 639 - conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) 640 - return -EPIPE; /* connection being aborted */ 608 + conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) { 609 + rc = -EPIPE; /* connection being aborted */ 610 + goto out; 611 + } 641 612 if (conn->lgr->is_smcd) 642 613 rc = smcd_tx_sndbuf_nonempty(conn); 643 614 else ··· 658 603 659 604 if (!rc) { 660 605 /* trigger socket release if connection is closing */ 661 - struct smc_sock *smc = container_of(conn, struct smc_sock, 662 - conn); 663 606 smc_close_wake_tx_prepared(smc); 664 607 } 608 + 609 + out: 610 + return rc; 611 + } 612 + 613 + int smc_tx_sndbuf_nonempty(struct smc_connection *conn) 614 + { 615 + int rc; 616 + 617 + /* This make sure only one can send simultaneously to prevent wasting 618 + * of CPU and CDC slot. 619 + * Record whether someone has tried to push while we are pushing. 620 + */ 621 + if (atomic_inc_return(&conn->tx_pushing) > 1) 622 + return 0; 623 + 624 + again: 625 + atomic_set(&conn->tx_pushing, 1); 626 + smp_wmb(); /* Make sure tx_pushing is 1 before real send */ 627 + rc = __smc_tx_sndbuf_nonempty(conn); 628 + 629 + /* We need to check whether someone else have added some data into 630 + * the send queue and tried to push but failed after the atomic_set() 631 + * when we are pushing. 632 + * If so, we need to push again to prevent those data hang in the send 633 + * queue. 634 + */ 635 + if (unlikely(!atomic_dec_and_test(&conn->tx_pushing))) 636 + goto again; 637 + 665 638 return rc; 666 639 } 667 640