Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'rds-enable-mprds'

Sowmini Varadhan says:

====================
RDS: TCP: Enable mprds for rds-tcp

The third, and final, installment for mprds-tcp changes.

In Patch 3 of this set, if the transport support t_mp_capable,
we hash outgoing traffic across multiple paths. Additionally, even if
the transport is MP capable, we may be peering with some node that does
not support mprds, or supports a different number of paths. This
necessitates RDS control plane changes so that both peers agree
on the number of paths to be used for the rds-tcp connection.
Patch 3 implements all these changes, which are documented in patch 5
of the series.

Patch 1 of this series is a bug fix for a race-condition
that has always existed, but is now more easily encountered with mprds.
Patch 2 is code refactoring. Patches 4 and 5 are Documentation updates.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

+342 -49
+71 -1
Documentation/networking/rds.txt
··· 85 85 86 86 bind(fd, &sockaddr_in, ...) 87 87 This binds the socket to a local IP address and port, and a 88 - transport. 88 + transport, if one has not already been selected via the 89 + SO_RDS_TRANSPORT socket option 89 90 90 91 sendmsg(fd, ...) 91 92 Sends a message to the indicated recipient. The kernel will ··· 147 146 operation. In this case, it would use RDS_CANCEL_SENT_TO to 148 147 nuke any pending messages. 149 148 149 + setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) 150 + getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) 151 + Set or read an integer defining the underlying 152 + encapsulating transport to be used for RDS packets on the 153 + socket. When setting the option, integer argument may be 154 + one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the 155 + value, RDS_TRANS_NONE will be returned on an unbound socket. 156 + This socket option may only be set exactly once on the socket, 157 + prior to binding it via the bind(2) system call. Attempts to 158 + set SO_RDS_TRANSPORT on a socket for which the transport has 159 + been previously attached explicitly (by SO_RDS_TRANSPORT) or 160 + implicitly (via bind(2)) will return an error of EOPNOTSUPP. 161 + An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will 162 + always return EINVAL. 150 163 151 164 RDMA for RDS 152 165 ============ ··· 365 350 handle CMSGs 366 351 return to application 367 352 353 + Multipath RDS (mprds) 354 + ===================== 355 + Mprds is multipathed-RDS, primarily intended for RDS-over-TCP 356 + (though the concept can be extended to other transports). The classical 357 + implementation of RDS-over-TCP is implemented by demultiplexing multiple 358 + PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, 359 + port]) over a single TCP socket between the 2 IP addresses involved. This 360 + has the limitation that it ends up funneling multiple RDS flows over a 361 + single TCP flow, thus it is 362 + (a) upper-bounded to the single-flow bandwidth, 363 + (b) suffers from head-of-line blocking for all the RDS sockets. 364 + 365 + Better throughput (for a fixed small packet size, MTU) can be achieved 366 + by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed 367 + RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp 368 + connection. RDS sockets will be attached to a path based on some hash 369 + (e.g., of local address and RDS port number) and packets for that RDS 370 + socket will be sent over the attached path using TCP to segment/reassemble 371 + RDS datagrams on that path. 372 + 373 + Multipathed RDS is implemented by splitting the struct rds_connection into 374 + a common (to all paths) part, and a per-path struct rds_conn_path. All 375 + I/O workqs and reconnect threads are driven from the rds_conn_path. 376 + Transports such as TCP that are multipath capable may then set up a 377 + TPC socket per rds_conn_path, and this is managed by the transport via 378 + the transport privatee cp_transport_data pointer. 379 + 380 + Transports announce themselves as multipath capable by setting the 381 + t_mp_capable bit during registration with the rds core module. When the 382 + transport is multipath-capable, rds_sendmsg() hashes outgoing traffic 383 + across multiple paths. The outgoing hash is computed based on the 384 + local address and port that the PF_RDS socket is bound to. 385 + 386 + Additionally, even if the transport is MP capable, we may be 387 + peering with some node that does not support mprds, or supports 388 + a different number of paths. As a result, the peering nodes need 389 + to agree on the number of paths to be used for the connection. 390 + This is done by sending out a control packet exchange before the 391 + first data packet. The control packet exchange must have completed 392 + prior to outgoing hash completion in rds_sendmsg() when the transport 393 + is mutlipath capable. 394 + 395 + The control packet is an RDS ping packet (i.e., packet to rds dest 396 + port 0) with the ping packet having a rds extension header option of 397 + type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the 398 + number of paths supported by the sender. The "probe" ping packet will 399 + get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) 400 + The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately 401 + be able to compute the min(sender_paths, rcvr_paths). The pong 402 + sent in response to a probe-ping should contain the rcvr's npaths 403 + when the rcvr is mprds-capable. 404 + 405 + If the rcvr is not mprds-capable, the exthdr in the ping will be 406 + ignored. In this case the pong will not have any exthdrs, so the sender 407 + of the probe-ping can default to single-path mprds. 368 408
+6
net/rds/bind.c
··· 81 81 82 82 if (*port != 0) { 83 83 rover = be16_to_cpu(*port); 84 + if (rover == RDS_FLAG_PROBE_PORT) 85 + return -EINVAL; 84 86 last = rover; 85 87 } else { 86 88 rover = max_t(u16, prandom_u32(), 2); ··· 93 91 if (rover == 0) 94 92 rover++; 95 93 94 + if (rover == RDS_FLAG_PROBE_PORT) 95 + continue; 96 96 key = ((u64)addr << 32) | cpu_to_be16(rover); 97 97 if (rhashtable_lookup_fast(&bind_hash_table, &key, ht_parms)) 98 98 continue; 99 99 100 100 rs->rs_bound_key = key; 101 101 rs->rs_bound_addr = addr; 102 + net_get_random_once(&rs->rs_hash_initval, 103 + sizeof(rs->rs_hash_initval)); 102 104 rs->rs_bound_port = cpu_to_be16(rover); 103 105 rs->rs_bound_node.next = NULL; 104 106 rds_sock_addref(rs);
+8 -9
net/rds/connection.c
··· 155 155 struct hlist_head *head = rds_conn_bucket(laddr, faddr); 156 156 struct rds_transport *loop_trans; 157 157 unsigned long flags; 158 - int ret; 158 + int ret, i; 159 159 160 160 rcu_read_lock(); 161 161 conn = rds_conn_lookup(net, head, laddr, faddr, trans); ··· 211 211 212 212 conn->c_trans = trans; 213 213 214 + init_waitqueue_head(&conn->c_hs_waitq); 215 + for (i = 0; i < RDS_MPATH_WORKERS; i++) { 216 + __rds_conn_path_init(conn, &conn->c_path[i], 217 + is_outgoing); 218 + conn->c_path[i].cp_index = i; 219 + } 214 220 ret = trans->conn_alloc(conn, gfp); 215 221 if (ret) { 216 222 kmem_cache_free(rds_conn_slab, conn); ··· 269 263 kmem_cache_free(rds_conn_slab, conn); 270 264 conn = found; 271 265 } else { 272 - int i; 273 - 274 - for (i = 0; i < RDS_MPATH_WORKERS; i++) { 275 - __rds_conn_path_init(conn, &conn->c_path[i], 276 - is_outgoing); 277 - conn->c_path[i].cp_index = i; 278 - } 279 - 280 266 hlist_add_head_rcu(&conn->c_hash_node, head); 281 267 rds_cong_add_conn(conn); 282 268 rds_conn_count++; ··· 666 668 667 669 void rds_conn_drop(struct rds_connection *conn) 668 670 { 671 + WARN_ON(conn->c_trans->t_mp_capable); 669 672 rds_conn_path_drop(&conn->c_path[0]); 670 673 } 671 674 EXPORT_SYMBOL_GPL(rds_conn_drop);
+1
net/rds/message.c
··· 41 41 [RDS_EXTHDR_VERSION] = sizeof(struct rds_ext_header_version), 42 42 [RDS_EXTHDR_RDMA] = sizeof(struct rds_ext_header_rdma), 43 43 [RDS_EXTHDR_RDMA_DEST] = sizeof(struct rds_ext_header_rdma_dest), 44 + [RDS_EXTHDR_NPATHS] = sizeof(u16), 44 45 }; 45 46 46 47
+23 -2
net/rds/rds.h
··· 85 85 #define RDS_RECV_REFILL 3 86 86 87 87 /* Max number of multipaths per RDS connection. Must be a power of 2 */ 88 - #define RDS_MPATH_WORKERS 1 88 + #define RDS_MPATH_WORKERS 8 89 + #define RDS_MPATH_HASH(rs, n) (jhash_1word((rs)->rs_bound_port, \ 90 + (rs)->rs_hash_initval) & ((n) - 1)) 89 91 90 92 /* Per mpath connection state */ 91 93 struct rds_conn_path { ··· 133 131 __be32 c_laddr; 134 132 __be32 c_faddr; 135 133 unsigned int c_loopback:1, 136 - c_pad_to_32:31; 134 + c_ping_triggered:1, 135 + c_pad_to_32:30; 137 136 int c_npaths; 138 137 struct rds_connection *c_passive; 139 138 struct rds_transport *c_trans; ··· 150 147 unsigned long c_map_queued; 151 148 152 149 struct rds_conn_path c_path[RDS_MPATH_WORKERS]; 150 + wait_queue_head_t c_hs_waitq; /* handshake waitq */ 153 151 }; 154 152 155 153 static inline ··· 170 166 #define RDS_FLAG_RETRANSMITTED 0x04 171 167 #define RDS_MAX_ADV_CREDIT 255 172 168 169 + /* RDS_FLAG_PROBE_PORT is the reserved sport used for sending a ping 170 + * probe to exchange control information before establishing a connection. 171 + * Currently the control information that is exchanged is the number of 172 + * supported paths. If the peer is a legacy (older kernel revision) peer, 173 + * it would return a pong message without additional control information 174 + * that would then alert the sender that the peer was an older rev. 175 + */ 176 + #define RDS_FLAG_PROBE_PORT 1 177 + #define RDS_HS_PROBE(sport, dport) \ 178 + ((sport == RDS_FLAG_PROBE_PORT && dport == 0) || \ 179 + (sport == 0 && dport == RDS_FLAG_PROBE_PORT)) 173 180 /* 174 181 * Maximum space available for extension headers. 175 182 */ ··· 239 224 __be32 h_rdma_rkey; 240 225 __be32 h_rdma_offset; 241 226 }; 227 + 228 + /* Extension header announcing number of paths. 229 + * Implicit length = 2 bytes. 230 + */ 231 + #define RDS_EXTHDR_NPATHS 4 242 232 243 233 #define __RDS_EXTHDR_MAX 16 /* for now */ 244 234 ··· 565 545 /* Socket options - in case there will be more */ 566 546 unsigned char rs_recverr, 567 547 rs_cong_monitor; 548 + u32 rs_hash_initval; 568 549 }; 569 550 570 551 static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
+75
net/rds/recv.c
··· 156 156 } 157 157 } 158 158 159 + static void rds_recv_hs_exthdrs(struct rds_header *hdr, 160 + struct rds_connection *conn) 161 + { 162 + unsigned int pos = 0, type, len; 163 + union { 164 + struct rds_ext_header_version version; 165 + u16 rds_npaths; 166 + } buffer; 167 + 168 + while (1) { 169 + len = sizeof(buffer); 170 + type = rds_message_next_extension(hdr, &pos, &buffer, &len); 171 + if (type == RDS_EXTHDR_NONE) 172 + break; 173 + /* Process extension header here */ 174 + switch (type) { 175 + case RDS_EXTHDR_NPATHS: 176 + conn->c_npaths = min_t(int, RDS_MPATH_WORKERS, 177 + buffer.rds_npaths); 178 + break; 179 + default: 180 + pr_warn_ratelimited("ignoring unknown exthdr type " 181 + "0x%x\n", type); 182 + } 183 + } 184 + /* if RDS_EXTHDR_NPATHS was not found, default to a single-path */ 185 + conn->c_npaths = max_t(int, conn->c_npaths, 1); 186 + } 187 + 188 + /* rds_start_mprds() will synchronously start multiple paths when appropriate. 189 + * The scheme is based on the following rules: 190 + * 191 + * 1. rds_sendmsg on first connect attempt sends the probe ping, with the 192 + * sender's npaths (s_npaths) 193 + * 2. rcvr of probe-ping knows the mprds_paths = min(s_npaths, r_npaths). It 194 + * sends back a probe-pong with r_npaths. After that, if rcvr is the 195 + * smaller ip addr, it starts rds_conn_path_connect_if_down on all 196 + * mprds_paths. 197 + * 3. sender gets woken up, and can move to rds_conn_path_connect_if_down. 198 + * If it is the smaller ipaddr, rds_conn_path_connect_if_down can be 199 + * called after reception of the probe-pong on all mprds_paths. 200 + * Otherwise (sender of probe-ping is not the smaller ip addr): just call 201 + * rds_conn_path_connect_if_down on the hashed path. (see rule 4) 202 + * 4. when cp_index > 0, rds_connect_worker must only trigger 203 + * a connection if laddr < faddr. 204 + * 5. sender may end up queuing the packet on the cp. will get sent out later. 205 + * when connection is completed. 206 + */ 207 + static void rds_start_mprds(struct rds_connection *conn) 208 + { 209 + int i; 210 + struct rds_conn_path *cp; 211 + 212 + if (conn->c_npaths > 1 && conn->c_laddr < conn->c_faddr) { 213 + for (i = 1; i < conn->c_npaths; i++) { 214 + cp = &conn->c_path[i]; 215 + rds_conn_path_connect_if_down(cp); 216 + } 217 + } 218 + } 219 + 159 220 /* 160 221 * The transport must make sure that this is serialized against other 161 222 * rx and conn reset on this specific conn. ··· 293 232 } 294 233 rds_stats_inc(s_recv_ping); 295 234 rds_send_pong(cp, inc->i_hdr.h_sport); 235 + /* if this is a handshake ping, start multipath if necessary */ 236 + if (RDS_HS_PROBE(inc->i_hdr.h_sport, inc->i_hdr.h_dport)) { 237 + rds_recv_hs_exthdrs(&inc->i_hdr, cp->cp_conn); 238 + rds_start_mprds(cp->cp_conn); 239 + } 240 + goto out; 241 + } 242 + 243 + if (inc->i_hdr.h_dport == RDS_FLAG_PROBE_PORT && 244 + inc->i_hdr.h_sport == 0) { 245 + rds_recv_hs_exthdrs(&inc->i_hdr, cp->cp_conn); 246 + /* if this is a handshake pong, start multipath if necessary */ 247 + rds_start_mprds(cp->cp_conn); 248 + wake_up(&cp->cp_conn->c_hs_waitq); 296 249 goto out; 297 250 } 298 251
+67 -4
net/rds/send.c
··· 963 963 return ret; 964 964 } 965 965 966 + static void rds_send_ping(struct rds_connection *conn); 967 + 968 + static int rds_send_mprds_hash(struct rds_sock *rs, struct rds_connection *conn) 969 + { 970 + int hash; 971 + 972 + if (conn->c_npaths == 0) 973 + hash = RDS_MPATH_HASH(rs, RDS_MPATH_WORKERS); 974 + else 975 + hash = RDS_MPATH_HASH(rs, conn->c_npaths); 976 + if (conn->c_npaths == 0 && hash != 0) { 977 + rds_send_ping(conn); 978 + 979 + if (conn->c_npaths == 0) { 980 + wait_event_interruptible(conn->c_hs_waitq, 981 + (conn->c_npaths != 0)); 982 + } 983 + if (conn->c_npaths == 1) 984 + hash = 0; 985 + } 986 + return hash; 987 + } 988 + 966 989 int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) 967 990 { 968 991 struct sock *sk = sock->sk; ··· 1098 1075 goto out; 1099 1076 } 1100 1077 1101 - cpath = &conn->c_path[0]; 1078 + if (conn->c_trans->t_mp_capable) 1079 + cpath = &conn->c_path[rds_send_mprds_hash(rs, conn)]; 1080 + else 1081 + cpath = &conn->c_path[0]; 1102 1082 1103 1083 rds_conn_path_connect_if_down(cpath); 1104 1084 ··· 1161 1135 } 1162 1136 1163 1137 /* 1164 - * Reply to a ping packet. 1138 + * send out a probe. Can be shared by rds_send_ping, 1139 + * rds_send_pong, rds_send_hb. 1140 + * rds_send_hb should use h_flags 1141 + * RDS_FLAG_HB_PING|RDS_FLAG_ACK_REQUIRED 1142 + * or 1143 + * RDS_FLAG_HB_PONG|RDS_FLAG_ACK_REQUIRED 1165 1144 */ 1166 1145 int 1167 - rds_send_pong(struct rds_conn_path *cp, __be16 dport) 1146 + rds_send_probe(struct rds_conn_path *cp, __be16 sport, 1147 + __be16 dport, u8 h_flags) 1168 1148 { 1169 1149 struct rds_message *rm; 1170 1150 unsigned long flags; ··· 1198 1166 rm->m_inc.i_conn = cp->cp_conn; 1199 1167 rm->m_inc.i_conn_path = cp; 1200 1168 1201 - rds_message_populate_header(&rm->m_inc.i_hdr, 0, dport, 1169 + rds_message_populate_header(&rm->m_inc.i_hdr, sport, dport, 1202 1170 cp->cp_next_tx_seq); 1171 + rm->m_inc.i_hdr.h_flags |= h_flags; 1203 1172 cp->cp_next_tx_seq++; 1173 + 1174 + if (RDS_HS_PROBE(sport, dport) && cp->cp_conn->c_trans->t_mp_capable) { 1175 + u16 npaths = RDS_MPATH_WORKERS; 1176 + 1177 + rds_message_add_extension(&rm->m_inc.i_hdr, 1178 + RDS_EXTHDR_NPATHS, &npaths, 1179 + sizeof(npaths)); 1180 + } 1204 1181 spin_unlock_irqrestore(&cp->cp_lock, flags); 1205 1182 1206 1183 rds_stats_inc(s_send_queued); ··· 1225 1184 if (rm) 1226 1185 rds_message_put(rm); 1227 1186 return ret; 1187 + } 1188 + 1189 + int 1190 + rds_send_pong(struct rds_conn_path *cp, __be16 dport) 1191 + { 1192 + return rds_send_probe(cp, 0, dport, 0); 1193 + } 1194 + 1195 + void 1196 + rds_send_ping(struct rds_connection *conn) 1197 + { 1198 + unsigned long flags; 1199 + struct rds_conn_path *cp = &conn->c_path[0]; 1200 + 1201 + spin_lock_irqsave(&cp->cp_lock, flags); 1202 + if (conn->c_ping_triggered) { 1203 + spin_unlock_irqrestore(&cp->cp_lock, flags); 1204 + return; 1205 + } 1206 + conn->c_ping_triggered = 1; 1207 + spin_unlock_irqrestore(&cp->cp_lock, flags); 1208 + rds_send_probe(&conn->c_path[0], RDS_FLAG_PROBE_PORT, 0, 0); 1228 1209 }
+12 -19
net/rds/tcp.c
··· 38 38 #include <net/net_namespace.h> 39 39 #include <net/netns/generic.h> 40 40 41 - #include "rds_single_path.h" 42 41 #include "rds.h" 43 42 #include "tcp.h" 44 43 ··· 167 168 wait_event(cp->cp_waitq, !test_bit(RDS_IN_XMIT, &cp->cp_flags)); 168 169 lock_sock(osock->sk); 169 170 /* reset receive side state for rds_tcp_data_recv() for osock */ 171 + cancel_delayed_work_sync(&cp->cp_send_w); 172 + cancel_delayed_work_sync(&cp->cp_recv_w); 170 173 if (tc->t_tinc) { 171 174 rds_inc_put(&tc->t_tinc->ti_inc); 172 175 tc->t_tinc = NULL; 173 176 } 174 177 tc->t_tinc_hdr_rem = sizeof(struct rds_header); 175 178 tc->t_tinc_data_rem = 0; 176 - tc->t_sock = NULL; 177 - 178 - write_lock_bh(&osock->sk->sk_callback_lock); 179 - 180 - osock->sk->sk_user_data = NULL; 181 - osock->sk->sk_data_ready = tc->t_orig_data_ready; 182 - osock->sk->sk_write_space = tc->t_orig_write_space; 183 - osock->sk->sk_state_change = tc->t_orig_state_change; 184 - write_unlock_bh(&osock->sk->sk_callback_lock); 179 + rds_tcp_restore_callbacks(osock, tc); 185 180 release_sock(osock->sk); 186 181 sock_release(osock); 187 182 newsock: 188 183 rds_send_path_reset(cp); 189 184 lock_sock(sock->sk); 190 - write_lock_bh(&sock->sk->sk_callback_lock); 191 - tc->t_sock = sock; 192 - tc->t_cpath = cp; 193 - sock->sk->sk_user_data = cp; 194 - sock->sk->sk_data_ready = rds_tcp_data_ready; 195 - sock->sk->sk_write_space = rds_tcp_write_space; 196 - sock->sk->sk_state_change = rds_tcp_state_change; 197 - 198 - write_unlock_bh(&sock->sk->sk_callback_lock); 185 + rds_tcp_set_callbacks(sock, cp); 199 186 release_sock(sock->sk); 200 187 } 201 188 ··· 357 372 .t_name = "tcp", 358 373 .t_type = RDS_TRANS_TCP, 359 374 .t_prefer_loopback = 1, 375 + .t_mp_capable = 1, 360 376 }; 361 377 362 378 static int rds_tcp_netid; ··· 535 549 rds_tcp_conn_paths_destroy(tc->t_cpath->cp_conn); 536 550 rds_conn_destroy(tc->t_cpath->cp_conn); 537 551 } 552 + } 553 + 554 + void *rds_tcp_listen_sock_def_readable(struct net *net) 555 + { 556 + struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); 557 + 558 + return rtn->rds_tcp_listen_sock->sk->sk_user_data; 538 559 } 539 560 540 561 static int rds_tcp_dev_event(struct notifier_block *this,
+1
net/rds/tcp.h
··· 70 70 void rds_tcp_listen_data_ready(struct sock *sk); 71 71 int rds_tcp_accept_one(struct socket *sock); 72 72 int rds_tcp_keepalive(struct socket *sock); 73 + void *rds_tcp_listen_sock_def_readable(struct net *net); 73 74 74 75 /* tcp_recv.c */ 75 76 int rds_tcp_recv_init(void);
+6 -1
net/rds/tcp_connect.c
··· 34 34 #include <linux/in.h> 35 35 #include <net/tcp.h> 36 36 37 - #include "rds_single_path.h" 38 37 #include "rds.h" 39 38 #include "tcp.h" 40 39 ··· 80 81 int ret; 81 82 struct rds_connection *conn = cp->cp_conn; 82 83 struct rds_tcp_connection *tc = cp->cp_transport_data; 84 + 85 + /* for multipath rds,we only trigger the connection after 86 + * the handshake probe has determined the number of paths. 87 + */ 88 + if (cp->cp_index > 0 && cp->cp_conn->c_npaths < 2) 89 + return -EAGAIN; 83 90 84 91 mutex_lock(&tc->t_conn_path_lock); 85 92
+57 -8
net/rds/tcp_listen.c
··· 35 35 #include <linux/in.h> 36 36 #include <net/tcp.h> 37 37 38 - #include "rds_single_path.h" 39 38 #include "rds.h" 40 39 #include "tcp.h" 41 40 ··· 68 69 (char *)&keepidle, sizeof(keepidle)); 69 70 bail: 70 71 return ret; 72 + } 73 + 74 + /* rds_tcp_accept_one_path(): if accepting on cp_index > 0, make sure the 75 + * client's ipaddr < server's ipaddr. Otherwise, close the accepted 76 + * socket and force a reconneect from smaller -> larger ip addr. The reason 77 + * we special case cp_index 0 is to allow the rds probe ping itself to itself 78 + * get through efficiently. 79 + * Since reconnects are only initiated from the node with the numerically 80 + * smaller ip address, we recycle conns in RDS_CONN_ERROR on the passive side 81 + * by moving them to CONNECTING in this function. 82 + */ 83 + struct rds_tcp_connection *rds_tcp_accept_one_path(struct rds_connection *conn) 84 + { 85 + int i; 86 + bool peer_is_smaller = (conn->c_faddr < conn->c_laddr); 87 + int npaths = conn->c_npaths; 88 + 89 + if (npaths <= 1) { 90 + struct rds_conn_path *cp = &conn->c_path[0]; 91 + int ret; 92 + 93 + ret = rds_conn_path_transition(cp, RDS_CONN_DOWN, 94 + RDS_CONN_CONNECTING); 95 + if (!ret) 96 + rds_conn_path_transition(cp, RDS_CONN_ERROR, 97 + RDS_CONN_CONNECTING); 98 + return cp->cp_transport_data; 99 + } 100 + 101 + /* for mprds, paths with cp_index > 0 MUST be initiated by the peer 102 + * with the smaller address. 103 + */ 104 + if (!peer_is_smaller) 105 + return NULL; 106 + 107 + for (i = 1; i < npaths; i++) { 108 + struct rds_conn_path *cp = &conn->c_path[i]; 109 + 110 + if (rds_conn_path_transition(cp, RDS_CONN_DOWN, 111 + RDS_CONN_CONNECTING) || 112 + rds_conn_path_transition(cp, RDS_CONN_ERROR, 113 + RDS_CONN_CONNECTING)) { 114 + return cp->cp_transport_data; 115 + } 116 + } 117 + return NULL; 71 118 } 72 119 73 120 int rds_tcp_accept_one(struct socket *sock) ··· 165 120 * If the client reboots, this conn will need to be cleaned up. 166 121 * rds_tcp_state_change() will do that cleanup 167 122 */ 168 - rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data; 169 - cp = &conn->c_path[0]; 170 - rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING); 123 + rs_tcp = rds_tcp_accept_one_path(conn); 124 + if (!rs_tcp) 125 + goto rst_nsk; 171 126 mutex_lock(&rs_tcp->t_conn_path_lock); 172 - conn_state = rds_conn_state(conn); 173 - if (conn_state != RDS_CONN_CONNECTING && conn_state != RDS_CONN_UP) 127 + cp = rs_tcp->t_cpath; 128 + conn_state = rds_conn_path_state(cp); 129 + if (conn_state != RDS_CONN_CONNECTING && conn_state != RDS_CONN_UP && 130 + conn_state != RDS_CONN_ERROR) 174 131 goto rst_nsk; 175 132 if (rs_tcp->t_sock) { 176 133 /* Need to resolve a duelling SYN between peers. ··· 182 135 * c_transport_data. 183 136 */ 184 137 if (ntohl(inet->inet_saddr) < ntohl(inet->inet_daddr) || 185 - !conn->c_path[0].cp_outgoing) { 138 + !cp->cp_outgoing) { 186 139 goto rst_nsk; 187 140 } else { 188 141 rds_tcp_reset_callbacks(new_sock, cp); 189 - conn->c_path[0].cp_outgoing = 0; 142 + cp->cp_outgoing = 0; 190 143 /* rds_connect_path_complete() marks RDS_CONN_UP */ 191 144 rds_connect_path_complete(cp, RDS_CONN_RESETTING); 192 145 } ··· 230 183 */ 231 184 if (sk->sk_state == TCP_LISTEN) 232 185 rds_tcp_accept_work(sk); 186 + else 187 + ready = rds_tcp_listen_sock_def_readable(sock_net(sk)); 233 188 234 189 out: 235 190 read_unlock_bh(&sk->sk_callback_lock);
+13 -5
net/rds/tcp_send.c
··· 81 81 int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm, 82 82 unsigned int hdr_off, unsigned int sg, unsigned int off) 83 83 { 84 - struct rds_tcp_connection *tc = conn->c_transport_data; 84 + struct rds_conn_path *cp = rm->m_inc.i_conn_path; 85 + struct rds_tcp_connection *tc = cp->cp_transport_data; 85 86 int done = 0; 86 87 int ret = 0; 87 88 int more; ··· 151 150 rds_tcp_stats_inc(s_tcp_sndbuf_full); 152 151 ret = 0; 153 152 } else { 154 - printk(KERN_WARNING "RDS/tcp: send to %pI4 " 155 - "returned %d, disconnecting and reconnecting\n", 156 - &conn->c_faddr, ret); 157 - rds_conn_drop(conn); 153 + /* No need to disconnect/reconnect if path_drop 154 + * has already been triggered, because, e.g., of 155 + * an incoming RST. 156 + */ 157 + if (rds_conn_path_up(cp)) { 158 + pr_warn("RDS/tcp: send to %pI4 on cp [%d]" 159 + "returned %d, " 160 + "disconnecting and reconnecting\n", 161 + &conn->c_faddr, cp->cp_index, ret); 162 + rds_conn_path_drop(cp); 163 + } 158 164 } 159 165 } 160 166 if (done == 0)
+2
net/rds/threads.c
··· 156 156 struct rds_connection *conn = cp->cp_conn; 157 157 int ret; 158 158 159 + if (cp->cp_index > 1 && cp->cp_conn->c_laddr > cp->cp_conn->c_faddr) 160 + return; 159 161 clear_bit(RDS_RECONNECT_PENDING, &cp->cp_flags); 160 162 ret = rds_conn_path_transition(cp, RDS_CONN_DOWN, RDS_CONN_CONNECTING); 161 163 if (ret) {