Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'tcp-plb'

Mubashir Adnan Qureshi says:

====================
net: Add PLB functionality to TCP

This patch series adds PLB (Protective Load Balancing) to TCP and hooks
it up to DCTCP. PLB is disabled by default and can be enabled using
relevant sysctls and support from underlying CC.

PLB (Protective Load Balancing) is a host based mechanism for load
balancing across switch links. It leverages congestion signals(e.g. ECN)
from transport layer to randomly change the path of the connection
experiencing congestion. PLB changes the path of the connection by
changing the outgoing IPv6 flow label for IPv6 connections (implemented
in Linux by calling sk_rethink_txhash()). Because of this implementation
mechanism, PLB can currently only work for IPv6 traffic. For more
information, see the SIGCOMM 2022 paper:
https://doi.org/10.1145/3544216.3544226
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

+305 -2
+75
Documentation/networking/ip-sysctl.rst
··· 1069 1069 1070 1070 Default: 0 1071 1071 1072 + tcp_plb_enabled - BOOLEAN 1073 + If set and the underlying congestion control (e.g. DCTCP) supports 1074 + and enables PLB feature, TCP PLB (Protective Load Balancing) is 1075 + enabled. PLB is described in the following paper: 1076 + https://doi.org/10.1145/3544216.3544226. Based on PLB parameters, 1077 + upon sensing sustained congestion, TCP triggers a change in 1078 + flow label field for outgoing IPv6 packets. A change in flow label 1079 + field potentially changes the path of outgoing packets for switches 1080 + that use ECMP/WCMP for routing. 1081 + 1082 + PLB changes socket txhash which results in a change in IPv6 Flow Label 1083 + field, and currently no-op for IPv4 headers. It is possible 1084 + to apply PLB for IPv4 with other network header fields (e.g. TCP 1085 + or IPv4 options) or using encapsulation where outer header is used 1086 + by switches to determine next hop. In either case, further host 1087 + and switch side changes will be needed. 1088 + 1089 + When set, PLB assumes that congestion signal (e.g. ECN) is made 1090 + available and used by congestion control module to estimate a 1091 + congestion measure (e.g. ce_ratio). PLB needs a congestion measure to 1092 + make repathing decisions. 1093 + 1094 + Default: FALSE 1095 + 1096 + tcp_plb_idle_rehash_rounds - INTEGER 1097 + Number of consecutive congested rounds (RTT) seen after which 1098 + a rehash can be performed, given there are no packets in flight. 1099 + This is referred to as M in PLB paper: 1100 + https://doi.org/10.1145/3544216.3544226. 1101 + 1102 + Possible Values: 0 - 31 1103 + 1104 + Default: 3 1105 + 1106 + tcp_plb_rehash_rounds - INTEGER 1107 + Number of consecutive congested rounds (RTT) seen after which 1108 + a forced rehash can be performed. Be careful when setting this 1109 + parameter, as a small value increases the risk of retransmissions. 1110 + This is referred to as N in PLB paper: 1111 + https://doi.org/10.1145/3544216.3544226. 1112 + 1113 + Possible Values: 0 - 31 1114 + 1115 + Default: 12 1116 + 1117 + tcp_plb_suspend_rto_sec - INTEGER 1118 + Time, in seconds, to suspend PLB in event of an RTO. In order to avoid 1119 + having PLB repath onto a connectivity "black hole", after an RTO a TCP 1120 + connection suspends PLB repathing for a random duration between 1x and 1121 + 2x of this parameter. Randomness is added to avoid concurrent rehashing 1122 + of multiple TCP connections. This should be set corresponding to the 1123 + amount of time it takes to repair a failed link. 1124 + 1125 + Possible Values: 0 - 255 1126 + 1127 + Default: 60 1128 + 1129 + tcp_plb_cong_thresh - INTEGER 1130 + Fraction of packets marked with congestion over a round (RTT) to 1131 + tag that round as congested. This is referred to as K in the PLB paper: 1132 + https://doi.org/10.1145/3544216.3544226. 1133 + 1134 + The 0-1 fraction range is mapped to 0-256 range to avoid floating 1135 + point operations. For example, 128 means that if at least 50% of 1136 + the packets in a round were marked as congested then the round 1137 + will be tagged as congested. 1138 + 1139 + Setting threshold to 0 means that PLB repaths every RTT regardless 1140 + of congestion. This is not intended behavior for PLB and should be 1141 + used only for experimentation purpose. 1142 + 1143 + Possible Values: 0 - 256 1144 + 1145 + Default: 128 1146 + 1072 1147 UDP variables 1073 1148 ============= 1074 1149
+1
include/linux/tcp.h
··· 423 423 u32 probe_seq_start; 424 424 u32 probe_seq_end; 425 425 } mtu_probe; 426 + u32 plb_rehash; /* PLB-triggered rehash attempts */ 426 427 u32 mtu_info; /* We received an ICMP_FRAG_NEEDED / ICMPV6_PKT_TOOBIG 427 428 * while socket was owned by user. 428 429 */
+5
include/net/netns/ipv4.h
··· 183 183 unsigned long tfo_active_disable_stamp; 184 184 u32 tcp_challenge_timestamp; 185 185 u32 tcp_challenge_count; 186 + u8 sysctl_tcp_plb_enabled; 187 + u8 sysctl_tcp_plb_idle_rehash_rounds; 188 + u8 sysctl_tcp_plb_rehash_rounds; 189 + u8 sysctl_tcp_plb_suspend_rto_sec; 190 + int sysctl_tcp_plb_cong_thresh; 186 191 187 192 int sysctl_udp_wmem_min; 188 193 int sysctl_udp_rmem_min;
+28
include/net/tcp.h
··· 2140 2140 extern void tcp_rack_reo_timeout(struct sock *sk); 2141 2141 extern void tcp_rack_update_reo_wnd(struct sock *sk, struct rate_sample *rs); 2142 2142 2143 + /* tcp_plb.c */ 2144 + 2145 + /* 2146 + * Scaling factor for fractions in PLB. For example, tcp_plb_update_state 2147 + * expects cong_ratio which represents fraction of traffic that experienced 2148 + * congestion over a single RTT. In order to avoid floating point operations, 2149 + * this fraction should be mapped to (1 << TCP_PLB_SCALE) and passed in. 2150 + */ 2151 + #define TCP_PLB_SCALE 8 2152 + 2153 + /* State for PLB (Protective Load Balancing) for a single TCP connection. */ 2154 + struct tcp_plb_state { 2155 + u8 consec_cong_rounds:5, /* consecutive congested rounds */ 2156 + unused:3; 2157 + u32 pause_until; /* jiffies32 when PLB can resume rerouting */ 2158 + }; 2159 + 2160 + static inline void tcp_plb_init(const struct sock *sk, 2161 + struct tcp_plb_state *plb) 2162 + { 2163 + plb->consec_cong_rounds = 0; 2164 + plb->pause_until = 0; 2165 + } 2166 + void tcp_plb_update_state(const struct sock *sk, struct tcp_plb_state *plb, 2167 + const int cong_ratio); 2168 + void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb); 2169 + void tcp_plb_update_state_upon_rto(struct sock *sk, struct tcp_plb_state *plb); 2170 + 2143 2171 /* At how many usecs into the future should the RTO fire? */ 2144 2172 static inline s64 tcp_rto_delta_us(const struct sock *sk) 2145 2173 {
+1
include/uapi/linux/snmp.h
··· 292 292 LINUX_MIB_TCPDSACKIGNOREDDUBIOUS, /* TCPDSACKIgnoredDubious */ 293 293 LINUX_MIB_TCPMIGRATEREQSUCCESS, /* TCPMigrateReqSuccess */ 294 294 LINUX_MIB_TCPMIGRATEREQFAILURE, /* TCPMigrateReqFailure */ 295 + LINUX_MIB_TCPPLBREHASH, /* TCPPLBRehash */ 295 296 __LINUX_MIB_MAX 296 297 }; 297 298
+6
include/uapi/linux/tcp.h
··· 284 284 __u32 tcpi_snd_wnd; /* peer's advertised receive window after 285 285 * scaling (bytes) 286 286 */ 287 + __u32 tcpi_rcv_wnd; /* local advertised receive window after 288 + * scaling (bytes) 289 + */ 290 + 291 + __u32 tcpi_rehash; /* PLB or timeout triggered rehash attempts */ 287 292 }; 288 293 289 294 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */ ··· 320 315 TCP_NLA_BYTES_NOTSENT, /* Bytes in write queue not yet sent */ 321 316 TCP_NLA_EDT, /* Earliest departure time (CLOCK_MONOTONIC) */ 322 317 TCP_NLA_TTL, /* TTL or hop limit of a packet received */ 318 + TCP_NLA_REHASH, /* PLB and timeout triggered rehash attempts */ 323 319 }; 324 320 325 321 /* for TCP_MD5SIG socket option */
+1 -1
net/ipv4/Makefile
··· 10 10 tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \ 11 11 tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \ 12 12 tcp_rate.o tcp_recovery.o tcp_ulp.o \ 13 - tcp_offload.o datagram.o raw.o udp.o udplite.o \ 13 + tcp_offload.o tcp_plb.o datagram.o raw.o udp.o udplite.o \ 14 14 udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \ 15 15 fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \ 16 16 inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+1
net/ipv4/proc.c
··· 297 297 SNMP_MIB_ITEM("TCPDSACKIgnoredDubious", LINUX_MIB_TCPDSACKIGNOREDDUBIOUS), 298 298 SNMP_MIB_ITEM("TCPMigrateReqSuccess", LINUX_MIB_TCPMIGRATEREQSUCCESS), 299 299 SNMP_MIB_ITEM("TCPMigrateReqFailure", LINUX_MIB_TCPMIGRATEREQFAILURE), 300 + SNMP_MIB_ITEM("TCPPLBRehash", LINUX_MIB_TCPPLBREHASH), 300 301 SNMP_MIB_SENTINEL 301 302 }; 302 303
+43
net/ipv4/sysctl_net_ipv4.c
··· 40 40 static u32 fib_multipath_hash_fields_all_mask __maybe_unused = 41 41 FIB_MULTIPATH_HASH_FIELD_ALL_MASK; 42 42 static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024; 43 + static int tcp_plb_max_rounds = 31; 44 + static int tcp_plb_max_cong_thresh = 256; 43 45 44 46 /* obsolete */ 45 47 static int sysctl_tcp_low_latency __read_mostly; ··· 1385 1383 .proc_handler = proc_dou8vec_minmax, 1386 1384 .extra1 = SYSCTL_ZERO, 1387 1385 .extra2 = SYSCTL_TWO, 1386 + }, 1387 + { 1388 + .procname = "tcp_plb_enabled", 1389 + .data = &init_net.ipv4.sysctl_tcp_plb_enabled, 1390 + .maxlen = sizeof(u8), 1391 + .mode = 0644, 1392 + .proc_handler = proc_dou8vec_minmax, 1393 + .extra1 = SYSCTL_ZERO, 1394 + .extra2 = SYSCTL_ONE, 1395 + }, 1396 + { 1397 + .procname = "tcp_plb_idle_rehash_rounds", 1398 + .data = &init_net.ipv4.sysctl_tcp_plb_idle_rehash_rounds, 1399 + .maxlen = sizeof(u8), 1400 + .mode = 0644, 1401 + .proc_handler = proc_dou8vec_minmax, 1402 + .extra2 = &tcp_plb_max_rounds, 1403 + }, 1404 + { 1405 + .procname = "tcp_plb_rehash_rounds", 1406 + .data = &init_net.ipv4.sysctl_tcp_plb_rehash_rounds, 1407 + .maxlen = sizeof(u8), 1408 + .mode = 0644, 1409 + .proc_handler = proc_dou8vec_minmax, 1410 + .extra2 = &tcp_plb_max_rounds, 1411 + }, 1412 + { 1413 + .procname = "tcp_plb_suspend_rto_sec", 1414 + .data = &init_net.ipv4.sysctl_tcp_plb_suspend_rto_sec, 1415 + .maxlen = sizeof(u8), 1416 + .mode = 0644, 1417 + .proc_handler = proc_dou8vec_minmax, 1418 + }, 1419 + { 1420 + .procname = "tcp_plb_cong_thresh", 1421 + .data = &init_net.ipv4.sysctl_tcp_plb_cong_thresh, 1422 + .maxlen = sizeof(int), 1423 + .mode = 0644, 1424 + .proc_handler = proc_dointvec_minmax, 1425 + .extra1 = SYSCTL_ZERO, 1426 + .extra2 = &tcp_plb_max_cong_thresh, 1388 1427 }, 1389 1428 { } 1390 1429 };
+5
net/ipv4/tcp.c
··· 3176 3176 tp->sacked_out = 0; 3177 3177 tp->tlp_high_seq = 0; 3178 3178 tp->last_oow_ack_time = 0; 3179 + tp->plb_rehash = 0; 3179 3180 /* There's a bubble in the pipe until at least the first ACK. */ 3180 3181 tp->app_limited = ~0U; 3181 3182 tp->rack.mstamp = 0; ··· 3940 3939 info->tcpi_reord_seen = tp->reord_seen; 3941 3940 info->tcpi_rcv_ooopack = tp->rcv_ooopack; 3942 3941 info->tcpi_snd_wnd = tp->snd_wnd; 3942 + info->tcpi_rcv_wnd = tp->rcv_wnd; 3943 + info->tcpi_rehash = tp->plb_rehash + tp->timeout_rehash; 3943 3944 info->tcpi_fastopen_client_fail = tp->fastopen_client_fail; 3944 3945 unlock_sock_fast(sk, slow); 3945 3946 } ··· 3976 3973 nla_total_size(sizeof(u32)) + /* TCP_NLA_BYTES_NOTSENT */ 3977 3974 nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_EDT */ 3978 3975 nla_total_size(sizeof(u8)) + /* TCP_NLA_TTL */ 3976 + nla_total_size(sizeof(u32)) + /* TCP_NLA_REHASH */ 3979 3977 0; 3980 3978 } 3981 3979 ··· 4053 4049 nla_put_u8(stats, TCP_NLA_TTL, 4054 4050 tcp_skb_ttl_or_hop_limit(ack_skb)); 4055 4051 4052 + nla_put_u32(stats, TCP_NLA_REHASH, tp->plb_rehash + tp->timeout_rehash); 4056 4053 return stats; 4057 4054 } 4058 4055
+22 -1
net/ipv4/tcp_dctcp.c
··· 54 54 u32 next_seq; 55 55 u32 ce_state; 56 56 u32 loss_cwnd; 57 + struct tcp_plb_state plb; 57 58 }; 58 59 59 60 static unsigned int dctcp_shift_g __read_mostly = 4; /* g = 1/2^4 */ ··· 92 91 ca->ce_state = 0; 93 92 94 93 dctcp_reset(tp, ca); 94 + tcp_plb_init(sk, &ca->plb); 95 + 95 96 return; 96 97 } 97 98 ··· 120 117 121 118 /* Expired RTT */ 122 119 if (!before(tp->snd_una, ca->next_seq)) { 120 + u32 delivered = tp->delivered - ca->old_delivered; 123 121 u32 delivered_ce = tp->delivered_ce - ca->old_delivered_ce; 124 122 u32 alpha = ca->dctcp_alpha; 123 + u32 ce_ratio = 0; 124 + 125 + if (delivered > 0) { 126 + /* dctcp_alpha keeps EWMA of fraction of ECN marked 127 + * packets. Because of EWMA smoothing, PLB reaction can 128 + * be slow so we use ce_ratio which is an instantaneous 129 + * measure of congestion. ce_ratio is the fraction of 130 + * ECN marked packets in the previous RTT. 131 + */ 132 + if (delivered_ce > 0) 133 + ce_ratio = (delivered_ce << TCP_PLB_SCALE) / delivered; 134 + tcp_plb_update_state(sk, &ca->plb, (int)ce_ratio); 135 + tcp_plb_check_rehash(sk, &ca->plb); 136 + } 125 137 126 138 /* alpha = (1 - g) * alpha + g * F */ 127 139 128 140 alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g); 129 141 if (delivered_ce) { 130 - u32 delivered = tp->delivered - ca->old_delivered; 131 142 132 143 /* If dctcp_shift_g == 1, a 32bit value would overflow 133 144 * after 8 M packets. ··· 189 172 dctcp_ece_ack_update(sk, ev, &ca->prior_rcv_nxt, &ca->ce_state); 190 173 break; 191 174 case CA_EVENT_LOSS: 175 + tcp_plb_update_state_upon_rto(sk, &ca->plb); 192 176 dctcp_react_to_loss(sk); 177 + break; 178 + case CA_EVENT_TX_START: 179 + tcp_plb_check_rehash(sk, &ca->plb); /* Maybe rehash when inflight is 0 */ 193 180 break; 194 181 default: 195 182 /* Don't care for the rest. */
+8
net/ipv4/tcp_ipv4.c
··· 3218 3218 net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 0; 3219 3219 atomic_set(&net->ipv4.tfo_active_disable_times, 0); 3220 3220 3221 + /* Set default values for PLB */ 3222 + net->ipv4.sysctl_tcp_plb_enabled = 0; /* Disabled by default */ 3223 + net->ipv4.sysctl_tcp_plb_idle_rehash_rounds = 3; 3224 + net->ipv4.sysctl_tcp_plb_rehash_rounds = 12; 3225 + net->ipv4.sysctl_tcp_plb_suspend_rto_sec = 60; 3226 + /* Default congestion threshold for PLB to mark a round is 50% */ 3227 + net->ipv4.sysctl_tcp_plb_cong_thresh = (1 << TCP_PLB_SCALE) / 2; 3228 + 3221 3229 /* Reno is always built in */ 3222 3230 if (!net_eq(net, &init_net) && 3223 3231 bpf_try_module_get(init_net.ipv4.tcp_congestion_control,
+109
net/ipv4/tcp_plb.c
··· 1 + /* Protective Load Balancing (PLB) 2 + * 3 + * PLB was designed to reduce link load imbalance across datacenter 4 + * switches. PLB is a host-based optimization; it leverages congestion 5 + * signals from the transport layer to randomly change the path of the 6 + * connection experiencing sustained congestion. PLB prefers to repath 7 + * after idle periods to minimize packet reordering. It repaths by 8 + * changing the IPv6 Flow Label on the packets of a connection, which 9 + * datacenter switches include as part of ECMP/WCMP hashing. 10 + * 11 + * PLB is described in detail in: 12 + * 13 + * Mubashir Adnan Qureshi, Yuchung Cheng, Qianwen Yin, Qiaobin Fu, 14 + * Gautam Kumar, Masoud Moshref, Junhua Yan, Van Jacobson, 15 + * David Wetherall,Abdul Kabbani: 16 + * "PLB: Congestion Signals are Simple and Effective for 17 + * Network Load Balancing" 18 + * In ACM SIGCOMM 2022, Amsterdam Netherlands. 19 + * 20 + */ 21 + 22 + #include <net/tcp.h> 23 + 24 + /* Called once per round-trip to update PLB state for a connection. */ 25 + void tcp_plb_update_state(const struct sock *sk, struct tcp_plb_state *plb, 26 + const int cong_ratio) 27 + { 28 + struct net *net = sock_net(sk); 29 + 30 + if (!READ_ONCE(net->ipv4.sysctl_tcp_plb_enabled)) 31 + return; 32 + 33 + if (cong_ratio >= 0) { 34 + if (cong_ratio < READ_ONCE(net->ipv4.sysctl_tcp_plb_cong_thresh)) 35 + plb->consec_cong_rounds = 0; 36 + else if (plb->consec_cong_rounds < 37 + READ_ONCE(net->ipv4.sysctl_tcp_plb_rehash_rounds)) 38 + plb->consec_cong_rounds++; 39 + } 40 + } 41 + EXPORT_SYMBOL_GPL(tcp_plb_update_state); 42 + 43 + /* Check whether recent congestion has been persistent enough to warrant 44 + * a load balancing decision that switches the connection to another path. 45 + */ 46 + void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb) 47 + { 48 + struct net *net = sock_net(sk); 49 + u32 max_suspend; 50 + bool forced_rehash = false, idle_rehash = false; 51 + 52 + if (!READ_ONCE(net->ipv4.sysctl_tcp_plb_enabled)) 53 + return; 54 + 55 + forced_rehash = plb->consec_cong_rounds >= 56 + READ_ONCE(net->ipv4.sysctl_tcp_plb_rehash_rounds); 57 + /* If sender goes idle then we check whether to rehash. */ 58 + idle_rehash = READ_ONCE(net->ipv4.sysctl_tcp_plb_idle_rehash_rounds) && 59 + !tcp_sk(sk)->packets_out && 60 + plb->consec_cong_rounds >= 61 + READ_ONCE(net->ipv4.sysctl_tcp_plb_idle_rehash_rounds); 62 + 63 + if (!forced_rehash && !idle_rehash) 64 + return; 65 + 66 + /* Note that tcp_jiffies32 can wrap; we detect wraps by checking for 67 + * cases where the max suspension end is before the actual suspension 68 + * end. We clear pause_until to 0 to indicate there is no recent 69 + * RTO event that constrains PLB rehashing. 70 + */ 71 + max_suspend = 2 * READ_ONCE(net->ipv4.sysctl_tcp_plb_suspend_rto_sec) * HZ; 72 + if (plb->pause_until && 73 + (!before(tcp_jiffies32, plb->pause_until) || 74 + before(tcp_jiffies32 + max_suspend, plb->pause_until))) 75 + plb->pause_until = 0; 76 + 77 + if (plb->pause_until) 78 + return; 79 + 80 + sk_rethink_txhash(sk); 81 + plb->consec_cong_rounds = 0; 82 + tcp_sk(sk)->plb_rehash++; 83 + NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH); 84 + } 85 + EXPORT_SYMBOL_GPL(tcp_plb_check_rehash); 86 + 87 + /* Upon RTO, disallow load balancing for a while, to avoid having load 88 + * balancing decisions switch traffic to a black-holed path that was 89 + * previously avoided with a sk_rethink_txhash() call at RTO time. 90 + */ 91 + void tcp_plb_update_state_upon_rto(struct sock *sk, struct tcp_plb_state *plb) 92 + { 93 + struct net *net = sock_net(sk); 94 + u32 pause; 95 + 96 + if (!READ_ONCE(net->ipv4.sysctl_tcp_plb_enabled)) 97 + return; 98 + 99 + pause = READ_ONCE(net->ipv4.sysctl_tcp_plb_suspend_rto_sec) * HZ; 100 + pause += prandom_u32_max(pause); 101 + plb->pause_until = tcp_jiffies32 + pause; 102 + 103 + /* Reset PLB state upon RTO, since an RTO causes a sk_rethink_txhash() call 104 + * that may switch this connection to a path with completely different 105 + * congestion characteristics. 106 + */ 107 + plb->consec_cong_rounds = 0; 108 + } 109 + EXPORT_SYMBOL_GPL(tcp_plb_update_state_upon_rto);