Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Documentation: net: Document resilient next-hop groups

Add a document describing the principles behind resilient next-hop groups,
and some notes about how to configure and offload them.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Petr Machata and committed by
David S. Miller
87f2c671 177cb787

+294
+1
Documentation/networking/index.rst
··· 76 76 netdevices 77 77 netfilter-sysctl 78 78 netif-msg 79 + nexthop-group-resilient 79 80 nf_conntrack-sysctl 80 81 nf_flowtable 81 82 openvswitch
+293
Documentation/networking/nexthop-group-resilient.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========================= 4 + Resilient Next-hop Groups 5 + ========================= 6 + 7 + Resilient groups are a type of next-hop group that is aimed at minimizing 8 + disruption in flow routing across changes to the group composition and 9 + weights of constituent next hops. 10 + 11 + The idea behind resilient hashing groups is best explained in contrast to 12 + the legacy multipath next-hop group, which uses the hash-threshold 13 + algorithm, described in RFC 2992. 14 + 15 + To select a next hop, hash-threshold algorithm first assigns a range of 16 + hashes to each next hop in the group, and then selects the next hop by 17 + comparing the SKB hash with the individual ranges. When a next hop is 18 + removed from the group, the ranges are recomputed, which leads to 19 + reassignment of parts of hash space from one next hop to another. RFC 2992 20 + illustrates it thus:: 21 + 22 + +-------+-------+-------+-------+-------+ 23 + | 1 | 2 | 3 | 4 | 5 | 24 + +-------+-+-----+---+---+-----+-+-------+ 25 + | 1 | 2 | 4 | 5 | 26 + +---------+---------+---------+---------+ 27 + 28 + Before and after deletion of next hop 3 29 + under the hash-threshold algorithm. 30 + 31 + Note how next hop 2 gave up part of the hash space in favor of next hop 1, 32 + and 4 in favor of 5. While there will usually be some overlap between the 33 + previous and the new distribution, some traffic flows change the next hop 34 + that they resolve to. 35 + 36 + If a multipath group is used for load-balancing between multiple servers, 37 + this hash space reassignment causes an issue that packets from a single 38 + flow suddenly end up arriving at a server that does not expect them. This 39 + can result in TCP connections being reset. 40 + 41 + If a multipath group is used for load-balancing among available paths to 42 + the same server, the issue is that different latencies and reordering along 43 + the way causes the packets to arrive in the wrong order, resulting in 44 + degraded application performance. 45 + 46 + To mitigate the above-mentioned flow redirection, resilient next-hop groups 47 + insert another layer of indirection between the hash space and its 48 + constituent next hops: a hash table. The selection algorithm uses SKB hash 49 + to choose a hash table bucket, then reads the next hop that this bucket 50 + contains, and forwards traffic there. 51 + 52 + This indirection brings an important feature. In the hash-threshold 53 + algorithm, the range of hashes associated with a next hop must be 54 + continuous. With a hash table, mapping between the hash table buckets and 55 + the individual next hops is arbitrary. Therefore when a next hop is deleted 56 + the buckets that held it are simply reassigned to other next hops:: 57 + 58 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 59 + |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| 60 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 61 + v v v v 62 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 63 + |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| 64 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 65 + 66 + Before and after deletion of next hop 3 67 + under the resilient hashing algorithm. 68 + 69 + When weights of next hops in a group are altered, it may be possible to 70 + choose a subset of buckets that are currently not used for forwarding 71 + traffic, and use those to satisfy the new next-hop distribution demands, 72 + keeping the "busy" buckets intact. This way, established flows are ideally 73 + kept being forwarded to the same endpoints through the same paths as before 74 + the next-hop group change. 75 + 76 + Algorithm 77 + --------- 78 + 79 + In a nutshell, the algorithm works as follows. Each next hop deserves a 80 + certain number of buckets, according to its weight and the number of 81 + buckets in the hash table. In accordance with the source code, we will call 82 + this number a "wants count" of a next hop. In case of an event that might 83 + cause bucket allocation change, the wants counts for individual next hops 84 + are updated. 85 + 86 + Next hops that have fewer buckets than their wants count, are called 87 + "underweight". Those that have more are "overweight". If there are no 88 + overweight (and therefore no underweight) next hops in the group, it is 89 + said to be "balanced". 90 + 91 + Each bucket maintains a last-used timer. Every time a packet is forwarded 92 + through a bucket, this timer is updated to current jiffies value. One 93 + attribute of a resilient group is then the "idle timer", which is the 94 + amount of time that a bucket must not be hit by traffic in order for it to 95 + be considered "idle". Buckets that are not idle are busy. 96 + 97 + After assigning wants counts to next hops, an "upkeep" algorithm runs. For 98 + buckets: 99 + 100 + 1) that have no assigned next hop, or 101 + 2) whose next hop has been removed, or 102 + 3) that are idle and their next hop is overweight, 103 + 104 + upkeep changes the next hop that the bucket references to one of the 105 + underweight next hops. If, after considering all buckets in this manner, 106 + there are still underweight next hops, another upkeep run is scheduled to a 107 + future time. 108 + 109 + There may not be enough "idle" buckets to satisfy the updated wants counts 110 + of all next hops. Another attribute of a resilient group is the "unbalanced 111 + timer". This timer can be set to 0, in which case the table will stay out 112 + of balance until idle buckets do appear, possibly never. If set to a 113 + non-zero value, the value represents the period of time that the table is 114 + permitted to stay out of balance. 115 + 116 + With this in mind, we update the above list of conditions with one more 117 + item. Thus buckets: 118 + 119 + 4) whose next hop is overweight, and the amount of time that the table has 120 + been out of balance exceeds the unbalanced timer, if that is non-zero, 121 + 122 + \... are migrated as well. 123 + 124 + Offloading & Driver Feedback 125 + ---------------------------- 126 + 127 + When offloading resilient groups, the algorithm that distributes buckets 128 + among next hops is still the one in SW. Drivers are notified of updates to 129 + next hop groups in the following three ways: 130 + 131 + - Full group notification with the type 132 + ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is 133 + created and buckets populated for the first time. 134 + 135 + - Single-bucket notifications of the type 136 + ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of 137 + individual migrations within an already-established group. 138 + 139 + - Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This 140 + is sent before the group is replaced, and is a way for the driver to veto 141 + the group before committing anything to the HW. 142 + 143 + Some single-bucket notifications are forced, as indicated by the "force" 144 + flag in the notification. Those are used for the cases where e.g. the next 145 + hop associated with the bucket was removed, and the bucket really must be 146 + migrated. 147 + 148 + Non-forced notifications can be overridden by the driver by returning an 149 + error code. The use case for this is that the driver notifies the HW that a 150 + bucket should be migrated, but the HW discovers that the bucket has in fact 151 + been hit by traffic. 152 + 153 + A second way for the HW to report that a bucket is busy is through the 154 + ``nexthop_res_grp_activity_update()`` API. The buckets identified this way 155 + as busy are treated as if traffic hit them. 156 + 157 + Offloaded buckets should be flagged as either "offload" or "trap". This is 158 + done through the ``nexthop_bucket_set_hw_flags()`` API. 159 + 160 + Netlink UAPI 161 + ------------ 162 + 163 + Resilient Group Replacement 164 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 165 + 166 + Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the 167 + same manner as other multipath groups. The following changes apply to the 168 + attributes passed in the netlink message: 169 + 170 + =================== ========================================================= 171 + ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group. 172 + ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient 173 + groups. 174 + =================== ========================================================= 175 + 176 + ``NHA_RES_GROUP`` payload: 177 + 178 + =================================== ========================================= 179 + ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table. 180 + ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t. 181 + ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t. 182 + =================================== ========================================= 183 + 184 + Next Hop Get 185 + ^^^^^^^^^^^^ 186 + 187 + Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP`` 188 + message in exactly the same way as other next hop get requests. The 189 + response attributes match the replacement attributes cited above, except 190 + ``NHA_RES_GROUP`` payload will include the following attribute: 191 + 192 + =================================== ========================================= 193 + ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out 194 + of balance, in units of clock_t. 195 + =================================== ========================================= 196 + 197 + Bucket Get 198 + ^^^^^^^^^^ 199 + 200 + The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is 201 + used to request a single bucket. The attributes recognized at get requests 202 + are: 203 + 204 + =================== ========================================================= 205 + ``NHA_ID`` ID of the next-hop group that the bucket belongs to. 206 + ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. 207 + =================== ========================================================= 208 + 209 + ``NHA_RES_BUCKET`` payload: 210 + 211 + ======================== ==================================================== 212 + ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table. 213 + ======================== ==================================================== 214 + 215 + Bucket Dumps 216 + ^^^^^^^^^^^^ 217 + 218 + The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used 219 + to request a dump of matching buckets. The attributes recognized at dump 220 + requests are: 221 + 222 + =================== ========================================================= 223 + ``NHA_ID`` If specified, limits the dump to just the next-hop group 224 + with this ID. 225 + ``NHA_OIF`` If specified, limits the dump to buckets that contain 226 + next hops that use the device with this ifindex. 227 + ``NHA_MASTER`` If specified, limits the dump to buckets that contain 228 + next hops that use a device in the VRF with this ifindex. 229 + ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. 230 + =================== ========================================================= 231 + 232 + ``NHA_RES_BUCKET`` payload: 233 + 234 + ======================== ==================================================== 235 + ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets 236 + that contain the next hop with this ID. 237 + ======================== ==================================================== 238 + 239 + Usage 240 + ----- 241 + 242 + To illustrate the usage, consider the following commands:: 243 + 244 + # ip nexthop add id 1 via 192.0.2.2 dev eth0 245 + # ip nexthop add id 2 via 192.0.2.3 dev eth0 246 + # ip nexthop add id 10 group 1/2 type resilient \ 247 + buckets 8 idle_timer 60 unbalanced_timer 300 248 + 249 + The last command creates a resilient next-hop group. It will have 8 buckets 250 + (which is unusually low number, and used here for demonstration purposes 251 + only), each bucket will be considered idle when no traffic hits it for at 252 + least 60 seconds, and if the table remains out of balance for 300 seconds, 253 + it will be forcefully brought into balance. 254 + 255 + Changing next-hop weights leads to change in bucket allocation:: 256 + 257 + # ip nexthop replace id 10 group 1,3/2 type resilient 258 + 259 + This can be confirmed by looking at individual buckets:: 260 + 261 + # ip nexthop bucket show id 10 262 + id 10 index 0 idle_time 5.59 nhid 1 263 + id 10 index 1 idle_time 5.59 nhid 1 264 + id 10 index 2 idle_time 8.74 nhid 2 265 + id 10 index 3 idle_time 8.74 nhid 2 266 + id 10 index 4 idle_time 8.74 nhid 1 267 + id 10 index 5 idle_time 8.74 nhid 1 268 + id 10 index 6 idle_time 8.74 nhid 1 269 + id 10 index 7 idle_time 8.74 nhid 1 270 + 271 + Note the two buckets that have a shorter idle time. Those are the ones that 272 + were migrated after the next-hop replace command to satisfy the new demand 273 + that next hop 1 be given 6 buckets instead of 4. 274 + 275 + Netdevsim 276 + --------- 277 + 278 + The netdevsim driver implements a mock offload of resilient groups, and 279 + exposes debugfs interface that allows marking individual buckets as busy. 280 + For example, the following will mark bucket 23 in next-hop group 10 as 281 + active:: 282 + 283 + # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity 284 + 285 + In addition, another debugfs interface can be used to configure that the 286 + next attempt to migrate a bucket should fail:: 287 + 288 + # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace 289 + 290 + Besides serving as an example, the interfaces that netdevsim exposes are 291 + useful in automated testing, and 292 + ``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of 293 + them to test the algorithm.