Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'bonding'

Veaceslav Falico says:

====================
bonding: add an option to rely on unvalidated arp packets

v4 -> v5:
Again per Nik's advise correct the bond_opts restrictions for arp_validate
- set it the same as arp_interval.

v3 -> v4:
Per Nikolay's advise, remove the new bond_opts restriction on modes setting
for arp_validate.

v2 -> v3:
Per Jay's advise, use the 'filter' keyword instead of 'arp' one, and use
his text for documentation. Also, rebase on the latest net-next. Sorry for
the delay, didn't manage to send it before net-next was closed.

v1 -> v2:
Don't remove the 'all traffic' functionality - rather, add new arp_validate
options to specify that we want *only* unvalidated arps.

Currently, if arp_validate is off (0), slave_last_rx() returns the
slave->dev->last_rx, which is always updated on *any* packet received by
slave, and not only arps. This means that, if the validation of arps is
off, we're treating *any* incoming packet as a proof of slave being up, and
not only arps.

This might seem logical at the first glance, however it can cause a lot of
troubles and false-positives, one example would be:

The arp_ip_target is NOT accessible, however someone in the broadcast domain
spams with any broadcast traffic. This way bonding will be tricked that the
slave is still up (as in - can access arp_ip_target), while it's not.

The net_device->last_rx is already used in a lot of drivers (even though the
comment states to NOT do it :)), and it's also ugly to modify it from bonding.

However, some loadbalance setups might rely on the fact that even non-arp
traffic is a sign of slave being up - and we definitely can't break anyones
config - so an extension to arp_validate is needed.

So, to fix this, add an option for the user to specify if he wants to
filter out non-arp traffic on unvalidated slaves, remove the last_rx from
bonding, *always* call bond_arp_rcv() in slave's rx_handler (which is
bond_handle_frame), and if we spot an arp there with this option on - update
the slave->last_arp_rx - and use it instead of net_device->last_rx. Finally,
rename last_arp_rx to last_rx to reflect the changes.

Also rename slave->jiffies to ->last_link_up, to reflect better its
meaning, add the new option's documentation and update the arp_validate one
to be a bit more descriptive.
====================

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

+117 -84
+64 -28
Documentation/networking/bonding.txt
··· 270 270 arp_validate 271 271 272 272 Specifies whether or not ARP probes and replies should be 273 - validated in the active-backup mode. This causes the ARP 274 - monitor to examine the incoming ARP requests and replies, and 275 - only consider a slave to be up if it is receiving the 276 - appropriate ARP traffic. 273 + validated in any mode that supports arp monitoring, or whether 274 + non-ARP traffic should be filtered (disregarded) for link 275 + monitoring purposes. 277 276 278 277 Possible values are: 279 278 280 279 none or 0 281 280 282 - No validation is performed. This is the default. 281 + No validation or filtering is performed. 283 282 284 283 active or 1 285 284 ··· 292 293 293 294 Validation is performed for all slaves. 294 295 295 - For the active slave, the validation checks ARP replies to 296 - confirm that they were generated by an arp_ip_target. Since 297 - backup slaves do not typically receive these replies, the 298 - validation performed for backup slaves is on the ARP request 299 - sent out via the active slave. It is possible that some 300 - switch or network configurations may result in situations 301 - wherein the backup slaves do not receive the ARP requests; in 302 - such a situation, validation of backup slaves must be 303 - disabled. 296 + filter or 4 304 297 305 - The validation of ARP requests on backup slaves is mainly 306 - helping bonding to decide which slaves are more likely to 307 - work in case of the active slave failure, it doesn't really 308 - guarantee that the backup slave will work if it's selected 309 - as the next active slave. 298 + Filtering is applied to all slaves. No validation is 299 + performed. 310 300 311 - This option is useful in network configurations in which 312 - multiple bonding hosts are concurrently issuing ARPs to one or 313 - more targets beyond a common switch. Should the link between 314 - the switch and target fail (but not the switch itself), the 315 - probe traffic generated by the multiple bonding instances will 316 - fool the standard ARP monitor into considering the links as 317 - still up. Use of the arp_validate option can resolve this, as 318 - the ARP monitor will only consider ARP requests and replies 319 - associated with its own instance of bonding. 301 + filter_active or 5 302 + 303 + Filtering is applied to all slaves, validation is performed 304 + only for the active slave. 305 + 306 + filter_backup or 6 307 + 308 + Filtering is applied to all slaves, validation is performed 309 + only for backup slaves. 310 + 311 + Validation: 312 + 313 + Enabling validation causes the ARP monitor to examine the incoming 314 + ARP requests and replies, and only consider a slave to be up if it 315 + is receiving the appropriate ARP traffic. 316 + 317 + For an active slave, the validation checks ARP replies to confirm 318 + that they were generated by an arp_ip_target. Since backup slaves 319 + do not typically receive these replies, the validation performed 320 + for backup slaves is on the broadcast ARP request sent out via the 321 + active slave. It is possible that some switch or network 322 + configurations may result in situations wherein the backup slaves 323 + do not receive the ARP requests; in such a situation, validation 324 + of backup slaves must be disabled. 325 + 326 + The validation of ARP requests on backup slaves is mainly helping 327 + bonding to decide which slaves are more likely to work in case of 328 + the active slave failure, it doesn't really guarantee that the 329 + backup slave will work if it's selected as the next active slave. 330 + 331 + Validation is useful in network configurations in which multiple 332 + bonding hosts are concurrently issuing ARPs to one or more targets 333 + beyond a common switch. Should the link between the switch and 334 + target fail (but not the switch itself), the probe traffic 335 + generated by the multiple bonding instances will fool the standard 336 + ARP monitor into considering the links as still up. Use of 337 + validation can resolve this, as the ARP monitor will only consider 338 + ARP requests and replies associated with its own instance of 339 + bonding. 340 + 341 + Filtering: 342 + 343 + Enabling filtering causes the ARP monitor to only use incoming ARP 344 + packets for link availability purposes. Arriving packets that are 345 + not ARPs are delivered normally, but do not count when determining 346 + if a slave is available. 347 + 348 + Filtering operates by only considering the reception of ARP 349 + packets (any ARP packet, regardless of source or destination) when 350 + determining if a slave has received traffic for link availability 351 + purposes. 352 + 353 + Filtering is useful in network configurations in which significant 354 + levels of third party broadcast traffic would fool the standard 355 + ARP monitor into considering the links as still up. Use of 356 + filtering can resolve this, as only ARP traffic is considered for 357 + link availability purposes. 320 358 321 359 This option was added in bonding version 3.1.0. 322 360
+24 -32
drivers/net/bonding/bond_main.c
··· 798 798 return; 799 799 800 800 if (new_active) { 801 - new_active->jiffies = jiffies; 801 + new_active->last_link_up = jiffies; 802 802 803 803 if (new_active->link == BOND_LINK_BACK) { 804 804 if (USES_PRIMARY(bond->params.mode)) { ··· 1115 1115 slave = bond_slave_get_rcu(skb->dev); 1116 1116 bond = slave->bond; 1117 1117 1118 - if (bond->params.arp_interval) 1119 - slave->dev->last_rx = jiffies; 1120 - 1121 1118 recv_probe = ACCESS_ONCE(bond->recv_probe); 1122 1119 if (recv_probe) { 1123 1120 ret = recv_probe(skb, bond, slave); ··· 1397 1400 1398 1401 bond_update_speed_duplex(new_slave); 1399 1402 1400 - new_slave->last_arp_rx = jiffies - 1403 + new_slave->last_rx = jiffies - 1401 1404 (msecs_to_jiffies(bond->params.arp_interval) + 1); 1402 1405 for (i = 0; i < BOND_MAX_ARP_TARGETS; i++) 1403 - new_slave->target_last_arp_rx[i] = new_slave->last_arp_rx; 1406 + new_slave->target_last_arp_rx[i] = new_slave->last_rx; 1404 1407 1405 1408 if (bond->params.miimon && !bond->params.use_carrier) { 1406 1409 link_reporting = bond_check_dev_link(bond, slave_dev, 1); ··· 1444 1447 } 1445 1448 1446 1449 if (new_slave->link != BOND_LINK_DOWN) 1447 - new_slave->jiffies = jiffies; 1450 + new_slave->last_link_up = jiffies; 1448 1451 pr_debug("Initial state of slave_dev is BOND_LINK_%s\n", 1449 1452 new_slave->link == BOND_LINK_DOWN ? "DOWN" : 1450 1453 (new_slave->link == BOND_LINK_UP ? "UP" : "BACK")); ··· 1891 1894 * recovered before downdelay expired 1892 1895 */ 1893 1896 slave->link = BOND_LINK_UP; 1894 - slave->jiffies = jiffies; 1897 + slave->last_link_up = jiffies; 1895 1898 pr_info("%s: link status up again after %d ms for interface %s\n", 1896 1899 bond->dev->name, 1897 1900 (bond->params.downdelay - slave->delay) * ··· 1966 1969 1967 1970 case BOND_LINK_UP: 1968 1971 slave->link = BOND_LINK_UP; 1969 - slave->jiffies = jiffies; 1972 + slave->last_link_up = jiffies; 1970 1973 1971 1974 if (bond->params.mode == BOND_MODE_8023AD) { 1972 1975 /* prevent it from being the active one */ ··· 2242 2245 pr_debug("bva: sip %pI4 not found in targets\n", &sip); 2243 2246 return; 2244 2247 } 2245 - slave->last_arp_rx = jiffies; 2248 + slave->last_rx = jiffies; 2246 2249 slave->target_last_arp_rx[i] = jiffies; 2247 2250 } 2248 2251 ··· 2252 2255 struct arphdr *arp = (struct arphdr *)skb->data; 2253 2256 unsigned char *arp_ptr; 2254 2257 __be32 sip, tip; 2255 - int alen; 2258 + int alen, is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP); 2256 2259 2257 - if (skb->protocol != __cpu_to_be16(ETH_P_ARP)) 2260 + if (!slave_do_arp_validate(bond, slave)) { 2261 + if ((slave_do_arp_validate_only(bond, slave) && is_arp) || 2262 + !slave_do_arp_validate_only(bond, slave)) 2263 + slave->last_rx = jiffies; 2258 2264 return RX_HANDLER_ANOTHER; 2259 - 2260 - read_lock(&bond->lock); 2261 - 2262 - if (!slave_do_arp_validate(bond, slave)) 2263 - goto out_unlock; 2265 + } else if (!is_arp) { 2266 + return RX_HANDLER_ANOTHER; 2267 + } 2264 2268 2265 2269 alen = arp_hdr_len(bond->dev); 2266 2270 ··· 2312 2314 bond_validate_arp(bond, slave, sip, tip); 2313 2315 else if (bond->curr_active_slave && 2314 2316 time_after(slave_last_rx(bond, bond->curr_active_slave), 2315 - bond->curr_active_slave->jiffies)) 2317 + bond->curr_active_slave->last_link_up)) 2316 2318 bond_validate_arp(bond, slave, tip, sip); 2317 2319 2318 2320 out_unlock: 2319 - read_unlock(&bond->lock); 2320 2321 if (arp != (struct arphdr *)skb->data) 2321 2322 kfree(arp); 2322 2323 return RX_HANDLER_ANOTHER; ··· 2358 2361 oldcurrent = ACCESS_ONCE(bond->curr_active_slave); 2359 2362 /* see if any of the previous devices are up now (i.e. they have 2360 2363 * xmt and rcv traffic). the curr_active_slave does not come into 2361 - * the picture unless it is null. also, slave->jiffies is not needed 2362 - * here because we send an arp on each slave and give a slave as 2363 - * long as it needs to get the tx/rx within the delta. 2364 + * the picture unless it is null. also, slave->last_link_up is not 2365 + * needed here because we send an arp on each slave and give a slave 2366 + * as long as it needs to get the tx/rx within the delta. 2364 2367 * TODO: what about up/down delay in arp mode? it wasn't here before 2365 2368 * so it can wait 2366 2369 */ ··· 2369 2372 2370 2373 if (slave->link != BOND_LINK_UP) { 2371 2374 if (bond_time_in_interval(bond, trans_start, 1) && 2372 - bond_time_in_interval(bond, slave->dev->last_rx, 1)) { 2375 + bond_time_in_interval(bond, slave->last_rx, 1)) { 2373 2376 2374 2377 slave->link = BOND_LINK_UP; 2375 2378 slave_state_changed = 1; ··· 2398 2401 * if we don't know our ip yet 2399 2402 */ 2400 2403 if (!bond_time_in_interval(bond, trans_start, 2) || 2401 - !bond_time_in_interval(bond, slave->dev->last_rx, 2)) { 2404 + !bond_time_in_interval(bond, slave->last_rx, 2)) { 2402 2405 2403 2406 slave->link = BOND_LINK_DOWN; 2404 2407 slave_state_changed = 1; ··· 2486 2489 * active. This avoids bouncing, as the last receive 2487 2490 * times need a full ARP monitor cycle to be updated. 2488 2491 */ 2489 - if (bond_time_in_interval(bond, slave->jiffies, 2)) 2492 + if (bond_time_in_interval(bond, slave->last_link_up, 2)) 2490 2493 continue; 2491 2494 2492 2495 /* ··· 2687 2690 new_slave->link = BOND_LINK_BACK; 2688 2691 bond_set_slave_active_flags(new_slave); 2689 2692 bond_arp_send_all(bond, new_slave); 2690 - new_slave->jiffies = jiffies; 2693 + new_slave->last_link_up = jiffies; 2691 2694 rcu_assign_pointer(bond->current_arp_slave, new_slave); 2692 2695 rtnl_unlock(); 2693 2696 ··· 3057 3060 3058 3061 if (bond->params.arp_interval) { /* arp interval, in milliseconds. */ 3059 3062 queue_delayed_work(bond->wq, &bond->arp_work, 0); 3060 - if (bond->params.arp_validate) 3061 - bond->recv_probe = bond_arp_rcv; 3063 + bond->recv_probe = bond_arp_rcv; 3062 3064 } 3063 3065 3064 3066 if (bond->params.mode == BOND_MODE_8023AD) { ··· 4182 4186 } 4183 4187 4184 4188 if (arp_validate) { 4185 - if (bond_mode != BOND_MODE_ACTIVEBACKUP) { 4186 - pr_err("arp_validate only supported in active-backup mode\n"); 4187 - return -EINVAL; 4188 - } 4189 4189 if (!arp_interval) { 4190 4190 pr_err("arp_validate requires arp_interval\n"); 4191 4191 return -EINVAL;
+11 -8
drivers/net/bonding/bond_options.c
··· 47 47 }; 48 48 49 49 static struct bond_opt_value bond_arp_validate_tbl[] = { 50 - { "none", BOND_ARP_VALIDATE_NONE, BOND_VALFLAG_DEFAULT}, 51 - { "active", BOND_ARP_VALIDATE_ACTIVE, 0}, 52 - { "backup", BOND_ARP_VALIDATE_BACKUP, 0}, 53 - { "all", BOND_ARP_VALIDATE_ALL, 0}, 54 - { NULL, -1, 0}, 50 + { "none", BOND_ARP_VALIDATE_NONE, BOND_VALFLAG_DEFAULT}, 51 + { "active", BOND_ARP_VALIDATE_ACTIVE, 0}, 52 + { "backup", BOND_ARP_VALIDATE_BACKUP, 0}, 53 + { "all", BOND_ARP_VALIDATE_ALL, 0}, 54 + { "filter", BOND_ARP_FILTER, 0}, 55 + { "filter_active", BOND_ARP_FILTER_ACTIVE, 0}, 56 + { "filter_backup", BOND_ARP_FILTER_BACKUP, 0}, 57 + { NULL, -1, 0}, 55 58 }; 56 59 57 60 static struct bond_opt_value bond_arp_all_targets_tbl[] = { ··· 154 151 .id = BOND_OPT_ARP_VALIDATE, 155 152 .name = "arp_validate", 156 153 .desc = "validate src/dst of ARP probes", 157 - .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_ACTIVEBACKUP)), 154 + .unsuppmodes = BIT(BOND_MODE_8023AD) | BIT(BOND_MODE_TLB) | 155 + BIT(BOND_MODE_ALB), 158 156 .values = bond_arp_validate_tbl, 159 157 .set = bond_option_arp_validate_set 160 158 }, ··· 813 809 cancel_delayed_work_sync(&bond->arp_work); 814 810 } else { 815 811 /* arp_validate can be set only in active-backup mode */ 816 - if (bond->params.arp_validate) 817 - bond->recv_probe = bond_arp_rcv; 812 + bond->recv_probe = bond_arp_rcv; 818 813 cancel_delayed_work_sync(&bond->mii_work); 819 814 queue_delayed_work(bond->wq, &bond->arp_work, 0); 820 815 }
+17 -9
drivers/net/bonding/bonding.h
··· 188 188 struct net_device *dev; /* first - useful for panic debug */ 189 189 struct bonding *bond; /* our master */ 190 190 int delay; 191 - unsigned long jiffies; 192 - unsigned long last_arp_rx; 191 + /* all three in jiffies */ 192 + unsigned long last_link_up; 193 + unsigned long last_rx; 193 194 unsigned long target_last_arp_rx[BOND_MAX_ARP_TARGETS]; 194 195 s8 link; /* one of BOND_LINK_XXXX */ 195 196 s8 new_link; ··· 343 342 #define BOND_ARP_VALIDATE_BACKUP (1 << BOND_STATE_BACKUP) 344 343 #define BOND_ARP_VALIDATE_ALL (BOND_ARP_VALIDATE_ACTIVE | \ 345 344 BOND_ARP_VALIDATE_BACKUP) 345 + #define BOND_ARP_FILTER (BOND_ARP_VALIDATE_ALL + 1) 346 + #define BOND_ARP_FILTER_ACTIVE (BOND_ARP_VALIDATE_ACTIVE | \ 347 + BOND_ARP_FILTER) 348 + #define BOND_ARP_FILTER_BACKUP (BOND_ARP_VALIDATE_BACKUP | \ 349 + BOND_ARP_FILTER) 346 350 347 351 static inline int slave_do_arp_validate(struct bonding *bond, 348 352 struct slave *slave) 349 353 { 350 354 return bond->params.arp_validate & (1 << bond_slave_state(slave)); 355 + } 356 + 357 + static inline int slave_do_arp_validate_only(struct bonding *bond, 358 + struct slave *slave) 359 + { 360 + return bond->params.arp_validate & BOND_ARP_FILTER; 351 361 } 352 362 353 363 /* Get the oldest arp which we've received on this slave for bond's ··· 380 368 static inline unsigned long slave_last_rx(struct bonding *bond, 381 369 struct slave *slave) 382 370 { 383 - if (slave_do_arp_validate(bond, slave)) { 384 - if (bond->params.arp_all_targets == BOND_ARP_TARGETS_ALL) 385 - return slave_oldest_target_arp_rx(bond, slave); 386 - else 387 - return slave->last_arp_rx; 388 - } 371 + if (bond->params.arp_all_targets == BOND_ARP_TARGETS_ALL) 372 + return slave_oldest_target_arp_rx(bond, slave); 389 373 390 - return slave->dev->last_rx; 374 + return slave->last_rx; 391 375 } 392 376 393 377 #ifdef CONFIG_NET_POLL_CONTROLLER
+1 -7
include/linux/netdevice.h
··· 1312 1312 /* 1313 1313 * Cache lines mostly used on receive path (including eth_type_trans()) 1314 1314 */ 1315 - unsigned long last_rx; /* Time of last Rx 1316 - * This should not be set in 1317 - * drivers, unless really needed, 1318 - * because network stack (bonding) 1319 - * use it if/when necessary, to 1320 - * avoid dirtying this cache line. 1321 - */ 1315 + unsigned long last_rx; /* Time of last Rx */ 1322 1316 1323 1317 /* Interface address info used in eth_type_trans() */ 1324 1318 unsigned char *dev_addr; /* hw address, (before bcast