Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries

tl;dr
=====

Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.

Background
==========

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.

When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.

Motivation
==========

If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.

"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.

If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.

In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.

Implementation
==============

In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".

Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.

In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.

One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:

1. Unclear why a control plane would like to program an entry that the
kernel cannot invalidate but can completely remove.

2. The "extern_learn" flag is used by FRR for neighbor entries learned
on remote VTEPs (for ARP/NS suppression) whereas here we are
concerned with local entries. This distinction is currently irrelevant
for the kernel, but might be relevant in the future.

Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:

# ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
Error: Cannot create externally validated neighbor with an invalid state.
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
Error: Cannot mark neighbor as externally validated with an invalid state.

The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:

# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
Error: Cannot create externally validated neighbor with an invalid state.

However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:

# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use

One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:

# sysctl net.ipv4.neigh.br0/10.mcast_resolicit
0
# tcpdump -nn -e -i br0.10 -Q out arp &
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
# sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28

iproute2 patches can be found here [2].

[1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
[2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Ido Schimmel and committed by
Jakub Kicinski
03dc03fa af232e76

+78 -11
+1
Documentation/netlink/specs/rt-neigh.yaml
··· 79 79 entries: 80 80 - managed 81 81 - locked 82 + - ext-validated 82 83 - 83 84 name: rtm-type 84 85 type: enum
+3 -1
include/net/neighbour.h
··· 261 261 #define NEIGH_UPDATE_F_EXT_LEARNED BIT(5) 262 262 #define NEIGH_UPDATE_F_ISROUTER BIT(6) 263 263 #define NEIGH_UPDATE_F_ADMIN BIT(7) 264 + #define NEIGH_UPDATE_F_EXT_VALIDATED BIT(8) 264 265 265 266 /* In-kernel representation for NDA_FLAGS_EXT flags: */ 266 267 #define NTF_OLD_MASK 0xff 267 268 #define NTF_EXT_SHIFT 8 268 - #define NTF_EXT_MASK (NTF_EXT_MANAGED) 269 + #define NTF_EXT_MASK (NTF_EXT_MANAGED | NTF_EXT_EXT_VALIDATED) 269 270 270 271 #define NTF_MANAGED (NTF_EXT_MANAGED << NTF_EXT_SHIFT) 272 + #define NTF_EXT_VALIDATED (NTF_EXT_EXT_VALIDATED << NTF_EXT_SHIFT) 271 273 272 274 extern const struct nla_policy nda_policy[]; 273 275
+5
include/uapi/linux/neighbour.h
··· 54 54 /* Extended flags under NDA_FLAGS_EXT: */ 55 55 #define NTF_EXT_MANAGED (1 << 0) 56 56 #define NTF_EXT_LOCKED (1 << 1) 57 + #define NTF_EXT_EXT_VALIDATED (1 << 2) 57 58 58 59 /* 59 60 * Neighbor Cache Entry States. ··· 93 92 * bridge in response to a host trying to communicate via a locked bridge port 94 93 * with MAB enabled. Their purpose is to notify user space that a host requires 95 94 * authentication. 95 + * 96 + * NTF_EXT_EXT_VALIDATED flagged neighbor entries were externally validated by 97 + * a user space control plane. The kernel will not remove or invalidate them, 98 + * but it can probe them and notify user space when they become reachable. 96 99 */ 97 100 98 101 struct nda_cacheinfo {
+69 -10
net/core/neighbour.c
··· 154 154 if (n->dead) 155 155 goto out; 156 156 157 - /* remove from the gc list if new state is permanent or if neighbor 158 - * is externally learned; otherwise entry should be on the gc list 157 + /* remove from the gc list if new state is permanent or if neighbor is 158 + * externally learned / validated; otherwise entry should be on the gc 159 + * list 159 160 */ 160 161 exempt_from_gc = n->nud_state & NUD_PERMANENT || 161 - n->flags & NTF_EXT_LEARNED; 162 + n->flags & (NTF_EXT_LEARNED | NTF_EXT_VALIDATED); 162 163 on_gc_list = !list_empty(&n->gc_list); 163 164 164 165 if (exempt_from_gc && on_gc_list) { ··· 206 205 207 206 ndm_flags = (flags & NEIGH_UPDATE_F_EXT_LEARNED) ? NTF_EXT_LEARNED : 0; 208 207 ndm_flags |= (flags & NEIGH_UPDATE_F_MANAGED) ? NTF_MANAGED : 0; 208 + ndm_flags |= (flags & NEIGH_UPDATE_F_EXT_VALIDATED) ? NTF_EXT_VALIDATED : 0; 209 209 210 210 if ((old_flags ^ ndm_flags) & NTF_EXT_LEARNED) { 211 211 if (ndm_flags & NTF_EXT_LEARNED) ··· 223 221 neigh->flags &= ~NTF_MANAGED; 224 222 *notify = 1; 225 223 *managed_update = true; 224 + } 225 + if ((old_flags ^ ndm_flags) & NTF_EXT_VALIDATED) { 226 + if (ndm_flags & NTF_EXT_VALIDATED) 227 + neigh->flags |= NTF_EXT_VALIDATED; 228 + else 229 + neigh->flags &= ~NTF_EXT_VALIDATED; 230 + *notify = 1; 231 + *gc_update = true; 226 232 } 227 233 } 228 234 ··· 389 379 dev_head = neigh_get_dev_table(dev, tbl->family); 390 380 391 381 hlist_for_each_entry_safe(n, tmp, dev_head, dev_list) { 392 - if (skip_perm && n->nud_state & NUD_PERMANENT) 382 + if (skip_perm && 383 + (n->nud_state & NUD_PERMANENT || 384 + n->flags & NTF_EXT_VALIDATED)) 393 385 continue; 394 386 395 387 hlist_del_rcu(&n->hash); ··· 954 942 955 943 state = n->nud_state; 956 944 if ((state & (NUD_PERMANENT | NUD_IN_TIMER)) || 957 - (n->flags & NTF_EXT_LEARNED)) { 945 + (n->flags & 946 + (NTF_EXT_LEARNED | NTF_EXT_VALIDATED))) { 958 947 write_unlock(&n->lock); 959 948 continue; 960 949 } ··· 1108 1095 1109 1096 if ((neigh->nud_state & (NUD_INCOMPLETE | NUD_PROBE)) && 1110 1097 atomic_read(&neigh->probes) >= neigh_max_probes(neigh)) { 1111 - WRITE_ONCE(neigh->nud_state, NUD_FAILED); 1098 + if (neigh->nud_state == NUD_PROBE && 1099 + neigh->flags & NTF_EXT_VALIDATED) { 1100 + WRITE_ONCE(neigh->nud_state, NUD_STALE); 1101 + neigh->updated = jiffies; 1102 + } else { 1103 + WRITE_ONCE(neigh->nud_state, NUD_FAILED); 1104 + neigh_invalidate(neigh); 1105 + } 1112 1106 notify = 1; 1113 - neigh_invalidate(neigh); 1114 1107 goto out; 1115 1108 } 1116 1109 ··· 1264 1245 NTF_ROUTER flag. 1265 1246 NEIGH_UPDATE_F_ISROUTER indicates if the neighbour is known as 1266 1247 a router. 1248 + NEIGH_UPDATE_F_EXT_VALIDATED means that the entry will not be removed 1249 + or invalidated. 1267 1250 1268 1251 Caller MUST hold reference count on the entry. 1269 1252 */ ··· 2000 1979 if (ndm_flags & NTF_PROXY) { 2001 1980 struct pneigh_entry *pn; 2002 1981 2003 - if (ndm_flags & NTF_MANAGED) { 1982 + if (ndm_flags & (NTF_MANAGED | NTF_EXT_VALIDATED)) { 2004 1983 NL_SET_ERR_MSG(extack, "Invalid NTF_* flag combination"); 2005 1984 goto out; 2006 1985 } ··· 2031 2010 if (neigh == NULL) { 2032 2011 bool ndm_permanent = ndm->ndm_state & NUD_PERMANENT; 2033 2012 bool exempt_from_gc = ndm_permanent || 2034 - ndm_flags & NTF_EXT_LEARNED; 2013 + ndm_flags & (NTF_EXT_LEARNED | 2014 + NTF_EXT_VALIDATED); 2035 2015 2036 2016 if (!(nlh->nlmsg_flags & NLM_F_CREATE)) { 2037 2017 err = -ENOENT; ··· 2043 2021 err = -EINVAL; 2044 2022 goto out; 2045 2023 } 2024 + if (ndm_flags & NTF_EXT_VALIDATED) { 2025 + u8 state = ndm->ndm_state; 2026 + 2027 + /* NTF_USE and NTF_MANAGED will result in the neighbor 2028 + * being created with an invalid state (NUD_NONE). 2029 + */ 2030 + if (ndm_flags & (NTF_USE | NTF_MANAGED)) 2031 + state = NUD_NONE; 2032 + 2033 + if (!(state & NUD_VALID)) { 2034 + NL_SET_ERR_MSG(extack, 2035 + "Cannot create externally validated neighbor with an invalid state"); 2036 + err = -EINVAL; 2037 + goto out; 2038 + } 2039 + } 2046 2040 2047 2041 neigh = ___neigh_create(tbl, dst, dev, 2048 2042 ndm_flags & 2049 - (NTF_EXT_LEARNED | NTF_MANAGED), 2043 + (NTF_EXT_LEARNED | NTF_MANAGED | 2044 + NTF_EXT_VALIDATED), 2050 2045 exempt_from_gc, true); 2051 2046 if (IS_ERR(neigh)) { 2052 2047 err = PTR_ERR(neigh); ··· 2074 2035 err = -EEXIST; 2075 2036 neigh_release(neigh); 2076 2037 goto out; 2038 + } 2039 + if (ndm_flags & NTF_EXT_VALIDATED) { 2040 + u8 state = ndm->ndm_state; 2041 + 2042 + /* NTF_USE and NTF_MANAGED do not update the existing 2043 + * state other than clearing it if it was 2044 + * NUD_PERMANENT. 2045 + */ 2046 + if (ndm_flags & (NTF_USE | NTF_MANAGED)) 2047 + state = READ_ONCE(neigh->nud_state) & ~NUD_PERMANENT; 2048 + 2049 + if (!(state & NUD_VALID)) { 2050 + NL_SET_ERR_MSG(extack, 2051 + "Cannot mark neighbor as externally validated with an invalid state"); 2052 + err = -EINVAL; 2053 + neigh_release(neigh); 2054 + goto out; 2055 + } 2077 2056 } 2078 2057 2079 2058 if (!(nlh->nlmsg_flags & NLM_F_REPLACE)) ··· 2109 2052 flags |= NEIGH_UPDATE_F_MANAGED; 2110 2053 if (ndm_flags & NTF_USE) 2111 2054 flags |= NEIGH_UPDATE_F_USE; 2055 + if (ndm_flags & NTF_EXT_VALIDATED) 2056 + flags |= NEIGH_UPDATE_F_EXT_VALIDATED; 2112 2057 2113 2058 err = __neigh_update(neigh, lladdr, ndm->ndm_state, flags, 2114 2059 NETLINK_CB(skb).portid, extack);