Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

netfilter: Introduce egress hook

Support classifying packets with netfilter on egress to satisfy user
requirements such as:
* outbound security policies for containers (Laura)
* filtering and mangling intra-node Direct Server Return (DSR) traffic
on a load balancer (Laura)
* filtering locally generated traffic coming in through AF_PACKET,
such as local ARP traffic generated for clustering purposes or DHCP
(Laura; the AF_PACKET plumbing is contained in a follow-up commit)
* L2 filtering from ingress and egress for AVB (Audio Video Bridging)
and gPTP with nftables (Pablo)
* in the future: in-kernel NAT64/NAT46 (Pablo)

The egress hook introduced herein complements the ingress hook added by
commit e687ad60af09 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key"). A patch for nftables to hook up
egress rules from user space has been submitted separately, so users may
immediately take advantage of the feature.

Alternatively or in addition to netfilter, packets can be classified
with traffic control (tc). On ingress, packets are classified first by
tc, then by netfilter. On egress, the order is reversed for symmetry.
Conceptually, tc and netfilter can be thought of as layers, with
netfilter layered above tc.

Traffic control is capable of redirecting packets to another interface
(man 8 tc-mirred). E.g., an ingress packet may be redirected from the
host namespace to a container via a veth connection:
tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)

In this case, netfilter egress classifying is not performed when leaving
the host namespace! That's because the packet is still on the tc layer.
If tc redirects the packet to a physical interface in the host namespace
such that it leaves the system, the packet is never subjected to
netfilter egress classifying. That is only logical since it hasn't
passed through netfilter ingress classifying either.

Packets can alternatively be redirected at the netfilter layer using
nft fwd. Such a packet *is* subjected to netfilter egress classifying
since it has reached the netfilter layer.

Internally, the skb->nf_skip_egress flag controls whether netfilter is
invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may
be called recursively by tunnel drivers such as vxlan, the flag is
reverted to false after sch_handle_egress(). This ensures that
netfilter is applied both on the overlay and underlying network.

Interaction between tc and netfilter is possible by setting and querying
skb->mark.

If netfilter egress classifying is not enabled on any interface, it is
patched out of the data path by way of a static_key and doesn't make a
performance difference that is discernible from noise:

Before: 1537 1538 1538 1537 1538 1537 Mb/sec
After: 1536 1534 1539 1539 1539 1540 Mb/sec
Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec
After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec

When netfilter egress classifying is enabled on at least one interface,
a minimal performance penalty is incurred for every egress packet, even
if the interface it's transmitted over doesn't have any netfilter egress
rules configured. That is caused by checking dev->nf_hooks_egress
against NULL.

Measurements were performed on a Core i7-3615QM. Commands to reproduce:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Accept all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

Drop all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'

Apply this patch when measuring packet drops to avoid errors in dmesg:
https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Laura García Liébana <nevola@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

authored by

Lukas Wunner and committed by
Pablo Neira Ayuso
42df6e1d 17d20784

+168 -10
+3
drivers/net/ifb.c
··· 31 31 #include <linux/init.h> 32 32 #include <linux/interrupt.h> 33 33 #include <linux/moduleparam.h> 34 + #include <linux/netfilter_netdev.h> 34 35 #include <net/pkt_sched.h> 35 36 #include <net/net_namespace.h> 36 37 ··· 76 75 } 77 76 78 77 while ((skb = __skb_dequeue(&txp->tq)) != NULL) { 78 + /* Skip tc and netfilter to prevent redirection loop. */ 79 79 skb->redirected = 0; 80 80 skb->tc_skip_classify = 1; 81 + nf_skip_egress(skb, true); 81 82 82 83 u64_stats_update_begin(&txp->tsync); 83 84 txp->tx_packets++;
+4
include/linux/netdevice.h
··· 1861 1861 * @xps_maps: XXX: need comments on this one 1862 1862 * @miniq_egress: clsact qdisc specific data for 1863 1863 * egress processing 1864 + * @nf_hooks_egress: netfilter hooks executed for egress packets 1864 1865 * @qdisc_hash: qdisc hash table 1865 1866 * @watchdog_timeo: Represents the timeout that is used by 1866 1867 * the watchdog (see dev_watchdog()) ··· 2161 2160 #endif 2162 2161 #ifdef CONFIG_NET_CLS_ACT 2163 2162 struct mini_Qdisc __rcu *miniq_egress; 2163 + #endif 2164 + #ifdef CONFIG_NETFILTER_EGRESS 2165 + struct nf_hook_entries __rcu *nf_hooks_egress; 2164 2166 #endif 2165 2167 2166 2168 #ifdef CONFIG_NET_SCHED
+86
include/linux/netfilter_netdev.h
··· 50 50 } 51 51 #endif /* CONFIG_NETFILTER_INGRESS */ 52 52 53 + #ifdef CONFIG_NETFILTER_EGRESS 54 + static inline bool nf_hook_egress_active(void) 55 + { 56 + #ifdef CONFIG_JUMP_LABEL 57 + if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_EGRESS])) 58 + return false; 59 + #endif 60 + return true; 61 + } 62 + 63 + /** 64 + * nf_hook_egress - classify packets before transmission 65 + * @skb: packet to be classified 66 + * @rc: result code which shall be returned by __dev_queue_xmit() on failure 67 + * @dev: netdev whose egress hooks shall be applied to @skb 68 + * 69 + * Returns @skb on success or %NULL if the packet was consumed or filtered. 70 + * Caller must hold rcu_read_lock. 71 + * 72 + * On ingress, packets are classified first by tc, then by netfilter. 73 + * On egress, the order is reversed for symmetry. Conceptually, tc and 74 + * netfilter can be thought of as layers, with netfilter layered above tc: 75 + * When tc redirects a packet to another interface, netfilter is not applied 76 + * because the packet is on the tc layer. 77 + * 78 + * The nf_skip_egress flag controls whether netfilter is applied on egress. 79 + * It is updated by __netif_receive_skb_core() and __dev_queue_xmit() when the 80 + * packet passes through tc and netfilter. Because __dev_queue_xmit() may be 81 + * called recursively by tunnel drivers such as vxlan, the flag is reverted to 82 + * false after sch_handle_egress(). This ensures that netfilter is applied 83 + * both on the overlay and underlying network. 84 + */ 85 + static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc, 86 + struct net_device *dev) 87 + { 88 + struct nf_hook_entries *e; 89 + struct nf_hook_state state; 90 + int ret; 91 + 92 + #ifdef CONFIG_NETFILTER_SKIP_EGRESS 93 + if (skb->nf_skip_egress) 94 + return skb; 95 + #endif 96 + 97 + e = rcu_dereference(dev->nf_hooks_egress); 98 + if (!e) 99 + return skb; 100 + 101 + nf_hook_state_init(&state, NF_NETDEV_EGRESS, 102 + NFPROTO_NETDEV, dev, NULL, NULL, 103 + dev_net(dev), NULL); 104 + ret = nf_hook_slow(skb, &state, e, 0); 105 + 106 + if (ret == 1) { 107 + return skb; 108 + } else if (ret < 0) { 109 + *rc = NET_XMIT_DROP; 110 + return NULL; 111 + } else { /* ret == 0 */ 112 + *rc = NET_XMIT_SUCCESS; 113 + return NULL; 114 + } 115 + } 116 + #else /* CONFIG_NETFILTER_EGRESS */ 117 + static inline bool nf_hook_egress_active(void) 118 + { 119 + return false; 120 + } 121 + 122 + static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc, 123 + struct net_device *dev) 124 + { 125 + return skb; 126 + } 127 + #endif /* CONFIG_NETFILTER_EGRESS */ 128 + 129 + static inline void nf_skip_egress(struct sk_buff *skb, bool skip) 130 + { 131 + #ifdef CONFIG_NETFILTER_SKIP_EGRESS 132 + skb->nf_skip_egress = skip; 133 + #endif 134 + } 135 + 53 136 static inline void nf_hook_netdev_init(struct net_device *dev) 54 137 { 55 138 #ifdef CONFIG_NETFILTER_INGRESS 56 139 RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL); 140 + #endif 141 + #ifdef CONFIG_NETFILTER_EGRESS 142 + RCU_INIT_POINTER(dev->nf_hooks_egress, NULL); 57 143 #endif 58 144 } 59 145
+4
include/linux/skbuff.h
··· 652 652 * @tc_at_ingress: used within tc_classify to distinguish in/egress 653 653 * @redirected: packet was redirected by packet classifier 654 654 * @from_ingress: packet was redirected from the ingress path 655 + * @nf_skip_egress: packet shall skip nf egress - see netfilter_netdev.h 655 656 * @peeked: this packet has been seen already, so stats have been 656 657 * done for it, don't do them again 657 658 * @nf_trace: netfilter packet trace flag ··· 868 867 __u8 redirected:1; 869 868 #ifdef CONFIG_NET_REDIRECT 870 869 __u8 from_ingress:1; 870 + #endif 871 + #ifdef CONFIG_NETFILTER_SKIP_EGRESS 872 + __u8 nf_skip_egress:1; 871 873 #endif 872 874 #ifdef CONFIG_TLS_DEVICE 873 875 __u8 decrypted:1;
+1
include/uapi/linux/netfilter.h
··· 51 51 52 52 enum nf_dev_hooks { 53 53 NF_NETDEV_INGRESS, 54 + NF_NETDEV_EGRESS, 54 55 NF_NETDEV_NUMHOOKS 55 56 }; 56 57
+13 -2
net/core/dev.c
··· 3920 3920 static struct sk_buff * 3921 3921 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) 3922 3922 { 3923 + #ifdef CONFIG_NET_CLS_ACT 3923 3924 struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress); 3924 3925 struct tcf_result cl_res; 3925 3926 ··· 3956 3955 default: 3957 3956 break; 3958 3957 } 3958 + #endif /* CONFIG_NET_CLS_ACT */ 3959 3959 3960 3960 return skb; 3961 3961 } ··· 4150 4148 qdisc_pkt_len_init(skb); 4151 4149 #ifdef CONFIG_NET_CLS_ACT 4152 4150 skb->tc_at_ingress = 0; 4153 - # ifdef CONFIG_NET_EGRESS 4151 + #endif 4152 + #ifdef CONFIG_NET_EGRESS 4154 4153 if (static_branch_unlikely(&egress_needed_key)) { 4154 + if (nf_hook_egress_active()) { 4155 + skb = nf_hook_egress(skb, &rc, dev); 4156 + if (!skb) 4157 + goto out; 4158 + } 4159 + nf_skip_egress(skb, true); 4155 4160 skb = sch_handle_egress(skb, &rc, dev); 4156 4161 if (!skb) 4157 4162 goto out; 4163 + nf_skip_egress(skb, false); 4158 4164 } 4159 - # endif 4160 4165 #endif 4161 4166 /* If device/qdisc don't need skb->dst, release it right now while 4162 4167 * its hot in this cpu cache. ··· 5305 5296 if (static_branch_unlikely(&ingress_needed_key)) { 5306 5297 bool another = false; 5307 5298 5299 + nf_skip_egress(skb, true); 5308 5300 skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev, 5309 5301 &another); 5310 5302 if (another) ··· 5313 5303 if (!skb) 5314 5304 goto out; 5315 5305 5306 + nf_skip_egress(skb, false); 5316 5307 if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0) 5317 5308 goto out; 5318 5309 }
+11
net/netfilter/Kconfig
··· 10 10 This allows you to classify packets from ingress using the Netfilter 11 11 infrastructure. 12 12 13 + config NETFILTER_EGRESS 14 + bool "Netfilter egress support" 15 + default y 16 + select NET_EGRESS 17 + help 18 + This allows you to classify packets before transmission using the 19 + Netfilter infrastructure. 20 + 21 + config NETFILTER_SKIP_EGRESS 22 + def_bool NETFILTER_EGRESS && (NET_CLS_ACT || IFB) 23 + 13 24 config NETFILTER_NETLINK 14 25 tristate 15 26
+31 -3
net/netfilter/core.c
··· 317 317 return &dev->nf_hooks_ingress; 318 318 } 319 319 #endif 320 + #ifdef CONFIG_NETFILTER_EGRESS 321 + if (hooknum == NF_NETDEV_EGRESS) { 322 + if (dev && dev_net(dev) == net) 323 + return &dev->nf_hooks_egress; 324 + } 325 + #endif 320 326 WARN_ON_ONCE(1); 321 327 return NULL; 322 328 } ··· 348 342 return true; 349 343 350 344 return false; 345 + } 346 + 347 + static inline bool nf_egress_hook(const struct nf_hook_ops *reg, int pf) 348 + { 349 + return pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS; 351 350 } 352 351 353 352 static void nf_static_key_inc(const struct nf_hook_ops *reg, int pf) ··· 394 383 395 384 switch (pf) { 396 385 case NFPROTO_NETDEV: 397 - err = nf_ingress_check(net, reg, NF_NETDEV_INGRESS); 398 - if (err < 0) 399 - return err; 386 + #ifndef CONFIG_NETFILTER_INGRESS 387 + if (reg->hooknum == NF_NETDEV_INGRESS) 388 + return -EOPNOTSUPP; 389 + #endif 390 + #ifndef CONFIG_NETFILTER_EGRESS 391 + if (reg->hooknum == NF_NETDEV_EGRESS) 392 + return -EOPNOTSUPP; 393 + #endif 394 + if ((reg->hooknum != NF_NETDEV_INGRESS && 395 + reg->hooknum != NF_NETDEV_EGRESS) || 396 + !reg->dev || dev_net(reg->dev) != net) 397 + return -EINVAL; 400 398 break; 401 399 case NFPROTO_INET: 402 400 if (reg->hooknum != NF_INET_INGRESS) ··· 437 417 #ifdef CONFIG_NETFILTER_INGRESS 438 418 if (nf_ingress_hook(reg, pf)) 439 419 net_inc_ingress_queue(); 420 + #endif 421 + #ifdef CONFIG_NETFILTER_EGRESS 422 + if (nf_egress_hook(reg, pf)) 423 + net_inc_egress_queue(); 440 424 #endif 441 425 nf_static_key_inc(reg, pf); 442 426 ··· 498 474 #ifdef CONFIG_NETFILTER_INGRESS 499 475 if (nf_ingress_hook(reg, pf)) 500 476 net_dec_ingress_queue(); 477 + #endif 478 + #ifdef CONFIG_NETFILTER_EGRESS 479 + if (nf_egress_hook(reg, pf)) 480 + net_dec_egress_queue(); 501 481 #endif 502 482 nf_static_key_dec(reg, pf); 503 483 } else {
+3 -1
net/netfilter/nft_chain_filter.c
··· 310 310 .name = "filter", 311 311 .type = NFT_CHAIN_T_DEFAULT, 312 312 .family = NFPROTO_NETDEV, 313 - .hook_mask = (1 << NF_NETDEV_INGRESS), 313 + .hook_mask = (1 << NF_NETDEV_INGRESS) | 314 + (1 << NF_NETDEV_EGRESS), 314 315 .hooks = { 315 316 [NF_NETDEV_INGRESS] = nft_do_chain_netdev, 317 + [NF_NETDEV_EGRESS] = nft_do_chain_netdev, 316 318 }, 317 319 }; 318 320