Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: Add fd-based tcx multi-prog infra with link support

This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]

- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Daniel Borkmann and committed by
Alexei Starovoitov
e420bed0 053c8e1f

+948 -157
+3 -1
MAINTAINERS
··· 3778 3778 S: Maintained 3779 3779 F: kernel/bpf/bpf_struct* 3780 3780 3781 - BPF [NETWORKING] (tc BPF, sock_addr) 3781 + BPF [NETWORKING] (tcx & tc BPF, sock_addr) 3782 3782 M: Martin KaFai Lau <martin.lau@linux.dev> 3783 3783 M: Daniel Borkmann <daniel@iogearbox.net> 3784 3784 R: John Fastabend <john.fastabend@gmail.com> 3785 3785 L: bpf@vger.kernel.org 3786 3786 L: netdev@vger.kernel.org 3787 3787 S: Maintained 3788 + F: include/net/tcx.h 3789 + F: kernel/bpf/tcx.c 3788 3790 F: net/core/filter.c 3789 3791 F: net/sched/act_bpf.c 3790 3792 F: net/sched/cls_bpf.c
+9
include/linux/bpf_mprog.h
··· 315 315 int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr, 316 316 struct bpf_mprog_entry *entry); 317 317 318 + static inline bool bpf_mprog_supported(enum bpf_prog_type type) 319 + { 320 + switch (type) { 321 + case BPF_PROG_TYPE_SCHED_CLS: 322 + return true; 323 + default: 324 + return false; 325 + } 326 + } 318 327 #endif /* __BPF_MPROG_H */
+6 -9
include/linux/netdevice.h
··· 1930 1930 * 1931 1931 * @rx_handler: handler for received packets 1932 1932 * @rx_handler_data: XXX: need comments on this one 1933 - * @miniq_ingress: ingress/clsact qdisc specific data for 1934 - * ingress processing 1933 + * @tcx_ingress: BPF & clsact qdisc specific data for ingress processing 1935 1934 * @ingress_queue: XXX: need comments on this one 1936 1935 * @nf_hooks_ingress: netfilter hooks executed for ingress packets 1937 1936 * @broadcast: hw bcast address ··· 1951 1952 * @xps_maps: all CPUs/RXQs maps for XPS device 1952 1953 * 1953 1954 * @xps_maps: XXX: need comments on this one 1954 - * @miniq_egress: clsact qdisc specific data for 1955 - * egress processing 1955 + * @tcx_egress: BPF & clsact qdisc specific data for egress processing 1956 1956 * @nf_hooks_egress: netfilter hooks executed for egress packets 1957 1957 * @qdisc_hash: qdisc hash table 1958 1958 * @watchdog_timeo: Represents the timeout that is used by ··· 2251 2253 unsigned int xdp_zc_max_segs; 2252 2254 rx_handler_func_t __rcu *rx_handler; 2253 2255 void __rcu *rx_handler_data; 2254 - 2255 - #ifdef CONFIG_NET_CLS_ACT 2256 - struct mini_Qdisc __rcu *miniq_ingress; 2256 + #ifdef CONFIG_NET_XGRESS 2257 + struct bpf_mprog_entry __rcu *tcx_ingress; 2257 2258 #endif 2258 2259 struct netdev_queue __rcu *ingress_queue; 2259 2260 #ifdef CONFIG_NETFILTER_INGRESS ··· 2280 2283 #ifdef CONFIG_XPS 2281 2284 struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX]; 2282 2285 #endif 2283 - #ifdef CONFIG_NET_CLS_ACT 2284 - struct mini_Qdisc __rcu *miniq_egress; 2286 + #ifdef CONFIG_NET_XGRESS 2287 + struct bpf_mprog_entry __rcu *tcx_egress; 2285 2288 #endif 2286 2289 #ifdef CONFIG_NETFILTER_EGRESS 2287 2290 struct nf_hook_entries __rcu *nf_hooks_egress;
+2 -2
include/linux/skbuff.h
··· 944 944 __u8 __mono_tc_offset[0]; 945 945 /* public: */ 946 946 __u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */ 947 - #ifdef CONFIG_NET_CLS_ACT 947 + #ifdef CONFIG_NET_XGRESS 948 948 __u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */ 949 949 __u8 tc_skip_classify:1; 950 950 #endif ··· 993 993 __u8 csum_not_inet:1; 994 994 #endif 995 995 996 - #ifdef CONFIG_NET_SCHED 996 + #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) 997 997 __u16 tc_index; /* traffic control index */ 998 998 #endif 999 999
+1 -1
include/net/sch_generic.h
··· 703 703 704 704 static inline bool skb_at_tc_ingress(const struct sk_buff *skb) 705 705 { 706 - #ifdef CONFIG_NET_CLS_ACT 706 + #ifdef CONFIG_NET_XGRESS 707 707 return skb->tc_at_ingress; 708 708 #else 709 709 return false;
+206
include/net/tcx.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* Copyright (c) 2023 Isovalent */ 3 + #ifndef __NET_TCX_H 4 + #define __NET_TCX_H 5 + 6 + #include <linux/bpf.h> 7 + #include <linux/bpf_mprog.h> 8 + 9 + #include <net/sch_generic.h> 10 + 11 + struct mini_Qdisc; 12 + 13 + struct tcx_entry { 14 + struct mini_Qdisc __rcu *miniq; 15 + struct bpf_mprog_bundle bundle; 16 + bool miniq_active; 17 + struct rcu_head rcu; 18 + }; 19 + 20 + struct tcx_link { 21 + struct bpf_link link; 22 + struct net_device *dev; 23 + u32 location; 24 + }; 25 + 26 + static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress) 27 + { 28 + #ifdef CONFIG_NET_XGRESS 29 + skb->tc_at_ingress = ingress; 30 + #endif 31 + } 32 + 33 + #ifdef CONFIG_NET_XGRESS 34 + static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry) 35 + { 36 + struct bpf_mprog_bundle *bundle = entry->parent; 37 + 38 + return container_of(bundle, struct tcx_entry, bundle); 39 + } 40 + 41 + static inline struct tcx_link *tcx_link(struct bpf_link *link) 42 + { 43 + return container_of(link, struct tcx_link, link); 44 + } 45 + 46 + static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link) 47 + { 48 + return tcx_link((struct bpf_link *)link); 49 + } 50 + 51 + void tcx_inc(void); 52 + void tcx_dec(void); 53 + 54 + static inline void tcx_entry_sync(void) 55 + { 56 + /* bpf_mprog_entry got a/b swapped, therefore ensure that 57 + * there are no inflight users on the old one anymore. 58 + */ 59 + synchronize_rcu(); 60 + } 61 + 62 + static inline void 63 + tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry, 64 + bool ingress) 65 + { 66 + ASSERT_RTNL(); 67 + if (ingress) 68 + rcu_assign_pointer(dev->tcx_ingress, entry); 69 + else 70 + rcu_assign_pointer(dev->tcx_egress, entry); 71 + } 72 + 73 + static inline struct bpf_mprog_entry * 74 + tcx_entry_fetch(struct net_device *dev, bool ingress) 75 + { 76 + ASSERT_RTNL(); 77 + if (ingress) 78 + return rcu_dereference_rtnl(dev->tcx_ingress); 79 + else 80 + return rcu_dereference_rtnl(dev->tcx_egress); 81 + } 82 + 83 + static inline struct bpf_mprog_entry *tcx_entry_create(void) 84 + { 85 + struct tcx_entry *tcx = kzalloc(sizeof(*tcx), GFP_KERNEL); 86 + 87 + if (tcx) { 88 + bpf_mprog_bundle_init(&tcx->bundle); 89 + return &tcx->bundle.a; 90 + } 91 + return NULL; 92 + } 93 + 94 + static inline void tcx_entry_free(struct bpf_mprog_entry *entry) 95 + { 96 + kfree_rcu(tcx_entry(entry), rcu); 97 + } 98 + 99 + static inline struct bpf_mprog_entry * 100 + tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created) 101 + { 102 + struct bpf_mprog_entry *entry = tcx_entry_fetch(dev, ingress); 103 + 104 + *created = false; 105 + if (!entry) { 106 + entry = tcx_entry_create(); 107 + if (!entry) 108 + return NULL; 109 + *created = true; 110 + } 111 + return entry; 112 + } 113 + 114 + static inline void tcx_skeys_inc(bool ingress) 115 + { 116 + tcx_inc(); 117 + if (ingress) 118 + net_inc_ingress_queue(); 119 + else 120 + net_inc_egress_queue(); 121 + } 122 + 123 + static inline void tcx_skeys_dec(bool ingress) 124 + { 125 + if (ingress) 126 + net_dec_ingress_queue(); 127 + else 128 + net_dec_egress_queue(); 129 + tcx_dec(); 130 + } 131 + 132 + static inline void tcx_miniq_set_active(struct bpf_mprog_entry *entry, 133 + const bool active) 134 + { 135 + ASSERT_RTNL(); 136 + tcx_entry(entry)->miniq_active = active; 137 + } 138 + 139 + static inline bool tcx_entry_is_active(struct bpf_mprog_entry *entry) 140 + { 141 + ASSERT_RTNL(); 142 + return bpf_mprog_total(entry) || tcx_entry(entry)->miniq_active; 143 + } 144 + 145 + static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb, 146 + int code) 147 + { 148 + switch (code) { 149 + case TCX_PASS: 150 + skb->tc_index = qdisc_skb_cb(skb)->tc_classid; 151 + fallthrough; 152 + case TCX_DROP: 153 + case TCX_REDIRECT: 154 + return code; 155 + case TCX_NEXT: 156 + default: 157 + return TCX_NEXT; 158 + } 159 + } 160 + #endif /* CONFIG_NET_XGRESS */ 161 + 162 + #if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL) 163 + int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog); 164 + int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog); 165 + int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog); 166 + void tcx_uninstall(struct net_device *dev, bool ingress); 167 + 168 + int tcx_prog_query(const union bpf_attr *attr, 169 + union bpf_attr __user *uattr); 170 + 171 + static inline void dev_tcx_uninstall(struct net_device *dev) 172 + { 173 + ASSERT_RTNL(); 174 + tcx_uninstall(dev, true); 175 + tcx_uninstall(dev, false); 176 + } 177 + #else 178 + static inline int tcx_prog_attach(const union bpf_attr *attr, 179 + struct bpf_prog *prog) 180 + { 181 + return -EINVAL; 182 + } 183 + 184 + static inline int tcx_link_attach(const union bpf_attr *attr, 185 + struct bpf_prog *prog) 186 + { 187 + return -EINVAL; 188 + } 189 + 190 + static inline int tcx_prog_detach(const union bpf_attr *attr, 191 + struct bpf_prog *prog) 192 + { 193 + return -EINVAL; 194 + } 195 + 196 + static inline int tcx_prog_query(const union bpf_attr *attr, 197 + union bpf_attr __user *uattr) 198 + { 199 + return -EINVAL; 200 + } 201 + 202 + static inline void dev_tcx_uninstall(struct net_device *dev) 203 + { 204 + } 205 + #endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */ 206 + #endif /* __NET_TCX_H */
+30 -4
include/uapi/linux/bpf.h
··· 1036 1036 BPF_LSM_CGROUP, 1037 1037 BPF_STRUCT_OPS, 1038 1038 BPF_NETFILTER, 1039 + BPF_TCX_INGRESS, 1040 + BPF_TCX_EGRESS, 1039 1041 __MAX_BPF_ATTACH_TYPE 1040 1042 }; 1041 1043 ··· 1055 1053 BPF_LINK_TYPE_KPROBE_MULTI = 8, 1056 1054 BPF_LINK_TYPE_STRUCT_OPS = 9, 1057 1055 BPF_LINK_TYPE_NETFILTER = 10, 1058 - 1056 + BPF_LINK_TYPE_TCX = 11, 1059 1057 MAX_BPF_LINK_TYPE, 1060 1058 }; 1061 1059 ··· 1571 1569 __u32 map_fd; /* struct_ops to attach */ 1572 1570 }; 1573 1571 union { 1574 - __u32 target_fd; /* object to attach to */ 1575 - __u32 target_ifindex; /* target ifindex */ 1572 + __u32 target_fd; /* target object to attach to or ... */ 1573 + __u32 target_ifindex; /* target ifindex */ 1576 1574 }; 1577 1575 __u32 attach_type; /* attach type */ 1578 1576 __u32 flags; /* extra flags */ 1579 1577 union { 1580 - __u32 target_btf_id; /* btf_id of target to attach to */ 1578 + __u32 target_btf_id; /* btf_id of target to attach to */ 1581 1579 struct { 1582 1580 __aligned_u64 iter_info; /* extra bpf_iter_link_info */ 1583 1581 __u32 iter_info_len; /* iter_info length */ ··· 1611 1609 __s32 priority; 1612 1610 __u32 flags; 1613 1611 } netfilter; 1612 + struct { 1613 + union { 1614 + __u32 relative_fd; 1615 + __u32 relative_id; 1616 + }; 1617 + __u64 expected_revision; 1618 + } tcx; 1614 1619 }; 1615 1620 } link_create; 1616 1621 ··· 6226 6217 }; 6227 6218 }; 6228 6219 6220 + /* (Simplified) user return codes for tcx prog type. 6221 + * A valid tcx program must return one of these defined values. All other 6222 + * return codes are reserved for future use. Must remain compatible with 6223 + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown 6224 + * return codes are mapped to TCX_NEXT. 6225 + */ 6226 + enum tcx_action_base { 6227 + TCX_NEXT = -1, 6228 + TCX_PASS = 0, 6229 + TCX_DROP = 2, 6230 + TCX_REDIRECT = 7, 6231 + }; 6232 + 6229 6233 struct bpf_xdp_sock { 6230 6234 __u32 queue_id; 6231 6235 }; ··· 6521 6499 } event; /* BPF_PERF_EVENT_EVENT */ 6522 6500 }; 6523 6501 } perf_event; 6502 + struct { 6503 + __u32 ifindex; 6504 + __u32 attach_type; 6505 + } tcx; 6524 6506 }; 6525 6507 } __attribute__((aligned(8))); 6526 6508
+1
kernel/bpf/Kconfig
··· 31 31 select TASKS_TRACE_RCU 32 32 select BINARY_PRINTF 33 33 select NET_SOCK_MSG if NET 34 + select NET_XGRESS if NET 34 35 select PAGE_POOL if NET 35 36 default n 36 37 help
+1
kernel/bpf/Makefile
··· 21 21 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o 22 22 obj-$(CONFIG_BPF_SYSCALL) += offload.o 23 23 obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o 24 + obj-$(CONFIG_BPF_SYSCALL) += tcx.o 24 25 endif 25 26 ifeq ($(CONFIG_PERF_EVENTS),y) 26 27 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
+69 -13
kernel/bpf/syscall.c
··· 37 37 #include <linux/trace_events.h> 38 38 #include <net/netfilter/nf_bpf_link.h> 39 39 40 + #include <net/tcx.h> 41 + 40 42 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ 41 43 (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ 42 44 (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) ··· 3742 3740 return BPF_PROG_TYPE_XDP; 3743 3741 case BPF_LSM_CGROUP: 3744 3742 return BPF_PROG_TYPE_LSM; 3743 + case BPF_TCX_INGRESS: 3744 + case BPF_TCX_EGRESS: 3745 + return BPF_PROG_TYPE_SCHED_CLS; 3745 3746 default: 3746 3747 return BPF_PROG_TYPE_UNSPEC; 3747 3748 } 3748 3749 } 3749 3750 3750 - #define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd 3751 + #define BPF_PROG_ATTACH_LAST_FIELD expected_revision 3751 3752 3752 - #define BPF_F_ATTACH_MASK \ 3753 - (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE) 3753 + #define BPF_F_ATTACH_MASK_BASE \ 3754 + (BPF_F_ALLOW_OVERRIDE | \ 3755 + BPF_F_ALLOW_MULTI | \ 3756 + BPF_F_REPLACE) 3757 + 3758 + #define BPF_F_ATTACH_MASK_MPROG \ 3759 + (BPF_F_REPLACE | \ 3760 + BPF_F_BEFORE | \ 3761 + BPF_F_AFTER | \ 3762 + BPF_F_ID | \ 3763 + BPF_F_LINK) 3754 3764 3755 3765 static int bpf_prog_attach(const union bpf_attr *attr) 3756 3766 { 3757 3767 enum bpf_prog_type ptype; 3758 3768 struct bpf_prog *prog; 3769 + u32 mask; 3759 3770 int ret; 3760 3771 3761 3772 if (CHECK_ATTR(BPF_PROG_ATTACH)) 3762 3773 return -EINVAL; 3763 3774 3764 - if (attr->attach_flags & ~BPF_F_ATTACH_MASK) 3765 - return -EINVAL; 3766 - 3767 3775 ptype = attach_type_to_prog_type(attr->attach_type); 3768 3776 if (ptype == BPF_PROG_TYPE_UNSPEC) 3777 + return -EINVAL; 3778 + mask = bpf_mprog_supported(ptype) ? 3779 + BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE; 3780 + if (attr->attach_flags & ~mask) 3769 3781 return -EINVAL; 3770 3782 3771 3783 prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); ··· 3816 3800 else 3817 3801 ret = cgroup_bpf_prog_attach(attr, ptype, prog); 3818 3802 break; 3803 + case BPF_PROG_TYPE_SCHED_CLS: 3804 + ret = tcx_prog_attach(attr, prog); 3805 + break; 3819 3806 default: 3820 3807 ret = -EINVAL; 3821 3808 } ··· 3828 3809 return ret; 3829 3810 } 3830 3811 3831 - #define BPF_PROG_DETACH_LAST_FIELD attach_type 3812 + #define BPF_PROG_DETACH_LAST_FIELD expected_revision 3832 3813 3833 3814 static int bpf_prog_detach(const union bpf_attr *attr) 3834 3815 { 3816 + struct bpf_prog *prog = NULL; 3835 3817 enum bpf_prog_type ptype; 3818 + int ret; 3836 3819 3837 3820 if (CHECK_ATTR(BPF_PROG_DETACH)) 3838 3821 return -EINVAL; 3839 3822 3840 3823 ptype = attach_type_to_prog_type(attr->attach_type); 3824 + if (bpf_mprog_supported(ptype)) { 3825 + if (ptype == BPF_PROG_TYPE_UNSPEC) 3826 + return -EINVAL; 3827 + if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG) 3828 + return -EINVAL; 3829 + if (attr->attach_bpf_fd) { 3830 + prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); 3831 + if (IS_ERR(prog)) 3832 + return PTR_ERR(prog); 3833 + } 3834 + } 3841 3835 3842 3836 switch (ptype) { 3843 3837 case BPF_PROG_TYPE_SK_MSG: 3844 3838 case BPF_PROG_TYPE_SK_SKB: 3845 - return sock_map_prog_detach(attr, ptype); 3839 + ret = sock_map_prog_detach(attr, ptype); 3840 + break; 3846 3841 case BPF_PROG_TYPE_LIRC_MODE2: 3847 - return lirc_prog_detach(attr); 3842 + ret = lirc_prog_detach(attr); 3843 + break; 3848 3844 case BPF_PROG_TYPE_FLOW_DISSECTOR: 3849 - return netns_bpf_prog_detach(attr, ptype); 3845 + ret = netns_bpf_prog_detach(attr, ptype); 3846 + break; 3850 3847 case BPF_PROG_TYPE_CGROUP_DEVICE: 3851 3848 case BPF_PROG_TYPE_CGROUP_SKB: 3852 3849 case BPF_PROG_TYPE_CGROUP_SOCK: ··· 3871 3836 case BPF_PROG_TYPE_CGROUP_SYSCTL: 3872 3837 case BPF_PROG_TYPE_SOCK_OPS: 3873 3838 case BPF_PROG_TYPE_LSM: 3874 - return cgroup_bpf_prog_detach(attr, ptype); 3839 + ret = cgroup_bpf_prog_detach(attr, ptype); 3840 + break; 3841 + case BPF_PROG_TYPE_SCHED_CLS: 3842 + ret = tcx_prog_detach(attr, prog); 3843 + break; 3875 3844 default: 3876 - return -EINVAL; 3845 + ret = -EINVAL; 3877 3846 } 3847 + 3848 + if (prog) 3849 + bpf_prog_put(prog); 3850 + return ret; 3878 3851 } 3879 3852 3880 - #define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags 3853 + #define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags 3881 3854 3882 3855 static int bpf_prog_query(const union bpf_attr *attr, 3883 3856 union bpf_attr __user *uattr) ··· 3933 3890 case BPF_SK_MSG_VERDICT: 3934 3891 case BPF_SK_SKB_VERDICT: 3935 3892 return sock_map_bpf_prog_query(attr, uattr); 3893 + case BPF_TCX_INGRESS: 3894 + case BPF_TCX_EGRESS: 3895 + return tcx_prog_query(attr, uattr); 3936 3896 default: 3937 3897 return -EINVAL; 3938 3898 } ··· 4898 4852 goto out; 4899 4853 } 4900 4854 break; 4855 + case BPF_PROG_TYPE_SCHED_CLS: 4856 + if (attr->link_create.attach_type != BPF_TCX_INGRESS && 4857 + attr->link_create.attach_type != BPF_TCX_EGRESS) { 4858 + ret = -EINVAL; 4859 + goto out; 4860 + } 4861 + break; 4901 4862 default: 4902 4863 ptype = attach_type_to_prog_type(attr->link_create.attach_type); 4903 4864 if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) { ··· 4955 4902 #ifdef CONFIG_NET 4956 4903 case BPF_PROG_TYPE_XDP: 4957 4904 ret = bpf_xdp_link_attach(attr, prog); 4905 + break; 4906 + case BPF_PROG_TYPE_SCHED_CLS: 4907 + ret = tcx_link_attach(attr, prog); 4958 4908 break; 4959 4909 case BPF_PROG_TYPE_NETFILTER: 4960 4910 ret = bpf_nf_link_attach(attr, prog);
+348
kernel/bpf/tcx.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2023 Isovalent */ 3 + 4 + #include <linux/bpf.h> 5 + #include <linux/bpf_mprog.h> 6 + #include <linux/netdevice.h> 7 + 8 + #include <net/tcx.h> 9 + 10 + int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) 11 + { 12 + bool created, ingress = attr->attach_type == BPF_TCX_INGRESS; 13 + struct net *net = current->nsproxy->net_ns; 14 + struct bpf_mprog_entry *entry, *entry_new; 15 + struct bpf_prog *replace_prog = NULL; 16 + struct net_device *dev; 17 + int ret; 18 + 19 + rtnl_lock(); 20 + dev = __dev_get_by_index(net, attr->target_ifindex); 21 + if (!dev) { 22 + ret = -ENODEV; 23 + goto out; 24 + } 25 + if (attr->attach_flags & BPF_F_REPLACE) { 26 + replace_prog = bpf_prog_get_type(attr->replace_bpf_fd, 27 + prog->type); 28 + if (IS_ERR(replace_prog)) { 29 + ret = PTR_ERR(replace_prog); 30 + replace_prog = NULL; 31 + goto out; 32 + } 33 + } 34 + entry = tcx_entry_fetch_or_create(dev, ingress, &created); 35 + if (!entry) { 36 + ret = -ENOMEM; 37 + goto out; 38 + } 39 + ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog, 40 + attr->attach_flags, attr->relative_fd, 41 + attr->expected_revision); 42 + if (!ret) { 43 + if (entry != entry_new) { 44 + tcx_entry_update(dev, entry_new, ingress); 45 + tcx_entry_sync(); 46 + tcx_skeys_inc(ingress); 47 + } 48 + bpf_mprog_commit(entry); 49 + } else if (created) { 50 + tcx_entry_free(entry); 51 + } 52 + out: 53 + if (replace_prog) 54 + bpf_prog_put(replace_prog); 55 + rtnl_unlock(); 56 + return ret; 57 + } 58 + 59 + int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog) 60 + { 61 + bool ingress = attr->attach_type == BPF_TCX_INGRESS; 62 + struct net *net = current->nsproxy->net_ns; 63 + struct bpf_mprog_entry *entry, *entry_new; 64 + struct net_device *dev; 65 + int ret; 66 + 67 + rtnl_lock(); 68 + dev = __dev_get_by_index(net, attr->target_ifindex); 69 + if (!dev) { 70 + ret = -ENODEV; 71 + goto out; 72 + } 73 + entry = tcx_entry_fetch(dev, ingress); 74 + if (!entry) { 75 + ret = -ENOENT; 76 + goto out; 77 + } 78 + ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags, 79 + attr->relative_fd, attr->expected_revision); 80 + if (!ret) { 81 + if (!tcx_entry_is_active(entry_new)) 82 + entry_new = NULL; 83 + tcx_entry_update(dev, entry_new, ingress); 84 + tcx_entry_sync(); 85 + tcx_skeys_dec(ingress); 86 + bpf_mprog_commit(entry); 87 + if (!entry_new) 88 + tcx_entry_free(entry); 89 + } 90 + out: 91 + rtnl_unlock(); 92 + return ret; 93 + } 94 + 95 + void tcx_uninstall(struct net_device *dev, bool ingress) 96 + { 97 + struct bpf_tuple tuple = {}; 98 + struct bpf_mprog_entry *entry; 99 + struct bpf_mprog_fp *fp; 100 + struct bpf_mprog_cp *cp; 101 + 102 + entry = tcx_entry_fetch(dev, ingress); 103 + if (!entry) 104 + return; 105 + tcx_entry_update(dev, NULL, ingress); 106 + tcx_entry_sync(); 107 + bpf_mprog_foreach_tuple(entry, fp, cp, tuple) { 108 + if (tuple.link) 109 + tcx_link(tuple.link)->dev = NULL; 110 + else 111 + bpf_prog_put(tuple.prog); 112 + tcx_skeys_dec(ingress); 113 + } 114 + WARN_ON_ONCE(tcx_entry(entry)->miniq_active); 115 + tcx_entry_free(entry); 116 + } 117 + 118 + int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) 119 + { 120 + bool ingress = attr->query.attach_type == BPF_TCX_INGRESS; 121 + struct net *net = current->nsproxy->net_ns; 122 + struct bpf_mprog_entry *entry; 123 + struct net_device *dev; 124 + int ret; 125 + 126 + rtnl_lock(); 127 + dev = __dev_get_by_index(net, attr->query.target_ifindex); 128 + if (!dev) { 129 + ret = -ENODEV; 130 + goto out; 131 + } 132 + entry = tcx_entry_fetch(dev, ingress); 133 + if (!entry) { 134 + ret = -ENOENT; 135 + goto out; 136 + } 137 + ret = bpf_mprog_query(attr, uattr, entry); 138 + out: 139 + rtnl_unlock(); 140 + return ret; 141 + } 142 + 143 + static int tcx_link_prog_attach(struct bpf_link *link, u32 flags, u32 id_or_fd, 144 + u64 revision) 145 + { 146 + struct tcx_link *tcx = tcx_link(link); 147 + bool created, ingress = tcx->location == BPF_TCX_INGRESS; 148 + struct bpf_mprog_entry *entry, *entry_new; 149 + struct net_device *dev = tcx->dev; 150 + int ret; 151 + 152 + ASSERT_RTNL(); 153 + entry = tcx_entry_fetch_or_create(dev, ingress, &created); 154 + if (!entry) 155 + return -ENOMEM; 156 + ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags, 157 + id_or_fd, revision); 158 + if (!ret) { 159 + if (entry != entry_new) { 160 + tcx_entry_update(dev, entry_new, ingress); 161 + tcx_entry_sync(); 162 + tcx_skeys_inc(ingress); 163 + } 164 + bpf_mprog_commit(entry); 165 + } else if (created) { 166 + tcx_entry_free(entry); 167 + } 168 + return ret; 169 + } 170 + 171 + static void tcx_link_release(struct bpf_link *link) 172 + { 173 + struct tcx_link *tcx = tcx_link(link); 174 + bool ingress = tcx->location == BPF_TCX_INGRESS; 175 + struct bpf_mprog_entry *entry, *entry_new; 176 + struct net_device *dev; 177 + int ret = 0; 178 + 179 + rtnl_lock(); 180 + dev = tcx->dev; 181 + if (!dev) 182 + goto out; 183 + entry = tcx_entry_fetch(dev, ingress); 184 + if (!entry) { 185 + ret = -ENOENT; 186 + goto out; 187 + } 188 + ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0); 189 + if (!ret) { 190 + if (!tcx_entry_is_active(entry_new)) 191 + entry_new = NULL; 192 + tcx_entry_update(dev, entry_new, ingress); 193 + tcx_entry_sync(); 194 + tcx_skeys_dec(ingress); 195 + bpf_mprog_commit(entry); 196 + if (!entry_new) 197 + tcx_entry_free(entry); 198 + tcx->dev = NULL; 199 + } 200 + out: 201 + WARN_ON_ONCE(ret); 202 + rtnl_unlock(); 203 + } 204 + 205 + static int tcx_link_update(struct bpf_link *link, struct bpf_prog *nprog, 206 + struct bpf_prog *oprog) 207 + { 208 + struct tcx_link *tcx = tcx_link(link); 209 + bool ingress = tcx->location == BPF_TCX_INGRESS; 210 + struct bpf_mprog_entry *entry, *entry_new; 211 + struct net_device *dev; 212 + int ret = 0; 213 + 214 + rtnl_lock(); 215 + dev = tcx->dev; 216 + if (!dev) { 217 + ret = -ENOLINK; 218 + goto out; 219 + } 220 + if (oprog && link->prog != oprog) { 221 + ret = -EPERM; 222 + goto out; 223 + } 224 + oprog = link->prog; 225 + if (oprog == nprog) { 226 + bpf_prog_put(nprog); 227 + goto out; 228 + } 229 + entry = tcx_entry_fetch(dev, ingress); 230 + if (!entry) { 231 + ret = -ENOENT; 232 + goto out; 233 + } 234 + ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog, 235 + BPF_F_REPLACE | BPF_F_ID, 236 + link->prog->aux->id, 0); 237 + if (!ret) { 238 + WARN_ON_ONCE(entry != entry_new); 239 + oprog = xchg(&link->prog, nprog); 240 + bpf_prog_put(oprog); 241 + bpf_mprog_commit(entry); 242 + } 243 + out: 244 + rtnl_unlock(); 245 + return ret; 246 + } 247 + 248 + static void tcx_link_dealloc(struct bpf_link *link) 249 + { 250 + kfree(tcx_link(link)); 251 + } 252 + 253 + static void tcx_link_fdinfo(const struct bpf_link *link, struct seq_file *seq) 254 + { 255 + const struct tcx_link *tcx = tcx_link_const(link); 256 + u32 ifindex = 0; 257 + 258 + rtnl_lock(); 259 + if (tcx->dev) 260 + ifindex = tcx->dev->ifindex; 261 + rtnl_unlock(); 262 + 263 + seq_printf(seq, "ifindex:\t%u\n", ifindex); 264 + seq_printf(seq, "attach_type:\t%u (%s)\n", 265 + tcx->location, 266 + tcx->location == BPF_TCX_INGRESS ? "ingress" : "egress"); 267 + } 268 + 269 + static int tcx_link_fill_info(const struct bpf_link *link, 270 + struct bpf_link_info *info) 271 + { 272 + const struct tcx_link *tcx = tcx_link_const(link); 273 + u32 ifindex = 0; 274 + 275 + rtnl_lock(); 276 + if (tcx->dev) 277 + ifindex = tcx->dev->ifindex; 278 + rtnl_unlock(); 279 + 280 + info->tcx.ifindex = ifindex; 281 + info->tcx.attach_type = tcx->location; 282 + return 0; 283 + } 284 + 285 + static int tcx_link_detach(struct bpf_link *link) 286 + { 287 + tcx_link_release(link); 288 + return 0; 289 + } 290 + 291 + static const struct bpf_link_ops tcx_link_lops = { 292 + .release = tcx_link_release, 293 + .detach = tcx_link_detach, 294 + .dealloc = tcx_link_dealloc, 295 + .update_prog = tcx_link_update, 296 + .show_fdinfo = tcx_link_fdinfo, 297 + .fill_link_info = tcx_link_fill_info, 298 + }; 299 + 300 + static int tcx_link_init(struct tcx_link *tcx, 301 + struct bpf_link_primer *link_primer, 302 + const union bpf_attr *attr, 303 + struct net_device *dev, 304 + struct bpf_prog *prog) 305 + { 306 + bpf_link_init(&tcx->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog); 307 + tcx->location = attr->link_create.attach_type; 308 + tcx->dev = dev; 309 + return bpf_link_prime(&tcx->link, link_primer); 310 + } 311 + 312 + int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) 313 + { 314 + struct net *net = current->nsproxy->net_ns; 315 + struct bpf_link_primer link_primer; 316 + struct net_device *dev; 317 + struct tcx_link *tcx; 318 + int ret; 319 + 320 + rtnl_lock(); 321 + dev = __dev_get_by_index(net, attr->link_create.target_ifindex); 322 + if (!dev) { 323 + ret = -ENODEV; 324 + goto out; 325 + } 326 + tcx = kzalloc(sizeof(*tcx), GFP_USER); 327 + if (!tcx) { 328 + ret = -ENOMEM; 329 + goto out; 330 + } 331 + ret = tcx_link_init(tcx, &link_primer, attr, dev, prog); 332 + if (ret) { 333 + kfree(tcx); 334 + goto out; 335 + } 336 + ret = tcx_link_prog_attach(&tcx->link, attr->link_create.flags, 337 + attr->link_create.tcx.relative_fd, 338 + attr->link_create.tcx.expected_revision); 339 + if (ret) { 340 + tcx->dev = NULL; 341 + bpf_link_cleanup(&link_primer); 342 + goto out; 343 + } 344 + ret = bpf_link_settle(&link_primer); 345 + out: 346 + rtnl_unlock(); 347 + return ret; 348 + }
+5
net/Kconfig
··· 52 52 config NET_EGRESS 53 53 bool 54 54 55 + config NET_XGRESS 56 + select NET_INGRESS 57 + select NET_EGRESS 58 + bool 59 + 55 60 config NET_REDIRECT 56 61 bool 57 62
+176 -115
net/core/dev.c
··· 107 107 #include <net/pkt_cls.h> 108 108 #include <net/checksum.h> 109 109 #include <net/xfrm.h> 110 + #include <net/tcx.h> 110 111 #include <linux/highmem.h> 111 112 #include <linux/init.h> 112 113 #include <linux/module.h> ··· 154 153 155 154 #include "dev.h" 156 155 #include "net-sysfs.h" 157 - 158 156 159 157 static DEFINE_SPINLOCK(ptype_lock); 160 158 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly; ··· 3882 3882 EXPORT_SYMBOL(dev_loopback_xmit); 3883 3883 3884 3884 #ifdef CONFIG_NET_EGRESS 3885 - static struct sk_buff * 3886 - sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) 3887 - { 3888 - #ifdef CONFIG_NET_CLS_ACT 3889 - struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress); 3890 - struct tcf_result cl_res; 3891 - 3892 - if (!miniq) 3893 - return skb; 3894 - 3895 - /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */ 3896 - tc_skb_cb(skb)->mru = 0; 3897 - tc_skb_cb(skb)->post_ct = false; 3898 - mini_qdisc_bstats_cpu_update(miniq, skb); 3899 - 3900 - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) { 3901 - case TC_ACT_OK: 3902 - case TC_ACT_RECLASSIFY: 3903 - skb->tc_index = TC_H_MIN(cl_res.classid); 3904 - break; 3905 - case TC_ACT_SHOT: 3906 - mini_qdisc_qstats_cpu_drop(miniq); 3907 - *ret = NET_XMIT_DROP; 3908 - kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS); 3909 - return NULL; 3910 - case TC_ACT_STOLEN: 3911 - case TC_ACT_QUEUED: 3912 - case TC_ACT_TRAP: 3913 - *ret = NET_XMIT_SUCCESS; 3914 - consume_skb(skb); 3915 - return NULL; 3916 - case TC_ACT_REDIRECT: 3917 - /* No need to push/pop skb's mac_header here on egress! */ 3918 - skb_do_redirect(skb); 3919 - *ret = NET_XMIT_SUCCESS; 3920 - return NULL; 3921 - default: 3922 - break; 3923 - } 3924 - #endif /* CONFIG_NET_CLS_ACT */ 3925 - 3926 - return skb; 3927 - } 3928 - 3929 3885 static struct netdev_queue * 3930 3886 netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) 3931 3887 { ··· 3901 3945 } 3902 3946 EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); 3903 3947 #endif /* CONFIG_NET_EGRESS */ 3948 + 3949 + #ifdef CONFIG_NET_XGRESS 3950 + static int tc_run(struct tcx_entry *entry, struct sk_buff *skb) 3951 + { 3952 + int ret = TC_ACT_UNSPEC; 3953 + #ifdef CONFIG_NET_CLS_ACT 3954 + struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq); 3955 + struct tcf_result res; 3956 + 3957 + if (!miniq) 3958 + return ret; 3959 + 3960 + tc_skb_cb(skb)->mru = 0; 3961 + tc_skb_cb(skb)->post_ct = false; 3962 + 3963 + mini_qdisc_bstats_cpu_update(miniq, skb); 3964 + ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false); 3965 + /* Only tcf related quirks below. */ 3966 + switch (ret) { 3967 + case TC_ACT_SHOT: 3968 + mini_qdisc_qstats_cpu_drop(miniq); 3969 + break; 3970 + case TC_ACT_OK: 3971 + case TC_ACT_RECLASSIFY: 3972 + skb->tc_index = TC_H_MIN(res.classid); 3973 + break; 3974 + } 3975 + #endif /* CONFIG_NET_CLS_ACT */ 3976 + return ret; 3977 + } 3978 + 3979 + static DEFINE_STATIC_KEY_FALSE(tcx_needed_key); 3980 + 3981 + void tcx_inc(void) 3982 + { 3983 + static_branch_inc(&tcx_needed_key); 3984 + } 3985 + 3986 + void tcx_dec(void) 3987 + { 3988 + static_branch_dec(&tcx_needed_key); 3989 + } 3990 + 3991 + static __always_inline enum tcx_action_base 3992 + tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb, 3993 + const bool needs_mac) 3994 + { 3995 + const struct bpf_mprog_fp *fp; 3996 + const struct bpf_prog *prog; 3997 + int ret = TCX_NEXT; 3998 + 3999 + if (needs_mac) 4000 + __skb_push(skb, skb->mac_len); 4001 + bpf_mprog_foreach_prog(entry, fp, prog) { 4002 + bpf_compute_data_pointers(skb); 4003 + ret = bpf_prog_run(prog, skb); 4004 + if (ret != TCX_NEXT) 4005 + break; 4006 + } 4007 + if (needs_mac) 4008 + __skb_pull(skb, skb->mac_len); 4009 + return tcx_action_code(skb, ret); 4010 + } 4011 + 4012 + static __always_inline struct sk_buff * 4013 + sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, 4014 + struct net_device *orig_dev, bool *another) 4015 + { 4016 + struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress); 4017 + int sch_ret; 4018 + 4019 + if (!entry) 4020 + return skb; 4021 + if (*pt_prev) { 4022 + *ret = deliver_skb(skb, *pt_prev, orig_dev); 4023 + *pt_prev = NULL; 4024 + } 4025 + 4026 + qdisc_skb_cb(skb)->pkt_len = skb->len; 4027 + tcx_set_ingress(skb, true); 4028 + 4029 + if (static_branch_unlikely(&tcx_needed_key)) { 4030 + sch_ret = tcx_run(entry, skb, true); 4031 + if (sch_ret != TC_ACT_UNSPEC) 4032 + goto ingress_verdict; 4033 + } 4034 + sch_ret = tc_run(tcx_entry(entry), skb); 4035 + ingress_verdict: 4036 + switch (sch_ret) { 4037 + case TC_ACT_REDIRECT: 4038 + /* skb_mac_header check was done by BPF, so we can safely 4039 + * push the L2 header back before redirecting to another 4040 + * netdev. 4041 + */ 4042 + __skb_push(skb, skb->mac_len); 4043 + if (skb_do_redirect(skb) == -EAGAIN) { 4044 + __skb_pull(skb, skb->mac_len); 4045 + *another = true; 4046 + break; 4047 + } 4048 + *ret = NET_RX_SUCCESS; 4049 + return NULL; 4050 + case TC_ACT_SHOT: 4051 + kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); 4052 + *ret = NET_RX_DROP; 4053 + return NULL; 4054 + /* used by tc_run */ 4055 + case TC_ACT_STOLEN: 4056 + case TC_ACT_QUEUED: 4057 + case TC_ACT_TRAP: 4058 + consume_skb(skb); 4059 + fallthrough; 4060 + case TC_ACT_CONSUMED: 4061 + *ret = NET_RX_SUCCESS; 4062 + return NULL; 4063 + } 4064 + 4065 + return skb; 4066 + } 4067 + 4068 + static __always_inline struct sk_buff * 4069 + sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) 4070 + { 4071 + struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress); 4072 + int sch_ret; 4073 + 4074 + if (!entry) 4075 + return skb; 4076 + 4077 + /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was 4078 + * already set by the caller. 4079 + */ 4080 + if (static_branch_unlikely(&tcx_needed_key)) { 4081 + sch_ret = tcx_run(entry, skb, false); 4082 + if (sch_ret != TC_ACT_UNSPEC) 4083 + goto egress_verdict; 4084 + } 4085 + sch_ret = tc_run(tcx_entry(entry), skb); 4086 + egress_verdict: 4087 + switch (sch_ret) { 4088 + case TC_ACT_REDIRECT: 4089 + /* No need to push/pop skb's mac_header here on egress! */ 4090 + skb_do_redirect(skb); 4091 + *ret = NET_XMIT_SUCCESS; 4092 + return NULL; 4093 + case TC_ACT_SHOT: 4094 + kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS); 4095 + *ret = NET_XMIT_DROP; 4096 + return NULL; 4097 + /* used by tc_run */ 4098 + case TC_ACT_STOLEN: 4099 + case TC_ACT_QUEUED: 4100 + case TC_ACT_TRAP: 4101 + *ret = NET_XMIT_SUCCESS; 4102 + return NULL; 4103 + } 4104 + 4105 + return skb; 4106 + } 4107 + #else 4108 + static __always_inline struct sk_buff * 4109 + sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, 4110 + struct net_device *orig_dev, bool *another) 4111 + { 4112 + return skb; 4113 + } 4114 + 4115 + static __always_inline struct sk_buff * 4116 + sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) 4117 + { 4118 + return skb; 4119 + } 4120 + #endif /* CONFIG_NET_XGRESS */ 3904 4121 3905 4122 #ifdef CONFIG_XPS 3906 4123 static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, ··· 4257 4128 skb_update_prio(skb); 4258 4129 4259 4130 qdisc_pkt_len_init(skb); 4260 - #ifdef CONFIG_NET_CLS_ACT 4261 - skb->tc_at_ingress = 0; 4262 - #endif 4131 + tcx_set_ingress(skb, false); 4263 4132 #ifdef CONFIG_NET_EGRESS 4264 4133 if (static_branch_unlikely(&egress_needed_key)) { 4265 4134 if (nf_hook_egress_active()) { ··· 5190 5063 unsigned char *addr) __read_mostly; 5191 5064 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook); 5192 5065 #endif 5193 - 5194 - static inline struct sk_buff * 5195 - sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, 5196 - struct net_device *orig_dev, bool *another) 5197 - { 5198 - #ifdef CONFIG_NET_CLS_ACT 5199 - struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress); 5200 - struct tcf_result cl_res; 5201 - 5202 - /* If there's at least one ingress present somewhere (so 5203 - * we get here via enabled static key), remaining devices 5204 - * that are not configured with an ingress qdisc will bail 5205 - * out here. 5206 - */ 5207 - if (!miniq) 5208 - return skb; 5209 - 5210 - if (*pt_prev) { 5211 - *ret = deliver_skb(skb, *pt_prev, orig_dev); 5212 - *pt_prev = NULL; 5213 - } 5214 - 5215 - qdisc_skb_cb(skb)->pkt_len = skb->len; 5216 - tc_skb_cb(skb)->mru = 0; 5217 - tc_skb_cb(skb)->post_ct = false; 5218 - skb->tc_at_ingress = 1; 5219 - mini_qdisc_bstats_cpu_update(miniq, skb); 5220 - 5221 - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) { 5222 - case TC_ACT_OK: 5223 - case TC_ACT_RECLASSIFY: 5224 - skb->tc_index = TC_H_MIN(cl_res.classid); 5225 - break; 5226 - case TC_ACT_SHOT: 5227 - mini_qdisc_qstats_cpu_drop(miniq); 5228 - kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); 5229 - *ret = NET_RX_DROP; 5230 - return NULL; 5231 - case TC_ACT_STOLEN: 5232 - case TC_ACT_QUEUED: 5233 - case TC_ACT_TRAP: 5234 - consume_skb(skb); 5235 - *ret = NET_RX_SUCCESS; 5236 - return NULL; 5237 - case TC_ACT_REDIRECT: 5238 - /* skb_mac_header check was done by cls/act_bpf, so 5239 - * we can safely push the L2 header back before 5240 - * redirecting to another netdev 5241 - */ 5242 - __skb_push(skb, skb->mac_len); 5243 - if (skb_do_redirect(skb) == -EAGAIN) { 5244 - __skb_pull(skb, skb->mac_len); 5245 - *another = true; 5246 - break; 5247 - } 5248 - *ret = NET_RX_SUCCESS; 5249 - return NULL; 5250 - case TC_ACT_CONSUMED: 5251 - *ret = NET_RX_SUCCESS; 5252 - return NULL; 5253 - default: 5254 - break; 5255 - } 5256 - #endif /* CONFIG_NET_CLS_ACT */ 5257 - return skb; 5258 - } 5259 5066 5260 5067 /** 5261 5068 * netdev_is_rx_handler_busy - check if receive handler is registered ··· 10896 10835 10897 10836 /* Shutdown queueing discipline. */ 10898 10837 dev_shutdown(dev); 10899 - 10838 + dev_tcx_uninstall(dev); 10900 10839 dev_xdp_uninstall(dev); 10901 10840 bpf_dev_bound_netdev_unregister(dev); 10902 10841
+2 -2
net/core/filter.c
··· 9307 9307 __u8 value_reg = si->dst_reg; 9308 9308 __u8 skb_reg = si->src_reg; 9309 9309 9310 - #ifdef CONFIG_NET_CLS_ACT 9310 + #ifdef CONFIG_NET_XGRESS 9311 9311 /* If the tstamp_type is read, 9312 9312 * the bpf prog is aware the tstamp could have delivery time. 9313 9313 * Thus, read skb->tstamp as is if tstamp_type_access is true. ··· 9341 9341 __u8 value_reg = si->src_reg; 9342 9342 __u8 skb_reg = si->dst_reg; 9343 9343 9344 - #ifdef CONFIG_NET_CLS_ACT 9344 + #ifdef CONFIG_NET_XGRESS 9345 9345 /* If the tstamp_type is read, 9346 9346 * the bpf prog is aware the tstamp could have delivery time. 9347 9347 * Thus, write skb->tstamp as is if tstamp_type_access is true.
+2 -2
net/sched/Kconfig
··· 347 347 config NET_SCH_INGRESS 348 348 tristate "Ingress/classifier-action Qdisc" 349 349 depends on NET_CLS_ACT 350 - select NET_INGRESS 351 - select NET_EGRESS 350 + select NET_XGRESS 352 351 help 353 352 Say Y here if you want to use classifiers for incoming and/or outgoing 354 353 packets. This qdisc doesn't do anything else besides running classifiers, ··· 678 679 config NET_CLS_ACT 679 680 bool "Actions" 680 681 select NET_CLS 682 + select NET_XGRESS 681 683 help 682 684 Say Y here if you want to use traffic control actions. Actions 683 685 get attached to classifiers and are invoked after a successful
+57 -4
net/sched/sch_ingress.c
··· 13 13 #include <net/netlink.h> 14 14 #include <net/pkt_sched.h> 15 15 #include <net/pkt_cls.h> 16 + #include <net/tcx.h> 16 17 17 18 struct ingress_sched_data { 18 19 struct tcf_block *block; ··· 79 78 { 80 79 struct ingress_sched_data *q = qdisc_priv(sch); 81 80 struct net_device *dev = qdisc_dev(sch); 81 + struct bpf_mprog_entry *entry; 82 + bool created; 82 83 int err; 83 84 84 85 if (sch->parent != TC_H_INGRESS) ··· 88 85 89 86 net_inc_ingress_queue(); 90 87 91 - mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress); 88 + entry = tcx_entry_fetch_or_create(dev, true, &created); 89 + if (!entry) 90 + return -ENOMEM; 91 + tcx_miniq_set_active(entry, true); 92 + mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq); 93 + if (created) 94 + tcx_entry_update(dev, entry, true); 92 95 93 96 q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS; 94 97 q->block_info.chain_head_change = clsact_chain_head_change; ··· 112 103 static void ingress_destroy(struct Qdisc *sch) 113 104 { 114 105 struct ingress_sched_data *q = qdisc_priv(sch); 106 + struct net_device *dev = qdisc_dev(sch); 107 + struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress); 115 108 116 109 if (sch->parent != TC_H_INGRESS) 117 110 return; 118 111 119 112 tcf_block_put_ext(q->block, sch, &q->block_info); 113 + 114 + if (entry) { 115 + tcx_miniq_set_active(entry, false); 116 + if (!tcx_entry_is_active(entry)) { 117 + tcx_entry_update(dev, NULL, false); 118 + tcx_entry_free(entry); 119 + } 120 + } 121 + 120 122 net_dec_ingress_queue(); 121 123 } 122 124 ··· 243 223 { 244 224 struct clsact_sched_data *q = qdisc_priv(sch); 245 225 struct net_device *dev = qdisc_dev(sch); 226 + struct bpf_mprog_entry *entry; 227 + bool created; 246 228 int err; 247 229 248 230 if (sch->parent != TC_H_CLSACT) ··· 253 231 net_inc_ingress_queue(); 254 232 net_inc_egress_queue(); 255 233 256 - mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress); 234 + entry = tcx_entry_fetch_or_create(dev, true, &created); 235 + if (!entry) 236 + return -ENOMEM; 237 + tcx_miniq_set_active(entry, true); 238 + mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq); 239 + if (created) 240 + tcx_entry_update(dev, entry, true); 257 241 258 242 q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS; 259 243 q->ingress_block_info.chain_head_change = clsact_chain_head_change; ··· 272 244 273 245 mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block); 274 246 275 - mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress); 247 + entry = tcx_entry_fetch_or_create(dev, false, &created); 248 + if (!entry) 249 + return -ENOMEM; 250 + tcx_miniq_set_active(entry, true); 251 + mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq); 252 + if (created) 253 + tcx_entry_update(dev, entry, false); 276 254 277 255 q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS; 278 256 q->egress_block_info.chain_head_change = clsact_chain_head_change; ··· 290 256 static void clsact_destroy(struct Qdisc *sch) 291 257 { 292 258 struct clsact_sched_data *q = qdisc_priv(sch); 259 + struct net_device *dev = qdisc_dev(sch); 260 + struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress); 261 + struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress); 293 262 294 263 if (sch->parent != TC_H_CLSACT) 295 264 return; 296 265 297 - tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info); 298 266 tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info); 267 + tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info); 268 + 269 + if (ingress_entry) { 270 + tcx_miniq_set_active(ingress_entry, false); 271 + if (!tcx_entry_is_active(ingress_entry)) { 272 + tcx_entry_update(dev, NULL, true); 273 + tcx_entry_free(ingress_entry); 274 + } 275 + } 276 + 277 + if (egress_entry) { 278 + tcx_miniq_set_active(egress_entry, false); 279 + if (!tcx_entry_is_active(egress_entry)) { 280 + tcx_entry_update(dev, NULL, false); 281 + tcx_entry_free(egress_entry); 282 + } 283 + } 299 284 300 285 net_dec_ingress_queue(); 301 286 net_dec_egress_queue();
+30 -4
tools/include/uapi/linux/bpf.h
··· 1036 1036 BPF_LSM_CGROUP, 1037 1037 BPF_STRUCT_OPS, 1038 1038 BPF_NETFILTER, 1039 + BPF_TCX_INGRESS, 1040 + BPF_TCX_EGRESS, 1039 1041 __MAX_BPF_ATTACH_TYPE 1040 1042 }; 1041 1043 ··· 1055 1053 BPF_LINK_TYPE_KPROBE_MULTI = 8, 1056 1054 BPF_LINK_TYPE_STRUCT_OPS = 9, 1057 1055 BPF_LINK_TYPE_NETFILTER = 10, 1058 - 1056 + BPF_LINK_TYPE_TCX = 11, 1059 1057 MAX_BPF_LINK_TYPE, 1060 1058 }; 1061 1059 ··· 1571 1569 __u32 map_fd; /* struct_ops to attach */ 1572 1570 }; 1573 1571 union { 1574 - __u32 target_fd; /* object to attach to */ 1575 - __u32 target_ifindex; /* target ifindex */ 1572 + __u32 target_fd; /* target object to attach to or ... */ 1573 + __u32 target_ifindex; /* target ifindex */ 1576 1574 }; 1577 1575 __u32 attach_type; /* attach type */ 1578 1576 __u32 flags; /* extra flags */ 1579 1577 union { 1580 - __u32 target_btf_id; /* btf_id of target to attach to */ 1578 + __u32 target_btf_id; /* btf_id of target to attach to */ 1581 1579 struct { 1582 1580 __aligned_u64 iter_info; /* extra bpf_iter_link_info */ 1583 1581 __u32 iter_info_len; /* iter_info length */ ··· 1611 1609 __s32 priority; 1612 1610 __u32 flags; 1613 1611 } netfilter; 1612 + struct { 1613 + union { 1614 + __u32 relative_fd; 1615 + __u32 relative_id; 1616 + }; 1617 + __u64 expected_revision; 1618 + } tcx; 1614 1619 }; 1615 1620 } link_create; 1616 1621 ··· 6226 6217 }; 6227 6218 }; 6228 6219 6220 + /* (Simplified) user return codes for tcx prog type. 6221 + * A valid tcx program must return one of these defined values. All other 6222 + * return codes are reserved for future use. Must remain compatible with 6223 + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown 6224 + * return codes are mapped to TCX_NEXT. 6225 + */ 6226 + enum tcx_action_base { 6227 + TCX_NEXT = -1, 6228 + TCX_PASS = 0, 6229 + TCX_DROP = 2, 6230 + TCX_REDIRECT = 7, 6231 + }; 6232 + 6229 6233 struct bpf_xdp_sock { 6230 6234 __u32 queue_id; 6231 6235 }; ··· 6521 6499 } event; /* BPF_PERF_EVENT_EVENT */ 6522 6500 }; 6523 6501 } perf_event; 6502 + struct { 6503 + __u32 ifindex; 6504 + __u32 attach_type; 6505 + } tcx; 6524 6506 }; 6525 6507 } __attribute__((aligned(8))); 6526 6508