Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net

authored by

Daniel Borkmann and committed by
Alexei Starovoitov
8520e224 0e6491b5

+41 -155
+27 -80
include/linux/cgroup-defs.h
··· 752 752 * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains 753 753 * per-socket cgroup information except for memcg association. 754 754 * 755 - * On legacy hierarchies, net_prio and net_cls controllers directly set 756 - * attributes on each sock which can then be tested by the network layer. 757 - * On the default hierarchy, each sock is associated with the cgroup it was 758 - * created in and the networking layer can match the cgroup directly. 759 - * 760 - * To avoid carrying all three cgroup related fields separately in sock, 761 - * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. 762 - * On boot, sock_cgroup_data records the cgroup that the sock was created 763 - * in so that cgroup2 matches can be made; however, once either net_prio or 764 - * net_cls starts being used, the area is overridden to carry prioidx and/or 765 - * classid. The two modes are distinguished by whether the lowest bit is 766 - * set. Clear bit indicates cgroup pointer while set bit prioidx and 767 - * classid. 768 - * 769 - * While userland may start using net_prio or net_cls at any time, once 770 - * either is used, cgroup2 matching no longer works. There is no reason to 771 - * mix the two and this is in line with how legacy and v2 compatibility is 772 - * handled. On mode switch, cgroup references which are already being 773 - * pointed to by socks may be leaked. While this can be remedied by adding 774 - * synchronization around sock_cgroup_data, given that the number of leaked 775 - * cgroups is bound and highly unlikely to be high, this seems to be the 776 - * better trade-off. 755 + * On legacy hierarchies, net_prio and net_cls controllers directly 756 + * set attributes on each sock which can then be tested by the network 757 + * layer. On the default hierarchy, each sock is associated with the 758 + * cgroup it was created in and the networking layer can match the 759 + * cgroup directly. 777 760 */ 778 761 struct sock_cgroup_data { 779 - union { 780 - #ifdef __LITTLE_ENDIAN 781 - struct { 782 - u8 is_data : 1; 783 - u8 no_refcnt : 1; 784 - u8 unused : 6; 785 - u8 padding; 786 - u16 prioidx; 787 - u32 classid; 788 - } __packed; 789 - #else 790 - struct { 791 - u32 classid; 792 - u16 prioidx; 793 - u8 padding; 794 - u8 unused : 6; 795 - u8 no_refcnt : 1; 796 - u8 is_data : 1; 797 - } __packed; 762 + struct cgroup *cgroup; /* v2 */ 763 + #ifdef CONFIG_CGROUP_NET_CLASSID 764 + u32 classid; /* v1 */ 798 765 #endif 799 - u64 val; 800 - }; 766 + #ifdef CONFIG_CGROUP_NET_PRIO 767 + u16 prioidx; /* v1 */ 768 + #endif 801 769 }; 802 770 803 - /* 804 - * There's a theoretical window where the following accessors race with 805 - * updaters and return part of the previous pointer as the prioidx or 806 - * classid. Such races are short-lived and the result isn't critical. 807 - */ 808 771 static inline u16 sock_cgroup_prioidx(const struct sock_cgroup_data *skcd) 809 772 { 810 - /* fallback to 1 which is always the ID of the root cgroup */ 811 - return (skcd->is_data & 1) ? skcd->prioidx : 1; 773 + #ifdef CONFIG_CGROUP_NET_PRIO 774 + return READ_ONCE(skcd->prioidx); 775 + #else 776 + return 1; 777 + #endif 812 778 } 813 779 814 780 static inline u32 sock_cgroup_classid(const struct sock_cgroup_data *skcd) 815 781 { 816 - /* fallback to 0 which is the unconfigured default classid */ 817 - return (skcd->is_data & 1) ? skcd->classid : 0; 782 + #ifdef CONFIG_CGROUP_NET_CLASSID 783 + return READ_ONCE(skcd->classid); 784 + #else 785 + return 0; 786 + #endif 818 787 } 819 788 820 - /* 821 - * If invoked concurrently, the updaters may clobber each other. The 822 - * caller is responsible for synchronization. 823 - */ 824 789 static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd, 825 790 u16 prioidx) 826 791 { 827 - struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }}; 828 - 829 - if (sock_cgroup_prioidx(&skcd_buf) == prioidx) 830 - return; 831 - 832 - if (!(skcd_buf.is_data & 1)) { 833 - skcd_buf.val = 0; 834 - skcd_buf.is_data = 1; 835 - } 836 - 837 - skcd_buf.prioidx = prioidx; 838 - WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */ 792 + #ifdef CONFIG_CGROUP_NET_PRIO 793 + WRITE_ONCE(skcd->prioidx, prioidx); 794 + #endif 839 795 } 840 796 841 797 static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd, 842 798 u32 classid) 843 799 { 844 - struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }}; 845 - 846 - if (sock_cgroup_classid(&skcd_buf) == classid) 847 - return; 848 - 849 - if (!(skcd_buf.is_data & 1)) { 850 - skcd_buf.val = 0; 851 - skcd_buf.is_data = 1; 852 - } 853 - 854 - skcd_buf.classid = classid; 855 - WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */ 800 + #ifdef CONFIG_CGROUP_NET_CLASSID 801 + WRITE_ONCE(skcd->classid, classid); 802 + #endif 856 803 } 857 804 858 805 #else /* CONFIG_SOCK_CGROUP_DATA */
+1 -21
include/linux/cgroup.h
··· 829 829 */ 830 830 #ifdef CONFIG_SOCK_CGROUP_DATA 831 831 832 - #if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) 833 - extern spinlock_t cgroup_sk_update_lock; 834 - #endif 835 - 836 - void cgroup_sk_alloc_disable(void); 837 832 void cgroup_sk_alloc(struct sock_cgroup_data *skcd); 838 833 void cgroup_sk_clone(struct sock_cgroup_data *skcd); 839 834 void cgroup_sk_free(struct sock_cgroup_data *skcd); 840 835 841 836 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd) 842 837 { 843 - #if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) 844 - unsigned long v; 845 - 846 - /* 847 - * @skcd->val is 64bit but the following is safe on 32bit too as we 848 - * just need the lower ulong to be written and read atomically. 849 - */ 850 - v = READ_ONCE(skcd->val); 851 - 852 - if (v & 3) 853 - return &cgrp_dfl_root.cgrp; 854 - 855 - return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp; 856 - #else 857 - return (struct cgroup *)(unsigned long)skcd->val; 858 - #endif 838 + return skcd->cgroup; 859 839 } 860 840 861 841 #else /* CONFIG_CGROUP_DATA */
+10 -40
kernel/cgroup/cgroup.c
··· 6572 6572 */ 6573 6573 #ifdef CONFIG_SOCK_CGROUP_DATA 6574 6574 6575 - #if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) 6576 - 6577 - DEFINE_SPINLOCK(cgroup_sk_update_lock); 6578 - static bool cgroup_sk_alloc_disabled __read_mostly; 6579 - 6580 - void cgroup_sk_alloc_disable(void) 6581 - { 6582 - if (cgroup_sk_alloc_disabled) 6583 - return; 6584 - pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n"); 6585 - cgroup_sk_alloc_disabled = true; 6586 - } 6587 - 6588 - #else 6589 - 6590 - #define cgroup_sk_alloc_disabled false 6591 - 6592 - #endif 6593 - 6594 6575 void cgroup_sk_alloc(struct sock_cgroup_data *skcd) 6595 6576 { 6596 - if (cgroup_sk_alloc_disabled) { 6597 - skcd->no_refcnt = 1; 6598 - return; 6599 - } 6600 - 6601 6577 /* Don't associate the sock with unrelated interrupted task's cgroup. */ 6602 6578 if (in_interrupt()) 6603 6579 return; 6604 6580 6605 6581 rcu_read_lock(); 6606 - 6607 6582 while (true) { 6608 6583 struct css_set *cset; 6609 6584 6610 6585 cset = task_css_set(current); 6611 6586 if (likely(cgroup_tryget(cset->dfl_cgrp))) { 6612 - skcd->val = (unsigned long)cset->dfl_cgrp; 6587 + skcd->cgroup = cset->dfl_cgrp; 6613 6588 cgroup_bpf_get(cset->dfl_cgrp); 6614 6589 break; 6615 6590 } 6616 6591 cpu_relax(); 6617 6592 } 6618 - 6619 6593 rcu_read_unlock(); 6620 6594 } 6621 6595 6622 6596 void cgroup_sk_clone(struct sock_cgroup_data *skcd) 6623 6597 { 6624 - if (skcd->val) { 6625 - if (skcd->no_refcnt) 6626 - return; 6627 - /* 6628 - * We might be cloning a socket which is left in an empty 6629 - * cgroup and the cgroup might have already been rmdir'd. 6630 - * Don't use cgroup_get_live(). 6631 - */ 6632 - cgroup_get(sock_cgroup_ptr(skcd)); 6633 - cgroup_bpf_get(sock_cgroup_ptr(skcd)); 6634 - } 6598 + struct cgroup *cgrp = sock_cgroup_ptr(skcd); 6599 + 6600 + /* 6601 + * We might be cloning a socket which is left in an empty 6602 + * cgroup and the cgroup might have already been rmdir'd. 6603 + * Don't use cgroup_get_live(). 6604 + */ 6605 + cgroup_get(cgrp); 6606 + cgroup_bpf_get(cgrp); 6635 6607 } 6636 6608 6637 6609 void cgroup_sk_free(struct sock_cgroup_data *skcd) 6638 6610 { 6639 6611 struct cgroup *cgrp = sock_cgroup_ptr(skcd); 6640 6612 6641 - if (skcd->no_refcnt) 6642 - return; 6643 6613 cgroup_bpf_put(cgrp); 6644 6614 cgroup_put(cgrp); 6645 6615 }
+1 -6
net/core/netclassid_cgroup.c
··· 71 71 struct update_classid_context *ctx = (void *)v; 72 72 struct socket *sock = sock_from_file(file); 73 73 74 - if (sock) { 75 - spin_lock(&cgroup_sk_update_lock); 74 + if (sock) 76 75 sock_cgroup_set_classid(&sock->sk->sk_cgrp_data, ctx->classid); 77 - spin_unlock(&cgroup_sk_update_lock); 78 - } 79 76 if (--ctx->batch == 0) { 80 77 ctx->batch = UPDATE_CLASSID_BATCH; 81 78 return n + 1; ··· 117 120 struct cgroup_cls_state *cs = css_cls_state(css); 118 121 struct css_task_iter it; 119 122 struct task_struct *p; 120 - 121 - cgroup_sk_alloc_disable(); 122 123 123 124 cs->classid = (u32)value; 124 125
+2 -8
net/core/netprio_cgroup.c
··· 207 207 if (!dev) 208 208 return -ENODEV; 209 209 210 - cgroup_sk_alloc_disable(); 211 - 212 210 rtnl_lock(); 213 211 214 212 ret = netprio_set_prio(of_css(of), dev, prio); ··· 219 221 static int update_netprio(const void *v, struct file *file, unsigned n) 220 222 { 221 223 struct socket *sock = sock_from_file(file); 222 - if (sock) { 223 - spin_lock(&cgroup_sk_update_lock); 224 + 225 + if (sock) 224 226 sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data, 225 227 (unsigned long)v); 226 - spin_unlock(&cgroup_sk_update_lock); 227 - } 228 228 return 0; 229 229 } 230 230 ··· 230 234 { 231 235 struct task_struct *p; 232 236 struct cgroup_subsys_state *css; 233 - 234 - cgroup_sk_alloc_disable(); 235 237 236 238 cgroup_taskset_for_each(p, css, tset) { 237 239 void *v = (void *)(unsigned long)css->id;