Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: Fix too early release of tcx_entry

Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported
an issue that the tcx_entry can be released too early leading to a use
after free (UAF) when an active old-style ingress or clsact qdisc with a
shared tc block is later replaced by another ingress or clsact instance.

Essentially, the sequence to trigger the UAF (one example) can be as follows:

1. A network namespace is created
2. An ingress qdisc is created. This allocates a tcx_entry, and
&tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the
same time, a tcf block with index 1 is created.
3. chain0 is attached to the tcf block. chain0 must be connected to
the block linked to the ingress qdisc to later reach the function
tcf_chain0_head_change_cb_del() which triggers the UAF.
4. Create and graft a clsact qdisc. This causes the ingress qdisc
created in step 1 to be removed, thus freeing the previously linked
tcx_entry:

rtnetlink_rcv_msg()
=> tc_modify_qdisc()
=> qdisc_create()
=> clsact_init() [a]
=> qdisc_graft()
=> qdisc_destroy()
=> __qdisc_destroy()
=> ingress_destroy() [b]
=> tcx_entry_free()
=> kfree_rcu() // tcx_entry freed

5. Finally, the network namespace is closed. This registers the
cleanup_net worker, and during the process of releasing the
remaining clsact qdisc, it accesses the tcx_entry that was
already freed in step 4, causing the UAF to occur:

cleanup_net()
=> ops_exit_list()
=> default_device_exit_batch()
=> unregister_netdevice_many()
=> unregister_netdevice_many_notify()
=> dev_shutdown()
=> qdisc_put()
=> clsact_destroy() [c]
=> tcf_block_put_ext()
=> tcf_chain0_head_change_cb_del()
=> tcf_chain_head_change_item()
=> clsact_chain_head_change()
=> mini_qdisc_pair_swap() // UAF

There are also other variants, the gist is to add an ingress (or clsact)
qdisc with a specific shared block, then to replace that qdisc, waiting
for the tcx_entry kfree_rcu() to be executed and subsequently accessing
the current active qdisc's miniq one way or another.

The correct fix is to turn the miniq_active boolean into a counter. What
can be observed, at step 2 above, the counter transitions from 0->1, at
step [a] from 1->2 (in order for the miniq object to remain active during
the replacement), then in [b] from 2->1 and finally [c] 1->0 with the
eventual release. The reference counter in general ranges from [0,2] and
it does not need to be atomic since all access to the counter is protected
by the rtnl mutex. With this in place, there is no longer a UAF happening
and the tcx_entry is freed at the correct time.

Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Pedro Pinto <xten@osec.io>
Co-developed-by: Pedro Pinto <xten@osec.io>
Signed-off-by: Pedro Pinto <xten@osec.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Hyunwoo Kim <v4bel@theori.io>
Cc: Wongi Lee <qwerty@theori.io>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240708133130.11609-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

authored by

Daniel Borkmann and committed by
Martin KaFai Lau
1cb6f0ba 83c36e7c

+15 -10
+9 -4
include/net/tcx.h
··· 13 13 struct tcx_entry { 14 14 struct mini_Qdisc __rcu *miniq; 15 15 struct bpf_mprog_bundle bundle; 16 - bool miniq_active; 16 + u32 miniq_active; 17 17 struct rcu_head rcu; 18 18 }; 19 19 ··· 125 125 tcx_dec(); 126 126 } 127 127 128 - static inline void tcx_miniq_set_active(struct bpf_mprog_entry *entry, 129 - const bool active) 128 + static inline void tcx_miniq_inc(struct bpf_mprog_entry *entry) 130 129 { 131 130 ASSERT_RTNL(); 132 - tcx_entry(entry)->miniq_active = active; 131 + tcx_entry(entry)->miniq_active++; 132 + } 133 + 134 + static inline void tcx_miniq_dec(struct bpf_mprog_entry *entry) 135 + { 136 + ASSERT_RTNL(); 137 + tcx_entry(entry)->miniq_active--; 133 138 } 134 139 135 140 static inline bool tcx_entry_is_active(struct bpf_mprog_entry *entry)
+6 -6
net/sched/sch_ingress.c
··· 91 91 entry = tcx_entry_fetch_or_create(dev, true, &created); 92 92 if (!entry) 93 93 return -ENOMEM; 94 - tcx_miniq_set_active(entry, true); 94 + tcx_miniq_inc(entry); 95 95 mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq); 96 96 if (created) 97 97 tcx_entry_update(dev, entry, true); ··· 121 121 tcf_block_put_ext(q->block, sch, &q->block_info); 122 122 123 123 if (entry) { 124 - tcx_miniq_set_active(entry, false); 124 + tcx_miniq_dec(entry); 125 125 if (!tcx_entry_is_active(entry)) { 126 126 tcx_entry_update(dev, NULL, true); 127 127 tcx_entry_free(entry); ··· 257 257 entry = tcx_entry_fetch_or_create(dev, true, &created); 258 258 if (!entry) 259 259 return -ENOMEM; 260 - tcx_miniq_set_active(entry, true); 260 + tcx_miniq_inc(entry); 261 261 mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq); 262 262 if (created) 263 263 tcx_entry_update(dev, entry, true); ··· 276 276 entry = tcx_entry_fetch_or_create(dev, false, &created); 277 277 if (!entry) 278 278 return -ENOMEM; 279 - tcx_miniq_set_active(entry, true); 279 + tcx_miniq_inc(entry); 280 280 mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq); 281 281 if (created) 282 282 tcx_entry_update(dev, entry, false); ··· 302 302 tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info); 303 303 304 304 if (ingress_entry) { 305 - tcx_miniq_set_active(ingress_entry, false); 305 + tcx_miniq_dec(ingress_entry); 306 306 if (!tcx_entry_is_active(ingress_entry)) { 307 307 tcx_entry_update(dev, NULL, true); 308 308 tcx_entry_free(ingress_entry); ··· 310 310 } 311 311 312 312 if (egress_entry) { 313 - tcx_miniq_set_active(egress_entry, false); 313 + tcx_miniq_dec(egress_entry); 314 314 if (!tcx_entry_is_active(egress_entry)) { 315 315 tcx_entry_update(dev, NULL, false); 316 316 tcx_entry_free(egress_entry);