Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

netfilter: conntrack: fix crash due to confirmed bit load reordering

Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1 CPU2
encounter 'ct' during hlist walk
delete_from_lists
refcount drops to 0
kmem_cache_free(ct);
__nf_conntrack_alloc() // returns same object
refcount_inc_not_zero(ct); /* might fail */

/* If set, ct is public/in the hash table */
test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
the next entry in the list. This happens for objects that are about
to be free'd, that have been free'd, or that have been reallocated
by __nf_conntrack_alloc(), but where the refcount has not been
increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
in ct->status. If set, the object is public/in the table.

If not, the object must be skipped; CPU2 calls nf_ct_put() to
un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

Add needed barriers when refcount_inc_not_zero() is successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still see the object, refcount_set(1)
"resurrects" it, so we need to make sure that other CPUs will also observe
the right content. In particular, the CONFIRMED bit test must only pass
once the object is fully initialised and either in the hash or about to be
inserted (with locks held to delay possible unlink from early_drop or
gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
add comment in nf_conntrack_netlink, no control dependency there
due to locks.

Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/all/Yr7WTfd6AVTQkLjI@e126311.manchester.arm.com/
Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Diagnosed-by: Will Deacon <will@kernel.org>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Will Deacon <will@kernel.org>

+26
+22
net/netfilter/nf_conntrack_core.c
··· 729 729 if (!refcount_inc_not_zero(&ct->ct_general.use)) 730 730 return; 731 731 732 + /* load ->status after refcount increase */ 733 + smp_acquire__after_ctrl_dep(); 734 + 732 735 if (nf_ct_should_gc(ct)) 733 736 nf_ct_kill(ct); 734 737 ··· 798 795 */ 799 796 ct = nf_ct_tuplehash_to_ctrack(h); 800 797 if (likely(refcount_inc_not_zero(&ct->ct_general.use))) { 798 + /* re-check key after refcount */ 799 + smp_acquire__after_ctrl_dep(); 800 + 801 801 if (likely(nf_ct_key_equal(h, tuple, zone, net))) 802 802 goto found; 803 803 ··· 1393 1387 if (!refcount_inc_not_zero(&tmp->ct_general.use)) 1394 1388 continue; 1395 1389 1390 + /* load ->ct_net and ->status after refcount increase */ 1391 + smp_acquire__after_ctrl_dep(); 1392 + 1396 1393 /* kill only if still in same netns -- might have moved due to 1397 1394 * SLAB_TYPESAFE_BY_RCU rules. 1398 1395 * ··· 1544 1535 /* need to take reference to avoid possible races */ 1545 1536 if (!refcount_inc_not_zero(&tmp->ct_general.use)) 1546 1537 continue; 1538 + 1539 + /* load ->status after refcount increase */ 1540 + smp_acquire__after_ctrl_dep(); 1547 1541 1548 1542 if (gc_worker_skip_ct(tmp)) { 1549 1543 nf_ct_put(tmp); ··· 1786 1774 } 1787 1775 if (!exp) 1788 1776 __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); 1777 + 1778 + /* Other CPU might have obtained a pointer to this object before it was 1779 + * released. Because refcount is 0, refcount_inc_not_zero() will fail. 1780 + * 1781 + * After refcount_set(1) it will succeed; ensure that zeroing of 1782 + * ct->status and the correct ct->net pointer are visible; else other 1783 + * core might observe CONFIRMED bit which means the entry is valid and 1784 + * in the hash table, but its not (anymore). 1785 + */ 1786 + smp_wmb(); 1789 1787 1790 1788 /* Now it is going to be associated with an sk_buff, set refcount to 1. */ 1791 1789 refcount_set(&ct->ct_general.use, 1);
+1
net/netfilter/nf_conntrack_netlink.c
··· 1203 1203 hnnode) { 1204 1204 ct = nf_ct_tuplehash_to_ctrack(h); 1205 1205 if (nf_ct_is_expired(ct)) { 1206 + /* need to defer nf_ct_kill() until lock is released */ 1206 1207 if (i < ARRAY_SIZE(nf_ct_evict) && 1207 1208 refcount_inc_not_zero(&ct->ct_general.use)) 1208 1209 nf_ct_evict[i++] = ct;
+3
net/netfilter/nf_conntrack_standalone.c
··· 306 306 if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use))) 307 307 return 0; 308 308 309 + /* load ->status after refcount increase */ 310 + smp_acquire__after_ctrl_dep(); 311 + 309 312 if (nf_ct_should_gc(ct)) { 310 313 nf_ct_kill(ct); 311 314 goto release;