Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: Call free_htab_elem() after htab_unlock_bucket()

For htab of maps, when the map is removed from the htab, it may hold the
last reference of the map. bpf_map_fd_put_ptr() will invoke
bpf_map_free_id() to free the id of the removed map element. However,
bpf_map_fd_put_ptr() is invoked while holding a bucket lock
(raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
(spinlock_t), triggering the following lockdep warning:

=============================
[ BUG: Invalid wait context ]
6.11.0-rc4+ #49 Not tainted
-----------------------------
test_maps/4881 is trying to lock:
ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
other info that might help us debug this:
context-{5:5}
2 locks held by test_maps/4881:
#0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
#1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
stack backtrace:
CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xb0
dump_stack+0x10/0x20
__lock_acquire+0x73e/0x36c0
lock_acquire+0x182/0x450
_raw_spin_lock_irqsave+0x43/0x70
bpf_map_free_id.part.0+0x21/0x70
bpf_map_put+0xcf/0x110
bpf_map_fd_put_ptr+0x9a/0xb0
free_htab_elem+0x69/0xe0
htab_map_update_elem+0x50f/0xa80
bpf_fd_htab_map_update_elem+0x131/0x270
htab_map_update_elem+0x50f/0xa80
bpf_fd_htab_map_update_elem+0x131/0x270
bpf_map_update_value+0x266/0x380
__sys_bpf+0x21bb/0x36b0
__x64_sys_bpf+0x45/0x60
x64_sys_call+0x1b2a/0x20d0
do_syscall_64+0x5d/0x100
entry_SYSCALL_64_after_hwframe+0x76/0x7e

One way to fix the lockdep warning is using raw_spinlock_t for
map_idr_lock as well. However, bpf_map_alloc_id() invokes
idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
still a spinlock.

Instead of changing map_idr_lock's type, fix the issue by invoking
htab_put_fd_value() after htab_unlock_bucket(). However, only deferring
the invocation of htab_put_fd_value() is not enough, because the old map
pointers in htab of maps can not be saved during batched deletion.
Therefore, also defer the invocation of free_htab_elem(), so these
to-be-freed elements could be linked together similar to lru map.

There are four callers for ->map_fd_put_ptr:

(1) alloc_htab_elem() (through htab_put_fd_value())
It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of
htab_put_fd_value() can not simply move after htab_unlock_bucket(),
because the old element has already been stashed in htab->extra_elems.
It may be reused immediately after htab_unlock_bucket() and the
invocation of htab_put_fd_value() after htab_unlock_bucket() may release
the newly-added element incorrectly. Therefore, saving the map pointer
of the old element for htab of maps before unlocking the bucket and
releasing the map_ptr after unlock. Beside the map pointer in the old
element, should do the same thing for the special fields in the old
element as well.

(2) free_htab_elem() (through htab_put_fd_value())
Its caller includes __htab_map_lookup_and_delete_elem(),
htab_map_delete_elem() and __htab_map_lookup_and_delete_batch().

For htab_map_delete_elem(), simply invoke free_htab_elem() after
htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just
like lru map, linking the to-be-freed element into node_to_free list
and invoking free_htab_elem() for these element after unlock. It is safe
to reuse batch_flink as the link for node_to_free, because these
elements have been removed from the hash llist.

Because htab of maps doesn't support lookup_and_delete operation,
__htab_map_lookup_and_delete_elem() doesn't have the problem, so kept
it as is.

(3) fd_htab_map_free()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

(4) bpf_fd_htab_map_update_elem()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

After moving free_htab_elem() outside htab bucket lock scope, using
pcpu_freelist_push() instead of __pcpu_freelist_push() to disable
the irq before freeing elements, and protecting the invocations of
bpf_mem_cache_free() with migrate_{disable|enable} pair.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

authored by

Hou Tao and committed by
Andrii Nakryiko
b9e9ed90 269e7c97

+39 -17
+39 -17
kernel/bpf/hashtab.c
··· 896 896 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l) 897 897 { 898 898 check_and_free_fields(htab, l); 899 + 900 + migrate_disable(); 899 901 if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH) 900 902 bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr); 901 903 bpf_mem_cache_free(&htab->ma, l); 904 + migrate_enable(); 902 905 } 903 906 904 907 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l) ··· 951 948 if (htab_is_prealloc(htab)) { 952 949 bpf_map_dec_elem_count(&htab->map); 953 950 check_and_free_fields(htab, l); 954 - __pcpu_freelist_push(&htab->freelist, &l->fnode); 951 + pcpu_freelist_push(&htab->freelist, &l->fnode); 955 952 } else { 956 953 dec_elem_count(htab); 957 954 htab_elem_free(htab, l); ··· 1021 1018 */ 1022 1019 pl_new = this_cpu_ptr(htab->extra_elems); 1023 1020 l_new = *pl_new; 1024 - htab_put_fd_value(htab, old_elem); 1025 1021 *pl_new = old_elem; 1026 1022 } else { 1027 1023 struct pcpu_freelist_node *l; ··· 1107 1105 struct htab_elem *l_new = NULL, *l_old; 1108 1106 struct hlist_nulls_head *head; 1109 1107 unsigned long flags; 1108 + void *old_map_ptr; 1110 1109 struct bucket *b; 1111 1110 u32 key_size, hash; 1112 1111 int ret; ··· 1186 1183 hlist_nulls_add_head_rcu(&l_new->hash_node, head); 1187 1184 if (l_old) { 1188 1185 hlist_nulls_del_rcu(&l_old->hash_node); 1186 + 1187 + /* l_old has already been stashed in htab->extra_elems, free 1188 + * its special fields before it is available for reuse. Also 1189 + * save the old map pointer in htab of maps before unlock 1190 + * and release it after unlock. 1191 + */ 1192 + old_map_ptr = NULL; 1193 + if (htab_is_prealloc(htab)) { 1194 + if (map->ops->map_fd_put_ptr) 1195 + old_map_ptr = fd_htab_map_get_ptr(map, l_old); 1196 + check_and_free_fields(htab, l_old); 1197 + } 1198 + } 1199 + htab_unlock_bucket(htab, b, hash, flags); 1200 + if (l_old) { 1201 + if (old_map_ptr) 1202 + map->ops->map_fd_put_ptr(map, old_map_ptr, true); 1189 1203 if (!htab_is_prealloc(htab)) 1190 1204 free_htab_elem(htab, l_old); 1191 - else 1192 - check_and_free_fields(htab, l_old); 1193 1205 } 1194 - ret = 0; 1206 + return 0; 1195 1207 err: 1196 1208 htab_unlock_bucket(htab, b, hash, flags); 1197 1209 return ret; ··· 1450 1432 return ret; 1451 1433 1452 1434 l = lookup_elem_raw(head, hash, key, key_size); 1453 - 1454 - if (l) { 1435 + if (l) 1455 1436 hlist_nulls_del_rcu(&l->hash_node); 1456 - free_htab_elem(htab, l); 1457 - } else { 1437 + else 1458 1438 ret = -ENOENT; 1459 - } 1460 1439 1461 1440 htab_unlock_bucket(htab, b, hash, flags); 1441 + 1442 + if (l) 1443 + free_htab_elem(htab, l); 1462 1444 return ret; 1463 1445 } 1464 1446 ··· 1871 1853 * may cause deadlock. See comments in function 1872 1854 * prealloc_lru_pop(). Let us do bpf_lru_push_free() 1873 1855 * after releasing the bucket lock. 1856 + * 1857 + * For htab of maps, htab_put_fd_value() in 1858 + * free_htab_elem() may acquire a spinlock with bucket 1859 + * lock being held and it violates the lock rule, so 1860 + * invoke free_htab_elem() after unlock as well. 1874 1861 */ 1875 - if (is_lru_map) { 1876 - l->batch_flink = node_to_free; 1877 - node_to_free = l; 1878 - } else { 1879 - free_htab_elem(htab, l); 1880 - } 1862 + l->batch_flink = node_to_free; 1863 + node_to_free = l; 1881 1864 } 1882 1865 dst_key += key_size; 1883 1866 dst_val += value_size; ··· 1890 1871 while (node_to_free) { 1891 1872 l = node_to_free; 1892 1873 node_to_free = node_to_free->batch_flink; 1893 - htab_lru_push_free(htab, l); 1874 + if (is_lru_map) 1875 + htab_lru_push_free(htab, l); 1876 + else 1877 + free_htab_elem(htab, l); 1894 1878 } 1895 1879 1896 1880 next_batch: