Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

netfilter: nf_tables: allow iter callbacks to sleep

Quoting Sven Auhagen:
we do see on occasions that we get the following error message, more so on
x86 systems than on arm64:

Error: Could not process rule: Cannot allocate memory delete table inet filter

It is not a consistent error and does not happen all the time.
We are on Kernel 6.6.80, seems to me like we have something along the lines
of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC.

As hinted at by Sven, this is because of GFP_ATOMIC allocations during
set flush.

When set is flushed, all elements are deactivated. This triggers a set
walk and each element gets added to the transaction list.

The rbtree and rhashtable sets don't allow the iter callback to sleep:
rbtree walk acquires read side of an rwlock with bh disabled, rhashtable
walk happens with rcu read lock held.

Rbtree is simple enough to resolve:
When the walk context is ITER_READ, no change is needed (the iter
callback must not deactivate elements; we're not in a transaction).

When the iter type is ITER_UPDATE, the rwlock isn't needed because the
caller holds the transaction mutex, this prevents any and all changes to
the ruleset, including add/remove of set elements.

Rhashtable is slightly more complex.
When the iter type is ITER_READ, no change is needed, like rbtree.

For ITER_UPDATE, we hold transaction mutex which prevents elements from
getting free'd, even outside of rcu read lock section.

So build a temporary list of all elements while doing the rcu iteration
and then call the iterator in a second pass.

The disadvantage is the need to iterate twice, but this cost comes with
the benefit to allow the iter callback to use GFP_KERNEL allocations in
a followup patch.

The new list based logic makes it necessary to catch recursive calls to
the same set earlier.

Such walk -> iter -> walk recursion for the same set can happen during
ruleset validation in case userspace gave us a bogus (cyclic) ruleset
where verdict map m jumps to chain that sooner or later also calls
"vmap @m".

Before the new ->in_update_walk test, the ruleset is rejected because the
infinite recursion causes ctx->level to exceed the allowed maximum.

But with the new logic added here, elements would get skipped:

nft_rhash_walk_update would see elements that are on the walk_list of
an older stack frame.

As all recursive calls into same map results in -EMLINK, we can avoid this
problem by using the new in_update_walk flag and reject immediately.

Next patch converts the problematic GFP_ATOMIC allocations.

Reported-by: Sven Auhagen <Sven.Auhagen@belden.com>
Closes: https://lore.kernel.org/netfilter-devel/BY1PR18MB5874110CAFF1ED098D0BC4E7E07BA@BY1PR18MB5874.namprd18.prod.outlook.com/
Signed-off-by: Florian Westphal <fw@strlen.de>

+126 -11
+2
include/net/netfilter/nf_tables.h
··· 556 556 * @size: maximum set size 557 557 * @field_len: length of each field in concatenation, bytes 558 558 * @field_count: number of concatenated fields in element 559 + * @in_update_walk: true during ->walk() in transaction phase 559 560 * @use: number of rules references to this set 560 561 * @nelems: number of elements 561 562 * @ndeact: number of deactivated elements queued for removal ··· 591 590 u32 size; 592 591 u8 field_len[NFT_REG32_COUNT]; 593 592 u8 field_count; 593 + bool in_update_walk; 594 594 u32 use; 595 595 atomic_t nelems; 596 596 u32 ndeact;
+97 -3
net/netfilter/nft_set_hash.c
··· 30 30 struct nft_rhash_elem { 31 31 struct nft_elem_priv priv; 32 32 struct rhash_head node; 33 + struct llist_node walk_node; 33 34 u32 wq_gc_seq; 34 35 struct nft_set_ext ext; 35 36 }; ··· 145 144 goto err1; 146 145 147 146 he = nft_elem_priv_cast(elem_priv); 147 + init_llist_node(&he->walk_node); 148 148 prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node, 149 149 nft_rhash_params); 150 150 if (IS_ERR(prev)) ··· 182 180 }; 183 181 struct nft_rhash_elem *prev; 184 182 183 + init_llist_node(&he->walk_node); 185 184 prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node, 186 185 nft_rhash_params); 187 186 if (IS_ERR(prev)) ··· 264 261 return true; 265 262 } 266 263 267 - static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set, 268 - struct nft_set_iter *iter) 264 + static void nft_rhash_walk_ro(const struct nft_ctx *ctx, struct nft_set *set, 265 + struct nft_set_iter *iter) 269 266 { 270 267 struct nft_rhash *priv = nft_set_priv(set); 271 - struct nft_rhash_elem *he; 272 268 struct rhashtable_iter hti; 269 + struct nft_rhash_elem *he; 273 270 274 271 rhashtable_walk_enter(&priv->ht, &hti); 275 272 rhashtable_walk_start(&hti); ··· 296 293 } 297 294 rhashtable_walk_stop(&hti); 298 295 rhashtable_walk_exit(&hti); 296 + } 297 + 298 + static void nft_rhash_walk_update(const struct nft_ctx *ctx, 299 + struct nft_set *set, 300 + struct nft_set_iter *iter) 301 + { 302 + struct nft_rhash *priv = nft_set_priv(set); 303 + struct nft_rhash_elem *he, *tmp; 304 + struct llist_node *first_node; 305 + struct rhashtable_iter hti; 306 + LLIST_HEAD(walk_list); 307 + 308 + lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex); 309 + 310 + if (set->in_update_walk) { 311 + /* This can happen with bogus rulesets during ruleset validation 312 + * when a verdict map causes a jump back to the same map. 313 + * 314 + * Without this extra check the walk_next loop below will see 315 + * elems on the callers walk_list and skip (not validate) them. 316 + */ 317 + iter->err = -EMLINK; 318 + return; 319 + } 320 + 321 + /* walk happens under RCU. 322 + * 323 + * We create a snapshot list so ->iter callback can sleep. 324 + * commit_mutex is held, elements can ... 325 + * .. be added in parallel from dataplane (dynset) 326 + * .. be marked as dead in parallel from dataplane (dynset). 327 + * .. be queued for removal in parallel (gc timeout). 328 + * .. not be freed: transaction mutex is held. 329 + */ 330 + rhashtable_walk_enter(&priv->ht, &hti); 331 + rhashtable_walk_start(&hti); 332 + 333 + while ((he = rhashtable_walk_next(&hti))) { 334 + if (IS_ERR(he)) { 335 + if (PTR_ERR(he) != -EAGAIN) { 336 + iter->err = PTR_ERR(he); 337 + break; 338 + } 339 + 340 + continue; 341 + } 342 + 343 + /* rhashtable resized during walk, skip */ 344 + if (llist_on_list(&he->walk_node)) 345 + continue; 346 + 347 + llist_add(&he->walk_node, &walk_list); 348 + } 349 + rhashtable_walk_stop(&hti); 350 + rhashtable_walk_exit(&hti); 351 + 352 + first_node = __llist_del_all(&walk_list); 353 + set->in_update_walk = true; 354 + llist_for_each_entry_safe(he, tmp, first_node, walk_node) { 355 + if (iter->err == 0) { 356 + iter->err = iter->fn(ctx, set, iter, &he->priv); 357 + if (iter->err == 0) 358 + iter->count++; 359 + } 360 + 361 + /* all entries must be cleared again, else next ->walk iteration 362 + * will skip entries. 363 + */ 364 + init_llist_node(&he->walk_node); 365 + } 366 + set->in_update_walk = false; 367 + } 368 + 369 + static void nft_rhash_walk(const struct nft_ctx *ctx, struct nft_set *set, 370 + struct nft_set_iter *iter) 371 + { 372 + switch (iter->type) { 373 + case NFT_ITER_UPDATE: 374 + /* only relevant for netlink dumps which use READ type */ 375 + WARN_ON_ONCE(iter->skip != 0); 376 + 377 + nft_rhash_walk_update(ctx, set, iter); 378 + break; 379 + case NFT_ITER_READ: 380 + nft_rhash_walk_ro(ctx, set, iter); 381 + break; 382 + default: 383 + iter->err = -EINVAL; 384 + WARN_ON_ONCE(1); 385 + break; 386 + } 299 387 } 300 388 301 389 static bool nft_rhash_expr_needs_gc_run(const struct nft_set *set,
+27 -8
net/netfilter/nft_set_rbtree.c
··· 584 584 return NULL; 585 585 } 586 586 587 - static void nft_rbtree_walk(const struct nft_ctx *ctx, 588 - struct nft_set *set, 589 - struct nft_set_iter *iter) 587 + static void nft_rbtree_do_walk(const struct nft_ctx *ctx, 588 + struct nft_set *set, 589 + struct nft_set_iter *iter) 590 590 { 591 591 struct nft_rbtree *priv = nft_set_priv(set); 592 592 struct nft_rbtree_elem *rbe; 593 593 struct rb_node *node; 594 594 595 - read_lock_bh(&priv->lock); 596 595 for (node = rb_first(&priv->root); node != NULL; node = rb_next(node)) { 597 596 rbe = rb_entry(node, struct nft_rbtree_elem, node); 598 597 ··· 599 600 goto cont; 600 601 601 602 iter->err = iter->fn(ctx, set, iter, &rbe->priv); 602 - if (iter->err < 0) { 603 - read_unlock_bh(&priv->lock); 603 + if (iter->err < 0) 604 604 return; 605 - } 606 605 cont: 607 606 iter->count++; 608 607 } 609 - read_unlock_bh(&priv->lock); 608 + } 609 + 610 + static void nft_rbtree_walk(const struct nft_ctx *ctx, 611 + struct nft_set *set, 612 + struct nft_set_iter *iter) 613 + { 614 + struct nft_rbtree *priv = nft_set_priv(set); 615 + 616 + switch (iter->type) { 617 + case NFT_ITER_UPDATE: 618 + lockdep_assert_held(&nft_pernet(ctx->net)->commit_mutex); 619 + nft_rbtree_do_walk(ctx, set, iter); 620 + break; 621 + case NFT_ITER_READ: 622 + read_lock_bh(&priv->lock); 623 + nft_rbtree_do_walk(ctx, set, iter); 624 + read_unlock_bh(&priv->lock); 625 + break; 626 + default: 627 + iter->err = -EINVAL; 628 + WARN_ON_ONCE(1); 629 + break; 630 + } 610 631 } 611 632 612 633 static void nft_rbtree_gc_remove(struct net *net, struct nft_set *set,