page_pool: Track DMA-mapped pages and unmap them when destroying the pool

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.

To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.

To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:

- 23 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Stashing a value into the unused bits of pp_magic does have the effect
that it can make the value stored there lie outside the unmappable
range (as governed by the mmap_min_addr sysctl), for architectures that
don't define ILLEGAL_POINTER_VALUE. This means that if one of the
pointers that is aliased to the pp_magic field (such as page->lru.next)
is dereferenced while the page is owned by page_pool, that could lead to
a dereference into userspace, which is a security concern. The risk of
this is mitigated by the fact that (a) we always clear pp_magic before
releasing a page from page_pool, and (b) this would need a
use-after-free bug for struct page, which can have many other risks
since page->lru.next is used as a generic list pointer in multiple
places in the kernel. As such, with this patch we take the position that
this risk is negligible in practice. For more discussion, see[1].

Since all the tracking added in this patch is performed on DMA
map/unmap, no additional code is needed in the fast path, meaning the
performance overhead of this tracking is negligible there. A
micro-benchmark shows that the total overhead of the tracking itself is
about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
Since this cost is only paid on DMA map and unmap, it seems like an
acceptable cost to fix the late unmap issue. Further optimisation can
narrow the cases where this cost is paid (for instance by eliding the
tracking when DMA map/unmap is a no-op).

The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).

[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
[1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
[2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com

Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Qiuling Ren <qren@redhat.com>
Tested-by: Yuying Ma <yuma@redhat.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Toke Høiland-Jørgensen and committed by

Jakub Kicinski 11 months ago ee62ce7a cd3c9316

+147 -18

5 changed files

expand all

include

linux

mm.h

poison.h

net

page_pool

types.h

net

core

netmem_priv.h

page_pool.c

+42 -4

include/linux/mm.h

··· 4248 4248 #define VM_SEALED_SYSMAP VM_NONE 4249 4249 #endif 4250 4250 4251 + /* 4252 + * DMA mapping IDs for page_pool 4253 + * 4254 + * When DMA-mapping a page, page_pool allocates an ID (from an xarray) and 4255 + * stashes it in the upper bits of page->pp_magic. We always want to be able to 4256 + * unambiguously identify page pool pages (using page_pool_page_is_pp()). Non-PP 4257 + * pages can have arbitrary kernel pointers stored in the same field as pp_magic 4258 + * (since it overlaps with page->lru.next), so we must ensure that we cannot 4259 + * mistake a valid kernel pointer with any of the values we write into this 4260 + * field. 4261 + * 4262 + * On architectures that set POISON_POINTER_DELTA, this is already ensured, 4263 + * since this value becomes part of PP_SIGNATURE; meaning we can just use the 4264 + * space between the PP_SIGNATURE value (without POISON_POINTER_DELTA), and the 4265 + * lowest bits of POISON_POINTER_DELTA. On arches where POISON_POINTER_DELTA is 4266 + * 0, we make sure that we leave the two topmost bits empty, as that guarantees 4267 + * we won't mistake a valid kernel pointer for a value we set, regardless of the 4268 + * VMSPLIT setting. 4269 + * 4270 + * Altogether, this means that the number of bits available is constrained by 4271 + * the size of an unsigned long (at the upper end, subtracting two bits per the 4272 + * above), and the definition of PP_SIGNATURE (with or without 4273 + * POISON_POINTER_DELTA). 4274 + */ 4275 + #define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA)) 4276 + #if POISON_POINTER_DELTA > 0 4277 + /* PP_SIGNATURE includes POISON_POINTER_DELTA, so limit the size of the DMA 4278 + * index to not overlap with that if set 4279 + */ 4280 + #define PP_DMA_INDEX_BITS MIN(32, __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT) 4281 + #else 4282 + /* Always leave out the topmost two; see above. */ 4283 + #define PP_DMA_INDEX_BITS MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 2) 4284 + #endif 4285 + 4286 + #define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \ 4287 + PP_DMA_INDEX_SHIFT) 4288 + 4251 4289 /* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is 4252 4290 * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for 4253 - * the head page of compound page and bit 1 for pfmemalloc page. 4254 - * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling 4255 - * the pfmemalloc page. 4291 + * the head page of compound page and bit 1 for pfmemalloc page, as well as the 4292 + * bits used for the DMA index. page_is_pfmemalloc() is checked in 4293 + * __page_pool_put_page() to avoid recycling the pfmemalloc page. 4256 4294 */ 4257 - #define PP_MAGIC_MASK ~0x3UL 4295 + #define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL) 4258 4296 4259 4297 #ifdef CONFIG_PAGE_POOL 4260 4298 static inline bool page_pool_page_is_pp(struct page *page)

include/linux/poison.h

··· 70 70 #define KEY_DESTROY 0xbd 71 71 72 72 /********** net/core/page_pool.c **********/ 73 + /* 74 + * page_pool uses additional free bits within this value to store data, see the 75 + * definition of PP_DMA_INDEX_MASK in mm.h 76 + */ 73 77 #define PP_SIGNATURE (0x40 + POISON_POINTER_DELTA) 74 78 75 79 /********** net/core/skbuff.c **********/

include/net/page_pool/types.h

··· 6 6 #include <linux/dma-direction.h> 7 7 #include <linux/ptr_ring.h> 8 8 #include <linux/types.h> 9 + #include <linux/xarray.h> 9 10 #include <net/netmem.h> 10 11 11 12 #define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA ··· 33 32 34 33 #define PP_FLAG_ALL (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \ 35 34 PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM) 35 + 36 + /* Index limit to stay within PP_DMA_INDEX_BITS for DMA indices */ 37 + #define PP_DMA_INDEX_LIMIT XA_LIMIT(1, BIT(PP_DMA_INDEX_BITS) - 1) 36 38 37 39 /* 38 40 * Fast allocation side cache array/stack ··· 224 220 225 221 void *mp_priv; 226 222 const struct memory_provider_ops *mp_ops; 223 + 224 + struct xarray dma_mapped; 227 225 228 226 #ifdef CONFIG_PAGE_POOL_STATS 229 227 /* recycle stats are per-cpu to avoid locking */

+27 -1

net/core/netmem_priv.h

··· 5 5 6 6 static inline unsigned long netmem_get_pp_magic(netmem_ref netmem) 7 7 { 8 - return __netmem_clear_lsb(netmem)->pp_magic; 8 + return __netmem_clear_lsb(netmem)->pp_magic & ~PP_DMA_INDEX_MASK; 9 9 } 10 10 11 11 static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic) ··· 15 15 16 16 static inline void netmem_clear_pp_magic(netmem_ref netmem) 17 17 { 18 + WARN_ON_ONCE(__netmem_clear_lsb(netmem)->pp_magic & PP_DMA_INDEX_MASK); 19 + 18 20 __netmem_clear_lsb(netmem)->pp_magic = 0; 19 21 } 20 22 ··· 34 32 unsigned long dma_addr) 35 33 { 36 34 __netmem_clear_lsb(netmem)->dma_addr = dma_addr; 35 + } 36 + 37 + static inline unsigned long netmem_get_dma_index(netmem_ref netmem) 38 + { 39 + unsigned long magic; 40 + 41 + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) 42 + return 0; 43 + 44 + magic = __netmem_clear_lsb(netmem)->pp_magic; 45 + 46 + return (magic & PP_DMA_INDEX_MASK) >> PP_DMA_INDEX_SHIFT; 47 + } 48 + 49 + static inline void netmem_set_dma_index(netmem_ref netmem, 50 + unsigned long id) 51 + { 52 + unsigned long magic; 53 + 54 + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) 55 + return; 56 + 57 + magic = netmem_get_pp_magic(netmem) | (id << PP_DMA_INDEX_SHIFT); 58 + __netmem_clear_lsb(netmem)->pp_magic = magic; 37 59 } 38 60 #endif

+68 -13

net/core/page_pool.c

··· 276 276 /* Driver calling page_pool_create() also call page_pool_destroy() */ 277 277 refcount_set(&pool->user_cnt, 1); 278 278 279 - if (pool->dma_map) 280 - get_device(pool->p.dev); 279 + xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1); 281 280 282 281 if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) { 283 282 netdev_assert_locked(pool->slow.netdev); ··· 319 320 static void page_pool_uninit(struct page_pool *pool) 320 321 { 321 322 ptr_ring_cleanup(&pool->ring, NULL); 322 - 323 - if (pool->dma_map) 324 - put_device(pool->p.dev); 323 + xa_destroy(&pool->dma_mapped); 325 324 326 325 #ifdef CONFIG_PAGE_POOL_STATS 327 326 if (!pool->system) ··· 460 463 netmem_ref netmem, 461 464 u32 dma_sync_size) 462 465 { 463 - if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) 464 - __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); 466 + if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) { 467 + rcu_read_lock(); 468 + /* re-check under rcu_read_lock() to sync with page_pool_scrub() */ 469 + if (pool->dma_sync) 470 + __page_pool_dma_sync_for_device(pool, netmem, 471 + dma_sync_size); 472 + rcu_read_unlock(); 473 + } 465 474 } 466 475 467 - static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) 476 + static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp) 468 477 { 469 478 dma_addr_t dma; 479 + int err; 480 + u32 id; 470 481 471 482 /* Setup DMA mapping: use 'struct page' area for storing DMA-addr 472 483 * since dma_addr_t can be either 32 or 64 bits and does not always fit ··· 488 483 if (dma_mapping_error(pool->p.dev, dma)) 489 484 return false; 490 485 491 - if (page_pool_set_dma_addr_netmem(netmem, dma)) 486 + if (page_pool_set_dma_addr_netmem(netmem, dma)) { 487 + WARN_ONCE(1, "unexpected DMA address, please report to netdev@"); 492 488 goto unmap_failed; 489 + } 493 490 491 + if (in_softirq()) 492 + err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem), 493 + PP_DMA_INDEX_LIMIT, gfp); 494 + else 495 + err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem), 496 + PP_DMA_INDEX_LIMIT, gfp); 497 + if (err) { 498 + WARN_ONCE(err != -ENOMEM, "couldn't track DMA mapping, please report to netdev@"); 499 + goto unset_failed; 500 + } 501 + 502 + netmem_set_dma_index(netmem, id); 494 503 page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len); 495 504 496 505 return true; 497 506 507 + unset_failed: 508 + page_pool_set_dma_addr_netmem(netmem, 0); 498 509 unmap_failed: 499 - WARN_ONCE(1, "unexpected DMA address, please report to netdev@"); 500 510 dma_unmap_page_attrs(pool->p.dev, dma, 501 511 PAGE_SIZE << pool->p.order, pool->p.dma_dir, 502 512 DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); ··· 528 508 if (unlikely(!page)) 529 509 return NULL; 530 510 531 - if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) { 511 + if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) { 532 512 put_page(page); 533 513 return NULL; 534 514 } ··· 574 554 */ 575 555 for (i = 0; i < nr_pages; i++) { 576 556 netmem = pool->alloc.cache[i]; 577 - if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) { 557 + if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) { 578 558 put_page(netmem_to_page(netmem)); 579 559 continue; 580 560 } ··· 676 656 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, 677 657 netmem_ref netmem) 678 658 { 659 + struct page *old, *page = netmem_to_page(netmem); 660 + unsigned long id; 679 661 dma_addr_t dma; 680 662 681 663 if (!pool->dma_map) 682 664 /* Always account for inflight pages, even if we didn't 683 665 * map them 684 666 */ 667 + return; 668 + 669 + id = netmem_get_dma_index(netmem); 670 + if (!id) 671 + return; 672 + 673 + if (in_softirq()) 674 + old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0); 675 + else 676 + old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0); 677 + if (old != page) 685 678 return; 686 679 687 680 dma = page_pool_get_dma_addr_netmem(netmem); ··· 704 671 PAGE_SIZE << pool->p.order, pool->p.dma_dir, 705 672 DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); 706 673 page_pool_set_dma_addr_netmem(netmem, 0); 674 + netmem_set_dma_index(netmem, 0); 707 675 } 708 676 709 677 /* Disconnects a page (from a page_pool). API users can have a need ··· 1114 1080 1115 1081 static void page_pool_scrub(struct page_pool *pool) 1116 1082 { 1083 + unsigned long id; 1084 + void *ptr; 1085 + 1117 1086 page_pool_empty_alloc_cache_once(pool); 1118 - pool->destroy_cnt++; 1087 + if (!pool->destroy_cnt++ && pool->dma_map) { 1088 + if (pool->dma_sync) { 1089 + /* Disable page_pool_dma_sync_for_device() */ 1090 + pool->dma_sync = false; 1091 + 1092 + /* Make sure all concurrent returns that may see the old 1093 + * value of dma_sync (and thus perform a sync) have 1094 + * finished before doing the unmapping below. Skip the 1095 + * wait if the device doesn't actually need syncing, or 1096 + * if there are no outstanding mapped pages. 1097 + */ 1098 + if (dma_dev_need_sync(pool->p.dev) && 1099 + !xa_empty(&pool->dma_mapped)) 1100 + synchronize_net(); 1101 + } 1102 + 1103 + xa_for_each(&pool->dma_mapped, id, ptr) 1104 + __page_pool_release_page_dma(pool, page_to_netmem(ptr)); 1105 + } 1119 1106 1120 1107 /* No more consumers should exist, but producers could still 1121 1108 * be in-flight.