Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'page_pool-API-for-numa-node-change-handling'

Saeed Mahameed says:

====================
page_pool: API for numa node change handling

This series extends page pool API to allow page pool consumers to update
page pool numa node on the fly. This is required since on some systems,
rx rings irqs can migrate between numa nodes, due to irq balancer or user
defined scripts, current page pool has no way to know of such migration
and will keep allocating and holding on to pages from a wrong numa node,
which is bad for the consumer performance.

1) Add API to update numa node id of the page pool
Consumers will call this API to update the page pool numa node id.

2) Don't recycle non-reusable pages:
Page pool will check upon page return whether a page is suitable for
recycling or not.
2.1) when it belongs to a different num node.
2.2) when it was allocated under memory pressure.

3) mlx5 will use the new API to update page pool numa id on demand.

The series is a joint work between me and Jonathan, we tested it and it
proved itself worthy to avoid page allocator bottlenecks and improve
packet rate and cpu utilization significantly for the described
scenarios above.

Performance testing:
XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
while migrating rx ring irq from close to far numa:

mlx5 internal page cache was locally disabled to get pure page pool
results.

CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)

XDP Drop/TX single core:
NUMA | XDP | Before | After
---------------------------------------
Close | Drop | 11 Mpps | 10.9 Mpps
Far | Drop | 4.4 Mpps | 5.8 Mpps

Close | TX | 6.5 Mpps | 6.5 Mpps
Far | TX | 3.5 Mpps | 4 Mpps

Improvement is about 30% drop packet rate, 15% tx packet rate for numa
far test.
No degradation for numa close tests.

TCP single/multi cpu/stream:
NUMA | #cpu | Before | After
--------------------------------------
Close | 1 | 18 Gbps | 18 Gbps
Far | 1 | 15 Gbps | 18 Gbps
Close | 12 | 80 Gbps | 80 Gbps
Far | 12 | 68 Gbps | 80 Gbps

In all test cases we see improvement for the far numa case, and no
impact on the close numa case.

==================

Performance analysis and conclusions by Jesper [1]:
Impact on XDP drop x86_64 is inconclusive and shows only 0.3459ns
slow-down, as this is below measurement accuracy of system.

v2->v3:
- Rebase on top of latest net-next and Jesper's page pool object
release patchset [2]
- No code changes
- Performance analysis by Jesper added to the cover letter.

v1->v2:
- Drop last patch, as requested by Ilias and Jesper.
- Fix documentation's performance numbers order.

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool04_inflight_changes.org#performance-notes
[2] https://patchwork.ozlabs.org/cover/1192098/
====================

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

+53 -1
+3
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
··· 1386 1386 if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state))) 1387 1387 return 0; 1388 1388 1389 + if (rq->page_pool) 1390 + page_pool_nid_changed(rq->page_pool, numa_mem_id()); 1391 + 1389 1392 if (rq->cqd.left) { 1390 1393 work_done += mlx5e_decompress_cqes_cont(rq, cqwq, 0, budget); 1391 1394 if (rq->cqd.left || work_done >= budget)
+7
include/net/page_pool.h
··· 204 204 return refcount_dec_and_test(&pool->user_cnt); 205 205 } 206 206 207 + /* Caller must provide appropriate safe context, e.g. NAPI. */ 208 + void page_pool_update_nid(struct page_pool *pool, int new_nid); 209 + static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) 210 + { 211 + if (unlikely(pool->p.nid != new_nid)) 212 + page_pool_update_nid(pool, new_nid); 213 + } 207 214 #endif /* _NET_PAGE_POOL_H */
+22
include/trace/events/page_pool.h
··· 89 89 __entry->pool, __entry->page, __entry->pfn, __entry->hold) 90 90 ); 91 91 92 + TRACE_EVENT(page_pool_update_nid, 93 + 94 + TP_PROTO(const struct page_pool *pool, int new_nid), 95 + 96 + TP_ARGS(pool, new_nid), 97 + 98 + TP_STRUCT__entry( 99 + __field(const struct page_pool *, pool) 100 + __field(int, pool_nid) 101 + __field(int, new_nid) 102 + ), 103 + 104 + TP_fast_assign( 105 + __entry->pool = pool; 106 + __entry->pool_nid = pool->p.nid; 107 + __entry->new_nid = new_nid; 108 + ), 109 + 110 + TP_printk("page_pool=%p pool_nid=%d new_nid=%d", 111 + __entry->pool, __entry->pool_nid, __entry->new_nid) 112 + ); 113 + 92 114 #endif /* _TRACE_PAGE_POOL_H */ 93 115 94 116 /* This part must be outside protection */
+21 -1
net/core/page_pool.c
··· 281 281 return true; 282 282 } 283 283 284 + /* page is NOT reusable when: 285 + * 1) allocated when system is under some pressure. (page_is_pfmemalloc) 286 + * 2) belongs to a different NUMA node than pool->p.nid. 287 + * 288 + * To update pool->p.nid users must call page_pool_update_nid. 289 + */ 290 + static bool pool_page_reusable(struct page_pool *pool, struct page *page) 291 + { 292 + return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid; 293 + } 294 + 284 295 void __page_pool_put_page(struct page_pool *pool, 285 296 struct page *page, bool allow_direct) 286 297 { ··· 301 290 * 302 291 * refcnt == 1 means page_pool owns page, and can recycle it. 303 292 */ 304 - if (likely(page_ref_count(page) == 1)) { 293 + if (likely(page_ref_count(page) == 1 && 294 + pool_page_reusable(pool, page))) { 305 295 /* Read barrier done in page_ref_count / READ_ONCE */ 306 296 307 297 if (allow_direct && in_serving_softirq()) ··· 448 436 schedule_delayed_work(&pool->release_dw, DEFER_TIME); 449 437 } 450 438 EXPORT_SYMBOL(page_pool_destroy); 439 + 440 + /* Caller must provide appropriate safe context, e.g. NAPI. */ 441 + void page_pool_update_nid(struct page_pool *pool, int new_nid) 442 + { 443 + trace_page_pool_update_nid(pool, new_nid); 444 + pool->p.nid = new_nid; 445 + } 446 + EXPORT_SYMBOL(page_pool_update_nid);