Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

[PATCH] Periodically drain non local pagesets

The pageset array can potentially acquire a huge amount of memory on large
NUMA systems. F.e. on a system with 512 processors and 256 nodes there
will be 256*512 pagesets. If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets. The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets. Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.

The slab allocator flushes its per cpu caches every 2 seconds. The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.

The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

authored by

Christoph Lameter and committed by
Linus Torvalds
4ae7c039 578c2fd6

+39 -2
+5
include/linux/gfp.h
··· 133 133 #define free_page(addr) free_pages((addr),0) 134 134 135 135 void page_alloc_init(void); 136 + #ifdef CONFIG_NUMA 137 + void drain_remote_pages(void); 138 + #else 139 + static inline void drain_remote_pages(void) { }; 140 + #endif 136 141 137 142 #endif /* __LINUX_GFP_H */
+33 -2
mm/page_alloc.c
··· 516 516 return allocated; 517 517 } 518 518 519 + #ifdef CONFIG_NUMA 520 + /* Called from the slab reaper to drain remote pagesets */ 521 + void drain_remote_pages(void) 522 + { 523 + struct zone *zone; 524 + int i; 525 + unsigned long flags; 526 + 527 + local_irq_save(flags); 528 + for_each_zone(zone) { 529 + struct per_cpu_pageset *pset; 530 + 531 + /* Do not drain local pagesets */ 532 + if (zone->zone_pgdat->node_id == numa_node_id()) 533 + continue; 534 + 535 + pset = zone->pageset[smp_processor_id()]; 536 + for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) { 537 + struct per_cpu_pages *pcp; 538 + 539 + pcp = &pset->pcp[i]; 540 + if (pcp->count) 541 + pcp->count -= free_pages_bulk(zone, pcp->count, 542 + &pcp->list, 0); 543 + } 544 + } 545 + local_irq_restore(flags); 546 + } 547 + #endif 548 + 519 549 #if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) 520 550 static void __drain_pages(unsigned int cpu) 521 551 { ··· 1301 1271 pageset = zone_pcp(zone, cpu); 1302 1272 1303 1273 for (temperature = 0; temperature < 2; temperature++) 1304 - printk("cpu %d %s: low %d, high %d, batch %d\n", 1274 + printk("cpu %d %s: low %d, high %d, batch %d used:%d\n", 1305 1275 cpu, 1306 1276 temperature ? "cold" : "hot", 1307 1277 pageset->pcp[temperature].low, 1308 1278 pageset->pcp[temperature].high, 1309 - pageset->pcp[temperature].batch); 1279 + pageset->pcp[temperature].batch, 1280 + pageset->pcp[temperature].count); 1310 1281 } 1311 1282 } 1312 1283
+1
mm/slab.c
··· 2851 2851 } 2852 2852 check_irq_on(); 2853 2853 up(&cache_chain_sem); 2854 + drain_remote_pages(); 2854 2855 /* Setup the next iteration */ 2855 2856 schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC + smp_processor_id()); 2856 2857 }