Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

ksm: allow trees per NUMA node

Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
we had with that, fully enabling KSM page migration on the way.

(A different kind of KSM/NUMA issue which I've certainly not begun to
address here: when KSM pages are unmerged, there's usually no sense in
preferring to allocate the new pages local to the caller's node.)

This patch:

Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes. When it is set
to zero only pages from the same node are merged, otherwise pages from
all nodes can be merged together (default behavior).

Typical use-case could be a lot of KVM guests on NUMA machine and cpus
from more distant nodes would have significant increase of access
latency to the merged ksm page. Sysfs knob was choosen for higher
variability when some users still prefers higher amount of saved
physical memory regardless of access latency.

Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is
possible only when there are not any ksm shared pages in system.

I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:

http://pholasek.fedorapeople.org/alloc_pg.c

Population standard deviations of access times in percentage of average
were following:

merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%

merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%

RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225

Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:

1) switching merge_across_nodes after running KSM is liable to oops
on stale nodes still left over from the previous stable tree;

2) memory hotremove may migrate KSM pages, but there is no provision
here for !merge_across_nodes to migrate nodes to the proper tree.

Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Petr Holasek and committed by
Linus Torvalds
90bd6fd3 22b751c3

+139 -19
+7
Documentation/vm/ksm.txt
··· 58 58 e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" 59 59 Default: 20 (chosen for demonstration purposes) 60 60 61 + merge_across_nodes - specifies if pages from different numa nodes can be merged. 62 + When set to 0, ksm merges only pages which physically 63 + reside in the memory area of same NUMA node. It brings 64 + lower latency to access to shared page. Value can be 65 + changed only when there is no ksm shared pages in system. 66 + Default: 1 67 + 61 68 run - set 0 to stop ksmd from running but keep merged pages, 62 69 set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", 63 70 set 2 to stop ksmd and unmerge all pages currently merged,
+132 -19
mm/ksm.c
··· 36 36 #include <linux/hashtable.h> 37 37 #include <linux/freezer.h> 38 38 #include <linux/oom.h> 39 + #include <linux/numa.h> 39 40 40 41 #include <asm/tlbflush.h> 41 42 #include "internal.h" ··· 140 139 struct mm_struct *mm; 141 140 unsigned long address; /* + low bits used for flags below */ 142 141 unsigned int oldchecksum; /* when unstable */ 142 + #ifdef CONFIG_NUMA 143 + unsigned int nid; 144 + #endif 143 145 union { 144 146 struct rb_node node; /* when node of unstable tree */ 145 147 struct { /* when listed from stable tree */ ··· 157 153 #define STABLE_FLAG 0x200 /* is listed from the stable tree */ 158 154 159 155 /* The stable and unstable tree heads */ 160 - static struct rb_root root_stable_tree = RB_ROOT; 161 - static struct rb_root root_unstable_tree = RB_ROOT; 156 + static struct rb_root root_unstable_tree[MAX_NUMNODES]; 157 + static struct rb_root root_stable_tree[MAX_NUMNODES]; 162 158 163 159 #define MM_SLOTS_HASH_BITS 10 164 160 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); ··· 191 187 192 188 /* Milliseconds ksmd should sleep between batches */ 193 189 static unsigned int ksm_thread_sleep_millisecs = 20; 190 + 191 + /* Zeroed when merging across nodes is not allowed */ 192 + static unsigned int ksm_merge_across_nodes = 1; 194 193 195 194 #define KSM_RUN_STOP 0 196 195 #define KSM_RUN_MERGE 1 ··· 448 441 return page; 449 442 } 450 443 444 + /* 445 + * This helper is used for getting right index into array of tree roots. 446 + * When merge_across_nodes knob is set to 1, there are only two rb-trees for 447 + * stable and unstable pages from all nodes with roots in index 0. Otherwise, 448 + * every node has its own stable and unstable tree. 449 + */ 450 + static inline int get_kpfn_nid(unsigned long kpfn) 451 + { 452 + if (ksm_merge_across_nodes) 453 + return 0; 454 + else 455 + return pfn_to_nid(kpfn); 456 + } 457 + 451 458 static void remove_node_from_stable_tree(struct stable_node *stable_node) 452 459 { 453 460 struct rmap_item *rmap_item; 454 461 struct hlist_node *hlist; 462 + int nid; 455 463 456 464 hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) { 457 465 if (rmap_item->hlist.next) ··· 478 456 cond_resched(); 479 457 } 480 458 481 - rb_erase(&stable_node->node, &root_stable_tree); 459 + nid = get_kpfn_nid(stable_node->kpfn); 460 + 461 + rb_erase(&stable_node->node, &root_stable_tree[nid]); 482 462 free_stable_node(stable_node); 483 463 } 484 464 ··· 578 554 age = (unsigned char)(ksm_scan.seqnr - rmap_item->address); 579 555 BUG_ON(age > 1); 580 556 if (!age) 581 - rb_erase(&rmap_item->node, &root_unstable_tree); 557 + #ifdef CONFIG_NUMA 558 + rb_erase(&rmap_item->node, 559 + &root_unstable_tree[rmap_item->nid]); 560 + #else 561 + rb_erase(&rmap_item->node, &root_unstable_tree[0]); 562 + #endif 582 563 583 564 ksm_pages_unshared--; 584 565 rmap_item->address &= PAGE_MASK; ··· 1019 990 */ 1020 991 static struct page *stable_tree_search(struct page *page) 1021 992 { 1022 - struct rb_node *node = root_stable_tree.rb_node; 993 + struct rb_node *node; 1023 994 struct stable_node *stable_node; 995 + int nid; 1024 996 1025 997 stable_node = page_stable_node(page); 1026 998 if (stable_node) { /* ksm page forked */ 1027 999 get_page(page); 1028 1000 return page; 1029 1001 } 1002 + 1003 + nid = get_kpfn_nid(page_to_pfn(page)); 1004 + node = root_stable_tree[nid].rb_node; 1030 1005 1031 1006 while (node) { 1032 1007 struct page *tree_page; ··· 1066 1033 */ 1067 1034 static struct stable_node *stable_tree_insert(struct page *kpage) 1068 1035 { 1069 - struct rb_node **new = &root_stable_tree.rb_node; 1036 + int nid; 1037 + unsigned long kpfn; 1038 + struct rb_node **new; 1070 1039 struct rb_node *parent = NULL; 1071 1040 struct stable_node *stable_node; 1041 + 1042 + kpfn = page_to_pfn(kpage); 1043 + nid = get_kpfn_nid(kpfn); 1044 + new = &root_stable_tree[nid].rb_node; 1072 1045 1073 1046 while (*new) { 1074 1047 struct page *tree_page; ··· 1109 1070 return NULL; 1110 1071 1111 1072 rb_link_node(&stable_node->node, parent, new); 1112 - rb_insert_color(&stable_node->node, &root_stable_tree); 1073 + rb_insert_color(&stable_node->node, &root_stable_tree[nid]); 1113 1074 1114 1075 INIT_HLIST_HEAD(&stable_node->hlist); 1115 1076 1116 - stable_node->kpfn = page_to_pfn(kpage); 1077 + stable_node->kpfn = kpfn; 1117 1078 set_page_stable_node(kpage, stable_node); 1118 1079 1119 1080 return stable_node; ··· 1137 1098 struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, 1138 1099 struct page *page, 1139 1100 struct page **tree_pagep) 1140 - 1141 1101 { 1142 - struct rb_node **new = &root_unstable_tree.rb_node; 1102 + struct rb_node **new; 1103 + struct rb_root *root; 1143 1104 struct rb_node *parent = NULL; 1105 + int nid; 1106 + 1107 + nid = get_kpfn_nid(page_to_pfn(page)); 1108 + root = &root_unstable_tree[nid]; 1109 + new = &root->rb_node; 1144 1110 1145 1111 while (*new) { 1146 1112 struct rmap_item *tree_rmap_item; ··· 1162 1118 * Don't substitute a ksm page for a forked page. 1163 1119 */ 1164 1120 if (page == tree_page) { 1121 + put_page(tree_page); 1122 + return NULL; 1123 + } 1124 + 1125 + /* 1126 + * If tree_page has been migrated to another NUMA node, it 1127 + * will be flushed out and put into the right unstable tree 1128 + * next time: only merge with it if merge_across_nodes. 1129 + * Just notice, we don't have similar problem for PageKsm 1130 + * because their migration is disabled now. (62b61f611e) 1131 + */ 1132 + if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { 1165 1133 put_page(tree_page); 1166 1134 return NULL; 1167 1135 } ··· 1195 1139 1196 1140 rmap_item->address |= UNSTABLE_FLAG; 1197 1141 rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK); 1142 + #ifdef CONFIG_NUMA 1143 + rmap_item->nid = nid; 1144 + #endif 1198 1145 rb_link_node(&rmap_item->node, parent, new); 1199 - rb_insert_color(&rmap_item->node, &root_unstable_tree); 1146 + rb_insert_color(&rmap_item->node, root); 1200 1147 1201 1148 ksm_pages_unshared++; 1202 1149 return NULL; ··· 1213 1154 static void stable_tree_append(struct rmap_item *rmap_item, 1214 1155 struct stable_node *stable_node) 1215 1156 { 1157 + #ifdef CONFIG_NUMA 1158 + /* 1159 + * Usually rmap_item->nid is already set correctly, 1160 + * but it may be wrong after switching merge_across_nodes. 1161 + */ 1162 + rmap_item->nid = get_kpfn_nid(stable_node->kpfn); 1163 + #endif 1216 1164 rmap_item->head = stable_node; 1217 1165 rmap_item->address |= STABLE_FLAG; 1218 1166 hlist_add_head(&rmap_item->hlist, &stable_node->hlist); ··· 1349 1283 struct mm_slot *slot; 1350 1284 struct vm_area_struct *vma; 1351 1285 struct rmap_item *rmap_item; 1286 + int nid; 1352 1287 1353 1288 if (list_empty(&ksm_mm_head.mm_list)) 1354 1289 return NULL; ··· 1368 1301 */ 1369 1302 lru_add_drain_all(); 1370 1303 1371 - root_unstable_tree = RB_ROOT; 1304 + for (nid = 0; nid < nr_node_ids; nid++) 1305 + root_unstable_tree[nid] = RB_ROOT; 1372 1306 1373 1307 spin_lock(&ksm_mmlist_lock); 1374 1308 slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); ··· 1838 1770 unsigned long end_pfn) 1839 1771 { 1840 1772 struct rb_node *node; 1773 + int nid; 1841 1774 1842 - for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) { 1843 - struct stable_node *stable_node; 1775 + for (nid = 0; nid < nr_node_ids; nid++) 1776 + for (node = rb_first(&root_stable_tree[nid]); node; 1777 + node = rb_next(node)) { 1778 + struct stable_node *stable_node; 1844 1779 1845 - stable_node = rb_entry(node, struct stable_node, node); 1846 - if (stable_node->kpfn >= start_pfn && 1847 - stable_node->kpfn < end_pfn) 1848 - return stable_node; 1849 - } 1780 + stable_node = rb_entry(node, struct stable_node, node); 1781 + if (stable_node->kpfn >= start_pfn && 1782 + stable_node->kpfn < end_pfn) 1783 + return stable_node; 1784 + } 1785 + 1850 1786 return NULL; 1851 1787 } 1852 1788 ··· 1997 1925 } 1998 1926 KSM_ATTR(run); 1999 1927 1928 + #ifdef CONFIG_NUMA 1929 + static ssize_t merge_across_nodes_show(struct kobject *kobj, 1930 + struct kobj_attribute *attr, char *buf) 1931 + { 1932 + return sprintf(buf, "%u\n", ksm_merge_across_nodes); 1933 + } 1934 + 1935 + static ssize_t merge_across_nodes_store(struct kobject *kobj, 1936 + struct kobj_attribute *attr, 1937 + const char *buf, size_t count) 1938 + { 1939 + int err; 1940 + unsigned long knob; 1941 + 1942 + err = kstrtoul(buf, 10, &knob); 1943 + if (err) 1944 + return err; 1945 + if (knob > 1) 1946 + return -EINVAL; 1947 + 1948 + mutex_lock(&ksm_thread_mutex); 1949 + if (ksm_merge_across_nodes != knob) { 1950 + if (ksm_pages_shared) 1951 + err = -EBUSY; 1952 + else 1953 + ksm_merge_across_nodes = knob; 1954 + } 1955 + mutex_unlock(&ksm_thread_mutex); 1956 + 1957 + return err ? err : count; 1958 + } 1959 + KSM_ATTR(merge_across_nodes); 1960 + #endif 1961 + 2000 1962 static ssize_t pages_shared_show(struct kobject *kobj, 2001 1963 struct kobj_attribute *attr, char *buf) 2002 1964 { ··· 2085 1979 &pages_unshared_attr.attr, 2086 1980 &pages_volatile_attr.attr, 2087 1981 &full_scans_attr.attr, 1982 + #ifdef CONFIG_NUMA 1983 + &merge_across_nodes_attr.attr, 1984 + #endif 2088 1985 NULL, 2089 1986 }; 2090 1987 ··· 2101 1992 { 2102 1993 struct task_struct *ksm_thread; 2103 1994 int err; 1995 + int nid; 2104 1996 2105 1997 err = ksm_slab_init(); 2106 1998 if (err) 2107 1999 goto out; 2000 + 2001 + for (nid = 0; nid < nr_node_ids; nid++) 2002 + root_stable_tree[nid] = RB_ROOT; 2108 2003 2109 2004 ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); 2110 2005 if (IS_ERR(ksm_thread)) {