Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/mempolicy: Weighted Interleave Auto-tuning

On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set
ratios.

Ideally, these weights should be proportional to their bandwidth, so that
under bandwidth pressure, each node uses its maximal efficient bandwidth
and prevents latency from increasing exponentially.

Previously, weighted interleave's default weights were just 1s -- which
would be equivalent to the (unweighted) interleave mempolicy, which goes
through the nodes in a round-robin fashion, ignoring bandwidth
information.

This patch has two main goals: First, it makes weighted interleave easier
to use for users who wish to relieve bandwidth pressure when using nodes
with varying bandwidth (CXL). By providing a set of "real" default
weights that just work out of the box, users who might not have the
capability (or wish to) perform experimentation to find the most optimal
weights for their system can still take advantage of bandwidth-informed
weighted interleave.

Second, it allows for weighted interleave to dynamically adjust to
hotplugged memory with new bandwidth information. Instead of manually
updating node weights every time new bandwidth information is reported or
taken off, weighted interleave adjusts and provides a new set of default
weights for weighted interleave to use when there is a change in bandwidth
information.

To meet these goals, this patch introduces an auto-configuration mode for
the interleave weights that provides a reasonable set of default weights,
calculated using bandwidth data reported by the system. In auto mode,
weights are dynamically adjusted based on whatever the current bandwidth
information reports (and responds to hotplug events).

This patch still supports users manually writing weights into the nodeN
sysfs interface by entering into manual mode. When a user enters manual
mode, the system stops dynamically updating any of the node weights, even
during hotplug events that shift the optimal weight distribution.

A new sysfs interface "auto" is introduced, which allows users to switch
between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
system also automatically enters manual mode when a nodeN interface is
manually written to.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before this
patch, the default for all weights were 1, which meant that writing 0 and
1 were functionally equivalent. With this patch, writing 0 is invalid.

Link: https://lkml.kernel.org/r/20250520141236.2987309-1-joshua.hahnjy@gmail.com
[joshua.hahnjy@gmail.com: wordsmithing changes, simplification, fixes]
Link: https://lkml.kernel.org/r/20250511025840.2410154-1-joshua.hahnjy@gmail.com
[joshua.hahnjy@gmail.com: remove auto_kobj_attr field from struct sysfs_wi_group]
Link: https://lkml.kernel.org/r/20250512142511.3959833-1-joshua.hahnjy@gmail.com
https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/ [1]
Link: https://lkml.kernel.org/r/20250505182328.4148265-1-joshua.hahnjy@gmail.com
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Honggyu Kim <honggyu.kim@sk.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Joshua Hahn and committed by
Andrew Morton
e341f9c3 9e619cd4

+311 -63
+32 -3
Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
··· 20 20 Minimum weight: 1 21 21 Maximum weight: 255 22 22 23 - Writing an empty string or `0` will reset the weight to the 24 - system default. The system default may be set by the kernel 25 - or drivers at boot or during hotplug events. 23 + Writing invalid values (i.e. any values not in [1,255], 24 + empty string, ...) will return -EINVAL. 25 + 26 + Changing the weight to a valid value will automatically 27 + switch the system to manual mode as well. 28 + 29 + What: /sys/kernel/mm/mempolicy/weighted_interleave/auto 30 + Date: May 2025 31 + Contact: Linux memory management mailing list <linux-mm@kvack.org> 32 + Description: Auto-weighting configuration interface 33 + 34 + Configuration mode for weighted interleave. 'true' indicates 35 + that the system is in auto mode, and a 'false' indicates that 36 + the system is in manual mode. 37 + 38 + In auto mode, all node weights are re-calculated and overwritten 39 + (visible via the nodeN interfaces) whenever new bandwidth data 40 + is made available during either boot or hotplug events. 41 + 42 + In manual mode, node weights can only be updated by the user. 43 + Note that nodes that are onlined with previously set weights 44 + will reuse those weights. If they were not previously set or 45 + are onlined with missing bandwidth data, the weights will use 46 + a default weight of 1. 47 + 48 + Writing any true value string (e.g. Y or 1) will enable auto 49 + mode, while writing any false value string (e.g. N or 0) will 50 + enable manual mode. All other strings are ignored and will 51 + return -EINVAL. 52 + 53 + Writing a new weight to a node directly via the nodeN interface 54 + will also automatically switch the system to manual mode.
+9
drivers/base/node.c
··· 7 7 #include <linux/init.h> 8 8 #include <linux/mm.h> 9 9 #include <linux/memory.h> 10 + #include <linux/mempolicy.h> 10 11 #include <linux/vmstat.h> 11 12 #include <linux/notifier.h> 12 13 #include <linux/node.h> ··· 213 212 pr_info("failed to add performance attribute to node %d\n", 214 213 nid); 215 214 break; 215 + } 216 + } 217 + 218 + /* When setting CPU access coordinates, update mempolicy */ 219 + if (access == ACCESS_COORDINATE_CPU) { 220 + if (mempolicy_set_node_perf(nid, coord)) { 221 + pr_info("failed to set mempolicy attrs for node %d\n", 222 + nid); 216 223 } 217 224 } 218 225 }
+4
include/linux/mempolicy.h
··· 11 11 #include <linux/slab.h> 12 12 #include <linux/rbtree.h> 13 13 #include <linux/spinlock.h> 14 + #include <linux/node.h> 14 15 #include <linux/nodemask.h> 15 16 #include <linux/pagemap.h> 16 17 #include <uapi/linux/mempolicy.h> ··· 178 177 } 179 178 180 179 extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); 180 + 181 + extern int mempolicy_set_node_perf(unsigned int node, 182 + struct access_coordinate *coords); 181 183 182 184 #else 183 185
+266 -60
mm/mempolicy.c
··· 109 109 #include <linux/mmu_notifier.h> 110 110 #include <linux/printk.h> 111 111 #include <linux/swapops.h> 112 + #include <linux/gcd.h> 112 113 113 114 #include <asm/tlbflush.h> 114 115 #include <asm/tlb.h> ··· 141 140 static struct mempolicy preferred_node_policy[MAX_NUMNODES]; 142 141 143 142 /* 144 - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes 145 - * system-default value should be used. A NULL iw_table also denotes that 146 - * system-default values should be used. Until the system-default table 147 - * is implemented, the system-default is always 1. 148 - * 149 - * iw_table is RCU protected 143 + * weightiness balances the tradeoff between small weights (cycles through nodes 144 + * faster, more fair/even distribution) and large weights (smaller errors 145 + * between actual bandwidth ratios and weight ratios). 32 is a number that has 146 + * been found to perform at a reasonable compromise between the two goals. 150 147 */ 151 - static u8 __rcu *iw_table; 152 - static DEFINE_MUTEX(iw_table_lock); 148 + static const int weightiness = 32; 149 + 150 + /* 151 + * A null weighted_interleave_state is interpreted as having .mode="auto", 152 + * and .iw_table is interpreted as an array of 1s with length nr_node_ids. 153 + */ 154 + struct weighted_interleave_state { 155 + bool mode_auto; 156 + u8 iw_table[]; 157 + }; 158 + static struct weighted_interleave_state __rcu *wi_state; 159 + static unsigned int *node_bw_table; 160 + 161 + /* 162 + * wi_state_lock protects both wi_state and node_bw_table. 163 + * node_bw_table is only used by writers to update wi_state. 164 + */ 165 + static DEFINE_MUTEX(wi_state_lock); 153 166 154 167 static u8 get_il_weight(int node) 155 168 { 156 - u8 *table; 157 - u8 weight; 169 + struct weighted_interleave_state *state; 170 + u8 weight = 1; 158 171 159 172 rcu_read_lock(); 160 - table = rcu_dereference(iw_table); 161 - /* if no iw_table, use system default */ 162 - weight = table ? table[node] : 1; 163 - /* if value in iw_table is 0, use system default */ 164 - weight = weight ? weight : 1; 173 + state = rcu_dereference(wi_state); 174 + if (state) 175 + weight = state->iw_table[node]; 165 176 rcu_read_unlock(); 166 177 return weight; 178 + } 179 + 180 + /* 181 + * Convert bandwidth values into weighted interleave weights. 182 + * Call with wi_state_lock. 183 + */ 184 + static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw) 185 + { 186 + u64 sum_bw = 0; 187 + unsigned int cast_sum_bw, scaling_factor = 1, iw_gcd = 0; 188 + int nid; 189 + 190 + for_each_node_state(nid, N_MEMORY) 191 + sum_bw += bw[nid]; 192 + 193 + /* Scale bandwidths to whole numbers in the range [1, weightiness] */ 194 + for_each_node_state(nid, N_MEMORY) { 195 + /* 196 + * Try not to perform 64-bit division. 197 + * If sum_bw < scaling_factor, then sum_bw < U32_MAX. 198 + * If sum_bw > scaling_factor, then round the weight up to 1. 199 + */ 200 + scaling_factor = weightiness * bw[nid]; 201 + if (bw[nid] && sum_bw < scaling_factor) { 202 + cast_sum_bw = (unsigned int)sum_bw; 203 + new_iw[nid] = scaling_factor / cast_sum_bw; 204 + } else { 205 + new_iw[nid] = 1; 206 + } 207 + if (!iw_gcd) 208 + iw_gcd = new_iw[nid]; 209 + iw_gcd = gcd(iw_gcd, new_iw[nid]); 210 + } 211 + 212 + /* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */ 213 + for_each_node_state(nid, N_MEMORY) 214 + new_iw[nid] /= iw_gcd; 215 + } 216 + 217 + int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) 218 + { 219 + struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; 220 + unsigned int *old_bw, *new_bw; 221 + unsigned int bw_val; 222 + int i; 223 + 224 + bw_val = min(coords->read_bandwidth, coords->write_bandwidth); 225 + new_bw = kcalloc(nr_node_ids, sizeof(unsigned int), GFP_KERNEL); 226 + if (!new_bw) 227 + return -ENOMEM; 228 + 229 + new_wi_state = kmalloc(struct_size(new_wi_state, iw_table, nr_node_ids), 230 + GFP_KERNEL); 231 + if (!new_wi_state) { 232 + kfree(new_bw); 233 + return -ENOMEM; 234 + } 235 + new_wi_state->mode_auto = true; 236 + for (i = 0; i < nr_node_ids; i++) 237 + new_wi_state->iw_table[i] = 1; 238 + 239 + /* 240 + * Update bandwidth info, even in manual mode. That way, when switching 241 + * to auto mode in the future, iw_table can be overwritten using 242 + * accurate bw data. 243 + */ 244 + mutex_lock(&wi_state_lock); 245 + 246 + old_bw = node_bw_table; 247 + if (old_bw) 248 + memcpy(new_bw, old_bw, nr_node_ids * sizeof(*old_bw)); 249 + new_bw[node] = bw_val; 250 + node_bw_table = new_bw; 251 + 252 + old_wi_state = rcu_dereference_protected(wi_state, 253 + lockdep_is_held(&wi_state_lock)); 254 + if (old_wi_state && !old_wi_state->mode_auto) { 255 + /* Manual mode; skip reducing weights and updating wi_state */ 256 + mutex_unlock(&wi_state_lock); 257 + kfree(new_wi_state); 258 + goto out; 259 + } 260 + 261 + /* NULL wi_state assumes auto=true; reduce weights and update wi_state*/ 262 + reduce_interleave_weights(new_bw, new_wi_state->iw_table); 263 + rcu_assign_pointer(wi_state, new_wi_state); 264 + 265 + mutex_unlock(&wi_state_lock); 266 + if (old_wi_state) { 267 + synchronize_rcu(); 268 + kfree(old_wi_state); 269 + } 270 + out: 271 + kfree(old_bw); 272 + return 0; 167 273 } 168 274 169 275 /** ··· 2131 2023 2132 2024 static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) 2133 2025 { 2026 + struct weighted_interleave_state *state; 2134 2027 nodemask_t nodemask; 2135 2028 unsigned int target, nr_nodes; 2136 - u8 *table; 2029 + u8 *table = NULL; 2137 2030 unsigned int weight_total = 0; 2138 2031 u8 weight; 2139 - int nid; 2032 + int nid = 0; 2140 2033 2141 2034 nr_nodes = read_once_policy_nodemask(pol, &nodemask); 2142 2035 if (!nr_nodes) 2143 2036 return numa_node_id(); 2144 2037 2145 2038 rcu_read_lock(); 2146 - table = rcu_dereference(iw_table); 2039 + 2040 + state = rcu_dereference(wi_state); 2041 + /* Uninitialized wi_state means we should assume all weights are 1 */ 2042 + if (state) 2043 + table = state->iw_table; 2044 + 2147 2045 /* calculate the total weight */ 2148 - for_each_node_mask(nid, nodemask) { 2149 - /* detect system default usage */ 2150 - weight = table ? table[nid] : 1; 2151 - weight = weight ? weight : 1; 2152 - weight_total += weight; 2153 - } 2046 + for_each_node_mask(nid, nodemask) 2047 + weight_total += table ? table[nid] : 1; 2154 2048 2155 2049 /* Calculate the node offset based on totals */ 2156 2050 target = ilx % weight_total; ··· 2160 2050 while (target) { 2161 2051 /* detect system default usage */ 2162 2052 weight = table ? table[nid] : 1; 2163 - weight = weight ? weight : 1; 2164 2053 if (target < weight) 2165 2054 break; 2166 2055 target -= weight; ··· 2560 2451 struct mempolicy *pol, unsigned long nr_pages, 2561 2452 struct page **page_array) 2562 2453 { 2454 + struct weighted_interleave_state *state; 2563 2455 struct task_struct *me = current; 2564 2456 unsigned int cpuset_mems_cookie; 2565 2457 unsigned long total_allocated = 0; 2566 2458 unsigned long nr_allocated = 0; 2567 2459 unsigned long rounds; 2568 2460 unsigned long node_pages, delta; 2569 - u8 *table, *weights, weight; 2461 + u8 *weights, weight; 2570 2462 unsigned int weight_total = 0; 2571 2463 unsigned long rem_pages = nr_pages; 2572 2464 nodemask_t nodes; ··· 2617 2507 return total_allocated; 2618 2508 2619 2509 rcu_read_lock(); 2620 - table = rcu_dereference(iw_table); 2621 - if (table) 2622 - memcpy(weights, table, nr_node_ids); 2623 - rcu_read_unlock(); 2510 + state = rcu_dereference(wi_state); 2511 + if (state) { 2512 + memcpy(weights, state->iw_table, nr_node_ids * sizeof(u8)); 2513 + rcu_read_unlock(); 2514 + } else { 2515 + rcu_read_unlock(); 2516 + for (i = 0; i < nr_node_ids; i++) 2517 + weights[i] = 1; 2518 + } 2624 2519 2625 2520 /* calculate total, detect system default usage */ 2626 - for_each_node_mask(node, nodes) { 2627 - if (!weights[node]) 2628 - weights[node] = 1; 2521 + for_each_node_mask(node, nodes) 2629 2522 weight_total += weights[node]; 2630 - } 2631 2523 2632 2524 /* 2633 2525 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. ··· 3562 3450 static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, 3563 3451 const char *buf, size_t count) 3564 3452 { 3453 + struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; 3565 3454 struct iw_node_attr *node_attr; 3566 - u8 *new; 3567 - u8 *old; 3568 3455 u8 weight = 0; 3456 + int i; 3569 3457 3570 3458 node_attr = container_of(attr, struct iw_node_attr, kobj_attr); 3571 - if (count == 0 || sysfs_streq(buf, "")) 3572 - weight = 0; 3573 - else if (kstrtou8(buf, 0, &weight)) 3459 + if (count == 0 || sysfs_streq(buf, "") || 3460 + kstrtou8(buf, 0, &weight) || weight == 0) 3574 3461 return -EINVAL; 3575 3462 3576 - new = kzalloc(nr_node_ids, GFP_KERNEL); 3577 - if (!new) 3463 + new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids), 3464 + GFP_KERNEL); 3465 + if (!new_wi_state) 3578 3466 return -ENOMEM; 3579 3467 3580 - mutex_lock(&iw_table_lock); 3581 - old = rcu_dereference_protected(iw_table, 3582 - lockdep_is_held(&iw_table_lock)); 3583 - if (old) 3584 - memcpy(new, old, nr_node_ids); 3585 - new[node_attr->nid] = weight; 3586 - rcu_assign_pointer(iw_table, new); 3587 - mutex_unlock(&iw_table_lock); 3588 - synchronize_rcu(); 3589 - kfree(old); 3468 + mutex_lock(&wi_state_lock); 3469 + old_wi_state = rcu_dereference_protected(wi_state, 3470 + lockdep_is_held(&wi_state_lock)); 3471 + if (old_wi_state) { 3472 + memcpy(new_wi_state->iw_table, old_wi_state->iw_table, 3473 + nr_node_ids * sizeof(u8)); 3474 + } else { 3475 + for (i = 0; i < nr_node_ids; i++) 3476 + new_wi_state->iw_table[i] = 1; 3477 + } 3478 + new_wi_state->iw_table[node_attr->nid] = weight; 3479 + new_wi_state->mode_auto = false; 3480 + 3481 + rcu_assign_pointer(wi_state, new_wi_state); 3482 + mutex_unlock(&wi_state_lock); 3483 + if (old_wi_state) { 3484 + synchronize_rcu(); 3485 + kfree(old_wi_state); 3486 + } 3487 + return count; 3488 + } 3489 + 3490 + static ssize_t weighted_interleave_auto_show(struct kobject *kobj, 3491 + struct kobj_attribute *attr, char *buf) 3492 + { 3493 + struct weighted_interleave_state *state; 3494 + bool wi_auto = true; 3495 + 3496 + rcu_read_lock(); 3497 + state = rcu_dereference(wi_state); 3498 + if (state) 3499 + wi_auto = state->mode_auto; 3500 + rcu_read_unlock(); 3501 + 3502 + return sysfs_emit(buf, "%s\n", str_true_false(wi_auto)); 3503 + } 3504 + 3505 + static ssize_t weighted_interleave_auto_store(struct kobject *kobj, 3506 + struct kobj_attribute *attr, const char *buf, size_t count) 3507 + { 3508 + struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; 3509 + unsigned int *bw; 3510 + bool input; 3511 + int i; 3512 + 3513 + if (kstrtobool(buf, &input)) 3514 + return -EINVAL; 3515 + 3516 + new_wi_state = kzalloc(struct_size(new_wi_state, iw_table, nr_node_ids), 3517 + GFP_KERNEL); 3518 + if (!new_wi_state) 3519 + return -ENOMEM; 3520 + for (i = 0; i < nr_node_ids; i++) 3521 + new_wi_state->iw_table[i] = 1; 3522 + 3523 + mutex_lock(&wi_state_lock); 3524 + if (!input) { 3525 + old_wi_state = rcu_dereference_protected(wi_state, 3526 + lockdep_is_held(&wi_state_lock)); 3527 + if (!old_wi_state) 3528 + goto update_wi_state; 3529 + if (input == old_wi_state->mode_auto) { 3530 + mutex_unlock(&wi_state_lock); 3531 + return count; 3532 + } 3533 + 3534 + memcpy(new_wi_state->iw_table, old_wi_state->iw_table, 3535 + nr_node_ids * sizeof(u8)); 3536 + goto update_wi_state; 3537 + } 3538 + 3539 + bw = node_bw_table; 3540 + if (!bw) { 3541 + mutex_unlock(&wi_state_lock); 3542 + kfree(new_wi_state); 3543 + return -ENODEV; 3544 + } 3545 + 3546 + new_wi_state->mode_auto = true; 3547 + reduce_interleave_weights(bw, new_wi_state->iw_table); 3548 + 3549 + update_wi_state: 3550 + rcu_assign_pointer(wi_state, new_wi_state); 3551 + mutex_unlock(&wi_state_lock); 3552 + if (old_wi_state) { 3553 + synchronize_rcu(); 3554 + kfree(old_wi_state); 3555 + } 3590 3556 return count; 3591 3557 } 3592 3558 ··· 3698 3508 sysfs_wi_node_delete(nid); 3699 3509 } 3700 3510 3701 - static void iw_table_free(void) 3511 + static void wi_state_free(void) 3702 3512 { 3703 - u8 *old; 3513 + struct weighted_interleave_state *old_wi_state; 3704 3514 3705 - mutex_lock(&iw_table_lock); 3706 - old = rcu_dereference_protected(iw_table, 3707 - lockdep_is_held(&iw_table_lock)); 3708 - rcu_assign_pointer(iw_table, NULL); 3709 - mutex_unlock(&iw_table_lock); 3515 + mutex_lock(&wi_state_lock); 3710 3516 3517 + old_wi_state = rcu_dereference_protected(wi_state, 3518 + lockdep_is_held(&wi_state_lock)); 3519 + if (!old_wi_state) { 3520 + mutex_unlock(&wi_state_lock); 3521 + goto out; 3522 + } 3523 + 3524 + rcu_assign_pointer(wi_state, NULL); 3525 + mutex_unlock(&wi_state_lock); 3711 3526 synchronize_rcu(); 3712 - kfree(old); 3527 + kfree(old_wi_state); 3528 + out: 3529 + kfree(&wi_group->wi_kobj); 3713 3530 } 3714 3531 3532 + static struct kobj_attribute wi_auto_attr = 3533 + __ATTR(auto, 0664, weighted_interleave_auto_show, 3534 + weighted_interleave_auto_store); 3535 + 3715 3536 static void wi_cleanup(void) { 3537 + sysfs_remove_file(&wi_group->wi_kobj, &wi_auto_attr.attr); 3716 3538 sysfs_wi_node_delete_all(); 3717 - iw_table_free(); 3539 + wi_state_free(); 3718 3540 } 3719 3541 3720 3542 static void wi_kobj_release(struct kobject *wi_kobj) ··· 3826 3624 3827 3625 err = kobject_init_and_add(&wi_group->wi_kobj, &wi_ktype, mempolicy_kobj, 3828 3626 "weighted_interleave"); 3627 + if (err) 3628 + goto err_put_kobj; 3629 + 3630 + err = sysfs_create_file(&wi_group->wi_kobj, &wi_auto_attr.attr); 3829 3631 if (err) 3830 3632 goto err_put_kobj; 3831 3633