Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/swap: do not choose swap device according to numa node

Patch series "mm/swapfile.c: select swap devices of default priority round
robin", v5.

Currently, on system with multiple swap devices, swap allocation will
select one swap device according to priority. The swap device with the
highest priority will be chosen to allocate firstly.

People can specify a priority from 0 to 32767 when swapon a swap device,
or the system will set it from -2 then downwards by default. Meanwhile,
on NUMA system, the swap device with node_id will be considered first on
that NUMA node of the node_id.

In the current code, an array of plist, swap_avail_heads[nid], is used to
organize swap devices on each NUMA node. For each NUMA node, there is a
plist organizing all swap devices. The 'prio' value in the plist is the
negated value of the device's priority due to plist being sorted from low
to high. The swap device owning one node_id will be promoted to the front
position on that NUMA node, then other swap devices are put in order of
their default priority.

E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
swap devices.

Current behaviour:
their priorities will be(note that -1 is skipped):
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 0B -2
/dev/zram1 partition 16G 0B -3
/dev/zram2 partition 16G 0B -4
/dev/zram3 partition 16G 0B -5

And their positions in the 8 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
zram1 -> zram0 -> zram2 -> zram3
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
zram2 -> zram0 -> zram1 -> zram3
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
zram3 -> zram0 -> zram1 -> zram2
prio:1 prio:2 prio:3 prio:4
swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:2 prio:3 prio:4 prio:5

The adjustment for swap device with node_id intended to decrease the
pressure of lock contention for one swap device by taking different swap
device on different node. The adjustment was introduced in commit
a2468cc9bfdf ("swap: choose swap device according to numa node").
However, the adjustment is a little coarse-grained. On the node, the swap
device sharing the node's id will always be selected firstly by node's
CPUs until exhausted, then next one. And on other nodes where no swap
device shares its node id, swap device with priority '-2' will be selected
firstly until exhausted, then next with priority '-3'.

This is the swapon output during the process high pressure vm-scability
test is being taken. It's clearly showing zram0 is heavily exploited
until exhausted.

===================================
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 15.7G -2
/dev/zram1 partition 16G 3.4G -3
/dev/zram2 partition 16G 3.4G -4
/dev/zram3 partition 16G 2.6G -5

The node based strategy on selecting swap device is much better then the
old way one by one selecting swap device. However it is still
unreasonable because swap devices are assumed to have similar accessing
speed if no priority is specified when swapon. It's unfair and doesn't
make sense just because one swap device is swapped on firstly, its
priority will be higher than the one swapped on later.

So in this patchset, change is made to select the swap device round robin
if default priority. In code, the plist array swap_avail_heads[nid] is
replaced with a plist swap_avail_head which reverts commit a2468cc9bfdf.
Meanwhile, on top of the revert, further change is taken to make any
device w/o specified priority get the same default priority '-1'. Surely,
swap device with specified priority are always put foremost, this is not
impacted. If you care about their different accessing speed, then use
'swapon -p xx' to deploy priority for your swap devices.

New behaviour:

swap_avail_list: /* one global available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:1 prio:1 prio:1 prio:1

This is the swapon output during the process high pressure vm-scability
being taken, all is selected round robin:
=======================================
[root@hp-dl385g10-03 linux]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 12.6G -1
/dev/zram1 partition 16G 12.6G -1
/dev/zram2 partition 16G 12.6G -1
/dev/zram3 partition 16G 12.6G -1

With the change, we can see about 18% efficiency promotion as below:

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
Before: After:
System time: 637.92 s 526.74 s (lower is better)
Sum Throughput: 3546.56 MB/s 4207.56 MB/s (higher is better)
Single process Throughput: 114.40 MB/s 135.72 MB/s (higher is better)
free latency: 10138455.99 us 6810119.01 us (low is better)


This patch (of 2):

This reverts commit a2468cc9bfdf ("swap: choose swap device according to
numa node").

After this patch, the behaviour will change back to pre-commit
a2468cc9bfdf. Means the priority will be set from -1 then downwards by
default, and when swapping, it will exhault swap device one by one
according to priority from high to low. This is preparation work for
later change.

[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4

Link: https://lkml.kernel.org/r/20251028034308.929550-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20251028034308.929550-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Baoquan He and committed by
Andrew Morton
8e689f8e 6af766c8

+15 -155
-1
Documentation/admin-guide/mm/index.rst
··· 39 39 shrinker_debugfs 40 40 slab 41 41 soft-dirty 42 - swap_numa 43 42 transhuge 44 43 userfaultfd 45 44 zswap
-78
Documentation/admin-guide/mm/swap_numa.rst
··· 1 - =========================================== 2 - Automatically bind swap device to numa node 3 - =========================================== 4 - 5 - If the system has more than one swap device and swap device has the node 6 - information, we can make use of this information to decide which swap 7 - device to use in get_swap_pages() to get better performance. 8 - 9 - 10 - How to use this feature 11 - ======================= 12 - 13 - Swap device has priority and that decides the order of it to be used. To make 14 - use of automatically binding, there is no need to manipulate priority settings 15 - for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and 16 - swapB, with swapA attached to node 0 and swapB attached to node 1, are going 17 - to be swapped on. Simply swapping them on by doing:: 18 - 19 - # swapon /dev/swapA 20 - # swapon /dev/swapB 21 - 22 - Then node 0 will use the two swap devices in the order of swapA then swapB and 23 - node 1 will use the two swap devices in the order of swapB then swapA. Note 24 - that the order of them being swapped on doesn't matter. 25 - 26 - A more complex example on a 4 node machine. Assume 6 swap devices are going to 27 - be swapped on: swapA and swapB are attached to node 0, swapC is attached to 28 - node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. 29 - The way to swap them on is the same as above:: 30 - 31 - # swapon /dev/swapA 32 - # swapon /dev/swapB 33 - # swapon /dev/swapC 34 - # swapon /dev/swapD 35 - # swapon /dev/swapE 36 - # swapon /dev/swapF 37 - 38 - Then node 0 will use them in the order of:: 39 - 40 - swapA/swapB -> swapC -> swapD -> swapE -> swapF 41 - 42 - swapA and swapB will be used in a round robin mode before any other swap device. 43 - 44 - node 1 will use them in the order of:: 45 - 46 - swapC -> swapA -> swapB -> swapD -> swapE -> swapF 47 - 48 - node 2 will use them in the order of:: 49 - 50 - swapD/swapE -> swapA -> swapB -> swapC -> swapF 51 - 52 - Similaly, swapD and swapE will be used in a round robin mode before any 53 - other swap devices. 54 - 55 - node 3 will use them in the order of:: 56 - 57 - swapF -> swapA -> swapB -> swapC -> swapD -> swapE 58 - 59 - 60 - Implementation details 61 - ====================== 62 - 63 - The current code uses a priority based list, swap_avail_list, to decide 64 - which swap device to use and if multiple swap devices share the same 65 - priority, they are used round robin. This change here replaces the single 66 - global swap_avail_list with a per-numa-node list, i.e. for each numa node, 67 - it sees its own priority based list of available swap devices. Swap 68 - device's priority can be promoted on its matching node's swap_avail_list. 69 - 70 - The current swap device's priority is set as: user can set a >=0 value, 71 - or the system will pick one starting from -1 then downwards. The priority 72 - value in the swap_avail_list is the negated value of the swap device's 73 - due to plist being sorted from low to high. The new policy doesn't change 74 - the semantics for priority >=0 cases, the previous starting from -1 then 75 - downwards now becomes starting from -2 then downwards and -1 is reserved 76 - as the promoted value. So if multiple swap devices are attached to the same 77 - node, they will all be promoted to priority -1 on that node's plist and will 78 - be used round robin before any other swap devices.
+1 -10
include/linux/swap.h
··· 301 301 struct work_struct discard_work; /* discard worker */ 302 302 struct work_struct reclaim_work; /* reclaim worker */ 303 303 struct list_head discard_clusters; /* discard clusters list */ 304 - struct plist_node avail_lists[]; /* 305 - * entries in swap_avail_heads, one 306 - * entry per node. 307 - * Must be last as the number of the 308 - * array is nr_node_ids, which is not 309 - * a fixed value so have to allocate 310 - * dynamically. 311 - * And it has to be an array so that 312 - * plist_for_each_* can work. 313 - */ 304 + struct plist_node avail_list; /* entry in swap_avail_head */ 314 305 }; 315 306 316 307 static inline swp_entry_t page_swap_entry(struct page *page)
+14 -66
mm/swapfile.c
··· 74 74 EXPORT_SYMBOL_GPL(nr_swap_pages); 75 75 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */ 76 76 long total_swap_pages; 77 - static int least_priority = -1; 77 + static int least_priority; 78 78 unsigned long swapfile_maximum_size; 79 79 #ifdef CONFIG_MIGRATION 80 80 bool swap_migration_ad_supported; ··· 103 103 * is held and the locking order requires swap_lock to be taken 104 104 * before any swap_info_struct->lock. 105 105 */ 106 - static struct plist_head *swap_avail_heads; 106 + static PLIST_HEAD(swap_avail_head); 107 107 static DEFINE_SPINLOCK(swap_avail_lock); 108 108 109 109 struct swap_info_struct *swap_info[MAX_SWAPFILES]; ··· 1130 1130 /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */ 1131 1131 static void del_from_avail_list(struct swap_info_struct *si, bool swapoff) 1132 1132 { 1133 - int nid; 1134 1133 unsigned long pages; 1135 1134 1136 1135 spin_lock(&swap_avail_lock); ··· 1158 1159 goto skip; 1159 1160 } 1160 1161 1161 - for_each_node(nid) 1162 - plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]); 1162 + plist_del(&si->avail_list, &swap_avail_head); 1163 1163 1164 1164 skip: 1165 1165 spin_unlock(&swap_avail_lock); ··· 1167 1169 /* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */ 1168 1170 static void add_to_avail_list(struct swap_info_struct *si, bool swapon) 1169 1171 { 1170 - int nid; 1171 1172 long val; 1172 1173 unsigned long pages; 1173 1174 ··· 1199 1202 goto skip; 1200 1203 } 1201 1204 1202 - for_each_node(nid) 1203 - plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]); 1205 + plist_add(&si->avail_list, &swap_avail_head); 1204 1206 1205 1207 skip: 1206 1208 spin_unlock(&swap_avail_lock); ··· 1342 1346 static bool swap_alloc_slow(swp_entry_t *entry, 1343 1347 int order) 1344 1348 { 1345 - int node; 1346 1349 unsigned long offset; 1347 1350 struct swap_info_struct *si, *next; 1348 1351 1349 - node = numa_node_id(); 1350 1352 spin_lock(&swap_avail_lock); 1351 1353 start_over: 1352 - plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) { 1354 + plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) { 1353 1355 /* Rotate the device and switch to a new cluster */ 1354 - plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); 1356 + plist_requeue(&si->avail_list, &swap_avail_head); 1355 1357 spin_unlock(&swap_avail_lock); 1356 1358 if (get_swap_device_info(si)) { 1357 1359 offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE); ··· 1374 1380 * still in the swap_avail_head list then try it, otherwise 1375 1381 * start over if we have not gotten any slots. 1376 1382 */ 1377 - if (plist_node_empty(&next->avail_lists[node])) 1383 + if (plist_node_empty(&si->avail_list)) 1378 1384 goto start_over; 1379 1385 } 1380 1386 spin_unlock(&swap_avail_lock); ··· 1388 1394 static bool swap_sync_discard(void) 1389 1395 { 1390 1396 bool ret = false; 1391 - int nid = numa_node_id(); 1392 1397 struct swap_info_struct *si, *next; 1393 1398 1394 1399 spin_lock(&swap_avail_lock); 1395 - plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) { 1400 + plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) { 1396 1401 spin_unlock(&swap_avail_lock); 1397 1402 if (get_swap_device_info(si)) { 1398 1403 if (si->flags & SWP_PAGE_DISCARD) ··· 2702 2709 return generic_swapfile_activate(sis, swap_file, span); 2703 2710 } 2704 2711 2705 - static int swap_node(struct swap_info_struct *si) 2706 - { 2707 - struct block_device *bdev; 2708 - 2709 - if (si->bdev) 2710 - bdev = si->bdev; 2711 - else 2712 - bdev = si->swap_file->f_inode->i_sb->s_bdev; 2713 - 2714 - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE; 2715 - } 2716 - 2717 2712 static void setup_swap_info(struct swap_info_struct *si, int prio, 2718 2713 unsigned char *swap_map, 2719 2714 struct swap_cluster_info *cluster_info, 2720 2715 unsigned long *zeromap) 2721 2716 { 2722 - int i; 2723 - 2724 2717 if (prio >= 0) 2725 2718 si->prio = prio; 2726 2719 else ··· 2716 2737 * low-to-high, while swap ordering is high-to-low 2717 2738 */ 2718 2739 si->list.prio = -si->prio; 2719 - for_each_node(i) { 2720 - if (si->prio >= 0) 2721 - si->avail_lists[i].prio = -si->prio; 2722 - else { 2723 - if (swap_node(si) == i) 2724 - si->avail_lists[i].prio = 1; 2725 - else 2726 - si->avail_lists[i].prio = -si->prio; 2727 - } 2728 - } 2740 + si->avail_list.prio = -si->prio; 2729 2741 si->swap_map = swap_map; 2730 2742 si->cluster_info = cluster_info; 2731 2743 si->zeromap = zeromap; ··· 2889 2919 del_from_avail_list(p, true); 2890 2920 if (p->prio < 0) { 2891 2921 struct swap_info_struct *si = p; 2892 - int nid; 2893 2922 2894 2923 plist_for_each_entry_continue(si, &swap_active_head, list) { 2895 2924 si->prio++; 2896 2925 si->list.prio--; 2897 - for_each_node(nid) { 2898 - if (si->avail_lists[nid].prio != 1) 2899 - si->avail_lists[nid].prio--; 2900 - } 2926 + si->avail_list.prio--; 2901 2927 } 2902 2928 least_priority++; 2903 2929 } ··· 3134 3168 struct swap_info_struct *p; 3135 3169 struct swap_info_struct *defer = NULL; 3136 3170 unsigned int type; 3137 - int i; 3138 3171 3139 - p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL); 3172 + p = kvzalloc(sizeof(struct swap_info_struct), GFP_KERNEL); 3140 3173 if (!p) 3141 3174 return ERR_PTR(-ENOMEM); 3142 3175 ··· 3174 3209 } 3175 3210 p->swap_extent_root = RB_ROOT; 3176 3211 plist_node_init(&p->list, 0); 3177 - for_each_node(i) 3178 - plist_node_init(&p->avail_lists[i], 0); 3212 + plist_node_init(&p->avail_list, 0); 3179 3213 p->flags = SWP_USED; 3180 3214 spin_unlock(&swap_lock); 3181 3215 if (defer) { ··· 3430 3466 3431 3467 if (!capable(CAP_SYS_ADMIN)) 3432 3468 return -EPERM; 3433 - 3434 - if (!swap_avail_heads) 3435 - return -ENOMEM; 3436 3469 3437 3470 si = alloc_swap_info(); 3438 3471 if (IS_ERR(si)) ··· 4040 4079 void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp) 4041 4080 { 4042 4081 struct swap_info_struct *si, *next; 4043 - int nid = folio_nid(folio); 4044 4082 4045 4083 if (!(gfp & __GFP_IO)) 4046 4084 return; ··· 4058 4098 return; 4059 4099 4060 4100 spin_lock(&swap_avail_lock); 4061 - plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], 4062 - avail_lists[nid]) { 4101 + plist_for_each_entry_safe(si, next, &swap_avail_head, 4102 + avail_list) { 4063 4103 if (si->bdev) { 4064 4104 blkcg_schedule_throttle(si->bdev->bd_disk, true); 4065 4105 break; ··· 4071 4111 4072 4112 static int __init swapfile_init(void) 4073 4113 { 4074 - int nid; 4075 - 4076 - swap_avail_heads = kmalloc_array(nr_node_ids, sizeof(struct plist_head), 4077 - GFP_KERNEL); 4078 - if (!swap_avail_heads) { 4079 - pr_emerg("Not enough memory for swap heads, swap is disabled\n"); 4080 - return -ENOMEM; 4081 - } 4082 - 4083 - for_each_node(nid) 4084 - plist_head_init(&swap_avail_heads[nid]); 4085 - 4086 4114 swapfile_maximum_size = arch_max_swapfile_size(); 4087 4115 4088 4116 /*