Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/mempolicy: use numa_node_id() instead of cpu_to_node()

Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY
policy:, v4.

This patchset is to optimize the cross-socket memory access with
MPOL_PREFERRED_MANY policy.

To test this patch we ran the following test on a 3 node system.
Node 0 - 2GB - Tier 1
Node 1 - 11GB - Tier 1
Node 6 - 10GB - Tier 2

Below changes are made to memcached to set the memory policy,
It select Node0 and Node1 as preferred nodes.

#include <numaif.h>
#include <numa.h>

unsigned long nodemask;
int ret;

nodemask = 0x03;
ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
&nodemask, 10);
/* If MPOL_F_NUMA_BALANCING isn't supported,
* fall back to MPOL_PREFERRED_MANY */
if (ret < 0 && errno == EINVAL){
printf("set mem policy normal\n");
ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
}
if (ret < 0) {
perror("Failed to call set_mempolicy");
exit(-1);
}

Test Procedure:
===============
1. Make sure memory tiring and demotion are enabled.
2. Start memcached.

# ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
-d -s "/tmp/memcached.sock"

3. Run memtier_benchmark to store 3200000 keys.

#./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
--threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
--key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024

4. Start a memory eater on node 0 and 1. This will demote all memcached
pages to node 6.
5. Make sure all the memcached pages got demoted to lower tier by reading
/proc/<memcaced PID>/numa_maps.

# cat /proc/2771/numa_maps
---
default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
---

6. Kill memory eater.
7. Read the pgpromote_success counter.
8. Start reading the keys by running memtier_benchmark.

#./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
--pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
--key-minimum=1 --key-maximum=3200000 -n allkeys
--threads=64 -c 1 -R -x 6

9. Read the pgpromote_success counter.

Test Results:
=============
Without Patch
------------------
1. pgpromote_success before test
Node 0: pgpromote_success 11
Node 1: pgpromote_success 140974

pgpromote_success after test
Node 0: pgpromote_success 11
Node 1: pgpromote_success 140974

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
==================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency
------------------------------------------------------------------
Sets 0.00 --- --- --- ---
Gets 305792.03 305791.93 0.10 0.18949 0.16700
Waits 0.00 --- --- --- ---
Totals 305792.03 305791.93 0.10 0.18949 0.16700

======================================
p99 Latency p99.9 Latency KB/sec
-------------------------------------
--- --- 0.00
0.44700 1.71100 11542.69
--- --- ---
0.44700 1.71100 11542.69

With Patch
---------------
1. pgpromote_success before test
Node 0: pgpromote_success 5
Node 1: pgpromote_success 89386

pgpromote_success after test
Node 0: pgpromote_success 57895
Node 1: pgpromote_success 141463

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
====================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency
--------------------------------------------------------------------
Sets 0.00 --- --- --- ---
Gets 521942.24 521942.07 0.17 0.11459 0.10300
Waits 0.00 --- --- --- ---
Totals 521942.24 521942.07 0.17 0.11459 0.10300

=======================================
p99 Latency p99.9 Latency KB/sec
---------------------------------------
--- --- 0.00
0.23100 0.31900 19701.68
--- --- ---
0.23100 0.31900 19701.68


Test Result Analysis:
=====================
1. With patch we could observe pages are getting promoted.
2. Memtier-benchmark results shows that, with the patch,
performance has increased more than 50%.

Ops/sec without fix - 305792.03
Ops/sec with fix - 521942.24


This patch (of 2):

Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is
quicker. smp_processor_id is guaranteed to be stable in the
'mpol_misplaced()' function because it is called with ptl held.
lockdep_assert_held was added to ensure that.

No functional change in this patch.

[donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment]
Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Donet Tom and committed by
Andrew Morton
f8fd525b fea68a75

+20 -11
+3 -2
include/linux/mempolicy.h
··· 167 167 /* Check if a vma is migratable */ 168 168 extern bool vma_migratable(struct vm_area_struct *vma); 169 169 170 - int mpol_misplaced(struct folio *, struct vm_area_struct *, unsigned long); 170 + int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, 171 + unsigned long addr); 171 172 extern void mpol_put_task_policy(struct task_struct *); 172 173 173 174 static inline bool mpol_is_preferred_many(struct mempolicy *pol) ··· 283 282 #endif 284 283 285 284 static inline int mpol_misplaced(struct folio *folio, 286 - struct vm_area_struct *vma, 285 + struct vm_fault *vmf, 287 286 unsigned long address) 288 287 { 289 288 return -1; /* no node preference */
+1 -1
mm/huge_memory.c
··· 1754 1754 */ 1755 1755 if (node_is_toptier(nid)) 1756 1756 last_cpupid = folio_last_cpupid(folio); 1757 - target_nid = numa_migrate_prep(folio, vma, haddr, nid, &flags); 1757 + target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags); 1758 1758 if (target_nid == NUMA_NO_NODE) { 1759 1759 folio_put(folio); 1760 1760 goto out_map;
+1 -1
mm/internal.h
··· 1087 1087 1088 1088 void __vunmap_range_noflush(unsigned long start, unsigned long end); 1089 1089 1090 - int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, 1090 + int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, 1091 1091 unsigned long addr, int page_nid, int *flags); 1092 1092 1093 1093 void free_zone_device_page(struct page *page);
+5 -3
mm/memory.c
··· 5035 5035 return ret; 5036 5036 } 5037 5037 5038 - int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, 5038 + int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, 5039 5039 unsigned long addr, int page_nid, int *flags) 5040 5040 { 5041 + struct vm_area_struct *vma = vmf->vma; 5042 + 5041 5043 folio_get(folio); 5042 5044 5043 5045 /* Record the current PID acceesing VMA */ ··· 5051 5049 *flags |= TNF_FAULT_LOCAL; 5052 5050 } 5053 5051 5054 - return mpol_misplaced(folio, vma, addr); 5052 + return mpol_misplaced(folio, vmf, addr); 5055 5053 } 5056 5054 5057 5055 static vm_fault_t do_numa_page(struct vm_fault *vmf) ··· 5125 5123 last_cpupid = (-1 & LAST_CPUPID_MASK); 5126 5124 else 5127 5125 last_cpupid = folio_last_cpupid(folio); 5128 - target_nid = numa_migrate_prep(folio, vma, vmf->address, nid, &flags); 5126 + target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags); 5129 5127 if (target_nid == NUMA_NO_NODE) { 5130 5128 folio_put(folio); 5131 5129 goto out_map;
+10 -4
mm/mempolicy.c
··· 2718 2718 * mpol_misplaced - check whether current folio node is valid in policy 2719 2719 * 2720 2720 * @folio: folio to be checked 2721 - * @vma: vm area where folio mapped 2721 + * @vmf: structure describing the fault 2722 2722 * @addr: virtual address in @vma for shared policy lookup and interleave policy 2723 2723 * 2724 2724 * Lookup current policy node id for vma,addr and "compare to" folio's ··· 2728 2728 * Return: NUMA_NO_NODE if the page is in a node that is valid for this 2729 2729 * policy, or a suitable node ID to allocate a replacement folio from. 2730 2730 */ 2731 - int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, 2731 + int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, 2732 2732 unsigned long addr) 2733 2733 { 2734 2734 struct mempolicy *pol; 2735 2735 pgoff_t ilx; 2736 2736 struct zoneref *z; 2737 2737 int curnid = folio_nid(folio); 2738 + struct vm_area_struct *vma = vmf->vma; 2738 2739 int thiscpu = raw_smp_processor_id(); 2739 - int thisnid = cpu_to_node(thiscpu); 2740 + int thisnid = numa_node_id(); 2740 2741 int polnid = NUMA_NO_NODE; 2741 2742 int ret = NUMA_NO_NODE; 2742 2743 2744 + /* 2745 + * Make sure ptl is held so that we don't preempt and we 2746 + * have a stable smp processor id 2747 + */ 2748 + lockdep_assert_held(vmf->ptl); 2743 2749 pol = get_vma_policy(vma, addr, folio_order(folio), &ilx); 2744 2750 if (!(pol->flags & MPOL_F_MOF)) 2745 2751 goto out; ··· 2787 2781 if (node_isset(curnid, pol->nodes)) 2788 2782 goto out; 2789 2783 z = first_zones_zonelist( 2790 - node_zonelist(numa_node_id(), GFP_HIGHUSER), 2784 + node_zonelist(thisnid, GFP_HIGHUSER), 2791 2785 gfp_zone(GFP_HIGHUSER), 2792 2786 &pol->nodes); 2793 2787 polnid = zone_to_nid(z->zone);