Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: multi-gen LRU: admin guide

Add an admin guide.

Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Yu Zhao and committed by
Andrew Morton
07017acb d6c3af7d

+169 -1
+1
Documentation/admin-guide/mm/index.rst
··· 32 32 idle_page_tracking 33 33 ksm 34 34 memory-hotplug 35 + multigen_lru 35 36 nommu-mmap 36 37 numa_memory_policy 37 38 numaperf
+162
Documentation/admin-guide/mm/multigen_lru.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============= 4 + Multi-Gen LRU 5 + ============= 6 + The multi-gen LRU is an alternative LRU implementation that optimizes 7 + page reclaim and improves performance under memory pressure. Page 8 + reclaim decides the kernel's caching policy and ability to overcommit 9 + memory. It directly impacts the kswapd CPU usage and RAM efficiency. 10 + 11 + Quick start 12 + =========== 13 + Build the kernel with the following configurations. 14 + 15 + * ``CONFIG_LRU_GEN=y`` 16 + * ``CONFIG_LRU_GEN_ENABLED=y`` 17 + 18 + All set! 19 + 20 + Runtime options 21 + =============== 22 + ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the 23 + following subsections. 24 + 25 + Kill switch 26 + ----------- 27 + ``enabled`` accepts different values to enable or disable the 28 + following components. Its default value depends on 29 + ``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled 30 + unless some of them have unforeseen side effects. Writing to 31 + ``enabled`` has no effect when a component is not supported by the 32 + hardware, and valid values will be accepted even when the main switch 33 + is off. 34 + 35 + ====== =============================================================== 36 + Values Components 37 + ====== =============================================================== 38 + 0x0001 The main switch for the multi-gen LRU. 39 + 0x0002 Clearing the accessed bit in leaf page table entries in large 40 + batches, when MMU sets it (e.g., on x86). This behavior can 41 + theoretically worsen lock contention (mmap_lock). If it is 42 + disabled, the multi-gen LRU will suffer a minor performance 43 + degradation for workloads that contiguously map hot pages, 44 + whose accessed bits can be otherwise cleared by fewer larger 45 + batches. 46 + 0x0004 Clearing the accessed bit in non-leaf page table entries as 47 + well, when MMU sets it (e.g., on x86). This behavior was not 48 + verified on x86 varieties other than Intel and AMD. If it is 49 + disabled, the multi-gen LRU will suffer a negligible 50 + performance degradation. 51 + [yYnN] Apply to all the components above. 52 + ====== =============================================================== 53 + 54 + E.g., 55 + :: 56 + 57 + echo y >/sys/kernel/mm/lru_gen/enabled 58 + cat /sys/kernel/mm/lru_gen/enabled 59 + 0x0007 60 + echo 5 >/sys/kernel/mm/lru_gen/enabled 61 + cat /sys/kernel/mm/lru_gen/enabled 62 + 0x0005 63 + 64 + Thrashing prevention 65 + -------------------- 66 + Personal computers are more sensitive to thrashing because it can 67 + cause janks (lags when rendering UI) and negatively impact user 68 + experience. The multi-gen LRU offers thrashing prevention to the 69 + majority of laptop and desktop users who do not have ``oomd``. 70 + 71 + Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of 72 + ``N`` milliseconds from getting evicted. The OOM killer is triggered 73 + if this working set cannot be kept in memory. In other words, this 74 + option works as an adjustable pressure relief valve, and when open, it 75 + terminates applications that are hopefully not being used. 76 + 77 + Based on the average human detectable lag (~100ms), ``N=1000`` usually 78 + eliminates intolerable janks due to thrashing. Larger values like 79 + ``N=3000`` make janks less noticeable at the risk of premature OOM 80 + kills. 81 + 82 + The default value ``0`` means disabled. 83 + 84 + Experimental features 85 + ===================== 86 + ``/sys/kernel/debug/lru_gen`` accepts commands described in the 87 + following subsections. Multiple command lines are supported, so does 88 + concatenation with delimiters ``,`` and ``;``. 89 + 90 + ``/sys/kernel/debug/lru_gen_full`` provides additional stats for 91 + debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from 92 + evicted generations in this file. 93 + 94 + Working set estimation 95 + ---------------------- 96 + Working set estimation measures how much memory an application needs 97 + in a given time interval, and it is usually done with little impact on 98 + the performance of the application. E.g., data centers want to 99 + optimize job scheduling (bin packing) to improve memory utilizations. 100 + When a new job comes in, the job scheduler needs to find out whether 101 + each server it manages can allocate a certain amount of memory for 102 + this new job before it can pick a candidate. To do so, the job 103 + scheduler needs to estimate the working sets of the existing jobs. 104 + 105 + When it is read, ``lru_gen`` returns a histogram of numbers of pages 106 + accessed over different time intervals for each memcg and node. 107 + ``MAX_NR_GENS`` decides the number of bins for each histogram. The 108 + histograms are noncumulative. 109 + :: 110 + 111 + memcg memcg_id memcg_path 112 + node node_id 113 + min_gen_nr age_in_ms nr_anon_pages nr_file_pages 114 + ... 115 + max_gen_nr age_in_ms nr_anon_pages nr_file_pages 116 + 117 + Each bin contains an estimated number of pages that have been accessed 118 + within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages 119 + and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of 120 + the former is the largest and that of the latter is the smallest. 121 + 122 + Users can write the following command to ``lru_gen`` to create a new 123 + generation ``max_gen_nr+1``: 124 + 125 + ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` 126 + 127 + ``can_swap`` defaults to the swap setting and, if it is set to ``1``, 128 + it forces the scan of anon pages when swap is off, and vice versa. 129 + ``force_scan`` defaults to ``1`` and, if it is set to ``0``, it 130 + employs heuristics to reduce the overhead, which is likely to reduce 131 + the coverage as well. 132 + 133 + A typical use case is that a job scheduler runs this command at a 134 + certain time interval to create new generations, and it ranks the 135 + servers it manages based on the sizes of their cold pages defined by 136 + this time interval. 137 + 138 + Proactive reclaim 139 + ----------------- 140 + Proactive reclaim induces page reclaim when there is no memory 141 + pressure. It usually targets cold pages only. E.g., when a new job 142 + comes in, the job scheduler wants to proactively reclaim cold pages on 143 + the server it selected, to improve the chance of successfully landing 144 + this new job. 145 + 146 + Users can write the following command to ``lru_gen`` to evict 147 + generations less than or equal to ``min_gen_nr``. 148 + 149 + ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` 150 + 151 + ``min_gen_nr`` should be less than ``max_gen_nr-1``, since 152 + ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to 153 + the active list) and therefore cannot be evicted. ``swappiness`` 154 + overrides the default value in ``/proc/sys/vm/swappiness``. 155 + ``nr_to_reclaim`` limits the number of pages to evict. 156 + 157 + A typical use case is that a job scheduler runs this command before it 158 + tries to land a new job on a server. If it fails to materialize enough 159 + cold pages because of the overestimation, it retries on the next 160 + server according to the ranking result obtained from the working set 161 + estimation step. This less forceful approach limits the impacts on the 162 + existing jobs.
+2 -1
mm/Kconfig
··· 1125 1125 # make sure folio->flags has enough spare bits 1126 1126 depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP 1127 1127 help 1128 - A high performance LRU implementation to overcommit memory. 1128 + A high performance LRU implementation to overcommit memory. See 1129 + Documentation/admin-guide/mm/multigen_lru.rst for details. 1129 1130 1130 1131 config LRU_GEN_ENABLED 1131 1132 bool "Enable by default"
+4
mm/vmscan.c
··· 5310 5310 return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl))); 5311 5311 } 5312 5312 5313 + /* see Documentation/admin-guide/mm/multigen_lru.rst for details */ 5313 5314 static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, 5314 5315 const char *buf, size_t len) 5315 5316 { ··· 5344 5343 return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps); 5345 5344 } 5346 5345 5346 + /* see Documentation/admin-guide/mm/multigen_lru.rst for details */ 5347 5347 static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr, 5348 5348 const char *buf, size_t len) 5349 5349 { ··· 5492 5490 seq_putc(m, '\n'); 5493 5491 } 5494 5492 5493 + /* see Documentation/admin-guide/mm/multigen_lru.rst for details */ 5495 5494 static int lru_gen_seq_show(struct seq_file *m, void *v) 5496 5495 { 5497 5496 unsigned long seq; ··· 5651 5648 return err; 5652 5649 } 5653 5650 5651 + /* see Documentation/admin-guide/mm/multigen_lru.rst for details */ 5654 5652 static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, 5655 5653 size_t len, loff_t *pos) 5656 5654 {