Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: multi-gen LRU: improve design doc

This patch improves the design doc. Specifically,
1. add a section for the per-memcg mm_struct list, and
2. add a section for the PID controller.

Link: https://lkml.kernel.org/r/20230214035445.1250139-2-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

T.J. Alumbaugh and committed by
Andrew Morton
32d32ef1 9a52b2f3

+40 -6
+39 -5
Documentation/mm/multigen_lru.rst
··· 103 103 ``folio->flags`` and therefore has a negligible cost. A feedback loop 104 104 modeled after the PID controller monitors refaults over all the tiers 105 105 from anon and file types and decides which tiers from which types to 106 - evict or protect. 106 + evict or protect. The desired effect is to balance refault percentages 107 + between anon and file types proportional to the swappiness level. 107 108 108 109 There are two conceptually independent procedures: the aging and the 109 110 eviction. They form a closed-loop system, i.e., the page reclaim. ··· 157 156 and memory sizes. 158 157 2. It is more reliable because it is directly wired to the OOM killer. 159 158 159 + ``mm_struct`` list 160 + ------------------ 161 + An ``mm_struct`` list is maintained for each memcg, and an 162 + ``mm_struct`` follows its owner task to the new memcg when this task 163 + is migrated. 164 + 165 + A page table walker iterates ``lruvec_memcg()->mm_list`` and calls 166 + ``walk_page_range()`` with each ``mm_struct`` on this list to scan 167 + PTEs. When multiple page table walkers iterate the same list, each of 168 + them gets a unique ``mm_struct``, and therefore they can run in 169 + parallel. 170 + 171 + Page table walkers ignore any misplaced pages, e.g., if an 172 + ``mm_struct`` was migrated, pages left in the previous memcg will be 173 + ignored when the current memcg is under reclaim. Similarly, page table 174 + walkers will ignore pages from nodes other than the one under reclaim. 175 + 176 + This infrastructure also tracks the usage of ``mm_struct`` between 177 + context switches so that page table walkers can skip processes that 178 + have been sleeping since the last iteration. 179 + 160 180 Rmap/PT walk feedback 161 181 --------------------- 162 182 Searching the rmap for PTEs mapping each page on an LRU list (to test ··· 192 170 adds the PMD entry pointing to the PTE table to the Bloom filter. This 193 171 forms a feedback loop between the eviction and the aging. 194 172 195 - Bloom Filters 173 + Bloom filters 196 174 ------------- 197 175 Bloom filters are a space and memory efficient data structure for set 198 176 membership test, i.e., test if an element is not in the set or may be ··· 207 185 is false positive, the cost is an additional scan of a range of PTEs, 208 186 which may yield hot pages anyway. Parameters of the filter itself can 209 187 control the false positive rate in the limit. 188 + 189 + PID controller 190 + -------------- 191 + A feedback loop modeled after the Proportional-Integral-Derivative 192 + (PID) controller monitors refaults over anon and file types and 193 + decides which type to evict when both types are available from the 194 + same generation. 195 + 196 + The PID controller uses generations rather than the wall clock as the 197 + time domain because a CPU can scan pages at different rates under 198 + varying memory pressure. It calculates a moving average for each new 199 + generation to avoid being permanently locked in a suboptimal state. 210 200 211 201 Memcg LRU 212 202 --------- ··· 257 223 258 224 * Generations 259 225 * Rmap walks 260 - * Page table walks 261 - * Bloom filters 262 - * PID controller 226 + * Page table walks via ``mm_struct`` list 227 + * Bloom filters for rmap/PT walk feedback 228 + * PID controller for refault feedback 263 229 264 230 The aging and the eviction form a producer-consumer model; 265 231 specifically, the latter drives the former by the sliding window over
+1 -1
mm/vmscan.c
··· 3604 3604 } 3605 3605 3606 3606 /****************************************************************************** 3607 - * refault feedback loop 3607 + * PID controller 3608 3608 ******************************************************************************/ 3609 3609 3610 3610 /*