Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

workingset: refactor LRU refault to expose refault recency check

Patch series "cachestat: a new syscall for page cache state of files",
v13.

There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.

Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.

More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/

This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.

This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:

https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html


I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:

Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689

Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046

I also ran both syscalls on a 2TB sparse file:

Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s

Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s

Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.

Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.

The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.


This patch (of 4):

In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.

[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Nhat Pham and committed by
Andrew Morton
ffcb5f52 18b1d18b

+102 -47
+1
include/linux/swap.h
··· 368 368 } 369 369 370 370 /* linux/mm/workingset.c */ 371 + bool workingset_test_recent(void *shadow, bool file, bool *workingset); 371 372 void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages); 372 373 void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg); 373 374 void workingset_refault(struct folio *folio, void *shadow);
+101 -47
mm/workingset.c
··· 255 255 return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs); 256 256 } 257 257 258 + /* 259 + * Tests if the shadow entry is for a folio that was recently evicted. 260 + * Fills in @memcgid, @pglist_data, @token, @workingset with the values 261 + * unpacked from shadow. 262 + */ 263 + static bool lru_gen_test_recent(void *shadow, bool file, int *memcgid, 264 + struct pglist_data **pgdat, unsigned long *token, bool *workingset) 265 + { 266 + struct mem_cgroup *eviction_memcg; 267 + struct lruvec *lruvec; 268 + struct lru_gen_folio *lrugen; 269 + unsigned long min_seq; 270 + 271 + unpack_shadow(shadow, memcgid, pgdat, token, workingset); 272 + eviction_memcg = mem_cgroup_from_id(*memcgid); 273 + 274 + lruvec = mem_cgroup_lruvec(eviction_memcg, *pgdat); 275 + lrugen = &lruvec->lrugen; 276 + 277 + min_seq = READ_ONCE(lrugen->min_seq[file]); 278 + return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)); 279 + } 280 + 258 281 static void lru_gen_refault(struct folio *folio, void *shadow) 259 282 { 260 283 int hist, tier, refs; ··· 292 269 int type = folio_is_file_lru(folio); 293 270 int delta = folio_nr_pages(folio); 294 271 295 - unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset); 296 - 297 - if (pgdat != folio_pgdat(folio)) 298 - return; 299 - 300 272 rcu_read_lock(); 273 + 274 + if (!lru_gen_test_recent(shadow, type, &memcg_id, &pgdat, &token, 275 + &workingset)) 276 + goto unlock; 301 277 302 278 memcg = folio_memcg_rcu(folio); 303 279 if (memcg_id != mem_cgroup_id(memcg)) 304 280 goto unlock; 305 281 282 + if (pgdat != folio_pgdat(folio)) 283 + goto unlock; 284 + 306 285 lruvec = mem_cgroup_lruvec(memcg, pgdat); 307 286 lrugen = &lruvec->lrugen; 308 - 309 287 min_seq = READ_ONCE(lrugen->min_seq[type]); 310 - if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH))) 311 - goto unlock; 312 288 313 289 hist = lru_hist_from_seq(min_seq); 314 290 /* see the comment in folio_lru_refs() */ ··· 337 315 static void *lru_gen_eviction(struct folio *folio) 338 316 { 339 317 return NULL; 318 + } 319 + 320 + static bool lru_gen_test_recent(void *shadow, bool file, int *memcgid, 321 + struct pglist_data **pgdat, unsigned long *token, bool *workingset) 322 + { 323 + return false; 340 324 } 341 325 342 326 static void lru_gen_refault(struct folio *folio, void *shadow) ··· 413 385 } 414 386 415 387 /** 416 - * workingset_refault - Evaluate the refault of a previously evicted folio. 417 - * @folio: The freshly allocated replacement folio. 418 - * @shadow: Shadow entry of the evicted folio. 388 + * workingset_test_recent - tests if the shadow entry is for a folio that was 389 + * recently evicted. Also fills in @workingset with the value unpacked from 390 + * shadow. 391 + * @shadow: the shadow entry to be tested. 392 + * @file: whether the corresponding folio is from the file lru. 393 + * @workingset: where the workingset value unpacked from shadow should 394 + * be stored. 419 395 * 420 - * Calculates and evaluates the refault distance of the previously 421 - * evicted folio in the context of the node and the memcg whose memory 422 - * pressure caused the eviction. 396 + * Return: true if the shadow is for a recently evicted folio; false otherwise. 423 397 */ 424 - void workingset_refault(struct folio *folio, void *shadow) 398 + bool workingset_test_recent(void *shadow, bool file, bool *workingset) 425 399 { 426 - bool file = folio_is_file_lru(folio); 427 400 struct mem_cgroup *eviction_memcg; 428 401 struct lruvec *eviction_lruvec; 429 402 unsigned long refault_distance; 430 403 unsigned long workingset_size; 431 - struct pglist_data *pgdat; 432 - struct mem_cgroup *memcg; 433 - unsigned long eviction; 434 - struct lruvec *lruvec; 435 404 unsigned long refault; 436 - bool workingset; 437 405 int memcgid; 438 - long nr; 406 + struct pglist_data *pgdat; 407 + unsigned long eviction; 439 408 440 - if (lru_gen_enabled()) { 441 - lru_gen_refault(folio, shadow); 442 - return; 443 - } 409 + if (lru_gen_enabled()) 410 + return lru_gen_test_recent(shadow, file, &memcgid, &pgdat, &eviction, 411 + workingset); 444 412 445 - unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); 413 + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); 446 414 eviction <<= bucket_order; 447 415 448 - /* Flush stats (and potentially sleep) before holding RCU read lock */ 449 - mem_cgroup_flush_stats_ratelimited(); 450 - 451 - rcu_read_lock(); 452 416 /* 453 417 * Look up the memcg associated with the stored ID. It might 454 418 * have been deleted since the folio's eviction. ··· 459 439 */ 460 440 eviction_memcg = mem_cgroup_from_id(memcgid); 461 441 if (!mem_cgroup_disabled() && !eviction_memcg) 462 - goto out; 442 + return false; 443 + 463 444 eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); 464 445 refault = atomic_long_read(&eviction_lruvec->nonresident_age); 465 446 ··· 483 462 refault_distance = (refault - eviction) & EVICTION_MASK; 484 463 485 464 /* 486 - * The activation decision for this folio is made at the level 487 - * where the eviction occurred, as that is where the LRU order 488 - * during folio reclaim is being determined. 489 - * 490 - * However, the cgroup that will own the folio is the one that 491 - * is actually experiencing the refault event. 492 - */ 493 - nr = folio_nr_pages(folio); 494 - memcg = folio_memcg(folio); 495 - pgdat = folio_pgdat(folio); 496 - lruvec = mem_cgroup_lruvec(memcg, pgdat); 497 - 498 - mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); 499 - /* 500 465 * Compare the distance to the existing workingset size. We 501 466 * don't activate pages that couldn't stay resident even if 502 467 * all the memory was available to the workingset. Whether ··· 502 495 NR_INACTIVE_ANON); 503 496 } 504 497 } 505 - if (refault_distance > workingset_size) 498 + 499 + return refault_distance <= workingset_size; 500 + } 501 + 502 + /** 503 + * workingset_refault - Evaluate the refault of a previously evicted folio. 504 + * @folio: The freshly allocated replacement folio. 505 + * @shadow: Shadow entry of the evicted folio. 506 + * 507 + * Calculates and evaluates the refault distance of the previously 508 + * evicted folio in the context of the node and the memcg whose memory 509 + * pressure caused the eviction. 510 + */ 511 + void workingset_refault(struct folio *folio, void *shadow) 512 + { 513 + bool file = folio_is_file_lru(folio); 514 + struct pglist_data *pgdat; 515 + struct mem_cgroup *memcg; 516 + struct lruvec *lruvec; 517 + bool workingset; 518 + long nr; 519 + 520 + if (lru_gen_enabled()) { 521 + lru_gen_refault(folio, shadow); 522 + return; 523 + } 524 + 525 + /* Flush stats (and potentially sleep) before holding RCU read lock */ 526 + mem_cgroup_flush_stats_ratelimited(); 527 + 528 + rcu_read_lock(); 529 + 530 + /* 531 + * The activation decision for this folio is made at the level 532 + * where the eviction occurred, as that is where the LRU order 533 + * during folio reclaim is being determined. 534 + * 535 + * However, the cgroup that will own the folio is the one that 536 + * is actually experiencing the refault event. 537 + */ 538 + nr = folio_nr_pages(folio); 539 + memcg = folio_memcg(folio); 540 + pgdat = folio_pgdat(folio); 541 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 542 + 543 + mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); 544 + 545 + if (!workingset_test_recent(shadow, file, &workingset)) 506 546 goto out; 507 547 508 548 folio_set_active(folio);