Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/vmscan: throttle reclaim until some writeback completes if congested

Patch series "Remove dependency on congestion_wait in mm/", v5.

This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].

Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).

This series replaces the "congestion" throttling with 3 different types.

- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned

- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU

- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.

This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.

stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.

- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time

- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.

- Y file readers which is fio randomly reading small files

- Z anon memory hogs which continually map (100-dirty_ratio)% of memory

- Total estimated WSS = (100+dirty_ration) percentage of memory

X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.

The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.

The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.

The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.

Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.

First the results of the "anon latency" latency

stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)

For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.

It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory

Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)

The overall system CPU usage and elapsed time is as follows

5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98

The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.

The high-level /proc/vmstats show

5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97

Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.

The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.

Ops Percentage direct scans 90.59 77.37

For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling

Ops Page writes by reclaim 2613590.00 1687131.00

Page writes from reclaim context are reduced.

Ops Page writes anon 2932752.00 1917048.00

And there is less swapping.

Ops Page reclaim immediate 996248528.00 107664764.00

The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.

Ops Slabs scanned 164284.00 153608.00

Slab scan activity is similar.

ftrace was used to gather stall activity

Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).

1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.

Bottom line: Vanilla will throttle but it's not effective.

Patch series
------------

Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU

1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were

6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK

The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads

For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

Most did not stall at all. A small number reached the timeout.

For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map

1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS

The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.

While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.

For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.

Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.

This patch (of 5):

Page reclaim throttles on wait_iff_congested under the following
conditions:

- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.

- Direct reclaim will stall if all dirty pages are backed by congested
inodes.

wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.

[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]

Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Mel Gorman and committed by
Linus Torvalds
8cd7c588 cb75463c

+135 -68
-1
include/linux/backing-dev.h
··· 154 154 } 155 155 156 156 long congestion_wait(int sync, long timeout); 157 - long wait_iff_congested(int sync, long timeout); 158 157 159 158 static inline bool mapping_can_writeback(struct address_space *mapping) 160 159 {
+13
include/linux/mmzone.h
··· 199 199 NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ 200 200 NR_DIRTIED, /* page dirtyings since bootup */ 201 201 NR_WRITTEN, /* page writings since bootup */ 202 + NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */ 202 203 NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ 203 204 NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ 204 205 NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ ··· 271 270 LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE, 272 271 LRU_UNEVICTABLE, 273 272 NR_LRU_LISTS 273 + }; 274 + 275 + enum vmscan_throttle_state { 276 + VMSCAN_THROTTLE_WRITEBACK, 277 + NR_VMSCAN_THROTTLE, 274 278 }; 275 279 276 280 #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++) ··· 847 841 int node_id; 848 842 wait_queue_head_t kswapd_wait; 849 843 wait_queue_head_t pfmemalloc_wait; 844 + 845 + /* workqueues for throttling reclaim for different reasons. */ 846 + wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE]; 847 + 848 + atomic_t nr_writeback_throttled;/* nr of writeback-throttled tasks */ 849 + unsigned long nr_reclaim_start; /* nr pages written while throttled 850 + * when throttling started. */ 850 851 struct task_struct *kswapd; /* Protected by 851 852 mem_hotplug_begin/end() */ 852 853 int kswapd_order;
+34
include/trace/events/vmscan.h
··· 27 27 {RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \ 28 28 ) : "RECLAIM_WB_NONE" 29 29 30 + #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) 31 + 32 + #define show_throttle_flags(flags) \ 33 + (flags) ? __print_flags(flags, "|", \ 34 + {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \ 35 + ) : "VMSCAN_THROTTLE_NONE" 36 + 37 + 30 38 #define trace_reclaim_flags(file) ( \ 31 39 (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \ 32 40 (RECLAIM_WB_ASYNC) \ ··· 462 454 TP_ARGS(nr_reclaimed) 463 455 ); 464 456 457 + TRACE_EVENT(mm_vmscan_throttled, 458 + 459 + TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason), 460 + 461 + TP_ARGS(nid, usec_timeout, usec_delayed, reason), 462 + 463 + TP_STRUCT__entry( 464 + __field(int, nid) 465 + __field(int, usec_timeout) 466 + __field(int, usec_delayed) 467 + __field(int, reason) 468 + ), 469 + 470 + TP_fast_assign( 471 + __entry->nid = nid; 472 + __entry->usec_timeout = usec_timeout; 473 + __entry->usec_delayed = usec_delayed; 474 + __entry->reason = 1U << reason; 475 + ), 476 + 477 + TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s", 478 + __entry->nid, 479 + __entry->usec_timeout, 480 + __entry->usec_delayed, 481 + show_throttle_flags(__entry->reason)) 482 + ); 465 483 #endif /* _TRACE_VMSCAN_H */ 466 484 467 485 /* This part must be outside protection */
-7
include/trace/events/writeback.h
··· 763 763 TP_ARGS(usec_timeout, usec_delayed) 764 764 ); 765 765 766 - DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested, 767 - 768 - TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), 769 - 770 - TP_ARGS(usec_timeout, usec_delayed) 771 - ); 772 - 773 766 DECLARE_EVENT_CLASS(writeback_single_inode_template, 774 767 775 768 TP_PROTO(struct inode *inode,
-48
mm/backing-dev.c
··· 1038 1038 return ret; 1039 1039 } 1040 1040 EXPORT_SYMBOL(congestion_wait); 1041 - 1042 - /** 1043 - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes 1044 - * @sync: SYNC or ASYNC IO 1045 - * @timeout: timeout in jiffies 1046 - * 1047 - * In the event of a congested backing_dev (any backing_dev) this waits 1048 - * for up to @timeout jiffies for either a BDI to exit congestion of the 1049 - * given @sync queue or a write to complete. 1050 - * 1051 - * The return value is 0 if the sleep is for the full timeout. Otherwise, 1052 - * it is the number of jiffies that were still remaining when the function 1053 - * returned. return_value == timeout implies the function did not sleep. 1054 - */ 1055 - long wait_iff_congested(int sync, long timeout) 1056 - { 1057 - long ret; 1058 - unsigned long start = jiffies; 1059 - DEFINE_WAIT(wait); 1060 - wait_queue_head_t *wqh = &congestion_wqh[sync]; 1061 - 1062 - /* 1063 - * If there is no congestion, yield if necessary instead 1064 - * of sleeping on the congestion queue 1065 - */ 1066 - if (atomic_read(&nr_wb_congested[sync]) == 0) { 1067 - cond_resched(); 1068 - 1069 - /* In case we scheduled, work out time remaining */ 1070 - ret = timeout - (jiffies - start); 1071 - if (ret < 0) 1072 - ret = 0; 1073 - 1074 - goto out; 1075 - } 1076 - 1077 - /* Sleep until uncongested or a write happens */ 1078 - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); 1079 - ret = io_schedule_timeout(timeout); 1080 - finish_wait(wqh, &wait); 1081 - 1082 - out: 1083 - trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), 1084 - jiffies_to_usecs(jiffies - start)); 1085 - 1086 - return ret; 1087 - } 1088 - EXPORT_SYMBOL(wait_iff_congested);
+1
mm/filemap.c
··· 1612 1612 1613 1613 smp_mb__after_atomic(); 1614 1614 wake_up_page(page, PG_writeback); 1615 + acct_reclaim_writeback(page); 1615 1616 put_page(page); 1616 1617 } 1617 1618 EXPORT_SYMBOL(end_page_writeback);
+11
mm/internal.h
··· 34 34 35 35 void page_writeback_init(void); 36 36 37 + void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, 38 + int nr_throttled); 39 + static inline void acct_reclaim_writeback(struct page *page) 40 + { 41 + pg_data_t *pgdat = page_pgdat(page); 42 + int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled); 43 + 44 + if (nr_throttled) 45 + __acct_reclaim_writeback(pgdat, page, nr_throttled); 46 + } 47 + 37 48 vm_fault_t do_swap_page(struct vm_fault *vmf); 38 49 39 50 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
+5
mm/page_alloc.c
··· 7408 7408 7409 7409 static void __meminit pgdat_init_internals(struct pglist_data *pgdat) 7410 7410 { 7411 + int i; 7412 + 7411 7413 pgdat_resize_init(pgdat); 7412 7414 7413 7415 pgdat_init_split_queue(pgdat); ··· 7417 7415 7418 7416 init_waitqueue_head(&pgdat->kswapd_wait); 7419 7417 init_waitqueue_head(&pgdat->pfmemalloc_wait); 7418 + 7419 + for (i = 0; i < NR_VMSCAN_THROTTLE; i++) 7420 + init_waitqueue_head(&pgdat->reclaim_wait[i]); 7420 7421 7421 7422 pgdat_page_ext_init(pgdat); 7422 7423 lruvec_init(&pgdat->__lruvec);
+70 -12
mm/vmscan.c
··· 1006 1006 unlock_page(page); 1007 1007 } 1008 1008 1009 + static void 1010 + reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, 1011 + long timeout) 1012 + { 1013 + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason]; 1014 + long ret; 1015 + DEFINE_WAIT(wait); 1016 + 1017 + /* 1018 + * Do not throttle IO workers, kthreads other than kswapd or 1019 + * workqueues. They may be required for reclaim to make 1020 + * forward progress (e.g. journalling workqueues or kthreads). 1021 + */ 1022 + if (!current_is_kswapd() && 1023 + current->flags & (PF_IO_WORKER|PF_KTHREAD)) 1024 + return; 1025 + 1026 + if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) { 1027 + WRITE_ONCE(pgdat->nr_reclaim_start, 1028 + node_page_state(pgdat, NR_THROTTLED_WRITTEN)); 1029 + } 1030 + 1031 + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); 1032 + ret = schedule_timeout(timeout); 1033 + finish_wait(wqh, &wait); 1034 + atomic_dec(&pgdat->nr_writeback_throttled); 1035 + 1036 + trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout), 1037 + jiffies_to_usecs(timeout - ret), 1038 + reason); 1039 + } 1040 + 1041 + /* 1042 + * Account for pages written if tasks are throttled waiting on dirty 1043 + * pages to clean. If enough pages have been cleaned since throttling 1044 + * started then wakeup the throttled tasks. 1045 + */ 1046 + void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, 1047 + int nr_throttled) 1048 + { 1049 + unsigned long nr_written; 1050 + 1051 + inc_node_page_state(page, NR_THROTTLED_WRITTEN); 1052 + 1053 + /* 1054 + * This is an inaccurate read as the per-cpu deltas may not 1055 + * be synchronised. However, given that the system is 1056 + * writeback throttled, it is not worth taking the penalty 1057 + * of getting an accurate count. At worst, the throttle 1058 + * timeout guarantees forward progress. 1059 + */ 1060 + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - 1061 + READ_ONCE(pgdat->nr_reclaim_start); 1062 + 1063 + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) 1064 + wake_up(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]); 1065 + } 1066 + 1009 1067 /* possible outcome of pageout() */ 1010 1068 typedef enum { 1011 1069 /* failed to write page out, page is locked */ ··· 1469 1411 1470 1412 /* 1471 1413 * The number of dirty pages determines if a node is marked 1472 - * reclaim_congested which affects wait_iff_congested. kswapd 1473 - * will stall and start writing pages if the tail of the LRU 1474 - * is all dirty unqueued pages. 1414 + * reclaim_congested. kswapd will stall and start writing 1415 + * pages if the tail of the LRU is all dirty unqueued pages. 1475 1416 */ 1476 1417 page_check_dirty_writeback(page, &dirty, &writeback); 1477 1418 if (dirty || writeback) ··· 3236 3179 * If kswapd scans pages marked for immediate 3237 3180 * reclaim and under writeback (nr_immediate), it 3238 3181 * implies that pages are cycling through the LRU 3239 - * faster than they are written so also forcibly stall. 3182 + * faster than they are written so forcibly stall 3183 + * until some pages complete writeback. 3240 3184 */ 3241 3185 if (sc->nr.immediate) 3242 - congestion_wait(BLK_RW_ASYNC, HZ/10); 3186 + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10); 3243 3187 } 3244 3188 3245 3189 /* 3246 - * Tag a node/memcg as congested if all the dirty pages 3247 - * scanned were backed by a congested BDI and 3248 - * wait_iff_congested will stall. 3190 + * Tag a node/memcg as congested if all the dirty pages were marked 3191 + * for writeback and immediate reclaim (counted in nr.congested). 3249 3192 * 3250 3193 * Legacy memcg will stall in page writeback so avoid forcibly 3251 - * stalling in wait_iff_congested(). 3194 + * stalling in reclaim_throttle(). 3252 3195 */ 3253 3196 if ((current_is_kswapd() || 3254 3197 (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && ··· 3256 3199 set_bit(LRUVEC_CONGESTED, &target_lruvec->flags); 3257 3200 3258 3201 /* 3259 - * Stall direct reclaim for IO completions if underlying BDIs 3260 - * and node is congested. Allow kswapd to continue until it 3202 + * Stall direct reclaim for IO completions if the lruvec is 3203 + * node is congested. Allow kswapd to continue until it 3261 3204 * starts encountering unqueued dirty pages or cycling through 3262 3205 * the LRU too quickly. 3263 3206 */ 3264 3207 if (!current_is_kswapd() && current_may_throttle() && 3265 3208 !sc->hibernation_mode && 3266 3209 test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) 3267 - wait_iff_congested(BLK_RW_ASYNC, HZ/10); 3210 + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10); 3268 3211 3269 3212 if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, 3270 3213 sc)) ··· 4342 4285 4343 4286 WRITE_ONCE(pgdat->kswapd_order, 0); 4344 4287 WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES); 4288 + atomic_set(&pgdat->nr_writeback_throttled, 0); 4345 4289 for ( ; ; ) { 4346 4290 bool ret; 4347 4291
+1
mm/vmstat.c
··· 1225 1225 "nr_vmscan_immediate_reclaim", 1226 1226 "nr_dirtied", 1227 1227 "nr_written", 1228 + "nr_throttled_written", 1228 1229 "nr_kernel_misc_reclaimable", 1229 1230 "nr_foll_pin_acquired", 1230 1231 "nr_foll_pin_released",