NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst

This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:

/sys/kernel/debug/nfsd/io_cache_read
/sys/kernel/debug/nfsd/io_cache_write

This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).

Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

authored by Mike Snitzer and committed by Chuck Lever fa8d4e67 06c5c972

+144
+144
Documentation/filesystems/nfs/nfsd-io-modes.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============= 4 + NFSD IO MODES 5 + ============= 6 + 7 + Overview 8 + ======== 9 + 10 + NFSD has historically always used buffered IO when servicing READ and 11 + WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible 12 + to override that default to use either DONTCACHE or DIRECT IO modes. 13 + 14 + Experimental NFSD debugfs interfaces are available to allow the NFSD IO 15 + mode used for READ and WRITE to be configured independently. See both: 16 + - /sys/kernel/debug/nfsd/io_cache_read 17 + - /sys/kernel/debug/nfsd/io_cache_write 18 + 19 + The default value for both io_cache_read and io_cache_write reflects 20 + NFSD's default IO mode (which is NFSD_IO_BUFFERED=0). 21 + 22 + Based on the configured settings, NFSD's IO will either be: 23 + - cached using page cache (NFSD_IO_BUFFERED=0) 24 + - cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1) 25 + - not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2) 26 + 27 + To set an NFSD IO mode, write a supported value (0 - 2) to the 28 + corresponding IO operation's debugfs interface, e.g.: 29 + echo 2 > /sys/kernel/debug/nfsd/io_cache_read 30 + echo 2 > /sys/kernel/debug/nfsd/io_cache_write 31 + 32 + To check which IO mode NFSD is using for READ or WRITE, simply read the 33 + corresponding IO operation's debugfs interface, e.g.: 34 + cat /sys/kernel/debug/nfsd/io_cache_read 35 + cat /sys/kernel/debug/nfsd/io_cache_write 36 + 37 + If you experiment with NFSD's IO modes on a recent kernel and have 38 + interesting results, please report them to linux-nfs@vger.kernel.org 39 + 40 + NFSD DONTCACHE 41 + ============== 42 + 43 + DONTCACHE offers a hybrid approach to servicing IO that aims to offer 44 + the benefits of using DIRECT IO without any of the strict alignment 45 + requirements that DIRECT IO imposes. To achieve this buffered IO is used 46 + but the IO is flagged to "drop behind" (meaning associated pages are 47 + dropped from the page cache) when IO completes. 48 + 49 + DONTCACHE aims to avoid what has proven to be a fairly significant 50 + limition of Linux's memory management subsystem if/when large amounts of 51 + data is infrequently accessed (e.g. read once _or_ written once but not 52 + read until much later). Such use-cases are particularly problematic 53 + because the page cache will eventually become a bottleneck to servicing 54 + new IO requests. 55 + 56 + For more context on DONTCACHE, please see these Linux commit headers: 57 + - Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio() 58 + to take a struct kiocb") 59 + - for READ: 8026e49bff9b1 ("mm/filemap: add read support for 60 + RWF_DONTCACHE") 61 + - for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE") 62 + 63 + NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying 64 + filesystem doesn't indicate support by setting FOP_DONTCACHE. 65 + 66 + NFSD DIRECT 67 + =========== 68 + 69 + DIRECT IO doesn't make use of the page cache, as such it is able to 70 + avoid the Linux memory management's page reclaim scalability problems 71 + without resorting to the hybrid use of page cache that DONTCACHE does. 72 + 73 + Some workloads benefit from NFSD avoiding the page cache, particularly 74 + those with a working set that is significantly larger than available 75 + system memory. The pathological worst-case workload that NFSD DIRECT has 76 + proven to help most is: NFS client issuing large sequential IO to a file 77 + that is 2-3 times larger than the NFS server's available system memory. 78 + The reason for such improvement is NFSD DIRECT eliminates a lot of work 79 + that the memory management subsystem would otherwise be required to 80 + perform (e.g. page allocation, dirty writeback, page reclaim). When 81 + using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU 82 + time trying to find adequate free pages so that forward IO progress can 83 + be made. 84 + 85 + The performance win associated with using NFSD DIRECT was previously 86 + discussed on linux-nfs, see: 87 + https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/ 88 + But in summary: 89 + - NFSD DIRECT can significantly reduce memory requirements 90 + - NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work 91 + - NFSD DIRECT can offer more deterministic IO performance 92 + 93 + As always, your mileage may vary and so it is important to carefully 94 + consider if/when it is beneficial to make use of NFSD DIRECT. When 95 + assessing comparative performance of your workload please be sure to log 96 + relevant performance metrics during testing (e.g. memory usage, cpu 97 + usage, IO performance). Using perf to collect perf data that may be used 98 + to generate a "flamegraph" for work Linux must perform on behalf of your 99 + test is a really meaningful way to compare the relative health of the 100 + system and how switching NFSD's IO mode changes what is observed. 101 + 102 + If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to 103 + NFSD's debugfs interfaces, ideally the IO will be aligned relative to 104 + the underlying block device's logical_block_size. Also the memory buffer 105 + used to store the READ or WRITE payload must be aligned relative to the 106 + underlying block device's dma_alignment. 107 + 108 + But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best 109 + it can: 110 + 111 + Misaligned READ: 112 + If NFSD_IO_DIRECT is used, expand any misaligned READ to the next 113 + DIO-aligned block (on either end of the READ). The expanded READ is 114 + verified to have proper offset/len (logical_block_size) and 115 + dma_alignment checking. 116 + 117 + Misaligned WRITE: 118 + If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start, 119 + middle and end as needed. The large middle segment is DIO-aligned 120 + and the start and/or end are misaligned. Buffered IO is used for the 121 + misaligned segments and O_DIRECT is used for the middle DIO-aligned 122 + segment. DONTCACHE buffered IO is _not_ used for the misaligned 123 + segments because using normal buffered IO offers significant RMW 124 + performance benefit when handling streaming misaligned WRITEs. 125 + 126 + Tracing: 127 + The nfsd_read_direct trace event shows how NFSD expands any 128 + misaligned READ to the next DIO-aligned block (on either end of the 129 + original READ, as needed). 130 + 131 + This combination of trace events is useful for READs: 132 + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable 133 + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable 134 + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable 135 + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable 136 + 137 + The nfsd_write_direct trace event shows how NFSD splits a given 138 + misaligned WRITE into a DIO-aligned middle segment. 139 + 140 + This combination of trace events is useful for WRITEs: 141 + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable 142 + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable 143 + echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable 144 + echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable