Documentation/filesystems/netfs_library.rst at v6.16-rc6

tjh.dev / kernel
fork
Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork
kernel / Documentation / filesystems / netfs_library.rst
at v6.16-rc6 1051 lines 40 kB view raw
wrap content
   1.. SPDX-License-Identifier: GPL-2.0
   2
   3===================================
   4Network Filesystem Services Library
   5===================================
   6
   7.. Contents:
   8
   9 - Overview.
  10   - Requests and streams.
  11   - Subrequests.
  12   - Result collection and retry.
  13   - Local caching.
  14   - Content encryption (fscrypt).
  15 - Per-inode context.
  16   - Inode context helper functions.
  17   - Inode locking.
  18   - Inode writeback.
  19 - High-level VFS API.
  20   - Unlocked read/write iter.
  21   - Pre-locked read/write iter.
  22   - Monolithic files API.
  23   - Memory-mapped I/O API.
  24 - High-level VM API.
  25   - Deprecated PG_private2 API.
  26 - I/O request API.
  27   - Request structure.
  28   - Stream structure.
  29   - Subrequest structure.
  30   - Filesystem methods.
  31   - Terminating a subrequest.
  32   - Local cache API.
  33 - API function reference.
  34
  35
  36Overview
  37========
  38
  39The network filesystem services library, netfslib, is a set of functions
  40designed to aid a network filesystem in implementing VM/VFS API operations.  It
  41takes over the normal buffered read, readahead, write and writeback and also
  42handles unbuffered and direct I/O.
  43
  44The library provides support for (re-)negotiation of I/O sizes and retrying
  45failed I/O as well as local caching and will, in the future, provide content
  46encryption.
  47
  48It insulates the filesystem from VM interface changes as much as possible and
  49handles VM features such as large multipage folios.  The filesystem basically
  50just has to provide a way to perform read and write RPC calls.
  51
  52The way I/O is organised inside netfslib consists of a number of objects:
  53
  54 * A *request*.  A request is used to track the progress of the I/O overall and
  55   to hold on to resources.  The collection of results is done at the request
  56   level.  The I/O within a request is divided into a number of parallel
  57   streams of subrequests.
  58
  59 * A *stream*.  A non-overlapping series of subrequests.  The subrequests
  60   within a stream do not have to be contiguous.
  61
  62 * A *subrequest*.  This is the basic unit of I/O.  It represents a single RPC
  63   call or a single cache I/O operation.  The library passes these to the
  64   filesystem and the cache to perform.
  65
  66Requests and Streams
  67--------------------
  68
  69When actually performing I/O (as opposed to just copying into the pagecache),
  70netfslib will create one or more requests to track the progress of the I/O and
  71to hold resources.
  72
  73A read operation will have a single stream and the subrequests within that
  74stream may be of mixed origins, for instance mixing RPC subrequests and cache
  75subrequests.
  76
  77On the other hand, a write operation may have multiple streams, where each
  78stream targets a different destination.  For instance, there may be one stream
  79writing to the local cache and one to the server.  Currently, only two streams
  80are allowed, but this could be increased if parallel writes to multiple servers
  81is desired.
  82
  83The subrequests within a write stream do not need to match alignment or size
  84with the subrequests in another write stream and netfslib performs the tiling
  85of subrequests in each stream over the source buffer independently.  Further,
  86each stream may contain holes that don't correspond to holes in the other
  87stream.
  88
  89In addition, the subrequests do not need to correspond to the boundaries of the
  90folios or vectors in the source/destination buffer.  The library handles the
  91collection of results and the wrangling of folio flags and references.
  92
  93Subrequests
  94-----------
  95
  96Subrequests are at the heart of the interaction between netfslib and the
  97filesystem using it.  Each subrequest is expected to correspond to a single
  98read or write RPC or cache operation.  The library will stitch together the
  99results from a set of subrequests to provide a higher level operation.
 100
 101Netfslib has two interactions with the filesystem or the cache when setting up
 102a subrequest.  First, there's an optional preparatory step that allows the
 103filesystem to negotiate the limits on the subrequest, both in terms of maximum
 104number of bytes and maximum number of vectors (e.g. for RDMA).  This may
 105involve negotiating with the server (e.g. cifs needing to acquire credits).
 106
 107And, secondly, there's the issuing step in which the subrequest is handed off
 108to the filesystem to perform.
 109
 110Note that these two steps are done slightly differently between read and write:
 111
 112 * For reads, the VM/VFS tells us how much is being requested up front, so the
 113   library can preset maximum values that the cache and then the filesystem can
 114   then reduce.  The cache also gets consulted first on whether it wants to do
 115   a read before the filesystem is consulted.
 116
 117 * For writeback, it is unknown how much there will be to write until the
 118   pagecache is walked, so no limit is set by the library.
 119
 120Once a subrequest is completed, the filesystem or cache informs the library of
 121the completion and then collection is invoked.  Depending on whether the
 122request is synchronous or asynchronous, the collection of results will be done
 123in either the application thread or in a work queue.
 124
 125Result Collection and Retry
 126---------------------------
 127
 128As subrequests complete, the results are collected and collated by the library
 129and folio unlocking is performed progressively (if appropriate).  Once the
 130request is complete, async completion will be invoked (again, if appropriate).
 131It is possible for the filesystem to provide interim progress reports to the
 132library to cause folio unlocking to happen earlier if possible.
 133
 134If any subrequests fail, netfslib can retry them.  It will wait until all
 135subrequests are completed, offer the filesystem the opportunity to fiddle with
 136the resources/state held by the request and poke at the subrequests before
 137re-preparing and re-issuing the subrequests.
 138
 139This allows the tiling of contiguous sets of failed subrequest within a stream
 140to be changed, adding more subrequests or ditching excess as necessary (for
 141instance, if the network sizes change or the server decides it wants smaller
 142chunks).
 143
 144Further, if one or more contiguous cache-read subrequests fail, the library
 145will pass them to the filesystem to perform instead, renegotiating and retiling
 146them as necessary to fit with the filesystem's parameters rather than those of
 147the cache.
 148
 149Local Caching
 150-------------
 151
 152One of the services netfslib provides, via ``fscache``, is the option to cache
 153on local disk a copy of the data obtained from/written to a network filesystem.
 154The library will manage the storing, retrieval and some invalidation of data
 155automatically on behalf of the filesystem if a cookie is attached to the
 156``netfs_inode``.
 157
 158Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to
 159keep track of a page that was being written to the cache, but this is now
 160deprecated as PG_private_2 will be removed.
 161
 162Instead, folios that are read from the server for which there was no data in
 163the cache will be marked as dirty and will have ``folio->private`` set to a
 164special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write.
 165If the folio is modified before that happened, the special value will be
 166cleared and the write will become normally dirty.
 167
 168When writeback occurs, folios that are so marked will only be written to the
 169cache and not to the server.  Writeback handles mixed cache-only writes and
 170server-and-cache writes by using two streams, sending one to the cache and one
 171to the server.  The server stream will have gaps in it corresponding to those
 172folios.
 173
 174Content Encryption (fscrypt)
 175----------------------------
 176
 177Though it does not do so yet, at some point netfslib will acquire the ability
 178to do client-side content encryption on behalf of the network filesystem (Ceph,
 179for example).  fscrypt can be used for this if appropriate (it may not be -
 180cifs, for example).
 181
 182The data will be stored encrypted in the local cache using the same manner of
 183encryption as the data written to the server and the library will impose bounce
 184buffering and RMW cycles as necessary.
 185
 186
 187Per-Inode Context
 188=================
 189
 190The network filesystem helper library needs a place to store a bit of state for
 191its use on each netfs inode it is helping to manage.  To this end, a context
 192structure is defined::
 193
 194	struct netfs_inode {
 195		struct inode inode;
 196		const struct netfs_request_ops *ops;
 197		struct fscache_cookie * cache;
 198		loff_t remote_i_size;
 199		unsigned long flags;
 200		...
 201	};
 202
 203A network filesystem that wants to use netfslib must place one of these in its
 204inode wrapper struct instead of the VFS ``struct inode``.  This can be done in
 205a way similar to the following::
 206
 207	struct my_inode {
 208		struct netfs_inode netfs; /* Netfslib context and vfs inode */
 209		...
 210	};
 211
 212This allows netfslib to find its state by using ``container_of()`` from the
 213inode pointer, thereby allowing the netfslib helper functions to be pointed to
 214directly by the VFS/VM operation tables.
 215
 216The structure contains the following fields that are of interest to the
 217filesystem:
 218
 219 * ``inode``
 220
 221   The VFS inode structure.
 222
 223 * ``ops``
 224
 225   The set of operations provided by the network filesystem to netfslib.
 226
 227 * ``cache``
 228
 229   Local caching cookie, or NULL if no caching is enabled.  This field does not
 230   exist if fscache is disabled.
 231
 232 * ``remote_i_size``
 233
 234   The size of the file on the server.  This differs from inode->i_size if
 235   local modifications have been made but not yet written back.
 236
 237 * ``flags``
 238
 239   A set of flags, some of which the filesystem might be interested in:
 240
 241   * ``NETFS_ICTX_MODIFIED_ATTR``
 242
 243     Set if netfslib modifies mtime/ctime.  The filesystem is free to ignore
 244     this or clear it.
 245
 246   * ``NETFS_ICTX_UNBUFFERED``
 247
 248     Do unbuffered I/O upon the file.  Like direct I/O but without the
 249     alignment limitations.  RMW will be performed if necessary.  The pagecache
 250     will not be used unless mmap() is also used.
 251
 252   * ``NETFS_ICTX_WRITETHROUGH``
 253
 254     Do writethrough caching upon the file.  I/O will be set up and dispatched
 255     as buffered writes are made to the page cache.  mmap() does the normal
 256     writeback thing.
 257
 258   * ``NETFS_ICTX_SINGLE_NO_UPLOAD``
 259
 260     Set if the file has a monolithic content that must be read entirely in a
 261     single go and must not be written back to the server, though it can be
 262     cached (e.g. AFS directories).
 263
 264Inode Context Helper Functions
 265------------------------------
 266
 267To help deal with the per-inode context, a number helper functions are
 268provided.  Firstly, a function to perform basic initialisation on a context and
 269set the operations table pointer::
 270
 271	void netfs_inode_init(struct netfs_inode *ctx,
 272			      const struct netfs_request_ops *ops);
 273
 274then a function to cast from the VFS inode structure to the netfs context::
 275
 276	struct netfs_inode *netfs_inode(struct inode *inode);
 277
 278and finally, a function to get the cache cookie pointer from the context
 279attached to an inode (or NULL if fscache is disabled)::
 280
 281	struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
 282
 283Inode Locking
 284-------------
 285
 286A number of functions are provided to manage the locking of i_rwsem for I/O and
 287to effectively extend it to provide more separate classes of exclusion::
 288
 289	int netfs_start_io_read(struct inode *inode);
 290	void netfs_end_io_read(struct inode *inode);
 291	int netfs_start_io_write(struct inode *inode);
 292	void netfs_end_io_write(struct inode *inode);
 293	int netfs_start_io_direct(struct inode *inode);
 294	void netfs_end_io_direct(struct inode *inode);
 295
 296The exclusion breaks down into four separate classes:
 297
 298 1) Buffered reads and writes.
 299
 300    Buffered reads can run concurrently each other and with buffered writes,
 301    but buffered writes cannot run concurrently with each other.
 302
 303 2) Direct reads and writes.
 304
 305    Direct (and unbuffered) reads and writes can run concurrently since they do
 306    not share local buffering (i.e. the pagecache) and, in a network
 307    filesystem, are expected to have exclusion managed on the server (though
 308    this may not be the case for, say, Ceph).
 309
 310 3) Other major inode modifying operations (e.g. truncate, fallocate).
 311
 312    These should just access i_rwsem directly.
 313
 314 4) mmap().
 315
 316    mmap'd accesses might operate concurrently with any of the other classes.
 317    They might form the buffer for an intra-file loopback DIO read/write.  They
 318    might be permitted on unbuffered files.
 319
 320Inode Writeback
 321---------------
 322
 323Netfslib will pin resources on an inode for future writeback (such as pinning
 324use of an fscache cookie) when an inode is dirtied.  However, this pinning
 325needs careful management.  To manage the pinning, the following sequence
 326occurs:
 327
 328 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the
 329    pinning begins (when a folio is dirtied, for example) if the cache is
 330    active to stop the cache structures from being discarded and the cache
 331    space from being culled.  This also prevents re-getting of cache resources
 332    if the flag is already set.
 333
 334 2) This flag then cleared inside the inode lock during inode writeback in the
 335    VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb``
 336    in ``struct writeback_control``.
 337
 338 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced.
 339
 340 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup.
 341
 342 5) The filesystem invokes netfs to do its cleanup.
 343
 344To do the cleanup, netfslib provides a function to do the resource unpinning::
 345
 346	int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
 347
 348If the filesystem doesn't need to do anything else, this may be set as a its
 349``.write_inode`` method.
 350
 351Further, if an inode is deleted, the filesystem's write_inode method may not
 352get called, so::
 353
 354	void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
 355
 356must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called.
 357
 358
 359High-Level VFS API
 360==================
 361
 362Netfslib provides a number of sets of API calls for the filesystem to delegate
 363VFS operations to.  Netfslib, in turn, will call out to the filesystem and the
 364cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene
 365at various times.
 366
 367Unlocked Read/Write Iter
 368------------------------
 369
 370The first API set is for the delegation of operations to netfslib when the
 371filesystem is called through the standard VFS read/write_iter methods::
 372
 373	ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
 374	ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
 375	ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
 376	ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
 377	ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);
 378
 379They can be assigned directly to ``.read_iter`` and ``.write_iter``.  They
 380perform the inode locking themselves and the first two will switch between
 381buffered I/O and DIO as appropriate.
 382
 383Pre-Locked Read/Write Iter
 384--------------------------
 385
 386The second API set is for the delegation of operations to netfslib when the
 387filesystem is called through the standard VFS methods, but needs to do some
 388other stuff before or after calling netfslib whilst still inside locked section
 389(e.g. Ceph negotiating caps).  The unbuffered read function is::
 390
 391	ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);
 392
 393This must not be assigned directly to ``.read_iter`` and the filesystem is
 394responsible for performing the inode locking before calling it.  In the case of
 395buffered read, the filesystem should use ``filemap_read()``.
 396
 397There are three functions for writes::
 398
 399	ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
 400						 struct netfs_group *netfs_group);
 401	ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
 402				    struct netfs_group *netfs_group);
 403	ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
 404						   struct netfs_group *netfs_group);
 405
 406These must not be assigned directly to ``.write_iter`` and the filesystem is
 407responsible for performing the inode locking before calling them.
 408
 409The first two functions are for buffered writes; the first just adds some
 410standard write checks and jumps to the second, but if the filesystem wants to
 411do the checks itself, it can use the second directly.  The third function is
 412for unbuffered or DIO writes.
 413
 414On all three write functions, there is a writeback group pointer (which should
 415be NULL if the filesystem doesn't use this).  Writeback groups are set on
 416folios when they're modified.  If a folio to-be-modified is already marked with
 417a different group, it is flushed first.  The writeback API allows writing back
 418of a specific group.
 419
 420Memory-Mapped I/O API
 421---------------------
 422
 423An API for support of mmap()'d I/O is provided::
 424
 425	vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
 426
 427This allows the filesystem to delegate ``.page_mkwrite`` to netfslib.  The
 428filesystem should not take the inode lock before calling it, but, as with the
 429locked write functions above, this does take a writeback group pointer.  If the
 430page to be made writable is in a different group, it will be flushed first.
 431
 432Monolithic Files API
 433--------------------
 434
 435There is also a special API set for files for which the content must be read in
 436a single RPC (and not written back) and is maintained as a monolithic blob
 437(e.g. an AFS directory), though it can be stored and updated in the local cache::
 438
 439	ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
 440	void netfs_single_mark_inode_dirty(struct inode *inode);
 441	int netfs_writeback_single(struct address_space *mapping,
 442				   struct writeback_control *wbc,
 443				   struct iov_iter *iter);
 444
 445The first function reads from a file into the given buffer, reading from the
 446cache in preference if the data is cached there; the second function allows the
 447inode to be marked dirty, causing a later writeback; and the third function can
 448be called from the writeback code to write the data to the cache, if there is
 449one.
 450
 451The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be
 452used.  The writeback function requires the buffer to be of ITER_FOLIOQ type.
 453
 454High-Level VM API
 455==================
 456
 457Netfslib also provides a number of sets of API calls for the filesystem to
 458delegate VM operations to.  Again, netfslib, in turn, will call out to the
 459filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places
 460for it to intervene at various times::
 461
 462	void netfs_readahead(struct readahead_control *);
 463	int netfs_read_folio(struct file *, struct folio *);
 464	int netfs_writepages(struct address_space *mapping,
 465			     struct writeback_control *wbc);
 466	bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
 467	void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
 468	bool netfs_release_folio(struct folio *folio, gfp_t gfp);
 469
 470These are ``address_space_operations`` methods and can be set directly in the
 471operations table.
 472
 473Deprecated PG_private_2 API
 474---------------------------
 475
 476There is also a deprecated function for filesystems that still use the
 477``->write_begin`` method::
 478
 479	int netfs_write_begin(struct netfs_inode *inode, struct file *file,
 480			      struct address_space *mapping, loff_t pos, unsigned int len,
 481			      struct folio **_folio, void **_fsdata);
 482
 483It uses the deprecated PG_private_2 flag and so should not be used.
 484
 485
 486I/O Request API
 487===============
 488
 489The I/O request API comprises a number of structures and a number of functions
 490that the filesystem may need to use.
 491
 492Request Structure
 493-----------------
 494
 495The request structure manages the request as a whole, holding some resources
 496and state on behalf of the filesystem and tracking the collection of results::
 497
 498	struct netfs_io_request {
 499		enum netfs_io_origin	origin;
 500		struct inode		*inode;
 501		struct address_space	*mapping;
 502		struct netfs_group	*group;
 503		struct netfs_io_stream	io_streams[];
 504		void			*netfs_priv;
 505		void			*netfs_priv2;
 506		unsigned long long	start;
 507		unsigned long long	len;
 508		unsigned long long	i_size;
 509		unsigned int		debug_id;
 510		unsigned long		flags;
 511		...
 512	};
 513
 514Many of the fields are for internal use, but the fields shown here are of
 515interest to the filesystem:
 516
 517 * ``origin``
 518
 519   The origin of the request (readahead, read_folio, DIO read, writeback, ...).
 520
 521 * ``inode``
 522 * ``mapping``
 523
 524   The inode and the address space of the file being read from.  The mapping
 525   may or may not point to inode->i_data.
 526
 527 * ``group``
 528
 529   The writeback group this request is dealing with or NULL.  This holds a ref
 530   on the group.
 531
 532 * ``io_streams``
 533
 534   The parallel streams of subrequests available to the request.  Currently two
 535   are available, but this may be made extensible in future.  ``NR_IO_STREAMS``
 536   indicates the size of the array.
 537
 538 * ``netfs_priv``
 539 * ``netfs_priv2``
 540
 541   The network filesystem's private data.  The value for this can be passed in
 542   to the helper functions or set during the request.
 543
 544 * ``start``
 545 * ``len``
 546
 547   The file position of the start of the read request and the length.  These
 548   may be altered by the ->expand_readahead() op.
 549
 550 * ``i_size``
 551
 552   The size of the file at the start of the request.
 553
 554 * ``debug_id``
 555
 556   A number allocated to this operation that can be displayed in trace lines
 557   for reference.
 558
 559 * ``flags``
 560
 561   Flags for managing and controlling the operation of the request.  Some of
 562   these may be of interest to the filesystem:
 563
 564   * ``NETFS_RREQ_RETRYING``
 565
 566     Netfslib sets this when generating retries.
 567
 568   * ``NETFS_RREQ_PAUSE``
 569
 570     The filesystem can set this to request to pause the library's subrequest
 571     issuing loop - but care needs to be taken as netfslib may also set it.
 572
 573   * ``NETFS_RREQ_NONBLOCK``
 574   * ``NETFS_RREQ_BLOCKED``
 575
 576     Netfslib sets the first to indicate that non-blocking mode was set by the
 577     caller and the filesystem can set the second to indicate that it would
 578     have had to block.
 579
 580   * ``NETFS_RREQ_USE_PGPRIV2``
 581
 582     The filesystem can set this if it wants to use PG_private_2 to track
 583     whether a folio is being written to the cache.  This is deprecated as
 584     PG_private_2 is going to go away.
 585
 586If the filesystem wants more private data than is afforded by this structure,
 587then it should wrap it and provide its own allocator.
 588
 589Stream Structure
 590----------------
 591
 592A request is comprised of one or more parallel streams and each stream may be
 593aimed at a different target.
 594
 595For read requests, only stream 0 is used.  This can contain a mixture of
 596subrequests aimed at different sources.  For write requests, stream 0 is used
 597for the server and stream 1 is used for the cache.  For buffered writeback,
 598stream 0 is not enabled unless a normal dirty folio is encountered, at which
 599point ->begin_writeback() will be invoked and the filesystem can mark the
 600stream available.
 601
 602The stream struct looks like::
 603
 604	struct netfs_io_stream {
 605		unsigned char		stream_nr;
 606		bool			avail;
 607		size_t			sreq_max_len;
 608		unsigned int		sreq_max_segs;
 609		unsigned int		submit_extendable_to;
 610		...
 611	};
 612
 613A number of members are available for access/use by the filesystem:
 614
 615 * ``stream_nr``
 616
 617   The number of the stream within the request.
 618
 619 * ``avail``
 620
 621   True if the stream is available for use.  The filesystem should set this on
 622   stream zero if in ->begin_writeback().
 623
 624 * ``sreq_max_len``
 625 * ``sreq_max_segs``
 626
 627   These are set by the filesystem or the cache in ->prepare_read() or
 628   ->prepare_write() for each subrequest to indicate the maximum number of
 629   bytes and, optionally, the maximum number of segments (if not 0) that that
 630   subrequest can support.
 631
 632 * ``submit_extendable_to``
 633
 634   The size that a subrequest can be rounded up to beyond the EOF, given the
 635   available buffer.  This allows the cache to work out if it can do a DIO read
 636   or write that straddles the EOF marker.
 637
 638Subrequest Structure
 639--------------------
 640
 641Individual units of I/O are managed by the subrequest structure.  These
 642represent slices of the overall request and run independently::
 643
 644	struct netfs_io_subrequest {
 645		struct netfs_io_request *rreq;
 646		struct iov_iter		io_iter;
 647		unsigned long long	start;
 648		size_t			len;
 649		size_t			transferred;
 650		unsigned long		flags;
 651		short			error;
 652		unsigned short		debug_index;
 653		unsigned char		stream_nr;
 654		...
 655	};
 656
 657Each subrequest is expected to access a single source, though the library will
 658handle falling back from one source type to another.  The members are:
 659
 660 * ``rreq``
 661
 662   A pointer to the read request.
 663
 664 * ``io_iter``
 665
 666   An I/O iterator representing a slice of the buffer to be read into or
 667   written from.
 668
 669 * ``start``
 670 * ``len``
 671
 672   The file position of the start of this slice of the read request and the
 673   length.
 674
 675 * ``transferred``
 676
 677   The amount of data transferred so far for this subrequest.  This should be
 678   added to with the length of the transfer made by this issuance of the
 679   subrequest.  If this is less than ``len`` then the subrequest may be
 680   reissued to continue.
 681
 682 * ``flags``
 683
 684   Flags for managing the subrequest.  There are a number of interest to the
 685   filesystem or cache:
 686
 687   * ``NETFS_SREQ_MADE_PROGRESS``
 688
 689     Set by the filesystem to indicates that at least one byte of data was read
 690     or written.
 691
 692   * ``NETFS_SREQ_HIT_EOF``
 693
 694     The filesystem should set this if a read hit the EOF on the file (in which
 695     case ``transferred`` should stop at the EOF).  Netfslib may expand the
 696     subrequest out to the size of the folio containing the EOF on the off
 697     chance that a third party change happened or a DIO read may have asked for
 698     more than is available.  The library will clear any excess pagecache.
 699
 700   * ``NETFS_SREQ_CLEAR_TAIL``
 701
 702     The filesystem can set this to indicate that the remainder of the slice,
 703     from transferred to len, should be cleared.  Do not set if HIT_EOF is set.
 704
 705   * ``NETFS_SREQ_NEED_RETRY``
 706
 707     The filesystem can set this to tell netfslib to retry the subrequest.
 708
 709   * ``NETFS_SREQ_BOUNDARY``
 710
 711     This can be set by the filesystem on a subrequest to indicate that it ends
 712     at a boundary with the filesystem structure (e.g. at the end of a Ceph
 713     object).  It tells netfslib not to retile subrequests across it.
 714
 715 * ``error``
 716
 717   This is for the filesystem to store result of the subrequest.  It should be
 718   set to 0 if successful and a negative error code otherwise.
 719
 720 * ``debug_index``
 721 * ``stream_nr``
 722
 723   A number allocated to this slice that can be displayed in trace lines for
 724   reference and the number of the request stream that it belongs to.
 725
 726If necessary, the filesystem can get and put extra refs on the subrequest it is
 727given::
 728
 729	void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
 730				  enum netfs_sreq_ref_trace what);
 731	void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
 732				  enum netfs_sreq_ref_trace what);
 733
 734using netfs trace codes to indicate the reason.  Care must be taken, however,
 735as once control of the subrequest is returned to netfslib, the same subrequest
 736can be reissued/retried.
 737
 738Filesystem Methods
 739------------------
 740
 741The filesystem sets a table of operations in ``netfs_inode`` for netfslib to
 742use::
 743
 744	struct netfs_request_ops {
 745		mempool_t *request_pool;
 746		mempool_t *subrequest_pool;
 747		int (*init_request)(struct netfs_io_request *rreq, struct file *file);
 748		void (*free_request)(struct netfs_io_request *rreq);
 749		void (*free_subrequest)(struct netfs_io_subrequest *rreq);
 750		void (*expand_readahead)(struct netfs_io_request *rreq);
 751		int (*prepare_read)(struct netfs_io_subrequest *subreq);
 752		void (*issue_read)(struct netfs_io_subrequest *subreq);
 753		void (*done)(struct netfs_io_request *rreq);
 754		void (*update_i_size)(struct inode *inode, loff_t i_size);
 755		void (*post_modify)(struct inode *inode);
 756		void (*begin_writeback)(struct netfs_io_request *wreq);
 757		void (*prepare_write)(struct netfs_io_subrequest *subreq);
 758		void (*issue_write)(struct netfs_io_subrequest *subreq);
 759		void (*retry_request)(struct netfs_io_request *wreq,
 760				      struct netfs_io_stream *stream);
 761		void (*invalidate_cache)(struct netfs_io_request *wreq);
 762	};
 763
 764The table starts with a pair of optional pointers to memory pools from which
 765requests and subrequests can be allocated.  If these are not given, netfslib
 766has default pools that it will use instead.  If the filesystem wraps the netfs
 767structs in its own larger structs, then it will need to use its own pools.
 768Netfslib will allocate directly from the pools.
 769
 770The methods defined in the table are:
 771
 772 * ``init_request()``
 773 * ``free_request()``
 774 * ``free_subrequest()``
 775
 776   [Optional] A filesystem may implement these to initialise or clean up any
 777   resources that it attaches to the request or subrequest.
 778
 779 * ``expand_readahead()``
 780
 781   [Optional] This is called to allow the filesystem to expand the size of a
 782   readahead request.  The filesystem gets to expand the request in both
 783   directions, though it must retain the initial region as that may represent
 784   an allocation already made.  If local caching is enabled, it gets to expand
 785   the request first.
 786
 787   Expansion is communicated by changing ->start and ->len in the request
 788   structure.  Note that if any change is made, ->len must be increased by at
 789   least as much as ->start is reduced.
 790
 791 * ``prepare_read()``
 792
 793   [Optional] This is called to allow the filesystem to limit the size of a
 794   subrequest.  It may also limit the number of individual regions in iterator,
 795   such as required by RDMA.  This information should be set on stream zero in::
 796
 797	rreq->io_streams[0].sreq_max_len
 798	rreq->io_streams[0].sreq_max_segs
 799
 800   The filesystem can use this, for example, to chop up a request that has to
 801   be split across multiple servers or to put multiple reads in flight.
 802
 803   Zero should be returned on success and an error code otherwise.
 804
 805 * ``issue_read()``
 806
 807   [Required] Netfslib calls this to dispatch a subrequest to the server for
 808   reading.  In the subrequest, ->start, ->len and ->transferred indicate what
 809   data should be read from the server and ->io_iter indicates the buffer to be
 810   used.
 811
 812   There is no return value; the ``netfs_read_subreq_terminated()`` function
 813   should be called to indicate that the subrequest completed either way.
 814   ->error, ->transferred and ->flags should be updated before completing.  The
 815   termination can be done asynchronously.
 816
 817   Note: the filesystem must not deal with setting folios uptodate, unlocking
 818   them or dropping their refs - the library deals with this as it may have to
 819   stitch together the results of multiple subrequests that variously overlap
 820   the set of folios.
 821
 822 * ``done()``
 823
 824   [Optional] This is called after the folios in a read request have all been
 825   unlocked (and marked uptodate if applicable).
 826
 827 * ``update_i_size()``
 828
 829   [Optional] This is invoked by netfslib at various points during the write
 830   paths to ask the filesystem to update its idea of the file size.  If not
 831   given, netfslib will set i_size and i_blocks and update the local cache
 832   cookie.
 833   
 834 * ``post_modify()``
 835
 836   [Optional] This is called after netfslib writes to the pagecache or when it
 837   allows an mmap'd page to be marked as writable.
 838   
 839 * ``begin_writeback()``
 840
 841   [Optional] Netfslib calls this when processing a writeback request if it
 842   finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE,
 843   indicating it must be written to the server.  This allows the filesystem to
 844   only set up writeback resources when it knows it's going to have to perform
 845   a write.
 846   
 847 * ``prepare_write()``
 848
 849   [Optional] This is called to allow the filesystem to limit the size of a
 850   subrequest.  It may also limit the number of individual regions in iterator,
 851   such as required by RDMA.  This information should be set on stream to which
 852   the subrequest belongs::
 853
 854	rreq->io_streams[subreq->stream_nr].sreq_max_len
 855	rreq->io_streams[subreq->stream_nr].sreq_max_segs
 856
 857   The filesystem can use this, for example, to chop up a request that has to
 858   be split across multiple servers or to put multiple writes in flight.
 859
 860   This is not permitted to return an error.  Instead, in the event of failure,
 861   ``netfs_prepare_write_failed()`` must be called.
 862
 863 * ``issue_write()``
 864
 865   [Required] This is used to dispatch a subrequest to the server for writing.
 866   In the subrequest, ->start, ->len and ->transferred indicate what data
 867   should be written to the server and ->io_iter indicates the buffer to be
 868   used.
 869
 870   There is no return value; the ``netfs_write_subreq_terminated()`` function
 871   should be called to indicate that the subrequest completed either way.
 872   ->error, ->transferred and ->flags should be updated before completing.  The
 873   termination can be done asynchronously.
 874
 875   Note: the filesystem must not deal with removing the dirty or writeback
 876   marks on folios involved in the operation and should not take refs or pins
 877   on them, but should leave retention to netfslib.
 878
 879 * ``retry_request()``
 880
 881   [Optional] Netfslib calls this at the beginning of a retry cycle.  This
 882   allows the filesystem to examine the state of the request, the subrequests
 883   in the indicated stream and of its own data and make adjustments or
 884   renegotiate resources.
 885   
 886 * ``invalidate_cache()``
 887
 888   [Optional] This is called by netfslib to invalidate data stored in the local
 889   cache in the event that writing to the local cache fails, providing updated
 890   coherency data that netfs can't provide.
 891
 892Terminating a subrequest
 893------------------------
 894
 895When a subrequest completes, there are a number of functions that the cache or
 896subrequest can call to inform netfslib of the status change.  One function is
 897provided to terminate a write subrequest at the preparation stage and acts
 898synchronously:
 899
 900 * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);``
 901
 902   Indicate that the ->prepare_write() call failed.  The ``error`` field should
 903   have been updated.
 904
 905Note that ->prepare_read() can return an error as a read can simply be aborted.
 906Dealing with writeback failure is trickier.
 907
 908The other functions are used for subrequests that got as far as being issued:
 909
 910 * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);``
 911
 912   Tell netfslib that a read subrequest has terminated.  The ``error``,
 913   ``flags`` and ``transferred`` fields should have been updated.
 914
 915 * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);``
 916
 917   Tell netfslib that a write subrequest has terminated.  Either the amount of
 918   data processed or the negative error code can be passed in.  This is
 919   can be used as a kiocb completion function.
 920
 921 * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);``
 922
 923   This is provided to optionally update netfslib on the incremental progress
 924   of a read, allowing some folios to be unlocked early and does not actually
 925   terminate the subrequest.  The ``transferred`` field should have been
 926   updated.
 927
 928Local Cache API
 929---------------
 930
 931Netfslib provides a separate API for a local cache to implement, though it
 932provides some somewhat similar routines to the filesystem request API.
 933
 934Firstly, the netfs_io_request object contains a place for the cache to hang its
 935state::
 936
 937	struct netfs_cache_resources {
 938		const struct netfs_cache_ops	*ops;
 939		void				*cache_priv;
 940		void				*cache_priv2;
 941		unsigned int			debug_id;
 942		unsigned int			inval_counter;
 943	};
 944
 945This contains an operations table pointer and two private pointers plus the
 946debug ID of the fscache cookie for tracing purposes and an invalidation counter
 947that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests
 948to be invalidated after completion.
 949
 950The cache operation table looks like the following::
 951
 952	struct netfs_cache_ops {
 953		void (*end_operation)(struct netfs_cache_resources *cres);
 954		void (*expand_readahead)(struct netfs_cache_resources *cres,
 955					 loff_t *_start, size_t *_len, loff_t i_size);
 956		enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
 957						     loff_t i_size);
 958		int (*read)(struct netfs_cache_resources *cres,
 959			    loff_t start_pos,
 960			    struct iov_iter *iter,
 961			    bool seek_data,
 962			    netfs_io_terminated_t term_func,
 963			    void *term_func_priv);
 964		void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
 965		void (*issue_write)(struct netfs_io_subrequest *subreq);
 966	};
 967
 968With a termination handler function pointer::
 969
 970	typedef void (*netfs_io_terminated_t)(void *priv,
 971					      ssize_t transferred_or_error,
 972					      bool was_async);
 973
 974The methods defined in the table are:
 975
 976 * ``end_operation()``
 977
 978   [Required] Called to clean up the resources at the end of the read request.
 979
 980 * ``expand_readahead()``
 981
 982   [Optional] Called at the beginning of a readahead operation to allow the
 983   cache to expand a request in either direction.  This allows the cache to
 984   size the request appropriately for the cache granularity.
 985
 986 * ``prepare_read()``
 987
 988   [Required] Called to configure the next slice of a request.  ->start and
 989   ->len in the subrequest indicate where and how big the next slice can be;
 990   the cache gets to reduce the length to match its granularity requirements.
 991
 992   The function is passed pointers to the start and length in its parameters,
 993   plus the size of the file for reference, and adjusts the start and length
 994   appropriately.  It should return one of:
 995
 996   * ``NETFS_FILL_WITH_ZEROES``
 997   * ``NETFS_DOWNLOAD_FROM_SERVER``
 998   * ``NETFS_READ_FROM_CACHE``
 999   * ``NETFS_INVALID_READ``
1000
1001   to indicate whether the slice should just be cleared or whether it should be
1002   downloaded from the server or read from the cache - or whether slicing
1003   should be given up at the current point.
1004
1005 * ``read()``
1006
1007   [Required] Called to read from the cache.  The start file offset is given
1008   along with an iterator to read to, which gives the length also.  It can be
1009   given a hint requesting that it seek forward from that start position for
1010   data.
1011
1012   Also provided is a pointer to a termination handler function and private
1013   data to pass to that function.  The termination function should be called
1014   with the number of bytes transferred or an error code, plus a flag
1015   indicating whether the termination is definitely happening in the caller's
1016   context.
1017
1018 * ``prepare_write_subreq()``
1019
1020   [Required] This is called to allow the cache to limit the size of a
1021   subrequest.  It may also limit the number of individual regions in iterator,
1022   such as required by DIO/DMA.  This information should be set on stream to
1023   which the subrequest belongs::
1024
1025	rreq->io_streams[subreq->stream_nr].sreq_max_len
1026	rreq->io_streams[subreq->stream_nr].sreq_max_segs
1027
1028   The filesystem can use this, for example, to chop up a request that has to
1029   be split across multiple servers or to put multiple writes in flight.
1030
1031   This is not permitted to return an error.  In the event of failure,
1032   ``netfs_prepare_write_failed()`` must be called.
1033
1034 * ``issue_write()``
1035
1036   [Required] This is used to dispatch a subrequest to the cache for writing.
1037   In the subrequest, ->start, ->len and ->transferred indicate what data
1038   should be written to the cache and ->io_iter indicates the buffer to be
1039   used.
1040
1041   There is no return value; the ``netfs_write_subreq_terminated()`` function
1042   should be called to indicate that the subrequest completed either way.
1043   ->error, ->transferred and ->flags should be updated before completing.  The
1044   termination can be done asynchronously.
1045
1046
1047API Function Reference
1048======================
1049
1050.. kernel-doc:: include/linux/netfs.h
1051.. kernel-doc:: fs/netfs/buffered_read.c
Configure Feed

Configure Feed