Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

dax: replace XIP documentation with DAX documentation

Based on the original XIP documentation, this documents the current state
of affairs, and includes instructions on how users can enable DAX if their
devices and kernel support it.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Matthew Wilcox and committed by
Linus Torvalds
95ec8dab 4c0ccfef

+92 -73
+3 -2
Documentation/filesystems/00-INDEX
··· 34 34 - directory containing configfs documentation and example code. 35 35 cramfs.txt 36 36 - info on the cram filesystem for small storage (ROMs etc). 37 + dax.txt 38 + - info on avoiding the page cache for files stored on CPU-addressable 39 + storage devices. 37 40 debugfs.txt 38 41 - info on the debugfs filesystem. 39 42 devpts.txt ··· 157 154 - info on XFS Self Describing Metadata. 158 155 xfs.txt 159 156 - info and mount options for the XFS filesystem. 160 - xip.txt 161 - - info on execute-in-place for file mappings.
+89
Documentation/filesystems/dax.txt
··· 1 + Direct Access for files 2 + ----------------------- 3 + 4 + Motivation 5 + ---------- 6 + 7 + The page cache is usually used to buffer reads and writes to files. 8 + It is also used to provide the pages which are mapped into userspace 9 + by a call to mmap. 10 + 11 + For block devices that are memory-like, the page cache pages would be 12 + unnecessary copies of the original storage. The DAX code removes the 13 + extra copy by performing reads and writes directly to the storage device. 14 + For file mappings, the storage device is mapped directly into userspace. 15 + 16 + 17 + Usage 18 + ----- 19 + 20 + If you have a block device which supports DAX, you can make a filesystem 21 + on it as usual. When mounting it, use the -o dax option manually 22 + or add 'dax' to the options in /etc/fstab. 23 + 24 + 25 + Implementation Tips for Block Driver Writers 26 + -------------------------------------------- 27 + 28 + To support DAX in your block driver, implement the 'direct_access' 29 + block device operation. It is used to translate the sector number 30 + (expressed in units of 512-byte sectors) to a page frame number (pfn) 31 + that identifies the physical page for the memory. It also returns a 32 + kernel virtual address that can be used to access the memory. 33 + 34 + The direct_access method takes a 'size' parameter that indicates the 35 + number of bytes being requested. The function should return the number 36 + of bytes that can be contiguously accessed at that offset. It may also 37 + return a negative errno if an error occurs. 38 + 39 + In order to support this method, the storage must be byte-accessible by 40 + the CPU at all times. If your device uses paging techniques to expose 41 + a large amount of memory through a smaller window, then you cannot 42 + implement direct_access. Equally, if your device can occasionally 43 + stall the CPU for an extended period, you should also not attempt to 44 + implement direct_access. 45 + 46 + These block devices may be used for inspiration: 47 + - axonram: Axon DDR2 device driver 48 + - brd: RAM backed block device driver 49 + - dcssblk: s390 dcss block device driver 50 + 51 + 52 + Implementation Tips for Filesystem Writers 53 + ------------------------------------------ 54 + 55 + Filesystem support consists of 56 + - adding support to mark inodes as being DAX by setting the S_DAX flag in 57 + i_flags 58 + - implementing the direct_IO address space operation, and calling 59 + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set 60 + - implementing an mmap file operation for DAX files which sets the 61 + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers 62 + for fault and page_mkwrite (which should probably call dax_fault() and 63 + dax_mkwrite(), passing the appropriate get_block() callback) 64 + - calling dax_truncate_page() instead of block_truncate_page() for DAX files 65 + - ensuring that there is sufficient locking between reads, writes, 66 + truncates and page faults 67 + 68 + The get_block() callback passed to the DAX functions may return 69 + uninitialised extents. If it does, it must ensure that simultaneous 70 + calls to get_block() (for example by a page-fault racing with a read() 71 + or a write()) work correctly. 72 + 73 + These filesystems may be used for inspiration: 74 + - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt 75 + 76 + 77 + Shortcomings 78 + ------------ 79 + 80 + Even if the kernel or its modules are stored on a filesystem that supports 81 + DAX on a block device that supports DAX, they will still be copied into RAM. 82 + 83 + Calling get_user_pages() on a range of user memory that has been mmaped 84 + from a DAX file will fail as there are no 'struct page' to describe 85 + those pages. This problem is being worked on. That means that O_DIRECT 86 + reads/writes to those memory ranges from a non-DAX file will fail (note 87 + that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory 88 + that is being accessed that is key here). Other things that will not 89 + work include RDMA, sendfile() and splice().
-71
Documentation/filesystems/xip.txt
··· 1 - Execute-in-place for file mappings 2 - ---------------------------------- 3 - 4 - Motivation 5 - ---------- 6 - File mappings are performed by mapping page cache pages to userspace. In 7 - addition, read&write type file operations also transfer data from/to the page 8 - cache. 9 - 10 - For memory backed storage devices that use the block device interface, the page 11 - cache pages are in fact copies of the original storage. Various approaches 12 - exist to work around the need for an extra copy. The ramdisk driver for example 13 - does read the data into the page cache, keeps a reference, and discards the 14 - original data behind later on. 15 - 16 - Execute-in-place solves this issue the other way around: instead of keeping 17 - data in the page cache, the need to have a page cache copy is eliminated 18 - completely. With execute-in-place, read&write type operations are performed 19 - directly from/to the memory backed storage device. For file mappings, the 20 - storage device itself is mapped directly into userspace. 21 - 22 - This implementation was initially written for shared memory segments between 23 - different virtual machines on s390 hardware to allow multiple machines to 24 - share the same binaries and libraries. 25 - 26 - Implementation 27 - -------------- 28 - Execute-in-place is implemented in three steps: block device operation, 29 - address space operation, and file operations. 30 - 31 - A block device operation named direct_access is used to translate the 32 - block device sector number to a page frame number (pfn) that identifies 33 - the physical page for the memory. It also returns a kernel virtual 34 - address that can be used to access the memory. 35 - 36 - The direct_access method takes a 'size' parameter that indicates the 37 - number of bytes being requested. The function should return the number 38 - of bytes that can be contiguously accessed at that offset. It may also 39 - return a negative errno if an error occurs. 40 - 41 - The block device operation is optional, these block devices support it as of 42 - today: 43 - - dcssblk: s390 dcss block device driver 44 - 45 - An address space operation named get_xip_mem is used to retrieve references 46 - to a page frame number and a kernel address. To obtain these values a reference 47 - to an address_space is provided. This function assigns values to the kmem and 48 - pfn parameters. The third argument indicates whether the function should allocate 49 - blocks if needed. 50 - 51 - This address space operation is mutually exclusive with readpage&writepage that 52 - do page cache read/write operations. 53 - The following filesystems support it as of today: 54 - - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt 55 - 56 - A set of file operations that do utilize get_xip_page can be found in 57 - mm/filemap_xip.c . The following file operation implementations are provided: 58 - - aio_read/aio_write 59 - - readv/writev 60 - - sendfile 61 - 62 - The generic file operations do_sync_read/do_sync_write can be used to implement 63 - classic synchronous IO calls. 64 - 65 - Shortcomings 66 - ------------ 67 - This implementation is limited to storage devices that are cpu addressable at 68 - all times (no highmem or such). It works well on rom/ram, but enhancements are 69 - needed to make it work with flash in read+write mode. 70 - Putting the Linux kernel and/or its modules on a xip filesystem does not mean 71 - they are not copied.