···11-<?xml version="1.0" encoding="UTF-8"?>22-<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"33- "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>44-55-<book id="Linux-filesystems-API">66- <bookinfo>77- <title>Linux Filesystems API</title>88-99- <legalnotice>1010- <para>1111- This documentation is free software; you can redistribute1212- it and/or modify it under the terms of the GNU General Public1313- License as published by the Free Software Foundation; either1414- version 2 of the License, or (at your option) any later1515- version.1616- </para>1717-1818- <para>1919- This program is distributed in the hope that it will be2020- useful, but WITHOUT ANY WARRANTY; without even the implied2121- warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.2222- See the GNU General Public License for more details.2323- </para>2424-2525- <para>2626- You should have received a copy of the GNU General Public2727- License along with this program; if not, write to the Free2828- Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,2929- MA 02111-1307 USA3030- </para>3131-3232- <para>3333- For more details see the file COPYING in the source3434- distribution of Linux.3535- </para>3636- </legalnotice>3737- </bookinfo>3838-3939-<toc></toc>4040-4141- <chapter id="vfs">4242- <title>The Linux VFS</title>4343- <sect1 id="the_filesystem_types"><title>The Filesystem types</title>4444-!Iinclude/linux/fs.h4545- </sect1>4646- <sect1 id="the_directory_cache"><title>The Directory Cache</title>4747-!Efs/dcache.c4848-!Iinclude/linux/dcache.h4949- </sect1>5050- <sect1 id="inode_handling"><title>Inode Handling</title>5151-!Efs/inode.c5252-!Efs/bad_inode.c5353- </sect1>5454- <sect1 id="registration_and_superblocks"><title>Registration and Superblocks</title>5555-!Efs/super.c5656- </sect1>5757- <sect1 id="file_locks"><title>File Locks</title>5858-!Efs/locks.c5959-!Ifs/locks.c6060- </sect1>6161- <sect1 id="other_functions"><title>Other Functions</title>6262-!Efs/mpage.c6363-!Efs/namei.c6464-!Efs/buffer.c6565-!Eblock/bio.c6666-!Efs/seq_file.c6767-!Efs/filesystems.c6868-!Efs/fs-writeback.c6969-!Efs/block_dev.c7070- </sect1>7171- </chapter>7272-7373- <chapter id="proc">7474- <title>The proc filesystem</title>7575-7676- <sect1 id="sysctl_interface"><title>sysctl interface</title>7777-!Ekernel/sysctl.c7878- </sect1>7979-8080- <sect1 id="proc_filesystem_interface"><title>proc filesystem interface</title>8181-!Ifs/proc/base.c8282- </sect1>8383- </chapter>8484-8585- <chapter id="fs_events">8686- <title>Events based on file descriptors</title>8787-!Efs/eventfd.c8888- </chapter>8989-9090- <chapter id="sysfs">9191- <title>The Filesystem for Exporting Kernel Objects</title>9292-!Efs/sysfs/file.c9393-!Efs/sysfs/symlink.c9494- </chapter>9595-9696- <chapter id="debugfs">9797- <title>The debugfs filesystem</title>9898-9999- <sect1 id="debugfs_interface"><title>debugfs interface</title>100100-!Efs/debugfs/inode.c101101-!Efs/debugfs/file.c102102- </sect1>103103- </chapter>104104-105105- <chapter id="LinuxJDBAPI">106106- <chapterinfo>107107- <title>The Linux Journalling API</title>108108-109109- <authorgroup>110110- <author>111111- <firstname>Roger</firstname>112112- <surname>Gammans</surname>113113- <affiliation>114114- <address>115115- <email>rgammans@computer-surgery.co.uk</email>116116- </address>117117- </affiliation>118118- </author>119119- </authorgroup>120120-121121- <authorgroup>122122- <author>123123- <firstname>Stephen</firstname>124124- <surname>Tweedie</surname>125125- <affiliation>126126- <address>127127- <email>sct@redhat.com</email>128128- </address>129129- </affiliation>130130- </author>131131- </authorgroup>132132-133133- <copyright>134134- <year>2002</year>135135- <holder>Roger Gammans</holder>136136- </copyright>137137- </chapterinfo>138138-139139- <title>The Linux Journalling API</title>140140-141141- <sect1 id="journaling_overview">142142- <title>Overview</title>143143- <sect2 id="journaling_details">144144- <title>Details</title>145145-<para>146146-The journalling layer is easy to use. You need to147147-first of all create a journal_t data structure. There are148148-two calls to do this dependent on how you decide to allocate the physical149149-media on which the journal resides. The jbd2_journal_init_inode() call150150-is for journals stored in filesystem inodes, or the jbd2_journal_init_dev()151151-call can be used for journal stored on a raw device (in a continuous range152152-of blocks). A journal_t is a typedef for a struct pointer, so when153153-you are finally finished make sure you call jbd2_journal_destroy() on it154154-to free up any used kernel memory.155155-</para>156156-157157-<para>158158-Once you have got your journal_t object you need to 'mount' or load the journal159159-file. The journalling layer expects the space for the journal was already160160-allocated and initialized properly by the userspace tools. When loading the161161-journal you must call jbd2_journal_load() to process journal contents. If the162162-client file system detects the journal contents does not need to be processed163163-(or even need not have valid contents), it may call jbd2_journal_wipe() to164164-clear the journal contents before calling jbd2_journal_load().165165-</para>166166-167167-<para>168168-Note that jbd2_journal_wipe(..,0) calls jbd2_journal_skip_recovery() for you if169169-it detects any outstanding transactions in the journal and similarly170170-jbd2_journal_load() will call jbd2_journal_recover() if necessary. I would171171-advise reading ext4_load_journal() in fs/ext4/super.c for examples on this172172-stage.173173-</para>174174-175175-<para>176176-Now you can go ahead and start modifying the underlying177177-filesystem. Almost.178178-</para>179179-180180-<para>181181-182182-You still need to actually journal your filesystem changes, this183183-is done by wrapping them into transactions. Additionally you184184-also need to wrap the modification of each of the buffers185185-with calls to the journal layer, so it knows what the modifications186186-you are actually making are. To do this use jbd2_journal_start() which187187-returns a transaction handle.188188-</para>189189-190190-<para>191191-jbd2_journal_start()192192-and its counterpart jbd2_journal_stop(), which indicates the end of a193193-transaction are nestable calls, so you can reenter a transaction if necessary,194194-but remember you must call jbd2_journal_stop() the same number of times as195195-jbd2_journal_start() before the transaction is completed (or more accurately196196-leaves the update phase). Ext4/VFS makes use of this feature to simplify197197-handling of inode dirtying, quota support, etc.198198-</para>199199-200200-<para>201201-Inside each transaction you need to wrap the modifications to the202202-individual buffers (blocks). Before you start to modify a buffer you203203-need to call jbd2_journal_get_{create,write,undo}_access() as appropriate,204204-this allows the journalling layer to copy the unmodified data if it205205-needs to. After all the buffer may be part of a previously uncommitted206206-transaction.207207-At this point you are at last ready to modify a buffer, and once208208-you are have done so you need to call jbd2_journal_dirty_{meta,}data().209209-Or if you've asked for access to a buffer you now know is now longer210210-required to be pushed back on the device you can call jbd2_journal_forget()211211-in much the same way as you might have used bforget() in the past.212212-</para>213213-214214-<para>215215-A jbd2_journal_flush() may be called at any time to commit and checkpoint216216-all your transactions.217217-</para>218218-219219-<para>220220-Then at umount time , in your put_super() you can then call jbd2_journal_destroy()221221-to clean up your in-core journal object.222222-</para>223223-224224-<para>225225-Unfortunately there a couple of ways the journal layer can cause a deadlock.226226-The first thing to note is that each task can only have227227-a single outstanding transaction at any one time, remember nothing228228-commits until the outermost jbd2_journal_stop(). This means229229-you must complete the transaction at the end of each file/inode/address230230-etc. operation you perform, so that the journalling system isn't re-entered231231-on another journal. Since transactions can't be nested/batched232232-across differing journals, and another filesystem other than233233-yours (say ext4) may be modified in a later syscall.234234-</para>235235-236236-<para>237237-The second case to bear in mind is that jbd2_journal_start() can238238-block if there isn't enough space in the journal for your transaction239239-(based on the passed nblocks param) - when it blocks it merely(!) needs to240240-wait for transactions to complete and be committed from other tasks,241241-so essentially we are waiting for jbd2_journal_stop(). So to avoid242242-deadlocks you must treat jbd2_journal_start/stop() as if they243243-were semaphores and include them in your semaphore ordering rules to prevent244244-deadlocks. Note that jbd2_journal_extend() has similar blocking behaviour to245245-jbd2_journal_start() so you can deadlock here just as easily as on246246-jbd2_journal_start().247247-</para>248248-249249-<para>250250-Try to reserve the right number of blocks the first time. ;-). This will251251-be the maximum number of blocks you are going to touch in this transaction.252252-I advise having a look at at least ext4_jbd.h to see the basis on which253253-ext4 uses to make these decisions.254254-</para>255255-256256-<para>257257-Another wriggle to watch out for is your on-disk block allocation strategy.258258-Why? Because, if you do a delete, you need to ensure you haven't reused any259259-of the freed blocks until the transaction freeing these blocks commits. If you260260-reused these blocks and crash happens, there is no way to restore the contents261261-of the reallocated blocks at the end of the last fully committed transaction.262262-263263-One simple way of doing this is to mark blocks as free in internal in-memory264264-block allocation structures only after the transaction freeing them commits.265265-Ext4 uses journal commit callback for this purpose.266266-</para>267267-268268-<para>269269-With journal commit callbacks you can ask the journalling layer to call a270270-callback function when the transaction is finally committed to disk, so that271271-you can do some of your own management. You ask the journalling layer for272272-calling the callback by simply setting journal->j_commit_callback function273273-pointer and that function is called after each transaction commit. You can also274274-use transaction->t_private_list for attaching entries to a transaction that275275-need processing when the transaction commits.276276-</para>277277-278278-<para>279279-JBD2 also provides a way to block all transaction updates via280280-jbd2_journal_{un,}lock_updates(). Ext4 uses this when it wants a window with a281281-clean and stable fs for a moment. E.g.282282-</para>283283-284284-<programlisting>285285-286286- jbd2_journal_lock_updates() //stop new stuff happening..287287- jbd2_journal_flush() // checkpoint everything.288288- ..do stuff on stable fs289289- jbd2_journal_unlock_updates() // carry on with filesystem use.290290-</programlisting>291291-292292-<para>293293-The opportunities for abuse and DOS attacks with this should be obvious,294294-if you allow unprivileged userspace to trigger codepaths containing these295295-calls.296296-</para>297297-298298- </sect2>299299-300300- <sect2 id="jbd_summary">301301- <title>Summary</title>302302-<para>303303-Using the journal is a matter of wrapping the different context changes,304304-being each mount, each modification (transaction) and each changed buffer305305-to tell the journalling layer about them.306306-</para>307307-308308- </sect2>309309-310310- </sect1>311311-312312- <sect1 id="data_types">313313- <title>Data Types</title>314314- <para>315315- The journalling layer uses typedefs to 'hide' the concrete definitions316316- of the structures used. As a client of the JBD2 layer you can317317- just rely on the using the pointer as a magic cookie of some sort.318318-319319- Obviously the hiding is not enforced as this is 'C'.320320- </para>321321- <sect2 id="structures"><title>Structures</title>322322-!Iinclude/linux/jbd2.h323323- </sect2>324324- </sect1>325325-326326- <sect1 id="functions">327327- <title>Functions</title>328328- <para>329329- The functions here are split into two groups those that330330- affect a journal as a whole, and those which are used to331331- manage transactions332332- </para>333333- <sect2 id="journal_level"><title>Journal Level</title>334334-!Efs/jbd2/journal.c335335-!Ifs/jbd2/recovery.c336336- </sect2>337337- <sect2 id="transaction_level"><title>Transasction Level</title>338338-!Efs/jbd2/transaction.c339339- </sect2>340340- </sect1>341341- <sect1 id="see_also">342342- <title>See also</title>343343- <para>344344- <citation>345345- <ulink url="http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz">346346- Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen Tweedie347347- </ulink>348348- </citation>349349- </para>350350- <para>351351- <citation>352352- <ulink url="http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html">353353- Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen Tweedie354354- </ulink>355355- </citation>356356- </para>357357- </sect1>358358-359359- </chapter>360360-361361- <chapter id="splice">362362- <title>splice API</title>363363- <para>364364- splice is a method for moving blocks of data around inside the365365- kernel, without continually transferring them between the kernel366366- and user space.367367- </para>368368-!Ffs/splice.c369369- </chapter>370370-371371- <chapter id="pipes">372372- <title>pipes API</title>373373- <para>374374- Pipe interfaces are all for in-kernel (builtin image) use.375375- They are not exported for use by modules.376376- </para>377377-!Iinclude/linux/pipe_fs_i.h378378-!Ffs/pipe.c379379- </chapter>380380-381381-</book>
+2
Documentation/conf.py
···359359 'The kernel development community', 'manual'),360360 ('driver-api/index', 'driver-api.tex', 'The kernel driver API manual',361361 'The kernel development community', 'manual'),362362+ ('filesystems/index', 'filesystems.tex', 'Linux Filesystems API',363363+ 'The kernel development community', 'manual'),362364 ('gpu/index', 'gpu.tex', 'Linux GPU Driver Developer\'s Guide',363365 'The kernel development community', 'manual'),364366 ('input/index', 'linux-input.tex', 'The Linux input driver subsystem',
···11+=====================22+Linux Filesystems API33+=====================44+55+The Linux VFS66+=============77+88+The Filesystem types99+--------------------1010+1111+.. kernel-doc:: include/linux/fs.h1212+ :internal:1313+1414+The Directory Cache1515+-------------------1616+1717+.. kernel-doc:: fs/dcache.c1818+ :export:1919+2020+.. kernel-doc:: include/linux/dcache.h2121+ :internal:2222+2323+Inode Handling2424+--------------2525+2626+.. kernel-doc:: fs/inode.c2727+ :export:2828+2929+.. kernel-doc:: fs/bad_inode.c3030+ :export:3131+3232+Registration and Superblocks3333+----------------------------3434+3535+.. kernel-doc:: fs/super.c3636+ :export:3737+3838+File Locks3939+----------4040+4141+.. kernel-doc:: fs/locks.c4242+ :export:4343+4444+.. kernel-doc:: fs/locks.c4545+ :internal:4646+4747+Other Functions4848+---------------4949+5050+.. kernel-doc:: fs/mpage.c5151+ :export:5252+5353+.. kernel-doc:: fs/namei.c5454+ :export:5555+5656+.. kernel-doc:: fs/buffer.c5757+ :export:5858+5959+.. kernel-doc:: block/bio.c6060+ :export:6161+6262+.. kernel-doc:: fs/seq_file.c6363+ :export:6464+6565+.. kernel-doc:: fs/filesystems.c6666+ :export:6767+6868+.. kernel-doc:: fs/fs-writeback.c6969+ :export:7070+7171+.. kernel-doc:: fs/block_dev.c7272+ :export:7373+7474+The proc filesystem7575+===================7676+7777+sysctl interface7878+----------------7979+8080+.. kernel-doc:: kernel/sysctl.c8181+ :export:8282+8383+proc filesystem interface8484+-------------------------8585+8686+.. kernel-doc:: fs/proc/base.c8787+ :internal:8888+8989+Events based on file descriptors9090+================================9191+9292+.. kernel-doc:: fs/eventfd.c9393+ :export:9494+9595+The Filesystem for Exporting Kernel Objects9696+===========================================9797+9898+.. kernel-doc:: fs/sysfs/file.c9999+ :export:100100+101101+.. kernel-doc:: fs/sysfs/symlink.c102102+ :export:103103+104104+The debugfs filesystem105105+======================106106+107107+debugfs interface108108+-----------------109109+110110+.. kernel-doc:: fs/debugfs/inode.c111111+ :export:112112+113113+.. kernel-doc:: fs/debugfs/file.c114114+ :export:115115+116116+The Linux Journalling API117117+=========================118118+119119+Overview120120+--------121121+122122+Details123123+~~~~~~~124124+125125+The journalling layer is easy to use. You need to first of all create a126126+journal_t data structure. There are two calls to do this dependent on127127+how you decide to allocate the physical media on which the journal128128+resides. The jbd2_journal_init_inode() call is for journals stored in129129+filesystem inodes, or the jbd2_journal_init_dev() call can be used130130+for journal stored on a raw device (in a continuous range of blocks). A131131+journal_t is a typedef for a struct pointer, so when you are finally132132+finished make sure you call jbd2_journal_destroy() on it to free up133133+any used kernel memory.134134+135135+Once you have got your journal_t object you need to 'mount' or load the136136+journal file. The journalling layer expects the space for the journal137137+was already allocated and initialized properly by the userspace tools.138138+When loading the journal you must call jbd2_journal_load() to process139139+journal contents. If the client file system detects the journal contents140140+does not need to be processed (or even need not have valid contents), it141141+may call jbd2_journal_wipe() to clear the journal contents before142142+calling jbd2_journal_load().143143+144144+Note that jbd2_journal_wipe(..,0) calls145145+jbd2_journal_skip_recovery() for you if it detects any outstanding146146+transactions in the journal and similarly jbd2_journal_load() will147147+call jbd2_journal_recover() if necessary. I would advise reading148148+ext4_load_journal() in fs/ext4/super.c for examples on this stage.149149+150150+Now you can go ahead and start modifying the underlying filesystem.151151+Almost.152152+153153+You still need to actually journal your filesystem changes, this is done154154+by wrapping them into transactions. Additionally you also need to wrap155155+the modification of each of the buffers with calls to the journal layer,156156+so it knows what the modifications you are actually making are. To do157157+this use jbd2_journal_start() which returns a transaction handle.158158+159159+jbd2_journal_start() and its counterpart jbd2_journal_stop(), which160160+indicates the end of a transaction are nestable calls, so you can161161+reenter a transaction if necessary, but remember you must call162162+jbd2_journal_stop() the same number of times as jbd2_journal_start()163163+before the transaction is completed (or more accurately leaves the164164+update phase). Ext4/VFS makes use of this feature to simplify handling165165+of inode dirtying, quota support, etc.166166+167167+Inside each transaction you need to wrap the modifications to the168168+individual buffers (blocks). Before you start to modify a buffer you169169+need to call jbd2_journal_get_{create,write,undo}_access() as170170+appropriate, this allows the journalling layer to copy the unmodified171171+data if it needs to. After all the buffer may be part of a previously172172+uncommitted transaction. At this point you are at last ready to modify a173173+buffer, and once you are have done so you need to call174174+jbd2_journal_dirty_{meta,}data(). Or if you've asked for access to a175175+buffer you now know is now longer required to be pushed back on the176176+device you can call jbd2_journal_forget() in much the same way as you177177+might have used bforget() in the past.178178+179179+A jbd2_journal_flush() may be called at any time to commit and180180+checkpoint all your transactions.181181+182182+Then at umount time , in your put_super() you can then call183183+jbd2_journal_destroy() to clean up your in-core journal object.184184+185185+Unfortunately there a couple of ways the journal layer can cause a186186+deadlock. The first thing to note is that each task can only have a187187+single outstanding transaction at any one time, remember nothing commits188188+until the outermost jbd2_journal_stop(). This means you must complete189189+the transaction at the end of each file/inode/address etc. operation you190190+perform, so that the journalling system isn't re-entered on another191191+journal. Since transactions can't be nested/batched across differing192192+journals, and another filesystem other than yours (say ext4) may be193193+modified in a later syscall.194194+195195+The second case to bear in mind is that jbd2_journal_start() can block196196+if there isn't enough space in the journal for your transaction (based197197+on the passed nblocks param) - when it blocks it merely(!) needs to wait198198+for transactions to complete and be committed from other tasks, so199199+essentially we are waiting for jbd2_journal_stop(). So to avoid200200+deadlocks you must treat jbd2_journal_start/stop() as if they were201201+semaphores and include them in your semaphore ordering rules to prevent202202+deadlocks. Note that jbd2_journal_extend() has similar blocking203203+behaviour to jbd2_journal_start() so you can deadlock here just as204204+easily as on jbd2_journal_start().205205+206206+Try to reserve the right number of blocks the first time. ;-). This will207207+be the maximum number of blocks you are going to touch in this208208+transaction. I advise having a look at at least ext4_jbd.h to see the209209+basis on which ext4 uses to make these decisions.210210+211211+Another wriggle to watch out for is your on-disk block allocation212212+strategy. Why? Because, if you do a delete, you need to ensure you213213+haven't reused any of the freed blocks until the transaction freeing214214+these blocks commits. If you reused these blocks and crash happens,215215+there is no way to restore the contents of the reallocated blocks at the216216+end of the last fully committed transaction. One simple way of doing217217+this is to mark blocks as free in internal in-memory block allocation218218+structures only after the transaction freeing them commits. Ext4 uses219219+journal commit callback for this purpose.220220+221221+With journal commit callbacks you can ask the journalling layer to call222222+a callback function when the transaction is finally committed to disk,223223+so that you can do some of your own management. You ask the journalling224224+layer for calling the callback by simply setting225225+journal->j_commit_callback function pointer and that function is226226+called after each transaction commit. You can also use227227+transaction->t_private_list for attaching entries to a transaction228228+that need processing when the transaction commits.229229+230230+JBD2 also provides a way to block all transaction updates via231231+jbd2_journal_{un,}lock_updates(). Ext4 uses this when it wants a232232+window with a clean and stable fs for a moment. E.g.233233+234234+::235235+236236+237237+ jbd2_journal_lock_updates() //stop new stuff happening..238238+ jbd2_journal_flush() // checkpoint everything.239239+ ..do stuff on stable fs240240+ jbd2_journal_unlock_updates() // carry on with filesystem use.241241+242242+The opportunities for abuse and DOS attacks with this should be obvious,243243+if you allow unprivileged userspace to trigger codepaths containing244244+these calls.245245+246246+Summary247247+~~~~~~~248248+249249+Using the journal is a matter of wrapping the different context changes,250250+being each mount, each modification (transaction) and each changed251251+buffer to tell the journalling layer about them.252252+253253+Data Types254254+----------255255+256256+The journalling layer uses typedefs to 'hide' the concrete definitions257257+of the structures used. As a client of the JBD2 layer you can just rely258258+on the using the pointer as a magic cookie of some sort. Obviously the259259+hiding is not enforced as this is 'C'.260260+261261+Structures262262+~~~~~~~~~~263263+264264+.. kernel-doc:: include/linux/jbd2.h265265+ :internal:266266+267267+Functions268268+---------269269+270270+The functions here are split into two groups those that affect a journal271271+as a whole, and those which are used to manage transactions272272+273273+Journal Level274274+~~~~~~~~~~~~~275275+276276+.. kernel-doc:: fs/jbd2/journal.c277277+ :export:278278+279279+.. kernel-doc:: fs/jbd2/recovery.c280280+ :internal:281281+282282+Transasction Level283283+~~~~~~~~~~~~~~~~~~284284+285285+.. kernel-doc:: fs/jbd2/transaction.c286286+ :export:287287+288288+See also289289+--------290290+291291+`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen292292+Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__293293+294294+`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen295295+Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__296296+297297+splice API298298+==========299299+300300+splice is a method for moving blocks of data around inside the kernel,301301+without continually transferring them between the kernel and user space.302302+303303+.. kernel-doc:: fs/splice.c304304+305305+pipes API306306+=========307307+308308+Pipe interfaces are all for in-kernel (builtin image) use. They are not309309+exported for use by modules.310310+311311+.. kernel-doc:: include/linux/pipe_fs_i.h312312+ :internal:313313+314314+.. kernel-doc:: fs/pipe.c