···11+Written by: Neil Brown <neilb@suse.de>22+33+Overlay Filesystem44+==================55+66+This document describes a prototype for a new approach to providing77+overlay-filesystem functionality in Linux (sometimes referred to as88+union-filesystems). An overlay-filesystem tries to present a99+filesystem which is the result over overlaying one filesystem on top1010+of the other.1111+1212+The result will inevitably fail to look exactly like a normal1313+filesystem for various technical reasons. The expectation is that1414+many use cases will be able to ignore these differences.1515+1616+This approach is 'hybrid' because the objects that appear in the1717+filesystem do not all appear to belong to that filesystem. In many1818+cases an object accessed in the union will be indistinguishable1919+from accessing the corresponding object from the original filesystem.2020+This is most obvious from the 'st_dev' field returned by stat(2).2121+2222+While directories will report an st_dev from the overlay-filesystem,2323+all non-directory objects will report an st_dev from the lower or2424+upper filesystem that is providing the object. Similarly st_ino will2525+only be unique when combined with st_dev, and both of these can change2626+over the lifetime of a non-directory object. Many applications and2727+tools ignore these values and will not be affected.2828+2929+Upper and Lower3030+---------------3131+3232+An overlay filesystem combines two filesystems - an 'upper' filesystem3333+and a 'lower' filesystem. When a name exists in both filesystems, the3434+object in the 'upper' filesystem is visible while the object in the3535+'lower' filesystem is either hidden or, in the case of directories,3636+merged with the 'upper' object.3737+3838+It would be more correct to refer to an upper and lower 'directory3939+tree' rather than 'filesystem' as it is quite possible for both4040+directory trees to be in the same filesystem and there is no4141+requirement that the root of a filesystem be given for either upper or4242+lower.4343+4444+The lower filesystem can be any filesystem supported by Linux and does4545+not need to be writable. The lower filesystem can even be another4646+overlayfs. The upper filesystem will normally be writable and if it4747+is it must support the creation of trusted.* extended attributes, and4848+must provide valid d_type in readdir responses, so NFS is not suitable.4949+5050+A read-only overlay of two read-only filesystems may use any5151+filesystem type.5252+5353+Directories5454+-----------5555+5656+Overlaying mainly involves directories. If a given name appears in both5757+upper and lower filesystems and refers to a non-directory in either,5858+then the lower object is hidden - the name refers only to the upper5959+object.6060+6161+Where both upper and lower objects are directories, a merged directory6262+is formed.6363+6464+At mount time, the two directories given as mount options "lowerdir" and6565+"upperdir" are combined into a merged directory:6666+6767+ mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper,\6868+workdir=/work /merged6969+7070+The "workdir" needs to be an empty directory on the same filesystem7171+as upperdir.7272+7373+Then whenever a lookup is requested in such a merged directory, the7474+lookup is performed in each actual directory and the combined result7575+is cached in the dentry belonging to the overlay filesystem. If both7676+actual lookups find directories, both are stored and a merged7777+directory is created, otherwise only one is stored: the upper if it7878+exists, else the lower.7979+8080+Only the lists of names from directories are merged. Other content8181+such as metadata and extended attributes are reported for the upper8282+directory only. These attributes of the lower directory are hidden.8383+8484+whiteouts and opaque directories8585+--------------------------------8686+8787+In order to support rm and rmdir without changing the lower8888+filesystem, an overlay filesystem needs to record in the upper filesystem8989+that files have been removed. This is done using whiteouts and opaque9090+directories (non-directories are always opaque).9191+9292+A whiteout is created as a character device with 0/0 device number.9393+When a whiteout is found in the upper level of a merged directory, any9494+matching name in the lower level is ignored, and the whiteout itself9595+is also hidden.9696+9797+A directory is made opaque by setting the xattr "trusted.overlay.opaque"9898+to "y". Where the upper filesystem contains an opaque directory, any9999+directory in the lower filesystem with the same name is ignored.100100+101101+readdir102102+-------103103+104104+When a 'readdir' request is made on a merged directory, the upper and105105+lower directories are each read and the name lists merged in the106106+obvious way (upper is read first, then lower - entries that already107107+exist are not re-added). This merged name list is cached in the108108+'struct file' and so remains as long as the file is kept open. If the109109+directory is opened and read by two processes at the same time, they110110+will each have separate caches. A seekdir to the start of the111111+directory (offset 0) followed by a readdir will cause the cache to be112112+discarded and rebuilt.113113+114114+This means that changes to the merged directory do not appear while a115115+directory is being read. This is unlikely to be noticed by many116116+programs.117117+118118+seek offsets are assigned sequentially when the directories are read.119119+Thus if120120+ - read part of a directory121121+ - remember an offset, and close the directory122122+ - re-open the directory some time later123123+ - seek to the remembered offset124124+125125+there may be little correlation between the old and new locations in126126+the list of filenames, particularly if anything has changed in the127127+directory.128128+129129+Readdir on directories that are not merged is simply handled by the130130+underlying directory (upper or lower).131131+132132+133133+Non-directories134134+---------------135135+136136+Objects that are not directories (files, symlinks, device-special137137+files etc.) are presented either from the upper or lower filesystem as138138+appropriate. When a file in the lower filesystem is accessed in a way139139+the requires write-access, such as opening for write access, changing140140+some metadata etc., the file is first copied from the lower filesystem141141+to the upper filesystem (copy_up). Note that creating a hard-link142142+also requires copy_up, though of course creation of a symlink does143143+not.144144+145145+The copy_up may turn out to be unnecessary, for example if the file is146146+opened for read-write but the data is not modified.147147+148148+The copy_up process first makes sure that the containing directory149149+exists in the upper filesystem - creating it and any parents as150150+necessary. It then creates the object with the same metadata (owner,151151+mode, mtime, symlink-target etc.) and then if the object is a file, the152152+data is copied from the lower to the upper filesystem. Finally any153153+extended attributes are copied up.154154+155155+Once the copy_up is complete, the overlay filesystem simply156156+provides direct access to the newly created file in the upper157157+filesystem - future operations on the file are barely noticed by the158158+overlay filesystem (though an operation on the name of the file such as159159+rename or unlink will of course be noticed and handled).160160+161161+162162+Non-standard behavior163163+---------------------164164+165165+The copy_up operation essentially creates a new, identical file and166166+moves it over to the old name. The new file may be on a different167167+filesystem, so both st_dev and st_ino of the file may change.168168+169169+Any open files referring to this inode will access the old data and170170+metadata. Similarly any file locks obtained before copy_up will not171171+apply to the copied up file.172172+173173+On a file opened with O_RDONLY fchmod(2), fchown(2), futimesat(2) and174174+fsetxattr(2) will fail with EROFS.175175+176176+If a file with multiple hard links is copied up, then this will177177+"break" the link. Changes will not be propagated to other names178178+referring to the same inode.179179+180180+Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory181181+object in overlayfs will not contain valid absolute paths, only182182+relative paths leading up to the filesystem's root. This will be183183+fixed in the future.184184+185185+Some operations are not atomic, for example a crash during copy_up or186186+rename will leave the filesystem in an inconsistent state. This will187187+be addressed in the future.188188+189189+Changes to underlying filesystems190190+---------------------------------191191+192192+Offline changes, when the overlay is not mounted, are allowed to either193193+the upper or the lower trees.194194+195195+Changes to the underlying filesystems while part of a mounted overlay196196+filesystem are not allowed. If the underlying filesystem is changed,197197+the behavior of the overlay is undefined, though it will not result in198198+a crash or deadlock.