Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

[PATCH] VFS: update documentation

This patch brings the now out-of-date Documentation/filesystems/vfs.txt
back to life. Thanks to Carsten Otte, Trond Myklebust, and Anton
Altaparmakov for their help on updating this documentation.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

authored by

Pekka J Enberg and committed by
Linus Torvalds
5ea626aa 952b6492

+323 -112
+323 -112
Documentation/filesystems/vfs.txt
··· 1 - /* -*- auto-fill -*- */ 2 1 3 - Overview of the Virtual File System 2 + Overview of the Linux Virtual File System 4 3 5 - Richard Gooch <rgooch@atnf.csiro.au> 4 + Original author: Richard Gooch <rgooch@atnf.csiro.au> 6 5 7 - 5-JUL-1999 6 + Last updated on August 25, 2005 8 7 8 + Copyright (C) 1999 Richard Gooch 9 + Copyright (C) 2005 Pekka Enberg 9 10 10 - Conventions used in this document <section> 11 - ================================= 12 - 13 - Each section in this document will have the string "<section>" at the 14 - right-hand side of the section title. Each subsection will have 15 - "<subsection>" at the right-hand side. These strings are meant to make 16 - it easier to search through the document. 17 - 18 - NOTE that the master copy of this document is available online at: 19 - http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt 11 + This file is released under the GPLv2. 20 12 21 13 22 - What is it? <section> 14 + What is it? 23 15 =========== 24 16 25 17 The Virtual File System (otherwise known as the Virtual Filesystem 26 18 Switch) is the software layer in the kernel that provides the 27 19 filesystem interface to userspace programs. It also provides an 28 20 abstraction within the kernel which allows different filesystem 29 - implementations to co-exist. 21 + implementations to coexist. 30 22 31 23 32 - A Quick Look At How It Works <section> 24 + A Quick Look At How It Works 33 25 ============================ 34 26 35 27 In this section I'll briefly describe how things work, before ··· 30 38 other view which is how a filesystem is supported and subsequently 31 39 mounted. 32 40 33 - Opening a File <subsection> 41 + 42 + Opening a File 34 43 -------------- 35 44 36 45 The VFS implements the open(2), stat(2), chmod(2) and similar system ··· 70 77 71 78 Opening a file requires another operation: allocation of a file 72 79 structure (this is the kernel-side implementation of file 73 - descriptors). The freshly allocated file structure is initialised with 80 + descriptors). The freshly allocated file structure is initialized with 74 81 a pointer to the dentry and a set of file operation member functions. 75 82 These are taken from the inode data. The open() file method is then 76 83 called so the specific filesystem implementation can do it's work. You ··· 95 102 processors. You should ensure that access to shared resources is 96 103 protected by appropriate locks. 97 104 98 - Registering and Mounting a Filesystem <subsection> 105 + 106 + Registering and Mounting a Filesystem 99 107 ------------------------------------- 100 108 101 109 If you want to support a new kind of filesystem in the kernel, all you ··· 117 123 It's now time to look at things in more detail. 118 124 119 125 120 - struct file_system_type <section> 126 + struct file_system_type 121 127 ======================= 122 128 123 - This describes the filesystem. As of kernel 2.1.99, the following 129 + This describes the filesystem. As of kernel 2.6.13, the following 124 130 members are defined: 125 131 126 132 struct file_system_type { 127 133 const char *name; 128 134 int fs_flags; 129 - struct super_block *(*read_super) (struct super_block *, void *, int); 130 - struct file_system_type * next; 135 + struct super_block *(*get_sb) (struct file_system_type *, int, 136 + const char *, void *); 137 + void (*kill_sb) (struct super_block *); 138 + struct module *owner; 139 + struct file_system_type * next; 140 + struct list_head fs_supers; 131 141 }; 132 142 133 143 name: the name of the filesystem type, such as "ext2", "iso9660", ··· 139 141 140 142 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) 141 143 142 - read_super: the method to call when a new instance of this 144 + get_sb: the method to call when a new instance of this 143 145 filesystem should be mounted 144 146 145 - next: for internal VFS use: you should initialise this to NULL 147 + kill_sb: the method to call when an instance of this filesystem 148 + should be unmounted 146 149 147 - The read_super() method has the following arguments: 150 + owner: for internal VFS use: you should initialize this to THIS_MODULE in 151 + most cases. 152 + 153 + next: for internal VFS use: you should initialize this to NULL 154 + 155 + The get_sb() method has the following arguments: 148 156 149 157 struct super_block *sb: the superblock structure. This is partially 150 - initialised by the VFS and the rest must be initialised by the 151 - read_super() method 158 + initialized by the VFS and the rest must be initialized by the 159 + get_sb() method 160 + 161 + int flags: mount flags 162 + 163 + const char *dev_name: the device name we are mounting. 152 164 153 165 void *data: arbitrary mount options, usually comes as an ASCII 154 166 string 155 167 156 168 int silent: whether or not to be silent on error 157 169 158 - The read_super() method must determine if the block device specified 170 + The get_sb() method must determine if the block device specified 159 171 in the superblock contains a filesystem of the type the method 160 172 supports. On success the method returns the superblock pointer, on 161 173 failure it returns NULL. 162 174 163 175 The most interesting member of the superblock structure that the 164 - read_super() method fills in is the "s_op" field. This is a pointer to 176 + get_sb() method fills in is the "s_op" field. This is a pointer to 165 177 a "struct super_operations" which describes the next level of the 166 178 filesystem implementation. 167 179 180 + Usually, a filesystem uses generic one of the generic get_sb() 181 + implementations and provides a fill_super() method instead. The 182 + generic methods are: 168 183 169 - struct super_operations <section> 184 + get_sb_bdev: mount a filesystem residing on a block device 185 + 186 + get_sb_nodev: mount a filesystem that is not backed by a device 187 + 188 + get_sb_single: mount a filesystem which shares the instance between 189 + all mounts 190 + 191 + A fill_super() method implementation has the following arguments: 192 + 193 + struct super_block *sb: the superblock structure. The method fill_super() 194 + must initialize this properly. 195 + 196 + void *data: arbitrary mount options, usually comes as an ASCII 197 + string 198 + 199 + int silent: whether or not to be silent on error 200 + 201 + 202 + struct super_operations 170 203 ======================= 171 204 172 205 This describes how the VFS can manipulate the superblock of your 173 - filesystem. As of kernel 2.1.99, the following members are defined: 206 + filesystem. As of kernel 2.6.13, the following members are defined: 174 207 175 208 struct super_operations { 176 - void (*read_inode) (struct inode *); 177 - int (*write_inode) (struct inode *, int); 178 - void (*put_inode) (struct inode *); 179 - void (*drop_inode) (struct inode *); 180 - void (*delete_inode) (struct inode *); 181 - int (*notify_change) (struct dentry *, struct iattr *); 182 - void (*put_super) (struct super_block *); 183 - void (*write_super) (struct super_block *); 184 - int (*statfs) (struct super_block *, struct statfs *, int); 185 - int (*remount_fs) (struct super_block *, int *, char *); 186 - void (*clear_inode) (struct inode *); 209 + struct inode *(*alloc_inode)(struct super_block *sb); 210 + void (*destroy_inode)(struct inode *); 211 + 212 + void (*read_inode) (struct inode *); 213 + 214 + void (*dirty_inode) (struct inode *); 215 + int (*write_inode) (struct inode *, int); 216 + void (*put_inode) (struct inode *); 217 + void (*drop_inode) (struct inode *); 218 + void (*delete_inode) (struct inode *); 219 + void (*put_super) (struct super_block *); 220 + void (*write_super) (struct super_block *); 221 + int (*sync_fs)(struct super_block *sb, int wait); 222 + void (*write_super_lockfs) (struct super_block *); 223 + void (*unlockfs) (struct super_block *); 224 + int (*statfs) (struct super_block *, struct kstatfs *); 225 + int (*remount_fs) (struct super_block *, int *, char *); 226 + void (*clear_inode) (struct inode *); 227 + void (*umount_begin) (struct super_block *); 228 + 229 + void (*sync_inodes) (struct super_block *sb, 230 + struct writeback_control *wbc); 231 + int (*show_options)(struct seq_file *, struct vfsmount *); 232 + 233 + ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); 234 + ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); 187 235 }; 188 236 189 237 All methods are called without any locks being held, unless otherwise ··· 237 193 only called from a process context (i.e. not from an interrupt handler 238 194 or bottom half). 239 195 196 + alloc_inode: this method is called by inode_alloc() to allocate memory 197 + for struct inode and initialize it. 198 + 199 + destroy_inode: this method is called by destroy_inode() to release 200 + resources allocated for struct inode. 201 + 240 202 read_inode: this method is called to read a specific inode from the 241 - mounted filesystem. The "i_ino" member in the "struct inode" 242 - will be initialised by the VFS to indicate which inode to 243 - read. Other members are filled in by this method 203 + mounted filesystem. The i_ino member in the struct inode is 204 + initialized by the VFS to indicate which inode to read. Other 205 + members are filled in by this method. 206 + 207 + You can set this to NULL and use iget5_locked() instead of iget() 208 + to read inodes. This is necessary for filesystems for which the 209 + inode number is not sufficient to identify an inode. 210 + 211 + dirty_inode: this method is called by the VFS to mark an inode dirty. 244 212 245 213 write_inode: this method is called when the VFS needs to write an 246 214 inode to disc. The second parameter indicates whether the write 247 215 should be synchronous or not, not all filesystems check this flag. 248 216 249 217 put_inode: called when the VFS inode is removed from the inode 250 - cache. This method is optional 218 + cache. 251 219 252 220 drop_inode: called when the last access to the inode is dropped, 253 221 with the inode_lock spinlock held. 254 222 255 - This method should be either NULL (normal unix filesystem 223 + This method should be either NULL (normal UNIX filesystem 256 224 semantics) or "generic_delete_inode" (for filesystems that do not 257 225 want to cache inodes - causing "delete_inode" to always be 258 226 called regardless of the value of i_nlink) 259 227 260 - The "generic_delete_inode()" behaviour is equivalent to the 228 + The "generic_delete_inode()" behavior is equivalent to the 261 229 old practice of using "force_delete" in the put_inode() case, 262 230 but does not have the races that the "force_delete()" approach 263 231 had. 264 232 265 233 delete_inode: called when the VFS wants to delete an inode 266 234 267 - notify_change: called when VFS inode attributes are changed. If this 268 - is NULL the VFS falls back to the write_inode() method. This 269 - is called with the kernel lock held 270 - 271 235 put_super: called when the VFS wishes to free the superblock 272 236 (i.e. unmount). This is called with the superblock lock held 273 237 274 238 write_super: called when the VFS superblock needs to be written to 275 239 disc. This method is optional 240 + 241 + sync_fs: called when VFS is writing out all dirty data associated with 242 + a superblock. The second parameter indicates whether the method 243 + should wait until the write out has been completed. Optional. 244 + 245 + write_super_lockfs: called when VFS is locking a filesystem and forcing 246 + it into a consistent state. This function is currently used by the 247 + Logical Volume Manager (LVM). 248 + 249 + unlockfs: called when VFS is unlocking a filesystem and making it writable 250 + again. 276 251 277 252 statfs: called when the VFS needs to get filesystem statistics. This 278 253 is called with the kernel lock held ··· 301 238 302 239 clear_inode: called then the VFS clears the inode. Optional 303 240 241 + umount_begin: called when the VFS is unmounting a filesystem. 242 + 243 + sync_inodes: called when the VFS is writing out dirty data associated with 244 + a superblock. 245 + 246 + show_options: called by the VFS to show mount options for /proc/<pid>/mounts. 247 + 248 + quota_read: called by the VFS to read from filesystem quota file. 249 + 250 + quota_write: called by the VFS to write to filesystem quota file. 251 + 304 252 The read_inode() method is responsible for filling in the "i_op" 305 253 field. This is a pointer to a "struct inode_operations" which 306 254 describes the methods that can be performed on individual inodes. 307 255 308 256 309 - struct inode_operations <section> 257 + struct inode_operations 310 258 ======================= 311 259 312 260 This describes how the VFS can manipulate an inode in your 313 - filesystem. As of kernel 2.1.99, the following members are defined: 261 + filesystem. As of kernel 2.6.13, the following members are defined: 314 262 315 263 struct inode_operations { 316 - struct file_operations * default_file_ops; 317 - int (*create) (struct inode *,struct dentry *,int); 318 - int (*lookup) (struct inode *,struct dentry *); 264 + int (*create) (struct inode *,struct dentry *,int, struct nameidata *); 265 + struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); 319 266 int (*link) (struct dentry *,struct inode *,struct dentry *); 320 267 int (*unlink) (struct inode *,struct dentry *); 321 268 int (*symlink) (struct inode *,struct dentry *,const char *); ··· 334 261 int (*mknod) (struct inode *,struct dentry *,int,dev_t); 335 262 int (*rename) (struct inode *, struct dentry *, 336 263 struct inode *, struct dentry *); 337 - int (*readlink) (struct dentry *, char *,int); 338 - struct dentry * (*follow_link) (struct dentry *, struct dentry *); 339 - int (*readpage) (struct file *, struct page *); 340 - int (*writepage) (struct page *page, struct writeback_control *wbc); 341 - int (*bmap) (struct inode *,int); 264 + int (*readlink) (struct dentry *, char __user *,int); 265 + void * (*follow_link) (struct dentry *, struct nameidata *); 266 + void (*put_link) (struct dentry *, struct nameidata *, void *); 342 267 void (*truncate) (struct inode *); 343 - int (*permission) (struct inode *, int); 344 - int (*smap) (struct inode *,int); 345 - int (*updatepage) (struct file *, struct page *, const char *, 346 - unsigned long, unsigned int, int); 347 - int (*revalidate) (struct dentry *); 268 + int (*permission) (struct inode *, int, struct nameidata *); 269 + int (*setattr) (struct dentry *, struct iattr *); 270 + int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); 271 + int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); 272 + ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); 273 + ssize_t (*listxattr) (struct dentry *, char *, size_t); 274 + int (*removexattr) (struct dentry *, const char *); 348 275 }; 349 276 350 277 Again, all methods are called without any locks being held, unless 351 278 otherwise noted. 352 - 353 - default_file_ops: this is a pointer to a "struct file_operations" 354 - which describes how to open and then manipulate open files 355 279 356 280 create: called by the open(2) and creat(2) system calls. Only 357 281 required if you want to support regular files. The dentry you ··· 398 328 you want to support reading symbolic links 399 329 400 330 follow_link: called by the VFS to follow a symbolic link to the 401 - inode it points to. Only required if you want to support 402 - symbolic links 331 + inode it points to. Only required if you want to support 332 + symbolic links. This function returns a void pointer cookie 333 + that is passed to put_link(). 334 + 335 + put_link: called by the VFS to release resources allocated by 336 + follow_link(). The cookie returned by follow_link() is passed to 337 + to this function as the last parameter. It is used by filesystems 338 + such as NFS where page cache is not stable (i.e. page that was 339 + installed when the symbolic link walk started might not be in the 340 + page cache at the end of the walk). 341 + 342 + truncate: called by the VFS to change the size of a file. The i_size 343 + field of the inode is set to the desired size by the VFS before 344 + this function is called. This function is called by the truncate(2) 345 + system call and related functionality. 346 + 347 + permission: called by the VFS to check for access rights on a POSIX-like 348 + filesystem. 349 + 350 + setattr: called by the VFS to set attributes for a file. This function is 351 + called by chmod(2) and related system calls. 352 + 353 + getattr: called by the VFS to get attributes of a file. This function is 354 + called by stat(2) and related system calls. 355 + 356 + setxattr: called by the VFS to set an extended attribute for a file. 357 + Extended attribute is a name:value pair associated with an inode. This 358 + function is called by setxattr(2) system call. 359 + 360 + getxattr: called by the VFS to retrieve the value of an extended attribute 361 + name. This function is called by getxattr(2) function call. 362 + 363 + listxattr: called by the VFS to list all extended attributes for a given 364 + file. This function is called by listxattr(2) system call. 365 + 366 + removexattr: called by the VFS to remove an extended attribute from a file. 367 + This function is called by removexattr(2) system call. 403 368 404 369 405 - struct file_operations <section> 370 + struct address_space_operations 371 + =============================== 372 + 373 + This describes how the VFS can manipulate mapping of a file to page cache in 374 + your filesystem. As of kernel 2.6.13, the following members are defined: 375 + 376 + struct address_space_operations { 377 + int (*writepage)(struct page *page, struct writeback_control *wbc); 378 + int (*readpage)(struct file *, struct page *); 379 + int (*sync_page)(struct page *); 380 + int (*writepages)(struct address_space *, struct writeback_control *); 381 + int (*set_page_dirty)(struct page *page); 382 + int (*readpages)(struct file *filp, struct address_space *mapping, 383 + struct list_head *pages, unsigned nr_pages); 384 + int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); 385 + int (*commit_write)(struct file *, struct page *, unsigned, unsigned); 386 + sector_t (*bmap)(struct address_space *, sector_t); 387 + int (*invalidatepage) (struct page *, unsigned long); 388 + int (*releasepage) (struct page *, int); 389 + ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, 390 + loff_t offset, unsigned long nr_segs); 391 + struct page* (*get_xip_page)(struct address_space *, sector_t, 392 + int); 393 + }; 394 + 395 + writepage: called by the VM write a dirty page to backing store. 396 + 397 + readpage: called by the VM to read a page from backing store. 398 + 399 + sync_page: called by the VM to notify the backing store to perform all 400 + queued I/O operations for a page. I/O operations for other pages 401 + associated with this address_space object may also be performed. 402 + 403 + writepages: called by the VM to write out pages associated with the 404 + address_space object. 405 + 406 + set_page_dirty: called by the VM to set a page dirty. 407 + 408 + readpages: called by the VM to read pages associated with the address_space 409 + object. 410 + 411 + prepare_write: called by the generic write path in VM to set up a write 412 + request for a page. 413 + 414 + commit_write: called by the generic write path in VM to write page to 415 + its backing store. 416 + 417 + bmap: called by the VFS to map a logical block offset within object to 418 + physical block number. This method is use by for the legacy FIBMAP 419 + ioctl. Other uses are discouraged. 420 + 421 + invalidatepage: called by the VM on truncate to disassociate a page from its 422 + address_space mapping. 423 + 424 + releasepage: called by the VFS to release filesystem specific metadata from 425 + a page. 426 + 427 + direct_IO: called by the VM for direct I/O writes and reads. 428 + 429 + get_xip_page: called by the VM to translate a block number to a page. 430 + The page is valid until the corresponding filesystem is unmounted. 431 + Filesystems that want to use execute-in-place (XIP) need to implement 432 + it. An example implementation can be found in fs/ext2/xip.c. 433 + 434 + 435 + struct file_operations 406 436 ====================== 407 437 408 438 This describes how the VFS can manipulate an open file. As of kernel 409 - 2.1.99, the following members are defined: 439 + 2.6.13, the following members are defined: 410 440 411 441 struct file_operations { 412 442 loff_t (*llseek) (struct file *, loff_t, int); 413 - ssize_t (*read) (struct file *, char *, size_t, loff_t *); 414 - ssize_t (*write) (struct file *, const char *, size_t, loff_t *); 443 + ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); 444 + ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); 445 + ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); 446 + ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); 415 447 int (*readdir) (struct file *, void *, filldir_t); 416 448 unsigned int (*poll) (struct file *, struct poll_table_struct *); 417 449 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); 450 + long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); 451 + long (*compat_ioctl) (struct file *, unsigned int, unsigned long); 418 452 int (*mmap) (struct file *, struct vm_area_struct *); 419 453 int (*open) (struct inode *, struct file *); 454 + int (*flush) (struct file *); 420 455 int (*release) (struct inode *, struct file *); 421 - int (*fsync) (struct file *, struct dentry *); 422 - int (*fasync) (struct file *, int); 423 - int (*check_media_change) (kdev_t dev); 424 - int (*revalidate) (kdev_t dev); 456 + int (*fsync) (struct file *, struct dentry *, int datasync); 457 + int (*aio_fsync) (struct kiocb *, int datasync); 458 + int (*fasync) (int, struct file *, int); 425 459 int (*lock) (struct file *, int, struct file_lock *); 460 + ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); 461 + ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); 462 + ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); 463 + ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); 464 + unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); 465 + int (*check_flags)(int); 466 + int (*dir_notify)(struct file *filp, unsigned long arg); 467 + int (*flock) (struct file *, int, struct file_lock *); 426 468 }; 427 469 428 470 Again, all methods are called without any locks being held, unless ··· 544 362 545 363 read: called by read(2) and related system calls 546 364 365 + aio_read: called by io_submit(2) and other asynchronous I/O operations 366 + 547 367 write: called by write(2) and related system calls 368 + 369 + aio_write: called by io_submit(2) and other asynchronous I/O operations 548 370 549 371 readdir: called when the VFS needs to read the directory contents 550 372 ··· 558 372 559 373 ioctl: called by the ioctl(2) system call 560 374 375 + unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not 376 + require the BKL should use this method instead of the ioctl() above. 377 + 378 + compat_ioctl: called by the ioctl(2) system call when 32 bit system calls 379 + are used on 64 bit kernels. 380 + 561 381 mmap: called by the mmap(2) system call 562 382 563 383 open: called by the VFS when an inode should be opened. When the VFS 564 - opens a file, it creates a new "struct file" and initialises 565 - the "f_op" file operations member with the "default_file_ops" 566 - field in the inode structure. It then calls the open method 567 - for the newly allocated file structure. You might think that 568 - the open method really belongs in "struct inode_operations", 569 - and you may be right. I think it's done the way it is because 570 - it makes filesystems simpler to implement. The open() method 571 - is a good place to initialise the "private_data" member in the 572 - file structure if you want to point to a device structure 384 + opens a file, it creates a new "struct file". It then calls the 385 + open method for the newly allocated file structure. You might 386 + think that the open method really belongs in 387 + "struct inode_operations", and you may be right. I think it's 388 + done the way it is because it makes filesystems simpler to 389 + implement. The open() method is a good place to initialize the 390 + "private_data" member in the file structure if you want to point 391 + to a device structure 392 + 393 + flush: called by the close(2) system call to flush a file 573 394 574 395 release: called when the last reference to an open file is closed 575 396 ··· 584 391 585 392 fasync: called by the fcntl(2) system call when asynchronous 586 393 (non-blocking) mode is enabled for a file 394 + 395 + lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW 396 + commands 397 + 398 + readv: called by the readv(2) system call 399 + 400 + writev: called by the writev(2) system call 401 + 402 + sendfile: called by the sendfile(2) system call 403 + 404 + get_unmapped_area: called by the mmap(2) system call 405 + 406 + check_flags: called by the fcntl(2) system call for F_SETFL command 407 + 408 + dir_notify: called by the fcntl(2) system call for F_NOTIFY command 409 + 410 + flock: called by the flock(2) system call 587 411 588 412 Note that the file operations are implemented by the specific 589 413 filesystem in which the inode resides. When opening a device node ··· 610 400 operations with those for the device driver, and then proceed to call 611 401 the new open() method for the file. This is how opening a device file 612 402 in the filesystem eventually ends up calling the device driver open() 613 - method. Note the devfs (the Device FileSystem) has a more direct path 614 - from device node to device driver (this is an unofficial kernel 615 - patch). 403 + method. 616 404 617 405 618 - Directory Entry Cache (dcache) <section> 619 - ------------------------------ 406 + Directory Entry Cache (dcache) 407 + ============================== 408 + 620 409 621 410 struct dentry_operations 622 - ======================== 411 + ------------------------ 623 412 624 413 This describes how a filesystem can overload the standard dentry 625 414 operations. Dentries and the dcache are the domain of the VFS and the 626 415 individual filesystem implementations. Device drivers have no business 627 416 here. These methods may be set to NULL, as they are either optional or 628 - the VFS uses a default. As of kernel 2.1.99, the following members are 417 + the VFS uses a default. As of kernel 2.6.13, the following members are 629 418 defined: 630 419 631 420 struct dentry_operations { 632 - int (*d_revalidate)(struct dentry *); 421 + int (*d_revalidate)(struct dentry *, struct nameidata *); 633 422 int (*d_hash) (struct dentry *, struct qstr *); 634 423 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 635 - void (*d_delete)(struct dentry *); 424 + int (*d_delete)(struct dentry *); 636 425 void (*d_release)(struct dentry *); 637 426 void (*d_iput)(struct dentry *, struct inode *); 638 427 }; ··· 660 451 of child dentries. Child dentries are basically like files in a 661 452 directory. 662 453 454 + 663 455 Directory Entry Cache APIs 664 456 -------------------------- 665 457 ··· 681 471 "d_delete" method is called 682 472 683 473 d_drop: this unhashes a dentry from its parents hash list. A 684 - subsequent call to dput() will dellocate the dentry if its 474 + subsequent call to dput() will deallocate the dentry if its 685 475 usage count drops to 0 686 476 687 477 d_delete: delete a dentry. If there are no other open references to ··· 717 507 of the pathname and using that dentry along with the next 718 508 component to look up the next level and so on. Since it 719 509 is a frequent operation for workloads like multiuser 720 - environments and webservers, it is important to optimize 510 + environments and web servers, it is important to optimize 721 511 this path. 722 512 723 513 Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus 724 514 in every component during path look-up. Since 2.5.10 onwards, 725 - fastwalk algorithm changed this by holding the dcache_lock 515 + fast-walk algorithm changed this by holding the dcache_lock 726 516 at the beginning and walking as many cached path component 727 - dentries as possible. This signficantly decreases the number 517 + dentries as possible. This significantly decreases the number 728 518 of acquisition of dcache_lock. However it also increases the 729 - lock hold time signficantly and affects performance in large 519 + lock hold time significantly and affects performance in large 730 520 SMP machines. Since 2.5.62 kernel, dcache has been using 731 521 a new locking model that uses RCU to make dcache look-up 732 522 lock-free. ··· 737 527 as d_inode and several other things like mount look-up. RCU-based 738 528 changes affect only the way the hash chain is protected. For everything 739 529 else the dcache_lock must be taken for both traversing as well as 740 - updating. The hash chain updations too take the dcache_lock. 530 + updating. The hash chain updates too take the dcache_lock. 741 531 The significant change is the way d_lookup traverses the hash chain, 742 532 it doesn't acquire the dcache_lock for this and rely on RCU to 743 533 ensure that the dentry has not been *freed*. ··· 745 535 746 536 Dcache locking details 747 537 ---------------------- 538 + 748 539 For many multi-user workloads, open() and stat() on files are 749 540 very frequently occurring operations. Both involve walking 750 541 of path names to find the dentry corresponding to the 751 542 concerned file. In 2.4 kernel, dcache_lock was held 752 543 during look-up of each path component. Contention and 753 - cacheline bouncing of this global lock caused significant 544 + cache-line bouncing of this global lock caused significant 754 545 scalability problems. With the introduction of RCU 755 - in linux kernel, this was worked around by making 546 + in Linux kernel, this was worked around by making 756 547 the look-up of path components during path walking lock-free. 757 548 758 549 ··· 773 562 2. Insertion of a dentry into the hash table is done using 774 563 hlist_add_head_rcu() which take care of ordering the writes - 775 564 the writes to the dentry must be visible before the dentry 776 - is inserted. This works in conjuction with hlist_for_each_rcu() 565 + is inserted. This works in conjunction with hlist_for_each_rcu() 777 566 while walking the hash chain. The only requirement is that 778 567 all initialization to the dentry must be done before hlist_add_head_rcu() 779 568 since we don't have dcache_lock protection while traversing ··· 795 584 the same. In some sense, dcache_rcu path walking looks like 796 585 the pre-2.5.10 version. 797 586 798 - 5. All dentry hash chain updations must take the dcache_lock as well as 587 + 5. All dentry hash chain updates must take the dcache_lock as well as 799 588 the per-dentry lock in that order. dput() does this to ensure 800 589 that a dentry that has just been looked up in another CPU 801 590 doesn't get deleted before dget() can be done on it. ··· 851 640 Since we redo the d_parent check and compare name while holding 852 641 d_lock, lock-free look-up will not race against d_move(). 853 642 854 - 4. There can be a theoritical race when a dentry keeps coming back 643 + 4. There can be a theoretical race when a dentry keeps coming back 855 644 to original bucket due to double moves. Due to this look-up may 856 645 consider that it has never moved and can end up in a infinite loop. 857 - But this is not any worse that theoritical livelocks we already 646 + But this is not any worse that theoretical livelocks we already 858 647 have in the kernel. 859 648 860 649