···6262 - info and mount options for the JFS filesystem.6363locks.txt6464 - info on file locking implementations, flock() vs. fcntl(), etc.6565+logfs.txt6666+ - info on the LogFS flash filesystem.6567mandatory-locking.txt6668 - info on the Linux implementation of Sys V mandatory file locking.6769ncpfs.txt
+241
Documentation/filesystems/logfs.txt
···11+22+The LogFS Flash Filesystem33+==========================44+55+Specification66+=============77+88+Superblocks99+-----------1010+1111+Two superblocks exist at the beginning and end of the filesystem.1212+Each superblock is 256 Bytes large, with another 3840 Bytes reserved1313+for future purposes, making a total of 4096 Bytes.1414+1515+Superblock locations may differ for MTD and block devices. On MTD the1616+first non-bad block contains a superblock in the first 4096 Bytes and1717+the last non-bad block contains a superblock in the last 4096 Bytes.1818+On block devices, the first 4096 Bytes of the device contain the first1919+superblock and the last aligned 4096 Byte-block contains the second2020+superblock.2121+2222+For the most part, the superblocks can be considered read-only. They2323+are written only to correct errors detected within the superblocks,2424+move the journal and change the filesystem parameters through tunefs.2525+As a result, the superblock does not contain any fields that require2626+constant updates, like the amount of free space, etc.2727+2828+Segments2929+--------3030+3131+The space in the device is split up into equal-sized segments.3232+Segments are the primary write unit of LogFS. Within each segments,3333+writes happen from front (low addresses) to back (high addresses. If3434+only a partial segment has been written, the segment number, the3535+current position within and optionally a write buffer are stored in3636+the journal.3737+3838+Segments are erased as a whole. Therefore Garbage Collection may be3939+required to completely free a segment before doing so.4040+4141+Journal4242+--------4343+4444+The journal contains all global information about the filesystem that4545+is subject to frequent change. At mount time, it has to be scanned4646+for the most recent commit entry, which contains a list of pointers to4747+all currently valid entries.4848+4949+Object Store5050+------------5151+5252+All space except for the superblocks and journal is part of the object5353+store. Each segment contains a segment header and a number of5454+objects, each consisting of the object header and the payload.5555+Objects are either inodes, directory entries (dentries), file data5656+blocks or indirect blocks.5757+5858+Levels5959+------6060+6161+Garbage collection (GC) may fail if all data is written6262+indiscriminately. One requirement of GC is that data is seperated6363+roughly according to the distance between the tree root and the data.6464+Effectively that means all file data is on level 0, indirect blocks6565+are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,6666+respectively. Inode file data is on level 6 for the inodes and 7-116767+for indirect blocks.6868+6969+Each segment contains objects of a single level only. As a result,7070+each level requires its own seperate segment to be open for writing.7171+7272+Inode File7373+----------7474+7575+All inodes are stored in a special file, the inode file. Single7676+exception is the inode file's inode (master inode) which for obvious7777+reasons is stored in the journal instead. Instead of data blocks, the7878+leaf nodes of the inode files are inodes.7979+8080+Aliases8181+-------8282+8383+Writes in LogFS are done by means of a wandering tree. A naïve8484+implementation would require that for each write or a block, all8585+parent blocks are written as well, since the block pointers have8686+changed. Such an implementation would not be very efficient.8787+8888+In LogFS, the block pointer changes are cached in the journal by means8989+of alias entries. Each alias consists of its logical address - inode9090+number, block index, level and child number (index into block) - and9191+the changed data. Any 8-byte word can be changes in this manner.9292+9393+Currently aliases are used for block pointers, file size, file used9494+bytes and the height of an inodes indirect tree.9595+9696+Segment Aliases9797+---------------9898+9999+Related to regular aliases, these are used to handle bad blocks.100100+Initially, bad blocks are handled by moving the affected segment101101+content to a spare segment and noting this move in the journal with a102102+segment alias, a simple (to, from) tupel. GC will later empty this103103+segment and the alias can be removed again. This is used on MTD only.104104+105105+Vim106106+---107107+108108+By cleverly predicting the life time of data, it is possible to109109+seperate long-living data from short-living data and thereby reduce110110+the GC overhead later. Each type of distinc life expectency (vim) can111111+have a seperate segment open for writing. Each (level, vim) tupel can112112+be open just once. If an open segment with unknown vim is encountered113113+at mount time, it is closed and ignored henceforth.114114+115115+Indirect Tree116116+-------------117117+118118+Inodes in LogFS are similar to FFS-style filesystems with direct and119119+indirect block pointers. One difference is that LogFS uses a single120120+indirect pointer that can be either a 1x, 2x, etc. indirect pointer.121121+A height field in the inode defines the height of the indirect tree122122+and thereby the indirection of the pointer.123123+124124+Another difference is the addressing of indirect blocks. In LogFS,125125+the first 16 pointers in the first indirect block are left empty,126126+corresponding to the 16 direct pointers in the inode. In ext2 (maybe127127+others as well) the first pointer in the first indirect block128128+corresponds to logical block 12, skipping the 12 direct pointers.129129+So where ext2 is using arithmetic to better utilize space, LogFS keeps130130+arithmetic simple and uses compression to save space.131131+132132+Compression133133+-----------134134+135135+Both file data and metadata can be compressed. Compression for file136136+data can be enabled with chattr +c and disabled with chattr -c. Doing137137+so has no effect on existing data, but new data will be stored138138+accordingly. New inodes will inherit the compression flag of the139139+parent directory.140140+141141+Metadata is always compressed. However, the space accounting ignores142142+this and charges for the uncompressed size. Failing to do so could143143+result in GC failures when, after moving some data, indirect blocks144144+compress worse than previously. Even on a 100% full medium, GC may145145+not consume any extra space, so the compression gains are lost space146146+to the user.147147+148148+However, they are not lost space to the filesystem internals. By149149+cheating the user for those bytes, the filesystem gained some slack150150+space and GC will run less often and faster.151151+152152+Garbage Collection and Wear Leveling153153+------------------------------------154154+155155+Garbage collection is invoked whenever the number of free segments156156+falls below a threshold. The best (known) candidate is picked based157157+on the least amount of valid data contained in the segment. All158158+remaining valid data is copied elsewhere, thereby invalidating it.159159+160160+The GC code also checks for aliases and writes then back if their161161+number gets too large.162162+163163+Wear leveling is done by occasionally picking a suboptimal segment for164164+garbage collection. If a stale segments erase count is significantly165165+lower than the active segments' erase counts, it will be picked. Wear166166+leveling is rate limited, so it will never monopolize the device for167167+more than one segment worth at a time.168168+169169+Values for "occasionally", "significantly lower" are compile time170170+constants.171171+172172+Hashed directories173173+------------------174174+175175+To satisfy efficient lookup(), directory entries are hashed and176176+located based on the hash. In order to both support large directories177177+and not be overly inefficient for small directories, several hash178178+tables of increasing size are used. For each table, the hash value179179+modulo the table size gives the table index.180180+181181+Tables sizes are chosen to limit the number of indirect blocks with a182182+fully populated table to 0, 1, 2 or 3 respectively. So the first183183+table contains 16 entries, the second 512-16, etc.184184+185185+The last table is special in several ways. First its size depends on186186+the effective 32bit limit on telldir/seekdir cookies. Since logfs187187+uses the upper half of the address space for indirect blocks, the size188188+is limited to 2^31. Secondly the table contains hash buckets with 16189189+entries each.190190+191191+Using single-entry buckets would result in birthday "attacks". At192192+just 2^16 used entries, hash collisions would be likely (P >= 0.5).193193+My math skills are insufficient to do the combinatorics for the 17x194194+collisions necessary to overflow a bucket, but testing showed that in195195+10,000 runs the lowest directory fill before a bucket overflow was196196+188,057,130 entries with an average of 315,149,915 entries. So for197197+directory sizes of up to a million, bucket overflows should be198198+virtually impossible under normal circumstances.199199+200200+With carefully chosen filenames, it is obviously possible to cause an201201+overflow with just 21 entries (4 higher tables + 16 entries + 1). So202202+there may be a security concern if a malicious user has write access203203+to a directory.204204+205205+Open For Discussion206206+===================207207+208208+Device Address Space209209+--------------------210210+211211+A device address space is used for caching. Both block devices and212212+MTD provide functions to either read a single page or write a segment.213213+Partial segments may be written for data integrity, but where possible214214+complete segments are written for performance on simple block device215215+flash media.216216+217217+Meta Inodes218218+-----------219219+220220+Inodes are stored in the inode file, which is just a regular file for221221+most purposes. At umount time, however, the inode file needs to222222+remain open until all dirty inodes are written. So223223+generic_shutdown_super() may not close this inode, but shouldn't224224+complain about remaining inodes due to the inode file either. Same225225+goes for mapping inode of the device address space.226226+227227+Currently logfs uses a hack that essentially copies part of fs/inode.c228228+code over. A general solution would be preferred.229229+230230+Indirect block mapping231231+----------------------232232+233233+With compression, the block device (or mapping inode) cannot be used234234+to cache indirect blocks. Some other place is required. Currently235235+logfs uses the top half of each inode's address space. The low 8TB236236+(on 32bit) are filled with file data, the high 8TB are used for237237+indirect blocks.238238+239239+One problem is that 16TB files created on 64bit systems actually have240240+data in the top 8TB. But files >16TB would cause problems anyway, so241241+only the limit has changed.
···11+config LOGFS22+ tristate "LogFS file system (EXPERIMENTAL)"33+ depends on (MTD || BLOCK) && EXPERIMENTAL44+ select ZLIB_INFLATE55+ select ZLIB_DEFLATE66+ select CRC3277+ select BTREE88+ help99+ Flash filesystem aimed to scale efficiently to large devices.1010+ In comparison to JFFS2 it offers significantly faster mount1111+ times and potentially less RAM usage, although the latter has1212+ not been measured yet.1313+1414+ In its current state it is still very experimental and should1515+ not be used for other than testing purposes.1616+1717+ If unsure, say N.
···11+/*22+ * fs/logfs/compr.c - compression routines33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+#include <linux/vmalloc.h>1010+#include <linux/zlib.h>1111+1212+#define COMPR_LEVEL 31313+1414+static DEFINE_MUTEX(compr_mutex);1515+static struct z_stream_s stream;1616+1717+int logfs_compress(void *in, void *out, size_t inlen, size_t outlen)1818+{1919+ int err, ret;2020+2121+ ret = -EIO;2222+ mutex_lock(&compr_mutex);2323+ err = zlib_deflateInit(&stream, COMPR_LEVEL);2424+ if (err != Z_OK)2525+ goto error;2626+2727+ stream.next_in = in;2828+ stream.avail_in = inlen;2929+ stream.total_in = 0;3030+ stream.next_out = out;3131+ stream.avail_out = outlen;3232+ stream.total_out = 0;3333+3434+ err = zlib_deflate(&stream, Z_FINISH);3535+ if (err != Z_STREAM_END)3636+ goto error;3737+3838+ err = zlib_deflateEnd(&stream);3939+ if (err != Z_OK)4040+ goto error;4141+4242+ if (stream.total_out >= stream.total_in)4343+ goto error;4444+4545+ ret = stream.total_out;4646+error:4747+ mutex_unlock(&compr_mutex);4848+ return ret;4949+}5050+5151+int logfs_uncompress(void *in, void *out, size_t inlen, size_t outlen)5252+{5353+ int err, ret;5454+5555+ ret = -EIO;5656+ mutex_lock(&compr_mutex);5757+ err = zlib_inflateInit(&stream);5858+ if (err != Z_OK)5959+ goto error;6060+6161+ stream.next_in = in;6262+ stream.avail_in = inlen;6363+ stream.total_in = 0;6464+ stream.next_out = out;6565+ stream.avail_out = outlen;6666+ stream.total_out = 0;6767+6868+ err = zlib_inflate(&stream, Z_FINISH);6969+ if (err != Z_STREAM_END)7070+ goto error;7171+7272+ err = zlib_inflateEnd(&stream);7373+ if (err != Z_OK)7474+ goto error;7575+7676+ ret = 0;7777+error:7878+ mutex_unlock(&compr_mutex);7979+ return ret;8080+}8181+8282+int __init logfs_compr_init(void)8383+{8484+ size_t size = max(zlib_deflate_workspacesize(),8585+ zlib_inflate_workspacesize());8686+ stream.workspace = vmalloc(size);8787+ if (!stream.workspace)8888+ return -ENOMEM;8989+ return 0;9090+}9191+9292+void logfs_compr_exit(void)9393+{9494+ vfree(stream.workspace);9595+}
+263
fs/logfs/dev_bdev.c
···11+/*22+ * fs/logfs/dev_bdev.c - Device access methods for block devices33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+#include <linux/bio.h>1010+#include <linux/blkdev.h>1111+#include <linux/buffer_head.h>1212+1313+#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))1414+1515+static void request_complete(struct bio *bio, int err)1616+{1717+ complete((struct completion *)bio->bi_private);1818+}1919+2020+static int sync_request(struct page *page, struct block_device *bdev, int rw)2121+{2222+ struct bio bio;2323+ struct bio_vec bio_vec;2424+ struct completion complete;2525+2626+ bio_init(&bio);2727+ bio.bi_io_vec = &bio_vec;2828+ bio_vec.bv_page = page;2929+ bio_vec.bv_len = PAGE_SIZE;3030+ bio_vec.bv_offset = 0;3131+ bio.bi_vcnt = 1;3232+ bio.bi_idx = 0;3333+ bio.bi_size = PAGE_SIZE;3434+ bio.bi_bdev = bdev;3535+ bio.bi_sector = page->index * (PAGE_SIZE >> 9);3636+ init_completion(&complete);3737+ bio.bi_private = &complete;3838+ bio.bi_end_io = request_complete;3939+4040+ submit_bio(rw, &bio);4141+ generic_unplug_device(bdev_get_queue(bdev));4242+ wait_for_completion(&complete);4343+ return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO;4444+}4545+4646+static int bdev_readpage(void *_sb, struct page *page)4747+{4848+ struct super_block *sb = _sb;4949+ struct block_device *bdev = logfs_super(sb)->s_bdev;5050+ int err;5151+5252+ err = sync_request(page, bdev, READ);5353+ if (err) {5454+ ClearPageUptodate(page);5555+ SetPageError(page);5656+ } else {5757+ SetPageUptodate(page);5858+ ClearPageError(page);5959+ }6060+ unlock_page(page);6161+ return err;6262+}6363+6464+static DECLARE_WAIT_QUEUE_HEAD(wq);6565+6666+static void writeseg_end_io(struct bio *bio, int err)6767+{6868+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);6969+ struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;7070+ struct super_block *sb = bio->bi_private;7171+ struct logfs_super *super = logfs_super(sb);7272+ struct page *page;7373+7474+ BUG_ON(!uptodate); /* FIXME: Retry io or write elsewhere */7575+ BUG_ON(err);7676+ BUG_ON(bio->bi_vcnt == 0);7777+ do {7878+ page = bvec->bv_page;7979+ if (--bvec >= bio->bi_io_vec)8080+ prefetchw(&bvec->bv_page->flags);8181+8282+ end_page_writeback(page);8383+ } while (bvec >= bio->bi_io_vec);8484+ bio_put(bio);8585+ if (atomic_dec_and_test(&super->s_pending_writes))8686+ wake_up(&wq);8787+}8888+8989+static int __bdev_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,9090+ size_t nr_pages)9191+{9292+ struct logfs_super *super = logfs_super(sb);9393+ struct address_space *mapping = super->s_mapping_inode->i_mapping;9494+ struct bio *bio;9595+ struct page *page;9696+ struct request_queue *q = bdev_get_queue(sb->s_bdev);9797+ unsigned int max_pages = queue_max_hw_sectors(q) >> (PAGE_SHIFT - 9);9898+ int i;9999+100100+ bio = bio_alloc(GFP_NOFS, max_pages);101101+ BUG_ON(!bio); /* FIXME: handle this */102102+103103+ for (i = 0; i < nr_pages; i++) {104104+ if (i >= max_pages) {105105+ /* Block layer cannot split bios :( */106106+ bio->bi_vcnt = i;107107+ bio->bi_idx = 0;108108+ bio->bi_size = i * PAGE_SIZE;109109+ bio->bi_bdev = super->s_bdev;110110+ bio->bi_sector = ofs >> 9;111111+ bio->bi_private = sb;112112+ bio->bi_end_io = writeseg_end_io;113113+ atomic_inc(&super->s_pending_writes);114114+ submit_bio(WRITE, bio);115115+116116+ ofs += i * PAGE_SIZE;117117+ index += i;118118+ nr_pages -= i;119119+ i = 0;120120+121121+ bio = bio_alloc(GFP_NOFS, max_pages);122122+ BUG_ON(!bio);123123+ }124124+ page = find_lock_page(mapping, index + i);125125+ BUG_ON(!page);126126+ bio->bi_io_vec[i].bv_page = page;127127+ bio->bi_io_vec[i].bv_len = PAGE_SIZE;128128+ bio->bi_io_vec[i].bv_offset = 0;129129+130130+ BUG_ON(PageWriteback(page));131131+ set_page_writeback(page);132132+ unlock_page(page);133133+ }134134+ bio->bi_vcnt = nr_pages;135135+ bio->bi_idx = 0;136136+ bio->bi_size = nr_pages * PAGE_SIZE;137137+ bio->bi_bdev = super->s_bdev;138138+ bio->bi_sector = ofs >> 9;139139+ bio->bi_private = sb;140140+ bio->bi_end_io = writeseg_end_io;141141+ atomic_inc(&super->s_pending_writes);142142+ submit_bio(WRITE, bio);143143+ return 0;144144+}145145+146146+static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len)147147+{148148+ struct logfs_super *super = logfs_super(sb);149149+ int head;150150+151151+ BUG_ON(super->s_flags & LOGFS_SB_FLAG_RO);152152+153153+ if (len == 0) {154154+ /* This can happen when the object fit perfectly into a155155+ * segment, the segment gets written per sync and subsequently156156+ * closed.157157+ */158158+ return;159159+ }160160+ head = ofs & (PAGE_SIZE - 1);161161+ if (head) {162162+ ofs -= head;163163+ len += head;164164+ }165165+ len = PAGE_ALIGN(len);166166+ __bdev_writeseg(sb, ofs, ofs >> PAGE_SHIFT, len >> PAGE_SHIFT);167167+ generic_unplug_device(bdev_get_queue(logfs_super(sb)->s_bdev));168168+}169169+170170+static int bdev_erase(struct super_block *sb, loff_t to, size_t len)171171+{172172+ struct logfs_super *super = logfs_super(sb);173173+ struct address_space *mapping = super->s_mapping_inode->i_mapping;174174+ struct page *page;175175+ pgoff_t index = to >> PAGE_SHIFT;176176+ int i, nr_pages = len >> PAGE_SHIFT;177177+178178+ BUG_ON(to & (PAGE_SIZE - 1));179179+ BUG_ON(len & (PAGE_SIZE - 1));180180+181181+ if (logfs_super(sb)->s_flags & LOGFS_SB_FLAG_RO)182182+ return -EROFS;183183+184184+ for (i = 0; i < nr_pages; i++) {185185+ page = find_get_page(mapping, index + i);186186+ if (page) {187187+ memset(page_address(page), 0xFF, PAGE_SIZE);188188+ page_cache_release(page);189189+ }190190+ }191191+ return 0;192192+}193193+194194+static void bdev_sync(struct super_block *sb)195195+{196196+ struct logfs_super *super = logfs_super(sb);197197+198198+ wait_event(wq, atomic_read(&super->s_pending_writes) == 0);199199+}200200+201201+static struct page *bdev_find_first_sb(struct super_block *sb, u64 *ofs)202202+{203203+ struct logfs_super *super = logfs_super(sb);204204+ struct address_space *mapping = super->s_mapping_inode->i_mapping;205205+ filler_t *filler = bdev_readpage;206206+207207+ *ofs = 0;208208+ return read_cache_page(mapping, 0, filler, sb);209209+}210210+211211+static struct page *bdev_find_last_sb(struct super_block *sb, u64 *ofs)212212+{213213+ struct logfs_super *super = logfs_super(sb);214214+ struct address_space *mapping = super->s_mapping_inode->i_mapping;215215+ filler_t *filler = bdev_readpage;216216+ u64 pos = (super->s_bdev->bd_inode->i_size & ~0xfffULL) - 0x1000;217217+ pgoff_t index = pos >> PAGE_SHIFT;218218+219219+ *ofs = pos;220220+ return read_cache_page(mapping, index, filler, sb);221221+}222222+223223+static int bdev_write_sb(struct super_block *sb, struct page *page)224224+{225225+ struct block_device *bdev = logfs_super(sb)->s_bdev;226226+227227+ /* Nothing special to do for block devices. */228228+ return sync_request(page, bdev, WRITE);229229+}230230+231231+static void bdev_put_device(struct super_block *sb)232232+{233233+ close_bdev_exclusive(logfs_super(sb)->s_bdev, FMODE_READ|FMODE_WRITE);234234+}235235+236236+static const struct logfs_device_ops bd_devops = {237237+ .find_first_sb = bdev_find_first_sb,238238+ .find_last_sb = bdev_find_last_sb,239239+ .write_sb = bdev_write_sb,240240+ .readpage = bdev_readpage,241241+ .writeseg = bdev_writeseg,242242+ .erase = bdev_erase,243243+ .sync = bdev_sync,244244+ .put_device = bdev_put_device,245245+};246246+247247+int logfs_get_sb_bdev(struct file_system_type *type, int flags,248248+ const char *devname, struct vfsmount *mnt)249249+{250250+ struct block_device *bdev;251251+252252+ bdev = open_bdev_exclusive(devname, FMODE_READ|FMODE_WRITE, type);253253+ if (IS_ERR(bdev))254254+ return PTR_ERR(bdev);255255+256256+ if (MAJOR(bdev->bd_dev) == MTD_BLOCK_MAJOR) {257257+ int mtdnr = MINOR(bdev->bd_dev);258258+ close_bdev_exclusive(bdev, FMODE_READ|FMODE_WRITE);259259+ return logfs_get_sb_mtd(type, flags, mtdnr, mnt);260260+ }261261+262262+ return logfs_get_sb_device(type, flags, NULL, bdev, &bd_devops, mnt);263263+}
+253
fs/logfs/dev_mtd.c
···11+/*22+ * fs/logfs/dev_mtd.c - Device access methods for MTD33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+#include <linux/completion.h>1010+#include <linux/mount.h>1111+#include <linux/sched.h>1212+1313+#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))1414+1515+static int mtd_read(struct super_block *sb, loff_t ofs, size_t len, void *buf)1616+{1717+ struct mtd_info *mtd = logfs_super(sb)->s_mtd;1818+ size_t retlen;1919+ int ret;2020+2121+ ret = mtd->read(mtd, ofs, len, &retlen, buf);2222+ BUG_ON(ret == -EINVAL);2323+ if (ret)2424+ return ret;2525+2626+ /* Not sure if we should loop instead. */2727+ if (retlen != len)2828+ return -EIO;2929+3030+ return 0;3131+}3232+3333+static int mtd_write(struct super_block *sb, loff_t ofs, size_t len, void *buf)3434+{3535+ struct logfs_super *super = logfs_super(sb);3636+ struct mtd_info *mtd = super->s_mtd;3737+ size_t retlen;3838+ loff_t page_start, page_end;3939+ int ret;4040+4141+ if (super->s_flags & LOGFS_SB_FLAG_RO)4242+ return -EROFS;4343+4444+ BUG_ON((ofs >= mtd->size) || (len > mtd->size - ofs));4545+ BUG_ON(ofs != (ofs >> super->s_writeshift) << super->s_writeshift);4646+ BUG_ON(len > PAGE_CACHE_SIZE);4747+ page_start = ofs & PAGE_CACHE_MASK;4848+ page_end = PAGE_CACHE_ALIGN(ofs + len) - 1;4949+ ret = mtd->write(mtd, ofs, len, &retlen, buf);5050+ if (ret || (retlen != len))5151+ return -EIO;5252+5353+ return 0;5454+}5555+5656+/*5757+ * For as long as I can remember (since about 2001) mtd->erase has been an5858+ * asynchronous interface lacking the first driver to actually use the5959+ * asynchronous properties. So just to prevent the first implementor of such6060+ * a thing from breaking logfs in 2350, we do the usual pointless dance to6161+ * declare a completion variable and wait for completion before returning6262+ * from mtd_erase(). What an excercise in futility!6363+ */6464+static void logfs_erase_callback(struct erase_info *ei)6565+{6666+ complete((struct completion *)ei->priv);6767+}6868+6969+static int mtd_erase_mapping(struct super_block *sb, loff_t ofs, size_t len)7070+{7171+ struct logfs_super *super = logfs_super(sb);7272+ struct address_space *mapping = super->s_mapping_inode->i_mapping;7373+ struct page *page;7474+ pgoff_t index = ofs >> PAGE_SHIFT;7575+7676+ for (index = ofs >> PAGE_SHIFT; index < (ofs + len) >> PAGE_SHIFT; index++) {7777+ page = find_get_page(mapping, index);7878+ if (!page)7979+ continue;8080+ memset(page_address(page), 0xFF, PAGE_SIZE);8181+ page_cache_release(page);8282+ }8383+ return 0;8484+}8585+8686+static int mtd_erase(struct super_block *sb, loff_t ofs, size_t len)8787+{8888+ struct mtd_info *mtd = logfs_super(sb)->s_mtd;8989+ struct erase_info ei;9090+ DECLARE_COMPLETION_ONSTACK(complete);9191+ int ret;9292+9393+ BUG_ON(len % mtd->erasesize);9494+ if (logfs_super(sb)->s_flags & LOGFS_SB_FLAG_RO)9595+ return -EROFS;9696+9797+ memset(&ei, 0, sizeof(ei));9898+ ei.mtd = mtd;9999+ ei.addr = ofs;100100+ ei.len = len;101101+ ei.callback = logfs_erase_callback;102102+ ei.priv = (long)&complete;103103+ ret = mtd->erase(mtd, &ei);104104+ if (ret)105105+ return -EIO;106106+107107+ wait_for_completion(&complete);108108+ if (ei.state != MTD_ERASE_DONE)109109+ return -EIO;110110+ return mtd_erase_mapping(sb, ofs, len);111111+}112112+113113+static void mtd_sync(struct super_block *sb)114114+{115115+ struct mtd_info *mtd = logfs_super(sb)->s_mtd;116116+117117+ if (mtd->sync)118118+ mtd->sync(mtd);119119+}120120+121121+static int mtd_readpage(void *_sb, struct page *page)122122+{123123+ struct super_block *sb = _sb;124124+ int err;125125+126126+ err = mtd_read(sb, page->index << PAGE_SHIFT, PAGE_SIZE,127127+ page_address(page));128128+ if (err == -EUCLEAN) {129129+ err = 0;130130+ /* FIXME: force GC this segment */131131+ }132132+ if (err) {133133+ ClearPageUptodate(page);134134+ SetPageError(page);135135+ } else {136136+ SetPageUptodate(page);137137+ ClearPageError(page);138138+ }139139+ unlock_page(page);140140+ return err;141141+}142142+143143+static struct page *mtd_find_first_sb(struct super_block *sb, u64 *ofs)144144+{145145+ struct logfs_super *super = logfs_super(sb);146146+ struct address_space *mapping = super->s_mapping_inode->i_mapping;147147+ filler_t *filler = mtd_readpage;148148+ struct mtd_info *mtd = super->s_mtd;149149+150150+ if (!mtd->block_isbad)151151+ return NULL;152152+153153+ *ofs = 0;154154+ while (mtd->block_isbad(mtd, *ofs)) {155155+ *ofs += mtd->erasesize;156156+ if (*ofs >= mtd->size)157157+ return NULL;158158+ }159159+ BUG_ON(*ofs & ~PAGE_MASK);160160+ return read_cache_page(mapping, *ofs >> PAGE_SHIFT, filler, sb);161161+}162162+163163+static struct page *mtd_find_last_sb(struct super_block *sb, u64 *ofs)164164+{165165+ struct logfs_super *super = logfs_super(sb);166166+ struct address_space *mapping = super->s_mapping_inode->i_mapping;167167+ filler_t *filler = mtd_readpage;168168+ struct mtd_info *mtd = super->s_mtd;169169+170170+ if (!mtd->block_isbad)171171+ return NULL;172172+173173+ *ofs = mtd->size - mtd->erasesize;174174+ while (mtd->block_isbad(mtd, *ofs)) {175175+ *ofs -= mtd->erasesize;176176+ if (*ofs <= 0)177177+ return NULL;178178+ }179179+ *ofs = *ofs + mtd->erasesize - 0x1000;180180+ BUG_ON(*ofs & ~PAGE_MASK);181181+ return read_cache_page(mapping, *ofs >> PAGE_SHIFT, filler, sb);182182+}183183+184184+static int __mtd_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,185185+ size_t nr_pages)186186+{187187+ struct logfs_super *super = logfs_super(sb);188188+ struct address_space *mapping = super->s_mapping_inode->i_mapping;189189+ struct page *page;190190+ int i, err;191191+192192+ for (i = 0; i < nr_pages; i++) {193193+ page = find_lock_page(mapping, index + i);194194+ BUG_ON(!page);195195+196196+ err = mtd_write(sb, page->index << PAGE_SHIFT, PAGE_SIZE,197197+ page_address(page));198198+ unlock_page(page);199199+ page_cache_release(page);200200+ if (err)201201+ return err;202202+ }203203+ return 0;204204+}205205+206206+static void mtd_writeseg(struct super_block *sb, u64 ofs, size_t len)207207+{208208+ struct logfs_super *super = logfs_super(sb);209209+ int head;210210+211211+ if (super->s_flags & LOGFS_SB_FLAG_RO)212212+ return;213213+214214+ if (len == 0) {215215+ /* This can happen when the object fit perfectly into a216216+ * segment, the segment gets written per sync and subsequently217217+ * closed.218218+ */219219+ return;220220+ }221221+ head = ofs & (PAGE_SIZE - 1);222222+ if (head) {223223+ ofs -= head;224224+ len += head;225225+ }226226+ len = PAGE_ALIGN(len);227227+ __mtd_writeseg(sb, ofs, ofs >> PAGE_SHIFT, len >> PAGE_SHIFT);228228+}229229+230230+static void mtd_put_device(struct super_block *sb)231231+{232232+ put_mtd_device(logfs_super(sb)->s_mtd);233233+}234234+235235+static const struct logfs_device_ops mtd_devops = {236236+ .find_first_sb = mtd_find_first_sb,237237+ .find_last_sb = mtd_find_last_sb,238238+ .readpage = mtd_readpage,239239+ .writeseg = mtd_writeseg,240240+ .erase = mtd_erase,241241+ .sync = mtd_sync,242242+ .put_device = mtd_put_device,243243+};244244+245245+int logfs_get_sb_mtd(struct file_system_type *type, int flags,246246+ int mtdnr, struct vfsmount *mnt)247247+{248248+ struct mtd_info *mtd;249249+ const struct logfs_device_ops *devops = &mtd_devops;250250+251251+ mtd = get_mtd_device(NULL, mtdnr);252252+ return logfs_get_sb_device(type, flags, mtd, NULL, devops, mnt);253253+}
+818
fs/logfs/dir.c
···11+/*22+ * fs/logfs/dir.c - directory-related code33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+1010+1111+/*1212+ * Atomic dir operations1313+ *1414+ * Directory operations are by default not atomic. Dentries and Inodes are1515+ * created/removed/altered in seperate operations. Therefore we need to do1616+ * a small amount of journaling.1717+ *1818+ * Create, link, mkdir, mknod and symlink all share the same function to do1919+ * the work: __logfs_create. This function works in two atomic steps:2020+ * 1. allocate inode (remember in journal)2121+ * 2. allocate dentry (clear journal)2222+ *2323+ * As we can only get interrupted between the two, when the inode we just2424+ * created is simply stored in the anchor. On next mount, if we were2525+ * interrupted, we delete the inode. From a users point of view the2626+ * operation never happened.2727+ *2828+ * Unlink and rmdir also share the same function: unlink. Again, this2929+ * function works in two atomic steps3030+ * 1. remove dentry (remember inode in journal)3131+ * 2. unlink inode (clear journal)3232+ *3333+ * And again, on the next mount, if we were interrupted, we delete the inode.3434+ * From a users point of view the operation succeeded.3535+ *3636+ * Rename is the real pain to deal with, harder than all the other methods3737+ * combined. Depending on the circumstances we can run into three cases.3838+ * A "target rename" where the target dentry already existed, a "local3939+ * rename" where both parent directories are identical or a "cross-directory4040+ * rename" in the remaining case.4141+ *4242+ * Local rename is atomic, as the old dentry is simply rewritten with a new4343+ * name.4444+ *4545+ * Cross-directory rename works in two steps, similar to __logfs_create and4646+ * logfs_unlink:4747+ * 1. Write new dentry (remember old dentry in journal)4848+ * 2. Remove old dentry (clear journal)4949+ *5050+ * Here we remember a dentry instead of an inode. On next mount, if we were5151+ * interrupted, we delete the dentry. From a users point of view, the5252+ * operation succeeded.5353+ *5454+ * Target rename works in three atomic steps:5555+ * 1. Attach old inode to new dentry (remember old dentry and new inode)5656+ * 2. Remove old dentry (still remember the new inode)5757+ * 3. Remove victim inode5858+ *5959+ * Here we remember both an inode an a dentry. If we get interrupted6060+ * between steps 1 and 2, we delete both the dentry and the inode. If6161+ * we get interrupted between steps 2 and 3, we delete just the inode.6262+ * In either case, the remaining objects are deleted on next mount. From6363+ * a users point of view, the operation succeeded.6464+ */6565+6666+static int write_dir(struct inode *dir, struct logfs_disk_dentry *dd,6767+ loff_t pos)6868+{6969+ return logfs_inode_write(dir, dd, sizeof(*dd), pos, WF_LOCK, NULL);7070+}7171+7272+static int write_inode(struct inode *inode)7373+{7474+ return __logfs_write_inode(inode, WF_LOCK);7575+}7676+7777+static s64 dir_seek_data(struct inode *inode, s64 pos)7878+{7979+ s64 new_pos = logfs_seek_data(inode, pos);8080+8181+ return max(pos, new_pos - 1);8282+}8383+8484+static int beyond_eof(struct inode *inode, loff_t bix)8585+{8686+ loff_t pos = bix << inode->i_sb->s_blocksize_bits;8787+ return pos >= i_size_read(inode);8888+}8989+9090+/*9191+ * Prime value was chosen to be roughly 256 + 26. r5 hash uses 11,9292+ * so short names (len <= 9) don't even occupy the complete 32bit name9393+ * space. A prime >256 ensures short names quickly spread the 32bit9494+ * name space. Add about 26 for the estimated amount of information9595+ * of each character and pick a prime nearby, preferrably a bit-sparse9696+ * one.9797+ */9898+static u32 hash_32(const char *s, int len, u32 seed)9999+{100100+ u32 hash = seed;101101+ int i;102102+103103+ for (i = 0; i < len; i++)104104+ hash = hash * 293 + s[i];105105+ return hash;106106+}107107+108108+/*109109+ * We have to satisfy several conflicting requirements here. Small110110+ * directories should stay fairly compact and not require too many111111+ * indirect blocks. The number of possible locations for a given hash112112+ * should be small to make lookup() fast. And we should try hard not113113+ * to overflow the 32bit name space or nfs and 32bit host systems will114114+ * be unhappy.115115+ *116116+ * So we use the following scheme. First we reduce the hash to 0..15117117+ * and try a direct block. If that is occupied we reduce the hash to118118+ * 16..255 and try an indirect block. Same for 2x and 3x indirect119119+ * blocks. Lastly we reduce the hash to 0x800_0000 .. 0xffff_ffff,120120+ * but use buckets containing eight entries instead of a single one.121121+ *122122+ * Using 16 entries should allow for a reasonable amount of hash123123+ * collisions, so the 32bit name space can be packed fairly tight124124+ * before overflowing. Oh and currently we don't overflow but return125125+ * and error.126126+ *127127+ * How likely are collisions? Doing the appropriate math is beyond me128128+ * and the Bronstein textbook. But running a test program to brute129129+ * force collisions for a couple of days showed that on average the130130+ * first collision occurs after 598M entries, with 290M being the131131+ * smallest result. Obviously 21 entries could already cause a132132+ * collision if all entries are carefully chosen.133133+ */134134+static pgoff_t hash_index(u32 hash, int round)135135+{136136+ switch (round) {137137+ case 0:138138+ return hash % I0_BLOCKS;139139+ case 1:140140+ return I0_BLOCKS + hash % (I1_BLOCKS - I0_BLOCKS);141141+ case 2:142142+ return I1_BLOCKS + hash % (I2_BLOCKS - I1_BLOCKS);143143+ case 3:144144+ return I2_BLOCKS + hash % (I3_BLOCKS - I2_BLOCKS);145145+ case 4 ... 19:146146+ return I3_BLOCKS + 16 * (hash % (((1<<31) - I3_BLOCKS) / 16))147147+ + round - 4;148148+ }149149+ BUG();150150+}151151+152152+static struct page *logfs_get_dd_page(struct inode *dir, struct dentry *dentry)153153+{154154+ struct qstr *name = &dentry->d_name;155155+ struct page *page;156156+ struct logfs_disk_dentry *dd;157157+ u32 hash = hash_32(name->name, name->len, 0);158158+ pgoff_t index;159159+ int round;160160+161161+ if (name->len > LOGFS_MAX_NAMELEN)162162+ return ERR_PTR(-ENAMETOOLONG);163163+164164+ for (round = 0; round < 20; round++) {165165+ index = hash_index(hash, round);166166+167167+ if (beyond_eof(dir, index))168168+ return NULL;169169+ if (!logfs_exist_block(dir, index))170170+ continue;171171+ page = read_cache_page(dir->i_mapping, index,172172+ (filler_t *)logfs_readpage, NULL);173173+ if (IS_ERR(page))174174+ return page;175175+ dd = kmap_atomic(page, KM_USER0);176176+ BUG_ON(dd->namelen == 0);177177+178178+ if (name->len != be16_to_cpu(dd->namelen) ||179179+ memcmp(name->name, dd->name, name->len)) {180180+ kunmap_atomic(dd, KM_USER0);181181+ page_cache_release(page);182182+ continue;183183+ }184184+185185+ kunmap_atomic(dd, KM_USER0);186186+ return page;187187+ }188188+ return NULL;189189+}190190+191191+static int logfs_remove_inode(struct inode *inode)192192+{193193+ int ret;194194+195195+ inode->i_nlink--;196196+ ret = write_inode(inode);197197+ LOGFS_BUG_ON(ret, inode->i_sb);198198+ return ret;199199+}200200+201201+static void abort_transaction(struct inode *inode, struct logfs_transaction *ta)202202+{203203+ if (logfs_inode(inode)->li_block)204204+ logfs_inode(inode)->li_block->ta = NULL;205205+ kfree(ta);206206+}207207+208208+static int logfs_unlink(struct inode *dir, struct dentry *dentry)209209+{210210+ struct logfs_super *super = logfs_super(dir->i_sb);211211+ struct inode *inode = dentry->d_inode;212212+ struct logfs_transaction *ta;213213+ struct page *page;214214+ pgoff_t index;215215+ int ret;216216+217217+ ta = kzalloc(sizeof(*ta), GFP_KERNEL);218218+ if (!ta)219219+ return -ENOMEM;220220+221221+ ta->state = UNLINK_1;222222+ ta->ino = inode->i_ino;223223+224224+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;225225+226226+ page = logfs_get_dd_page(dir, dentry);227227+ if (!page)228228+ return -ENOENT;229229+ if (IS_ERR(page))230230+ return PTR_ERR(page);231231+ index = page->index;232232+ page_cache_release(page);233233+234234+ mutex_lock(&super->s_dirop_mutex);235235+ logfs_add_transaction(dir, ta);236236+237237+ ret = logfs_delete(dir, index, NULL);238238+ if (!ret)239239+ ret = write_inode(dir);240240+241241+ if (ret) {242242+ abort_transaction(dir, ta);243243+ printk(KERN_ERR"LOGFS: unable to delete inode\n");244244+ goto out;245245+ }246246+247247+ ta->state = UNLINK_2;248248+ logfs_add_transaction(inode, ta);249249+ ret = logfs_remove_inode(inode);250250+out:251251+ mutex_unlock(&super->s_dirop_mutex);252252+ return ret;253253+}254254+255255+static inline int logfs_empty_dir(struct inode *dir)256256+{257257+ u64 data;258258+259259+ data = logfs_seek_data(dir, 0) << dir->i_sb->s_blocksize_bits;260260+ return data >= i_size_read(dir);261261+}262262+263263+static int logfs_rmdir(struct inode *dir, struct dentry *dentry)264264+{265265+ struct inode *inode = dentry->d_inode;266266+267267+ if (!logfs_empty_dir(inode))268268+ return -ENOTEMPTY;269269+270270+ return logfs_unlink(dir, dentry);271271+}272272+273273+/* FIXME: readdir currently has it's own dir_walk code. I don't see a good274274+ * way to combine the two copies */275275+#define IMPLICIT_NODES 2276276+static int __logfs_readdir(struct file *file, void *buf, filldir_t filldir)277277+{278278+ struct inode *dir = file->f_dentry->d_inode;279279+ loff_t pos = file->f_pos - IMPLICIT_NODES;280280+ struct page *page;281281+ struct logfs_disk_dentry *dd;282282+ int full;283283+284284+ BUG_ON(pos < 0);285285+ for (;; pos++) {286286+ if (beyond_eof(dir, pos))287287+ break;288288+ if (!logfs_exist_block(dir, pos)) {289289+ /* deleted dentry */290290+ pos = dir_seek_data(dir, pos);291291+ continue;292292+ }293293+ page = read_cache_page(dir->i_mapping, pos,294294+ (filler_t *)logfs_readpage, NULL);295295+ if (IS_ERR(page))296296+ return PTR_ERR(page);297297+ dd = kmap_atomic(page, KM_USER0);298298+ BUG_ON(dd->namelen == 0);299299+300300+ full = filldir(buf, (char *)dd->name, be16_to_cpu(dd->namelen),301301+ pos, be64_to_cpu(dd->ino), dd->type);302302+ kunmap_atomic(dd, KM_USER0);303303+ page_cache_release(page);304304+ if (full)305305+ break;306306+ }307307+308308+ file->f_pos = pos + IMPLICIT_NODES;309309+ return 0;310310+}311311+312312+static int logfs_readdir(struct file *file, void *buf, filldir_t filldir)313313+{314314+ struct inode *inode = file->f_dentry->d_inode;315315+ ino_t pino = parent_ino(file->f_dentry);316316+ int err;317317+318318+ if (file->f_pos < 0)319319+ return -EINVAL;320320+321321+ if (file->f_pos == 0) {322322+ if (filldir(buf, ".", 1, 1, inode->i_ino, DT_DIR) < 0)323323+ return 0;324324+ file->f_pos++;325325+ }326326+ if (file->f_pos == 1) {327327+ if (filldir(buf, "..", 2, 2, pino, DT_DIR) < 0)328328+ return 0;329329+ file->f_pos++;330330+ }331331+332332+ err = __logfs_readdir(file, buf, filldir);333333+ return err;334334+}335335+336336+static void logfs_set_name(struct logfs_disk_dentry *dd, struct qstr *name)337337+{338338+ dd->namelen = cpu_to_be16(name->len);339339+ memcpy(dd->name, name->name, name->len);340340+}341341+342342+static struct dentry *logfs_lookup(struct inode *dir, struct dentry *dentry,343343+ struct nameidata *nd)344344+{345345+ struct page *page;346346+ struct logfs_disk_dentry *dd;347347+ pgoff_t index;348348+ u64 ino = 0;349349+ struct inode *inode;350350+351351+ page = logfs_get_dd_page(dir, dentry);352352+ if (IS_ERR(page))353353+ return ERR_CAST(page);354354+ if (!page) {355355+ d_add(dentry, NULL);356356+ return NULL;357357+ }358358+ index = page->index;359359+ dd = kmap_atomic(page, KM_USER0);360360+ ino = be64_to_cpu(dd->ino);361361+ kunmap_atomic(dd, KM_USER0);362362+ page_cache_release(page);363363+364364+ inode = logfs_iget(dir->i_sb, ino);365365+ if (IS_ERR(inode)) {366366+ printk(KERN_ERR"LogFS: Cannot read inode #%llx for dentry (%lx, %lx)n",367367+ ino, dir->i_ino, index);368368+ return ERR_CAST(inode);369369+ }370370+ return d_splice_alias(inode, dentry);371371+}372372+373373+static void grow_dir(struct inode *dir, loff_t index)374374+{375375+ index = (index + 1) << dir->i_sb->s_blocksize_bits;376376+ if (i_size_read(dir) < index)377377+ i_size_write(dir, index);378378+}379379+380380+static int logfs_write_dir(struct inode *dir, struct dentry *dentry,381381+ struct inode *inode)382382+{383383+ struct page *page;384384+ struct logfs_disk_dentry *dd;385385+ u32 hash = hash_32(dentry->d_name.name, dentry->d_name.len, 0);386386+ pgoff_t index;387387+ int round, err;388388+389389+ for (round = 0; round < 20; round++) {390390+ index = hash_index(hash, round);391391+392392+ if (logfs_exist_block(dir, index))393393+ continue;394394+ page = find_or_create_page(dir->i_mapping, index, GFP_KERNEL);395395+ if (!page)396396+ return -ENOMEM;397397+398398+ dd = kmap_atomic(page, KM_USER0);399399+ memset(dd, 0, sizeof(*dd));400400+ dd->ino = cpu_to_be64(inode->i_ino);401401+ dd->type = logfs_type(inode);402402+ logfs_set_name(dd, &dentry->d_name);403403+ kunmap_atomic(dd, KM_USER0);404404+405405+ err = logfs_write_buf(dir, page, WF_LOCK);406406+ unlock_page(page);407407+ page_cache_release(page);408408+ if (!err)409409+ grow_dir(dir, index);410410+ return err;411411+ }412412+ /* FIXME: Is there a better return value? In most cases neither413413+ * the filesystem nor the directory are full. But we have had414414+ * too many collisions for this particular hash and no fallback.415415+ */416416+ return -ENOSPC;417417+}418418+419419+static int __logfs_create(struct inode *dir, struct dentry *dentry,420420+ struct inode *inode, const char *dest, long destlen)421421+{422422+ struct logfs_super *super = logfs_super(dir->i_sb);423423+ struct logfs_inode *li = logfs_inode(inode);424424+ struct logfs_transaction *ta;425425+ int ret;426426+427427+ ta = kzalloc(sizeof(*ta), GFP_KERNEL);428428+ if (!ta)429429+ return -ENOMEM;430430+431431+ ta->state = CREATE_1;432432+ ta->ino = inode->i_ino;433433+ mutex_lock(&super->s_dirop_mutex);434434+ logfs_add_transaction(inode, ta);435435+436436+ if (dest) {437437+ /* symlink */438438+ ret = logfs_inode_write(inode, dest, destlen, 0, WF_LOCK, NULL);439439+ if (!ret)440440+ ret = write_inode(inode);441441+ } else {442442+ /* creat/mkdir/mknod */443443+ ret = write_inode(inode);444444+ }445445+ if (ret) {446446+ abort_transaction(inode, ta);447447+ li->li_flags |= LOGFS_IF_STILLBORN;448448+ /* FIXME: truncate symlink */449449+ inode->i_nlink--;450450+ iput(inode);451451+ goto out;452452+ }453453+454454+ ta->state = CREATE_2;455455+ logfs_add_transaction(dir, ta);456456+ ret = logfs_write_dir(dir, dentry, inode);457457+ /* sync directory */458458+ if (!ret)459459+ ret = write_inode(dir);460460+461461+ if (ret) {462462+ logfs_del_transaction(dir, ta);463463+ ta->state = CREATE_2;464464+ logfs_add_transaction(inode, ta);465465+ logfs_remove_inode(inode);466466+ iput(inode);467467+ goto out;468468+ }469469+ d_instantiate(dentry, inode);470470+out:471471+ mutex_unlock(&super->s_dirop_mutex);472472+ return ret;473473+}474474+475475+static int logfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)476476+{477477+ struct inode *inode;478478+479479+ /*480480+ * FIXME: why do we have to fill in S_IFDIR, while the mode is481481+ * correct for mknod, creat, etc.? Smells like the vfs *should*482482+ * do it for us but for some reason fails to do so.483483+ */484484+ inode = logfs_new_inode(dir, S_IFDIR | mode);485485+ if (IS_ERR(inode))486486+ return PTR_ERR(inode);487487+488488+ inode->i_op = &logfs_dir_iops;489489+ inode->i_fop = &logfs_dir_fops;490490+491491+ return __logfs_create(dir, dentry, inode, NULL, 0);492492+}493493+494494+static int logfs_create(struct inode *dir, struct dentry *dentry, int mode,495495+ struct nameidata *nd)496496+{497497+ struct inode *inode;498498+499499+ inode = logfs_new_inode(dir, mode);500500+ if (IS_ERR(inode))501501+ return PTR_ERR(inode);502502+503503+ inode->i_op = &logfs_reg_iops;504504+ inode->i_fop = &logfs_reg_fops;505505+ inode->i_mapping->a_ops = &logfs_reg_aops;506506+507507+ return __logfs_create(dir, dentry, inode, NULL, 0);508508+}509509+510510+static int logfs_mknod(struct inode *dir, struct dentry *dentry, int mode,511511+ dev_t rdev)512512+{513513+ struct inode *inode;514514+515515+ if (dentry->d_name.len > LOGFS_MAX_NAMELEN)516516+ return -ENAMETOOLONG;517517+518518+ inode = logfs_new_inode(dir, mode);519519+ if (IS_ERR(inode))520520+ return PTR_ERR(inode);521521+522522+ init_special_inode(inode, mode, rdev);523523+524524+ return __logfs_create(dir, dentry, inode, NULL, 0);525525+}526526+527527+static int logfs_symlink(struct inode *dir, struct dentry *dentry,528528+ const char *target)529529+{530530+ struct inode *inode;531531+ size_t destlen = strlen(target) + 1;532532+533533+ if (destlen > dir->i_sb->s_blocksize)534534+ return -ENAMETOOLONG;535535+536536+ inode = logfs_new_inode(dir, S_IFLNK | 0777);537537+ if (IS_ERR(inode))538538+ return PTR_ERR(inode);539539+540540+ inode->i_op = &logfs_symlink_iops;541541+ inode->i_mapping->a_ops = &logfs_reg_aops;542542+543543+ return __logfs_create(dir, dentry, inode, target, destlen);544544+}545545+546546+static int logfs_permission(struct inode *inode, int mask)547547+{548548+ return generic_permission(inode, mask, NULL);549549+}550550+551551+static int logfs_link(struct dentry *old_dentry, struct inode *dir,552552+ struct dentry *dentry)553553+{554554+ struct inode *inode = old_dentry->d_inode;555555+556556+ if (inode->i_nlink >= LOGFS_LINK_MAX)557557+ return -EMLINK;558558+559559+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;560560+ atomic_inc(&inode->i_count);561561+ inode->i_nlink++;562562+ mark_inode_dirty_sync(inode);563563+564564+ return __logfs_create(dir, dentry, inode, NULL, 0);565565+}566566+567567+static int logfs_get_dd(struct inode *dir, struct dentry *dentry,568568+ struct logfs_disk_dentry *dd, loff_t *pos)569569+{570570+ struct page *page;571571+ void *map;572572+573573+ page = logfs_get_dd_page(dir, dentry);574574+ if (IS_ERR(page))575575+ return PTR_ERR(page);576576+ *pos = page->index;577577+ map = kmap_atomic(page, KM_USER0);578578+ memcpy(dd, map, sizeof(*dd));579579+ kunmap_atomic(map, KM_USER0);580580+ page_cache_release(page);581581+ return 0;582582+}583583+584584+static int logfs_delete_dd(struct inode *dir, loff_t pos)585585+{586586+ /*587587+ * Getting called with pos somewhere beyond eof is either a goofup588588+ * within this file or means someone maliciously edited the589589+ * (crc-protected) journal.590590+ */591591+ BUG_ON(beyond_eof(dir, pos));592592+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;593593+ log_dir(" Delete dentry (%lx, %llx)\n", dir->i_ino, pos);594594+ return logfs_delete(dir, pos, NULL);595595+}596596+597597+/*598598+ * Cross-directory rename, target does not exist. Just a little nasty.599599+ * Create a new dentry in the target dir, then remove the old dentry,600600+ * all the while taking care to remember our operation in the journal.601601+ */602602+static int logfs_rename_cross(struct inode *old_dir, struct dentry *old_dentry,603603+ struct inode *new_dir, struct dentry *new_dentry)604604+{605605+ struct logfs_super *super = logfs_super(old_dir->i_sb);606606+ struct logfs_disk_dentry dd;607607+ struct logfs_transaction *ta;608608+ loff_t pos;609609+ int err;610610+611611+ /* 1. locate source dd */612612+ err = logfs_get_dd(old_dir, old_dentry, &dd, &pos);613613+ if (err)614614+ return err;615615+616616+ ta = kzalloc(sizeof(*ta), GFP_KERNEL);617617+ if (!ta)618618+ return -ENOMEM;619619+620620+ ta->state = CROSS_RENAME_1;621621+ ta->dir = old_dir->i_ino;622622+ ta->pos = pos;623623+624624+ /* 2. write target dd */625625+ mutex_lock(&super->s_dirop_mutex);626626+ logfs_add_transaction(new_dir, ta);627627+ err = logfs_write_dir(new_dir, new_dentry, old_dentry->d_inode);628628+ if (!err)629629+ err = write_inode(new_dir);630630+631631+ if (err) {632632+ super->s_rename_dir = 0;633633+ super->s_rename_pos = 0;634634+ abort_transaction(new_dir, ta);635635+ goto out;636636+ }637637+638638+ /* 3. remove source dd */639639+ ta->state = CROSS_RENAME_2;640640+ logfs_add_transaction(old_dir, ta);641641+ err = logfs_delete_dd(old_dir, pos);642642+ if (!err)643643+ err = write_inode(old_dir);644644+ LOGFS_BUG_ON(err, old_dir->i_sb);645645+out:646646+ mutex_unlock(&super->s_dirop_mutex);647647+ return err;648648+}649649+650650+static int logfs_replace_inode(struct inode *dir, struct dentry *dentry,651651+ struct logfs_disk_dentry *dd, struct inode *inode)652652+{653653+ loff_t pos;654654+ int err;655655+656656+ err = logfs_get_dd(dir, dentry, dd, &pos);657657+ if (err)658658+ return err;659659+ dd->ino = cpu_to_be64(inode->i_ino);660660+ dd->type = logfs_type(inode);661661+662662+ err = write_dir(dir, dd, pos);663663+ if (err)664664+ return err;665665+ log_dir("Replace dentry (%lx, %llx) %s -> %llx\n", dir->i_ino, pos,666666+ dd->name, be64_to_cpu(dd->ino));667667+ return write_inode(dir);668668+}669669+670670+/* Target dentry exists - the worst case. We need to attach the source671671+ * inode to the target dentry, then remove the orphaned target inode and672672+ * source dentry.673673+ */674674+static int logfs_rename_target(struct inode *old_dir, struct dentry *old_dentry,675675+ struct inode *new_dir, struct dentry *new_dentry)676676+{677677+ struct logfs_super *super = logfs_super(old_dir->i_sb);678678+ struct inode *old_inode = old_dentry->d_inode;679679+ struct inode *new_inode = new_dentry->d_inode;680680+ int isdir = S_ISDIR(old_inode->i_mode);681681+ struct logfs_disk_dentry dd;682682+ struct logfs_transaction *ta;683683+ loff_t pos;684684+ int err;685685+686686+ BUG_ON(isdir != S_ISDIR(new_inode->i_mode));687687+ if (isdir) {688688+ if (!logfs_empty_dir(new_inode))689689+ return -ENOTEMPTY;690690+ }691691+692692+ /* 1. locate source dd */693693+ err = logfs_get_dd(old_dir, old_dentry, &dd, &pos);694694+ if (err)695695+ return err;696696+697697+ ta = kzalloc(sizeof(*ta), GFP_KERNEL);698698+ if (!ta)699699+ return -ENOMEM;700700+701701+ ta->state = TARGET_RENAME_1;702702+ ta->dir = old_dir->i_ino;703703+ ta->pos = pos;704704+ ta->ino = new_inode->i_ino;705705+706706+ /* 2. attach source inode to target dd */707707+ mutex_lock(&super->s_dirop_mutex);708708+ logfs_add_transaction(new_dir, ta);709709+ err = logfs_replace_inode(new_dir, new_dentry, &dd, old_inode);710710+ if (err) {711711+ super->s_rename_dir = 0;712712+ super->s_rename_pos = 0;713713+ super->s_victim_ino = 0;714714+ abort_transaction(new_dir, ta);715715+ goto out;716716+ }717717+718718+ /* 3. remove source dd */719719+ ta->state = TARGET_RENAME_2;720720+ logfs_add_transaction(old_dir, ta);721721+ err = logfs_delete_dd(old_dir, pos);722722+ if (!err)723723+ err = write_inode(old_dir);724724+ LOGFS_BUG_ON(err, old_dir->i_sb);725725+726726+ /* 4. remove target inode */727727+ ta->state = TARGET_RENAME_3;728728+ logfs_add_transaction(new_inode, ta);729729+ err = logfs_remove_inode(new_inode);730730+731731+out:732732+ mutex_unlock(&super->s_dirop_mutex);733733+ return err;734734+}735735+736736+static int logfs_rename(struct inode *old_dir, struct dentry *old_dentry,737737+ struct inode *new_dir, struct dentry *new_dentry)738738+{739739+ if (new_dentry->d_inode)740740+ return logfs_rename_target(old_dir, old_dentry,741741+ new_dir, new_dentry);742742+ return logfs_rename_cross(old_dir, old_dentry, new_dir, new_dentry);743743+}744744+745745+/* No locking done here, as this is called before .get_sb() returns. */746746+int logfs_replay_journal(struct super_block *sb)747747+{748748+ struct logfs_super *super = logfs_super(sb);749749+ struct inode *inode;750750+ u64 ino, pos;751751+ int err;752752+753753+ if (super->s_victim_ino) {754754+ /* delete victim inode */755755+ ino = super->s_victim_ino;756756+ printk(KERN_INFO"LogFS: delete unmapped inode #%llx\n", ino);757757+ inode = logfs_iget(sb, ino);758758+ if (IS_ERR(inode))759759+ goto fail;760760+761761+ LOGFS_BUG_ON(i_size_read(inode) > 0, sb);762762+ super->s_victim_ino = 0;763763+ err = logfs_remove_inode(inode);764764+ iput(inode);765765+ if (err) {766766+ super->s_victim_ino = ino;767767+ goto fail;768768+ }769769+ }770770+ if (super->s_rename_dir) {771771+ /* delete old dd from rename */772772+ ino = super->s_rename_dir;773773+ pos = super->s_rename_pos;774774+ printk(KERN_INFO"LogFS: delete unbacked dentry (%llx, %llx)\n",775775+ ino, pos);776776+ inode = logfs_iget(sb, ino);777777+ if (IS_ERR(inode))778778+ goto fail;779779+780780+ super->s_rename_dir = 0;781781+ super->s_rename_pos = 0;782782+ err = logfs_delete_dd(inode, pos);783783+ iput(inode);784784+ if (err) {785785+ super->s_rename_dir = ino;786786+ super->s_rename_pos = pos;787787+ goto fail;788788+ }789789+ }790790+ return 0;791791+fail:792792+ LOGFS_BUG(sb);793793+ return -EIO;794794+}795795+796796+const struct inode_operations logfs_symlink_iops = {797797+ .readlink = generic_readlink,798798+ .follow_link = page_follow_link_light,799799+};800800+801801+const struct inode_operations logfs_dir_iops = {802802+ .create = logfs_create,803803+ .link = logfs_link,804804+ .lookup = logfs_lookup,805805+ .mkdir = logfs_mkdir,806806+ .mknod = logfs_mknod,807807+ .rename = logfs_rename,808808+ .rmdir = logfs_rmdir,809809+ .permission = logfs_permission,810810+ .symlink = logfs_symlink,811811+ .unlink = logfs_unlink,812812+};813813+const struct file_operations logfs_dir_fops = {814814+ .fsync = logfs_fsync,815815+ .ioctl = logfs_ioctl,816816+ .readdir = logfs_readdir,817817+ .read = generic_read_dir,818818+};
+263
fs/logfs/file.c
···11+/*22+ * fs/logfs/file.c - prepare_write, commit_write and friends33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+#include <linux/sched.h>1010+#include <linux/writeback.h>1111+1212+static int logfs_write_begin(struct file *file, struct address_space *mapping,1313+ loff_t pos, unsigned len, unsigned flags,1414+ struct page **pagep, void **fsdata)1515+{1616+ struct inode *inode = mapping->host;1717+ struct page *page;1818+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;1919+2020+ page = grab_cache_page_write_begin(mapping, index, flags);2121+ if (!page)2222+ return -ENOMEM;2323+ *pagep = page;2424+2525+ if ((len == PAGE_CACHE_SIZE) || PageUptodate(page))2626+ return 0;2727+ if ((pos & PAGE_CACHE_MASK) >= i_size_read(inode)) {2828+ unsigned start = pos & (PAGE_CACHE_SIZE - 1);2929+ unsigned end = start + len;3030+3131+ /* Reading beyond i_size is simple: memset to zero */3232+ zero_user_segments(page, 0, start, end, PAGE_CACHE_SIZE);3333+ return 0;3434+ }3535+ return logfs_readpage_nolock(page);3636+}3737+3838+static int logfs_write_end(struct file *file, struct address_space *mapping,3939+ loff_t pos, unsigned len, unsigned copied, struct page *page,4040+ void *fsdata)4141+{4242+ struct inode *inode = mapping->host;4343+ pgoff_t index = page->index;4444+ unsigned start = pos & (PAGE_CACHE_SIZE - 1);4545+ unsigned end = start + copied;4646+ int ret = 0;4747+4848+ BUG_ON(PAGE_CACHE_SIZE != inode->i_sb->s_blocksize);4949+ BUG_ON(page->index > I3_BLOCKS);5050+5151+ if (copied < len) {5252+ /*5353+ * Short write of a non-initialized paged. Just tell userspace5454+ * to retry the entire page.5555+ */5656+ if (!PageUptodate(page)) {5757+ copied = 0;5858+ goto out;5959+ }6060+ }6161+ if (copied == 0)6262+ goto out; /* FIXME: do we need to update inode? */6363+6464+ if (i_size_read(inode) < (index << PAGE_CACHE_SHIFT) + end) {6565+ i_size_write(inode, (index << PAGE_CACHE_SHIFT) + end);6666+ mark_inode_dirty_sync(inode);6767+ }6868+6969+ SetPageUptodate(page);7070+ if (!PageDirty(page)) {7171+ if (!get_page_reserve(inode, page))7272+ __set_page_dirty_nobuffers(page);7373+ else7474+ ret = logfs_write_buf(inode, page, WF_LOCK);7575+ }7676+out:7777+ unlock_page(page);7878+ page_cache_release(page);7979+ return ret ? ret : copied;8080+}8181+8282+int logfs_readpage(struct file *file, struct page *page)8383+{8484+ int ret;8585+8686+ ret = logfs_readpage_nolock(page);8787+ unlock_page(page);8888+ return ret;8989+}9090+9191+/* Clear the page's dirty flag in the radix tree. */9292+/* TODO: mucking with PageWriteback is silly. Add a generic function to clear9393+ * the dirty bit from the radix tree for filesystems that don't have to wait9494+ * for page writeback to finish (i.e. any compressing filesystem).9595+ */9696+static void clear_radix_tree_dirty(struct page *page)9797+{9898+ BUG_ON(PagePrivate(page) || page->private);9999+ set_page_writeback(page);100100+ end_page_writeback(page);101101+}102102+103103+static int __logfs_writepage(struct page *page)104104+{105105+ struct inode *inode = page->mapping->host;106106+ int err;107107+108108+ err = logfs_write_buf(inode, page, WF_LOCK);109109+ if (err)110110+ set_page_dirty(page);111111+ else112112+ clear_radix_tree_dirty(page);113113+ unlock_page(page);114114+ return err;115115+}116116+117117+static int logfs_writepage(struct page *page, struct writeback_control *wbc)118118+{119119+ struct inode *inode = page->mapping->host;120120+ loff_t i_size = i_size_read(inode);121121+ pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;122122+ unsigned offset;123123+ u64 bix;124124+ level_t level;125125+126126+ log_file("logfs_writepage(%lx, %lx, %p)\n", inode->i_ino, page->index,127127+ page);128128+129129+ logfs_unpack_index(page->index, &bix, &level);130130+131131+ /* Indirect blocks are never truncated */132132+ if (level != 0)133133+ return __logfs_writepage(page);134134+135135+ /*136136+ * TODO: everything below is a near-verbatim copy of nobh_writepage().137137+ * The relevant bits should be factored out after logfs is merged.138138+ */139139+140140+ /* Is the page fully inside i_size? */141141+ if (bix < end_index)142142+ return __logfs_writepage(page);143143+144144+ /* Is the page fully outside i_size? (truncate in progress) */145145+ offset = i_size & (PAGE_CACHE_SIZE-1);146146+ if (bix > end_index || offset == 0) {147147+ unlock_page(page);148148+ return 0; /* don't care */149149+ }150150+151151+ /*152152+ * The page straddles i_size. It must be zeroed out on each and every153153+ * writepage invokation because it may be mmapped. "A file is mapped154154+ * in multiples of the page size. For a file that is not a multiple of155155+ * the page size, the remaining memory is zeroed when mapped, and156156+ * writes to that region are not written out to the file."157157+ */158158+ zero_user_segment(page, offset, PAGE_CACHE_SIZE);159159+ return __logfs_writepage(page);160160+}161161+162162+static void logfs_invalidatepage(struct page *page, unsigned long offset)163163+{164164+ move_page_to_btree(page);165165+ BUG_ON(PagePrivate(page) || page->private);166166+}167167+168168+static int logfs_releasepage(struct page *page, gfp_t only_xfs_uses_this)169169+{170170+ return 0; /* None of these are easy to release */171171+}172172+173173+174174+int logfs_ioctl(struct inode *inode, struct file *file, unsigned int cmd,175175+ unsigned long arg)176176+{177177+ struct logfs_inode *li = logfs_inode(inode);178178+ unsigned int oldflags, flags;179179+ int err;180180+181181+ switch (cmd) {182182+ case FS_IOC_GETFLAGS:183183+ flags = li->li_flags & LOGFS_FL_USER_VISIBLE;184184+ return put_user(flags, (int __user *)arg);185185+ case FS_IOC_SETFLAGS:186186+ if (IS_RDONLY(inode))187187+ return -EROFS;188188+189189+ if (!is_owner_or_cap(inode))190190+ return -EACCES;191191+192192+ err = get_user(flags, (int __user *)arg);193193+ if (err)194194+ return err;195195+196196+ mutex_lock(&inode->i_mutex);197197+ oldflags = li->li_flags;198198+ flags &= LOGFS_FL_USER_MODIFIABLE;199199+ flags |= oldflags & ~LOGFS_FL_USER_MODIFIABLE;200200+ li->li_flags = flags;201201+ mutex_unlock(&inode->i_mutex);202202+203203+ inode->i_ctime = CURRENT_TIME;204204+ mark_inode_dirty_sync(inode);205205+ return 0;206206+207207+ default:208208+ return -ENOTTY;209209+ }210210+}211211+212212+int logfs_fsync(struct file *file, struct dentry *dentry, int datasync)213213+{214214+ struct super_block *sb = dentry->d_inode->i_sb;215215+ struct logfs_super *super = logfs_super(sb);216216+217217+ /* FIXME: write anchor */218218+ super->s_devops->sync(sb);219219+ return 0;220220+}221221+222222+static int logfs_setattr(struct dentry *dentry, struct iattr *attr)223223+{224224+ struct inode *inode = dentry->d_inode;225225+ int err = 0;226226+227227+ if (attr->ia_valid & ATTR_SIZE)228228+ err = logfs_truncate(inode, attr->ia_size);229229+ attr->ia_valid &= ~ATTR_SIZE;230230+231231+ if (!err)232232+ err = inode_change_ok(inode, attr);233233+ if (!err)234234+ err = inode_setattr(inode, attr);235235+ return err;236236+}237237+238238+const struct inode_operations logfs_reg_iops = {239239+ .setattr = logfs_setattr,240240+};241241+242242+const struct file_operations logfs_reg_fops = {243243+ .aio_read = generic_file_aio_read,244244+ .aio_write = generic_file_aio_write,245245+ .fsync = logfs_fsync,246246+ .ioctl = logfs_ioctl,247247+ .llseek = generic_file_llseek,248248+ .mmap = generic_file_readonly_mmap,249249+ .open = generic_file_open,250250+ .read = do_sync_read,251251+ .write = do_sync_write,252252+};253253+254254+const struct address_space_operations logfs_reg_aops = {255255+ .invalidatepage = logfs_invalidatepage,256256+ .readpage = logfs_readpage,257257+ .releasepage = logfs_releasepage,258258+ .set_page_dirty = __set_page_dirty_nobuffers,259259+ .writepage = logfs_writepage,260260+ .writepages = generic_writepages,261261+ .write_begin = logfs_write_begin,262262+ .write_end = logfs_write_end,263263+};
+730
fs/logfs/gc.c
···11+/*22+ * fs/logfs/gc.c - garbage collection code33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+#include <linux/sched.h>1010+1111+/*1212+ * Wear leveling needs to kick in when the difference between low erase1313+ * counts and high erase counts gets too big. A good value for "too big"1414+ * may be somewhat below 10% of maximum erase count for the device.1515+ * Why not 397, to pick a nice round number with no specific meaning? :)1616+ *1717+ * WL_RATELIMIT is the minimum time between two wear level events. A huge1818+ * number of segments may fulfil the requirements for wear leveling at the1919+ * same time. If that happens we don't want to cause a latency from hell,2020+ * but just gently pick one segment every so often and minimize overhead.2121+ */2222+#define WL_DELTA 3972323+#define WL_RATELIMIT 1002424+#define MAX_OBJ_ALIASES 26002525+#define SCAN_RATIO 512 /* number of scanned segments per gc'd segment */2626+#define LIST_SIZE 64 /* base size of candidate lists */2727+#define SCAN_ROUNDS 128 /* maximum number of complete medium scans */2828+#define SCAN_ROUNDS_HIGH 4 /* maximum number of higher-level scans */2929+3030+static int no_free_segments(struct super_block *sb)3131+{3232+ struct logfs_super *super = logfs_super(sb);3333+3434+ return super->s_free_list.count;3535+}3636+3737+/* journal has distance -1, top-most ifile layer distance 0 */3838+static u8 root_distance(struct super_block *sb, gc_level_t __gc_level)3939+{4040+ struct logfs_super *super = logfs_super(sb);4141+ u8 gc_level = (__force u8)__gc_level;4242+4343+ switch (gc_level) {4444+ case 0: /* fall through */4545+ case 1: /* fall through */4646+ case 2: /* fall through */4747+ case 3:4848+ /* file data or indirect blocks */4949+ return super->s_ifile_levels + super->s_iblock_levels - gc_level;5050+ case 6: /* fall through */5151+ case 7: /* fall through */5252+ case 8: /* fall through */5353+ case 9:5454+ /* inode file data or indirect blocks */5555+ return super->s_ifile_levels - (gc_level - 6);5656+ default:5757+ printk(KERN_ERR"LOGFS: segment of unknown level %x found\n",5858+ gc_level);5959+ WARN_ON(1);6060+ return super->s_ifile_levels + super->s_iblock_levels;6161+ }6262+}6363+6464+static int segment_is_reserved(struct super_block *sb, u32 segno)6565+{6666+ struct logfs_super *super = logfs_super(sb);6767+ struct logfs_area *area;6868+ void *reserved;6969+ int i;7070+7171+ /* Some segments are reserved. Just pretend they were all valid */7272+ reserved = btree_lookup32(&super->s_reserved_segments, segno);7373+ if (reserved)7474+ return 1;7575+7676+ /* Currently open segments */7777+ for_each_area(i) {7878+ area = super->s_area[i];7979+ if (area->a_is_open && area->a_segno == segno)8080+ return 1;8181+ }8282+8383+ return 0;8484+}8585+8686+static void logfs_mark_segment_bad(struct super_block *sb, u32 segno)8787+{8888+ BUG();8989+}9090+9191+/*9292+ * Returns the bytes consumed by valid objects in this segment. Object headers9393+ * are counted, the segment header is not.9494+ */9595+static u32 logfs_valid_bytes(struct super_block *sb, u32 segno, u32 *ec,9696+ gc_level_t *gc_level)9797+{9898+ struct logfs_segment_entry se;9999+ u32 ec_level;100100+101101+ logfs_get_segment_entry(sb, segno, &se);102102+ if (se.ec_level == cpu_to_be32(BADSEG) ||103103+ se.valid == cpu_to_be32(RESERVED))104104+ return RESERVED;105105+106106+ ec_level = be32_to_cpu(se.ec_level);107107+ *ec = ec_level >> 4;108108+ *gc_level = GC_LEVEL(ec_level & 0xf);109109+ return be32_to_cpu(se.valid);110110+}111111+112112+static void logfs_cleanse_block(struct super_block *sb, u64 ofs, u64 ino,113113+ u64 bix, gc_level_t gc_level)114114+{115115+ struct inode *inode;116116+ int err, cookie;117117+118118+ inode = logfs_safe_iget(sb, ino, &cookie);119119+ err = logfs_rewrite_block(inode, bix, ofs, gc_level, 0);120120+ BUG_ON(err);121121+ logfs_safe_iput(inode, cookie);122122+}123123+124124+static u32 logfs_gc_segment(struct super_block *sb, u32 segno, u8 dist)125125+{126126+ struct logfs_super *super = logfs_super(sb);127127+ struct logfs_segment_header sh;128128+ struct logfs_object_header oh;129129+ u64 ofs, ino, bix;130130+ u32 seg_ofs, logical_segno, cleaned = 0;131131+ int err, len, valid;132132+ gc_level_t gc_level;133133+134134+ LOGFS_BUG_ON(segment_is_reserved(sb, segno), sb);135135+136136+ btree_insert32(&super->s_reserved_segments, segno, (void *)1, GFP_NOFS);137137+ err = wbuf_read(sb, dev_ofs(sb, segno, 0), sizeof(sh), &sh);138138+ BUG_ON(err);139139+ gc_level = GC_LEVEL(sh.level);140140+ logical_segno = be32_to_cpu(sh.segno);141141+ if (sh.crc != logfs_crc32(&sh, sizeof(sh), 4)) {142142+ logfs_mark_segment_bad(sb, segno);143143+ cleaned = -1;144144+ goto out;145145+ }146146+147147+ for (seg_ofs = LOGFS_SEGMENT_HEADERSIZE;148148+ seg_ofs + sizeof(oh) < super->s_segsize; ) {149149+ ofs = dev_ofs(sb, logical_segno, seg_ofs);150150+ err = wbuf_read(sb, dev_ofs(sb, segno, seg_ofs), sizeof(oh),151151+ &oh);152152+ BUG_ON(err);153153+154154+ if (!memchr_inv(&oh, 0xff, sizeof(oh)))155155+ break;156156+157157+ if (oh.crc != logfs_crc32(&oh, sizeof(oh) - 4, 4)) {158158+ logfs_mark_segment_bad(sb, segno);159159+ cleaned = super->s_segsize - 1;160160+ goto out;161161+ }162162+163163+ ino = be64_to_cpu(oh.ino);164164+ bix = be64_to_cpu(oh.bix);165165+ len = sizeof(oh) + be16_to_cpu(oh.len);166166+ valid = logfs_is_valid_block(sb, ofs, ino, bix, gc_level);167167+ if (valid == 1) {168168+ logfs_cleanse_block(sb, ofs, ino, bix, gc_level);169169+ cleaned += len;170170+ } else if (valid == 2) {171171+ /* Will be invalid upon journal commit */172172+ cleaned += len;173173+ }174174+ seg_ofs += len;175175+ }176176+out:177177+ btree_remove32(&super->s_reserved_segments, segno);178178+ return cleaned;179179+}180180+181181+static struct gc_candidate *add_list(struct gc_candidate *cand,182182+ struct candidate_list *list)183183+{184184+ struct rb_node **p = &list->rb_tree.rb_node;185185+ struct rb_node *parent = NULL;186186+ struct gc_candidate *cur;187187+ int comp;188188+189189+ cand->list = list;190190+ while (*p) {191191+ parent = *p;192192+ cur = rb_entry(parent, struct gc_candidate, rb_node);193193+194194+ if (list->sort_by_ec)195195+ comp = cand->erase_count < cur->erase_count;196196+ else197197+ comp = cand->valid < cur->valid;198198+199199+ if (comp)200200+ p = &parent->rb_left;201201+ else202202+ p = &parent->rb_right;203203+ }204204+ rb_link_node(&cand->rb_node, parent, p);205205+ rb_insert_color(&cand->rb_node, &list->rb_tree);206206+207207+ if (list->count <= list->maxcount) {208208+ list->count++;209209+ return NULL;210210+ }211211+ cand = rb_entry(rb_last(&list->rb_tree), struct gc_candidate, rb_node);212212+ rb_erase(&cand->rb_node, &list->rb_tree);213213+ cand->list = NULL;214214+ return cand;215215+}216216+217217+static void remove_from_list(struct gc_candidate *cand)218218+{219219+ struct candidate_list *list = cand->list;220220+221221+ rb_erase(&cand->rb_node, &list->rb_tree);222222+ list->count--;223223+}224224+225225+static void free_candidate(struct super_block *sb, struct gc_candidate *cand)226226+{227227+ struct logfs_super *super = logfs_super(sb);228228+229229+ btree_remove32(&super->s_cand_tree, cand->segno);230230+ kfree(cand);231231+}232232+233233+u32 get_best_cand(struct super_block *sb, struct candidate_list *list, u32 *ec)234234+{235235+ struct gc_candidate *cand;236236+ u32 segno;237237+238238+ BUG_ON(list->count == 0);239239+240240+ cand = rb_entry(rb_first(&list->rb_tree), struct gc_candidate, rb_node);241241+ remove_from_list(cand);242242+ segno = cand->segno;243243+ if (ec)244244+ *ec = cand->erase_count;245245+ free_candidate(sb, cand);246246+ return segno;247247+}248248+249249+/*250250+ * We have several lists to manage segments with. The reserve_list is used to251251+ * deal with bad blocks. We try to keep the best (lowest ec) segments on this252252+ * list.253253+ * The free_list contains free segments for normal usage. It usually gets the254254+ * second pick after the reserve_list. But when the free_list is running short255255+ * it is more important to keep the free_list full than to keep a reserve.256256+ *257257+ * Segments that are not free are put onto a per-level low_list. If we have258258+ * to run garbage collection, we pick a candidate from there. All segments on259259+ * those lists should have at least some free space so GC will make progress.260260+ *261261+ * And last we have the ec_list, which is used to pick segments for wear262262+ * leveling.263263+ *264264+ * If all appropriate lists are full, we simply free the candidate and forget265265+ * about that segment for a while. We have better candidates for each purpose.266266+ */267267+static void __add_candidate(struct super_block *sb, struct gc_candidate *cand)268268+{269269+ struct logfs_super *super = logfs_super(sb);270270+ u32 full = super->s_segsize - LOGFS_SEGMENT_RESERVE;271271+272272+ if (cand->valid == 0) {273273+ /* 100% free segments */274274+ log_gc_noisy("add reserve segment %x (ec %x) at %llx\n",275275+ cand->segno, cand->erase_count,276276+ dev_ofs(sb, cand->segno, 0));277277+ cand = add_list(cand, &super->s_reserve_list);278278+ if (cand) {279279+ log_gc_noisy("add free segment %x (ec %x) at %llx\n",280280+ cand->segno, cand->erase_count,281281+ dev_ofs(sb, cand->segno, 0));282282+ cand = add_list(cand, &super->s_free_list);283283+ }284284+ } else {285285+ /* good candidates for Garbage Collection */286286+ if (cand->valid < full)287287+ cand = add_list(cand, &super->s_low_list[cand->dist]);288288+ /* good candidates for wear leveling,289289+ * segments that were recently written get ignored */290290+ if (cand)291291+ cand = add_list(cand, &super->s_ec_list);292292+ }293293+ if (cand)294294+ free_candidate(sb, cand);295295+}296296+297297+static int add_candidate(struct super_block *sb, u32 segno, u32 valid, u32 ec,298298+ u8 dist)299299+{300300+ struct logfs_super *super = logfs_super(sb);301301+ struct gc_candidate *cand;302302+303303+ cand = kmalloc(sizeof(*cand), GFP_NOFS);304304+ if (!cand)305305+ return -ENOMEM;306306+307307+ cand->segno = segno;308308+ cand->valid = valid;309309+ cand->erase_count = ec;310310+ cand->dist = dist;311311+312312+ btree_insert32(&super->s_cand_tree, segno, cand, GFP_NOFS);313313+ __add_candidate(sb, cand);314314+ return 0;315315+}316316+317317+static void remove_segment_from_lists(struct super_block *sb, u32 segno)318318+{319319+ struct logfs_super *super = logfs_super(sb);320320+ struct gc_candidate *cand;321321+322322+ cand = btree_lookup32(&super->s_cand_tree, segno);323323+ if (cand) {324324+ remove_from_list(cand);325325+ free_candidate(sb, cand);326326+ }327327+}328328+329329+static void scan_segment(struct super_block *sb, u32 segno)330330+{331331+ u32 valid, ec = 0;332332+ gc_level_t gc_level = 0;333333+ u8 dist;334334+335335+ if (segment_is_reserved(sb, segno))336336+ return;337337+338338+ remove_segment_from_lists(sb, segno);339339+ valid = logfs_valid_bytes(sb, segno, &ec, &gc_level);340340+ if (valid == RESERVED)341341+ return;342342+343343+ dist = root_distance(sb, gc_level);344344+ add_candidate(sb, segno, valid, ec, dist);345345+}346346+347347+static struct gc_candidate *first_in_list(struct candidate_list *list)348348+{349349+ if (list->count == 0)350350+ return NULL;351351+ return rb_entry(rb_first(&list->rb_tree), struct gc_candidate, rb_node);352352+}353353+354354+/*355355+ * Find the best segment for garbage collection. Main criterion is356356+ * the segment requiring the least effort to clean. Secondary357357+ * criterion is to GC on the lowest level available.358358+ *359359+ * So we search the least effort segment on the lowest level first,360360+ * then move up and pick another segment iff is requires significantly361361+ * less effort. Hence the LOGFS_MAX_OBJECTSIZE in the comparison.362362+ */363363+static struct gc_candidate *get_candidate(struct super_block *sb)364364+{365365+ struct logfs_super *super = logfs_super(sb);366366+ int i, max_dist;367367+ struct gc_candidate *cand = NULL, *this;368368+369369+ max_dist = min(no_free_segments(sb), LOGFS_NO_AREAS);370370+371371+ for (i = max_dist; i >= 0; i--) {372372+ this = first_in_list(&super->s_low_list[i]);373373+ if (!this)374374+ continue;375375+ if (!cand)376376+ cand = this;377377+ if (this->valid + LOGFS_MAX_OBJECTSIZE <= cand->valid)378378+ cand = this;379379+ }380380+ return cand;381381+}382382+383383+static int __logfs_gc_once(struct super_block *sb, struct gc_candidate *cand)384384+{385385+ struct logfs_super *super = logfs_super(sb);386386+ gc_level_t gc_level;387387+ u32 cleaned, valid, segno, ec;388388+ u8 dist;389389+390390+ if (!cand) {391391+ log_gc("GC attempted, but no candidate found\n");392392+ return 0;393393+ }394394+395395+ segno = cand->segno;396396+ dist = cand->dist;397397+ valid = logfs_valid_bytes(sb, segno, &ec, &gc_level);398398+ free_candidate(sb, cand);399399+ log_gc("GC segment #%02x at %llx, %x required, %x free, %x valid, %llx free\n",400400+ segno, (u64)segno << super->s_segshift,401401+ dist, no_free_segments(sb), valid,402402+ super->s_free_bytes);403403+ cleaned = logfs_gc_segment(sb, segno, dist);404404+ log_gc("GC segment #%02x complete - now %x valid\n", segno,405405+ valid - cleaned);406406+ BUG_ON(cleaned != valid);407407+ return 1;408408+}409409+410410+static int logfs_gc_once(struct super_block *sb)411411+{412412+ struct gc_candidate *cand;413413+414414+ cand = get_candidate(sb);415415+ if (cand)416416+ remove_from_list(cand);417417+ return __logfs_gc_once(sb, cand);418418+}419419+420420+/* returns 1 if a wrap occurs, 0 otherwise */421421+static int logfs_scan_some(struct super_block *sb)422422+{423423+ struct logfs_super *super = logfs_super(sb);424424+ u32 segno;425425+ int i, ret = 0;426426+427427+ segno = super->s_sweeper;428428+ for (i = SCAN_RATIO; i > 0; i--) {429429+ segno++;430430+ if (segno >= super->s_no_segs) {431431+ segno = 0;432432+ ret = 1;433433+ /* Break out of the loop. We want to read a single434434+ * block from the segment size on next invocation if435435+ * SCAN_RATIO is set to match block size436436+ */437437+ break;438438+ }439439+440440+ scan_segment(sb, segno);441441+ }442442+ super->s_sweeper = segno;443443+ return ret;444444+}445445+446446+/*447447+ * In principle, this function should loop forever, looking for GC candidates448448+ * and moving data. LogFS is designed in such a way that this loop is449449+ * guaranteed to terminate.450450+ *451451+ * Limiting the loop to some iterations serves purely to catch cases when452452+ * these guarantees have failed. An actual endless loop is an obvious bug453453+ * and should be reported as such.454454+ */455455+static void __logfs_gc_pass(struct super_block *sb, int target)456456+{457457+ struct logfs_super *super = logfs_super(sb);458458+ struct logfs_block *block;459459+ int round, progress, last_progress = 0;460460+461461+ if (no_free_segments(sb) >= target &&462462+ super->s_no_object_aliases < MAX_OBJ_ALIASES)463463+ return;464464+465465+ log_gc("__logfs_gc_pass(%x)\n", target);466466+ for (round = 0; round < SCAN_ROUNDS; ) {467467+ if (no_free_segments(sb) >= target)468468+ goto write_alias;469469+470470+ /* Sync in-memory state with on-medium state in case they471471+ * diverged */472472+ logfs_write_anchor(super->s_master_inode);473473+ round += logfs_scan_some(sb);474474+ if (no_free_segments(sb) >= target)475475+ goto write_alias;476476+ progress = logfs_gc_once(sb);477477+ if (progress)478478+ last_progress = round;479479+ else if (round - last_progress > 2)480480+ break;481481+ continue;482482+483483+ /*484484+ * The goto logic is nasty, I just don't know a better way to485485+ * code it. GC is supposed to ensure two things:486486+ * 1. Enough free segments are available.487487+ * 2. The number of aliases is bounded.488488+ * When 1. is achieved, we take a look at 2. and write back489489+ * some alias-containing blocks, if necessary. However, after490490+ * each such write we need to go back to 1., as writes can491491+ * consume free segments.492492+ */493493+write_alias:494494+ if (super->s_no_object_aliases < MAX_OBJ_ALIASES)495495+ return;496496+ if (list_empty(&super->s_object_alias)) {497497+ /* All aliases are still in btree */498498+ return;499499+ }500500+ log_gc("Write back one alias\n");501501+ block = list_entry(super->s_object_alias.next,502502+ struct logfs_block, alias_list);503503+ block->ops->write_block(block);504504+ /*505505+ * To round off the nasty goto logic, we reset round here. It506506+ * is a safety-net for GC not making any progress and limited507507+ * to something reasonably small. If incremented it for every508508+ * single alias, the loop could terminate rather quickly.509509+ */510510+ round = 0;511511+ }512512+ LOGFS_BUG(sb);513513+}514514+515515+static int wl_ratelimit(struct super_block *sb, u64 *next_event)516516+{517517+ struct logfs_super *super = logfs_super(sb);518518+519519+ if (*next_event < super->s_gec) {520520+ *next_event = super->s_gec + WL_RATELIMIT;521521+ return 0;522522+ }523523+ return 1;524524+}525525+526526+static void logfs_wl_pass(struct super_block *sb)527527+{528528+ struct logfs_super *super = logfs_super(sb);529529+ struct gc_candidate *wl_cand, *free_cand;530530+531531+ if (wl_ratelimit(sb, &super->s_wl_gec_ostore))532532+ return;533533+534534+ wl_cand = first_in_list(&super->s_ec_list);535535+ if (!wl_cand)536536+ return;537537+ free_cand = first_in_list(&super->s_free_list);538538+ if (!free_cand)539539+ return;540540+541541+ if (wl_cand->erase_count < free_cand->erase_count + WL_DELTA) {542542+ remove_from_list(wl_cand);543543+ __logfs_gc_once(sb, wl_cand);544544+ }545545+}546546+547547+/*548548+ * The journal needs wear leveling as well. But moving the journal is an549549+ * expensive operation so we try to avoid it as much as possible. And if we550550+ * have to do it, we move the whole journal, not individual segments.551551+ *552552+ * Ratelimiting is not strictly necessary here, it mainly serves to avoid the553553+ * calculations. First we check whether moving the journal would be a554554+ * significant improvement. That means that a) the current journal segments555555+ * have more wear than the future journal segments and b) the current journal556556+ * segments have more wear than normal ostore segments.557557+ * Rationale for b) is that we don't have to move the journal if it is aging558558+ * less than the ostore, even if the reserve segments age even less (they are559559+ * excluded from wear leveling, after all).560560+ * Next we check that the superblocks have less wear than the journal. Since561561+ * moving the journal requires writing the superblocks, we have to protect the562562+ * superblocks even more than the journal.563563+ *564564+ * Also we double the acceptable wear difference, compared to ostore wear565565+ * leveling. Journal data is read and rewritten rapidly, comparatively. So566566+ * soft errors have much less time to accumulate and we allow the journal to567567+ * be a bit worse than the ostore.568568+ */569569+static void logfs_journal_wl_pass(struct super_block *sb)570570+{571571+ struct logfs_super *super = logfs_super(sb);572572+ struct gc_candidate *cand;573573+ u32 min_journal_ec = -1, max_reserve_ec = 0;574574+ int i;575575+576576+ if (wl_ratelimit(sb, &super->s_wl_gec_journal))577577+ return;578578+579579+ if (super->s_reserve_list.count < super->s_no_journal_segs) {580580+ /* Reserve is not full enough to move complete journal */581581+ return;582582+ }583583+584584+ journal_for_each(i)585585+ if (super->s_journal_seg[i])586586+ min_journal_ec = min(min_journal_ec,587587+ super->s_journal_ec[i]);588588+ cand = rb_entry(rb_first(&super->s_free_list.rb_tree),589589+ struct gc_candidate, rb_node);590590+ max_reserve_ec = cand->erase_count;591591+ for (i = 0; i < 2; i++) {592592+ struct logfs_segment_entry se;593593+ u32 segno = seg_no(sb, super->s_sb_ofs[i]);594594+ u32 ec;595595+596596+ logfs_get_segment_entry(sb, segno, &se);597597+ ec = be32_to_cpu(se.ec_level) >> 4;598598+ max_reserve_ec = max(max_reserve_ec, ec);599599+ }600600+601601+ if (min_journal_ec > max_reserve_ec + 2 * WL_DELTA) {602602+ do_logfs_journal_wl_pass(sb);603603+ }604604+}605605+606606+void logfs_gc_pass(struct super_block *sb)607607+{608608+ struct logfs_super *super = logfs_super(sb);609609+610610+ //BUG_ON(mutex_trylock(&logfs_super(sb)->s_w_mutex));611611+ /* Write journal before free space is getting saturated with dirty612612+ * objects.613613+ */614614+ if (super->s_dirty_used_bytes + super->s_dirty_free_bytes615615+ + LOGFS_MAX_OBJECTSIZE >= super->s_free_bytes)616616+ logfs_write_anchor(super->s_master_inode);617617+ __logfs_gc_pass(sb, logfs_super(sb)->s_total_levels);618618+ logfs_wl_pass(sb);619619+ logfs_journal_wl_pass(sb);620620+}621621+622622+static int check_area(struct super_block *sb, int i)623623+{624624+ struct logfs_super *super = logfs_super(sb);625625+ struct logfs_area *area = super->s_area[i];626626+ struct logfs_object_header oh;627627+ u32 segno = area->a_segno;628628+ u32 ofs = area->a_used_bytes;629629+ __be32 crc;630630+ int err;631631+632632+ if (!area->a_is_open)633633+ return 0;634634+635635+ for (ofs = area->a_used_bytes;636636+ ofs <= super->s_segsize - sizeof(oh);637637+ ofs += (u32)be16_to_cpu(oh.len) + sizeof(oh)) {638638+ err = wbuf_read(sb, dev_ofs(sb, segno, ofs), sizeof(oh), &oh);639639+ if (err)640640+ return err;641641+642642+ if (!memchr_inv(&oh, 0xff, sizeof(oh)))643643+ break;644644+645645+ crc = logfs_crc32(&oh, sizeof(oh) - 4, 4);646646+ if (crc != oh.crc) {647647+ printk(KERN_INFO "interrupted header at %llx\n",648648+ dev_ofs(sb, segno, ofs));649649+ return 0;650650+ }651651+ }652652+ if (ofs != area->a_used_bytes) {653653+ printk(KERN_INFO "%x bytes unaccounted data found at %llx\n",654654+ ofs - area->a_used_bytes,655655+ dev_ofs(sb, segno, area->a_used_bytes));656656+ area->a_used_bytes = ofs;657657+ }658658+ return 0;659659+}660660+661661+int logfs_check_areas(struct super_block *sb)662662+{663663+ int i, err;664664+665665+ for_each_area(i) {666666+ err = check_area(sb, i);667667+ if (err)668668+ return err;669669+ }670670+ return 0;671671+}672672+673673+static void logfs_init_candlist(struct candidate_list *list, int maxcount,674674+ int sort_by_ec)675675+{676676+ list->count = 0;677677+ list->maxcount = maxcount;678678+ list->sort_by_ec = sort_by_ec;679679+ list->rb_tree = RB_ROOT;680680+}681681+682682+int logfs_init_gc(struct super_block *sb)683683+{684684+ struct logfs_super *super = logfs_super(sb);685685+ int i;686686+687687+ btree_init_mempool32(&super->s_cand_tree, super->s_btree_pool);688688+ logfs_init_candlist(&super->s_free_list, LIST_SIZE + SCAN_RATIO, 1);689689+ logfs_init_candlist(&super->s_reserve_list,690690+ super->s_bad_seg_reserve, 1);691691+ for_each_area(i)692692+ logfs_init_candlist(&super->s_low_list[i], LIST_SIZE, 0);693693+ logfs_init_candlist(&super->s_ec_list, LIST_SIZE, 1);694694+ return 0;695695+}696696+697697+static void logfs_cleanup_list(struct super_block *sb,698698+ struct candidate_list *list)699699+{700700+ struct gc_candidate *cand;701701+702702+ while (list->count) {703703+ cand = rb_entry(list->rb_tree.rb_node, struct gc_candidate,704704+ rb_node);705705+ remove_from_list(cand);706706+ free_candidate(sb, cand);707707+ }708708+ BUG_ON(list->rb_tree.rb_node);709709+}710710+711711+void logfs_cleanup_gc(struct super_block *sb)712712+{713713+ struct logfs_super *super = logfs_super(sb);714714+ int i;715715+716716+ if (!super->s_free_list.count)717717+ return;718718+719719+ /*720720+ * FIXME: The btree may still contain a single empty node. So we721721+ * call the grim visitor to clean up that mess. Btree code should722722+ * do it for us, really.723723+ */724724+ btree_grim_visitor32(&super->s_cand_tree, 0, NULL);725725+ logfs_cleanup_list(sb, &super->s_free_list);726726+ logfs_cleanup_list(sb, &super->s_reserve_list);727727+ for_each_area(i)728728+ logfs_cleanup_list(sb, &super->s_low_list[i]);729729+ logfs_cleanup_list(sb, &super->s_ec_list);730730+}
+417
fs/logfs/inode.c
···11+/*22+ * fs/logfs/inode.c - inode handling code33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+#include <linux/writeback.h>1010+#include <linux/backing-dev.h>1111+1212+/*1313+ * How soon to reuse old inode numbers? LogFS doesn't store deleted inodes1414+ * on the medium. It therefore also lacks a method to store the previous1515+ * generation number for deleted inodes. Instead a single generation number1616+ * is stored which will be used for new inodes. Being just a 32bit counter,1717+ * this can obvious wrap relatively quickly. So we only reuse inodes if we1818+ * know that a fair number of inodes can be created before we have to increment1919+ * the generation again - effectively adding some bits to the counter.2020+ * But being too aggressive here means we keep a very large and very sparse2121+ * inode file, wasting space on indirect blocks.2222+ * So what is a good value? Beats me. 64k seems moderately bad on both2323+ * fronts, so let's use that for now...2424+ *2525+ * NFS sucks, as everyone already knows.2626+ */2727+#define INOS_PER_WRAP (0x10000)2828+2929+/*3030+ * Logfs' requirement to read inodes for garbage collection makes life a bit3131+ * harder. GC may have to read inodes that are in I_FREEING state, when they3232+ * are being written out - and waiting for GC to make progress, naturally.3333+ *3434+ * So we cannot just call iget() or some variant of it, but first have to check3535+ * wether the inode in question might be in I_FREEING state. Therefore we3636+ * maintain our own per-sb list of "almost deleted" inodes and check against3737+ * that list first. Normally this should be at most 1-2 entries long.3838+ *3939+ * Also, inodes have logfs-specific reference counting on top of what the vfs4040+ * does. When .destroy_inode is called, normally the reference count will drop4141+ * to zero and the inode gets deleted. But if GC accessed the inode, its4242+ * refcount will remain nonzero and final deletion will have to wait.4343+ *4444+ * As a result we have two sets of functions to get/put inodes:4545+ * logfs_safe_iget/logfs_safe_iput - safe to call from GC context4646+ * logfs_iget/iput - normal version4747+ */4848+static struct kmem_cache *logfs_inode_cache;4949+5050+static DEFINE_SPINLOCK(logfs_inode_lock);5151+5252+static void logfs_inode_setops(struct inode *inode)5353+{5454+ switch (inode->i_mode & S_IFMT) {5555+ case S_IFDIR:5656+ inode->i_op = &logfs_dir_iops;5757+ inode->i_fop = &logfs_dir_fops;5858+ inode->i_mapping->a_ops = &logfs_reg_aops;5959+ break;6060+ case S_IFREG:6161+ inode->i_op = &logfs_reg_iops;6262+ inode->i_fop = &logfs_reg_fops;6363+ inode->i_mapping->a_ops = &logfs_reg_aops;6464+ break;6565+ case S_IFLNK:6666+ inode->i_op = &logfs_symlink_iops;6767+ inode->i_mapping->a_ops = &logfs_reg_aops;6868+ break;6969+ case S_IFSOCK: /* fall through */7070+ case S_IFBLK: /* fall through */7171+ case S_IFCHR: /* fall through */7272+ case S_IFIFO:7373+ init_special_inode(inode, inode->i_mode, inode->i_rdev);7474+ break;7575+ default:7676+ BUG();7777+ }7878+}7979+8080+static struct inode *__logfs_iget(struct super_block *sb, ino_t ino)8181+{8282+ struct inode *inode = iget_locked(sb, ino);8383+ int err;8484+8585+ if (!inode)8686+ return ERR_PTR(-ENOMEM);8787+ if (!(inode->i_state & I_NEW))8888+ return inode;8989+9090+ err = logfs_read_inode(inode);9191+ if (err || inode->i_nlink == 0) {9292+ /* inode->i_nlink == 0 can be true when called from9393+ * block validator */9494+ /* set i_nlink to 0 to prevent caching */9595+ inode->i_nlink = 0;9696+ logfs_inode(inode)->li_flags |= LOGFS_IF_ZOMBIE;9797+ iget_failed(inode);9898+ if (!err)9999+ err = -ENOENT;100100+ return ERR_PTR(err);101101+ }102102+103103+ logfs_inode_setops(inode);104104+ unlock_new_inode(inode);105105+ return inode;106106+}107107+108108+struct inode *logfs_iget(struct super_block *sb, ino_t ino)109109+{110110+ BUG_ON(ino == LOGFS_INO_MASTER);111111+ BUG_ON(ino == LOGFS_INO_SEGFILE);112112+ return __logfs_iget(sb, ino);113113+}114114+115115+/*116116+ * is_cached is set to 1 if we hand out a cached inode, 0 otherwise.117117+ * this allows logfs_iput to do the right thing later118118+ */119119+struct inode *logfs_safe_iget(struct super_block *sb, ino_t ino, int *is_cached)120120+{121121+ struct logfs_super *super = logfs_super(sb);122122+ struct logfs_inode *li;123123+124124+ if (ino == LOGFS_INO_MASTER)125125+ return super->s_master_inode;126126+ if (ino == LOGFS_INO_SEGFILE)127127+ return super->s_segfile_inode;128128+129129+ spin_lock(&logfs_inode_lock);130130+ list_for_each_entry(li, &super->s_freeing_list, li_freeing_list)131131+ if (li->vfs_inode.i_ino == ino) {132132+ li->li_refcount++;133133+ spin_unlock(&logfs_inode_lock);134134+ *is_cached = 1;135135+ return &li->vfs_inode;136136+ }137137+ spin_unlock(&logfs_inode_lock);138138+139139+ *is_cached = 0;140140+ return __logfs_iget(sb, ino);141141+}142142+143143+static void __logfs_destroy_inode(struct inode *inode)144144+{145145+ struct logfs_inode *li = logfs_inode(inode);146146+147147+ BUG_ON(li->li_block);148148+ list_del(&li->li_freeing_list);149149+ kmem_cache_free(logfs_inode_cache, li);150150+}151151+152152+static void logfs_destroy_inode(struct inode *inode)153153+{154154+ struct logfs_inode *li = logfs_inode(inode);155155+156156+ BUG_ON(list_empty(&li->li_freeing_list));157157+ spin_lock(&logfs_inode_lock);158158+ li->li_refcount--;159159+ if (li->li_refcount == 0)160160+ __logfs_destroy_inode(inode);161161+ spin_unlock(&logfs_inode_lock);162162+}163163+164164+void logfs_safe_iput(struct inode *inode, int is_cached)165165+{166166+ if (inode->i_ino == LOGFS_INO_MASTER)167167+ return;168168+ if (inode->i_ino == LOGFS_INO_SEGFILE)169169+ return;170170+171171+ if (is_cached) {172172+ logfs_destroy_inode(inode);173173+ return;174174+ }175175+176176+ iput(inode);177177+}178178+179179+static void logfs_init_inode(struct super_block *sb, struct inode *inode)180180+{181181+ struct logfs_inode *li = logfs_inode(inode);182182+ int i;183183+184184+ li->li_flags = 0;185185+ li->li_height = 0;186186+ li->li_used_bytes = 0;187187+ li->li_block = NULL;188188+ inode->i_uid = 0;189189+ inode->i_gid = 0;190190+ inode->i_size = 0;191191+ inode->i_blocks = 0;192192+ inode->i_ctime = CURRENT_TIME;193193+ inode->i_mtime = CURRENT_TIME;194194+ inode->i_nlink = 1;195195+ INIT_LIST_HEAD(&li->li_freeing_list);196196+197197+ for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)198198+ li->li_data[i] = 0;199199+200200+ return;201201+}202202+203203+static struct inode *logfs_alloc_inode(struct super_block *sb)204204+{205205+ struct logfs_inode *li;206206+207207+ li = kmem_cache_alloc(logfs_inode_cache, GFP_NOFS);208208+ if (!li)209209+ return NULL;210210+ logfs_init_inode(sb, &li->vfs_inode);211211+ return &li->vfs_inode;212212+}213213+214214+/*215215+ * In logfs inodes are written to an inode file. The inode file, like any216216+ * other file, is managed with a inode. The inode file's inode, aka master217217+ * inode, requires special handling in several respects. First, it cannot be218218+ * written to the inode file, so it is stored in the journal instead.219219+ *220220+ * Secondly, this inode cannot be written back and destroyed before all other221221+ * inodes have been written. The ordering is important. Linux' VFS is happily222222+ * unaware of the ordering constraint and would ordinarily destroy the master223223+ * inode at umount time while other inodes are still in use and dirty. Not224224+ * good.225225+ *226226+ * So logfs makes sure the master inode is not written until all other inodes227227+ * have been destroyed. Sadly, this method has another side-effect. The VFS228228+ * will notice one remaining inode and print a frightening warning message.229229+ * Worse, it is impossible to judge whether such a warning was caused by the230230+ * master inode or any other inodes have leaked as well.231231+ *232232+ * Our attempt of solving this is with logfs_new_meta_inode() below. Its233233+ * purpose is to create a new inode that will not trigger the warning if such234234+ * an inode is still in use. An ugly hack, no doubt. Suggections for235235+ * improvement are welcome.236236+ */237237+struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino)238238+{239239+ struct inode *inode;240240+241241+ inode = logfs_alloc_inode(sb);242242+ if (!inode)243243+ return ERR_PTR(-ENOMEM);244244+245245+ inode->i_mode = S_IFREG;246246+ inode->i_ino = ino;247247+ inode->i_sb = sb;248248+249249+ /* This is a blatant copy of alloc_inode code. We'd need alloc_inode250250+ * to be nonstatic, alas. */251251+ {252252+ struct address_space * const mapping = &inode->i_data;253253+254254+ mapping->a_ops = &logfs_reg_aops;255255+ mapping->host = inode;256256+ mapping->flags = 0;257257+ mapping_set_gfp_mask(mapping, GFP_NOFS);258258+ mapping->assoc_mapping = NULL;259259+ mapping->backing_dev_info = &default_backing_dev_info;260260+ inode->i_mapping = mapping;261261+ inode->i_nlink = 1;262262+ }263263+264264+ return inode;265265+}266266+267267+struct inode *logfs_read_meta_inode(struct super_block *sb, u64 ino)268268+{269269+ struct inode *inode;270270+ int err;271271+272272+ inode = logfs_new_meta_inode(sb, ino);273273+ if (IS_ERR(inode))274274+ return inode;275275+276276+ err = logfs_read_inode(inode);277277+ if (err) {278278+ destroy_meta_inode(inode);279279+ return ERR_PTR(err);280280+ }281281+ logfs_inode_setops(inode);282282+ return inode;283283+}284284+285285+static int logfs_write_inode(struct inode *inode, int do_sync)286286+{287287+ int ret;288288+ long flags = WF_LOCK;289289+290290+ /* Can only happen if creat() failed. Safe to skip. */291291+ if (logfs_inode(inode)->li_flags & LOGFS_IF_STILLBORN)292292+ return 0;293293+294294+ ret = __logfs_write_inode(inode, flags);295295+ LOGFS_BUG_ON(ret, inode->i_sb);296296+ return ret;297297+}298298+299299+void destroy_meta_inode(struct inode *inode)300300+{301301+ if (inode) {302302+ if (inode->i_data.nrpages)303303+ truncate_inode_pages(&inode->i_data, 0);304304+ logfs_clear_inode(inode);305305+ kmem_cache_free(logfs_inode_cache, logfs_inode(inode));306306+ }307307+}308308+309309+/* called with inode_lock held */310310+static void logfs_drop_inode(struct inode *inode)311311+{312312+ struct logfs_super *super = logfs_super(inode->i_sb);313313+ struct logfs_inode *li = logfs_inode(inode);314314+315315+ spin_lock(&logfs_inode_lock);316316+ list_move(&li->li_freeing_list, &super->s_freeing_list);317317+ spin_unlock(&logfs_inode_lock);318318+ generic_drop_inode(inode);319319+}320320+321321+static void logfs_set_ino_generation(struct super_block *sb,322322+ struct inode *inode)323323+{324324+ struct logfs_super *super = logfs_super(sb);325325+ u64 ino;326326+327327+ mutex_lock(&super->s_journal_mutex);328328+ ino = logfs_seek_hole(super->s_master_inode, super->s_last_ino);329329+ super->s_last_ino = ino;330330+ super->s_inos_till_wrap--;331331+ if (super->s_inos_till_wrap < 0) {332332+ super->s_last_ino = LOGFS_RESERVED_INOS;333333+ super->s_generation++;334334+ super->s_inos_till_wrap = INOS_PER_WRAP;335335+ }336336+ inode->i_ino = ino;337337+ inode->i_generation = super->s_generation;338338+ mutex_unlock(&super->s_journal_mutex);339339+}340340+341341+struct inode *logfs_new_inode(struct inode *dir, int mode)342342+{343343+ struct super_block *sb = dir->i_sb;344344+ struct inode *inode;345345+346346+ inode = new_inode(sb);347347+ if (!inode)348348+ return ERR_PTR(-ENOMEM);349349+350350+ logfs_init_inode(sb, inode);351351+352352+ /* inherit parent flags */353353+ logfs_inode(inode)->li_flags |=354354+ logfs_inode(dir)->li_flags & LOGFS_FL_INHERITED;355355+356356+ inode->i_mode = mode;357357+ logfs_set_ino_generation(sb, inode);358358+359359+ inode->i_uid = current_fsuid();360360+ inode->i_gid = current_fsgid();361361+ if (dir->i_mode & S_ISGID) {362362+ inode->i_gid = dir->i_gid;363363+ if (S_ISDIR(mode))364364+ inode->i_mode |= S_ISGID;365365+ }366366+367367+ logfs_inode_setops(inode);368368+ insert_inode_hash(inode);369369+370370+ return inode;371371+}372372+373373+static void logfs_init_once(void *_li)374374+{375375+ struct logfs_inode *li = _li;376376+ int i;377377+378378+ li->li_flags = 0;379379+ li->li_used_bytes = 0;380380+ li->li_refcount = 1;381381+ for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)382382+ li->li_data[i] = 0;383383+ inode_init_once(&li->vfs_inode);384384+}385385+386386+static int logfs_sync_fs(struct super_block *sb, int wait)387387+{388388+ /* FIXME: write anchor */389389+ logfs_super(sb)->s_devops->sync(sb);390390+ return 0;391391+}392392+393393+const struct super_operations logfs_super_operations = {394394+ .alloc_inode = logfs_alloc_inode,395395+ .clear_inode = logfs_clear_inode,396396+ .delete_inode = logfs_delete_inode,397397+ .destroy_inode = logfs_destroy_inode,398398+ .drop_inode = logfs_drop_inode,399399+ .write_inode = logfs_write_inode,400400+ .statfs = logfs_statfs,401401+ .sync_fs = logfs_sync_fs,402402+};403403+404404+int logfs_init_inode_cache(void)405405+{406406+ logfs_inode_cache = kmem_cache_create("logfs_inode_cache",407407+ sizeof(struct logfs_inode), 0, SLAB_RECLAIM_ACCOUNT,408408+ logfs_init_once);409409+ if (!logfs_inode_cache)410410+ return -ENOMEM;411411+ return 0;412412+}413413+414414+void logfs_destroy_inode_cache(void)415415+{416416+ kmem_cache_destroy(logfs_inode_cache);417417+}
+879
fs/logfs/journal.c
···11+/*22+ * fs/logfs/journal.c - journal handling code33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ */88+#include "logfs.h"99+1010+static void logfs_calc_free(struct super_block *sb)1111+{1212+ struct logfs_super *super = logfs_super(sb);1313+ u64 reserve, no_segs = super->s_no_segs;1414+ s64 free;1515+ int i;1616+1717+ /* superblock segments */1818+ no_segs -= 2;1919+ super->s_no_journal_segs = 0;2020+ /* journal */2121+ journal_for_each(i)2222+ if (super->s_journal_seg[i]) {2323+ no_segs--;2424+ super->s_no_journal_segs++;2525+ }2626+2727+ /* open segments plus one extra per level for GC */2828+ no_segs -= 2 * super->s_total_levels;2929+3030+ free = no_segs * (super->s_segsize - LOGFS_SEGMENT_RESERVE);3131+ free -= super->s_used_bytes;3232+ /* just a bit extra */3333+ free -= super->s_total_levels * 4096;3434+3535+ /* Bad blocks are 'paid' for with speed reserve - the filesystem3636+ * simply gets slower as bad blocks accumulate. Until the bad blocks3737+ * exceed the speed reserve - then the filesystem gets smaller.3838+ */3939+ reserve = super->s_bad_segments + super->s_bad_seg_reserve;4040+ reserve *= super->s_segsize - LOGFS_SEGMENT_RESERVE;4141+ reserve = max(reserve, super->s_speed_reserve);4242+ free -= reserve;4343+ if (free < 0)4444+ free = 0;4545+4646+ super->s_free_bytes = free;4747+}4848+4949+static void reserve_sb_and_journal(struct super_block *sb)5050+{5151+ struct logfs_super *super = logfs_super(sb);5252+ struct btree_head32 *head = &super->s_reserved_segments;5353+ int i, err;5454+5555+ err = btree_insert32(head, seg_no(sb, super->s_sb_ofs[0]), (void *)1,5656+ GFP_KERNEL);5757+ BUG_ON(err);5858+5959+ err = btree_insert32(head, seg_no(sb, super->s_sb_ofs[1]), (void *)1,6060+ GFP_KERNEL);6161+ BUG_ON(err);6262+6363+ journal_for_each(i) {6464+ if (!super->s_journal_seg[i])6565+ continue;6666+ err = btree_insert32(head, super->s_journal_seg[i], (void *)1,6767+ GFP_KERNEL);6868+ BUG_ON(err);6969+ }7070+}7171+7272+static void read_dynsb(struct super_block *sb,7373+ struct logfs_je_dynsb *dynsb)7474+{7575+ struct logfs_super *super = logfs_super(sb);7676+7777+ super->s_gec = be64_to_cpu(dynsb->ds_gec);7878+ super->s_sweeper = be64_to_cpu(dynsb->ds_sweeper);7979+ super->s_victim_ino = be64_to_cpu(dynsb->ds_victim_ino);8080+ super->s_rename_dir = be64_to_cpu(dynsb->ds_rename_dir);8181+ super->s_rename_pos = be64_to_cpu(dynsb->ds_rename_pos);8282+ super->s_used_bytes = be64_to_cpu(dynsb->ds_used_bytes);8383+ super->s_generation = be32_to_cpu(dynsb->ds_generation);8484+}8585+8686+static void read_anchor(struct super_block *sb,8787+ struct logfs_je_anchor *da)8888+{8989+ struct logfs_super *super = logfs_super(sb);9090+ struct inode *inode = super->s_master_inode;9191+ struct logfs_inode *li = logfs_inode(inode);9292+ int i;9393+9494+ super->s_last_ino = be64_to_cpu(da->da_last_ino);9595+ li->li_flags = 0;9696+ li->li_height = da->da_height;9797+ i_size_write(inode, be64_to_cpu(da->da_size));9898+ li->li_used_bytes = be64_to_cpu(da->da_used_bytes);9999+100100+ for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)101101+ li->li_data[i] = be64_to_cpu(da->da_data[i]);102102+}103103+104104+static void read_erasecount(struct super_block *sb,105105+ struct logfs_je_journal_ec *ec)106106+{107107+ struct logfs_super *super = logfs_super(sb);108108+ int i;109109+110110+ journal_for_each(i)111111+ super->s_journal_ec[i] = be32_to_cpu(ec->ec[i]);112112+}113113+114114+static int read_area(struct super_block *sb, struct logfs_je_area *a)115115+{116116+ struct logfs_super *super = logfs_super(sb);117117+ struct logfs_area *area = super->s_area[a->gc_level];118118+ u64 ofs;119119+ u32 writemask = ~(super->s_writesize - 1);120120+121121+ if (a->gc_level >= LOGFS_NO_AREAS)122122+ return -EIO;123123+ if (a->vim != VIM_DEFAULT)124124+ return -EIO; /* TODO: close area and continue */125125+126126+ area->a_used_bytes = be32_to_cpu(a->used_bytes);127127+ area->a_written_bytes = area->a_used_bytes & writemask;128128+ area->a_segno = be32_to_cpu(a->segno);129129+ if (area->a_segno)130130+ area->a_is_open = 1;131131+132132+ ofs = dev_ofs(sb, area->a_segno, area->a_written_bytes);133133+ if (super->s_writesize > 1)134134+ logfs_buf_recover(area, ofs, a + 1, super->s_writesize);135135+ else136136+ logfs_buf_recover(area, ofs, NULL, 0);137137+ return 0;138138+}139139+140140+static void *unpack(void *from, void *to)141141+{142142+ struct logfs_journal_header *jh = from;143143+ void *data = from + sizeof(struct logfs_journal_header);144144+ int err;145145+ size_t inlen, outlen;146146+147147+ inlen = be16_to_cpu(jh->h_len);148148+ outlen = be16_to_cpu(jh->h_datalen);149149+150150+ if (jh->h_compr == COMPR_NONE)151151+ memcpy(to, data, inlen);152152+ else {153153+ err = logfs_uncompress(data, to, inlen, outlen);154154+ BUG_ON(err);155155+ }156156+ return to;157157+}158158+159159+static int __read_je_header(struct super_block *sb, u64 ofs,160160+ struct logfs_journal_header *jh)161161+{162162+ struct logfs_super *super = logfs_super(sb);163163+ size_t bufsize = max_t(size_t, sb->s_blocksize, super->s_writesize)164164+ + MAX_JOURNAL_HEADER;165165+ u16 type, len, datalen;166166+ int err;167167+168168+ /* read header only */169169+ err = wbuf_read(sb, ofs, sizeof(*jh), jh);170170+ if (err)171171+ return err;172172+ type = be16_to_cpu(jh->h_type);173173+ len = be16_to_cpu(jh->h_len);174174+ datalen = be16_to_cpu(jh->h_datalen);175175+ if (len > sb->s_blocksize)176176+ return -EIO;177177+ if ((type < JE_FIRST) || (type > JE_LAST))178178+ return -EIO;179179+ if (datalen > bufsize)180180+ return -EIO;181181+ return 0;182182+}183183+184184+static int __read_je_payload(struct super_block *sb, u64 ofs,185185+ struct logfs_journal_header *jh)186186+{187187+ u16 len;188188+ int err;189189+190190+ len = be16_to_cpu(jh->h_len);191191+ err = wbuf_read(sb, ofs + sizeof(*jh), len, jh + 1);192192+ if (err)193193+ return err;194194+ if (jh->h_crc != logfs_crc32(jh, len + sizeof(*jh), 4)) {195195+ /* Old code was confused. It forgot about the header length196196+ * and stopped calculating the crc 16 bytes before the end197197+ * of data - ick!198198+ * FIXME: Remove this hack once the old code is fixed.199199+ */200200+ if (jh->h_crc == logfs_crc32(jh, len, 4))201201+ WARN_ON_ONCE(1);202202+ else203203+ return -EIO;204204+ }205205+ return 0;206206+}207207+208208+/*209209+ * jh needs to be large enough to hold the complete entry, not just the header210210+ */211211+static int __read_je(struct super_block *sb, u64 ofs,212212+ struct logfs_journal_header *jh)213213+{214214+ int err;215215+216216+ err = __read_je_header(sb, ofs, jh);217217+ if (err)218218+ return err;219219+ return __read_je_payload(sb, ofs, jh);220220+}221221+222222+static int read_je(struct super_block *sb, u64 ofs)223223+{224224+ struct logfs_super *super = logfs_super(sb);225225+ struct logfs_journal_header *jh = super->s_compressed_je;226226+ void *scratch = super->s_je;227227+ u16 type, datalen;228228+ int err;229229+230230+ err = __read_je(sb, ofs, jh);231231+ if (err)232232+ return err;233233+ type = be16_to_cpu(jh->h_type);234234+ datalen = be16_to_cpu(jh->h_datalen);235235+236236+ switch (type) {237237+ case JE_DYNSB:238238+ read_dynsb(sb, unpack(jh, scratch));239239+ break;240240+ case JE_ANCHOR:241241+ read_anchor(sb, unpack(jh, scratch));242242+ break;243243+ case JE_ERASECOUNT:244244+ read_erasecount(sb, unpack(jh, scratch));245245+ break;246246+ case JE_AREA:247247+ read_area(sb, unpack(jh, scratch));248248+ break;249249+ case JE_OBJ_ALIAS:250250+ err = logfs_load_object_aliases(sb, unpack(jh, scratch),251251+ datalen);252252+ break;253253+ default:254254+ WARN_ON_ONCE(1);255255+ return -EIO;256256+ }257257+ return err;258258+}259259+260260+static int logfs_read_segment(struct super_block *sb, u32 segno)261261+{262262+ struct logfs_super *super = logfs_super(sb);263263+ struct logfs_journal_header *jh = super->s_compressed_je;264264+ u64 ofs, seg_ofs = dev_ofs(sb, segno, 0);265265+ u32 h_ofs, last_ofs = 0;266266+ u16 len, datalen, last_len;267267+ int i, err;268268+269269+ /* search for most recent commit */270270+ for (h_ofs = 0; h_ofs < super->s_segsize; h_ofs += sizeof(*jh)) {271271+ ofs = seg_ofs + h_ofs;272272+ err = __read_je_header(sb, ofs, jh);273273+ if (err)274274+ continue;275275+ if (jh->h_type != cpu_to_be16(JE_COMMIT))276276+ continue;277277+ err = __read_je_payload(sb, ofs, jh);278278+ if (err)279279+ continue;280280+ len = be16_to_cpu(jh->h_len);281281+ datalen = be16_to_cpu(jh->h_datalen);282282+ if ((datalen > sizeof(super->s_je_array)) ||283283+ (datalen % sizeof(__be64)))284284+ continue;285285+ last_ofs = h_ofs;286286+ last_len = datalen;287287+ h_ofs += ALIGN(len, sizeof(*jh)) - sizeof(*jh);288288+ }289289+ /* read commit */290290+ if (last_ofs == 0)291291+ return -ENOENT;292292+ ofs = seg_ofs + last_ofs;293293+ log_journal("Read commit from %llx\n", ofs);294294+ err = __read_je(sb, ofs, jh);295295+ BUG_ON(err); /* We should have caught it in the scan loop already */296296+ if (err)297297+ return err;298298+ /* uncompress */299299+ unpack(jh, super->s_je_array);300300+ super->s_no_je = last_len / sizeof(__be64);301301+ /* iterate over array */302302+ for (i = 0; i < super->s_no_je; i++) {303303+ err = read_je(sb, be64_to_cpu(super->s_je_array[i]));304304+ if (err)305305+ return err;306306+ }307307+ super->s_journal_area->a_segno = segno;308308+ return 0;309309+}310310+311311+static u64 read_gec(struct super_block *sb, u32 segno)312312+{313313+ struct logfs_segment_header sh;314314+ __be32 crc;315315+ int err;316316+317317+ if (!segno)318318+ return 0;319319+ err = wbuf_read(sb, dev_ofs(sb, segno, 0), sizeof(sh), &sh);320320+ if (err)321321+ return 0;322322+ crc = logfs_crc32(&sh, sizeof(sh), 4);323323+ if (crc != sh.crc) {324324+ WARN_ON(sh.gec != cpu_to_be64(0xffffffffffffffffull));325325+ /* Most likely it was just erased */326326+ return 0;327327+ }328328+ return be64_to_cpu(sh.gec);329329+}330330+331331+static int logfs_read_journal(struct super_block *sb)332332+{333333+ struct logfs_super *super = logfs_super(sb);334334+ u64 gec[LOGFS_JOURNAL_SEGS], max;335335+ u32 segno;336336+ int i, max_i;337337+338338+ max = 0;339339+ max_i = -1;340340+ journal_for_each(i) {341341+ segno = super->s_journal_seg[i];342342+ gec[i] = read_gec(sb, super->s_journal_seg[i]);343343+ if (gec[i] > max) {344344+ max = gec[i];345345+ max_i = i;346346+ }347347+ }348348+ if (max_i == -1)349349+ return -EIO;350350+ /* FIXME: Try older segments in case of error */351351+ return logfs_read_segment(sb, super->s_journal_seg[max_i]);352352+}353353+354354+/*355355+ * First search the current segment (outer loop), then pick the next segment356356+ * in the array, skipping any zero entries (inner loop).357357+ */358358+static void journal_get_free_segment(struct logfs_area *area)359359+{360360+ struct logfs_super *super = logfs_super(area->a_sb);361361+ int i;362362+363363+ journal_for_each(i) {364364+ if (area->a_segno != super->s_journal_seg[i])365365+ continue;366366+367367+ do {368368+ i++;369369+ if (i == LOGFS_JOURNAL_SEGS)370370+ i = 0;371371+ } while (!super->s_journal_seg[i]);372372+373373+ area->a_segno = super->s_journal_seg[i];374374+ area->a_erase_count = ++(super->s_journal_ec[i]);375375+ log_journal("Journal now at %x (ec %x)\n", area->a_segno,376376+ area->a_erase_count);377377+ return;378378+ }379379+ BUG();380380+}381381+382382+static void journal_get_erase_count(struct logfs_area *area)383383+{384384+ /* erase count is stored globally and incremented in385385+ * journal_get_free_segment() - nothing to do here */386386+}387387+388388+static int journal_erase_segment(struct logfs_area *area)389389+{390390+ struct super_block *sb = area->a_sb;391391+ struct logfs_segment_header sh;392392+ u64 ofs;393393+ int err;394394+395395+ err = logfs_erase_segment(sb, area->a_segno);396396+ if (err)397397+ return err;398398+399399+ sh.pad = 0;400400+ sh.type = SEG_JOURNAL;401401+ sh.level = 0;402402+ sh.segno = cpu_to_be32(area->a_segno);403403+ sh.ec = cpu_to_be32(area->a_erase_count);404404+ sh.gec = cpu_to_be64(logfs_super(sb)->s_gec);405405+ sh.crc = logfs_crc32(&sh, sizeof(sh), 4);406406+407407+ /* This causes a bug in segment.c. Not yet. */408408+ //logfs_set_segment_erased(sb, area->a_segno, area->a_erase_count, 0);409409+410410+ ofs = dev_ofs(sb, area->a_segno, 0);411411+ area->a_used_bytes = ALIGN(sizeof(sh), 16);412412+ logfs_buf_write(area, ofs, &sh, sizeof(sh));413413+ return 0;414414+}415415+416416+static size_t __logfs_write_header(struct logfs_super *super,417417+ struct logfs_journal_header *jh, size_t len, size_t datalen,418418+ u16 type, u8 compr)419419+{420420+ jh->h_len = cpu_to_be16(len);421421+ jh->h_type = cpu_to_be16(type);422422+ jh->h_version = cpu_to_be16(++super->s_last_version);423423+ jh->h_datalen = cpu_to_be16(datalen);424424+ jh->h_compr = compr;425425+ jh->h_pad[0] = 'H';426426+ jh->h_pad[1] = 'A';427427+ jh->h_pad[2] = 'T';428428+ jh->h_crc = logfs_crc32(jh, len + sizeof(*jh), 4);429429+ return ALIGN(len, 16) + sizeof(*jh);430430+}431431+432432+static size_t logfs_write_header(struct logfs_super *super,433433+ struct logfs_journal_header *jh, size_t datalen, u16 type)434434+{435435+ size_t len = datalen;436436+437437+ return __logfs_write_header(super, jh, len, datalen, type, COMPR_NONE);438438+}439439+440440+static inline size_t logfs_journal_erasecount_size(struct logfs_super *super)441441+{442442+ return LOGFS_JOURNAL_SEGS * sizeof(__be32);443443+}444444+445445+static void *logfs_write_erasecount(struct super_block *sb, void *_ec,446446+ u16 *type, size_t *len)447447+{448448+ struct logfs_super *super = logfs_super(sb);449449+ struct logfs_je_journal_ec *ec = _ec;450450+ int i;451451+452452+ journal_for_each(i)453453+ ec->ec[i] = cpu_to_be32(super->s_journal_ec[i]);454454+ *type = JE_ERASECOUNT;455455+ *len = logfs_journal_erasecount_size(super);456456+ return ec;457457+}458458+459459+static void account_shadow(void *_shadow, unsigned long _sb, u64 ignore,460460+ size_t ignore2)461461+{462462+ struct logfs_shadow *shadow = _shadow;463463+ struct super_block *sb = (void *)_sb;464464+ struct logfs_super *super = logfs_super(sb);465465+466466+ /* consume new space */467467+ super->s_free_bytes -= shadow->new_len;468468+ super->s_used_bytes += shadow->new_len;469469+ super->s_dirty_used_bytes -= shadow->new_len;470470+471471+ /* free up old space */472472+ super->s_free_bytes += shadow->old_len;473473+ super->s_used_bytes -= shadow->old_len;474474+ super->s_dirty_free_bytes -= shadow->old_len;475475+476476+ logfs_set_segment_used(sb, shadow->old_ofs, -shadow->old_len);477477+ logfs_set_segment_used(sb, shadow->new_ofs, shadow->new_len);478478+479479+ log_journal("account_shadow(%llx, %llx, %x) %llx->%llx %x->%x\n",480480+ shadow->ino, shadow->bix, shadow->gc_level,481481+ shadow->old_ofs, shadow->new_ofs,482482+ shadow->old_len, shadow->new_len);483483+ mempool_free(shadow, super->s_shadow_pool);484484+}485485+486486+static void account_shadows(struct super_block *sb)487487+{488488+ struct logfs_super *super = logfs_super(sb);489489+ struct inode *inode = super->s_master_inode;490490+ struct logfs_inode *li = logfs_inode(inode);491491+ struct shadow_tree *tree = &super->s_shadow_tree;492492+493493+ btree_grim_visitor64(&tree->new, (unsigned long)sb, account_shadow);494494+ btree_grim_visitor64(&tree->old, (unsigned long)sb, account_shadow);495495+496496+ if (li->li_block) {497497+ /*498498+ * We never actually use the structure, when attached to the499499+ * master inode. But it is easier to always free it here than500500+ * to have checks in several places elsewhere when allocating501501+ * it.502502+ */503503+ li->li_block->ops->free_block(sb, li->li_block);504504+ }505505+ BUG_ON((s64)li->li_used_bytes < 0);506506+}507507+508508+static void *__logfs_write_anchor(struct super_block *sb, void *_da,509509+ u16 *type, size_t *len)510510+{511511+ struct logfs_super *super = logfs_super(sb);512512+ struct logfs_je_anchor *da = _da;513513+ struct inode *inode = super->s_master_inode;514514+ struct logfs_inode *li = logfs_inode(inode);515515+ int i;516516+517517+ da->da_height = li->li_height;518518+ da->da_last_ino = cpu_to_be64(super->s_last_ino);519519+ da->da_size = cpu_to_be64(i_size_read(inode));520520+ da->da_used_bytes = cpu_to_be64(li->li_used_bytes);521521+ for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)522522+ da->da_data[i] = cpu_to_be64(li->li_data[i]);523523+ *type = JE_ANCHOR;524524+ *len = sizeof(*da);525525+ return da;526526+}527527+528528+static void *logfs_write_dynsb(struct super_block *sb, void *_dynsb,529529+ u16 *type, size_t *len)530530+{531531+ struct logfs_super *super = logfs_super(sb);532532+ struct logfs_je_dynsb *dynsb = _dynsb;533533+534534+ dynsb->ds_gec = cpu_to_be64(super->s_gec);535535+ dynsb->ds_sweeper = cpu_to_be64(super->s_sweeper);536536+ dynsb->ds_victim_ino = cpu_to_be64(super->s_victim_ino);537537+ dynsb->ds_rename_dir = cpu_to_be64(super->s_rename_dir);538538+ dynsb->ds_rename_pos = cpu_to_be64(super->s_rename_pos);539539+ dynsb->ds_used_bytes = cpu_to_be64(super->s_used_bytes);540540+ dynsb->ds_generation = cpu_to_be32(super->s_generation);541541+ *type = JE_DYNSB;542542+ *len = sizeof(*dynsb);543543+ return dynsb;544544+}545545+546546+static void write_wbuf(struct super_block *sb, struct logfs_area *area,547547+ void *wbuf)548548+{549549+ struct logfs_super *super = logfs_super(sb);550550+ struct address_space *mapping = super->s_mapping_inode->i_mapping;551551+ u64 ofs;552552+ pgoff_t index;553553+ int page_ofs;554554+ struct page *page;555555+556556+ ofs = dev_ofs(sb, area->a_segno,557557+ area->a_used_bytes & ~(super->s_writesize - 1));558558+ index = ofs >> PAGE_SHIFT;559559+ page_ofs = ofs & (PAGE_SIZE - 1);560560+561561+ page = find_lock_page(mapping, index);562562+ BUG_ON(!page);563563+ memcpy(wbuf, page_address(page) + page_ofs, super->s_writesize);564564+ unlock_page(page);565565+}566566+567567+static void *logfs_write_area(struct super_block *sb, void *_a,568568+ u16 *type, size_t *len)569569+{570570+ struct logfs_super *super = logfs_super(sb);571571+ struct logfs_area *area = super->s_area[super->s_sum_index];572572+ struct logfs_je_area *a = _a;573573+574574+ a->vim = VIM_DEFAULT;575575+ a->gc_level = super->s_sum_index;576576+ a->used_bytes = cpu_to_be32(area->a_used_bytes);577577+ a->segno = cpu_to_be32(area->a_segno);578578+ if (super->s_writesize > 1)579579+ write_wbuf(sb, area, a + 1);580580+581581+ *type = JE_AREA;582582+ *len = sizeof(*a) + super->s_writesize;583583+ return a;584584+}585585+586586+static void *logfs_write_commit(struct super_block *sb, void *h,587587+ u16 *type, size_t *len)588588+{589589+ struct logfs_super *super = logfs_super(sb);590590+591591+ *type = JE_COMMIT;592592+ *len = super->s_no_je * sizeof(__be64);593593+ return super->s_je_array;594594+}595595+596596+static size_t __logfs_write_je(struct super_block *sb, void *buf, u16 type,597597+ size_t len)598598+{599599+ struct logfs_super *super = logfs_super(sb);600600+ void *header = super->s_compressed_je;601601+ void *data = header + sizeof(struct logfs_journal_header);602602+ ssize_t compr_len, pad_len;603603+ u8 compr = COMPR_ZLIB;604604+605605+ if (len == 0)606606+ return logfs_write_header(super, header, 0, type);607607+608608+ compr_len = logfs_compress(buf, data, len, sb->s_blocksize);609609+ if (compr_len < 0 || type == JE_ANCHOR) {610610+ BUG_ON(len > sb->s_blocksize);611611+ memcpy(data, buf, len);612612+ compr_len = len;613613+ compr = COMPR_NONE;614614+ }615615+616616+ pad_len = ALIGN(compr_len, 16);617617+ memset(data + compr_len, 0, pad_len - compr_len);618618+619619+ return __logfs_write_header(super, header, compr_len, len, type, compr);620620+}621621+622622+static s64 logfs_get_free_bytes(struct logfs_area *area, size_t *bytes,623623+ int must_pad)624624+{625625+ u32 writesize = logfs_super(area->a_sb)->s_writesize;626626+ s32 ofs;627627+ int ret;628628+629629+ ret = logfs_open_area(area, *bytes);630630+ if (ret)631631+ return -EAGAIN;632632+633633+ ofs = area->a_used_bytes;634634+ area->a_used_bytes += *bytes;635635+636636+ if (must_pad) {637637+ area->a_used_bytes = ALIGN(area->a_used_bytes, writesize);638638+ *bytes = area->a_used_bytes - ofs;639639+ }640640+641641+ return dev_ofs(area->a_sb, area->a_segno, ofs);642642+}643643+644644+static int logfs_write_je_buf(struct super_block *sb, void *buf, u16 type,645645+ size_t buf_len)646646+{647647+ struct logfs_super *super = logfs_super(sb);648648+ struct logfs_area *area = super->s_journal_area;649649+ struct logfs_journal_header *jh = super->s_compressed_je;650650+ size_t len;651651+ int must_pad = 0;652652+ s64 ofs;653653+654654+ len = __logfs_write_je(sb, buf, type, buf_len);655655+ if (jh->h_type == cpu_to_be16(JE_COMMIT))656656+ must_pad = 1;657657+658658+ ofs = logfs_get_free_bytes(area, &len, must_pad);659659+ if (ofs < 0)660660+ return ofs;661661+ logfs_buf_write(area, ofs, super->s_compressed_je, len);662662+ super->s_je_array[super->s_no_je++] = cpu_to_be64(ofs);663663+ return 0;664664+}665665+666666+static int logfs_write_je(struct super_block *sb,667667+ void* (*write)(struct super_block *sb, void *scratch,668668+ u16 *type, size_t *len))669669+{670670+ void *buf;671671+ size_t len;672672+ u16 type;673673+674674+ buf = write(sb, logfs_super(sb)->s_je, &type, &len);675675+ return logfs_write_je_buf(sb, buf, type, len);676676+}677677+678678+int write_alias_journal(struct super_block *sb, u64 ino, u64 bix,679679+ level_t level, int child_no, __be64 val)680680+{681681+ struct logfs_super *super = logfs_super(sb);682682+ struct logfs_obj_alias *oa = super->s_je;683683+ int err = 0, fill = super->s_je_fill;684684+685685+ log_aliases("logfs_write_obj_aliases #%x(%llx, %llx, %x, %x) %llx\n",686686+ fill, ino, bix, level, child_no, be64_to_cpu(val));687687+ oa[fill].ino = cpu_to_be64(ino);688688+ oa[fill].bix = cpu_to_be64(bix);689689+ oa[fill].val = val;690690+ oa[fill].level = (__force u8)level;691691+ oa[fill].child_no = cpu_to_be16(child_no);692692+ fill++;693693+ if (fill >= sb->s_blocksize / sizeof(*oa)) {694694+ err = logfs_write_je_buf(sb, oa, JE_OBJ_ALIAS, sb->s_blocksize);695695+ fill = 0;696696+ }697697+698698+ super->s_je_fill = fill;699699+ return err;700700+}701701+702702+static int logfs_write_obj_aliases(struct super_block *sb)703703+{704704+ struct logfs_super *super = logfs_super(sb);705705+ int err;706706+707707+ log_journal("logfs_write_obj_aliases: %d aliases to write\n",708708+ super->s_no_object_aliases);709709+ super->s_je_fill = 0;710710+ err = logfs_write_obj_aliases_pagecache(sb);711711+ if (err)712712+ return err;713713+714714+ if (super->s_je_fill)715715+ err = logfs_write_je_buf(sb, super->s_je, JE_OBJ_ALIAS,716716+ super->s_je_fill717717+ * sizeof(struct logfs_obj_alias));718718+ return err;719719+}720720+721721+/*722722+ * Write all journal entries. The goto logic ensures that all journal entries723723+ * are written whenever a new segment is used. It is ugly and potentially a724724+ * bit wasteful, but robustness is more important. With this we can *always*725725+ * erase all journal segments except the one containing the most recent commit.726726+ */727727+void logfs_write_anchor(struct inode *inode)728728+{729729+ struct super_block *sb = inode->i_sb;730730+ struct logfs_super *super = logfs_super(sb);731731+ struct logfs_area *area = super->s_journal_area;732732+ int i, err;733733+734734+ BUG_ON(logfs_super(sb)->s_flags & LOGFS_SB_FLAG_SHUTDOWN);735735+ mutex_lock(&super->s_journal_mutex);736736+737737+ /* Do this first or suffer corruption */738738+ logfs_sync_segments(sb);739739+ account_shadows(sb);740740+741741+again:742742+ super->s_no_je = 0;743743+ for_each_area(i) {744744+ if (!super->s_area[i]->a_is_open)745745+ continue;746746+ super->s_sum_index = i;747747+ err = logfs_write_je(sb, logfs_write_area);748748+ if (err)749749+ goto again;750750+ }751751+ err = logfs_write_obj_aliases(sb);752752+ if (err)753753+ goto again;754754+ err = logfs_write_je(sb, logfs_write_erasecount);755755+ if (err)756756+ goto again;757757+ err = logfs_write_je(sb, __logfs_write_anchor);758758+ if (err)759759+ goto again;760760+ err = logfs_write_je(sb, logfs_write_dynsb);761761+ if (err)762762+ goto again;763763+ /*764764+ * Order is imperative. First we sync all writes, including the765765+ * non-committed journal writes. Then we write the final commit and766766+ * sync the current journal segment.767767+ * There is a theoretical bug here. Syncing the journal segment will768768+ * write a number of journal entries and the final commit. All these769769+ * are written in a single operation. If the device layer writes the770770+ * data back-to-front, the commit will precede the other journal771771+ * entries, leaving a race window.772772+ * Two fixes are possible. Preferred is to fix the device layer to773773+ * ensure writes happen front-to-back. Alternatively we can insert774774+ * another logfs_sync_area() super->s_devops->sync() combo before775775+ * writing the commit.776776+ */777777+ /*778778+ * On another subject, super->s_devops->sync is usually not necessary.779779+ * Unless called from sys_sync or friends, a barrier would suffice.780780+ */781781+ super->s_devops->sync(sb);782782+ err = logfs_write_je(sb, logfs_write_commit);783783+ if (err)784784+ goto again;785785+ log_journal("Write commit to %llx\n",786786+ be64_to_cpu(super->s_je_array[super->s_no_je - 1]));787787+ logfs_sync_area(area);788788+ BUG_ON(area->a_used_bytes != area->a_written_bytes);789789+ super->s_devops->sync(sb);790790+791791+ mutex_unlock(&super->s_journal_mutex);792792+ return;793793+}794794+795795+void do_logfs_journal_wl_pass(struct super_block *sb)796796+{797797+ struct logfs_super *super = logfs_super(sb);798798+ struct logfs_area *area = super->s_journal_area;799799+ u32 segno, ec;800800+ int i, err;801801+802802+ log_journal("Journal requires wear-leveling.\n");803803+ /* Drop old segments */804804+ journal_for_each(i)805805+ if (super->s_journal_seg[i]) {806806+ logfs_set_segment_unreserved(sb,807807+ super->s_journal_seg[i],808808+ super->s_journal_ec[i]);809809+ super->s_journal_seg[i] = 0;810810+ super->s_journal_ec[i] = 0;811811+ }812812+ /* Get new segments */813813+ for (i = 0; i < super->s_no_journal_segs; i++) {814814+ segno = get_best_cand(sb, &super->s_reserve_list, &ec);815815+ super->s_journal_seg[i] = segno;816816+ super->s_journal_ec[i] = ec;817817+ logfs_set_segment_reserved(sb, segno);818818+ }819819+ /* Manually move journal_area */820820+ area->a_segno = super->s_journal_seg[0];821821+ area->a_is_open = 0;822822+ area->a_used_bytes = 0;823823+ /* Write journal */824824+ logfs_write_anchor(super->s_master_inode);825825+ /* Write superblocks */826826+ err = logfs_write_sb(sb);827827+ BUG_ON(err);828828+}829829+830830+static const struct logfs_area_ops journal_area_ops = {831831+ .get_free_segment = journal_get_free_segment,832832+ .get_erase_count = journal_get_erase_count,833833+ .erase_segment = journal_erase_segment,834834+};835835+836836+int logfs_init_journal(struct super_block *sb)837837+{838838+ struct logfs_super *super = logfs_super(sb);839839+ size_t bufsize = max_t(size_t, sb->s_blocksize, super->s_writesize)840840+ + MAX_JOURNAL_HEADER;841841+ int ret = -ENOMEM;842842+843843+ mutex_init(&super->s_journal_mutex);844844+ btree_init_mempool32(&super->s_reserved_segments, super->s_btree_pool);845845+846846+ super->s_je = kzalloc(bufsize, GFP_KERNEL);847847+ if (!super->s_je)848848+ return ret;849849+850850+ super->s_compressed_je = kzalloc(bufsize, GFP_KERNEL);851851+ if (!super->s_compressed_je)852852+ return ret;853853+854854+ super->s_master_inode = logfs_new_meta_inode(sb, LOGFS_INO_MASTER);855855+ if (IS_ERR(super->s_master_inode))856856+ return PTR_ERR(super->s_master_inode);857857+858858+ ret = logfs_read_journal(sb);859859+ if (ret)860860+ return -EIO;861861+862862+ reserve_sb_and_journal(sb);863863+ logfs_calc_free(sb);864864+865865+ super->s_journal_area->a_ops = &journal_area_ops;866866+ return 0;867867+}868868+869869+void logfs_cleanup_journal(struct super_block *sb)870870+{871871+ struct logfs_super *super = logfs_super(sb);872872+873873+ btree_grim_visitor32(&super->s_reserved_segments, 0, NULL);874874+ destroy_meta_inode(super->s_master_inode);875875+ super->s_master_inode = NULL;876876+877877+ kfree(super->s_compressed_je);878878+ kfree(super->s_je);879879+}
+722
fs/logfs/logfs.h
···11+/*22+ * fs/logfs/logfs.h33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ *88+ * Private header for logfs.99+ */1010+#ifndef FS_LOGFS_LOGFS_H1111+#define FS_LOGFS_LOGFS_H1212+1313+#undef __CHECK_ENDIAN__1414+#define __CHECK_ENDIAN__1515+1616+#include <linux/btree.h>1717+#include <linux/crc32.h>1818+#include <linux/fs.h>1919+#include <linux/kernel.h>2020+#include <linux/mempool.h>2121+#include <linux/pagemap.h>2222+#include <linux/mtd/mtd.h>2323+#include "logfs_abi.h"2424+2525+#define LOGFS_DEBUG_SUPER (0x0001)2626+#define LOGFS_DEBUG_SEGMENT (0x0002)2727+#define LOGFS_DEBUG_JOURNAL (0x0004)2828+#define LOGFS_DEBUG_DIR (0x0008)2929+#define LOGFS_DEBUG_FILE (0x0010)3030+#define LOGFS_DEBUG_INODE (0x0020)3131+#define LOGFS_DEBUG_READWRITE (0x0040)3232+#define LOGFS_DEBUG_GC (0x0080)3333+#define LOGFS_DEBUG_GC_NOISY (0x0100)3434+#define LOGFS_DEBUG_ALIASES (0x0200)3535+#define LOGFS_DEBUG_BLOCKMOVE (0x0400)3636+#define LOGFS_DEBUG_ALL (0xffffffff)3737+3838+#define LOGFS_DEBUG (0x01)3939+/*4040+ * To enable specific log messages, simply define LOGFS_DEBUG to match any4141+ * or all of the above.4242+ */4343+#ifndef LOGFS_DEBUG4444+#define LOGFS_DEBUG (0)4545+#endif4646+4747+#define log_cond(cond, fmt, arg...) do { \4848+ if (cond) \4949+ printk(KERN_DEBUG fmt, ##arg); \5050+} while (0)5151+5252+#define log_super(fmt, arg...) \5353+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_SUPER, fmt, ##arg)5454+#define log_segment(fmt, arg...) \5555+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_SEGMENT, fmt, ##arg)5656+#define log_journal(fmt, arg...) \5757+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_JOURNAL, fmt, ##arg)5858+#define log_dir(fmt, arg...) \5959+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_DIR, fmt, ##arg)6060+#define log_file(fmt, arg...) \6161+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_FILE, fmt, ##arg)6262+#define log_inode(fmt, arg...) \6363+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_INODE, fmt, ##arg)6464+#define log_readwrite(fmt, arg...) \6565+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_READWRITE, fmt, ##arg)6666+#define log_gc(fmt, arg...) \6767+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_GC, fmt, ##arg)6868+#define log_gc_noisy(fmt, arg...) \6969+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_GC_NOISY, fmt, ##arg)7070+#define log_aliases(fmt, arg...) \7171+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_ALIASES, fmt, ##arg)7272+#define log_blockmove(fmt, arg...) \7373+ log_cond(LOGFS_DEBUG & LOGFS_DEBUG_BLOCKMOVE, fmt, ##arg)7474+7575+#define PG_pre_locked PG_owner_priv_17676+#define PagePreLocked(page) test_bit(PG_pre_locked, &(page)->flags)7777+#define SetPagePreLocked(page) set_bit(PG_pre_locked, &(page)->flags)7878+#define ClearPagePreLocked(page) clear_bit(PG_pre_locked, &(page)->flags)7979+8080+/* FIXME: This should really be somewhere in the 64bit area. */8181+#define LOGFS_LINK_MAX (1<<30)8282+8383+/* Read-only filesystem */8484+#define LOGFS_SB_FLAG_RO 0x00018585+#define LOGFS_SB_FLAG_SEG_ALIAS 0x00028686+#define LOGFS_SB_FLAG_OBJ_ALIAS 0x00048787+#define LOGFS_SB_FLAG_SHUTDOWN 0x00088888+8989+/* Write Control Flags */9090+#define WF_LOCK 0x01 /* take write lock */9191+#define WF_WRITE 0x02 /* write block */9292+#define WF_DELETE 0x04 /* delete old block */9393+9494+typedef u8 __bitwise level_t;9595+typedef u8 __bitwise gc_level_t;9696+9797+#define LEVEL(level) ((__force level_t)(level))9898+#define GC_LEVEL(gc_level) ((__force gc_level_t)(gc_level))9999+100100+#define SUBLEVEL(level) ( (void)((level) == LEVEL(1)), \101101+ (__force level_t)((__force u8)(level) - 1) )102102+103103+/**104104+ * struct logfs_area - area management information105105+ *106106+ * @a_sb: the superblock this area belongs to107107+ * @a_is_open: 1 if the area is currently open, else 0108108+ * @a_segno: segment number of area109109+ * @a_written_bytes: number of bytes already written back110110+ * @a_used_bytes: number of used bytes111111+ * @a_ops: area operations (either journal or ostore)112112+ * @a_erase_count: erase count113113+ * @a_level: GC level114114+ */115115+struct logfs_area { /* a segment open for writing */116116+ struct super_block *a_sb;117117+ int a_is_open;118118+ u32 a_segno;119119+ u32 a_written_bytes;120120+ u32 a_used_bytes;121121+ const struct logfs_area_ops *a_ops;122122+ u32 a_erase_count;123123+ gc_level_t a_level;124124+};125125+126126+/**127127+ * struct logfs_area_ops - area operations128128+ *129129+ * @get_free_segment: fill area->ofs with the offset of a free segment130130+ * @get_erase_count: fill area->erase_count (needs area->ofs)131131+ * @erase_segment: erase and setup segment132132+ */133133+struct logfs_area_ops {134134+ void (*get_free_segment)(struct logfs_area *area);135135+ void (*get_erase_count)(struct logfs_area *area);136136+ int (*erase_segment)(struct logfs_area *area);137137+};138138+139139+/**140140+ * struct logfs_device_ops - device access operations141141+ *142142+ * @readpage: read one page (mm page)143143+ * @writeseg: write one segment. may be a partial segment144144+ * @erase: erase one segment145145+ * @read: read from the device146146+ * @erase: erase part of the device147147+ */148148+struct logfs_device_ops {149149+ struct page *(*find_first_sb)(struct super_block *sb, u64 *ofs);150150+ struct page *(*find_last_sb)(struct super_block *sb, u64 *ofs);151151+ int (*write_sb)(struct super_block *sb, struct page *page);152152+ int (*readpage)(void *_sb, struct page *page);153153+ void (*writeseg)(struct super_block *sb, u64 ofs, size_t len);154154+ int (*erase)(struct super_block *sb, loff_t ofs, size_t len);155155+ void (*sync)(struct super_block *sb);156156+ void (*put_device)(struct super_block *sb);157157+};158158+159159+/**160160+ * struct candidate_list - list of similar candidates161161+ */162162+struct candidate_list {163163+ struct rb_root rb_tree;164164+ int count;165165+ int maxcount;166166+ int sort_by_ec;167167+};168168+169169+/**170170+ * struct gc_candidate - "candidate" segment to be garbage collected next171171+ *172172+ * @list: list (either free of low)173173+ * @segno: segment number174174+ * @valid: number of valid bytes175175+ * @erase_count: erase count of segment176176+ * @dist: distance from tree root177177+ *178178+ * Candidates can be on two lists. The free list contains electees rather179179+ * than candidates - segments that no longer contain any valid data. The180180+ * low list contains candidates to be picked for GC. It should be kept181181+ * short. It is not required to always pick a perfect candidate. In the182182+ * worst case GC will have to move more data than absolutely necessary.183183+ */184184+struct gc_candidate {185185+ struct rb_node rb_node;186186+ struct candidate_list *list;187187+ u32 segno;188188+ u32 valid;189189+ u32 erase_count;190190+ u8 dist;191191+};192192+193193+/**194194+ * struct logfs_journal_entry - temporary structure used during journal scan195195+ *196196+ * @used:197197+ * @version: normalized version198198+ * @len: length199199+ * @offset: offset200200+ */201201+struct logfs_journal_entry {202202+ int used;203203+ s16 version;204204+ u16 len;205205+ u16 datalen;206206+ u64 offset;207207+};208208+209209+enum transaction_state {210210+ CREATE_1 = 1,211211+ CREATE_2,212212+ UNLINK_1,213213+ UNLINK_2,214214+ CROSS_RENAME_1,215215+ CROSS_RENAME_2,216216+ TARGET_RENAME_1,217217+ TARGET_RENAME_2,218218+ TARGET_RENAME_3219219+};220220+221221+/**222222+ * struct logfs_transaction - essential fields to support atomic dirops223223+ *224224+ * @ino: target inode225225+ * @dir: inode of directory containing dentry226226+ * @pos: pos of dentry in directory227227+ */228228+struct logfs_transaction {229229+ enum transaction_state state;230230+ u64 ino;231231+ u64 dir;232232+ u64 pos;233233+};234234+235235+/**236236+ * struct logfs_shadow - old block in the shadow of a not-yet-committed new one237237+ * @old_ofs: offset of old block on medium238238+ * @new_ofs: offset of new block on medium239239+ * @ino: inode number240240+ * @bix: block index241241+ * @old_len: size of old block, including header242242+ * @new_len: size of new block, including header243243+ * @level: block level244244+ */245245+struct logfs_shadow {246246+ u64 old_ofs;247247+ u64 new_ofs;248248+ u64 ino;249249+ u64 bix;250250+ int old_len;251251+ int new_len;252252+ gc_level_t gc_level;253253+};254254+255255+/**256256+ * struct shadow_tree257257+ * @new: shadows where old_ofs==0, indexed by new_ofs258258+ * @old: shadows where old_ofs!=0, indexed by old_ofs259259+ */260260+struct shadow_tree {261261+ struct btree_head64 new;262262+ struct btree_head64 old;263263+};264264+265265+struct object_alias_item {266266+ struct list_head list;267267+ __be64 val;268268+ int child_no;269269+};270270+271271+/**272272+ * struct logfs_block - contains any block state273273+ * @type: indirect block or inode274274+ * @full: number of fully populated children275275+ * @partial: number of partially populated children276276+ *277277+ * Most blocks are directly represented by page cache pages. But when a block278278+ * becomes dirty, is part of a transaction, contains aliases or is otherwise279279+ * special, a struct logfs_block is allocated to track the additional state.280280+ * Inodes are very similar to indirect blocks, so they can also get one of281281+ * these structures added when appropriate.282282+ */283283+#define BLOCK_INDIRECT 1 /* Indirect block */284284+#define BLOCK_INODE 2 /* Inode */285285+struct logfs_block_ops;286286+struct logfs_block {287287+ struct list_head alias_list;288288+ struct list_head item_list;289289+ struct super_block *sb;290290+ u64 ino;291291+ u64 bix;292292+ level_t level;293293+ struct page *page;294294+ struct inode *inode;295295+ struct logfs_transaction *ta;296296+ unsigned long alias_map[LOGFS_BLOCK_FACTOR / BITS_PER_LONG];297297+ struct logfs_block_ops *ops;298298+ int full;299299+ int partial;300300+ int reserved_bytes;301301+};302302+303303+typedef int write_alias_t(struct super_block *sb, u64 ino, u64 bix,304304+ level_t level, int child_no, __be64 val);305305+struct logfs_block_ops {306306+ void (*write_block)(struct logfs_block *block);307307+ gc_level_t (*block_level)(struct logfs_block *block);308308+ void (*free_block)(struct super_block *sb, struct logfs_block*block);309309+ int (*write_alias)(struct super_block *sb,310310+ struct logfs_block *block,311311+ write_alias_t *write_one_alias);312312+};313313+314314+struct logfs_super {315315+ struct mtd_info *s_mtd; /* underlying device */316316+ struct block_device *s_bdev; /* underlying device */317317+ const struct logfs_device_ops *s_devops;/* device access */318318+ struct inode *s_master_inode; /* inode file */319319+ struct inode *s_segfile_inode; /* segment file */320320+ struct inode *s_mapping_inode; /* device mapping */321321+ atomic_t s_pending_writes; /* outstanting bios */322322+ long s_flags;323323+ mempool_t *s_btree_pool; /* for btree nodes */324324+ mempool_t *s_alias_pool; /* aliases in segment.c */325325+ u64 s_feature_incompat;326326+ u64 s_feature_ro_compat;327327+ u64 s_feature_compat;328328+ u64 s_feature_flags;329329+ u64 s_sb_ofs[2];330330+ /* alias.c fields */331331+ struct btree_head32 s_segment_alias; /* remapped segments */332332+ int s_no_object_aliases;333333+ struct list_head s_object_alias; /* remapped objects */334334+ struct btree_head128 s_object_alias_tree; /* remapped objects */335335+ struct mutex s_object_alias_mutex;336336+ /* dir.c fields */337337+ struct mutex s_dirop_mutex; /* for creat/unlink/rename */338338+ u64 s_victim_ino; /* used for atomic dir-ops */339339+ u64 s_rename_dir; /* source directory ino */340340+ u64 s_rename_pos; /* position of source dd */341341+ /* gc.c fields */342342+ long s_segsize; /* size of a segment */343343+ int s_segshift; /* log2 of segment size */344344+ long s_segmask; /* 1 << s_segshift - 1 */345345+ long s_no_segs; /* segments on device */346346+ long s_no_journal_segs; /* segments used for journal */347347+ long s_no_blocks; /* blocks per segment */348348+ long s_writesize; /* minimum write size */349349+ int s_writeshift; /* log2 of write size */350350+ u64 s_size; /* filesystem size */351351+ struct logfs_area *s_area[LOGFS_NO_AREAS]; /* open segment array */352352+ u64 s_gec; /* global erase count */353353+ u64 s_wl_gec_ostore; /* time of last wl event */354354+ u64 s_wl_gec_journal; /* time of last wl event */355355+ u64 s_sweeper; /* current sweeper pos */356356+ u8 s_ifile_levels; /* max level of ifile */357357+ u8 s_iblock_levels; /* max level of regular files */358358+ u8 s_data_levels; /* # of segments to leaf block*/359359+ u8 s_total_levels; /* sum of above three */360360+ struct btree_head32 s_cand_tree; /* all candidates */361361+ struct candidate_list s_free_list; /* 100% free segments */362362+ struct candidate_list s_reserve_list; /* Bad segment reserve */363363+ struct candidate_list s_low_list[LOGFS_NO_AREAS];/* good candidates */364364+ struct candidate_list s_ec_list; /* wear level candidates */365365+ struct btree_head32 s_reserved_segments;/* sb, journal, bad, etc. */366366+ /* inode.c fields */367367+ u64 s_last_ino; /* highest ino used */368368+ long s_inos_till_wrap;369369+ u32 s_generation; /* i_generation for new files */370370+ struct list_head s_freeing_list; /* inodes being freed */371371+ /* journal.c fields */372372+ struct mutex s_journal_mutex;373373+ void *s_je; /* journal entry to compress */374374+ void *s_compressed_je; /* block to write to journal */375375+ u32 s_journal_seg[LOGFS_JOURNAL_SEGS]; /* journal segments */376376+ u32 s_journal_ec[LOGFS_JOURNAL_SEGS]; /* journal erasecounts */377377+ u64 s_last_version;378378+ struct logfs_area *s_journal_area; /* open journal segment */379379+ __be64 s_je_array[64];380380+ int s_no_je;381381+382382+ int s_sum_index; /* for the 12 summaries */383383+ struct shadow_tree s_shadow_tree;384384+ int s_je_fill; /* index of current je */385385+ /* readwrite.c fields */386386+ struct mutex s_write_mutex;387387+ int s_lock_count;388388+ mempool_t *s_block_pool; /* struct logfs_block pool */389389+ mempool_t *s_shadow_pool; /* struct logfs_shadow pool */390390+ /*391391+ * Space accounting:392392+ * - s_used_bytes specifies space used to store valid data objects.393393+ * - s_dirty_used_bytes is space used to store non-committed data394394+ * objects. Those objects have already been written themselves,395395+ * but they don't become valid until all indirect blocks up to the396396+ * journal have been written as well.397397+ * - s_dirty_free_bytes is space used to store the old copy of a398398+ * replaced object, as long as the replacement is non-committed.399399+ * In other words, it is the amount of space freed when all dirty400400+ * blocks are written back.401401+ * - s_free_bytes is the amount of free space available for any402402+ * purpose.403403+ * - s_root_reserve is the amount of free space available only to404404+ * the root user. Non-privileged users can no longer write once405405+ * this watermark has been reached.406406+ * - s_speed_reserve is space which remains unused to speed up407407+ * garbage collection performance.408408+ * - s_dirty_pages is the space reserved for currently dirty pages.409409+ * It is a pessimistic estimate, so some/most will get freed on410410+ * page writeback.411411+ *412412+ * s_used_bytes + s_free_bytes + s_speed_reserve = total usable size413413+ */414414+ u64 s_free_bytes;415415+ u64 s_used_bytes;416416+ u64 s_dirty_free_bytes;417417+ u64 s_dirty_used_bytes;418418+ u64 s_root_reserve;419419+ u64 s_speed_reserve;420420+ u64 s_dirty_pages;421421+ /* Bad block handling:422422+ * - s_bad_seg_reserve is a number of segments usually kept423423+ * free. When encountering bad blocks, the affected segment's data424424+ * is _temporarily_ moved to a reserved segment.425425+ * - s_bad_segments is the number of known bad segments.426426+ */427427+ u32 s_bad_seg_reserve;428428+ u32 s_bad_segments;429429+};430430+431431+/**432432+ * struct logfs_inode - in-memory inode433433+ *434434+ * @vfs_inode: struct inode435435+ * @li_data: data pointers436436+ * @li_used_bytes: number of used bytes437437+ * @li_freeing_list: used to track inodes currently being freed438438+ * @li_flags: inode flags439439+ * @li_refcount: number of internal (GC-induced) references440440+ */441441+struct logfs_inode {442442+ struct inode vfs_inode;443443+ u64 li_data[LOGFS_EMBEDDED_FIELDS];444444+ u64 li_used_bytes;445445+ struct list_head li_freeing_list;446446+ struct logfs_block *li_block;447447+ u32 li_flags;448448+ u8 li_height;449449+ int li_refcount;450450+};451451+452452+#define journal_for_each(__i) for (__i = 0; __i < LOGFS_JOURNAL_SEGS; __i++)453453+#define for_each_area(__i) for (__i = 0; __i < LOGFS_NO_AREAS; __i++)454454+#define for_each_area_down(__i) for (__i = LOGFS_NO_AREAS - 1; __i >= 0; __i--)455455+456456+/* compr.c */457457+int logfs_compress(void *in, void *out, size_t inlen, size_t outlen);458458+int logfs_uncompress(void *in, void *out, size_t inlen, size_t outlen);459459+int __init logfs_compr_init(void);460460+void logfs_compr_exit(void);461461+462462+/* dev_bdev.c */463463+#ifdef CONFIG_BLOCK464464+int logfs_get_sb_bdev(struct file_system_type *type, int flags,465465+ const char *devname, struct vfsmount *mnt);466466+#else467467+static inline int logfs_get_sb_bdev(struct file_system_type *type, int flags,468468+ const char *devname, struct vfsmount *mnt)469469+{470470+ return -ENODEV;471471+}472472+#endif473473+474474+/* dev_mtd.c */475475+#ifdef CONFIG_MTD476476+int logfs_get_sb_mtd(struct file_system_type *type, int flags,477477+ int mtdnr, struct vfsmount *mnt);478478+#else479479+static inline int logfs_get_sb_mtd(struct file_system_type *type, int flags,480480+ int mtdnr, struct vfsmount *mnt)481481+{482482+ return -ENODEV;483483+}484484+#endif485485+486486+/* dir.c */487487+extern const struct inode_operations logfs_symlink_iops;488488+extern const struct inode_operations logfs_dir_iops;489489+extern const struct file_operations logfs_dir_fops;490490+int logfs_replay_journal(struct super_block *sb);491491+492492+/* file.c */493493+extern const struct inode_operations logfs_reg_iops;494494+extern const struct file_operations logfs_reg_fops;495495+extern const struct address_space_operations logfs_reg_aops;496496+int logfs_readpage(struct file *file, struct page *page);497497+int logfs_ioctl(struct inode *inode, struct file *file, unsigned int cmd,498498+ unsigned long arg);499499+int logfs_fsync(struct file *file, struct dentry *dentry, int datasync);500500+501501+/* gc.c */502502+u32 get_best_cand(struct super_block *sb, struct candidate_list *list, u32 *ec);503503+void logfs_gc_pass(struct super_block *sb);504504+int logfs_check_areas(struct super_block *sb);505505+int logfs_init_gc(struct super_block *sb);506506+void logfs_cleanup_gc(struct super_block *sb);507507+508508+/* inode.c */509509+extern const struct super_operations logfs_super_operations;510510+struct inode *logfs_iget(struct super_block *sb, ino_t ino);511511+struct inode *logfs_safe_iget(struct super_block *sb, ino_t ino, int *cookie);512512+void logfs_safe_iput(struct inode *inode, int cookie);513513+struct inode *logfs_new_inode(struct inode *dir, int mode);514514+struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino);515515+struct inode *logfs_read_meta_inode(struct super_block *sb, u64 ino);516516+int logfs_init_inode_cache(void);517517+void logfs_destroy_inode_cache(void);518518+void destroy_meta_inode(struct inode *inode);519519+void logfs_set_blocks(struct inode *inode, u64 no);520520+/* these logically belong into inode.c but actually reside in readwrite.c */521521+int logfs_read_inode(struct inode *inode);522522+int __logfs_write_inode(struct inode *inode, long flags);523523+void logfs_delete_inode(struct inode *inode);524524+void logfs_clear_inode(struct inode *inode);525525+526526+/* journal.c */527527+void logfs_write_anchor(struct inode *inode);528528+int logfs_init_journal(struct super_block *sb);529529+void logfs_cleanup_journal(struct super_block *sb);530530+int write_alias_journal(struct super_block *sb, u64 ino, u64 bix,531531+ level_t level, int child_no, __be64 val);532532+void do_logfs_journal_wl_pass(struct super_block *sb);533533+534534+/* readwrite.c */535535+pgoff_t logfs_pack_index(u64 bix, level_t level);536536+void logfs_unpack_index(pgoff_t index, u64 *bix, level_t *level);537537+int logfs_inode_write(struct inode *inode, const void *buf, size_t count,538538+ loff_t bix, long flags, struct shadow_tree *shadow_tree);539539+int logfs_readpage_nolock(struct page *page);540540+int logfs_write_buf(struct inode *inode, struct page *page, long flags);541541+int logfs_delete(struct inode *inode, pgoff_t index,542542+ struct shadow_tree *shadow_tree);543543+int logfs_rewrite_block(struct inode *inode, u64 bix, u64 ofs,544544+ gc_level_t gc_level, long flags);545545+int logfs_is_valid_block(struct super_block *sb, u64 ofs, u64 ino, u64 bix,546546+ gc_level_t gc_level);547547+int logfs_truncate(struct inode *inode, u64 size);548548+u64 logfs_seek_hole(struct inode *inode, u64 bix);549549+u64 logfs_seek_data(struct inode *inode, u64 bix);550550+int logfs_open_segfile(struct super_block *sb);551551+int logfs_init_rw(struct super_block *sb);552552+void logfs_cleanup_rw(struct super_block *sb);553553+void logfs_add_transaction(struct inode *inode, struct logfs_transaction *ta);554554+void logfs_del_transaction(struct inode *inode, struct logfs_transaction *ta);555555+void logfs_write_block(struct logfs_block *block, long flags);556556+int logfs_write_obj_aliases_pagecache(struct super_block *sb);557557+void logfs_get_segment_entry(struct super_block *sb, u32 segno,558558+ struct logfs_segment_entry *se);559559+void logfs_set_segment_used(struct super_block *sb, u64 ofs, int increment);560560+void logfs_set_segment_erased(struct super_block *sb, u32 segno, u32 ec,561561+ gc_level_t gc_level);562562+void logfs_set_segment_reserved(struct super_block *sb, u32 segno);563563+void logfs_set_segment_unreserved(struct super_block *sb, u32 segno, u32 ec);564564+struct logfs_block *__alloc_block(struct super_block *sb,565565+ u64 ino, u64 bix, level_t level);566566+void __free_block(struct super_block *sb, struct logfs_block *block);567567+void btree_write_block(struct logfs_block *block);568568+void initialize_block_counters(struct page *page, struct logfs_block *block,569569+ __be64 *array, int page_is_empty);570570+int logfs_exist_block(struct inode *inode, u64 bix);571571+int get_page_reserve(struct inode *inode, struct page *page);572572+extern struct logfs_block_ops indirect_block_ops;573573+574574+/* segment.c */575575+int logfs_erase_segment(struct super_block *sb, u32 ofs);576576+int wbuf_read(struct super_block *sb, u64 ofs, size_t len, void *buf);577577+int logfs_segment_read(struct inode *inode, struct page *page, u64 ofs, u64 bix,578578+ level_t level);579579+int logfs_segment_write(struct inode *inode, struct page *page,580580+ struct logfs_shadow *shadow);581581+int logfs_segment_delete(struct inode *inode, struct logfs_shadow *shadow);582582+int logfs_load_object_aliases(struct super_block *sb,583583+ struct logfs_obj_alias *oa, int count);584584+void move_page_to_btree(struct page *page);585585+int logfs_init_mapping(struct super_block *sb);586586+void logfs_sync_area(struct logfs_area *area);587587+void logfs_sync_segments(struct super_block *sb);588588+589589+/* area handling */590590+int logfs_init_areas(struct super_block *sb);591591+void logfs_cleanup_areas(struct super_block *sb);592592+int logfs_open_area(struct logfs_area *area, size_t bytes);593593+void __logfs_buf_write(struct logfs_area *area, u64 ofs, void *buf, size_t len,594594+ int use_filler);595595+596596+static inline void logfs_buf_write(struct logfs_area *area, u64 ofs,597597+ void *buf, size_t len)598598+{599599+ __logfs_buf_write(area, ofs, buf, len, 0);600600+}601601+602602+static inline void logfs_buf_recover(struct logfs_area *area, u64 ofs,603603+ void *buf, size_t len)604604+{605605+ __logfs_buf_write(area, ofs, buf, len, 1);606606+}607607+608608+/* super.c */609609+struct page *emergency_read_begin(struct address_space *mapping, pgoff_t index);610610+void emergency_read_end(struct page *page);611611+void logfs_crash_dump(struct super_block *sb);612612+void *memchr_inv(const void *s, int c, size_t n);613613+int logfs_statfs(struct dentry *dentry, struct kstatfs *stats);614614+int logfs_get_sb_device(struct file_system_type *type, int flags,615615+ struct mtd_info *mtd, struct block_device *bdev,616616+ const struct logfs_device_ops *devops, struct vfsmount *mnt);617617+int logfs_check_ds(struct logfs_disk_super *ds);618618+int logfs_write_sb(struct super_block *sb);619619+620620+static inline struct logfs_super *logfs_super(struct super_block *sb)621621+{622622+ return sb->s_fs_info;623623+}624624+625625+static inline struct logfs_inode *logfs_inode(struct inode *inode)626626+{627627+ return container_of(inode, struct logfs_inode, vfs_inode);628628+}629629+630630+static inline void logfs_set_ro(struct super_block *sb)631631+{632632+ logfs_super(sb)->s_flags |= LOGFS_SB_FLAG_RO;633633+}634634+635635+#define LOGFS_BUG(sb) do { \636636+ struct super_block *__sb = sb; \637637+ logfs_crash_dump(__sb); \638638+ logfs_super(__sb)->s_flags |= LOGFS_SB_FLAG_RO; \639639+ BUG(); \640640+} while (0)641641+642642+#define LOGFS_BUG_ON(condition, sb) \643643+ do { if (unlikely(condition)) LOGFS_BUG((sb)); } while (0)644644+645645+static inline __be32 logfs_crc32(void *data, size_t len, size_t skip)646646+{647647+ return cpu_to_be32(crc32(~0, data+skip, len-skip));648648+}649649+650650+static inline u8 logfs_type(struct inode *inode)651651+{652652+ return (inode->i_mode >> 12) & 15;653653+}654654+655655+static inline pgoff_t logfs_index(struct super_block *sb, u64 pos)656656+{657657+ return pos >> sb->s_blocksize_bits;658658+}659659+660660+static inline u64 dev_ofs(struct super_block *sb, u32 segno, u32 ofs)661661+{662662+ return ((u64)segno << logfs_super(sb)->s_segshift) + ofs;663663+}664664+665665+static inline u32 seg_no(struct super_block *sb, u64 ofs)666666+{667667+ return ofs >> logfs_super(sb)->s_segshift;668668+}669669+670670+static inline u32 seg_ofs(struct super_block *sb, u64 ofs)671671+{672672+ return ofs & logfs_super(sb)->s_segmask;673673+}674674+675675+static inline u64 seg_align(struct super_block *sb, u64 ofs)676676+{677677+ return ofs & ~logfs_super(sb)->s_segmask;678678+}679679+680680+static inline struct logfs_block *logfs_block(struct page *page)681681+{682682+ return (void *)page->private;683683+}684684+685685+static inline level_t shrink_level(gc_level_t __level)686686+{687687+ u8 level = (__force u8)__level;688688+689689+ if (level >= LOGFS_MAX_LEVELS)690690+ level -= LOGFS_MAX_LEVELS;691691+ return (__force level_t)level;692692+}693693+694694+static inline gc_level_t expand_level(u64 ino, level_t __level)695695+{696696+ u8 level = (__force u8)__level;697697+698698+ if (ino == LOGFS_INO_MASTER) {699699+ /* ifile has seperate areas */700700+ level += LOGFS_MAX_LEVELS;701701+ }702702+ return (__force gc_level_t)level;703703+}704704+705705+static inline int logfs_block_shift(struct super_block *sb, level_t level)706706+{707707+ level = shrink_level((__force gc_level_t)level);708708+ return (__force int)level * (sb->s_blocksize_bits - 3);709709+}710710+711711+static inline u64 logfs_block_mask(struct super_block *sb, level_t level)712712+{713713+ return ~0ull << logfs_block_shift(sb, level);714714+}715715+716716+static inline struct logfs_area *get_area(struct super_block *sb,717717+ gc_level_t gc_level)718718+{719719+ return logfs_super(sb)->s_area[(__force u8)gc_level];720720+}721721+722722+#endif
+627
fs/logfs/logfs_abi.h
···11+/*22+ * fs/logfs/logfs_abi.h33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ *88+ * Public header for logfs.99+ */1010+#ifndef FS_LOGFS_LOGFS_ABI_H1111+#define FS_LOGFS_LOGFS_ABI_H1212+1313+/* For out-of-kernel compiles */1414+#ifndef BUILD_BUG_ON1515+#define BUILD_BUG_ON(condition) /**/1616+#endif1717+1818+#define SIZE_CHECK(type, size) \1919+static inline void check_##type(void) \2020+{ \2121+ BUILD_BUG_ON(sizeof(struct type) != (size)); \2222+}2323+2424+/*2525+ * Throughout the logfs code, we're constantly dealing with blocks at2626+ * various positions or offsets. To remove confusion, we stricly2727+ * distinguish between a "position" - the logical position within a2828+ * file and an "offset" - the physical location within the device.2929+ *3030+ * Any usage of the term offset for a logical location or position for3131+ * a physical one is a bug and should get fixed.3232+ */3333+3434+/*3535+ * Block are allocated in one of several segments depending on their3636+ * level. The following levels are used:3737+ * 0 - regular data block3838+ * 1 - i1 indirect blocks3939+ * 2 - i2 indirect blocks4040+ * 3 - i3 indirect blocks4141+ * 4 - i4 indirect blocks4242+ * 5 - i5 indirect blocks4343+ * 6 - ifile data blocks4444+ * 7 - ifile i1 indirect blocks4545+ * 8 - ifile i2 indirect blocks4646+ * 9 - ifile i3 indirect blocks4747+ * 10 - ifile i4 indirect blocks4848+ * 11 - ifile i5 indirect blocks4949+ * Potential levels to be used in the future:5050+ * 12 - gc recycled blocks, long-lived data5151+ * 13 - replacement blocks, short-lived data5252+ *5353+ * Levels 1-11 are necessary for robust gc operations and help seperate5454+ * short-lived metadata from longer-lived file data. In the future,5555+ * file data should get seperated into several segments based on simple5656+ * heuristics. Old data recycled during gc operation is expected to be5757+ * long-lived. New data is of uncertain life expectancy. New data5858+ * used to replace older blocks in existing files is expected to be5959+ * short-lived.6060+ */6161+6262+6363+/* Magic numbers. 64bit for superblock, 32bit for statfs f_type */6464+#define LOGFS_MAGIC 0xb21f205ac97e8168ull6565+#define LOGFS_MAGIC_U32 0xc97e8168u6666+6767+/*6868+ * Various blocksize related macros. Blocksize is currently fixed at 4KiB.6969+ * Sooner or later that should become configurable and the macros replaced7070+ * by something superblock-dependent. Pointers in indirect blocks are and7171+ * will remain 64bit.7272+ *7373+ * LOGFS_BLOCKSIZE - self-explaining7474+ * LOGFS_BLOCK_FACTOR - number of pointers per indirect block7575+ * LOGFS_BLOCK_BITS - log2 of LOGFS_BLOCK_FACTOR, used for shifts7676+ */7777+#define LOGFS_BLOCKSIZE (4096ull)7878+#define LOGFS_BLOCK_FACTOR (LOGFS_BLOCKSIZE / sizeof(u64))7979+#define LOGFS_BLOCK_BITS (9)8080+8181+/*8282+ * Number of blocks at various levels of indirection. There are 16 direct8383+ * block pointers plus a single indirect pointer.8484+ */8585+#define I0_BLOCKS (16)8686+#define I1_BLOCKS LOGFS_BLOCK_FACTOR8787+#define I2_BLOCKS (LOGFS_BLOCK_FACTOR * I1_BLOCKS)8888+#define I3_BLOCKS (LOGFS_BLOCK_FACTOR * I2_BLOCKS)8989+#define I4_BLOCKS (LOGFS_BLOCK_FACTOR * I3_BLOCKS)9090+#define I5_BLOCKS (LOGFS_BLOCK_FACTOR * I4_BLOCKS)9191+9292+#define INDIRECT_INDEX I0_BLOCKS9393+#define LOGFS_EMBEDDED_FIELDS (I0_BLOCKS + 1)9494+9595+/*9696+ * Sizes at which files require another level of indirection. Files smaller9797+ * than LOGFS_EMBEDDED_SIZE can be completely stored in the inode itself,9898+ * similar like ext2 fast symlinks.9999+ *100100+ * Data at a position smaller than LOGFS_I0_SIZE is accessed through the101101+ * direct pointers, else through the 1x indirect pointer and so forth.102102+ */103103+#define LOGFS_EMBEDDED_SIZE (LOGFS_EMBEDDED_FIELDS * sizeof(u64))104104+#define LOGFS_I0_SIZE (I0_BLOCKS * LOGFS_BLOCKSIZE)105105+#define LOGFS_I1_SIZE (I1_BLOCKS * LOGFS_BLOCKSIZE)106106+#define LOGFS_I2_SIZE (I2_BLOCKS * LOGFS_BLOCKSIZE)107107+#define LOGFS_I3_SIZE (I3_BLOCKS * LOGFS_BLOCKSIZE)108108+#define LOGFS_I4_SIZE (I4_BLOCKS * LOGFS_BLOCKSIZE)109109+#define LOGFS_I5_SIZE (I5_BLOCKS * LOGFS_BLOCKSIZE)110110+111111+/*112112+ * Each indirect block pointer must have this flag set, if all block pointers113113+ * behind it are set, i.e. there is no hole hidden in the shadow of this114114+ * indirect block pointer.115115+ */116116+#define LOGFS_FULLY_POPULATED (1ULL << 63)117117+#define pure_ofs(ofs) (ofs & ~LOGFS_FULLY_POPULATED)118118+119119+/*120120+ * LogFS needs to seperate data into levels. Each level is defined as the121121+ * maximal possible distance from the master inode (inode of the inode file).122122+ * Data blocks reside on level 0, 1x indirect block on level 1, etc.123123+ * Inodes reside on level 6, indirect blocks for the inode file on levels 7-11.124124+ * This effort is necessary to guarantee garbage collection to always make125125+ * progress.126126+ *127127+ * LOGFS_MAX_INDIRECT is the maximal indirection through indirect blocks,128128+ * LOGFS_MAX_LEVELS is one more for the actual data level of a file. It is129129+ * the maximal number of levels for one file.130130+ * LOGFS_NO_AREAS is twice that, as the inode file and regular files are131131+ * effectively stacked on top of each other.132132+ */133133+#define LOGFS_MAX_INDIRECT (5)134134+#define LOGFS_MAX_LEVELS (LOGFS_MAX_INDIRECT + 1)135135+#define LOGFS_NO_AREAS (2 * LOGFS_MAX_LEVELS)136136+137137+/* Maximum size of filenames */138138+#define LOGFS_MAX_NAMELEN (255)139139+140140+/* Number of segments in the primary journal. */141141+#define LOGFS_JOURNAL_SEGS (16)142142+143143+/* Maximum number of free/erased/etc. segments in journal entries */144144+#define MAX_CACHED_SEGS (64)145145+146146+147147+/*148148+ * LOGFS_OBJECT_HEADERSIZE is the size of a single header in the object store,149149+ * LOGFS_MAX_OBJECTSIZE the size of the largest possible object, including150150+ * its header,151151+ * LOGFS_SEGMENT_RESERVE is the amount of space reserved for each segment for152152+ * its segment header and the padded space at the end when no further objects153153+ * fit.154154+ */155155+#define LOGFS_OBJECT_HEADERSIZE (0x1c)156156+#define LOGFS_SEGMENT_HEADERSIZE (0x18)157157+#define LOGFS_MAX_OBJECTSIZE (LOGFS_OBJECT_HEADERSIZE + LOGFS_BLOCKSIZE)158158+#define LOGFS_SEGMENT_RESERVE \159159+ (LOGFS_SEGMENT_HEADERSIZE + LOGFS_MAX_OBJECTSIZE - 1)160160+161161+/*162162+ * Segment types:163163+ * SEG_SUPER - Data or indirect block164164+ * SEG_JOURNAL - Inode165165+ * SEG_OSTORE - Dentry166166+ */167167+enum {168168+ SEG_SUPER = 0x01,169169+ SEG_JOURNAL = 0x02,170170+ SEG_OSTORE = 0x03,171171+};172172+173173+/**174174+ * struct logfs_segment_header - per-segment header in the ostore175175+ *176176+ * @crc: crc32 of header (there is no data)177177+ * @pad: unused, must be 0178178+ * @type: segment type, see above179179+ * @level: GC level for all objects in this segment180180+ * @segno: segment number181181+ * @ec: erase count for this segment182182+ * @gec: global erase count at time of writing183183+ */184184+struct logfs_segment_header {185185+ __be32 crc;186186+ __be16 pad;187187+ __u8 type;188188+ __u8 level;189189+ __be32 segno;190190+ __be32 ec;191191+ __be64 gec;192192+};193193+194194+SIZE_CHECK(logfs_segment_header, LOGFS_SEGMENT_HEADERSIZE);195195+196196+/**197197+ * struct logfs_disk_super - on-medium superblock198198+ *199199+ * @ds_magic: magic number, must equal LOGFS_MAGIC200200+ * @ds_crc: crc32 of structure starting with the next field201201+ * @ds_ifile_levels: maximum number of levels for ifile202202+ * @ds_iblock_levels: maximum number of levels for regular files203203+ * @ds_data_levels: number of seperate levels for data204204+ * @pad0: reserved, must be 0205205+ * @ds_feature_incompat: incompatible filesystem features206206+ * @ds_feature_ro_compat: read-only compatible filesystem features207207+ * @ds_feature_compat: compatible filesystem features208208+ * @ds_flags: flags209209+ * @ds_segment_shift: log2 of segment size210210+ * @ds_block_shift: log2 of block size211211+ * @ds_write_shift: log2 of write size212212+ * @pad1: reserved, must be 0213213+ * @ds_journal_seg: segments used by primary journal214214+ * @ds_root_reserve: bytes reserved for the superuser215215+ * @ds_speed_reserve: bytes reserved to speed up GC216216+ * @ds_bad_seg_reserve: number of segments reserved to handle bad blocks217217+ * @pad2: reserved, must be 0218218+ * @pad3: reserved, must be 0219219+ *220220+ * Contains only read-only fields. Read-write fields like the amount of used221221+ * space is tracked in the dynamic superblock, which is stored in the journal.222222+ */223223+struct logfs_disk_super {224224+ struct logfs_segment_header ds_sh;225225+ __be64 ds_magic;226226+227227+ __be32 ds_crc;228228+ __u8 ds_ifile_levels;229229+ __u8 ds_iblock_levels;230230+ __u8 ds_data_levels;231231+ __u8 ds_segment_shift;232232+ __u8 ds_block_shift;233233+ __u8 ds_write_shift;234234+ __u8 pad0[6];235235+236236+ __be64 ds_filesystem_size;237237+ __be32 ds_segment_size;238238+ __be32 ds_bad_seg_reserve;239239+240240+ __be64 ds_feature_incompat;241241+ __be64 ds_feature_ro_compat;242242+243243+ __be64 ds_feature_compat;244244+ __be64 ds_feature_flags;245245+246246+ __be64 ds_root_reserve;247247+ __be64 ds_speed_reserve;248248+249249+ __be32 ds_journal_seg[LOGFS_JOURNAL_SEGS];250250+251251+ __be64 ds_super_ofs[2];252252+ __be64 pad3[8];253253+};254254+255255+SIZE_CHECK(logfs_disk_super, 256);256256+257257+/*258258+ * Object types:259259+ * OBJ_BLOCK - Data or indirect block260260+ * OBJ_INODE - Inode261261+ * OBJ_DENTRY - Dentry262262+ */263263+enum {264264+ OBJ_BLOCK = 0x04,265265+ OBJ_INODE = 0x05,266266+ OBJ_DENTRY = 0x06,267267+};268268+269269+/**270270+ * struct logfs_object_header - per-object header in the ostore271271+ *272272+ * @crc: crc32 of header, excluding data_crc273273+ * @len: length of data274274+ * @type: object type, see above275275+ * @compr: compression type276276+ * @ino: inode number277277+ * @bix: block index278278+ * @data_crc: crc32 of payload279279+ */280280+struct logfs_object_header {281281+ __be32 crc;282282+ __be16 len;283283+ __u8 type;284284+ __u8 compr;285285+ __be64 ino;286286+ __be64 bix;287287+ __be32 data_crc;288288+} __attribute__((packed));289289+290290+SIZE_CHECK(logfs_object_header, LOGFS_OBJECT_HEADERSIZE);291291+292292+/*293293+ * Reserved inode numbers:294294+ * LOGFS_INO_MASTER - master inode (for inode file)295295+ * LOGFS_INO_ROOT - root directory296296+ * LOGFS_INO_SEGFILE - per-segment used bytes and erase count297297+ */298298+enum {299299+ LOGFS_INO_MAPPING = 0x00,300300+ LOGFS_INO_MASTER = 0x01,301301+ LOGFS_INO_ROOT = 0x02,302302+ LOGFS_INO_SEGFILE = 0x03,303303+ LOGFS_RESERVED_INOS = 0x10,304304+};305305+306306+/*307307+ * Inode flags. High bits should never be written to the medium. They are308308+ * reserved for in-memory usage.309309+ * Low bits should either remain in sync with the corresponding FS_*_FL or310310+ * reuse slots that obviously don't make sense for logfs.311311+ *312312+ * LOGFS_IF_DIRTY Inode must be written back313313+ * LOGFS_IF_ZOMBIE Inode has been deleted314314+ * LOGFS_IF_STILLBORN -ENOSPC happened when creating inode315315+ */316316+#define LOGFS_IF_COMPRESSED 0x00000004 /* == FS_COMPR_FL */317317+#define LOGFS_IF_DIRTY 0x20000000318318+#define LOGFS_IF_ZOMBIE 0x40000000319319+#define LOGFS_IF_STILLBORN 0x80000000320320+321321+/* Flags available to chattr */322322+#define LOGFS_FL_USER_VISIBLE (LOGFS_IF_COMPRESSED)323323+#define LOGFS_FL_USER_MODIFIABLE (LOGFS_IF_COMPRESSED)324324+/* Flags inherited from parent directory on file/directory creation */325325+#define LOGFS_FL_INHERITED (LOGFS_IF_COMPRESSED)326326+327327+/**328328+ * struct logfs_disk_inode - on-medium inode329329+ *330330+ * @di_mode: file mode331331+ * @di_pad: reserved, must be 0332332+ * @di_flags: inode flags, see above333333+ * @di_uid: user id334334+ * @di_gid: group id335335+ * @di_ctime: change time336336+ * @di_mtime: modify time337337+ * @di_refcount: reference count (aka nlink or link count)338338+ * @di_generation: inode generation, for nfs339339+ * @di_used_bytes: number of bytes used340340+ * @di_size: file size341341+ * @di_data: data pointers342342+ */343343+struct logfs_disk_inode {344344+ __be16 di_mode;345345+ __u8 di_height;346346+ __u8 di_pad;347347+ __be32 di_flags;348348+ __be32 di_uid;349349+ __be32 di_gid;350350+351351+ __be64 di_ctime;352352+ __be64 di_mtime;353353+354354+ __be64 di_atime;355355+ __be32 di_refcount;356356+ __be32 di_generation;357357+358358+ __be64 di_used_bytes;359359+ __be64 di_size;360360+361361+ __be64 di_data[LOGFS_EMBEDDED_FIELDS];362362+};363363+364364+SIZE_CHECK(logfs_disk_inode, 200);365365+366366+#define INODE_POINTER_OFS \367367+ (offsetof(struct logfs_disk_inode, di_data) / sizeof(__be64))368368+#define INODE_USED_OFS \369369+ (offsetof(struct logfs_disk_inode, di_used_bytes) / sizeof(__be64))370370+#define INODE_SIZE_OFS \371371+ (offsetof(struct logfs_disk_inode, di_size) / sizeof(__be64))372372+#define INODE_HEIGHT_OFS (0)373373+374374+/**375375+ * struct logfs_disk_dentry - on-medium dentry structure376376+ *377377+ * @ino: inode number378378+ * @namelen: length of file name379379+ * @type: file type, identical to bits 12..15 of mode380380+ * @name: file name381381+ */382382+/* FIXME: add 6 bytes of padding to remove the __packed */383383+struct logfs_disk_dentry {384384+ __be64 ino;385385+ __be16 namelen;386386+ __u8 type;387387+ __u8 name[LOGFS_MAX_NAMELEN];388388+} __attribute__((packed));389389+390390+SIZE_CHECK(logfs_disk_dentry, 266);391391+392392+#define RESERVED 0xffffffff393393+#define BADSEG 0xffffffff394394+/**395395+ * struct logfs_segment_entry - segment file entry396396+ *397397+ * @ec_level: erase count and level398398+ * @valid: number of valid bytes399399+ *400400+ * Segment file contains one entry for every segment. ec_level contains the401401+ * erasecount in the upper 28 bits and the level in the lower 4 bits. An402402+ * ec_level of BADSEG (-1) identifies bad segments. valid contains the number403403+ * of valid bytes or RESERVED (-1 again) if the segment is used for either the404404+ * superblock or the journal, or when the segment is bad.405405+ */406406+struct logfs_segment_entry {407407+ __be32 ec_level;408408+ __be32 valid;409409+};410410+411411+SIZE_CHECK(logfs_segment_entry, 8);412412+413413+/**414414+ * struct logfs_journal_header - header for journal entries (JEs)415415+ *416416+ * @h_crc: crc32 of journal entry417417+ * @h_len: length of compressed journal entry,418418+ * not including header419419+ * @h_datalen: length of uncompressed data420420+ * @h_type: JE type421421+ * @h_version: unnormalized version of journal entry422422+ * @h_compr: compression type423423+ * @h_pad: reserved424424+ */425425+struct logfs_journal_header {426426+ __be32 h_crc;427427+ __be16 h_len;428428+ __be16 h_datalen;429429+ __be16 h_type;430430+ __be16 h_version;431431+ __u8 h_compr;432432+ __u8 h_pad[3];433433+};434434+435435+SIZE_CHECK(logfs_journal_header, 16);436436+437437+/*438438+ * Life expectency of data.439439+ * VIM_DEFAULT - default vim440440+ * VIM_SEGFILE - for segment file only - very short-living441441+ * VIM_GC - GC'd data - likely long-living442442+ */443443+enum logfs_vim {444444+ VIM_DEFAULT = 0,445445+ VIM_SEGFILE = 1,446446+};447447+448448+/**449449+ * struct logfs_je_area - wbuf header450450+ *451451+ * @segno: segment number of area452452+ * @used_bytes: number of bytes already used453453+ * @gc_level: GC level454454+ * @vim: life expectancy of data455455+ *456456+ * "Areas" are segments currently being used for writing. There is at least457457+ * one area per GC level. Several may be used to seperate long-living from458458+ * short-living data. If an area with unknown vim is encountered, it can459459+ * simply be closed.460460+ * The write buffer immediately follow this header.461461+ */462462+struct logfs_je_area {463463+ __be32 segno;464464+ __be32 used_bytes;465465+ __u8 gc_level;466466+ __u8 vim;467467+} __attribute__((packed));468468+469469+SIZE_CHECK(logfs_je_area, 10);470470+471471+#define MAX_JOURNAL_HEADER \472472+ (sizeof(struct logfs_journal_header) + sizeof(struct logfs_je_area))473473+474474+/**475475+ * struct logfs_je_dynsb - dynamic superblock476476+ *477477+ * @ds_gec: global erase count478478+ * @ds_sweeper: current position of GC "sweeper"479479+ * @ds_rename_dir: source directory ino (see dir.c documentation)480480+ * @ds_rename_pos: position of source dd (see dir.c documentation)481481+ * @ds_victim_ino: victims of incomplete dir operation (see dir.c)482482+ * @ds_victim_ino: parent inode of victim (see dir.c)483483+ * @ds_used_bytes: number of used bytes484484+ */485485+struct logfs_je_dynsb {486486+ __be64 ds_gec;487487+ __be64 ds_sweeper;488488+489489+ __be64 ds_rename_dir;490490+ __be64 ds_rename_pos;491491+492492+ __be64 ds_victim_ino;493493+ __be64 ds_victim_parent; /* XXX */494494+495495+ __be64 ds_used_bytes;496496+ __be32 ds_generation;497497+ __be32 pad;498498+};499499+500500+SIZE_CHECK(logfs_je_dynsb, 64);501501+502502+/**503503+ * struct logfs_je_anchor - anchor of filesystem tree, aka master inode504504+ *505505+ * @da_size: size of inode file506506+ * @da_last_ino: last created inode507507+ * @da_used_bytes: number of bytes used508508+ * @da_data: data pointers509509+ */510510+struct logfs_je_anchor {511511+ __be64 da_size;512512+ __be64 da_last_ino;513513+514514+ __be64 da_used_bytes;515515+ u8 da_height;516516+ u8 pad[7];517517+518518+ __be64 da_data[LOGFS_EMBEDDED_FIELDS];519519+};520520+521521+SIZE_CHECK(logfs_je_anchor, 168);522522+523523+/**524524+ * struct logfs_je_spillout - spillout entry (from 1st to 2nd journal)525525+ *526526+ * @so_segment: segments used for 2nd journal527527+ *528528+ * Length of the array is given by h_len field in the header.529529+ */530530+struct logfs_je_spillout {531531+ __be64 so_segment[0];532532+};533533+534534+SIZE_CHECK(logfs_je_spillout, 0);535535+536536+/**537537+ * struct logfs_je_journal_ec - erase counts for all journal segments538538+ *539539+ * @ec: erase count540540+ *541541+ * Length of the array is given by h_len field in the header.542542+ */543543+struct logfs_je_journal_ec {544544+ __be32 ec[0];545545+};546546+547547+SIZE_CHECK(logfs_je_journal_ec, 0);548548+549549+/**550550+ * struct logfs_je_free_segments - list of free segmetns with erase count551551+ */552552+struct logfs_je_free_segments {553553+ __be32 segno;554554+ __be32 ec;555555+};556556+557557+SIZE_CHECK(logfs_je_free_segments, 8);558558+559559+/**560560+ * struct logfs_seg_alias - list of segment aliases561561+ */562562+struct logfs_seg_alias {563563+ __be32 old_segno;564564+ __be32 new_segno;565565+};566566+567567+SIZE_CHECK(logfs_seg_alias, 8);568568+569569+/**570570+ * struct logfs_obj_alias - list of object aliases571571+ */572572+struct logfs_obj_alias {573573+ __be64 ino;574574+ __be64 bix;575575+ __be64 val;576576+ u8 level;577577+ u8 pad[5];578578+ __be16 child_no;579579+};580580+581581+SIZE_CHECK(logfs_obj_alias, 32);582582+583583+/**584584+ * Compression types.585585+ *586586+ * COMPR_NONE - uncompressed587587+ * COMPR_ZLIB - compressed with zlib588588+ */589589+enum {590590+ COMPR_NONE = 0,591591+ COMPR_ZLIB = 1,592592+};593593+594594+/*595595+ * Journal entries come in groups of 16. First group contains unique596596+ * entries, next groups contain one entry per level597597+ *598598+ * JE_FIRST - smallest possible journal entry number599599+ *600600+ * JEG_BASE - base group, containing unique entries601601+ * JE_COMMIT - commit entry, validates all previous entries602602+ * JE_DYNSB - dynamic superblock, anything that ought to be in the603603+ * superblock but cannot because it is read-write data604604+ * JE_ANCHOR - anchor aka master inode aka inode file's inode605605+ * JE_ERASECOUNT erasecounts for all journal segments606606+ * JE_SPILLOUT - unused607607+ * JE_SEG_ALIAS - aliases segments608608+ * JE_AREA - area description609609+ *610610+ * JE_LAST - largest possible journal entry number611611+ */612612+enum {613613+ JE_FIRST = 0x01,614614+615615+ JEG_BASE = 0x00,616616+ JE_COMMIT = 0x02,617617+ JE_DYNSB = 0x03,618618+ JE_ANCHOR = 0x04,619619+ JE_ERASECOUNT = 0x05,620620+ JE_SPILLOUT = 0x06,621621+ JE_OBJ_ALIAS = 0x0d,622622+ JE_AREA = 0x0e,623623+624624+ JE_LAST = 0x0e,625625+};626626+627627+#endif
+2246
fs/logfs/readwrite.c
···11+/*22+ * fs/logfs/readwrite.c33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ *88+ *99+ * Actually contains five sets of very similar functions:1010+ * read read blocks from a file1111+ * seek_hole find next hole1212+ * seek_data find next data block1313+ * valid check whether a block still belongs to a file1414+ * write write blocks to a file1515+ * delete delete a block (for directories and ifile)1616+ * rewrite move existing blocks of a file to a new location (gc helper)1717+ * truncate truncate a file1818+ */1919+#include "logfs.h"2020+#include <linux/sched.h>2121+2222+static u64 adjust_bix(u64 bix, level_t level)2323+{2424+ switch (level) {2525+ case 0:2626+ return bix;2727+ case LEVEL(1):2828+ return max_t(u64, bix, I0_BLOCKS);2929+ case LEVEL(2):3030+ return max_t(u64, bix, I1_BLOCKS);3131+ case LEVEL(3):3232+ return max_t(u64, bix, I2_BLOCKS);3333+ case LEVEL(4):3434+ return max_t(u64, bix, I3_BLOCKS);3535+ case LEVEL(5):3636+ return max_t(u64, bix, I4_BLOCKS);3737+ default:3838+ WARN_ON(1);3939+ return bix;4040+ }4141+}4242+4343+static inline u64 maxbix(u8 height)4444+{4545+ return 1ULL << (LOGFS_BLOCK_BITS * height);4646+}4747+4848+/**4949+ * The inode address space is cut in two halves. Lower half belongs to data5050+ * pages, upper half to indirect blocks. If the high bit (INDIRECT_BIT) is5151+ * set, the actual block index (bix) and level can be derived from the page5252+ * index.5353+ *5454+ * The lowest three bits of the block index are set to 0 after packing and5555+ * unpacking. Since the lowest n bits (9 for 4KiB blocksize) are ignored5656+ * anyway this is harmless.5757+ */5858+#define ARCH_SHIFT (BITS_PER_LONG - 32)5959+#define INDIRECT_BIT (0x80000000UL << ARCH_SHIFT)6060+#define LEVEL_SHIFT (28 + ARCH_SHIFT)6161+static inline pgoff_t first_indirect_block(void)6262+{6363+ return INDIRECT_BIT | (1ULL << LEVEL_SHIFT);6464+}6565+6666+pgoff_t logfs_pack_index(u64 bix, level_t level)6767+{6868+ pgoff_t index;6969+7070+ BUG_ON(bix >= INDIRECT_BIT);7171+ if (level == 0)7272+ return bix;7373+7474+ index = INDIRECT_BIT;7575+ index |= (__force long)level << LEVEL_SHIFT;7676+ index |= bix >> ((__force u8)level * LOGFS_BLOCK_BITS);7777+ return index;7878+}7979+8080+void logfs_unpack_index(pgoff_t index, u64 *bix, level_t *level)8181+{8282+ u8 __level;8383+8484+ if (!(index & INDIRECT_BIT)) {8585+ *bix = index;8686+ *level = 0;8787+ return;8888+ }8989+9090+ __level = (index & ~INDIRECT_BIT) >> LEVEL_SHIFT;9191+ *level = LEVEL(__level);9292+ *bix = (index << (__level * LOGFS_BLOCK_BITS)) & ~INDIRECT_BIT;9393+ *bix = adjust_bix(*bix, *level);9494+ return;9595+}9696+#undef ARCH_SHIFT9797+#undef INDIRECT_BIT9898+#undef LEVEL_SHIFT9999+100100+/*101101+ * Time is stored as nanoseconds since the epoch.102102+ */103103+static struct timespec be64_to_timespec(__be64 betime)104104+{105105+ return ns_to_timespec(be64_to_cpu(betime));106106+}107107+108108+static __be64 timespec_to_be64(struct timespec tsp)109109+{110110+ return cpu_to_be64((u64)tsp.tv_sec * NSEC_PER_SEC + tsp.tv_nsec);111111+}112112+113113+static void logfs_disk_to_inode(struct logfs_disk_inode *di, struct inode*inode)114114+{115115+ struct logfs_inode *li = logfs_inode(inode);116116+ int i;117117+118118+ inode->i_mode = be16_to_cpu(di->di_mode);119119+ li->li_height = di->di_height;120120+ li->li_flags = be32_to_cpu(di->di_flags);121121+ inode->i_uid = be32_to_cpu(di->di_uid);122122+ inode->i_gid = be32_to_cpu(di->di_gid);123123+ inode->i_size = be64_to_cpu(di->di_size);124124+ logfs_set_blocks(inode, be64_to_cpu(di->di_used_bytes));125125+ inode->i_atime = be64_to_timespec(di->di_atime);126126+ inode->i_ctime = be64_to_timespec(di->di_ctime);127127+ inode->i_mtime = be64_to_timespec(di->di_mtime);128128+ inode->i_nlink = be32_to_cpu(di->di_refcount);129129+ inode->i_generation = be32_to_cpu(di->di_generation);130130+131131+ switch (inode->i_mode & S_IFMT) {132132+ case S_IFSOCK: /* fall through */133133+ case S_IFBLK: /* fall through */134134+ case S_IFCHR: /* fall through */135135+ case S_IFIFO:136136+ inode->i_rdev = be64_to_cpu(di->di_data[0]);137137+ break;138138+ case S_IFDIR: /* fall through */139139+ case S_IFREG: /* fall through */140140+ case S_IFLNK:141141+ for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)142142+ li->li_data[i] = be64_to_cpu(di->di_data[i]);143143+ break;144144+ default:145145+ BUG();146146+ }147147+}148148+149149+static void logfs_inode_to_disk(struct inode *inode, struct logfs_disk_inode*di)150150+{151151+ struct logfs_inode *li = logfs_inode(inode);152152+ int i;153153+154154+ di->di_mode = cpu_to_be16(inode->i_mode);155155+ di->di_height = li->li_height;156156+ di->di_pad = 0;157157+ di->di_flags = cpu_to_be32(li->li_flags);158158+ di->di_uid = cpu_to_be32(inode->i_uid);159159+ di->di_gid = cpu_to_be32(inode->i_gid);160160+ di->di_size = cpu_to_be64(i_size_read(inode));161161+ di->di_used_bytes = cpu_to_be64(li->li_used_bytes);162162+ di->di_atime = timespec_to_be64(inode->i_atime);163163+ di->di_ctime = timespec_to_be64(inode->i_ctime);164164+ di->di_mtime = timespec_to_be64(inode->i_mtime);165165+ di->di_refcount = cpu_to_be32(inode->i_nlink);166166+ di->di_generation = cpu_to_be32(inode->i_generation);167167+168168+ switch (inode->i_mode & S_IFMT) {169169+ case S_IFSOCK: /* fall through */170170+ case S_IFBLK: /* fall through */171171+ case S_IFCHR: /* fall through */172172+ case S_IFIFO:173173+ di->di_data[0] = cpu_to_be64(inode->i_rdev);174174+ break;175175+ case S_IFDIR: /* fall through */176176+ case S_IFREG: /* fall through */177177+ case S_IFLNK:178178+ for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)179179+ di->di_data[i] = cpu_to_be64(li->li_data[i]);180180+ break;181181+ default:182182+ BUG();183183+ }184184+}185185+186186+static void __logfs_set_blocks(struct inode *inode)187187+{188188+ struct super_block *sb = inode->i_sb;189189+ struct logfs_inode *li = logfs_inode(inode);190190+191191+ inode->i_blocks = ULONG_MAX;192192+ if (li->li_used_bytes >> sb->s_blocksize_bits < ULONG_MAX)193193+ inode->i_blocks = ALIGN(li->li_used_bytes, 512) >> 9;194194+}195195+196196+void logfs_set_blocks(struct inode *inode, u64 bytes)197197+{198198+ struct logfs_inode *li = logfs_inode(inode);199199+200200+ li->li_used_bytes = bytes;201201+ __logfs_set_blocks(inode);202202+}203203+204204+static void prelock_page(struct super_block *sb, struct page *page, int lock)205205+{206206+ struct logfs_super *super = logfs_super(sb);207207+208208+ BUG_ON(!PageLocked(page));209209+ if (lock) {210210+ BUG_ON(PagePreLocked(page));211211+ SetPagePreLocked(page);212212+ } else {213213+ /* We are in GC path. */214214+ if (PagePreLocked(page))215215+ super->s_lock_count++;216216+ else217217+ SetPagePreLocked(page);218218+ }219219+}220220+221221+static void preunlock_page(struct super_block *sb, struct page *page, int lock)222222+{223223+ struct logfs_super *super = logfs_super(sb);224224+225225+ BUG_ON(!PageLocked(page));226226+ if (lock)227227+ ClearPagePreLocked(page);228228+ else {229229+ /* We are in GC path. */230230+ BUG_ON(!PagePreLocked(page));231231+ if (super->s_lock_count)232232+ super->s_lock_count--;233233+ else234234+ ClearPagePreLocked(page);235235+ }236236+}237237+238238+/*239239+ * Logfs is prone to an AB-BA deadlock where one task tries to acquire240240+ * s_write_mutex with a locked page and GC tries to get that page while holding241241+ * s_write_mutex.242242+ * To solve this issue logfs will ignore the page lock iff the page in question243243+ * is waiting for s_write_mutex. We annotate this fact by setting PG_pre_locked244244+ * in addition to PG_locked.245245+ */246246+static void logfs_get_wblocks(struct super_block *sb, struct page *page,247247+ int lock)248248+{249249+ struct logfs_super *super = logfs_super(sb);250250+251251+ if (page)252252+ prelock_page(sb, page, lock);253253+254254+ if (lock) {255255+ mutex_lock(&super->s_write_mutex);256256+ logfs_gc_pass(sb);257257+ /* FIXME: We also have to check for shadowed space258258+ * and mempool fill grade */259259+ }260260+}261261+262262+static void logfs_put_wblocks(struct super_block *sb, struct page *page,263263+ int lock)264264+{265265+ struct logfs_super *super = logfs_super(sb);266266+267267+ if (page)268268+ preunlock_page(sb, page, lock);269269+ /* Order matters - we must clear PG_pre_locked before releasing270270+ * s_write_mutex or we could race against another task. */271271+ if (lock)272272+ mutex_unlock(&super->s_write_mutex);273273+}274274+275275+static struct page *logfs_get_read_page(struct inode *inode, u64 bix,276276+ level_t level)277277+{278278+ return find_or_create_page(inode->i_mapping,279279+ logfs_pack_index(bix, level), GFP_NOFS);280280+}281281+282282+static void logfs_put_read_page(struct page *page)283283+{284284+ unlock_page(page);285285+ page_cache_release(page);286286+}287287+288288+static void logfs_lock_write_page(struct page *page)289289+{290290+ int loop = 0;291291+292292+ while (unlikely(!trylock_page(page))) {293293+ if (loop++ > 0x1000) {294294+ /* Has been observed once so far... */295295+ printk(KERN_ERR "stack at %p\n", &loop);296296+ BUG();297297+ }298298+ if (PagePreLocked(page)) {299299+ /* Holder of page lock is waiting for us, it300300+ * is safe to use this page. */301301+ break;302302+ }303303+ /* Some other process has this page locked and has304304+ * nothing to do with us. Wait for it to finish.305305+ */306306+ schedule();307307+ }308308+ BUG_ON(!PageLocked(page));309309+}310310+311311+static struct page *logfs_get_write_page(struct inode *inode, u64 bix,312312+ level_t level)313313+{314314+ struct address_space *mapping = inode->i_mapping;315315+ pgoff_t index = logfs_pack_index(bix, level);316316+ struct page *page;317317+ int err;318318+319319+repeat:320320+ page = find_get_page(mapping, index);321321+ if (!page) {322322+ page = __page_cache_alloc(GFP_NOFS);323323+ if (!page)324324+ return NULL;325325+ err = add_to_page_cache_lru(page, mapping, index, GFP_NOFS);326326+ if (unlikely(err)) {327327+ page_cache_release(page);328328+ if (err == -EEXIST)329329+ goto repeat;330330+ return NULL;331331+ }332332+ } else logfs_lock_write_page(page);333333+ BUG_ON(!PageLocked(page));334334+ return page;335335+}336336+337337+static void logfs_unlock_write_page(struct page *page)338338+{339339+ if (!PagePreLocked(page))340340+ unlock_page(page);341341+}342342+343343+static void logfs_put_write_page(struct page *page)344344+{345345+ logfs_unlock_write_page(page);346346+ page_cache_release(page);347347+}348348+349349+static struct page *logfs_get_page(struct inode *inode, u64 bix, level_t level,350350+ int rw)351351+{352352+ if (rw == READ)353353+ return logfs_get_read_page(inode, bix, level);354354+ else355355+ return logfs_get_write_page(inode, bix, level);356356+}357357+358358+static void logfs_put_page(struct page *page, int rw)359359+{360360+ if (rw == READ)361361+ logfs_put_read_page(page);362362+ else363363+ logfs_put_write_page(page);364364+}365365+366366+static unsigned long __get_bits(u64 val, int skip, int no)367367+{368368+ u64 ret = val;369369+370370+ ret >>= skip * no;371371+ ret <<= 64 - no;372372+ ret >>= 64 - no;373373+ return ret;374374+}375375+376376+static unsigned long get_bits(u64 val, level_t skip)377377+{378378+ return __get_bits(val, (__force int)skip, LOGFS_BLOCK_BITS);379379+}380380+381381+static inline void init_shadow_tree(struct super_block *sb,382382+ struct shadow_tree *tree)383383+{384384+ struct logfs_super *super = logfs_super(sb);385385+386386+ btree_init_mempool64(&tree->new, super->s_btree_pool);387387+ btree_init_mempool64(&tree->old, super->s_btree_pool);388388+}389389+390390+static void indirect_write_block(struct logfs_block *block)391391+{392392+ struct page *page;393393+ struct inode *inode;394394+ int ret;395395+396396+ page = block->page;397397+ inode = page->mapping->host;398398+ logfs_lock_write_page(page);399399+ ret = logfs_write_buf(inode, page, 0);400400+ logfs_unlock_write_page(page);401401+ /*402402+ * This needs some rework. Unless you want your filesystem to run403403+ * completely synchronously (you don't), the filesystem will always404404+ * report writes as 'successful' before the actual work has been405405+ * done. The actual work gets done here and this is where any errors406406+ * will show up. And there isn't much we can do about it, really.407407+ *408408+ * Some attempts to fix the errors (move from bad blocks, retry io,...)409409+ * have already been done, so anything left should be either a broken410410+ * device or a bug somewhere in logfs itself. Being relatively new,411411+ * the odds currently favor a bug, so for now the line below isn't412412+ * entirely tasteles.413413+ */414414+ BUG_ON(ret);415415+}416416+417417+static void inode_write_block(struct logfs_block *block)418418+{419419+ struct inode *inode;420420+ int ret;421421+422422+ inode = block->inode;423423+ if (inode->i_ino == LOGFS_INO_MASTER)424424+ logfs_write_anchor(inode);425425+ else {426426+ ret = __logfs_write_inode(inode, 0);427427+ /* see indirect_write_block comment */428428+ BUG_ON(ret);429429+ }430430+}431431+432432+static gc_level_t inode_block_level(struct logfs_block *block)433433+{434434+ BUG_ON(block->inode->i_ino == LOGFS_INO_MASTER);435435+ return GC_LEVEL(LOGFS_MAX_LEVELS);436436+}437437+438438+static gc_level_t indirect_block_level(struct logfs_block *block)439439+{440440+ struct page *page;441441+ struct inode *inode;442442+ u64 bix;443443+ level_t level;444444+445445+ page = block->page;446446+ inode = page->mapping->host;447447+ logfs_unpack_index(page->index, &bix, &level);448448+ return expand_level(inode->i_ino, level);449449+}450450+451451+/*452452+ * This silences a false, yet annoying gcc warning. I hate it when my editor453453+ * jumps into bitops.h each time I recompile this file.454454+ * TODO: Complain to gcc folks about this and upgrade compiler.455455+ */456456+static unsigned long fnb(const unsigned long *addr,457457+ unsigned long size, unsigned long offset)458458+{459459+ return find_next_bit(addr, size, offset);460460+}461461+462462+static __be64 inode_val0(struct inode *inode)463463+{464464+ struct logfs_inode *li = logfs_inode(inode);465465+ u64 val;466466+467467+ /*468468+ * Explicit shifting generates good code, but must match the format469469+ * of the structure. Add some paranoia just in case.470470+ */471471+ BUILD_BUG_ON(offsetof(struct logfs_disk_inode, di_mode) != 0);472472+ BUILD_BUG_ON(offsetof(struct logfs_disk_inode, di_height) != 2);473473+ BUILD_BUG_ON(offsetof(struct logfs_disk_inode, di_flags) != 4);474474+475475+ val = (u64)inode->i_mode << 48 |476476+ (u64)li->li_height << 40 |477477+ (u64)li->li_flags;478478+ return cpu_to_be64(val);479479+}480480+481481+static int inode_write_alias(struct super_block *sb,482482+ struct logfs_block *block, write_alias_t *write_one_alias)483483+{484484+ struct inode *inode = block->inode;485485+ struct logfs_inode *li = logfs_inode(inode);486486+ unsigned long pos;487487+ u64 ino , bix;488488+ __be64 val;489489+ level_t level;490490+ int err;491491+492492+ for (pos = 0; ; pos++) {493493+ pos = fnb(block->alias_map, LOGFS_BLOCK_FACTOR, pos);494494+ if (pos >= LOGFS_EMBEDDED_FIELDS + INODE_POINTER_OFS)495495+ return 0;496496+497497+ switch (pos) {498498+ case INODE_HEIGHT_OFS:499499+ val = inode_val0(inode);500500+ break;501501+ case INODE_USED_OFS:502502+ val = cpu_to_be64(li->li_used_bytes);;503503+ break;504504+ case INODE_SIZE_OFS:505505+ val = cpu_to_be64(i_size_read(inode));506506+ break;507507+ case INODE_POINTER_OFS ... INODE_POINTER_OFS + LOGFS_EMBEDDED_FIELDS - 1:508508+ val = cpu_to_be64(li->li_data[pos - INODE_POINTER_OFS]);509509+ break;510510+ default:511511+ BUG();512512+ }513513+514514+ ino = LOGFS_INO_MASTER;515515+ bix = inode->i_ino;516516+ level = LEVEL(0);517517+ err = write_one_alias(sb, ino, bix, level, pos, val);518518+ if (err)519519+ return err;520520+ }521521+}522522+523523+static int indirect_write_alias(struct super_block *sb,524524+ struct logfs_block *block, write_alias_t *write_one_alias)525525+{526526+ unsigned long pos;527527+ struct page *page = block->page;528528+ u64 ino , bix;529529+ __be64 *child, val;530530+ level_t level;531531+ int err;532532+533533+ for (pos = 0; ; pos++) {534534+ pos = fnb(block->alias_map, LOGFS_BLOCK_FACTOR, pos);535535+ if (pos >= LOGFS_BLOCK_FACTOR)536536+ return 0;537537+538538+ ino = page->mapping->host->i_ino;539539+ logfs_unpack_index(page->index, &bix, &level);540540+ child = kmap_atomic(page, KM_USER0);541541+ val = child[pos];542542+ kunmap_atomic(child, KM_USER0);543543+ err = write_one_alias(sb, ino, bix, level, pos, val);544544+ if (err)545545+ return err;546546+ }547547+}548548+549549+int logfs_write_obj_aliases_pagecache(struct super_block *sb)550550+{551551+ struct logfs_super *super = logfs_super(sb);552552+ struct logfs_block *block;553553+ int err;554554+555555+ list_for_each_entry(block, &super->s_object_alias, alias_list) {556556+ err = block->ops->write_alias(sb, block, write_alias_journal);557557+ if (err)558558+ return err;559559+ }560560+ return 0;561561+}562562+563563+void __free_block(struct super_block *sb, struct logfs_block *block)564564+{565565+ BUG_ON(!list_empty(&block->item_list));566566+ list_del(&block->alias_list);567567+ mempool_free(block, logfs_super(sb)->s_block_pool);568568+}569569+570570+static void inode_free_block(struct super_block *sb, struct logfs_block *block)571571+{572572+ struct inode *inode = block->inode;573573+574574+ logfs_inode(inode)->li_block = NULL;575575+ __free_block(sb, block);576576+}577577+578578+static void indirect_free_block(struct super_block *sb,579579+ struct logfs_block *block)580580+{581581+ ClearPagePrivate(block->page);582582+ block->page->private = 0;583583+ __free_block(sb, block);584584+}585585+586586+587587+static struct logfs_block_ops inode_block_ops = {588588+ .write_block = inode_write_block,589589+ .block_level = inode_block_level,590590+ .free_block = inode_free_block,591591+ .write_alias = inode_write_alias,592592+};593593+594594+struct logfs_block_ops indirect_block_ops = {595595+ .write_block = indirect_write_block,596596+ .block_level = indirect_block_level,597597+ .free_block = indirect_free_block,598598+ .write_alias = indirect_write_alias,599599+};600600+601601+struct logfs_block *__alloc_block(struct super_block *sb,602602+ u64 ino, u64 bix, level_t level)603603+{604604+ struct logfs_super *super = logfs_super(sb);605605+ struct logfs_block *block;606606+607607+ block = mempool_alloc(super->s_block_pool, GFP_NOFS);608608+ memset(block, 0, sizeof(*block));609609+ INIT_LIST_HEAD(&block->alias_list);610610+ INIT_LIST_HEAD(&block->item_list);611611+ block->sb = sb;612612+ block->ino = ino;613613+ block->bix = bix;614614+ block->level = level;615615+ return block;616616+}617617+618618+static void alloc_inode_block(struct inode *inode)619619+{620620+ struct logfs_inode *li = logfs_inode(inode);621621+ struct logfs_block *block;622622+623623+ if (li->li_block)624624+ return;625625+626626+ block = __alloc_block(inode->i_sb, LOGFS_INO_MASTER, inode->i_ino, 0);627627+ block->inode = inode;628628+ li->li_block = block;629629+ block->ops = &inode_block_ops;630630+}631631+632632+void initialize_block_counters(struct page *page, struct logfs_block *block,633633+ __be64 *array, int page_is_empty)634634+{635635+ u64 ptr;636636+ int i, start;637637+638638+ block->partial = 0;639639+ block->full = 0;640640+ start = 0;641641+ if (page->index < first_indirect_block()) {642642+ /* Counters are pointless on level 0 */643643+ return;644644+ }645645+ if (page->index == first_indirect_block()) {646646+ /* Skip unused pointers */647647+ start = I0_BLOCKS;648648+ block->full = I0_BLOCKS;649649+ }650650+ if (!page_is_empty) {651651+ for (i = start; i < LOGFS_BLOCK_FACTOR; i++) {652652+ ptr = be64_to_cpu(array[i]);653653+ if (ptr)654654+ block->partial++;655655+ if (ptr & LOGFS_FULLY_POPULATED)656656+ block->full++;657657+ }658658+ }659659+}660660+661661+static void alloc_data_block(struct inode *inode, struct page *page)662662+{663663+ struct logfs_block *block;664664+ u64 bix;665665+ level_t level;666666+667667+ if (PagePrivate(page))668668+ return;669669+670670+ logfs_unpack_index(page->index, &bix, &level);671671+ block = __alloc_block(inode->i_sb, inode->i_ino, bix, level);672672+ block->page = page;673673+ SetPagePrivate(page);674674+ page->private = (unsigned long)block;675675+ block->ops = &indirect_block_ops;676676+}677677+678678+static void alloc_indirect_block(struct inode *inode, struct page *page,679679+ int page_is_empty)680680+{681681+ struct logfs_block *block;682682+ __be64 *array;683683+684684+ if (PagePrivate(page))685685+ return;686686+687687+ alloc_data_block(inode, page);688688+689689+ block = logfs_block(page);690690+ array = kmap_atomic(page, KM_USER0);691691+ initialize_block_counters(page, block, array, page_is_empty);692692+ kunmap_atomic(array, KM_USER0);693693+}694694+695695+static void block_set_pointer(struct page *page, int index, u64 ptr)696696+{697697+ struct logfs_block *block = logfs_block(page);698698+ __be64 *array;699699+ u64 oldptr;700700+701701+ BUG_ON(!block);702702+ array = kmap_atomic(page, KM_USER0);703703+ oldptr = be64_to_cpu(array[index]);704704+ array[index] = cpu_to_be64(ptr);705705+ kunmap_atomic(array, KM_USER0);706706+ SetPageUptodate(page);707707+708708+ block->full += !!(ptr & LOGFS_FULLY_POPULATED)709709+ - !!(oldptr & LOGFS_FULLY_POPULATED);710710+ block->partial += !!ptr - !!oldptr;711711+}712712+713713+static u64 block_get_pointer(struct page *page, int index)714714+{715715+ __be64 *block;716716+ u64 ptr;717717+718718+ block = kmap_atomic(page, KM_USER0);719719+ ptr = be64_to_cpu(block[index]);720720+ kunmap_atomic(block, KM_USER0);721721+ return ptr;722722+}723723+724724+static int logfs_read_empty(struct page *page)725725+{726726+ zero_user_segment(page, 0, PAGE_CACHE_SIZE);727727+ return 0;728728+}729729+730730+static int logfs_read_direct(struct inode *inode, struct page *page)731731+{732732+ struct logfs_inode *li = logfs_inode(inode);733733+ pgoff_t index = page->index;734734+ u64 block;735735+736736+ block = li->li_data[index];737737+ if (!block)738738+ return logfs_read_empty(page);739739+740740+ return logfs_segment_read(inode, page, block, index, 0);741741+}742742+743743+static int logfs_read_loop(struct inode *inode, struct page *page,744744+ int rw_context)745745+{746746+ struct logfs_inode *li = logfs_inode(inode);747747+ u64 bix, bofs = li->li_data[INDIRECT_INDEX];748748+ level_t level, target_level;749749+ int ret;750750+ struct page *ipage;751751+752752+ logfs_unpack_index(page->index, &bix, &target_level);753753+ if (!bofs)754754+ return logfs_read_empty(page);755755+756756+ if (bix >= maxbix(li->li_height))757757+ return logfs_read_empty(page);758758+759759+ for (level = LEVEL(li->li_height);760760+ (__force u8)level > (__force u8)target_level;761761+ level = SUBLEVEL(level)){762762+ ipage = logfs_get_page(inode, bix, level, rw_context);763763+ if (!ipage)764764+ return -ENOMEM;765765+766766+ ret = logfs_segment_read(inode, ipage, bofs, bix, level);767767+ if (ret) {768768+ logfs_put_read_page(ipage);769769+ return ret;770770+ }771771+772772+ bofs = block_get_pointer(ipage, get_bits(bix, SUBLEVEL(level)));773773+ logfs_put_page(ipage, rw_context);774774+ if (!bofs)775775+ return logfs_read_empty(page);776776+ }777777+778778+ return logfs_segment_read(inode, page, bofs, bix, 0);779779+}780780+781781+static int logfs_read_block(struct inode *inode, struct page *page,782782+ int rw_context)783783+{784784+ pgoff_t index = page->index;785785+786786+ if (index < I0_BLOCKS)787787+ return logfs_read_direct(inode, page);788788+ return logfs_read_loop(inode, page, rw_context);789789+}790790+791791+static int logfs_exist_loop(struct inode *inode, u64 bix)792792+{793793+ struct logfs_inode *li = logfs_inode(inode);794794+ u64 bofs = li->li_data[INDIRECT_INDEX];795795+ level_t level;796796+ int ret;797797+ struct page *ipage;798798+799799+ if (!bofs)800800+ return 0;801801+ if (bix >= maxbix(li->li_height))802802+ return 0;803803+804804+ for (level = LEVEL(li->li_height); level != 0; level = SUBLEVEL(level)) {805805+ ipage = logfs_get_read_page(inode, bix, level);806806+ if (!ipage)807807+ return -ENOMEM;808808+809809+ ret = logfs_segment_read(inode, ipage, bofs, bix, level);810810+ if (ret) {811811+ logfs_put_read_page(ipage);812812+ return ret;813813+ }814814+815815+ bofs = block_get_pointer(ipage, get_bits(bix, SUBLEVEL(level)));816816+ logfs_put_read_page(ipage);817817+ if (!bofs)818818+ return 0;819819+ }820820+821821+ return 1;822822+}823823+824824+int logfs_exist_block(struct inode *inode, u64 bix)825825+{826826+ struct logfs_inode *li = logfs_inode(inode);827827+828828+ if (bix < I0_BLOCKS)829829+ return !!li->li_data[bix];830830+ return logfs_exist_loop(inode, bix);831831+}832832+833833+static u64 seek_holedata_direct(struct inode *inode, u64 bix, int data)834834+{835835+ struct logfs_inode *li = logfs_inode(inode);836836+837837+ for (; bix < I0_BLOCKS; bix++)838838+ if (data ^ (li->li_data[bix] == 0))839839+ return bix;840840+ return I0_BLOCKS;841841+}842842+843843+static u64 seek_holedata_loop(struct inode *inode, u64 bix, int data)844844+{845845+ struct logfs_inode *li = logfs_inode(inode);846846+ __be64 *rblock;847847+ u64 increment, bofs = li->li_data[INDIRECT_INDEX];848848+ level_t level;849849+ int ret, slot;850850+ struct page *page;851851+852852+ BUG_ON(!bofs);853853+854854+ for (level = LEVEL(li->li_height); level != 0; level = SUBLEVEL(level)) {855855+ increment = 1 << (LOGFS_BLOCK_BITS * ((__force u8)level-1));856856+ page = logfs_get_read_page(inode, bix, level);857857+ if (!page)858858+ return bix;859859+860860+ ret = logfs_segment_read(inode, page, bofs, bix, level);861861+ if (ret) {862862+ logfs_put_read_page(page);863863+ return bix;864864+ }865865+866866+ slot = get_bits(bix, SUBLEVEL(level));867867+ rblock = kmap_atomic(page, KM_USER0);868868+ while (slot < LOGFS_BLOCK_FACTOR) {869869+ if (data && (rblock[slot] != 0))870870+ break;871871+ if (!data && !(be64_to_cpu(rblock[slot]) & LOGFS_FULLY_POPULATED))872872+ break;873873+ slot++;874874+ bix += increment;875875+ bix &= ~(increment - 1);876876+ }877877+ if (slot >= LOGFS_BLOCK_FACTOR) {878878+ kunmap_atomic(rblock, KM_USER0);879879+ logfs_put_read_page(page);880880+ return bix;881881+ }882882+ bofs = be64_to_cpu(rblock[slot]);883883+ kunmap_atomic(rblock, KM_USER0);884884+ logfs_put_read_page(page);885885+ if (!bofs) {886886+ BUG_ON(data);887887+ return bix;888888+ }889889+ }890890+ return bix;891891+}892892+893893+/**894894+ * logfs_seek_hole - find next hole starting at a given block index895895+ * @inode: inode to search in896896+ * @bix: block index to start searching897897+ *898898+ * Returns next hole. If the file doesn't contain any further holes, the899899+ * block address next to eof is returned instead.900900+ */901901+u64 logfs_seek_hole(struct inode *inode, u64 bix)902902+{903903+ struct logfs_inode *li = logfs_inode(inode);904904+905905+ if (bix < I0_BLOCKS) {906906+ bix = seek_holedata_direct(inode, bix, 0);907907+ if (bix < I0_BLOCKS)908908+ return bix;909909+ }910910+911911+ if (!li->li_data[INDIRECT_INDEX])912912+ return bix;913913+ else if (li->li_data[INDIRECT_INDEX] & LOGFS_FULLY_POPULATED)914914+ bix = maxbix(li->li_height);915915+ else {916916+ bix = seek_holedata_loop(inode, bix, 0);917917+ if (bix < maxbix(li->li_height))918918+ return bix;919919+ /* Should not happen anymore. But if some port writes semi-920920+ * corrupt images (as this one used to) we might run into it.921921+ */922922+ WARN_ON_ONCE(bix == maxbix(li->li_height));923923+ }924924+925925+ return bix;926926+}927927+928928+static u64 __logfs_seek_data(struct inode *inode, u64 bix)929929+{930930+ struct logfs_inode *li = logfs_inode(inode);931931+932932+ if (bix < I0_BLOCKS) {933933+ bix = seek_holedata_direct(inode, bix, 1);934934+ if (bix < I0_BLOCKS)935935+ return bix;936936+ }937937+938938+ if (bix < maxbix(li->li_height)) {939939+ if (!li->li_data[INDIRECT_INDEX])940940+ bix = maxbix(li->li_height);941941+ else942942+ return seek_holedata_loop(inode, bix, 1);943943+ }944944+945945+ return bix;946946+}947947+948948+/**949949+ * logfs_seek_data - find next data block after a given block index950950+ * @inode: inode to search in951951+ * @bix: block index to start searching952952+ *953953+ * Returns next data block. If the file doesn't contain any further data954954+ * blocks, the last block in the file is returned instead.955955+ */956956+u64 logfs_seek_data(struct inode *inode, u64 bix)957957+{958958+ struct super_block *sb = inode->i_sb;959959+ u64 ret, end;960960+961961+ ret = __logfs_seek_data(inode, bix);962962+ end = i_size_read(inode) >> sb->s_blocksize_bits;963963+ if (ret >= end)964964+ ret = max(bix, end);965965+ return ret;966966+}967967+968968+static int logfs_is_valid_direct(struct logfs_inode *li, u64 bix, u64 ofs)969969+{970970+ return pure_ofs(li->li_data[bix]) == ofs;971971+}972972+973973+static int __logfs_is_valid_loop(struct inode *inode, u64 bix,974974+ u64 ofs, u64 bofs)975975+{976976+ struct logfs_inode *li = logfs_inode(inode);977977+ level_t level;978978+ int ret;979979+ struct page *page;980980+981981+ for (level = LEVEL(li->li_height); level != 0; level = SUBLEVEL(level)){982982+ page = logfs_get_write_page(inode, bix, level);983983+ BUG_ON(!page);984984+985985+ ret = logfs_segment_read(inode, page, bofs, bix, level);986986+ if (ret) {987987+ logfs_put_write_page(page);988988+ return 0;989989+ }990990+991991+ bofs = block_get_pointer(page, get_bits(bix, SUBLEVEL(level)));992992+ logfs_put_write_page(page);993993+ if (!bofs)994994+ return 0;995995+996996+ if (pure_ofs(bofs) == ofs)997997+ return 1;998998+ }999999+ return 0;10001000+}10011001+10021002+static int logfs_is_valid_loop(struct inode *inode, u64 bix, u64 ofs)10031003+{10041004+ struct logfs_inode *li = logfs_inode(inode);10051005+ u64 bofs = li->li_data[INDIRECT_INDEX];10061006+10071007+ if (!bofs)10081008+ return 0;10091009+10101010+ if (bix >= maxbix(li->li_height))10111011+ return 0;10121012+10131013+ if (pure_ofs(bofs) == ofs)10141014+ return 1;10151015+10161016+ return __logfs_is_valid_loop(inode, bix, ofs, bofs);10171017+}10181018+10191019+static int __logfs_is_valid_block(struct inode *inode, u64 bix, u64 ofs)10201020+{10211021+ struct logfs_inode *li = logfs_inode(inode);10221022+10231023+ if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)10241024+ return 0;10251025+10261026+ if (bix < I0_BLOCKS)10271027+ return logfs_is_valid_direct(li, bix, ofs);10281028+ return logfs_is_valid_loop(inode, bix, ofs);10291029+}10301030+10311031+/**10321032+ * logfs_is_valid_block - check whether this block is still valid10331033+ *10341034+ * @sb - superblock10351035+ * @ofs - block physical offset10361036+ * @ino - block inode number10371037+ * @bix - block index10381038+ * @level - block level10391039+ *10401040+ * Returns 0 if the block is invalid, 1 if it is valid and 2 if it will10411041+ * become invalid once the journal is written.10421042+ */10431043+int logfs_is_valid_block(struct super_block *sb, u64 ofs, u64 ino, u64 bix,10441044+ gc_level_t gc_level)10451045+{10461046+ struct logfs_super *super = logfs_super(sb);10471047+ struct inode *inode;10481048+ int ret, cookie;10491049+10501050+ /* Umount closes a segment with free blocks remaining. Those10511051+ * blocks are by definition invalid. */10521052+ if (ino == -1)10531053+ return 0;10541054+10551055+ LOGFS_BUG_ON((u64)(u_long)ino != ino, sb);10561056+10571057+ inode = logfs_safe_iget(sb, ino, &cookie);10581058+ if (IS_ERR(inode))10591059+ goto invalid;10601060+10611061+ ret = __logfs_is_valid_block(inode, bix, ofs);10621062+ logfs_safe_iput(inode, cookie);10631063+ if (ret)10641064+ return ret;10651065+10661066+invalid:10671067+ /* Block is nominally invalid, but may still sit in the shadow tree,10681068+ * waiting for a journal commit.10691069+ */10701070+ if (btree_lookup64(&super->s_shadow_tree.old, ofs))10711071+ return 2;10721072+ return 0;10731073+}10741074+10751075+int logfs_readpage_nolock(struct page *page)10761076+{10771077+ struct inode *inode = page->mapping->host;10781078+ int ret = -EIO;10791079+10801080+ ret = logfs_read_block(inode, page, READ);10811081+10821082+ if (ret) {10831083+ ClearPageUptodate(page);10841084+ SetPageError(page);10851085+ } else {10861086+ SetPageUptodate(page);10871087+ ClearPageError(page);10881088+ }10891089+ flush_dcache_page(page);10901090+10911091+ return ret;10921092+}10931093+10941094+static int logfs_reserve_bytes(struct inode *inode, int bytes)10951095+{10961096+ struct logfs_super *super = logfs_super(inode->i_sb);10971097+ u64 available = super->s_free_bytes + super->s_dirty_free_bytes10981098+ - super->s_dirty_used_bytes - super->s_dirty_pages;10991099+11001100+ if (!bytes)11011101+ return 0;11021102+11031103+ if (available < bytes)11041104+ return -ENOSPC;11051105+11061106+ if (available < bytes + super->s_root_reserve &&11071107+ !capable(CAP_SYS_RESOURCE))11081108+ return -ENOSPC;11091109+11101110+ return 0;11111111+}11121112+11131113+int get_page_reserve(struct inode *inode, struct page *page)11141114+{11151115+ struct logfs_super *super = logfs_super(inode->i_sb);11161116+ int ret;11171117+11181118+ if (logfs_block(page) && logfs_block(page)->reserved_bytes)11191119+ return 0;11201120+11211121+ logfs_get_wblocks(inode->i_sb, page, WF_LOCK);11221122+ ret = logfs_reserve_bytes(inode, 6 * LOGFS_MAX_OBJECTSIZE);11231123+ if (!ret) {11241124+ alloc_data_block(inode, page);11251125+ logfs_block(page)->reserved_bytes += 6 * LOGFS_MAX_OBJECTSIZE;11261126+ super->s_dirty_pages += 6 * LOGFS_MAX_OBJECTSIZE;11271127+ }11281128+ logfs_put_wblocks(inode->i_sb, page, WF_LOCK);11291129+ return ret;11301130+}11311131+11321132+/*11331133+ * We are protected by write lock. Push victims up to superblock level11341134+ * and release transaction when appropriate.11351135+ */11361136+/* FIXME: This is currently called from the wrong spots. */11371137+static void logfs_handle_transaction(struct inode *inode,11381138+ struct logfs_transaction *ta)11391139+{11401140+ struct logfs_super *super = logfs_super(inode->i_sb);11411141+11421142+ if (!ta)11431143+ return;11441144+ logfs_inode(inode)->li_block->ta = NULL;11451145+11461146+ if (inode->i_ino != LOGFS_INO_MASTER) {11471147+ BUG(); /* FIXME: Yes, this needs more thought */11481148+ /* just remember the transaction until inode is written */11491149+ //BUG_ON(logfs_inode(inode)->li_transaction);11501150+ //logfs_inode(inode)->li_transaction = ta;11511151+ return;11521152+ }11531153+11541154+ switch (ta->state) {11551155+ case CREATE_1: /* fall through */11561156+ case UNLINK_1:11571157+ BUG_ON(super->s_victim_ino);11581158+ super->s_victim_ino = ta->ino;11591159+ break;11601160+ case CREATE_2: /* fall through */11611161+ case UNLINK_2:11621162+ BUG_ON(super->s_victim_ino != ta->ino);11631163+ super->s_victim_ino = 0;11641164+ /* transaction ends here - free it */11651165+ kfree(ta);11661166+ break;11671167+ case CROSS_RENAME_1:11681168+ BUG_ON(super->s_rename_dir);11691169+ BUG_ON(super->s_rename_pos);11701170+ super->s_rename_dir = ta->dir;11711171+ super->s_rename_pos = ta->pos;11721172+ break;11731173+ case CROSS_RENAME_2:11741174+ BUG_ON(super->s_rename_dir != ta->dir);11751175+ BUG_ON(super->s_rename_pos != ta->pos);11761176+ super->s_rename_dir = 0;11771177+ super->s_rename_pos = 0;11781178+ kfree(ta);11791179+ break;11801180+ case TARGET_RENAME_1:11811181+ BUG_ON(super->s_rename_dir);11821182+ BUG_ON(super->s_rename_pos);11831183+ BUG_ON(super->s_victim_ino);11841184+ super->s_rename_dir = ta->dir;11851185+ super->s_rename_pos = ta->pos;11861186+ super->s_victim_ino = ta->ino;11871187+ break;11881188+ case TARGET_RENAME_2:11891189+ BUG_ON(super->s_rename_dir != ta->dir);11901190+ BUG_ON(super->s_rename_pos != ta->pos);11911191+ BUG_ON(super->s_victim_ino != ta->ino);11921192+ super->s_rename_dir = 0;11931193+ super->s_rename_pos = 0;11941194+ break;11951195+ case TARGET_RENAME_3:11961196+ BUG_ON(super->s_rename_dir);11971197+ BUG_ON(super->s_rename_pos);11981198+ BUG_ON(super->s_victim_ino != ta->ino);11991199+ super->s_victim_ino = 0;12001200+ kfree(ta);12011201+ break;12021202+ default:12031203+ BUG();12041204+ }12051205+}12061206+12071207+/*12081208+ * Not strictly a reservation, but rather a check that we still have enough12091209+ * space to satisfy the write.12101210+ */12111211+static int logfs_reserve_blocks(struct inode *inode, int blocks)12121212+{12131213+ return logfs_reserve_bytes(inode, blocks * LOGFS_MAX_OBJECTSIZE);12141214+}12151215+12161216+struct write_control {12171217+ u64 ofs;12181218+ long flags;12191219+};12201220+12211221+static struct logfs_shadow *alloc_shadow(struct inode *inode, u64 bix,12221222+ level_t level, u64 old_ofs)12231223+{12241224+ struct logfs_super *super = logfs_super(inode->i_sb);12251225+ struct logfs_shadow *shadow;12261226+12271227+ shadow = mempool_alloc(super->s_shadow_pool, GFP_NOFS);12281228+ memset(shadow, 0, sizeof(*shadow));12291229+ shadow->ino = inode->i_ino;12301230+ shadow->bix = bix;12311231+ shadow->gc_level = expand_level(inode->i_ino, level);12321232+ shadow->old_ofs = old_ofs & ~LOGFS_FULLY_POPULATED;12331233+ return shadow;12341234+}12351235+12361236+static void free_shadow(struct inode *inode, struct logfs_shadow *shadow)12371237+{12381238+ struct logfs_super *super = logfs_super(inode->i_sb);12391239+12401240+ mempool_free(shadow, super->s_shadow_pool);12411241+}12421242+12431243+/**12441244+ * fill_shadow_tree - Propagate shadow tree changes due to a write12451245+ * @inode: Inode owning the page12461246+ * @page: Struct page that was written12471247+ * @shadow: Shadow for the current write12481248+ *12491249+ * Writes in logfs can result in two semi-valid objects. The old object12501250+ * is still valid as long as it can be reached by following pointers on12511251+ * the medium. Only when writes propagate all the way up to the journal12521252+ * has the new object safely replaced the old one.12531253+ *12541254+ * To handle this problem, a struct logfs_shadow is used to represent12551255+ * every single write. It is attached to the indirect block, which is12561256+ * marked dirty. When the indirect block is written, its shadows are12571257+ * handed up to the next indirect block (or inode). Untimately they12581258+ * will reach the master inode and be freed upon journal commit.12591259+ *12601260+ * This function handles a single step in the propagation. It adds the12611261+ * shadow for the current write to the tree, along with any shadows in12621262+ * the page's tree, in case it was an indirect block. If a page is12631263+ * written, the inode parameter is left NULL, if an inode is written,12641264+ * the page parameter is left NULL.12651265+ */12661266+static void fill_shadow_tree(struct inode *inode, struct page *page,12671267+ struct logfs_shadow *shadow)12681268+{12691269+ struct logfs_super *super = logfs_super(inode->i_sb);12701270+ struct logfs_block *block = logfs_block(page);12711271+ struct shadow_tree *tree = &super->s_shadow_tree;12721272+12731273+ if (PagePrivate(page)) {12741274+ if (block->alias_map)12751275+ super->s_no_object_aliases -= bitmap_weight(12761276+ block->alias_map, LOGFS_BLOCK_FACTOR);12771277+ logfs_handle_transaction(inode, block->ta);12781278+ block->ops->free_block(inode->i_sb, block);12791279+ }12801280+ if (shadow) {12811281+ if (shadow->old_ofs)12821282+ btree_insert64(&tree->old, shadow->old_ofs, shadow,12831283+ GFP_NOFS);12841284+ else12851285+ btree_insert64(&tree->new, shadow->new_ofs, shadow,12861286+ GFP_NOFS);12871287+12881288+ super->s_dirty_used_bytes += shadow->new_len;12891289+ super->s_dirty_free_bytes += shadow->old_len;12901290+ }12911291+}12921292+12931293+static void logfs_set_alias(struct super_block *sb, struct logfs_block *block,12941294+ long child_no)12951295+{12961296+ struct logfs_super *super = logfs_super(sb);12971297+12981298+ if (block->inode && block->inode->i_ino == LOGFS_INO_MASTER) {12991299+ /* Aliases in the master inode are pointless. */13001300+ return;13011301+ }13021302+13031303+ if (!test_bit(child_no, block->alias_map)) {13041304+ set_bit(child_no, block->alias_map);13051305+ super->s_no_object_aliases++;13061306+ }13071307+ list_move_tail(&block->alias_list, &super->s_object_alias);13081308+}13091309+13101310+/*13111311+ * Object aliases can and often do change the size and occupied space of a13121312+ * file. So not only do we have to change the pointers, we also have to13131313+ * change inode->i_size and li->li_used_bytes. Which is done by setting13141314+ * another two object aliases for the inode itself.13151315+ */13161316+static void set_iused(struct inode *inode, struct logfs_shadow *shadow)13171317+{13181318+ struct logfs_inode *li = logfs_inode(inode);13191319+13201320+ if (shadow->new_len == shadow->old_len)13211321+ return;13221322+13231323+ alloc_inode_block(inode);13241324+ li->li_used_bytes += shadow->new_len - shadow->old_len;13251325+ __logfs_set_blocks(inode);13261326+ logfs_set_alias(inode->i_sb, li->li_block, INODE_USED_OFS);13271327+ logfs_set_alias(inode->i_sb, li->li_block, INODE_SIZE_OFS);13281328+}13291329+13301330+static int logfs_write_i0(struct inode *inode, struct page *page,13311331+ struct write_control *wc)13321332+{13331333+ struct logfs_shadow *shadow;13341334+ u64 bix;13351335+ level_t level;13361336+ int full, err = 0;13371337+13381338+ logfs_unpack_index(page->index, &bix, &level);13391339+ if (wc->ofs == 0)13401340+ if (logfs_reserve_blocks(inode, 1))13411341+ return -ENOSPC;13421342+13431343+ shadow = alloc_shadow(inode, bix, level, wc->ofs);13441344+ if (wc->flags & WF_WRITE)13451345+ err = logfs_segment_write(inode, page, shadow);13461346+ if (wc->flags & WF_DELETE)13471347+ logfs_segment_delete(inode, shadow);13481348+ if (err) {13491349+ free_shadow(inode, shadow);13501350+ return err;13511351+ }13521352+13531353+ set_iused(inode, shadow);13541354+ full = 1;13551355+ if (level != 0) {13561356+ alloc_indirect_block(inode, page, 0);13571357+ full = logfs_block(page)->full == LOGFS_BLOCK_FACTOR;13581358+ }13591359+ fill_shadow_tree(inode, page, shadow);13601360+ wc->ofs = shadow->new_ofs;13611361+ if (wc->ofs && full)13621362+ wc->ofs |= LOGFS_FULLY_POPULATED;13631363+ return 0;13641364+}13651365+13661366+static int logfs_write_direct(struct inode *inode, struct page *page,13671367+ long flags)13681368+{13691369+ struct logfs_inode *li = logfs_inode(inode);13701370+ struct write_control wc = {13711371+ .ofs = li->li_data[page->index],13721372+ .flags = flags,13731373+ };13741374+ int err;13751375+13761376+ alloc_inode_block(inode);13771377+13781378+ err = logfs_write_i0(inode, page, &wc);13791379+ if (err)13801380+ return err;13811381+13821382+ li->li_data[page->index] = wc.ofs;13831383+ logfs_set_alias(inode->i_sb, li->li_block,13841384+ page->index + INODE_POINTER_OFS);13851385+ return 0;13861386+}13871387+13881388+static int ptr_change(u64 ofs, struct page *page)13891389+{13901390+ struct logfs_block *block = logfs_block(page);13911391+ int empty0, empty1, full0, full1;13921392+13931393+ empty0 = ofs == 0;13941394+ empty1 = block->partial == 0;13951395+ if (empty0 != empty1)13961396+ return 1;13971397+13981398+ /* The !! is necessary to shrink result to int */13991399+ full0 = !!(ofs & LOGFS_FULLY_POPULATED);14001400+ full1 = block->full == LOGFS_BLOCK_FACTOR;14011401+ if (full0 != full1)14021402+ return 1;14031403+ return 0;14041404+}14051405+14061406+static int __logfs_write_rec(struct inode *inode, struct page *page,14071407+ struct write_control *this_wc,14081408+ pgoff_t bix, level_t target_level, level_t level)14091409+{14101410+ int ret, page_empty = 0;14111411+ int child_no = get_bits(bix, SUBLEVEL(level));14121412+ struct page *ipage;14131413+ struct write_control child_wc = {14141414+ .flags = this_wc->flags,14151415+ };14161416+14171417+ ipage = logfs_get_write_page(inode, bix, level);14181418+ if (!ipage)14191419+ return -ENOMEM;14201420+14211421+ if (this_wc->ofs) {14221422+ ret = logfs_segment_read(inode, ipage, this_wc->ofs, bix, level);14231423+ if (ret)14241424+ goto out;14251425+ } else if (!PageUptodate(ipage)) {14261426+ page_empty = 1;14271427+ logfs_read_empty(ipage);14281428+ }14291429+14301430+ child_wc.ofs = block_get_pointer(ipage, child_no);14311431+14321432+ if ((__force u8)level-1 > (__force u8)target_level)14331433+ ret = __logfs_write_rec(inode, page, &child_wc, bix,14341434+ target_level, SUBLEVEL(level));14351435+ else14361436+ ret = logfs_write_i0(inode, page, &child_wc);14371437+14381438+ if (ret)14391439+ goto out;14401440+14411441+ alloc_indirect_block(inode, ipage, page_empty);14421442+ block_set_pointer(ipage, child_no, child_wc.ofs);14431443+ /* FIXME: first condition seems superfluous */14441444+ if (child_wc.ofs || logfs_block(ipage)->partial)14451445+ this_wc->flags |= WF_WRITE;14461446+ /* the condition on this_wc->ofs ensures that we won't consume extra14471447+ * space for indirect blocks in the future, which we cannot reserve */14481448+ if (!this_wc->ofs || ptr_change(this_wc->ofs, ipage))14491449+ ret = logfs_write_i0(inode, ipage, this_wc);14501450+ else14511451+ logfs_set_alias(inode->i_sb, logfs_block(ipage), child_no);14521452+out:14531453+ logfs_put_write_page(ipage);14541454+ return ret;14551455+}14561456+14571457+static int logfs_write_rec(struct inode *inode, struct page *page,14581458+ pgoff_t bix, level_t target_level, long flags)14591459+{14601460+ struct logfs_inode *li = logfs_inode(inode);14611461+ struct write_control wc = {14621462+ .ofs = li->li_data[INDIRECT_INDEX],14631463+ .flags = flags,14641464+ };14651465+ int ret;14661466+14671467+ alloc_inode_block(inode);14681468+14691469+ if (li->li_height > (__force u8)target_level)14701470+ ret = __logfs_write_rec(inode, page, &wc, bix, target_level,14711471+ LEVEL(li->li_height));14721472+ else14731473+ ret = logfs_write_i0(inode, page, &wc);14741474+ if (ret)14751475+ return ret;14761476+14771477+ if (li->li_data[INDIRECT_INDEX] != wc.ofs) {14781478+ li->li_data[INDIRECT_INDEX] = wc.ofs;14791479+ logfs_set_alias(inode->i_sb, li->li_block,14801480+ INDIRECT_INDEX + INODE_POINTER_OFS);14811481+ }14821482+ return ret;14831483+}14841484+14851485+void logfs_add_transaction(struct inode *inode, struct logfs_transaction *ta)14861486+{14871487+ alloc_inode_block(inode);14881488+ logfs_inode(inode)->li_block->ta = ta;14891489+}14901490+14911491+void logfs_del_transaction(struct inode *inode, struct logfs_transaction *ta)14921492+{14931493+ struct logfs_block *block = logfs_inode(inode)->li_block;14941494+14951495+ if (block && block->ta)14961496+ block->ta = NULL;14971497+}14981498+14991499+static int grow_inode(struct inode *inode, u64 bix, level_t level)15001500+{15011501+ struct logfs_inode *li = logfs_inode(inode);15021502+ u8 height = (__force u8)level;15031503+ struct page *page;15041504+ struct write_control wc = {15051505+ .flags = WF_WRITE,15061506+ };15071507+ int err;15081508+15091509+ BUG_ON(height > 5 || li->li_height > 5);15101510+ while (height > li->li_height || bix >= maxbix(li->li_height)) {15111511+ page = logfs_get_write_page(inode, I0_BLOCKS + 1,15121512+ LEVEL(li->li_height + 1));15131513+ if (!page)15141514+ return -ENOMEM;15151515+ logfs_read_empty(page);15161516+ alloc_indirect_block(inode, page, 1);15171517+ block_set_pointer(page, 0, li->li_data[INDIRECT_INDEX]);15181518+ err = logfs_write_i0(inode, page, &wc);15191519+ logfs_put_write_page(page);15201520+ if (err)15211521+ return err;15221522+ li->li_data[INDIRECT_INDEX] = wc.ofs;15231523+ wc.ofs = 0;15241524+ li->li_height++;15251525+ logfs_set_alias(inode->i_sb, li->li_block, INODE_HEIGHT_OFS);15261526+ }15271527+ return 0;15281528+}15291529+15301530+static int __logfs_write_buf(struct inode *inode, struct page *page, long flags)15311531+{15321532+ struct logfs_super *super = logfs_super(inode->i_sb);15331533+ pgoff_t index = page->index;15341534+ u64 bix;15351535+ level_t level;15361536+ int err;15371537+15381538+ flags |= WF_WRITE | WF_DELETE;15391539+ inode->i_ctime = inode->i_mtime = CURRENT_TIME;15401540+15411541+ logfs_unpack_index(index, &bix, &level);15421542+ if (logfs_block(page) && logfs_block(page)->reserved_bytes)15431543+ super->s_dirty_pages -= logfs_block(page)->reserved_bytes;15441544+15451545+ if (index < I0_BLOCKS)15461546+ return logfs_write_direct(inode, page, flags);15471547+15481548+ bix = adjust_bix(bix, level);15491549+ err = grow_inode(inode, bix, level);15501550+ if (err)15511551+ return err;15521552+ return logfs_write_rec(inode, page, bix, level, flags);15531553+}15541554+15551555+int logfs_write_buf(struct inode *inode, struct page *page, long flags)15561556+{15571557+ struct super_block *sb = inode->i_sb;15581558+ int ret;15591559+15601560+ logfs_get_wblocks(sb, page, flags & WF_LOCK);15611561+ ret = __logfs_write_buf(inode, page, flags);15621562+ logfs_put_wblocks(sb, page, flags & WF_LOCK);15631563+ return ret;15641564+}15651565+15661566+static int __logfs_delete(struct inode *inode, struct page *page)15671567+{15681568+ long flags = WF_DELETE;15691569+15701570+ inode->i_ctime = inode->i_mtime = CURRENT_TIME;15711571+15721572+ if (page->index < I0_BLOCKS)15731573+ return logfs_write_direct(inode, page, flags);15741574+ return logfs_write_rec(inode, page, page->index, 0, flags);15751575+}15761576+15771577+int logfs_delete(struct inode *inode, pgoff_t index,15781578+ struct shadow_tree *shadow_tree)15791579+{15801580+ struct super_block *sb = inode->i_sb;15811581+ struct page *page;15821582+ int ret;15831583+15841584+ page = logfs_get_read_page(inode, index, 0);15851585+ if (!page)15861586+ return -ENOMEM;15871587+15881588+ logfs_get_wblocks(sb, page, 1);15891589+ ret = __logfs_delete(inode, page);15901590+ logfs_put_wblocks(sb, page, 1);15911591+15921592+ logfs_put_read_page(page);15931593+15941594+ return ret;15951595+}15961596+15971597+/* Rewrite cannot mark the inode dirty but has to write it immediatly. */15981598+int logfs_rewrite_block(struct inode *inode, u64 bix, u64 ofs,15991599+ gc_level_t gc_level, long flags)16001600+{16011601+ level_t level = shrink_level(gc_level);16021602+ struct page *page;16031603+ int err;16041604+16051605+ page = logfs_get_write_page(inode, bix, level);16061606+ if (!page)16071607+ return -ENOMEM;16081608+16091609+ err = logfs_segment_read(inode, page, ofs, bix, level);16101610+ if (!err) {16111611+ if (level != 0)16121612+ alloc_indirect_block(inode, page, 0);16131613+ err = logfs_write_buf(inode, page, flags);16141614+ }16151615+ logfs_put_write_page(page);16161616+ return err;16171617+}16181618+16191619+static int truncate_data_block(struct inode *inode, struct page *page,16201620+ u64 ofs, struct logfs_shadow *shadow, u64 size)16211621+{16221622+ loff_t pageofs = page->index << inode->i_sb->s_blocksize_bits;16231623+ u64 bix;16241624+ level_t level;16251625+ int err;16261626+16271627+ /* Does truncation happen within this page? */16281628+ if (size <= pageofs || size - pageofs >= PAGE_SIZE)16291629+ return 0;16301630+16311631+ logfs_unpack_index(page->index, &bix, &level);16321632+ BUG_ON(level != 0);16331633+16341634+ err = logfs_segment_read(inode, page, ofs, bix, level);16351635+ if (err)16361636+ return err;16371637+16381638+ zero_user_segment(page, size - pageofs, PAGE_CACHE_SIZE);16391639+ return logfs_segment_write(inode, page, shadow);16401640+}16411641+16421642+static int logfs_truncate_i0(struct inode *inode, struct page *page,16431643+ struct write_control *wc, u64 size)16441644+{16451645+ struct logfs_shadow *shadow;16461646+ u64 bix;16471647+ level_t level;16481648+ int err = 0;16491649+16501650+ logfs_unpack_index(page->index, &bix, &level);16511651+ BUG_ON(level != 0);16521652+ shadow = alloc_shadow(inode, bix, level, wc->ofs);16531653+16541654+ err = truncate_data_block(inode, page, wc->ofs, shadow, size);16551655+ if (err) {16561656+ free_shadow(inode, shadow);16571657+ return err;16581658+ }16591659+16601660+ logfs_segment_delete(inode, shadow);16611661+ set_iused(inode, shadow);16621662+ fill_shadow_tree(inode, page, shadow);16631663+ wc->ofs = shadow->new_ofs;16641664+ return 0;16651665+}16661666+16671667+static int logfs_truncate_direct(struct inode *inode, u64 size)16681668+{16691669+ struct logfs_inode *li = logfs_inode(inode);16701670+ struct write_control wc;16711671+ struct page *page;16721672+ int e;16731673+ int err;16741674+16751675+ alloc_inode_block(inode);16761676+16771677+ for (e = I0_BLOCKS - 1; e >= 0; e--) {16781678+ if (size > (e+1) * LOGFS_BLOCKSIZE)16791679+ break;16801680+16811681+ wc.ofs = li->li_data[e];16821682+ if (!wc.ofs)16831683+ continue;16841684+16851685+ page = logfs_get_write_page(inode, e, 0);16861686+ if (!page)16871687+ return -ENOMEM;16881688+ err = logfs_segment_read(inode, page, wc.ofs, e, 0);16891689+ if (err) {16901690+ logfs_put_write_page(page);16911691+ return err;16921692+ }16931693+ err = logfs_truncate_i0(inode, page, &wc, size);16941694+ logfs_put_write_page(page);16951695+ if (err)16961696+ return err;16971697+16981698+ li->li_data[e] = wc.ofs;16991699+ }17001700+ return 0;17011701+}17021702+17031703+/* FIXME: these need to become per-sb once we support different blocksizes */17041704+static u64 __logfs_step[] = {17051705+ 1,17061706+ I1_BLOCKS,17071707+ I2_BLOCKS,17081708+ I3_BLOCKS,17091709+};17101710+17111711+static u64 __logfs_start_index[] = {17121712+ I0_BLOCKS,17131713+ I1_BLOCKS,17141714+ I2_BLOCKS,17151715+ I3_BLOCKS17161716+};17171717+17181718+static inline u64 logfs_step(level_t level)17191719+{17201720+ return __logfs_step[(__force u8)level];17211721+}17221722+17231723+static inline u64 logfs_factor(u8 level)17241724+{17251725+ return __logfs_step[level] * LOGFS_BLOCKSIZE;17261726+}17271727+17281728+static inline u64 logfs_start_index(level_t level)17291729+{17301730+ return __logfs_start_index[(__force u8)level];17311731+}17321732+17331733+static void logfs_unpack_raw_index(pgoff_t index, u64 *bix, level_t *level)17341734+{17351735+ logfs_unpack_index(index, bix, level);17361736+ if (*bix <= logfs_start_index(SUBLEVEL(*level)))17371737+ *bix = 0;17381738+}17391739+17401740+static int __logfs_truncate_rec(struct inode *inode, struct page *ipage,17411741+ struct write_control *this_wc, u64 size)17421742+{17431743+ int truncate_happened = 0;17441744+ int e, err = 0;17451745+ u64 bix, child_bix, next_bix;17461746+ level_t level;17471747+ struct page *page;17481748+ struct write_control child_wc = { /* FIXME: flags */ };17491749+17501750+ logfs_unpack_raw_index(ipage->index, &bix, &level);17511751+ err = logfs_segment_read(inode, ipage, this_wc->ofs, bix, level);17521752+ if (err)17531753+ return err;17541754+17551755+ for (e = LOGFS_BLOCK_FACTOR - 1; e >= 0; e--) {17561756+ child_bix = bix + e * logfs_step(SUBLEVEL(level));17571757+ next_bix = child_bix + logfs_step(SUBLEVEL(level));17581758+ if (size > next_bix * LOGFS_BLOCKSIZE)17591759+ break;17601760+17611761+ child_wc.ofs = pure_ofs(block_get_pointer(ipage, e));17621762+ if (!child_wc.ofs)17631763+ continue;17641764+17651765+ page = logfs_get_write_page(inode, child_bix, SUBLEVEL(level));17661766+ if (!page)17671767+ return -ENOMEM;17681768+17691769+ if ((__force u8)level > 1)17701770+ err = __logfs_truncate_rec(inode, page, &child_wc, size);17711771+ else17721772+ err = logfs_truncate_i0(inode, page, &child_wc, size);17731773+ logfs_put_write_page(page);17741774+ if (err)17751775+ return err;17761776+17771777+ truncate_happened = 1;17781778+ alloc_indirect_block(inode, ipage, 0);17791779+ block_set_pointer(ipage, e, child_wc.ofs);17801780+ }17811781+17821782+ if (!truncate_happened) {17831783+ printk("ineffectual truncate (%lx, %lx, %llx)\n", inode->i_ino, ipage->index, size);17841784+ return 0;17851785+ }17861786+17871787+ this_wc->flags = WF_DELETE;17881788+ if (logfs_block(ipage)->partial)17891789+ this_wc->flags |= WF_WRITE;17901790+17911791+ return logfs_write_i0(inode, ipage, this_wc);17921792+}17931793+17941794+static int logfs_truncate_rec(struct inode *inode, u64 size)17951795+{17961796+ struct logfs_inode *li = logfs_inode(inode);17971797+ struct write_control wc = {17981798+ .ofs = li->li_data[INDIRECT_INDEX],17991799+ };18001800+ struct page *page;18011801+ int err;18021802+18031803+ alloc_inode_block(inode);18041804+18051805+ if (!wc.ofs)18061806+ return 0;18071807+18081808+ page = logfs_get_write_page(inode, 0, LEVEL(li->li_height));18091809+ if (!page)18101810+ return -ENOMEM;18111811+18121812+ err = __logfs_truncate_rec(inode, page, &wc, size);18131813+ logfs_put_write_page(page);18141814+ if (err)18151815+ return err;18161816+18171817+ if (li->li_data[INDIRECT_INDEX] != wc.ofs)18181818+ li->li_data[INDIRECT_INDEX] = wc.ofs;18191819+ return 0;18201820+}18211821+18221822+static int __logfs_truncate(struct inode *inode, u64 size)18231823+{18241824+ int ret;18251825+18261826+ if (size >= logfs_factor(logfs_inode(inode)->li_height))18271827+ return 0;18281828+18291829+ ret = logfs_truncate_rec(inode, size);18301830+ if (ret)18311831+ return ret;18321832+18331833+ return logfs_truncate_direct(inode, size);18341834+}18351835+18361836+int logfs_truncate(struct inode *inode, u64 size)18371837+{18381838+ struct super_block *sb = inode->i_sb;18391839+ int err;18401840+18411841+ logfs_get_wblocks(sb, NULL, 1);18421842+ err = __logfs_truncate(inode, size);18431843+ if (!err)18441844+ err = __logfs_write_inode(inode, 0);18451845+ logfs_put_wblocks(sb, NULL, 1);18461846+18471847+ if (!err)18481848+ err = vmtruncate(inode, size);18491849+18501850+ /* I don't trust error recovery yet. */18511851+ WARN_ON(err);18521852+ return err;18531853+}18541854+18551855+static void move_page_to_inode(struct inode *inode, struct page *page)18561856+{18571857+ struct logfs_inode *li = logfs_inode(inode);18581858+ struct logfs_block *block = logfs_block(page);18591859+18601860+ if (!block)18611861+ return;18621862+18631863+ log_blockmove("move_page_to_inode(%llx, %llx, %x)\n",18641864+ block->ino, block->bix, block->level);18651865+ BUG_ON(li->li_block);18661866+ block->ops = &inode_block_ops;18671867+ block->inode = inode;18681868+ li->li_block = block;18691869+18701870+ block->page = NULL;18711871+ page->private = 0;18721872+ ClearPagePrivate(page);18731873+}18741874+18751875+static void move_inode_to_page(struct page *page, struct inode *inode)18761876+{18771877+ struct logfs_inode *li = logfs_inode(inode);18781878+ struct logfs_block *block = li->li_block;18791879+18801880+ if (!block)18811881+ return;18821882+18831883+ log_blockmove("move_inode_to_page(%llx, %llx, %x)\n",18841884+ block->ino, block->bix, block->level);18851885+ BUG_ON(PagePrivate(page));18861886+ block->ops = &indirect_block_ops;18871887+ block->page = page;18881888+ page->private = (unsigned long)block;18891889+ SetPagePrivate(page);18901890+18911891+ block->inode = NULL;18921892+ li->li_block = NULL;18931893+}18941894+18951895+int logfs_read_inode(struct inode *inode)18961896+{18971897+ struct super_block *sb = inode->i_sb;18981898+ struct logfs_super *super = logfs_super(sb);18991899+ struct inode *master_inode = super->s_master_inode;19001900+ struct page *page;19011901+ struct logfs_disk_inode *di;19021902+ u64 ino = inode->i_ino;19031903+19041904+ if (ino << sb->s_blocksize_bits > i_size_read(master_inode))19051905+ return -ENODATA;19061906+ if (!logfs_exist_block(master_inode, ino))19071907+ return -ENODATA;19081908+19091909+ page = read_cache_page(master_inode->i_mapping, ino,19101910+ (filler_t *)logfs_readpage, NULL);19111911+ if (IS_ERR(page))19121912+ return PTR_ERR(page);19131913+19141914+ di = kmap_atomic(page, KM_USER0);19151915+ logfs_disk_to_inode(di, inode);19161916+ kunmap_atomic(di, KM_USER0);19171917+ move_page_to_inode(inode, page);19181918+ page_cache_release(page);19191919+ return 0;19201920+}19211921+19221922+/* Caller must logfs_put_write_page(page); */19231923+static struct page *inode_to_page(struct inode *inode)19241924+{19251925+ struct inode *master_inode = logfs_super(inode->i_sb)->s_master_inode;19261926+ struct logfs_disk_inode *di;19271927+ struct page *page;19281928+19291929+ BUG_ON(inode->i_ino == LOGFS_INO_MASTER);19301930+19311931+ page = logfs_get_write_page(master_inode, inode->i_ino, 0);19321932+ if (!page)19331933+ return NULL;19341934+19351935+ di = kmap_atomic(page, KM_USER0);19361936+ logfs_inode_to_disk(inode, di);19371937+ kunmap_atomic(di, KM_USER0);19381938+ move_inode_to_page(page, inode);19391939+ return page;19401940+}19411941+19421942+/* Cheaper version of write_inode. All changes are concealed in19431943+ * aliases, which are moved back. No write to the medium happens.19441944+ */19451945+void logfs_clear_inode(struct inode *inode)19461946+{19471947+ struct super_block *sb = inode->i_sb;19481948+ struct logfs_inode *li = logfs_inode(inode);19491949+ struct logfs_block *block = li->li_block;19501950+ struct page *page;19511951+19521952+ /* Only deleted files may be dirty at this point */19531953+ BUG_ON(inode->i_state & I_DIRTY && inode->i_nlink);19541954+ if (!block)19551955+ return;19561956+ if ((logfs_super(sb)->s_flags & LOGFS_SB_FLAG_SHUTDOWN)) {19571957+ block->ops->free_block(inode->i_sb, block);19581958+ return;19591959+ }19601960+19611961+ BUG_ON(inode->i_ino < LOGFS_RESERVED_INOS);19621962+ page = inode_to_page(inode);19631963+ BUG_ON(!page); /* FIXME: Use emergency page */19641964+ logfs_put_write_page(page);19651965+}19661966+19671967+static int do_write_inode(struct inode *inode)19681968+{19691969+ struct super_block *sb = inode->i_sb;19701970+ struct inode *master_inode = logfs_super(sb)->s_master_inode;19711971+ loff_t size = (inode->i_ino + 1) << inode->i_sb->s_blocksize_bits;19721972+ struct page *page;19731973+ int err;19741974+19751975+ BUG_ON(inode->i_ino == LOGFS_INO_MASTER);19761976+ /* FIXME: lock inode */19771977+19781978+ if (i_size_read(master_inode) < size)19791979+ i_size_write(master_inode, size);19801980+19811981+ /* TODO: Tell vfs this inode is clean now */19821982+19831983+ page = inode_to_page(inode);19841984+ if (!page)19851985+ return -ENOMEM;19861986+19871987+ /* FIXME: transaction is part of logfs_block now. Is that enough? */19881988+ err = logfs_write_buf(master_inode, page, 0);19891989+ logfs_put_write_page(page);19901990+ return err;19911991+}19921992+19931993+static void logfs_mod_segment_entry(struct super_block *sb, u32 segno,19941994+ int write,19951995+ void (*change_se)(struct logfs_segment_entry *, long),19961996+ long arg)19971997+{19981998+ struct logfs_super *super = logfs_super(sb);19991999+ struct inode *inode;20002000+ struct page *page;20012001+ struct logfs_segment_entry *se;20022002+ pgoff_t page_no;20032003+ int child_no;20042004+20052005+ page_no = segno >> (sb->s_blocksize_bits - 3);20062006+ child_no = segno & ((sb->s_blocksize >> 3) - 1);20072007+20082008+ inode = super->s_segfile_inode;20092009+ page = logfs_get_write_page(inode, page_no, 0);20102010+ BUG_ON(!page); /* FIXME: We need some reserve page for this case */20112011+ if (!PageUptodate(page))20122012+ logfs_read_block(inode, page, WRITE);20132013+20142014+ if (write)20152015+ alloc_indirect_block(inode, page, 0);20162016+ se = kmap_atomic(page, KM_USER0);20172017+ change_se(se + child_no, arg);20182018+ if (write) {20192019+ logfs_set_alias(sb, logfs_block(page), child_no);20202020+ BUG_ON((int)be32_to_cpu(se[child_no].valid) > super->s_segsize);20212021+ }20222022+ kunmap_atomic(se, KM_USER0);20232023+20242024+ logfs_put_write_page(page);20252025+}20262026+20272027+static void __get_segment_entry(struct logfs_segment_entry *se, long _target)20282028+{20292029+ struct logfs_segment_entry *target = (void *)_target;20302030+20312031+ *target = *se;20322032+}20332033+20342034+void logfs_get_segment_entry(struct super_block *sb, u32 segno,20352035+ struct logfs_segment_entry *se)20362036+{20372037+ logfs_mod_segment_entry(sb, segno, 0, __get_segment_entry, (long)se);20382038+}20392039+20402040+static void __set_segment_used(struct logfs_segment_entry *se, long increment)20412041+{20422042+ u32 valid;20432043+20442044+ valid = be32_to_cpu(se->valid);20452045+ valid += increment;20462046+ se->valid = cpu_to_be32(valid);20472047+}20482048+20492049+void logfs_set_segment_used(struct super_block *sb, u64 ofs, int increment)20502050+{20512051+ struct logfs_super *super = logfs_super(sb);20522052+ u32 segno = ofs >> super->s_segshift;20532053+20542054+ if (!increment)20552055+ return;20562056+20572057+ logfs_mod_segment_entry(sb, segno, 1, __set_segment_used, increment);20582058+}20592059+20602060+static void __set_segment_erased(struct logfs_segment_entry *se, long ec_level)20612061+{20622062+ se->ec_level = cpu_to_be32(ec_level);20632063+}20642064+20652065+void logfs_set_segment_erased(struct super_block *sb, u32 segno, u32 ec,20662066+ gc_level_t gc_level)20672067+{20682068+ u32 ec_level = ec << 4 | (__force u8)gc_level;20692069+20702070+ logfs_mod_segment_entry(sb, segno, 1, __set_segment_erased, ec_level);20712071+}20722072+20732073+static void __set_segment_reserved(struct logfs_segment_entry *se, long ignore)20742074+{20752075+ se->valid = cpu_to_be32(RESERVED);20762076+}20772077+20782078+void logfs_set_segment_reserved(struct super_block *sb, u32 segno)20792079+{20802080+ logfs_mod_segment_entry(sb, segno, 1, __set_segment_reserved, 0);20812081+}20822082+20832083+static void __set_segment_unreserved(struct logfs_segment_entry *se,20842084+ long ec_level)20852085+{20862086+ se->valid = 0;20872087+ se->ec_level = cpu_to_be32(ec_level);20882088+}20892089+20902090+void logfs_set_segment_unreserved(struct super_block *sb, u32 segno, u32 ec)20912091+{20922092+ u32 ec_level = ec << 4;20932093+20942094+ logfs_mod_segment_entry(sb, segno, 1, __set_segment_unreserved,20952095+ ec_level);20962096+}20972097+20982098+int __logfs_write_inode(struct inode *inode, long flags)20992099+{21002100+ struct super_block *sb = inode->i_sb;21012101+ int ret;21022102+21032103+ logfs_get_wblocks(sb, NULL, flags & WF_LOCK);21042104+ ret = do_write_inode(inode);21052105+ logfs_put_wblocks(sb, NULL, flags & WF_LOCK);21062106+ return ret;21072107+}21082108+21092109+static int do_delete_inode(struct inode *inode)21102110+{21112111+ struct super_block *sb = inode->i_sb;21122112+ struct inode *master_inode = logfs_super(sb)->s_master_inode;21132113+ struct page *page;21142114+ int ret;21152115+21162116+ page = logfs_get_write_page(master_inode, inode->i_ino, 0);21172117+ if (!page)21182118+ return -ENOMEM;21192119+21202120+ move_inode_to_page(page, inode);21212121+21222122+ logfs_get_wblocks(sb, page, 1);21232123+ ret = __logfs_delete(master_inode, page);21242124+ logfs_put_wblocks(sb, page, 1);21252125+21262126+ logfs_put_write_page(page);21272127+ return ret;21282128+}21292129+21302130+/*21312131+ * ZOMBIE inodes have already been deleted before and should remain dead,21322132+ * if it weren't for valid checking. No need to kill them again here.21332133+ */21342134+void logfs_delete_inode(struct inode *inode)21352135+{21362136+ struct logfs_inode *li = logfs_inode(inode);21372137+21382138+ if (!(li->li_flags & LOGFS_IF_ZOMBIE)) {21392139+ li->li_flags |= LOGFS_IF_ZOMBIE;21402140+ if (i_size_read(inode) > 0)21412141+ logfs_truncate(inode, 0);21422142+ do_delete_inode(inode);21432143+ }21442144+ truncate_inode_pages(&inode->i_data, 0);21452145+ clear_inode(inode);21462146+}21472147+21482148+void btree_write_block(struct logfs_block *block)21492149+{21502150+ struct inode *inode;21512151+ struct page *page;21522152+ int err, cookie;21532153+21542154+ inode = logfs_safe_iget(block->sb, block->ino, &cookie);21552155+ page = logfs_get_write_page(inode, block->bix, block->level);21562156+21572157+ err = logfs_readpage_nolock(page);21582158+ BUG_ON(err);21592159+ BUG_ON(!PagePrivate(page));21602160+ BUG_ON(logfs_block(page) != block);21612161+ err = __logfs_write_buf(inode, page, 0);21622162+ BUG_ON(err);21632163+ BUG_ON(PagePrivate(page) || page->private);21642164+21652165+ logfs_put_write_page(page);21662166+ logfs_safe_iput(inode, cookie);21672167+}21682168+21692169+/**21702170+ * logfs_inode_write - write inode or dentry objects21712171+ *21722172+ * @inode: parent inode (ifile or directory)21732173+ * @buf: object to write (inode or dentry)21742174+ * @n: object size21752175+ * @_pos: object number (file position in blocks/objects)21762176+ * @flags: write flags21772177+ * @lock: 0 if write lock is already taken, 1 otherwise21782178+ * @shadow_tree: shadow below this inode21792179+ *21802180+ * FIXME: All caller of this put a 200-300 byte variable on the stack,21812181+ * only to call here and do a memcpy from that stack variable. A good21822182+ * example of wasted performance and stack space.21832183+ */21842184+int logfs_inode_write(struct inode *inode, const void *buf, size_t count,21852185+ loff_t bix, long flags, struct shadow_tree *shadow_tree)21862186+{21872187+ loff_t pos = bix << inode->i_sb->s_blocksize_bits;21882188+ int err;21892189+ struct page *page;21902190+ void *pagebuf;21912191+21922192+ BUG_ON(pos & (LOGFS_BLOCKSIZE-1));21932193+ BUG_ON(count > LOGFS_BLOCKSIZE);21942194+ page = logfs_get_write_page(inode, bix, 0);21952195+ if (!page)21962196+ return -ENOMEM;21972197+21982198+ pagebuf = kmap_atomic(page, KM_USER0);21992199+ memcpy(pagebuf, buf, count);22002200+ flush_dcache_page(page);22012201+ kunmap_atomic(pagebuf, KM_USER0);22022202+22032203+ if (i_size_read(inode) < pos + LOGFS_BLOCKSIZE)22042204+ i_size_write(inode, pos + LOGFS_BLOCKSIZE);22052205+22062206+ err = logfs_write_buf(inode, page, flags);22072207+ logfs_put_write_page(page);22082208+ return err;22092209+}22102210+22112211+int logfs_open_segfile(struct super_block *sb)22122212+{22132213+ struct logfs_super *super = logfs_super(sb);22142214+ struct inode *inode;22152215+22162216+ inode = logfs_read_meta_inode(sb, LOGFS_INO_SEGFILE);22172217+ if (IS_ERR(inode))22182218+ return PTR_ERR(inode);22192219+ super->s_segfile_inode = inode;22202220+ return 0;22212221+}22222222+22232223+int logfs_init_rw(struct super_block *sb)22242224+{22252225+ struct logfs_super *super = logfs_super(sb);22262226+ int min_fill = 3 * super->s_no_blocks;22272227+22282228+ INIT_LIST_HEAD(&super->s_object_alias);22292229+ mutex_init(&super->s_write_mutex);22302230+ super->s_block_pool = mempool_create_kmalloc_pool(min_fill,22312231+ sizeof(struct logfs_block));22322232+ super->s_shadow_pool = mempool_create_kmalloc_pool(min_fill,22332233+ sizeof(struct logfs_shadow));22342234+ return 0;22352235+}22362236+22372237+void logfs_cleanup_rw(struct super_block *sb)22382238+{22392239+ struct logfs_super *super = logfs_super(sb);22402240+22412241+ destroy_meta_inode(super->s_segfile_inode);22422242+ if (super->s_block_pool)22432243+ mempool_destroy(super->s_block_pool);22442244+ if (super->s_shadow_pool)22452245+ mempool_destroy(super->s_shadow_pool);22462246+}
+924
fs/logfs/segment.c
···11+/*22+ * fs/logfs/segment.c - Handling the Object Store33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ *88+ * Object store or ostore makes up the complete device with exception of99+ * the superblock and journal areas. Apart from its own metadata it stores1010+ * three kinds of objects: inodes, dentries and blocks, both data and indirect.1111+ */1212+#include "logfs.h"1313+1414+static int logfs_mark_segment_bad(struct super_block *sb, u32 segno)1515+{1616+ struct logfs_super *super = logfs_super(sb);1717+ struct btree_head32 *head = &super->s_reserved_segments;1818+ int err;1919+2020+ err = btree_insert32(head, segno, (void *)1, GFP_NOFS);2121+ if (err)2222+ return err;2323+ logfs_super(sb)->s_bad_segments++;2424+ /* FIXME: write to journal */2525+ return 0;2626+}2727+2828+int logfs_erase_segment(struct super_block *sb, u32 segno)2929+{3030+ struct logfs_super *super = logfs_super(sb);3131+3232+ super->s_gec++;3333+3434+ return super->s_devops->erase(sb, (u64)segno << super->s_segshift,3535+ super->s_segsize);3636+}3737+3838+static s64 logfs_get_free_bytes(struct logfs_area *area, size_t bytes)3939+{4040+ s32 ofs;4141+4242+ logfs_open_area(area, bytes);4343+4444+ ofs = area->a_used_bytes;4545+ area->a_used_bytes += bytes;4646+ BUG_ON(area->a_used_bytes >= logfs_super(area->a_sb)->s_segsize);4747+4848+ return dev_ofs(area->a_sb, area->a_segno, ofs);4949+}5050+5151+static struct page *get_mapping_page(struct super_block *sb, pgoff_t index,5252+ int use_filler)5353+{5454+ struct logfs_super *super = logfs_super(sb);5555+ struct address_space *mapping = super->s_mapping_inode->i_mapping;5656+ filler_t *filler = super->s_devops->readpage;5757+ struct page *page;5858+5959+ BUG_ON(mapping_gfp_mask(mapping) & __GFP_FS);6060+ if (use_filler)6161+ page = read_cache_page(mapping, index, filler, sb);6262+ else {6363+ page = find_or_create_page(mapping, index, GFP_NOFS);6464+ unlock_page(page);6565+ }6666+ return page;6767+}6868+6969+void __logfs_buf_write(struct logfs_area *area, u64 ofs, void *buf, size_t len,7070+ int use_filler)7171+{7272+ pgoff_t index = ofs >> PAGE_SHIFT;7373+ struct page *page;7474+ long offset = ofs & (PAGE_SIZE-1);7575+ long copylen;7676+7777+ /* Only logfs_wbuf_recover may use len==0 */7878+ BUG_ON(!len && !use_filler);7979+ do {8080+ copylen = min((ulong)len, PAGE_SIZE - offset);8181+8282+ page = get_mapping_page(area->a_sb, index, use_filler);8383+ SetPageUptodate(page);8484+ BUG_ON(!page); /* FIXME: reserve a pool */8585+ memcpy(page_address(page) + offset, buf, copylen);8686+ SetPagePrivate(page);8787+ page_cache_release(page);8888+8989+ buf += copylen;9090+ len -= copylen;9191+ offset = 0;9292+ index++;9393+ } while (len);9494+}9595+9696+/*9797+ * bdev_writeseg will write full pages. Memset the tail to prevent data leaks.9898+ */9999+static void pad_wbuf(struct logfs_area *area, int final)100100+{101101+ struct super_block *sb = area->a_sb;102102+ struct logfs_super *super = logfs_super(sb);103103+ struct page *page;104104+ u64 ofs = dev_ofs(sb, area->a_segno, area->a_used_bytes);105105+ pgoff_t index = ofs >> PAGE_SHIFT;106106+ long offset = ofs & (PAGE_SIZE-1);107107+ u32 len = PAGE_SIZE - offset;108108+109109+ if (len == PAGE_SIZE) {110110+ /* The math in this function can surely use some love */111111+ len = 0;112112+ }113113+ if (len) {114114+ BUG_ON(area->a_used_bytes >= super->s_segsize);115115+116116+ page = get_mapping_page(area->a_sb, index, 0);117117+ BUG_ON(!page); /* FIXME: reserve a pool */118118+ memset(page_address(page) + offset, 0xff, len);119119+ SetPagePrivate(page);120120+ page_cache_release(page);121121+ }122122+123123+ if (!final)124124+ return;125125+126126+ area->a_used_bytes += len;127127+ for ( ; area->a_used_bytes < super->s_segsize;128128+ area->a_used_bytes += PAGE_SIZE) {129129+ /* Memset another page */130130+ index++;131131+ page = get_mapping_page(area->a_sb, index, 0);132132+ BUG_ON(!page); /* FIXME: reserve a pool */133133+ memset(page_address(page), 0xff, PAGE_SIZE);134134+ SetPagePrivate(page);135135+ page_cache_release(page);136136+ }137137+}138138+139139+/*140140+ * We have to be careful with the alias tree. Since lookup is done by bix,141141+ * it needs to be normalized, so 14, 15, 16, etc. all match when dealing with142142+ * indirect blocks. So always use it through accessor functions.143143+ */144144+static void *alias_tree_lookup(struct super_block *sb, u64 ino, u64 bix,145145+ level_t level)146146+{147147+ struct btree_head128 *head = &logfs_super(sb)->s_object_alias_tree;148148+ pgoff_t index = logfs_pack_index(bix, level);149149+150150+ return btree_lookup128(head, ino, index);151151+}152152+153153+static int alias_tree_insert(struct super_block *sb, u64 ino, u64 bix,154154+ level_t level, void *val)155155+{156156+ struct btree_head128 *head = &logfs_super(sb)->s_object_alias_tree;157157+ pgoff_t index = logfs_pack_index(bix, level);158158+159159+ return btree_insert128(head, ino, index, val, GFP_NOFS);160160+}161161+162162+static int btree_write_alias(struct super_block *sb, struct logfs_block *block,163163+ write_alias_t *write_one_alias)164164+{165165+ struct object_alias_item *item;166166+ int err;167167+168168+ list_for_each_entry(item, &block->item_list, list) {169169+ err = write_alias_journal(sb, block->ino, block->bix,170170+ block->level, item->child_no, item->val);171171+ if (err)172172+ return err;173173+ }174174+ return 0;175175+}176176+177177+static gc_level_t btree_block_level(struct logfs_block *block)178178+{179179+ return expand_level(block->ino, block->level);180180+}181181+182182+static struct logfs_block_ops btree_block_ops = {183183+ .write_block = btree_write_block,184184+ .block_level = btree_block_level,185185+ .free_block = __free_block,186186+ .write_alias = btree_write_alias,187187+};188188+189189+int logfs_load_object_aliases(struct super_block *sb,190190+ struct logfs_obj_alias *oa, int count)191191+{192192+ struct logfs_super *super = logfs_super(sb);193193+ struct logfs_block *block;194194+ struct object_alias_item *item;195195+ u64 ino, bix;196196+ level_t level;197197+ int i, err;198198+199199+ super->s_flags |= LOGFS_SB_FLAG_OBJ_ALIAS;200200+ count /= sizeof(*oa);201201+ for (i = 0; i < count; i++) {202202+ item = mempool_alloc(super->s_alias_pool, GFP_NOFS);203203+ if (!item)204204+ return -ENOMEM;205205+ memset(item, 0, sizeof(*item));206206+207207+ super->s_no_object_aliases++;208208+ item->val = oa[i].val;209209+ item->child_no = be16_to_cpu(oa[i].child_no);210210+211211+ ino = be64_to_cpu(oa[i].ino);212212+ bix = be64_to_cpu(oa[i].bix);213213+ level = LEVEL(oa[i].level);214214+215215+ log_aliases("logfs_load_object_aliases(%llx, %llx, %x, %x) %llx\n",216216+ ino, bix, level, item->child_no,217217+ be64_to_cpu(item->val));218218+ block = alias_tree_lookup(sb, ino, bix, level);219219+ if (!block) {220220+ block = __alloc_block(sb, ino, bix, level);221221+ block->ops = &btree_block_ops;222222+ err = alias_tree_insert(sb, ino, bix, level, block);223223+ BUG_ON(err); /* mempool empty */224224+ }225225+ if (test_and_set_bit(item->child_no, block->alias_map)) {226226+ printk(KERN_ERR"LogFS: Alias collision detected\n");227227+ return -EIO;228228+ }229229+ list_move_tail(&block->alias_list, &super->s_object_alias);230230+ list_add(&item->list, &block->item_list);231231+ }232232+ return 0;233233+}234234+235235+static void kill_alias(void *_block, unsigned long ignore0,236236+ u64 ignore1, u64 ignore2, size_t ignore3)237237+{238238+ struct logfs_block *block = _block;239239+ struct super_block *sb = block->sb;240240+ struct logfs_super *super = logfs_super(sb);241241+ struct object_alias_item *item;242242+243243+ while (!list_empty(&block->item_list)) {244244+ item = list_entry(block->item_list.next, typeof(*item), list);245245+ list_del(&item->list);246246+ mempool_free(item, super->s_alias_pool);247247+ }248248+ block->ops->free_block(sb, block);249249+}250250+251251+static int obj_type(struct inode *inode, level_t level)252252+{253253+ if (level == 0) {254254+ if (S_ISDIR(inode->i_mode))255255+ return OBJ_DENTRY;256256+ if (inode->i_ino == LOGFS_INO_MASTER)257257+ return OBJ_INODE;258258+ }259259+ return OBJ_BLOCK;260260+}261261+262262+static int obj_len(struct super_block *sb, int obj_type)263263+{264264+ switch (obj_type) {265265+ case OBJ_DENTRY:266266+ return sizeof(struct logfs_disk_dentry);267267+ case OBJ_INODE:268268+ return sizeof(struct logfs_disk_inode);269269+ case OBJ_BLOCK:270270+ return sb->s_blocksize;271271+ default:272272+ BUG();273273+ }274274+}275275+276276+static int __logfs_segment_write(struct inode *inode, void *buf,277277+ struct logfs_shadow *shadow, int type, int len, int compr)278278+{279279+ struct logfs_area *area;280280+ struct super_block *sb = inode->i_sb;281281+ s64 ofs;282282+ struct logfs_object_header h;283283+ int acc_len;284284+285285+ if (shadow->gc_level == 0)286286+ acc_len = len;287287+ else288288+ acc_len = obj_len(sb, type);289289+290290+ area = get_area(sb, shadow->gc_level);291291+ ofs = logfs_get_free_bytes(area, len + LOGFS_OBJECT_HEADERSIZE);292292+ LOGFS_BUG_ON(ofs <= 0, sb);293293+ /*294294+ * Order is important. logfs_get_free_bytes(), by modifying the295295+ * segment file, may modify the content of the very page we're about296296+ * to write now. Which is fine, as long as the calculated crc and297297+ * written data still match. So do the modifications _before_298298+ * calculating the crc.299299+ */300300+301301+ h.len = cpu_to_be16(len);302302+ h.type = type;303303+ h.compr = compr;304304+ h.ino = cpu_to_be64(inode->i_ino);305305+ h.bix = cpu_to_be64(shadow->bix);306306+ h.crc = logfs_crc32(&h, sizeof(h) - 4, 4);307307+ h.data_crc = logfs_crc32(buf, len, 0);308308+309309+ logfs_buf_write(area, ofs, &h, sizeof(h));310310+ logfs_buf_write(area, ofs + LOGFS_OBJECT_HEADERSIZE, buf, len);311311+312312+ shadow->new_ofs = ofs;313313+ shadow->new_len = acc_len + LOGFS_OBJECT_HEADERSIZE;314314+315315+ return 0;316316+}317317+318318+static s64 logfs_segment_write_compress(struct inode *inode, void *buf,319319+ struct logfs_shadow *shadow, int type, int len)320320+{321321+ struct super_block *sb = inode->i_sb;322322+ void *compressor_buf = logfs_super(sb)->s_compressed_je;323323+ ssize_t compr_len;324324+ int ret;325325+326326+ mutex_lock(&logfs_super(sb)->s_journal_mutex);327327+ compr_len = logfs_compress(buf, compressor_buf, len, len);328328+329329+ if (compr_len >= 0) {330330+ ret = __logfs_segment_write(inode, compressor_buf, shadow,331331+ type, compr_len, COMPR_ZLIB);332332+ } else {333333+ ret = __logfs_segment_write(inode, buf, shadow, type, len,334334+ COMPR_NONE);335335+ }336336+ mutex_unlock(&logfs_super(sb)->s_journal_mutex);337337+ return ret;338338+}339339+340340+/**341341+ * logfs_segment_write - write data block to object store342342+ * @inode: inode containing data343343+ *344344+ * Returns an errno or zero.345345+ */346346+int logfs_segment_write(struct inode *inode, struct page *page,347347+ struct logfs_shadow *shadow)348348+{349349+ struct super_block *sb = inode->i_sb;350350+ struct logfs_super *super = logfs_super(sb);351351+ int do_compress, type, len;352352+ int ret;353353+ void *buf;354354+355355+ BUG_ON(logfs_super(sb)->s_flags & LOGFS_SB_FLAG_SHUTDOWN);356356+ do_compress = logfs_inode(inode)->li_flags & LOGFS_IF_COMPRESSED;357357+ if (shadow->gc_level != 0) {358358+ /* temporarily disable compression for indirect blocks */359359+ do_compress = 0;360360+ }361361+362362+ type = obj_type(inode, shrink_level(shadow->gc_level));363363+ len = obj_len(sb, type);364364+ buf = kmap(page);365365+ if (do_compress)366366+ ret = logfs_segment_write_compress(inode, buf, shadow, type,367367+ len);368368+ else369369+ ret = __logfs_segment_write(inode, buf, shadow, type, len,370370+ COMPR_NONE);371371+ kunmap(page);372372+373373+ log_segment("logfs_segment_write(%llx, %llx, %x) %llx->%llx %x->%x\n",374374+ shadow->ino, shadow->bix, shadow->gc_level,375375+ shadow->old_ofs, shadow->new_ofs,376376+ shadow->old_len, shadow->new_len);377377+ /* this BUG_ON did catch a locking bug. useful */378378+ BUG_ON(!(shadow->new_ofs & (super->s_segsize - 1)));379379+ return ret;380380+}381381+382382+int wbuf_read(struct super_block *sb, u64 ofs, size_t len, void *buf)383383+{384384+ pgoff_t index = ofs >> PAGE_SHIFT;385385+ struct page *page;386386+ long offset = ofs & (PAGE_SIZE-1);387387+ long copylen;388388+389389+ while (len) {390390+ copylen = min((ulong)len, PAGE_SIZE - offset);391391+392392+ page = get_mapping_page(sb, index, 1);393393+ if (IS_ERR(page))394394+ return PTR_ERR(page);395395+ memcpy(buf, page_address(page) + offset, copylen);396396+ page_cache_release(page);397397+398398+ buf += copylen;399399+ len -= copylen;400400+ offset = 0;401401+ index++;402402+ }403403+ return 0;404404+}405405+406406+/*407407+ * The "position" of indirect blocks is ambiguous. It can be the position408408+ * of any data block somewhere behind this indirect block. So we need to409409+ * normalize the positions through logfs_block_mask() before comparing.410410+ */411411+static int check_pos(struct super_block *sb, u64 pos1, u64 pos2, level_t level)412412+{413413+ return (pos1 & logfs_block_mask(sb, level)) !=414414+ (pos2 & logfs_block_mask(sb, level));415415+}416416+417417+#if 0418418+static int read_seg_header(struct super_block *sb, u64 ofs,419419+ struct logfs_segment_header *sh)420420+{421421+ __be32 crc;422422+ int err;423423+424424+ err = wbuf_read(sb, ofs, sizeof(*sh), sh);425425+ if (err)426426+ return err;427427+ crc = logfs_crc32(sh, sizeof(*sh), 4);428428+ if (crc != sh->crc) {429429+ printk(KERN_ERR"LOGFS: header crc error at %llx: expected %x, "430430+ "got %x\n", ofs, be32_to_cpu(sh->crc),431431+ be32_to_cpu(crc));432432+ return -EIO;433433+ }434434+ return 0;435435+}436436+#endif437437+438438+static int read_obj_header(struct super_block *sb, u64 ofs,439439+ struct logfs_object_header *oh)440440+{441441+ __be32 crc;442442+ int err;443443+444444+ err = wbuf_read(sb, ofs, sizeof(*oh), oh);445445+ if (err)446446+ return err;447447+ crc = logfs_crc32(oh, sizeof(*oh) - 4, 4);448448+ if (crc != oh->crc) {449449+ printk(KERN_ERR"LOGFS: header crc error at %llx: expected %x, "450450+ "got %x\n", ofs, be32_to_cpu(oh->crc),451451+ be32_to_cpu(crc));452452+ return -EIO;453453+ }454454+ return 0;455455+}456456+457457+static void move_btree_to_page(struct inode *inode, struct page *page,458458+ __be64 *data)459459+{460460+ struct super_block *sb = inode->i_sb;461461+ struct logfs_super *super = logfs_super(sb);462462+ struct btree_head128 *head = &super->s_object_alias_tree;463463+ struct logfs_block *block;464464+ struct object_alias_item *item, *next;465465+466466+ if (!(super->s_flags & LOGFS_SB_FLAG_OBJ_ALIAS))467467+ return;468468+469469+ block = btree_remove128(head, inode->i_ino, page->index);470470+ if (!block)471471+ return;472472+473473+ log_blockmove("move_btree_to_page(%llx, %llx, %x)\n",474474+ block->ino, block->bix, block->level);475475+ list_for_each_entry_safe(item, next, &block->item_list, list) {476476+ data[item->child_no] = item->val;477477+ list_del(&item->list);478478+ mempool_free(item, super->s_alias_pool);479479+ }480480+ block->page = page;481481+ SetPagePrivate(page);482482+ page->private = (unsigned long)block;483483+ block->ops = &indirect_block_ops;484484+ initialize_block_counters(page, block, data, 0);485485+}486486+487487+/*488488+ * This silences a false, yet annoying gcc warning. I hate it when my editor489489+ * jumps into bitops.h each time I recompile this file.490490+ * TODO: Complain to gcc folks about this and upgrade compiler.491491+ */492492+static unsigned long fnb(const unsigned long *addr,493493+ unsigned long size, unsigned long offset)494494+{495495+ return find_next_bit(addr, size, offset);496496+}497497+498498+void move_page_to_btree(struct page *page)499499+{500500+ struct logfs_block *block = logfs_block(page);501501+ struct super_block *sb = block->sb;502502+ struct logfs_super *super = logfs_super(sb);503503+ struct object_alias_item *item;504504+ unsigned long pos;505505+ __be64 *child;506506+ int err;507507+508508+ if (super->s_flags & LOGFS_SB_FLAG_SHUTDOWN) {509509+ block->ops->free_block(sb, block);510510+ return;511511+ }512512+ log_blockmove("move_page_to_btree(%llx, %llx, %x)\n",513513+ block->ino, block->bix, block->level);514514+ super->s_flags |= LOGFS_SB_FLAG_OBJ_ALIAS;515515+516516+ for (pos = 0; ; pos++) {517517+ pos = fnb(block->alias_map, LOGFS_BLOCK_FACTOR, pos);518518+ if (pos >= LOGFS_BLOCK_FACTOR)519519+ break;520520+521521+ item = mempool_alloc(super->s_alias_pool, GFP_NOFS);522522+ BUG_ON(!item); /* mempool empty */523523+ memset(item, 0, sizeof(*item));524524+525525+ child = kmap_atomic(page, KM_USER0);526526+ item->val = child[pos];527527+ kunmap_atomic(child, KM_USER0);528528+ item->child_no = pos;529529+ list_add(&item->list, &block->item_list);530530+ }531531+ block->page = NULL;532532+ ClearPagePrivate(page);533533+ page->private = 0;534534+ block->ops = &btree_block_ops;535535+ err = alias_tree_insert(block->sb, block->ino, block->bix, block->level,536536+ block);537537+ BUG_ON(err); /* mempool empty */538538+ ClearPageUptodate(page);539539+}540540+541541+static int __logfs_segment_read(struct inode *inode, void *buf,542542+ u64 ofs, u64 bix, level_t level)543543+{544544+ struct super_block *sb = inode->i_sb;545545+ void *compressor_buf = logfs_super(sb)->s_compressed_je;546546+ struct logfs_object_header oh;547547+ __be32 crc;548548+ u16 len;549549+ int err, block_len;550550+551551+ block_len = obj_len(sb, obj_type(inode, level));552552+ err = read_obj_header(sb, ofs, &oh);553553+ if (err)554554+ goto out_err;555555+556556+ err = -EIO;557557+ if (be64_to_cpu(oh.ino) != inode->i_ino558558+ || check_pos(sb, be64_to_cpu(oh.bix), bix, level)) {559559+ printk(KERN_ERR"LOGFS: (ino, bix) don't match at %llx: "560560+ "expected (%lx, %llx), got (%llx, %llx)\n",561561+ ofs, inode->i_ino, bix,562562+ be64_to_cpu(oh.ino), be64_to_cpu(oh.bix));563563+ goto out_err;564564+ }565565+566566+ len = be16_to_cpu(oh.len);567567+568568+ switch (oh.compr) {569569+ case COMPR_NONE:570570+ err = wbuf_read(sb, ofs + LOGFS_OBJECT_HEADERSIZE, len, buf);571571+ if (err)572572+ goto out_err;573573+ crc = logfs_crc32(buf, len, 0);574574+ if (crc != oh.data_crc) {575575+ printk(KERN_ERR"LOGFS: uncompressed data crc error at "576576+ "%llx: expected %x, got %x\n", ofs,577577+ be32_to_cpu(oh.data_crc),578578+ be32_to_cpu(crc));579579+ goto out_err;580580+ }581581+ break;582582+ case COMPR_ZLIB:583583+ mutex_lock(&logfs_super(sb)->s_journal_mutex);584584+ err = wbuf_read(sb, ofs + LOGFS_OBJECT_HEADERSIZE, len,585585+ compressor_buf);586586+ if (err) {587587+ mutex_unlock(&logfs_super(sb)->s_journal_mutex);588588+ goto out_err;589589+ }590590+ crc = logfs_crc32(compressor_buf, len, 0);591591+ if (crc != oh.data_crc) {592592+ printk(KERN_ERR"LOGFS: compressed data crc error at "593593+ "%llx: expected %x, got %x\n", ofs,594594+ be32_to_cpu(oh.data_crc),595595+ be32_to_cpu(crc));596596+ mutex_unlock(&logfs_super(sb)->s_journal_mutex);597597+ goto out_err;598598+ }599599+ err = logfs_uncompress(compressor_buf, buf, len, block_len);600600+ mutex_unlock(&logfs_super(sb)->s_journal_mutex);601601+ if (err) {602602+ printk(KERN_ERR"LOGFS: uncompress error at %llx\n", ofs);603603+ goto out_err;604604+ }605605+ break;606606+ default:607607+ LOGFS_BUG(sb);608608+ err = -EIO;609609+ goto out_err;610610+ }611611+ return 0;612612+613613+out_err:614614+ logfs_set_ro(sb);615615+ printk(KERN_ERR"LOGFS: device is read-only now\n");616616+ LOGFS_BUG(sb);617617+ return err;618618+}619619+620620+/**621621+ * logfs_segment_read - read data block from object store622622+ * @inode: inode containing data623623+ * @buf: data buffer624624+ * @ofs: physical data offset625625+ * @bix: block index626626+ * @level: block level627627+ *628628+ * Returns 0 on success or a negative errno.629629+ */630630+int logfs_segment_read(struct inode *inode, struct page *page,631631+ u64 ofs, u64 bix, level_t level)632632+{633633+ int err;634634+ void *buf;635635+636636+ if (PageUptodate(page))637637+ return 0;638638+639639+ ofs &= ~LOGFS_FULLY_POPULATED;640640+641641+ buf = kmap(page);642642+ err = __logfs_segment_read(inode, buf, ofs, bix, level);643643+ if (!err) {644644+ move_btree_to_page(inode, page, buf);645645+ SetPageUptodate(page);646646+ }647647+ kunmap(page);648648+ log_segment("logfs_segment_read(%lx, %llx, %x) %llx (%d)\n",649649+ inode->i_ino, bix, level, ofs, err);650650+ return err;651651+}652652+653653+int logfs_segment_delete(struct inode *inode, struct logfs_shadow *shadow)654654+{655655+ struct super_block *sb = inode->i_sb;656656+ struct logfs_object_header h;657657+ u16 len;658658+ int err;659659+660660+ BUG_ON(logfs_super(sb)->s_flags & LOGFS_SB_FLAG_SHUTDOWN);661661+ BUG_ON(shadow->old_ofs & LOGFS_FULLY_POPULATED);662662+ if (!shadow->old_ofs)663663+ return 0;664664+665665+ log_segment("logfs_segment_delete(%llx, %llx, %x) %llx->%llx %x->%x\n",666666+ shadow->ino, shadow->bix, shadow->gc_level,667667+ shadow->old_ofs, shadow->new_ofs,668668+ shadow->old_len, shadow->new_len);669669+ err = read_obj_header(sb, shadow->old_ofs, &h);670670+ LOGFS_BUG_ON(err, sb);671671+ LOGFS_BUG_ON(be64_to_cpu(h.ino) != inode->i_ino, sb);672672+ LOGFS_BUG_ON(check_pos(sb, shadow->bix, be64_to_cpu(h.bix),673673+ shrink_level(shadow->gc_level)), sb);674674+675675+ if (shadow->gc_level == 0)676676+ len = be16_to_cpu(h.len);677677+ else678678+ len = obj_len(sb, h.type);679679+ shadow->old_len = len + sizeof(h);680680+ return 0;681681+}682682+683683+static void freeseg(struct super_block *sb, u32 segno)684684+{685685+ struct logfs_super *super = logfs_super(sb);686686+ struct address_space *mapping = super->s_mapping_inode->i_mapping;687687+ struct page *page;688688+ u64 ofs, start, end;689689+690690+ start = dev_ofs(sb, segno, 0);691691+ end = dev_ofs(sb, segno + 1, 0);692692+ for (ofs = start; ofs < end; ofs += PAGE_SIZE) {693693+ page = find_get_page(mapping, ofs >> PAGE_SHIFT);694694+ if (!page)695695+ continue;696696+ ClearPagePrivate(page);697697+ page_cache_release(page);698698+ }699699+}700700+701701+int logfs_open_area(struct logfs_area *area, size_t bytes)702702+{703703+ struct super_block *sb = area->a_sb;704704+ struct logfs_super *super = logfs_super(sb);705705+ int err, closed = 0;706706+707707+ if (area->a_is_open && area->a_used_bytes + bytes <= super->s_segsize)708708+ return 0;709709+710710+ if (area->a_is_open) {711711+ u64 ofs = dev_ofs(sb, area->a_segno, area->a_written_bytes);712712+ u32 len = super->s_segsize - area->a_written_bytes;713713+714714+ log_gc("logfs_close_area(%x)\n", area->a_segno);715715+ pad_wbuf(area, 1);716716+ super->s_devops->writeseg(area->a_sb, ofs, len);717717+ freeseg(sb, area->a_segno);718718+ closed = 1;719719+ }720720+721721+ area->a_used_bytes = 0;722722+ area->a_written_bytes = 0;723723+again:724724+ area->a_ops->get_free_segment(area);725725+ area->a_ops->get_erase_count(area);726726+727727+ log_gc("logfs_open_area(%x, %x)\n", area->a_segno, area->a_level);728728+ err = area->a_ops->erase_segment(area);729729+ if (err) {730730+ printk(KERN_WARNING "LogFS: Error erasing segment %x\n",731731+ area->a_segno);732732+ logfs_mark_segment_bad(sb, area->a_segno);733733+ goto again;734734+ }735735+ area->a_is_open = 1;736736+ return closed;737737+}738738+739739+void logfs_sync_area(struct logfs_area *area)740740+{741741+ struct super_block *sb = area->a_sb;742742+ struct logfs_super *super = logfs_super(sb);743743+ u64 ofs = dev_ofs(sb, area->a_segno, area->a_written_bytes);744744+ u32 len = (area->a_used_bytes - area->a_written_bytes);745745+746746+ if (super->s_writesize)747747+ len &= ~(super->s_writesize - 1);748748+ if (len == 0)749749+ return;750750+ pad_wbuf(area, 0);751751+ super->s_devops->writeseg(sb, ofs, len);752752+ area->a_written_bytes += len;753753+}754754+755755+void logfs_sync_segments(struct super_block *sb)756756+{757757+ struct logfs_super *super = logfs_super(sb);758758+ int i;759759+760760+ for_each_area(i)761761+ logfs_sync_area(super->s_area[i]);762762+}763763+764764+/*765765+ * Pick a free segment to be used for this area. Effectively takes a766766+ * candidate from the free list (not really a candidate anymore).767767+ */768768+static void ostore_get_free_segment(struct logfs_area *area)769769+{770770+ struct super_block *sb = area->a_sb;771771+ struct logfs_super *super = logfs_super(sb);772772+773773+ if (super->s_free_list.count == 0) {774774+ printk(KERN_ERR"LOGFS: ran out of free segments\n");775775+ LOGFS_BUG(sb);776776+ }777777+778778+ area->a_segno = get_best_cand(sb, &super->s_free_list, NULL);779779+}780780+781781+static void ostore_get_erase_count(struct logfs_area *area)782782+{783783+ struct logfs_segment_entry se;784784+ u32 ec_level;785785+786786+ logfs_get_segment_entry(area->a_sb, area->a_segno, &se);787787+ BUG_ON(se.ec_level == cpu_to_be32(BADSEG) ||788788+ se.valid == cpu_to_be32(RESERVED));789789+790790+ ec_level = be32_to_cpu(se.ec_level);791791+ area->a_erase_count = (ec_level >> 4) + 1;792792+}793793+794794+static int ostore_erase_segment(struct logfs_area *area)795795+{796796+ struct super_block *sb = area->a_sb;797797+ struct logfs_segment_header sh;798798+ u64 ofs;799799+ int err;800800+801801+ err = logfs_erase_segment(sb, area->a_segno);802802+ if (err)803803+ return err;804804+805805+ sh.pad = 0;806806+ sh.type = SEG_OSTORE;807807+ sh.level = (__force u8)area->a_level;808808+ sh.segno = cpu_to_be32(area->a_segno);809809+ sh.ec = cpu_to_be32(area->a_erase_count);810810+ sh.gec = cpu_to_be64(logfs_super(sb)->s_gec);811811+ sh.crc = logfs_crc32(&sh, sizeof(sh), 4);812812+813813+ logfs_set_segment_erased(sb, area->a_segno, area->a_erase_count,814814+ area->a_level);815815+816816+ ofs = dev_ofs(sb, area->a_segno, 0);817817+ area->a_used_bytes = sizeof(sh);818818+ logfs_buf_write(area, ofs, &sh, sizeof(sh));819819+ return 0;820820+}821821+822822+static const struct logfs_area_ops ostore_area_ops = {823823+ .get_free_segment = ostore_get_free_segment,824824+ .get_erase_count = ostore_get_erase_count,825825+ .erase_segment = ostore_erase_segment,826826+};827827+828828+static void free_area(struct logfs_area *area)829829+{830830+ if (area)831831+ freeseg(area->a_sb, area->a_segno);832832+ kfree(area);833833+}834834+835835+static struct logfs_area *alloc_area(struct super_block *sb)836836+{837837+ struct logfs_area *area;838838+839839+ area = kzalloc(sizeof(*area), GFP_KERNEL);840840+ if (!area)841841+ return NULL;842842+843843+ area->a_sb = sb;844844+ return area;845845+}846846+847847+static void map_invalidatepage(struct page *page, unsigned long l)848848+{849849+ BUG();850850+}851851+852852+static int map_releasepage(struct page *page, gfp_t g)853853+{854854+ /* Don't release these pages */855855+ return 0;856856+}857857+858858+static const struct address_space_operations mapping_aops = {859859+ .invalidatepage = map_invalidatepage,860860+ .releasepage = map_releasepage,861861+ .set_page_dirty = __set_page_dirty_nobuffers,862862+};863863+864864+int logfs_init_mapping(struct super_block *sb)865865+{866866+ struct logfs_super *super = logfs_super(sb);867867+ struct address_space *mapping;868868+ struct inode *inode;869869+870870+ inode = logfs_new_meta_inode(sb, LOGFS_INO_MAPPING);871871+ if (IS_ERR(inode))872872+ return PTR_ERR(inode);873873+ super->s_mapping_inode = inode;874874+ mapping = inode->i_mapping;875875+ mapping->a_ops = &mapping_aops;876876+ /* Would it be possible to use __GFP_HIGHMEM as well? */877877+ mapping_set_gfp_mask(mapping, GFP_NOFS);878878+ return 0;879879+}880880+881881+int logfs_init_areas(struct super_block *sb)882882+{883883+ struct logfs_super *super = logfs_super(sb);884884+ int i = -1;885885+886886+ super->s_alias_pool = mempool_create_kmalloc_pool(600,887887+ sizeof(struct object_alias_item));888888+ if (!super->s_alias_pool)889889+ return -ENOMEM;890890+891891+ super->s_journal_area = alloc_area(sb);892892+ if (!super->s_journal_area)893893+ goto err;894894+895895+ for_each_area(i) {896896+ super->s_area[i] = alloc_area(sb);897897+ if (!super->s_area[i])898898+ goto err;899899+ super->s_area[i]->a_level = GC_LEVEL(i);900900+ super->s_area[i]->a_ops = &ostore_area_ops;901901+ }902902+ btree_init_mempool128(&super->s_object_alias_tree,903903+ super->s_btree_pool);904904+ return 0;905905+906906+err:907907+ for (i--; i >= 0; i--)908908+ free_area(super->s_area[i]);909909+ free_area(super->s_journal_area);910910+ mempool_destroy(super->s_alias_pool);911911+ return -ENOMEM;912912+}913913+914914+void logfs_cleanup_areas(struct super_block *sb)915915+{916916+ struct logfs_super *super = logfs_super(sb);917917+ int i;918918+919919+ btree_grim_visitor128(&super->s_object_alias_tree, 0, kill_alias);920920+ for_each_area(i)921921+ free_area(super->s_area[i]);922922+ free_area(super->s_journal_area);923923+ destroy_meta_inode(super->s_mapping_inode);924924+}
+634
fs/logfs/super.c
···11+/*22+ * fs/logfs/super.c33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>77+ *88+ * Generally contains mount/umount code and also serves as a dump area for99+ * any functions that don't fit elsewhere and neither justify a file of their1010+ * own.1111+ */1212+#include "logfs.h"1313+#include <linux/bio.h>1414+#include <linux/mtd/mtd.h>1515+#include <linux/statfs.h>1616+#include <linux/buffer_head.h>1717+1818+static DEFINE_MUTEX(emergency_mutex);1919+static struct page *emergency_page;2020+2121+struct page *emergency_read_begin(struct address_space *mapping, pgoff_t index)2222+{2323+ filler_t *filler = (filler_t *)mapping->a_ops->readpage;2424+ struct page *page;2525+ int err;2626+2727+ page = read_cache_page(mapping, index, filler, NULL);2828+ if (page)2929+ return page;3030+3131+ /* No more pages available, switch to emergency page */3232+ printk(KERN_INFO"Logfs: Using emergency page\n");3333+ mutex_lock(&emergency_mutex);3434+ err = filler(NULL, emergency_page);3535+ if (err) {3636+ mutex_unlock(&emergency_mutex);3737+ printk(KERN_EMERG"Logfs: Error reading emergency page\n");3838+ return ERR_PTR(err);3939+ }4040+ return emergency_page;4141+}4242+4343+void emergency_read_end(struct page *page)4444+{4545+ if (page == emergency_page)4646+ mutex_unlock(&emergency_mutex);4747+ else4848+ page_cache_release(page);4949+}5050+5151+static void dump_segfile(struct super_block *sb)5252+{5353+ struct logfs_super *super = logfs_super(sb);5454+ struct logfs_segment_entry se;5555+ u32 segno;5656+5757+ for (segno = 0; segno < super->s_no_segs; segno++) {5858+ logfs_get_segment_entry(sb, segno, &se);5959+ printk("%3x: %6x %8x", segno, be32_to_cpu(se.ec_level),6060+ be32_to_cpu(se.valid));6161+ if (++segno < super->s_no_segs) {6262+ logfs_get_segment_entry(sb, segno, &se);6363+ printk(" %6x %8x", be32_to_cpu(se.ec_level),6464+ be32_to_cpu(se.valid));6565+ }6666+ if (++segno < super->s_no_segs) {6767+ logfs_get_segment_entry(sb, segno, &se);6868+ printk(" %6x %8x", be32_to_cpu(se.ec_level),6969+ be32_to_cpu(se.valid));7070+ }7171+ if (++segno < super->s_no_segs) {7272+ logfs_get_segment_entry(sb, segno, &se);7373+ printk(" %6x %8x", be32_to_cpu(se.ec_level),7474+ be32_to_cpu(se.valid));7575+ }7676+ printk("\n");7777+ }7878+}7979+8080+/*8181+ * logfs_crash_dump - dump debug information to device8282+ *8383+ * The LogFS superblock only occupies part of a segment. This function will8484+ * write as much debug information as it can gather into the spare space.8585+ */8686+void logfs_crash_dump(struct super_block *sb)8787+{8888+ dump_segfile(sb);8989+}9090+9191+/*9292+ * TODO: move to lib/string.c9393+ */9494+/**9595+ * memchr_inv - Find a character in an area of memory.9696+ * @s: The memory area9797+ * @c: The byte to search for9898+ * @n: The size of the area.9999+ *100100+ * returns the address of the first character other than @c, or %NULL101101+ * if the whole buffer contains just @c.102102+ */103103+void *memchr_inv(const void *s, int c, size_t n)104104+{105105+ const unsigned char *p = s;106106+ while (n-- != 0)107107+ if ((unsigned char)c != *p++)108108+ return (void *)(p - 1);109109+110110+ return NULL;111111+}112112+113113+/*114114+ * FIXME: There should be a reserve for root, similar to ext2.115115+ */116116+int logfs_statfs(struct dentry *dentry, struct kstatfs *stats)117117+{118118+ struct super_block *sb = dentry->d_sb;119119+ struct logfs_super *super = logfs_super(sb);120120+121121+ stats->f_type = LOGFS_MAGIC_U32;122122+ stats->f_bsize = sb->s_blocksize;123123+ stats->f_blocks = super->s_size >> LOGFS_BLOCK_BITS >> 3;124124+ stats->f_bfree = super->s_free_bytes >> sb->s_blocksize_bits;125125+ stats->f_bavail = super->s_free_bytes >> sb->s_blocksize_bits;126126+ stats->f_files = 0;127127+ stats->f_ffree = 0;128128+ stats->f_namelen = LOGFS_MAX_NAMELEN;129129+ return 0;130130+}131131+132132+static int logfs_sb_set(struct super_block *sb, void *_super)133133+{134134+ struct logfs_super *super = _super;135135+136136+ sb->s_fs_info = super;137137+ sb->s_mtd = super->s_mtd;138138+ sb->s_bdev = super->s_bdev;139139+ return 0;140140+}141141+142142+static int logfs_sb_test(struct super_block *sb, void *_super)143143+{144144+ struct logfs_super *super = _super;145145+ struct mtd_info *mtd = super->s_mtd;146146+147147+ if (mtd && sb->s_mtd == mtd)148148+ return 1;149149+ if (super->s_bdev && sb->s_bdev == super->s_bdev)150150+ return 1;151151+ return 0;152152+}153153+154154+static void set_segment_header(struct logfs_segment_header *sh, u8 type,155155+ u8 level, u32 segno, u32 ec)156156+{157157+ sh->pad = 0;158158+ sh->type = type;159159+ sh->level = level;160160+ sh->segno = cpu_to_be32(segno);161161+ sh->ec = cpu_to_be32(ec);162162+ sh->gec = cpu_to_be64(segno);163163+ sh->crc = logfs_crc32(sh, LOGFS_SEGMENT_HEADERSIZE, 4);164164+}165165+166166+static void logfs_write_ds(struct super_block *sb, struct logfs_disk_super *ds,167167+ u32 segno, u32 ec)168168+{169169+ struct logfs_super *super = logfs_super(sb);170170+ struct logfs_segment_header *sh = &ds->ds_sh;171171+ int i;172172+173173+ memset(ds, 0, sizeof(*ds));174174+ set_segment_header(sh, SEG_SUPER, 0, segno, ec);175175+176176+ ds->ds_ifile_levels = super->s_ifile_levels;177177+ ds->ds_iblock_levels = super->s_iblock_levels;178178+ ds->ds_data_levels = super->s_data_levels; /* XXX: Remove */179179+ ds->ds_segment_shift = super->s_segshift;180180+ ds->ds_block_shift = sb->s_blocksize_bits;181181+ ds->ds_write_shift = super->s_writeshift;182182+ ds->ds_filesystem_size = cpu_to_be64(super->s_size);183183+ ds->ds_segment_size = cpu_to_be32(super->s_segsize);184184+ ds->ds_bad_seg_reserve = cpu_to_be32(super->s_bad_seg_reserve);185185+ ds->ds_feature_incompat = cpu_to_be64(super->s_feature_incompat);186186+ ds->ds_feature_ro_compat= cpu_to_be64(super->s_feature_ro_compat);187187+ ds->ds_feature_compat = cpu_to_be64(super->s_feature_compat);188188+ ds->ds_feature_flags = cpu_to_be64(super->s_feature_flags);189189+ ds->ds_root_reserve = cpu_to_be64(super->s_root_reserve);190190+ ds->ds_speed_reserve = cpu_to_be64(super->s_speed_reserve);191191+ journal_for_each(i)192192+ ds->ds_journal_seg[i] = cpu_to_be32(super->s_journal_seg[i]);193193+ ds->ds_magic = cpu_to_be64(LOGFS_MAGIC);194194+ ds->ds_crc = logfs_crc32(ds, sizeof(*ds),195195+ LOGFS_SEGMENT_HEADERSIZE + 12);196196+}197197+198198+static int write_one_sb(struct super_block *sb,199199+ struct page *(*find_sb)(struct super_block *sb, u64 *ofs))200200+{201201+ struct logfs_super *super = logfs_super(sb);202202+ struct logfs_disk_super *ds;203203+ struct logfs_segment_entry se;204204+ struct page *page;205205+ u64 ofs;206206+ u32 ec, segno;207207+ int err;208208+209209+ page = find_sb(sb, &ofs);210210+ if (!page)211211+ return -EIO;212212+ ds = page_address(page);213213+ segno = seg_no(sb, ofs);214214+ logfs_get_segment_entry(sb, segno, &se);215215+ ec = be32_to_cpu(se.ec_level) >> 4;216216+ ec++;217217+ logfs_set_segment_erased(sb, segno, ec, 0);218218+ logfs_write_ds(sb, ds, segno, ec);219219+ err = super->s_devops->write_sb(sb, page);220220+ page_cache_release(page);221221+ return err;222222+}223223+224224+int logfs_write_sb(struct super_block *sb)225225+{226226+ struct logfs_super *super = logfs_super(sb);227227+ int err;228228+229229+ /* First superblock */230230+ err = write_one_sb(sb, super->s_devops->find_first_sb);231231+ if (err)232232+ return err;233233+234234+ /* Last superblock */235235+ err = write_one_sb(sb, super->s_devops->find_last_sb);236236+ if (err)237237+ return err;238238+ return 0;239239+}240240+241241+static int ds_cmp(const void *ds0, const void *ds1)242242+{243243+ size_t len = sizeof(struct logfs_disk_super);244244+245245+ /* We know the segment headers differ, so ignore them */246246+ len -= LOGFS_SEGMENT_HEADERSIZE;247247+ ds0 += LOGFS_SEGMENT_HEADERSIZE;248248+ ds1 += LOGFS_SEGMENT_HEADERSIZE;249249+ return memcmp(ds0, ds1, len);250250+}251251+252252+static int logfs_recover_sb(struct super_block *sb)253253+{254254+ struct logfs_super *super = logfs_super(sb);255255+ struct logfs_disk_super _ds0, *ds0 = &_ds0;256256+ struct logfs_disk_super _ds1, *ds1 = &_ds1;257257+ int err, valid0, valid1;258258+259259+ /* read first superblock */260260+ err = wbuf_read(sb, super->s_sb_ofs[0], sizeof(*ds0), ds0);261261+ if (err)262262+ return err;263263+ /* read last superblock */264264+ err = wbuf_read(sb, super->s_sb_ofs[1], sizeof(*ds1), ds1);265265+ if (err)266266+ return err;267267+ valid0 = logfs_check_ds(ds0) == 0;268268+ valid1 = logfs_check_ds(ds1) == 0;269269+270270+ if (!valid0 && valid1) {271271+ printk(KERN_INFO"First superblock is invalid - fixing.\n");272272+ return write_one_sb(sb, super->s_devops->find_first_sb);273273+ }274274+ if (valid0 && !valid1) {275275+ printk(KERN_INFO"Last superblock is invalid - fixing.\n");276276+ return write_one_sb(sb, super->s_devops->find_last_sb);277277+ }278278+ if (valid0 && valid1 && ds_cmp(ds0, ds1)) {279279+ printk(KERN_INFO"Superblocks don't match - fixing.\n");280280+ return write_one_sb(sb, super->s_devops->find_last_sb);281281+ }282282+ /* If neither is valid now, something's wrong. Didn't we properly283283+ * check them before?!? */284284+ BUG_ON(!valid0 && !valid1);285285+ return 0;286286+}287287+288288+static int logfs_make_writeable(struct super_block *sb)289289+{290290+ int err;291291+292292+ /* Repair any broken superblock copies */293293+ err = logfs_recover_sb(sb);294294+ if (err)295295+ return err;296296+297297+ /* Check areas for trailing unaccounted data */298298+ err = logfs_check_areas(sb);299299+ if (err)300300+ return err;301301+302302+ err = logfs_open_segfile(sb);303303+ if (err)304304+ return err;305305+306306+ /* Do one GC pass before any data gets dirtied */307307+ logfs_gc_pass(sb);308308+309309+ /* after all initializations are done, replay the journal310310+ * for rw-mounts, if necessary */311311+ err = logfs_replay_journal(sb);312312+ if (err)313313+ return err;314314+315315+ return 0;316316+}317317+318318+static int logfs_get_sb_final(struct super_block *sb, struct vfsmount *mnt)319319+{320320+ struct inode *rootdir;321321+ int err;322322+323323+ /* root dir */324324+ rootdir = logfs_iget(sb, LOGFS_INO_ROOT);325325+ if (IS_ERR(rootdir))326326+ goto fail;327327+328328+ sb->s_root = d_alloc_root(rootdir);329329+ if (!sb->s_root)330330+ goto fail;331331+332332+ /* FIXME: check for read-only mounts */333333+ err = logfs_make_writeable(sb);334334+ if (err)335335+ goto fail2;336336+337337+ log_super("LogFS: Finished mounting\n");338338+ simple_set_mnt(mnt, sb);339339+ return 0;340340+341341+fail2:342342+ iput(rootdir);343343+fail:344344+ iput(logfs_super(sb)->s_master_inode);345345+ return -EIO;346346+}347347+348348+int logfs_check_ds(struct logfs_disk_super *ds)349349+{350350+ struct logfs_segment_header *sh = &ds->ds_sh;351351+352352+ if (ds->ds_magic != cpu_to_be64(LOGFS_MAGIC))353353+ return -EINVAL;354354+ if (sh->crc != logfs_crc32(sh, LOGFS_SEGMENT_HEADERSIZE, 4))355355+ return -EINVAL;356356+ if (ds->ds_crc != logfs_crc32(ds, sizeof(*ds),357357+ LOGFS_SEGMENT_HEADERSIZE + 12))358358+ return -EINVAL;359359+ return 0;360360+}361361+362362+static struct page *find_super_block(struct super_block *sb)363363+{364364+ struct logfs_super *super = logfs_super(sb);365365+ struct page *first, *last;366366+367367+ first = super->s_devops->find_first_sb(sb, &super->s_sb_ofs[0]);368368+ if (!first || IS_ERR(first))369369+ return NULL;370370+ last = super->s_devops->find_last_sb(sb, &super->s_sb_ofs[1]);371371+ if (!last || IS_ERR(first)) {372372+ page_cache_release(first);373373+ return NULL;374374+ }375375+376376+ if (!logfs_check_ds(page_address(first))) {377377+ page_cache_release(last);378378+ return first;379379+ }380380+381381+ /* First one didn't work, try the second superblock */382382+ if (!logfs_check_ds(page_address(last))) {383383+ page_cache_release(first);384384+ return last;385385+ }386386+387387+ /* Neither worked, sorry folks */388388+ page_cache_release(first);389389+ page_cache_release(last);390390+ return NULL;391391+}392392+393393+static int __logfs_read_sb(struct super_block *sb)394394+{395395+ struct logfs_super *super = logfs_super(sb);396396+ struct page *page;397397+ struct logfs_disk_super *ds;398398+ int i;399399+400400+ page = find_super_block(sb);401401+ if (!page)402402+ return -EIO;403403+404404+ ds = page_address(page);405405+ super->s_size = be64_to_cpu(ds->ds_filesystem_size);406406+ super->s_root_reserve = be64_to_cpu(ds->ds_root_reserve);407407+ super->s_speed_reserve = be64_to_cpu(ds->ds_speed_reserve);408408+ super->s_bad_seg_reserve = be32_to_cpu(ds->ds_bad_seg_reserve);409409+ super->s_segsize = 1 << ds->ds_segment_shift;410410+ super->s_segmask = (1 << ds->ds_segment_shift) - 1;411411+ super->s_segshift = ds->ds_segment_shift;412412+ sb->s_blocksize = 1 << ds->ds_block_shift;413413+ sb->s_blocksize_bits = ds->ds_block_shift;414414+ super->s_writesize = 1 << ds->ds_write_shift;415415+ super->s_writeshift = ds->ds_write_shift;416416+ super->s_no_segs = super->s_size >> super->s_segshift;417417+ super->s_no_blocks = super->s_segsize >> sb->s_blocksize_bits;418418+ super->s_feature_incompat = be64_to_cpu(ds->ds_feature_incompat);419419+ super->s_feature_ro_compat = be64_to_cpu(ds->ds_feature_ro_compat);420420+ super->s_feature_compat = be64_to_cpu(ds->ds_feature_compat);421421+ super->s_feature_flags = be64_to_cpu(ds->ds_feature_flags);422422+423423+ journal_for_each(i)424424+ super->s_journal_seg[i] = be32_to_cpu(ds->ds_journal_seg[i]);425425+426426+ super->s_ifile_levels = ds->ds_ifile_levels;427427+ super->s_iblock_levels = ds->ds_iblock_levels;428428+ super->s_data_levels = ds->ds_data_levels;429429+ super->s_total_levels = super->s_ifile_levels + super->s_iblock_levels430430+ + super->s_data_levels;431431+ page_cache_release(page);432432+ return 0;433433+}434434+435435+static int logfs_read_sb(struct super_block *sb)436436+{437437+ struct logfs_super *super = logfs_super(sb);438438+ int ret;439439+440440+ super->s_btree_pool = mempool_create(32, btree_alloc, btree_free, NULL);441441+ if (!super->s_btree_pool)442442+ return -ENOMEM;443443+444444+ btree_init_mempool64(&super->s_shadow_tree.new, super->s_btree_pool);445445+ btree_init_mempool64(&super->s_shadow_tree.old, super->s_btree_pool);446446+447447+ ret = logfs_init_mapping(sb);448448+ if (ret)449449+ return ret;450450+451451+ ret = __logfs_read_sb(sb);452452+ if (ret)453453+ return ret;454454+455455+ mutex_init(&super->s_dirop_mutex);456456+ mutex_init(&super->s_object_alias_mutex);457457+ INIT_LIST_HEAD(&super->s_freeing_list);458458+459459+ ret = logfs_init_rw(sb);460460+ if (ret)461461+ return ret;462462+463463+ ret = logfs_init_areas(sb);464464+ if (ret)465465+ return ret;466466+467467+ ret = logfs_init_gc(sb);468468+ if (ret)469469+ return ret;470470+471471+ ret = logfs_init_journal(sb);472472+ if (ret)473473+ return ret;474474+475475+ return 0;476476+}477477+478478+static void logfs_kill_sb(struct super_block *sb)479479+{480480+ struct logfs_super *super = logfs_super(sb);481481+482482+ log_super("LogFS: Start unmounting\n");483483+ /* Alias entries slow down mount, so evict as many as possible */484484+ sync_filesystem(sb);485485+ logfs_write_anchor(super->s_master_inode);486486+487487+ /*488488+ * From this point on alias entries are simply dropped - and any489489+ * writes to the object store are considered bugs.490490+ */491491+ super->s_flags |= LOGFS_SB_FLAG_SHUTDOWN;492492+ log_super("LogFS: Now in shutdown\n");493493+ generic_shutdown_super(sb);494494+495495+ BUG_ON(super->s_dirty_used_bytes || super->s_dirty_free_bytes);496496+497497+ logfs_cleanup_gc(sb);498498+ logfs_cleanup_journal(sb);499499+ logfs_cleanup_areas(sb);500500+ logfs_cleanup_rw(sb);501501+ super->s_devops->put_device(sb);502502+ mempool_destroy(super->s_btree_pool);503503+ mempool_destroy(super->s_alias_pool);504504+ kfree(super);505505+ log_super("LogFS: Finished unmounting\n");506506+}507507+508508+int logfs_get_sb_device(struct file_system_type *type, int flags,509509+ struct mtd_info *mtd, struct block_device *bdev,510510+ const struct logfs_device_ops *devops, struct vfsmount *mnt)511511+{512512+ struct logfs_super *super;513513+ struct super_block *sb;514514+ int err = -ENOMEM;515515+ static int mount_count;516516+517517+ log_super("LogFS: Start mount %x\n", mount_count++);518518+ super = kzalloc(sizeof(*super), GFP_KERNEL);519519+ if (!super)520520+ goto err0;521521+522522+ super->s_mtd = mtd;523523+ super->s_bdev = bdev;524524+ err = -EINVAL;525525+ sb = sget(type, logfs_sb_test, logfs_sb_set, super);526526+ if (IS_ERR(sb))527527+ goto err0;528528+529529+ if (sb->s_root) {530530+ /* Device is already in use */531531+ err = 0;532532+ simple_set_mnt(mnt, sb);533533+ goto err0;534534+ }535535+536536+ super->s_devops = devops;537537+538538+ /*539539+ * sb->s_maxbytes is limited to 8TB. On 32bit systems, the page cache540540+ * only covers 16TB and the upper 8TB are used for indirect blocks.541541+ * On 64bit system we could bump up the limit, but that would make542542+ * the filesystem incompatible with 32bit systems.543543+ */544544+ sb->s_maxbytes = (1ull << 43) - 1;545545+ sb->s_op = &logfs_super_operations;546546+ sb->s_flags = flags | MS_NOATIME;547547+548548+ err = logfs_read_sb(sb);549549+ if (err)550550+ goto err1;551551+552552+ sb->s_flags |= MS_ACTIVE;553553+ err = logfs_get_sb_final(sb, mnt);554554+ if (err)555555+ goto err1;556556+ return 0;557557+558558+err1:559559+ up_write(&sb->s_umount);560560+ deactivate_super(sb);561561+ return err;562562+err0:563563+ kfree(super);564564+ //devops->put_device(sb);565565+ return err;566566+}567567+568568+static int logfs_get_sb(struct file_system_type *type, int flags,569569+ const char *devname, void *data, struct vfsmount *mnt)570570+{571571+ ulong mtdnr;572572+573573+ if (!devname)574574+ return logfs_get_sb_bdev(type, flags, devname, mnt);575575+ if (strncmp(devname, "mtd", 3))576576+ return logfs_get_sb_bdev(type, flags, devname, mnt);577577+578578+ {579579+ char *garbage;580580+ mtdnr = simple_strtoul(devname+3, &garbage, 0);581581+ if (*garbage)582582+ return -EINVAL;583583+ }584584+585585+ return logfs_get_sb_mtd(type, flags, mtdnr, mnt);586586+}587587+588588+static struct file_system_type logfs_fs_type = {589589+ .owner = THIS_MODULE,590590+ .name = "logfs",591591+ .get_sb = logfs_get_sb,592592+ .kill_sb = logfs_kill_sb,593593+ .fs_flags = FS_REQUIRES_DEV,594594+595595+};596596+597597+static int __init logfs_init(void)598598+{599599+ int ret;600600+601601+ emergency_page = alloc_pages(GFP_KERNEL, 0);602602+ if (!emergency_page)603603+ return -ENOMEM;604604+605605+ ret = logfs_compr_init();606606+ if (ret)607607+ goto out1;608608+609609+ ret = logfs_init_inode_cache();610610+ if (ret)611611+ goto out2;612612+613613+ return register_filesystem(&logfs_fs_type);614614+out2:615615+ logfs_compr_exit();616616+out1:617617+ __free_pages(emergency_page, 0);618618+ return ret;619619+}620620+621621+static void __exit logfs_exit(void)622622+{623623+ unregister_filesystem(&logfs_fs_type);624624+ logfs_destroy_inode_cache();625625+ logfs_compr_exit();626626+ __free_pages(emergency_page, 0);627627+}628628+629629+module_init(logfs_init);630630+module_exit(logfs_exit);631631+632632+MODULE_LICENSE("GPL v2");633633+MODULE_AUTHOR("Joern Engel <joern@logfs.org>");634634+MODULE_DESCRIPTION("scalable flash filesystem");
···11+#ifndef BTREE_H22+#define BTREE_H33+44+#include <linux/kernel.h>55+#include <linux/mempool.h>66+77+/**88+ * DOC: B+Tree basics99+ *1010+ * A B+Tree is a data structure for looking up arbitrary (currently allowing1111+ * unsigned long, u32, u64 and 2 * u64) keys into pointers. The data structure1212+ * is described at http://en.wikipedia.org/wiki/B-tree, we currently do not1313+ * use binary search to find the key on lookups.1414+ *1515+ * Each B+Tree consists of a head, that contains bookkeeping information and1616+ * a variable number (starting with zero) nodes. Each node contains the keys1717+ * and pointers to sub-nodes, or, for leaf nodes, the keys and values for the1818+ * tree entries.1919+ *2020+ * Each node in this implementation has the following layout:2121+ * [key1, key2, ..., keyN] [val1, val2, ..., valN]2222+ *2323+ * Each key here is an array of unsigned longs, geo->no_longs in total. The2424+ * number of keys and values (N) is geo->no_pairs.2525+ */2626+2727+/**2828+ * struct btree_head - btree head2929+ *3030+ * @node: the first node in the tree3131+ * @mempool: mempool used for node allocations3232+ * @height: current of the tree3333+ */3434+struct btree_head {3535+ unsigned long *node;3636+ mempool_t *mempool;3737+ int height;3838+};3939+4040+/* btree geometry */4141+struct btree_geo;4242+4343+/**4444+ * btree_alloc - allocate function for the mempool4545+ * @gfp_mask: gfp mask for the allocation4646+ * @pool_data: unused4747+ */4848+void *btree_alloc(gfp_t gfp_mask, void *pool_data);4949+5050+/**5151+ * btree_free - free function for the mempool5252+ * @element: the element to free5353+ * @pool_data: unused5454+ */5555+void btree_free(void *element, void *pool_data);5656+5757+/**5858+ * btree_init_mempool - initialise a btree with given mempool5959+ *6060+ * @head: the btree head to initialise6161+ * @mempool: the mempool to use6262+ *6363+ * When this function is used, there is no need to destroy6464+ * the mempool.6565+ */6666+void btree_init_mempool(struct btree_head *head, mempool_t *mempool);6767+6868+/**6969+ * btree_init - initialise a btree7070+ *7171+ * @head: the btree head to initialise7272+ *7373+ * This function allocates the memory pool that the7474+ * btree needs. Returns zero or a negative error code7575+ * (-%ENOMEM) when memory allocation fails.7676+ *7777+ */7878+int __must_check btree_init(struct btree_head *head);7979+8080+/**8181+ * btree_destroy - destroy mempool8282+ *8383+ * @head: the btree head to destroy8484+ *8585+ * This function destroys the internal memory pool, use only8686+ * when using btree_init(), not with btree_init_mempool().8787+ */8888+void btree_destroy(struct btree_head *head);8989+9090+/**9191+ * btree_lookup - look up a key in the btree9292+ *9393+ * @head: the btree to look in9494+ * @geo: the btree geometry9595+ * @key: the key to look up9696+ *9797+ * This function returns the value for the given key, or %NULL.9898+ */9999+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,100100+ unsigned long *key);101101+102102+/**103103+ * btree_insert - insert an entry into the btree104104+ *105105+ * @head: the btree to add to106106+ * @geo: the btree geometry107107+ * @key: the key to add (must not already be present)108108+ * @val: the value to add (must not be %NULL)109109+ * @gfp: allocation flags for node allocations110110+ *111111+ * This function returns 0 if the item could be added, or an112112+ * error code if it failed (may fail due to memory pressure).113113+ */114114+int __must_check btree_insert(struct btree_head *head, struct btree_geo *geo,115115+ unsigned long *key, void *val, gfp_t gfp);116116+/**117117+ * btree_update - update an entry in the btree118118+ *119119+ * @head: the btree to update120120+ * @geo: the btree geometry121121+ * @key: the key to update122122+ * @val: the value to change it to (must not be %NULL)123123+ *124124+ * This function returns 0 if the update was successful, or125125+ * -%ENOENT if the key could not be found.126126+ */127127+int btree_update(struct btree_head *head, struct btree_geo *geo,128128+ unsigned long *key, void *val);129129+/**130130+ * btree_remove - remove an entry from the btree131131+ *132132+ * @head: the btree to update133133+ * @geo: the btree geometry134134+ * @key: the key to remove135135+ *136136+ * This function returns the removed entry, or %NULL if the key137137+ * could not be found.138138+ */139139+void *btree_remove(struct btree_head *head, struct btree_geo *geo,140140+ unsigned long *key);141141+142142+/**143143+ * btree_merge - merge two btrees144144+ *145145+ * @target: the tree that gets all the entries146146+ * @victim: the tree that gets merged into @target147147+ * @geo: the btree geometry148148+ * @gfp: allocation flags149149+ *150150+ * The two trees @target and @victim may not contain the same keys,151151+ * that is a bug and triggers a BUG(). This function returns zero152152+ * if the trees were merged successfully, and may return a failure153153+ * when memory allocation fails, in which case both trees might have154154+ * been partially merged, i.e. some entries have been moved from155155+ * @victim to @target.156156+ */157157+int btree_merge(struct btree_head *target, struct btree_head *victim,158158+ struct btree_geo *geo, gfp_t gfp);159159+160160+/**161161+ * btree_last - get last entry in btree162162+ *163163+ * @head: btree head164164+ * @geo: btree geometry165165+ * @key: last key166166+ *167167+ * Returns the last entry in the btree, and sets @key to the key168168+ * of that entry; returns NULL if the tree is empty, in that case169169+ * key is not changed.170170+ */171171+void *btree_last(struct btree_head *head, struct btree_geo *geo,172172+ unsigned long *key);173173+174174+/**175175+ * btree_get_prev - get previous entry176176+ *177177+ * @head: btree head178178+ * @geo: btree geometry179179+ * @key: pointer to key180180+ *181181+ * The function returns the next item right before the value pointed to by182182+ * @key, and updates @key with its key, or returns %NULL when there is no183183+ * entry with a key smaller than the given key.184184+ */185185+void *btree_get_prev(struct btree_head *head, struct btree_geo *geo,186186+ unsigned long *key);187187+188188+189189+/* internal use, use btree_visitor{l,32,64,128} */190190+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,191191+ unsigned long opaque,192192+ void (*func)(void *elem, unsigned long opaque,193193+ unsigned long *key, size_t index,194194+ void *func2),195195+ void *func2);196196+197197+/* internal use, use btree_grim_visitor{l,32,64,128} */198198+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,199199+ unsigned long opaque,200200+ void (*func)(void *elem, unsigned long opaque,201201+ unsigned long *key,202202+ size_t index, void *func2),203203+ void *func2);204204+205205+206206+#include <linux/btree-128.h>207207+208208+extern struct btree_geo btree_geo32;209209+#define BTREE_TYPE_SUFFIX l210210+#define BTREE_TYPE_BITS BITS_PER_LONG211211+#define BTREE_TYPE_GEO &btree_geo32212212+#define BTREE_KEYTYPE unsigned long213213+#include <linux/btree-type.h>214214+215215+#define btree_for_each_safel(head, key, val) \216216+ for (val = btree_lastl(head, &key); \217217+ val; \218218+ val = btree_get_prevl(head, &key))219219+220220+#define BTREE_TYPE_SUFFIX 32221221+#define BTREE_TYPE_BITS 32222222+#define BTREE_TYPE_GEO &btree_geo32223223+#define BTREE_KEYTYPE u32224224+#include <linux/btree-type.h>225225+226226+#define btree_for_each_safe32(head, key, val) \227227+ for (val = btree_last32(head, &key); \228228+ val; \229229+ val = btree_get_prev32(head, &key))230230+231231+extern struct btree_geo btree_geo64;232232+#define BTREE_TYPE_SUFFIX 64233233+#define BTREE_TYPE_BITS 64234234+#define BTREE_TYPE_GEO &btree_geo64235235+#define BTREE_KEYTYPE u64236236+#include <linux/btree-type.h>237237+238238+#define btree_for_each_safe64(head, key, val) \239239+ for (val = btree_last64(head, &key); \240240+ val; \241241+ val = btree_get_prev64(head, &key))242242+243243+#endif
···11+/*22+ * lib/btree.c - Simple In-memory B+Tree33+ *44+ * As should be obvious for Linux kernel code, license is GPLv255+ *66+ * Copyright (c) 2007-2008 Joern Engel <joern@logfs.org>77+ * Bits and pieces stolen from Peter Zijlstra's code, which is88+ * Copyright 2007, Red Hat Inc. Peter Zijlstra <pzijlstr@redhat.com>99+ * GPLv21010+ *1111+ * see http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch1212+ *1313+ * A relatively simple B+Tree implementation. I have written it as a learning1414+ * excercise to understand how B+Trees work. Turned out to be useful as well.1515+ *1616+ * B+Trees can be used similar to Linux radix trees (which don't have anything1717+ * in common with textbook radix trees, beware). Prerequisite for them working1818+ * well is that access to a random tree node is much faster than a large number1919+ * of operations within each node.2020+ *2121+ * Disks have fulfilled the prerequisite for a long time. More recently DRAM2222+ * has gained similar properties, as memory access times, when measured in cpu2323+ * cycles, have increased. Cacheline sizes have increased as well, which also2424+ * helps B+Trees.2525+ *2626+ * Compared to radix trees, B+Trees are more efficient when dealing with a2727+ * sparsely populated address space. Between 25% and 50% of the memory is2828+ * occupied with valid pointers. When densely populated, radix trees contain2929+ * ~98% pointers - hard to beat. Very sparse radix trees contain only ~2%3030+ * pointers.3131+ *3232+ * This particular implementation stores pointers identified by a long value.3333+ * Storing NULL pointers is illegal, lookup will return NULL when no entry3434+ * was found.3535+ *3636+ * A tricks was used that is not commonly found in textbooks. The lowest3737+ * values are to the right, not to the left. All used slots within a node3838+ * are on the left, all unused slots contain NUL values. Most operations3939+ * simply loop once over all slots and terminate on the first NUL.4040+ */4141+4242+#include <linux/btree.h>4343+#include <linux/cache.h>4444+#include <linux/kernel.h>4545+#include <linux/slab.h>4646+#include <linux/module.h>4747+4848+#define MAX(a, b) ((a) > (b) ? (a) : (b))4949+#define NODESIZE MAX(L1_CACHE_BYTES, 128)5050+5151+struct btree_geo {5252+ int keylen;5353+ int no_pairs;5454+ int no_longs;5555+};5656+5757+struct btree_geo btree_geo32 = {5858+ .keylen = 1,5959+ .no_pairs = NODESIZE / sizeof(long) / 2,6060+ .no_longs = NODESIZE / sizeof(long) / 2,6161+};6262+EXPORT_SYMBOL_GPL(btree_geo32);6363+6464+#define LONG_PER_U64 (64 / BITS_PER_LONG)6565+struct btree_geo btree_geo64 = {6666+ .keylen = LONG_PER_U64,6767+ .no_pairs = NODESIZE / sizeof(long) / (1 + LONG_PER_U64),6868+ .no_longs = LONG_PER_U64 * (NODESIZE / sizeof(long) / (1 + LONG_PER_U64)),6969+};7070+EXPORT_SYMBOL_GPL(btree_geo64);7171+7272+struct btree_geo btree_geo128 = {7373+ .keylen = 2 * LONG_PER_U64,7474+ .no_pairs = NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64),7575+ .no_longs = 2 * LONG_PER_U64 * (NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64)),7676+};7777+EXPORT_SYMBOL_GPL(btree_geo128);7878+7979+static struct kmem_cache *btree_cachep;8080+8181+void *btree_alloc(gfp_t gfp_mask, void *pool_data)8282+{8383+ return kmem_cache_alloc(btree_cachep, gfp_mask);8484+}8585+EXPORT_SYMBOL_GPL(btree_alloc);8686+8787+void btree_free(void *element, void *pool_data)8888+{8989+ kmem_cache_free(btree_cachep, element);9090+}9191+EXPORT_SYMBOL_GPL(btree_free);9292+9393+static unsigned long *btree_node_alloc(struct btree_head *head, gfp_t gfp)9494+{9595+ unsigned long *node;9696+9797+ node = mempool_alloc(head->mempool, gfp);9898+ memset(node, 0, NODESIZE);9999+ return node;100100+}101101+102102+static int longcmp(const unsigned long *l1, const unsigned long *l2, size_t n)103103+{104104+ size_t i;105105+106106+ for (i = 0; i < n; i++) {107107+ if (l1[i] < l2[i])108108+ return -1;109109+ if (l1[i] > l2[i])110110+ return 1;111111+ }112112+ return 0;113113+}114114+115115+static unsigned long *longcpy(unsigned long *dest, const unsigned long *src,116116+ size_t n)117117+{118118+ size_t i;119119+120120+ for (i = 0; i < n; i++)121121+ dest[i] = src[i];122122+ return dest;123123+}124124+125125+static unsigned long *longset(unsigned long *s, unsigned long c, size_t n)126126+{127127+ size_t i;128128+129129+ for (i = 0; i < n; i++)130130+ s[i] = c;131131+ return s;132132+}133133+134134+static void dec_key(struct btree_geo *geo, unsigned long *key)135135+{136136+ unsigned long val;137137+ int i;138138+139139+ for (i = geo->keylen - 1; i >= 0; i--) {140140+ val = key[i];141141+ key[i] = val - 1;142142+ if (val)143143+ break;144144+ }145145+}146146+147147+static unsigned long *bkey(struct btree_geo *geo, unsigned long *node, int n)148148+{149149+ return &node[n * geo->keylen];150150+}151151+152152+static void *bval(struct btree_geo *geo, unsigned long *node, int n)153153+{154154+ return (void *)node[geo->no_longs + n];155155+}156156+157157+static void setkey(struct btree_geo *geo, unsigned long *node, int n,158158+ unsigned long *key)159159+{160160+ longcpy(bkey(geo, node, n), key, geo->keylen);161161+}162162+163163+static void setval(struct btree_geo *geo, unsigned long *node, int n,164164+ void *val)165165+{166166+ node[geo->no_longs + n] = (unsigned long) val;167167+}168168+169169+static void clearpair(struct btree_geo *geo, unsigned long *node, int n)170170+{171171+ longset(bkey(geo, node, n), 0, geo->keylen);172172+ node[geo->no_longs + n] = 0;173173+}174174+175175+static inline void __btree_init(struct btree_head *head)176176+{177177+ head->node = NULL;178178+ head->height = 0;179179+}180180+181181+void btree_init_mempool(struct btree_head *head, mempool_t *mempool)182182+{183183+ __btree_init(head);184184+ head->mempool = mempool;185185+}186186+EXPORT_SYMBOL_GPL(btree_init_mempool);187187+188188+int btree_init(struct btree_head *head)189189+{190190+ __btree_init(head);191191+ head->mempool = mempool_create(0, btree_alloc, btree_free, NULL);192192+ if (!head->mempool)193193+ return -ENOMEM;194194+ return 0;195195+}196196+EXPORT_SYMBOL_GPL(btree_init);197197+198198+void btree_destroy(struct btree_head *head)199199+{200200+ mempool_destroy(head->mempool);201201+ head->mempool = NULL;202202+}203203+EXPORT_SYMBOL_GPL(btree_destroy);204204+205205+void *btree_last(struct btree_head *head, struct btree_geo *geo,206206+ unsigned long *key)207207+{208208+ int height = head->height;209209+ unsigned long *node = head->node;210210+211211+ if (height == 0)212212+ return NULL;213213+214214+ for ( ; height > 1; height--)215215+ node = bval(geo, node, 0);216216+217217+ longcpy(key, bkey(geo, node, 0), geo->keylen);218218+ return bval(geo, node, 0);219219+}220220+EXPORT_SYMBOL_GPL(btree_last);221221+222222+static int keycmp(struct btree_geo *geo, unsigned long *node, int pos,223223+ unsigned long *key)224224+{225225+ return longcmp(bkey(geo, node, pos), key, geo->keylen);226226+}227227+228228+static int keyzero(struct btree_geo *geo, unsigned long *key)229229+{230230+ int i;231231+232232+ for (i = 0; i < geo->keylen; i++)233233+ if (key[i])234234+ return 0;235235+236236+ return 1;237237+}238238+239239+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,240240+ unsigned long *key)241241+{242242+ int i, height = head->height;243243+ unsigned long *node = head->node;244244+245245+ if (height == 0)246246+ return NULL;247247+248248+ for ( ; height > 1; height--) {249249+ for (i = 0; i < geo->no_pairs; i++)250250+ if (keycmp(geo, node, i, key) <= 0)251251+ break;252252+ if (i == geo->no_pairs)253253+ return NULL;254254+ node = bval(geo, node, i);255255+ if (!node)256256+ return NULL;257257+ }258258+259259+ if (!node)260260+ return NULL;261261+262262+ for (i = 0; i < geo->no_pairs; i++)263263+ if (keycmp(geo, node, i, key) == 0)264264+ return bval(geo, node, i);265265+ return NULL;266266+}267267+EXPORT_SYMBOL_GPL(btree_lookup);268268+269269+int btree_update(struct btree_head *head, struct btree_geo *geo,270270+ unsigned long *key, void *val)271271+{272272+ int i, height = head->height;273273+ unsigned long *node = head->node;274274+275275+ if (height == 0)276276+ return -ENOENT;277277+278278+ for ( ; height > 1; height--) {279279+ for (i = 0; i < geo->no_pairs; i++)280280+ if (keycmp(geo, node, i, key) <= 0)281281+ break;282282+ if (i == geo->no_pairs)283283+ return -ENOENT;284284+ node = bval(geo, node, i);285285+ if (!node)286286+ return -ENOENT;287287+ }288288+289289+ if (!node)290290+ return -ENOENT;291291+292292+ for (i = 0; i < geo->no_pairs; i++)293293+ if (keycmp(geo, node, i, key) == 0) {294294+ setval(geo, node, i, val);295295+ return 0;296296+ }297297+ return -ENOENT;298298+}299299+EXPORT_SYMBOL_GPL(btree_update);300300+301301+/*302302+ * Usually this function is quite similar to normal lookup. But the key of303303+ * a parent node may be smaller than the smallest key of all its siblings.304304+ * In such a case we cannot just return NULL, as we have only proven that no305305+ * key smaller than __key, but larger than this parent key exists.306306+ * So we set __key to the parent key and retry. We have to use the smallest307307+ * such parent key, which is the last parent key we encountered.308308+ */309309+void *btree_get_prev(struct btree_head *head, struct btree_geo *geo,310310+ unsigned long *__key)311311+{312312+ int i, height;313313+ unsigned long *node, *oldnode;314314+ unsigned long *retry_key = NULL, key[geo->keylen];315315+316316+ if (keyzero(geo, __key))317317+ return NULL;318318+319319+ if (head->height == 0)320320+ return NULL;321321+retry:322322+ longcpy(key, __key, geo->keylen);323323+ dec_key(geo, key);324324+325325+ node = head->node;326326+ for (height = head->height ; height > 1; height--) {327327+ for (i = 0; i < geo->no_pairs; i++)328328+ if (keycmp(geo, node, i, key) <= 0)329329+ break;330330+ if (i == geo->no_pairs)331331+ goto miss;332332+ oldnode = node;333333+ node = bval(geo, node, i);334334+ if (!node)335335+ goto miss;336336+ retry_key = bkey(geo, oldnode, i);337337+ }338338+339339+ if (!node)340340+ goto miss;341341+342342+ for (i = 0; i < geo->no_pairs; i++) {343343+ if (keycmp(geo, node, i, key) <= 0) {344344+ if (bval(geo, node, i)) {345345+ longcpy(__key, bkey(geo, node, i), geo->keylen);346346+ return bval(geo, node, i);347347+ } else348348+ goto miss;349349+ }350350+ }351351+miss:352352+ if (retry_key) {353353+ __key = retry_key;354354+ retry_key = NULL;355355+ goto retry;356356+ }357357+ return NULL;358358+}359359+360360+static int getpos(struct btree_geo *geo, unsigned long *node,361361+ unsigned long *key)362362+{363363+ int i;364364+365365+ for (i = 0; i < geo->no_pairs; i++) {366366+ if (keycmp(geo, node, i, key) <= 0)367367+ break;368368+ }369369+ return i;370370+}371371+372372+static int getfill(struct btree_geo *geo, unsigned long *node, int start)373373+{374374+ int i;375375+376376+ for (i = start; i < geo->no_pairs; i++)377377+ if (!bval(geo, node, i))378378+ break;379379+ return i;380380+}381381+382382+/*383383+ * locate the correct leaf node in the btree384384+ */385385+static unsigned long *find_level(struct btree_head *head, struct btree_geo *geo,386386+ unsigned long *key, int level)387387+{388388+ unsigned long *node = head->node;389389+ int i, height;390390+391391+ for (height = head->height; height > level; height--) {392392+ for (i = 0; i < geo->no_pairs; i++)393393+ if (keycmp(geo, node, i, key) <= 0)394394+ break;395395+396396+ if ((i == geo->no_pairs) || !bval(geo, node, i)) {397397+ /* right-most key is too large, update it */398398+ /* FIXME: If the right-most key on higher levels is399399+ * always zero, this wouldn't be necessary. */400400+ i--;401401+ setkey(geo, node, i, key);402402+ }403403+ BUG_ON(i < 0);404404+ node = bval(geo, node, i);405405+ }406406+ BUG_ON(!node);407407+ return node;408408+}409409+410410+static int btree_grow(struct btree_head *head, struct btree_geo *geo,411411+ gfp_t gfp)412412+{413413+ unsigned long *node;414414+ int fill;415415+416416+ node = btree_node_alloc(head, gfp);417417+ if (!node)418418+ return -ENOMEM;419419+ if (head->node) {420420+ fill = getfill(geo, head->node, 0);421421+ setkey(geo, node, 0, bkey(geo, head->node, fill - 1));422422+ setval(geo, node, 0, head->node);423423+ }424424+ head->node = node;425425+ head->height++;426426+ return 0;427427+}428428+429429+static void btree_shrink(struct btree_head *head, struct btree_geo *geo)430430+{431431+ unsigned long *node;432432+ int fill;433433+434434+ if (head->height <= 1)435435+ return;436436+437437+ node = head->node;438438+ fill = getfill(geo, node, 0);439439+ BUG_ON(fill > 1);440440+ head->node = bval(geo, node, 0);441441+ head->height--;442442+ mempool_free(node, head->mempool);443443+}444444+445445+static int btree_insert_level(struct btree_head *head, struct btree_geo *geo,446446+ unsigned long *key, void *val, int level,447447+ gfp_t gfp)448448+{449449+ unsigned long *node;450450+ int i, pos, fill, err;451451+452452+ BUG_ON(!val);453453+ if (head->height < level) {454454+ err = btree_grow(head, geo, gfp);455455+ if (err)456456+ return err;457457+ }458458+459459+retry:460460+ node = find_level(head, geo, key, level);461461+ pos = getpos(geo, node, key);462462+ fill = getfill(geo, node, pos);463463+ /* two identical keys are not allowed */464464+ BUG_ON(pos < fill && keycmp(geo, node, pos, key) == 0);465465+466466+ if (fill == geo->no_pairs) {467467+ /* need to split node */468468+ unsigned long *new;469469+470470+ new = btree_node_alloc(head, gfp);471471+ if (!new)472472+ return -ENOMEM;473473+ err = btree_insert_level(head, geo,474474+ bkey(geo, node, fill / 2 - 1),475475+ new, level + 1, gfp);476476+ if (err) {477477+ mempool_free(new, head->mempool);478478+ return err;479479+ }480480+ for (i = 0; i < fill / 2; i++) {481481+ setkey(geo, new, i, bkey(geo, node, i));482482+ setval(geo, new, i, bval(geo, node, i));483483+ setkey(geo, node, i, bkey(geo, node, i + fill / 2));484484+ setval(geo, node, i, bval(geo, node, i + fill / 2));485485+ clearpair(geo, node, i + fill / 2);486486+ }487487+ if (fill & 1) {488488+ setkey(geo, node, i, bkey(geo, node, fill - 1));489489+ setval(geo, node, i, bval(geo, node, fill - 1));490490+ clearpair(geo, node, fill - 1);491491+ }492492+ goto retry;493493+ }494494+ BUG_ON(fill >= geo->no_pairs);495495+496496+ /* shift and insert */497497+ for (i = fill; i > pos; i--) {498498+ setkey(geo, node, i, bkey(geo, node, i - 1));499499+ setval(geo, node, i, bval(geo, node, i - 1));500500+ }501501+ setkey(geo, node, pos, key);502502+ setval(geo, node, pos, val);503503+504504+ return 0;505505+}506506+507507+int btree_insert(struct btree_head *head, struct btree_geo *geo,508508+ unsigned long *key, void *val, gfp_t gfp)509509+{510510+ return btree_insert_level(head, geo, key, val, 1, gfp);511511+}512512+EXPORT_SYMBOL_GPL(btree_insert);513513+514514+static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,515515+ unsigned long *key, int level);516516+static void merge(struct btree_head *head, struct btree_geo *geo, int level,517517+ unsigned long *left, int lfill,518518+ unsigned long *right, int rfill,519519+ unsigned long *parent, int lpos)520520+{521521+ int i;522522+523523+ for (i = 0; i < rfill; i++) {524524+ /* Move all keys to the left */525525+ setkey(geo, left, lfill + i, bkey(geo, right, i));526526+ setval(geo, left, lfill + i, bval(geo, right, i));527527+ }528528+ /* Exchange left and right child in parent */529529+ setval(geo, parent, lpos, right);530530+ setval(geo, parent, lpos + 1, left);531531+ /* Remove left (formerly right) child from parent */532532+ btree_remove_level(head, geo, bkey(geo, parent, lpos), level + 1);533533+ mempool_free(right, head->mempool);534534+}535535+536536+static void rebalance(struct btree_head *head, struct btree_geo *geo,537537+ unsigned long *key, int level, unsigned long *child, int fill)538538+{539539+ unsigned long *parent, *left = NULL, *right = NULL;540540+ int i, no_left, no_right;541541+542542+ if (fill == 0) {543543+ /* Because we don't steal entries from a neigbour, this case544544+ * can happen. Parent node contains a single child, this545545+ * node, so merging with a sibling never happens.546546+ */547547+ btree_remove_level(head, geo, key, level + 1);548548+ mempool_free(child, head->mempool);549549+ return;550550+ }551551+552552+ parent = find_level(head, geo, key, level + 1);553553+ i = getpos(geo, parent, key);554554+ BUG_ON(bval(geo, parent, i) != child);555555+556556+ if (i > 0) {557557+ left = bval(geo, parent, i - 1);558558+ no_left = getfill(geo, left, 0);559559+ if (fill + no_left <= geo->no_pairs) {560560+ merge(head, geo, level,561561+ left, no_left,562562+ child, fill,563563+ parent, i - 1);564564+ return;565565+ }566566+ }567567+ if (i + 1 < getfill(geo, parent, i)) {568568+ right = bval(geo, parent, i + 1);569569+ no_right = getfill(geo, right, 0);570570+ if (fill + no_right <= geo->no_pairs) {571571+ merge(head, geo, level,572572+ child, fill,573573+ right, no_right,574574+ parent, i);575575+ return;576576+ }577577+ }578578+ /*579579+ * We could also try to steal one entry from the left or right580580+ * neighbor. By not doing so we changed the invariant from581581+ * "all nodes are at least half full" to "no two neighboring582582+ * nodes can be merged". Which means that the average fill of583583+ * all nodes is still half or better.584584+ */585585+}586586+587587+static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,588588+ unsigned long *key, int level)589589+{590590+ unsigned long *node;591591+ int i, pos, fill;592592+ void *ret;593593+594594+ if (level > head->height) {595595+ /* we recursed all the way up */596596+ head->height = 0;597597+ head->node = NULL;598598+ return NULL;599599+ }600600+601601+ node = find_level(head, geo, key, level);602602+ pos = getpos(geo, node, key);603603+ fill = getfill(geo, node, pos);604604+ if ((level == 1) && (keycmp(geo, node, pos, key) != 0))605605+ return NULL;606606+ ret = bval(geo, node, pos);607607+608608+ /* remove and shift */609609+ for (i = pos; i < fill - 1; i++) {610610+ setkey(geo, node, i, bkey(geo, node, i + 1));611611+ setval(geo, node, i, bval(geo, node, i + 1));612612+ }613613+ clearpair(geo, node, fill - 1);614614+615615+ if (fill - 1 < geo->no_pairs / 2) {616616+ if (level < head->height)617617+ rebalance(head, geo, key, level, node, fill - 1);618618+ else if (fill - 1 == 1)619619+ btree_shrink(head, geo);620620+ }621621+622622+ return ret;623623+}624624+625625+void *btree_remove(struct btree_head *head, struct btree_geo *geo,626626+ unsigned long *key)627627+{628628+ if (head->height == 0)629629+ return NULL;630630+631631+ return btree_remove_level(head, geo, key, 1);632632+}633633+EXPORT_SYMBOL_GPL(btree_remove);634634+635635+int btree_merge(struct btree_head *target, struct btree_head *victim,636636+ struct btree_geo *geo, gfp_t gfp)637637+{638638+ unsigned long key[geo->keylen];639639+ unsigned long dup[geo->keylen];640640+ void *val;641641+ int err;642642+643643+ BUG_ON(target == victim);644644+645645+ if (!(target->node)) {646646+ /* target is empty, just copy fields over */647647+ target->node = victim->node;648648+ target->height = victim->height;649649+ __btree_init(victim);650650+ return 0;651651+ }652652+653653+ /* TODO: This needs some optimizations. Currently we do three tree654654+ * walks to remove a single object from the victim.655655+ */656656+ for (;;) {657657+ if (!btree_last(victim, geo, key))658658+ break;659659+ val = btree_lookup(victim, geo, key);660660+ err = btree_insert(target, geo, key, val, gfp);661661+ if (err)662662+ return err;663663+ /* We must make a copy of the key, as the original will get664664+ * mangled inside btree_remove. */665665+ longcpy(dup, key, geo->keylen);666666+ btree_remove(victim, geo, dup);667667+ }668668+ return 0;669669+}670670+EXPORT_SYMBOL_GPL(btree_merge);671671+672672+static size_t __btree_for_each(struct btree_head *head, struct btree_geo *geo,673673+ unsigned long *node, unsigned long opaque,674674+ void (*func)(void *elem, unsigned long opaque,675675+ unsigned long *key, size_t index,676676+ void *func2),677677+ void *func2, int reap, int height, size_t count)678678+{679679+ int i;680680+ unsigned long *child;681681+682682+ for (i = 0; i < geo->no_pairs; i++) {683683+ child = bval(geo, node, i);684684+ if (!child)685685+ break;686686+ if (height > 1)687687+ count = __btree_for_each(head, geo, child, opaque,688688+ func, func2, reap, height - 1, count);689689+ else690690+ func(child, opaque, bkey(geo, node, i), count++,691691+ func2);692692+ }693693+ if (reap)694694+ mempool_free(node, head->mempool);695695+ return count;696696+}697697+698698+static void empty(void *elem, unsigned long opaque, unsigned long *key,699699+ size_t index, void *func2)700700+{701701+}702702+703703+void visitorl(void *elem, unsigned long opaque, unsigned long *key,704704+ size_t index, void *__func)705705+{706706+ visitorl_t func = __func;707707+708708+ func(elem, opaque, *key, index);709709+}710710+EXPORT_SYMBOL_GPL(visitorl);711711+712712+void visitor32(void *elem, unsigned long opaque, unsigned long *__key,713713+ size_t index, void *__func)714714+{715715+ visitor32_t func = __func;716716+ u32 *key = (void *)__key;717717+718718+ func(elem, opaque, *key, index);719719+}720720+EXPORT_SYMBOL_GPL(visitor32);721721+722722+void visitor64(void *elem, unsigned long opaque, unsigned long *__key,723723+ size_t index, void *__func)724724+{725725+ visitor64_t func = __func;726726+ u64 *key = (void *)__key;727727+728728+ func(elem, opaque, *key, index);729729+}730730+EXPORT_SYMBOL_GPL(visitor64);731731+732732+void visitor128(void *elem, unsigned long opaque, unsigned long *__key,733733+ size_t index, void *__func)734734+{735735+ visitor128_t func = __func;736736+ u64 *key = (void *)__key;737737+738738+ func(elem, opaque, key[0], key[1], index);739739+}740740+EXPORT_SYMBOL_GPL(visitor128);741741+742742+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,743743+ unsigned long opaque,744744+ void (*func)(void *elem, unsigned long opaque,745745+ unsigned long *key,746746+ size_t index, void *func2),747747+ void *func2)748748+{749749+ size_t count = 0;750750+751751+ if (!func2)752752+ func = empty;753753+ if (head->node)754754+ count = __btree_for_each(head, geo, head->node, opaque, func,755755+ func2, 0, head->height, 0);756756+ return count;757757+}758758+EXPORT_SYMBOL_GPL(btree_visitor);759759+760760+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,761761+ unsigned long opaque,762762+ void (*func)(void *elem, unsigned long opaque,763763+ unsigned long *key,764764+ size_t index, void *func2),765765+ void *func2)766766+{767767+ size_t count = 0;768768+769769+ if (!func2)770770+ func = empty;771771+ if (head->node)772772+ count = __btree_for_each(head, geo, head->node, opaque, func,773773+ func2, 1, head->height, 0);774774+ __btree_init(head);775775+ return count;776776+}777777+EXPORT_SYMBOL_GPL(btree_grim_visitor);778778+779779+static int __init btree_module_init(void)780780+{781781+ btree_cachep = kmem_cache_create("btree_node", NODESIZE, 0,782782+ SLAB_HWCACHE_ALIGN, NULL);783783+ return 0;784784+}785785+786786+static void __exit btree_module_exit(void)787787+{788788+ kmem_cache_destroy(btree_cachep);789789+}790790+791791+/* If core code starts using btree, initialization should happen even earlier */792792+module_init(btree_module_init);793793+module_exit(btree_module_exit);794794+795795+MODULE_AUTHOR("Joern Engel <joern@logfs.org>");796796+MODULE_AUTHOR("Johannes Berg <johannes@sipsolutions.net>");797797+MODULE_LICENSE("GPL");