Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

+66

Documentation/filesystems/ext4/journal.rst

··· 28 28 safest. If ``data=writeback``, dirty data blocks are not flushed to the 29 29 disk before the metadata are written to disk through the journal. 30 30 31 + In case of ``data=ordered`` mode, Ext4 also supports fast commits which 32 + help reduce commit latency significantly. The default ``data=ordered`` 33 + mode works by logging metadata blocks to the journal. In fast commit 34 + mode, Ext4 only stores the minimal delta needed to recreate the 35 + affected metadata in fast commit space that is shared with JBD2. 36 + Once the fast commit area fills in or if fast commit is not possible 37 + or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. 38 + A full commit invalidates all the fast commits that happened before 39 + it and thus it makes the fast commit area empty for further fast 40 + commits. This feature needs to be enabled at mkfs time. 41 + 31 42 The journal inode is typically inode 8. The first 68 bytes of the 32 43 journal inode are replicated in the ext4 superblock. The journal itself 33 44 is normal (but hidden) file within the filesystem. The file usually ··· 619 608 - \_\_be32 620 609 - h\_commit\_nsec 621 610 - Nanoseconds component of the above timestamp. 611 + 612 + Fast commits 613 + ~~~~~~~~~~~~ 614 + 615 + Fast commit area is organized as a log of tag length values. Each TLV has 616 + a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length 617 + of the entire field. It is followed by variable length tag specific value. 618 + Here is the list of supported tags and their meanings: 619 + 620 + .. list-table:: 621 + :widths: 8 20 20 32 622 + :header-rows: 1 623 + 624 + * - Tag 625 + - Meaning 626 + - Value struct 627 + - Description 628 + * - EXT4_FC_TAG_HEAD 629 + - Fast commit area header 630 + - ``struct ext4_fc_head`` 631 + - Stores the TID of the transaction after which these fast commits should 632 + be applied. 633 + * - EXT4_FC_TAG_ADD_RANGE 634 + - Add extent to inode 635 + - ``struct ext4_fc_add_range`` 636 + - Stores the inode number and extent to be added in this inode 637 + * - EXT4_FC_TAG_DEL_RANGE 638 + - Remove logical offsets to inode 639 + - ``struct ext4_fc_del_range`` 640 + - Stores the inode number and the logical offset range that needs to be 641 + removed 642 + * - EXT4_FC_TAG_CREAT 643 + - Create directory entry for a newly created file 644 + - ``struct ext4_fc_dentry_info`` 645 + - Stores the parent inode number, inode number and directory entry of the 646 + newly created file 647 + * - EXT4_FC_TAG_LINK 648 + - Link a directory entry to an inode 649 + - ``struct ext4_fc_dentry_info`` 650 + - Stores the parent inode number, inode number and directory entry 651 + * - EXT4_FC_TAG_UNLINK 652 + - Unlink a directory entry of an inode 653 + - ``struct ext4_fc_dentry_info`` 654 + - Stores the parent inode number, inode number and directory entry 655 + 656 + * - EXT4_FC_TAG_PAD 657 + - Padding (unused area) 658 + - None 659 + - Unused bytes in the fast commit area. 660 + 661 + * - EXT4_FC_TAG_TAIL 662 + - Mark the end of a fast commit 663 + - ``struct ext4_fc_tail`` 664 + - Stores the TID of the commit, CRC of the fast commit of which this tag 665 + represents the end of 622 666

+33

Documentation/filesystems/journalling.rst

··· 132 132 if you allow unprivileged userspace to trigger codepaths containing 133 133 these calls. 134 134 135 + Fast commits 136 + ~~~~~~~~~~~~ 137 + 138 + JBD2 to also allows you to perform file-system specific delta commits known as 139 + fast commits. In order to use fast commits, you first need to call 140 + :c:func:`jbd2_fc_init` and tell how many blocks at the end of journal 141 + area should be reserved for fast commits. Along with that, you will also need 142 + to set following callbacks that perform correspodning work: 143 + 144 + `journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and 145 + fast commit. 146 + 147 + `journal->j_fc_replay_cb`: Replay function called for replay of fast commit 148 + blocks. 149 + 150 + File system is free to perform fast commits as and when it wants as long as it 151 + gets permission from JBD2 to do so by calling the function 152 + :c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client 153 + file system should tell JBD2 about it by calling 154 + :c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full 155 + commit immediately after stopping the fast commit it can do so by calling 156 + :c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation 157 + fails for some reason and the only way to guarantee consistency is for JBD2 to 158 + perform the full traditional commit. 159 + 160 + JBD2 helper functions to manage fast commit buffers. File system can use 161 + :c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate 162 + and wait on IO completion of fast commit buffers. 163 + 164 + Currently, only Ext4 implements fast commits. For details of its implementation 165 + of fast commits, please refer to the top level comments in 166 + fs/ext4/fast_commit.c. 167 + 135 168 Summary 136 169 ~~~~~~~ 137 170

+1 -1

fs/ext4/Makefile

··· 10 10 indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \ 11 11 mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \ 12 12 super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \ 13 - xattr_user.o 13 + xattr_user.o fast_commit.o 14 14 15 15 ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o 16 16 ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o

+2

fs/ext4/acl.c

··· 242 242 handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits); 243 243 if (IS_ERR(handle)) 244 244 return PTR_ERR(handle); 245 + ext4_fc_start_update(inode); 245 246 246 247 if ((type == ACL_TYPE_ACCESS) && acl) { 247 248 error = posix_acl_update_mode(inode, &mode, &acl); ··· 260 259 } 261 260 out_stop: 262 261 ext4_journal_stop(handle); 262 + ext4_fc_stop_update(inode); 263 263 if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) 264 264 goto retry; 265 265 return error;

+9 -5

fs/ext4/balloc.c

··· 368 368 struct buffer_head *bh) 369 369 { 370 370 ext4_fsblk_t blk; 371 - struct ext4_group_info *grp = ext4_get_group_info(sb, block_group); 371 + struct ext4_group_info *grp; 372 + 373 + if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) 374 + return 0; 375 + 376 + grp = ext4_get_group_info(sb, block_group); 372 377 373 378 if (buffer_verified(bh)) 374 379 return 0; ··· 500 495 */ 501 496 set_buffer_new(bh); 502 497 trace_ext4_read_block_bitmap_load(sb, block_group, ignore_locked); 503 - bh->b_end_io = ext4_end_bitmap_read; 504 - get_bh(bh); 505 - submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO | 506 - (ignore_locked ? REQ_RAHEAD : 0), bh); 498 + ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO | 499 + (ignore_locked ? REQ_RAHEAD : 0), 500 + ext4_end_bitmap_read); 507 501 return bh; 508 502 verify: 509 503 err = ext4_validate_block_bitmap(sb, desc, block_group, bh);

+5 -5

fs/ext4/block_validity.c

··· 131 131 132 132 printk(KERN_INFO "System zones: "); 133 133 rcu_read_lock(); 134 - system_blks = rcu_dereference(sbi->system_blks); 134 + system_blks = rcu_dereference(sbi->s_system_blks); 135 135 node = rb_first(&system_blks->root); 136 136 while (node) { 137 137 entry = rb_entry(node, struct ext4_system_zone, node); ··· 261 261 * with ext4_data_block_valid() accessing the rbtree at the same 262 262 * time. 263 263 */ 264 - rcu_assign_pointer(sbi->system_blks, system_blks); 264 + rcu_assign_pointer(sbi->s_system_blks, system_blks); 265 265 266 266 if (test_opt(sb, DEBUG)) 267 267 debug_print_tree(sbi); ··· 286 286 { 287 287 struct ext4_system_blocks *system_blks; 288 288 289 - system_blks = rcu_dereference_protected(EXT4_SB(sb)->system_blks, 289 + system_blks = rcu_dereference_protected(EXT4_SB(sb)->s_system_blks, 290 290 lockdep_is_held(&sb->s_umount)); 291 - rcu_assign_pointer(EXT4_SB(sb)->system_blks, NULL); 291 + rcu_assign_pointer(EXT4_SB(sb)->s_system_blks, NULL); 292 292 293 293 if (system_blks) 294 294 call_rcu(&system_blks->rcu, ext4_destroy_system_zone); ··· 319 319 * mount option. 320 320 */ 321 321 rcu_read_lock(); 322 - system_blks = rcu_dereference(sbi->system_blks); 322 + system_blks = rcu_dereference(sbi->s_system_blks); 323 323 if (system_blks == NULL) 324 324 goto out_rcu; 325 325

+2 -2

fs/ext4/dir.c

··· 674 674 { 675 675 struct qstr qstr = {.name = str, .len = len }; 676 676 const struct dentry *parent = READ_ONCE(dentry->d_parent); 677 - const struct inode *inode = READ_ONCE(parent->d_inode); 677 + const struct inode *inode = d_inode_rcu(parent); 678 678 char strbuf[DNAME_INLINE_LEN]; 679 679 680 680 if (!inode || !IS_CASEFOLDED(inode) || ··· 706 706 { 707 707 const struct ext4_sb_info *sbi = EXT4_SB(dentry->d_sb); 708 708 const struct unicode_map *um = sbi->s_encoding; 709 - const struct inode *inode = READ_ONCE(dentry->d_inode); 709 + const struct inode *inode = d_inode_rcu(dentry); 710 710 unsigned char *norm; 711 711 int len, ret = 0; 712 712

+126 -10

fs/ext4/ext4.h

··· 27 27 #include <linux/seqlock.h> 28 28 #include <linux/mutex.h> 29 29 #include <linux/timer.h> 30 - #include <linux/version.h> 31 30 #include <linux/wait.h> 32 31 #include <linux/sched/signal.h> 33 32 #include <linux/blockgroup_lock.h> ··· 491 492 492 493 /* Flags which are mutually exclusive to DAX */ 493 494 #define EXT4_DAX_MUT_EXCL (EXT4_VERITY_FL | EXT4_ENCRYPT_FL |\ 494 - EXT4_JOURNAL_DATA_FL) 495 + EXT4_JOURNAL_DATA_FL | EXT4_INLINE_DATA_FL) 495 496 496 497 /* Mask out flags that are inappropriate for the given type of inode. */ 497 498 static inline __u32 ext4_mask_flags(umode_t mode, __u32 flags) ··· 963 964 #endif /* defined(__KERNEL__) || defined(__linux__) */ 964 965 965 966 #include "extents_status.h" 967 + #include "fast_commit.h" 966 968 967 969 /* 968 970 * Lock subclasses for i_data_sem in the ext4_inode_info structure. ··· 1020 1020 struct rw_semaphore xattr_sem; 1021 1021 1022 1022 struct list_head i_orphan; /* unlinked but open inodes */ 1023 + 1024 + /* Fast commit related info */ 1025 + 1026 + struct list_head i_fc_list; /* 1027 + * inodes that need fast commit 1028 + * protected by sbi->s_fc_lock. 1029 + */ 1030 + 1031 + /* Fast commit subtid when this inode was committed */ 1032 + unsigned int i_fc_committed_subtid; 1033 + 1034 + /* Start of lblk range that needs to be committed in this fast commit */ 1035 + ext4_lblk_t i_fc_lblk_start; 1036 + 1037 + /* End of lblk range that needs to be committed in this fast commit */ 1038 + ext4_lblk_t i_fc_lblk_len; 1039 + 1040 + /* Number of ongoing updates on this inode */ 1041 + atomic_t i_fc_updates; 1042 + 1043 + /* Fast commit wait queue for this inode */ 1044 + wait_queue_head_t i_fc_wait; 1045 + 1046 + /* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */ 1047 + struct mutex i_fc_lock; 1023 1048 1024 1049 /* 1025 1050 * i_disksize keeps track of what the inode size is ON DISK, not ··· 1166 1141 #define EXT4_VALID_FS 0x0001 /* Unmounted cleanly */ 1167 1142 #define EXT4_ERROR_FS 0x0002 /* Errors detected */ 1168 1143 #define EXT4_ORPHAN_FS 0x0004 /* Orphans being recovered */ 1144 + #define EXT4_FC_INELIGIBLE 0x0008 /* Fast commit ineligible */ 1145 + #define EXT4_FC_COMMITTING 0x0010 /* File system underoing a fast 1146 + * commit. 1147 + */ 1148 + #define EXT4_FC_REPLAY 0x0020 /* Fast commit replay ongoing */ 1169 1149 1170 1150 /* 1171 1151 * Misc. filesystem flags ··· 1243 1213 1244 1214 #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly 1245 1215 specified journal checksum */ 1216 + 1217 + #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */ 1246 1218 1247 1219 #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ 1248 1220 ~EXT4_MOUNT_##opt ··· 1513 1481 unsigned long s_commit_interval; 1514 1482 u32 s_max_batch_time; 1515 1483 u32 s_min_batch_time; 1516 - struct block_device *journal_bdev; 1484 + struct block_device *s_journal_bdev; 1517 1485 #ifdef CONFIG_QUOTA 1518 1486 /* Names of quota files with journalled quota */ 1519 1487 char __rcu *s_qf_names[EXT4_MAXQUOTAS]; 1520 1488 int s_jquota_fmt; /* Format of quota to use */ 1521 1489 #endif 1522 1490 unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */ 1523 - struct ext4_system_blocks __rcu *system_blks; 1491 + struct ext4_system_blocks __rcu *s_system_blks; 1524 1492 1525 1493 #ifdef EXTENTS_STATS 1526 1494 /* ext4 extents stats */ ··· 1643 1611 /* Record the errseq of the backing block device */ 1644 1612 errseq_t s_bdev_wb_err; 1645 1613 spinlock_t s_bdev_wb_lock; 1614 + 1615 + /* Ext4 fast commit stuff */ 1616 + atomic_t s_fc_subtid; 1617 + atomic_t s_fc_ineligible_updates; 1618 + /* 1619 + * After commit starts, the main queue gets locked, and the further 1620 + * updates get added in the staging queue. 1621 + */ 1622 + #define FC_Q_MAIN 0 1623 + #define FC_Q_STAGING 1 1624 + struct list_head s_fc_q[2]; /* Inodes staged for fast commit 1625 + * that have data changes in them. 1626 + */ 1627 + struct list_head s_fc_dentry_q[2]; /* directory entry updates */ 1628 + unsigned int s_fc_bytes; 1629 + /* 1630 + * Main fast commit lock. This lock protects accesses to the 1631 + * following fields: 1632 + * ei->i_fc_list, s_fc_dentry_q, s_fc_q, s_fc_bytes, s_fc_bh. 1633 + */ 1634 + spinlock_t s_fc_lock; 1635 + struct buffer_head *s_fc_bh; 1636 + struct ext4_fc_stats s_fc_stats; 1637 + u64 s_fc_avg_commit_time; 1638 + #ifdef CONFIG_EXT4_DEBUG 1639 + int s_fc_debug_max_replay; 1640 + #endif 1641 + struct ext4_fc_replay_state s_fc_replay_state; 1646 1642 }; 1647 1643 1648 1644 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb) ··· 1781 1721 EXT4_STATE_EXT_PRECACHED, /* extents have been precached */ 1782 1722 EXT4_STATE_LUSTRE_EA_INODE, /* Lustre-style ea_inode */ 1783 1723 EXT4_STATE_VERITY_IN_PROGRESS, /* building fs-verity Merkle tree */ 1724 + EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */ 1784 1725 }; 1785 1726 1786 1727 #define EXT4_INODE_BIT_FNS(name, field, offset) \ ··· 1875 1814 #define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010 1876 1815 #define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020 1877 1816 #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2 0x0200 1817 + #define EXT4_FEATURE_COMPAT_FAST_COMMIT 0x0400 1878 1818 #define EXT4_FEATURE_COMPAT_STABLE_INODES 0x0800 1879 1819 1880 1820 #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001 ··· 1978 1916 EXT4_FEATURE_COMPAT_FUNCS(resize_inode, RESIZE_INODE) 1979 1917 EXT4_FEATURE_COMPAT_FUNCS(dir_index, DIR_INDEX) 1980 1918 EXT4_FEATURE_COMPAT_FUNCS(sparse_super2, SPARSE_SUPER2) 1919 + EXT4_FEATURE_COMPAT_FUNCS(fast_commit, FAST_COMMIT) 1981 1920 EXT4_FEATURE_COMPAT_FUNCS(stable_inodes, STABLE_INODES) 1982 1921 1983 1922 EXT4_FEATURE_RO_COMPAT_FUNCS(sparse_super, SPARSE_SUPER) ··· 2713 2650 struct dx_hash_info *hinfo); 2714 2651 2715 2652 /* ialloc.c */ 2653 + extern int ext4_mark_inode_used(struct super_block *sb, int ino); 2716 2654 extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t, 2717 2655 const struct qstr *qstr, __u32 goal, 2718 2656 uid_t *owner, __u32 i_flags, ··· 2738 2674 extern int ext4_init_inode_table(struct super_block *sb, 2739 2675 ext4_group_t group, int barrier); 2740 2676 extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate); 2677 + 2678 + /* fast_commit.c */ 2679 + int ext4_fc_info_show(struct seq_file *seq, void *v); 2680 + void ext4_fc_init(struct super_block *sb, journal_t *journal); 2681 + void ext4_fc_init_inode(struct inode *inode); 2682 + void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start, 2683 + ext4_lblk_t end); 2684 + void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry); 2685 + void ext4_fc_track_link(struct inode *inode, struct dentry *dentry); 2686 + void ext4_fc_track_create(struct inode *inode, struct dentry *dentry); 2687 + void ext4_fc_track_inode(struct inode *inode); 2688 + void ext4_fc_mark_ineligible(struct super_block *sb, int reason); 2689 + void ext4_fc_start_ineligible(struct super_block *sb, int reason); 2690 + void ext4_fc_stop_ineligible(struct super_block *sb); 2691 + void ext4_fc_start_update(struct inode *inode); 2692 + void ext4_fc_stop_update(struct inode *inode); 2693 + void ext4_fc_del(struct inode *inode); 2694 + bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block); 2695 + void ext4_fc_replay_cleanup(struct super_block *sb); 2696 + int ext4_fc_commit(journal_t *journal, tid_t commit_tid); 2697 + int __init ext4_fc_init_dentry_cache(void); 2741 2698 2742 2699 /* mballoc.c */ 2743 2700 extern const struct seq_operations ext4_mb_seq_groups_ops; ··· 2789 2704 ext4_fsblk_t block, unsigned long count); 2790 2705 extern int ext4_trim_fs(struct super_block *, struct fstrim_range *); 2791 2706 extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid); 2707 + extern void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block, 2708 + int len, int state); 2792 2709 2793 2710 /* inode.c */ 2711 + void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw, 2712 + struct ext4_inode_info *ei); 2794 2713 int ext4_inode_is_fast_symlink(struct inode *inode); 2795 2714 struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int); 2796 2715 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int); ··· 2841 2752 extern void ext4_dirty_inode(struct inode *, int); 2842 2753 extern int ext4_change_inode_journal_flag(struct inode *, int); 2843 2754 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); 2755 + extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino, 2756 + struct ext4_iloc *iloc); 2844 2757 extern int ext4_inode_attach_jinode(struct inode *inode); 2845 2758 extern int ext4_can_truncate(struct inode *inode); 2846 2759 extern int ext4_truncate(struct inode *); ··· 2876 2785 /* ioctl.c */ 2877 2786 extern long ext4_ioctl(struct file *, unsigned int, unsigned long); 2878 2787 extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long); 2788 + extern void ext4_reset_inode_seed(struct inode *inode); 2879 2789 2880 2790 /* migrate.c */ 2881 2791 extern int ext4_ext_migrate(struct inode *); 2882 2792 extern int ext4_ind_migrate(struct inode *inode); 2883 2793 2884 2794 /* namei.c */ 2795 + extern int ext4_init_new_dir(handle_t *handle, struct inode *dir, 2796 + struct inode *inode); 2885 2797 extern int ext4_dirblock_csum_verify(struct inode *inode, 2886 2798 struct buffer_head *bh); 2887 2799 extern int ext4_orphan_add(handle_t *, struct inode *); ··· 2918 2824 /* super.c */ 2919 2825 extern struct buffer_head *ext4_sb_bread(struct super_block *sb, 2920 2826 sector_t block, int op_flags); 2827 + extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb, 2828 + sector_t block); 2829 + extern void ext4_read_bh_nowait(struct buffer_head *bh, int op_flags, 2830 + bh_end_io_t *end_io); 2831 + extern int ext4_read_bh(struct buffer_head *bh, int op_flags, 2832 + bh_end_io_t *end_io); 2833 + extern int ext4_read_bh_lock(struct buffer_head *bh, int op_flags, bool wait); 2834 + extern void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block); 2921 2835 extern int ext4_seq_options_show(struct seq_file *seq, void *offset); 2922 2836 extern int ext4_calculate_overhead(struct super_block *sb); 2923 2837 extern void ext4_superblock_csum_set(struct super_block *sb); ··· 3114 3012 return ext4_has_feature_gdt_csum(sb) || ext4_has_metadata_csum(sb); 3115 3013 } 3116 3014 3015 + #define ext4_read_incompat_64bit_val(es, name) \ 3016 + (((es)->s_feature_incompat & cpu_to_le32(EXT4_FEATURE_INCOMPAT_64BIT) \ 3017 + ? (ext4_fsblk_t)le32_to_cpu(es->name##_hi) << 32 : 0) | \ 3018 + le32_to_cpu(es->name##_lo)) 3019 + 3117 3020 static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es) 3118 3021 { 3119 - return ((ext4_fsblk_t)le32_to_cpu(es->s_blocks_count_hi) << 32) | 3120 - le32_to_cpu(es->s_blocks_count_lo); 3022 + return ext4_read_incompat_64bit_val(es, s_blocks_count); 3121 3023 } 3122 3024 3123 3025 static inline ext4_fsblk_t ext4_r_blocks_count(struct ext4_super_block *es) 3124 3026 { 3125 - return ((ext4_fsblk_t)le32_to_cpu(es->s_r_blocks_count_hi) << 32) | 3126 - le32_to_cpu(es->s_r_blocks_count_lo); 3027 + return ext4_read_incompat_64bit_val(es, s_r_blocks_count); 3127 3028 } 3128 3029 3129 3030 static inline ext4_fsblk_t ext4_free_blocks_count(struct ext4_super_block *es) 3130 3031 { 3131 - return ((ext4_fsblk_t)le32_to_cpu(es->s_free_blocks_count_hi) << 32) | 3132 - le32_to_cpu(es->s_free_blocks_count_lo); 3032 + return ext4_read_incompat_64bit_val(es, s_free_blocks_count); 3133 3033 } 3134 3034 3135 3035 static inline void ext4_blocks_count_set(struct ext4_super_block *es, ··· 3257 3153 3258 3154 struct ext4_group_info { 3259 3155 unsigned long bb_state; 3156 + #ifdef AGGRESSIVE_CHECK 3157 + unsigned long bb_check_counter; 3158 + #endif 3260 3159 struct rb_root bb_free_root; 3261 3160 ext4_grpblk_t bb_first_free; /* first free block */ 3262 3161 ext4_grpblk_t bb_free; /* total free blocks */ ··· 3464 3357 extern int ext4_ci_compare(const struct inode *parent, 3465 3358 const struct qstr *fname, 3466 3359 const struct qstr *entry, bool quick); 3360 + extern int __ext4_unlink(struct inode *dir, const struct qstr *d_name, 3361 + struct inode *inode); 3362 + extern int __ext4_link(struct inode *dir, struct inode *inode, 3363 + struct dentry *dentry); 3467 3364 3468 3365 #define S_SHIFT 12 3469 3366 static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = { ··· 3568 3457 extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode, 3569 3458 int check_cred, int restart_cred, 3570 3459 int revoke_cred); 3460 + extern void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end); 3461 + extern int ext4_ext_replay_set_iblocks(struct inode *inode); 3462 + extern int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start, 3463 + int len, int unwritten, ext4_fsblk_t pblk); 3464 + extern int ext4_ext_clear_bb(struct inode *inode); 3571 3465 3572 3466 3573 3467 /* move_extent.c */

+1 -1

fs/ext4/ext4_jbd2.c

··· 100 100 return ERR_PTR(err); 101 101 102 102 journal = EXT4_SB(sb)->s_journal; 103 - if (!journal) 103 + if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)) 104 104 return ext4_get_nojournal(); 105 105 return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds, 106 106 GFP_NOFS, type, line);

+298 -17

fs/ext4/extents.c

··· 501 501 502 502 if (!bh_uptodate_or_lock(bh)) { 503 503 trace_ext4_ext_load_extent(inode, pblk, _RET_IP_); 504 - err = bh_submit_read(bh); 504 + err = ext4_read_bh(bh, 0, NULL); 505 505 if (err < 0) 506 506 goto errout; 507 507 } ··· 3723 3723 err = ext4_ext_dirty(handle, inode, path + path->p_depth); 3724 3724 out: 3725 3725 ext4_ext_show_leaf(inode, path); 3726 + ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1); 3726 3727 return err; 3727 3728 } 3728 3729 ··· 3795 3794 if (*allocated > map->m_len) 3796 3795 *allocated = map->m_len; 3797 3796 map->m_len = *allocated; 3797 + ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1); 3798 3798 return 0; 3799 3799 } 3800 3800 ··· 4025 4023 * down_read(&EXT4_I(inode)->i_data_sem) if not allocating file system block 4026 4024 * (ie, create is zero). Otherwise down_write(&EXT4_I(inode)->i_data_sem) 4027 4025 * 4028 - * return > 0, number of of blocks already mapped/allocated 4026 + * return > 0, number of blocks already mapped/allocated 4029 4027 * if create == 0 and these are pre-allocated blocks 4030 4028 * buffer head is unmapped 4031 4029 * otherwise blocks are mapped ··· 4329 4327 map->m_len = ar.len; 4330 4328 allocated = map->m_len; 4331 4329 ext4_ext_show_leaf(inode, path); 4332 - 4330 + ext4_fc_track_range(inode, map->m_lblk, map->m_lblk + map->m_len - 1); 4333 4331 out: 4334 4332 ext4_ext_drop_refs(path); 4335 4333 kfree(path); ··· 4602 4600 ret = ext4_mark_inode_dirty(handle, inode); 4603 4601 if (unlikely(ret)) 4604 4602 goto out_handle; 4605 - 4603 + ext4_fc_track_range(inode, offset >> inode->i_sb->s_blocksize_bits, 4604 + (offset + len - 1) >> inode->i_sb->s_blocksize_bits); 4606 4605 /* Zero out partial block at the edges of the range */ 4607 4606 ret = ext4_zero_partial_blocks(handle, inode, offset, len); 4608 4607 if (ret >= 0) ··· 4651 4648 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | 4652 4649 FALLOC_FL_INSERT_RANGE)) 4653 4650 return -EOPNOTSUPP; 4651 + ext4_fc_track_range(inode, offset >> blkbits, 4652 + (offset + len - 1) >> blkbits); 4654 4653 4655 - if (mode & FALLOC_FL_PUNCH_HOLE) 4656 - return ext4_punch_hole(inode, offset, len); 4654 + ext4_fc_start_update(inode); 4655 + 4656 + if (mode & FALLOC_FL_PUNCH_HOLE) { 4657 + ret = ext4_punch_hole(inode, offset, len); 4658 + goto exit; 4659 + } 4657 4660 4658 4661 ret = ext4_convert_inline_data(inode); 4659 4662 if (ret) 4660 - return ret; 4663 + goto exit; 4661 4664 4662 - if (mode & FALLOC_FL_COLLAPSE_RANGE) 4663 - return ext4_collapse_range(inode, offset, len); 4665 + if (mode & FALLOC_FL_COLLAPSE_RANGE) { 4666 + ret = ext4_collapse_range(inode, offset, len); 4667 + goto exit; 4668 + } 4664 4669 4665 - if (mode & FALLOC_FL_INSERT_RANGE) 4666 - return ext4_insert_range(inode, offset, len); 4670 + if (mode & FALLOC_FL_INSERT_RANGE) { 4671 + ret = ext4_insert_range(inode, offset, len); 4672 + goto exit; 4673 + } 4667 4674 4668 - if (mode & FALLOC_FL_ZERO_RANGE) 4669 - return ext4_zero_range(file, offset, len, mode); 4670 - 4675 + if (mode & FALLOC_FL_ZERO_RANGE) { 4676 + ret = ext4_zero_range(file, offset, len, mode); 4677 + goto exit; 4678 + } 4671 4679 trace_ext4_fallocate_enter(inode, offset, len, mode); 4672 4680 lblk = offset >> blkbits; 4673 4681 ··· 4712 4698 goto out; 4713 4699 4714 4700 if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) { 4715 - ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal, 4716 - EXT4_I(inode)->i_sync_tid); 4701 + ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal, 4702 + EXT4_I(inode)->i_sync_tid); 4717 4703 } 4718 4704 out: 4719 4705 inode_unlock(inode); 4720 4706 trace_ext4_fallocate_exit(inode, offset, max_blocks, ret); 4707 + exit: 4708 + ext4_fc_stop_update(inode); 4721 4709 return ret; 4722 4710 } 4723 4711 ··· 4785 4769 4786 4770 int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end) 4787 4771 { 4788 - int ret, err = 0; 4772 + int ret = 0, err = 0; 4789 4773 struct ext4_io_end_vec *io_end_vec; 4790 4774 4791 4775 /* ··· 5307 5291 ret = PTR_ERR(handle); 5308 5292 goto out_mmap; 5309 5293 } 5294 + ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE); 5310 5295 5311 5296 down_write(&EXT4_I(inode)->i_data_sem); 5312 5297 ext4_discard_preallocations(inode, 0); ··· 5346 5329 5347 5330 out_stop: 5348 5331 ext4_journal_stop(handle); 5332 + ext4_fc_stop_ineligible(sb); 5349 5333 out_mmap: 5350 5334 up_write(&EXT4_I(inode)->i_mmap_sem); 5351 5335 out_mutex: ··· 5447 5429 ret = PTR_ERR(handle); 5448 5430 goto out_mmap; 5449 5431 } 5432 + ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE); 5450 5433 5451 5434 /* Expand file to avoid data loss if there is error while shifting */ 5452 5435 inode->i_size += len; ··· 5522 5503 5523 5504 out_stop: 5524 5505 ext4_journal_stop(handle); 5506 + ext4_fc_stop_ineligible(sb); 5525 5507 out_mmap: 5526 5508 up_write(&EXT4_I(inode)->i_mmap_sem); 5527 5509 out_mutex: ··· 5803 5783 kfree(path); 5804 5784 5805 5785 return err ? err : mapped; 5786 + } 5787 + 5788 + /* 5789 + * Updates physical block address and unwritten status of extent 5790 + * starting at lblk start and of len. If such an extent doesn't exist, 5791 + * this function splits the extent tree appropriately to create an 5792 + * extent like this. This function is called in the fast commit 5793 + * replay path. Returns 0 on success and error on failure. 5794 + */ 5795 + int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start, 5796 + int len, int unwritten, ext4_fsblk_t pblk) 5797 + { 5798 + struct ext4_ext_path *path = NULL, *ppath; 5799 + struct ext4_extent *ex; 5800 + int ret; 5801 + 5802 + path = ext4_find_extent(inode, start, NULL, 0); 5803 + if (!path) 5804 + return -EINVAL; 5805 + ex = path[path->p_depth].p_ext; 5806 + if (!ex) { 5807 + ret = -EFSCORRUPTED; 5808 + goto out; 5809 + } 5810 + 5811 + if (le32_to_cpu(ex->ee_block) != start || 5812 + ext4_ext_get_actual_len(ex) != len) { 5813 + /* We need to split this extent to match our extent first */ 5814 + ppath = path; 5815 + down_write(&EXT4_I(inode)->i_data_sem); 5816 + ret = ext4_force_split_extent_at(NULL, inode, &ppath, start, 1); 5817 + up_write(&EXT4_I(inode)->i_data_sem); 5818 + if (ret) 5819 + goto out; 5820 + kfree(path); 5821 + path = ext4_find_extent(inode, start, NULL, 0); 5822 + if (IS_ERR(path)) 5823 + return -1; 5824 + ppath = path; 5825 + ex = path[path->p_depth].p_ext; 5826 + WARN_ON(le32_to_cpu(ex->ee_block) != start); 5827 + if (ext4_ext_get_actual_len(ex) != len) { 5828 + down_write(&EXT4_I(inode)->i_data_sem); 5829 + ret = ext4_force_split_extent_at(NULL, inode, &ppath, 5830 + start + len, 1); 5831 + up_write(&EXT4_I(inode)->i_data_sem); 5832 + if (ret) 5833 + goto out; 5834 + kfree(path); 5835 + path = ext4_find_extent(inode, start, NULL, 0); 5836 + if (IS_ERR(path)) 5837 + return -EINVAL; 5838 + ex = path[path->p_depth].p_ext; 5839 + } 5840 + } 5841 + if (unwritten) 5842 + ext4_ext_mark_unwritten(ex); 5843 + else 5844 + ext4_ext_mark_initialized(ex); 5845 + ext4_ext_store_pblock(ex, pblk); 5846 + down_write(&EXT4_I(inode)->i_data_sem); 5847 + ret = ext4_ext_dirty(NULL, inode, &path[path->p_depth]); 5848 + up_write(&EXT4_I(inode)->i_data_sem); 5849 + out: 5850 + ext4_ext_drop_refs(path); 5851 + kfree(path); 5852 + ext4_mark_inode_dirty(NULL, inode); 5853 + return ret; 5854 + } 5855 + 5856 + /* Try to shrink the extent tree */ 5857 + void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end) 5858 + { 5859 + struct ext4_ext_path *path = NULL; 5860 + struct ext4_extent *ex; 5861 + ext4_lblk_t old_cur, cur = 0; 5862 + 5863 + while (cur < end) { 5864 + path = ext4_find_extent(inode, cur, NULL, 0); 5865 + if (IS_ERR(path)) 5866 + return; 5867 + ex = path[path->p_depth].p_ext; 5868 + if (!ex) { 5869 + ext4_ext_drop_refs(path); 5870 + kfree(path); 5871 + ext4_mark_inode_dirty(NULL, inode); 5872 + return; 5873 + } 5874 + old_cur = cur; 5875 + cur = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex); 5876 + if (cur <= old_cur) 5877 + cur = old_cur + 1; 5878 + ext4_ext_try_to_merge(NULL, inode, path, ex); 5879 + down_write(&EXT4_I(inode)->i_data_sem); 5880 + ext4_ext_dirty(NULL, inode, &path[path->p_depth]); 5881 + up_write(&EXT4_I(inode)->i_data_sem); 5882 + ext4_mark_inode_dirty(NULL, inode); 5883 + ext4_ext_drop_refs(path); 5884 + kfree(path); 5885 + } 5886 + } 5887 + 5888 + /* Check if *cur is a hole and if it is, skip it */ 5889 + static void skip_hole(struct inode *inode, ext4_lblk_t *cur) 5890 + { 5891 + int ret; 5892 + struct ext4_map_blocks map; 5893 + 5894 + map.m_lblk = *cur; 5895 + map.m_len = ((inode->i_size) >> inode->i_sb->s_blocksize_bits) - *cur; 5896 + 5897 + ret = ext4_map_blocks(NULL, inode, &map, 0); 5898 + if (ret != 0) 5899 + return; 5900 + *cur = *cur + map.m_len; 5901 + } 5902 + 5903 + /* Count number of blocks used by this inode and update i_blocks */ 5904 + int ext4_ext_replay_set_iblocks(struct inode *inode) 5905 + { 5906 + struct ext4_ext_path *path = NULL, *path2 = NULL; 5907 + struct ext4_extent *ex; 5908 + ext4_lblk_t cur = 0, end; 5909 + int numblks = 0, i, ret = 0; 5910 + ext4_fsblk_t cmp1, cmp2; 5911 + struct ext4_map_blocks map; 5912 + 5913 + /* Determin the size of the file first */ 5914 + path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL, 5915 + EXT4_EX_NOCACHE); 5916 + if (IS_ERR(path)) 5917 + return PTR_ERR(path); 5918 + ex = path[path->p_depth].p_ext; 5919 + if (!ex) { 5920 + ext4_ext_drop_refs(path); 5921 + kfree(path); 5922 + goto out; 5923 + } 5924 + end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex); 5925 + ext4_ext_drop_refs(path); 5926 + kfree(path); 5927 + 5928 + /* Count the number of data blocks */ 5929 + cur = 0; 5930 + while (cur < end) { 5931 + map.m_lblk = cur; 5932 + map.m_len = end - cur; 5933 + ret = ext4_map_blocks(NULL, inode, &map, 0); 5934 + if (ret < 0) 5935 + break; 5936 + if (ret > 0) 5937 + numblks += ret; 5938 + cur = cur + map.m_len; 5939 + } 5940 + 5941 + /* 5942 + * Count the number of extent tree blocks. We do it by looking up 5943 + * two successive extents and determining the difference between 5944 + * their paths. When path is different for 2 successive extents 5945 + * we compare the blocks in the path at each level and increment 5946 + * iblocks by total number of differences found. 5947 + */ 5948 + cur = 0; 5949 + skip_hole(inode, &cur); 5950 + path = ext4_find_extent(inode, cur, NULL, 0); 5951 + if (IS_ERR(path)) 5952 + goto out; 5953 + numblks += path->p_depth; 5954 + ext4_ext_drop_refs(path); 5955 + kfree(path); 5956 + while (cur < end) { 5957 + path = ext4_find_extent(inode, cur, NULL, 0); 5958 + if (IS_ERR(path)) 5959 + break; 5960 + ex = path[path->p_depth].p_ext; 5961 + if (!ex) { 5962 + ext4_ext_drop_refs(path); 5963 + kfree(path); 5964 + return 0; 5965 + } 5966 + cur = max(cur + 1, le32_to_cpu(ex->ee_block) + 5967 + ext4_ext_get_actual_len(ex)); 5968 + skip_hole(inode, &cur); 5969 + 5970 + path2 = ext4_find_extent(inode, cur, NULL, 0); 5971 + if (IS_ERR(path2)) { 5972 + ext4_ext_drop_refs(path); 5973 + kfree(path); 5974 + break; 5975 + } 5976 + ex = path2[path2->p_depth].p_ext; 5977 + for (i = 0; i <= max(path->p_depth, path2->p_depth); i++) { 5978 + cmp1 = cmp2 = 0; 5979 + if (i <= path->p_depth) 5980 + cmp1 = path[i].p_bh ? 5981 + path[i].p_bh->b_blocknr : 0; 5982 + if (i <= path2->p_depth) 5983 + cmp2 = path2[i].p_bh ? 5984 + path2[i].p_bh->b_blocknr : 0; 5985 + if (cmp1 != cmp2 && cmp2 != 0) 5986 + numblks++; 5987 + } 5988 + ext4_ext_drop_refs(path); 5989 + ext4_ext_drop_refs(path2); 5990 + kfree(path); 5991 + kfree(path2); 5992 + } 5993 + 5994 + out: 5995 + inode->i_blocks = numblks << (inode->i_sb->s_blocksize_bits - 9); 5996 + ext4_mark_inode_dirty(NULL, inode); 5997 + return 0; 5998 + } 5999 + 6000 + int ext4_ext_clear_bb(struct inode *inode) 6001 + { 6002 + struct ext4_ext_path *path = NULL; 6003 + struct ext4_extent *ex; 6004 + ext4_lblk_t cur = 0, end; 6005 + int j, ret = 0; 6006 + struct ext4_map_blocks map; 6007 + 6008 + /* Determin the size of the file first */ 6009 + path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL, 6010 + EXT4_EX_NOCACHE); 6011 + if (IS_ERR(path)) 6012 + return PTR_ERR(path); 6013 + ex = path[path->p_depth].p_ext; 6014 + if (!ex) { 6015 + ext4_ext_drop_refs(path); 6016 + kfree(path); 6017 + return 0; 6018 + } 6019 + end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex); 6020 + ext4_ext_drop_refs(path); 6021 + kfree(path); 6022 + 6023 + cur = 0; 6024 + while (cur < end) { 6025 + map.m_lblk = cur; 6026 + map.m_len = end - cur; 6027 + ret = ext4_map_blocks(NULL, inode, &map, 0); 6028 + if (ret < 0) 6029 + break; 6030 + if (ret > 0) { 6031 + path = ext4_find_extent(inode, map.m_lblk, NULL, 0); 6032 + if (!IS_ERR_OR_NULL(path)) { 6033 + for (j = 0; j < path->p_depth; j++) { 6034 + 6035 + ext4_mb_mark_bb(inode->i_sb, 6036 + path[j].p_block, 1, 0); 6037 + } 6038 + ext4_ext_drop_refs(path); 6039 + kfree(path); 6040 + } 6041 + ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0); 6042 + } 6043 + cur = cur + map.m_len; 6044 + } 6045 + 6046 + return 0; 5806 6047 }

+24

fs/ext4/extents_status.c

··· 311 311 ext4_lblk_t lblk, ext4_lblk_t end, 312 312 struct extent_status *es) 313 313 { 314 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 315 + return; 316 + 314 317 trace_ext4_es_find_extent_range_enter(inode, lblk); 315 318 316 319 read_lock(&EXT4_I(inode)->i_es_lock); ··· 364 361 { 365 362 bool ret; 366 363 364 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 365 + return false; 366 + 367 367 read_lock(&EXT4_I(inode)->i_es_lock); 368 368 ret = __es_scan_range(inode, matching_fn, lblk, end); 369 369 read_unlock(&EXT4_I(inode)->i_es_lock); ··· 409 403 ext4_lblk_t lblk) 410 404 { 411 405 bool ret; 406 + 407 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 408 + return false; 412 409 413 410 read_lock(&EXT4_I(inode)->i_es_lock); 414 411 ret = __es_scan_clu(inode, matching_fn, lblk); ··· 821 812 int err = 0; 822 813 struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 823 814 815 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 816 + return 0; 817 + 824 818 es_debug("add [%u/%u) %llu %x to extent status tree of inode %lu\n", 825 819 lblk, len, pblk, status, inode->i_ino); 826 820 ··· 885 873 struct extent_status newes; 886 874 ext4_lblk_t end = lblk + len - 1; 887 875 876 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 877 + return; 878 + 888 879 newes.es_lblk = lblk; 889 880 newes.es_len = len; 890 881 ext4_es_store_pblock_status(&newes, pblk, status); ··· 922 907 struct extent_status *es1 = NULL; 923 908 struct rb_node *node; 924 909 int found = 0; 910 + 911 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 912 + return 0; 925 913 926 914 trace_ext4_es_lookup_extent_enter(inode, lblk); 927 915 es_debug("lookup extent in block %u\n", lblk); ··· 1436 1418 ext4_lblk_t end; 1437 1419 int err = 0; 1438 1420 int reserved = 0; 1421 + 1422 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 1423 + return 0; 1439 1424 1440 1425 trace_ext4_es_remove_extent(inode, lblk, len); 1441 1426 es_debug("remove [%u/%u) from extent status tree of inode %lu\n", ··· 1989 1968 { 1990 1969 struct extent_status newes; 1991 1970 int err = 0; 1971 + 1972 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 1973 + return 0; 1992 1974 1993 1975 es_debug("add [%u/1) delayed to extent status tree of inode %lu\n", 1994 1976 lblk, inode->i_ino);

+2139

fs/ext4/fast_commit.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + /* 4 + * fs/ext4/fast_commit.c 5 + * 6 + * Written by Harshad Shirwadkar <harshadshirwadkar@gmail.com> 7 + * 8 + * Ext4 fast commits routines. 9 + */ 10 + #include "ext4.h" 11 + #include "ext4_jbd2.h" 12 + #include "ext4_extents.h" 13 + #include "mballoc.h" 14 + 15 + /* 16 + * Ext4 Fast Commits 17 + * ----------------- 18 + * 19 + * Ext4 fast commits implement fine grained journalling for Ext4. 20 + * 21 + * Fast commits are organized as a log of tag-length-value (TLV) structs. (See 22 + * struct ext4_fc_tl). Each TLV contains some delta that is replayed TLV by 23 + * TLV during the recovery phase. For the scenarios for which we currently 24 + * don't have replay code, fast commit falls back to full commits. 25 + * Fast commits record delta in one of the following three categories. 26 + * 27 + * (A) Directory entry updates: 28 + * 29 + * - EXT4_FC_TAG_UNLINK - records directory entry unlink 30 + * - EXT4_FC_TAG_LINK - records directory entry link 31 + * - EXT4_FC_TAG_CREAT - records inode and directory entry creation 32 + * 33 + * (B) File specific data range updates: 34 + * 35 + * - EXT4_FC_TAG_ADD_RANGE - records addition of new blocks to an inode 36 + * - EXT4_FC_TAG_DEL_RANGE - records deletion of blocks from an inode 37 + * 38 + * (C) Inode metadata (mtime / ctime etc): 39 + * 40 + * - EXT4_FC_TAG_INODE - record the inode that should be replayed 41 + * during recovery. Note that iblocks field is 42 + * not replayed and instead derived during 43 + * replay. 44 + * Commit Operation 45 + * ---------------- 46 + * With fast commits, we maintain all the directory entry operations in the 47 + * order in which they are issued in an in-memory queue. This queue is flushed 48 + * to disk during the commit operation. We also maintain a list of inodes 49 + * that need to be committed during a fast commit in another in memory queue of 50 + * inodes. During the commit operation, we commit in the following order: 51 + * 52 + * [1] Lock inodes for any further data updates by setting COMMITTING state 53 + * [2] Submit data buffers of all the inodes 54 + * [3] Wait for [2] to complete 55 + * [4] Commit all the directory entry updates in the fast commit space 56 + * [5] Commit all the changed inode structures 57 + * [6] Write tail tag (this tag ensures the atomicity, please read the following 58 + * section for more details). 59 + * [7] Wait for [4], [5] and [6] to complete. 60 + * 61 + * All the inode updates must call ext4_fc_start_update() before starting an 62 + * update. If such an ongoing update is present, fast commit waits for it to 63 + * complete. The completion of such an update is marked by 64 + * ext4_fc_stop_update(). 65 + * 66 + * Fast Commit Ineligibility 67 + * ------------------------- 68 + * Not all operations are supported by fast commits today (e.g extended 69 + * attributes). Fast commit ineligiblity is marked by calling one of the 70 + * two following functions: 71 + * 72 + * - ext4_fc_mark_ineligible(): This makes next fast commit operation to fall 73 + * back to full commit. This is useful in case of transient errors. 74 + * 75 + * - ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() - This makes all 76 + * the fast commits happening between ext4_fc_start_ineligible() and 77 + * ext4_fc_stop_ineligible() and one fast commit after the call to 78 + * ext4_fc_stop_ineligible() to fall back to full commits. It is important to 79 + * make one more fast commit to fall back to full commit after stop call so 80 + * that it guaranteed that the fast commit ineligible operation contained 81 + * within ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() is 82 + * followed by at least 1 full commit. 83 + * 84 + * Atomicity of commits 85 + * -------------------- 86 + * In order to gaurantee atomicity during the commit operation, fast commit 87 + * uses "EXT4_FC_TAG_TAIL" tag that marks a fast commit as complete. Tail 88 + * tag contains CRC of the contents and TID of the transaction after which 89 + * this fast commit should be applied. Recovery code replays fast commit 90 + * logs only if there's at least 1 valid tail present. For every fast commit 91 + * operation, there is 1 tail. This means, we may end up with multiple tails 92 + * in the fast commit space. Here's an example: 93 + * 94 + * - Create a new file A and remove existing file B 95 + * - fsync() 96 + * - Append contents to file A 97 + * - Truncate file A 98 + * - fsync() 99 + * 100 + * The fast commit space at the end of above operations would look like this: 101 + * [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL] 102 + * |<--- Fast Commit 1 --->|<--- Fast Commit 2 ---->| 103 + * 104 + * Replay code should thus check for all the valid tails in the FC area. 105 + * 106 + * TODOs 107 + * ----- 108 + * 1) Make fast commit atomic updates more fine grained. Today, a fast commit 109 + * eligible update must be protected within ext4_fc_start_update() and 110 + * ext4_fc_stop_update(). These routines are called at much higher 111 + * routines. This can be made more fine grained by combining with 112 + * ext4_journal_start(). 113 + * 114 + * 2) Same above for ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() 115 + * 116 + * 3) Handle more ineligible cases. 117 + */ 118 + 119 + #include <trace/events/ext4.h> 120 + static struct kmem_cache *ext4_fc_dentry_cachep; 121 + 122 + static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate) 123 + { 124 + BUFFER_TRACE(bh, ""); 125 + if (uptodate) { 126 + ext4_debug("%s: Block %lld up-to-date", 127 + __func__, bh->b_blocknr); 128 + set_buffer_uptodate(bh); 129 + } else { 130 + ext4_debug("%s: Block %lld not up-to-date", 131 + __func__, bh->b_blocknr); 132 + clear_buffer_uptodate(bh); 133 + } 134 + 135 + unlock_buffer(bh); 136 + } 137 + 138 + static inline void ext4_fc_reset_inode(struct inode *inode) 139 + { 140 + struct ext4_inode_info *ei = EXT4_I(inode); 141 + 142 + ei->i_fc_lblk_start = 0; 143 + ei->i_fc_lblk_len = 0; 144 + } 145 + 146 + void ext4_fc_init_inode(struct inode *inode) 147 + { 148 + struct ext4_inode_info *ei = EXT4_I(inode); 149 + 150 + ext4_fc_reset_inode(inode); 151 + ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING); 152 + INIT_LIST_HEAD(&ei->i_fc_list); 153 + init_waitqueue_head(&ei->i_fc_wait); 154 + atomic_set(&ei->i_fc_updates, 0); 155 + ei->i_fc_committed_subtid = 0; 156 + } 157 + 158 + /* 159 + * Inform Ext4's fast about start of an inode update 160 + * 161 + * This function is called by the high level call VFS callbacks before 162 + * performing any inode update. This function blocks if there's an ongoing 163 + * fast commit on the inode in question. 164 + */ 165 + void ext4_fc_start_update(struct inode *inode) 166 + { 167 + struct ext4_inode_info *ei = EXT4_I(inode); 168 + 169 + if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) || 170 + (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)) 171 + return; 172 + 173 + restart: 174 + spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock); 175 + if (list_empty(&ei->i_fc_list)) 176 + goto out; 177 + 178 + if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) { 179 + wait_queue_head_t *wq; 180 + #if (BITS_PER_LONG < 64) 181 + DEFINE_WAIT_BIT(wait, &ei->i_state_flags, 182 + EXT4_STATE_FC_COMMITTING); 183 + wq = bit_waitqueue(&ei->i_state_flags, 184 + EXT4_STATE_FC_COMMITTING); 185 + #else 186 + DEFINE_WAIT_BIT(wait, &ei->i_flags, 187 + EXT4_STATE_FC_COMMITTING); 188 + wq = bit_waitqueue(&ei->i_flags, 189 + EXT4_STATE_FC_COMMITTING); 190 + #endif 191 + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 192 + spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock); 193 + schedule(); 194 + finish_wait(wq, &wait.wq_entry); 195 + goto restart; 196 + } 197 + out: 198 + atomic_inc(&ei->i_fc_updates); 199 + spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock); 200 + } 201 + 202 + /* 203 + * Stop inode update and wake up waiting fast commits if any. 204 + */ 205 + void ext4_fc_stop_update(struct inode *inode) 206 + { 207 + struct ext4_inode_info *ei = EXT4_I(inode); 208 + 209 + if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) || 210 + (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)) 211 + return; 212 + 213 + if (atomic_dec_and_test(&ei->i_fc_updates)) 214 + wake_up_all(&ei->i_fc_wait); 215 + } 216 + 217 + /* 218 + * Remove inode from fast commit list. If the inode is being committed 219 + * we wait until inode commit is done. 220 + */ 221 + void ext4_fc_del(struct inode *inode) 222 + { 223 + struct ext4_inode_info *ei = EXT4_I(inode); 224 + 225 + if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) || 226 + (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)) 227 + return; 228 + 229 + restart: 230 + spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock); 231 + if (list_empty(&ei->i_fc_list)) { 232 + spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock); 233 + return; 234 + } 235 + 236 + if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) { 237 + wait_queue_head_t *wq; 238 + #if (BITS_PER_LONG < 64) 239 + DEFINE_WAIT_BIT(wait, &ei->i_state_flags, 240 + EXT4_STATE_FC_COMMITTING); 241 + wq = bit_waitqueue(&ei->i_state_flags, 242 + EXT4_STATE_FC_COMMITTING); 243 + #else 244 + DEFINE_WAIT_BIT(wait, &ei->i_flags, 245 + EXT4_STATE_FC_COMMITTING); 246 + wq = bit_waitqueue(&ei->i_flags, 247 + EXT4_STATE_FC_COMMITTING); 248 + #endif 249 + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 250 + spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock); 251 + schedule(); 252 + finish_wait(wq, &wait.wq_entry); 253 + goto restart; 254 + } 255 + if (!list_empty(&ei->i_fc_list)) 256 + list_del_init(&ei->i_fc_list); 257 + spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock); 258 + } 259 + 260 + /* 261 + * Mark file system as fast commit ineligible. This means that next commit 262 + * operation would result in a full jbd2 commit. 263 + */ 264 + void ext4_fc_mark_ineligible(struct super_block *sb, int reason) 265 + { 266 + struct ext4_sb_info *sbi = EXT4_SB(sb); 267 + 268 + if (!test_opt2(sb, JOURNAL_FAST_COMMIT) || 269 + (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)) 270 + return; 271 + 272 + sbi->s_mount_state |= EXT4_FC_INELIGIBLE; 273 + WARN_ON(reason >= EXT4_FC_REASON_MAX); 274 + sbi->s_fc_stats.fc_ineligible_reason_count[reason]++; 275 + } 276 + 277 + /* 278 + * Start a fast commit ineligible update. Any commits that happen while 279 + * such an operation is in progress fall back to full commits. 280 + */ 281 + void ext4_fc_start_ineligible(struct super_block *sb, int reason) 282 + { 283 + struct ext4_sb_info *sbi = EXT4_SB(sb); 284 + 285 + if (!test_opt2(sb, JOURNAL_FAST_COMMIT) || 286 + (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)) 287 + return; 288 + 289 + WARN_ON(reason >= EXT4_FC_REASON_MAX); 290 + sbi->s_fc_stats.fc_ineligible_reason_count[reason]++; 291 + atomic_inc(&sbi->s_fc_ineligible_updates); 292 + } 293 + 294 + /* 295 + * Stop a fast commit ineligible update. We set EXT4_FC_INELIGIBLE flag here 296 + * to ensure that after stopping the ineligible update, at least one full 297 + * commit takes place. 298 + */ 299 + void ext4_fc_stop_ineligible(struct super_block *sb) 300 + { 301 + if (!test_opt2(sb, JOURNAL_FAST_COMMIT) || 302 + (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)) 303 + return; 304 + 305 + EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE; 306 + atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates); 307 + } 308 + 309 + static inline int ext4_fc_is_ineligible(struct super_block *sb) 310 + { 311 + return (EXT4_SB(sb)->s_mount_state & EXT4_FC_INELIGIBLE) || 312 + atomic_read(&EXT4_SB(sb)->s_fc_ineligible_updates); 313 + } 314 + 315 + /* 316 + * Generic fast commit tracking function. If this is the first time this we are 317 + * called after a full commit, we initialize fast commit fields and then call 318 + * __fc_track_fn() with update = 0. If we have already been called after a full 319 + * commit, we pass update = 1. Based on that, the track function can determine 320 + * if it needs to track a field for the first time or if it needs to just 321 + * update the previously tracked value. 322 + * 323 + * If enqueue is set, this function enqueues the inode in fast commit list. 324 + */ 325 + static int ext4_fc_track_template( 326 + struct inode *inode, int (*__fc_track_fn)(struct inode *, void *, bool), 327 + void *args, int enqueue) 328 + { 329 + tid_t running_txn_tid; 330 + bool update = false; 331 + struct ext4_inode_info *ei = EXT4_I(inode); 332 + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 333 + int ret; 334 + 335 + if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) || 336 + (sbi->s_mount_state & EXT4_FC_REPLAY)) 337 + return -EOPNOTSUPP; 338 + 339 + if (ext4_fc_is_ineligible(inode->i_sb)) 340 + return -EINVAL; 341 + 342 + running_txn_tid = sbi->s_journal ? 343 + sbi->s_journal->j_commit_sequence + 1 : 0; 344 + 345 + mutex_lock(&ei->i_fc_lock); 346 + if (running_txn_tid == ei->i_sync_tid) { 347 + update = true; 348 + } else { 349 + ext4_fc_reset_inode(inode); 350 + ei->i_sync_tid = running_txn_tid; 351 + } 352 + ret = __fc_track_fn(inode, args, update); 353 + mutex_unlock(&ei->i_fc_lock); 354 + 355 + if (!enqueue) 356 + return ret; 357 + 358 + spin_lock(&sbi->s_fc_lock); 359 + if (list_empty(&EXT4_I(inode)->i_fc_list)) 360 + list_add_tail(&EXT4_I(inode)->i_fc_list, 361 + (sbi->s_mount_state & EXT4_FC_COMMITTING) ? 362 + &sbi->s_fc_q[FC_Q_STAGING] : 363 + &sbi->s_fc_q[FC_Q_MAIN]); 364 + spin_unlock(&sbi->s_fc_lock); 365 + 366 + return ret; 367 + } 368 + 369 + struct __track_dentry_update_args { 370 + struct dentry *dentry; 371 + int op; 372 + }; 373 + 374 + /* __track_fn for directory entry updates. Called with ei->i_fc_lock. */ 375 + static int __track_dentry_update(struct inode *inode, void *arg, bool update) 376 + { 377 + struct ext4_fc_dentry_update *node; 378 + struct ext4_inode_info *ei = EXT4_I(inode); 379 + struct __track_dentry_update_args *dentry_update = 380 + (struct __track_dentry_update_args *)arg; 381 + struct dentry *dentry = dentry_update->dentry; 382 + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 383 + 384 + mutex_unlock(&ei->i_fc_lock); 385 + node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS); 386 + if (!node) { 387 + ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_MEM); 388 + mutex_lock(&ei->i_fc_lock); 389 + return -ENOMEM; 390 + } 391 + 392 + node->fcd_op = dentry_update->op; 393 + node->fcd_parent = dentry->d_parent->d_inode->i_ino; 394 + node->fcd_ino = inode->i_ino; 395 + if (dentry->d_name.len > DNAME_INLINE_LEN) { 396 + node->fcd_name.name = kmalloc(dentry->d_name.len, GFP_NOFS); 397 + if (!node->fcd_name.name) { 398 + kmem_cache_free(ext4_fc_dentry_cachep, node); 399 + ext4_fc_mark_ineligible(inode->i_sb, 400 + EXT4_FC_REASON_MEM); 401 + mutex_lock(&ei->i_fc_lock); 402 + return -ENOMEM; 403 + } 404 + memcpy((u8 *)node->fcd_name.name, dentry->d_name.name, 405 + dentry->d_name.len); 406 + } else { 407 + memcpy(node->fcd_iname, dentry->d_name.name, 408 + dentry->d_name.len); 409 + node->fcd_name.name = node->fcd_iname; 410 + } 411 + node->fcd_name.len = dentry->d_name.len; 412 + 413 + spin_lock(&sbi->s_fc_lock); 414 + if (sbi->s_mount_state & EXT4_FC_COMMITTING) 415 + list_add_tail(&node->fcd_list, 416 + &sbi->s_fc_dentry_q[FC_Q_STAGING]); 417 + else 418 + list_add_tail(&node->fcd_list, &sbi->s_fc_dentry_q[FC_Q_MAIN]); 419 + spin_unlock(&sbi->s_fc_lock); 420 + mutex_lock(&ei->i_fc_lock); 421 + 422 + return 0; 423 + } 424 + 425 + void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry) 426 + { 427 + struct __track_dentry_update_args args; 428 + int ret; 429 + 430 + args.dentry = dentry; 431 + args.op = EXT4_FC_TAG_UNLINK; 432 + 433 + ret = ext4_fc_track_template(inode, __track_dentry_update, 434 + (void *)&args, 0); 435 + trace_ext4_fc_track_unlink(inode, dentry, ret); 436 + } 437 + 438 + void ext4_fc_track_link(struct inode *inode, struct dentry *dentry) 439 + { 440 + struct __track_dentry_update_args args; 441 + int ret; 442 + 443 + args.dentry = dentry; 444 + args.op = EXT4_FC_TAG_LINK; 445 + 446 + ret = ext4_fc_track_template(inode, __track_dentry_update, 447 + (void *)&args, 0); 448 + trace_ext4_fc_track_link(inode, dentry, ret); 449 + } 450 + 451 + void ext4_fc_track_create(struct inode *inode, struct dentry *dentry) 452 + { 453 + struct __track_dentry_update_args args; 454 + int ret; 455 + 456 + args.dentry = dentry; 457 + args.op = EXT4_FC_TAG_CREAT; 458 + 459 + ret = ext4_fc_track_template(inode, __track_dentry_update, 460 + (void *)&args, 0); 461 + trace_ext4_fc_track_create(inode, dentry, ret); 462 + } 463 + 464 + /* __track_fn for inode tracking */ 465 + static int __track_inode(struct inode *inode, void *arg, bool update) 466 + { 467 + if (update) 468 + return -EEXIST; 469 + 470 + EXT4_I(inode)->i_fc_lblk_len = 0; 471 + 472 + return 0; 473 + } 474 + 475 + void ext4_fc_track_inode(struct inode *inode) 476 + { 477 + int ret; 478 + 479 + if (S_ISDIR(inode->i_mode)) 480 + return; 481 + 482 + ret = ext4_fc_track_template(inode, __track_inode, NULL, 1); 483 + trace_ext4_fc_track_inode(inode, ret); 484 + } 485 + 486 + struct __track_range_args { 487 + ext4_lblk_t start, end; 488 + }; 489 + 490 + /* __track_fn for tracking data updates */ 491 + static int __track_range(struct inode *inode, void *arg, bool update) 492 + { 493 + struct ext4_inode_info *ei = EXT4_I(inode); 494 + ext4_lblk_t oldstart; 495 + struct __track_range_args *__arg = 496 + (struct __track_range_args *)arg; 497 + 498 + if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) { 499 + ext4_debug("Special inode %ld being modified\n", inode->i_ino); 500 + return -ECANCELED; 501 + } 502 + 503 + oldstart = ei->i_fc_lblk_start; 504 + 505 + if (update && ei->i_fc_lblk_len > 0) { 506 + ei->i_fc_lblk_start = min(ei->i_fc_lblk_start, __arg->start); 507 + ei->i_fc_lblk_len = 508 + max(oldstart + ei->i_fc_lblk_len - 1, __arg->end) - 509 + ei->i_fc_lblk_start + 1; 510 + } else { 511 + ei->i_fc_lblk_start = __arg->start; 512 + ei->i_fc_lblk_len = __arg->end - __arg->start + 1; 513 + } 514 + 515 + return 0; 516 + } 517 + 518 + void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start, 519 + ext4_lblk_t end) 520 + { 521 + struct __track_range_args args; 522 + int ret; 523 + 524 + if (S_ISDIR(inode->i_mode)) 525 + return; 526 + 527 + args.start = start; 528 + args.end = end; 529 + 530 + ret = ext4_fc_track_template(inode, __track_range, &args, 1); 531 + 532 + trace_ext4_fc_track_range(inode, start, end, ret); 533 + } 534 + 535 + static void ext4_fc_submit_bh(struct super_block *sb) 536 + { 537 + int write_flags = REQ_SYNC; 538 + struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh; 539 + 540 + if (test_opt(sb, BARRIER)) 541 + write_flags |= REQ_FUA | REQ_PREFLUSH; 542 + lock_buffer(bh); 543 + clear_buffer_dirty(bh); 544 + set_buffer_uptodate(bh); 545 + bh->b_end_io = ext4_end_buffer_io_sync; 546 + submit_bh(REQ_OP_WRITE, write_flags, bh); 547 + EXT4_SB(sb)->s_fc_bh = NULL; 548 + } 549 + 550 + /* Ext4 commit path routines */ 551 + 552 + /* memzero and update CRC */ 553 + static void *ext4_fc_memzero(struct super_block *sb, void *dst, int len, 554 + u32 *crc) 555 + { 556 + void *ret; 557 + 558 + ret = memset(dst, 0, len); 559 + if (crc) 560 + *crc = ext4_chksum(EXT4_SB(sb), *crc, dst, len); 561 + return ret; 562 + } 563 + 564 + /* 565 + * Allocate len bytes on a fast commit buffer. 566 + * 567 + * During the commit time this function is used to manage fast commit 568 + * block space. We don't split a fast commit log onto different 569 + * blocks. So this function makes sure that if there's not enough space 570 + * on the current block, the remaining space in the current block is 571 + * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case, 572 + * new block is from jbd2 and CRC is updated to reflect the padding 573 + * we added. 574 + */ 575 + static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc) 576 + { 577 + struct ext4_fc_tl *tl; 578 + struct ext4_sb_info *sbi = EXT4_SB(sb); 579 + struct buffer_head *bh; 580 + int bsize = sbi->s_journal->j_blocksize; 581 + int ret, off = sbi->s_fc_bytes % bsize; 582 + int pad_len; 583 + 584 + /* 585 + * After allocating len, we should have space at least for a 0 byte 586 + * padding. 587 + */ 588 + if (len + sizeof(struct ext4_fc_tl) > bsize) 589 + return NULL; 590 + 591 + if (bsize - off - 1 > len + sizeof(struct ext4_fc_tl)) { 592 + /* 593 + * Only allocate from current buffer if we have enough space for 594 + * this request AND we have space to add a zero byte padding. 595 + */ 596 + if (!sbi->s_fc_bh) { 597 + ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh); 598 + if (ret) 599 + return NULL; 600 + sbi->s_fc_bh = bh; 601 + } 602 + sbi->s_fc_bytes += len; 603 + return sbi->s_fc_bh->b_data + off; 604 + } 605 + /* Need to add PAD tag */ 606 + tl = (struct ext4_fc_tl *)(sbi->s_fc_bh->b_data + off); 607 + tl->fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD); 608 + pad_len = bsize - off - 1 - sizeof(struct ext4_fc_tl); 609 + tl->fc_len = cpu_to_le16(pad_len); 610 + if (crc) 611 + *crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl)); 612 + if (pad_len > 0) 613 + ext4_fc_memzero(sb, tl + 1, pad_len, crc); 614 + ext4_fc_submit_bh(sb); 615 + 616 + ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh); 617 + if (ret) 618 + return NULL; 619 + sbi->s_fc_bh = bh; 620 + sbi->s_fc_bytes = (sbi->s_fc_bytes / bsize + 1) * bsize + len; 621 + return sbi->s_fc_bh->b_data; 622 + } 623 + 624 + /* memcpy to fc reserved space and update CRC */ 625 + static void *ext4_fc_memcpy(struct super_block *sb, void *dst, const void *src, 626 + int len, u32 *crc) 627 + { 628 + if (crc) 629 + *crc = ext4_chksum(EXT4_SB(sb), *crc, src, len); 630 + return memcpy(dst, src, len); 631 + } 632 + 633 + /* 634 + * Complete a fast commit by writing tail tag. 635 + * 636 + * Writing tail tag marks the end of a fast commit. In order to guarantee 637 + * atomicity, after writing tail tag, even if there's space remaining 638 + * in the block, next commit shouldn't use it. That's why tail tag 639 + * has the length as that of the remaining space on the block. 640 + */ 641 + static int ext4_fc_write_tail(struct super_block *sb, u32 crc) 642 + { 643 + struct ext4_sb_info *sbi = EXT4_SB(sb); 644 + struct ext4_fc_tl tl; 645 + struct ext4_fc_tail tail; 646 + int off, bsize = sbi->s_journal->j_blocksize; 647 + u8 *dst; 648 + 649 + /* 650 + * ext4_fc_reserve_space takes care of allocating an extra block if 651 + * there's no enough space on this block for accommodating this tail. 652 + */ 653 + dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc); 654 + if (!dst) 655 + return -ENOSPC; 656 + 657 + off = sbi->s_fc_bytes % bsize; 658 + 659 + tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL); 660 + tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail)); 661 + sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize); 662 + 663 + ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc); 664 + dst += sizeof(tl); 665 + tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid); 666 + ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc); 667 + dst += sizeof(tail.fc_tid); 668 + tail.fc_crc = cpu_to_le32(crc); 669 + ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL); 670 + 671 + ext4_fc_submit_bh(sb); 672 + 673 + return 0; 674 + } 675 + 676 + /* 677 + * Adds tag, length, value and updates CRC. Returns true if tlv was added. 678 + * Returns false if there's not enough space. 679 + */ 680 + static bool ext4_fc_add_tlv(struct super_block *sb, u16 tag, u16 len, u8 *val, 681 + u32 *crc) 682 + { 683 + struct ext4_fc_tl tl; 684 + u8 *dst; 685 + 686 + dst = ext4_fc_reserve_space(sb, sizeof(tl) + len, crc); 687 + if (!dst) 688 + return false; 689 + 690 + tl.fc_tag = cpu_to_le16(tag); 691 + tl.fc_len = cpu_to_le16(len); 692 + 693 + ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc); 694 + ext4_fc_memcpy(sb, dst + sizeof(tl), val, len, crc); 695 + 696 + return true; 697 + } 698 + 699 + /* Same as above, but adds dentry tlv. */ 700 + static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u16 tag, 701 + int parent_ino, int ino, int dlen, 702 + const unsigned char *dname, 703 + u32 *crc) 704 + { 705 + struct ext4_fc_dentry_info fcd; 706 + struct ext4_fc_tl tl; 707 + u8 *dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(fcd) + dlen, 708 + crc); 709 + 710 + if (!dst) 711 + return false; 712 + 713 + fcd.fc_parent_ino = cpu_to_le32(parent_ino); 714 + fcd.fc_ino = cpu_to_le32(ino); 715 + tl.fc_tag = cpu_to_le16(tag); 716 + tl.fc_len = cpu_to_le16(sizeof(fcd) + dlen); 717 + ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc); 718 + dst += sizeof(tl); 719 + ext4_fc_memcpy(sb, dst, &fcd, sizeof(fcd), crc); 720 + dst += sizeof(fcd); 721 + ext4_fc_memcpy(sb, dst, dname, dlen, crc); 722 + dst += dlen; 723 + 724 + return true; 725 + } 726 + 727 + /* 728 + * Writes inode in the fast commit space under TLV with tag @tag. 729 + * Returns 0 on success, error on failure. 730 + */ 731 + static int ext4_fc_write_inode(struct inode *inode, u32 *crc) 732 + { 733 + struct ext4_inode_info *ei = EXT4_I(inode); 734 + int inode_len = EXT4_GOOD_OLD_INODE_SIZE; 735 + int ret; 736 + struct ext4_iloc iloc; 737 + struct ext4_fc_inode fc_inode; 738 + struct ext4_fc_tl tl; 739 + u8 *dst; 740 + 741 + ret = ext4_get_inode_loc(inode, &iloc); 742 + if (ret) 743 + return ret; 744 + 745 + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) 746 + inode_len += ei->i_extra_isize; 747 + 748 + fc_inode.fc_ino = cpu_to_le32(inode->i_ino); 749 + tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE); 750 + tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino)); 751 + 752 + dst = ext4_fc_reserve_space(inode->i_sb, 753 + sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc); 754 + if (!dst) 755 + return -ECANCELED; 756 + 757 + if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc)) 758 + return -ECANCELED; 759 + dst += sizeof(tl); 760 + if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc)) 761 + return -ECANCELED; 762 + dst += sizeof(fc_inode); 763 + if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc), 764 + inode_len, crc)) 765 + return -ECANCELED; 766 + 767 + return 0; 768 + } 769 + 770 + /* 771 + * Writes updated data ranges for the inode in question. Updates CRC. 772 + * Returns 0 on success, error otherwise. 773 + */ 774 + static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc) 775 + { 776 + ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size; 777 + struct ext4_inode_info *ei = EXT4_I(inode); 778 + struct ext4_map_blocks map; 779 + struct ext4_fc_add_range fc_ext; 780 + struct ext4_fc_del_range lrange; 781 + struct ext4_extent *ex; 782 + int ret; 783 + 784 + mutex_lock(&ei->i_fc_lock); 785 + if (ei->i_fc_lblk_len == 0) { 786 + mutex_unlock(&ei->i_fc_lock); 787 + return 0; 788 + } 789 + old_blk_size = ei->i_fc_lblk_start; 790 + new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1; 791 + ei->i_fc_lblk_len = 0; 792 + mutex_unlock(&ei->i_fc_lock); 793 + 794 + cur_lblk_off = old_blk_size; 795 + jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n", 796 + __func__, cur_lblk_off, new_blk_size, inode->i_ino); 797 + 798 + while (cur_lblk_off <= new_blk_size) { 799 + map.m_lblk = cur_lblk_off; 800 + map.m_len = new_blk_size - cur_lblk_off + 1; 801 + ret = ext4_map_blocks(NULL, inode, &map, 0); 802 + if (ret < 0) 803 + return -ECANCELED; 804 + 805 + if (map.m_len == 0) { 806 + cur_lblk_off++; 807 + continue; 808 + } 809 + 810 + if (ret == 0) { 811 + lrange.fc_ino = cpu_to_le32(inode->i_ino); 812 + lrange.fc_lblk = cpu_to_le32(map.m_lblk); 813 + lrange.fc_len = cpu_to_le32(map.m_len); 814 + if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE, 815 + sizeof(lrange), (u8 *)&lrange, crc)) 816 + return -ENOSPC; 817 + } else { 818 + fc_ext.fc_ino = cpu_to_le32(inode->i_ino); 819 + ex = (struct ext4_extent *)&fc_ext.fc_ex; 820 + ex->ee_block = cpu_to_le32(map.m_lblk); 821 + ex->ee_len = cpu_to_le16(map.m_len); 822 + ext4_ext_store_pblock(ex, map.m_pblk); 823 + if (map.m_flags & EXT4_MAP_UNWRITTEN) 824 + ext4_ext_mark_unwritten(ex); 825 + else 826 + ext4_ext_mark_initialized(ex); 827 + if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE, 828 + sizeof(fc_ext), (u8 *)&fc_ext, crc)) 829 + return -ENOSPC; 830 + } 831 + 832 + cur_lblk_off += map.m_len; 833 + } 834 + 835 + return 0; 836 + } 837 + 838 + 839 + /* Submit data for all the fast commit inodes */ 840 + static int ext4_fc_submit_inode_data_all(journal_t *journal) 841 + { 842 + struct super_block *sb = (struct super_block *)(journal->j_private); 843 + struct ext4_sb_info *sbi = EXT4_SB(sb); 844 + struct ext4_inode_info *ei; 845 + struct list_head *pos; 846 + int ret = 0; 847 + 848 + spin_lock(&sbi->s_fc_lock); 849 + sbi->s_mount_state |= EXT4_FC_COMMITTING; 850 + list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) { 851 + ei = list_entry(pos, struct ext4_inode_info, i_fc_list); 852 + ext4_set_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING); 853 + while (atomic_read(&ei->i_fc_updates)) { 854 + DEFINE_WAIT(wait); 855 + 856 + prepare_to_wait(&ei->i_fc_wait, &wait, 857 + TASK_UNINTERRUPTIBLE); 858 + if (atomic_read(&ei->i_fc_updates)) { 859 + spin_unlock(&sbi->s_fc_lock); 860 + schedule(); 861 + spin_lock(&sbi->s_fc_lock); 862 + } 863 + finish_wait(&ei->i_fc_wait, &wait); 864 + } 865 + spin_unlock(&sbi->s_fc_lock); 866 + ret = jbd2_submit_inode_data(ei->jinode); 867 + if (ret) 868 + return ret; 869 + spin_lock(&sbi->s_fc_lock); 870 + } 871 + spin_unlock(&sbi->s_fc_lock); 872 + 873 + return ret; 874 + } 875 + 876 + /* Wait for completion of data for all the fast commit inodes */ 877 + static int ext4_fc_wait_inode_data_all(journal_t *journal) 878 + { 879 + struct super_block *sb = (struct super_block *)(journal->j_private); 880 + struct ext4_sb_info *sbi = EXT4_SB(sb); 881 + struct ext4_inode_info *pos, *n; 882 + int ret = 0; 883 + 884 + spin_lock(&sbi->s_fc_lock); 885 + list_for_each_entry_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { 886 + if (!ext4_test_inode_state(&pos->vfs_inode, 887 + EXT4_STATE_FC_COMMITTING)) 888 + continue; 889 + spin_unlock(&sbi->s_fc_lock); 890 + 891 + ret = jbd2_wait_inode_data(journal, pos->jinode); 892 + if (ret) 893 + return ret; 894 + spin_lock(&sbi->s_fc_lock); 895 + } 896 + spin_unlock(&sbi->s_fc_lock); 897 + 898 + return 0; 899 + } 900 + 901 + /* Commit all the directory entry updates */ 902 + static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc) 903 + { 904 + struct super_block *sb = (struct super_block *)(journal->j_private); 905 + struct ext4_sb_info *sbi = EXT4_SB(sb); 906 + struct ext4_fc_dentry_update *fc_dentry; 907 + struct inode *inode; 908 + struct list_head *pos, *n, *fcd_pos, *fcd_n; 909 + struct ext4_inode_info *ei; 910 + int ret; 911 + 912 + if (list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) 913 + return 0; 914 + list_for_each_safe(fcd_pos, fcd_n, &sbi->s_fc_dentry_q[FC_Q_MAIN]) { 915 + fc_dentry = list_entry(fcd_pos, struct ext4_fc_dentry_update, 916 + fcd_list); 917 + if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT) { 918 + spin_unlock(&sbi->s_fc_lock); 919 + if (!ext4_fc_add_dentry_tlv( 920 + sb, fc_dentry->fcd_op, 921 + fc_dentry->fcd_parent, fc_dentry->fcd_ino, 922 + fc_dentry->fcd_name.len, 923 + fc_dentry->fcd_name.name, crc)) { 924 + ret = -ENOSPC; 925 + goto lock_and_exit; 926 + } 927 + spin_lock(&sbi->s_fc_lock); 928 + continue; 929 + } 930 + 931 + inode = NULL; 932 + list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) { 933 + ei = list_entry(pos, struct ext4_inode_info, i_fc_list); 934 + if (ei->vfs_inode.i_ino == fc_dentry->fcd_ino) { 935 + inode = &ei->vfs_inode; 936 + break; 937 + } 938 + } 939 + /* 940 + * If we don't find inode in our list, then it was deleted, 941 + * in which case, we don't need to record it's create tag. 942 + */ 943 + if (!inode) 944 + continue; 945 + spin_unlock(&sbi->s_fc_lock); 946 + 947 + /* 948 + * We first write the inode and then the create dirent. This 949 + * allows the recovery code to create an unnamed inode first 950 + * and then link it to a directory entry. This allows us 951 + * to use namei.c routines almost as is and simplifies 952 + * the recovery code. 953 + */ 954 + ret = ext4_fc_write_inode(inode, crc); 955 + if (ret) 956 + goto lock_and_exit; 957 + 958 + ret = ext4_fc_write_inode_data(inode, crc); 959 + if (ret) 960 + goto lock_and_exit; 961 + 962 + if (!ext4_fc_add_dentry_tlv( 963 + sb, fc_dentry->fcd_op, 964 + fc_dentry->fcd_parent, fc_dentry->fcd_ino, 965 + fc_dentry->fcd_name.len, 966 + fc_dentry->fcd_name.name, crc)) { 967 + spin_lock(&sbi->s_fc_lock); 968 + ret = -ENOSPC; 969 + goto lock_and_exit; 970 + } 971 + 972 + spin_lock(&sbi->s_fc_lock); 973 + } 974 + return 0; 975 + lock_and_exit: 976 + spin_lock(&sbi->s_fc_lock); 977 + return ret; 978 + } 979 + 980 + static int ext4_fc_perform_commit(journal_t *journal) 981 + { 982 + struct super_block *sb = (struct super_block *)(journal->j_private); 983 + struct ext4_sb_info *sbi = EXT4_SB(sb); 984 + struct ext4_inode_info *iter; 985 + struct ext4_fc_head head; 986 + struct list_head *pos; 987 + struct inode *inode; 988 + struct blk_plug plug; 989 + int ret = 0; 990 + u32 crc = 0; 991 + 992 + ret = ext4_fc_submit_inode_data_all(journal); 993 + if (ret) 994 + return ret; 995 + 996 + ret = ext4_fc_wait_inode_data_all(journal); 997 + if (ret) 998 + return ret; 999 + 1000 + blk_start_plug(&plug); 1001 + if (sbi->s_fc_bytes == 0) { 1002 + /* 1003 + * Add a head tag only if this is the first fast commit 1004 + * in this TID. 1005 + */ 1006 + head.fc_features = cpu_to_le32(EXT4_FC_SUPPORTED_FEATURES); 1007 + head.fc_tid = cpu_to_le32( 1008 + sbi->s_journal->j_running_transaction->t_tid); 1009 + if (!ext4_fc_add_tlv(sb, EXT4_FC_TAG_HEAD, sizeof(head), 1010 + (u8 *)&head, &crc)) 1011 + goto out; 1012 + } 1013 + 1014 + spin_lock(&sbi->s_fc_lock); 1015 + ret = ext4_fc_commit_dentry_updates(journal, &crc); 1016 + if (ret) { 1017 + spin_unlock(&sbi->s_fc_lock); 1018 + goto out; 1019 + } 1020 + 1021 + list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) { 1022 + iter = list_entry(pos, struct ext4_inode_info, i_fc_list); 1023 + inode = &iter->vfs_inode; 1024 + if (!ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) 1025 + continue; 1026 + 1027 + spin_unlock(&sbi->s_fc_lock); 1028 + ret = ext4_fc_write_inode_data(inode, &crc); 1029 + if (ret) 1030 + goto out; 1031 + ret = ext4_fc_write_inode(inode, &crc); 1032 + if (ret) 1033 + goto out; 1034 + spin_lock(&sbi->s_fc_lock); 1035 + EXT4_I(inode)->i_fc_committed_subtid = 1036 + atomic_read(&sbi->s_fc_subtid); 1037 + } 1038 + spin_unlock(&sbi->s_fc_lock); 1039 + 1040 + ret = ext4_fc_write_tail(sb, crc); 1041 + 1042 + out: 1043 + blk_finish_plug(&plug); 1044 + return ret; 1045 + } 1046 + 1047 + /* 1048 + * The main commit entry point. Performs a fast commit for transaction 1049 + * commit_tid if needed. If it's not possible to perform a fast commit 1050 + * due to various reasons, we fall back to full commit. Returns 0 1051 + * on success, error otherwise. 1052 + */ 1053 + int ext4_fc_commit(journal_t *journal, tid_t commit_tid) 1054 + { 1055 + struct super_block *sb = (struct super_block *)(journal->j_private); 1056 + struct ext4_sb_info *sbi = EXT4_SB(sb); 1057 + int nblks = 0, ret, bsize = journal->j_blocksize; 1058 + int subtid = atomic_read(&sbi->s_fc_subtid); 1059 + int reason = EXT4_FC_REASON_OK, fc_bufs_before = 0; 1060 + ktime_t start_time, commit_time; 1061 + 1062 + trace_ext4_fc_commit_start(sb); 1063 + 1064 + start_time = ktime_get(); 1065 + 1066 + if (!test_opt2(sb, JOURNAL_FAST_COMMIT) || 1067 + (ext4_fc_is_ineligible(sb))) { 1068 + reason = EXT4_FC_REASON_INELIGIBLE; 1069 + goto out; 1070 + } 1071 + 1072 + restart_fc: 1073 + ret = jbd2_fc_begin_commit(journal, commit_tid); 1074 + if (ret == -EALREADY) { 1075 + /* There was an ongoing commit, check if we need to restart */ 1076 + if (atomic_read(&sbi->s_fc_subtid) <= subtid && 1077 + commit_tid > journal->j_commit_sequence) 1078 + goto restart_fc; 1079 + reason = EXT4_FC_REASON_ALREADY_COMMITTED; 1080 + goto out; 1081 + } else if (ret) { 1082 + sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++; 1083 + reason = EXT4_FC_REASON_FC_START_FAILED; 1084 + goto out; 1085 + } 1086 + 1087 + fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize; 1088 + ret = ext4_fc_perform_commit(journal); 1089 + if (ret < 0) { 1090 + sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++; 1091 + reason = EXT4_FC_REASON_FC_FAILED; 1092 + goto out; 1093 + } 1094 + nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before; 1095 + ret = jbd2_fc_wait_bufs(journal, nblks); 1096 + if (ret < 0) { 1097 + sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++; 1098 + reason = EXT4_FC_REASON_FC_FAILED; 1099 + goto out; 1100 + } 1101 + atomic_inc(&sbi->s_fc_subtid); 1102 + jbd2_fc_end_commit(journal); 1103 + out: 1104 + /* Has any ineligible update happened since we started? */ 1105 + if (reason == EXT4_FC_REASON_OK && ext4_fc_is_ineligible(sb)) { 1106 + sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++; 1107 + reason = EXT4_FC_REASON_INELIGIBLE; 1108 + } 1109 + 1110 + spin_lock(&sbi->s_fc_lock); 1111 + if (reason != EXT4_FC_REASON_OK && 1112 + reason != EXT4_FC_REASON_ALREADY_COMMITTED) { 1113 + sbi->s_fc_stats.fc_ineligible_commits++; 1114 + } else { 1115 + sbi->s_fc_stats.fc_num_commits++; 1116 + sbi->s_fc_stats.fc_numblks += nblks; 1117 + } 1118 + spin_unlock(&sbi->s_fc_lock); 1119 + nblks = (reason == EXT4_FC_REASON_OK) ? nblks : 0; 1120 + trace_ext4_fc_commit_stop(sb, nblks, reason); 1121 + commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time)); 1122 + /* 1123 + * weight the commit time higher than the average time so we don't 1124 + * react too strongly to vast changes in the commit time 1125 + */ 1126 + if (likely(sbi->s_fc_avg_commit_time)) 1127 + sbi->s_fc_avg_commit_time = (commit_time + 1128 + sbi->s_fc_avg_commit_time * 3) / 4; 1129 + else 1130 + sbi->s_fc_avg_commit_time = commit_time; 1131 + jbd_debug(1, 1132 + "Fast commit ended with blks = %d, reason = %d, subtid - %d", 1133 + nblks, reason, subtid); 1134 + if (reason == EXT4_FC_REASON_FC_FAILED) 1135 + return jbd2_fc_end_commit_fallback(journal, commit_tid); 1136 + if (reason == EXT4_FC_REASON_FC_START_FAILED || 1137 + reason == EXT4_FC_REASON_INELIGIBLE) 1138 + return jbd2_complete_transaction(journal, commit_tid); 1139 + return 0; 1140 + } 1141 + 1142 + /* 1143 + * Fast commit cleanup routine. This is called after every fast commit and 1144 + * full commit. full is true if we are called after a full commit. 1145 + */ 1146 + static void ext4_fc_cleanup(journal_t *journal, int full) 1147 + { 1148 + struct super_block *sb = journal->j_private; 1149 + struct ext4_sb_info *sbi = EXT4_SB(sb); 1150 + struct ext4_inode_info *iter; 1151 + struct ext4_fc_dentry_update *fc_dentry; 1152 + struct list_head *pos, *n; 1153 + 1154 + if (full && sbi->s_fc_bh) 1155 + sbi->s_fc_bh = NULL; 1156 + 1157 + jbd2_fc_release_bufs(journal); 1158 + 1159 + spin_lock(&sbi->s_fc_lock); 1160 + list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) { 1161 + iter = list_entry(pos, struct ext4_inode_info, i_fc_list); 1162 + list_del_init(&iter->i_fc_list); 1163 + ext4_clear_inode_state(&iter->vfs_inode, 1164 + EXT4_STATE_FC_COMMITTING); 1165 + ext4_fc_reset_inode(&iter->vfs_inode); 1166 + /* Make sure EXT4_STATE_FC_COMMITTING bit is clear */ 1167 + smp_mb(); 1168 + #if (BITS_PER_LONG < 64) 1169 + wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING); 1170 + #else 1171 + wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING); 1172 + #endif 1173 + } 1174 + 1175 + while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) { 1176 + fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN], 1177 + struct ext4_fc_dentry_update, 1178 + fcd_list); 1179 + list_del_init(&fc_dentry->fcd_list); 1180 + spin_unlock(&sbi->s_fc_lock); 1181 + 1182 + if (fc_dentry->fcd_name.name && 1183 + fc_dentry->fcd_name.len > DNAME_INLINE_LEN) 1184 + kfree(fc_dentry->fcd_name.name); 1185 + kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry); 1186 + spin_lock(&sbi->s_fc_lock); 1187 + } 1188 + 1189 + list_splice_init(&sbi->s_fc_dentry_q[FC_Q_STAGING], 1190 + &sbi->s_fc_dentry_q[FC_Q_MAIN]); 1191 + list_splice_init(&sbi->s_fc_q[FC_Q_STAGING], 1192 + &sbi->s_fc_q[FC_Q_STAGING]); 1193 + 1194 + sbi->s_mount_state &= ~EXT4_FC_COMMITTING; 1195 + sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE; 1196 + 1197 + if (full) 1198 + sbi->s_fc_bytes = 0; 1199 + spin_unlock(&sbi->s_fc_lock); 1200 + trace_ext4_fc_stats(sb); 1201 + } 1202 + 1203 + /* Ext4 Replay Path Routines */ 1204 + 1205 + /* Get length of a particular tlv */ 1206 + static inline int ext4_fc_tag_len(struct ext4_fc_tl *tl) 1207 + { 1208 + return le16_to_cpu(tl->fc_len); 1209 + } 1210 + 1211 + /* Get a pointer to "value" of a tlv */ 1212 + static inline u8 *ext4_fc_tag_val(struct ext4_fc_tl *tl) 1213 + { 1214 + return (u8 *)tl + sizeof(*tl); 1215 + } 1216 + 1217 + /* Helper struct for dentry replay routines */ 1218 + struct dentry_info_args { 1219 + int parent_ino, dname_len, ino, inode_len; 1220 + char *dname; 1221 + }; 1222 + 1223 + static inline void tl_to_darg(struct dentry_info_args *darg, 1224 + struct ext4_fc_tl *tl) 1225 + { 1226 + struct ext4_fc_dentry_info *fcd; 1227 + 1228 + fcd = (struct ext4_fc_dentry_info *)ext4_fc_tag_val(tl); 1229 + 1230 + darg->parent_ino = le32_to_cpu(fcd->fc_parent_ino); 1231 + darg->ino = le32_to_cpu(fcd->fc_ino); 1232 + darg->dname = fcd->fc_dname; 1233 + darg->dname_len = ext4_fc_tag_len(tl) - 1234 + sizeof(struct ext4_fc_dentry_info); 1235 + } 1236 + 1237 + /* Unlink replay function */ 1238 + static int ext4_fc_replay_unlink(struct super_block *sb, struct ext4_fc_tl *tl) 1239 + { 1240 + struct inode *inode, *old_parent; 1241 + struct qstr entry; 1242 + struct dentry_info_args darg; 1243 + int ret = 0; 1244 + 1245 + tl_to_darg(&darg, tl); 1246 + 1247 + trace_ext4_fc_replay(sb, EXT4_FC_TAG_UNLINK, darg.ino, 1248 + darg.parent_ino, darg.dname_len); 1249 + 1250 + entry.name = darg.dname; 1251 + entry.len = darg.dname_len; 1252 + inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL); 1253 + 1254 + if (IS_ERR_OR_NULL(inode)) { 1255 + jbd_debug(1, "Inode %d not found", darg.ino); 1256 + return 0; 1257 + } 1258 + 1259 + old_parent = ext4_iget(sb, darg.parent_ino, 1260 + EXT4_IGET_NORMAL); 1261 + if (IS_ERR_OR_NULL(old_parent)) { 1262 + jbd_debug(1, "Dir with inode %d not found", darg.parent_ino); 1263 + iput(inode); 1264 + return 0; 1265 + } 1266 + 1267 + ret = __ext4_unlink(old_parent, &entry, inode); 1268 + /* -ENOENT ok coz it might not exist anymore. */ 1269 + if (ret == -ENOENT) 1270 + ret = 0; 1271 + iput(old_parent); 1272 + iput(inode); 1273 + return ret; 1274 + } 1275 + 1276 + static int ext4_fc_replay_link_internal(struct super_block *sb, 1277 + struct dentry_info_args *darg, 1278 + struct inode *inode) 1279 + { 1280 + struct inode *dir = NULL; 1281 + struct dentry *dentry_dir = NULL, *dentry_inode = NULL; 1282 + struct qstr qstr_dname = QSTR_INIT(darg->dname, darg->dname_len); 1283 + int ret = 0; 1284 + 1285 + dir = ext4_iget(sb, darg->parent_ino, EXT4_IGET_NORMAL); 1286 + if (IS_ERR(dir)) { 1287 + jbd_debug(1, "Dir with inode %d not found.", darg->parent_ino); 1288 + dir = NULL; 1289 + goto out; 1290 + } 1291 + 1292 + dentry_dir = d_obtain_alias(dir); 1293 + if (IS_ERR(dentry_dir)) { 1294 + jbd_debug(1, "Failed to obtain dentry"); 1295 + dentry_dir = NULL; 1296 + goto out; 1297 + } 1298 + 1299 + dentry_inode = d_alloc(dentry_dir, &qstr_dname); 1300 + if (!dentry_inode) { 1301 + jbd_debug(1, "Inode dentry not created."); 1302 + ret = -ENOMEM; 1303 + goto out; 1304 + } 1305 + 1306 + ret = __ext4_link(dir, inode, dentry_inode); 1307 + /* 1308 + * It's possible that link already existed since data blocks 1309 + * for the dir in question got persisted before we crashed OR 1310 + * we replayed this tag and crashed before the entire replay 1311 + * could complete. 1312 + */ 1313 + if (ret && ret != -EEXIST) { 1314 + jbd_debug(1, "Failed to link\n"); 1315 + goto out; 1316 + } 1317 + 1318 + ret = 0; 1319 + out: 1320 + if (dentry_dir) { 1321 + d_drop(dentry_dir); 1322 + dput(dentry_dir); 1323 + } else if (dir) { 1324 + iput(dir); 1325 + } 1326 + if (dentry_inode) { 1327 + d_drop(dentry_inode); 1328 + dput(dentry_inode); 1329 + } 1330 + 1331 + return ret; 1332 + } 1333 + 1334 + /* Link replay function */ 1335 + static int ext4_fc_replay_link(struct super_block *sb, struct ext4_fc_tl *tl) 1336 + { 1337 + struct inode *inode; 1338 + struct dentry_info_args darg; 1339 + int ret = 0; 1340 + 1341 + tl_to_darg(&darg, tl); 1342 + trace_ext4_fc_replay(sb, EXT4_FC_TAG_LINK, darg.ino, 1343 + darg.parent_ino, darg.dname_len); 1344 + 1345 + inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL); 1346 + if (IS_ERR_OR_NULL(inode)) { 1347 + jbd_debug(1, "Inode not found."); 1348 + return 0; 1349 + } 1350 + 1351 + ret = ext4_fc_replay_link_internal(sb, &darg, inode); 1352 + iput(inode); 1353 + return ret; 1354 + } 1355 + 1356 + /* 1357 + * Record all the modified inodes during replay. We use this later to setup 1358 + * block bitmaps correctly. 1359 + */ 1360 + static int ext4_fc_record_modified_inode(struct super_block *sb, int ino) 1361 + { 1362 + struct ext4_fc_replay_state *state; 1363 + int i; 1364 + 1365 + state = &EXT4_SB(sb)->s_fc_replay_state; 1366 + for (i = 0; i < state->fc_modified_inodes_used; i++) 1367 + if (state->fc_modified_inodes[i] == ino) 1368 + return 0; 1369 + if (state->fc_modified_inodes_used == state->fc_modified_inodes_size) { 1370 + state->fc_modified_inodes_size += 1371 + EXT4_FC_REPLAY_REALLOC_INCREMENT; 1372 + state->fc_modified_inodes = krealloc( 1373 + state->fc_modified_inodes, sizeof(int) * 1374 + state->fc_modified_inodes_size, 1375 + GFP_KERNEL); 1376 + if (!state->fc_modified_inodes) 1377 + return -ENOMEM; 1378 + } 1379 + state->fc_modified_inodes[state->fc_modified_inodes_used++] = ino; 1380 + return 0; 1381 + } 1382 + 1383 + /* 1384 + * Inode replay function 1385 + */ 1386 + static int ext4_fc_replay_inode(struct super_block *sb, struct ext4_fc_tl *tl) 1387 + { 1388 + struct ext4_fc_inode *fc_inode; 1389 + struct ext4_inode *raw_inode; 1390 + struct ext4_inode *raw_fc_inode; 1391 + struct inode *inode = NULL; 1392 + struct ext4_iloc iloc; 1393 + int inode_len, ino, ret, tag = le16_to_cpu(tl->fc_tag); 1394 + struct ext4_extent_header *eh; 1395 + 1396 + fc_inode = (struct ext4_fc_inode *)ext4_fc_tag_val(tl); 1397 + 1398 + ino = le32_to_cpu(fc_inode->fc_ino); 1399 + trace_ext4_fc_replay(sb, tag, ino, 0, 0); 1400 + 1401 + inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL); 1402 + if (!IS_ERR_OR_NULL(inode)) { 1403 + ext4_ext_clear_bb(inode); 1404 + iput(inode); 1405 + } 1406 + 1407 + ext4_fc_record_modified_inode(sb, ino); 1408 + 1409 + raw_fc_inode = (struct ext4_inode *)fc_inode->fc_raw_inode; 1410 + ret = ext4_get_fc_inode_loc(sb, ino, &iloc); 1411 + if (ret) 1412 + goto out; 1413 + 1414 + inode_len = ext4_fc_tag_len(tl) - sizeof(struct ext4_fc_inode); 1415 + raw_inode = ext4_raw_inode(&iloc); 1416 + 1417 + memcpy(raw_inode, raw_fc_inode, offsetof(struct ext4_inode, i_block)); 1418 + memcpy(&raw_inode->i_generation, &raw_fc_inode->i_generation, 1419 + inode_len - offsetof(struct ext4_inode, i_generation)); 1420 + if (le32_to_cpu(raw_inode->i_flags) & EXT4_EXTENTS_FL) { 1421 + eh = (struct ext4_extent_header *)(&raw_inode->i_block[0]); 1422 + if (eh->eh_magic != EXT4_EXT_MAGIC) { 1423 + memset(eh, 0, sizeof(*eh)); 1424 + eh->eh_magic = EXT4_EXT_MAGIC; 1425 + eh->eh_max = cpu_to_le16( 1426 + (sizeof(raw_inode->i_block) - 1427 + sizeof(struct ext4_extent_header)) 1428 + / sizeof(struct ext4_extent)); 1429 + } 1430 + } else if (le32_to_cpu(raw_inode->i_flags) & EXT4_INLINE_DATA_FL) { 1431 + memcpy(raw_inode->i_block, raw_fc_inode->i_block, 1432 + sizeof(raw_inode->i_block)); 1433 + } 1434 + 1435 + /* Immediately update the inode on disk. */ 1436 + ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh); 1437 + if (ret) 1438 + goto out; 1439 + ret = sync_dirty_buffer(iloc.bh); 1440 + if (ret) 1441 + goto out; 1442 + ret = ext4_mark_inode_used(sb, ino); 1443 + if (ret) 1444 + goto out; 1445 + 1446 + /* Given that we just wrote the inode on disk, this SHOULD succeed. */ 1447 + inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL); 1448 + if (IS_ERR_OR_NULL(inode)) { 1449 + jbd_debug(1, "Inode not found."); 1450 + return -EFSCORRUPTED; 1451 + } 1452 + 1453 + /* 1454 + * Our allocator could have made different decisions than before 1455 + * crashing. This should be fixed but until then, we calculate 1456 + * the number of blocks the inode. 1457 + */ 1458 + ext4_ext_replay_set_iblocks(inode); 1459 + 1460 + inode->i_generation = le32_to_cpu(ext4_raw_inode(&iloc)->i_generation); 1461 + ext4_reset_inode_seed(inode); 1462 + 1463 + ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode)); 1464 + ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh); 1465 + sync_dirty_buffer(iloc.bh); 1466 + brelse(iloc.bh); 1467 + out: 1468 + iput(inode); 1469 + if (!ret) 1470 + blkdev_issue_flush(sb->s_bdev, GFP_KERNEL); 1471 + 1472 + return 0; 1473 + } 1474 + 1475 + /* 1476 + * Dentry create replay function. 1477 + * 1478 + * EXT4_FC_TAG_CREAT is preceded by EXT4_FC_TAG_INODE_FULL. Which means, the 1479 + * inode for which we are trying to create a dentry here, should already have 1480 + * been replayed before we start here. 1481 + */ 1482 + static int ext4_fc_replay_create(struct super_block *sb, struct ext4_fc_tl *tl) 1483 + { 1484 + int ret = 0; 1485 + struct inode *inode = NULL; 1486 + struct inode *dir = NULL; 1487 + struct dentry_info_args darg; 1488 + 1489 + tl_to_darg(&darg, tl); 1490 + 1491 + trace_ext4_fc_replay(sb, EXT4_FC_TAG_CREAT, darg.ino, 1492 + darg.parent_ino, darg.dname_len); 1493 + 1494 + /* This takes care of update group descriptor and other metadata */ 1495 + ret = ext4_mark_inode_used(sb, darg.ino); 1496 + if (ret) 1497 + goto out; 1498 + 1499 + inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL); 1500 + if (IS_ERR_OR_NULL(inode)) { 1501 + jbd_debug(1, "inode %d not found.", darg.ino); 1502 + inode = NULL; 1503 + ret = -EINVAL; 1504 + goto out; 1505 + } 1506 + 1507 + if (S_ISDIR(inode->i_mode)) { 1508 + /* 1509 + * If we are creating a directory, we need to make sure that the 1510 + * dot and dot dot dirents are setup properly. 1511 + */ 1512 + dir = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL); 1513 + if (IS_ERR_OR_NULL(dir)) { 1514 + jbd_debug(1, "Dir %d not found.", darg.ino); 1515 + goto out; 1516 + } 1517 + ret = ext4_init_new_dir(NULL, dir, inode); 1518 + iput(dir); 1519 + if (ret) { 1520 + ret = 0; 1521 + goto out; 1522 + } 1523 + } 1524 + ret = ext4_fc_replay_link_internal(sb, &darg, inode); 1525 + if (ret) 1526 + goto out; 1527 + set_nlink(inode, 1); 1528 + ext4_mark_inode_dirty(NULL, inode); 1529 + out: 1530 + if (inode) 1531 + iput(inode); 1532 + return ret; 1533 + } 1534 + 1535 + /* 1536 + * Record physical disk regions which are in use as per fast commit area. Our 1537 + * simple replay phase allocator excludes these regions from allocation. 1538 + */ 1539 + static int ext4_fc_record_regions(struct super_block *sb, int ino, 1540 + ext4_lblk_t lblk, ext4_fsblk_t pblk, int len) 1541 + { 1542 + struct ext4_fc_replay_state *state; 1543 + struct ext4_fc_alloc_region *region; 1544 + 1545 + state = &EXT4_SB(sb)->s_fc_replay_state; 1546 + if (state->fc_regions_used == state->fc_regions_size) { 1547 + state->fc_regions_size += 1548 + EXT4_FC_REPLAY_REALLOC_INCREMENT; 1549 + state->fc_regions = krealloc( 1550 + state->fc_regions, 1551 + state->fc_regions_size * 1552 + sizeof(struct ext4_fc_alloc_region), 1553 + GFP_KERNEL); 1554 + if (!state->fc_regions) 1555 + return -ENOMEM; 1556 + } 1557 + region = &state->fc_regions[state->fc_regions_used++]; 1558 + region->ino = ino; 1559 + region->lblk = lblk; 1560 + region->pblk = pblk; 1561 + region->len = len; 1562 + 1563 + return 0; 1564 + } 1565 + 1566 + /* Replay add range tag */ 1567 + static int ext4_fc_replay_add_range(struct super_block *sb, 1568 + struct ext4_fc_tl *tl) 1569 + { 1570 + struct ext4_fc_add_range *fc_add_ex; 1571 + struct ext4_extent newex, *ex; 1572 + struct inode *inode; 1573 + ext4_lblk_t start, cur; 1574 + int remaining, len; 1575 + ext4_fsblk_t start_pblk; 1576 + struct ext4_map_blocks map; 1577 + struct ext4_ext_path *path = NULL; 1578 + int ret; 1579 + 1580 + fc_add_ex = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl); 1581 + ex = (struct ext4_extent *)&fc_add_ex->fc_ex; 1582 + 1583 + trace_ext4_fc_replay(sb, EXT4_FC_TAG_ADD_RANGE, 1584 + le32_to_cpu(fc_add_ex->fc_ino), le32_to_cpu(ex->ee_block), 1585 + ext4_ext_get_actual_len(ex)); 1586 + 1587 + inode = ext4_iget(sb, le32_to_cpu(fc_add_ex->fc_ino), 1588 + EXT4_IGET_NORMAL); 1589 + if (IS_ERR_OR_NULL(inode)) { 1590 + jbd_debug(1, "Inode not found."); 1591 + return 0; 1592 + } 1593 + 1594 + ret = ext4_fc_record_modified_inode(sb, inode->i_ino); 1595 + 1596 + start = le32_to_cpu(ex->ee_block); 1597 + start_pblk = ext4_ext_pblock(ex); 1598 + len = ext4_ext_get_actual_len(ex); 1599 + 1600 + cur = start; 1601 + remaining = len; 1602 + jbd_debug(1, "ADD_RANGE, lblk %d, pblk %lld, len %d, unwritten %d, inode %ld\n", 1603 + start, start_pblk, len, ext4_ext_is_unwritten(ex), 1604 + inode->i_ino); 1605 + 1606 + while (remaining > 0) { 1607 + map.m_lblk = cur; 1608 + map.m_len = remaining; 1609 + map.m_pblk = 0; 1610 + ret = ext4_map_blocks(NULL, inode, &map, 0); 1611 + 1612 + if (ret < 0) { 1613 + iput(inode); 1614 + return 0; 1615 + } 1616 + 1617 + if (ret == 0) { 1618 + /* Range is not mapped */ 1619 + path = ext4_find_extent(inode, cur, NULL, 0); 1620 + if (!path) 1621 + continue; 1622 + memset(&newex, 0, sizeof(newex)); 1623 + newex.ee_block = cpu_to_le32(cur); 1624 + ext4_ext_store_pblock( 1625 + &newex, start_pblk + cur - start); 1626 + newex.ee_len = cpu_to_le16(map.m_len); 1627 + if (ext4_ext_is_unwritten(ex)) 1628 + ext4_ext_mark_unwritten(&newex); 1629 + down_write(&EXT4_I(inode)->i_data_sem); 1630 + ret = ext4_ext_insert_extent( 1631 + NULL, inode, &path, &newex, 0); 1632 + up_write((&EXT4_I(inode)->i_data_sem)); 1633 + ext4_ext_drop_refs(path); 1634 + kfree(path); 1635 + if (ret) { 1636 + iput(inode); 1637 + return 0; 1638 + } 1639 + goto next; 1640 + } 1641 + 1642 + if (start_pblk + cur - start != map.m_pblk) { 1643 + /* 1644 + * Logical to physical mapping changed. This can happen 1645 + * if this range was removed and then reallocated to 1646 + * map to new physical blocks during a fast commit. 1647 + */ 1648 + ret = ext4_ext_replay_update_ex(inode, cur, map.m_len, 1649 + ext4_ext_is_unwritten(ex), 1650 + start_pblk + cur - start); 1651 + if (ret) { 1652 + iput(inode); 1653 + return 0; 1654 + } 1655 + /* 1656 + * Mark the old blocks as free since they aren't used 1657 + * anymore. We maintain an array of all the modified 1658 + * inodes. In case these blocks are still used at either 1659 + * a different logical range in the same inode or in 1660 + * some different inode, we will mark them as allocated 1661 + * at the end of the FC replay using our array of 1662 + * modified inodes. 1663 + */ 1664 + ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0); 1665 + goto next; 1666 + } 1667 + 1668 + /* Range is mapped and needs a state change */ 1669 + jbd_debug(1, "Converting from %d to %d %lld", 1670 + map.m_flags & EXT4_MAP_UNWRITTEN, 1671 + ext4_ext_is_unwritten(ex), map.m_pblk); 1672 + ret = ext4_ext_replay_update_ex(inode, cur, map.m_len, 1673 + ext4_ext_is_unwritten(ex), map.m_pblk); 1674 + if (ret) { 1675 + iput(inode); 1676 + return 0; 1677 + } 1678 + /* 1679 + * We may have split the extent tree while toggling the state. 1680 + * Try to shrink the extent tree now. 1681 + */ 1682 + ext4_ext_replay_shrink_inode(inode, start + len); 1683 + next: 1684 + cur += map.m_len; 1685 + remaining -= map.m_len; 1686 + } 1687 + ext4_ext_replay_shrink_inode(inode, i_size_read(inode) >> 1688 + sb->s_blocksize_bits); 1689 + iput(inode); 1690 + return 0; 1691 + } 1692 + 1693 + /* Replay DEL_RANGE tag */ 1694 + static int 1695 + ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl) 1696 + { 1697 + struct inode *inode; 1698 + struct ext4_fc_del_range *lrange; 1699 + struct ext4_map_blocks map; 1700 + ext4_lblk_t cur, remaining; 1701 + int ret; 1702 + 1703 + lrange = (struct ext4_fc_del_range *)ext4_fc_tag_val(tl); 1704 + cur = le32_to_cpu(lrange->fc_lblk); 1705 + remaining = le32_to_cpu(lrange->fc_len); 1706 + 1707 + trace_ext4_fc_replay(sb, EXT4_FC_TAG_DEL_RANGE, 1708 + le32_to_cpu(lrange->fc_ino), cur, remaining); 1709 + 1710 + inode = ext4_iget(sb, le32_to_cpu(lrange->fc_ino), EXT4_IGET_NORMAL); 1711 + if (IS_ERR_OR_NULL(inode)) { 1712 + jbd_debug(1, "Inode %d not found", le32_to_cpu(lrange->fc_ino)); 1713 + return 0; 1714 + } 1715 + 1716 + ret = ext4_fc_record_modified_inode(sb, inode->i_ino); 1717 + 1718 + jbd_debug(1, "DEL_RANGE, inode %ld, lblk %d, len %d\n", 1719 + inode->i_ino, le32_to_cpu(lrange->fc_lblk), 1720 + le32_to_cpu(lrange->fc_len)); 1721 + while (remaining > 0) { 1722 + map.m_lblk = cur; 1723 + map.m_len = remaining; 1724 + 1725 + ret = ext4_map_blocks(NULL, inode, &map, 0); 1726 + if (ret < 0) { 1727 + iput(inode); 1728 + return 0; 1729 + } 1730 + if (ret > 0) { 1731 + remaining -= ret; 1732 + cur += ret; 1733 + ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0); 1734 + } else { 1735 + remaining -= map.m_len; 1736 + cur += map.m_len; 1737 + } 1738 + } 1739 + 1740 + ret = ext4_punch_hole(inode, 1741 + le32_to_cpu(lrange->fc_lblk) << sb->s_blocksize_bits, 1742 + le32_to_cpu(lrange->fc_len) << sb->s_blocksize_bits); 1743 + if (ret) 1744 + jbd_debug(1, "ext4_punch_hole returned %d", ret); 1745 + ext4_ext_replay_shrink_inode(inode, 1746 + i_size_read(inode) >> sb->s_blocksize_bits); 1747 + ext4_mark_inode_dirty(NULL, inode); 1748 + iput(inode); 1749 + 1750 + return 0; 1751 + } 1752 + 1753 + static inline const char *tag2str(u16 tag) 1754 + { 1755 + switch (tag) { 1756 + case EXT4_FC_TAG_LINK: 1757 + return "TAG_ADD_ENTRY"; 1758 + case EXT4_FC_TAG_UNLINK: 1759 + return "TAG_DEL_ENTRY"; 1760 + case EXT4_FC_TAG_ADD_RANGE: 1761 + return "TAG_ADD_RANGE"; 1762 + case EXT4_FC_TAG_CREAT: 1763 + return "TAG_CREAT_DENTRY"; 1764 + case EXT4_FC_TAG_DEL_RANGE: 1765 + return "TAG_DEL_RANGE"; 1766 + case EXT4_FC_TAG_INODE: 1767 + return "TAG_INODE"; 1768 + case EXT4_FC_TAG_PAD: 1769 + return "TAG_PAD"; 1770 + case EXT4_FC_TAG_TAIL: 1771 + return "TAG_TAIL"; 1772 + case EXT4_FC_TAG_HEAD: 1773 + return "TAG_HEAD"; 1774 + default: 1775 + return "TAG_ERROR"; 1776 + } 1777 + } 1778 + 1779 + static void ext4_fc_set_bitmaps_and_counters(struct super_block *sb) 1780 + { 1781 + struct ext4_fc_replay_state *state; 1782 + struct inode *inode; 1783 + struct ext4_ext_path *path = NULL; 1784 + struct ext4_map_blocks map; 1785 + int i, ret, j; 1786 + ext4_lblk_t cur, end; 1787 + 1788 + state = &EXT4_SB(sb)->s_fc_replay_state; 1789 + for (i = 0; i < state->fc_modified_inodes_used; i++) { 1790 + inode = ext4_iget(sb, state->fc_modified_inodes[i], 1791 + EXT4_IGET_NORMAL); 1792 + if (IS_ERR_OR_NULL(inode)) { 1793 + jbd_debug(1, "Inode %d not found.", 1794 + state->fc_modified_inodes[i]); 1795 + continue; 1796 + } 1797 + cur = 0; 1798 + end = EXT_MAX_BLOCKS; 1799 + while (cur < end) { 1800 + map.m_lblk = cur; 1801 + map.m_len = end - cur; 1802 + 1803 + ret = ext4_map_blocks(NULL, inode, &map, 0); 1804 + if (ret < 0) 1805 + break; 1806 + 1807 + if (ret > 0) { 1808 + path = ext4_find_extent(inode, map.m_lblk, NULL, 0); 1809 + if (!IS_ERR_OR_NULL(path)) { 1810 + for (j = 0; j < path->p_depth; j++) 1811 + ext4_mb_mark_bb(inode->i_sb, 1812 + path[j].p_block, 1, 1); 1813 + ext4_ext_drop_refs(path); 1814 + kfree(path); 1815 + } 1816 + cur += ret; 1817 + ext4_mb_mark_bb(inode->i_sb, map.m_pblk, 1818 + map.m_len, 1); 1819 + } else { 1820 + cur = cur + (map.m_len ? map.m_len : 1); 1821 + } 1822 + } 1823 + iput(inode); 1824 + } 1825 + } 1826 + 1827 + /* 1828 + * Check if block is in excluded regions for block allocation. The simple 1829 + * allocator that runs during replay phase is calls this function to see 1830 + * if it is okay to use a block. 1831 + */ 1832 + bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t blk) 1833 + { 1834 + int i; 1835 + struct ext4_fc_replay_state *state; 1836 + 1837 + state = &EXT4_SB(sb)->s_fc_replay_state; 1838 + for (i = 0; i < state->fc_regions_valid; i++) { 1839 + if (state->fc_regions[i].ino == 0 || 1840 + state->fc_regions[i].len == 0) 1841 + continue; 1842 + if (blk >= state->fc_regions[i].pblk && 1843 + blk < state->fc_regions[i].pblk + state->fc_regions[i].len) 1844 + return true; 1845 + } 1846 + return false; 1847 + } 1848 + 1849 + /* Cleanup function called after replay */ 1850 + void ext4_fc_replay_cleanup(struct super_block *sb) 1851 + { 1852 + struct ext4_sb_info *sbi = EXT4_SB(sb); 1853 + 1854 + sbi->s_mount_state &= ~EXT4_FC_REPLAY; 1855 + kfree(sbi->s_fc_replay_state.fc_regions); 1856 + kfree(sbi->s_fc_replay_state.fc_modified_inodes); 1857 + } 1858 + 1859 + /* 1860 + * Recovery Scan phase handler 1861 + * 1862 + * This function is called during the scan phase and is responsible 1863 + * for doing following things: 1864 + * - Make sure the fast commit area has valid tags for replay 1865 + * - Count number of tags that need to be replayed by the replay handler 1866 + * - Verify CRC 1867 + * - Create a list of excluded blocks for allocation during replay phase 1868 + * 1869 + * This function returns JBD2_FC_REPLAY_CONTINUE to indicate that SCAN is 1870 + * incomplete and JBD2 should send more blocks. It returns JBD2_FC_REPLAY_STOP 1871 + * to indicate that scan has finished and JBD2 can now start replay phase. 1872 + * It returns a negative error to indicate that there was an error. At the end 1873 + * of a successful scan phase, sbi->s_fc_replay_state.fc_replay_num_tags is set 1874 + * to indicate the number of tags that need to replayed during the replay phase. 1875 + */ 1876 + static int ext4_fc_replay_scan(journal_t *journal, 1877 + struct buffer_head *bh, int off, 1878 + tid_t expected_tid) 1879 + { 1880 + struct super_block *sb = journal->j_private; 1881 + struct ext4_sb_info *sbi = EXT4_SB(sb); 1882 + struct ext4_fc_replay_state *state; 1883 + int ret = JBD2_FC_REPLAY_CONTINUE; 1884 + struct ext4_fc_add_range *ext; 1885 + struct ext4_fc_tl *tl; 1886 + struct ext4_fc_tail *tail; 1887 + __u8 *start, *end; 1888 + struct ext4_fc_head *head; 1889 + struct ext4_extent *ex; 1890 + 1891 + state = &sbi->s_fc_replay_state; 1892 + 1893 + start = (u8 *)bh->b_data; 1894 + end = (__u8 *)bh->b_data + journal->j_blocksize - 1; 1895 + 1896 + if (state->fc_replay_expected_off == 0) { 1897 + state->fc_cur_tag = 0; 1898 + state->fc_replay_num_tags = 0; 1899 + state->fc_crc = 0; 1900 + state->fc_regions = NULL; 1901 + state->fc_regions_valid = state->fc_regions_used = 1902 + state->fc_regions_size = 0; 1903 + /* Check if we can stop early */ 1904 + if (le16_to_cpu(((struct ext4_fc_tl *)start)->fc_tag) 1905 + != EXT4_FC_TAG_HEAD) 1906 + return 0; 1907 + } 1908 + 1909 + if (off != state->fc_replay_expected_off) { 1910 + ret = -EFSCORRUPTED; 1911 + goto out_err; 1912 + } 1913 + 1914 + state->fc_replay_expected_off++; 1915 + fc_for_each_tl(start, end, tl) { 1916 + jbd_debug(3, "Scan phase, tag:%s, blk %lld\n", 1917 + tag2str(le16_to_cpu(tl->fc_tag)), bh->b_blocknr); 1918 + switch (le16_to_cpu(tl->fc_tag)) { 1919 + case EXT4_FC_TAG_ADD_RANGE: 1920 + ext = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl); 1921 + ex = (struct ext4_extent *)&ext->fc_ex; 1922 + ret = ext4_fc_record_regions(sb, 1923 + le32_to_cpu(ext->fc_ino), 1924 + le32_to_cpu(ex->ee_block), ext4_ext_pblock(ex), 1925 + ext4_ext_get_actual_len(ex)); 1926 + if (ret < 0) 1927 + break; 1928 + ret = JBD2_FC_REPLAY_CONTINUE; 1929 + fallthrough; 1930 + case EXT4_FC_TAG_DEL_RANGE: 1931 + case EXT4_FC_TAG_LINK: 1932 + case EXT4_FC_TAG_UNLINK: 1933 + case EXT4_FC_TAG_CREAT: 1934 + case EXT4_FC_TAG_INODE: 1935 + case EXT4_FC_TAG_PAD: 1936 + state->fc_cur_tag++; 1937 + state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl, 1938 + sizeof(*tl) + ext4_fc_tag_len(tl)); 1939 + break; 1940 + case EXT4_FC_TAG_TAIL: 1941 + state->fc_cur_tag++; 1942 + tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl); 1943 + state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl, 1944 + sizeof(*tl) + 1945 + offsetof(struct ext4_fc_tail, 1946 + fc_crc)); 1947 + if (le32_to_cpu(tail->fc_tid) == expected_tid && 1948 + le32_to_cpu(tail->fc_crc) == state->fc_crc) { 1949 + state->fc_replay_num_tags = state->fc_cur_tag; 1950 + state->fc_regions_valid = 1951 + state->fc_regions_used; 1952 + } else { 1953 + ret = state->fc_replay_num_tags ? 1954 + JBD2_FC_REPLAY_STOP : -EFSBADCRC; 1955 + } 1956 + state->fc_crc = 0; 1957 + break; 1958 + case EXT4_FC_TAG_HEAD: 1959 + head = (struct ext4_fc_head *)ext4_fc_tag_val(tl); 1960 + if (le32_to_cpu(head->fc_features) & 1961 + ~EXT4_FC_SUPPORTED_FEATURES) { 1962 + ret = -EOPNOTSUPP; 1963 + break; 1964 + } 1965 + if (le32_to_cpu(head->fc_tid) != expected_tid) { 1966 + ret = JBD2_FC_REPLAY_STOP; 1967 + break; 1968 + } 1969 + state->fc_cur_tag++; 1970 + state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl, 1971 + sizeof(*tl) + ext4_fc_tag_len(tl)); 1972 + break; 1973 + default: 1974 + ret = state->fc_replay_num_tags ? 1975 + JBD2_FC_REPLAY_STOP : -ECANCELED; 1976 + } 1977 + if (ret < 0 || ret == JBD2_FC_REPLAY_STOP) 1978 + break; 1979 + } 1980 + 1981 + out_err: 1982 + trace_ext4_fc_replay_scan(sb, ret, off); 1983 + return ret; 1984 + } 1985 + 1986 + /* 1987 + * Main recovery path entry point. 1988 + * The meaning of return codes is similar as above. 1989 + */ 1990 + static int ext4_fc_replay(journal_t *journal, struct buffer_head *bh, 1991 + enum passtype pass, int off, tid_t expected_tid) 1992 + { 1993 + struct super_block *sb = journal->j_private; 1994 + struct ext4_sb_info *sbi = EXT4_SB(sb); 1995 + struct ext4_fc_tl *tl; 1996 + __u8 *start, *end; 1997 + int ret = JBD2_FC_REPLAY_CONTINUE; 1998 + struct ext4_fc_replay_state *state = &sbi->s_fc_replay_state; 1999 + struct ext4_fc_tail *tail; 2000 + 2001 + if (pass == PASS_SCAN) { 2002 + state->fc_current_pass = PASS_SCAN; 2003 + return ext4_fc_replay_scan(journal, bh, off, expected_tid); 2004 + } 2005 + 2006 + if (state->fc_current_pass != pass) { 2007 + state->fc_current_pass = pass; 2008 + sbi->s_mount_state |= EXT4_FC_REPLAY; 2009 + } 2010 + if (!sbi->s_fc_replay_state.fc_replay_num_tags) { 2011 + jbd_debug(1, "Replay stops\n"); 2012 + ext4_fc_set_bitmaps_and_counters(sb); 2013 + return 0; 2014 + } 2015 + 2016 + #ifdef CONFIG_EXT4_DEBUG 2017 + if (sbi->s_fc_debug_max_replay && off >= sbi->s_fc_debug_max_replay) { 2018 + pr_warn("Dropping fc block %d because max_replay set\n", off); 2019 + return JBD2_FC_REPLAY_STOP; 2020 + } 2021 + #endif 2022 + 2023 + start = (u8 *)bh->b_data; 2024 + end = (__u8 *)bh->b_data + journal->j_blocksize - 1; 2025 + 2026 + fc_for_each_tl(start, end, tl) { 2027 + if (state->fc_replay_num_tags == 0) { 2028 + ret = JBD2_FC_REPLAY_STOP; 2029 + ext4_fc_set_bitmaps_and_counters(sb); 2030 + break; 2031 + } 2032 + jbd_debug(3, "Replay phase, tag:%s\n", 2033 + tag2str(le16_to_cpu(tl->fc_tag))); 2034 + state->fc_replay_num_tags--; 2035 + switch (le16_to_cpu(tl->fc_tag)) { 2036 + case EXT4_FC_TAG_LINK: 2037 + ret = ext4_fc_replay_link(sb, tl); 2038 + break; 2039 + case EXT4_FC_TAG_UNLINK: 2040 + ret = ext4_fc_replay_unlink(sb, tl); 2041 + break; 2042 + case EXT4_FC_TAG_ADD_RANGE: 2043 + ret = ext4_fc_replay_add_range(sb, tl); 2044 + break; 2045 + case EXT4_FC_TAG_CREAT: 2046 + ret = ext4_fc_replay_create(sb, tl); 2047 + break; 2048 + case EXT4_FC_TAG_DEL_RANGE: 2049 + ret = ext4_fc_replay_del_range(sb, tl); 2050 + break; 2051 + case EXT4_FC_TAG_INODE: 2052 + ret = ext4_fc_replay_inode(sb, tl); 2053 + break; 2054 + case EXT4_FC_TAG_PAD: 2055 + trace_ext4_fc_replay(sb, EXT4_FC_TAG_PAD, 0, 2056 + ext4_fc_tag_len(tl), 0); 2057 + break; 2058 + case EXT4_FC_TAG_TAIL: 2059 + trace_ext4_fc_replay(sb, EXT4_FC_TAG_TAIL, 0, 2060 + ext4_fc_tag_len(tl), 0); 2061 + tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl); 2062 + WARN_ON(le32_to_cpu(tail->fc_tid) != expected_tid); 2063 + break; 2064 + case EXT4_FC_TAG_HEAD: 2065 + break; 2066 + default: 2067 + trace_ext4_fc_replay(sb, le16_to_cpu(tl->fc_tag), 0, 2068 + ext4_fc_tag_len(tl), 0); 2069 + ret = -ECANCELED; 2070 + break; 2071 + } 2072 + if (ret < 0) 2073 + break; 2074 + ret = JBD2_FC_REPLAY_CONTINUE; 2075 + } 2076 + return ret; 2077 + } 2078 + 2079 + void ext4_fc_init(struct super_block *sb, journal_t *journal) 2080 + { 2081 + /* 2082 + * We set replay callback even if fast commit disabled because we may 2083 + * could still have fast commit blocks that need to be replayed even if 2084 + * fast commit has now been turned off. 2085 + */ 2086 + journal->j_fc_replay_callback = ext4_fc_replay; 2087 + if (!test_opt2(sb, JOURNAL_FAST_COMMIT)) 2088 + return; 2089 + journal->j_fc_cleanup_callback = ext4_fc_cleanup; 2090 + if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) { 2091 + pr_warn("Error while enabling fast commits, turning off."); 2092 + ext4_clear_feature_fast_commit(sb); 2093 + } 2094 + } 2095 + 2096 + const char *fc_ineligible_reasons[] = { 2097 + "Extended attributes changed", 2098 + "Cross rename", 2099 + "Journal flag changed", 2100 + "Insufficient memory", 2101 + "Swap boot", 2102 + "Resize", 2103 + "Dir renamed", 2104 + "Falloc range op", 2105 + "FC Commit Failed" 2106 + }; 2107 + 2108 + int ext4_fc_info_show(struct seq_file *seq, void *v) 2109 + { 2110 + struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private); 2111 + struct ext4_fc_stats *stats = &sbi->s_fc_stats; 2112 + int i; 2113 + 2114 + if (v != SEQ_START_TOKEN) 2115 + return 0; 2116 + 2117 + seq_printf(seq, 2118 + "fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n", 2119 + stats->fc_num_commits, stats->fc_ineligible_commits, 2120 + stats->fc_numblks, 2121 + div_u64(sbi->s_fc_avg_commit_time, 1000)); 2122 + seq_puts(seq, "Ineligible reasons:\n"); 2123 + for (i = 0; i < EXT4_FC_REASON_MAX; i++) 2124 + seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i], 2125 + stats->fc_ineligible_reason_count[i]); 2126 + 2127 + return 0; 2128 + } 2129 + 2130 + int __init ext4_fc_init_dentry_cache(void) 2131 + { 2132 + ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update, 2133 + SLAB_RECLAIM_ACCOUNT); 2134 + 2135 + if (ext4_fc_dentry_cachep == NULL) 2136 + return -ENOMEM; 2137 + 2138 + return 0; 2139 + }

+159

fs/ext4/fast_commit.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef __FAST_COMMIT_H__ 4 + #define __FAST_COMMIT_H__ 5 + 6 + /* Number of blocks in journal area to allocate for fast commits */ 7 + #define EXT4_NUM_FC_BLKS 256 8 + 9 + /* Fast commit tags */ 10 + #define EXT4_FC_TAG_ADD_RANGE 0x0001 11 + #define EXT4_FC_TAG_DEL_RANGE 0x0002 12 + #define EXT4_FC_TAG_CREAT 0x0003 13 + #define EXT4_FC_TAG_LINK 0x0004 14 + #define EXT4_FC_TAG_UNLINK 0x0005 15 + #define EXT4_FC_TAG_INODE 0x0006 16 + #define EXT4_FC_TAG_PAD 0x0007 17 + #define EXT4_FC_TAG_TAIL 0x0008 18 + #define EXT4_FC_TAG_HEAD 0x0009 19 + 20 + #define EXT4_FC_SUPPORTED_FEATURES 0x0 21 + 22 + /* On disk fast commit tlv value structures */ 23 + 24 + /* Fast commit on disk tag length structure */ 25 + struct ext4_fc_tl { 26 + __le16 fc_tag; 27 + __le16 fc_len; 28 + }; 29 + 30 + /* Value structure for tag EXT4_FC_TAG_HEAD. */ 31 + struct ext4_fc_head { 32 + __le32 fc_features; 33 + __le32 fc_tid; 34 + }; 35 + 36 + /* Value structure for EXT4_FC_TAG_ADD_RANGE. */ 37 + struct ext4_fc_add_range { 38 + __le32 fc_ino; 39 + __u8 fc_ex[12]; 40 + }; 41 + 42 + /* Value structure for tag EXT4_FC_TAG_DEL_RANGE. */ 43 + struct ext4_fc_del_range { 44 + __le32 fc_ino; 45 + __le32 fc_lblk; 46 + __le32 fc_len; 47 + }; 48 + 49 + /* 50 + * This is the value structure for tags EXT4_FC_TAG_CREAT, EXT4_FC_TAG_LINK 51 + * and EXT4_FC_TAG_UNLINK. 52 + */ 53 + struct ext4_fc_dentry_info { 54 + __le32 fc_parent_ino; 55 + __le32 fc_ino; 56 + u8 fc_dname[0]; 57 + }; 58 + 59 + /* Value structure for EXT4_FC_TAG_INODE and EXT4_FC_TAG_INODE_PARTIAL. */ 60 + struct ext4_fc_inode { 61 + __le32 fc_ino; 62 + __u8 fc_raw_inode[0]; 63 + }; 64 + 65 + /* Value structure for tag EXT4_FC_TAG_TAIL. */ 66 + struct ext4_fc_tail { 67 + __le32 fc_tid; 68 + __le32 fc_crc; 69 + }; 70 + 71 + /* 72 + * In memory list of dentry updates that are performed on the file 73 + * system used by fast commit code. 74 + */ 75 + struct ext4_fc_dentry_update { 76 + int fcd_op; /* Type of update create / unlink / link */ 77 + int fcd_parent; /* Parent inode number */ 78 + int fcd_ino; /* Inode number */ 79 + struct qstr fcd_name; /* Dirent name */ 80 + unsigned char fcd_iname[DNAME_INLINE_LEN]; /* Dirent name string */ 81 + struct list_head fcd_list; 82 + }; 83 + 84 + /* 85 + * Fast commit reason codes 86 + */ 87 + enum { 88 + /* 89 + * Commit status codes: 90 + */ 91 + EXT4_FC_REASON_OK = 0, 92 + EXT4_FC_REASON_INELIGIBLE, 93 + EXT4_FC_REASON_ALREADY_COMMITTED, 94 + EXT4_FC_REASON_FC_START_FAILED, 95 + EXT4_FC_REASON_FC_FAILED, 96 + 97 + /* 98 + * Fast commit ineligiblity reasons: 99 + */ 100 + EXT4_FC_REASON_XATTR = 0, 101 + EXT4_FC_REASON_CROSS_RENAME, 102 + EXT4_FC_REASON_JOURNAL_FLAG_CHANGE, 103 + EXT4_FC_REASON_MEM, 104 + EXT4_FC_REASON_SWAP_BOOT, 105 + EXT4_FC_REASON_RESIZE, 106 + EXT4_FC_REASON_RENAME_DIR, 107 + EXT4_FC_REASON_FALLOC_RANGE, 108 + EXT4_FC_COMMIT_FAILED, 109 + EXT4_FC_REASON_MAX 110 + }; 111 + 112 + struct ext4_fc_stats { 113 + unsigned int fc_ineligible_reason_count[EXT4_FC_REASON_MAX]; 114 + unsigned long fc_num_commits; 115 + unsigned long fc_ineligible_commits; 116 + unsigned long fc_numblks; 117 + }; 118 + 119 + #define EXT4_FC_REPLAY_REALLOC_INCREMENT 4 120 + 121 + /* 122 + * Physical block regions added to different inodes due to fast commit 123 + * recovery. These are set during the SCAN phase. During the replay phase, 124 + * our allocator excludes these from its allocation. This ensures that 125 + * we don't accidentally allocating a block that is going to be used by 126 + * another inode. 127 + */ 128 + struct ext4_fc_alloc_region { 129 + ext4_lblk_t lblk; 130 + ext4_fsblk_t pblk; 131 + int ino, len; 132 + }; 133 + 134 + /* 135 + * Fast commit replay state. 136 + */ 137 + struct ext4_fc_replay_state { 138 + int fc_replay_num_tags; 139 + int fc_replay_expected_off; 140 + int fc_current_pass; 141 + int fc_cur_tag; 142 + int fc_crc; 143 + struct ext4_fc_alloc_region *fc_regions; 144 + int fc_regions_size, fc_regions_used, fc_regions_valid; 145 + int *fc_modified_inodes; 146 + int fc_modified_inodes_used, fc_modified_inodes_size; 147 + }; 148 + 149 + #define region_last(__region) (((__region)->lblk) + ((__region)->len) - 1) 150 + 151 + #define fc_for_each_tl(__start, __end, __tl) \ 152 + for (tl = (struct ext4_fc_tl *)start; \ 153 + (u8 *)tl < (u8 *)end; \ 154 + tl = (struct ext4_fc_tl *)((u8 *)tl + \ 155 + sizeof(struct ext4_fc_tl) + \ 156 + + le16_to_cpu(tl->fc_len))) 157 + 158 + 159 + #endif /* __FAST_COMMIT_H__ */

+9 -3

fs/ext4/file.c

··· 260 260 if (iocb->ki_flags & IOCB_NOWAIT) 261 261 return -EOPNOTSUPP; 262 262 263 + ext4_fc_start_update(inode); 263 264 inode_lock(inode); 264 265 ret = ext4_write_checks(iocb, from); 265 266 if (ret <= 0) ··· 272 271 273 272 out: 274 273 inode_unlock(inode); 274 + ext4_fc_stop_update(inode); 275 275 if (likely(ret > 0)) { 276 276 iocb->ki_pos += ret; 277 277 ret = generic_write_sync(iocb, ret); ··· 536 534 goto out; 537 535 } 538 536 537 + ext4_fc_start_update(inode); 539 538 ret = ext4_orphan_add(handle, inode); 539 + ext4_fc_stop_update(inode); 540 540 if (ret) { 541 541 ext4_journal_stop(handle); 542 542 goto out; ··· 660 656 #endif 661 657 if (iocb->ki_flags & IOCB_DIRECT) 662 658 return ext4_dio_write_iter(iocb, from); 663 - 664 - return ext4_buffered_write_iter(iocb, from); 659 + else 660 + return ext4_buffered_write_iter(iocb, from); 665 661 } 666 662 667 663 #ifdef CONFIG_FS_DAX ··· 761 757 if (!daxdev_mapping_supported(vma, dax_dev)) 762 758 return -EOPNOTSUPP; 763 759 760 + ext4_fc_start_update(inode); 764 761 file_accessed(file); 765 762 if (IS_DAX(file_inode(file))) { 766 763 vma->vm_ops = &ext4_dax_vm_ops; ··· 769 764 } else { 770 765 vma->vm_ops = &ext4_file_vm_ops; 771 766 } 767 + ext4_fc_stop_update(inode); 772 768 return 0; 773 769 } 774 770 ··· 850 844 return ret; 851 845 } 852 846 853 - filp->f_mode |= FMODE_NOWAIT; 847 + filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC; 854 848 return dquot_file_open(inode, filp); 855 849 } 856 850

+7 -4

fs/ext4/fsmap.c

··· 108 108 109 109 /* Are we just counting mappings? */ 110 110 if (info->gfi_head->fmh_count == 0) { 111 + if (info->gfi_head->fmh_entries == UINT_MAX) 112 + return EXT4_QUERY_RANGE_ABORT; 113 + 111 114 if (rec_fsblk > info->gfi_next_fsblk) 112 115 info->gfi_head->fmh_entries++; 113 116 ··· 574 571 if (fm->fmr_device == 0 || fm->fmr_device == UINT_MAX || 575 572 fm->fmr_device == new_encode_dev(sb->s_bdev->bd_dev)) 576 573 return true; 577 - if (EXT4_SB(sb)->journal_bdev && 578 - fm->fmr_device == new_encode_dev(EXT4_SB(sb)->journal_bdev->bd_dev)) 574 + if (EXT4_SB(sb)->s_journal_bdev && 575 + fm->fmr_device == new_encode_dev(EXT4_SB(sb)->s_journal_bdev->bd_dev)) 579 576 return true; 580 577 return false; 581 578 } ··· 645 642 memset(handlers, 0, sizeof(handlers)); 646 643 handlers[0].gfd_dev = new_encode_dev(sb->s_bdev->bd_dev); 647 644 handlers[0].gfd_fn = ext4_getfsmap_datadev; 648 - if (EXT4_SB(sb)->journal_bdev) { 645 + if (EXT4_SB(sb)->s_journal_bdev) { 649 646 handlers[1].gfd_dev = new_encode_dev( 650 - EXT4_SB(sb)->journal_bdev->bd_dev); 647 + EXT4_SB(sb)->s_journal_bdev->bd_dev); 651 648 handlers[1].gfd_fn = ext4_getfsmap_logdev; 652 649 } 653 650

+2 -2

fs/ext4/fsync.c

··· 112 112 !jbd2_trans_will_send_data_barrier(journal, commit_tid)) 113 113 *needs_barrier = true; 114 114 115 - return jbd2_complete_transaction(journal, commit_tid); 115 + return ext4_fc_commit(journal, commit_tid); 116 116 } 117 117 118 118 /* ··· 150 150 151 151 ret = file_write_and_wait_range(file, start, end); 152 152 if (ret) 153 - return ret; 153 + goto out; 154 154 155 155 /* 156 156 * data=writeback,ordered:

+152 -19

fs/ext4/ialloc.c

··· 82 82 struct buffer_head *bh) 83 83 { 84 84 ext4_fsblk_t blk; 85 - struct ext4_group_info *grp = ext4_get_group_info(sb, block_group); 85 + struct ext4_group_info *grp; 86 + 87 + if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) 88 + return 0; 89 + 90 + grp = ext4_get_group_info(sb, block_group); 86 91 87 92 if (buffer_verified(bh)) 88 93 return 0; ··· 194 189 * submit the buffer_head for reading 195 190 */ 196 191 trace_ext4_load_inode_bitmap(sb, block_group); 197 - bh->b_end_io = ext4_end_bitmap_read; 198 - get_bh(bh); 199 - submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh); 200 - wait_on_buffer(bh); 192 + ext4_read_bh(bh, REQ_META | REQ_PRIO, ext4_end_bitmap_read); 201 193 ext4_simulate_fail_bh(sb, bh, EXT4_SIM_IBITMAP_EIO); 202 194 if (!buffer_uptodate(bh)) { 203 195 put_bh(bh); ··· 286 284 bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb); 287 285 bitmap_bh = ext4_read_inode_bitmap(sb, block_group); 288 286 /* Don't bother if the inode bitmap is corrupt. */ 289 - grp = ext4_get_group_info(sb, block_group); 290 287 if (IS_ERR(bitmap_bh)) { 291 288 fatal = PTR_ERR(bitmap_bh); 292 289 bitmap_bh = NULL; 293 290 goto error_return; 294 291 } 295 - if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) { 296 - fatal = -EFSCORRUPTED; 297 - goto error_return; 292 + if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) { 293 + grp = ext4_get_group_info(sb, block_group); 294 + if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) { 295 + fatal = -EFSCORRUPTED; 296 + goto error_return; 297 + } 298 298 } 299 299 300 300 BUFFER_TRACE(bitmap_bh, "get_write_access"); ··· 746 742 return 1; 747 743 } 748 744 745 + int ext4_mark_inode_used(struct super_block *sb, int ino) 746 + { 747 + unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count); 748 + struct buffer_head *inode_bitmap_bh = NULL, *group_desc_bh = NULL; 749 + struct ext4_group_desc *gdp; 750 + ext4_group_t group; 751 + int bit; 752 + int err = -EFSCORRUPTED; 753 + 754 + if (ino < EXT4_FIRST_INO(sb) || ino > max_ino) 755 + goto out; 756 + 757 + group = (ino - 1) / EXT4_INODES_PER_GROUP(sb); 758 + bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb); 759 + inode_bitmap_bh = ext4_read_inode_bitmap(sb, group); 760 + if (IS_ERR(inode_bitmap_bh)) 761 + return PTR_ERR(inode_bitmap_bh); 762 + 763 + if (ext4_test_bit(bit, inode_bitmap_bh->b_data)) { 764 + err = 0; 765 + goto out; 766 + } 767 + 768 + gdp = ext4_get_group_desc(sb, group, &group_desc_bh); 769 + if (!gdp || !group_desc_bh) { 770 + err = -EINVAL; 771 + goto out; 772 + } 773 + 774 + ext4_set_bit(bit, inode_bitmap_bh->b_data); 775 + 776 + BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata"); 777 + err = ext4_handle_dirty_metadata(NULL, NULL, inode_bitmap_bh); 778 + if (err) { 779 + ext4_std_error(sb, err); 780 + goto out; 781 + } 782 + err = sync_dirty_buffer(inode_bitmap_bh); 783 + if (err) { 784 + ext4_std_error(sb, err); 785 + goto out; 786 + } 787 + 788 + /* We may have to initialize the block bitmap if it isn't already */ 789 + if (ext4_has_group_desc_csum(sb) && 790 + gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { 791 + struct buffer_head *block_bitmap_bh; 792 + 793 + block_bitmap_bh = ext4_read_block_bitmap(sb, group); 794 + if (IS_ERR(block_bitmap_bh)) { 795 + err = PTR_ERR(block_bitmap_bh); 796 + goto out; 797 + } 798 + 799 + BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap"); 800 + err = ext4_handle_dirty_metadata(NULL, NULL, block_bitmap_bh); 801 + sync_dirty_buffer(block_bitmap_bh); 802 + 803 + /* recheck and clear flag under lock if we still need to */ 804 + ext4_lock_group(sb, group); 805 + if (ext4_has_group_desc_csum(sb) && 806 + (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) { 807 + gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); 808 + ext4_free_group_clusters_set(sb, gdp, 809 + ext4_free_clusters_after_init(sb, group, gdp)); 810 + ext4_block_bitmap_csum_set(sb, group, gdp, 811 + block_bitmap_bh); 812 + ext4_group_desc_csum_set(sb, group, gdp); 813 + } 814 + ext4_unlock_group(sb, group); 815 + brelse(block_bitmap_bh); 816 + 817 + if (err) { 818 + ext4_std_error(sb, err); 819 + goto out; 820 + } 821 + } 822 + 823 + /* Update the relevant bg descriptor fields */ 824 + if (ext4_has_group_desc_csum(sb)) { 825 + int free; 826 + 827 + ext4_lock_group(sb, group); /* while we modify the bg desc */ 828 + free = EXT4_INODES_PER_GROUP(sb) - 829 + ext4_itable_unused_count(sb, gdp); 830 + if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) { 831 + gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT); 832 + free = 0; 833 + } 834 + 835 + /* 836 + * Check the relative inode number against the last used 837 + * relative inode number in this group. if it is greater 838 + * we need to update the bg_itable_unused count 839 + */ 840 + if (bit >= free) 841 + ext4_itable_unused_set(sb, gdp, 842 + (EXT4_INODES_PER_GROUP(sb) - bit - 1)); 843 + } else { 844 + ext4_lock_group(sb, group); 845 + } 846 + 847 + ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1); 848 + if (ext4_has_group_desc_csum(sb)) { 849 + ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh, 850 + EXT4_INODES_PER_GROUP(sb) / 8); 851 + ext4_group_desc_csum_set(sb, group, gdp); 852 + } 853 + 854 + ext4_unlock_group(sb, group); 855 + err = ext4_handle_dirty_metadata(NULL, NULL, group_desc_bh); 856 + sync_dirty_buffer(group_desc_bh); 857 + out: 858 + return err; 859 + } 860 + 749 861 static int ext4_xattr_credits_for_new_inode(struct inode *dir, mode_t mode, 750 862 bool encrypt) 751 863 { ··· 938 818 struct inode *ret; 939 819 ext4_group_t i; 940 820 ext4_group_t flex_group; 941 - struct ext4_group_info *grp; 821 + struct ext4_group_info *grp = NULL; 942 822 bool encrypt = false; 943 823 944 824 /* Cannot create files in a deleted directory */ ··· 1038 918 if (ext4_free_inodes_count(sb, gdp) == 0) 1039 919 goto next_group; 1040 920 1041 - grp = ext4_get_group_info(sb, group); 1042 - /* Skip groups with already-known suspicious inode tables */ 1043 - if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) 1044 - goto next_group; 921 + if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) { 922 + grp = ext4_get_group_info(sb, group); 923 + /* 924 + * Skip groups with already-known suspicious inode 925 + * tables 926 + */ 927 + if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) 928 + goto next_group; 929 + } 1045 930 1046 931 brelse(inode_bitmap_bh); 1047 932 inode_bitmap_bh = ext4_read_inode_bitmap(sb, group); 1048 933 /* Skip groups with suspicious inode tables */ 1049 - if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) || 934 + if (((!(sbi->s_mount_state & EXT4_FC_REPLAY)) 935 + && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) || 1050 936 IS_ERR(inode_bitmap_bh)) { 1051 937 inode_bitmap_bh = NULL; 1052 938 goto next_group; ··· 1071 945 goto next_group; 1072 946 } 1073 947 1074 - if (!handle) { 948 + if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) { 1075 949 BUG_ON(nblocks <= 0); 1076 950 handle = __ext4_journal_start_sb(dir->i_sb, line_no, 1077 951 handle_type, nblocks, 0, ··· 1175 1049 /* Update the relevant bg descriptor fields */ 1176 1050 if (ext4_has_group_desc_csum(sb)) { 1177 1051 int free; 1178 - struct ext4_group_info *grp = ext4_get_group_info(sb, group); 1052 + struct ext4_group_info *grp = NULL; 1179 1053 1180 - down_read(&grp->alloc_sem); /* protect vs itable lazyinit */ 1054 + if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) { 1055 + grp = ext4_get_group_info(sb, group); 1056 + down_read(&grp->alloc_sem); /* 1057 + * protect vs itable 1058 + * lazyinit 1059 + */ 1060 + } 1181 1061 ext4_lock_group(sb, group); /* while we modify the bg desc */ 1182 1062 free = EXT4_INODES_PER_GROUP(sb) - 1183 1063 ext4_itable_unused_count(sb, gdp); ··· 1199 1067 if (ino > free) 1200 1068 ext4_itable_unused_set(sb, gdp, 1201 1069 (EXT4_INODES_PER_GROUP(sb) - ino)); 1202 - up_read(&grp->alloc_sem); 1070 + if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) 1071 + up_read(&grp->alloc_sem); 1203 1072 } else { 1204 1073 ext4_lock_group(sb, group); 1205 1074 }

+7 -6

fs/ext4/indirect.c

··· 163 163 } 164 164 165 165 if (!bh_uptodate_or_lock(bh)) { 166 - if (bh_submit_read(bh) < 0) { 166 + if (ext4_read_bh(bh, 0, NULL) < 0) { 167 167 put_bh(bh); 168 168 goto failure; 169 169 } ··· 593 593 if (ext4_has_feature_bigalloc(inode->i_sb)) { 594 594 EXT4_ERROR_INODE(inode, "Can't allocate blocks for " 595 595 "non-extent mapped inodes with bigalloc"); 596 - return -EFSCORRUPTED; 596 + err = -EFSCORRUPTED; 597 + goto out; 597 598 } 598 599 599 600 /* Set up for the direct block allocation */ ··· 1013 1012 } 1014 1013 1015 1014 /* Go read the buffer for the next level down */ 1016 - bh = sb_bread(inode->i_sb, nr); 1015 + bh = ext4_sb_bread(inode->i_sb, nr, 0); 1017 1016 1018 1017 /* 1019 1018 * A read failure? Report error and clear slot 1020 1019 * (should be rare). 1021 1020 */ 1022 - if (!bh) { 1023 - ext4_error_inode_block(inode, nr, EIO, 1021 + if (IS_ERR(bh)) { 1022 + ext4_error_inode_block(inode, nr, -PTR_ERR(bh), 1024 1023 "Read failure"); 1025 1024 continue; 1026 1025 } ··· 1034 1033 brelse(bh); 1035 1034 1036 1035 /* 1037 - * Everything below this this pointer has been 1036 + * Everything below this pointer has been 1038 1037 * released. Now let this top-of-subtree go. 1039 1038 * 1040 1039 * We want the freeing of this indirect block to be

+1 -1

fs/ext4/inline.c

··· 354 354 if (error) 355 355 goto out; 356 356 357 - /* Update the xttr entry. */ 357 + /* Update the xattr entry. */ 358 358 i.value = value; 359 359 i.value_len = len; 360 360

+210 -82

fs/ext4/inode.c

··· 101 101 return provided == calculated; 102 102 } 103 103 104 - static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw, 105 - struct ext4_inode_info *ei) 104 + void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw, 105 + struct ext4_inode_info *ei) 106 106 { 107 107 __u32 csum; 108 108 ··· 514 514 return -EFSCORRUPTED; 515 515 516 516 /* Lookup extent status tree firstly */ 517 - if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) { 517 + if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) && 518 + ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) { 518 519 if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) { 519 520 map->m_pblk = ext4_es_pblock(&es) + 520 521 map->m_lblk - es.es_lblk; ··· 730 729 if (ret) 731 730 return ret; 732 731 } 732 + ext4_fc_track_range(inode, map->m_lblk, 733 + map->m_lblk + map->m_len - 1); 733 734 } 734 735 735 736 if (retval < 0) ··· 828 825 int create = map_flags & EXT4_GET_BLOCKS_CREATE; 829 826 int err; 830 827 831 - J_ASSERT(handle != NULL || create == 0); 828 + J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 829 + || handle != NULL || create == 0); 832 830 833 831 map.m_lblk = block; 834 832 map.m_len = 1; ··· 845 841 return ERR_PTR(-ENOMEM); 846 842 if (map.m_flags & EXT4_MAP_NEW) { 847 843 J_ASSERT(create != 0); 848 - J_ASSERT(handle != NULL); 844 + J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 845 + || (handle != NULL)); 849 846 850 847 /* 851 848 * Now that we do not always journal data, we should ··· 883 878 ext4_lblk_t block, int map_flags) 884 879 { 885 880 struct buffer_head *bh; 881 + int ret; 886 882 887 883 bh = ext4_getblk(handle, inode, block, map_flags); 888 884 if (IS_ERR(bh)) 889 885 return bh; 890 886 if (!bh || ext4_buffer_uptodate(bh)) 891 887 return bh; 892 - ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, &bh); 893 - wait_on_buffer(bh); 894 - if (buffer_uptodate(bh)) 895 - return bh; 896 - put_bh(bh); 897 - return ERR_PTR(-EIO); 888 + 889 + ret = ext4_read_bh_lock(bh, REQ_META | REQ_PRIO, true); 890 + if (ret) { 891 + put_bh(bh); 892 + return ERR_PTR(ret); 893 + } 894 + return bh; 898 895 } 899 896 900 897 /* Read a contiguous batch of blocks. */ ··· 917 910 for (i = 0; i < bh_count; i++) 918 911 /* Note that NULL bhs[i] is valid because of holes. */ 919 912 if (bhs[i] && !ext4_buffer_uptodate(bhs[i])) 920 - ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, 921 - &bhs[i]); 913 + ext4_read_bh_lock(bhs[i], REQ_META | REQ_PRIO, false); 922 914 923 915 if (!wait) 924 916 return 0; ··· 1087 1081 if (!buffer_uptodate(bh) && !buffer_delay(bh) && 1088 1082 !buffer_unwritten(bh) && 1089 1083 (block_start < from || block_end > to)) { 1090 - ll_rw_block(REQ_OP_READ, 0, 1, &bh); 1084 + ext4_read_bh_lock(bh, 0, false); 1091 1085 wait[nr_wait++] = bh; 1092 1086 } 1093 1087 } ··· 1918 1912 } 1919 1913 if (ret == 0) 1920 1914 ret = err; 1915 + err = ext4_jbd2_inode_add_write(handle, inode, 0, len); 1916 + if (ret == 0) 1917 + ret = err; 1921 1918 EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid; 1922 1919 err = ext4_journal_stop(handle); 1923 1920 if (!ret) ··· 2263 2254 err = PTR_ERR(io_end_vec); 2264 2255 goto out; 2265 2256 } 2266 - io_end_vec->offset = mpd->map.m_lblk << blkbits; 2257 + io_end_vec->offset = (loff_t)mpd->map.m_lblk << blkbits; 2267 2258 } 2268 2259 *map_bh = true; 2269 2260 goto out; ··· 2794 2785 * ext4_journal_stop() can wait for transaction commit 2795 2786 * to finish which may depend on writeback of pages to 2796 2787 * complete or on page lock to be released. In that 2797 - * case, we have to wait until after after we have 2788 + * case, we have to wait until after we have 2798 2789 * submitted all the IO, released page locks we hold, 2799 2790 * and dropped io_end reference (for extent conversion 2800 2791 * to be able to complete) before stopping the handle. ··· 3305 3296 { 3306 3297 journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; 3307 3298 3308 - if (journal) 3309 - return !jbd2_transaction_committed(journal, 3310 - EXT4_I(inode)->i_datasync_tid); 3299 + if (journal) { 3300 + if (jbd2_transaction_committed(journal, 3301 + EXT4_I(inode)->i_datasync_tid)) 3302 + return true; 3303 + return atomic_read(&EXT4_SB(inode->i_sb)->s_fc_subtid) >= 3304 + EXT4_I(inode)->i_fc_committed_subtid; 3305 + } 3306 + 3311 3307 /* Any metadata buffers to write? */ 3312 3308 if (!list_empty(&inode->i_mapping->private_list)) 3313 3309 return true; ··· 3450 3436 map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, 3451 3437 EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; 3452 3438 3453 - if (flags & IOMAP_WRITE) 3439 + if (flags & IOMAP_WRITE) { 3440 + /* 3441 + * We check here if the blocks are already allocated, then we 3442 + * don't need to start a journal txn and we can directly return 3443 + * the mapping information. This could boost performance 3444 + * especially in multi-threaded overwrite requests. 3445 + */ 3446 + if (offset + length <= i_size_read(inode)) { 3447 + ret = ext4_map_blocks(NULL, inode, &map, 0); 3448 + if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) 3449 + goto out; 3450 + } 3454 3451 ret = ext4_iomap_alloc(inode, &map, flags); 3455 - else 3452 + } else { 3456 3453 ret = ext4_map_blocks(NULL, inode, &map, 0); 3454 + } 3457 3455 3458 3456 if (ret < 0) 3459 3457 return ret; 3460 - 3458 + out: 3461 3459 ext4_set_iomap(inode, iomap, &map, offset, length); 3462 3460 3463 3461 return 0; ··· 3627 3601 return __set_page_dirty_buffers(page); 3628 3602 } 3629 3603 3604 + static int ext4_iomap_swap_activate(struct swap_info_struct *sis, 3605 + struct file *file, sector_t *span) 3606 + { 3607 + return iomap_swapfile_activate(sis, file, span, 3608 + &ext4_iomap_report_ops); 3609 + } 3610 + 3630 3611 static const struct address_space_operations ext4_aops = { 3631 3612 .readpage = ext4_readpage, 3632 3613 .readahead = ext4_readahead, ··· 3649 3616 .migratepage = buffer_migrate_page, 3650 3617 .is_partially_uptodate = block_is_partially_uptodate, 3651 3618 .error_remove_page = generic_error_remove_page, 3619 + .swap_activate = ext4_iomap_swap_activate, 3652 3620 }; 3653 3621 3654 3622 static const struct address_space_operations ext4_journalled_aops = { ··· 3666 3632 .direct_IO = noop_direct_IO, 3667 3633 .is_partially_uptodate = block_is_partially_uptodate, 3668 3634 .error_remove_page = generic_error_remove_page, 3635 + .swap_activate = ext4_iomap_swap_activate, 3669 3636 }; 3670 3637 3671 3638 static const struct address_space_operations ext4_da_aops = { ··· 3684 3649 .migratepage = buffer_migrate_page, 3685 3650 .is_partially_uptodate = block_is_partially_uptodate, 3686 3651 .error_remove_page = generic_error_remove_page, 3652 + .swap_activate = ext4_iomap_swap_activate, 3687 3653 }; 3688 3654 3689 3655 static const struct address_space_operations ext4_dax_aops = { ··· 3693 3657 .set_page_dirty = noop_set_page_dirty, 3694 3658 .bmap = ext4_bmap, 3695 3659 .invalidatepage = noop_invalidatepage, 3660 + .swap_activate = ext4_iomap_swap_activate, 3696 3661 }; 3697 3662 3698 3663 void ext4_set_aops(struct inode *inode) ··· 3767 3730 set_buffer_uptodate(bh); 3768 3731 3769 3732 if (!buffer_uptodate(bh)) { 3770 - err = -EIO; 3771 - ll_rw_block(REQ_OP_READ, 0, 1, &bh); 3772 - wait_on_buffer(bh); 3773 - /* Uhhuh. Read error. Complain and punt. */ 3774 - if (!buffer_uptodate(bh)) 3733 + err = ext4_read_bh_lock(bh, 0, true); 3734 + if (err) 3775 3735 goto unlock; 3776 3736 if (fscrypt_inode_uses_fs_layer_crypto(inode)) { 3777 3737 /* We expect the key to be set. */ ··· 4107 4073 4108 4074 up_write(&EXT4_I(inode)->i_data_sem); 4109 4075 } 4076 + ext4_fc_track_range(inode, first_block, stop_block); 4110 4077 if (IS_SYNC(inode)) 4111 4078 ext4_handle_sync(handle); 4112 4079 ··· 4287 4252 * data in memory that is needed to recreate the on-disk version of this 4288 4253 * inode. 4289 4254 */ 4290 - static int __ext4_get_inode_loc(struct inode *inode, 4291 - struct ext4_iloc *iloc, int in_mem) 4255 + static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino, 4256 + struct ext4_iloc *iloc, int in_mem, 4257 + ext4_fsblk_t *ret_block) 4292 4258 { 4293 4259 struct ext4_group_desc *gdp; 4294 4260 struct buffer_head *bh; 4295 - struct super_block *sb = inode->i_sb; 4296 4261 ext4_fsblk_t block; 4297 4262 struct blk_plug plug; 4298 4263 int inodes_per_block, inode_offset; 4299 4264 4300 4265 iloc->bh = NULL; 4301 - if (inode->i_ino < EXT4_ROOT_INO || 4302 - inode->i_ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)) 4266 + if (ino < EXT4_ROOT_INO || 4267 + ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)) 4303 4268 return -EFSCORRUPTED; 4304 4269 4305 - iloc->block_group = (inode->i_ino - 1) / EXT4_INODES_PER_GROUP(sb); 4270 + iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb); 4306 4271 gdp = ext4_get_group_desc(sb, iloc->block_group, NULL); 4307 4272 if (!gdp) 4308 4273 return -EIO; ··· 4311 4276 * Figure out the offset within the block group inode table 4312 4277 */ 4313 4278 inodes_per_block = EXT4_SB(sb)->s_inodes_per_block; 4314 - inode_offset = ((inode->i_ino - 1) % 4279 + inode_offset = ((ino - 1) % 4315 4280 EXT4_INODES_PER_GROUP(sb)); 4316 4281 block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block); 4317 4282 iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb); ··· 4324 4289 if (!buffer_uptodate(bh)) { 4325 4290 lock_buffer(bh); 4326 4291 4327 - /* 4328 - * If the buffer has the write error flag, we have failed 4329 - * to write out another inode in the same block. In this 4330 - * case, we don't have to read the block because we may 4331 - * read the old inode data successfully. 4332 - */ 4333 - if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) 4334 - set_buffer_uptodate(bh); 4335 - 4336 - if (buffer_uptodate(bh)) { 4292 + if (ext4_buffer_uptodate(bh)) { 4337 4293 /* someone brought it uptodate while we waited */ 4338 4294 unlock_buffer(bh); 4339 4295 goto has_buffer; ··· 4395 4369 if (end > table) 4396 4370 end = table; 4397 4371 while (b <= end) 4398 - sb_breadahead_unmovable(sb, b++); 4372 + ext4_sb_breadahead_unmovable(sb, b++); 4399 4373 } 4400 4374 4401 4375 /* ··· 4403 4377 * has in-inode xattrs, or we don't have this inode in memory. 4404 4378 * Read the block from disk. 4405 4379 */ 4406 - trace_ext4_load_inode(inode); 4407 - get_bh(bh); 4408 - bh->b_end_io = end_buffer_read_sync; 4409 - submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh); 4380 + trace_ext4_load_inode(sb, ino); 4381 + ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL); 4410 4382 blk_finish_plug(&plug); 4411 4383 wait_on_buffer(bh); 4412 4384 if (!buffer_uptodate(bh)) { 4413 4385 simulate_eio: 4414 - ext4_error_inode_block(inode, block, EIO, 4415 - "unable to read itable block"); 4386 + if (ret_block) 4387 + *ret_block = block; 4416 4388 brelse(bh); 4417 4389 return -EIO; 4418 4390 } ··· 4420 4396 return 0; 4421 4397 } 4422 4398 4399 + static int __ext4_get_inode_loc_noinmem(struct inode *inode, 4400 + struct ext4_iloc *iloc) 4401 + { 4402 + ext4_fsblk_t err_blk; 4403 + int ret; 4404 + 4405 + ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, 0, 4406 + &err_blk); 4407 + 4408 + if (ret == -EIO) 4409 + ext4_error_inode_block(inode, err_blk, EIO, 4410 + "unable to read itable block"); 4411 + 4412 + return ret; 4413 + } 4414 + 4423 4415 int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc) 4424 4416 { 4417 + ext4_fsblk_t err_blk; 4418 + int ret; 4419 + 4425 4420 /* We have all inode data except xattrs in memory here. */ 4426 - return __ext4_get_inode_loc(inode, iloc, 4427 - !ext4_test_inode_state(inode, EXT4_STATE_XATTR)); 4421 + ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, 4422 + !ext4_test_inode_state(inode, EXT4_STATE_XATTR), &err_blk); 4423 + 4424 + if (ret == -EIO) 4425 + ext4_error_inode_block(inode, err_blk, EIO, 4426 + "unable to read itable block"); 4427 + 4428 + return ret; 4429 + } 4430 + 4431 + 4432 + int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino, 4433 + struct ext4_iloc *iloc) 4434 + { 4435 + return __ext4_get_inode_loc(sb, ino, iloc, 0, NULL); 4428 4436 } 4429 4437 4430 4438 static bool ext4_should_enable_dax(struct inode *inode) ··· 4622 4566 ei = EXT4_I(inode); 4623 4567 iloc.bh = NULL; 4624 4568 4625 - ret = __ext4_get_inode_loc(inode, &iloc, 0); 4569 + ret = __ext4_get_inode_loc_noinmem(inode, &iloc); 4626 4570 if (ret < 0) 4627 4571 goto bad_inode; 4628 4572 raw_inode = ext4_raw_inode(&iloc); ··· 4668 4612 sizeof(gen)); 4669 4613 } 4670 4614 4671 - if (!ext4_inode_csum_verify(inode, raw_inode, ei) || 4672 - ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) { 4673 - ext4_error_inode_err(inode, function, line, 0, EFSBADCRC, 4674 - "iget: checksum invalid"); 4615 + if ((!ext4_inode_csum_verify(inode, raw_inode, ei) || 4616 + ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) && 4617 + (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))) { 4618 + ext4_error_inode_err(inode, function, line, 0, 4619 + EFSBADCRC, "iget: checksum invalid"); 4675 4620 ret = -EFSBADCRC; 4676 4621 goto bad_inode; 4677 4622 } ··· 4760 4703 for (block = 0; block < EXT4_N_BLOCKS; block++) 4761 4704 ei->i_data[block] = raw_inode->i_block[block]; 4762 4705 INIT_LIST_HEAD(&ei->i_orphan); 4706 + ext4_fc_init_inode(&ei->vfs_inode); 4763 4707 4764 4708 /* 4765 4709 * Set transaction id's of transactions that have to be committed ··· 4826 4768 goto bad_inode; 4827 4769 } else if (!ext4_has_inline_data(inode)) { 4828 4770 /* validate the block references in the inode */ 4829 - if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) || 4830 - (S_ISLNK(inode->i_mode) && 4831 - !ext4_inode_is_fast_symlink(inode))) { 4771 + if (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) && 4772 + (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) || 4773 + (S_ISLNK(inode->i_mode) && 4774 + !ext4_inode_is_fast_symlink(inode)))) { 4832 4775 if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) 4833 4776 ret = ext4_ext_check_inode(inode); 4834 4777 else ··· 5030 4971 if (ext4_test_inode_state(inode, EXT4_STATE_NEW)) 5031 4972 memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size); 5032 4973 4974 + err = ext4_inode_blocks_set(handle, raw_inode, ei); 4975 + if (err) { 4976 + spin_unlock(&ei->i_raw_lock); 4977 + goto out_brelse; 4978 + } 4979 + 5033 4980 raw_inode->i_mode = cpu_to_le16(inode->i_mode); 5034 4981 i_uid = i_uid_read(inode); 5035 4982 i_gid = i_gid_read(inode); ··· 5069 5004 EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode); 5070 5005 EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode); 5071 5006 5072 - err = ext4_inode_blocks_set(handle, raw_inode, ei); 5073 - if (err) { 5074 - spin_unlock(&ei->i_raw_lock); 5075 - goto out_brelse; 5076 - } 5077 5007 raw_inode->i_dtime = cpu_to_le32(ei->i_dtime); 5078 5008 raw_inode->i_flags = cpu_to_le32(ei->i_flags & 0xFFFFFFFF); 5079 5009 if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) ··· 5209 5149 if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync) 5210 5150 return 0; 5211 5151 5212 - err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal, 5152 + err = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal, 5213 5153 EXT4_I(inode)->i_sync_tid); 5214 5154 } else { 5215 5155 struct ext4_iloc iloc; 5216 5156 5217 - err = __ext4_get_inode_loc(inode, &iloc, 0); 5157 + err = __ext4_get_inode_loc_noinmem(inode, &iloc); 5218 5158 if (err) 5219 5159 return err; 5220 5160 /* ··· 5338 5278 if (error) 5339 5279 return error; 5340 5280 } 5281 + ext4_fc_start_update(inode); 5341 5282 if ((ia_valid & ATTR_UID && !uid_eq(attr->ia_uid, inode->i_uid)) || 5342 5283 (ia_valid & ATTR_GID && !gid_eq(attr->ia_gid, inode->i_gid))) { 5343 5284 handle_t *handle; ··· 5362 5301 5363 5302 if (error) { 5364 5303 ext4_journal_stop(handle); 5304 + ext4_fc_stop_update(inode); 5365 5305 return error; 5366 5306 } 5367 5307 /* Update corresponding info in inode so that everything is in ··· 5385 5323 if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { 5386 5324 struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 5387 5325 5388 - if (attr->ia_size > sbi->s_bitmap_maxbytes) 5326 + if (attr->ia_size > sbi->s_bitmap_maxbytes) { 5327 + ext4_fc_stop_update(inode); 5389 5328 return -EFBIG; 5329 + } 5390 5330 } 5391 - if (!S_ISREG(inode->i_mode)) 5331 + if (!S_ISREG(inode->i_mode)) { 5332 + ext4_fc_stop_update(inode); 5392 5333 return -EINVAL; 5334 + } 5393 5335 5394 5336 if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size) 5395 5337 inode_inc_iversion(inode); ··· 5417 5351 rc = ext4_break_layouts(inode); 5418 5352 if (rc) { 5419 5353 up_write(&EXT4_I(inode)->i_mmap_sem); 5420 - return rc; 5354 + goto err_out; 5421 5355 } 5422 5356 5423 5357 if (attr->ia_size != inode->i_size) { ··· 5438 5372 inode->i_mtime = current_time(inode); 5439 5373 inode->i_ctime = inode->i_mtime; 5440 5374 } 5375 + 5376 + if (shrink) 5377 + ext4_fc_track_range(inode, 5378 + (attr->ia_size > 0 ? attr->ia_size - 1 : 0) >> 5379 + inode->i_sb->s_blocksize_bits, 5380 + (oldsize > 0 ? oldsize - 1 : 0) >> 5381 + inode->i_sb->s_blocksize_bits); 5382 + else 5383 + ext4_fc_track_range( 5384 + inode, 5385 + (oldsize > 0 ? oldsize - 1 : oldsize) >> 5386 + inode->i_sb->s_blocksize_bits, 5387 + (attr->ia_size > 0 ? attr->ia_size - 1 : 0) >> 5388 + inode->i_sb->s_blocksize_bits); 5389 + 5441 5390 down_write(&EXT4_I(inode)->i_data_sem); 5442 5391 EXT4_I(inode)->i_disksize = attr->ia_size; 5443 5392 rc = ext4_mark_inode_dirty(handle, inode); ··· 5511 5430 rc = posix_acl_chmod(inode, inode->i_mode); 5512 5431 5513 5432 err_out: 5514 - ext4_std_error(inode->i_sb, error); 5433 + if (error) 5434 + ext4_std_error(inode->i_sb, error); 5515 5435 if (!error) 5516 5436 error = rc; 5437 + ext4_fc_stop_update(inode); 5517 5438 return error; 5518 5439 } 5519 5440 ··· 5697 5614 put_bh(iloc->bh); 5698 5615 return -EIO; 5699 5616 } 5617 + ext4_fc_track_inode(inode); 5618 + 5700 5619 if (IS_I_VERSION(inode)) 5701 5620 inode_inc_iversion(inode); 5702 5621 ··· 6022 5937 if (IS_ERR(handle)) 6023 5938 return PTR_ERR(handle); 6024 5939 5940 + ext4_fc_mark_ineligible(inode->i_sb, 5941 + EXT4_FC_REASON_JOURNAL_FLAG_CHANGE); 6025 5942 err = ext4_mark_inode_dirty(handle, inode); 6026 5943 ext4_handle_sync(handle); 6027 5944 ext4_journal_stop(handle); ··· 6064 5977 if (err) 6065 5978 goto out_ret; 6066 5979 5980 + /* 5981 + * On data journalling we skip straight to the transaction handle: 5982 + * there's no delalloc; page truncated will be checked later; the 5983 + * early return w/ all buffers mapped (calculates size/len) can't 5984 + * be used; and there's no dioread_nolock, so only ext4_get_block. 5985 + */ 5986 + if (ext4_should_journal_data(inode)) 5987 + goto retry_alloc; 5988 + 6067 5989 /* Delalloc case is easy... */ 6068 5990 if (test_opt(inode->i_sb, DELALLOC) && 6069 - !ext4_should_journal_data(inode) && 6070 5991 !ext4_nonda_switch(inode->i_sb)) { 6071 5992 do { 6072 5993 err = block_page_mkwrite(vma, vmf, ··· 6100 6005 /* 6101 6006 * Return if we have all the buffers mapped. This avoids the need to do 6102 6007 * journal_start/journal_stop which can block and take a long time 6008 + * 6009 + * This cannot be done for data journalling, as we have to add the 6010 + * inode to the transaction's list to writeprotect pages on commit. 6103 6011 */ 6104 6012 if (page_has_buffers(page)) { 6105 6013 if (!ext4_walk_page_buffers(NULL, page_buffers(page), ··· 6127 6029 ret = VM_FAULT_SIGBUS; 6128 6030 goto out; 6129 6031 } 6130 - err = block_page_mkwrite(vma, vmf, get_block); 6131 - if (!err && ext4_should_journal_data(inode)) { 6132 - if (ext4_walk_page_buffers(handle, page_buffers(page), 0, 6133 - PAGE_SIZE, NULL, do_journal_get_write_access)) { 6134 - unlock_page(page); 6135 - ret = VM_FAULT_SIGBUS; 6136 - ext4_journal_stop(handle); 6137 - goto out; 6032 + /* 6033 + * Data journalling can't use block_page_mkwrite() because it 6034 + * will set_buffer_dirty() before do_journal_get_write_access() 6035 + * thus might hit warning messages for dirty metadata buffers. 6036 + */ 6037 + if (!ext4_should_journal_data(inode)) { 6038 + err = block_page_mkwrite(vma, vmf, get_block); 6039 + } else { 6040 + lock_page(page); 6041 + size = i_size_read(inode); 6042 + /* Page got truncated from under us? */ 6043 + if (page->mapping != mapping || page_offset(page) > size) { 6044 + ret = VM_FAULT_NOPAGE; 6045 + goto out_error; 6138 6046 } 6139 - ext4_set_inode_state(inode, EXT4_STATE_JDATA); 6047 + 6048 + if (page->index == size >> PAGE_SHIFT) 6049 + len = size & ~PAGE_MASK; 6050 + else 6051 + len = PAGE_SIZE; 6052 + 6053 + err = __block_write_begin(page, 0, len, ext4_get_block); 6054 + if (!err) { 6055 + ret = VM_FAULT_SIGBUS; 6056 + if (ext4_walk_page_buffers(handle, page_buffers(page), 6057 + 0, len, NULL, do_journal_get_write_access)) 6058 + goto out_error; 6059 + if (ext4_walk_page_buffers(handle, page_buffers(page), 6060 + 0, len, NULL, write_end_fn)) 6061 + goto out_error; 6062 + if (ext4_jbd2_inode_add_write(handle, inode, 0, len)) 6063 + goto out_error; 6064 + ext4_set_inode_state(inode, EXT4_STATE_JDATA); 6065 + } else { 6066 + unlock_page(page); 6067 + } 6140 6068 } 6141 6069 ext4_journal_stop(handle); 6142 6070 if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) ··· 6173 6049 up_read(&EXT4_I(inode)->i_mmap_sem); 6174 6050 sb_end_pagefault(inode->i_sb); 6175 6051 return ret; 6052 + out_error: 6053 + unlock_page(page); 6054 + ext4_journal_stop(handle); 6055 + goto out; 6176 6056 } 6177 6057 6178 6058 vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)

+18 -4

fs/ext4/ioctl.c

··· 86 86 i_size_write(inode2, isize); 87 87 } 88 88 89 - static void reset_inode_seed(struct inode *inode) 89 + void ext4_reset_inode_seed(struct inode *inode) 90 90 { 91 91 struct ext4_inode_info *ei = EXT4_I(inode); 92 92 struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); ··· 165 165 err = -EINVAL; 166 166 goto err_out; 167 167 } 168 + ext4_fc_start_ineligible(sb, EXT4_FC_REASON_SWAP_BOOT); 168 169 169 170 /* Protect extent tree against block allocations via delalloc */ 170 171 ext4_double_down_write_data_sem(inode, inode_bl); ··· 200 199 201 200 inode->i_generation = prandom_u32(); 202 201 inode_bl->i_generation = prandom_u32(); 203 - reset_inode_seed(inode); 204 - reset_inode_seed(inode_bl); 202 + ext4_reset_inode_seed(inode); 203 + ext4_reset_inode_seed(inode_bl); 205 204 206 205 ext4_discard_preallocations(inode, 0); 207 206 ··· 248 247 249 248 err_out1: 250 249 ext4_journal_stop(handle); 250 + ext4_fc_stop_ineligible(sb); 251 251 ext4_double_up_write_data_sem(inode, inode_bl); 252 252 253 253 err_out: ··· 809 807 return error; 810 808 } 811 809 812 - long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) 810 + static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) 813 811 { 814 812 struct inode *inode = file_inode(filp); 815 813 struct super_block *sb = inode->i_sb; ··· 1076 1074 1077 1075 err = ext4_resize_fs(sb, n_blocks_count); 1078 1076 if (EXT4_SB(sb)->s_journal) { 1077 + ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE); 1079 1078 jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); 1080 1079 err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal); 1081 1080 jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); ··· 1309 1306 default: 1310 1307 return -ENOTTY; 1311 1308 } 1309 + } 1310 + 1311 + long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) 1312 + { 1313 + long ret; 1314 + 1315 + ext4_fc_start_update(file_inode(filp)); 1316 + ret = __ext4_ioctl(filp, cmd, arg); 1317 + ext4_fc_stop_update(file_inode(filp)); 1318 + 1319 + return ret; 1312 1320 } 1313 1321 1314 1322 #ifdef CONFIG_COMPAT

+218 -39

fs/ext4/mballoc.c

··· 124 124 * /sys/fs/ext4/<partition>/mb_group_prealloc. The value is represented in 125 125 * terms of number of blocks. If we have mounted the file system with -O 126 126 * stripe=<value> option the group prealloc request is normalized to the 127 - * the smallest multiple of the stripe value (sbi->s_stripe) which is 127 + * smallest multiple of the stripe value (sbi->s_stripe) which is 128 128 * greater than the default mb_group_prealloc. 129 129 * 130 130 * The regular allocator (using the buddy cache) supports a few tunables. ··· 619 619 void *buddy; 620 620 void *buddy2; 621 621 622 - { 623 - static int mb_check_counter; 624 - if (mb_check_counter++ % 100 != 0) 625 - return 0; 626 - } 622 + if (e4b->bd_info->bb_check_counter++ % 10) 623 + return 0; 627 624 628 625 while (order > 1) { 629 626 buddy = mb_find_buddy(e4b, order, &max); ··· 1391 1394 } 1392 1395 } 1393 1396 1394 - /* 1395 - * _________________________________________________________________ */ 1396 - 1397 1397 static inline int mb_buddy_adjust_border(int* bit, void* bitmap, int side) 1398 1398 { 1399 1399 if (mb_test_bit(*bit + side, bitmap)) { ··· 1502 1508 1503 1509 blocknr = ext4_group_first_block_no(sb, e4b->bd_group); 1504 1510 blocknr += EXT4_C2B(sbi, block); 1505 - ext4_grp_locked_error(sb, e4b->bd_group, 1506 - inode ? inode->i_ino : 0, 1507 - blocknr, 1508 - "freeing already freed block " 1509 - "(bit %u); block bitmap corrupt.", 1510 - block); 1511 - ext4_mark_group_bitmap_corrupted(sb, e4b->bd_group, 1511 + if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) { 1512 + ext4_grp_locked_error(sb, e4b->bd_group, 1513 + inode ? inode->i_ino : 0, 1514 + blocknr, 1515 + "freeing already freed block (bit %u); block bitmap corrupt.", 1516 + block); 1517 + ext4_mark_group_bitmap_corrupted( 1518 + sb, e4b->bd_group, 1512 1519 EXT4_GROUP_INFO_BBITMAP_CORRUPT); 1520 + } 1513 1521 mb_regenerate_buddy(e4b); 1514 1522 goto done; 1515 1523 } ··· 2015 2019 /* 2016 2020 * IF we have corrupt bitmap, we won't find any 2017 2021 * free blocks even though group info says we 2018 - * we have free blocks 2022 + * have free blocks 2019 2023 */ 2020 2024 ext4_grp_locked_error(sb, e4b->bd_group, 0, 0, 2021 2025 "%d free clusters as per " ··· 3299 3303 } 3300 3304 3301 3305 /* 3306 + * Idempotent helper for Ext4 fast commit replay path to set the state of 3307 + * blocks in bitmaps and update counters. 3308 + */ 3309 + void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block, 3310 + int len, int state) 3311 + { 3312 + struct buffer_head *bitmap_bh = NULL; 3313 + struct ext4_group_desc *gdp; 3314 + struct buffer_head *gdp_bh; 3315 + struct ext4_sb_info *sbi = EXT4_SB(sb); 3316 + ext4_group_t group; 3317 + ext4_grpblk_t blkoff; 3318 + int i, clen, err; 3319 + int already; 3320 + 3321 + clen = EXT4_B2C(sbi, len); 3322 + 3323 + ext4_get_group_no_and_offset(sb, block, &group, &blkoff); 3324 + bitmap_bh = ext4_read_block_bitmap(sb, group); 3325 + if (IS_ERR(bitmap_bh)) { 3326 + err = PTR_ERR(bitmap_bh); 3327 + bitmap_bh = NULL; 3328 + goto out_err; 3329 + } 3330 + 3331 + err = -EIO; 3332 + gdp = ext4_get_group_desc(sb, group, &gdp_bh); 3333 + if (!gdp) 3334 + goto out_err; 3335 + 3336 + ext4_lock_group(sb, group); 3337 + already = 0; 3338 + for (i = 0; i < clen; i++) 3339 + if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) == !state) 3340 + already++; 3341 + 3342 + if (state) 3343 + ext4_set_bits(bitmap_bh->b_data, blkoff, clen); 3344 + else 3345 + mb_test_and_clear_bits(bitmap_bh->b_data, blkoff, clen); 3346 + if (ext4_has_group_desc_csum(sb) && 3347 + (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) { 3348 + gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); 3349 + ext4_free_group_clusters_set(sb, gdp, 3350 + ext4_free_clusters_after_init(sb, 3351 + group, gdp)); 3352 + } 3353 + if (state) 3354 + clen = ext4_free_group_clusters(sb, gdp) - clen + already; 3355 + else 3356 + clen = ext4_free_group_clusters(sb, gdp) + clen - already; 3357 + 3358 + ext4_free_group_clusters_set(sb, gdp, clen); 3359 + ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh); 3360 + ext4_group_desc_csum_set(sb, group, gdp); 3361 + 3362 + ext4_unlock_group(sb, group); 3363 + 3364 + if (sbi->s_log_groups_per_flex) { 3365 + ext4_group_t flex_group = ext4_flex_group(sbi, group); 3366 + 3367 + atomic64_sub(len, 3368 + &sbi_array_rcu_deref(sbi, s_flex_groups, 3369 + flex_group)->free_clusters); 3370 + } 3371 + 3372 + err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh); 3373 + if (err) 3374 + goto out_err; 3375 + sync_dirty_buffer(bitmap_bh); 3376 + err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh); 3377 + sync_dirty_buffer(gdp_bh); 3378 + 3379 + out_err: 3380 + brelse(bitmap_bh); 3381 + } 3382 + 3383 + /* 3302 3384 * here we normalize request for locality group 3303 3385 * Group request are normalized to s_mb_group_prealloc, which goes to 3304 3386 * s_strip if we set the same via mount option. ··· 4234 4160 struct ext4_buddy e4b; 4235 4161 int err; 4236 4162 int busy = 0; 4237 - int free = 0; 4163 + int free, free_total = 0; 4238 4164 4239 4165 mb_debug(sb, "discard preallocation for group %u\n", group); 4240 4166 if (list_empty(&grp->bb_prealloc_list)) ··· 4262 4188 4263 4189 INIT_LIST_HEAD(&list); 4264 4190 repeat: 4191 + free = 0; 4265 4192 ext4_lock_group(sb, group); 4266 - this_cpu_inc(discard_pa_seq); 4267 4193 list_for_each_entry_safe(pa, tmp, 4268 4194 &grp->bb_prealloc_list, pa_group_list) { 4269 4195 spin_lock(&pa->pa_lock); ··· 4280 4206 /* seems this one can be freed ... */ 4281 4207 ext4_mb_mark_pa_deleted(sb, pa); 4282 4208 4209 + if (!free) 4210 + this_cpu_inc(discard_pa_seq); 4211 + 4283 4212 /* we can trust pa_free ... */ 4284 4213 free += pa->pa_free; 4285 4214 ··· 4290 4213 4291 4214 list_del(&pa->pa_group_list); 4292 4215 list_add(&pa->u.pa_tmp_list, &list); 4293 - } 4294 - 4295 - /* if we still need more blocks and some PAs were used, try again */ 4296 - if (free < needed && busy) { 4297 - busy = 0; 4298 - ext4_unlock_group(sb, group); 4299 - cond_resched(); 4300 - goto repeat; 4301 - } 4302 - 4303 - /* found anything to free? */ 4304 - if (list_empty(&list)) { 4305 - BUG_ON(free != 0); 4306 - mb_debug(sb, "Someone else may have freed PA for this group %u\n", 4307 - group); 4308 - goto out; 4309 4216 } 4310 4217 4311 4218 /* now free all selected PAs */ ··· 4309 4248 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); 4310 4249 } 4311 4250 4312 - out: 4251 + free_total += free; 4252 + 4253 + /* if we still need more blocks and some PAs were used, try again */ 4254 + if (free_total < needed && busy) { 4255 + ext4_unlock_group(sb, group); 4256 + cond_resched(); 4257 + busy = 0; 4258 + goto repeat; 4259 + } 4313 4260 ext4_unlock_group(sb, group); 4314 4261 ext4_mb_unload_buddy(&e4b); 4315 4262 put_bh(bitmap_bh); 4316 4263 out_dbg: 4317 4264 mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n", 4318 - free, group, grp->bb_free); 4319 - return free; 4265 + free_total, group, grp->bb_free); 4266 + return free_total; 4320 4267 } 4321 4268 4322 4269 /* ··· 4351 4282 /*BUG_ON(!list_empty(&ei->i_prealloc_list));*/ 4352 4283 return; 4353 4284 } 4285 + 4286 + if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) 4287 + return; 4354 4288 4355 4289 mb_debug(sb, "discard preallocation for inode %lu\n", 4356 4290 inode->i_ino); ··· 4902 4830 return ret; 4903 4831 } 4904 4832 4833 + static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle, 4834 + struct ext4_allocation_request *ar, int *errp); 4835 + 4905 4836 /* 4906 4837 * Main entry point into mballoc to allocate blocks 4907 4838 * it tries to use preallocation first, then falls back ··· 4926 4851 sbi = EXT4_SB(sb); 4927 4852 4928 4853 trace_ext4_request_blocks(ar); 4854 + if (sbi->s_mount_state & EXT4_FC_REPLAY) 4855 + return ext4_mb_new_blocks_simple(handle, ar, errp); 4929 4856 4930 4857 /* Allow to use superuser reservation for quota file */ 4931 4858 if (ext4_is_quota_file(ar->inode)) ··· 5155 5078 return 0; 5156 5079 } 5157 5080 5081 + /* 5082 + * Simple allocator for Ext4 fast commit replay path. It searches for blocks 5083 + * linearly starting at the goal block and also excludes the blocks which 5084 + * are going to be in use after fast commit replay. 5085 + */ 5086 + static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle, 5087 + struct ext4_allocation_request *ar, int *errp) 5088 + { 5089 + struct buffer_head *bitmap_bh; 5090 + struct super_block *sb = ar->inode->i_sb; 5091 + ext4_group_t group; 5092 + ext4_grpblk_t blkoff; 5093 + int i; 5094 + ext4_fsblk_t goal, block; 5095 + struct ext4_super_block *es = EXT4_SB(sb)->s_es; 5096 + 5097 + goal = ar->goal; 5098 + if (goal < le32_to_cpu(es->s_first_data_block) || 5099 + goal >= ext4_blocks_count(es)) 5100 + goal = le32_to_cpu(es->s_first_data_block); 5101 + 5102 + ar->len = 0; 5103 + ext4_get_group_no_and_offset(sb, goal, &group, &blkoff); 5104 + for (; group < ext4_get_groups_count(sb); group++) { 5105 + bitmap_bh = ext4_read_block_bitmap(sb, group); 5106 + if (IS_ERR(bitmap_bh)) { 5107 + *errp = PTR_ERR(bitmap_bh); 5108 + pr_warn("Failed to read block bitmap\n"); 5109 + return 0; 5110 + } 5111 + 5112 + ext4_get_group_no_and_offset(sb, 5113 + max(ext4_group_first_block_no(sb, group), goal), 5114 + NULL, &blkoff); 5115 + i = mb_find_next_zero_bit(bitmap_bh->b_data, sb->s_blocksize, 5116 + blkoff); 5117 + brelse(bitmap_bh); 5118 + if (i >= sb->s_blocksize) 5119 + continue; 5120 + if (ext4_fc_replay_check_excluded(sb, 5121 + ext4_group_first_block_no(sb, group) + i)) 5122 + continue; 5123 + break; 5124 + } 5125 + 5126 + if (group >= ext4_get_groups_count(sb) && i >= sb->s_blocksize) 5127 + return 0; 5128 + 5129 + block = ext4_group_first_block_no(sb, group) + i; 5130 + ext4_mb_mark_bb(sb, block, 1, 1); 5131 + ar->len = 1; 5132 + 5133 + return block; 5134 + } 5135 + 5136 + static void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block, 5137 + unsigned long count) 5138 + { 5139 + struct buffer_head *bitmap_bh; 5140 + struct super_block *sb = inode->i_sb; 5141 + struct ext4_group_desc *gdp; 5142 + struct buffer_head *gdp_bh; 5143 + ext4_group_t group; 5144 + ext4_grpblk_t blkoff; 5145 + int already_freed = 0, err, i; 5146 + 5147 + ext4_get_group_no_and_offset(sb, block, &group, &blkoff); 5148 + bitmap_bh = ext4_read_block_bitmap(sb, group); 5149 + if (IS_ERR(bitmap_bh)) { 5150 + err = PTR_ERR(bitmap_bh); 5151 + pr_warn("Failed to read block bitmap\n"); 5152 + return; 5153 + } 5154 + gdp = ext4_get_group_desc(sb, group, &gdp_bh); 5155 + if (!gdp) 5156 + return; 5157 + 5158 + for (i = 0; i < count; i++) { 5159 + if (!mb_test_bit(blkoff + i, bitmap_bh->b_data)) 5160 + already_freed++; 5161 + } 5162 + mb_clear_bits(bitmap_bh->b_data, blkoff, count); 5163 + err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh); 5164 + if (err) 5165 + return; 5166 + ext4_free_group_clusters_set( 5167 + sb, gdp, ext4_free_group_clusters(sb, gdp) + 5168 + count - already_freed); 5169 + ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh); 5170 + ext4_group_desc_csum_set(sb, group, gdp); 5171 + ext4_handle_dirty_metadata(NULL, NULL, gdp_bh); 5172 + sync_dirty_buffer(bitmap_bh); 5173 + sync_dirty_buffer(gdp_bh); 5174 + brelse(bitmap_bh); 5175 + } 5176 + 5158 5177 /** 5159 5178 * ext4_free_blocks() -- Free given blocks and update quota 5160 5179 * @handle: handle for this transaction ··· 5277 5104 int err = 0; 5278 5105 int ret; 5279 5106 5107 + sbi = EXT4_SB(sb); 5108 + 5109 + if (sbi->s_mount_state & EXT4_FC_REPLAY) { 5110 + ext4_free_blocks_simple(inode, block, count); 5111 + return; 5112 + } 5113 + 5280 5114 might_sleep(); 5281 5115 if (bh) { 5282 5116 if (block) ··· 5292 5112 block = bh->b_blocknr; 5293 5113 } 5294 5114 5295 - sbi = EXT4_SB(sb); 5296 5115 if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) && 5297 5116 !ext4_inode_block_valid(inode, block, count)) { 5298 5117 ext4_error(sb, "Freeing blocks not in datazone - "

+3 -7

fs/ext4/mmp.c

··· 85 85 } 86 86 } 87 87 88 - get_bh(*bh); 89 88 lock_buffer(*bh); 90 - (*bh)->b_end_io = end_buffer_read_sync; 91 - submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, *bh); 92 - wait_on_buffer(*bh); 93 - if (!buffer_uptodate(*bh)) { 94 - ret = -EIO; 89 + ret = ext4_read_bh(*bh, REQ_META | REQ_PRIO, NULL); 90 + if (ret) 95 91 goto warn_exit; 96 - } 92 + 97 93 mmp = (struct mmp_struct *)((*bh)->b_data); 98 94 if (le32_to_cpu(mmp->mmp_magic) != EXT4_MMP_MAGIC) { 99 95 ret = -EFSCORRUPTED;

+1 -1

fs/ext4/move_extent.c

··· 215 215 for (i = 0; i < nr; i++) { 216 216 bh = arr[i]; 217 217 if (!bh_uptodate_or_lock(bh)) { 218 - err = bh_submit_read(bh); 218 + err = ext4_read_bh(bh, 0, NULL); 219 219 if (err) 220 220 return err; 221 221 }

+130 -76

fs/ext4/namei.c

··· 2553 2553 * for checking S_ISDIR(inode) (since the INODE_INDEX feature will not be set 2554 2554 * on regular files) and to avoid creating huge/slow non-HTREE directories. 2555 2555 */ 2556 - static void ext4_inc_count(handle_t *handle, struct inode *inode) 2556 + static void ext4_inc_count(struct inode *inode) 2557 2557 { 2558 2558 inc_nlink(inode); 2559 2559 if (is_dx(inode) && ··· 2565 2565 * If a directory had nlink == 1, then we should let it be 1. This indicates 2566 2566 * directory has >EXT4_LINK_MAX subdirs. 2567 2567 */ 2568 - static void ext4_dec_count(handle_t *handle, struct inode *inode) 2568 + static void ext4_dec_count(struct inode *inode) 2569 2569 { 2570 2570 if (!S_ISDIR(inode->i_mode) || inode->i_nlink > 2) 2571 2571 drop_nlink(inode); ··· 2610 2610 bool excl) 2611 2611 { 2612 2612 handle_t *handle; 2613 - struct inode *inode; 2613 + struct inode *inode, *inode_save; 2614 2614 int err, credits, retries = 0; 2615 2615 2616 2616 err = dquot_initialize(dir); ··· 2628 2628 inode->i_op = &ext4_file_inode_operations; 2629 2629 inode->i_fop = &ext4_file_operations; 2630 2630 ext4_set_aops(inode); 2631 + inode_save = inode; 2632 + ihold(inode_save); 2631 2633 err = ext4_add_nondir(handle, dentry, &inode); 2634 + ext4_fc_track_create(inode_save, dentry); 2635 + iput(inode_save); 2632 2636 } 2633 2637 if (handle) 2634 2638 ext4_journal_stop(handle); ··· 2647 2643 umode_t mode, dev_t rdev) 2648 2644 { 2649 2645 handle_t *handle; 2650 - struct inode *inode; 2646 + struct inode *inode, *inode_save; 2651 2647 int err, credits, retries = 0; 2652 2648 2653 2649 err = dquot_initialize(dir); ··· 2664 2660 if (!IS_ERR(inode)) { 2665 2661 init_special_inode(inode, inode->i_mode, rdev); 2666 2662 inode->i_op = &ext4_special_inode_operations; 2663 + inode_save = inode; 2664 + ihold(inode_save); 2667 2665 err = ext4_add_nondir(handle, dentry, &inode); 2666 + if (!err) 2667 + ext4_fc_track_create(inode_save, dentry); 2668 + iput(inode_save); 2668 2669 } 2669 2670 if (handle) 2670 2671 ext4_journal_stop(handle); ··· 2748 2739 return ext4_next_entry(de, blocksize); 2749 2740 } 2750 2741 2751 - static int ext4_init_new_dir(handle_t *handle, struct inode *dir, 2742 + int ext4_init_new_dir(handle_t *handle, struct inode *dir, 2752 2743 struct inode *inode) 2753 2744 { 2754 2745 struct buffer_head *dir_block = NULL; ··· 2833 2824 iput(inode); 2834 2825 goto out_retry; 2835 2826 } 2836 - ext4_inc_count(handle, dir); 2827 + ext4_fc_track_create(inode, dentry); 2828 + ext4_inc_count(dir); 2829 + 2837 2830 ext4_update_dx_flag(dir); 2838 2831 err = ext4_mark_inode_dirty(handle, dir); 2839 2832 if (err) ··· 3173 3162 retval = ext4_mark_inode_dirty(handle, inode); 3174 3163 if (retval) 3175 3164 goto end_rmdir; 3176 - ext4_dec_count(handle, dir); 3165 + ext4_dec_count(dir); 3177 3166 ext4_update_dx_flag(dir); 3167 + ext4_fc_track_unlink(inode, dentry); 3178 3168 retval = ext4_mark_inode_dirty(handle, dir); 3179 3169 3180 3170 #ifdef CONFIG_UNICODE ··· 3196 3184 return retval; 3197 3185 } 3198 3186 3199 - static int ext4_unlink(struct inode *dir, struct dentry *dentry) 3187 + int __ext4_unlink(struct inode *dir, const struct qstr *d_name, 3188 + struct inode *inode) 3200 3189 { 3201 - int retval; 3202 - struct inode *inode; 3190 + int retval = -ENOENT; 3203 3191 struct buffer_head *bh; 3204 3192 struct ext4_dir_entry_2 *de; 3205 3193 handle_t *handle = NULL; 3194 + int skip_remove_dentry = 0; 3206 3195 3207 - if (unlikely(ext4_forced_shutdown(EXT4_SB(dir->i_sb)))) 3208 - return -EIO; 3196 + bh = ext4_find_entry(dir, d_name, &de, NULL); 3197 + if (IS_ERR(bh)) 3198 + return PTR_ERR(bh); 3209 3199 3210 - trace_ext4_unlink_enter(dir, dentry); 3211 - /* Initialize quotas before so that eventual writes go 3212 - * in separate transaction */ 3213 - retval = dquot_initialize(dir); 3214 - if (retval) 3215 - goto out_trace; 3216 - retval = dquot_initialize(d_inode(dentry)); 3217 - if (retval) 3218 - goto out_trace; 3219 - 3220 - bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL); 3221 - if (IS_ERR(bh)) { 3222 - retval = PTR_ERR(bh); 3223 - goto out_trace; 3224 - } 3225 - if (!bh) { 3226 - retval = -ENOENT; 3227 - goto out_trace; 3228 - } 3229 - 3230 - inode = d_inode(dentry); 3200 + if (!bh) 3201 + return -ENOENT; 3231 3202 3232 3203 if (le32_to_cpu(de->inode) != inode->i_ino) { 3233 - retval = -EFSCORRUPTED; 3234 - goto out_bh; 3204 + /* 3205 + * It's okay if we find dont find dentry which matches 3206 + * the inode. That's because it might have gotten 3207 + * renamed to a different inode number 3208 + */ 3209 + if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) 3210 + skip_remove_dentry = 1; 3211 + else 3212 + goto out_bh; 3235 3213 } 3236 3214 3237 3215 handle = ext4_journal_start(dir, EXT4_HT_DIR, ··· 3234 3232 if (IS_DIRSYNC(dir)) 3235 3233 ext4_handle_sync(handle); 3236 3234 3237 - retval = ext4_delete_entry(handle, dir, de, bh); 3238 - if (retval) 3239 - goto out_handle; 3240 - dir->i_ctime = dir->i_mtime = current_time(dir); 3241 - ext4_update_dx_flag(dir); 3242 - retval = ext4_mark_inode_dirty(handle, dir); 3243 - if (retval) 3244 - goto out_handle; 3235 + if (!skip_remove_dentry) { 3236 + retval = ext4_delete_entry(handle, dir, de, bh); 3237 + if (retval) 3238 + goto out_handle; 3239 + dir->i_ctime = dir->i_mtime = current_time(dir); 3240 + ext4_update_dx_flag(dir); 3241 + retval = ext4_mark_inode_dirty(handle, dir); 3242 + if (retval) 3243 + goto out_handle; 3244 + } else { 3245 + retval = 0; 3246 + } 3245 3247 if (inode->i_nlink == 0) 3246 3248 ext4_warning_inode(inode, "Deleting file '%.*s' with no links", 3247 - dentry->d_name.len, dentry->d_name.name); 3249 + d_name->len, d_name->name); 3248 3250 else 3249 3251 drop_nlink(inode); 3250 3252 if (!inode->i_nlink) ··· 3256 3250 inode->i_ctime = current_time(inode); 3257 3251 retval = ext4_mark_inode_dirty(handle, inode); 3258 3252 3253 + out_handle: 3254 + ext4_journal_stop(handle); 3255 + out_bh: 3256 + brelse(bh); 3257 + return retval; 3258 + } 3259 + 3260 + static int ext4_unlink(struct inode *dir, struct dentry *dentry) 3261 + { 3262 + int retval; 3263 + 3264 + if (unlikely(ext4_forced_shutdown(EXT4_SB(dir->i_sb)))) 3265 + return -EIO; 3266 + 3267 + trace_ext4_unlink_enter(dir, dentry); 3268 + /* 3269 + * Initialize quotas before so that eventual writes go 3270 + * in separate transaction 3271 + */ 3272 + retval = dquot_initialize(dir); 3273 + if (retval) 3274 + goto out_trace; 3275 + retval = dquot_initialize(d_inode(dentry)); 3276 + if (retval) 3277 + goto out_trace; 3278 + 3279 + retval = __ext4_unlink(dir, &dentry->d_name, d_inode(dentry)); 3280 + if (!retval) 3281 + ext4_fc_track_unlink(d_inode(dentry), dentry); 3259 3282 #ifdef CONFIG_UNICODE 3260 3283 /* VFS negative dentries are incompatible with Encoding and 3261 3284 * Case-insensitiveness. Eventually we'll want avoid ··· 3296 3261 d_invalidate(dentry); 3297 3262 #endif 3298 3263 3299 - out_handle: 3300 - ext4_journal_stop(handle); 3301 - out_bh: 3302 - brelse(bh); 3303 3264 out_trace: 3304 3265 trace_ext4_unlink_exit(dentry, retval); 3305 3266 return retval; ··· 3376 3345 */ 3377 3346 drop_nlink(inode); 3378 3347 err = ext4_orphan_add(handle, inode); 3379 - ext4_journal_stop(handle); 3348 + if (handle) 3349 + ext4_journal_stop(handle); 3380 3350 handle = NULL; 3381 3351 if (err) 3382 3352 goto err_drop_inode; ··· 3431 3399 return err; 3432 3400 } 3433 3401 3434 - static int ext4_link(struct dentry *old_dentry, 3435 - struct inode *dir, struct dentry *dentry) 3402 + int __ext4_link(struct inode *dir, struct inode *inode, struct dentry *dentry) 3436 3403 { 3437 3404 handle_t *handle; 3438 - struct inode *inode = d_inode(old_dentry); 3439 3405 int err, retries = 0; 3440 - 3441 - if (inode->i_nlink >= EXT4_LINK_MAX) 3442 - return -EMLINK; 3443 - 3444 - err = fscrypt_prepare_link(old_dentry, dir, dentry); 3445 - if (err) 3446 - return err; 3447 - 3448 - if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) && 3449 - (!projid_eq(EXT4_I(dir)->i_projid, 3450 - EXT4_I(old_dentry->d_inode)->i_projid))) 3451 - return -EXDEV; 3452 - 3453 - err = dquot_initialize(dir); 3454 - if (err) 3455 - return err; 3456 - 3457 3406 retry: 3458 3407 handle = ext4_journal_start(dir, EXT4_HT_DIR, 3459 3408 (EXT4_DATA_TRANS_BLOCKS(dir->i_sb) + ··· 3446 3433 ext4_handle_sync(handle); 3447 3434 3448 3435 inode->i_ctime = current_time(inode); 3449 - ext4_inc_count(handle, inode); 3436 + ext4_inc_count(inode); 3450 3437 ihold(inode); 3451 3438 3452 3439 err = ext4_add_entry(handle, dentry, inode); 3453 3440 if (!err) { 3441 + ext4_fc_track_link(inode, dentry); 3454 3442 err = ext4_mark_inode_dirty(handle, inode); 3455 3443 /* this can happen only for tmpfile being 3456 3444 * linked the first time ··· 3469 3455 return err; 3470 3456 } 3471 3457 3458 + static int ext4_link(struct dentry *old_dentry, 3459 + struct inode *dir, struct dentry *dentry) 3460 + { 3461 + struct inode *inode = d_inode(old_dentry); 3462 + int err; 3463 + 3464 + if (inode->i_nlink >= EXT4_LINK_MAX) 3465 + return -EMLINK; 3466 + 3467 + err = fscrypt_prepare_link(old_dentry, dir, dentry); 3468 + if (err) 3469 + return err; 3470 + 3471 + if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) && 3472 + (!projid_eq(EXT4_I(dir)->i_projid, 3473 + EXT4_I(old_dentry->d_inode)->i_projid))) 3474 + return -EXDEV; 3475 + 3476 + err = dquot_initialize(dir); 3477 + if (err) 3478 + return err; 3479 + return __ext4_link(dir, inode, dentry); 3480 + } 3472 3481 3473 3482 /* 3474 3483 * Try to find buffer head where contains the parent block. ··· 3667 3630 { 3668 3631 if (ent->dir_nlink_delta) { 3669 3632 if (ent->dir_nlink_delta == -1) 3670 - ext4_dec_count(handle, ent->dir); 3633 + ext4_dec_count(ent->dir); 3671 3634 else 3672 - ext4_inc_count(handle, ent->dir); 3635 + ext4_inc_count(ent->dir); 3673 3636 ext4_mark_inode_dirty(handle, ent->dir); 3674 3637 } 3675 3638 } ··· 3881 3844 } 3882 3845 3883 3846 if (new.inode) { 3884 - ext4_dec_count(handle, new.inode); 3847 + ext4_dec_count(new.inode); 3885 3848 new.inode->i_ctime = current_time(new.inode); 3886 3849 } 3887 3850 old.dir->i_ctime = old.dir->i_mtime = current_time(old.dir); ··· 3891 3854 if (retval) 3892 3855 goto end_rename; 3893 3856 3894 - ext4_dec_count(handle, old.dir); 3857 + ext4_dec_count(old.dir); 3895 3858 if (new.inode) { 3896 3859 /* checked ext4_empty_dir above, can't have another 3897 3860 * parent, ext4_dec_count() won't work for many-linked 3898 3861 * dirs */ 3899 3862 clear_nlink(new.inode); 3900 3863 } else { 3901 - ext4_inc_count(handle, new.dir); 3864 + ext4_inc_count(new.dir); 3902 3865 ext4_update_dx_flag(new.dir); 3903 3866 retval = ext4_mark_inode_dirty(handle, new.dir); 3904 3867 if (unlikely(retval)) ··· 3908 3871 retval = ext4_mark_inode_dirty(handle, old.dir); 3909 3872 if (unlikely(retval)) 3910 3873 goto end_rename; 3874 + 3875 + if (S_ISDIR(old.inode->i_mode)) { 3876 + /* 3877 + * We disable fast commits here that's because the 3878 + * replay code is not yet capable of changing dot dot 3879 + * dirents in directories. 3880 + */ 3881 + ext4_fc_mark_ineligible(old.inode->i_sb, 3882 + EXT4_FC_REASON_RENAME_DIR); 3883 + } else { 3884 + if (new.inode) 3885 + ext4_fc_track_unlink(new.inode, new.dentry); 3886 + ext4_fc_track_link(old.inode, new.dentry); 3887 + ext4_fc_track_unlink(old.inode, old.dentry); 3888 + } 3889 + 3911 3890 if (new.inode) { 3912 3891 retval = ext4_mark_inode_dirty(handle, new.inode); 3913 3892 if (unlikely(retval)) ··· 4067 4014 retval = ext4_mark_inode_dirty(handle, new.inode); 4068 4015 if (unlikely(retval)) 4069 4016 goto end_rename; 4070 - 4017 + ext4_fc_mark_ineligible(new.inode->i_sb, 4018 + EXT4_FC_REASON_CROSS_RENAME); 4071 4019 if (old.dir_bh) { 4072 4020 retval = ext4_rename_dir_finish(handle, &old, new.dir->i_ino); 4073 4021 if (retval)

+8 -6

fs/ext4/resize.c

··· 843 843 844 844 BUFFER_TRACE(dind, "get_write_access"); 845 845 err = ext4_journal_get_write_access(handle, dind); 846 - if (unlikely(err)) 846 + if (unlikely(err)) { 847 847 ext4_std_error(sb, err); 848 + goto errout; 849 + } 848 850 849 851 /* ext4_reserve_inode_write() gets a reference on the iloc */ 850 852 err = ext4_reserve_inode_write(handle, inode, &iloc); ··· 1245 1243 if (unlikely(!bh)) 1246 1244 return NULL; 1247 1245 if (!bh_uptodate_or_lock(bh)) { 1248 - if (bh_submit_read(bh) < 0) { 1246 + if (ext4_read_bh(bh, 0, NULL) < 0) { 1249 1247 brelse(bh); 1250 1248 return NULL; 1251 1249 } ··· 1808 1806 o_blocks_count + add, add); 1809 1807 1810 1808 /* See if the device is actually as big as what was requested */ 1811 - bh = sb_bread(sb, o_blocks_count + add - 1); 1812 - if (!bh) { 1809 + bh = ext4_sb_bread(sb, o_blocks_count + add - 1, 0); 1810 + if (IS_ERR(bh)) { 1813 1811 ext4_warning(sb, "can't read last block, resize aborted"); 1814 1812 return -ENOSPC; 1815 1813 } ··· 1934 1932 int meta_bg; 1935 1933 1936 1934 /* See if the device is actually as big as what was requested */ 1937 - bh = sb_bread(sb, n_blocks_count - 1); 1938 - if (!bh) { 1935 + bh = ext4_sb_bread(sb, n_blocks_count - 1, 0); 1936 + if (IS_ERR(bh)) { 1939 1937 ext4_warning(sb, "can't read last block, resize aborted"); 1940 1938 return -ENOSPC; 1941 1939 }

+307 -45

fs/ext4/super.c

··· 141 141 MODULE_ALIAS("ext3"); 142 142 #define IS_EXT3_SB(sb) ((sb)->s_bdev->bd_holder == &ext3_fs_type) 143 143 144 + 145 + static inline void __ext4_read_bh(struct buffer_head *bh, int op_flags, 146 + bh_end_io_t *end_io) 147 + { 148 + /* 149 + * buffer's verified bit is no longer valid after reading from 150 + * disk again due to write out error, clear it to make sure we 151 + * recheck the buffer contents. 152 + */ 153 + clear_buffer_verified(bh); 154 + 155 + bh->b_end_io = end_io ? end_io : end_buffer_read_sync; 156 + get_bh(bh); 157 + submit_bh(REQ_OP_READ, op_flags, bh); 158 + } 159 + 160 + void ext4_read_bh_nowait(struct buffer_head *bh, int op_flags, 161 + bh_end_io_t *end_io) 162 + { 163 + BUG_ON(!buffer_locked(bh)); 164 + 165 + if (ext4_buffer_uptodate(bh)) { 166 + unlock_buffer(bh); 167 + return; 168 + } 169 + __ext4_read_bh(bh, op_flags, end_io); 170 + } 171 + 172 + int ext4_read_bh(struct buffer_head *bh, int op_flags, bh_end_io_t *end_io) 173 + { 174 + BUG_ON(!buffer_locked(bh)); 175 + 176 + if (ext4_buffer_uptodate(bh)) { 177 + unlock_buffer(bh); 178 + return 0; 179 + } 180 + 181 + __ext4_read_bh(bh, op_flags, end_io); 182 + 183 + wait_on_buffer(bh); 184 + if (buffer_uptodate(bh)) 185 + return 0; 186 + return -EIO; 187 + } 188 + 189 + int ext4_read_bh_lock(struct buffer_head *bh, int op_flags, bool wait) 190 + { 191 + if (trylock_buffer(bh)) { 192 + if (wait) 193 + return ext4_read_bh(bh, op_flags, NULL); 194 + ext4_read_bh_nowait(bh, op_flags, NULL); 195 + return 0; 196 + } 197 + if (wait) { 198 + wait_on_buffer(bh); 199 + if (buffer_uptodate(bh)) 200 + return 0; 201 + return -EIO; 202 + } 203 + return 0; 204 + } 205 + 144 206 /* 145 - * This works like sb_bread() except it uses ERR_PTR for error 207 + * This works like __bread_gfp() except it uses ERR_PTR for error 146 208 * returns. Currently with sb_bread it's impossible to distinguish 147 209 * between ENOMEM and EIO situations (since both result in a NULL 148 210 * return. 149 211 */ 150 - struct buffer_head * 151 - ext4_sb_bread(struct super_block *sb, sector_t block, int op_flags) 212 + static struct buffer_head *__ext4_sb_bread_gfp(struct super_block *sb, 213 + sector_t block, int op_flags, 214 + gfp_t gfp) 152 215 { 153 - struct buffer_head *bh = sb_getblk(sb, block); 216 + struct buffer_head *bh; 217 + int ret; 154 218 219 + bh = sb_getblk_gfp(sb, block, gfp); 155 220 if (bh == NULL) 156 221 return ERR_PTR(-ENOMEM); 157 222 if (ext4_buffer_uptodate(bh)) 158 223 return bh; 159 - ll_rw_block(REQ_OP_READ, REQ_META | op_flags, 1, &bh); 160 - wait_on_buffer(bh); 161 - if (buffer_uptodate(bh)) 162 - return bh; 163 - put_bh(bh); 164 - return ERR_PTR(-EIO); 224 + 225 + ret = ext4_read_bh_lock(bh, REQ_META | op_flags, true); 226 + if (ret) { 227 + put_bh(bh); 228 + return ERR_PTR(ret); 229 + } 230 + return bh; 231 + } 232 + 233 + struct buffer_head *ext4_sb_bread(struct super_block *sb, sector_t block, 234 + int op_flags) 235 + { 236 + return __ext4_sb_bread_gfp(sb, block, op_flags, __GFP_MOVABLE); 237 + } 238 + 239 + struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb, 240 + sector_t block) 241 + { 242 + return __ext4_sb_bread_gfp(sb, block, 0, 0); 243 + } 244 + 245 + void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block) 246 + { 247 + struct buffer_head *bh = sb_getblk_gfp(sb, block, 0); 248 + 249 + if (likely(bh)) { 250 + ext4_read_bh_lock(bh, REQ_RAHEAD, false); 251 + brelse(bh); 252 + } 165 253 } 166 254 167 255 static int ext4_verify_csum_type(struct super_block *sb, ··· 289 201 if (!ext4_has_metadata_csum(sb)) 290 202 return; 291 203 204 + /* 205 + * Locking the superblock prevents the scenario 206 + * where: 207 + * 1) a first thread pauses during checksum calculation. 208 + * 2) a second thread updates the superblock, recalculates 209 + * the checksum, and updates s_checksum 210 + * 3) the first thread resumes and finishes its checksum calculation 211 + * and updates s_checksum with a potentially stale or torn value. 212 + */ 213 + lock_buffer(EXT4_SB(sb)->s_sbh); 292 214 es->s_checksum = ext4_superblock_csum(sb, es); 215 + unlock_buffer(EXT4_SB(sb)->s_sbh); 293 216 } 294 217 295 218 ext4_fsblk_t ext4_block_bitmap(struct super_block *sb, ··· 569 470 spin_lock(&sbi->s_md_lock); 570 471 } 571 472 spin_unlock(&sbi->s_md_lock); 473 + } 474 + 475 + /* 476 + * This writepage callback for write_cache_pages() 477 + * takes care of a few cases after page cleaning. 478 + * 479 + * write_cache_pages() already checks for dirty pages 480 + * and calls clear_page_dirty_for_io(), which we want, 481 + * to write protect the pages. 482 + * 483 + * However, we may have to redirty a page (see below.) 484 + */ 485 + static int ext4_journalled_writepage_callback(struct page *page, 486 + struct writeback_control *wbc, 487 + void *data) 488 + { 489 + transaction_t *transaction = (transaction_t *) data; 490 + struct buffer_head *bh, *head; 491 + struct journal_head *jh; 492 + 493 + bh = head = page_buffers(page); 494 + do { 495 + /* 496 + * We have to redirty a page in these cases: 497 + * 1) If buffer is dirty, it means the page was dirty because it 498 + * contains a buffer that needs checkpointing. So the dirty bit 499 + * needs to be preserved so that checkpointing writes the buffer 500 + * properly. 501 + * 2) If buffer is not part of the committing transaction 502 + * (we may have just accidentally come across this buffer because 503 + * inode range tracking is not exact) or if the currently running 504 + * transaction already contains this buffer as well, dirty bit 505 + * needs to be preserved so that the buffer gets writeprotected 506 + * properly on running transaction's commit. 507 + */ 508 + jh = bh2jh(bh); 509 + if (buffer_dirty(bh) || 510 + (jh && (jh->b_transaction != transaction || 511 + jh->b_next_transaction))) { 512 + redirty_page_for_writepage(wbc, page); 513 + goto out; 514 + } 515 + } while ((bh = bh->b_this_page) != head); 516 + 517 + out: 518 + return AOP_WRITEPAGE_ACTIVATE; 519 + } 520 + 521 + static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode) 522 + { 523 + struct address_space *mapping = jinode->i_vfs_inode->i_mapping; 524 + struct writeback_control wbc = { 525 + .sync_mode = WB_SYNC_ALL, 526 + .nr_to_write = LONG_MAX, 527 + .range_start = jinode->i_dirty_start, 528 + .range_end = jinode->i_dirty_end, 529 + }; 530 + 531 + return write_cache_pages(mapping, &wbc, 532 + ext4_journalled_writepage_callback, 533 + jinode->i_transaction); 534 + } 535 + 536 + static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode) 537 + { 538 + int ret; 539 + 540 + if (ext4_should_journal_data(jinode->i_vfs_inode)) 541 + ret = ext4_journalled_submit_inode_data_buffers(jinode); 542 + else 543 + ret = jbd2_journal_submit_inode_data_buffers(jinode); 544 + 545 + return ret; 546 + } 547 + 548 + static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode) 549 + { 550 + int ret = 0; 551 + 552 + if (!ext4_should_journal_data(jinode->i_vfs_inode)) 553 + ret = jbd2_journal_finish_inode_data_buffers(jinode); 554 + 555 + return ret; 572 556 } 573 557 574 558 static bool system_going_down(void) ··· 1121 939 static void ext4_blkdev_remove(struct ext4_sb_info *sbi) 1122 940 { 1123 941 struct block_device *bdev; 1124 - bdev = sbi->journal_bdev; 942 + bdev = sbi->s_journal_bdev; 1125 943 if (bdev) { 1126 944 ext4_blkdev_put(bdev); 1127 - sbi->journal_bdev = NULL; 945 + sbi->s_journal_bdev = NULL; 1128 946 } 1129 947 } 1130 948 ··· 1255 1073 1256 1074 sync_blockdev(sb->s_bdev); 1257 1075 invalidate_bdev(sb->s_bdev); 1258 - if (sbi->journal_bdev && sbi->journal_bdev != sb->s_bdev) { 1076 + if (sbi->s_journal_bdev && sbi->s_journal_bdev != sb->s_bdev) { 1259 1077 /* 1260 1078 * Invalidate the journal device's buffers. We don't want them 1261 1079 * floating about in memory - the physical journal device may 1262 1080 * hotswapped, and it breaks the `ro-after' testing code. 1263 1081 */ 1264 - sync_blockdev(sbi->journal_bdev); 1265 - invalidate_bdev(sbi->journal_bdev); 1082 + sync_blockdev(sbi->s_journal_bdev); 1083 + invalidate_bdev(sbi->s_journal_bdev); 1266 1084 ext4_blkdev_remove(sbi); 1267 1085 } 1268 1086 ··· 1331 1149 ei->i_datasync_tid = 0; 1332 1150 atomic_set(&ei->i_unwritten, 0); 1333 1151 INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work); 1152 + ext4_fc_init_inode(&ei->vfs_inode); 1153 + mutex_init(&ei->i_fc_lock); 1334 1154 return &ei->vfs_inode; 1335 1155 } 1336 1156 ··· 1350 1166 static void ext4_free_in_core_inode(struct inode *inode) 1351 1167 { 1352 1168 fscrypt_free_inode(inode); 1169 + if (!list_empty(&(EXT4_I(inode)->i_fc_list))) { 1170 + pr_warn("%s: inode %ld still in fc list", 1171 + __func__, inode->i_ino); 1172 + } 1353 1173 kmem_cache_free(ext4_inode_cachep, EXT4_I(inode)); 1354 1174 } 1355 1175 ··· 1379 1191 init_rwsem(&ei->i_data_sem); 1380 1192 init_rwsem(&ei->i_mmap_sem); 1381 1193 inode_init_once(&ei->vfs_inode); 1194 + ext4_fc_init_inode(&ei->vfs_inode); 1382 1195 } 1383 1196 1384 1197 static int __init init_inodecache(void) ··· 1408 1219 1409 1220 void ext4_clear_inode(struct inode *inode) 1410 1221 { 1222 + ext4_fc_del(inode); 1411 1223 invalidate_inode_buffers(inode); 1412 1224 clear_inode(inode); 1413 1225 ext4_discard_preallocations(inode, 0); ··· 1716 1526 Opt_dioread_nolock, Opt_dioread_lock, 1717 1527 Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable, 1718 1528 Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache, 1719 - Opt_prefetch_block_bitmaps, 1529 + Opt_prefetch_block_bitmaps, Opt_no_fc, 1530 + #ifdef CONFIG_EXT4_DEBUG 1531 + Opt_fc_debug_max_replay, 1532 + #endif 1533 + Opt_fc_debug_force 1720 1534 }; 1721 1535 1722 1536 static const match_table_t tokens = { ··· 1807 1613 {Opt_init_itable, "init_itable=%u"}, 1808 1614 {Opt_init_itable, "init_itable"}, 1809 1615 {Opt_noinit_itable, "noinit_itable"}, 1616 + {Opt_no_fc, "no_fc"}, 1617 + {Opt_fc_debug_force, "fc_debug_force"}, 1618 + #ifdef CONFIG_EXT4_DEBUG 1619 + {Opt_fc_debug_max_replay, "fc_debug_max_replay=%u"}, 1620 + #endif 1810 1621 {Opt_max_dir_size_kb, "max_dir_size_kb=%u"}, 1811 1622 {Opt_test_dummy_encryption, "test_dummy_encryption=%s"}, 1812 1623 {Opt_test_dummy_encryption, "test_dummy_encryption"}, ··· 1938 1739 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3) 1939 1740 #define MOPT_STRING 0x0400 1940 1741 #define MOPT_SKIP 0x0800 1742 + #define MOPT_2 0x1000 1941 1743 1942 1744 static const struct mount_opts { 1943 1745 int token; ··· 2039 1839 {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET}, 2040 1840 {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS, 2041 1841 MOPT_SET}, 1842 + {Opt_no_fc, EXT4_MOUNT2_JOURNAL_FAST_COMMIT, 1843 + MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY}, 1844 + {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT, 1845 + MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY}, 1846 + #ifdef CONFIG_EXT4_DEBUG 1847 + {Opt_fc_debug_max_replay, 0, MOPT_GTE0}, 1848 + #endif 2042 1849 {Opt_err, 0, 0} 2043 1850 }; 2044 1851 ··· 2255 2048 sbi->s_li_wait_mult = arg; 2256 2049 } else if (token == Opt_max_dir_size_kb) { 2257 2050 sbi->s_max_dir_size_kb = arg; 2051 + #ifdef CONFIG_EXT4_DEBUG 2052 + } else if (token == Opt_fc_debug_max_replay) { 2053 + sbi->s_fc_debug_max_replay = arg; 2054 + #endif 2258 2055 } else if (token == Opt_stripe) { 2259 2056 sbi->s_stripe = arg; 2260 2057 } else if (token == Opt_resuid) { ··· 2427 2216 WARN_ON(1); 2428 2217 return -1; 2429 2218 } 2430 - if (arg != 0) 2431 - sbi->s_mount_opt |= m->mount_opt; 2432 - else 2433 - sbi->s_mount_opt &= ~m->mount_opt; 2219 + if (m->flags & MOPT_2) { 2220 + if (arg != 0) 2221 + sbi->s_mount_opt2 |= m->mount_opt; 2222 + else 2223 + sbi->s_mount_opt2 &= ~m->mount_opt; 2224 + } else { 2225 + if (arg != 0) 2226 + sbi->s_mount_opt |= m->mount_opt; 2227 + else 2228 + sbi->s_mount_opt &= ~m->mount_opt; 2229 + } 2434 2230 } 2435 2231 return 1; 2436 2232 } ··· 2653 2435 } else if (test_opt2(sb, DAX_INODE)) { 2654 2436 SEQ_OPTS_PUTS("dax=inode"); 2655 2437 } 2438 + 2439 + if (test_opt2(sb, JOURNAL_FAST_COMMIT)) 2440 + SEQ_OPTS_PUTS("fast_commit"); 2656 2441 2657 2442 ext4_show_quota_options(seq, sb); 2658 2443 return 0; ··· 3975 3754 * Add the internal journal blocks whether the journal has been 3976 3755 * loaded or not 3977 3756 */ 3978 - if (sbi->s_journal && !sbi->journal_bdev) 3757 + if (sbi->s_journal && !sbi->s_journal_bdev) 3979 3758 overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen); 3980 3759 else if (ext4_has_feature_journal(sb) && !sbi->s_journal && j_inum) { 3981 3760 /* j_inum for internal journal is non-zero */ ··· 4089 3868 logical_sb_block = sb_block; 4090 3869 } 4091 3870 4092 - if (!(bh = sb_bread_unmovable(sb, logical_sb_block))) { 3871 + bh = ext4_sb_bread_unmovable(sb, logical_sb_block); 3872 + if (IS_ERR(bh)) { 4093 3873 ext4_msg(sb, KERN_ERR, "unable to read superblock"); 3874 + ret = PTR_ERR(bh); 3875 + bh = NULL; 4094 3876 goto out_fail; 4095 3877 } 4096 3878 /* ··· 4160 3936 #ifdef CONFIG_EXT4_FS_POSIX_ACL 4161 3937 set_opt(sb, POSIX_ACL); 4162 3938 #endif 3939 + if (ext4_has_feature_fast_commit(sb)) 3940 + set_opt2(sb, JOURNAL_FAST_COMMIT); 4163 3941 /* don't forget to enable journal_csum when metadata_csum is enabled. */ 4164 3942 if (ext4_has_metadata_csum(sb)) 4165 3943 set_opt(sb, JOURNAL_CHECKSUM); ··· 4491 4265 brelse(bh); 4492 4266 logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; 4493 4267 offset = do_div(logical_sb_block, blocksize); 4494 - bh = sb_bread_unmovable(sb, logical_sb_block); 4495 - if (!bh) { 4268 + bh = ext4_sb_bread_unmovable(sb, logical_sb_block); 4269 + if (IS_ERR(bh)) { 4496 4270 ext4_msg(sb, KERN_ERR, 4497 4271 "Can't read superblock on 2nd try"); 4272 + ret = PTR_ERR(bh); 4273 + bh = NULL; 4498 4274 goto failed_mount; 4499 4275 } 4500 4276 es = (struct ext4_super_block *)(bh->b_data + offset); ··· 4708 4480 /* Pre-read the descriptors into the buffer cache */ 4709 4481 for (i = 0; i < db_count; i++) { 4710 4482 block = descriptor_loc(sb, logical_sb_block, i); 4711 - sb_breadahead_unmovable(sb, block); 4483 + ext4_sb_breadahead_unmovable(sb, block); 4712 4484 } 4713 4485 4714 4486 for (i = 0; i < db_count; i++) { 4715 4487 struct buffer_head *bh; 4716 4488 4717 4489 block = descriptor_loc(sb, logical_sb_block, i); 4718 - bh = sb_bread_unmovable(sb, block); 4719 - if (!bh) { 4490 + bh = ext4_sb_bread_unmovable(sb, block); 4491 + if (IS_ERR(bh)) { 4720 4492 ext4_msg(sb, KERN_ERR, 4721 4493 "can't read group descriptor %d", i); 4722 4494 db_count = i; 4495 + ret = PTR_ERR(bh); 4496 + bh = NULL; 4723 4497 goto failed_mount2; 4724 4498 } 4725 4499 rcu_read_lock(); ··· 4768 4538 4769 4539 INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ 4770 4540 mutex_init(&sbi->s_orphan_lock); 4541 + 4542 + /* Initialize fast commit stuff */ 4543 + atomic_set(&sbi->s_fc_subtid, 0); 4544 + atomic_set(&sbi->s_fc_ineligible_updates, 0); 4545 + INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_MAIN]); 4546 + INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_STAGING]); 4547 + INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_MAIN]); 4548 + INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_STAGING]); 4549 + sbi->s_fc_bytes = 0; 4550 + sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE; 4551 + sbi->s_mount_state &= ~EXT4_FC_COMMITTING; 4552 + spin_lock_init(&sbi->s_fc_lock); 4553 + memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats)); 4554 + sbi->s_fc_replay_state.fc_regions = NULL; 4555 + sbi->s_fc_replay_state.fc_regions_size = 0; 4556 + sbi->s_fc_replay_state.fc_regions_used = 0; 4557 + sbi->s_fc_replay_state.fc_regions_valid = 0; 4558 + sbi->s_fc_replay_state.fc_modified_inodes = NULL; 4559 + sbi->s_fc_replay_state.fc_modified_inodes_size = 0; 4560 + sbi->s_fc_replay_state.fc_modified_inodes_used = 0; 4771 4561 4772 4562 sb->s_root = NULL; 4773 4563 ··· 4838 4588 sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM; 4839 4589 clear_opt(sb, JOURNAL_CHECKSUM); 4840 4590 clear_opt(sb, DATA_FLAGS); 4591 + clear_opt2(sb, JOURNAL_FAST_COMMIT); 4841 4592 sbi->s_journal = NULL; 4842 4593 needs_recovery = 0; 4843 4594 goto no_journal; ··· 4897 4646 set_task_ioprio(sbi->s_journal->j_task, journal_ioprio); 4898 4647 4899 4648 sbi->s_journal->j_commit_callback = ext4_journal_commit_callback; 4649 + sbi->s_journal->j_submit_inode_data_buffers = 4650 + ext4_journal_submit_inode_data_buffers; 4651 + sbi->s_journal->j_finish_inode_data_buffers = 4652 + ext4_journal_finish_inode_data_buffers; 4900 4653 4901 4654 no_journal: 4902 4655 if (!test_opt(sb, NO_MBCACHE)) { ··· 5003 4748 goto failed_mount4a; 5004 4749 } 5005 4750 } 4751 + ext4_fc_replay_cleanup(sb); 5006 4752 5007 4753 ext4_ext_init(sb); 5008 4754 err = ext4_mb_init(sb); ··· 5070 4814 * used to detect the metadata async write error. 5071 4815 */ 5072 4816 spin_lock_init(&sbi->s_bdev_wb_lock); 5073 - if (!sb_rdonly(sb)) 5074 - errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err, 5075 - &sbi->s_bdev_wb_err); 4817 + errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err, 4818 + &sbi->s_bdev_wb_err); 5076 4819 sb->s_bdev->bd_super = sb; 5077 4820 EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS; 5078 4821 ext4_orphan_cleanup(sb, es); ··· 5127 4872 5128 4873 failed_mount8: 5129 4874 ext4_unregister_sysfs(sb); 4875 + kobject_put(&sbi->s_kobj); 5130 4876 failed_mount7: 5131 4877 ext4_unregister_li_request(sb); 5132 4878 failed_mount6: ··· 5216 4960 journal->j_commit_interval = sbi->s_commit_interval; 5217 4961 journal->j_min_batch_time = sbi->s_min_batch_time; 5218 4962 journal->j_max_batch_time = sbi->s_max_batch_time; 4963 + ext4_fc_init(sb, journal); 5219 4964 5220 4965 write_lock(&journal->j_state_lock); 5221 4966 if (test_opt(sb, BARRIER)) ··· 5359 5102 goto out_bdev; 5360 5103 } 5361 5104 journal->j_private = sb; 5362 - ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, &journal->j_sb_buffer); 5363 - wait_on_buffer(journal->j_sb_buffer); 5364 - if (!buffer_uptodate(journal->j_sb_buffer)) { 5105 + if (ext4_read_bh_lock(journal->j_sb_buffer, REQ_META | REQ_PRIO, true)) { 5365 5106 ext4_msg(sb, KERN_ERR, "I/O error on journal device"); 5366 5107 goto out_journal; 5367 5108 } ··· 5369 5114 be32_to_cpu(journal->j_superblock->s_nr_users)); 5370 5115 goto out_journal; 5371 5116 } 5372 - EXT4_SB(sb)->journal_bdev = bdev; 5117 + EXT4_SB(sb)->s_journal_bdev = bdev; 5373 5118 ext4_init_journal_params(sb, journal); 5374 5119 return journal; 5375 5120 ··· 5963 5708 } 5964 5709 5965 5710 /* 5966 - * Update the original bdev mapping's wb_err value 5967 - * which could be used to detect the metadata async 5968 - * write error. 5969 - */ 5970 - errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err, 5971 - &sbi->s_bdev_wb_err); 5972 - 5973 - /* 5974 5711 * Mounting a RDONLY partition read-write, so reread 5975 5712 * and store the current valid flag. (It may have 5976 5713 * been changed by e2fsck since we originally mounted ··· 6007 5760 * Releasing of existing data is done when we are sure remount will 6008 5761 * succeed. 6009 5762 */ 6010 - if (test_opt(sb, BLOCK_VALIDITY) && !sbi->system_blks) { 5763 + if (test_opt(sb, BLOCK_VALIDITY) && !sbi->s_system_blks) { 6011 5764 err = ext4_setup_system_zone(sb); 6012 5765 if (err) 6013 5766 goto restore_opts; ··· 6033 5786 } 6034 5787 } 6035 5788 #endif 6036 - if (!test_opt(sb, BLOCK_VALIDITY) && sbi->system_blks) 5789 + if (!test_opt(sb, BLOCK_VALIDITY) && sbi->s_system_blks) 6037 5790 ext4_release_system_zone(sb); 6038 5791 6039 5792 /* ··· 6056 5809 sbi->s_commit_interval = old_opts.s_commit_interval; 6057 5810 sbi->s_min_batch_time = old_opts.s_min_batch_time; 6058 5811 sbi->s_max_batch_time = old_opts.s_max_batch_time; 6059 - if (!test_opt(sb, BLOCK_VALIDITY) && sbi->system_blks) 5812 + if (!test_opt(sb, BLOCK_VALIDITY) && sbi->s_system_blks) 6060 5813 ext4_release_system_zone(sb); 6061 5814 #ifdef CONFIG_QUOTA 6062 5815 sbi->s_jquota_fmt = old_opts.s_jquota_fmt; ··· 6289 6042 /* Quotafile not on the same filesystem? */ 6290 6043 if (path->dentry->d_sb != sb) 6291 6044 return -EXDEV; 6045 + 6046 + /* Quota already enabled for this file? */ 6047 + if (IS_NOQUOTA(d_inode(path->dentry))) 6048 + return -EBUSY; 6049 + 6292 6050 /* Journaling quota? */ 6293 6051 if (EXT4_SB(sb)->s_qf_names[type]) { 6294 6052 /* Quotafile not in fs root? */ ··· 6561 6309 brelse(bh); 6562 6310 out: 6563 6311 if (inode->i_size < off + len) { 6312 + ext4_fc_track_range(inode, 6313 + (inode->i_size > 0 ? inode->i_size - 1 : 0) 6314 + >> inode->i_sb->s_blocksize_bits, 6315 + (off + len) >> inode->i_sb->s_blocksize_bits); 6564 6316 i_size_write(inode, off + len); 6565 6317 EXT4_I(inode)->i_disksize = inode->i_size; 6566 6318 err2 = ext4_mark_inode_dirty(handle, inode); ··· 6693 6437 err = init_inodecache(); 6694 6438 if (err) 6695 6439 goto out1; 6440 + 6441 + err = ext4_fc_init_dentry_cache(); 6442 + if (err) 6443 + goto out05; 6444 + 6696 6445 register_as_ext3(); 6697 6446 register_as_ext2(); 6698 6447 err = register_filesystem(&ext4_fs_type); ··· 6708 6447 out: 6709 6448 unregister_as_ext2(); 6710 6449 unregister_as_ext3(); 6450 + out05: 6711 6451 destroy_inodecache(); 6712 6452 out1: 6713 6453 ext4_exit_mballoc();

+2

fs/ext4/sysfs.c

··· 521 521 proc_create_single_data("es_shrinker_info", S_IRUGO, 522 522 sbi->s_proc, ext4_seq_es_shrinker_info_show, 523 523 sb); 524 + proc_create_single_data("fc_info", 0444, sbi->s_proc, 525 + ext4_fc_info_show, sb); 524 526 proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc, 525 527 &ext4_mb_seq_groups_ops, sb); 526 528 }

+3

fs/ext4/xattr.c

··· 2419 2419 if (IS_SYNC(inode)) 2420 2420 ext4_handle_sync(handle); 2421 2421 } 2422 + ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR); 2422 2423 2423 2424 cleanup: 2424 2425 brelse(is.iloc.bh); ··· 2497 2496 if (error == 0) 2498 2497 error = error2; 2499 2498 } 2499 + ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR); 2500 2500 2501 2501 return error; 2502 2502 } ··· 2930 2928 error); 2931 2929 goto cleanup; 2932 2930 } 2931 + ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR); 2933 2932 } 2934 2933 error = 0; 2935 2934 cleanup:

+76 -30

fs/jbd2/commit.c

··· 187 187 * use writepages() because with delayed allocation we may be doing 188 188 * block allocation in writepages(). 189 189 */ 190 - static int journal_submit_inode_data_buffers(struct address_space *mapping, 191 - loff_t dirty_start, loff_t dirty_end) 190 + int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode) 192 191 { 193 - int ret; 192 + struct address_space *mapping = jinode->i_vfs_inode->i_mapping; 194 193 struct writeback_control wbc = { 195 194 .sync_mode = WB_SYNC_ALL, 196 195 .nr_to_write = mapping->nrpages * 2, 197 - .range_start = dirty_start, 198 - .range_end = dirty_end, 196 + .range_start = jinode->i_dirty_start, 197 + .range_end = jinode->i_dirty_end, 199 198 }; 200 199 201 - ret = generic_writepages(mapping, &wbc); 202 - return ret; 200 + /* 201 + * submit the inode data buffers. We use writepage 202 + * instead of writepages. Because writepages can do 203 + * block allocation with delalloc. We need to write 204 + * only allocated blocks here. 205 + */ 206 + return generic_writepages(mapping, &wbc); 203 207 } 208 + 209 + /* Send all the data buffers related to an inode */ 210 + int jbd2_submit_inode_data(struct jbd2_inode *jinode) 211 + { 212 + 213 + if (!jinode || !(jinode->i_flags & JI_WRITE_DATA)) 214 + return 0; 215 + 216 + trace_jbd2_submit_inode_data(jinode->i_vfs_inode); 217 + return jbd2_journal_submit_inode_data_buffers(jinode); 218 + 219 + } 220 + EXPORT_SYMBOL(jbd2_submit_inode_data); 221 + 222 + int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode) 223 + { 224 + if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) || 225 + !jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping) 226 + return 0; 227 + return filemap_fdatawait_range_keep_errors( 228 + jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start, 229 + jinode->i_dirty_end); 230 + } 231 + EXPORT_SYMBOL(jbd2_wait_inode_data); 204 232 205 233 /* 206 234 * Submit all the data buffers of inode associated with the transaction to ··· 243 215 { 244 216 struct jbd2_inode *jinode; 245 217 int err, ret = 0; 246 - struct address_space *mapping; 247 218 248 219 spin_lock(&journal->j_list_lock); 249 220 list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { 250 - loff_t dirty_start = jinode->i_dirty_start; 251 - loff_t dirty_end = jinode->i_dirty_end; 252 - 253 221 if (!(jinode->i_flags & JI_WRITE_DATA)) 254 222 continue; 255 - mapping = jinode->i_vfs_inode->i_mapping; 256 223 jinode->i_flags |= JI_COMMIT_RUNNING; 257 224 spin_unlock(&journal->j_list_lock); 258 - /* 259 - * submit the inode data buffers. We use writepage 260 - * instead of writepages. Because writepages can do 261 - * block allocation with delalloc. We need to write 262 - * only allocated blocks here. 263 - */ 225 + /* submit the inode data buffers. */ 264 226 trace_jbd2_submit_inode_data(jinode->i_vfs_inode); 265 - err = journal_submit_inode_data_buffers(mapping, dirty_start, 266 - dirty_end); 267 - if (!ret) 268 - ret = err; 227 + if (journal->j_submit_inode_data_buffers) { 228 + err = journal->j_submit_inode_data_buffers(jinode); 229 + if (!ret) 230 + ret = err; 231 + } 269 232 spin_lock(&journal->j_list_lock); 270 233 J_ASSERT(jinode->i_transaction == commit_transaction); 271 234 jinode->i_flags &= ~JI_COMMIT_RUNNING; ··· 265 246 } 266 247 spin_unlock(&journal->j_list_lock); 267 248 return ret; 249 + } 250 + 251 + int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode) 252 + { 253 + struct address_space *mapping = jinode->i_vfs_inode->i_mapping; 254 + 255 + return filemap_fdatawait_range_keep_errors(mapping, 256 + jinode->i_dirty_start, 257 + jinode->i_dirty_end); 268 258 } 269 259 270 260 /* ··· 290 262 /* For locking, see the comment in journal_submit_data_buffers() */ 291 263 spin_lock(&journal->j_list_lock); 292 264 list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { 293 - loff_t dirty_start = jinode->i_dirty_start; 294 - loff_t dirty_end = jinode->i_dirty_end; 295 - 296 265 if (!(jinode->i_flags & JI_WAIT_DATA)) 297 266 continue; 298 267 jinode->i_flags |= JI_COMMIT_RUNNING; 299 268 spin_unlock(&journal->j_list_lock); 300 - err = filemap_fdatawait_range_keep_errors( 301 - jinode->i_vfs_inode->i_mapping, dirty_start, 302 - dirty_end); 303 - if (!ret) 304 - ret = err; 269 + /* wait for the inode data buffers writeout. */ 270 + if (journal->j_finish_inode_data_buffers) { 271 + err = journal->j_finish_inode_data_buffers(jinode); 272 + if (!ret) 273 + ret = err; 274 + } 305 275 spin_lock(&journal->j_list_lock); 306 276 jinode->i_flags &= ~JI_COMMIT_RUNNING; 307 277 smp_mb(); ··· 439 413 J_ASSERT(journal->j_running_transaction != NULL); 440 414 J_ASSERT(journal->j_committing_transaction == NULL); 441 415 416 + write_lock(&journal->j_state_lock); 417 + journal->j_flags |= JBD2_FULL_COMMIT_ONGOING; 418 + while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) { 419 + DEFINE_WAIT(wait); 420 + 421 + prepare_to_wait(&journal->j_fc_wait, &wait, 422 + TASK_UNINTERRUPTIBLE); 423 + write_unlock(&journal->j_state_lock); 424 + schedule(); 425 + write_lock(&journal->j_state_lock); 426 + finish_wait(&journal->j_fc_wait, &wait); 427 + } 428 + write_unlock(&journal->j_state_lock); 429 + 442 430 commit_transaction = journal->j_running_transaction; 443 431 444 432 trace_jbd2_start_commit(journal, commit_transaction); ··· 460 420 commit_transaction->t_tid); 461 421 462 422 write_lock(&journal->j_state_lock); 423 + journal->j_fc_off = 0; 463 424 J_ASSERT(commit_transaction->t_state == T_RUNNING); 464 425 commit_transaction->t_state = T_LOCKED; 465 426 ··· 1160 1119 1161 1120 if (journal->j_commit_callback) 1162 1121 journal->j_commit_callback(journal, commit_transaction); 1122 + if (journal->j_fc_cleanup_callback) 1123 + journal->j_fc_cleanup_callback(journal, 1); 1163 1124 1164 1125 trace_jbd2_end_commit(journal, commit_transaction); 1165 1126 jbd_debug(1, "JBD2: commit %d complete, head %d\n", 1166 1127 journal->j_commit_sequence, journal->j_tail_sequence); 1167 1128 1168 1129 write_lock(&journal->j_state_lock); 1130 + journal->j_flags &= ~JBD2_FULL_COMMIT_ONGOING; 1131 + journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING; 1169 1132 spin_lock(&journal->j_list_lock); 1170 1133 commit_transaction->t_state = T_FINISHED; 1171 1134 /* Check if the transaction can be dropped now that we are finished */ ··· 1181 1136 spin_unlock(&journal->j_list_lock); 1182 1137 write_unlock(&journal->j_state_lock); 1183 1138 wake_up(&journal->j_wait_done_commit); 1139 + wake_up(&journal->j_fc_wait); 1184 1140 1185 1141 /* 1186 1142 * Calculate overall stats

+239 -6

fs/jbd2/journal.c

··· 91 91 EXPORT_SYMBOL(jbd2_journal_force_commit); 92 92 EXPORT_SYMBOL(jbd2_journal_inode_ranged_write); 93 93 EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait); 94 + EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers); 95 + EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers); 94 96 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode); 95 97 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode); 96 98 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate); ··· 159 157 * 160 158 * 1) COMMIT: Every so often we need to commit the current state of the 161 159 * filesystem to disk. The journal thread is responsible for writing 162 - * all of the metadata buffers to disk. 160 + * all of the metadata buffers to disk. If a fast commit is ongoing 161 + * journal thread waits until it's done and then continues from 162 + * there on. 163 163 * 164 164 * 2) CHECKPOINT: We cannot reuse a used section of the log file until all 165 165 * of the data in that part of the log has been rewritten elsewhere on ··· 718 714 return err; 719 715 } 720 716 717 + /* 718 + * Start a fast commit. If there's an ongoing fast or full commit wait for 719 + * it to complete. Returns 0 if a new fast commit was started. Returns -EALREADY 720 + * if a fast commit is not needed, either because there's an already a commit 721 + * going on or this tid has already been committed. Returns -EINVAL if no jbd2 722 + * commit has yet been performed. 723 + */ 724 + int jbd2_fc_begin_commit(journal_t *journal, tid_t tid) 725 + { 726 + /* 727 + * Fast commits only allowed if at least one full commit has 728 + * been processed. 729 + */ 730 + if (!journal->j_stats.ts_tid) 731 + return -EINVAL; 732 + 733 + if (tid <= journal->j_commit_sequence) 734 + return -EALREADY; 735 + 736 + write_lock(&journal->j_state_lock); 737 + if (journal->j_flags & JBD2_FULL_COMMIT_ONGOING || 738 + (journal->j_flags & JBD2_FAST_COMMIT_ONGOING)) { 739 + DEFINE_WAIT(wait); 740 + 741 + prepare_to_wait(&journal->j_fc_wait, &wait, 742 + TASK_UNINTERRUPTIBLE); 743 + write_unlock(&journal->j_state_lock); 744 + schedule(); 745 + finish_wait(&journal->j_fc_wait, &wait); 746 + return -EALREADY; 747 + } 748 + journal->j_flags |= JBD2_FAST_COMMIT_ONGOING; 749 + write_unlock(&journal->j_state_lock); 750 + 751 + return 0; 752 + } 753 + EXPORT_SYMBOL(jbd2_fc_begin_commit); 754 + 755 + /* 756 + * Stop a fast commit. If fallback is set, this function starts commit of 757 + * TID tid before any other fast commit can start. 758 + */ 759 + static int __jbd2_fc_end_commit(journal_t *journal, tid_t tid, bool fallback) 760 + { 761 + if (journal->j_fc_cleanup_callback) 762 + journal->j_fc_cleanup_callback(journal, 0); 763 + write_lock(&journal->j_state_lock); 764 + journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING; 765 + if (fallback) 766 + journal->j_flags |= JBD2_FULL_COMMIT_ONGOING; 767 + write_unlock(&journal->j_state_lock); 768 + wake_up(&journal->j_fc_wait); 769 + if (fallback) 770 + return jbd2_complete_transaction(journal, tid); 771 + return 0; 772 + } 773 + 774 + int jbd2_fc_end_commit(journal_t *journal) 775 + { 776 + return __jbd2_fc_end_commit(journal, 0, 0); 777 + } 778 + EXPORT_SYMBOL(jbd2_fc_end_commit); 779 + 780 + int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid) 781 + { 782 + return __jbd2_fc_end_commit(journal, tid, 1); 783 + } 784 + EXPORT_SYMBOL(jbd2_fc_end_commit_fallback); 785 + 721 786 /* Return 1 when transaction with given tid has already committed. */ 722 787 int jbd2_transaction_committed(journal_t *journal, tid_t tid) 723 788 { ··· 854 781 write_unlock(&journal->j_state_lock); 855 782 return jbd2_journal_bmap(journal, blocknr, retp); 856 783 } 784 + 785 + /* Map one fast commit buffer for use by the file system */ 786 + int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out) 787 + { 788 + unsigned long long pblock; 789 + unsigned long blocknr; 790 + int ret = 0; 791 + struct buffer_head *bh; 792 + int fc_off; 793 + 794 + *bh_out = NULL; 795 + write_lock(&journal->j_state_lock); 796 + 797 + if (journal->j_fc_off + journal->j_fc_first < journal->j_fc_last) { 798 + fc_off = journal->j_fc_off; 799 + blocknr = journal->j_fc_first + fc_off; 800 + journal->j_fc_off++; 801 + } else { 802 + ret = -EINVAL; 803 + } 804 + write_unlock(&journal->j_state_lock); 805 + 806 + if (ret) 807 + return ret; 808 + 809 + ret = jbd2_journal_bmap(journal, blocknr, &pblock); 810 + if (ret) 811 + return ret; 812 + 813 + bh = __getblk(journal->j_dev, pblock, journal->j_blocksize); 814 + if (!bh) 815 + return -ENOMEM; 816 + 817 + lock_buffer(bh); 818 + 819 + clear_buffer_uptodate(bh); 820 + set_buffer_dirty(bh); 821 + unlock_buffer(bh); 822 + journal->j_fc_wbuf[fc_off] = bh; 823 + 824 + *bh_out = bh; 825 + 826 + return 0; 827 + } 828 + EXPORT_SYMBOL(jbd2_fc_get_buf); 829 + 830 + /* 831 + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf 832 + * for completion. 833 + */ 834 + int jbd2_fc_wait_bufs(journal_t *journal, int num_blks) 835 + { 836 + struct buffer_head *bh; 837 + int i, j_fc_off; 838 + 839 + read_lock(&journal->j_state_lock); 840 + j_fc_off = journal->j_fc_off; 841 + read_unlock(&journal->j_state_lock); 842 + 843 + /* 844 + * Wait in reverse order to minimize chances of us being woken up before 845 + * all IOs have completed 846 + */ 847 + for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) { 848 + bh = journal->j_fc_wbuf[i]; 849 + wait_on_buffer(bh); 850 + put_bh(bh); 851 + journal->j_fc_wbuf[i] = NULL; 852 + if (unlikely(!buffer_uptodate(bh))) 853 + return -EIO; 854 + } 855 + 856 + return 0; 857 + } 858 + EXPORT_SYMBOL(jbd2_fc_wait_bufs); 859 + 860 + /* 861 + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf 862 + * for completion. 863 + */ 864 + int jbd2_fc_release_bufs(journal_t *journal) 865 + { 866 + struct buffer_head *bh; 867 + int i, j_fc_off; 868 + 869 + read_lock(&journal->j_state_lock); 870 + j_fc_off = journal->j_fc_off; 871 + read_unlock(&journal->j_state_lock); 872 + 873 + /* 874 + * Wait in reverse order to minimize chances of us being woken up before 875 + * all IOs have completed 876 + */ 877 + for (i = j_fc_off - 1; i >= 0; i--) { 878 + bh = journal->j_fc_wbuf[i]; 879 + if (!bh) 880 + break; 881 + put_bh(bh); 882 + journal->j_fc_wbuf[i] = NULL; 883 + } 884 + 885 + return 0; 886 + } 887 + EXPORT_SYMBOL(jbd2_fc_release_bufs); 857 888 858 889 /* 859 890 * Conversion of logical to physical block numbers for the journal ··· 1317 1140 init_waitqueue_head(&journal->j_wait_commit); 1318 1141 init_waitqueue_head(&journal->j_wait_updates); 1319 1142 init_waitqueue_head(&journal->j_wait_reserved); 1143 + init_waitqueue_head(&journal->j_fc_wait); 1320 1144 mutex_init(&journal->j_abort_mutex); 1321 1145 mutex_init(&journal->j_barrier); 1322 1146 mutex_init(&journal->j_checkpoint_mutex); ··· 1357 1179 if (!journal->j_wbuf) 1358 1180 goto err_cleanup; 1359 1181 1182 + if (journal->j_fc_wbufsize > 0) { 1183 + journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize, 1184 + sizeof(struct buffer_head *), 1185 + GFP_KERNEL); 1186 + if (!journal->j_fc_wbuf) 1187 + goto err_cleanup; 1188 + } 1189 + 1360 1190 bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize); 1361 1191 if (!bh) { 1362 1192 pr_err("%s: Cannot get buffer for journal superblock\n", ··· 1378 1192 1379 1193 err_cleanup: 1380 1194 kfree(journal->j_wbuf); 1195 + kfree(journal->j_fc_wbuf); 1381 1196 jbd2_journal_destroy_revoke(journal); 1382 1197 kfree(journal); 1383 1198 return NULL; 1384 1199 } 1200 + 1201 + int jbd2_fc_init(journal_t *journal, int num_fc_blks) 1202 + { 1203 + journal->j_fc_wbufsize = num_fc_blks; 1204 + journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize, 1205 + sizeof(struct buffer_head *), GFP_KERNEL); 1206 + if (!journal->j_fc_wbuf) 1207 + return -ENOMEM; 1208 + return 0; 1209 + } 1210 + EXPORT_SYMBOL(jbd2_fc_init); 1385 1211 1386 1212 /* jbd2_journal_init_dev and jbd2_journal_init_inode: 1387 1213 * ··· 1512 1314 } 1513 1315 1514 1316 journal->j_first = first; 1515 - journal->j_last = last; 1516 1317 1517 - journal->j_head = first; 1518 - journal->j_tail = first; 1519 - journal->j_free = last - first; 1318 + if (jbd2_has_feature_fast_commit(journal) && 1319 + journal->j_fc_wbufsize > 0) { 1320 + journal->j_fc_last = last; 1321 + journal->j_last = last - journal->j_fc_wbufsize; 1322 + journal->j_fc_first = journal->j_last + 1; 1323 + journal->j_fc_off = 0; 1324 + } else { 1325 + journal->j_last = last; 1326 + } 1327 + 1328 + journal->j_head = journal->j_first; 1329 + journal->j_tail = journal->j_first; 1330 + journal->j_free = journal->j_last - journal->j_first; 1520 1331 1521 1332 journal->j_tail_sequence = journal->j_transaction_sequence; 1522 1333 journal->j_commit_sequence = journal->j_transaction_sequence - 1; ··· 1671 1464 static void jbd2_mark_journal_empty(journal_t *journal, int write_op) 1672 1465 { 1673 1466 journal_superblock_t *sb = journal->j_superblock; 1467 + bool had_fast_commit = false; 1674 1468 1675 1469 BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex)); 1676 1470 lock_buffer(journal->j_sb_buffer); ··· 1685 1477 1686 1478 sb->s_sequence = cpu_to_be32(journal->j_tail_sequence); 1687 1479 sb->s_start = cpu_to_be32(0); 1480 + if (jbd2_has_feature_fast_commit(journal)) { 1481 + /* 1482 + * When journal is clean, no need to commit fast commit flag and 1483 + * make file system incompatible with older kernels. 1484 + */ 1485 + jbd2_clear_feature_fast_commit(journal); 1486 + had_fast_commit = true; 1487 + } 1688 1488 1689 1489 jbd2_write_superblock(journal, write_op); 1490 + 1491 + if (had_fast_commit) 1492 + jbd2_set_feature_fast_commit(journal); 1690 1493 1691 1494 /* Log is no longer empty */ 1692 1495 write_lock(&journal->j_state_lock); ··· 1882 1663 journal->j_tail_sequence = be32_to_cpu(sb->s_sequence); 1883 1664 journal->j_tail = be32_to_cpu(sb->s_start); 1884 1665 journal->j_first = be32_to_cpu(sb->s_first); 1885 - journal->j_last = be32_to_cpu(sb->s_maxlen); 1886 1666 journal->j_errno = be32_to_cpu(sb->s_errno); 1667 + 1668 + if (jbd2_has_feature_fast_commit(journal) && 1669 + journal->j_fc_wbufsize > 0) { 1670 + journal->j_fc_last = be32_to_cpu(sb->s_maxlen); 1671 + journal->j_last = journal->j_fc_last - journal->j_fc_wbufsize; 1672 + journal->j_fc_first = journal->j_last + 1; 1673 + journal->j_fc_off = 0; 1674 + } else { 1675 + journal->j_last = be32_to_cpu(sb->s_maxlen); 1676 + } 1887 1677 1888 1678 return 0; 1889 1679 } ··· 1954 1726 */ 1955 1727 journal->j_flags &= ~JBD2_ABORT; 1956 1728 1729 + if (journal->j_fc_wbufsize > 0) 1730 + jbd2_journal_set_features(journal, 0, 0, 1731 + JBD2_FEATURE_INCOMPAT_FAST_COMMIT); 1957 1732 /* OK, we've finished with the dynamic journal bits: 1958 1733 * reinitialise the dynamic contents of the superblock in memory 1959 1734 * and reset them on disk. */ ··· 2040 1809 jbd2_journal_destroy_revoke(journal); 2041 1810 if (journal->j_chksum_driver) 2042 1811 crypto_free_shash(journal->j_chksum_driver); 1812 + if (journal->j_fc_wbufsize > 0) 1813 + kfree(journal->j_fc_wbuf); 2043 1814 kfree(journal->j_wbuf); 2044 1815 kfree(journal); 2045 1816

+119 -16

fs/jbd2/recovery.c

··· 35 35 int nr_revoke_hits; 36 36 }; 37 37 38 - enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY}; 39 38 static int do_one_pass(journal_t *journal, 40 39 struct recovery_info *info, enum passtype pass); 41 40 static int scan_revoke_records(journal_t *, struct buffer_head *, ··· 224 225 /* Make sure we wrap around the log correctly! */ 225 226 #define wrap(journal, var) \ 226 227 do { \ 227 - if (var >= (journal)->j_last) \ 228 - var -= ((journal)->j_last - (journal)->j_first); \ 228 + unsigned long _wrap_last = \ 229 + jbd2_has_feature_fast_commit(journal) ? \ 230 + (journal)->j_fc_last : (journal)->j_last; \ 231 + \ 232 + if (var >= _wrap_last) \ 233 + var -= (_wrap_last - (journal)->j_first); \ 229 234 } while (0) 235 + 236 + static int fc_do_one_pass(journal_t *journal, 237 + struct recovery_info *info, enum passtype pass) 238 + { 239 + unsigned int expected_commit_id = info->end_transaction; 240 + unsigned long next_fc_block; 241 + struct buffer_head *bh; 242 + int err = 0; 243 + 244 + next_fc_block = journal->j_fc_first; 245 + if (!journal->j_fc_replay_callback) 246 + return 0; 247 + 248 + while (next_fc_block <= journal->j_fc_last) { 249 + jbd_debug(3, "Fast commit replay: next block %ld", 250 + next_fc_block); 251 + err = jread(&bh, journal, next_fc_block); 252 + if (err) { 253 + jbd_debug(3, "Fast commit replay: read error"); 254 + break; 255 + } 256 + 257 + jbd_debug(3, "Processing fast commit blk with seq %d"); 258 + err = journal->j_fc_replay_callback(journal, bh, pass, 259 + next_fc_block - journal->j_fc_first, 260 + expected_commit_id); 261 + next_fc_block++; 262 + if (err < 0 || err == JBD2_FC_REPLAY_STOP) 263 + break; 264 + err = 0; 265 + } 266 + 267 + if (err) 268 + jbd_debug(3, "Fast commit replay failed, err = %d\n", err); 269 + 270 + return err; 271 + } 230 272 231 273 /** 232 274 * jbd2_journal_recover - recovers a on-disk journal ··· 468 428 __u32 crc32_sum = ~0; /* Transactional Checksums */ 469 429 int descr_csum_size = 0; 470 430 int block_error = 0; 431 + bool need_check_commit_time = false; 432 + __u64 last_trans_commit_time = 0, commit_time; 471 433 472 434 /* 473 435 * First thing is to establish what we expect to find in the log ··· 512 470 break; 513 471 514 472 jbd_debug(2, "Scanning for sequence ID %u at %lu/%lu\n", 515 - next_commit_ID, next_log_block, journal->j_last); 473 + next_commit_ID, next_log_block, 474 + jbd2_has_feature_fast_commit(journal) ? 475 + journal->j_fc_last : journal->j_last); 516 476 517 477 /* Skip over each chunk of the transaction looking 518 478 * either the next descriptor block or the final commit ··· 564 520 if (descr_csum_size > 0 && 565 521 !jbd2_descriptor_block_csum_verify(journal, 566 522 bh->b_data)) { 567 - printk(KERN_ERR "JBD2: Invalid checksum " 568 - "recovering block %lu in log\n", 569 - next_log_block); 570 - err = -EFSBADCRC; 571 - brelse(bh); 572 - goto failed; 523 + /* 524 + * PASS_SCAN can see stale blocks due to lazy 525 + * journal init. Don't error out on those yet. 526 + */ 527 + if (pass != PASS_SCAN) { 528 + pr_err("JBD2: Invalid checksum recovering block %lu in log\n", 529 + next_log_block); 530 + err = -EFSBADCRC; 531 + brelse(bh); 532 + goto failed; 533 + } 534 + need_check_commit_time = true; 535 + jbd_debug(1, 536 + "invalid descriptor block found in %lu\n", 537 + next_log_block); 573 538 } 574 539 575 540 /* If it is a valid descriptor block, replay it ··· 588 535 if (pass != PASS_REPLAY) { 589 536 if (pass == PASS_SCAN && 590 537 jbd2_has_feature_checksum(journal) && 538 + !need_check_commit_time && 591 539 !info->end_transaction) { 592 540 if (calc_chksums(journal, bh, 593 541 &next_log_block, ··· 737 683 * mentioned conditions. Hence assume 738 684 * "Interrupted Commit".) 739 685 */ 686 + commit_time = be64_to_cpu( 687 + ((struct commit_header *)bh->b_data)->h_commit_sec); 688 + /* 689 + * If need_check_commit_time is set, it means we are in 690 + * PASS_SCAN and csum verify failed before. If 691 + * commit_time is increasing, it's the same journal, 692 + * otherwise it is stale journal block, just end this 693 + * recovery. 694 + */ 695 + if (need_check_commit_time) { 696 + if (commit_time >= last_trans_commit_time) { 697 + pr_err("JBD2: Invalid checksum found in transaction %u\n", 698 + next_commit_ID); 699 + err = -EFSBADCRC; 700 + brelse(bh); 701 + goto failed; 702 + } 703 + ignore_crc_mismatch: 704 + /* 705 + * It likely does not belong to same journal, 706 + * just end this recovery with success. 707 + */ 708 + jbd_debug(1, "JBD2: Invalid checksum ignored in transaction %u, likely stale data\n", 709 + next_commit_ID); 710 + err = 0; 711 + brelse(bh); 712 + goto done; 713 + } 740 714 741 - /* Found an expected commit block: if checksums 742 - * are present verify them in PASS_SCAN; else not 715 + /* 716 + * Found an expected commit block: if checksums 717 + * are present, verify them in PASS_SCAN; else not 743 718 * much to do other than move on to the next sequence 744 - * number. */ 719 + * number. 720 + */ 745 721 if (pass == PASS_SCAN && 746 722 jbd2_has_feature_checksum(journal)) { 747 723 struct commit_header *cbh = ··· 803 719 !jbd2_commit_block_csum_verify(journal, 804 720 bh->b_data)) { 805 721 chksum_error: 722 + if (commit_time < last_trans_commit_time) 723 + goto ignore_crc_mismatch; 806 724 info->end_transaction = next_commit_ID; 807 725 808 726 if (!jbd2_has_feature_async_commit(journal)) { ··· 814 728 break; 815 729 } 816 730 } 731 + if (pass == PASS_SCAN) 732 + last_trans_commit_time = commit_time; 817 733 brelse(bh); 818 734 next_commit_ID++; 819 735 continue; 820 736 821 737 case JBD2_REVOKE_BLOCK: 738 + /* 739 + * Check revoke block crc in pass_scan, if csum verify 740 + * failed, check commit block time later. 741 + */ 742 + if (pass == PASS_SCAN && 743 + !jbd2_descriptor_block_csum_verify(journal, 744 + bh->b_data)) { 745 + jbd_debug(1, "JBD2: invalid revoke block found in %lu\n", 746 + next_log_block); 747 + need_check_commit_time = true; 748 + } 822 749 /* If we aren't in the REVOKE pass, then we can 823 750 * just skip over this block. */ 824 751 if (pass != PASS_REVOKE) { ··· 876 777 success = -EIO; 877 778 } 878 779 } 780 + 781 + if (jbd2_has_feature_fast_commit(journal) && pass != PASS_REVOKE) { 782 + err = fc_do_one_pass(journal, info, pass); 783 + if (err) 784 + success = err; 785 + } 786 + 879 787 if (block_error && success == 0) 880 788 success = -EIO; 881 789 return success; ··· 905 799 header = (jbd2_journal_revoke_header_t *) bh->b_data; 906 800 offset = sizeof(jbd2_journal_revoke_header_t); 907 801 rcount = be32_to_cpu(header->r_count); 908 - 909 - if (!jbd2_descriptor_block_csum_verify(journal, header)) 910 - return -EFSBADCRC; 911 802 912 803 if (jbd2_journal_has_csum_v2or3(journal)) 913 804 csum_size = sizeof(struct jbd2_journal_block_tail);

+4

fs/ocfs2/journal.c

··· 883 883 OCFS2_JOURNAL_DIRTY_FL); 884 884 885 885 journal->j_journal = j_journal; 886 + journal->j_journal->j_submit_inode_data_buffers = 887 + jbd2_journal_submit_inode_data_buffers; 888 + journal->j_journal->j_finish_inode_data_buffers = 889 + jbd2_journal_finish_inode_data_buffers; 886 890 journal->j_inode = inode; 887 891 journal->j_bh = bh; 888 892

+120 -4

include/linux/jbd2.h

··· 289 289 #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004 290 290 #define JBD2_FEATURE_INCOMPAT_CSUM_V2 0x00000008 291 291 #define JBD2_FEATURE_INCOMPAT_CSUM_V3 0x00000010 292 + #define JBD2_FEATURE_INCOMPAT_FAST_COMMIT 0x00000020 292 293 293 294 /* See "journal feature predicate functions" below */ 294 295 ··· 300 299 JBD2_FEATURE_INCOMPAT_64BIT | \ 301 300 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \ 302 301 JBD2_FEATURE_INCOMPAT_CSUM_V2 | \ 303 - JBD2_FEATURE_INCOMPAT_CSUM_V3) 302 + JBD2_FEATURE_INCOMPAT_CSUM_V3 | \ 303 + JBD2_FEATURE_INCOMPAT_FAST_COMMIT) 304 304 305 305 #ifdef __KERNEL__ 306 306 ··· 454 452 struct jbd2_revoke_table_s; 455 453 456 454 /** 457 - * struct handle_s - The handle_s type is the concrete type associated with 458 - * handle_t. 455 + * struct jbd2_journal_handle - The jbd2_journal_handle type is the concrete 456 + * type associated with handle_t. 459 457 * @h_transaction: Which compound transaction is this update a part of? 460 458 * @h_journal: Which journal handle belongs to - used iff h_reserved set. 461 459 * @h_rsv_handle: Handle reserved for finishing the logical operation. ··· 631 629 struct journal_head *t_shadow_list; 632 630 633 631 /* 634 - * List of inodes whose data we've modified in data=ordered mode. 632 + * List of inodes associated with the transaction; e.g., ext4 uses 633 + * this to track inodes in data=ordered and data=journal mode that 634 + * need special handling on transaction commit; also used by ocfs2. 635 635 * [j_list_lock] 636 636 */ 637 637 struct list_head t_inode_list; ··· 751 747 752 748 #define JBD2_NR_BATCH 64 753 749 750 + enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY}; 751 + 752 + #define JBD2_FC_REPLAY_STOP 0 753 + #define JBD2_FC_REPLAY_CONTINUE 1 754 + 754 755 /** 755 756 * struct journal_s - The journal_s type is the concrete type associated with 756 757 * journal_t. ··· 867 858 wait_queue_head_t j_wait_reserved; 868 859 869 860 /** 861 + * @j_fc_wait: 862 + * 863 + * Wait queue to wait for completion of async fast commits. 864 + */ 865 + wait_queue_head_t j_fc_wait; 866 + 867 + /** 870 868 * @j_checkpoint_mutex: 871 869 * 872 870 * Semaphore for locking against concurrent checkpoints. ··· 929 913 * [j_state_lock]. 930 914 */ 931 915 unsigned long j_last; 916 + 917 + /** 918 + * @j_fc_first: 919 + * 920 + * The block number of the first fast commit block in the journal 921 + * [j_state_lock]. 922 + */ 923 + unsigned long j_fc_first; 924 + 925 + /** 926 + * @j_fc_off: 927 + * 928 + * Number of fast commit blocks currently allocated. 929 + * [j_state_lock]. 930 + */ 931 + unsigned long j_fc_off; 932 + 933 + /** 934 + * @j_fc_last: 935 + * 936 + * The block number one beyond the last fast commit block in the journal 937 + * [j_state_lock]. 938 + */ 939 + unsigned long j_fc_last; 932 940 933 941 /** 934 942 * @j_dev: Device where we store the journal. ··· 1105 1065 struct buffer_head **j_wbuf; 1106 1066 1107 1067 /** 1068 + * @j_fc_wbuf: Array of fast commit bhs for 1069 + * jbd2_journal_commit_transaction. 1070 + */ 1071 + struct buffer_head **j_fc_wbuf; 1072 + 1073 + /** 1108 1074 * @j_wbufsize: 1109 1075 * 1110 1076 * Size of @j_wbuf array. 1111 1077 */ 1112 1078 int j_wbufsize; 1079 + 1080 + /** 1081 + * @j_fc_wbufsize: 1082 + * 1083 + * Size of @j_fc_wbuf array. 1084 + */ 1085 + int j_fc_wbufsize; 1113 1086 1114 1087 /** 1115 1088 * @j_last_sync_writer: ··· 1163 1110 */ 1164 1111 void (*j_commit_callback)(journal_t *, 1165 1112 transaction_t *); 1113 + 1114 + /** 1115 + * @j_submit_inode_data_buffers: 1116 + * 1117 + * This function is called for all inodes associated with the 1118 + * committing transaction marked with JI_WRITE_DATA flag 1119 + * before we start to write out the transaction to the journal. 1120 + */ 1121 + int (*j_submit_inode_data_buffers) 1122 + (struct jbd2_inode *); 1123 + 1124 + /** 1125 + * @j_finish_inode_data_buffers: 1126 + * 1127 + * This function is called for all inodes associated with the 1128 + * committing transaction marked with JI_WAIT_DATA flag 1129 + * after we have written the transaction to the journal 1130 + * but before we write out the commit block. 1131 + */ 1132 + int (*j_finish_inode_data_buffers) 1133 + (struct jbd2_inode *); 1166 1134 1167 1135 /* 1168 1136 * Journal statistics ··· 1244 1170 */ 1245 1171 struct lockdep_map j_trans_commit_map; 1246 1172 #endif 1173 + 1174 + /** 1175 + * @j_fc_cleanup_callback: 1176 + * 1177 + * Clean-up after fast commit or full commit. JBD2 calls this function 1178 + * after every commit operation. 1179 + */ 1180 + void (*j_fc_cleanup_callback)(struct journal_s *journal, int); 1181 + 1182 + /* 1183 + * @j_fc_replay_callback: 1184 + * 1185 + * File-system specific function that performs replay of a fast 1186 + * commit. JBD2 calls this function for each fast commit block found in 1187 + * the journal. This function should return JBD2_FC_REPLAY_CONTINUE 1188 + * to indicate that the block was processed correctly and more fast 1189 + * commit replay should continue. Return value of JBD2_FC_REPLAY_STOP 1190 + * indicates the end of replay (no more blocks remaining). A negative 1191 + * return value indicates error. 1192 + */ 1193 + int (*j_fc_replay_callback)(struct journal_s *journal, 1194 + struct buffer_head *bh, 1195 + enum passtype pass, int off, 1196 + tid_t expected_commit_id); 1247 1197 }; 1248 1198 1249 1199 #define jbd2_might_wait_for_commit(j) \ ··· 1338 1240 JBD2_FEATURE_INCOMPAT_FUNCS(async_commit, ASYNC_COMMIT) 1339 1241 JBD2_FEATURE_INCOMPAT_FUNCS(csum2, CSUM_V2) 1340 1242 JBD2_FEATURE_INCOMPAT_FUNCS(csum3, CSUM_V3) 1243 + JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit, FAST_COMMIT) 1341 1244 1342 1245 /* 1343 1246 * Journal flag definitions ··· 1352 1253 #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file 1353 1254 * data write error in ordered 1354 1255 * mode */ 1256 + #define JBD2_FAST_COMMIT_ONGOING 0x100 /* Fast commit is ongoing */ 1257 + #define JBD2_FULL_COMMIT_ONGOING 0x200 /* Full commit is ongoing */ 1355 1258 1356 1259 /* 1357 1260 * Function declarations for the journaling transaction and buffer ··· 1522 1421 extern int jbd2_journal_inode_ranged_wait(handle_t *handle, 1523 1422 struct jbd2_inode *inode, loff_t start_byte, 1524 1423 loff_t length); 1424 + extern int jbd2_journal_submit_inode_data_buffers( 1425 + struct jbd2_inode *jinode); 1426 + extern int jbd2_journal_finish_inode_data_buffers( 1427 + struct jbd2_inode *jinode); 1525 1428 extern int jbd2_journal_begin_ordered_truncate(journal_t *journal, 1526 1429 struct jbd2_inode *inode, loff_t new_size); 1527 1430 extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode); ··· 1609 1504 void __jbd2_log_wait_for_space(journal_t *journal); 1610 1505 extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *); 1611 1506 extern int jbd2_cleanup_journal_tail(journal_t *); 1507 + 1508 + /* Fast commit related APIs */ 1509 + int jbd2_fc_init(journal_t *journal, int num_fc_blks); 1510 + int jbd2_fc_begin_commit(journal_t *journal, tid_t tid); 1511 + int jbd2_fc_end_commit(journal_t *journal); 1512 + int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid); 1513 + int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out); 1514 + int jbd2_submit_inode_data(struct jbd2_inode *jinode); 1515 + int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode); 1516 + int jbd2_fc_wait_bufs(journal_t *journal, int num_blks); 1517 + int jbd2_fc_release_bufs(journal_t *journal); 1612 1518 1613 1519 /* 1614 1520 * is_journal_abort

+224 -4

include/trace/events/ext4.h

··· 95 95 { FALLOC_FL_COLLAPSE_RANGE, "COLLAPSE_RANGE"}, \ 96 96 { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"}) 97 97 98 + #define show_fc_reason(reason) \ 99 + __print_symbolic(reason, \ 100 + { EXT4_FC_REASON_XATTR, "XATTR"}, \ 101 + { EXT4_FC_REASON_CROSS_RENAME, "CROSS_RENAME"}, \ 102 + { EXT4_FC_REASON_JOURNAL_FLAG_CHANGE, "JOURNAL_FLAG_CHANGE"}, \ 103 + { EXT4_FC_REASON_MEM, "NO_MEM"}, \ 104 + { EXT4_FC_REASON_SWAP_BOOT, "SWAP_BOOT"}, \ 105 + { EXT4_FC_REASON_RESIZE, "RESIZE"}, \ 106 + { EXT4_FC_REASON_RENAME_DIR, "RENAME_DIR"}, \ 107 + { EXT4_FC_REASON_FALLOC_RANGE, "FALLOC_RANGE"}) 98 108 99 109 TRACE_EVENT(ext4_other_inode_update_time, 100 110 TP_PROTO(struct inode *inode, ino_t orig_ino), ··· 1776 1766 ); 1777 1767 1778 1768 TRACE_EVENT(ext4_load_inode, 1779 - TP_PROTO(struct inode *inode), 1769 + TP_PROTO(struct super_block *sb, unsigned long ino), 1780 1770 1781 - TP_ARGS(inode), 1771 + TP_ARGS(sb, ino), 1782 1772 1783 1773 TP_STRUCT__entry( 1784 1774 __field( dev_t, dev ) ··· 1786 1776 ), 1787 1777 1788 1778 TP_fast_assign( 1789 - __entry->dev = inode->i_sb->s_dev; 1790 - __entry->ino = inode->i_ino; 1779 + __entry->dev = sb->s_dev; 1780 + __entry->ino = ino; 1791 1781 ), 1792 1782 1793 1783 TP_printk("dev %d,%d ino %ld", ··· 2800 2790 TP_printk("dev %d,%d group %u", 2801 2791 MAJOR(__entry->dev), MINOR(__entry->dev), __entry->group) 2802 2792 ); 2793 + 2794 + TRACE_EVENT(ext4_fc_replay_scan, 2795 + TP_PROTO(struct super_block *sb, int error, int off), 2796 + 2797 + TP_ARGS(sb, error, off), 2798 + 2799 + TP_STRUCT__entry( 2800 + __field(dev_t, dev) 2801 + __field(int, error) 2802 + __field(int, off) 2803 + ), 2804 + 2805 + TP_fast_assign( 2806 + __entry->dev = sb->s_dev; 2807 + __entry->error = error; 2808 + __entry->off = off; 2809 + ), 2810 + 2811 + TP_printk("FC scan pass on dev %d,%d: error %d, off %d", 2812 + MAJOR(__entry->dev), MINOR(__entry->dev), 2813 + __entry->error, __entry->off) 2814 + ); 2815 + 2816 + TRACE_EVENT(ext4_fc_replay, 2817 + TP_PROTO(struct super_block *sb, int tag, int ino, int priv1, int priv2), 2818 + 2819 + TP_ARGS(sb, tag, ino, priv1, priv2), 2820 + 2821 + TP_STRUCT__entry( 2822 + __field(dev_t, dev) 2823 + __field(int, tag) 2824 + __field(int, ino) 2825 + __field(int, priv1) 2826 + __field(int, priv2) 2827 + ), 2828 + 2829 + TP_fast_assign( 2830 + __entry->dev = sb->s_dev; 2831 + __entry->tag = tag; 2832 + __entry->ino = ino; 2833 + __entry->priv1 = priv1; 2834 + __entry->priv2 = priv2; 2835 + ), 2836 + 2837 + TP_printk("FC Replay %d,%d: tag %d, ino %d, data1 %d, data2 %d", 2838 + MAJOR(__entry->dev), MINOR(__entry->dev), 2839 + __entry->tag, __entry->ino, __entry->priv1, __entry->priv2) 2840 + ); 2841 + 2842 + TRACE_EVENT(ext4_fc_commit_start, 2843 + TP_PROTO(struct super_block *sb), 2844 + 2845 + TP_ARGS(sb), 2846 + 2847 + TP_STRUCT__entry( 2848 + __field(dev_t, dev) 2849 + ), 2850 + 2851 + TP_fast_assign( 2852 + __entry->dev = sb->s_dev; 2853 + ), 2854 + 2855 + TP_printk("fast_commit started on dev %d,%d", 2856 + MAJOR(__entry->dev), MINOR(__entry->dev)) 2857 + ); 2858 + 2859 + TRACE_EVENT(ext4_fc_commit_stop, 2860 + TP_PROTO(struct super_block *sb, int nblks, int reason), 2861 + 2862 + TP_ARGS(sb, nblks, reason), 2863 + 2864 + TP_STRUCT__entry( 2865 + __field(dev_t, dev) 2866 + __field(int, nblks) 2867 + __field(int, reason) 2868 + __field(int, num_fc) 2869 + __field(int, num_fc_ineligible) 2870 + __field(int, nblks_agg) 2871 + ), 2872 + 2873 + TP_fast_assign( 2874 + __entry->dev = sb->s_dev; 2875 + __entry->nblks = nblks; 2876 + __entry->reason = reason; 2877 + __entry->num_fc = EXT4_SB(sb)->s_fc_stats.fc_num_commits; 2878 + __entry->num_fc_ineligible = 2879 + EXT4_SB(sb)->s_fc_stats.fc_ineligible_commits; 2880 + __entry->nblks_agg = EXT4_SB(sb)->s_fc_stats.fc_numblks; 2881 + ), 2882 + 2883 + TP_printk("fc on [%d,%d] nblks %d, reason %d, fc = %d, ineligible = %d, agg_nblks %d", 2884 + MAJOR(__entry->dev), MINOR(__entry->dev), 2885 + __entry->nblks, __entry->reason, __entry->num_fc, 2886 + __entry->num_fc_ineligible, __entry->nblks_agg) 2887 + ); 2888 + 2889 + #define FC_REASON_NAME_STAT(reason) \ 2890 + show_fc_reason(reason), \ 2891 + __entry->sbi->s_fc_stats.fc_ineligible_reason_count[reason] 2892 + 2893 + TRACE_EVENT(ext4_fc_stats, 2894 + TP_PROTO(struct super_block *sb), 2895 + 2896 + TP_ARGS(sb), 2897 + 2898 + TP_STRUCT__entry( 2899 + __field(dev_t, dev) 2900 + __field(struct ext4_sb_info *, sbi) 2901 + __field(int, count) 2902 + ), 2903 + 2904 + TP_fast_assign( 2905 + __entry->dev = sb->s_dev; 2906 + __entry->sbi = EXT4_SB(sb); 2907 + ), 2908 + 2909 + TP_printk("dev %d:%d fc ineligible reasons:\n" 2910 + "%s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s,%d; " 2911 + "num_commits:%ld, ineligible: %ld, numblks: %ld", 2912 + MAJOR(__entry->dev), MINOR(__entry->dev), 2913 + FC_REASON_NAME_STAT(EXT4_FC_REASON_XATTR), 2914 + FC_REASON_NAME_STAT(EXT4_FC_REASON_CROSS_RENAME), 2915 + FC_REASON_NAME_STAT(EXT4_FC_REASON_JOURNAL_FLAG_CHANGE), 2916 + FC_REASON_NAME_STAT(EXT4_FC_REASON_MEM), 2917 + FC_REASON_NAME_STAT(EXT4_FC_REASON_SWAP_BOOT), 2918 + FC_REASON_NAME_STAT(EXT4_FC_REASON_RESIZE), 2919 + FC_REASON_NAME_STAT(EXT4_FC_REASON_RENAME_DIR), 2920 + FC_REASON_NAME_STAT(EXT4_FC_REASON_FALLOC_RANGE), 2921 + __entry->sbi->s_fc_stats.fc_num_commits, 2922 + __entry->sbi->s_fc_stats.fc_ineligible_commits, 2923 + __entry->sbi->s_fc_stats.fc_numblks) 2924 + 2925 + ); 2926 + 2927 + #define DEFINE_TRACE_DENTRY_EVENT(__type) \ 2928 + TRACE_EVENT(ext4_fc_track_##__type, \ 2929 + TP_PROTO(struct inode *inode, struct dentry *dentry, int ret), \ 2930 + \ 2931 + TP_ARGS(inode, dentry, ret), \ 2932 + \ 2933 + TP_STRUCT__entry( \ 2934 + __field(dev_t, dev) \ 2935 + __field(int, ino) \ 2936 + __field(int, error) \ 2937 + ), \ 2938 + \ 2939 + TP_fast_assign( \ 2940 + __entry->dev = inode->i_sb->s_dev; \ 2941 + __entry->ino = inode->i_ino; \ 2942 + __entry->error = ret; \ 2943 + ), \ 2944 + \ 2945 + TP_printk("dev %d:%d, inode %d, error %d, fc_%s", \ 2946 + MAJOR(__entry->dev), MINOR(__entry->dev), \ 2947 + __entry->ino, __entry->error, \ 2948 + #__type) \ 2949 + ) 2950 + 2951 + DEFINE_TRACE_DENTRY_EVENT(create); 2952 + DEFINE_TRACE_DENTRY_EVENT(link); 2953 + DEFINE_TRACE_DENTRY_EVENT(unlink); 2954 + 2955 + TRACE_EVENT(ext4_fc_track_inode, 2956 + TP_PROTO(struct inode *inode, int ret), 2957 + 2958 + TP_ARGS(inode, ret), 2959 + 2960 + TP_STRUCT__entry( 2961 + __field(dev_t, dev) 2962 + __field(int, ino) 2963 + __field(int, error) 2964 + ), 2965 + 2966 + TP_fast_assign( 2967 + __entry->dev = inode->i_sb->s_dev; 2968 + __entry->ino = inode->i_ino; 2969 + __entry->error = ret; 2970 + ), 2971 + 2972 + TP_printk("dev %d:%d, inode %d, error %d", 2973 + MAJOR(__entry->dev), MINOR(__entry->dev), 2974 + __entry->ino, __entry->error) 2975 + ); 2976 + 2977 + TRACE_EVENT(ext4_fc_track_range, 2978 + TP_PROTO(struct inode *inode, long start, long end, int ret), 2979 + 2980 + TP_ARGS(inode, start, end, ret), 2981 + 2982 + TP_STRUCT__entry( 2983 + __field(dev_t, dev) 2984 + __field(int, ino) 2985 + __field(long, start) 2986 + __field(long, end) 2987 + __field(int, error) 2988 + ), 2989 + 2990 + TP_fast_assign( 2991 + __entry->dev = inode->i_sb->s_dev; 2992 + __entry->ino = inode->i_ino; 2993 + __entry->start = start; 2994 + __entry->end = end; 2995 + __entry->error = ret; 2996 + ), 2997 + 2998 + TP_printk("dev %d:%d, inode %d, error %d, start %ld, end %ld", 2999 + MAJOR(__entry->dev), MINOR(__entry->dev), 3000 + __entry->ino, __entry->error, __entry->start, 3001 + __entry->end) 3002 + ); 2803 3003 2804 3004 #endif /* _TRACE_EXT4_H */ 2805 3005