Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

+22 -10

Documentation/filesystems/ext4/bigalloc.rst

··· 9 9 exceeds the page size. However, for a filesystem of mostly huge files, 10 10 it is desirable to be able to allocate disk blocks in units of multiple 11 11 blocks to reduce both fragmentation and metadata overhead. The 12 - `bigalloc <Bigalloc>`__ feature provides exactly this ability. The 13 - administrator can set a block cluster size at mkfs time (which is stored 14 - in the s\_log\_cluster\_size field in the superblock); from then on, the 15 - block bitmaps track clusters, not individual blocks. This means that 16 - block groups can be several gigabytes in size (instead of just 128MiB); 17 - however, the minimum allocation unit becomes a cluster, not a block, 18 - even for directories. TaoBao had a patchset to extend the “use units of 19 - clusters instead of blocks” to the extent tree, though it is not clear 20 - where those patches went-- they eventually morphed into “extent tree v2” 21 - but that code has not landed as of May 2015. 12 + bigalloc feature provides exactly this ability. 13 + 14 + The bigalloc feature (EXT4_FEATURE_RO_COMPAT_BIGALLOC) changes ext4 to 15 + use clustered allocation, so that each bit in the ext4 block allocation 16 + bitmap addresses a power of two number of blocks. For example, if the 17 + file system is mainly going to be storing large files in the 4-32 18 + megabyte range, it might make sense to set a cluster size of 1 megabyte. 19 + This means that each bit in the block allocation bitmap now addresses 20 + 256 4k blocks. This shrinks the total size of the block allocation 21 + bitmaps for a 2T file system from 64 megabytes to 256 kilobytes. It also 22 + means that a block group addresses 32 gigabytes instead of 128 megabytes, 23 + also shrinking the amount of file system overhead for metadata. 24 + 25 + The administrator can set a block cluster size at mkfs time (which is 26 + stored in the s\_log\_cluster\_size field in the superblock); from then 27 + on, the block bitmaps track clusters, not individual blocks. This means 28 + that block groups can be several gigabytes in size (instead of just 29 + 128MiB); however, the minimum allocation unit becomes a cluster, not a 30 + block, even for directories. TaoBao had a patchset to extend the “use 31 + units of clusters instead of blocks” to the extent tree, though it is 32 + not clear where those patches went-- they eventually morphed into 33 + “extent tree v2” but that code has not landed as of May 2015. 22 34

+5 -5

Documentation/filesystems/ext4/blockgroup.rst

··· 71 71 superblock, group descriptors, data block bitmaps for groups 0-3, inode 72 72 bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining 73 73 space in group 0 is for file data. The effect of this is to group the 74 - block metadata close together for faster loading, and to enable large 75 - files to be continuous on disk. Backup copies of the superblock and 76 - group descriptors are always at the beginning of block groups, even if 77 - flex\_bg is enabled. The number of block groups that make up a flex\_bg 78 - is given by 2 ^ ``sb.s_log_groups_per_flex``. 74 + block group metadata close together for faster loading, and to enable 75 + large files to be continuous on disk. Backup copies of the superblock 76 + and group descriptors are always at the beginning of block groups, even 77 + if flex\_bg is enabled. The number of block groups that make up a 78 + flex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. 79 79 80 80 Meta Block Groups 81 81 -----------------

+3 -1

Documentation/filesystems/ext4/blocks.rst

··· 10 10 4KiB. You may experience mounting problems if block size is greater than 11 11 page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory 12 12 pages). By default a filesystem can contain 2^32 blocks; if the '64bit' 13 - feature is enabled, then a filesystem can have 2^64 blocks. 13 + feature is enabled, then a filesystem can have 2^64 blocks. The location 14 + of structures is stored in terms of the block number the structure lives 15 + in and not the absolute offset on disk. 14 16 15 17 For 32-bit filesystems, limits are as follows: 16 18

+1 -1

Documentation/filesystems/ext4/directory.rst

··· 59 59 - File name. 60 60 61 61 Since file names cannot be longer than 255 bytes, the new directory 62 - entry format shortens the rec\_len field and uses the space for a file 62 + entry format shortens the name\_len field and uses the space for a file 63 63 type flag, probably to avoid having to load every inode during directory 64 64 tree traversal. This format is ``ext4_dir_entry_2``, which is at most 65 65 263 bytes long, though on disk you'll need to reference

+6 -3

Documentation/filesystems/ext4/group_descr.rst

··· 99 99 * - 0x1E 100 100 - \_\_le16 101 101 - bg\_checksum 102 - - Group descriptor checksum; crc16(sb\_uuid+group+desc) if the 103 - RO\_COMPAT\_GDT\_CSUM feature is set, or crc32c(sb\_uuid+group\_desc) & 104 - 0xFFFF if the RO\_COMPAT\_METADATA\_CSUM feature is set. 102 + - Group descriptor checksum; crc16(sb\_uuid+group\_num+bg\_desc) if the 103 + RO\_COMPAT\_GDT\_CSUM feature is set, or 104 + crc32c(sb\_uuid+group\_num+bg\_desc) & 0xFFFF if the 105 + RO\_COMPAT\_METADATA\_CSUM feature is set. The bg\_checksum 106 + field in bg\_desc is skipped when calculating crc16 checksum, 107 + and set to zero if crc32c checksum is used. 105 108 * - 106 109 - 107 110 -

+2 -2

Documentation/filesystems/ext4/inodes.rst

··· 472 472 having to upgrade all of the on-disk inodes. Access to fields beyond 473 473 EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within 474 474 ``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as 475 - of October 2013) the inode structure is 156 bytes 476 - (``i_extra_isize = 28``). The extra space between the end of the inode 475 + of August 2019) the inode structure is 160 bytes 476 + (``i_extra_isize = 32``). The extra space between the end of the inode 477 477 structure and the end of the inode record can be used to store extended 478 478 attributes. Each inode record can be as large as the filesystem block 479 479 size, though this is not terribly efficient.

+14 -6

Documentation/filesystems/ext4/super.rst

··· 58 58 * - 0x1C 59 59 - \_\_le32 60 60 - s\_log\_cluster\_size 61 - - Cluster size is (2 ^ s\_log\_cluster\_size) blocks if bigalloc is 61 + - Cluster size is 2 ^ (10 + s\_log\_cluster\_size) blocks if bigalloc is 62 62 enabled. Otherwise s\_log\_cluster\_size must equal s\_log\_block\_size. 63 63 * - 0x20 64 64 - \_\_le32 ··· 447 447 - Upper 8 bits of the s_wtime field. 448 448 * - 0x275 449 449 - \_\_u8 450 - - s\_wtime_hi 450 + - s\_mtime_hi 451 451 - Upper 8 bits of the s_mtime field. 452 452 * - 0x276 453 453 - \_\_u8 ··· 466 466 - s\_last_error_time_hi 467 467 - Upper 8 bits of the s_last_error_time_hi field. 468 468 * - 0x27A 469 - - \_\_u8[2] 470 - - s\_pad 469 + - \_\_u8 470 + - s\_pad[2] 471 471 - Zero padding. 472 472 * - 0x27C 473 + - \_\_le16 474 + - s\_encoding 475 + - Filename charset encoding. 476 + * - 0x27E 477 + - \_\_le16 478 + - s\_encoding_flags 479 + - Filename charset encoding flags. 480 + * - 0x280 473 481 - \_\_le32 474 - - s\_reserved[96] 482 + - s\_reserved[95] 475 483 - Padding to the end of the block. 476 484 * - 0x3FC 477 485 - \_\_le32 ··· 625 617 * - 0x80 626 618 - Enable a filesystem size of 2^64 blocks (INCOMPAT\_64BIT). 627 619 * - 0x100 628 - - Multiple mount protection. Not implemented (INCOMPAT\_MMP). 620 + - Multiple mount protection (INCOMPAT\_MMP). 629 621 * - 0x200 630 622 - Flexible block groups. See the earlier discussion of this feature 631 623 (INCOMPAT\_FLEX\_BG).

+142 -55

fs/ext4/block_validity.c

··· 38 38 39 39 void ext4_exit_system_zone(void) 40 40 { 41 + rcu_barrier(); 41 42 kmem_cache_destroy(ext4_system_zone_cachep); 42 43 } 43 44 ··· 50 49 return 0; 51 50 } 52 51 52 + static void release_system_zone(struct ext4_system_blocks *system_blks) 53 + { 54 + struct ext4_system_zone *entry, *n; 55 + 56 + rbtree_postorder_for_each_entry_safe(entry, n, 57 + &system_blks->root, node) 58 + kmem_cache_free(ext4_system_zone_cachep, entry); 59 + } 60 + 53 61 /* 54 62 * Mark a range of blocks as belonging to the "system zone" --- that 55 63 * is, filesystem metadata blocks which should never be used by 56 64 * inodes. 57 65 */ 58 - static int add_system_zone(struct ext4_sb_info *sbi, 66 + static int add_system_zone(struct ext4_system_blocks *system_blks, 59 67 ext4_fsblk_t start_blk, 60 68 unsigned int count) 61 69 { 62 70 struct ext4_system_zone *new_entry = NULL, *entry; 63 - struct rb_node **n = &sbi->system_blks.rb_node, *node; 71 + struct rb_node **n = &system_blks->root.rb_node, *node; 64 72 struct rb_node *parent = NULL, *new_node = NULL; 65 73 66 74 while (*n) { ··· 101 91 new_node = &new_entry->node; 102 92 103 93 rb_link_node(new_node, parent, n); 104 - rb_insert_color(new_node, &sbi->system_blks); 94 + rb_insert_color(new_node, &system_blks->root); 105 95 } 106 96 107 97 /* Can we merge to the left? */ ··· 111 101 if (can_merge(entry, new_entry)) { 112 102 new_entry->start_blk = entry->start_blk; 113 103 new_entry->count += entry->count; 114 - rb_erase(node, &sbi->system_blks); 104 + rb_erase(node, &system_blks->root); 115 105 kmem_cache_free(ext4_system_zone_cachep, entry); 116 106 } 117 107 } ··· 122 112 entry = rb_entry(node, struct ext4_system_zone, node); 123 113 if (can_merge(new_entry, entry)) { 124 114 new_entry->count += entry->count; 125 - rb_erase(node, &sbi->system_blks); 115 + rb_erase(node, &system_blks->root); 126 116 kmem_cache_free(ext4_system_zone_cachep, entry); 127 117 } 128 118 } ··· 136 126 int first = 1; 137 127 138 128 printk(KERN_INFO "System zones: "); 139 - node = rb_first(&sbi->system_blks); 129 + node = rb_first(&sbi->system_blks->root); 140 130 while (node) { 141 131 entry = rb_entry(node, struct ext4_system_zone, node); 142 132 printk(KERN_CONT "%s%llu-%llu", first ? "" : ", ", ··· 147 137 printk(KERN_CONT "\n"); 148 138 } 149 139 150 - static int ext4_protect_reserved_inode(struct super_block *sb, u32 ino) 140 + /* 141 + * Returns 1 if the passed-in block region (start_blk, 142 + * start_blk+count) is valid; 0 if some part of the block region 143 + * overlaps with filesystem metadata blocks. 144 + */ 145 + static int ext4_data_block_valid_rcu(struct ext4_sb_info *sbi, 146 + struct ext4_system_blocks *system_blks, 147 + ext4_fsblk_t start_blk, 148 + unsigned int count) 149 + { 150 + struct ext4_system_zone *entry; 151 + struct rb_node *n; 152 + 153 + if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) || 154 + (start_blk + count < start_blk) || 155 + (start_blk + count > ext4_blocks_count(sbi->s_es))) { 156 + sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); 157 + return 0; 158 + } 159 + 160 + if (system_blks == NULL) 161 + return 1; 162 + 163 + n = system_blks->root.rb_node; 164 + while (n) { 165 + entry = rb_entry(n, struct ext4_system_zone, node); 166 + if (start_blk + count - 1 < entry->start_blk) 167 + n = n->rb_left; 168 + else if (start_blk >= (entry->start_blk + entry->count)) 169 + n = n->rb_right; 170 + else { 171 + sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); 172 + return 0; 173 + } 174 + } 175 + return 1; 176 + } 177 + 178 + static int ext4_protect_reserved_inode(struct super_block *sb, 179 + struct ext4_system_blocks *system_blks, 180 + u32 ino) 151 181 { 152 182 struct inode *inode; 153 183 struct ext4_sb_info *sbi = EXT4_SB(sb); ··· 213 163 if (n == 0) { 214 164 i++; 215 165 } else { 216 - if (!ext4_data_block_valid(sbi, map.m_pblk, n)) { 166 + if (!ext4_data_block_valid_rcu(sbi, system_blks, 167 + map.m_pblk, n)) { 217 168 ext4_error(sb, "blocks %llu-%llu from inode %u " 218 169 "overlap system zone", map.m_pblk, 219 170 map.m_pblk + map.m_len - 1, ino); 220 171 err = -EFSCORRUPTED; 221 172 break; 222 173 } 223 - err = add_system_zone(sbi, map.m_pblk, n); 174 + err = add_system_zone(system_blks, map.m_pblk, n); 224 175 if (err < 0) 225 176 break; 226 177 i += n; ··· 231 180 return err; 232 181 } 233 182 183 + static void ext4_destroy_system_zone(struct rcu_head *rcu) 184 + { 185 + struct ext4_system_blocks *system_blks; 186 + 187 + system_blks = container_of(rcu, struct ext4_system_blocks, rcu); 188 + release_system_zone(system_blks); 189 + kfree(system_blks); 190 + } 191 + 192 + /* 193 + * Build system zone rbtree which is used for block validity checking. 194 + * 195 + * The update of system_blks pointer in this function is protected by 196 + * sb->s_umount semaphore. However we have to be careful as we can be 197 + * racing with ext4_data_block_valid() calls reading system_blks rbtree 198 + * protected only by RCU. That's why we first build the rbtree and then 199 + * swap it in place. 200 + */ 234 201 int ext4_setup_system_zone(struct super_block *sb) 235 202 { 236 203 ext4_group_t ngroups = ext4_get_groups_count(sb); 237 204 struct ext4_sb_info *sbi = EXT4_SB(sb); 205 + struct ext4_system_blocks *system_blks; 238 206 struct ext4_group_desc *gdp; 239 207 ext4_group_t i; 240 208 int flex_size = ext4_flex_bg_size(sbi); 241 209 int ret; 242 210 243 211 if (!test_opt(sb, BLOCK_VALIDITY)) { 244 - if (sbi->system_blks.rb_node) 212 + if (sbi->system_blks) 245 213 ext4_release_system_zone(sb); 246 214 return 0; 247 215 } 248 - if (sbi->system_blks.rb_node) 216 + if (sbi->system_blks) 249 217 return 0; 218 + 219 + system_blks = kzalloc(sizeof(*system_blks), GFP_KERNEL); 220 + if (!system_blks) 221 + return -ENOMEM; 250 222 251 223 for (i=0; i < ngroups; i++) { 252 224 cond_resched(); 253 225 if (ext4_bg_has_super(sb, i) && 254 226 ((i < 5) || ((i % flex_size) == 0))) 255 - add_system_zone(sbi, ext4_group_first_block_no(sb, i), 227 + add_system_zone(system_blks, 228 + ext4_group_first_block_no(sb, i), 256 229 ext4_bg_num_gdb(sb, i) + 1); 257 230 gdp = ext4_get_group_desc(sb, i, NULL); 258 - ret = add_system_zone(sbi, ext4_block_bitmap(sb, gdp), 1); 231 + ret = add_system_zone(system_blks, 232 + ext4_block_bitmap(sb, gdp), 1); 259 233 if (ret) 260 - return ret; 261 - ret = add_system_zone(sbi, ext4_inode_bitmap(sb, gdp), 1); 234 + goto err; 235 + ret = add_system_zone(system_blks, 236 + ext4_inode_bitmap(sb, gdp), 1); 262 237 if (ret) 263 - return ret; 264 - ret = add_system_zone(sbi, ext4_inode_table(sb, gdp), 238 + goto err; 239 + ret = add_system_zone(system_blks, 240 + ext4_inode_table(sb, gdp), 265 241 sbi->s_itb_per_group); 266 242 if (ret) 267 - return ret; 243 + goto err; 268 244 } 269 245 if (ext4_has_feature_journal(sb) && sbi->s_es->s_journal_inum) { 270 - ret = ext4_protect_reserved_inode(sb, 246 + ret = ext4_protect_reserved_inode(sb, system_blks, 271 247 le32_to_cpu(sbi->s_es->s_journal_inum)); 272 248 if (ret) 273 - return ret; 249 + goto err; 274 250 } 251 + 252 + /* 253 + * System blks rbtree complete, announce it once to prevent racing 254 + * with ext4_data_block_valid() accessing the rbtree at the same 255 + * time. 256 + */ 257 + rcu_assign_pointer(sbi->system_blks, system_blks); 275 258 276 259 if (test_opt(sb, DEBUG)) 277 260 debug_print_tree(sbi); 278 261 return 0; 279 - } 280 - 281 - /* Called when the filesystem is unmounted */ 282 - void ext4_release_system_zone(struct super_block *sb) 283 - { 284 - struct ext4_system_zone *entry, *n; 285 - 286 - rbtree_postorder_for_each_entry_safe(entry, n, 287 - &EXT4_SB(sb)->system_blks, node) 288 - kmem_cache_free(ext4_system_zone_cachep, entry); 289 - 290 - EXT4_SB(sb)->system_blks = RB_ROOT; 262 + err: 263 + release_system_zone(system_blks); 264 + kfree(system_blks); 265 + return ret; 291 266 } 292 267 293 268 /* 294 - * Returns 1 if the passed-in block region (start_blk, 295 - * start_blk+count) is valid; 0 if some part of the block region 296 - * overlaps with filesystem metadata blocks. 269 + * Called when the filesystem is unmounted or when remounting it with 270 + * noblock_validity specified. 271 + * 272 + * The update of system_blks pointer in this function is protected by 273 + * sb->s_umount semaphore. However we have to be careful as we can be 274 + * racing with ext4_data_block_valid() calls reading system_blks rbtree 275 + * protected only by RCU. So we first clear the system_blks pointer and 276 + * then free the rbtree only after RCU grace period expires. 297 277 */ 278 + void ext4_release_system_zone(struct super_block *sb) 279 + { 280 + struct ext4_system_blocks *system_blks; 281 + 282 + system_blks = rcu_dereference_protected(EXT4_SB(sb)->system_blks, 283 + lockdep_is_held(&sb->s_umount)); 284 + rcu_assign_pointer(EXT4_SB(sb)->system_blks, NULL); 285 + 286 + if (system_blks) 287 + call_rcu(&system_blks->rcu, ext4_destroy_system_zone); 288 + } 289 + 298 290 int ext4_data_block_valid(struct ext4_sb_info *sbi, ext4_fsblk_t start_blk, 299 291 unsigned int count) 300 292 { 301 - struct ext4_system_zone *entry; 302 - struct rb_node *n = sbi->system_blks.rb_node; 293 + struct ext4_system_blocks *system_blks; 294 + int ret; 303 295 304 - if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) || 305 - (start_blk + count < start_blk) || 306 - (start_blk + count > ext4_blocks_count(sbi->s_es))) { 307 - sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); 308 - return 0; 309 - } 310 - while (n) { 311 - entry = rb_entry(n, struct ext4_system_zone, node); 312 - if (start_blk + count - 1 < entry->start_blk) 313 - n = n->rb_left; 314 - else if (start_blk >= (entry->start_blk + entry->count)) 315 - n = n->rb_right; 316 - else { 317 - sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); 318 - return 0; 319 - } 320 - } 321 - return 1; 296 + /* 297 + * Lock the system zone to prevent it being released concurrently 298 + * when doing a remount which inverse current "[no]block_validity" 299 + * mount option. 300 + */ 301 + rcu_read_lock(); 302 + system_blks = rcu_dereference(sbi->system_blks); 303 + ret = ext4_data_block_valid_rcu(sbi, system_blks, start_blk, 304 + count); 305 + rcu_read_unlock(); 306 + return ret; 322 307 } 323 308 324 309 int ext4_check_blockref(const char *function, unsigned int line,

+4 -3

fs/ext4/dir.c

··· 668 668 const char *str, const struct qstr *name) 669 669 { 670 670 struct qstr qstr = {.name = str, .len = len }; 671 + struct inode *inode = dentry->d_parent->d_inode; 671 672 672 - if (!IS_CASEFOLDED(dentry->d_parent->d_inode)) { 673 + if (!IS_CASEFOLDED(inode) || !EXT4_SB(inode->i_sb)->s_encoding) { 673 674 if (len != name->len) 674 675 return -1; 675 676 return memcmp(str, name->name, len); 676 677 } 677 678 678 - return ext4_ci_compare(dentry->d_parent->d_inode, name, &qstr, false); 679 + return ext4_ci_compare(inode, name, &qstr, false); 679 680 } 680 681 681 682 static int ext4_d_hash(const struct dentry *dentry, struct qstr *str) ··· 686 685 unsigned char *norm; 687 686 int len, ret = 0; 688 687 689 - if (!IS_CASEFOLDED(dentry->d_inode)) 688 + if (!IS_CASEFOLDED(dentry->d_inode) || !um) 690 689 return 0; 691 690 692 691 norm = kmalloc(PATH_MAX, GFP_ATOMIC);

+49 -15

fs/ext4/ext4.h

··· 186 186 }; 187 187 188 188 /* 189 + * Block validity checking, system zone rbtree. 190 + */ 191 + struct ext4_system_blocks { 192 + struct rb_root root; 193 + struct rcu_head rcu; 194 + }; 195 + 196 + /* 189 197 * Flags for ext4_io_end->flags 190 198 */ 191 199 #define EXT4_IO_END_UNWRITTEN 0x0001 ··· 293 285 ~((ext4_fsblk_t) (s)->s_cluster_ratio - 1)) 294 286 #define EXT4_LBLK_CMASK(s, lblk) ((lblk) & \ 295 287 ~((ext4_lblk_t) (s)->s_cluster_ratio - 1)) 288 + /* Fill in the low bits to get the last block of the cluster */ 289 + #define EXT4_LBLK_CFILL(sbi, lblk) ((lblk) | \ 290 + ((ext4_lblk_t) (sbi)->s_cluster_ratio - 1)) 296 291 /* Get the cluster offset */ 297 292 #define EXT4_PBLK_COFF(s, pblk) ((pblk) & \ 298 293 ((ext4_fsblk_t) (s)->s_cluster_ratio - 1)) ··· 664 653 #define EXT4_IOC_SET_ENCRYPTION_POLICY FS_IOC_SET_ENCRYPTION_POLICY 665 654 #define EXT4_IOC_GET_ENCRYPTION_PWSALT FS_IOC_GET_ENCRYPTION_PWSALT 666 655 #define EXT4_IOC_GET_ENCRYPTION_POLICY FS_IOC_GET_ENCRYPTION_POLICY 656 + /* ioctl codes 19--39 are reserved for fscrypt */ 657 + #define EXT4_IOC_CLEAR_ES_CACHE _IO('f', 40) 658 + #define EXT4_IOC_GETSTATE _IOW('f', 41, __u32) 659 + #define EXT4_IOC_GET_ES_CACHE _IOWR('f', 42, struct fiemap) 667 660 668 661 #define EXT4_IOC_FSGETXATTR FS_IOC_FSGETXATTR 669 662 #define EXT4_IOC_FSSETXATTR FS_IOC_FSSETXATTR ··· 681 666 #define EXT4_GOING_FLAGS_LOGFLUSH 0x1 /* flush log but not data */ 682 667 #define EXT4_GOING_FLAGS_NOLOGFLUSH 0x2 /* don't flush log nor data */ 683 668 669 + /* 670 + * Flags returned by EXT4_IOC_GETSTATE 671 + * 672 + * We only expose to userspace a subset of the state flags in 673 + * i_state_flags 674 + */ 675 + #define EXT4_STATE_FLAG_EXT_PRECACHED 0x00000001 676 + #define EXT4_STATE_FLAG_NEW 0x00000002 677 + #define EXT4_STATE_FLAG_NEWENTRY 0x00000004 678 + #define EXT4_STATE_FLAG_DA_ALLOC_CLOSE 0x00000008 684 679 685 680 #if defined(__KERNEL__) && defined(CONFIG_COMPAT) 686 681 /* ··· 707 682 #define EXT4_IOC32_GETVERSION_OLD FS_IOC32_GETVERSION 708 683 #define EXT4_IOC32_SETVERSION_OLD FS_IOC32_SETVERSION 709 684 #endif 685 + 686 + /* 687 + * Returned by EXT4_IOC_GET_ES_CACHE as an additional possible flag. 688 + * It indicates that the entry in extent status cache is for a hole. 689 + */ 690 + #define EXT4_FIEMAP_EXTENT_HOLE 0x08000000 710 691 711 692 /* Max physical block we can address w/o extents */ 712 693 #define EXT4_MAX_BLOCK_FILE_PHYS 0xFFFFFFFF ··· 843 812 static inline void ext4_decode_extra_time(struct timespec64 *time, 844 813 __le32 extra) 845 814 { 846 - if (unlikely(extra & cpu_to_le32(EXT4_EPOCH_MASK))) { 847 - 848 - #if 1 849 - /* Handle legacy encoding of pre-1970 dates with epoch 850 - * bits 1,1. (This backwards compatibility may be removed 851 - * at the discretion of the ext4 developers.) 852 - */ 853 - u64 extra_bits = le32_to_cpu(extra) & EXT4_EPOCH_MASK; 854 - if (extra_bits == 3 && ((time->tv_sec) & 0x80000000) != 0) 855 - extra_bits = 0; 856 - time->tv_sec += extra_bits << 32; 857 - #else 815 + if (unlikely(extra & cpu_to_le32(EXT4_EPOCH_MASK))) 858 816 time->tv_sec += (u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK) << 32; 859 - #endif 860 - } 861 817 time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> EXT4_EPOCH_BITS; 862 818 } 863 819 ··· 1445 1427 int s_jquota_fmt; /* Format of quota to use */ 1446 1428 #endif 1447 1429 unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */ 1448 - struct rb_root system_blks; 1430 + struct ext4_system_blocks __rcu *system_blks; 1449 1431 1450 1432 #ifdef EXTENTS_STATS 1451 1433 /* ext4 extents stats */ ··· 3285 3267 extern ext4_lblk_t ext4_ext_next_allocated_block(struct ext4_ext_path *path); 3286 3268 extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 3287 3269 __u64 start, __u64 len); 3270 + extern int ext4_get_es_cache(struct inode *inode, 3271 + struct fiemap_extent_info *fieinfo, 3272 + __u64 start, __u64 len); 3288 3273 extern int ext4_ext_precache(struct inode *inode); 3289 3274 extern int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len); 3290 3275 extern int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len); ··· 3379 3358 } 3380 3359 3381 3360 extern const struct iomap_ops ext4_iomap_ops; 3361 + 3362 + static inline int ext4_buffer_uptodate(struct buffer_head *bh) 3363 + { 3364 + /* 3365 + * If the buffer has the write error flag, we have failed 3366 + * to write out data in the block. In this case, we don't 3367 + * have to read the block because we may read the old data 3368 + * successfully. 3369 + */ 3370 + if (!buffer_uptodate(bh) && buffer_write_io_error(bh)) 3371 + set_buffer_uptodate(bh); 3372 + return buffer_uptodate(bh); 3373 + } 3382 3374 3383 3375 #endif /* __KERNEL__ */ 3384 3376

+88 -10

fs/ext4/extents.c

··· 2315 2315 return err; 2316 2316 } 2317 2317 2318 + static int ext4_fill_es_cache_info(struct inode *inode, 2319 + ext4_lblk_t block, ext4_lblk_t num, 2320 + struct fiemap_extent_info *fieinfo) 2321 + { 2322 + ext4_lblk_t next, end = block + num - 1; 2323 + struct extent_status es; 2324 + unsigned char blksize_bits = inode->i_sb->s_blocksize_bits; 2325 + unsigned int flags; 2326 + int err; 2327 + 2328 + while (block <= end) { 2329 + next = 0; 2330 + flags = 0; 2331 + if (!ext4_es_lookup_extent(inode, block, &next, &es)) 2332 + break; 2333 + if (ext4_es_is_unwritten(&es)) 2334 + flags |= FIEMAP_EXTENT_UNWRITTEN; 2335 + if (ext4_es_is_delayed(&es)) 2336 + flags |= (FIEMAP_EXTENT_DELALLOC | 2337 + FIEMAP_EXTENT_UNKNOWN); 2338 + if (ext4_es_is_hole(&es)) 2339 + flags |= EXT4_FIEMAP_EXTENT_HOLE; 2340 + if (next == 0) 2341 + flags |= FIEMAP_EXTENT_LAST; 2342 + if (flags & (FIEMAP_EXTENT_DELALLOC| 2343 + EXT4_FIEMAP_EXTENT_HOLE)) 2344 + es.es_pblk = 0; 2345 + else 2346 + es.es_pblk = ext4_es_pblock(&es); 2347 + err = fiemap_fill_next_extent(fieinfo, 2348 + (__u64)es.es_lblk << blksize_bits, 2349 + (__u64)es.es_pblk << blksize_bits, 2350 + (__u64)es.es_len << blksize_bits, 2351 + flags); 2352 + if (next == 0) 2353 + break; 2354 + block = next; 2355 + if (err < 0) 2356 + return err; 2357 + if (err == 1) 2358 + return 0; 2359 + } 2360 + return 0; 2361 + } 2362 + 2363 + 2318 2364 /* 2319 2365 * ext4_ext_determine_hole - determine hole around given block 2320 2366 * @inode: inode we lookup in ··· 3859 3813 * illegal. 3860 3814 */ 3861 3815 if (ee_block != map->m_lblk || ee_len > map->m_len) { 3862 - #ifdef EXT4_DEBUG 3863 - ext4_warning("Inode (%ld) finished: extent logical block %llu," 3816 + #ifdef CONFIG_EXT4_DEBUG 3817 + ext4_warning(inode->i_sb, "Inode (%ld) finished: extent logical block %llu," 3864 3818 " len %u; IO logical block %llu, len %u", 3865 3819 inode->i_ino, (unsigned long long)ee_block, ee_len, 3866 3820 (unsigned long long)map->m_lblk, map->m_len); ··· 5063 5017 5064 5018 return next_del; 5065 5019 } 5066 - /* fiemap flags we can handle specified here */ 5067 - #define EXT4_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR) 5068 5020 5069 5021 static int ext4_xattr_fiemap(struct inode *inode, 5070 5022 struct fiemap_extent_info *fieinfo) ··· 5099 5055 return (error < 0 ? error : 0); 5100 5056 } 5101 5057 5102 - int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 5103 - __u64 start, __u64 len) 5058 + static int _ext4_fiemap(struct inode *inode, 5059 + struct fiemap_extent_info *fieinfo, 5060 + __u64 start, __u64 len, 5061 + int (*fill)(struct inode *, ext4_lblk_t, 5062 + ext4_lblk_t, 5063 + struct fiemap_extent_info *)) 5104 5064 { 5105 5065 ext4_lblk_t start_blk; 5066 + u32 ext4_fiemap_flags = FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR; 5067 + 5106 5068 int error = 0; 5107 5069 5108 5070 if (ext4_has_inline_data(inode)) { ··· 5125 5075 error = ext4_ext_precache(inode); 5126 5076 if (error) 5127 5077 return error; 5078 + fieinfo->fi_flags &= ~FIEMAP_FLAG_CACHE; 5128 5079 } 5129 5080 5130 5081 /* fallback to generic here if not in extents fmt */ 5131 - if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) 5082 + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) && 5083 + fill == ext4_fill_fiemap_extents) 5132 5084 return generic_block_fiemap(inode, fieinfo, start, len, 5133 5085 ext4_get_block); 5134 5086 5135 - if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS)) 5087 + if (fill == ext4_fill_es_cache_info) 5088 + ext4_fiemap_flags &= FIEMAP_FLAG_XATTR; 5089 + if (fiemap_check_flags(fieinfo, ext4_fiemap_flags)) 5136 5090 return -EBADR; 5137 5091 5138 5092 if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) { ··· 5155 5101 * Walk the extent tree gathering extent information 5156 5102 * and pushing extents back to the user. 5157 5103 */ 5158 - error = ext4_fill_fiemap_extents(inode, start_blk, 5159 - len_blks, fieinfo); 5104 + error = fill(inode, start_blk, len_blks, fieinfo); 5160 5105 } 5161 5106 return error; 5162 5107 } 5108 + 5109 + int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 5110 + __u64 start, __u64 len) 5111 + { 5112 + return _ext4_fiemap(inode, fieinfo, start, len, 5113 + ext4_fill_fiemap_extents); 5114 + } 5115 + 5116 + int ext4_get_es_cache(struct inode *inode, struct fiemap_extent_info *fieinfo, 5117 + __u64 start, __u64 len) 5118 + { 5119 + if (ext4_has_inline_data(inode)) { 5120 + int has_inline; 5121 + 5122 + down_read(&EXT4_I(inode)->xattr_sem); 5123 + has_inline = ext4_has_inline_data(inode); 5124 + up_read(&EXT4_I(inode)->xattr_sem); 5125 + if (has_inline) 5126 + return 0; 5127 + } 5128 + 5129 + return _ext4_fiemap(inode, fieinfo, start, len, 5130 + ext4_fill_es_cache_info); 5131 + } 5132 + 5163 5133 5164 5134 /* 5165 5135 * ext4_access_path:

+411 -110

fs/ext4/extents_status.c

··· 146 146 147 147 static int __es_insert_extent(struct inode *inode, struct extent_status *newes); 148 148 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, 149 - ext4_lblk_t end); 149 + ext4_lblk_t end, int *reserved); 150 150 static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan); 151 151 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan, 152 152 struct ext4_inode_info *locked_ei); ··· 836 836 ext4_es_insert_extent_check(inode, &newes); 837 837 838 838 write_lock(&EXT4_I(inode)->i_es_lock); 839 - err = __es_remove_extent(inode, lblk, end); 839 + err = __es_remove_extent(inode, lblk, end, NULL); 840 840 if (err != 0) 841 841 goto error; 842 842 retry: ··· 899 899 * Return: 1 on found, 0 on not 900 900 */ 901 901 int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk, 902 + ext4_lblk_t *next_lblk, 902 903 struct extent_status *es) 903 904 { 904 905 struct ext4_es_tree *tree; ··· 948 947 es->es_pblk = es1->es_pblk; 949 948 if (!ext4_es_is_referenced(es1)) 950 949 ext4_es_set_referenced(es1); 951 - stats->es_stats_cache_hits++; 950 + percpu_counter_inc(&stats->es_stats_cache_hits); 951 + if (next_lblk) { 952 + node = rb_next(&es1->rb_node); 953 + if (node) { 954 + es1 = rb_entry(node, struct extent_status, 955 + rb_node); 956 + *next_lblk = es1->es_lblk; 957 + } else 958 + *next_lblk = 0; 959 + } 952 960 } else { 953 - stats->es_stats_cache_misses++; 961 + percpu_counter_inc(&stats->es_stats_cache_misses); 954 962 } 955 963 956 964 read_unlock(&EXT4_I(inode)->i_es_lock); ··· 968 958 return found; 969 959 } 970 960 961 + struct rsvd_count { 962 + int ndelonly; 963 + bool first_do_lblk_found; 964 + ext4_lblk_t first_do_lblk; 965 + ext4_lblk_t last_do_lblk; 966 + struct extent_status *left_es; 967 + bool partial; 968 + ext4_lblk_t lclu; 969 + }; 970 + 971 + /* 972 + * init_rsvd - initialize reserved count data before removing block range 973 + * in file from extent status tree 974 + * 975 + * @inode - file containing range 976 + * @lblk - first block in range 977 + * @es - pointer to first extent in range 978 + * @rc - pointer to reserved count data 979 + * 980 + * Assumes es is not NULL 981 + */ 982 + static void init_rsvd(struct inode *inode, ext4_lblk_t lblk, 983 + struct extent_status *es, struct rsvd_count *rc) 984 + { 985 + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 986 + struct rb_node *node; 987 + 988 + rc->ndelonly = 0; 989 + 990 + /* 991 + * for bigalloc, note the first delonly block in the range has not 992 + * been found, record the extent containing the block to the left of 993 + * the region to be removed, if any, and note that there's no partial 994 + * cluster to track 995 + */ 996 + if (sbi->s_cluster_ratio > 1) { 997 + rc->first_do_lblk_found = false; 998 + if (lblk > es->es_lblk) { 999 + rc->left_es = es; 1000 + } else { 1001 + node = rb_prev(&es->rb_node); 1002 + rc->left_es = node ? rb_entry(node, 1003 + struct extent_status, 1004 + rb_node) : NULL; 1005 + } 1006 + rc->partial = false; 1007 + } 1008 + } 1009 + 1010 + /* 1011 + * count_rsvd - count the clusters containing delayed and not unwritten 1012 + * (delonly) blocks in a range within an extent and add to 1013 + * the running tally in rsvd_count 1014 + * 1015 + * @inode - file containing extent 1016 + * @lblk - first block in range 1017 + * @len - length of range in blocks 1018 + * @es - pointer to extent containing clusters to be counted 1019 + * @rc - pointer to reserved count data 1020 + * 1021 + * Tracks partial clusters found at the beginning and end of extents so 1022 + * they aren't overcounted when they span adjacent extents 1023 + */ 1024 + static void count_rsvd(struct inode *inode, ext4_lblk_t lblk, long len, 1025 + struct extent_status *es, struct rsvd_count *rc) 1026 + { 1027 + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 1028 + ext4_lblk_t i, end, nclu; 1029 + 1030 + if (!ext4_es_is_delonly(es)) 1031 + return; 1032 + 1033 + WARN_ON(len <= 0); 1034 + 1035 + if (sbi->s_cluster_ratio == 1) { 1036 + rc->ndelonly += (int) len; 1037 + return; 1038 + } 1039 + 1040 + /* bigalloc */ 1041 + 1042 + i = (lblk < es->es_lblk) ? es->es_lblk : lblk; 1043 + end = lblk + (ext4_lblk_t) len - 1; 1044 + end = (end > ext4_es_end(es)) ? ext4_es_end(es) : end; 1045 + 1046 + /* record the first block of the first delonly extent seen */ 1047 + if (rc->first_do_lblk_found == false) { 1048 + rc->first_do_lblk = i; 1049 + rc->first_do_lblk_found = true; 1050 + } 1051 + 1052 + /* update the last lblk in the region seen so far */ 1053 + rc->last_do_lblk = end; 1054 + 1055 + /* 1056 + * if we're tracking a partial cluster and the current extent 1057 + * doesn't start with it, count it and stop tracking 1058 + */ 1059 + if (rc->partial && (rc->lclu != EXT4_B2C(sbi, i))) { 1060 + rc->ndelonly++; 1061 + rc->partial = false; 1062 + } 1063 + 1064 + /* 1065 + * if the first cluster doesn't start on a cluster boundary but 1066 + * ends on one, count it 1067 + */ 1068 + if (EXT4_LBLK_COFF(sbi, i) != 0) { 1069 + if (end >= EXT4_LBLK_CFILL(sbi, i)) { 1070 + rc->ndelonly++; 1071 + rc->partial = false; 1072 + i = EXT4_LBLK_CFILL(sbi, i) + 1; 1073 + } 1074 + } 1075 + 1076 + /* 1077 + * if the current cluster starts on a cluster boundary, count the 1078 + * number of whole delonly clusters in the extent 1079 + */ 1080 + if ((i + sbi->s_cluster_ratio - 1) <= end) { 1081 + nclu = (end - i + 1) >> sbi->s_cluster_bits; 1082 + rc->ndelonly += nclu; 1083 + i += nclu << sbi->s_cluster_bits; 1084 + } 1085 + 1086 + /* 1087 + * start tracking a partial cluster if there's a partial at the end 1088 + * of the current extent and we're not already tracking one 1089 + */ 1090 + if (!rc->partial && i <= end) { 1091 + rc->partial = true; 1092 + rc->lclu = EXT4_B2C(sbi, i); 1093 + } 1094 + } 1095 + 1096 + /* 1097 + * __pr_tree_search - search for a pending cluster reservation 1098 + * 1099 + * @root - root of pending reservation tree 1100 + * @lclu - logical cluster to search for 1101 + * 1102 + * Returns the pending reservation for the cluster identified by @lclu 1103 + * if found. If not, returns a reservation for the next cluster if any, 1104 + * and if not, returns NULL. 1105 + */ 1106 + static struct pending_reservation *__pr_tree_search(struct rb_root *root, 1107 + ext4_lblk_t lclu) 1108 + { 1109 + struct rb_node *node = root->rb_node; 1110 + struct pending_reservation *pr = NULL; 1111 + 1112 + while (node) { 1113 + pr = rb_entry(node, struct pending_reservation, rb_node); 1114 + if (lclu < pr->lclu) 1115 + node = node->rb_left; 1116 + else if (lclu > pr->lclu) 1117 + node = node->rb_right; 1118 + else 1119 + return pr; 1120 + } 1121 + if (pr && lclu < pr->lclu) 1122 + return pr; 1123 + if (pr && lclu > pr->lclu) { 1124 + node = rb_next(&pr->rb_node); 1125 + return node ? rb_entry(node, struct pending_reservation, 1126 + rb_node) : NULL; 1127 + } 1128 + return NULL; 1129 + } 1130 + 1131 + /* 1132 + * get_rsvd - calculates and returns the number of cluster reservations to be 1133 + * released when removing a block range from the extent status tree 1134 + * and releases any pending reservations within the range 1135 + * 1136 + * @inode - file containing block range 1137 + * @end - last block in range 1138 + * @right_es - pointer to extent containing next block beyond end or NULL 1139 + * @rc - pointer to reserved count data 1140 + * 1141 + * The number of reservations to be released is equal to the number of 1142 + * clusters containing delayed and not unwritten (delonly) blocks within 1143 + * the range, minus the number of clusters still containing delonly blocks 1144 + * at the ends of the range, and minus the number of pending reservations 1145 + * within the range. 1146 + */ 1147 + static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end, 1148 + struct extent_status *right_es, 1149 + struct rsvd_count *rc) 1150 + { 1151 + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 1152 + struct pending_reservation *pr; 1153 + struct ext4_pending_tree *tree = &EXT4_I(inode)->i_pending_tree; 1154 + struct rb_node *node; 1155 + ext4_lblk_t first_lclu, last_lclu; 1156 + bool left_delonly, right_delonly, count_pending; 1157 + struct extent_status *es; 1158 + 1159 + if (sbi->s_cluster_ratio > 1) { 1160 + /* count any remaining partial cluster */ 1161 + if (rc->partial) 1162 + rc->ndelonly++; 1163 + 1164 + if (rc->ndelonly == 0) 1165 + return 0; 1166 + 1167 + first_lclu = EXT4_B2C(sbi, rc->first_do_lblk); 1168 + last_lclu = EXT4_B2C(sbi, rc->last_do_lblk); 1169 + 1170 + /* 1171 + * decrease the delonly count by the number of clusters at the 1172 + * ends of the range that still contain delonly blocks - 1173 + * these clusters still need to be reserved 1174 + */ 1175 + left_delonly = right_delonly = false; 1176 + 1177 + es = rc->left_es; 1178 + while (es && ext4_es_end(es) >= 1179 + EXT4_LBLK_CMASK(sbi, rc->first_do_lblk)) { 1180 + if (ext4_es_is_delonly(es)) { 1181 + rc->ndelonly--; 1182 + left_delonly = true; 1183 + break; 1184 + } 1185 + node = rb_prev(&es->rb_node); 1186 + if (!node) 1187 + break; 1188 + es = rb_entry(node, struct extent_status, rb_node); 1189 + } 1190 + if (right_es && (!left_delonly || first_lclu != last_lclu)) { 1191 + if (end < ext4_es_end(right_es)) { 1192 + es = right_es; 1193 + } else { 1194 + node = rb_next(&right_es->rb_node); 1195 + es = node ? rb_entry(node, struct extent_status, 1196 + rb_node) : NULL; 1197 + } 1198 + while (es && es->es_lblk <= 1199 + EXT4_LBLK_CFILL(sbi, rc->last_do_lblk)) { 1200 + if (ext4_es_is_delonly(es)) { 1201 + rc->ndelonly--; 1202 + right_delonly = true; 1203 + break; 1204 + } 1205 + node = rb_next(&es->rb_node); 1206 + if (!node) 1207 + break; 1208 + es = rb_entry(node, struct extent_status, 1209 + rb_node); 1210 + } 1211 + } 1212 + 1213 + /* 1214 + * Determine the block range that should be searched for 1215 + * pending reservations, if any. Clusters on the ends of the 1216 + * original removed range containing delonly blocks are 1217 + * excluded. They've already been accounted for and it's not 1218 + * possible to determine if an associated pending reservation 1219 + * should be released with the information available in the 1220 + * extents status tree. 1221 + */ 1222 + if (first_lclu == last_lclu) { 1223 + if (left_delonly | right_delonly) 1224 + count_pending = false; 1225 + else 1226 + count_pending = true; 1227 + } else { 1228 + if (left_delonly) 1229 + first_lclu++; 1230 + if (right_delonly) 1231 + last_lclu--; 1232 + if (first_lclu <= last_lclu) 1233 + count_pending = true; 1234 + else 1235 + count_pending = false; 1236 + } 1237 + 1238 + /* 1239 + * a pending reservation found between first_lclu and last_lclu 1240 + * represents an allocated cluster that contained at least one 1241 + * delonly block, so the delonly total must be reduced by one 1242 + * for each pending reservation found and released 1243 + */ 1244 + if (count_pending) { 1245 + pr = __pr_tree_search(&tree->root, first_lclu); 1246 + while (pr && pr->lclu <= last_lclu) { 1247 + rc->ndelonly--; 1248 + node = rb_next(&pr->rb_node); 1249 + rb_erase(&pr->rb_node, &tree->root); 1250 + kmem_cache_free(ext4_pending_cachep, pr); 1251 + if (!node) 1252 + break; 1253 + pr = rb_entry(node, struct pending_reservation, 1254 + rb_node); 1255 + } 1256 + } 1257 + } 1258 + return rc->ndelonly; 1259 + } 1260 + 1261 + 1262 + /* 1263 + * __es_remove_extent - removes block range from extent status tree 1264 + * 1265 + * @inode - file containing range 1266 + * @lblk - first block in range 1267 + * @end - last block in range 1268 + * @reserved - number of cluster reservations released 1269 + * 1270 + * If @reserved is not NULL and delayed allocation is enabled, counts 1271 + * block/cluster reservations freed by removing range and if bigalloc 1272 + * enabled cancels pending reservations as needed. Returns 0 on success, 1273 + * error code on failure. 1274 + */ 971 1275 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, 972 - ext4_lblk_t end) 1276 + ext4_lblk_t end, int *reserved) 973 1277 { 974 1278 struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; 975 1279 struct rb_node *node; ··· 1292 968 ext4_lblk_t len1, len2; 1293 969 ext4_fsblk_t block; 1294 970 int err; 971 + bool count_reserved = true; 972 + struct rsvd_count rc; 1295 973 974 + if (reserved == NULL || !test_opt(inode->i_sb, DELALLOC)) 975 + count_reserved = false; 1296 976 retry: 1297 977 err = 0; 978 + 1298 979 es = __es_tree_search(&tree->root, lblk); 1299 980 if (!es) 1300 981 goto out; ··· 1308 979 1309 980 /* Simply invalidate cache_es. */ 1310 981 tree->cache_es = NULL; 982 + if (count_reserved) 983 + init_rsvd(inode, lblk, es, &rc); 1311 984 1312 985 orig_es.es_lblk = es->es_lblk; 1313 986 orig_es.es_len = es->es_len; ··· 1351 1020 ext4_es_store_pblock(es, block); 1352 1021 } 1353 1022 } 1023 + if (count_reserved) 1024 + count_rsvd(inode, lblk, orig_es.es_len - len1 - len2, 1025 + &orig_es, &rc); 1354 1026 goto out; 1355 1027 } 1356 1028 1357 1029 if (len1 > 0) { 1030 + if (count_reserved) 1031 + count_rsvd(inode, lblk, orig_es.es_len - len1, 1032 + &orig_es, &rc); 1358 1033 node = rb_next(&es->rb_node); 1359 1034 if (node) 1360 1035 es = rb_entry(node, struct extent_status, rb_node); ··· 1369 1032 } 1370 1033 1371 1034 while (es && ext4_es_end(es) <= end) { 1035 + if (count_reserved) 1036 + count_rsvd(inode, es->es_lblk, es->es_len, es, &rc); 1372 1037 node = rb_next(&es->rb_node); 1373 1038 rb_erase(&es->rb_node, &tree->root); 1374 1039 ext4_es_free_extent(inode, es); ··· 1385 1046 ext4_lblk_t orig_len = es->es_len; 1386 1047 1387 1048 len1 = ext4_es_end(es) - end; 1049 + if (count_reserved) 1050 + count_rsvd(inode, es->es_lblk, orig_len - len1, 1051 + es, &rc); 1388 1052 es->es_lblk = end + 1; 1389 1053 es->es_len = len1; 1390 1054 if (ext4_es_is_written(es) || ext4_es_is_unwritten(es)) { ··· 1396 1054 } 1397 1055 } 1398 1056 1057 + if (count_reserved) 1058 + *reserved = get_rsvd(inode, end, es, &rc); 1399 1059 out: 1400 1060 return err; 1401 1061 } 1402 1062 1403 1063 /* 1404 - * ext4_es_remove_extent() removes a space from a extent status tree. 1064 + * ext4_es_remove_extent - removes block range from extent status tree 1405 1065 * 1406 - * Return 0 on success, error code on failure. 1066 + * @inode - file containing range 1067 + * @lblk - first block in range 1068 + * @len - number of blocks to remove 1069 + * 1070 + * Reduces block/cluster reservation count and for bigalloc cancels pending 1071 + * reservations as needed. Returns 0 on success, error code on failure. 1407 1072 */ 1408 1073 int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, 1409 1074 ext4_lblk_t len) 1410 1075 { 1411 1076 ext4_lblk_t end; 1412 1077 int err = 0; 1078 + int reserved = 0; 1413 1079 1414 1080 trace_ext4_es_remove_extent(inode, lblk, len); 1415 1081 es_debug("remove [%u/%u) from extent status tree of inode %lu\n", ··· 1435 1085 * is reclaimed. 1436 1086 */ 1437 1087 write_lock(&EXT4_I(inode)->i_es_lock); 1438 - err = __es_remove_extent(inode, lblk, end); 1088 + err = __es_remove_extent(inode, lblk, end, &reserved); 1439 1089 write_unlock(&EXT4_I(inode)->i_es_lock); 1440 1090 ext4_es_print_tree(inode); 1091 + ext4_da_release_space(inode, reserved); 1441 1092 return err; 1442 1093 } 1443 1094 ··· 1586 1235 seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n", 1587 1236 percpu_counter_sum_positive(&es_stats->es_stats_all_cnt), 1588 1237 percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt)); 1589 - seq_printf(seq, " %lu/%lu cache hits/misses\n", 1590 - es_stats->es_stats_cache_hits, 1591 - es_stats->es_stats_cache_misses); 1238 + seq_printf(seq, " %lld/%lld cache hits/misses\n", 1239 + percpu_counter_sum_positive(&es_stats->es_stats_cache_hits), 1240 + percpu_counter_sum_positive(&es_stats->es_stats_cache_misses)); 1592 1241 if (inode_cnt) 1593 1242 seq_printf(seq, " %d inodes on list\n", inode_cnt); 1594 1243 ··· 1615 1264 sbi->s_es_nr_inode = 0; 1616 1265 spin_lock_init(&sbi->s_es_lock); 1617 1266 sbi->s_es_stats.es_stats_shrunk = 0; 1618 - sbi->s_es_stats.es_stats_cache_hits = 0; 1619 - sbi->s_es_stats.es_stats_cache_misses = 0; 1267 + err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_hits, 0, 1268 + GFP_KERNEL); 1269 + if (err) 1270 + return err; 1271 + err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_misses, 0, 1272 + GFP_KERNEL); 1273 + if (err) 1274 + goto err1; 1620 1275 sbi->s_es_stats.es_stats_scan_time = 0; 1621 1276 sbi->s_es_stats.es_stats_max_scan_time = 0; 1622 1277 err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL); 1623 1278 if (err) 1624 - return err; 1279 + goto err2; 1625 1280 err = percpu_counter_init(&sbi->s_es_stats.es_stats_shk_cnt, 0, GFP_KERNEL); 1626 1281 if (err) 1627 - goto err1; 1282 + goto err3; 1628 1283 1629 1284 sbi->s_es_shrinker.scan_objects = ext4_es_scan; 1630 1285 sbi->s_es_shrinker.count_objects = ext4_es_count; 1631 1286 sbi->s_es_shrinker.seeks = DEFAULT_SEEKS; 1632 1287 err = register_shrinker(&sbi->s_es_shrinker); 1633 1288 if (err) 1634 - goto err2; 1289 + goto err4; 1635 1290 1636 1291 return 0; 1637 - 1638 - err2: 1292 + err4: 1639 1293 percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt); 1640 - err1: 1294 + err3: 1641 1295 percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt); 1296 + err2: 1297 + percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses); 1298 + err1: 1299 + percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits); 1642 1300 return err; 1643 1301 } 1644 1302 1645 1303 void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi) 1646 1304 { 1305 + percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits); 1306 + percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses); 1647 1307 percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt); 1648 1308 percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt); 1649 1309 unregister_shrinker(&sbi->s_es_shrinker); ··· 1679 1317 es = __es_tree_search(&tree->root, ei->i_es_shrink_lblk); 1680 1318 if (!es) 1681 1319 goto out_wrap; 1320 + 1682 1321 while (*nr_to_scan > 0) { 1683 1322 if (es->es_lblk > end) { 1684 1323 ei->i_es_shrink_lblk = end + 1; ··· 1735 1372 1736 1373 ei->i_es_tree.cache_es = NULL; 1737 1374 return nr_shrunk; 1375 + } 1376 + 1377 + /* 1378 + * Called to support EXT4_IOC_CLEAR_ES_CACHE. We can only remove 1379 + * discretionary entries from the extent status cache. (Some entries 1380 + * must be present for proper operations.) 1381 + */ 1382 + void ext4_clear_inode_es(struct inode *inode) 1383 + { 1384 + struct ext4_inode_info *ei = EXT4_I(inode); 1385 + struct extent_status *es; 1386 + struct ext4_es_tree *tree; 1387 + struct rb_node *node; 1388 + 1389 + write_lock(&ei->i_es_lock); 1390 + tree = &EXT4_I(inode)->i_es_tree; 1391 + tree->cache_es = NULL; 1392 + node = rb_first(&tree->root); 1393 + while (node) { 1394 + es = rb_entry(node, struct extent_status, rb_node); 1395 + node = rb_next(node); 1396 + if (!ext4_es_is_delayed(es)) { 1397 + rb_erase(&es->rb_node, &tree->root); 1398 + ext4_es_free_extent(inode, es); 1399 + } 1400 + } 1401 + ext4_clear_inode_state(inode, EXT4_STATE_EXT_PRECACHED); 1402 + write_unlock(&ei->i_es_lock); 1738 1403 } 1739 1404 1740 1405 #ifdef ES_DEBUG__ ··· 1981 1590 1982 1591 write_lock(&EXT4_I(inode)->i_es_lock); 1983 1592 1984 - err = __es_remove_extent(inode, lblk, lblk); 1593 + err = __es_remove_extent(inode, lblk, lblk, NULL); 1985 1594 if (err != 0) 1986 1595 goto error; 1987 1596 retry: ··· 2169 1778 else 2170 1779 __remove_pending(inode, last); 2171 1780 } 2172 - } 2173 - 2174 - /* 2175 - * ext4_es_remove_blks - remove block range from extents status tree and 2176 - * reduce reservation count or cancel pending 2177 - * reservation as needed 2178 - * 2179 - * @inode - file containing range 2180 - * @lblk - first block in range 2181 - * @len - number of blocks to remove 2182 - * 2183 - */ 2184 - void ext4_es_remove_blks(struct inode *inode, ext4_lblk_t lblk, 2185 - ext4_lblk_t len) 2186 - { 2187 - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); 2188 - unsigned int clu_size, reserved = 0; 2189 - ext4_lblk_t last_lclu, first, length, remainder, last; 2190 - bool delonly; 2191 - int err = 0; 2192 - struct pending_reservation *pr; 2193 - struct ext4_pending_tree *tree; 2194 - 2195 - /* 2196 - * Process cluster by cluster for bigalloc - there may be up to 2197 - * two clusters in a 4k page with a 1k block size and two blocks 2198 - * per cluster. Also necessary for systems with larger page sizes 2199 - * and potentially larger block sizes. 2200 - */ 2201 - clu_size = sbi->s_cluster_ratio; 2202 - last_lclu = EXT4_B2C(sbi, lblk + len - 1); 2203 - 2204 - write_lock(&EXT4_I(inode)->i_es_lock); 2205 - 2206 - for (first = lblk, remainder = len; 2207 - remainder > 0; 2208 - first += length, remainder -= length) { 2209 - 2210 - if (EXT4_B2C(sbi, first) == last_lclu) 2211 - length = remainder; 2212 - else 2213 - length = clu_size - EXT4_LBLK_COFF(sbi, first); 2214 - 2215 - /* 2216 - * The BH_Delay flag, which triggers calls to this function, 2217 - * and the contents of the extents status tree can be 2218 - * inconsistent due to writepages activity. So, note whether 2219 - * the blocks to be removed actually belong to an extent with 2220 - * delayed only status. 2221 - */ 2222 - delonly = __es_scan_clu(inode, &ext4_es_is_delonly, first); 2223 - 2224 - /* 2225 - * because of the writepages effect, written and unwritten 2226 - * blocks could be removed here 2227 - */ 2228 - last = first + length - 1; 2229 - err = __es_remove_extent(inode, first, last); 2230 - if (err) 2231 - ext4_warning(inode->i_sb, 2232 - "%s: couldn't remove page (err = %d)", 2233 - __func__, err); 2234 - 2235 - /* non-bigalloc case: simply count the cluster for release */ 2236 - if (sbi->s_cluster_ratio == 1 && delonly) { 2237 - reserved++; 2238 - continue; 2239 - } 2240 - 2241 - /* 2242 - * bigalloc case: if all delayed allocated only blocks have 2243 - * just been removed from a cluster, either cancel a pending 2244 - * reservation if it exists or count a cluster for release 2245 - */ 2246 - if (delonly && 2247 - !__es_scan_clu(inode, &ext4_es_is_delonly, first)) { 2248 - pr = __get_pending(inode, EXT4_B2C(sbi, first)); 2249 - if (pr != NULL) { 2250 - tree = &EXT4_I(inode)->i_pending_tree; 2251 - rb_erase(&pr->rb_node, &tree->root); 2252 - kmem_cache_free(ext4_pending_cachep, pr); 2253 - } else { 2254 - reserved++; 2255 - } 2256 - } 2257 - } 2258 - 2259 - write_unlock(&EXT4_I(inode)->i_es_lock); 2260 - 2261 - ext4_da_release_space(inode, reserved); 2262 1781 }

+4 -4

fs/ext4/extents_status.h

··· 70 70 71 71 struct ext4_es_stats { 72 72 unsigned long es_stats_shrunk; 73 - unsigned long es_stats_cache_hits; 74 - unsigned long es_stats_cache_misses; 73 + struct percpu_counter es_stats_cache_hits; 74 + struct percpu_counter es_stats_cache_misses; 75 75 u64 es_stats_scan_time; 76 76 u64 es_stats_max_scan_time; 77 77 struct percpu_counter es_stats_all_cnt; ··· 140 140 ext4_lblk_t lblk, ext4_lblk_t end, 141 141 struct extent_status *es); 142 142 extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk, 143 + ext4_lblk_t *next_lblk, 143 144 struct extent_status *es); 144 145 extern bool ext4_es_scan_range(struct inode *inode, 145 146 int (*matching_fn)(struct extent_status *es), ··· 247 246 bool allocated); 248 247 extern unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk, 249 248 ext4_lblk_t len); 250 - extern void ext4_es_remove_blks(struct inode *inode, ext4_lblk_t lblk, 251 - ext4_lblk_t len); 249 + extern void ext4_clear_inode_es(struct inode *inode); 252 250 253 251 #endif /* _EXT4_EXTENTS_STATUS_H */

-2

fs/ext4/file.c

··· 230 230 if (IS_DAX(inode)) 231 231 return ext4_dax_write_iter(iocb, from); 232 232 #endif 233 - if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT)) 234 - return -EOPNOTSUPP; 235 233 236 234 if (!inode_trylock(inode)) { 237 235 if (iocb->ki_flags & IOCB_NOWAIT)

+1 -1

fs/ext4/hash.c

··· 280 280 unsigned char *buff; 281 281 struct qstr qstr = {.name = name, .len = len }; 282 282 283 - if (len && IS_CASEFOLDED(dir)) { 283 + if (len && IS_CASEFOLDED(dir) && um) { 284 284 buff = kzalloc(sizeof(char) * PATH_MAX, GFP_KERNEL); 285 285 if (!buff) 286 286 return -ENOMEM;

+1 -1

fs/ext4/inline.c

··· 1416 1416 err = ext4_htree_store_dirent(dir_file, hinfo->hash, 1417 1417 hinfo->minor_hash, de, &tmp_str); 1418 1418 if (err) { 1419 - count = err; 1419 + ret = err; 1420 1420 goto out; 1421 1421 } 1422 1422 count++;

+27 -76

fs/ext4/inode.c

··· 527 527 return -EFSCORRUPTED; 528 528 529 529 /* Lookup extent status tree firstly */ 530 - if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) { 530 + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) { 531 531 if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) { 532 532 map->m_pblk = ext4_es_pblock(&es) + 533 533 map->m_lblk - es.es_lblk; ··· 695 695 * extent status tree. 696 696 */ 697 697 if ((flags & EXT4_GET_BLOCKS_PRE_IO) && 698 - ext4_es_lookup_extent(inode, map->m_lblk, &es)) { 698 + ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) { 699 699 if (ext4_es_is_written(&es)) 700 700 goto out_sem; 701 701 } ··· 1024 1024 bh = ext4_getblk(handle, inode, block, map_flags); 1025 1025 if (IS_ERR(bh)) 1026 1026 return bh; 1027 - if (!bh || buffer_uptodate(bh)) 1027 + if (!bh || ext4_buffer_uptodate(bh)) 1028 1028 return bh; 1029 1029 ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, &bh); 1030 1030 wait_on_buffer(bh); ··· 1051 1051 1052 1052 for (i = 0; i < bh_count; i++) 1053 1053 /* Note that NULL bhs[i] is valid because of holes. */ 1054 - if (bhs[i] && !buffer_uptodate(bhs[i])) 1054 + if (bhs[i] && !ext4_buffer_uptodate(bhs[i])) 1055 1055 ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, 1056 1056 &bhs[i]); 1057 1057 ··· 1656 1656 dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); 1657 1657 } 1658 1658 1659 - static void ext4_da_page_release_reservation(struct page *page, 1660 - unsigned int offset, 1661 - unsigned int length) 1662 - { 1663 - int contiguous_blks = 0; 1664 - struct buffer_head *head, *bh; 1665 - unsigned int curr_off = 0; 1666 - struct inode *inode = page->mapping->host; 1667 - unsigned int stop = offset + length; 1668 - ext4_fsblk_t lblk; 1669 - 1670 - BUG_ON(stop > PAGE_SIZE || stop < length); 1671 - 1672 - head = page_buffers(page); 1673 - bh = head; 1674 - do { 1675 - unsigned int next_off = curr_off + bh->b_size; 1676 - 1677 - if (next_off > stop) 1678 - break; 1679 - 1680 - if ((offset <= curr_off) && (buffer_delay(bh))) { 1681 - contiguous_blks++; 1682 - clear_buffer_delay(bh); 1683 - } else if (contiguous_blks) { 1684 - lblk = page->index << 1685 - (PAGE_SHIFT - inode->i_blkbits); 1686 - lblk += (curr_off >> inode->i_blkbits) - 1687 - contiguous_blks; 1688 - ext4_es_remove_blks(inode, lblk, contiguous_blks); 1689 - contiguous_blks = 0; 1690 - } 1691 - curr_off = next_off; 1692 - } while ((bh = bh->b_this_page) != head); 1693 - 1694 - if (contiguous_blks) { 1695 - lblk = page->index << (PAGE_SHIFT - inode->i_blkbits); 1696 - lblk += (curr_off >> inode->i_blkbits) - contiguous_blks; 1697 - ext4_es_remove_blks(inode, lblk, contiguous_blks); 1698 - } 1699 - 1700 - } 1701 - 1702 1659 /* 1703 1660 * Delayed allocation stuff 1704 1661 */ ··· 1835 1878 (unsigned long) map->m_lblk); 1836 1879 1837 1880 /* Lookup extent status tree firstly */ 1838 - if (ext4_es_lookup_extent(inode, iblock, &es)) { 1881 + if (ext4_es_lookup_extent(inode, iblock, NULL, &es)) { 1839 1882 if (ext4_es_is_hole(&es)) { 1840 1883 retval = 0; 1841 1884 down_read(&EXT4_I(inode)->i_data_sem); ··· 2757 2800 goto out_writepages; 2758 2801 } 2759 2802 2760 - if (ext4_should_dioread_nolock(inode)) { 2761 - /* 2762 - * We may need to convert up to one extent per block in 2763 - * the page and we may dirty the inode. 2764 - */ 2765 - rsv_blocks = 1 + ext4_chunk_trans_blocks(inode, 2766 - PAGE_SIZE >> inode->i_blkbits); 2767 - } 2768 - 2769 2803 /* 2770 2804 * If we have inline data and arrive here, it means that 2771 2805 * we will soon create the block for the 1st page, so ··· 2773 2825 EXT4_STATE_MAY_INLINE_DATA)); 2774 2826 ext4_destroy_inline_data(handle, inode); 2775 2827 ext4_journal_stop(handle); 2828 + } 2829 + 2830 + if (ext4_should_dioread_nolock(inode)) { 2831 + /* 2832 + * We may need to convert up to one extent per block in 2833 + * the page and we may dirty the inode. 2834 + */ 2835 + rsv_blocks = 1 + ext4_chunk_trans_blocks(inode, 2836 + PAGE_SIZE >> inode->i_blkbits); 2776 2837 } 2777 2838 2778 2839 if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) ··· 3197 3240 ret = ret2; 3198 3241 3199 3242 return ret ? ret : copied; 3200 - } 3201 - 3202 - static void ext4_da_invalidatepage(struct page *page, unsigned int offset, 3203 - unsigned int length) 3204 - { 3205 - /* 3206 - * Drop reserved blocks 3207 - */ 3208 - BUG_ON(!PageLocked(page)); 3209 - if (!page_has_buffers(page)) 3210 - goto out; 3211 - 3212 - ext4_da_page_release_reservation(page, offset, length); 3213 - 3214 - out: 3215 - ext4_invalidatepage(page, offset, length); 3216 - 3217 - return; 3218 3243 } 3219 3244 3220 3245 /* ··· 3941 4002 .write_end = ext4_da_write_end, 3942 4003 .set_page_dirty = ext4_set_page_dirty, 3943 4004 .bmap = ext4_bmap, 3944 - .invalidatepage = ext4_da_invalidatepage, 4005 + .invalidatepage = ext4_invalidatepage, 3945 4006 .releasepage = ext4_releasepage, 3946 4007 .direct_IO = ext4_direct_IO, 3947 4008 .migratepage = buffer_migrate_page, ··· 4252 4313 return -EOPNOTSUPP; 4253 4314 4254 4315 trace_ext4_punch_hole(inode, offset, length, 0); 4316 + 4317 + ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); 4318 + if (ext4_has_inline_data(inode)) { 4319 + down_write(&EXT4_I(inode)->i_mmap_sem); 4320 + ret = ext4_convert_inline_data(inode); 4321 + up_write(&EXT4_I(inode)->i_mmap_sem); 4322 + if (ret) 4323 + return ret; 4324 + } 4255 4325 4256 4326 /* 4257 4327 * Write out all dirty pages to avoid race conditions ··· 5085 5137 "iget: bogus i_mode (%o)", inode->i_mode); 5086 5138 goto bad_inode; 5087 5139 } 5140 + if (IS_CASEFOLDED(inode) && !ext4_has_feature_casefold(inode->i_sb)) 5141 + ext4_error_inode(inode, function, line, 0, 5142 + "casefold flag without casefold feature"); 5088 5143 brelse(iloc.bh); 5089 5144 5090 5145 unlock_new_inode(inode);

+98

fs/ext4/ioctl.c

··· 745 745 fa->fsx_projid = from_kprojid(&init_user_ns, ei->i_projid); 746 746 } 747 747 748 + /* copied from fs/ioctl.c */ 749 + static int fiemap_check_ranges(struct super_block *sb, 750 + u64 start, u64 len, u64 *new_len) 751 + { 752 + u64 maxbytes = (u64) sb->s_maxbytes; 753 + 754 + *new_len = len; 755 + 756 + if (len == 0) 757 + return -EINVAL; 758 + 759 + if (start > maxbytes) 760 + return -EFBIG; 761 + 762 + /* 763 + * Shrink request scope to what the fs can actually handle. 764 + */ 765 + if (len > maxbytes || (maxbytes - len) < start) 766 + *new_len = maxbytes - start; 767 + 768 + return 0; 769 + } 770 + 771 + /* So that the fiemap access checks can't overflow on 32 bit machines. */ 772 + #define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent)) 773 + 774 + static int ext4_ioctl_get_es_cache(struct file *filp, unsigned long arg) 775 + { 776 + struct fiemap fiemap; 777 + struct fiemap __user *ufiemap = (struct fiemap __user *) arg; 778 + struct fiemap_extent_info fieinfo = { 0, }; 779 + struct inode *inode = file_inode(filp); 780 + struct super_block *sb = inode->i_sb; 781 + u64 len; 782 + int error; 783 + 784 + if (copy_from_user(&fiemap, ufiemap, sizeof(fiemap))) 785 + return -EFAULT; 786 + 787 + if (fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS) 788 + return -EINVAL; 789 + 790 + error = fiemap_check_ranges(sb, fiemap.fm_start, fiemap.fm_length, 791 + &len); 792 + if (error) 793 + return error; 794 + 795 + fieinfo.fi_flags = fiemap.fm_flags; 796 + fieinfo.fi_extents_max = fiemap.fm_extent_count; 797 + fieinfo.fi_extents_start = ufiemap->fm_extents; 798 + 799 + if (fiemap.fm_extent_count != 0 && 800 + !access_ok(fieinfo.fi_extents_start, 801 + fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) 802 + return -EFAULT; 803 + 804 + if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC) 805 + filemap_write_and_wait(inode->i_mapping); 806 + 807 + error = ext4_get_es_cache(inode, &fieinfo, fiemap.fm_start, len); 808 + fiemap.fm_flags = fieinfo.fi_flags; 809 + fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped; 810 + if (copy_to_user(ufiemap, &fiemap, sizeof(fiemap))) 811 + error = -EFAULT; 812 + 813 + return error; 814 + } 815 + 748 816 long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) 749 817 { 750 818 struct inode *inode = file_inode(filp); ··· 1210 1142 return -EOPNOTSUPP; 1211 1143 return fscrypt_ioctl_get_key_status(filp, (void __user *)arg); 1212 1144 1145 + case EXT4_IOC_CLEAR_ES_CACHE: 1146 + { 1147 + if (!inode_owner_or_capable(inode)) 1148 + return -EACCES; 1149 + ext4_clear_inode_es(inode); 1150 + return 0; 1151 + } 1152 + 1153 + case EXT4_IOC_GETSTATE: 1154 + { 1155 + __u32 state = 0; 1156 + 1157 + if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED)) 1158 + state |= EXT4_STATE_FLAG_EXT_PRECACHED; 1159 + if (ext4_test_inode_state(inode, EXT4_STATE_NEW)) 1160 + state |= EXT4_STATE_FLAG_NEW; 1161 + if (ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY)) 1162 + state |= EXT4_STATE_FLAG_NEWENTRY; 1163 + if (ext4_test_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE)) 1164 + state |= EXT4_STATE_FLAG_DA_ALLOC_CLOSE; 1165 + 1166 + return put_user(state, (__u32 __user *) arg); 1167 + } 1168 + 1169 + case EXT4_IOC_GET_ES_CACHE: 1170 + return ext4_ioctl_get_es_cache(filp, arg); 1171 + 1213 1172 case EXT4_IOC_FSGETXATTR: 1214 1173 { 1215 1174 struct fsxattr fa; ··· 1373 1278 case FS_IOC_GETFSMAP: 1374 1279 case FS_IOC_ENABLE_VERITY: 1375 1280 case FS_IOC_MEASURE_VERITY: 1281 + case EXT4_IOC_CLEAR_ES_CACHE: 1282 + case EXT4_IOC_GETSTATE: 1283 + case EXT4_IOC_GET_ES_CACHE: 1376 1284 break; 1377 1285 default: 1378 1286 return -ENOIOCTLCMD;

+2 -2

fs/ext4/namei.c

··· 1312 1312 { 1313 1313 int len; 1314 1314 1315 - if (!IS_CASEFOLDED(dir)) { 1315 + if (!IS_CASEFOLDED(dir) || !EXT4_SB(dir->i_sb)->s_encoding) { 1316 1316 cf_name->name = NULL; 1317 1317 return; 1318 1318 } ··· 2183 2183 2184 2184 #ifdef CONFIG_UNICODE 2185 2185 if (ext4_has_strict_mode(sbi) && IS_CASEFOLDED(dir) && 2186 - utf8_validate(sbi->s_encoding, &dentry->d_name)) 2186 + sbi->s_encoding && utf8_validate(sbi->s_encoding, &dentry->d_name)) 2187 2187 return -EINVAL; 2188 2188 #endif 2189 2189

+7

fs/ext4/super.c

··· 1878 1878 } else if (token == Opt_commit) { 1879 1879 if (arg == 0) 1880 1880 arg = JBD2_DEFAULT_MAX_COMMIT_AGE; 1881 + else if (arg > INT_MAX / HZ) { 1882 + ext4_msg(sb, KERN_ERR, 1883 + "Invalid commit interval %d, " 1884 + "must be smaller than %d", 1885 + arg, INT_MAX / HZ); 1886 + return -1; 1887 + } 1881 1888 sbi->s_commit_interval = HZ * arg; 1882 1889 } else if (token == Opt_debug_want_extra_isize) { 1883 1890 sbi->s_want_extra_isize = arg;

+1 -3

fs/jbd2/revoke.c

··· 638 638 { 639 639 jbd2_journal_revoke_header_t *header; 640 640 641 - if (is_journal_aborted(journal)) { 642 - put_bh(descriptor); 641 + if (is_journal_aborted(journal)) 643 642 return; 644 - } 645 643 646 644 header = (jbd2_journal_revoke_header_t *)descriptor->b_data; 647 645 header->r_count = cpu_to_be32(offset);

+3

fs/jbd2/transaction.c

··· 569 569 } 570 570 handle->h_type = type; 571 571 handle->h_line_no = line_no; 572 + trace_jbd2_handle_start(journal->j_fs_dev->bd_dev, 573 + handle->h_transaction->t_tid, type, 574 + line_no, handle->h_buffer_credits); 572 575 return 0; 573 576 } 574 577 EXPORT_SYMBOL(jbd2_journal_start_reserved);

+1 -1

fs/unicode/utf8-core.c

··· 154 154 { 155 155 substring_t args[3]; 156 156 char version_string[12]; 157 - const struct match_token token[] = { 157 + static const struct match_token token[] = { 158 158 {1, "%d.%d.%d"}, 159 159 {0, NULL} 160 160 };

+2 -2

fs/unicode/utf8-selftest.c

··· 35 35 #define test_f(cond, fmt, ...) _test(cond, __func__, __LINE__, fmt, ##__VA_ARGS__) 36 36 #define test(cond) _test(cond, __func__, __LINE__, "") 37 37 38 - const static struct { 38 + static const struct { 39 39 /* UTF-8 strings in this vector _must_ be NULL-terminated. */ 40 40 unsigned char str[10]; 41 41 unsigned char dec[10]; ··· 89 89 90 90 }; 91 91 92 - const static struct { 92 + static const struct { 93 93 /* UTF-8 strings in this vector _must_ be NULL-terminated. */ 94 94 unsigned char str[30]; 95 95 unsigned char ncf[30];