Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2

+48 -9

Documentation/filesystems/configfs/configfs.txt

··· 238 238 struct config_group *(*make_group)(struct config_group *group, 239 239 const char *name); 240 240 int (*commit_item)(struct config_item *item); 241 + void (*disconnect_notify)(struct config_group *group, 242 + struct config_item *item); 241 243 void (*drop_item)(struct config_group *group, 242 244 struct config_item *item); 243 245 }; ··· 270 268 for the item to actually disappear from the subsystem's usage. But it 271 269 is gone from configfs. 272 270 271 + When drop_item() is called, the item's linkage has already been torn 272 + down. It no longer has a reference on its parent and has no place in 273 + the item hierarchy. If a client needs to do some cleanup before this 274 + teardown happens, the subsystem can implement the 275 + ct_group_ops->disconnect_notify() method. The method is called after 276 + configfs has removed the item from the filesystem view but before the 277 + item is removed from its parent group. Like drop_item(), 278 + disconnect_notify() is void and cannot fail. Client subsystems should 279 + not drop any references here, as they still must do it in drop_item(). 280 + 273 281 A config_group cannot be removed while it still has child items. This 274 282 is implemented in the configfs rmdir(2) code. ->drop_item() will not be 275 283 called, as the item has not been dropped. rmdir(2) will fail, as the ··· 292 280 293 281 struct configfs_subsystem { 294 282 struct config_group su_group; 295 - struct semaphore su_sem; 283 + struct mutex su_mutex; 296 284 }; 297 285 298 286 int configfs_register_subsystem(struct configfs_subsystem *subsys); 299 287 void configfs_unregister_subsystem(struct configfs_subsystem *subsys); 300 288 301 - A subsystem consists of a toplevel config_group and a semaphore. 289 + A subsystem consists of a toplevel config_group and a mutex. 302 290 The group is where child config_items are created. For a subsystem, 303 291 this group is usually defined statically. Before calling 304 292 configfs_register_subsystem(), the subsystem must have initialized the 305 293 group via the usual group _init() functions, and it must also have 306 - initialized the semaphore. 294 + initialized the mutex. 307 295 When the register call returns, the subsystem is live, and it 308 296 will be visible via configfs. At that point, mkdir(2) can be called and 309 297 the subsystem must be ready for it. ··· 315 303 shows a trivial object displaying and storing an attribute, and a simple 316 304 group creating and destroying these children. 317 305 318 - [Hierarchy Navigation and the Subsystem Semaphore] 306 + [Hierarchy Navigation and the Subsystem Mutex] 319 307 320 308 There is an extra bonus that configfs provides. The config_groups and 321 309 config_items are arranged in a hierarchy due to the fact that they ··· 326 314 327 315 A subsystem can navigate the cg_children list and the ci_parent pointer 328 316 to see the tree created by the subsystem. This can race with configfs' 329 - management of the hierarchy, so configfs uses the subsystem semaphore to 317 + management of the hierarchy, so configfs uses the subsystem mutex to 330 318 protect modifications. Whenever a subsystem wants to navigate the 331 319 hierarchy, it must do so under the protection of the subsystem 332 - semaphore. 320 + mutex. 333 321 334 - A subsystem will be prevented from acquiring the semaphore while a newly 322 + A subsystem will be prevented from acquiring the mutex while a newly 335 323 allocated item has not been linked into this hierarchy. Similarly, it 336 - will not be able to acquire the semaphore while a dropping item has not 324 + will not be able to acquire the mutex while a dropping item has not 337 325 yet been unlinked. This means that an item's ci_parent pointer will 338 326 never be NULL while the item is in configfs, and that an item will only 339 327 be in its parent's cg_children list for the same duration. This allows 340 328 a subsystem to trust ci_parent and cg_children while they hold the 341 - semaphore. 329 + mutex. 342 330 343 331 [Item Aggregation Via symlink(2)] 344 332 ··· 397 385 As a consequence of this, default_groups cannot be removed directly via 398 386 rmdir(2). They also are not considered when rmdir(2) on the parent 399 387 group is checking for children. 388 + 389 + [Dependant Subsystems] 390 + 391 + Sometimes other drivers depend on particular configfs items. For 392 + example, ocfs2 mounts depend on a heartbeat region item. If that 393 + region item is removed with rmdir(2), the ocfs2 mount must BUG or go 394 + readonly. Not happy. 395 + 396 + configfs provides two additional API calls: configfs_depend_item() and 397 + configfs_undepend_item(). A client driver can call 398 + configfs_depend_item() on an existing item to tell configfs that it is 399 + depended on. configfs will then return -EBUSY from rmdir(2) for that 400 + item. When the item is no longer depended on, the client driver calls 401 + configfs_undepend_item() on it. 402 + 403 + These API cannot be called underneath any configfs callbacks, as 404 + they will conflict. They can block and allocate. A client driver 405 + probably shouldn't calling them of its own gumption. Rather it should 406 + be providing an API that external subsystems call. 407 + 408 + How does this work? Imagine the ocfs2 mount process. When it mounts, 409 + it asks for a heartbeat region item. This is done via a call into the 410 + heartbeat code. Inside the heartbeat code, the region item is looked 411 + up. Here, the heartbeat code calls configfs_depend_item(). If it 412 + succeeds, then heartbeat knows the region is safe to give to ocfs2. 413 + If it fails, it was being torn down anyway, and heartbeat can gracefully 414 + pass up an error. 400 415 401 416 [Committable Items] 402 417

+1 -1

Documentation/filesystems/configfs/configfs_example.c

··· 453 453 subsys = example_subsys[i]; 454 454 455 455 config_group_init(&subsys->su_group); 456 - init_MUTEX(&subsys->su_sem); 456 + mutex_init(&subsys->su_mutex); 457 457 ret = configfs_register_subsystem(subsys); 458 458 if (ret) { 459 459 printk(KERN_ERR "Error %d while registering subsystem %s\n",

+4 -3

fs/configfs/configfs_internal.h

··· 29 29 30 30 struct configfs_dirent { 31 31 atomic_t s_count; 32 + int s_dependent_count; 32 33 struct list_head s_sibling; 33 34 struct list_head s_children; 34 35 struct list_head s_links; 35 - void * s_element; 36 + void * s_element; 36 37 int s_type; 37 38 umode_t s_mode; 38 39 struct dentry * s_dentry; ··· 42 41 43 42 #define CONFIGFS_ROOT 0x0001 44 43 #define CONFIGFS_DIR 0x0002 45 - #define CONFIGFS_ITEM_ATTR 0x0004 46 - #define CONFIGFS_ITEM_LINK 0x0020 44 + #define CONFIGFS_ITEM_ATTR 0x0004 45 + #define CONFIGFS_ITEM_LINK 0x0020 47 46 #define CONFIGFS_USET_DIR 0x0040 48 47 #define CONFIGFS_USET_DEFAULT 0x0080 49 48 #define CONFIGFS_USET_DROPPING 0x0100

+280 -9

fs/configfs/dir.c

··· 355 355 /* Mark that we've taken i_mutex */ 356 356 sd->s_type |= CONFIGFS_USET_DROPPING; 357 357 358 + /* 359 + * Yup, recursive. If there's a problem, blame 360 + * deep nesting of default_groups 361 + */ 358 362 ret = configfs_detach_prep(sd->s_dentry); 359 363 if (!ret) 360 364 continue; ··· 566 562 567 563 /* 568 564 * All of link_obj/unlink_obj/link_group/unlink_group require that 569 - * subsys->su_sem is held. 565 + * subsys->su_mutex is held. 570 566 */ 571 567 572 568 static void unlink_obj(struct config_item *item) ··· 718 714 } 719 715 720 716 /* 717 + * After the item has been detached from the filesystem view, we are 718 + * ready to tear it out of the hierarchy. Notify the client before 719 + * we do that so they can perform any cleanup that requires 720 + * navigating the hierarchy. A client does not need to provide this 721 + * callback. The subsystem semaphore MUST be held by the caller, and 722 + * references must be valid for both items. It also assumes the 723 + * caller has validated ci_type. 724 + */ 725 + static void client_disconnect_notify(struct config_item *parent_item, 726 + struct config_item *item) 727 + { 728 + struct config_item_type *type; 729 + 730 + type = parent_item->ci_type; 731 + BUG_ON(!type); 732 + 733 + if (type->ct_group_ops && type->ct_group_ops->disconnect_notify) 734 + type->ct_group_ops->disconnect_notify(to_config_group(parent_item), 735 + item); 736 + } 737 + 738 + /* 721 739 * Drop the initial reference from make_item()/make_group() 722 740 * This function assumes that reference is held on item 723 741 * and that item holds a valid reference to the parent. Also, it ··· 759 733 */ 760 734 if (type->ct_group_ops && type->ct_group_ops->drop_item) 761 735 type->ct_group_ops->drop_item(to_config_group(parent_item), 762 - item); 736 + item); 763 737 else 764 738 config_item_put(item); 765 739 } 766 740 741 + #ifdef DEBUG 742 + static void configfs_dump_one(struct configfs_dirent *sd, int level) 743 + { 744 + printk(KERN_INFO "%*s\"%s\":\n", level, " ", configfs_get_name(sd)); 745 + 746 + #define type_print(_type) if (sd->s_type & _type) printk(KERN_INFO "%*s %s\n", level, " ", #_type); 747 + type_print(CONFIGFS_ROOT); 748 + type_print(CONFIGFS_DIR); 749 + type_print(CONFIGFS_ITEM_ATTR); 750 + type_print(CONFIGFS_ITEM_LINK); 751 + type_print(CONFIGFS_USET_DIR); 752 + type_print(CONFIGFS_USET_DEFAULT); 753 + type_print(CONFIGFS_USET_DROPPING); 754 + #undef type_print 755 + } 756 + 757 + static int configfs_dump(struct configfs_dirent *sd, int level) 758 + { 759 + struct configfs_dirent *child_sd; 760 + int ret = 0; 761 + 762 + configfs_dump_one(sd, level); 763 + 764 + if (!(sd->s_type & (CONFIGFS_DIR|CONFIGFS_ROOT))) 765 + return 0; 766 + 767 + list_for_each_entry(child_sd, &sd->s_children, s_sibling) { 768 + ret = configfs_dump(child_sd, level + 2); 769 + if (ret) 770 + break; 771 + } 772 + 773 + return ret; 774 + } 775 + #endif 776 + 777 + 778 + /* 779 + * configfs_depend_item() and configfs_undepend_item() 780 + * 781 + * WARNING: Do not call these from a configfs callback! 782 + * 783 + * This describes these functions and their helpers. 784 + * 785 + * Allow another kernel system to depend on a config_item. If this 786 + * happens, the item cannot go away until the dependant can live without 787 + * it. The idea is to give client modules as simple an interface as 788 + * possible. When a system asks them to depend on an item, they just 789 + * call configfs_depend_item(). If the item is live and the client 790 + * driver is in good shape, we'll happily do the work for them. 791 + * 792 + * Why is the locking complex? Because configfs uses the VFS to handle 793 + * all locking, but this function is called outside the normal 794 + * VFS->configfs path. So it must take VFS locks to prevent the 795 + * VFS->configfs stuff (configfs_mkdir(), configfs_rmdir(), etc). This is 796 + * why you can't call these functions underneath configfs callbacks. 797 + * 798 + * Note, btw, that this can be called at *any* time, even when a configfs 799 + * subsystem isn't registered, or when configfs is loading or unloading. 800 + * Just like configfs_register_subsystem(). So we take the same 801 + * precautions. We pin the filesystem. We lock each i_mutex _in_order_ 802 + * on our way down the tree. If we can find the target item in the 803 + * configfs tree, it must be part of the subsystem tree as well, so we 804 + * do not need the subsystem semaphore. Holding the i_mutex chain locks 805 + * out mkdir() and rmdir(), who might be racing us. 806 + */ 807 + 808 + /* 809 + * configfs_depend_prep() 810 + * 811 + * Only subdirectories count here. Files (CONFIGFS_NOT_PINNED) are 812 + * attributes. This is similar but not the same to configfs_detach_prep(). 813 + * Note that configfs_detach_prep() expects the parent to be locked when it 814 + * is called, but we lock the parent *inside* configfs_depend_prep(). We 815 + * do that so we can unlock it if we find nothing. 816 + * 817 + * Here we do a depth-first search of the dentry hierarchy looking for 818 + * our object. We take i_mutex on each step of the way down. IT IS 819 + * ESSENTIAL THAT i_mutex LOCKING IS ORDERED. If we come back up a branch, 820 + * we'll drop the i_mutex. 821 + * 822 + * If the target is not found, -ENOENT is bubbled up and we have released 823 + * all locks. If the target was found, the locks will be cleared by 824 + * configfs_depend_rollback(). 825 + * 826 + * This adds a requirement that all config_items be unique! 827 + * 828 + * This is recursive because the locking traversal is tricky. There isn't 829 + * much on the stack, though, so folks that need this function - be careful 830 + * about your stack! Patches will be accepted to make it iterative. 831 + */ 832 + static int configfs_depend_prep(struct dentry *origin, 833 + struct config_item *target) 834 + { 835 + struct configfs_dirent *child_sd, *sd = origin->d_fsdata; 836 + int ret = 0; 837 + 838 + BUG_ON(!origin || !sd); 839 + 840 + /* Lock this guy on the way down */ 841 + mutex_lock(&sd->s_dentry->d_inode->i_mutex); 842 + if (sd->s_element == target) /* Boo-yah */ 843 + goto out; 844 + 845 + list_for_each_entry(child_sd, &sd->s_children, s_sibling) { 846 + if (child_sd->s_type & CONFIGFS_DIR) { 847 + ret = configfs_depend_prep(child_sd->s_dentry, 848 + target); 849 + if (!ret) 850 + goto out; /* Child path boo-yah */ 851 + } 852 + } 853 + 854 + /* We looped all our children and didn't find target */ 855 + mutex_unlock(&sd->s_dentry->d_inode->i_mutex); 856 + ret = -ENOENT; 857 + 858 + out: 859 + return ret; 860 + } 861 + 862 + /* 863 + * This is ONLY called if configfs_depend_prep() did its job. So we can 864 + * trust the entire path from item back up to origin. 865 + * 866 + * We walk backwards from item, unlocking each i_mutex. We finish by 867 + * unlocking origin. 868 + */ 869 + static void configfs_depend_rollback(struct dentry *origin, 870 + struct config_item *item) 871 + { 872 + struct dentry *dentry = item->ci_dentry; 873 + 874 + while (dentry != origin) { 875 + mutex_unlock(&dentry->d_inode->i_mutex); 876 + dentry = dentry->d_parent; 877 + } 878 + 879 + mutex_unlock(&origin->d_inode->i_mutex); 880 + } 881 + 882 + int configfs_depend_item(struct configfs_subsystem *subsys, 883 + struct config_item *target) 884 + { 885 + int ret; 886 + struct configfs_dirent *p, *root_sd, *subsys_sd = NULL; 887 + struct config_item *s_item = &subsys->su_group.cg_item; 888 + 889 + /* 890 + * Pin the configfs filesystem. This means we can safely access 891 + * the root of the configfs filesystem. 892 + */ 893 + ret = configfs_pin_fs(); 894 + if (ret) 895 + return ret; 896 + 897 + /* 898 + * Next, lock the root directory. We're going to check that the 899 + * subsystem is really registered, and so we need to lock out 900 + * configfs_[un]register_subsystem(). 901 + */ 902 + mutex_lock(&configfs_sb->s_root->d_inode->i_mutex); 903 + 904 + root_sd = configfs_sb->s_root->d_fsdata; 905 + 906 + list_for_each_entry(p, &root_sd->s_children, s_sibling) { 907 + if (p->s_type & CONFIGFS_DIR) { 908 + if (p->s_element == s_item) { 909 + subsys_sd = p; 910 + break; 911 + } 912 + } 913 + } 914 + 915 + if (!subsys_sd) { 916 + ret = -ENOENT; 917 + goto out_unlock_fs; 918 + } 919 + 920 + /* Ok, now we can trust subsys/s_item */ 921 + 922 + /* Scan the tree, locking i_mutex recursively, return 0 if found */ 923 + ret = configfs_depend_prep(subsys_sd->s_dentry, target); 924 + if (ret) 925 + goto out_unlock_fs; 926 + 927 + /* We hold all i_mutexes from the subsystem down to the target */ 928 + p = target->ci_dentry->d_fsdata; 929 + p->s_dependent_count += 1; 930 + 931 + configfs_depend_rollback(subsys_sd->s_dentry, target); 932 + 933 + out_unlock_fs: 934 + mutex_unlock(&configfs_sb->s_root->d_inode->i_mutex); 935 + 936 + /* 937 + * If we succeeded, the fs is pinned via other methods. If not, 938 + * we're done with it anyway. So release_fs() is always right. 939 + */ 940 + configfs_release_fs(); 941 + 942 + return ret; 943 + } 944 + EXPORT_SYMBOL(configfs_depend_item); 945 + 946 + /* 947 + * Release the dependent linkage. This is much simpler than 948 + * configfs_depend_item() because we know that that the client driver is 949 + * pinned, thus the subsystem is pinned, and therefore configfs is pinned. 950 + */ 951 + void configfs_undepend_item(struct configfs_subsystem *subsys, 952 + struct config_item *target) 953 + { 954 + struct configfs_dirent *sd; 955 + 956 + /* 957 + * Since we can trust everything is pinned, we just need i_mutex 958 + * on the item. 959 + */ 960 + mutex_lock(&target->ci_dentry->d_inode->i_mutex); 961 + 962 + sd = target->ci_dentry->d_fsdata; 963 + BUG_ON(sd->s_dependent_count < 1); 964 + 965 + sd->s_dependent_count -= 1; 966 + 967 + /* 968 + * After this unlock, we cannot trust the item to stay alive! 969 + * DO NOT REFERENCE item after this unlock. 970 + */ 971 + mutex_unlock(&target->ci_dentry->d_inode->i_mutex); 972 + } 973 + EXPORT_SYMBOL(configfs_undepend_item); 767 974 768 975 static int configfs_mkdir(struct inode *dir, struct dentry *dentry, int mode) 769 976 { ··· 1042 783 1043 784 snprintf(name, dentry->d_name.len + 1, "%s", dentry->d_name.name); 1044 785 1045 - down(&subsys->su_sem); 786 + mutex_lock(&subsys->su_mutex); 1046 787 group = NULL; 1047 788 item = NULL; 1048 789 if (type->ct_group_ops->make_group) { ··· 1056 797 if (item) 1057 798 link_obj(parent_item, item); 1058 799 } 1059 - up(&subsys->su_sem); 800 + mutex_unlock(&subsys->su_mutex); 1060 801 1061 802 kfree(name); 1062 803 if (!item) { ··· 1100 841 out_unlink: 1101 842 if (ret) { 1102 843 /* Tear down everything we built up */ 1103 - down(&subsys->su_sem); 844 + mutex_lock(&subsys->su_mutex); 845 + 846 + client_disconnect_notify(parent_item, item); 1104 847 if (group) 1105 848 unlink_group(group); 1106 849 else 1107 850 unlink_obj(item); 1108 851 client_drop_item(parent_item, item); 1109 - up(&subsys->su_sem); 852 + 853 + mutex_unlock(&subsys->su_mutex); 1110 854 1111 855 if (module_got) 1112 856 module_put(owner); ··· 1143 881 if (sd->s_type & CONFIGFS_USET_DEFAULT) 1144 882 return -EPERM; 1145 883 884 + /* 885 + * Here's where we check for dependents. We're protected by 886 + * i_mutex. 887 + */ 888 + if (sd->s_dependent_count) 889 + return -EBUSY; 890 + 1146 891 /* Get a working ref until we have the child */ 1147 892 parent_item = configfs_get_config_item(dentry->d_parent); 1148 893 subsys = to_config_group(parent_item)->cg_subsys; ··· 1179 910 if (sd->s_type & CONFIGFS_USET_DIR) { 1180 911 configfs_detach_group(item); 1181 912 1182 - down(&subsys->su_sem); 913 + mutex_lock(&subsys->su_mutex); 914 + client_disconnect_notify(parent_item, item); 1183 915 unlink_group(to_config_group(item)); 1184 916 } else { 1185 917 configfs_detach_item(item); 1186 918 1187 - down(&subsys->su_sem); 919 + mutex_lock(&subsys->su_mutex); 920 + client_disconnect_notify(parent_item, item); 1188 921 unlink_obj(item); 1189 922 } 1190 923 1191 924 client_drop_item(parent_item, item); 1192 - up(&subsys->su_sem); 925 + mutex_unlock(&subsys->su_mutex); 1193 926 1194 927 /* Drop our reference from above */ 1195 928 config_item_put(item);

+18 -10

fs/configfs/file.c

··· 27 27 #include <linux/fs.h> 28 28 #include <linux/module.h> 29 29 #include <linux/slab.h> 30 + #include <linux/mutex.h> 30 31 #include <asm/uaccess.h> 31 - #include <asm/semaphore.h> 32 32 33 33 #include <linux/configfs.h> 34 34 #include "configfs_internal.h" 35 35 36 + /* 37 + * A simple attribute can only be 4096 characters. Why 4k? Because the 38 + * original code limited it to PAGE_SIZE. That's a bad idea, though, 39 + * because an attribute of 16k on ia64 won't work on x86. So we limit to 40 + * 4k, our minimum common page size. 41 + */ 42 + #define SIMPLE_ATTR_SIZE 4096 36 43 37 44 struct configfs_buffer { 38 45 size_t count; 39 46 loff_t pos; 40 47 char * page; 41 48 struct configfs_item_operations * ops; 42 - struct semaphore sem; 49 + struct mutex mutex; 43 50 int needs_read_fill; 44 51 }; 45 52 ··· 76 69 77 70 count = ops->show_attribute(item,attr,buffer->page); 78 71 buffer->needs_read_fill = 0; 79 - BUG_ON(count > (ssize_t)PAGE_SIZE); 72 + BUG_ON(count > (ssize_t)SIMPLE_ATTR_SIZE); 80 73 if (count >= 0) 81 74 buffer->count = count; 82 75 else ··· 109 102 struct configfs_buffer * buffer = file->private_data; 110 103 ssize_t retval = 0; 111 104 112 - down(&buffer->sem); 105 + mutex_lock(&buffer->mutex); 113 106 if (buffer->needs_read_fill) { 114 107 if ((retval = fill_read_buffer(file->f_path.dentry,buffer))) 115 108 goto out; ··· 119 112 retval = simple_read_from_buffer(buf, count, ppos, buffer->page, 120 113 buffer->count); 121 114 out: 122 - up(&buffer->sem); 115 + mutex_unlock(&buffer->mutex); 123 116 return retval; 124 117 } 125 118 ··· 144 137 if (!buffer->page) 145 138 return -ENOMEM; 146 139 147 - if (count >= PAGE_SIZE) 148 - count = PAGE_SIZE - 1; 140 + if (count >= SIMPLE_ATTR_SIZE) 141 + count = SIMPLE_ATTR_SIZE - 1; 149 142 error = copy_from_user(buffer->page,buf,count); 150 143 buffer->needs_read_fill = 1; 151 144 /* if buf is assumed to contain a string, terminate it by \0, ··· 200 193 struct configfs_buffer * buffer = file->private_data; 201 194 ssize_t len; 202 195 203 - down(&buffer->sem); 196 + mutex_lock(&buffer->mutex); 204 197 len = fill_write_buffer(buffer, buf, count); 205 198 if (len > 0) 206 199 len = flush_write_buffer(file->f_path.dentry, buffer, count); 207 200 if (len > 0) 208 201 *ppos += len; 209 - up(&buffer->sem); 202 + mutex_unlock(&buffer->mutex); 210 203 return len; 211 204 } 212 205 ··· 260 253 error = -ENOMEM; 261 254 goto Enomem; 262 255 } 263 - init_MUTEX(&buffer->sem); 256 + mutex_init(&buffer->mutex); 264 257 buffer->needs_read_fill = 1; 265 258 buffer->ops = ops; 266 259 file->private_data = buffer; ··· 299 292 if (buffer) { 300 293 if (buffer->page) 301 294 free_page((unsigned long)buffer->page); 295 + mutex_destroy(&buffer->mutex); 302 296 kfree(buffer); 303 297 } 304 298 return 0;

+9 -20

fs/configfs/item.c

··· 62 62 * dynamically allocated string that @item->ci_name points to. 63 63 * Otherwise, use the static @item->ci_namebuf array. 64 64 */ 65 - 66 65 int config_item_set_name(struct config_item * item, const char * fmt, ...) 67 66 { 68 67 int error = 0; ··· 138 139 return item; 139 140 } 140 141 141 - /** 142 - * config_item_cleanup - free config_item resources. 143 - * @item: item. 144 - */ 145 - 146 - void config_item_cleanup(struct config_item * item) 142 + static void config_item_cleanup(struct config_item * item) 147 143 { 148 144 struct config_item_type * t = item->ci_type; 149 145 struct config_group * s = item->ci_group; ··· 173 179 kref_put(&item->ci_kref, config_item_release); 174 180 } 175 181 176 - 177 182 /** 178 183 * config_group_init - initialize a group for use 179 184 * @k: group 180 185 */ 181 - 182 186 void config_group_init(struct config_group *group) 183 187 { 184 188 config_item_init(&group->cg_item); 185 189 INIT_LIST_HEAD(&group->cg_children); 186 190 } 187 191 188 - 189 192 /** 190 - * config_group_find_obj - search for item in group. 193 + * config_group_find_item - search for item in group. 191 194 * @group: group we're looking in. 192 195 * @name: item's name. 193 196 * 194 - * Lock group via @group->cg_subsys, and iterate over @group->cg_list, 195 - * looking for a matching config_item. If matching item is found 196 - * take a reference and return the item. 197 + * Iterate over @group->cg_list, looking for a matching config_item. 198 + * If matching item is found take a reference and return the item. 199 + * Caller must have locked group via @group->cg_subsys->su_mtx. 197 200 */ 198 - 199 - struct config_item * config_group_find_obj(struct config_group * group, const char * name) 201 + struct config_item *config_group_find_item(struct config_group *group, 202 + const char *name) 200 203 { 201 204 struct list_head * entry; 202 205 struct config_item * ret = NULL; 203 206 204 - /* XXX LOCKING! */ 205 207 list_for_each(entry,&group->cg_children) { 206 208 struct config_item * item = to_item(entry); 207 209 if (config_item_name(item) && 208 - !strcmp(config_item_name(item), name)) { 210 + !strcmp(config_item_name(item), name)) { 209 211 ret = config_item_get(item); 210 212 break; 211 213 } ··· 209 219 return ret; 210 220 } 211 221 212 - 213 222 EXPORT_SYMBOL(config_item_init); 214 223 EXPORT_SYMBOL(config_group_init); 215 224 EXPORT_SYMBOL(config_item_get); 216 225 EXPORT_SYMBOL(config_item_put); 217 - EXPORT_SYMBOL(config_group_find_obj); 226 + EXPORT_SYMBOL(config_group_find_item);

+6 -14

fs/dlm/config.c

··· 133 133 return len; 134 134 } 135 135 136 - #define __CONFIGFS_ATTR(_name,_mode,_read,_write) { \ 137 - .attr = { .ca_name = __stringify(_name), \ 138 - .ca_mode = _mode, \ 139 - .ca_owner = THIS_MODULE }, \ 140 - .show = _read, \ 141 - .store = _write, \ 142 - } 143 - 144 136 #define CLUSTER_ATTR(name, check_zero) \ 145 137 static ssize_t name##_write(struct cluster *cl, const char *buf, size_t len) \ 146 138 { \ ··· 607 615 int dlm_config_init(void) 608 616 { 609 617 config_group_init(&clusters_root.subsys.su_group); 610 - init_MUTEX(&clusters_root.subsys.su_sem); 618 + mutex_init(&clusters_root.subsys.su_mutex); 611 619 return configfs_register_subsystem(&clusters_root.subsys); 612 620 } 613 621 ··· 751 759 if (!space_list) 752 760 return NULL; 753 761 754 - down(&space_list->cg_subsys->su_sem); 755 - i = config_group_find_obj(space_list, name); 756 - up(&space_list->cg_subsys->su_sem); 762 + mutex_lock(&space_list->cg_subsys->su_mutex); 763 + i = config_group_find_item(space_list, name); 764 + mutex_unlock(&space_list->cg_subsys->su_mutex); 757 765 758 766 return to_space(i); 759 767 } ··· 772 780 if (!comm_list) 773 781 return NULL; 774 782 775 - down(&clusters_root.subsys.su_sem); 783 + mutex_lock(&clusters_root.subsys.su_mutex); 776 784 777 785 list_for_each_entry(i, &comm_list->cg_children, ci_entry) { 778 786 cm = to_comm(i); ··· 792 800 break; 793 801 } 794 802 } 795 - up(&clusters_root.subsys.su_sem); 803 + mutex_unlock(&clusters_root.subsys.su_mutex); 796 804 797 805 if (!found) 798 806 cm = NULL;

+2444 -232

fs/ocfs2/alloc.c

··· 50 50 #include "buffer_head_io.h" 51 51 52 52 static void ocfs2_free_truncate_context(struct ocfs2_truncate_context *tc); 53 + static int ocfs2_cache_extent_block_free(struct ocfs2_cached_dealloc_ctxt *ctxt, 54 + struct ocfs2_extent_block *eb); 53 55 54 56 /* 55 57 * Structures which describe a path through a btree, and functions to ··· 115 113 if (path) { 116 114 ocfs2_reinit_path(path, 0); 117 115 kfree(path); 116 + } 117 + } 118 + 119 + /* 120 + * All the elements of src into dest. After this call, src could be freed 121 + * without affecting dest. 122 + * 123 + * Both paths should have the same root. Any non-root elements of dest 124 + * will be freed. 125 + */ 126 + static void ocfs2_cp_path(struct ocfs2_path *dest, struct ocfs2_path *src) 127 + { 128 + int i; 129 + 130 + BUG_ON(path_root_bh(dest) != path_root_bh(src)); 131 + BUG_ON(path_root_el(dest) != path_root_el(src)); 132 + 133 + ocfs2_reinit_path(dest, 1); 134 + 135 + for(i = 1; i < OCFS2_MAX_PATH_DEPTH; i++) { 136 + dest->p_node[i].bh = src->p_node[i].bh; 137 + dest->p_node[i].el = src->p_node[i].el; 138 + 139 + if (dest->p_node[i].bh) 140 + get_bh(dest->p_node[i].bh); 118 141 } 119 142 } 120 143 ··· 239 212 return ret; 240 213 } 241 214 215 + /* 216 + * Return the index of the extent record which contains cluster #v_cluster. 217 + * -1 is returned if it was not found. 218 + * 219 + * Should work fine on interior and exterior nodes. 220 + */ 221 + int ocfs2_search_extent_list(struct ocfs2_extent_list *el, u32 v_cluster) 222 + { 223 + int ret = -1; 224 + int i; 225 + struct ocfs2_extent_rec *rec; 226 + u32 rec_end, rec_start, clusters; 227 + 228 + for(i = 0; i < le16_to_cpu(el->l_next_free_rec); i++) { 229 + rec = &el->l_recs[i]; 230 + 231 + rec_start = le32_to_cpu(rec->e_cpos); 232 + clusters = ocfs2_rec_clusters(el, rec); 233 + 234 + rec_end = rec_start + clusters; 235 + 236 + if (v_cluster >= rec_start && v_cluster < rec_end) { 237 + ret = i; 238 + break; 239 + } 240 + } 241 + 242 + return ret; 243 + } 244 + 242 245 enum ocfs2_contig_type { 243 246 CONTIG_NONE = 0, 244 247 CONTIG_LEFT, 245 - CONTIG_RIGHT 248 + CONTIG_RIGHT, 249 + CONTIG_LEFTRIGHT, 246 250 }; 247 251 248 252 ··· 311 253 { 312 254 u64 blkno = le64_to_cpu(insert_rec->e_blkno); 313 255 256 + /* 257 + * Refuse to coalesce extent records with different flag 258 + * fields - we don't want to mix unwritten extents with user 259 + * data. 260 + */ 261 + if (ext->e_flags != insert_rec->e_flags) 262 + return CONTIG_NONE; 263 + 314 264 if (ocfs2_extents_adjacent(ext, insert_rec) && 315 265 ocfs2_block_extent_contig(inode->i_sb, ext, blkno)) 316 266 return CONTIG_RIGHT; ··· 343 277 APPEND_TAIL, 344 278 }; 345 279 280 + enum ocfs2_split_type { 281 + SPLIT_NONE = 0, 282 + SPLIT_LEFT, 283 + SPLIT_RIGHT, 284 + }; 285 + 346 286 struct ocfs2_insert_type { 287 + enum ocfs2_split_type ins_split; 347 288 enum ocfs2_append_type ins_appending; 348 289 enum ocfs2_contig_type ins_contig; 349 290 int ins_contig_index; 350 291 int ins_free_records; 351 292 int ins_tree_depth; 293 + }; 294 + 295 + struct ocfs2_merge_ctxt { 296 + enum ocfs2_contig_type c_contig_type; 297 + int c_has_empty_extent; 298 + int c_split_covers_rec; 299 + int c_used_tail_recs; 352 300 }; 353 301 354 302 /* ··· 464 384 strcpy(eb->h_signature, OCFS2_EXTENT_BLOCK_SIGNATURE); 465 385 eb->h_blkno = cpu_to_le64(first_blkno); 466 386 eb->h_fs_generation = cpu_to_le32(osb->fs_generation); 467 - 468 - #ifndef OCFS2_USE_ALL_METADATA_SUBALLOCATORS 469 - /* we always use slot zero's suballocator */ 470 - eb->h_suballoc_slot = 0; 471 - #else 472 387 eb->h_suballoc_slot = cpu_to_le16(osb->slot_num); 473 - #endif 474 388 eb->h_suballoc_bit = cpu_to_le16(suballoc_bit_start); 475 389 eb->h_list.l_count = 476 390 cpu_to_le16(ocfs2_extent_recs_per_eb(osb->sb)); ··· 535 461 struct inode *inode, 536 462 struct buffer_head *fe_bh, 537 463 struct buffer_head *eb_bh, 538 - struct buffer_head *last_eb_bh, 464 + struct buffer_head **last_eb_bh, 539 465 struct ocfs2_alloc_context *meta_ac) 540 466 { 541 467 int status, new_blocks, i; ··· 550 476 551 477 mlog_entry_void(); 552 478 553 - BUG_ON(!last_eb_bh); 479 + BUG_ON(!last_eb_bh || !*last_eb_bh); 554 480 555 481 fe = (struct ocfs2_dinode *) fe_bh->b_data; 556 482 ··· 581 507 goto bail; 582 508 } 583 509 584 - eb = (struct ocfs2_extent_block *)last_eb_bh->b_data; 510 + eb = (struct ocfs2_extent_block *)(*last_eb_bh)->b_data; 585 511 new_cpos = ocfs2_sum_rightmost_rec(&eb->h_list); 586 512 587 513 /* Note: new_eb_bhs[new_blocks - 1] is the guy which will be ··· 642 568 * journal_dirty erroring as it won't unless we've aborted the 643 569 * handle (in which case we would never be here) so reserving 644 570 * the write with journal_access is all we need to do. */ 645 - status = ocfs2_journal_access(handle, inode, last_eb_bh, 571 + status = ocfs2_journal_access(handle, inode, *last_eb_bh, 646 572 OCFS2_JOURNAL_ACCESS_WRITE); 647 573 if (status < 0) { 648 574 mlog_errno(status); ··· 675 601 * next_leaf on the previously last-extent-block. */ 676 602 fe->i_last_eb_blk = cpu_to_le64(new_last_eb_blk); 677 603 678 - eb = (struct ocfs2_extent_block *) last_eb_bh->b_data; 604 + eb = (struct ocfs2_extent_block *) (*last_eb_bh)->b_data; 679 605 eb->h_next_leaf_blk = cpu_to_le64(new_last_eb_blk); 680 606 681 - status = ocfs2_journal_dirty(handle, last_eb_bh); 607 + status = ocfs2_journal_dirty(handle, *last_eb_bh); 682 608 if (status < 0) 683 609 mlog_errno(status); 684 610 status = ocfs2_journal_dirty(handle, fe_bh); ··· 689 615 if (status < 0) 690 616 mlog_errno(status); 691 617 } 618 + 619 + /* 620 + * Some callers want to track the rightmost leaf so pass it 621 + * back here. 622 + */ 623 + brelse(*last_eb_bh); 624 + get_bh(new_eb_bhs[0]); 625 + *last_eb_bh = new_eb_bhs[0]; 692 626 693 627 status = 0; 694 628 bail: ··· 911 829 } 912 830 913 831 /* 832 + * Grow a b-tree so that it has more records. 833 + * 834 + * We might shift the tree depth in which case existing paths should 835 + * be considered invalid. 836 + * 837 + * Tree depth after the grow is returned via *final_depth. 838 + * 839 + * *last_eb_bh will be updated by ocfs2_add_branch(). 840 + */ 841 + static int ocfs2_grow_tree(struct inode *inode, handle_t *handle, 842 + struct buffer_head *di_bh, int *final_depth, 843 + struct buffer_head **last_eb_bh, 844 + struct ocfs2_alloc_context *meta_ac) 845 + { 846 + int ret, shift; 847 + struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; 848 + int depth = le16_to_cpu(di->id2.i_list.l_tree_depth); 849 + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 850 + struct buffer_head *bh = NULL; 851 + 852 + BUG_ON(meta_ac == NULL); 853 + 854 + shift = ocfs2_find_branch_target(osb, inode, di_bh, &bh); 855 + if (shift < 0) { 856 + ret = shift; 857 + mlog_errno(ret); 858 + goto out; 859 + } 860 + 861 + /* We traveled all the way to the bottom of the allocation tree 862 + * and didn't find room for any more extents - we need to add 863 + * another tree level */ 864 + if (shift) { 865 + BUG_ON(bh); 866 + mlog(0, "need to shift tree depth (current = %d)\n", depth); 867 + 868 + /* ocfs2_shift_tree_depth will return us a buffer with 869 + * the new extent block (so we can pass that to 870 + * ocfs2_add_branch). */ 871 + ret = ocfs2_shift_tree_depth(osb, handle, inode, di_bh, 872 + meta_ac, &bh); 873 + if (ret < 0) { 874 + mlog_errno(ret); 875 + goto out; 876 + } 877 + depth++; 878 + if (depth == 1) { 879 + /* 880 + * Special case: we have room now if we shifted from 881 + * tree_depth 0, so no more work needs to be done. 882 + * 883 + * We won't be calling add_branch, so pass 884 + * back *last_eb_bh as the new leaf. At depth 885 + * zero, it should always be null so there's 886 + * no reason to brelse. 887 + */ 888 + BUG_ON(*last_eb_bh); 889 + get_bh(bh); 890 + *last_eb_bh = bh; 891 + goto out; 892 + } 893 + } 894 + 895 + /* call ocfs2_add_branch to add the final part of the tree with 896 + * the new data. */ 897 + mlog(0, "add branch. bh = %p\n", bh); 898 + ret = ocfs2_add_branch(osb, handle, inode, di_bh, bh, last_eb_bh, 899 + meta_ac); 900 + if (ret < 0) { 901 + mlog_errno(ret); 902 + goto out; 903 + } 904 + 905 + out: 906 + if (final_depth) 907 + *final_depth = depth; 908 + brelse(bh); 909 + return ret; 910 + } 911 + 912 + /* 914 913 * This is only valid for leaf nodes, which are the only ones that can 915 914 * have empty extents anyway. 916 915 */ ··· 1095 932 1096 933 el->l_recs[insert_index] = *insert_rec; 1097 934 935 + } 936 + 937 + static void ocfs2_remove_empty_extent(struct ocfs2_extent_list *el) 938 + { 939 + int size, num_recs = le16_to_cpu(el->l_next_free_rec); 940 + 941 + BUG_ON(num_recs == 0); 942 + 943 + if (ocfs2_is_empty_extent(&el->l_recs[0])) { 944 + num_recs--; 945 + size = num_recs * sizeof(struct ocfs2_extent_rec); 946 + memmove(&el->l_recs[0], &el->l_recs[1], size); 947 + memset(&el->l_recs[num_recs], 0, 948 + sizeof(struct ocfs2_extent_rec)); 949 + el->l_next_free_rec = cpu_to_le16(num_recs); 950 + } 1098 951 } 1099 952 1100 953 /* ··· 1390 1211 * immediately to their right. 1391 1212 */ 1392 1213 left_clusters = le32_to_cpu(right_child_el->l_recs[0].e_cpos); 1214 + if (ocfs2_is_empty_extent(&right_child_el->l_recs[0])) { 1215 + BUG_ON(le16_to_cpu(right_child_el->l_next_free_rec) <= 1); 1216 + left_clusters = le32_to_cpu(right_child_el->l_recs[1].e_cpos); 1217 + } 1393 1218 left_clusters -= le32_to_cpu(left_rec->e_cpos); 1394 1219 left_rec->e_int_clusters = cpu_to_le32(left_clusters); 1395 1220 ··· 1714 1531 return ret; 1715 1532 } 1716 1533 1534 + /* 1535 + * Extend the transaction by enough credits to complete the rotation, 1536 + * and still leave at least the original number of credits allocated 1537 + * to this transaction. 1538 + */ 1717 1539 static int ocfs2_extend_rotate_transaction(handle_t *handle, int subtree_depth, 1540 + int op_credits, 1718 1541 struct ocfs2_path *path) 1719 1542 { 1720 - int credits = (path->p_tree_depth - subtree_depth) * 2 + 1; 1543 + int credits = (path->p_tree_depth - subtree_depth) * 2 + 1 + op_credits; 1721 1544 1722 1545 if (handle->h_buffer_credits < credits) 1723 1546 return ocfs2_extend_trans(handle, credits); ··· 1757 1568 return 0; 1758 1569 } 1759 1570 1571 + static int ocfs2_leftmost_rec_contains(struct ocfs2_extent_list *el, u32 cpos) 1572 + { 1573 + int next_free = le16_to_cpu(el->l_next_free_rec); 1574 + unsigned int range; 1575 + struct ocfs2_extent_rec *rec; 1576 + 1577 + if (next_free == 0) 1578 + return 0; 1579 + 1580 + rec = &el->l_recs[0]; 1581 + if (ocfs2_is_empty_extent(rec)) { 1582 + /* Empty list. */ 1583 + if (next_free == 1) 1584 + return 0; 1585 + rec = &el->l_recs[1]; 1586 + } 1587 + 1588 + range = le32_to_cpu(rec->e_cpos) + ocfs2_rec_clusters(el, rec); 1589 + if (cpos >= le32_to_cpu(rec->e_cpos) && cpos < range) 1590 + return 1; 1591 + return 0; 1592 + } 1593 + 1760 1594 /* 1761 1595 * Rotate all the records in a btree right one record, starting at insert_cpos. 1762 1596 * ··· 1798 1586 */ 1799 1587 static int ocfs2_rotate_tree_right(struct inode *inode, 1800 1588 handle_t *handle, 1589 + enum ocfs2_split_type split, 1801 1590 u32 insert_cpos, 1802 1591 struct ocfs2_path *right_path, 1803 1592 struct ocfs2_path **ret_left_path) 1804 1593 { 1805 - int ret, start; 1594 + int ret, start, orig_credits = handle->h_buffer_credits; 1806 1595 u32 cpos; 1807 1596 struct ocfs2_path *left_path = NULL; 1808 1597 ··· 1870 1657 (unsigned long long) 1871 1658 path_leaf_bh(left_path)->b_blocknr); 1872 1659 1873 - if (ocfs2_rotate_requires_path_adjustment(left_path, 1660 + if (split == SPLIT_NONE && 1661 + ocfs2_rotate_requires_path_adjustment(left_path, 1874 1662 insert_cpos)) { 1875 - mlog(0, "Path adjustment required\n"); 1876 1663 1877 1664 /* 1878 1665 * We've rotated the tree as much as we ··· 1900 1687 right_path->p_tree_depth); 1901 1688 1902 1689 ret = ocfs2_extend_rotate_transaction(handle, start, 1903 - right_path); 1690 + orig_credits, right_path); 1904 1691 if (ret) { 1905 1692 mlog_errno(ret); 1906 1693 goto out; ··· 1911 1698 if (ret) { 1912 1699 mlog_errno(ret); 1913 1700 goto out; 1701 + } 1702 + 1703 + if (split != SPLIT_NONE && 1704 + ocfs2_leftmost_rec_contains(path_leaf_el(right_path), 1705 + insert_cpos)) { 1706 + /* 1707 + * A rotate moves the rightmost left leaf 1708 + * record over to the leftmost right leaf 1709 + * slot. If we're doing an extent split 1710 + * instead of a real insert, then we have to 1711 + * check that the extent to be split wasn't 1712 + * just moved over. If it was, then we can 1713 + * exit here, passing left_path back - 1714 + * ocfs2_split_extent() is smart enough to 1715 + * search both leaves. 1716 + */ 1717 + *ret_left_path = left_path; 1718 + goto out_ret_path; 1914 1719 } 1915 1720 1916 1721 /* ··· 1953 1722 return ret; 1954 1723 } 1955 1724 1725 + static void ocfs2_update_edge_lengths(struct inode *inode, handle_t *handle, 1726 + struct ocfs2_path *path) 1727 + { 1728 + int i, idx; 1729 + struct ocfs2_extent_rec *rec; 1730 + struct ocfs2_extent_list *el; 1731 + struct ocfs2_extent_block *eb; 1732 + u32 range; 1733 + 1734 + /* Path should always be rightmost. */ 1735 + eb = (struct ocfs2_extent_block *)path_leaf_bh(path)->b_data; 1736 + BUG_ON(eb->h_next_leaf_blk != 0ULL); 1737 + 1738 + el = &eb->h_list; 1739 + BUG_ON(le16_to_cpu(el->l_next_free_rec) == 0); 1740 + idx = le16_to_cpu(el->l_next_free_rec) - 1; 1741 + rec = &el->l_recs[idx]; 1742 + range = le32_to_cpu(rec->e_cpos) + ocfs2_rec_clusters(el, rec); 1743 + 1744 + for (i = 0; i < path->p_tree_depth; i++) { 1745 + el = path->p_node[i].el; 1746 + idx = le16_to_cpu(el->l_next_free_rec) - 1; 1747 + rec = &el->l_recs[idx]; 1748 + 1749 + rec->e_int_clusters = cpu_to_le32(range); 1750 + le32_add_cpu(&rec->e_int_clusters, -le32_to_cpu(rec->e_cpos)); 1751 + 1752 + ocfs2_journal_dirty(handle, path->p_node[i].bh); 1753 + } 1754 + } 1755 + 1756 + static void ocfs2_unlink_path(struct inode *inode, handle_t *handle, 1757 + struct ocfs2_cached_dealloc_ctxt *dealloc, 1758 + struct ocfs2_path *path, int unlink_start) 1759 + { 1760 + int ret, i; 1761 + struct ocfs2_extent_block *eb; 1762 + struct ocfs2_extent_list *el; 1763 + struct buffer_head *bh; 1764 + 1765 + for(i = unlink_start; i < path_num_items(path); i++) { 1766 + bh = path->p_node[i].bh; 1767 + 1768 + eb = (struct ocfs2_extent_block *)bh->b_data; 1769 + /* 1770 + * Not all nodes might have had their final count 1771 + * decremented by the caller - handle this here. 1772 + */ 1773 + el = &eb->h_list; 1774 + if (le16_to_cpu(el->l_next_free_rec) > 1) { 1775 + mlog(ML_ERROR, 1776 + "Inode %llu, attempted to remove extent block " 1777 + "%llu with %u records\n", 1778 + (unsigned long long)OCFS2_I(inode)->ip_blkno, 1779 + (unsigned long long)le64_to_cpu(eb->h_blkno), 1780 + le16_to_cpu(el->l_next_free_rec)); 1781 + 1782 + ocfs2_journal_dirty(handle, bh); 1783 + ocfs2_remove_from_cache(inode, bh); 1784 + continue; 1785 + } 1786 + 1787 + el->l_next_free_rec = 0; 1788 + memset(&el->l_recs[0], 0, sizeof(struct ocfs2_extent_rec)); 1789 + 1790 + ocfs2_journal_dirty(handle, bh); 1791 + 1792 + ret = ocfs2_cache_extent_block_free(dealloc, eb); 1793 + if (ret) 1794 + mlog_errno(ret); 1795 + 1796 + ocfs2_remove_from_cache(inode, bh); 1797 + } 1798 + } 1799 + 1800 + static void ocfs2_unlink_subtree(struct inode *inode, handle_t *handle, 1801 + struct ocfs2_path *left_path, 1802 + struct ocfs2_path *right_path, 1803 + int subtree_index, 1804 + struct ocfs2_cached_dealloc_ctxt *dealloc) 1805 + { 1806 + int i; 1807 + struct buffer_head *root_bh = left_path->p_node[subtree_index].bh; 1808 + struct ocfs2_extent_list *root_el = left_path->p_node[subtree_index].el; 1809 + struct ocfs2_extent_list *el; 1810 + struct ocfs2_extent_block *eb; 1811 + 1812 + el = path_leaf_el(left_path); 1813 + 1814 + eb = (struct ocfs2_extent_block *)right_path->p_node[subtree_index + 1].bh->b_data; 1815 + 1816 + for(i = 1; i < le16_to_cpu(root_el->l_next_free_rec); i++) 1817 + if (root_el->l_recs[i].e_blkno == eb->h_blkno) 1818 + break; 1819 + 1820 + BUG_ON(i >= le16_to_cpu(root_el->l_next_free_rec)); 1821 + 1822 + memset(&root_el->l_recs[i], 0, sizeof(struct ocfs2_extent_rec)); 1823 + le16_add_cpu(&root_el->l_next_free_rec, -1); 1824 + 1825 + eb = (struct ocfs2_extent_block *)path_leaf_bh(left_path)->b_data; 1826 + eb->h_next_leaf_blk = 0; 1827 + 1828 + ocfs2_journal_dirty(handle, root_bh); 1829 + ocfs2_journal_dirty(handle, path_leaf_bh(left_path)); 1830 + 1831 + ocfs2_unlink_path(inode, handle, dealloc, right_path, 1832 + subtree_index + 1); 1833 + } 1834 + 1835 + static int ocfs2_rotate_subtree_left(struct inode *inode, handle_t *handle, 1836 + struct ocfs2_path *left_path, 1837 + struct ocfs2_path *right_path, 1838 + int subtree_index, 1839 + struct ocfs2_cached_dealloc_ctxt *dealloc, 1840 + int *deleted) 1841 + { 1842 + int ret, i, del_right_subtree = 0, right_has_empty = 0; 1843 + struct buffer_head *root_bh, *di_bh = path_root_bh(right_path); 1844 + struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; 1845 + struct ocfs2_extent_list *right_leaf_el, *left_leaf_el; 1846 + struct ocfs2_extent_block *eb; 1847 + 1848 + *deleted = 0; 1849 + 1850 + right_leaf_el = path_leaf_el(right_path); 1851 + left_leaf_el = path_leaf_el(left_path); 1852 + root_bh = left_path->p_node[subtree_index].bh; 1853 + BUG_ON(root_bh != right_path->p_node[subtree_index].bh); 1854 + 1855 + if (!ocfs2_is_empty_extent(&left_leaf_el->l_recs[0])) 1856 + return 0; 1857 + 1858 + eb = (struct ocfs2_extent_block *)path_leaf_bh(right_path)->b_data; 1859 + if (ocfs2_is_empty_extent(&right_leaf_el->l_recs[0])) { 1860 + /* 1861 + * It's legal for us to proceed if the right leaf is 1862 + * the rightmost one and it has an empty extent. There 1863 + * are two cases to handle - whether the leaf will be 1864 + * empty after removal or not. If the leaf isn't empty 1865 + * then just remove the empty extent up front. The 1866 + * next block will handle empty leaves by flagging 1867 + * them for unlink. 1868 + * 1869 + * Non rightmost leaves will throw -EAGAIN and the 1870 + * caller can manually move the subtree and retry. 1871 + */ 1872 + 1873 + if (eb->h_next_leaf_blk != 0ULL) 1874 + return -EAGAIN; 1875 + 1876 + if (le16_to_cpu(right_leaf_el->l_next_free_rec) > 1) { 1877 + ret = ocfs2_journal_access(handle, inode, 1878 + path_leaf_bh(right_path), 1879 + OCFS2_JOURNAL_ACCESS_WRITE); 1880 + if (ret) { 1881 + mlog_errno(ret); 1882 + goto out; 1883 + } 1884 + 1885 + ocfs2_remove_empty_extent(right_leaf_el); 1886 + } else 1887 + right_has_empty = 1; 1888 + } 1889 + 1890 + if (eb->h_next_leaf_blk == 0ULL && 1891 + le16_to_cpu(right_leaf_el->l_next_free_rec) == 1) { 1892 + /* 1893 + * We have to update i_last_eb_blk during the meta 1894 + * data delete. 1895 + */ 1896 + ret = ocfs2_journal_access(handle, inode, di_bh, 1897 + OCFS2_JOURNAL_ACCESS_WRITE); 1898 + if (ret) { 1899 + mlog_errno(ret); 1900 + goto out; 1901 + } 1902 + 1903 + del_right_subtree = 1; 1904 + } 1905 + 1906 + /* 1907 + * Getting here with an empty extent in the right path implies 1908 + * that it's the rightmost path and will be deleted. 1909 + */ 1910 + BUG_ON(right_has_empty && !del_right_subtree); 1911 + 1912 + ret = ocfs2_journal_access(handle, inode, root_bh, 1913 + OCFS2_JOURNAL_ACCESS_WRITE); 1914 + if (ret) { 1915 + mlog_errno(ret); 1916 + goto out; 1917 + } 1918 + 1919 + for(i = subtree_index + 1; i < path_num_items(right_path); i++) { 1920 + ret = ocfs2_journal_access(handle, inode, 1921 + right_path->p_node[i].bh, 1922 + OCFS2_JOURNAL_ACCESS_WRITE); 1923 + if (ret) { 1924 + mlog_errno(ret); 1925 + goto out; 1926 + } 1927 + 1928 + ret = ocfs2_journal_access(handle, inode, 1929 + left_path->p_node[i].bh, 1930 + OCFS2_JOURNAL_ACCESS_WRITE); 1931 + if (ret) { 1932 + mlog_errno(ret); 1933 + goto out; 1934 + } 1935 + } 1936 + 1937 + if (!right_has_empty) { 1938 + /* 1939 + * Only do this if we're moving a real 1940 + * record. Otherwise, the action is delayed until 1941 + * after removal of the right path in which case we 1942 + * can do a simple shift to remove the empty extent. 1943 + */ 1944 + ocfs2_rotate_leaf(left_leaf_el, &right_leaf_el->l_recs[0]); 1945 + memset(&right_leaf_el->l_recs[0], 0, 1946 + sizeof(struct ocfs2_extent_rec)); 1947 + } 1948 + if (eb->h_next_leaf_blk == 0ULL) { 1949 + /* 1950 + * Move recs over to get rid of empty extent, decrease 1951 + * next_free. This is allowed to remove the last 1952 + * extent in our leaf (setting l_next_free_rec to 1953 + * zero) - the delete code below won't care. 1954 + */ 1955 + ocfs2_remove_empty_extent(right_leaf_el); 1956 + } 1957 + 1958 + ret = ocfs2_journal_dirty(handle, path_leaf_bh(left_path)); 1959 + if (ret) 1960 + mlog_errno(ret); 1961 + ret = ocfs2_journal_dirty(handle, path_leaf_bh(right_path)); 1962 + if (ret) 1963 + mlog_errno(ret); 1964 + 1965 + if (del_right_subtree) { 1966 + ocfs2_unlink_subtree(inode, handle, left_path, right_path, 1967 + subtree_index, dealloc); 1968 + ocfs2_update_edge_lengths(inode, handle, left_path); 1969 + 1970 + eb = (struct ocfs2_extent_block *)path_leaf_bh(left_path)->b_data; 1971 + di->i_last_eb_blk = eb->h_blkno; 1972 + 1973 + /* 1974 + * Removal of the extent in the left leaf was skipped 1975 + * above so we could delete the right path 1976 + * 1st. 1977 + */ 1978 + if (right_has_empty) 1979 + ocfs2_remove_empty_extent(left_leaf_el); 1980 + 1981 + ret = ocfs2_journal_dirty(handle, di_bh); 1982 + if (ret) 1983 + mlog_errno(ret); 1984 + 1985 + *deleted = 1; 1986 + } else 1987 + ocfs2_complete_edge_insert(inode, handle, left_path, right_path, 1988 + subtree_index); 1989 + 1990 + out: 1991 + return ret; 1992 + } 1993 + 1994 + /* 1995 + * Given a full path, determine what cpos value would return us a path 1996 + * containing the leaf immediately to the right of the current one. 1997 + * 1998 + * Will return zero if the path passed in is already the rightmost path. 1999 + * 2000 + * This looks similar, but is subtly different to 2001 + * ocfs2_find_cpos_for_left_leaf(). 2002 + */ 2003 + static int ocfs2_find_cpos_for_right_leaf(struct super_block *sb, 2004 + struct ocfs2_path *path, u32 *cpos) 2005 + { 2006 + int i, j, ret = 0; 2007 + u64 blkno; 2008 + struct ocfs2_extent_list *el; 2009 + 2010 + *cpos = 0; 2011 + 2012 + if (path->p_tree_depth == 0) 2013 + return 0; 2014 + 2015 + blkno = path_leaf_bh(path)->b_blocknr; 2016 + 2017 + /* Start at the tree node just above the leaf and work our way up. */ 2018 + i = path->p_tree_depth - 1; 2019 + while (i >= 0) { 2020 + int next_free; 2021 + 2022 + el = path->p_node[i].el; 2023 + 2024 + /* 2025 + * Find the extent record just after the one in our 2026 + * path. 2027 + */ 2028 + next_free = le16_to_cpu(el->l_next_free_rec); 2029 + for(j = 0; j < le16_to_cpu(el->l_next_free_rec); j++) { 2030 + if (le64_to_cpu(el->l_recs[j].e_blkno) == blkno) { 2031 + if (j == (next_free - 1)) { 2032 + if (i == 0) { 2033 + /* 2034 + * We've determined that the 2035 + * path specified is already 2036 + * the rightmost one - return a 2037 + * cpos of zero. 2038 + */ 2039 + goto out; 2040 + } 2041 + /* 2042 + * The rightmost record points to our 2043 + * leaf - we need to travel up the 2044 + * tree one level. 2045 + */ 2046 + goto next_node; 2047 + } 2048 + 2049 + *cpos = le32_to_cpu(el->l_recs[j + 1].e_cpos); 2050 + goto out; 2051 + } 2052 + } 2053 + 2054 + /* 2055 + * If we got here, we never found a valid node where 2056 + * the tree indicated one should be. 2057 + */ 2058 + ocfs2_error(sb, 2059 + "Invalid extent tree at extent block %llu\n", 2060 + (unsigned long long)blkno); 2061 + ret = -EROFS; 2062 + goto out; 2063 + 2064 + next_node: 2065 + blkno = path->p_node[i].bh->b_blocknr; 2066 + i--; 2067 + } 2068 + 2069 + out: 2070 + return ret; 2071 + } 2072 + 2073 + static int ocfs2_rotate_rightmost_leaf_left(struct inode *inode, 2074 + handle_t *handle, 2075 + struct buffer_head *bh, 2076 + struct ocfs2_extent_list *el) 2077 + { 2078 + int ret; 2079 + 2080 + if (!ocfs2_is_empty_extent(&el->l_recs[0])) 2081 + return 0; 2082 + 2083 + ret = ocfs2_journal_access(handle, inode, bh, 2084 + OCFS2_JOURNAL_ACCESS_WRITE); 2085 + if (ret) { 2086 + mlog_errno(ret); 2087 + goto out; 2088 + } 2089 + 2090 + ocfs2_remove_empty_extent(el); 2091 + 2092 + ret = ocfs2_journal_dirty(handle, bh); 2093 + if (ret) 2094 + mlog_errno(ret); 2095 + 2096 + out: 2097 + return ret; 2098 + } 2099 + 2100 + static int __ocfs2_rotate_tree_left(struct inode *inode, 2101 + handle_t *handle, int orig_credits, 2102 + struct ocfs2_path *path, 2103 + struct ocfs2_cached_dealloc_ctxt *dealloc, 2104 + struct ocfs2_path **empty_extent_path) 2105 + { 2106 + int ret, subtree_root, deleted; 2107 + u32 right_cpos; 2108 + struct ocfs2_path *left_path = NULL; 2109 + struct ocfs2_path *right_path = NULL; 2110 + 2111 + BUG_ON(!ocfs2_is_empty_extent(&(path_leaf_el(path)->l_recs[0]))); 2112 + 2113 + *empty_extent_path = NULL; 2114 + 2115 + ret = ocfs2_find_cpos_for_right_leaf(inode->i_sb, path, 2116 + &right_cpos); 2117 + if (ret) { 2118 + mlog_errno(ret); 2119 + goto out; 2120 + } 2121 + 2122 + left_path = ocfs2_new_path(path_root_bh(path), 2123 + path_root_el(path)); 2124 + if (!left_path) { 2125 + ret = -ENOMEM; 2126 + mlog_errno(ret); 2127 + goto out; 2128 + } 2129 + 2130 + ocfs2_cp_path(left_path, path); 2131 + 2132 + right_path = ocfs2_new_path(path_root_bh(path), 2133 + path_root_el(path)); 2134 + if (!right_path) { 2135 + ret = -ENOMEM; 2136 + mlog_errno(ret); 2137 + goto out; 2138 + } 2139 + 2140 + while (right_cpos) { 2141 + ret = ocfs2_find_path(inode, right_path, right_cpos); 2142 + if (ret) { 2143 + mlog_errno(ret); 2144 + goto out; 2145 + } 2146 + 2147 + subtree_root = ocfs2_find_subtree_root(inode, left_path, 2148 + right_path); 2149 + 2150 + mlog(0, "Subtree root at index %d (blk %llu, depth %d)\n", 2151 + subtree_root, 2152 + (unsigned long long) 2153 + right_path->p_node[subtree_root].bh->b_blocknr, 2154 + right_path->p_tree_depth); 2155 + 2156 + ret = ocfs2_extend_rotate_transaction(handle, subtree_root, 2157 + orig_credits, left_path); 2158 + if (ret) { 2159 + mlog_errno(ret); 2160 + goto out; 2161 + } 2162 + 2163 + ret = ocfs2_rotate_subtree_left(inode, handle, left_path, 2164 + right_path, subtree_root, 2165 + dealloc, &deleted); 2166 + if (ret == -EAGAIN) { 2167 + /* 2168 + * The rotation has to temporarily stop due to 2169 + * the right subtree having an empty 2170 + * extent. Pass it back to the caller for a 2171 + * fixup. 2172 + */ 2173 + *empty_extent_path = right_path; 2174 + right_path = NULL; 2175 + goto out; 2176 + } 2177 + if (ret) { 2178 + mlog_errno(ret); 2179 + goto out; 2180 + } 2181 + 2182 + /* 2183 + * The subtree rotate might have removed records on 2184 + * the rightmost edge. If so, then rotation is 2185 + * complete. 2186 + */ 2187 + if (deleted) 2188 + break; 2189 + 2190 + ocfs2_mv_path(left_path, right_path); 2191 + 2192 + ret = ocfs2_find_cpos_for_right_leaf(inode->i_sb, left_path, 2193 + &right_cpos); 2194 + if (ret) { 2195 + mlog_errno(ret); 2196 + goto out; 2197 + } 2198 + } 2199 + 2200 + out: 2201 + ocfs2_free_path(right_path); 2202 + ocfs2_free_path(left_path); 2203 + 2204 + return ret; 2205 + } 2206 + 2207 + static int ocfs2_remove_rightmost_path(struct inode *inode, handle_t *handle, 2208 + struct ocfs2_path *path, 2209 + struct ocfs2_cached_dealloc_ctxt *dealloc) 2210 + { 2211 + int ret, subtree_index; 2212 + u32 cpos; 2213 + struct ocfs2_path *left_path = NULL; 2214 + struct ocfs2_dinode *di; 2215 + struct ocfs2_extent_block *eb; 2216 + struct ocfs2_extent_list *el; 2217 + 2218 + /* 2219 + * XXX: This code assumes that the root is an inode, which is 2220 + * true for now but may change as tree code gets generic. 2221 + */ 2222 + di = (struct ocfs2_dinode *)path_root_bh(path)->b_data; 2223 + if (!OCFS2_IS_VALID_DINODE(di)) { 2224 + ret = -EIO; 2225 + ocfs2_error(inode->i_sb, 2226 + "Inode %llu has invalid path root", 2227 + (unsigned long long)OCFS2_I(inode)->ip_blkno); 2228 + goto out; 2229 + } 2230 + 2231 + /* 2232 + * There's two ways we handle this depending on 2233 + * whether path is the only existing one. 2234 + */ 2235 + ret = ocfs2_extend_rotate_transaction(handle, 0, 2236 + handle->h_buffer_credits, 2237 + path); 2238 + if (ret) { 2239 + mlog_errno(ret); 2240 + goto out; 2241 + } 2242 + 2243 + ret = ocfs2_journal_access_path(inode, handle, path); 2244 + if (ret) { 2245 + mlog_errno(ret); 2246 + goto out; 2247 + } 2248 + 2249 + ret = ocfs2_find_cpos_for_left_leaf(inode->i_sb, path, &cpos); 2250 + if (ret) { 2251 + mlog_errno(ret); 2252 + goto out; 2253 + } 2254 + 2255 + if (cpos) { 2256 + /* 2257 + * We have a path to the left of this one - it needs 2258 + * an update too. 2259 + */ 2260 + left_path = ocfs2_new_path(path_root_bh(path), 2261 + path_root_el(path)); 2262 + if (!left_path) { 2263 + ret = -ENOMEM; 2264 + mlog_errno(ret); 2265 + goto out; 2266 + } 2267 + 2268 + ret = ocfs2_find_path(inode, left_path, cpos); 2269 + if (ret) { 2270 + mlog_errno(ret); 2271 + goto out; 2272 + } 2273 + 2274 + ret = ocfs2_journal_access_path(inode, handle, left_path); 2275 + if (ret) { 2276 + mlog_errno(ret); 2277 + goto out; 2278 + } 2279 + 2280 + subtree_index = ocfs2_find_subtree_root(inode, left_path, path); 2281 + 2282 + ocfs2_unlink_subtree(inode, handle, left_path, path, 2283 + subtree_index, dealloc); 2284 + ocfs2_update_edge_lengths(inode, handle, left_path); 2285 + 2286 + eb = (struct ocfs2_extent_block *)path_leaf_bh(left_path)->b_data; 2287 + di->i_last_eb_blk = eb->h_blkno; 2288 + } else { 2289 + /* 2290 + * 'path' is also the leftmost path which 2291 + * means it must be the only one. This gets 2292 + * handled differently because we want to 2293 + * revert the inode back to having extents 2294 + * in-line. 2295 + */ 2296 + ocfs2_unlink_path(inode, handle, dealloc, path, 1); 2297 + 2298 + el = &di->id2.i_list; 2299 + el->l_tree_depth = 0; 2300 + el->l_next_free_rec = 0; 2301 + memset(&el->l_recs[0], 0, sizeof(struct ocfs2_extent_rec)); 2302 + 2303 + di->i_last_eb_blk = 0; 2304 + } 2305 + 2306 + ocfs2_journal_dirty(handle, path_root_bh(path)); 2307 + 2308 + out: 2309 + ocfs2_free_path(left_path); 2310 + return ret; 2311 + } 2312 + 2313 + /* 2314 + * Left rotation of btree records. 2315 + * 2316 + * In many ways, this is (unsurprisingly) the opposite of right 2317 + * rotation. We start at some non-rightmost path containing an empty 2318 + * extent in the leaf block. The code works its way to the rightmost 2319 + * path by rotating records to the left in every subtree. 2320 + * 2321 + * This is used by any code which reduces the number of extent records 2322 + * in a leaf. After removal, an empty record should be placed in the 2323 + * leftmost list position. 2324 + * 2325 + * This won't handle a length update of the rightmost path records if 2326 + * the rightmost tree leaf record is removed so the caller is 2327 + * responsible for detecting and correcting that. 2328 + */ 2329 + static int ocfs2_rotate_tree_left(struct inode *inode, handle_t *handle, 2330 + struct ocfs2_path *path, 2331 + struct ocfs2_cached_dealloc_ctxt *dealloc) 2332 + { 2333 + int ret, orig_credits = handle->h_buffer_credits; 2334 + struct ocfs2_path *tmp_path = NULL, *restart_path = NULL; 2335 + struct ocfs2_extent_block *eb; 2336 + struct ocfs2_extent_list *el; 2337 + 2338 + el = path_leaf_el(path); 2339 + if (!ocfs2_is_empty_extent(&el->l_recs[0])) 2340 + return 0; 2341 + 2342 + if (path->p_tree_depth == 0) { 2343 + rightmost_no_delete: 2344 + /* 2345 + * In-inode extents. This is trivially handled, so do 2346 + * it up front. 2347 + */ 2348 + ret = ocfs2_rotate_rightmost_leaf_left(inode, handle, 2349 + path_leaf_bh(path), 2350 + path_leaf_el(path)); 2351 + if (ret) 2352 + mlog_errno(ret); 2353 + goto out; 2354 + } 2355 + 2356 + /* 2357 + * Handle rightmost branch now. There's several cases: 2358 + * 1) simple rotation leaving records in there. That's trivial. 2359 + * 2) rotation requiring a branch delete - there's no more 2360 + * records left. Two cases of this: 2361 + * a) There are branches to the left. 2362 + * b) This is also the leftmost (the only) branch. 2363 + * 2364 + * 1) is handled via ocfs2_rotate_rightmost_leaf_left() 2365 + * 2a) we need the left branch so that we can update it with the unlink 2366 + * 2b) we need to bring the inode back to inline extents. 2367 + */ 2368 + 2369 + eb = (struct ocfs2_extent_block *)path_leaf_bh(path)->b_data; 2370 + el = &eb->h_list; 2371 + if (eb->h_next_leaf_blk == 0) { 2372 + /* 2373 + * This gets a bit tricky if we're going to delete the 2374 + * rightmost path. Get the other cases out of the way 2375 + * 1st. 2376 + */ 2377 + if (le16_to_cpu(el->l_next_free_rec) > 1) 2378 + goto rightmost_no_delete; 2379 + 2380 + if (le16_to_cpu(el->l_next_free_rec) == 0) { 2381 + ret = -EIO; 2382 + ocfs2_error(inode->i_sb, 2383 + "Inode %llu has empty extent block at %llu", 2384 + (unsigned long long)OCFS2_I(inode)->ip_blkno, 2385 + (unsigned long long)le64_to_cpu(eb->h_blkno)); 2386 + goto out; 2387 + } 2388 + 2389 + /* 2390 + * XXX: The caller can not trust "path" any more after 2391 + * this as it will have been deleted. What do we do? 2392 + * 2393 + * In theory the rotate-for-merge code will never get 2394 + * here because it'll always ask for a rotate in a 2395 + * nonempty list. 2396 + */ 2397 + 2398 + ret = ocfs2_remove_rightmost_path(inode, handle, path, 2399 + dealloc); 2400 + if (ret) 2401 + mlog_errno(ret); 2402 + goto out; 2403 + } 2404 + 2405 + /* 2406 + * Now we can loop, remembering the path we get from -EAGAIN 2407 + * and restarting from there. 2408 + */ 2409 + try_rotate: 2410 + ret = __ocfs2_rotate_tree_left(inode, handle, orig_credits, path, 2411 + dealloc, &restart_path); 2412 + if (ret && ret != -EAGAIN) { 2413 + mlog_errno(ret); 2414 + goto out; 2415 + } 2416 + 2417 + while (ret == -EAGAIN) { 2418 + tmp_path = restart_path; 2419 + restart_path = NULL; 2420 + 2421 + ret = __ocfs2_rotate_tree_left(inode, handle, orig_credits, 2422 + tmp_path, dealloc, 2423 + &restart_path); 2424 + if (ret && ret != -EAGAIN) { 2425 + mlog_errno(ret); 2426 + goto out; 2427 + } 2428 + 2429 + ocfs2_free_path(tmp_path); 2430 + tmp_path = NULL; 2431 + 2432 + if (ret == 0) 2433 + goto try_rotate; 2434 + } 2435 + 2436 + out: 2437 + ocfs2_free_path(tmp_path); 2438 + ocfs2_free_path(restart_path); 2439 + return ret; 2440 + } 2441 + 2442 + static void ocfs2_cleanup_merge(struct ocfs2_extent_list *el, 2443 + int index) 2444 + { 2445 + struct ocfs2_extent_rec *rec = &el->l_recs[index]; 2446 + unsigned int size; 2447 + 2448 + if (rec->e_leaf_clusters == 0) { 2449 + /* 2450 + * We consumed all of the merged-from record. An empty 2451 + * extent cannot exist anywhere but the 1st array 2452 + * position, so move things over if the merged-from 2453 + * record doesn't occupy that position. 2454 + * 2455 + * This creates a new empty extent so the caller 2456 + * should be smart enough to have removed any existing 2457 + * ones. 2458 + */ 2459 + if (index > 0) { 2460 + BUG_ON(ocfs2_is_empty_extent(&el->l_recs[0])); 2461 + size = index * sizeof(struct ocfs2_extent_rec); 2462 + memmove(&el->l_recs[1], &el->l_recs[0], size); 2463 + } 2464 + 2465 + /* 2466 + * Always memset - the caller doesn't check whether it 2467 + * created an empty extent, so there could be junk in 2468 + * the other fields. 2469 + */ 2470 + memset(&el->l_recs[0], 0, sizeof(struct ocfs2_extent_rec)); 2471 + } 2472 + } 2473 + 2474 + /* 2475 + * Remove split_rec clusters from the record at index and merge them 2476 + * onto the beginning of the record at index + 1. 2477 + */ 2478 + static int ocfs2_merge_rec_right(struct inode *inode, struct buffer_head *bh, 2479 + handle_t *handle, 2480 + struct ocfs2_extent_rec *split_rec, 2481 + struct ocfs2_extent_list *el, int index) 2482 + { 2483 + int ret; 2484 + unsigned int split_clusters = le16_to_cpu(split_rec->e_leaf_clusters); 2485 + struct ocfs2_extent_rec *left_rec; 2486 + struct ocfs2_extent_rec *right_rec; 2487 + 2488 + BUG_ON(index >= le16_to_cpu(el->l_next_free_rec)); 2489 + 2490 + left_rec = &el->l_recs[index]; 2491 + right_rec = &el->l_recs[index + 1]; 2492 + 2493 + ret = ocfs2_journal_access(handle, inode, bh, 2494 + OCFS2_JOURNAL_ACCESS_WRITE); 2495 + if (ret) { 2496 + mlog_errno(ret); 2497 + goto out; 2498 + } 2499 + 2500 + le16_add_cpu(&left_rec->e_leaf_clusters, -split_clusters); 2501 + 2502 + le32_add_cpu(&right_rec->e_cpos, -split_clusters); 2503 + le64_add_cpu(&right_rec->e_blkno, 2504 + -ocfs2_clusters_to_blocks(inode->i_sb, split_clusters)); 2505 + le16_add_cpu(&right_rec->e_leaf_clusters, split_clusters); 2506 + 2507 + ocfs2_cleanup_merge(el, index); 2508 + 2509 + ret = ocfs2_journal_dirty(handle, bh); 2510 + if (ret) 2511 + mlog_errno(ret); 2512 + 2513 + out: 2514 + return ret; 2515 + } 2516 + 2517 + /* 2518 + * Remove split_rec clusters from the record at index and merge them 2519 + * onto the tail of the record at index - 1. 2520 + */ 2521 + static int ocfs2_merge_rec_left(struct inode *inode, struct buffer_head *bh, 2522 + handle_t *handle, 2523 + struct ocfs2_extent_rec *split_rec, 2524 + struct ocfs2_extent_list *el, int index) 2525 + { 2526 + int ret, has_empty_extent = 0; 2527 + unsigned int split_clusters = le16_to_cpu(split_rec->e_leaf_clusters); 2528 + struct ocfs2_extent_rec *left_rec; 2529 + struct ocfs2_extent_rec *right_rec; 2530 + 2531 + BUG_ON(index <= 0); 2532 + 2533 + left_rec = &el->l_recs[index - 1]; 2534 + right_rec = &el->l_recs[index]; 2535 + if (ocfs2_is_empty_extent(&el->l_recs[0])) 2536 + has_empty_extent = 1; 2537 + 2538 + ret = ocfs2_journal_access(handle, inode, bh, 2539 + OCFS2_JOURNAL_ACCESS_WRITE); 2540 + if (ret) { 2541 + mlog_errno(ret); 2542 + goto out; 2543 + } 2544 + 2545 + if (has_empty_extent && index == 1) { 2546 + /* 2547 + * The easy case - we can just plop the record right in. 2548 + */ 2549 + *left_rec = *split_rec; 2550 + 2551 + has_empty_extent = 0; 2552 + } else { 2553 + le16_add_cpu(&left_rec->e_leaf_clusters, split_clusters); 2554 + } 2555 + 2556 + le32_add_cpu(&right_rec->e_cpos, split_clusters); 2557 + le64_add_cpu(&right_rec->e_blkno, 2558 + ocfs2_clusters_to_blocks(inode->i_sb, split_clusters)); 2559 + le16_add_cpu(&right_rec->e_leaf_clusters, -split_clusters); 2560 + 2561 + ocfs2_cleanup_merge(el, index); 2562 + 2563 + ret = ocfs2_journal_dirty(handle, bh); 2564 + if (ret) 2565 + mlog_errno(ret); 2566 + 2567 + out: 2568 + return ret; 2569 + } 2570 + 2571 + static int ocfs2_try_to_merge_extent(struct inode *inode, 2572 + handle_t *handle, 2573 + struct ocfs2_path *left_path, 2574 + int split_index, 2575 + struct ocfs2_extent_rec *split_rec, 2576 + struct ocfs2_cached_dealloc_ctxt *dealloc, 2577 + struct ocfs2_merge_ctxt *ctxt) 2578 + 2579 + { 2580 + int ret = 0, delete_tail_recs = 0; 2581 + struct ocfs2_extent_list *el = path_leaf_el(left_path); 2582 + struct ocfs2_extent_rec *rec = &el->l_recs[split_index]; 2583 + 2584 + BUG_ON(ctxt->c_contig_type == CONTIG_NONE); 2585 + 2586 + if (ctxt->c_split_covers_rec) { 2587 + delete_tail_recs++; 2588 + 2589 + if (ctxt->c_contig_type == CONTIG_LEFTRIGHT || 2590 + ctxt->c_has_empty_extent) 2591 + delete_tail_recs++; 2592 + 2593 + if (ctxt->c_has_empty_extent) { 2594 + /* 2595 + * The merge code will need to create an empty 2596 + * extent to take the place of the newly 2597 + * emptied slot. Remove any pre-existing empty 2598 + * extents - having more than one in a leaf is 2599 + * illegal. 2600 + */ 2601 + ret = ocfs2_rotate_tree_left(inode, handle, left_path, 2602 + dealloc); 2603 + if (ret) { 2604 + mlog_errno(ret); 2605 + goto out; 2606 + } 2607 + split_index--; 2608 + rec = &el->l_recs[split_index]; 2609 + } 2610 + } 2611 + 2612 + if (ctxt->c_contig_type == CONTIG_LEFTRIGHT) { 2613 + /* 2614 + * Left-right contig implies this. 2615 + */ 2616 + BUG_ON(!ctxt->c_split_covers_rec); 2617 + BUG_ON(split_index == 0); 2618 + 2619 + /* 2620 + * Since the leftright insert always covers the entire 2621 + * extent, this call will delete the insert record 2622 + * entirely, resulting in an empty extent record added to 2623 + * the extent block. 2624 + * 2625 + * Since the adding of an empty extent shifts 2626 + * everything back to the right, there's no need to 2627 + * update split_index here. 2628 + */ 2629 + ret = ocfs2_merge_rec_left(inode, path_leaf_bh(left_path), 2630 + handle, split_rec, el, split_index); 2631 + if (ret) { 2632 + mlog_errno(ret); 2633 + goto out; 2634 + } 2635 + 2636 + /* 2637 + * We can only get this from logic error above. 2638 + */ 2639 + BUG_ON(!ocfs2_is_empty_extent(&el->l_recs[0])); 2640 + 2641 + /* 2642 + * The left merge left us with an empty extent, remove 2643 + * it. 2644 + */ 2645 + ret = ocfs2_rotate_tree_left(inode, handle, left_path, dealloc); 2646 + if (ret) { 2647 + mlog_errno(ret); 2648 + goto out; 2649 + } 2650 + split_index--; 2651 + rec = &el->l_recs[split_index]; 2652 + 2653 + /* 2654 + * Note that we don't pass split_rec here on purpose - 2655 + * we've merged it into the left side. 2656 + */ 2657 + ret = ocfs2_merge_rec_right(inode, path_leaf_bh(left_path), 2658 + handle, rec, el, split_index); 2659 + if (ret) { 2660 + mlog_errno(ret); 2661 + goto out; 2662 + } 2663 + 2664 + BUG_ON(!ocfs2_is_empty_extent(&el->l_recs[0])); 2665 + 2666 + ret = ocfs2_rotate_tree_left(inode, handle, left_path, 2667 + dealloc); 2668 + /* 2669 + * Error from this last rotate is not critical, so 2670 + * print but don't bubble it up. 2671 + */ 2672 + if (ret) 2673 + mlog_errno(ret); 2674 + ret = 0; 2675 + } else { 2676 + /* 2677 + * Merge a record to the left or right. 2678 + * 2679 + * 'contig_type' is relative to the existing record, 2680 + * so for example, if we're "right contig", it's to 2681 + * the record on the left (hence the left merge). 2682 + */ 2683 + if (ctxt->c_contig_type == CONTIG_RIGHT) { 2684 + ret = ocfs2_merge_rec_left(inode, 2685 + path_leaf_bh(left_path), 2686 + handle, split_rec, el, 2687 + split_index); 2688 + if (ret) { 2689 + mlog_errno(ret); 2690 + goto out; 2691 + } 2692 + } else { 2693 + ret = ocfs2_merge_rec_right(inode, 2694 + path_leaf_bh(left_path), 2695 + handle, split_rec, el, 2696 + split_index); 2697 + if (ret) { 2698 + mlog_errno(ret); 2699 + goto out; 2700 + } 2701 + } 2702 + 2703 + if (ctxt->c_split_covers_rec) { 2704 + /* 2705 + * The merge may have left an empty extent in 2706 + * our leaf. Try to rotate it away. 2707 + */ 2708 + ret = ocfs2_rotate_tree_left(inode, handle, left_path, 2709 + dealloc); 2710 + if (ret) 2711 + mlog_errno(ret); 2712 + ret = 0; 2713 + } 2714 + } 2715 + 2716 + out: 2717 + return ret; 2718 + } 2719 + 2720 + static void ocfs2_subtract_from_rec(struct super_block *sb, 2721 + enum ocfs2_split_type split, 2722 + struct ocfs2_extent_rec *rec, 2723 + struct ocfs2_extent_rec *split_rec) 2724 + { 2725 + u64 len_blocks; 2726 + 2727 + len_blocks = ocfs2_clusters_to_blocks(sb, 2728 + le16_to_cpu(split_rec->e_leaf_clusters)); 2729 + 2730 + if (split == SPLIT_LEFT) { 2731 + /* 2732 + * Region is on the left edge of the existing 2733 + * record. 2734 + */ 2735 + le32_add_cpu(&rec->e_cpos, 2736 + le16_to_cpu(split_rec->e_leaf_clusters)); 2737 + le64_add_cpu(&rec->e_blkno, len_blocks); 2738 + le16_add_cpu(&rec->e_leaf_clusters, 2739 + -le16_to_cpu(split_rec->e_leaf_clusters)); 2740 + } else { 2741 + /* 2742 + * Region is on the right edge of the existing 2743 + * record. 2744 + */ 2745 + le16_add_cpu(&rec->e_leaf_clusters, 2746 + -le16_to_cpu(split_rec->e_leaf_clusters)); 2747 + } 2748 + } 2749 + 1956 2750 /* 1957 2751 * Do the final bits of extent record insertion at the target leaf 1958 2752 * list. If this leaf is part of an allocation tree, it is assumed ··· 2993 1737 struct ocfs2_extent_rec *rec; 2994 1738 2995 1739 BUG_ON(le16_to_cpu(el->l_tree_depth) != 0); 1740 + 1741 + if (insert->ins_split != SPLIT_NONE) { 1742 + i = ocfs2_search_extent_list(el, le32_to_cpu(insert_rec->e_cpos)); 1743 + BUG_ON(i == -1); 1744 + rec = &el->l_recs[i]; 1745 + ocfs2_subtract_from_rec(inode->i_sb, insert->ins_split, rec, 1746 + insert_rec); 1747 + goto rotate; 1748 + } 2996 1749 2997 1750 /* 2998 1751 * Contiguous insert - either left or right. ··· 3057 1792 return; 3058 1793 } 3059 1794 1795 + rotate: 3060 1796 /* 3061 1797 * Ok, we have to rotate. 3062 1798 * ··· 3081 1815 spin_unlock(&OCFS2_I(inode)->ip_lock); 3082 1816 } 3083 1817 1818 + static void ocfs2_adjust_rightmost_records(struct inode *inode, 1819 + handle_t *handle, 1820 + struct ocfs2_path *path, 1821 + struct ocfs2_extent_rec *insert_rec) 1822 + { 1823 + int ret, i, next_free; 1824 + struct buffer_head *bh; 1825 + struct ocfs2_extent_list *el; 1826 + struct ocfs2_extent_rec *rec; 1827 + 1828 + /* 1829 + * Update everything except the leaf block. 1830 + */ 1831 + for (i = 0; i < path->p_tree_depth; i++) { 1832 + bh = path->p_node[i].bh; 1833 + el = path->p_node[i].el; 1834 + 1835 + next_free = le16_to_cpu(el->l_next_free_rec); 1836 + if (next_free == 0) { 1837 + ocfs2_error(inode->i_sb, 1838 + "Dinode %llu has a bad extent list", 1839 + (unsigned long long)OCFS2_I(inode)->ip_blkno); 1840 + ret = -EIO; 1841 + return; 1842 + } 1843 + 1844 + rec = &el->l_recs[next_free - 1]; 1845 + 1846 + rec->e_int_clusters = insert_rec->e_cpos; 1847 + le32_add_cpu(&rec->e_int_clusters, 1848 + le16_to_cpu(insert_rec->e_leaf_clusters)); 1849 + le32_add_cpu(&rec->e_int_clusters, 1850 + -le32_to_cpu(rec->e_cpos)); 1851 + 1852 + ret = ocfs2_journal_dirty(handle, bh); 1853 + if (ret) 1854 + mlog_errno(ret); 1855 + 1856 + } 1857 + } 1858 + 3084 1859 static int ocfs2_append_rec_to_path(struct inode *inode, handle_t *handle, 3085 1860 struct ocfs2_extent_rec *insert_rec, 3086 1861 struct ocfs2_path *right_path, 3087 1862 struct ocfs2_path **ret_left_path) 3088 1863 { 3089 - int ret, i, next_free; 3090 - struct buffer_head *bh; 1864 + int ret, next_free; 3091 1865 struct ocfs2_extent_list *el; 3092 1866 struct ocfs2_path *left_path = NULL; 3093 1867 ··· 3193 1887 goto out; 3194 1888 } 3195 1889 3196 - el = path_root_el(right_path); 3197 - bh = path_root_bh(right_path); 3198 - i = 0; 3199 - while (1) { 3200 - struct ocfs2_extent_rec *rec; 3201 - 3202 - next_free = le16_to_cpu(el->l_next_free_rec); 3203 - if (next_free == 0) { 3204 - ocfs2_error(inode->i_sb, 3205 - "Dinode %llu has a bad extent list", 3206 - (unsigned long long)OCFS2_I(inode)->ip_blkno); 3207 - ret = -EIO; 3208 - goto out; 3209 - } 3210 - 3211 - rec = &el->l_recs[next_free - 1]; 3212 - 3213 - rec->e_int_clusters = insert_rec->e_cpos; 3214 - le32_add_cpu(&rec->e_int_clusters, 3215 - le16_to_cpu(insert_rec->e_leaf_clusters)); 3216 - le32_add_cpu(&rec->e_int_clusters, 3217 - -le32_to_cpu(rec->e_cpos)); 3218 - 3219 - ret = ocfs2_journal_dirty(handle, bh); 3220 - if (ret) 3221 - mlog_errno(ret); 3222 - 3223 - /* Don't touch the leaf node */ 3224 - if (++i >= right_path->p_tree_depth) 3225 - break; 3226 - 3227 - bh = right_path->p_node[i].bh; 3228 - el = right_path->p_node[i].el; 3229 - } 1890 + ocfs2_adjust_rightmost_records(inode, handle, right_path, insert_rec); 3230 1891 3231 1892 *ret_left_path = left_path; 3232 1893 ret = 0; ··· 3202 1929 ocfs2_free_path(left_path); 3203 1930 3204 1931 return ret; 1932 + } 1933 + 1934 + static void ocfs2_split_record(struct inode *inode, 1935 + struct ocfs2_path *left_path, 1936 + struct ocfs2_path *right_path, 1937 + struct ocfs2_extent_rec *split_rec, 1938 + enum ocfs2_split_type split) 1939 + { 1940 + int index; 1941 + u32 cpos = le32_to_cpu(split_rec->e_cpos); 1942 + struct ocfs2_extent_list *left_el = NULL, *right_el, *insert_el, *el; 1943 + struct ocfs2_extent_rec *rec, *tmprec; 1944 + 1945 + right_el = path_leaf_el(right_path);; 1946 + if (left_path) 1947 + left_el = path_leaf_el(left_path); 1948 + 1949 + el = right_el; 1950 + insert_el = right_el; 1951 + index = ocfs2_search_extent_list(el, cpos); 1952 + if (index != -1) { 1953 + if (index == 0 && left_path) { 1954 + BUG_ON(ocfs2_is_empty_extent(&el->l_recs[0])); 1955 + 1956 + /* 1957 + * This typically means that the record 1958 + * started in the left path but moved to the 1959 + * right as a result of rotation. We either 1960 + * move the existing record to the left, or we 1961 + * do the later insert there. 1962 + * 1963 + * In this case, the left path should always 1964 + * exist as the rotate code will have passed 1965 + * it back for a post-insert update. 1966 + */ 1967 + 1968 + if (split == SPLIT_LEFT) { 1969 + /* 1970 + * It's a left split. Since we know 1971 + * that the rotate code gave us an 1972 + * empty extent in the left path, we 1973 + * can just do the insert there. 1974 + */ 1975 + insert_el = left_el; 1976 + } else { 1977 + /* 1978 + * Right split - we have to move the 1979 + * existing record over to the left 1980 + * leaf. The insert will be into the 1981 + * newly created empty extent in the 1982 + * right leaf. 1983 + */ 1984 + tmprec = &right_el->l_recs[index]; 1985 + ocfs2_rotate_leaf(left_el, tmprec); 1986 + el = left_el; 1987 + 1988 + memset(tmprec, 0, sizeof(*tmprec)); 1989 + index = ocfs2_search_extent_list(left_el, cpos); 1990 + BUG_ON(index == -1); 1991 + } 1992 + } 1993 + } else { 1994 + BUG_ON(!left_path); 1995 + BUG_ON(!ocfs2_is_empty_extent(&left_el->l_recs[0])); 1996 + /* 1997 + * Left path is easy - we can just allow the insert to 1998 + * happen. 1999 + */ 2000 + el = left_el; 2001 + insert_el = left_el; 2002 + index = ocfs2_search_extent_list(el, cpos); 2003 + BUG_ON(index == -1); 2004 + } 2005 + 2006 + rec = &el->l_recs[index]; 2007 + ocfs2_subtract_from_rec(inode->i_sb, split, rec, split_rec); 2008 + ocfs2_rotate_leaf(insert_el, split_rec); 3205 2009 } 3206 2010 3207 2011 /* ··· 3298 1948 { 3299 1949 int ret, subtree_index; 3300 1950 struct buffer_head *leaf_bh = path_leaf_bh(right_path); 3301 - struct ocfs2_extent_list *el; 3302 1951 3303 1952 /* 3304 1953 * Pass both paths to the journal. The majority of inserts ··· 3333 1984 } 3334 1985 } 3335 1986 3336 - el = path_leaf_el(right_path); 1987 + if (insert->ins_split != SPLIT_NONE) { 1988 + /* 1989 + * We could call ocfs2_insert_at_leaf() for some types 1990 + * of splits, but it's easier to just let one seperate 1991 + * function sort it all out. 1992 + */ 1993 + ocfs2_split_record(inode, left_path, right_path, 1994 + insert_rec, insert->ins_split); 1995 + } else 1996 + ocfs2_insert_at_leaf(insert_rec, path_leaf_el(right_path), 1997 + insert, inode); 3337 1998 3338 - ocfs2_insert_at_leaf(insert_rec, el, insert, inode); 3339 1999 ret = ocfs2_journal_dirty(handle, leaf_bh); 3340 2000 if (ret) 3341 2001 mlog_errno(ret); ··· 3433 2075 * can wind up skipping both of these two special cases... 3434 2076 */ 3435 2077 if (rotate) { 3436 - ret = ocfs2_rotate_tree_right(inode, handle, 2078 + ret = ocfs2_rotate_tree_right(inode, handle, type->ins_split, 3437 2079 le32_to_cpu(insert_rec->e_cpos), 3438 2080 right_path, &left_path); 3439 2081 if (ret) { ··· 3458 2100 } 3459 2101 3460 2102 out_update_clusters: 3461 - ocfs2_update_dinode_clusters(inode, di, 3462 - le16_to_cpu(insert_rec->e_leaf_clusters)); 2103 + if (type->ins_split == SPLIT_NONE) 2104 + ocfs2_update_dinode_clusters(inode, di, 2105 + le16_to_cpu(insert_rec->e_leaf_clusters)); 3463 2106 3464 2107 ret = ocfs2_journal_dirty(handle, di_bh); 3465 2108 if (ret) ··· 3469 2110 out: 3470 2111 ocfs2_free_path(left_path); 3471 2112 ocfs2_free_path(right_path); 2113 + 2114 + return ret; 2115 + } 2116 + 2117 + static enum ocfs2_contig_type 2118 + ocfs2_figure_merge_contig_type(struct inode *inode, 2119 + struct ocfs2_extent_list *el, int index, 2120 + struct ocfs2_extent_rec *split_rec) 2121 + { 2122 + struct ocfs2_extent_rec *rec; 2123 + enum ocfs2_contig_type ret = CONTIG_NONE; 2124 + 2125 + /* 2126 + * We're careful to check for an empty extent record here - 2127 + * the merge code will know what to do if it sees one. 2128 + */ 2129 + 2130 + if (index > 0) { 2131 + rec = &el->l_recs[index - 1]; 2132 + if (index == 1 && ocfs2_is_empty_extent(rec)) { 2133 + if (split_rec->e_cpos == el->l_recs[index].e_cpos) 2134 + ret = CONTIG_RIGHT; 2135 + } else { 2136 + ret = ocfs2_extent_contig(inode, rec, split_rec); 2137 + } 2138 + } 2139 + 2140 + if (index < (le16_to_cpu(el->l_next_free_rec) - 1)) { 2141 + enum ocfs2_contig_type contig_type; 2142 + 2143 + rec = &el->l_recs[index + 1]; 2144 + contig_type = ocfs2_extent_contig(inode, rec, split_rec); 2145 + 2146 + if (contig_type == CONTIG_LEFT && ret == CONTIG_RIGHT) 2147 + ret = CONTIG_LEFTRIGHT; 2148 + else if (ret == CONTIG_NONE) 2149 + ret = contig_type; 2150 + } 3472 2151 3473 2152 return ret; 3474 2153 } ··· 3601 2204 struct ocfs2_extent_list *el; 3602 2205 struct ocfs2_path *path = NULL; 3603 2206 struct buffer_head *bh = NULL; 2207 + 2208 + insert->ins_split = SPLIT_NONE; 3604 2209 3605 2210 el = &di->id2.i_list; 3606 2211 insert->ins_tree_depth = le16_to_cpu(el->l_tree_depth); ··· 3726 2327 u32 cpos, 3727 2328 u64 start_blk, 3728 2329 u32 new_clusters, 2330 + u8 flags, 3729 2331 struct ocfs2_alloc_context *meta_ac) 3730 2332 { 3731 - int status, shift; 2333 + int status; 3732 2334 struct buffer_head *last_eb_bh = NULL; 3733 2335 struct buffer_head *bh = NULL; 3734 2336 struct ocfs2_insert_type insert = {0, }; ··· 3750 2350 rec.e_cpos = cpu_to_le32(cpos); 3751 2351 rec.e_blkno = cpu_to_le64(start_blk); 3752 2352 rec.e_leaf_clusters = cpu_to_le16(new_clusters); 2353 + rec.e_flags = flags; 3753 2354 3754 2355 status = ocfs2_figure_insert_type(inode, fe_bh, &last_eb_bh, &rec, 3755 2356 &insert); ··· 3765 2364 insert.ins_appending, insert.ins_contig, insert.ins_contig_index, 3766 2365 insert.ins_free_records, insert.ins_tree_depth); 3767 2366 3768 - /* 3769 - * Avoid growing the tree unless we're out of records and the 3770 - * insert type requres one. 3771 - */ 3772 - if (insert.ins_contig != CONTIG_NONE || insert.ins_free_records) 3773 - goto out_add; 3774 - 3775 - shift = ocfs2_find_branch_target(osb, inode, fe_bh, &bh); 3776 - if (shift < 0) { 3777 - status = shift; 3778 - mlog_errno(status); 3779 - goto bail; 3780 - } 3781 - 3782 - /* We traveled all the way to the bottom of the allocation tree 3783 - * and didn't find room for any more extents - we need to add 3784 - * another tree level */ 3785 - if (shift) { 3786 - BUG_ON(bh); 3787 - mlog(0, "need to shift tree depth " 3788 - "(current = %d)\n", insert.ins_tree_depth); 3789 - 3790 - /* ocfs2_shift_tree_depth will return us a buffer with 3791 - * the new extent block (so we can pass that to 3792 - * ocfs2_add_branch). */ 3793 - status = ocfs2_shift_tree_depth(osb, handle, inode, fe_bh, 3794 - meta_ac, &bh); 3795 - if (status < 0) { 2367 + if (insert.ins_contig == CONTIG_NONE && insert.ins_free_records == 0) { 2368 + status = ocfs2_grow_tree(inode, handle, fe_bh, 2369 + &insert.ins_tree_depth, &last_eb_bh, 2370 + meta_ac); 2371 + if (status) { 3796 2372 mlog_errno(status); 3797 2373 goto bail; 3798 2374 } 3799 - insert.ins_tree_depth++; 3800 - /* Special case: we have room now if we shifted from 3801 - * tree_depth 0 */ 3802 - if (insert.ins_tree_depth == 1) 3803 - goto out_add; 3804 2375 } 3805 2376 3806 - /* call ocfs2_add_branch to add the final part of the tree with 3807 - * the new data. */ 3808 - mlog(0, "add branch. bh = %p\n", bh); 3809 - status = ocfs2_add_branch(osb, handle, inode, fe_bh, bh, last_eb_bh, 3810 - meta_ac); 3811 - if (status < 0) { 3812 - mlog_errno(status); 3813 - goto bail; 3814 - } 3815 - 3816 - out_add: 3817 2377 /* Finally, we can add clusters. This might rotate the tree for us. */ 3818 2378 status = ocfs2_do_insert_extent(inode, handle, fe_bh, &rec, &insert); 3819 2379 if (status < 0) ··· 3793 2431 return status; 3794 2432 } 3795 2433 3796 - static inline int ocfs2_truncate_log_needs_flush(struct ocfs2_super *osb) 2434 + static void ocfs2_make_right_split_rec(struct super_block *sb, 2435 + struct ocfs2_extent_rec *split_rec, 2436 + u32 cpos, 2437 + struct ocfs2_extent_rec *rec) 2438 + { 2439 + u32 rec_cpos = le32_to_cpu(rec->e_cpos); 2440 + u32 rec_range = rec_cpos + le16_to_cpu(rec->e_leaf_clusters); 2441 + 2442 + memset(split_rec, 0, sizeof(struct ocfs2_extent_rec)); 2443 + 2444 + split_rec->e_cpos = cpu_to_le32(cpos); 2445 + split_rec->e_leaf_clusters = cpu_to_le16(rec_range - cpos); 2446 + 2447 + split_rec->e_blkno = rec->e_blkno; 2448 + le64_add_cpu(&split_rec->e_blkno, 2449 + ocfs2_clusters_to_blocks(sb, cpos - rec_cpos)); 2450 + 2451 + split_rec->e_flags = rec->e_flags; 2452 + } 2453 + 2454 + static int ocfs2_split_and_insert(struct inode *inode, 2455 + handle_t *handle, 2456 + struct ocfs2_path *path, 2457 + struct buffer_head *di_bh, 2458 + struct buffer_head **last_eb_bh, 2459 + int split_index, 2460 + struct ocfs2_extent_rec *orig_split_rec, 2461 + struct ocfs2_alloc_context *meta_ac) 2462 + { 2463 + int ret = 0, depth; 2464 + unsigned int insert_range, rec_range, do_leftright = 0; 2465 + struct ocfs2_extent_rec tmprec; 2466 + struct ocfs2_extent_list *rightmost_el; 2467 + struct ocfs2_extent_rec rec; 2468 + struct ocfs2_extent_rec split_rec = *orig_split_rec; 2469 + struct ocfs2_insert_type insert; 2470 + struct ocfs2_extent_block *eb; 2471 + struct ocfs2_dinode *di; 2472 + 2473 + leftright: 2474 + /* 2475 + * Store a copy of the record on the stack - it might move 2476 + * around as the tree is manipulated below. 2477 + */ 2478 + rec = path_leaf_el(path)->l_recs[split_index]; 2479 + 2480 + di = (struct ocfs2_dinode *)di_bh->b_data; 2481 + rightmost_el = &di->id2.i_list; 2482 + 2483 + depth = le16_to_cpu(rightmost_el->l_tree_depth); 2484 + if (depth) { 2485 + BUG_ON(!(*last_eb_bh)); 2486 + eb = (struct ocfs2_extent_block *) (*last_eb_bh)->b_data; 2487 + rightmost_el = &eb->h_list; 2488 + } 2489 + 2490 + if (le16_to_cpu(rightmost_el->l_next_free_rec) == 2491 + le16_to_cpu(rightmost_el->l_count)) { 2492 + int old_depth = depth; 2493 + 2494 + ret = ocfs2_grow_tree(inode, handle, di_bh, &depth, last_eb_bh, 2495 + meta_ac); 2496 + if (ret) { 2497 + mlog_errno(ret); 2498 + goto out; 2499 + } 2500 + 2501 + if (old_depth != depth) { 2502 + eb = (struct ocfs2_extent_block *)(*last_eb_bh)->b_data; 2503 + rightmost_el = &eb->h_list; 2504 + } 2505 + } 2506 + 2507 + memset(&insert, 0, sizeof(struct ocfs2_insert_type)); 2508 + insert.ins_appending = APPEND_NONE; 2509 + insert.ins_contig = CONTIG_NONE; 2510 + insert.ins_free_records = le16_to_cpu(rightmost_el->l_count) 2511 + - le16_to_cpu(rightmost_el->l_next_free_rec); 2512 + insert.ins_tree_depth = depth; 2513 + 2514 + insert_range = le32_to_cpu(split_rec.e_cpos) + 2515 + le16_to_cpu(split_rec.e_leaf_clusters); 2516 + rec_range = le32_to_cpu(rec.e_cpos) + 2517 + le16_to_cpu(rec.e_leaf_clusters); 2518 + 2519 + if (split_rec.e_cpos == rec.e_cpos) { 2520 + insert.ins_split = SPLIT_LEFT; 2521 + } else if (insert_range == rec_range) { 2522 + insert.ins_split = SPLIT_RIGHT; 2523 + } else { 2524 + /* 2525 + * Left/right split. We fake this as a right split 2526 + * first and then make a second pass as a left split. 2527 + */ 2528 + insert.ins_split = SPLIT_RIGHT; 2529 + 2530 + ocfs2_make_right_split_rec(inode->i_sb, &tmprec, insert_range, 2531 + &rec); 2532 + 2533 + split_rec = tmprec; 2534 + 2535 + BUG_ON(do_leftright); 2536 + do_leftright = 1; 2537 + } 2538 + 2539 + ret = ocfs2_do_insert_extent(inode, handle, di_bh, &split_rec, 2540 + &insert); 2541 + if (ret) { 2542 + mlog_errno(ret); 2543 + goto out; 2544 + } 2545 + 2546 + if (do_leftright == 1) { 2547 + u32 cpos; 2548 + struct ocfs2_extent_list *el; 2549 + 2550 + do_leftright++; 2551 + split_rec = *orig_split_rec; 2552 + 2553 + ocfs2_reinit_path(path, 1); 2554 + 2555 + cpos = le32_to_cpu(split_rec.e_cpos); 2556 + ret = ocfs2_find_path(inode, path, cpos); 2557 + if (ret) { 2558 + mlog_errno(ret); 2559 + goto out; 2560 + } 2561 + 2562 + el = path_leaf_el(path); 2563 + split_index = ocfs2_search_extent_list(el, cpos); 2564 + goto leftright; 2565 + } 2566 + out: 2567 + 2568 + return ret; 2569 + } 2570 + 2571 + /* 2572 + * Mark part or all of the extent record at split_index in the leaf 2573 + * pointed to by path as written. This removes the unwritten 2574 + * extent flag. 2575 + * 2576 + * Care is taken to handle contiguousness so as to not grow the tree. 2577 + * 2578 + * meta_ac is not strictly necessary - we only truly need it if growth 2579 + * of the tree is required. All other cases will degrade into a less 2580 + * optimal tree layout. 2581 + * 2582 + * last_eb_bh should be the rightmost leaf block for any inode with a 2583 + * btree. Since a split may grow the tree or a merge might shrink it, the caller cannot trust the contents of that buffer after this call. 2584 + * 2585 + * This code is optimized for readability - several passes might be 2586 + * made over certain portions of the tree. All of those blocks will 2587 + * have been brought into cache (and pinned via the journal), so the 2588 + * extra overhead is not expressed in terms of disk reads. 2589 + */ 2590 + static int __ocfs2_mark_extent_written(struct inode *inode, 2591 + struct buffer_head *di_bh, 2592 + handle_t *handle, 2593 + struct ocfs2_path *path, 2594 + int split_index, 2595 + struct ocfs2_extent_rec *split_rec, 2596 + struct ocfs2_alloc_context *meta_ac, 2597 + struct ocfs2_cached_dealloc_ctxt *dealloc) 2598 + { 2599 + int ret = 0; 2600 + struct ocfs2_extent_list *el = path_leaf_el(path); 2601 + struct buffer_head *eb_bh, *last_eb_bh = NULL; 2602 + struct ocfs2_extent_rec *rec = &el->l_recs[split_index]; 2603 + struct ocfs2_merge_ctxt ctxt; 2604 + struct ocfs2_extent_list *rightmost_el; 2605 + 2606 + if (!rec->e_flags & OCFS2_EXT_UNWRITTEN) { 2607 + ret = -EIO; 2608 + mlog_errno(ret); 2609 + goto out; 2610 + } 2611 + 2612 + if (le32_to_cpu(rec->e_cpos) > le32_to_cpu(split_rec->e_cpos) || 2613 + ((le32_to_cpu(rec->e_cpos) + le16_to_cpu(rec->e_leaf_clusters)) < 2614 + (le32_to_cpu(split_rec->e_cpos) + le16_to_cpu(split_rec->e_leaf_clusters)))) { 2615 + ret = -EIO; 2616 + mlog_errno(ret); 2617 + goto out; 2618 + } 2619 + 2620 + eb_bh = path_leaf_bh(path); 2621 + ret = ocfs2_journal_access(handle, inode, eb_bh, 2622 + OCFS2_JOURNAL_ACCESS_WRITE); 2623 + if (ret) { 2624 + mlog_errno(ret); 2625 + goto out; 2626 + } 2627 + 2628 + ctxt.c_contig_type = ocfs2_figure_merge_contig_type(inode, el, 2629 + split_index, 2630 + split_rec); 2631 + 2632 + /* 2633 + * The core merge / split code wants to know how much room is 2634 + * left in this inodes allocation tree, so we pass the 2635 + * rightmost extent list. 2636 + */ 2637 + if (path->p_tree_depth) { 2638 + struct ocfs2_extent_block *eb; 2639 + struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; 2640 + 2641 + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), 2642 + le64_to_cpu(di->i_last_eb_blk), 2643 + &last_eb_bh, OCFS2_BH_CACHED, inode); 2644 + if (ret) { 2645 + mlog_exit(ret); 2646 + goto out; 2647 + } 2648 + 2649 + eb = (struct ocfs2_extent_block *) last_eb_bh->b_data; 2650 + if (!OCFS2_IS_VALID_EXTENT_BLOCK(eb)) { 2651 + OCFS2_RO_ON_INVALID_EXTENT_BLOCK(inode->i_sb, eb); 2652 + ret = -EROFS; 2653 + goto out; 2654 + } 2655 + 2656 + rightmost_el = &eb->h_list; 2657 + } else 2658 + rightmost_el = path_root_el(path); 2659 + 2660 + ctxt.c_used_tail_recs = le16_to_cpu(rightmost_el->l_next_free_rec); 2661 + if (ctxt.c_used_tail_recs > 0 && 2662 + ocfs2_is_empty_extent(&rightmost_el->l_recs[0])) 2663 + ctxt.c_used_tail_recs--; 2664 + 2665 + if (rec->e_cpos == split_rec->e_cpos && 2666 + rec->e_leaf_clusters == split_rec->e_leaf_clusters) 2667 + ctxt.c_split_covers_rec = 1; 2668 + else 2669 + ctxt.c_split_covers_rec = 0; 2670 + 2671 + ctxt.c_has_empty_extent = ocfs2_is_empty_extent(&el->l_recs[0]); 2672 + 2673 + mlog(0, "index: %d, contig: %u, used_tail_recs: %u, " 2674 + "has_empty: %u, split_covers: %u\n", split_index, 2675 + ctxt.c_contig_type, ctxt.c_used_tail_recs, 2676 + ctxt.c_has_empty_extent, ctxt.c_split_covers_rec); 2677 + 2678 + if (ctxt.c_contig_type == CONTIG_NONE) { 2679 + if (ctxt.c_split_covers_rec) 2680 + el->l_recs[split_index] = *split_rec; 2681 + else 2682 + ret = ocfs2_split_and_insert(inode, handle, path, di_bh, 2683 + &last_eb_bh, split_index, 2684 + split_rec, meta_ac); 2685 + if (ret) 2686 + mlog_errno(ret); 2687 + } else { 2688 + ret = ocfs2_try_to_merge_extent(inode, handle, path, 2689 + split_index, split_rec, 2690 + dealloc, &ctxt); 2691 + if (ret) 2692 + mlog_errno(ret); 2693 + } 2694 + 2695 + ocfs2_journal_dirty(handle, eb_bh); 2696 + 2697 + out: 2698 + brelse(last_eb_bh); 2699 + return ret; 2700 + } 2701 + 2702 + /* 2703 + * Mark the already-existing extent at cpos as written for len clusters. 2704 + * 2705 + * If the existing extent is larger than the request, initiate a 2706 + * split. An attempt will be made at merging with adjacent extents. 2707 + * 2708 + * The caller is responsible for passing down meta_ac if we'll need it. 2709 + */ 2710 + int ocfs2_mark_extent_written(struct inode *inode, struct buffer_head *di_bh, 2711 + handle_t *handle, u32 cpos, u32 len, u32 phys, 2712 + struct ocfs2_alloc_context *meta_ac, 2713 + struct ocfs2_cached_dealloc_ctxt *dealloc) 2714 + { 2715 + int ret, index; 2716 + u64 start_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys); 2717 + struct ocfs2_extent_rec split_rec; 2718 + struct ocfs2_path *left_path = NULL; 2719 + struct ocfs2_extent_list *el; 2720 + 2721 + mlog(0, "Inode %lu cpos %u, len %u, phys %u (%llu)\n", 2722 + inode->i_ino, cpos, len, phys, (unsigned long long)start_blkno); 2723 + 2724 + if (!ocfs2_writes_unwritten_extents(OCFS2_SB(inode->i_sb))) { 2725 + ocfs2_error(inode->i_sb, "Inode %llu has unwritten extents " 2726 + "that are being written to, but the feature bit " 2727 + "is not set in the super block.", 2728 + (unsigned long long)OCFS2_I(inode)->ip_blkno); 2729 + ret = -EROFS; 2730 + goto out; 2731 + } 2732 + 2733 + /* 2734 + * XXX: This should be fixed up so that we just re-insert the 2735 + * next extent records. 2736 + */ 2737 + ocfs2_extent_map_trunc(inode, 0); 2738 + 2739 + left_path = ocfs2_new_inode_path(di_bh); 2740 + if (!left_path) { 2741 + ret = -ENOMEM; 2742 + mlog_errno(ret); 2743 + goto out; 2744 + } 2745 + 2746 + ret = ocfs2_find_path(inode, left_path, cpos); 2747 + if (ret) { 2748 + mlog_errno(ret); 2749 + goto out; 2750 + } 2751 + el = path_leaf_el(left_path); 2752 + 2753 + index = ocfs2_search_extent_list(el, cpos); 2754 + if (index == -1 || index >= le16_to_cpu(el->l_next_free_rec)) { 2755 + ocfs2_error(inode->i_sb, 2756 + "Inode %llu has an extent at cpos %u which can no " 2757 + "longer be found.\n", 2758 + (unsigned long long)OCFS2_I(inode)->ip_blkno, cpos); 2759 + ret = -EROFS; 2760 + goto out; 2761 + } 2762 + 2763 + memset(&split_rec, 0, sizeof(struct ocfs2_extent_rec)); 2764 + split_rec.e_cpos = cpu_to_le32(cpos); 2765 + split_rec.e_leaf_clusters = cpu_to_le16(len); 2766 + split_rec.e_blkno = cpu_to_le64(start_blkno); 2767 + split_rec.e_flags = path_leaf_el(left_path)->l_recs[index].e_flags; 2768 + split_rec.e_flags &= ~OCFS2_EXT_UNWRITTEN; 2769 + 2770 + ret = __ocfs2_mark_extent_written(inode, di_bh, handle, left_path, 2771 + index, &split_rec, meta_ac, dealloc); 2772 + if (ret) 2773 + mlog_errno(ret); 2774 + 2775 + out: 2776 + ocfs2_free_path(left_path); 2777 + return ret; 2778 + } 2779 + 2780 + static int ocfs2_split_tree(struct inode *inode, struct buffer_head *di_bh, 2781 + handle_t *handle, struct ocfs2_path *path, 2782 + int index, u32 new_range, 2783 + struct ocfs2_alloc_context *meta_ac) 2784 + { 2785 + int ret, depth, credits = handle->h_buffer_credits; 2786 + struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; 2787 + struct buffer_head *last_eb_bh = NULL; 2788 + struct ocfs2_extent_block *eb; 2789 + struct ocfs2_extent_list *rightmost_el, *el; 2790 + struct ocfs2_extent_rec split_rec; 2791 + struct ocfs2_extent_rec *rec; 2792 + struct ocfs2_insert_type insert; 2793 + 2794 + /* 2795 + * Setup the record to split before we grow the tree. 2796 + */ 2797 + el = path_leaf_el(path); 2798 + rec = &el->l_recs[index]; 2799 + ocfs2_make_right_split_rec(inode->i_sb, &split_rec, new_range, rec); 2800 + 2801 + depth = path->p_tree_depth; 2802 + if (depth > 0) { 2803 + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), 2804 + le64_to_cpu(di->i_last_eb_blk), 2805 + &last_eb_bh, OCFS2_BH_CACHED, inode); 2806 + if (ret < 0) { 2807 + mlog_errno(ret); 2808 + goto out; 2809 + } 2810 + 2811 + eb = (struct ocfs2_extent_block *) last_eb_bh->b_data; 2812 + rightmost_el = &eb->h_list; 2813 + } else 2814 + rightmost_el = path_leaf_el(path); 2815 + 2816 + credits += path->p_tree_depth + ocfs2_extend_meta_needed(di); 2817 + ret = ocfs2_extend_trans(handle, credits); 2818 + if (ret) { 2819 + mlog_errno(ret); 2820 + goto out; 2821 + } 2822 + 2823 + if (le16_to_cpu(rightmost_el->l_next_free_rec) == 2824 + le16_to_cpu(rightmost_el->l_count)) { 2825 + int old_depth = depth; 2826 + 2827 + ret = ocfs2_grow_tree(inode, handle, di_bh, &depth, &last_eb_bh, 2828 + meta_ac); 2829 + if (ret) { 2830 + mlog_errno(ret); 2831 + goto out; 2832 + } 2833 + 2834 + if (old_depth != depth) { 2835 + eb = (struct ocfs2_extent_block *)last_eb_bh->b_data; 2836 + rightmost_el = &eb->h_list; 2837 + } 2838 + } 2839 + 2840 + memset(&insert, 0, sizeof(struct ocfs2_insert_type)); 2841 + insert.ins_appending = APPEND_NONE; 2842 + insert.ins_contig = CONTIG_NONE; 2843 + insert.ins_split = SPLIT_RIGHT; 2844 + insert.ins_free_records = le16_to_cpu(rightmost_el->l_count) 2845 + - le16_to_cpu(rightmost_el->l_next_free_rec); 2846 + insert.ins_tree_depth = depth; 2847 + 2848 + ret = ocfs2_do_insert_extent(inode, handle, di_bh, &split_rec, &insert); 2849 + if (ret) 2850 + mlog_errno(ret); 2851 + 2852 + out: 2853 + brelse(last_eb_bh); 2854 + return ret; 2855 + } 2856 + 2857 + static int ocfs2_truncate_rec(struct inode *inode, handle_t *handle, 2858 + struct ocfs2_path *path, int index, 2859 + struct ocfs2_cached_dealloc_ctxt *dealloc, 2860 + u32 cpos, u32 len) 2861 + { 2862 + int ret; 2863 + u32 left_cpos, rec_range, trunc_range; 2864 + int wants_rotate = 0, is_rightmost_tree_rec = 0; 2865 + struct super_block *sb = inode->i_sb; 2866 + struct ocfs2_path *left_path = NULL; 2867 + struct ocfs2_extent_list *el = path_leaf_el(path); 2868 + struct ocfs2_extent_rec *rec; 2869 + struct ocfs2_extent_block *eb; 2870 + 2871 + if (ocfs2_is_empty_extent(&el->l_recs[0]) && index > 0) { 2872 + ret = ocfs2_rotate_tree_left(inode, handle, path, dealloc); 2873 + if (ret) { 2874 + mlog_errno(ret); 2875 + goto out; 2876 + } 2877 + 2878 + index--; 2879 + } 2880 + 2881 + if (index == (le16_to_cpu(el->l_next_free_rec) - 1) && 2882 + path->p_tree_depth) { 2883 + /* 2884 + * Check whether this is the rightmost tree record. If 2885 + * we remove all of this record or part of its right 2886 + * edge then an update of the record lengths above it 2887 + * will be required. 2888 + */ 2889 + eb = (struct ocfs2_extent_block *)path_leaf_bh(path)->b_data; 2890 + if (eb->h_next_leaf_blk == 0) 2891 + is_rightmost_tree_rec = 1; 2892 + } 2893 + 2894 + rec = &el->l_recs[index]; 2895 + if (index == 0 && path->p_tree_depth && 2896 + le32_to_cpu(rec->e_cpos) == cpos) { 2897 + /* 2898 + * Changing the leftmost offset (via partial or whole 2899 + * record truncate) of an interior (or rightmost) path 2900 + * means we have to update the subtree that is formed 2901 + * by this leaf and the one to it's left. 2902 + * 2903 + * There are two cases we can skip: 2904 + * 1) Path is the leftmost one in our inode tree. 2905 + * 2) The leaf is rightmost and will be empty after 2906 + * we remove the extent record - the rotate code 2907 + * knows how to update the newly formed edge. 2908 + */ 2909 + 2910 + ret = ocfs2_find_cpos_for_left_leaf(inode->i_sb, path, 2911 + &left_cpos); 2912 + if (ret) { 2913 + mlog_errno(ret); 2914 + goto out; 2915 + } 2916 + 2917 + if (left_cpos && le16_to_cpu(el->l_next_free_rec) > 1) { 2918 + left_path = ocfs2_new_path(path_root_bh(path), 2919 + path_root_el(path)); 2920 + if (!left_path) { 2921 + ret = -ENOMEM; 2922 + mlog_errno(ret); 2923 + goto out; 2924 + } 2925 + 2926 + ret = ocfs2_find_path(inode, left_path, left_cpos); 2927 + if (ret) { 2928 + mlog_errno(ret); 2929 + goto out; 2930 + } 2931 + } 2932 + } 2933 + 2934 + ret = ocfs2_extend_rotate_transaction(handle, 0, 2935 + handle->h_buffer_credits, 2936 + path); 2937 + if (ret) { 2938 + mlog_errno(ret); 2939 + goto out; 2940 + } 2941 + 2942 + ret = ocfs2_journal_access_path(inode, handle, path); 2943 + if (ret) { 2944 + mlog_errno(ret); 2945 + goto out; 2946 + } 2947 + 2948 + ret = ocfs2_journal_access_path(inode, handle, left_path); 2949 + if (ret) { 2950 + mlog_errno(ret); 2951 + goto out; 2952 + } 2953 + 2954 + rec_range = le32_to_cpu(rec->e_cpos) + ocfs2_rec_clusters(el, rec); 2955 + trunc_range = cpos + len; 2956 + 2957 + if (le32_to_cpu(rec->e_cpos) == cpos && rec_range == trunc_range) { 2958 + int next_free; 2959 + 2960 + memset(rec, 0, sizeof(*rec)); 2961 + ocfs2_cleanup_merge(el, index); 2962 + wants_rotate = 1; 2963 + 2964 + next_free = le16_to_cpu(el->l_next_free_rec); 2965 + if (is_rightmost_tree_rec && next_free > 1) { 2966 + /* 2967 + * We skip the edge update if this path will 2968 + * be deleted by the rotate code. 2969 + */ 2970 + rec = &el->l_recs[next_free - 1]; 2971 + ocfs2_adjust_rightmost_records(inode, handle, path, 2972 + rec); 2973 + } 2974 + } else if (le32_to_cpu(rec->e_cpos) == cpos) { 2975 + /* Remove leftmost portion of the record. */ 2976 + le32_add_cpu(&rec->e_cpos, len); 2977 + le64_add_cpu(&rec->e_blkno, ocfs2_clusters_to_blocks(sb, len)); 2978 + le16_add_cpu(&rec->e_leaf_clusters, -len); 2979 + } else if (rec_range == trunc_range) { 2980 + /* Remove rightmost portion of the record */ 2981 + le16_add_cpu(&rec->e_leaf_clusters, -len); 2982 + if (is_rightmost_tree_rec) 2983 + ocfs2_adjust_rightmost_records(inode, handle, path, rec); 2984 + } else { 2985 + /* Caller should have trapped this. */ 2986 + mlog(ML_ERROR, "Inode %llu: Invalid record truncate: (%u, %u) " 2987 + "(%u, %u)\n", (unsigned long long)OCFS2_I(inode)->ip_blkno, 2988 + le32_to_cpu(rec->e_cpos), 2989 + le16_to_cpu(rec->e_leaf_clusters), cpos, len); 2990 + BUG(); 2991 + } 2992 + 2993 + if (left_path) { 2994 + int subtree_index; 2995 + 2996 + subtree_index = ocfs2_find_subtree_root(inode, left_path, path); 2997 + ocfs2_complete_edge_insert(inode, handle, left_path, path, 2998 + subtree_index); 2999 + } 3000 + 3001 + ocfs2_journal_dirty(handle, path_leaf_bh(path)); 3002 + 3003 + ret = ocfs2_rotate_tree_left(inode, handle, path, dealloc); 3004 + if (ret) { 3005 + mlog_errno(ret); 3006 + goto out; 3007 + } 3008 + 3009 + out: 3010 + ocfs2_free_path(left_path); 3011 + return ret; 3012 + } 3013 + 3014 + int ocfs2_remove_extent(struct inode *inode, struct buffer_head *di_bh, 3015 + u32 cpos, u32 len, handle_t *handle, 3016 + struct ocfs2_alloc_context *meta_ac, 3017 + struct ocfs2_cached_dealloc_ctxt *dealloc) 3018 + { 3019 + int ret, index; 3020 + u32 rec_range, trunc_range; 3021 + struct ocfs2_extent_rec *rec; 3022 + struct ocfs2_extent_list *el; 3023 + struct ocfs2_path *path; 3024 + 3025 + ocfs2_extent_map_trunc(inode, 0); 3026 + 3027 + path = ocfs2_new_inode_path(di_bh); 3028 + if (!path) { 3029 + ret = -ENOMEM; 3030 + mlog_errno(ret); 3031 + goto out; 3032 + } 3033 + 3034 + ret = ocfs2_find_path(inode, path, cpos); 3035 + if (ret) { 3036 + mlog_errno(ret); 3037 + goto out; 3038 + } 3039 + 3040 + el = path_leaf_el(path); 3041 + index = ocfs2_search_extent_list(el, cpos); 3042 + if (index == -1 || index >= le16_to_cpu(el->l_next_free_rec)) { 3043 + ocfs2_error(inode->i_sb, 3044 + "Inode %llu has an extent at cpos %u which can no " 3045 + "longer be found.\n", 3046 + (unsigned long long)OCFS2_I(inode)->ip_blkno, cpos); 3047 + ret = -EROFS; 3048 + goto out; 3049 + } 3050 + 3051 + /* 3052 + * We have 3 cases of extent removal: 3053 + * 1) Range covers the entire extent rec 3054 + * 2) Range begins or ends on one edge of the extent rec 3055 + * 3) Range is in the middle of the extent rec (no shared edges) 3056 + * 3057 + * For case 1 we remove the extent rec and left rotate to 3058 + * fill the hole. 3059 + * 3060 + * For case 2 we just shrink the existing extent rec, with a 3061 + * tree update if the shrinking edge is also the edge of an 3062 + * extent block. 3063 + * 3064 + * For case 3 we do a right split to turn the extent rec into 3065 + * something case 2 can handle. 3066 + */ 3067 + rec = &el->l_recs[index]; 3068 + rec_range = le32_to_cpu(rec->e_cpos) + ocfs2_rec_clusters(el, rec); 3069 + trunc_range = cpos + len; 3070 + 3071 + BUG_ON(cpos < le32_to_cpu(rec->e_cpos) || trunc_range > rec_range); 3072 + 3073 + mlog(0, "Inode %llu, remove (cpos %u, len %u). Existing index %d " 3074 + "(cpos %u, len %u)\n", 3075 + (unsigned long long)OCFS2_I(inode)->ip_blkno, cpos, len, index, 3076 + le32_to_cpu(rec->e_cpos), ocfs2_rec_clusters(el, rec)); 3077 + 3078 + if (le32_to_cpu(rec->e_cpos) == cpos || rec_range == trunc_range) { 3079 + ret = ocfs2_truncate_rec(inode, handle, path, index, dealloc, 3080 + cpos, len); 3081 + if (ret) { 3082 + mlog_errno(ret); 3083 + goto out; 3084 + } 3085 + } else { 3086 + ret = ocfs2_split_tree(inode, di_bh, handle, path, index, 3087 + trunc_range, meta_ac); 3088 + if (ret) { 3089 + mlog_errno(ret); 3090 + goto out; 3091 + } 3092 + 3093 + /* 3094 + * The split could have manipulated the tree enough to 3095 + * move the record location, so we have to look for it again. 3096 + */ 3097 + ocfs2_reinit_path(path, 1); 3098 + 3099 + ret = ocfs2_find_path(inode, path, cpos); 3100 + if (ret) { 3101 + mlog_errno(ret); 3102 + goto out; 3103 + } 3104 + 3105 + el = path_leaf_el(path); 3106 + index = ocfs2_search_extent_list(el, cpos); 3107 + if (index == -1 || index >= le16_to_cpu(el->l_next_free_rec)) { 3108 + ocfs2_error(inode->i_sb, 3109 + "Inode %llu: split at cpos %u lost record.", 3110 + (unsigned long long)OCFS2_I(inode)->ip_blkno, 3111 + cpos); 3112 + ret = -EROFS; 3113 + goto out; 3114 + } 3115 + 3116 + /* 3117 + * Double check our values here. If anything is fishy, 3118 + * it's easier to catch it at the top level. 3119 + */ 3120 + rec = &el->l_recs[index]; 3121 + rec_range = le32_to_cpu(rec->e_cpos) + 3122 + ocfs2_rec_clusters(el, rec); 3123 + if (rec_range != trunc_range) { 3124 + ocfs2_error(inode->i_sb, 3125 + "Inode %llu: error after split at cpos %u" 3126 + "trunc len %u, existing record is (%u,%u)", 3127 + (unsigned long long)OCFS2_I(inode)->ip_blkno, 3128 + cpos, len, le32_to_cpu(rec->e_cpos), 3129 + ocfs2_rec_clusters(el, rec)); 3130 + ret = -EROFS; 3131 + goto out; 3132 + } 3133 + 3134 + ret = ocfs2_truncate_rec(inode, handle, path, index, dealloc, 3135 + cpos, len); 3136 + if (ret) { 3137 + mlog_errno(ret); 3138 + goto out; 3139 + } 3140 + } 3141 + 3142 + out: 3143 + ocfs2_free_path(path); 3144 + return ret; 3145 + } 3146 + 3147 + int ocfs2_truncate_log_needs_flush(struct ocfs2_super *osb) 3797 3148 { 3798 3149 struct buffer_head *tl_bh = osb->osb_tl_bh; 3799 3150 struct ocfs2_dinode *di; ··· 4539 2464 return current_tail == new_start; 4540 2465 } 4541 2466 4542 - static int ocfs2_truncate_log_append(struct ocfs2_super *osb, 4543 - handle_t *handle, 4544 - u64 start_blk, 4545 - unsigned int num_clusters) 2467 + int ocfs2_truncate_log_append(struct ocfs2_super *osb, 2468 + handle_t *handle, 2469 + u64 start_blk, 2470 + unsigned int num_clusters) 4546 2471 { 4547 2472 int status, index; 4548 2473 unsigned int start_cluster, tl_count; ··· 4698 2623 } 4699 2624 4700 2625 /* Expects you to already be holding tl_inode->i_mutex */ 4701 - static int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) 2626 + int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) 4702 2627 { 4703 2628 int status; 4704 2629 unsigned int num_to_flush; ··· 5032 2957 return status; 5033 2958 } 5034 2959 2960 + /* 2961 + * Delayed de-allocation of suballocator blocks. 2962 + * 2963 + * Some sets of block de-allocations might involve multiple suballocator inodes. 2964 + * 2965 + * The locking for this can get extremely complicated, especially when 2966 + * the suballocator inodes to delete from aren't known until deep 2967 + * within an unrelated codepath. 2968 + * 2969 + * ocfs2_extent_block structures are a good example of this - an inode 2970 + * btree could have been grown by any number of nodes each allocating 2971 + * out of their own suballoc inode. 2972 + * 2973 + * These structures allow the delay of block de-allocation until a 2974 + * later time, when locking of multiple cluster inodes won't cause 2975 + * deadlock. 2976 + */ 2977 + 2978 + /* 2979 + * Describes a single block free from a suballocator 2980 + */ 2981 + struct ocfs2_cached_block_free { 2982 + struct ocfs2_cached_block_free *free_next; 2983 + u64 free_blk; 2984 + unsigned int free_bit; 2985 + }; 2986 + 2987 + struct ocfs2_per_slot_free_list { 2988 + struct ocfs2_per_slot_free_list *f_next_suballocator; 2989 + int f_inode_type; 2990 + int f_slot; 2991 + struct ocfs2_cached_block_free *f_first; 2992 + }; 2993 + 2994 + static int ocfs2_free_cached_items(struct ocfs2_super *osb, 2995 + int sysfile_type, 2996 + int slot, 2997 + struct ocfs2_cached_block_free *head) 2998 + { 2999 + int ret; 3000 + u64 bg_blkno; 3001 + handle_t *handle; 3002 + struct inode *inode; 3003 + struct buffer_head *di_bh = NULL; 3004 + struct ocfs2_cached_block_free *tmp; 3005 + 3006 + inode = ocfs2_get_system_file_inode(osb, sysfile_type, slot); 3007 + if (!inode) { 3008 + ret = -EINVAL; 3009 + mlog_errno(ret); 3010 + goto out; 3011 + } 3012 + 3013 + mutex_lock(&inode->i_mutex); 3014 + 3015 + ret = ocfs2_meta_lock(inode, &di_bh, 1); 3016 + if (ret) { 3017 + mlog_errno(ret); 3018 + goto out_mutex; 3019 + } 3020 + 3021 + handle = ocfs2_start_trans(osb, OCFS2_SUBALLOC_FREE); 3022 + if (IS_ERR(handle)) { 3023 + ret = PTR_ERR(handle); 3024 + mlog_errno(ret); 3025 + goto out_unlock; 3026 + } 3027 + 3028 + while (head) { 3029 + bg_blkno = ocfs2_which_suballoc_group(head->free_blk, 3030 + head->free_bit); 3031 + mlog(0, "Free bit: (bit %u, blkno %llu)\n", 3032 + head->free_bit, (unsigned long long)head->free_blk); 3033 + 3034 + ret = ocfs2_free_suballoc_bits(handle, inode, di_bh, 3035 + head->free_bit, bg_blkno, 1); 3036 + if (ret) { 3037 + mlog_errno(ret); 3038 + goto out_journal; 3039 + } 3040 + 3041 + ret = ocfs2_extend_trans(handle, OCFS2_SUBALLOC_FREE); 3042 + if (ret) { 3043 + mlog_errno(ret); 3044 + goto out_journal; 3045 + } 3046 + 3047 + tmp = head; 3048 + head = head->free_next; 3049 + kfree(tmp); 3050 + } 3051 + 3052 + out_journal: 3053 + ocfs2_commit_trans(osb, handle); 3054 + 3055 + out_unlock: 3056 + ocfs2_meta_unlock(inode, 1); 3057 + brelse(di_bh); 3058 + out_mutex: 3059 + mutex_unlock(&inode->i_mutex); 3060 + iput(inode); 3061 + out: 3062 + while(head) { 3063 + /* Premature exit may have left some dangling items. */ 3064 + tmp = head; 3065 + head = head->free_next; 3066 + kfree(tmp); 3067 + } 3068 + 3069 + return ret; 3070 + } 3071 + 3072 + int ocfs2_run_deallocs(struct ocfs2_super *osb, 3073 + struct ocfs2_cached_dealloc_ctxt *ctxt) 3074 + { 3075 + int ret = 0, ret2; 3076 + struct ocfs2_per_slot_free_list *fl; 3077 + 3078 + if (!ctxt) 3079 + return 0; 3080 + 3081 + while (ctxt->c_first_suballocator) { 3082 + fl = ctxt->c_first_suballocator; 3083 + 3084 + if (fl->f_first) { 3085 + mlog(0, "Free items: (type %u, slot %d)\n", 3086 + fl->f_inode_type, fl->f_slot); 3087 + ret2 = ocfs2_free_cached_items(osb, fl->f_inode_type, 3088 + fl->f_slot, fl->f_first); 3089 + if (ret2) 3090 + mlog_errno(ret2); 3091 + if (!ret) 3092 + ret = ret2; 3093 + } 3094 + 3095 + ctxt->c_first_suballocator = fl->f_next_suballocator; 3096 + kfree(fl); 3097 + } 3098 + 3099 + return ret; 3100 + } 3101 + 3102 + static struct ocfs2_per_slot_free_list * 3103 + ocfs2_find_per_slot_free_list(int type, 3104 + int slot, 3105 + struct ocfs2_cached_dealloc_ctxt *ctxt) 3106 + { 3107 + struct ocfs2_per_slot_free_list *fl = ctxt->c_first_suballocator; 3108 + 3109 + while (fl) { 3110 + if (fl->f_inode_type == type && fl->f_slot == slot) 3111 + return fl; 3112 + 3113 + fl = fl->f_next_suballocator; 3114 + } 3115 + 3116 + fl = kmalloc(sizeof(*fl), GFP_NOFS); 3117 + if (fl) { 3118 + fl->f_inode_type = type; 3119 + fl->f_slot = slot; 3120 + fl->f_first = NULL; 3121 + fl->f_next_suballocator = ctxt->c_first_suballocator; 3122 + 3123 + ctxt->c_first_suballocator = fl; 3124 + } 3125 + return fl; 3126 + } 3127 + 3128 + static int ocfs2_cache_block_dealloc(struct ocfs2_cached_dealloc_ctxt *ctxt, 3129 + int type, int slot, u64 blkno, 3130 + unsigned int bit) 3131 + { 3132 + int ret; 3133 + struct ocfs2_per_slot_free_list *fl; 3134 + struct ocfs2_cached_block_free *item; 3135 + 3136 + fl = ocfs2_find_per_slot_free_list(type, slot, ctxt); 3137 + if (fl == NULL) { 3138 + ret = -ENOMEM; 3139 + mlog_errno(ret); 3140 + goto out; 3141 + } 3142 + 3143 + item = kmalloc(sizeof(*item), GFP_NOFS); 3144 + if (item == NULL) { 3145 + ret = -ENOMEM; 3146 + mlog_errno(ret); 3147 + goto out; 3148 + } 3149 + 3150 + mlog(0, "Insert: (type %d, slot %u, bit %u, blk %llu)\n", 3151 + type, slot, bit, (unsigned long long)blkno); 3152 + 3153 + item->free_blk = blkno; 3154 + item->free_bit = bit; 3155 + item->free_next = fl->f_first; 3156 + 3157 + fl->f_first = item; 3158 + 3159 + ret = 0; 3160 + out: 3161 + return ret; 3162 + } 3163 + 3164 + static int ocfs2_cache_extent_block_free(struct ocfs2_cached_dealloc_ctxt *ctxt, 3165 + struct ocfs2_extent_block *eb) 3166 + { 3167 + return ocfs2_cache_block_dealloc(ctxt, EXTENT_ALLOC_SYSTEM_INODE, 3168 + le16_to_cpu(eb->h_suballoc_slot), 3169 + le64_to_cpu(eb->h_blkno), 3170 + le16_to_cpu(eb->h_suballoc_bit)); 3171 + } 3172 + 5035 3173 /* This function will figure out whether the currently last extent 5036 3174 * block will be deleted, and if it will, what the new last extent 5037 3175 * block will be so we can update his h_next_leaf_blk field, as well ··· 5526 3238 BUG_ON(le32_to_cpu(el->l_recs[0].e_cpos)); 5527 3239 BUG_ON(le64_to_cpu(el->l_recs[0].e_blkno)); 5528 3240 5529 - if (le16_to_cpu(eb->h_suballoc_slot) == 0) { 5530 - /* 5531 - * This code only understands how to 5532 - * lock the suballocator in slot 0, 5533 - * which is fine because allocation is 5534 - * only ever done out of that 5535 - * suballocator too. A future version 5536 - * might change that however, so avoid 5537 - * a free if we don't know how to 5538 - * handle it. This way an fs incompat 5539 - * bit will not be necessary. 5540 - */ 5541 - ret = ocfs2_free_extent_block(handle, 5542 - tc->tc_ext_alloc_inode, 5543 - tc->tc_ext_alloc_bh, 5544 - eb); 5545 - 5546 - /* An error here is not fatal. */ 5547 - if (ret < 0) 5548 - mlog_errno(ret); 5549 - } 3241 + ret = ocfs2_cache_extent_block_free(&tc->tc_dealloc, eb); 3242 + /* An error here is not fatal. */ 3243 + if (ret < 0) 3244 + mlog_errno(ret); 5550 3245 } else { 5551 3246 deleted_eb = 0; 5552 3247 } ··· 5668 3397 return ocfs2_journal_dirty_data(handle, bh); 5669 3398 } 5670 3399 5671 - static void ocfs2_zero_cluster_pages(struct inode *inode, loff_t isize, 5672 - struct page **pages, int numpages, 5673 - u64 phys, handle_t *handle) 3400 + static void ocfs2_zero_cluster_pages(struct inode *inode, loff_t start, 3401 + loff_t end, struct page **pages, 3402 + int numpages, u64 phys, handle_t *handle) 5674 3403 { 5675 3404 int i, ret, partial = 0; 5676 3405 void *kaddr; ··· 5683 3412 if (numpages == 0) 5684 3413 goto out; 5685 3414 5686 - from = isize & (PAGE_CACHE_SIZE - 1); /* 1st page offset */ 5687 - if (PAGE_CACHE_SHIFT > OCFS2_SB(sb)->s_clustersize_bits) { 5688 - /* 5689 - * Since 'from' has been capped to a value below page 5690 - * size, this calculation won't be able to overflow 5691 - * 'to' 5692 - */ 5693 - to = ocfs2_align_bytes_to_clusters(sb, from); 5694 - 5695 - /* 5696 - * The truncate tail in this case should never contain 5697 - * more than one page at maximum. The loop below also 5698 - * assumes this. 5699 - */ 5700 - BUG_ON(numpages != 1); 5701 - } 5702 - 3415 + to = PAGE_CACHE_SIZE; 5703 3416 for(i = 0; i < numpages; i++) { 5704 3417 page = pages[i]; 3418 + 3419 + from = start & (PAGE_CACHE_SIZE - 1); 3420 + if ((end >> PAGE_CACHE_SHIFT) == page->index) 3421 + to = end & (PAGE_CACHE_SIZE - 1); 5705 3422 5706 3423 BUG_ON(from > PAGE_CACHE_SIZE); 5707 3424 BUG_ON(to > PAGE_CACHE_SIZE); ··· 5727 3468 5728 3469 flush_dcache_page(page); 5729 3470 5730 - /* 5731 - * Every page after the 1st one should be completely zero'd. 5732 - */ 5733 - from = 0; 3471 + start = (page->index + 1) << PAGE_CACHE_SHIFT; 5734 3472 } 5735 3473 out: 5736 3474 if (pages) { ··· 5740 3484 } 5741 3485 } 5742 3486 5743 - static int ocfs2_grab_eof_pages(struct inode *inode, loff_t isize, struct page **pages, 5744 - int *num, u64 *phys) 3487 + static int ocfs2_grab_eof_pages(struct inode *inode, loff_t start, loff_t end, 3488 + struct page **pages, int *num, u64 *phys) 5745 3489 { 5746 3490 int i, numpages = 0, ret = 0; 5747 - unsigned int csize = OCFS2_SB(inode->i_sb)->s_clustersize; 5748 3491 unsigned int ext_flags; 5749 3492 struct super_block *sb = inode->i_sb; 5750 3493 struct address_space *mapping = inode->i_mapping; 5751 3494 unsigned long index; 5752 - u64 next_cluster_bytes; 3495 + loff_t last_page_bytes; 5753 3496 5754 3497 BUG_ON(!ocfs2_sparse_alloc(OCFS2_SB(sb))); 3498 + BUG_ON(start > end); 5755 3499 5756 - /* Cluster boundary, so we don't need to grab any pages. */ 5757 - if ((isize & (csize - 1)) == 0) 3500 + if (start == end) 5758 3501 goto out; 5759 3502 5760 - ret = ocfs2_extent_map_get_blocks(inode, isize >> sb->s_blocksize_bits, 3503 + BUG_ON(start >> OCFS2_SB(sb)->s_clustersize_bits != 3504 + (end - 1) >> OCFS2_SB(sb)->s_clustersize_bits); 3505 + 3506 + ret = ocfs2_extent_map_get_blocks(inode, start >> sb->s_blocksize_bits, 5761 3507 phys, NULL, &ext_flags); 5762 3508 if (ret) { 5763 3509 mlog_errno(ret); ··· 5775 3517 if (ext_flags & OCFS2_EXT_UNWRITTEN) 5776 3518 goto out; 5777 3519 5778 - next_cluster_bytes = ocfs2_align_bytes_to_clusters(inode->i_sb, isize); 5779 - index = isize >> PAGE_CACHE_SHIFT; 3520 + last_page_bytes = PAGE_ALIGN(end); 3521 + index = start >> PAGE_CACHE_SHIFT; 5780 3522 do { 5781 3523 pages[numpages] = grab_cache_page(mapping, index); 5782 3524 if (!pages[numpages]) { ··· 5787 3529 5788 3530 numpages++; 5789 3531 index++; 5790 - } while (index < (next_cluster_bytes >> PAGE_CACHE_SHIFT)); 3532 + } while (index < (last_page_bytes >> PAGE_CACHE_SHIFT)); 5791 3533 5792 3534 out: 5793 3535 if (ret != 0) { ··· 5816 3558 * otherwise block_write_full_page() will skip writeout of pages past 5817 3559 * i_size. The new_i_size parameter is passed for this reason. 5818 3560 */ 5819 - int ocfs2_zero_tail_for_truncate(struct inode *inode, handle_t *handle, 5820 - u64 new_i_size) 3561 + int ocfs2_zero_range_for_truncate(struct inode *inode, handle_t *handle, 3562 + u64 range_start, u64 range_end) 5821 3563 { 5822 3564 int ret, numpages; 5823 - loff_t endbyte; 5824 3565 struct page **pages = NULL; 5825 3566 u64 phys; 5826 3567 ··· 5838 3581 goto out; 5839 3582 } 5840 3583 5841 - ret = ocfs2_grab_eof_pages(inode, new_i_size, pages, &numpages, &phys); 3584 + ret = ocfs2_grab_eof_pages(inode, range_start, range_end, pages, 3585 + &numpages, &phys); 5842 3586 if (ret) { 5843 3587 mlog_errno(ret); 5844 3588 goto out; ··· 5848 3590 if (numpages == 0) 5849 3591 goto out; 5850 3592 5851 - ocfs2_zero_cluster_pages(inode, new_i_size, pages, numpages, phys, 5852 - handle); 3593 + ocfs2_zero_cluster_pages(inode, range_start, range_end, pages, 3594 + numpages, phys, handle); 5853 3595 5854 3596 /* 5855 3597 * Initiate writeout of the pages we zero'd here. We don't 5856 3598 * wait on them - the truncate_inode_pages() call later will 5857 3599 * do that for us. 5858 3600 */ 5859 - endbyte = ocfs2_align_bytes_to_clusters(inode->i_sb, new_i_size); 5860 - ret = do_sync_mapping_range(inode->i_mapping, new_i_size, 5861 - endbyte - 1, SYNC_FILE_RANGE_WRITE); 3601 + ret = do_sync_mapping_range(inode->i_mapping, range_start, 3602 + range_end - 1, SYNC_FILE_RANGE_WRITE); 5862 3603 if (ret) 5863 3604 mlog_errno(ret); 5864 3605 ··· 5887 3630 struct ocfs2_path *path = NULL; 5888 3631 5889 3632 mlog_entry_void(); 5890 - 5891 - down_write(&OCFS2_I(inode)->ip_alloc_sem); 5892 3633 5893 3634 new_highest_cpos = ocfs2_clusters_for_bytes(osb->sb, 5894 3635 i_size_read(inode)); ··· 6009 3754 goto start; 6010 3755 6011 3756 bail: 6012 - up_write(&OCFS2_I(inode)->ip_alloc_sem); 6013 3757 6014 3758 ocfs2_schedule_truncate_log_flush(osb, 1); 6015 3759 ··· 6017 3763 6018 3764 if (handle) 6019 3765 ocfs2_commit_trans(osb, handle); 3766 + 3767 + ocfs2_run_deallocs(osb, &tc->tc_dealloc); 6020 3768 6021 3769 ocfs2_free_path(path); 6022 3770 ··· 6030 3774 } 6031 3775 6032 3776 /* 6033 - * Expects the inode to already be locked. This will figure out which 6034 - * inodes need to be locked and will put them on the returned truncate 6035 - * context. 3777 + * Expects the inode to already be locked. 6036 3778 */ 6037 3779 int ocfs2_prepare_truncate(struct ocfs2_super *osb, 6038 3780 struct inode *inode, 6039 3781 struct buffer_head *fe_bh, 6040 3782 struct ocfs2_truncate_context **tc) 6041 3783 { 6042 - int status, metadata_delete, i; 3784 + int status; 6043 3785 unsigned int new_i_clusters; 6044 3786 struct ocfs2_dinode *fe; 6045 3787 struct ocfs2_extent_block *eb; 6046 - struct ocfs2_extent_list *el; 6047 3788 struct buffer_head *last_eb_bh = NULL; 6048 - struct inode *ext_alloc_inode = NULL; 6049 - struct buffer_head *ext_alloc_bh = NULL; 6050 3789 6051 3790 mlog_entry_void(); 6052 3791 ··· 6061 3810 mlog_errno(status); 6062 3811 goto bail; 6063 3812 } 3813 + ocfs2_init_dealloc_ctxt(&(*tc)->tc_dealloc); 6064 3814 6065 - metadata_delete = 0; 6066 3815 if (fe->id2.i_list.l_tree_depth) { 6067 - /* If we have a tree, then the truncate may result in 6068 - * metadata deletes. Figure this out from the 6069 - * rightmost leaf block.*/ 6070 3816 status = ocfs2_read_block(osb, le64_to_cpu(fe->i_last_eb_blk), 6071 3817 &last_eb_bh, OCFS2_BH_CACHED, inode); 6072 3818 if (status < 0) { ··· 6078 3830 status = -EIO; 6079 3831 goto bail; 6080 3832 } 6081 - el = &(eb->h_list); 6082 - 6083 - i = 0; 6084 - if (ocfs2_is_empty_extent(&el->l_recs[0])) 6085 - i = 1; 6086 - /* 6087 - * XXX: Should we check that next_free_rec contains 6088 - * the extent? 6089 - */ 6090 - if (le32_to_cpu(el->l_recs[i].e_cpos) >= new_i_clusters) 6091 - metadata_delete = 1; 6092 3833 } 6093 3834 6094 3835 (*tc)->tc_last_eb_bh = last_eb_bh; 6095 - 6096 - if (metadata_delete) { 6097 - mlog(0, "Will have to delete metadata for this trunc. " 6098 - "locking allocator.\n"); 6099 - ext_alloc_inode = ocfs2_get_system_file_inode(osb, EXTENT_ALLOC_SYSTEM_INODE, 0); 6100 - if (!ext_alloc_inode) { 6101 - status = -ENOMEM; 6102 - mlog_errno(status); 6103 - goto bail; 6104 - } 6105 - 6106 - mutex_lock(&ext_alloc_inode->i_mutex); 6107 - (*tc)->tc_ext_alloc_inode = ext_alloc_inode; 6108 - 6109 - status = ocfs2_meta_lock(ext_alloc_inode, &ext_alloc_bh, 1); 6110 - if (status < 0) { 6111 - mlog_errno(status); 6112 - goto bail; 6113 - } 6114 - (*tc)->tc_ext_alloc_bh = ext_alloc_bh; 6115 - (*tc)->tc_ext_alloc_locked = 1; 6116 - } 6117 3836 6118 3837 status = 0; 6119 3838 bail: ··· 6095 3880 6096 3881 static void ocfs2_free_truncate_context(struct ocfs2_truncate_context *tc) 6097 3882 { 6098 - if (tc->tc_ext_alloc_inode) { 6099 - if (tc->tc_ext_alloc_locked) 6100 - ocfs2_meta_unlock(tc->tc_ext_alloc_inode, 1); 6101 - 6102 - mutex_unlock(&tc->tc_ext_alloc_inode->i_mutex); 6103 - iput(tc->tc_ext_alloc_inode); 6104 - } 6105 - 6106 - if (tc->tc_ext_alloc_bh) 6107 - brelse(tc->tc_ext_alloc_bh); 3883 + /* 3884 + * The caller is responsible for completing deallocation 3885 + * before freeing the context. 3886 + */ 3887 + if (tc->tc_dealloc.c_first_suballocator != NULL) 3888 + mlog(ML_NOTICE, 3889 + "Truncate completion has non-empty dealloc context\n"); 6108 3890 6109 3891 if (tc->tc_last_eb_bh) 6110 3892 brelse(tc->tc_last_eb_bh);

+39 -4

fs/ocfs2/alloc.h

··· 34 34 u32 cpos, 35 35 u64 start_blk, 36 36 u32 new_clusters, 37 + u8 flags, 37 38 struct ocfs2_alloc_context *meta_ac); 39 + struct ocfs2_cached_dealloc_ctxt; 40 + int ocfs2_mark_extent_written(struct inode *inode, struct buffer_head *di_bh, 41 + handle_t *handle, u32 cpos, u32 len, u32 phys, 42 + struct ocfs2_alloc_context *meta_ac, 43 + struct ocfs2_cached_dealloc_ctxt *dealloc); 44 + int ocfs2_remove_extent(struct inode *inode, struct buffer_head *di_bh, 45 + u32 cpos, u32 len, handle_t *handle, 46 + struct ocfs2_alloc_context *meta_ac, 47 + struct ocfs2_cached_dealloc_ctxt *dealloc); 38 48 int ocfs2_num_free_extents(struct ocfs2_super *osb, 39 49 struct inode *inode, 40 50 struct ocfs2_dinode *fe); ··· 72 62 struct ocfs2_dinode **tl_copy); 73 63 int ocfs2_complete_truncate_log_recovery(struct ocfs2_super *osb, 74 64 struct ocfs2_dinode *tl_copy); 65 + int ocfs2_truncate_log_needs_flush(struct ocfs2_super *osb); 66 + int ocfs2_truncate_log_append(struct ocfs2_super *osb, 67 + handle_t *handle, 68 + u64 start_blk, 69 + unsigned int num_clusters); 70 + int __ocfs2_flush_truncate_log(struct ocfs2_super *osb); 71 + 72 + /* 73 + * Process local structure which describes the block unlinks done 74 + * during an operation. This is populated via 75 + * ocfs2_cache_block_dealloc(). 76 + * 77 + * ocfs2_run_deallocs() should be called after the potentially 78 + * de-allocating routines. No journal handles should be open, and most 79 + * locks should have been dropped. 80 + */ 81 + struct ocfs2_cached_dealloc_ctxt { 82 + struct ocfs2_per_slot_free_list *c_first_suballocator; 83 + }; 84 + static inline void ocfs2_init_dealloc_ctxt(struct ocfs2_cached_dealloc_ctxt *c) 85 + { 86 + c->c_first_suballocator = NULL; 87 + } 88 + int ocfs2_run_deallocs(struct ocfs2_super *osb, 89 + struct ocfs2_cached_dealloc_ctxt *ctxt); 75 90 76 91 struct ocfs2_truncate_context { 77 - struct inode *tc_ext_alloc_inode; 78 - struct buffer_head *tc_ext_alloc_bh; 92 + struct ocfs2_cached_dealloc_ctxt tc_dealloc; 79 93 int tc_ext_alloc_locked; /* is it cluster locked? */ 80 94 /* these get destroyed once it's passed to ocfs2_commit_truncate. */ 81 95 struct buffer_head *tc_last_eb_bh; 82 96 }; 83 97 84 - int ocfs2_zero_tail_for_truncate(struct inode *inode, handle_t *handle, 85 - u64 new_i_size); 98 + int ocfs2_zero_range_for_truncate(struct inode *inode, handle_t *handle, 99 + u64 range_start, u64 range_end); 86 100 int ocfs2_prepare_truncate(struct ocfs2_super *osb, 87 101 struct inode *inode, 88 102 struct buffer_head *fe_bh, ··· 118 84 119 85 int ocfs2_find_leaf(struct inode *inode, struct ocfs2_extent_list *root_el, 120 86 u32 cpos, struct buffer_head **leaf_bh); 87 + int ocfs2_search_extent_list(struct ocfs2_extent_list *el, u32 v_cluster); 121 88 122 89 /* 123 90 * Helper function to look at the # of clusters in an extent record.

+685 -384

fs/ocfs2/aops.c

··· 684 684 bh = bh->b_this_page, block_start += bsize) { 685 685 block_end = block_start + bsize; 686 686 687 + clear_buffer_new(bh); 688 + 687 689 /* 688 690 * Ignore blocks outside of our i/o range - 689 691 * they may belong to unallocated clusters. ··· 700 698 * For an allocating write with cluster size >= page 701 699 * size, we always write the entire page. 702 700 */ 703 - 704 - if (buffer_new(bh)) 705 - clear_buffer_new(bh); 701 + if (new) 702 + set_buffer_new(bh); 706 703 707 704 if (!buffer_mapped(bh)) { 708 705 map_bh(bh, inode->i_sb, *p_blkno); ··· 712 711 if (!buffer_uptodate(bh)) 713 712 set_buffer_uptodate(bh); 714 713 } else if (!buffer_uptodate(bh) && !buffer_delay(bh) && 715 - (block_start < from || block_end > to)) { 714 + !buffer_new(bh) && 715 + (block_start < from || block_end > to)) { 716 716 ll_rw_block(READ, 1, &bh); 717 717 *wait_bh++=bh; 718 718 } ··· 740 738 bh = head; 741 739 block_start = 0; 742 740 do { 743 - void *kaddr; 744 - 745 741 block_end = block_start + bsize; 746 742 if (block_end <= from) 747 743 goto next_bh; 748 744 if (block_start >= to) 749 745 break; 750 746 751 - kaddr = kmap_atomic(page, KM_USER0); 752 - memset(kaddr+block_start, 0, bh->b_size); 753 - flush_dcache_page(page); 754 - kunmap_atomic(kaddr, KM_USER0); 747 + zero_user_page(page, block_start, bh->b_size, KM_USER0); 755 748 set_buffer_uptodate(bh); 756 749 mark_buffer_dirty(bh); 757 750 ··· 758 761 return ret; 759 762 } 760 763 764 + #if (PAGE_CACHE_SIZE >= OCFS2_MAX_CLUSTERSIZE) 765 + #define OCFS2_MAX_CTXT_PAGES 1 766 + #else 767 + #define OCFS2_MAX_CTXT_PAGES (OCFS2_MAX_CLUSTERSIZE / PAGE_CACHE_SIZE) 768 + #endif 769 + 770 + #define OCFS2_MAX_CLUSTERS_PER_PAGE (PAGE_CACHE_SIZE / OCFS2_MIN_CLUSTERSIZE) 771 + 761 772 /* 762 - * This will copy user data from the buffer page in the splice 763 - * context. 764 - * 765 - * For now, we ignore SPLICE_F_MOVE as that would require some extra 766 - * communication out all the way to ocfs2_write(). 773 + * Describe the state of a single cluster to be written to. 767 774 */ 768 - int ocfs2_map_and_write_splice_data(struct inode *inode, 769 - struct ocfs2_write_ctxt *wc, u64 *p_blkno, 770 - unsigned int *ret_from, unsigned int *ret_to) 775 + struct ocfs2_write_cluster_desc { 776 + u32 c_cpos; 777 + u32 c_phys; 778 + /* 779 + * Give this a unique field because c_phys eventually gets 780 + * filled. 781 + */ 782 + unsigned c_new; 783 + unsigned c_unwritten; 784 + }; 785 + 786 + static inline int ocfs2_should_zero_cluster(struct ocfs2_write_cluster_desc *d) 771 787 { 772 - int ret; 773 - unsigned int to, from, cluster_start, cluster_end; 774 - char *src, *dst; 775 - struct ocfs2_splice_write_priv *sp = wc->w_private; 776 - struct pipe_buffer *buf = sp->s_buf; 777 - unsigned long bytes, src_from; 778 - struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 779 - 780 - ocfs2_figure_cluster_boundaries(osb, wc->w_cpos, &cluster_start, 781 - &cluster_end); 782 - 783 - from = sp->s_offset; 784 - src_from = sp->s_buf_offset; 785 - bytes = wc->w_count; 786 - 787 - if (wc->w_large_pages) { 788 - /* 789 - * For cluster size < page size, we have to 790 - * calculate pos within the cluster and obey 791 - * the rightmost boundary. 792 - */ 793 - bytes = min(bytes, (unsigned long)(osb->s_clustersize 794 - - (wc->w_pos & (osb->s_clustersize - 1)))); 795 - } 796 - to = from + bytes; 797 - 798 - BUG_ON(from > PAGE_CACHE_SIZE); 799 - BUG_ON(to > PAGE_CACHE_SIZE); 800 - BUG_ON(from < cluster_start); 801 - BUG_ON(to > cluster_end); 802 - 803 - if (wc->w_this_page_new) 804 - ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode, 805 - cluster_start, cluster_end, 1); 806 - else 807 - ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode, 808 - from, to, 0); 809 - if (ret) { 810 - mlog_errno(ret); 811 - goto out; 812 - } 813 - 814 - src = buf->ops->map(sp->s_pipe, buf, 1); 815 - dst = kmap_atomic(wc->w_this_page, KM_USER1); 816 - memcpy(dst + from, src + src_from, bytes); 817 - kunmap_atomic(wc->w_this_page, KM_USER1); 818 - buf->ops->unmap(sp->s_pipe, buf, src); 819 - 820 - wc->w_finished_copy = 1; 821 - 822 - *ret_from = from; 823 - *ret_to = to; 824 - out: 825 - 826 - return bytes ? (unsigned int)bytes : ret; 788 + return d->c_new || d->c_unwritten; 827 789 } 828 790 829 - /* 830 - * This will copy user data from the iovec in the buffered write 831 - * context. 832 - */ 833 - int ocfs2_map_and_write_user_data(struct inode *inode, 834 - struct ocfs2_write_ctxt *wc, u64 *p_blkno, 835 - unsigned int *ret_from, unsigned int *ret_to) 836 - { 837 - int ret; 838 - unsigned int to, from, cluster_start, cluster_end; 839 - unsigned long bytes, src_from; 840 - char *dst; 841 - struct ocfs2_buffered_write_priv *bp = wc->w_private; 842 - const struct iovec *cur_iov = bp->b_cur_iov; 843 - char __user *buf; 844 - struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 791 + struct ocfs2_write_ctxt { 792 + /* Logical cluster position / len of write */ 793 + u32 w_cpos; 794 + u32 w_clen; 845 795 846 - ocfs2_figure_cluster_boundaries(osb, wc->w_cpos, &cluster_start, 847 - &cluster_end); 848 - 849 - buf = cur_iov->iov_base + bp->b_cur_off; 850 - src_from = (unsigned long)buf & ~PAGE_CACHE_MASK; 851 - 852 - from = wc->w_pos & (PAGE_CACHE_SIZE - 1); 796 + struct ocfs2_write_cluster_desc w_desc[OCFS2_MAX_CLUSTERS_PER_PAGE]; 853 797 854 798 /* 855 - * This is a lot of comparisons, but it reads quite 856 - * easily, which is important here. 857 - */ 858 - /* Stay within the src page */ 859 - bytes = PAGE_SIZE - src_from; 860 - /* Stay within the vector */ 861 - bytes = min(bytes, 862 - (unsigned long)(cur_iov->iov_len - bp->b_cur_off)); 863 - /* Stay within count */ 864 - bytes = min(bytes, (unsigned long)wc->w_count); 865 - /* 866 - * For clustersize > page size, just stay within 867 - * target page, otherwise we have to calculate pos 868 - * within the cluster and obey the rightmost 869 - * boundary. 870 - */ 871 - if (wc->w_large_pages) { 872 - /* 873 - * For cluster size < page size, we have to 874 - * calculate pos within the cluster and obey 875 - * the rightmost boundary. 876 - */ 877 - bytes = min(bytes, (unsigned long)(osb->s_clustersize 878 - - (wc->w_pos & (osb->s_clustersize - 1)))); 879 - } else { 880 - /* 881 - * cluster size > page size is the most common 882 - * case - we just stay within the target page 883 - * boundary. 884 - */ 885 - bytes = min(bytes, PAGE_CACHE_SIZE - from); 886 - } 887 - 888 - to = from + bytes; 889 - 890 - BUG_ON(from > PAGE_CACHE_SIZE); 891 - BUG_ON(to > PAGE_CACHE_SIZE); 892 - BUG_ON(from < cluster_start); 893 - BUG_ON(to > cluster_end); 894 - 895 - if (wc->w_this_page_new) 896 - ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode, 897 - cluster_start, cluster_end, 1); 898 - else 899 - ret = ocfs2_map_page_blocks(wc->w_this_page, p_blkno, inode, 900 - from, to, 0); 901 - if (ret) { 902 - mlog_errno(ret); 903 - goto out; 904 - } 905 - 906 - dst = kmap(wc->w_this_page); 907 - memcpy(dst + from, bp->b_src_buf + src_from, bytes); 908 - kunmap(wc->w_this_page); 909 - 910 - /* 911 - * XXX: This is slow, but simple. The caller of 912 - * ocfs2_buffered_write_cluster() is responsible for 913 - * passing through the iovecs, so it's difficult to 914 - * predict what our next step is in here after our 915 - * initial write. A future version should be pushing 916 - * that iovec manipulation further down. 799 + * This is true if page_size > cluster_size. 917 800 * 918 - * By setting this, we indicate that a copy from user 919 - * data was done, and subsequent calls for this 920 - * cluster will skip copying more data. 801 + * It triggers a set of special cases during write which might 802 + * have to deal with allocating writes to partial pages. 921 803 */ 922 - wc->w_finished_copy = 1; 804 + unsigned int w_large_pages; 923 805 924 - *ret_from = from; 925 - *ret_to = to; 926 - out: 806 + /* 807 + * Pages involved in this write. 808 + * 809 + * w_target_page is the page being written to by the user. 810 + * 811 + * w_pages is an array of pages which always contains 812 + * w_target_page, and in the case of an allocating write with 813 + * page_size < cluster size, it will contain zero'd and mapped 814 + * pages adjacent to w_target_page which need to be written 815 + * out in so that future reads from that region will get 816 + * zero's. 817 + */ 818 + struct page *w_pages[OCFS2_MAX_CTXT_PAGES]; 819 + unsigned int w_num_pages; 820 + struct page *w_target_page; 927 821 928 - return bytes ? (unsigned int)bytes : ret; 822 + /* 823 + * ocfs2_write_end() uses this to know what the real range to 824 + * write in the target should be. 825 + */ 826 + unsigned int w_target_from; 827 + unsigned int w_target_to; 828 + 829 + /* 830 + * We could use journal_current_handle() but this is cleaner, 831 + * IMHO -Mark 832 + */ 833 + handle_t *w_handle; 834 + 835 + struct buffer_head *w_di_bh; 836 + 837 + struct ocfs2_cached_dealloc_ctxt w_dealloc; 838 + }; 839 + 840 + static void ocfs2_free_write_ctxt(struct ocfs2_write_ctxt *wc) 841 + { 842 + int i; 843 + 844 + for(i = 0; i < wc->w_num_pages; i++) { 845 + if (wc->w_pages[i] == NULL) 846 + continue; 847 + 848 + unlock_page(wc->w_pages[i]); 849 + mark_page_accessed(wc->w_pages[i]); 850 + page_cache_release(wc->w_pages[i]); 851 + } 852 + 853 + brelse(wc->w_di_bh); 854 + kfree(wc); 855 + } 856 + 857 + static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp, 858 + struct ocfs2_super *osb, loff_t pos, 859 + unsigned len, struct buffer_head *di_bh) 860 + { 861 + struct ocfs2_write_ctxt *wc; 862 + 863 + wc = kzalloc(sizeof(struct ocfs2_write_ctxt), GFP_NOFS); 864 + if (!wc) 865 + return -ENOMEM; 866 + 867 + wc->w_cpos = pos >> osb->s_clustersize_bits; 868 + wc->w_clen = ocfs2_clusters_for_bytes(osb->sb, len); 869 + get_bh(di_bh); 870 + wc->w_di_bh = di_bh; 871 + 872 + if (unlikely(PAGE_CACHE_SHIFT > osb->s_clustersize_bits)) 873 + wc->w_large_pages = 1; 874 + else 875 + wc->w_large_pages = 0; 876 + 877 + ocfs2_init_dealloc_ctxt(&wc->w_dealloc); 878 + 879 + *wcp = wc; 880 + 881 + return 0; 929 882 } 930 883 931 884 /* 932 - * Map, fill and write a page to disk. 933 - * 934 - * The work of copying data is done via callback. Newly allocated 935 - * pages which don't take user data will be zero'd (set 'new' to 936 - * indicate an allocating write) 937 - * 938 - * Returns a negative error code or the number of bytes copied into 939 - * the page. 885 + * If a page has any new buffers, zero them out here, and mark them uptodate 886 + * and dirty so they'll be written out (in order to prevent uninitialised 887 + * block data from leaking). And clear the new bit. 940 888 */ 941 - static int ocfs2_write_data_page(struct inode *inode, handle_t *handle, 942 - u64 *p_blkno, struct page *page, 943 - struct ocfs2_write_ctxt *wc, int new) 889 + static void ocfs2_zero_new_buffers(struct page *page, unsigned from, unsigned to) 944 890 { 945 - int ret, copied = 0; 946 - unsigned int from = 0, to = 0; 947 - unsigned int cluster_start, cluster_end; 948 - unsigned int zero_from = 0, zero_to = 0; 891 + unsigned int block_start, block_end; 892 + struct buffer_head *head, *bh; 949 893 950 - ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), wc->w_cpos, 894 + BUG_ON(!PageLocked(page)); 895 + if (!page_has_buffers(page)) 896 + return; 897 + 898 + bh = head = page_buffers(page); 899 + block_start = 0; 900 + do { 901 + block_end = block_start + bh->b_size; 902 + 903 + if (buffer_new(bh)) { 904 + if (block_end > from && block_start < to) { 905 + if (!PageUptodate(page)) { 906 + unsigned start, end; 907 + 908 + start = max(from, block_start); 909 + end = min(to, block_end); 910 + 911 + zero_user_page(page, start, end - start, KM_USER0); 912 + set_buffer_uptodate(bh); 913 + } 914 + 915 + clear_buffer_new(bh); 916 + mark_buffer_dirty(bh); 917 + } 918 + } 919 + 920 + block_start = block_end; 921 + bh = bh->b_this_page; 922 + } while (bh != head); 923 + } 924 + 925 + /* 926 + * Only called when we have a failure during allocating write to write 927 + * zero's to the newly allocated region. 928 + */ 929 + static void ocfs2_write_failure(struct inode *inode, 930 + struct ocfs2_write_ctxt *wc, 931 + loff_t user_pos, unsigned user_len) 932 + { 933 + int i; 934 + unsigned from, to; 935 + struct page *tmppage; 936 + 937 + ocfs2_zero_new_buffers(wc->w_target_page, user_pos, user_len); 938 + 939 + if (wc->w_large_pages) { 940 + from = wc->w_target_from; 941 + to = wc->w_target_to; 942 + } else { 943 + from = 0; 944 + to = PAGE_CACHE_SIZE; 945 + } 946 + 947 + for(i = 0; i < wc->w_num_pages; i++) { 948 + tmppage = wc->w_pages[i]; 949 + 950 + if (ocfs2_should_order_data(inode)) 951 + walk_page_buffers(wc->w_handle, page_buffers(tmppage), 952 + from, to, NULL, 953 + ocfs2_journal_dirty_data); 954 + 955 + block_commit_write(tmppage, from, to); 956 + } 957 + } 958 + 959 + static int ocfs2_prepare_page_for_write(struct inode *inode, u64 *p_blkno, 960 + struct ocfs2_write_ctxt *wc, 961 + struct page *page, u32 cpos, 962 + loff_t user_pos, unsigned user_len, 963 + int new) 964 + { 965 + int ret; 966 + unsigned int map_from = 0, map_to = 0; 967 + unsigned int cluster_start, cluster_end; 968 + unsigned int user_data_from = 0, user_data_to = 0; 969 + 970 + ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos, 951 971 &cluster_start, &cluster_end); 952 972 953 - if ((wc->w_pos >> PAGE_CACHE_SHIFT) == page->index 954 - && !wc->w_finished_copy) { 973 + if (page == wc->w_target_page) { 974 + map_from = user_pos & (PAGE_CACHE_SIZE - 1); 975 + map_to = map_from + user_len; 955 976 956 - wc->w_this_page = page; 957 - wc->w_this_page_new = new; 958 - ret = wc->w_write_data_page(inode, wc, p_blkno, &from, &to); 959 - if (ret < 0) { 977 + if (new) 978 + ret = ocfs2_map_page_blocks(page, p_blkno, inode, 979 + cluster_start, cluster_end, 980 + new); 981 + else 982 + ret = ocfs2_map_page_blocks(page, p_blkno, inode, 983 + map_from, map_to, new); 984 + if (ret) { 960 985 mlog_errno(ret); 961 986 goto out; 962 987 } 963 988 964 - copied = ret; 965 - 966 - zero_from = from; 967 - zero_to = to; 989 + user_data_from = map_from; 990 + user_data_to = map_to; 968 991 if (new) { 969 - from = cluster_start; 970 - to = cluster_end; 992 + map_from = cluster_start; 993 + map_to = cluster_end; 971 994 } 995 + 996 + wc->w_target_from = map_from; 997 + wc->w_target_to = map_to; 972 998 } else { 973 999 /* 974 1000 * If we haven't allocated the new page yet, we ··· 1000 980 */ 1001 981 BUG_ON(!new); 1002 982 1003 - from = cluster_start; 1004 - to = cluster_end; 983 + map_from = cluster_start; 984 + map_to = cluster_end; 1005 985 1006 986 ret = ocfs2_map_page_blocks(page, p_blkno, inode, 1007 - cluster_start, cluster_end, 1); 987 + cluster_start, cluster_end, new); 1008 988 if (ret) { 1009 989 mlog_errno(ret); 1010 990 goto out; ··· 1023 1003 */ 1024 1004 if (new && !PageUptodate(page)) 1025 1005 ocfs2_clear_page_regions(page, OCFS2_SB(inode->i_sb), 1026 - wc->w_cpos, zero_from, zero_to); 1006 + cpos, user_data_from, user_data_to); 1027 1007 1028 1008 flush_dcache_page(page); 1029 1009 1030 - if (ocfs2_should_order_data(inode)) { 1031 - ret = walk_page_buffers(handle, 1032 - page_buffers(page), 1033 - from, to, NULL, 1034 - ocfs2_journal_dirty_data); 1035 - if (ret < 0) 1036 - mlog_errno(ret); 1037 - } 1038 - 1039 - /* 1040 - * We don't use generic_commit_write() because we need to 1041 - * handle our own i_size update. 1042 - */ 1043 - ret = block_commit_write(page, from, to); 1044 - if (ret) 1045 - mlog_errno(ret); 1046 1010 out: 1047 - 1048 - return copied ? copied : ret; 1011 + return ret; 1049 1012 } 1050 1013 1051 1014 /* 1052 - * Do the actual write of some data into an inode. Optionally allocate 1053 - * in order to fulfill the write. 1054 - * 1055 - * cpos is the logical cluster offset within the file to write at 1056 - * 1057 - * 'phys' is the physical mapping of that offset. a 'phys' value of 1058 - * zero indicates that allocation is required. In this case, data_ac 1059 - * and meta_ac should be valid (meta_ac can be null if metadata 1060 - * allocation isn't required). 1015 + * This function will only grab one clusters worth of pages. 1061 1016 */ 1062 - static ssize_t ocfs2_write(struct file *file, u32 phys, handle_t *handle, 1063 - struct buffer_head *di_bh, 1064 - struct ocfs2_alloc_context *data_ac, 1065 - struct ocfs2_alloc_context *meta_ac, 1066 - struct ocfs2_write_ctxt *wc) 1017 + static int ocfs2_grab_pages_for_write(struct address_space *mapping, 1018 + struct ocfs2_write_ctxt *wc, 1019 + u32 cpos, loff_t user_pos, int new, 1020 + struct page *mmap_page) 1067 1021 { 1068 - int ret, i, numpages = 1, new; 1069 - unsigned int copied = 0; 1070 - u32 tmp_pos; 1071 - u64 v_blkno, p_blkno; 1072 - struct address_space *mapping = file->f_mapping; 1022 + int ret = 0, i; 1023 + unsigned long start, target_index, index; 1073 1024 struct inode *inode = mapping->host; 1074 - unsigned long index, start; 1075 - struct page **cpages; 1076 1025 1077 - new = phys == 0 ? 1 : 0; 1026 + target_index = user_pos >> PAGE_CACHE_SHIFT; 1078 1027 1079 1028 /* 1080 1029 * Figure out how many pages we'll be manipulating here. For 1081 1030 * non allocating write, we just change the one 1082 1031 * page. Otherwise, we'll need a whole clusters worth. 1083 1032 */ 1084 - if (new) 1085 - numpages = ocfs2_pages_per_cluster(inode->i_sb); 1086 - 1087 - cpages = kzalloc(sizeof(*cpages) * numpages, GFP_NOFS); 1088 - if (!cpages) { 1089 - ret = -ENOMEM; 1090 - mlog_errno(ret); 1091 - return ret; 1092 - } 1093 - 1094 - /* 1095 - * Fill our page array first. That way we've grabbed enough so 1096 - * that we can zero and flush if we error after adding the 1097 - * extent. 1098 - */ 1099 1033 if (new) { 1100 - start = ocfs2_align_clusters_to_page_index(inode->i_sb, 1101 - wc->w_cpos); 1102 - v_blkno = ocfs2_clusters_to_blocks(inode->i_sb, wc->w_cpos); 1034 + wc->w_num_pages = ocfs2_pages_per_cluster(inode->i_sb); 1035 + start = ocfs2_align_clusters_to_page_index(inode->i_sb, cpos); 1103 1036 } else { 1104 - start = wc->w_pos >> PAGE_CACHE_SHIFT; 1105 - v_blkno = wc->w_pos >> inode->i_sb->s_blocksize_bits; 1037 + wc->w_num_pages = 1; 1038 + start = target_index; 1106 1039 } 1107 1040 1108 - for(i = 0; i < numpages; i++) { 1041 + for(i = 0; i < wc->w_num_pages; i++) { 1109 1042 index = start + i; 1110 1043 1111 - cpages[i] = find_or_create_page(mapping, index, GFP_NOFS); 1112 - if (!cpages[i]) { 1113 - ret = -ENOMEM; 1114 - mlog_errno(ret); 1115 - goto out; 1044 + if (index == target_index && mmap_page) { 1045 + /* 1046 + * ocfs2_pagemkwrite() is a little different 1047 + * and wants us to directly use the page 1048 + * passed in. 1049 + */ 1050 + lock_page(mmap_page); 1051 + 1052 + if (mmap_page->mapping != mapping) { 1053 + unlock_page(mmap_page); 1054 + /* 1055 + * Sanity check - the locking in 1056 + * ocfs2_pagemkwrite() should ensure 1057 + * that this code doesn't trigger. 1058 + */ 1059 + ret = -EINVAL; 1060 + mlog_errno(ret); 1061 + goto out; 1062 + } 1063 + 1064 + page_cache_get(mmap_page); 1065 + wc->w_pages[i] = mmap_page; 1066 + } else { 1067 + wc->w_pages[i] = find_or_create_page(mapping, index, 1068 + GFP_NOFS); 1069 + if (!wc->w_pages[i]) { 1070 + ret = -ENOMEM; 1071 + mlog_errno(ret); 1072 + goto out; 1073 + } 1116 1074 } 1075 + 1076 + if (index == target_index) 1077 + wc->w_target_page = wc->w_pages[i]; 1117 1078 } 1079 + out: 1080 + return ret; 1081 + } 1082 + 1083 + /* 1084 + * Prepare a single cluster for write one cluster into the file. 1085 + */ 1086 + static int ocfs2_write_cluster(struct address_space *mapping, 1087 + u32 phys, unsigned int unwritten, 1088 + struct ocfs2_alloc_context *data_ac, 1089 + struct ocfs2_alloc_context *meta_ac, 1090 + struct ocfs2_write_ctxt *wc, u32 cpos, 1091 + loff_t user_pos, unsigned user_len) 1092 + { 1093 + int ret, i, new, should_zero = 0; 1094 + u64 v_blkno, p_blkno; 1095 + struct inode *inode = mapping->host; 1096 + 1097 + new = phys == 0 ? 1 : 0; 1098 + if (new || unwritten) 1099 + should_zero = 1; 1118 1100 1119 1101 if (new) { 1102 + u32 tmp_pos; 1103 + 1120 1104 /* 1121 1105 * This is safe to call with the page locks - it won't take 1122 1106 * any additional semaphores or cluster locks. 1123 1107 */ 1124 - tmp_pos = wc->w_cpos; 1108 + tmp_pos = cpos; 1125 1109 ret = ocfs2_do_extend_allocation(OCFS2_SB(inode->i_sb), inode, 1126 - &tmp_pos, 1, di_bh, handle, 1127 - data_ac, meta_ac, NULL); 1110 + &tmp_pos, 1, 0, wc->w_di_bh, 1111 + wc->w_handle, data_ac, 1112 + meta_ac, NULL); 1128 1113 /* 1129 1114 * This shouldn't happen because we must have already 1130 1115 * calculated the correct meta data allocation required. The ··· 1146 1121 mlog_errno(ret); 1147 1122 goto out; 1148 1123 } 1124 + } else if (unwritten) { 1125 + ret = ocfs2_mark_extent_written(inode, wc->w_di_bh, 1126 + wc->w_handle, cpos, 1, phys, 1127 + meta_ac, &wc->w_dealloc); 1128 + if (ret < 0) { 1129 + mlog_errno(ret); 1130 + goto out; 1131 + } 1149 1132 } 1150 1133 1134 + if (should_zero) 1135 + v_blkno = ocfs2_clusters_to_blocks(inode->i_sb, cpos); 1136 + else 1137 + v_blkno = user_pos >> inode->i_sb->s_blocksize_bits; 1138 + 1139 + /* 1140 + * The only reason this should fail is due to an inability to 1141 + * find the extent added. 1142 + */ 1151 1143 ret = ocfs2_extent_map_get_blocks(inode, v_blkno, &p_blkno, NULL, 1152 1144 NULL); 1153 1145 if (ret < 0) { 1154 - 1155 - /* 1156 - * XXX: Should we go readonly here? 1157 - */ 1158 - 1159 - mlog_errno(ret); 1146 + ocfs2_error(inode->i_sb, "Corrupting extend for inode %llu, " 1147 + "at logical block %llu", 1148 + (unsigned long long)OCFS2_I(inode)->ip_blkno, 1149 + (unsigned long long)v_blkno); 1160 1150 goto out; 1161 1151 } 1162 1152 1163 1153 BUG_ON(p_blkno == 0); 1164 1154 1165 - for(i = 0; i < numpages; i++) { 1166 - ret = ocfs2_write_data_page(inode, handle, &p_blkno, cpages[i], 1167 - wc, new); 1168 - if (ret < 0) { 1155 + for(i = 0; i < wc->w_num_pages; i++) { 1156 + int tmpret; 1157 + 1158 + tmpret = ocfs2_prepare_page_for_write(inode, &p_blkno, wc, 1159 + wc->w_pages[i], cpos, 1160 + user_pos, user_len, 1161 + should_zero); 1162 + if (tmpret) { 1163 + mlog_errno(tmpret); 1164 + if (ret == 0) 1165 + tmpret = ret; 1166 + } 1167 + } 1168 + 1169 + /* 1170 + * We only have cleanup to do in case of allocating write. 1171 + */ 1172 + if (ret && new) 1173 + ocfs2_write_failure(inode, wc, user_pos, user_len); 1174 + 1175 + out: 1176 + 1177 + return ret; 1178 + } 1179 + 1180 + static int ocfs2_write_cluster_by_desc(struct address_space *mapping, 1181 + struct ocfs2_alloc_context *data_ac, 1182 + struct ocfs2_alloc_context *meta_ac, 1183 + struct ocfs2_write_ctxt *wc, 1184 + loff_t pos, unsigned len) 1185 + { 1186 + int ret, i; 1187 + struct ocfs2_write_cluster_desc *desc; 1188 + 1189 + for (i = 0; i < wc->w_clen; i++) { 1190 + desc = &wc->w_desc[i]; 1191 + 1192 + ret = ocfs2_write_cluster(mapping, desc->c_phys, 1193 + desc->c_unwritten, data_ac, meta_ac, 1194 + wc, desc->c_cpos, pos, len); 1195 + if (ret) { 1169 1196 mlog_errno(ret); 1170 1197 goto out; 1171 1198 } 1172 - 1173 - copied += ret; 1174 1199 } 1175 1200 1201 + ret = 0; 1176 1202 out: 1177 - for(i = 0; i < numpages; i++) { 1178 - unlock_page(cpages[i]); 1179 - mark_page_accessed(cpages[i]); 1180 - page_cache_release(cpages[i]); 1181 - } 1182 - kfree(cpages); 1183 - 1184 - return copied ? copied : ret; 1185 - } 1186 - 1187 - static void ocfs2_write_ctxt_init(struct ocfs2_write_ctxt *wc, 1188 - struct ocfs2_super *osb, loff_t pos, 1189 - size_t count, ocfs2_page_writer *cb, 1190 - void *cb_priv) 1191 - { 1192 - wc->w_count = count; 1193 - wc->w_pos = pos; 1194 - wc->w_cpos = wc->w_pos >> osb->s_clustersize_bits; 1195 - wc->w_finished_copy = 0; 1196 - 1197 - if (unlikely(PAGE_CACHE_SHIFT > osb->s_clustersize_bits)) 1198 - wc->w_large_pages = 1; 1199 - else 1200 - wc->w_large_pages = 0; 1201 - 1202 - wc->w_write_data_page = cb; 1203 - wc->w_private = cb_priv; 1203 + return ret; 1204 1204 } 1205 1205 1206 1206 /* 1207 - * Write a cluster to an inode. The cluster may not be allocated yet, 1208 - * in which case it will be. This only exists for buffered writes - 1209 - * O_DIRECT takes a more "traditional" path through the kernel. 1210 - * 1211 - * The caller is responsible for incrementing pos, written counts, etc 1212 - * 1213 - * For file systems that don't support sparse files, pre-allocation 1214 - * and page zeroing up until cpos should be done prior to this 1215 - * function call. 1216 - * 1217 - * Callers should be holding i_sem, and the rw cluster lock. 1218 - * 1219 - * Returns the number of user bytes written, or less than zero for 1220 - * error. 1207 + * ocfs2_write_end() wants to know which parts of the target page it 1208 + * should complete the write on. It's easiest to compute them ahead of 1209 + * time when a more complete view of the write is available. 1221 1210 */ 1222 - ssize_t ocfs2_buffered_write_cluster(struct file *file, loff_t pos, 1223 - size_t count, ocfs2_page_writer *actor, 1224 - void *priv) 1211 + static void ocfs2_set_target_boundaries(struct ocfs2_super *osb, 1212 + struct ocfs2_write_ctxt *wc, 1213 + loff_t pos, unsigned len, int alloc) 1214 + { 1215 + struct ocfs2_write_cluster_desc *desc; 1216 + 1217 + wc->w_target_from = pos & (PAGE_CACHE_SIZE - 1); 1218 + wc->w_target_to = wc->w_target_from + len; 1219 + 1220 + if (alloc == 0) 1221 + return; 1222 + 1223 + /* 1224 + * Allocating write - we may have different boundaries based 1225 + * on page size and cluster size. 1226 + * 1227 + * NOTE: We can no longer compute one value from the other as 1228 + * the actual write length and user provided length may be 1229 + * different. 1230 + */ 1231 + 1232 + if (wc->w_large_pages) { 1233 + /* 1234 + * We only care about the 1st and last cluster within 1235 + * our range and whether they should be zero'd or not. Either 1236 + * value may be extended out to the start/end of a 1237 + * newly allocated cluster. 1238 + */ 1239 + desc = &wc->w_desc[0]; 1240 + if (ocfs2_should_zero_cluster(desc)) 1241 + ocfs2_figure_cluster_boundaries(osb, 1242 + desc->c_cpos, 1243 + &wc->w_target_from, 1244 + NULL); 1245 + 1246 + desc = &wc->w_desc[wc->w_clen - 1]; 1247 + if (ocfs2_should_zero_cluster(desc)) 1248 + ocfs2_figure_cluster_boundaries(osb, 1249 + desc->c_cpos, 1250 + NULL, 1251 + &wc->w_target_to); 1252 + } else { 1253 + wc->w_target_from = 0; 1254 + wc->w_target_to = PAGE_CACHE_SIZE; 1255 + } 1256 + } 1257 + 1258 + /* 1259 + * Populate each single-cluster write descriptor in the write context 1260 + * with information about the i/o to be done. 1261 + * 1262 + * Returns the number of clusters that will have to be allocated, as 1263 + * well as a worst case estimate of the number of extent records that 1264 + * would have to be created during a write to an unwritten region. 1265 + */ 1266 + static int ocfs2_populate_write_desc(struct inode *inode, 1267 + struct ocfs2_write_ctxt *wc, 1268 + unsigned int *clusters_to_alloc, 1269 + unsigned int *extents_to_split) 1270 + { 1271 + int ret; 1272 + struct ocfs2_write_cluster_desc *desc; 1273 + unsigned int num_clusters = 0; 1274 + unsigned int ext_flags = 0; 1275 + u32 phys = 0; 1276 + int i; 1277 + 1278 + *clusters_to_alloc = 0; 1279 + *extents_to_split = 0; 1280 + 1281 + for (i = 0; i < wc->w_clen; i++) { 1282 + desc = &wc->w_desc[i]; 1283 + desc->c_cpos = wc->w_cpos + i; 1284 + 1285 + if (num_clusters == 0) { 1286 + /* 1287 + * Need to look up the next extent record. 1288 + */ 1289 + ret = ocfs2_get_clusters(inode, desc->c_cpos, &phys, 1290 + &num_clusters, &ext_flags); 1291 + if (ret) { 1292 + mlog_errno(ret); 1293 + goto out; 1294 + } 1295 + 1296 + /* 1297 + * Assume worst case - that we're writing in 1298 + * the middle of the extent. 1299 + * 1300 + * We can assume that the write proceeds from 1301 + * left to right, in which case the extent 1302 + * insert code is smart enough to coalesce the 1303 + * next splits into the previous records created. 1304 + */ 1305 + if (ext_flags & OCFS2_EXT_UNWRITTEN) 1306 + *extents_to_split = *extents_to_split + 2; 1307 + } else if (phys) { 1308 + /* 1309 + * Only increment phys if it doesn't describe 1310 + * a hole. 1311 + */ 1312 + phys++; 1313 + } 1314 + 1315 + desc->c_phys = phys; 1316 + if (phys == 0) { 1317 + desc->c_new = 1; 1318 + *clusters_to_alloc = *clusters_to_alloc + 1; 1319 + } 1320 + if (ext_flags & OCFS2_EXT_UNWRITTEN) 1321 + desc->c_unwritten = 1; 1322 + 1323 + num_clusters--; 1324 + } 1325 + 1326 + ret = 0; 1327 + out: 1328 + return ret; 1329 + } 1330 + 1331 + int ocfs2_write_begin_nolock(struct address_space *mapping, 1332 + loff_t pos, unsigned len, unsigned flags, 1333 + struct page **pagep, void **fsdata, 1334 + struct buffer_head *di_bh, struct page *mmap_page) 1225 1335 { 1226 1336 int ret, credits = OCFS2_INODE_UPDATE_CREDITS; 1227 - ssize_t written = 0; 1228 - u32 phys; 1229 - struct inode *inode = file->f_mapping->host; 1337 + unsigned int clusters_to_alloc, extents_to_split; 1338 + struct ocfs2_write_ctxt *wc; 1339 + struct inode *inode = mapping->host; 1230 1340 struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1231 - struct buffer_head *di_bh = NULL; 1232 1341 struct ocfs2_dinode *di; 1233 1342 struct ocfs2_alloc_context *data_ac = NULL; 1234 1343 struct ocfs2_alloc_context *meta_ac = NULL; 1235 1344 handle_t *handle; 1236 - struct ocfs2_write_ctxt wc; 1237 1345 1238 - ocfs2_write_ctxt_init(&wc, osb, pos, count, actor, priv); 1346 + ret = ocfs2_alloc_write_ctxt(&wc, osb, pos, len, di_bh); 1347 + if (ret) { 1348 + mlog_errno(ret); 1349 + return ret; 1350 + } 1239 1351 1240 - ret = ocfs2_meta_lock(inode, &di_bh, 1); 1352 + ret = ocfs2_populate_write_desc(inode, wc, &clusters_to_alloc, 1353 + &extents_to_split); 1241 1354 if (ret) { 1242 1355 mlog_errno(ret); 1243 1356 goto out; 1244 1357 } 1245 - di = (struct ocfs2_dinode *)di_bh->b_data; 1358 + 1359 + di = (struct ocfs2_dinode *)wc->w_di_bh->b_data; 1360 + 1361 + /* 1362 + * We set w_target_from, w_target_to here so that 1363 + * ocfs2_write_end() knows which range in the target page to 1364 + * write out. An allocation requires that we write the entire 1365 + * cluster range. 1366 + */ 1367 + if (clusters_to_alloc || extents_to_split) { 1368 + /* 1369 + * XXX: We are stretching the limits of 1370 + * ocfs2_lock_allocators(). It greatly over-estimates 1371 + * the work to be done. 1372 + */ 1373 + ret = ocfs2_lock_allocators(inode, di, clusters_to_alloc, 1374 + extents_to_split, &data_ac, &meta_ac); 1375 + if (ret) { 1376 + mlog_errno(ret); 1377 + goto out; 1378 + } 1379 + 1380 + credits = ocfs2_calc_extend_credits(inode->i_sb, di, 1381 + clusters_to_alloc); 1382 + 1383 + } 1384 + 1385 + ocfs2_set_target_boundaries(osb, wc, pos, len, 1386 + clusters_to_alloc + extents_to_split); 1387 + 1388 + handle = ocfs2_start_trans(osb, credits); 1389 + if (IS_ERR(handle)) { 1390 + ret = PTR_ERR(handle); 1391 + mlog_errno(ret); 1392 + goto out; 1393 + } 1394 + 1395 + wc->w_handle = handle; 1396 + 1397 + /* 1398 + * We don't want this to fail in ocfs2_write_end(), so do it 1399 + * here. 1400 + */ 1401 + ret = ocfs2_journal_access(handle, inode, wc->w_di_bh, 1402 + OCFS2_JOURNAL_ACCESS_WRITE); 1403 + if (ret) { 1404 + mlog_errno(ret); 1405 + goto out_commit; 1406 + } 1407 + 1408 + /* 1409 + * Fill our page array first. That way we've grabbed enough so 1410 + * that we can zero and flush if we error after adding the 1411 + * extent. 1412 + */ 1413 + ret = ocfs2_grab_pages_for_write(mapping, wc, wc->w_cpos, pos, 1414 + clusters_to_alloc + extents_to_split, 1415 + mmap_page); 1416 + if (ret) { 1417 + mlog_errno(ret); 1418 + goto out_commit; 1419 + } 1420 + 1421 + ret = ocfs2_write_cluster_by_desc(mapping, data_ac, meta_ac, wc, pos, 1422 + len); 1423 + if (ret) { 1424 + mlog_errno(ret); 1425 + goto out_commit; 1426 + } 1427 + 1428 + if (data_ac) 1429 + ocfs2_free_alloc_context(data_ac); 1430 + if (meta_ac) 1431 + ocfs2_free_alloc_context(meta_ac); 1432 + 1433 + *pagep = wc->w_target_page; 1434 + *fsdata = wc; 1435 + return 0; 1436 + out_commit: 1437 + ocfs2_commit_trans(osb, handle); 1438 + 1439 + out: 1440 + ocfs2_free_write_ctxt(wc); 1441 + 1442 + if (data_ac) 1443 + ocfs2_free_alloc_context(data_ac); 1444 + if (meta_ac) 1445 + ocfs2_free_alloc_context(meta_ac); 1446 + return ret; 1447 + } 1448 + 1449 + int ocfs2_write_begin(struct file *file, struct address_space *mapping, 1450 + loff_t pos, unsigned len, unsigned flags, 1451 + struct page **pagep, void **fsdata) 1452 + { 1453 + int ret; 1454 + struct buffer_head *di_bh = NULL; 1455 + struct inode *inode = mapping->host; 1456 + 1457 + ret = ocfs2_meta_lock(inode, &di_bh, 1); 1458 + if (ret) { 1459 + mlog_errno(ret); 1460 + return ret; 1461 + } 1246 1462 1247 1463 /* 1248 1464 * Take alloc sem here to prevent concurrent lookups. That way ··· 1494 1228 */ 1495 1229 down_write(&OCFS2_I(inode)->ip_alloc_sem); 1496 1230 1497 - ret = ocfs2_get_clusters(inode, wc.w_cpos, &phys, NULL, NULL); 1498 - if (ret) { 1499 - mlog_errno(ret); 1500 - goto out_meta; 1501 - } 1502 - 1503 - /* phys == 0 means that allocation is required. */ 1504 - if (phys == 0) { 1505 - ret = ocfs2_lock_allocators(inode, di, 1, &data_ac, &meta_ac); 1506 - if (ret) { 1507 - mlog_errno(ret); 1508 - goto out_meta; 1509 - } 1510 - 1511 - credits = ocfs2_calc_extend_credits(inode->i_sb, di, 1); 1512 - } 1513 - 1514 1231 ret = ocfs2_data_lock(inode, 1); 1515 1232 if (ret) { 1516 1233 mlog_errno(ret); 1517 - goto out_meta; 1234 + goto out_fail; 1518 1235 } 1519 1236 1520 - handle = ocfs2_start_trans(osb, credits); 1521 - if (IS_ERR(handle)) { 1522 - ret = PTR_ERR(handle); 1523 - mlog_errno(ret); 1524 - goto out_data; 1525 - } 1526 - 1527 - written = ocfs2_write(file, phys, handle, di_bh, data_ac, 1528 - meta_ac, &wc); 1529 - if (written < 0) { 1530 - ret = written; 1531 - mlog_errno(ret); 1532 - goto out_commit; 1533 - } 1534 - 1535 - ret = ocfs2_journal_access(handle, inode, di_bh, 1536 - OCFS2_JOURNAL_ACCESS_WRITE); 1237 + ret = ocfs2_write_begin_nolock(mapping, pos, len, flags, pagep, 1238 + fsdata, di_bh, NULL); 1537 1239 if (ret) { 1538 1240 mlog_errno(ret); 1539 - goto out_commit; 1241 + goto out_fail_data; 1540 1242 } 1541 1243 1542 - pos += written; 1244 + brelse(di_bh); 1245 + 1246 + return 0; 1247 + 1248 + out_fail_data: 1249 + ocfs2_data_unlock(inode, 1); 1250 + out_fail: 1251 + up_write(&OCFS2_I(inode)->ip_alloc_sem); 1252 + 1253 + brelse(di_bh); 1254 + ocfs2_meta_unlock(inode, 1); 1255 + 1256 + return ret; 1257 + } 1258 + 1259 + int ocfs2_write_end_nolock(struct address_space *mapping, 1260 + loff_t pos, unsigned len, unsigned copied, 1261 + struct page *page, void *fsdata) 1262 + { 1263 + int i; 1264 + unsigned from, to, start = pos & (PAGE_CACHE_SIZE - 1); 1265 + struct inode *inode = mapping->host; 1266 + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1267 + struct ocfs2_write_ctxt *wc = fsdata; 1268 + struct ocfs2_dinode *di = (struct ocfs2_dinode *)wc->w_di_bh->b_data; 1269 + handle_t *handle = wc->w_handle; 1270 + struct page *tmppage; 1271 + 1272 + if (unlikely(copied < len)) { 1273 + if (!PageUptodate(wc->w_target_page)) 1274 + copied = 0; 1275 + 1276 + ocfs2_zero_new_buffers(wc->w_target_page, start+copied, 1277 + start+len); 1278 + } 1279 + flush_dcache_page(wc->w_target_page); 1280 + 1281 + for(i = 0; i < wc->w_num_pages; i++) { 1282 + tmppage = wc->w_pages[i]; 1283 + 1284 + if (tmppage == wc->w_target_page) { 1285 + from = wc->w_target_from; 1286 + to = wc->w_target_to; 1287 + 1288 + BUG_ON(from > PAGE_CACHE_SIZE || 1289 + to > PAGE_CACHE_SIZE || 1290 + to < from); 1291 + } else { 1292 + /* 1293 + * Pages adjacent to the target (if any) imply 1294 + * a hole-filling write in which case we want 1295 + * to flush their entire range. 1296 + */ 1297 + from = 0; 1298 + to = PAGE_CACHE_SIZE; 1299 + } 1300 + 1301 + if (ocfs2_should_order_data(inode)) 1302 + walk_page_buffers(wc->w_handle, page_buffers(tmppage), 1303 + from, to, NULL, 1304 + ocfs2_journal_dirty_data); 1305 + 1306 + block_commit_write(tmppage, from, to); 1307 + } 1308 + 1309 + pos += copied; 1543 1310 if (pos > inode->i_size) { 1544 1311 i_size_write(inode, pos); 1545 1312 mark_inode_dirty(inode); ··· 1582 1283 inode->i_mtime = inode->i_ctime = CURRENT_TIME; 1583 1284 di->i_mtime = di->i_ctime = cpu_to_le64(inode->i_mtime.tv_sec); 1584 1285 di->i_mtime_nsec = di->i_ctime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec); 1286 + ocfs2_journal_dirty(handle, wc->w_di_bh); 1585 1287 1586 - ret = ocfs2_journal_dirty(handle, di_bh); 1587 - if (ret) 1588 - mlog_errno(ret); 1589 - 1590 - out_commit: 1591 1288 ocfs2_commit_trans(osb, handle); 1592 1289 1593 - out_data: 1594 - ocfs2_data_unlock(inode, 1); 1290 + ocfs2_run_deallocs(osb, &wc->w_dealloc); 1595 1291 1596 - out_meta: 1292 + ocfs2_free_write_ctxt(wc); 1293 + 1294 + return copied; 1295 + } 1296 + 1297 + int ocfs2_write_end(struct file *file, struct address_space *mapping, 1298 + loff_t pos, unsigned len, unsigned copied, 1299 + struct page *page, void *fsdata) 1300 + { 1301 + int ret; 1302 + struct inode *inode = mapping->host; 1303 + 1304 + ret = ocfs2_write_end_nolock(mapping, pos, len, copied, page, fsdata); 1305 + 1306 + ocfs2_data_unlock(inode, 1); 1597 1307 up_write(&OCFS2_I(inode)->ip_alloc_sem); 1598 1308 ocfs2_meta_unlock(inode, 1); 1599 1309 1600 - out: 1601 - brelse(di_bh); 1602 - if (data_ac) 1603 - ocfs2_free_alloc_context(data_ac); 1604 - if (meta_ac) 1605 - ocfs2_free_alloc_context(meta_ac); 1606 - 1607 - return written ? written : ret; 1310 + return ret; 1608 1311 } 1609 1312 1610 1313 const struct address_space_operations ocfs2_aops = {

+13 -48

fs/ocfs2/aops.h

··· 42 42 int (*fn)( handle_t *handle, 43 43 struct buffer_head *bh)); 44 44 45 - struct ocfs2_write_ctxt; 46 - typedef int (ocfs2_page_writer)(struct inode *, struct ocfs2_write_ctxt *, 47 - u64 *, unsigned int *, unsigned int *); 45 + int ocfs2_write_begin(struct file *file, struct address_space *mapping, 46 + loff_t pos, unsigned len, unsigned flags, 47 + struct page **pagep, void **fsdata); 48 48 49 - ssize_t ocfs2_buffered_write_cluster(struct file *file, loff_t pos, 50 - size_t count, ocfs2_page_writer *actor, 51 - void *priv); 49 + int ocfs2_write_end(struct file *file, struct address_space *mapping, 50 + loff_t pos, unsigned len, unsigned copied, 51 + struct page *page, void *fsdata); 52 52 53 - struct ocfs2_write_ctxt { 54 - size_t w_count; 55 - loff_t w_pos; 56 - u32 w_cpos; 57 - unsigned int w_finished_copy; 53 + int ocfs2_write_end_nolock(struct address_space *mapping, 54 + loff_t pos, unsigned len, unsigned copied, 55 + struct page *page, void *fsdata); 58 56 59 - /* This is true if page_size > cluster_size */ 60 - unsigned int w_large_pages; 61 - 62 - /* Filler callback and private data */ 63 - ocfs2_page_writer *w_write_data_page; 64 - void *w_private; 65 - 66 - /* Only valid for the filler callback */ 67 - struct page *w_this_page; 68 - unsigned int w_this_page_new; 69 - }; 70 - 71 - struct ocfs2_buffered_write_priv { 72 - char *b_src_buf; 73 - const struct iovec *b_cur_iov; /* Current iovec */ 74 - size_t b_cur_off; /* Offset in the 75 - * current iovec */ 76 - }; 77 - int ocfs2_map_and_write_user_data(struct inode *inode, 78 - struct ocfs2_write_ctxt *wc, 79 - u64 *p_blkno, 80 - unsigned int *ret_from, 81 - unsigned int *ret_to); 82 - 83 - struct ocfs2_splice_write_priv { 84 - struct splice_desc *s_sd; 85 - struct pipe_buffer *s_buf; 86 - struct pipe_inode_info *s_pipe; 87 - /* Neither offset value is ever larger than one page */ 88 - unsigned int s_offset; 89 - unsigned int s_buf_offset; 90 - }; 91 - int ocfs2_map_and_write_splice_data(struct inode *inode, 92 - struct ocfs2_write_ctxt *wc, 93 - u64 *p_blkno, 94 - unsigned int *ret_from, 95 - unsigned int *ret_to); 57 + int ocfs2_write_begin_nolock(struct address_space *mapping, 58 + loff_t pos, unsigned len, unsigned flags, 59 + struct page **pagep, void **fsdata, 60 + struct buffer_head *di_bh, struct page *mmap_page); 96 61 97 62 /* all ocfs2_dio_end_io()'s fault */ 98 63 #define ocfs2_iocb_is_rw_locked(iocb) \

+93 -3

fs/ocfs2/cluster/heartbeat.c

··· 1335 1335 ret = wait_event_interruptible(o2hb_steady_queue, 1336 1336 atomic_read(&reg->hr_steady_iterations) == 0); 1337 1337 if (ret) { 1338 + /* We got interrupted (hello ptrace!). Clean up */ 1338 1339 spin_lock(&o2hb_live_lock); 1339 1340 hb_task = reg->hr_task; 1340 1341 reg->hr_task = NULL; ··· 1346 1345 goto out; 1347 1346 } 1348 1347 1349 - ret = count; 1348 + /* Ok, we were woken. Make sure it wasn't by drop_item() */ 1349 + spin_lock(&o2hb_live_lock); 1350 + hb_task = reg->hr_task; 1351 + spin_unlock(&o2hb_live_lock); 1352 + 1353 + if (hb_task) 1354 + ret = count; 1355 + else 1356 + ret = -EIO; 1357 + 1350 1358 out: 1351 1359 if (filp) 1352 1360 fput(filp); ··· 1533 1523 if (hb_task) 1534 1524 kthread_stop(hb_task); 1535 1525 1526 + /* 1527 + * If we're racing a dev_write(), we need to wake them. They will 1528 + * check reg->hr_task 1529 + */ 1530 + if (atomic_read(&reg->hr_steady_iterations) != 0) { 1531 + atomic_set(&reg->hr_steady_iterations, 0); 1532 + wake_up(&o2hb_steady_queue); 1533 + } 1534 + 1536 1535 config_item_put(item); 1537 1536 } 1538 1537 ··· 1684 1665 } 1685 1666 EXPORT_SYMBOL_GPL(o2hb_setup_callback); 1686 1667 1687 - int o2hb_register_callback(struct o2hb_callback_func *hc) 1668 + static struct o2hb_region *o2hb_find_region(const char *region_uuid) 1669 + { 1670 + struct o2hb_region *p, *reg = NULL; 1671 + 1672 + assert_spin_locked(&o2hb_live_lock); 1673 + 1674 + list_for_each_entry(p, &o2hb_all_regions, hr_all_item) { 1675 + if (!strcmp(region_uuid, config_item_name(&p->hr_item))) { 1676 + reg = p; 1677 + break; 1678 + } 1679 + } 1680 + 1681 + return reg; 1682 + } 1683 + 1684 + static int o2hb_region_get(const char *region_uuid) 1685 + { 1686 + int ret = 0; 1687 + struct o2hb_region *reg; 1688 + 1689 + spin_lock(&o2hb_live_lock); 1690 + 1691 + reg = o2hb_find_region(region_uuid); 1692 + if (!reg) 1693 + ret = -ENOENT; 1694 + spin_unlock(&o2hb_live_lock); 1695 + 1696 + if (ret) 1697 + goto out; 1698 + 1699 + ret = o2nm_depend_this_node(); 1700 + if (ret) 1701 + goto out; 1702 + 1703 + ret = o2nm_depend_item(&reg->hr_item); 1704 + if (ret) 1705 + o2nm_undepend_this_node(); 1706 + 1707 + out: 1708 + return ret; 1709 + } 1710 + 1711 + static void o2hb_region_put(const char *region_uuid) 1712 + { 1713 + struct o2hb_region *reg; 1714 + 1715 + spin_lock(&o2hb_live_lock); 1716 + 1717 + reg = o2hb_find_region(region_uuid); 1718 + 1719 + spin_unlock(&o2hb_live_lock); 1720 + 1721 + if (reg) { 1722 + o2nm_undepend_item(&reg->hr_item); 1723 + o2nm_undepend_this_node(); 1724 + } 1725 + } 1726 + 1727 + int o2hb_register_callback(const char *region_uuid, 1728 + struct o2hb_callback_func *hc) 1688 1729 { 1689 1730 struct o2hb_callback_func *tmp; 1690 1731 struct list_head *iter; ··· 1758 1679 if (IS_ERR(hbcall)) { 1759 1680 ret = PTR_ERR(hbcall); 1760 1681 goto out; 1682 + } 1683 + 1684 + if (region_uuid) { 1685 + ret = o2hb_region_get(region_uuid); 1686 + if (ret) 1687 + goto out; 1761 1688 } 1762 1689 1763 1690 down_write(&o2hb_callback_sem); ··· 1787 1702 } 1788 1703 EXPORT_SYMBOL_GPL(o2hb_register_callback); 1789 1704 1790 - void o2hb_unregister_callback(struct o2hb_callback_func *hc) 1705 + void o2hb_unregister_callback(const char *region_uuid, 1706 + struct o2hb_callback_func *hc) 1791 1707 { 1792 1708 BUG_ON(hc->hc_magic != O2HB_CB_MAGIC); 1793 1709 1794 1710 mlog(ML_HEARTBEAT, "on behalf of %p for funcs %p\n", 1795 1711 __builtin_return_address(0), hc); 1796 1712 1713 + /* XXX Can this happen _with_ a region reference? */ 1797 1714 if (list_empty(&hc->hc_item)) 1798 1715 return; 1716 + 1717 + if (region_uuid) 1718 + o2hb_region_put(region_uuid); 1799 1719 1800 1720 down_write(&o2hb_callback_sem); 1801 1721

+4 -2

fs/ocfs2/cluster/heartbeat.h

··· 69 69 o2hb_cb_func *func, 70 70 void *data, 71 71 int priority); 72 - int o2hb_register_callback(struct o2hb_callback_func *hc); 73 - void o2hb_unregister_callback(struct o2hb_callback_func *hc); 72 + int o2hb_register_callback(const char *region_uuid, 73 + struct o2hb_callback_func *hc); 74 + void o2hb_unregister_callback(const char *region_uuid, 75 + struct o2hb_callback_func *hc); 74 76 void o2hb_fill_node_map(unsigned long *map, 75 77 unsigned bytes); 76 78 void o2hb_init(void);

+41 -1

fs/ocfs2/cluster/nodemanager.c

··· 900 900 }, 901 901 }; 902 902 903 + int o2nm_depend_item(struct config_item *item) 904 + { 905 + return configfs_depend_item(&o2nm_cluster_group.cs_subsys, item); 906 + } 907 + 908 + void o2nm_undepend_item(struct config_item *item) 909 + { 910 + configfs_undepend_item(&o2nm_cluster_group.cs_subsys, item); 911 + } 912 + 913 + int o2nm_depend_this_node(void) 914 + { 915 + int ret = 0; 916 + struct o2nm_node *local_node; 917 + 918 + local_node = o2nm_get_node_by_num(o2nm_this_node()); 919 + if (!local_node) { 920 + ret = -EINVAL; 921 + goto out; 922 + } 923 + 924 + ret = o2nm_depend_item(&local_node->nd_item); 925 + o2nm_node_put(local_node); 926 + 927 + out: 928 + return ret; 929 + } 930 + 931 + void o2nm_undepend_this_node(void) 932 + { 933 + struct o2nm_node *local_node; 934 + 935 + local_node = o2nm_get_node_by_num(o2nm_this_node()); 936 + BUG_ON(!local_node); 937 + 938 + o2nm_undepend_item(&local_node->nd_item); 939 + o2nm_node_put(local_node); 940 + } 941 + 942 + 903 943 static void __exit exit_o2nm(void) 904 944 { 905 945 if (ocfs2_table_header) ··· 974 934 goto out_sysctl; 975 935 976 936 config_group_init(&o2nm_cluster_group.cs_subsys.su_group); 977 - init_MUTEX(&o2nm_cluster_group.cs_subsys.su_sem); 937 + mutex_init(&o2nm_cluster_group.cs_subsys.su_mutex); 978 938 ret = configfs_register_subsystem(&o2nm_cluster_group.cs_subsys); 979 939 if (ret) { 980 940 printk(KERN_ERR "nodemanager: Registration returned %d\n", ret);

+5

fs/ocfs2/cluster/nodemanager.h

··· 77 77 void o2nm_node_get(struct o2nm_node *node); 78 78 void o2nm_node_put(struct o2nm_node *node); 79 79 80 + int o2nm_depend_item(struct config_item *item); 81 + void o2nm_undepend_item(struct config_item *item); 82 + int o2nm_depend_this_node(void); 83 + void o2nm_undepend_this_node(void); 84 + 80 85 #endif /* O2CLUSTER_NODEMANAGER_H */

+8 -13

fs/ocfs2/cluster/tcp.c

··· 261 261 262 262 static void o2net_complete_nodes_nsw(struct o2net_node *nn) 263 263 { 264 - struct list_head *iter, *tmp; 264 + struct o2net_status_wait *nsw, *tmp; 265 265 unsigned int num_kills = 0; 266 - struct o2net_status_wait *nsw; 267 266 268 267 assert_spin_locked(&nn->nn_lock); 269 268 270 - list_for_each_safe(iter, tmp, &nn->nn_status_list) { 271 - nsw = list_entry(iter, struct o2net_status_wait, ns_node_item); 269 + list_for_each_entry_safe(nsw, tmp, &nn->nn_status_list, ns_node_item) { 272 270 o2net_complete_nsw_locked(nn, nsw, O2NET_ERR_DIED, 0); 273 271 num_kills++; 274 272 } ··· 762 764 763 765 void o2net_unregister_handler_list(struct list_head *list) 764 766 { 765 - struct list_head *pos, *n; 766 - struct o2net_msg_handler *nmh; 767 + struct o2net_msg_handler *nmh, *n; 767 768 768 769 write_lock(&o2net_handler_lock); 769 - list_for_each_safe(pos, n, list) { 770 - nmh = list_entry(pos, struct o2net_msg_handler, 771 - nh_unregister_item); 770 + list_for_each_entry_safe(nmh, n, list, nh_unregister_item) { 772 771 mlog(ML_TCP, "unregistering handler func %p type %u key %08x\n", 773 772 nmh->nh_func, nmh->nh_msg_type, nmh->nh_key); 774 773 rb_erase(&nmh->nh_node, &o2net_handler_tree); ··· 1633 1638 1634 1639 void o2net_unregister_hb_callbacks(void) 1635 1640 { 1636 - o2hb_unregister_callback(&o2net_hb_up); 1637 - o2hb_unregister_callback(&o2net_hb_down); 1641 + o2hb_unregister_callback(NULL, &o2net_hb_up); 1642 + o2hb_unregister_callback(NULL, &o2net_hb_down); 1638 1643 } 1639 1644 1640 1645 int o2net_register_hb_callbacks(void) ··· 1646 1651 o2hb_setup_callback(&o2net_hb_up, O2HB_NODE_UP_CB, 1647 1652 o2net_hb_node_up_cb, NULL, O2NET_HB_PRI); 1648 1653 1649 - ret = o2hb_register_callback(&o2net_hb_up); 1654 + ret = o2hb_register_callback(NULL, &o2net_hb_up); 1650 1655 if (ret == 0) 1651 - ret = o2hb_register_callback(&o2net_hb_down); 1656 + ret = o2hb_register_callback(NULL, &o2net_hb_down); 1652 1657 1653 1658 if (ret) 1654 1659 o2net_unregister_hb_callbacks();

+1 -1

fs/ocfs2/dir.c

··· 368 368 u32 offset = OCFS2_I(dir)->ip_clusters; 369 369 370 370 status = ocfs2_do_extend_allocation(OCFS2_SB(sb), dir, &offset, 371 - 1, parent_fe_bh, handle, 371 + 1, 0, parent_fe_bh, handle, 372 372 data_ac, meta_ac, NULL); 373 373 BUG_ON(status == -EAGAIN); 374 374 if (status < 0) {

+4 -4

fs/ocfs2/dlm/dlmdomain.c

··· 1128 1128 1129 1129 static void dlm_unregister_domain_handlers(struct dlm_ctxt *dlm) 1130 1130 { 1131 - o2hb_unregister_callback(&dlm->dlm_hb_up); 1132 - o2hb_unregister_callback(&dlm->dlm_hb_down); 1131 + o2hb_unregister_callback(NULL, &dlm->dlm_hb_up); 1132 + o2hb_unregister_callback(NULL, &dlm->dlm_hb_down); 1133 1133 o2net_unregister_handler_list(&dlm->dlm_domain_handlers); 1134 1134 } 1135 1135 ··· 1141 1141 1142 1142 o2hb_setup_callback(&dlm->dlm_hb_down, O2HB_NODE_DOWN_CB, 1143 1143 dlm_hb_node_down_cb, dlm, DLM_HB_NODE_DOWN_PRI); 1144 - status = o2hb_register_callback(&dlm->dlm_hb_down); 1144 + status = o2hb_register_callback(NULL, &dlm->dlm_hb_down); 1145 1145 if (status) 1146 1146 goto bail; 1147 1147 1148 1148 o2hb_setup_callback(&dlm->dlm_hb_up, O2HB_NODE_UP_CB, 1149 1149 dlm_hb_node_up_cb, dlm, DLM_HB_NODE_UP_PRI); 1150 - status = o2hb_register_callback(&dlm->dlm_hb_up); 1150 + status = o2hb_register_callback(NULL, &dlm->dlm_hb_up); 1151 1151 if (status) 1152 1152 goto bail; 1153 1153

+11 -29

fs/ocfs2/dlm/dlmmaster.c

··· 192 192 static void dlm_dump_mles(struct dlm_ctxt *dlm) 193 193 { 194 194 struct dlm_master_list_entry *mle; 195 - struct list_head *iter; 196 195 197 196 mlog(ML_NOTICE, "dumping all mles for domain %s:\n", dlm->name); 198 197 spin_lock(&dlm->master_lock); 199 - list_for_each(iter, &dlm->master_list) { 200 - mle = list_entry(iter, struct dlm_master_list_entry, list); 198 + list_for_each_entry(mle, &dlm->master_list, list) 201 199 dlm_print_one_mle(mle); 202 - } 203 200 spin_unlock(&dlm->master_lock); 204 201 } 205 202 206 203 int dlm_dump_all_mles(const char __user *data, unsigned int len) 207 204 { 208 - struct list_head *iter; 209 205 struct dlm_ctxt *dlm; 210 206 211 207 spin_lock(&dlm_domain_lock); 212 - list_for_each(iter, &dlm_domains) { 213 - dlm = list_entry (iter, struct dlm_ctxt, list); 208 + list_for_each_entry(dlm, &dlm_domains, list) { 214 209 mlog(ML_NOTICE, "found dlm: %p, name=%s\n", dlm, dlm->name); 215 210 dlm_dump_mles(dlm); 216 211 } ··· 449 454 char *name, unsigned int namelen) 450 455 { 451 456 struct dlm_master_list_entry *tmpmle; 452 - struct list_head *iter; 453 457 454 458 assert_spin_locked(&dlm->master_lock); 455 459 456 - list_for_each(iter, &dlm->master_list) { 457 - tmpmle = list_entry(iter, struct dlm_master_list_entry, list); 460 + list_for_each_entry(tmpmle, &dlm->master_list, list) { 458 461 if (!dlm_mle_equal(dlm, tmpmle, name, namelen)) 459 462 continue; 460 463 dlm_get_mle(tmpmle); ··· 465 472 void dlm_hb_event_notify_attached(struct dlm_ctxt *dlm, int idx, int node_up) 466 473 { 467 474 struct dlm_master_list_entry *mle; 468 - struct list_head *iter; 469 475 470 476 assert_spin_locked(&dlm->spinlock); 471 477 472 - list_for_each(iter, &dlm->mle_hb_events) { 473 - mle = list_entry(iter, struct dlm_master_list_entry, 474 - hb_events); 478 + list_for_each_entry(mle, &dlm->mle_hb_events, hb_events) { 475 479 if (node_up) 476 480 dlm_mle_node_up(dlm, mle, NULL, idx); 477 481 else ··· 2424 2434 int ret; 2425 2435 int i; 2426 2436 int count = 0; 2427 - struct list_head *queue, *iter; 2437 + struct list_head *queue; 2428 2438 struct dlm_lock *lock; 2429 2439 2430 2440 assert_spin_locked(&res->spinlock); ··· 2443 2453 ret = 0; 2444 2454 queue = &res->granted; 2445 2455 for (i = 0; i < 3; i++) { 2446 - list_for_each(iter, queue) { 2447 - lock = list_entry(iter, struct dlm_lock, list); 2456 + list_for_each_entry(lock, queue, list) { 2448 2457 ++count; 2449 2458 if (lock->ml.node == dlm->node_num) { 2450 2459 mlog(0, "found a lock owned by this node still " ··· 2912 2923 static void dlm_remove_nonlocal_locks(struct dlm_ctxt *dlm, 2913 2924 struct dlm_lock_resource *res) 2914 2925 { 2915 - struct list_head *iter, *iter2; 2916 2926 struct list_head *queue = &res->granted; 2917 2927 int i, bit; 2918 - struct dlm_lock *lock; 2928 + struct dlm_lock *lock, *next; 2919 2929 2920 2930 assert_spin_locked(&res->spinlock); 2921 2931 2922 2932 BUG_ON(res->owner == dlm->node_num); 2923 2933 2924 2934 for (i=0; i<3; i++) { 2925 - list_for_each_safe(iter, iter2, queue) { 2926 - lock = list_entry (iter, struct dlm_lock, list); 2935 + list_for_each_entry_safe(lock, next, queue, list) { 2927 2936 if (lock->ml.node != dlm->node_num) { 2928 2937 mlog(0, "putting lock for node %u\n", 2929 2938 lock->ml.node); ··· 2963 2976 { 2964 2977 int i; 2965 2978 struct list_head *queue = &res->granted; 2966 - struct list_head *iter; 2967 2979 struct dlm_lock *lock; 2968 2980 int nodenum; 2969 2981 ··· 2970 2984 2971 2985 spin_lock(&res->spinlock); 2972 2986 for (i=0; i<3; i++) { 2973 - list_for_each(iter, queue) { 2987 + list_for_each_entry(lock, queue, list) { 2974 2988 /* up to the caller to make sure this node 2975 2989 * is alive */ 2976 - lock = list_entry (iter, struct dlm_lock, list); 2977 2990 if (lock->ml.node != dlm->node_num) { 2978 2991 spin_unlock(&res->spinlock); 2979 2992 return lock->ml.node; ··· 3219 3234 3220 3235 void dlm_clean_master_list(struct dlm_ctxt *dlm, u8 dead_node) 3221 3236 { 3222 - struct list_head *iter, *iter2; 3223 - struct dlm_master_list_entry *mle; 3237 + struct dlm_master_list_entry *mle, *next; 3224 3238 struct dlm_lock_resource *res; 3225 3239 unsigned int hash; 3226 3240 ··· 3229 3245 3230 3246 /* clean the master list */ 3231 3247 spin_lock(&dlm->master_lock); 3232 - list_for_each_safe(iter, iter2, &dlm->master_list) { 3233 - mle = list_entry(iter, struct dlm_master_list_entry, list); 3234 - 3248 + list_for_each_entry_safe(mle, next, &dlm->master_list, list) { 3235 3249 BUG_ON(mle->type != DLM_MLE_BLOCK && 3236 3250 mle->type != DLM_MLE_MASTER && 3237 3251 mle->type != DLM_MLE_MIGRATION);

+26 -53

fs/ocfs2/dlm/dlmrecovery.c

··· 158 158 struct dlm_ctxt *dlm = 159 159 container_of(work, struct dlm_ctxt, dispatched_work); 160 160 LIST_HEAD(tmp_list); 161 - struct list_head *iter, *iter2; 162 - struct dlm_work_item *item; 161 + struct dlm_work_item *item, *next; 163 162 dlm_workfunc_t *workfunc; 164 163 int tot=0; 165 164 ··· 166 167 list_splice_init(&dlm->work_list, &tmp_list); 167 168 spin_unlock(&dlm->work_lock); 168 169 169 - list_for_each_safe(iter, iter2, &tmp_list) { 170 + list_for_each_entry(item, &tmp_list, list) { 170 171 tot++; 171 172 } 172 173 mlog(0, "%s: work thread has %d work items\n", dlm->name, tot); 173 174 174 - list_for_each_safe(iter, iter2, &tmp_list) { 175 - item = list_entry(iter, struct dlm_work_item, list); 175 + list_for_each_entry_safe(item, next, &tmp_list, list) { 176 176 workfunc = item->func; 177 177 list_del_init(&item->list); 178 178 ··· 547 549 { 548 550 int status = 0; 549 551 struct dlm_reco_node_data *ndata; 550 - struct list_head *iter; 551 552 int all_nodes_done; 552 553 int destroy = 0; 553 554 int pass = 0; ··· 564 567 565 568 /* safe to access the node data list without a lock, since this 566 569 * process is the only one to change the list */ 567 - list_for_each(iter, &dlm->reco.node_data) { 568 - ndata = list_entry (iter, struct dlm_reco_node_data, list); 570 + list_for_each_entry(ndata, &dlm->reco.node_data, list) { 569 571 BUG_ON(ndata->state != DLM_RECO_NODE_DATA_INIT); 570 572 ndata->state = DLM_RECO_NODE_DATA_REQUESTING; 571 573 ··· 651 655 * done, or if anyone died */ 652 656 all_nodes_done = 1; 653 657 spin_lock(&dlm_reco_state_lock); 654 - list_for_each(iter, &dlm->reco.node_data) { 655 - ndata = list_entry (iter, struct dlm_reco_node_data, list); 656 - 658 + list_for_each_entry(ndata, &dlm->reco.node_data, list) { 657 659 mlog(0, "checking recovery state of node %u\n", 658 660 ndata->node_num); 659 661 switch (ndata->state) { ··· 768 774 769 775 static void dlm_destroy_recovery_area(struct dlm_ctxt *dlm, u8 dead_node) 770 776 { 771 - struct list_head *iter, *iter2; 772 - struct dlm_reco_node_data *ndata; 777 + struct dlm_reco_node_data *ndata, *next; 773 778 LIST_HEAD(tmplist); 774 779 775 780 spin_lock(&dlm_reco_state_lock); 776 781 list_splice_init(&dlm->reco.node_data, &tmplist); 777 782 spin_unlock(&dlm_reco_state_lock); 778 783 779 - list_for_each_safe(iter, iter2, &tmplist) { 780 - ndata = list_entry (iter, struct dlm_reco_node_data, list); 784 + list_for_each_entry_safe(ndata, next, &tmplist, list) { 781 785 list_del_init(&ndata->list); 782 786 kfree(ndata); 783 787 } ··· 868 876 struct dlm_lock_resource *res; 869 877 struct dlm_ctxt *dlm; 870 878 LIST_HEAD(resources); 871 - struct list_head *iter; 872 879 int ret; 873 880 u8 dead_node, reco_master; 874 881 int skip_all_done = 0; ··· 911 920 912 921 /* any errors returned will be due to the new_master dying, 913 922 * the dlm_reco_thread should detect this */ 914 - list_for_each(iter, &resources) { 915 - res = list_entry (iter, struct dlm_lock_resource, recovering); 923 + list_for_each_entry(res, &resources, recovering) { 916 924 ret = dlm_send_one_lockres(dlm, res, mres, reco_master, 917 925 DLM_MRES_RECOVERY); 918 926 if (ret < 0) { ··· 973 983 { 974 984 struct dlm_ctxt *dlm = data; 975 985 struct dlm_reco_data_done *done = (struct dlm_reco_data_done *)msg->buf; 976 - struct list_head *iter; 977 986 struct dlm_reco_node_data *ndata = NULL; 978 987 int ret = -EINVAL; 979 988 ··· 989 1000 dlm->reco.dead_node, done->node_idx, dlm->node_num); 990 1001 991 1002 spin_lock(&dlm_reco_state_lock); 992 - list_for_each(iter, &dlm->reco.node_data) { 993 - ndata = list_entry (iter, struct dlm_reco_node_data, list); 1003 + list_for_each_entry(ndata, &dlm->reco.node_data, list) { 994 1004 if (ndata->node_num != done->node_idx) 995 1005 continue; 996 1006 ··· 1037 1049 struct list_head *list, 1038 1050 u8 dead_node) 1039 1051 { 1040 - struct dlm_lock_resource *res; 1041 - struct list_head *iter, *iter2; 1052 + struct dlm_lock_resource *res, *next; 1042 1053 struct dlm_lock *lock; 1043 1054 1044 1055 spin_lock(&dlm->spinlock); 1045 - list_for_each_safe(iter, iter2, &dlm->reco.resources) { 1046 - res = list_entry (iter, struct dlm_lock_resource, recovering); 1056 + list_for_each_entry_safe(res, next, &dlm->reco.resources, recovering) { 1047 1057 /* always prune any $RECOVERY entries for dead nodes, 1048 1058 * otherwise hangs can occur during later recovery */ 1049 1059 if (dlm_is_recovery_lock(res->lockname.name, ··· 1155 1169 u8 flags, u8 master) 1156 1170 { 1157 1171 /* mres here is one full page */ 1158 - memset(mres, 0, PAGE_SIZE); 1172 + clear_page(mres); 1159 1173 mres->lockname_len = namelen; 1160 1174 memcpy(mres->lockname, lockname, namelen); 1161 1175 mres->num_locks = 0; ··· 1238 1252 struct dlm_migratable_lockres *mres, 1239 1253 u8 send_to, u8 flags) 1240 1254 { 1241 - struct list_head *queue, *iter; 1255 + struct list_head *queue; 1242 1256 int total_locks, i; 1243 1257 u64 mig_cookie = 0; 1244 1258 struct dlm_lock *lock; ··· 1264 1278 total_locks = 0; 1265 1279 for (i=DLM_GRANTED_LIST; i<=DLM_BLOCKED_LIST; i++) { 1266 1280 queue = dlm_list_idx_to_ptr(res, i); 1267 - list_for_each(iter, queue) { 1268 - lock = list_entry (iter, struct dlm_lock, list); 1269 - 1281 + list_for_each_entry(lock, queue, list) { 1270 1282 /* add another lock. */ 1271 1283 total_locks++; 1272 1284 if (!dlm_add_lock_to_array(lock, mres, i)) ··· 1701 1717 struct dlm_lockstatus *lksb = NULL; 1702 1718 int ret = 0; 1703 1719 int i, j, bad; 1704 - struct list_head *iter; 1705 1720 struct dlm_lock *lock = NULL; 1706 1721 u8 from = O2NM_MAX_NODES; 1707 1722 unsigned int added = 0; ··· 1738 1755 spin_lock(&res->spinlock); 1739 1756 for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) { 1740 1757 tmpq = dlm_list_idx_to_ptr(res, j); 1741 - list_for_each(iter, tmpq) { 1742 - lock = list_entry (iter, struct dlm_lock, list); 1758 + list_for_each_entry(lock, tmpq, list) { 1743 1759 if (lock->ml.cookie != ml->cookie) 1744 1760 lock = NULL; 1745 1761 else ··· 1912 1930 struct dlm_lock_resource *res) 1913 1931 { 1914 1932 int i; 1915 - struct list_head *queue, *iter, *iter2; 1916 - struct dlm_lock *lock; 1933 + struct list_head *queue; 1934 + struct dlm_lock *lock, *next; 1917 1935 1918 1936 res->state |= DLM_LOCK_RES_RECOVERING; 1919 1937 if (!list_empty(&res->recovering)) { ··· 1929 1947 /* find any pending locks and put them back on proper list */ 1930 1948 for (i=DLM_BLOCKED_LIST; i>=DLM_GRANTED_LIST; i--) { 1931 1949 queue = dlm_list_idx_to_ptr(res, i); 1932 - list_for_each_safe(iter, iter2, queue) { 1933 - lock = list_entry (iter, struct dlm_lock, list); 1950 + list_for_each_entry_safe(lock, next, queue, list) { 1934 1951 dlm_lock_get(lock); 1935 1952 if (lock->convert_pending) { 1936 1953 /* move converting lock back to granted */ ··· 1994 2013 u8 dead_node, u8 new_master) 1995 2014 { 1996 2015 int i; 1997 - struct list_head *iter, *iter2; 1998 2016 struct hlist_node *hash_iter; 1999 2017 struct hlist_head *bucket; 2000 - 2001 - struct dlm_lock_resource *res; 2018 + struct dlm_lock_resource *res, *next; 2002 2019 2003 2020 mlog_entry_void(); 2004 2021 2005 2022 assert_spin_locked(&dlm->spinlock); 2006 2023 2007 - list_for_each_safe(iter, iter2, &dlm->reco.resources) { 2008 - res = list_entry (iter, struct dlm_lock_resource, recovering); 2024 + list_for_each_entry_safe(res, next, &dlm->reco.resources, recovering) { 2009 2025 if (res->owner == dead_node) { 2010 2026 list_del_init(&res->recovering); 2011 2027 spin_lock(&res->spinlock); ··· 2077 2099 static void dlm_revalidate_lvb(struct dlm_ctxt *dlm, 2078 2100 struct dlm_lock_resource *res, u8 dead_node) 2079 2101 { 2080 - struct list_head *iter, *queue; 2102 + struct list_head *queue; 2081 2103 struct dlm_lock *lock; 2082 2104 int blank_lvb = 0, local = 0; 2083 2105 int i; ··· 2099 2121 2100 2122 for (i=DLM_GRANTED_LIST; i<=DLM_CONVERTING_LIST; i++) { 2101 2123 queue = dlm_list_idx_to_ptr(res, i); 2102 - list_for_each(iter, queue) { 2103 - lock = list_entry (iter, struct dlm_lock, list); 2124 + list_for_each_entry(lock, queue, list) { 2104 2125 if (lock->ml.node == search_node) { 2105 2126 if (dlm_lvb_needs_invalidation(lock, local)) { 2106 2127 /* zero the lksb lvb and lockres lvb */ ··· 2120 2143 static void dlm_free_dead_locks(struct dlm_ctxt *dlm, 2121 2144 struct dlm_lock_resource *res, u8 dead_node) 2122 2145 { 2123 - struct list_head *iter, *tmpiter; 2124 - struct dlm_lock *lock; 2146 + struct dlm_lock *lock, *next; 2125 2147 unsigned int freed = 0; 2126 2148 2127 2149 /* this node is the lockres master: ··· 2131 2155 assert_spin_locked(&res->spinlock); 2132 2156 2133 2157 /* TODO: check pending_asts, pending_basts here */ 2134 - list_for_each_safe(iter, tmpiter, &res->granted) { 2135 - lock = list_entry (iter, struct dlm_lock, list); 2158 + list_for_each_entry_safe(lock, next, &res->granted, list) { 2136 2159 if (lock->ml.node == dead_node) { 2137 2160 list_del_init(&lock->list); 2138 2161 dlm_lock_put(lock); 2139 2162 freed++; 2140 2163 } 2141 2164 } 2142 - list_for_each_safe(iter, tmpiter, &res->converting) { 2143 - lock = list_entry (iter, struct dlm_lock, list); 2165 + list_for_each_entry_safe(lock, next, &res->converting, list) { 2144 2166 if (lock->ml.node == dead_node) { 2145 2167 list_del_init(&lock->list); 2146 2168 dlm_lock_put(lock); 2147 2169 freed++; 2148 2170 } 2149 2171 } 2150 - list_for_each_safe(iter, tmpiter, &res->blocked) { 2151 - lock = list_entry (iter, struct dlm_lock, list); 2172 + list_for_each_entry_safe(lock, next, &res->blocked, list) { 2152 2173 if (lock->ml.node == dead_node) { 2153 2174 list_del_init(&lock->list); 2154 2175 dlm_lock_put(lock);

+2 -4

fs/ocfs2/dlmglue.c

··· 600 600 static void lockres_set_flags(struct ocfs2_lock_res *lockres, 601 601 unsigned long newflags) 602 602 { 603 - struct list_head *pos, *tmp; 604 - struct ocfs2_mask_waiter *mw; 603 + struct ocfs2_mask_waiter *mw, *tmp; 605 604 606 605 assert_spin_locked(&lockres->l_lock); 607 606 608 607 lockres->l_flags = newflags; 609 608 610 - list_for_each_safe(pos, tmp, &lockres->l_mask_waiters) { 611 - mw = list_entry(pos, struct ocfs2_mask_waiter, mw_item); 609 + list_for_each_entry_safe(mw, tmp, &lockres->l_mask_waiters, mw_item) { 612 610 if ((lockres->l_flags & mw->mw_mask) != mw->mw_goal) 613 611 continue; 614 612

+5

fs/ocfs2/endian.h

··· 32 32 *var = cpu_to_le32(le32_to_cpu(*var) + val); 33 33 } 34 34 35 + static inline void le64_add_cpu(__le64 *var, u64 val) 36 + { 37 + *var = cpu_to_le64(le64_to_cpu(*var) + val); 38 + } 39 + 35 40 static inline void le32_and_cpu(__le32 *var, u32 val) 36 41 { 37 42 *var = cpu_to_le32(le32_to_cpu(*var) & val);

+3 -38

fs/ocfs2/extent_map.c

··· 109 109 */ 110 110 void ocfs2_extent_map_trunc(struct inode *inode, unsigned int cpos) 111 111 { 112 - struct list_head *p, *n; 113 - struct ocfs2_extent_map_item *emi; 112 + struct ocfs2_extent_map_item *emi, *n; 114 113 struct ocfs2_inode_info *oi = OCFS2_I(inode); 115 114 struct ocfs2_extent_map *em = &oi->ip_extent_map; 116 115 LIST_HEAD(tmp_list); 117 116 unsigned int range; 118 117 119 118 spin_lock(&oi->ip_lock); 120 - list_for_each_safe(p, n, &em->em_list) { 121 - emi = list_entry(p, struct ocfs2_extent_map_item, ei_list); 122 - 119 + list_for_each_entry_safe(emi, n, &em->em_list, ei_list) { 123 120 if (emi->ei_cpos >= cpos) { 124 121 /* Full truncate of this record. */ 125 122 list_move(&emi->ei_list, &tmp_list); ··· 133 136 } 134 137 spin_unlock(&oi->ip_lock); 135 138 136 - list_for_each_safe(p, n, &tmp_list) { 137 - emi = list_entry(p, struct ocfs2_extent_map_item, ei_list); 139 + list_for_each_entry_safe(emi, n, &tmp_list, ei_list) { 138 140 list_del(&emi->ei_list); 139 141 kfree(emi); 140 142 } ··· 370 374 ret = 0; 371 375 out: 372 376 brelse(next_eb_bh); 373 - return ret; 374 - } 375 - 376 - /* 377 - * Return the index of the extent record which contains cluster #v_cluster. 378 - * -1 is returned if it was not found. 379 - * 380 - * Should work fine on interior and exterior nodes. 381 - */ 382 - static int ocfs2_search_extent_list(struct ocfs2_extent_list *el, 383 - u32 v_cluster) 384 - { 385 - int ret = -1; 386 - int i; 387 - struct ocfs2_extent_rec *rec; 388 - u32 rec_end, rec_start, clusters; 389 - 390 - for(i = 0; i < le16_to_cpu(el->l_next_free_rec); i++) { 391 - rec = &el->l_recs[i]; 392 - 393 - rec_start = le32_to_cpu(rec->e_cpos); 394 - clusters = ocfs2_rec_clusters(el, rec); 395 - 396 - rec_end = rec_start + clusters; 397 - 398 - if (v_cluster >= rec_start && v_cluster < rec_end) { 399 - ret = i; 400 - break; 401 - } 402 - } 403 - 404 377 return ret; 405 378 } 406 379

+601 -101

fs/ocfs2/file.c

··· 263 263 int status; 264 264 handle_t *handle; 265 265 struct ocfs2_dinode *di; 266 + u64 cluster_bytes; 266 267 267 268 mlog_entry_void(); 268 269 ··· 287 286 /* 288 287 * Do this before setting i_size. 289 288 */ 290 - status = ocfs2_zero_tail_for_truncate(inode, handle, new_i_size); 289 + cluster_bytes = ocfs2_align_bytes_to_clusters(inode->i_sb, new_i_size); 290 + status = ocfs2_zero_range_for_truncate(inode, handle, new_i_size, 291 + cluster_bytes); 291 292 if (status) { 292 293 mlog_errno(status); 293 294 goto out_commit; ··· 329 326 (unsigned long long)OCFS2_I(inode)->ip_blkno, 330 327 (unsigned long long)new_i_size); 331 328 332 - unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1); 333 - truncate_inode_pages(inode->i_mapping, new_i_size); 334 - 335 329 fe = (struct ocfs2_dinode *) di_bh->b_data; 336 330 if (!OCFS2_IS_VALID_DINODE(fe)) { 337 331 OCFS2_RO_ON_INVALID_DINODE(inode->i_sb, fe); ··· 363 363 if (new_i_size == le64_to_cpu(fe->i_size)) 364 364 goto bail; 365 365 366 + down_write(&OCFS2_I(inode)->ip_alloc_sem); 367 + 366 368 /* This forces other nodes to sync and drop their pages. Do 367 369 * this even if we have a truncate without allocation change - 368 370 * ocfs2 cluster sizes can be much greater than page size, so 369 371 * we have to truncate them anyway. */ 370 372 status = ocfs2_data_lock(inode, 1); 371 373 if (status < 0) { 374 + up_write(&OCFS2_I(inode)->ip_alloc_sem); 375 + 372 376 mlog_errno(status); 373 377 goto bail; 374 378 } 379 + 380 + unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1); 381 + truncate_inode_pages(inode->i_mapping, new_i_size); 375 382 376 383 /* alright, we're going to need to do a full blown alloc size 377 384 * change. Orphan the inode so that recovery can complete the ··· 406 399 bail_unlock_data: 407 400 ocfs2_data_unlock(inode, 1); 408 401 402 + up_write(&OCFS2_I(inode)->ip_alloc_sem); 403 + 409 404 bail: 410 405 411 406 mlog_exit(status); ··· 428 419 struct inode *inode, 429 420 u32 *logical_offset, 430 421 u32 clusters_to_add, 422 + int mark_unwritten, 431 423 struct buffer_head *fe_bh, 432 424 handle_t *handle, 433 425 struct ocfs2_alloc_context *data_ac, ··· 441 431 enum ocfs2_alloc_restarted reason = RESTART_NONE; 442 432 u32 bit_off, num_bits; 443 433 u64 block; 434 + u8 flags = 0; 444 435 445 436 BUG_ON(!clusters_to_add); 437 + 438 + if (mark_unwritten) 439 + flags = OCFS2_EXT_UNWRITTEN; 446 440 447 441 free_extents = ocfs2_num_free_extents(osb, inode, fe); 448 442 if (free_extents < 0) { ··· 497 483 num_bits, bit_off, (unsigned long long)OCFS2_I(inode)->ip_blkno); 498 484 status = ocfs2_insert_extent(osb, handle, inode, fe_bh, 499 485 *logical_offset, block, num_bits, 500 - meta_ac); 486 + flags, meta_ac); 501 487 if (status < 0) { 502 488 mlog_errno(status); 503 489 goto leave; ··· 530 516 * For a given allocation, determine which allocators will need to be 531 517 * accessed, and lock them, reserving the appropriate number of bits. 532 518 * 533 - * Called from ocfs2_extend_allocation() for file systems which don't 534 - * support holes, and from ocfs2_write() for file systems which 535 - * understand sparse inodes. 519 + * Sparse file systems call this from ocfs2_write_begin_nolock() 520 + * and ocfs2_allocate_unwritten_extents(). 521 + * 522 + * File systems which don't support holes call this from 523 + * ocfs2_extend_allocation(). 536 524 */ 537 525 int ocfs2_lock_allocators(struct inode *inode, struct ocfs2_dinode *di, 538 - u32 clusters_to_add, 526 + u32 clusters_to_add, u32 extents_to_split, 539 527 struct ocfs2_alloc_context **data_ac, 540 528 struct ocfs2_alloc_context **meta_ac) 541 529 { 542 - int ret, num_free_extents; 530 + int ret = 0, num_free_extents; 531 + unsigned int max_recs_needed = clusters_to_add + 2 * extents_to_split; 543 532 struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 544 533 545 534 *meta_ac = NULL; 546 - *data_ac = NULL; 535 + if (data_ac) 536 + *data_ac = NULL; 537 + 538 + BUG_ON(clusters_to_add != 0 && data_ac == NULL); 547 539 548 540 mlog(0, "extend inode %llu, i_size = %lld, di->i_clusters = %u, " 549 - "clusters_to_add = %u\n", 541 + "clusters_to_add = %u, extents_to_split = %u\n", 550 542 (unsigned long long)OCFS2_I(inode)->ip_blkno, i_size_read(inode), 551 - le32_to_cpu(di->i_clusters), clusters_to_add); 543 + le32_to_cpu(di->i_clusters), clusters_to_add, extents_to_split); 552 544 553 545 num_free_extents = ocfs2_num_free_extents(osb, inode, di); 554 546 if (num_free_extents < 0) { ··· 572 552 * 573 553 * Most of the time we'll only be seeing this 1 cluster at a time 574 554 * anyway. 555 + * 556 + * Always lock for any unwritten extents - we might want to 557 + * add blocks during a split. 575 558 */ 576 559 if (!num_free_extents || 577 - (ocfs2_sparse_alloc(osb) && num_free_extents < clusters_to_add)) { 560 + (ocfs2_sparse_alloc(osb) && num_free_extents < max_recs_needed)) { 578 561 ret = ocfs2_reserve_new_metadata(osb, di, meta_ac); 579 562 if (ret < 0) { 580 563 if (ret != -ENOSPC) ··· 585 562 goto out; 586 563 } 587 564 } 565 + 566 + if (clusters_to_add == 0) 567 + goto out; 588 568 589 569 ret = ocfs2_reserve_clusters(osb, clusters_to_add, data_ac); 590 570 if (ret < 0) { ··· 611 585 return ret; 612 586 } 613 587 614 - static int ocfs2_extend_allocation(struct inode *inode, 615 - u32 clusters_to_add) 588 + static int __ocfs2_extend_allocation(struct inode *inode, u32 logical_start, 589 + u32 clusters_to_add, int mark_unwritten) 616 590 { 617 591 int status = 0; 618 592 int restart_func = 0; 619 - int drop_alloc_sem = 0; 620 593 int credits; 621 - u32 prev_clusters, logical_start; 594 + u32 prev_clusters; 622 595 struct buffer_head *bh = NULL; 623 596 struct ocfs2_dinode *fe = NULL; 624 597 handle_t *handle = NULL; ··· 632 607 * This function only exists for file systems which don't 633 608 * support holes. 634 609 */ 635 - BUG_ON(ocfs2_sparse_alloc(osb)); 610 + BUG_ON(mark_unwritten && !ocfs2_sparse_alloc(osb)); 636 611 637 612 status = ocfs2_read_block(osb, OCFS2_I(inode)->ip_blkno, &bh, 638 613 OCFS2_BH_CACHED, inode); ··· 648 623 goto leave; 649 624 } 650 625 651 - logical_start = OCFS2_I(inode)->ip_clusters; 652 - 653 626 restart_all: 654 627 BUG_ON(le32_to_cpu(fe->i_clusters) != OCFS2_I(inode)->ip_clusters); 655 628 656 - /* blocks peope in read/write from reading our allocation 657 - * until we're done changing it. We depend on i_mutex to block 658 - * other extend/truncate calls while we're here. Ordering wrt 659 - * start_trans is important here -- always do it before! */ 660 - down_write(&OCFS2_I(inode)->ip_alloc_sem); 661 - drop_alloc_sem = 1; 662 - 663 - status = ocfs2_lock_allocators(inode, fe, clusters_to_add, &data_ac, 629 + status = ocfs2_lock_allocators(inode, fe, clusters_to_add, 0, &data_ac, 664 630 &meta_ac); 665 631 if (status) { 666 632 mlog_errno(status); ··· 684 668 inode, 685 669 &logical_start, 686 670 clusters_to_add, 671 + mark_unwritten, 687 672 bh, 688 673 handle, 689 674 data_ac, ··· 737 720 OCFS2_I(inode)->ip_clusters, i_size_read(inode)); 738 721 739 722 leave: 740 - if (drop_alloc_sem) { 741 - up_write(&OCFS2_I(inode)->ip_alloc_sem); 742 - drop_alloc_sem = 0; 743 - } 744 723 if (handle) { 745 724 ocfs2_commit_trans(osb, handle); 746 725 handle = NULL; ··· 760 747 761 748 mlog_exit(status); 762 749 return status; 750 + } 751 + 752 + static int ocfs2_extend_allocation(struct inode *inode, u32 logical_start, 753 + u32 clusters_to_add, int mark_unwritten) 754 + { 755 + int ret; 756 + 757 + /* 758 + * The alloc sem blocks peope in read/write from reading our 759 + * allocation until we're done changing it. We depend on 760 + * i_mutex to block other extend/truncate calls while we're 761 + * here. 762 + */ 763 + down_write(&OCFS2_I(inode)->ip_alloc_sem); 764 + ret = __ocfs2_extend_allocation(inode, logical_start, clusters_to_add, 765 + mark_unwritten); 766 + up_write(&OCFS2_I(inode)->ip_alloc_sem); 767 + 768 + return ret; 763 769 } 764 770 765 771 /* Some parts of this taken from generic_cont_expand, which turned out ··· 922 890 } 923 891 924 892 if (clusters_to_add) { 925 - ret = ocfs2_extend_allocation(inode, clusters_to_add); 893 + ret = ocfs2_extend_allocation(inode, 894 + OCFS2_I(inode)->ip_clusters, 895 + clusters_to_add, 0); 926 896 if (ret < 0) { 927 897 mlog_errno(ret); 928 898 goto out_unlock; ··· 1029 995 goto bail_unlock; 1030 996 } 1031 997 998 + /* 999 + * This will intentionally not wind up calling vmtruncate(), 1000 + * since all the work for a size change has been done above. 1001 + * Otherwise, we could get into problems with truncate as 1002 + * ip_alloc_sem is used there to protect against i_size 1003 + * changes. 1004 + */ 1032 1005 status = inode_setattr(inode, attr); 1033 1006 if (status < 0) { 1034 1007 mlog_errno(status); ··· 1111 1070 return ret; 1112 1071 } 1113 1072 1114 - static int ocfs2_write_remove_suid(struct inode *inode) 1073 + static int __ocfs2_write_remove_suid(struct inode *inode, 1074 + struct buffer_head *bh) 1115 1075 { 1116 1076 int ret; 1117 - struct buffer_head *bh = NULL; 1118 - struct ocfs2_inode_info *oi = OCFS2_I(inode); 1119 1077 handle_t *handle; 1120 1078 struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1121 1079 struct ocfs2_dinode *di; 1122 1080 1123 1081 mlog_entry("(Inode %llu, mode 0%o)\n", 1124 - (unsigned long long)oi->ip_blkno, inode->i_mode); 1082 + (unsigned long long)OCFS2_I(inode)->ip_blkno, inode->i_mode); 1125 1083 1126 1084 handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS); 1127 1085 if (handle == NULL) { ··· 1129 1089 goto out; 1130 1090 } 1131 1091 1132 - ret = ocfs2_read_block(osb, oi->ip_blkno, &bh, OCFS2_BH_CACHED, inode); 1133 - if (ret < 0) { 1134 - mlog_errno(ret); 1135 - goto out_trans; 1136 - } 1137 - 1138 1092 ret = ocfs2_journal_access(handle, inode, bh, 1139 1093 OCFS2_JOURNAL_ACCESS_WRITE); 1140 1094 if (ret < 0) { 1141 1095 mlog_errno(ret); 1142 - goto out_bh; 1096 + goto out_trans; 1143 1097 } 1144 1098 1145 1099 inode->i_mode &= ~S_ISUID; ··· 1146 1112 ret = ocfs2_journal_dirty(handle, bh); 1147 1113 if (ret < 0) 1148 1114 mlog_errno(ret); 1149 - out_bh: 1150 - brelse(bh); 1115 + 1151 1116 out_trans: 1152 1117 ocfs2_commit_trans(osb, handle); 1153 1118 out: ··· 1188 1155 clusters -= extent_len; 1189 1156 cpos += extent_len; 1190 1157 } 1158 + out: 1159 + return ret; 1160 + } 1161 + 1162 + static int ocfs2_write_remove_suid(struct inode *inode) 1163 + { 1164 + int ret; 1165 + struct buffer_head *bh = NULL; 1166 + struct ocfs2_inode_info *oi = OCFS2_I(inode); 1167 + 1168 + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), 1169 + oi->ip_blkno, &bh, OCFS2_BH_CACHED, inode); 1170 + if (ret < 0) { 1171 + mlog_errno(ret); 1172 + goto out; 1173 + } 1174 + 1175 + ret = __ocfs2_write_remove_suid(inode, bh); 1176 + out: 1177 + brelse(bh); 1178 + return ret; 1179 + } 1180 + 1181 + /* 1182 + * Allocate enough extents to cover the region starting at byte offset 1183 + * start for len bytes. Existing extents are skipped, any extents 1184 + * added are marked as "unwritten". 1185 + */ 1186 + static int ocfs2_allocate_unwritten_extents(struct inode *inode, 1187 + u64 start, u64 len) 1188 + { 1189 + int ret; 1190 + u32 cpos, phys_cpos, clusters, alloc_size; 1191 + 1192 + /* 1193 + * We consider both start and len to be inclusive. 1194 + */ 1195 + cpos = start >> OCFS2_SB(inode->i_sb)->s_clustersize_bits; 1196 + clusters = ocfs2_clusters_for_bytes(inode->i_sb, start + len); 1197 + clusters -= cpos; 1198 + 1199 + while (clusters) { 1200 + ret = ocfs2_get_clusters(inode, cpos, &phys_cpos, 1201 + &alloc_size, NULL); 1202 + if (ret) { 1203 + mlog_errno(ret); 1204 + goto out; 1205 + } 1206 + 1207 + /* 1208 + * Hole or existing extent len can be arbitrary, so 1209 + * cap it to our own allocation request. 1210 + */ 1211 + if (alloc_size > clusters) 1212 + alloc_size = clusters; 1213 + 1214 + if (phys_cpos) { 1215 + /* 1216 + * We already have an allocation at this 1217 + * region so we can safely skip it. 1218 + */ 1219 + goto next; 1220 + } 1221 + 1222 + ret = __ocfs2_extend_allocation(inode, cpos, alloc_size, 1); 1223 + if (ret) { 1224 + if (ret != -ENOSPC) 1225 + mlog_errno(ret); 1226 + goto out; 1227 + } 1228 + 1229 + next: 1230 + cpos += alloc_size; 1231 + clusters -= alloc_size; 1232 + } 1233 + 1234 + ret = 0; 1235 + out: 1236 + return ret; 1237 + } 1238 + 1239 + static int __ocfs2_remove_inode_range(struct inode *inode, 1240 + struct buffer_head *di_bh, 1241 + u32 cpos, u32 phys_cpos, u32 len, 1242 + struct ocfs2_cached_dealloc_ctxt *dealloc) 1243 + { 1244 + int ret; 1245 + u64 phys_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys_cpos); 1246 + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1247 + struct inode *tl_inode = osb->osb_tl_inode; 1248 + handle_t *handle; 1249 + struct ocfs2_alloc_context *meta_ac = NULL; 1250 + struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; 1251 + 1252 + ret = ocfs2_lock_allocators(inode, di, 0, 1, NULL, &meta_ac); 1253 + if (ret) { 1254 + mlog_errno(ret); 1255 + return ret; 1256 + } 1257 + 1258 + mutex_lock(&tl_inode->i_mutex); 1259 + 1260 + if (ocfs2_truncate_log_needs_flush(osb)) { 1261 + ret = __ocfs2_flush_truncate_log(osb); 1262 + if (ret < 0) { 1263 + mlog_errno(ret); 1264 + goto out; 1265 + } 1266 + } 1267 + 1268 + handle = ocfs2_start_trans(osb, OCFS2_REMOVE_EXTENT_CREDITS); 1269 + if (handle == NULL) { 1270 + ret = -ENOMEM; 1271 + mlog_errno(ret); 1272 + goto out; 1273 + } 1274 + 1275 + ret = ocfs2_journal_access(handle, inode, di_bh, 1276 + OCFS2_JOURNAL_ACCESS_WRITE); 1277 + if (ret) { 1278 + mlog_errno(ret); 1279 + goto out; 1280 + } 1281 + 1282 + ret = ocfs2_remove_extent(inode, di_bh, cpos, len, handle, meta_ac, 1283 + dealloc); 1284 + if (ret) { 1285 + mlog_errno(ret); 1286 + goto out_commit; 1287 + } 1288 + 1289 + OCFS2_I(inode)->ip_clusters -= len; 1290 + di->i_clusters = cpu_to_le32(OCFS2_I(inode)->ip_clusters); 1291 + 1292 + ret = ocfs2_journal_dirty(handle, di_bh); 1293 + if (ret) { 1294 + mlog_errno(ret); 1295 + goto out_commit; 1296 + } 1297 + 1298 + ret = ocfs2_truncate_log_append(osb, handle, phys_blkno, len); 1299 + if (ret) 1300 + mlog_errno(ret); 1301 + 1302 + out_commit: 1303 + ocfs2_commit_trans(osb, handle); 1304 + out: 1305 + mutex_unlock(&tl_inode->i_mutex); 1306 + 1307 + if (meta_ac) 1308 + ocfs2_free_alloc_context(meta_ac); 1309 + 1310 + return ret; 1311 + } 1312 + 1313 + /* 1314 + * Truncate a byte range, avoiding pages within partial clusters. This 1315 + * preserves those pages for the zeroing code to write to. 1316 + */ 1317 + static void ocfs2_truncate_cluster_pages(struct inode *inode, u64 byte_start, 1318 + u64 byte_len) 1319 + { 1320 + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1321 + loff_t start, end; 1322 + struct address_space *mapping = inode->i_mapping; 1323 + 1324 + start = (loff_t)ocfs2_align_bytes_to_clusters(inode->i_sb, byte_start); 1325 + end = byte_start + byte_len; 1326 + end = end & ~(osb->s_clustersize - 1); 1327 + 1328 + if (start < end) { 1329 + unmap_mapping_range(mapping, start, end - start, 0); 1330 + truncate_inode_pages_range(mapping, start, end - 1); 1331 + } 1332 + } 1333 + 1334 + static int ocfs2_zero_partial_clusters(struct inode *inode, 1335 + u64 start, u64 len) 1336 + { 1337 + int ret = 0; 1338 + u64 tmpend, end = start + len; 1339 + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1340 + unsigned int csize = osb->s_clustersize; 1341 + handle_t *handle; 1342 + 1343 + /* 1344 + * The "start" and "end" values are NOT necessarily part of 1345 + * the range whose allocation is being deleted. Rather, this 1346 + * is what the user passed in with the request. We must zero 1347 + * partial clusters here. There's no need to worry about 1348 + * physical allocation - the zeroing code knows to skip holes. 1349 + */ 1350 + mlog(0, "byte start: %llu, end: %llu\n", 1351 + (unsigned long long)start, (unsigned long long)end); 1352 + 1353 + /* 1354 + * If both edges are on a cluster boundary then there's no 1355 + * zeroing required as the region is part of the allocation to 1356 + * be truncated. 1357 + */ 1358 + if ((start & (csize - 1)) == 0 && (end & (csize - 1)) == 0) 1359 + goto out; 1360 + 1361 + handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS); 1362 + if (handle == NULL) { 1363 + ret = -ENOMEM; 1364 + mlog_errno(ret); 1365 + goto out; 1366 + } 1367 + 1368 + /* 1369 + * We want to get the byte offset of the end of the 1st cluster. 1370 + */ 1371 + tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize - 1)); 1372 + if (tmpend > end) 1373 + tmpend = end; 1374 + 1375 + mlog(0, "1st range: start: %llu, tmpend: %llu\n", 1376 + (unsigned long long)start, (unsigned long long)tmpend); 1377 + 1378 + ret = ocfs2_zero_range_for_truncate(inode, handle, start, tmpend); 1379 + if (ret) 1380 + mlog_errno(ret); 1381 + 1382 + if (tmpend < end) { 1383 + /* 1384 + * This may make start and end equal, but the zeroing 1385 + * code will skip any work in that case so there's no 1386 + * need to catch it up here. 1387 + */ 1388 + start = end & ~(osb->s_clustersize - 1); 1389 + 1390 + mlog(0, "2nd range: start: %llu, end: %llu\n", 1391 + (unsigned long long)start, (unsigned long long)end); 1392 + 1393 + ret = ocfs2_zero_range_for_truncate(inode, handle, start, end); 1394 + if (ret) 1395 + mlog_errno(ret); 1396 + } 1397 + 1398 + ocfs2_commit_trans(osb, handle); 1399 + out: 1400 + return ret; 1401 + } 1402 + 1403 + static int ocfs2_remove_inode_range(struct inode *inode, 1404 + struct buffer_head *di_bh, u64 byte_start, 1405 + u64 byte_len) 1406 + { 1407 + int ret = 0; 1408 + u32 trunc_start, trunc_len, cpos, phys_cpos, alloc_size; 1409 + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1410 + struct ocfs2_cached_dealloc_ctxt dealloc; 1411 + 1412 + ocfs2_init_dealloc_ctxt(&dealloc); 1413 + 1414 + if (byte_len == 0) 1415 + return 0; 1416 + 1417 + trunc_start = ocfs2_clusters_for_bytes(osb->sb, byte_start); 1418 + trunc_len = (byte_start + byte_len) >> osb->s_clustersize_bits; 1419 + if (trunc_len >= trunc_start) 1420 + trunc_len -= trunc_start; 1421 + else 1422 + trunc_len = 0; 1423 + 1424 + mlog(0, "Inode: %llu, start: %llu, len: %llu, cstart: %u, clen: %u\n", 1425 + (unsigned long long)OCFS2_I(inode)->ip_blkno, 1426 + (unsigned long long)byte_start, 1427 + (unsigned long long)byte_len, trunc_start, trunc_len); 1428 + 1429 + ret = ocfs2_zero_partial_clusters(inode, byte_start, byte_len); 1430 + if (ret) { 1431 + mlog_errno(ret); 1432 + goto out; 1433 + } 1434 + 1435 + cpos = trunc_start; 1436 + while (trunc_len) { 1437 + ret = ocfs2_get_clusters(inode, cpos, &phys_cpos, 1438 + &alloc_size, NULL); 1439 + if (ret) { 1440 + mlog_errno(ret); 1441 + goto out; 1442 + } 1443 + 1444 + if (alloc_size > trunc_len) 1445 + alloc_size = trunc_len; 1446 + 1447 + /* Only do work for non-holes */ 1448 + if (phys_cpos != 0) { 1449 + ret = __ocfs2_remove_inode_range(inode, di_bh, cpos, 1450 + phys_cpos, alloc_size, 1451 + &dealloc); 1452 + if (ret) { 1453 + mlog_errno(ret); 1454 + goto out; 1455 + } 1456 + } 1457 + 1458 + cpos += alloc_size; 1459 + trunc_len -= alloc_size; 1460 + } 1461 + 1462 + ocfs2_truncate_cluster_pages(inode, byte_start, byte_len); 1463 + 1464 + out: 1465 + ocfs2_schedule_truncate_log_flush(osb, 1); 1466 + ocfs2_run_deallocs(osb, &dealloc); 1467 + 1468 + return ret; 1469 + } 1470 + 1471 + /* 1472 + * Parts of this function taken from xfs_change_file_space() 1473 + */ 1474 + int ocfs2_change_file_space(struct file *file, unsigned int cmd, 1475 + struct ocfs2_space_resv *sr) 1476 + { 1477 + int ret; 1478 + s64 llen; 1479 + struct inode *inode = file->f_path.dentry->d_inode; 1480 + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 1481 + struct buffer_head *di_bh = NULL; 1482 + handle_t *handle; 1483 + unsigned long long max_off = ocfs2_max_file_offset(inode->i_sb->s_blocksize_bits); 1484 + 1485 + if ((cmd == OCFS2_IOC_RESVSP || cmd == OCFS2_IOC_RESVSP64) && 1486 + !ocfs2_writes_unwritten_extents(osb)) 1487 + return -ENOTTY; 1488 + else if ((cmd == OCFS2_IOC_UNRESVSP || cmd == OCFS2_IOC_UNRESVSP64) && 1489 + !ocfs2_sparse_alloc(osb)) 1490 + return -ENOTTY; 1491 + 1492 + if (!S_ISREG(inode->i_mode)) 1493 + return -EINVAL; 1494 + 1495 + if (!(file->f_mode & FMODE_WRITE)) 1496 + return -EBADF; 1497 + 1498 + if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb)) 1499 + return -EROFS; 1500 + 1501 + mutex_lock(&inode->i_mutex); 1502 + 1503 + /* 1504 + * This prevents concurrent writes on other nodes 1505 + */ 1506 + ret = ocfs2_rw_lock(inode, 1); 1507 + if (ret) { 1508 + mlog_errno(ret); 1509 + goto out; 1510 + } 1511 + 1512 + ret = ocfs2_meta_lock(inode, &di_bh, 1); 1513 + if (ret) { 1514 + mlog_errno(ret); 1515 + goto out_rw_unlock; 1516 + } 1517 + 1518 + if (inode->i_flags & (S_IMMUTABLE|S_APPEND)) { 1519 + ret = -EPERM; 1520 + goto out_meta_unlock; 1521 + } 1522 + 1523 + switch (sr->l_whence) { 1524 + case 0: /*SEEK_SET*/ 1525 + break; 1526 + case 1: /*SEEK_CUR*/ 1527 + sr->l_start += file->f_pos; 1528 + break; 1529 + case 2: /*SEEK_END*/ 1530 + sr->l_start += i_size_read(inode); 1531 + break; 1532 + default: 1533 + ret = -EINVAL; 1534 + goto out_meta_unlock; 1535 + } 1536 + sr->l_whence = 0; 1537 + 1538 + llen = sr->l_len > 0 ? sr->l_len - 1 : sr->l_len; 1539 + 1540 + if (sr->l_start < 0 1541 + || sr->l_start > max_off 1542 + || (sr->l_start + llen) < 0 1543 + || (sr->l_start + llen) > max_off) { 1544 + ret = -EINVAL; 1545 + goto out_meta_unlock; 1546 + } 1547 + 1548 + if (cmd == OCFS2_IOC_RESVSP || cmd == OCFS2_IOC_RESVSP64) { 1549 + if (sr->l_len <= 0) { 1550 + ret = -EINVAL; 1551 + goto out_meta_unlock; 1552 + } 1553 + } 1554 + 1555 + if (should_remove_suid(file->f_path.dentry)) { 1556 + ret = __ocfs2_write_remove_suid(inode, di_bh); 1557 + if (ret) { 1558 + mlog_errno(ret); 1559 + goto out_meta_unlock; 1560 + } 1561 + } 1562 + 1563 + down_write(&OCFS2_I(inode)->ip_alloc_sem); 1564 + switch (cmd) { 1565 + case OCFS2_IOC_RESVSP: 1566 + case OCFS2_IOC_RESVSP64: 1567 + /* 1568 + * This takes unsigned offsets, but the signed ones we 1569 + * pass have been checked against overflow above. 1570 + */ 1571 + ret = ocfs2_allocate_unwritten_extents(inode, sr->l_start, 1572 + sr->l_len); 1573 + break; 1574 + case OCFS2_IOC_UNRESVSP: 1575 + case OCFS2_IOC_UNRESVSP64: 1576 + ret = ocfs2_remove_inode_range(inode, di_bh, sr->l_start, 1577 + sr->l_len); 1578 + break; 1579 + default: 1580 + ret = -EINVAL; 1581 + } 1582 + up_write(&OCFS2_I(inode)->ip_alloc_sem); 1583 + if (ret) { 1584 + mlog_errno(ret); 1585 + goto out_meta_unlock; 1586 + } 1587 + 1588 + /* 1589 + * We update c/mtime for these changes 1590 + */ 1591 + handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS); 1592 + if (IS_ERR(handle)) { 1593 + ret = PTR_ERR(handle); 1594 + mlog_errno(ret); 1595 + goto out_meta_unlock; 1596 + } 1597 + 1598 + inode->i_ctime = inode->i_mtime = CURRENT_TIME; 1599 + ret = ocfs2_mark_inode_dirty(handle, inode, di_bh); 1600 + if (ret < 0) 1601 + mlog_errno(ret); 1602 + 1603 + ocfs2_commit_trans(osb, handle); 1604 + 1605 + out_meta_unlock: 1606 + brelse(di_bh); 1607 + ocfs2_meta_unlock(inode, 1); 1608 + out_rw_unlock: 1609 + ocfs2_rw_unlock(inode, 1); 1610 + 1611 + mutex_unlock(&inode->i_mutex); 1191 1612 out: 1192 1613 return ret; 1193 1614 } ··· 1816 1329 *basep = base; 1817 1330 } 1818 1331 1819 - static struct page * ocfs2_get_write_source(struct ocfs2_buffered_write_priv *bp, 1332 + static struct page * ocfs2_get_write_source(char **ret_src_buf, 1820 1333 const struct iovec *cur_iov, 1821 1334 size_t iov_offset) 1822 1335 { 1823 1336 int ret; 1824 - char *buf; 1337 + char *buf = cur_iov->iov_base + iov_offset; 1825 1338 struct page *src_page = NULL; 1339 + unsigned long off; 1826 1340 1827 - buf = cur_iov->iov_base + iov_offset; 1341 + off = (unsigned long)(buf) & ~PAGE_CACHE_MASK; 1828 1342 1829 1343 if (!segment_eq(get_fs(), KERNEL_DS)) { 1830 1344 /* ··· 1837 1349 (unsigned long)buf & PAGE_CACHE_MASK, 1, 1838 1350 0, 0, &src_page, NULL); 1839 1351 if (ret == 1) 1840 - bp->b_src_buf = kmap(src_page); 1352 + *ret_src_buf = kmap(src_page) + off; 1841 1353 else 1842 1354 src_page = ERR_PTR(-EFAULT); 1843 1355 } else { 1844 - bp->b_src_buf = buf; 1356 + *ret_src_buf = buf; 1845 1357 } 1846 1358 1847 1359 return src_page; 1848 1360 } 1849 1361 1850 - static void ocfs2_put_write_source(struct ocfs2_buffered_write_priv *bp, 1851 - struct page *page) 1362 + static void ocfs2_put_write_source(struct page *page) 1852 1363 { 1853 1364 if (page) { 1854 1365 kunmap(page); ··· 1863 1376 { 1864 1377 int ret = 0; 1865 1378 ssize_t copied, total = 0; 1866 - size_t iov_offset = 0; 1379 + size_t iov_offset = 0, bytes; 1380 + loff_t pos; 1867 1381 const struct iovec *cur_iov = iov; 1868 - struct ocfs2_buffered_write_priv bp; 1869 - struct page *page; 1382 + struct page *user_page, *page; 1383 + char *buf, *dst; 1384 + void *fsdata; 1870 1385 1871 1386 /* 1872 1387 * handle partial DIO write. Adjust cur_iov if needed. ··· 1876 1387 ocfs2_set_next_iovec(&cur_iov, &iov_offset, o_direct_written); 1877 1388 1878 1389 do { 1879 - bp.b_cur_off = iov_offset; 1880 - bp.b_cur_iov = cur_iov; 1390 + pos = *ppos; 1881 1391 1882 - page = ocfs2_get_write_source(&bp, cur_iov, iov_offset); 1883 - if (IS_ERR(page)) { 1884 - ret = PTR_ERR(page); 1392 + user_page = ocfs2_get_write_source(&buf, cur_iov, iov_offset); 1393 + if (IS_ERR(user_page)) { 1394 + ret = PTR_ERR(user_page); 1885 1395 goto out; 1886 1396 } 1887 1397 1888 - copied = ocfs2_buffered_write_cluster(file, *ppos, count, 1889 - ocfs2_map_and_write_user_data, 1890 - &bp); 1398 + /* Stay within our page boundaries */ 1399 + bytes = min((PAGE_CACHE_SIZE - ((unsigned long)pos & ~PAGE_CACHE_MASK)), 1400 + (PAGE_CACHE_SIZE - ((unsigned long)buf & ~PAGE_CACHE_MASK))); 1401 + /* Stay within the vector boundary */ 1402 + bytes = min_t(size_t, bytes, cur_iov->iov_len - iov_offset); 1403 + /* Stay within count */ 1404 + bytes = min(bytes, count); 1891 1405 1892 - ocfs2_put_write_source(&bp, page); 1406 + page = NULL; 1407 + ret = ocfs2_write_begin(file, file->f_mapping, pos, bytes, 0, 1408 + &page, &fsdata); 1409 + if (ret) { 1410 + mlog_errno(ret); 1411 + goto out; 1412 + } 1893 1413 1414 + dst = kmap_atomic(page, KM_USER0); 1415 + memcpy(dst + (pos & (PAGE_CACHE_SIZE - 1)), buf, bytes); 1416 + kunmap_atomic(dst, KM_USER0); 1417 + flush_dcache_page(page); 1418 + ocfs2_put_write_source(user_page); 1419 + 1420 + copied = ocfs2_write_end(file, file->f_mapping, pos, bytes, 1421 + bytes, page, fsdata); 1894 1422 if (copied < 0) { 1895 1423 mlog_errno(copied); 1896 1424 ret = copied; ··· 1915 1409 } 1916 1410 1917 1411 total += copied; 1918 - *ppos = *ppos + copied; 1412 + *ppos = pos + copied; 1919 1413 count -= copied; 1920 1414 1921 1415 ocfs2_set_next_iovec(&cur_iov, &iov_offset, copied); ··· 2085 1579 struct pipe_buffer *buf, 2086 1580 struct splice_desc *sd) 2087 1581 { 2088 - int ret, count, total = 0; 1582 + int ret, count; 2089 1583 ssize_t copied = 0; 2090 - struct ocfs2_splice_write_priv sp; 1584 + struct file *file = sd->u.file; 1585 + unsigned int offset; 1586 + struct page *page = NULL; 1587 + void *fsdata; 1588 + char *src, *dst; 2091 1589 2092 1590 ret = buf->ops->confirm(pipe, buf); 2093 1591 if (ret) 2094 1592 goto out; 2095 1593 2096 - sp.s_sd = sd; 2097 - sp.s_buf = buf; 2098 - sp.s_pipe = pipe; 2099 - sp.s_offset = sd->pos & ~PAGE_CACHE_MASK; 2100 - sp.s_buf_offset = buf->offset; 2101 - 1594 + offset = sd->pos & ~PAGE_CACHE_MASK; 2102 1595 count = sd->len; 2103 - if (count + sp.s_offset > PAGE_CACHE_SIZE) 2104 - count = PAGE_CACHE_SIZE - sp.s_offset; 1596 + if (count + offset > PAGE_CACHE_SIZE) 1597 + count = PAGE_CACHE_SIZE - offset; 2105 1598 2106 - do { 2107 - /* 2108 - * splice wants us to copy up to one page at a 2109 - * time. For pagesize > cluster size, this means we 2110 - * might enter ocfs2_buffered_write_cluster() more 2111 - * than once, so keep track of our progress here. 2112 - */ 2113 - copied = ocfs2_buffered_write_cluster(sd->u.file, 2114 - (loff_t)sd->pos + total, 2115 - count, 2116 - ocfs2_map_and_write_splice_data, 2117 - &sp); 2118 - if (copied < 0) { 2119 - mlog_errno(copied); 2120 - ret = copied; 2121 - goto out; 2122 - } 1599 + ret = ocfs2_write_begin(file, file->f_mapping, sd->pos, count, 0, 1600 + &page, &fsdata); 1601 + if (ret) { 1602 + mlog_errno(ret); 1603 + goto out; 1604 + } 2123 1605 2124 - count -= copied; 2125 - sp.s_offset += copied; 2126 - sp.s_buf_offset += copied; 2127 - total += copied; 2128 - } while (count); 1606 + src = buf->ops->map(pipe, buf, 1); 1607 + dst = kmap_atomic(page, KM_USER1); 1608 + memcpy(dst + offset, src + buf->offset, count); 1609 + kunmap_atomic(page, KM_USER1); 1610 + buf->ops->unmap(pipe, buf, src); 2129 1611 2130 - ret = 0; 1612 + copied = ocfs2_write_end(file, file->f_mapping, sd->pos, count, count, 1613 + page, fsdata); 1614 + if (copied < 0) { 1615 + mlog_errno(copied); 1616 + ret = copied; 1617 + goto out; 1618 + } 2131 1619 out: 2132 1620 2133 - return total ? total : ret; 1621 + return copied ? copied : ret; 2134 1622 } 2135 1623 2136 1624 static ssize_t __ocfs2_file_splice_write(struct pipe_inode_info *pipe,

+7 -3

fs/ocfs2/file.h

··· 39 39 }; 40 40 int ocfs2_do_extend_allocation(struct ocfs2_super *osb, 41 41 struct inode *inode, 42 - u32 *cluster_start, 42 + u32 *logical_offset, 43 43 u32 clusters_to_add, 44 + int mark_unwritten, 44 45 struct buffer_head *fe_bh, 45 46 handle_t *handle, 46 47 struct ocfs2_alloc_context *data_ac, 47 48 struct ocfs2_alloc_context *meta_ac, 48 - enum ocfs2_alloc_restarted *reason); 49 + enum ocfs2_alloc_restarted *reason_ret); 49 50 int ocfs2_lock_allocators(struct inode *inode, struct ocfs2_dinode *di, 50 - u32 clusters_to_add, 51 + u32 clusters_to_add, u32 extents_to_split, 51 52 struct ocfs2_alloc_context **data_ac, 52 53 struct ocfs2_alloc_context **meta_ac); 53 54 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr); ··· 61 60 struct vfsmount *vfsmnt); 62 61 int ocfs2_update_inode_atime(struct inode *inode, 63 62 struct buffer_head *bh); 63 + 64 + int ocfs2_change_file_space(struct file *file, unsigned int cmd, 65 + struct ocfs2_space_resv *sr); 64 66 65 67 #endif /* OCFS2_FILE_H */

+5 -5

fs/ocfs2/heartbeat.c

··· 157 157 if (ocfs2_mount_local(osb)) 158 158 return 0; 159 159 160 - status = o2hb_register_callback(&osb->osb_hb_down); 160 + status = o2hb_register_callback(osb->uuid_str, &osb->osb_hb_down); 161 161 if (status < 0) { 162 162 mlog_errno(status); 163 163 goto bail; 164 164 } 165 165 166 - status = o2hb_register_callback(&osb->osb_hb_up); 166 + status = o2hb_register_callback(osb->uuid_str, &osb->osb_hb_up); 167 167 if (status < 0) { 168 168 mlog_errno(status); 169 - o2hb_unregister_callback(&osb->osb_hb_down); 169 + o2hb_unregister_callback(osb->uuid_str, &osb->osb_hb_down); 170 170 } 171 171 172 172 bail: ··· 178 178 if (ocfs2_mount_local(osb)) 179 179 return; 180 180 181 - o2hb_unregister_callback(&osb->osb_hb_down); 182 - o2hb_unregister_callback(&osb->osb_hb_up); 181 + o2hb_unregister_callback(osb->uuid_str, &osb->osb_hb_down); 182 + o2hb_unregister_callback(osb->uuid_str, &osb->osb_hb_up); 183 183 } 184 184 185 185 void ocfs2_stop_heartbeat(struct ocfs2_super *osb)

+15

fs/ocfs2/ioctl.c

··· 14 14 #include "ocfs2.h" 15 15 #include "alloc.h" 16 16 #include "dlmglue.h" 17 + #include "file.h" 17 18 #include "inode.h" 18 19 #include "journal.h" 19 20 ··· 116 115 { 117 116 unsigned int flags; 118 117 int status; 118 + struct ocfs2_space_resv sr; 119 119 120 120 switch (cmd) { 121 121 case OCFS2_IOC_GETFLAGS: ··· 132 130 133 131 return ocfs2_set_inode_attr(inode, flags, 134 132 OCFS2_FL_MODIFIABLE); 133 + case OCFS2_IOC_RESVSP: 134 + case OCFS2_IOC_RESVSP64: 135 + case OCFS2_IOC_UNRESVSP: 136 + case OCFS2_IOC_UNRESVSP64: 137 + if (copy_from_user(&sr, (int __user *) arg, sizeof(sr))) 138 + return -EFAULT; 139 + 140 + return ocfs2_change_file_space(filp, cmd, &sr); 135 141 default: 136 142 return -ENOTTY; 137 143 } ··· 157 147 break; 158 148 case OCFS2_IOC32_SETFLAGS: 159 149 cmd = OCFS2_IOC_SETFLAGS; 150 + break; 151 + case OCFS2_IOC_RESVSP: 152 + case OCFS2_IOC_RESVSP64: 153 + case OCFS2_IOC_UNRESVSP: 154 + case OCFS2_IOC_UNRESVSP64: 160 155 break; 161 156 default: 162 157 return -ENOIOCTLCMD;

+2 -4

fs/ocfs2/journal.c

··· 722 722 container_of(work, struct ocfs2_journal, j_recovery_work); 723 723 struct ocfs2_super *osb = journal->j_osb; 724 724 struct ocfs2_dinode *la_dinode, *tl_dinode; 725 - struct ocfs2_la_recovery_item *item; 726 - struct list_head *p, *n; 725 + struct ocfs2_la_recovery_item *item, *n; 727 726 LIST_HEAD(tmp_la_list); 728 727 729 728 mlog_entry_void(); ··· 733 734 list_splice_init(&journal->j_la_cleanups, &tmp_la_list); 734 735 spin_unlock(&journal->j_lock); 735 736 736 - list_for_each_safe(p, n, &tmp_la_list) { 737 - item = list_entry(p, struct ocfs2_la_recovery_item, lri_list); 737 + list_for_each_entry_safe(item, n, &tmp_la_list, lri_list) { 738 738 list_del_init(&item->lri_list); 739 739 740 740 mlog(0, "Complete recovery for slot %d\n", item->lri_slot);

+2

fs/ocfs2/journal.h

··· 289 289 #define OCFS2_TRUNCATE_LOG_FLUSH_ONE_REC (OCFS2_SUBALLOC_FREE \ 290 290 + OCFS2_TRUNCATE_LOG_UPDATE) 291 291 292 + #define OCFS2_REMOVE_EXTENT_CREDITS (OCFS2_TRUNCATE_LOG_UPDATE + OCFS2_INODE_UPDATE_CREDITS) 293 + 292 294 /* data block for new dir/symlink, 2 for bitmap updates (bitmap fe + 293 295 * bitmap block for the new bit) */ 294 296 #define OCFS2_DIR_LINK_ADDITIONAL_CREDITS (1 + 2)

+143 -24

fs/ocfs2/mmap.c

··· 37 37 38 38 #include "ocfs2.h" 39 39 40 + #include "aops.h" 40 41 #include "dlmglue.h" 41 42 #include "file.h" 42 43 #include "inode.h" 43 44 #include "mmap.h" 45 + 46 + static inline int ocfs2_vm_op_block_sigs(sigset_t *blocked, sigset_t *oldset) 47 + { 48 + /* The best way to deal with signals in the vm path is 49 + * to block them upfront, rather than allowing the 50 + * locking paths to return -ERESTARTSYS. */ 51 + sigfillset(blocked); 52 + 53 + /* We should technically never get a bad return value 54 + * from sigprocmask */ 55 + return sigprocmask(SIG_BLOCK, blocked, oldset); 56 + } 57 + 58 + static inline int ocfs2_vm_op_unblock_sigs(sigset_t *oldset) 59 + { 60 + return sigprocmask(SIG_SETMASK, oldset, NULL); 61 + } 44 62 45 63 static struct page *ocfs2_nopage(struct vm_area_struct * area, 46 64 unsigned long address, ··· 71 53 mlog_entry("(area=%p, address=%lu, type=%p)\n", area, address, 72 54 type); 73 55 74 - /* The best way to deal with signals in this path is 75 - * to block them upfront, rather than allowing the 76 - * locking paths to return -ERESTARTSYS. */ 77 - sigfillset(&blocked); 78 - 79 - /* We should technically never get a bad ret return 80 - * from sigprocmask */ 81 - ret = sigprocmask(SIG_BLOCK, &blocked, &oldset); 56 + ret = ocfs2_vm_op_block_sigs(&blocked, &oldset); 82 57 if (ret < 0) { 83 58 mlog_errno(ret); 84 59 goto out; ··· 79 68 80 69 page = filemap_nopage(area, address, type); 81 70 82 - ret = sigprocmask(SIG_SETMASK, &oldset, NULL); 71 + ret = ocfs2_vm_op_unblock_sigs(&oldset); 83 72 if (ret < 0) 84 73 mlog_errno(ret); 85 74 out: ··· 87 76 return page; 88 77 } 89 78 79 + static int __ocfs2_page_mkwrite(struct inode *inode, struct buffer_head *di_bh, 80 + struct page *page) 81 + { 82 + int ret; 83 + struct address_space *mapping = inode->i_mapping; 84 + loff_t pos = page->index << PAGE_CACHE_SHIFT; 85 + unsigned int len = PAGE_CACHE_SIZE; 86 + pgoff_t last_index; 87 + struct page *locked_page = NULL; 88 + void *fsdata; 89 + loff_t size = i_size_read(inode); 90 + 91 + /* 92 + * Another node might have truncated while we were waiting on 93 + * cluster locks. 94 + */ 95 + last_index = size >> PAGE_CACHE_SHIFT; 96 + if (page->index > last_index) { 97 + ret = -EINVAL; 98 + goto out; 99 + } 100 + 101 + /* 102 + * The i_size check above doesn't catch the case where nodes 103 + * truncated and then re-extended the file. We'll re-check the 104 + * page mapping after taking the page lock inside of 105 + * ocfs2_write_begin_nolock(). 106 + */ 107 + if (!PageUptodate(page) || page->mapping != inode->i_mapping) { 108 + ret = -EINVAL; 109 + goto out; 110 + } 111 + 112 + /* 113 + * Call ocfs2_write_begin() and ocfs2_write_end() to take 114 + * advantage of the allocation code there. We pass a write 115 + * length of the whole page (chopped to i_size) to make sure 116 + * the whole thing is allocated. 117 + * 118 + * Since we know the page is up to date, we don't have to 119 + * worry about ocfs2_write_begin() skipping some buffer reads 120 + * because the "write" would invalidate their data. 121 + */ 122 + if (page->index == last_index) 123 + len = size & ~PAGE_CACHE_MASK; 124 + 125 + ret = ocfs2_write_begin_nolock(mapping, pos, len, 0, &locked_page, 126 + &fsdata, di_bh, page); 127 + if (ret) { 128 + if (ret != -ENOSPC) 129 + mlog_errno(ret); 130 + goto out; 131 + } 132 + 133 + ret = ocfs2_write_end_nolock(mapping, pos, len, len, locked_page, 134 + fsdata); 135 + if (ret < 0) { 136 + mlog_errno(ret); 137 + goto out; 138 + } 139 + BUG_ON(ret != len); 140 + ret = 0; 141 + out: 142 + return ret; 143 + } 144 + 145 + static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct page *page) 146 + { 147 + struct inode *inode = vma->vm_file->f_path.dentry->d_inode; 148 + struct buffer_head *di_bh = NULL; 149 + sigset_t blocked, oldset; 150 + int ret, ret2; 151 + 152 + ret = ocfs2_vm_op_block_sigs(&blocked, &oldset); 153 + if (ret < 0) { 154 + mlog_errno(ret); 155 + return ret; 156 + } 157 + 158 + /* 159 + * The cluster locks taken will block a truncate from another 160 + * node. Taking the data lock will also ensure that we don't 161 + * attempt page truncation as part of a downconvert. 162 + */ 163 + ret = ocfs2_meta_lock(inode, &di_bh, 1); 164 + if (ret < 0) { 165 + mlog_errno(ret); 166 + goto out; 167 + } 168 + 169 + /* 170 + * The alloc sem should be enough to serialize with 171 + * ocfs2_truncate_file() changing i_size as well as any thread 172 + * modifying the inode btree. 173 + */ 174 + down_write(&OCFS2_I(inode)->ip_alloc_sem); 175 + 176 + ret = ocfs2_data_lock(inode, 1); 177 + if (ret < 0) { 178 + mlog_errno(ret); 179 + goto out_meta_unlock; 180 + } 181 + 182 + ret = __ocfs2_page_mkwrite(inode, di_bh, page); 183 + 184 + ocfs2_data_unlock(inode, 1); 185 + 186 + out_meta_unlock: 187 + up_write(&OCFS2_I(inode)->ip_alloc_sem); 188 + 189 + brelse(di_bh); 190 + ocfs2_meta_unlock(inode, 1); 191 + 192 + out: 193 + ret2 = ocfs2_vm_op_unblock_sigs(&oldset); 194 + if (ret2 < 0) 195 + mlog_errno(ret2); 196 + 197 + return ret; 198 + } 199 + 90 200 static struct vm_operations_struct ocfs2_file_vm_ops = { 91 - .nopage = ocfs2_nopage, 201 + .nopage = ocfs2_nopage, 202 + .page_mkwrite = ocfs2_page_mkwrite, 92 203 }; 93 204 94 205 int ocfs2_mmap(struct file *file, struct vm_area_struct *vma) 95 206 { 96 207 int ret = 0, lock_level = 0; 97 - struct ocfs2_super *osb = OCFS2_SB(file->f_dentry->d_inode->i_sb); 98 - 99 - /* 100 - * Only support shared writeable mmap for local mounts which 101 - * don't know about holes. 102 - */ 103 - if ((!ocfs2_mount_local(osb) || ocfs2_sparse_alloc(osb)) && 104 - ((vma->vm_flags & VM_SHARED) || (vma->vm_flags & VM_MAYSHARE)) && 105 - ((vma->vm_flags & VM_WRITE) || (vma->vm_flags & VM_MAYWRITE))) { 106 - mlog(0, "disallow shared writable mmaps %lx\n", vma->vm_flags); 107 - /* This is -EINVAL because generic_file_readonly_mmap 108 - * returns it in a similar situation. */ 109 - return -EINVAL; 110 - } 111 208 112 209 ret = ocfs2_meta_lock_atime(file->f_dentry->d_inode, 113 210 file->f_vfsmnt, &lock_level);

+1 -1

fs/ocfs2/namei.c

··· 1674 1674 u32 offset = 0; 1675 1675 1676 1676 inode->i_op = &ocfs2_symlink_inode_operations; 1677 - status = ocfs2_do_extend_allocation(osb, inode, &offset, 1, 1677 + status = ocfs2_do_extend_allocation(osb, inode, &offset, 1, 0, 1678 1678 new_fe_bh, 1679 1679 handle, data_ac, NULL, 1680 1680 NULL);

+14

fs/ocfs2/ocfs2.h

··· 219 219 u16 max_slots; 220 220 s16 node_num; 221 221 s16 slot_num; 222 + s16 preferred_slot; 222 223 int s_sectsize_bits; 223 224 int s_clustersize; 224 225 int s_clustersize_bits; ··· 302 301 static inline int ocfs2_sparse_alloc(struct ocfs2_super *osb) 303 302 { 304 303 if (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC) 304 + return 1; 305 + return 0; 306 + } 307 + 308 + static inline int ocfs2_writes_unwritten_extents(struct ocfs2_super *osb) 309 + { 310 + /* 311 + * Support for sparse files is a pre-requisite 312 + */ 313 + if (!ocfs2_sparse_alloc(osb)) 314 + return 0; 315 + 316 + if (osb->s_feature_ro_compat & OCFS2_FEATURE_RO_COMPAT_UNWRITTEN) 305 317 return 1; 306 318 return 0; 307 319 }

+32 -1

fs/ocfs2/ocfs2_fs.h

··· 88 88 #define OCFS2_FEATURE_COMPAT_SUPP OCFS2_FEATURE_COMPAT_BACKUP_SB 89 89 #define OCFS2_FEATURE_INCOMPAT_SUPP (OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT \ 90 90 | OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC) 91 - #define OCFS2_FEATURE_RO_COMPAT_SUPP 0 91 + #define OCFS2_FEATURE_RO_COMPAT_SUPP OCFS2_FEATURE_RO_COMPAT_UNWRITTEN 92 92 93 93 /* 94 94 * Heartbeat-only devices are missing journals and other files. The ··· 115 115 * has backup superblocks. 116 116 */ 117 117 #define OCFS2_FEATURE_COMPAT_BACKUP_SB 0x0001 118 + 119 + /* 120 + * Unwritten extents support. 121 + */ 122 + #define OCFS2_FEATURE_RO_COMPAT_UNWRITTEN 0x0001 118 123 119 124 /* The byte offset of the first backup block will be 1G. 120 125 * The following will be 4G, 16G, 64G, 256G and 1T. ··· 173 168 #define OCFS2_IOC_SETFLAGS _IOW('f', 2, long) 174 169 #define OCFS2_IOC32_GETFLAGS _IOR('f', 1, int) 175 170 #define OCFS2_IOC32_SETFLAGS _IOW('f', 2, int) 171 + 172 + /* 173 + * Space reservation / allocation / free ioctls and argument structure 174 + * are designed to be compatible with XFS. 175 + * 176 + * ALLOCSP* and FREESP* are not and will never be supported, but are 177 + * included here for completeness. 178 + */ 179 + struct ocfs2_space_resv { 180 + __s16 l_type; 181 + __s16 l_whence; 182 + __s64 l_start; 183 + __s64 l_len; /* len == 0 means until end of file */ 184 + __s32 l_sysid; 185 + __u32 l_pid; 186 + __s32 l_pad[4]; /* reserve area */ 187 + }; 188 + 189 + #define OCFS2_IOC_ALLOCSP _IOW ('X', 10, struct ocfs2_space_resv) 190 + #define OCFS2_IOC_FREESP _IOW ('X', 11, struct ocfs2_space_resv) 191 + #define OCFS2_IOC_RESVSP _IOW ('X', 40, struct ocfs2_space_resv) 192 + #define OCFS2_IOC_UNRESVSP _IOW ('X', 41, struct ocfs2_space_resv) 193 + #define OCFS2_IOC_ALLOCSP64 _IOW ('X', 36, struct ocfs2_space_resv) 194 + #define OCFS2_IOC_FREESP64 _IOW ('X', 37, struct ocfs2_space_resv) 195 + #define OCFS2_IOC_RESVSP64 _IOW ('X', 42, struct ocfs2_space_resv) 196 + #define OCFS2_IOC_UNRESVSP64 _IOW ('X', 43, struct ocfs2_space_resv) 176 197 177 198 /* 178 199 * Journal Flags (ocfs2_dinode.id1.journal1.i_flags)

+10 -2

fs/ocfs2/slot_map.c

··· 121 121 return ret; 122 122 } 123 123 124 - static s16 __ocfs2_find_empty_slot(struct ocfs2_slot_info *si) 124 + static s16 __ocfs2_find_empty_slot(struct ocfs2_slot_info *si, s16 preferred) 125 125 { 126 126 int i; 127 127 s16 ret = OCFS2_INVALID_SLOT; 128 + 129 + if (preferred >= 0 && preferred < si->si_num_slots) { 130 + if (OCFS2_INVALID_SLOT == si->si_global_node_nums[preferred]) { 131 + ret = preferred; 132 + goto out; 133 + } 134 + } 128 135 129 136 for(i = 0; i < si->si_num_slots; i++) { 130 137 if (OCFS2_INVALID_SLOT == si->si_global_node_nums[i]) { ··· 139 132 break; 140 133 } 141 134 } 135 + out: 142 136 return ret; 143 137 } 144 138 ··· 256 248 if (slot == OCFS2_INVALID_SLOT) { 257 249 /* if no slot yet, then just take 1st available 258 250 * one. */ 259 - slot = __ocfs2_find_empty_slot(si); 251 + slot = __ocfs2_find_empty_slot(si, osb->preferred_slot); 260 252 if (slot == OCFS2_INVALID_SLOT) { 261 253 spin_unlock(&si->si_lock); 262 254 mlog(ML_ERROR, "no free slots available!\n");

+6 -40

fs/ocfs2/suballoc.c

··· 98 98 u16 chain); 99 99 static inline int ocfs2_block_group_reasonably_empty(struct ocfs2_group_desc *bg, 100 100 u32 wanted); 101 - static int ocfs2_free_suballoc_bits(handle_t *handle, 102 - struct inode *alloc_inode, 103 - struct buffer_head *alloc_bh, 104 - unsigned int start_bit, 105 - u64 bg_blkno, 106 - unsigned int count); 107 - static inline u64 ocfs2_which_suballoc_group(u64 block, 108 - unsigned int bit); 109 101 static inline u32 ocfs2_desc_bitmap_to_cluster_off(struct inode *inode, 110 102 u64 bg_blkno, 111 103 u16 bg_bit_off); ··· 488 496 489 497 (*ac)->ac_bits_wanted = ocfs2_extend_meta_needed(fe); 490 498 (*ac)->ac_which = OCFS2_AC_USE_META; 491 - 492 - #ifndef OCFS2_USE_ALL_METADATA_SUBALLOCATORS 493 - slot = 0; 494 - #else 495 499 slot = osb->slot_num; 496 - #endif 497 - 498 500 (*ac)->ac_group_search = ocfs2_block_group_search; 499 501 500 502 status = ocfs2_reserve_suballoc_bits(osb, (*ac), ··· 1612 1626 /* 1613 1627 * expects the suballoc inode to already be locked. 1614 1628 */ 1615 - static int ocfs2_free_suballoc_bits(handle_t *handle, 1616 - struct inode *alloc_inode, 1617 - struct buffer_head *alloc_bh, 1618 - unsigned int start_bit, 1619 - u64 bg_blkno, 1620 - unsigned int count) 1629 + int ocfs2_free_suballoc_bits(handle_t *handle, 1630 + struct inode *alloc_inode, 1631 + struct buffer_head *alloc_bh, 1632 + unsigned int start_bit, 1633 + u64 bg_blkno, 1634 + unsigned int count) 1621 1635 { 1622 1636 int status = 0; 1623 1637 u32 tmp_used; ··· 1689 1703 return status; 1690 1704 } 1691 1705 1692 - static inline u64 ocfs2_which_suballoc_group(u64 block, unsigned int bit) 1693 - { 1694 - u64 group = block - (u64) bit; 1695 - 1696 - return group; 1697 - } 1698 - 1699 1706 int ocfs2_free_dinode(handle_t *handle, 1700 1707 struct inode *inode_alloc_inode, 1701 1708 struct buffer_head *inode_alloc_bh, ··· 1700 1721 1701 1722 return ocfs2_free_suballoc_bits(handle, inode_alloc_inode, 1702 1723 inode_alloc_bh, bit, bg_blkno, 1); 1703 - } 1704 - 1705 - int ocfs2_free_extent_block(handle_t *handle, 1706 - struct inode *eb_alloc_inode, 1707 - struct buffer_head *eb_alloc_bh, 1708 - struct ocfs2_extent_block *eb) 1709 - { 1710 - u64 blk = le64_to_cpu(eb->h_blkno); 1711 - u16 bit = le16_to_cpu(eb->h_suballoc_bit); 1712 - u64 bg_blkno = ocfs2_which_suballoc_group(blk, bit); 1713 - 1714 - return ocfs2_free_suballoc_bits(handle, eb_alloc_inode, eb_alloc_bh, 1715 - bit, bg_blkno, 1); 1716 1724 } 1717 1725 1718 1726 int ocfs2_free_clusters(handle_t *handle,

+13 -4

fs/ocfs2/suballoc.h

··· 86 86 u32 *cluster_start, 87 87 u32 *num_clusters); 88 88 89 + int ocfs2_free_suballoc_bits(handle_t *handle, 90 + struct inode *alloc_inode, 91 + struct buffer_head *alloc_bh, 92 + unsigned int start_bit, 93 + u64 bg_blkno, 94 + unsigned int count); 89 95 int ocfs2_free_dinode(handle_t *handle, 90 96 struct inode *inode_alloc_inode, 91 97 struct buffer_head *inode_alloc_bh, 92 98 struct ocfs2_dinode *di); 93 - int ocfs2_free_extent_block(handle_t *handle, 94 - struct inode *eb_alloc_inode, 95 - struct buffer_head *eb_alloc_bh, 96 - struct ocfs2_extent_block *eb); 97 99 int ocfs2_free_clusters(handle_t *handle, 98 100 struct inode *bitmap_inode, 99 101 struct buffer_head *bitmap_bh, 100 102 u64 start_blk, 101 103 unsigned int num_clusters); 104 + 105 + static inline u64 ocfs2_which_suballoc_group(u64 block, unsigned int bit) 106 + { 107 + u64 group = block - (u64) bit; 108 + 109 + return group; 110 + } 102 111 103 112 static inline u32 ocfs2_cluster_from_desc(struct ocfs2_super *osb, 104 113 u64 bg_blkno)

+21 -6

fs/ocfs2/super.c

··· 82 82 MODULE_LICENSE("GPL"); 83 83 84 84 static int ocfs2_parse_options(struct super_block *sb, char *options, 85 - unsigned long *mount_opt, int is_remount); 85 + unsigned long *mount_opt, s16 *slot, 86 + int is_remount); 86 87 static void ocfs2_put_super(struct super_block *sb); 87 88 static int ocfs2_mount_volume(struct super_block *sb); 88 89 static int ocfs2_remount(struct super_block *sb, int *flags, char *data); ··· 115 114 static struct inode *ocfs2_alloc_inode(struct super_block *sb); 116 115 static void ocfs2_destroy_inode(struct inode *inode); 117 116 118 - static unsigned long long ocfs2_max_file_offset(unsigned int blockshift); 119 - 120 117 static const struct super_operations ocfs2_sops = { 121 118 .statfs = ocfs2_statfs, 122 119 .alloc_inode = ocfs2_alloc_inode, ··· 139 140 Opt_data_ordered, 140 141 Opt_data_writeback, 141 142 Opt_atime_quantum, 143 + Opt_slot, 142 144 Opt_err, 143 145 }; 144 146 ··· 154 154 {Opt_data_ordered, "data=ordered"}, 155 155 {Opt_data_writeback, "data=writeback"}, 156 156 {Opt_atime_quantum, "atime_quantum=%u"}, 157 + {Opt_slot, "preferred_slot=%u"}, 157 158 {Opt_err, NULL} 158 159 }; 159 160 ··· 319 318 /* From xfs_super.c:xfs_max_file_offset 320 319 * Copyright (c) 2000-2004 Silicon Graphics, Inc. 321 320 */ 322 - static unsigned long long ocfs2_max_file_offset(unsigned int blockshift) 321 + unsigned long long ocfs2_max_file_offset(unsigned int blockshift) 323 322 { 324 323 unsigned int pagefactor = 1; 325 324 unsigned int bitshift = BITS_PER_LONG - 1; ··· 356 355 int incompat_features; 357 356 int ret = 0; 358 357 unsigned long parsed_options; 358 + s16 slot; 359 359 struct ocfs2_super *osb = OCFS2_SB(sb); 360 360 361 - if (!ocfs2_parse_options(sb, data, &parsed_options, 1)) { 361 + if (!ocfs2_parse_options(sb, data, &parsed_options, &slot, 1)) { 362 362 ret = -EINVAL; 363 363 goto out; 364 364 } ··· 536 534 struct dentry *root; 537 535 int status, sector_size; 538 536 unsigned long parsed_opt; 537 + s16 slot; 539 538 struct inode *inode = NULL; 540 539 struct ocfs2_super *osb = NULL; 541 540 struct buffer_head *bh = NULL; ··· 544 541 545 542 mlog_entry("%p, %p, %i", sb, data, silent); 546 543 547 - if (!ocfs2_parse_options(sb, data, &parsed_opt, 0)) { 544 + if (!ocfs2_parse_options(sb, data, &parsed_opt, &slot, 0)) { 548 545 status = -EINVAL; 549 546 goto read_super_error; 550 547 } ··· 574 571 brelse(bh); 575 572 bh = NULL; 576 573 osb->s_mount_opt = parsed_opt; 574 + osb->preferred_slot = slot; 577 575 578 576 sb->s_magic = OCFS2_SUPER_MAGIC; 579 577 ··· 717 713 static int ocfs2_parse_options(struct super_block *sb, 718 714 char *options, 719 715 unsigned long *mount_opt, 716 + s16 *slot, 720 717 int is_remount) 721 718 { 722 719 int status; ··· 727 722 options ? options : "(none)"); 728 723 729 724 *mount_opt = 0; 725 + *slot = OCFS2_INVALID_SLOT; 730 726 731 727 if (!options) { 732 728 status = 1; ··· 787 781 osb->s_atime_quantum = option; 788 782 else 789 783 osb->s_atime_quantum = OCFS2_DEFAULT_ATIME_QUANTUM; 784 + break; 785 + case Opt_slot: 786 + option = 0; 787 + if (match_int(&args[0], &option)) { 788 + status = 0; 789 + goto bail; 790 + } 791 + if (option) 792 + *slot = (s16)option; 790 793 break; 791 794 default: 792 795 mlog(ML_ERROR,

+2

fs/ocfs2/super.h

··· 45 45 46 46 #define ocfs2_abort(sb, fmt, args...) __ocfs2_abort(sb, __PRETTY_FUNCTION__, fmt, ##args) 47 47 48 + unsigned long long ocfs2_max_file_offset(unsigned int blockshift); 49 + 48 50 #endif /* OCFS2_SUPER_H */

+26 -8

include/linux/configfs.h

··· 40 40 #include <linux/types.h> 41 41 #include <linux/list.h> 42 42 #include <linux/kref.h> 43 + #include <linux/mutex.h> 43 44 44 45 #include <asm/atomic.h> 45 - #include <asm/semaphore.h> 46 46 47 47 #define CONFIGFS_ITEM_NAME_LEN 20 48 48 ··· 75 75 extern void config_item_init_type_name(struct config_item *item, 76 76 const char *name, 77 77 struct config_item_type *type); 78 - extern void config_item_cleanup(struct config_item *); 79 78 80 79 extern struct config_item * config_item_get(struct config_item *); 81 80 extern void config_item_put(struct config_item *); ··· 86 87 struct configfs_attribute **ct_attrs; 87 88 }; 88 89 89 - 90 90 /** 91 91 * group - a group of config_items of a specific type, belonging 92 92 * to a specific subsystem. 93 93 */ 94 - 95 94 struct config_group { 96 95 struct config_item cg_item; 97 96 struct list_head cg_children; ··· 97 100 struct config_group **default_groups; 98 101 }; 99 102 100 - 101 103 extern void config_group_init(struct config_group *group); 102 104 extern void config_group_init_type_name(struct config_group *group, 103 105 const char *name, 104 106 struct config_item_type *type); 105 - 106 107 107 108 static inline struct config_group *to_config_group(struct config_item *item) 108 109 { ··· 117 122 config_item_put(&group->cg_item); 118 123 } 119 124 120 - extern struct config_item *config_group_find_obj(struct config_group *, const char *); 125 + extern struct config_item *config_group_find_item(struct config_group *, 126 + const char *); 121 127 122 128 123 129 struct configfs_attribute { ··· 127 131 mode_t ca_mode; 128 132 }; 129 133 134 + /* 135 + * Users often need to create attribute structures for their configurable 136 + * attributes, containing a configfs_attribute member and function pointers 137 + * for the show() and store() operations on that attribute. They can use 138 + * this macro (similar to sysfs' __ATTR) to make defining attributes easier. 139 + */ 140 + #define __CONFIGFS_ATTR(_name, _mode, _show, _store) \ 141 + { \ 142 + .attr = { \ 143 + .ca_name = __stringify(_name), \ 144 + .ca_mode = _mode, \ 145 + .ca_owner = THIS_MODULE, \ 146 + }, \ 147 + .show = _show, \ 148 + .store = _store, \ 149 + } 130 150 131 151 /* 132 152 * If allow_link() exists, the item can symlink(2) out to other ··· 169 157 struct config_item *(*make_item)(struct config_group *group, const char *name); 170 158 struct config_group *(*make_group)(struct config_group *group, const char *name); 171 159 int (*commit_item)(struct config_item *item); 160 + void (*disconnect_notify)(struct config_group *group, struct config_item *item); 172 161 void (*drop_item)(struct config_group *group, struct config_item *item); 173 162 }; 174 163 175 164 struct configfs_subsystem { 176 165 struct config_group su_group; 177 - struct semaphore su_sem; 166 + struct mutex su_mutex; 178 167 }; 179 168 180 169 static inline struct configfs_subsystem *to_configfs_subsystem(struct config_group *group) ··· 187 174 188 175 int configfs_register_subsystem(struct configfs_subsystem *subsys); 189 176 void configfs_unregister_subsystem(struct configfs_subsystem *subsys); 177 + 178 + /* These functions can sleep and can alloc with GFP_KERNEL */ 179 + /* WARNING: These cannot be called underneath configfs callbacks!! */ 180 + int configfs_depend_item(struct configfs_subsystem *subsys, struct config_item *target); 181 + void configfs_undepend_item(struct configfs_subsystem *subsys, struct config_item *target); 190 182 191 183 #endif /* __KERNEL__ */ 192 184