Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'for-linus-v4.8' of git://github.com/martinbrandenburg/linux

Pull orangefs update from Martin Brandenburg:
"Kernel side caching and executable bugfix

This allows OrangeFS to utilize the dcache and adds an in kernel
attribute cache. We previously used the user side client for this
purpose.

We see a modest performance increase on small file operations. For
example, without the cache, compiling coreutils takes about 17
minutes. With the patch and a 50 millisecond timeout for
dcache_timeout_msecs and getattr_timeout_msecs (the default),
compiling coreutils takes about 6 minutes 20 seconds. On the same
hardware, compiling coreutils on an xfs filesystem takes 90 seconds.
We see similar improvements with mdtest and a test involving writing,
reading, and deleting a large number of small files.

Interested parties can review more data at the following URL.

https://docs.google.com/spreadsheets/d/1v4aUeppKexIbRMz_Yn9k4eaM3uy2KCaPoe_93YKWOtA/pubhtml

The eventual goal of this is to allow getdents to turn into a
readdirplus to the OrangeFS server. The cache will be filled then,
which should provide a performance benefit to the common case of
readdir followed by getattr on each entry (i.e. ls -l).

This also fixes a bug. When orangefs_inode_permission was added, it
did not collect i_size from the OrangeFS server, since this presses an
unnecessary load on the OrangeFS server. However, it left a case
where i_size is never initialized. Then running an executable could
fail.

With this patch, size is always collected to be inserted into the
cache. Thus the bug disappears. If this patch is not accepted during
this merge window, we will send a one-line band-aid for this bug
instead"

* tag 'for-linus-v4.8' of git://github.com/martinbrandenburg/linux:
Orangefs: update orangefs.txt
orangefs: Account for jiffies wraparound.
orangefs: Change default dcache and getattr timeout to 50 msec.
orangefs: Allow dcache and getattr cache time to be configured.
orangefs: Cache getattr results.
orangefs: Use d_time to avoid excessive lookups

+135 -34
+46 -4
Documentation/filesystems/orangefs.txt
··· 281 281 if the client-core stays dead too long, the arbitrary userspace processes 282 282 trying to use Orangefs will be negatively affected. Waiting ops 283 283 that can't be serviced will be removed from the request list and 284 - have their states set to "given up". In-progress ops that can't 284 + have their states set to "given up". In-progress ops that can't 285 285 be serviced will be removed from the in_progress hash table and 286 286 have their states set to "given up". 287 287 ··· 338 338 PVFS2_VFS_OP_STATFS 339 339 fill a pvfs2_statfs_response_t with useless info <g>. It is hard for 340 340 us to know, in a timely fashion, these statistics about our 341 - distributed network filesystem. 341 + distributed network filesystem. 342 342 343 343 PVFS2_VFS_OP_FS_MOUNT 344 344 fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref ··· 386 386 387 387 io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) 388 388 io_array[1].iov_len = sizeof(int32_t) 389 - 389 + 390 390 io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) 391 391 io_array[2].iov_len = sizeof(int64_t) 392 392 ··· 402 402 io_array[4].iov_len = contents of member trailer_size (PVFS_size) 403 403 from out_downcall member of global variable 404 404 vfs_request 405 - 405 + 406 + Orangefs exploits the dcache in order to avoid sending redundant 407 + requests to userspace. We keep object inode attributes up-to-date with 408 + orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to 409 + help it decide whether or not to update an inode: "new" and "bypass". 410 + Orangefs keeps private data in an object's inode that includes a short 411 + timeout value, getattr_time, which allows any iteration of 412 + orangefs_inode_getattr to know how long it has been since the inode was 413 + updated. When the object is not new (new == 0) and the bypass flag is not 414 + set (bypass == 0) orangefs_inode_getattr returns without updating the inode 415 + if getattr_time has not timed out. Getattr_time is updated each time the 416 + inode is updated. 417 + 418 + Creation of a new object (file, dir, sym-link) includes the evaluation of 419 + its pathname, resulting in a negative directory entry for the object. 420 + A new inode is allocated and associated with the dentry, turning it from 421 + a negative dentry into a "productive full member of society". Orangefs 422 + obtains the new inode from Linux with new_inode() and associates 423 + the inode with the dentry by sending the pair back to Linux with 424 + d_instantiate(). 425 + 426 + The evaluation of a pathname for an object resolves to its corresponding 427 + dentry. If there is no corresponding dentry, one is created for it in 428 + the dcache. Whenever a dentry is modified or verified Orangefs stores a 429 + short timeout value in the dentry's d_time, and the dentry will be trusted 430 + for that amount of time. Orangefs is a network filesystem, and objects 431 + can potentially change out-of-band with any particular Orangefs kernel module 432 + instance, so trusting a dentry is risky. The alternative to trusting 433 + dentries is to always obtain the needed information from userspace - at 434 + least a trip to the client-core, maybe to the servers. Obtaining information 435 + from a dentry is cheap, obtaining it from userspace is relatively expensive, 436 + hence the motivation to use the dentry when possible. 437 + 438 + The timeout values d_time and getattr_time are jiffy based, and the 439 + code is designed to avoid the jiffy-wrap problem: 440 + 441 + "In general, if the clock may have wrapped around more than once, there 442 + is no way to tell how much time has elapsed. However, if the times t1 443 + and t2 are known to be fairly close, we can reliably compute the 444 + difference in a way that takes into account the possibility that the 445 + clock may have wrapped between times." 446 + 447 + from course notes by instructor Andy Wang 406 448
+4
fs/orangefs/dcache.c
··· 73 73 } 74 74 } 75 75 76 + dentry->d_time = jiffies + dcache_timeout_msecs*HZ/1000; 76 77 ret = 1; 77 78 out_release_op: 78 79 op_release(new_op); ··· 94 93 static int orangefs_d_revalidate(struct dentry *dentry, unsigned int flags) 95 94 { 96 95 int ret; 96 + 97 + if (time_before(jiffies, dentry->d_time)) 98 + return 1; 97 99 98 100 if (flags & LOOKUP_RCU) 99 101 return -ECHILD;
+3 -3
fs/orangefs/inode.c
··· 262 262 "orangefs_getattr: called on %s\n", 263 263 dentry->d_name.name); 264 264 265 - ret = orangefs_inode_getattr(inode, 0, 1); 265 + ret = orangefs_inode_getattr(inode, 0, 0); 266 266 if (ret == 0) { 267 267 generic_fillattr(inode, kstat); 268 268 ··· 384 384 if (!inode || !(inode->i_state & I_NEW)) 385 385 return inode; 386 386 387 - error = orangefs_inode_getattr(inode, 1, 0); 387 + error = orangefs_inode_getattr(inode, 1, 1); 388 388 if (error) { 389 389 iget_failed(inode); 390 390 return ERR_PTR(error); ··· 429 429 orangefs_set_inode(inode, ref); 430 430 inode->i_ino = hash; /* needed for stat etc */ 431 431 432 - error = orangefs_inode_getattr(inode, 1, 0); 432 + error = orangefs_inode_getattr(inode, 1, 1); 433 433 if (error) 434 434 goto out_iput; 435 435
+12
fs/orangefs/namei.c
··· 72 72 73 73 d_instantiate(dentry, inode); 74 74 unlock_new_inode(inode); 75 + dentry->d_time = jiffies + dcache_timeout_msecs*HZ/1000; 76 + ORANGEFS_I(inode)->getattr_time = jiffies - 1; 75 77 76 78 gossip_debug(GOSSIP_NAME_DEBUG, 77 79 "%s: dentry instantiated for %s\n", ··· 183 181 goto out; 184 182 } 185 183 184 + dentry->d_time = jiffies + dcache_timeout_msecs*HZ/1000; 185 + 186 186 inode = orangefs_iget(dir->i_sb, &new_op->downcall.resp.lookup.refn); 187 187 if (IS_ERR(inode)) { 188 188 gossip_debug(GOSSIP_NAME_DEBUG, ··· 192 188 res = ERR_CAST(inode); 193 189 goto out; 194 190 } 191 + 192 + ORANGEFS_I(inode)->getattr_time = jiffies - 1; 195 193 196 194 gossip_debug(GOSSIP_NAME_DEBUG, 197 195 "%s:%s:%d " ··· 322 316 323 317 d_instantiate(dentry, inode); 324 318 unlock_new_inode(inode); 319 + dentry->d_time = jiffies + dcache_timeout_msecs*HZ/1000; 320 + ORANGEFS_I(inode)->getattr_time = jiffies - 1; 325 321 326 322 gossip_debug(GOSSIP_NAME_DEBUG, 327 323 "Inode (Symlink) %pU -> %s\n", ··· 386 378 387 379 d_instantiate(dentry, inode); 388 380 unlock_new_inode(inode); 381 + dentry->d_time = jiffies + dcache_timeout_msecs*HZ/1000; 382 + ORANGEFS_I(inode)->getattr_time = jiffies - 1; 389 383 390 384 gossip_debug(GOSSIP_NAME_DEBUG, 391 385 "Inode (Directory) %pU -> %s\n", ··· 417 407 gossip_debug(GOSSIP_NAME_DEBUG, 418 408 "orangefs_rename: called (%pd2 => %pd2) ct=%d\n", 419 409 old_dentry, new_dentry, d_count(new_dentry)); 410 + 411 + ORANGEFS_I(new_dentry->d_parent->d_inode)->getattr_time = jiffies - 1; 420 412 421 413 new_op = op_alloc(ORANGEFS_VFS_OP_RENAME); 422 414 if (!new_op)
+5 -1
fs/orangefs/orangefs-kernel.h
··· 246 246 * with this object 247 247 */ 248 248 unsigned long pinode_flags; 249 + 250 + unsigned long getattr_time; 249 251 }; 250 252 251 253 #define P_ATIME_FLAG 0 ··· 529 527 size_t size, 530 528 int flags); 531 529 532 - int orangefs_inode_getattr(struct inode *inode, int new, int size); 530 + int orangefs_inode_getattr(struct inode *inode, int new, int bypass); 533 531 534 532 int orangefs_inode_check_changed(struct inode *inode); 535 533 ··· 548 546 extern int debug; 549 547 extern int op_timeout_secs; 550 548 extern int slot_timeout_secs; 549 + extern int dcache_timeout_msecs; 550 + extern int getattr_timeout_msecs; 551 551 extern struct list_head orangefs_superblocks; 552 552 extern spinlock_t orangefs_superblocks_lock; 553 553 extern struct list_head orangefs_request_list;
+2
fs/orangefs/orangefs-mod.c
··· 47 47 unsigned int kernel_mask_set_mod_init; /* implicitly false */ 48 48 int op_timeout_secs = ORANGEFS_DEFAULT_OP_TIMEOUT_SECS; 49 49 int slot_timeout_secs = ORANGEFS_DEFAULT_SLOT_TIMEOUT_SECS; 50 + int dcache_timeout_msecs = 50; 51 + int getattr_timeout_msecs = 50; 50 52 51 53 MODULE_LICENSE("GPL"); 52 54 MODULE_AUTHOR("ORANGEFS Development Team");
+42 -1
fs/orangefs/orangefs-sysfs.c
··· 61 61 * Slots are requested and waited for, 62 62 * the wait times out after slot_timeout_secs. 63 63 * 64 + * What: /sys/fs/orangefs/dcache_timeout_msecs 65 + * Date: Jul 2016 66 + * Contact: Martin Brandenburg <martin@omnibond.com> 67 + * Description: 68 + * Time lookup is valid in milliseconds. 69 + * 70 + * What: /sys/fs/orangefs/getattr_timeout_msecs 71 + * Date: Jul 2016 72 + * Contact: Martin Brandenburg <martin@omnibond.com> 73 + * Description: 74 + * Time getattr is valid in milliseconds. 64 75 * 65 76 * What: /sys/fs/orangefs/acache/... 66 77 * Date: Jun 2015 67 - * Contact: Mike Marshall <hubcap@omnibond.com> 78 + * Contact: Martin Brandenburg <martin@omnibond.com> 68 79 * Description: 69 80 * Attribute cache configurable settings. 70 81 * ··· 128 117 int perf_history_size; 129 118 int perf_time_interval_secs; 130 119 int slot_timeout_secs; 120 + int dcache_timeout_msecs; 121 + int getattr_timeout_msecs; 131 122 }; 132 123 133 124 struct acache_orangefs_obj { ··· 671 658 "%d\n", 672 659 slot_timeout_secs); 673 660 goto out; 661 + } else if (!strcmp(orangefs_attr->attr.name, 662 + "dcache_timeout_msecs")) { 663 + rc = scnprintf(buf, 664 + PAGE_SIZE, 665 + "%d\n", 666 + dcache_timeout_msecs); 667 + goto out; 668 + } else if (!strcmp(orangefs_attr->attr.name, 669 + "getattr_timeout_msecs")) { 670 + rc = scnprintf(buf, 671 + PAGE_SIZE, 672 + "%d\n", 673 + getattr_timeout_msecs); 674 + goto out; 674 675 } else { 675 676 goto out; 676 677 } ··· 760 733 goto out; 761 734 } else if (!strcmp(attr->attr.name, "slot_timeout_secs")) { 762 735 rc = kstrtoint(buf, 0, &slot_timeout_secs); 736 + goto out; 737 + } else if (!strcmp(attr->attr.name, "dcache_timeout_msecs")) { 738 + rc = kstrtoint(buf, 0, &dcache_timeout_msecs); 739 + goto out; 740 + } else if (!strcmp(attr->attr.name, "getattr_timeout_msecs")) { 741 + rc = kstrtoint(buf, 0, &getattr_timeout_msecs); 763 742 goto out; 764 743 } else { 765 744 goto out; ··· 1394 1361 static struct orangefs_attribute slot_timeout_secs_attribute = 1395 1362 __ATTR(slot_timeout_secs, 0664, int_orangefs_show, int_store); 1396 1363 1364 + static struct orangefs_attribute dcache_timeout_msecs_attribute = 1365 + __ATTR(dcache_timeout_msecs, 0664, int_orangefs_show, int_store); 1366 + 1367 + static struct orangefs_attribute getattr_timeout_msecs_attribute = 1368 + __ATTR(getattr_timeout_msecs, 0664, int_orangefs_show, int_store); 1369 + 1397 1370 static struct orangefs_attribute perf_counter_reset_attribute = 1398 1371 __ATTR(perf_counter_reset, 1399 1372 0664, ··· 1421 1382 static struct attribute *orangefs_default_attrs[] = { 1422 1383 &op_timeout_secs_attribute.attr, 1423 1384 &slot_timeout_secs_attribute.attr, 1385 + &dcache_timeout_msecs_attribute.attr, 1386 + &getattr_timeout_msecs_attribute.attr, 1424 1387 &perf_counter_reset_attribute.attr, 1425 1388 &perf_history_size_attribute.attr, 1426 1389 &perf_time_interval_secs_attribute.attr,
+21 -17
fs/orangefs/orangefs-utils.c
··· 251 251 return 0; 252 252 } 253 253 254 - int orangefs_inode_getattr(struct inode *inode, int new, int size) 254 + int orangefs_inode_getattr(struct inode *inode, int new, int bypass) 255 255 { 256 256 struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode); 257 257 struct orangefs_kernel_op_s *new_op; ··· 261 261 gossip_debug(GOSSIP_UTILS_DEBUG, "%s: called on inode %pU\n", __func__, 262 262 get_khandle_from_ino(inode)); 263 263 264 + if (!new && !bypass) { 265 + if (time_before(jiffies, orangefs_inode->getattr_time)) 266 + return 0; 267 + } 268 + 264 269 new_op = op_alloc(ORANGEFS_VFS_OP_GETATTR); 265 270 if (!new_op) 266 271 return -ENOMEM; 267 272 new_op->upcall.req.getattr.refn = orangefs_inode->refn; 268 - new_op->upcall.req.getattr.mask = size ? 269 - ORANGEFS_ATTR_SYS_ALL_NOHINT : ORANGEFS_ATTR_SYS_ALL_NOHINT_NOSIZE; 273 + new_op->upcall.req.getattr.mask = ORANGEFS_ATTR_SYS_ALL_NOHINT; 270 274 271 275 ret = service_operation(new_op, __func__, 272 276 get_interruptible_flag(inode)); ··· 291 287 case S_IFREG: 292 288 inode->i_flags = orangefs_inode_flags(&new_op-> 293 289 downcall.resp.getattr.attributes); 294 - if (size) { 295 - inode_size = (loff_t)new_op-> 296 - downcall.resp.getattr.attributes.size; 297 - rounded_up_size = 298 - (inode_size + (4096 - (inode_size % 4096))); 299 - inode->i_size = inode_size; 300 - orangefs_inode->blksize = 301 - new_op->downcall.resp.getattr.attributes.blksize; 302 - spin_lock(&inode->i_lock); 303 - inode->i_bytes = inode_size; 304 - inode->i_blocks = 305 - (unsigned long)(rounded_up_size / 512); 306 - spin_unlock(&inode->i_lock); 307 - } 290 + inode_size = (loff_t)new_op-> 291 + downcall.resp.getattr.attributes.size; 292 + rounded_up_size = 293 + (inode_size + (4096 - (inode_size % 4096))); 294 + inode->i_size = inode_size; 295 + orangefs_inode->blksize = 296 + new_op->downcall.resp.getattr.attributes.blksize; 297 + spin_lock(&inode->i_lock); 298 + inode->i_bytes = inode_size; 299 + inode->i_blocks = 300 + (unsigned long)(rounded_up_size / 512); 301 + spin_unlock(&inode->i_lock); 308 302 break; 309 303 case S_IFDIR: 310 304 inode->i_size = PAGE_SIZE; ··· 347 345 inode->i_mode = type | (is_root_handle(inode) ? S_ISVTX : 0) | 348 346 orangefs_inode_perms(&new_op->downcall.resp.getattr.attributes); 349 347 348 + orangefs_inode->getattr_time = jiffies + getattr_timeout_msecs*HZ/1000; 350 349 ret = 0; 351 350 out: 352 351 op_release(new_op); ··· 421 418 ClearMtimeFlag(orangefs_inode); 422 419 ClearCtimeFlag(orangefs_inode); 423 420 ClearModeFlag(orangefs_inode); 421 + orangefs_inode->getattr_time = jiffies - 1; 424 422 } 425 423 426 424 return ret;
-8
fs/orangefs/protocol.h
··· 207 207 ORANGEFS_ATTR_SYS_DIRENT_COUNT | \ 208 208 ORANGEFS_ATTR_SYS_BLKSIZE) 209 209 210 - #define ORANGEFS_ATTR_SYS_ALL_NOHINT_NOSIZE \ 211 - (ORANGEFS_ATTR_SYS_COMMON_ALL | \ 212 - ORANGEFS_ATTR_SYS_LNK_TARGET | \ 213 - ORANGEFS_ATTR_SYS_DFILE_COUNT | \ 214 - ORANGEFS_ATTR_SYS_MIRROR_COPIES_COUNT | \ 215 - ORANGEFS_ATTR_SYS_DIRENT_COUNT | \ 216 - ORANGEFS_ATTR_SYS_BLKSIZE) 217 - 218 210 #define ORANGEFS_XATTR_REPLACE 0x2 219 211 #define ORANGEFS_XATTR_CREATE 0x1 220 212 #define ORANGEFS_MAX_SERVER_ADDR_LEN 256