Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

virtiofs: propagate sync() to file server

Even if POSIX doesn't mandate it, linux users legitimately expect sync() to
flush all data and metadata to physical storage when it is located on the
same system. This isn't happening with virtiofs though: sync() inside the
guest returns right away even though data still needs to be flushed from
the host page cache.

This is easily demonstrated by doing the following in the guest:

$ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s
sync() = 0 <0.024068>

and start the following in the host when the 'dd' command completes
in the guest:

$ strace -T -e fsync /usr/bin/sync virtiofs/foo
fsync(3) = 0 <10.371640>

There are no good reasons not to honor the expected behavior of sync()
actually: it gives an unrealistic impression that virtiofs is super fast
and that data has safely landed on HW, which isn't the case obviously.

Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS
request type for this purpose. Provision a 64-bit placeholder for possible
future extensions. Since the file server cannot handle the wait == 0 case,
we skip it to avoid a gratuitous roundtrip. Note that this is
per-superblock: a FUSE_SYNCFS is send for the root mount and for each
submount.

Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in
the file server is treated as permanent success. This ensures
compatibility with older file servers: the client will get the current
behavior of sync() not being propagated to the file server.

Note that such an operation allows the file server to DoS sync(). Since a
typical FUSE file server is an untrusted piece of software running in
userspace, this is disabled by default. Only enable it with virtiofs for
now since virtiofsd is supposedly trusted by the guest kernel.

Reported-by: Robert Krawitz <rlk@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

authored by

Greg Kurz and committed by
Miklos Szeredi
2d82ab25 49221cf8

+53 -1
+3
fs/fuse/fuse_i.h
··· 761 761 /* Auto-mount submounts announced by the server */ 762 762 unsigned int auto_submounts:1; 763 763 764 + /* Propagate syncfs() to server */ 765 + unsigned int sync_fs:1; 766 + 764 767 /** The number of requests waiting for completion */ 765 768 atomic_t num_waiting; 766 769
+40
fs/fuse/inode.c
··· 506 506 return err; 507 507 } 508 508 509 + static int fuse_sync_fs(struct super_block *sb, int wait) 510 + { 511 + struct fuse_mount *fm = get_fuse_mount_super(sb); 512 + struct fuse_conn *fc = fm->fc; 513 + struct fuse_syncfs_in inarg; 514 + FUSE_ARGS(args); 515 + int err; 516 + 517 + /* 518 + * Userspace cannot handle the wait == 0 case. Avoid a 519 + * gratuitous roundtrip. 520 + */ 521 + if (!wait) 522 + return 0; 523 + 524 + /* The filesystem is being unmounted. Nothing to do. */ 525 + if (!sb->s_root) 526 + return 0; 527 + 528 + if (!fc->sync_fs) 529 + return 0; 530 + 531 + memset(&inarg, 0, sizeof(inarg)); 532 + args.in_numargs = 1; 533 + args.in_args[0].size = sizeof(inarg); 534 + args.in_args[0].value = &inarg; 535 + args.opcode = FUSE_SYNCFS; 536 + args.nodeid = get_node_id(sb->s_root->d_inode); 537 + args.out_numargs = 0; 538 + 539 + err = fuse_simple_request(fm, &args); 540 + if (err == -ENOSYS) { 541 + fc->sync_fs = 0; 542 + err = 0; 543 + } 544 + 545 + return err; 546 + } 547 + 509 548 enum { 510 549 OPT_SOURCE, 511 550 OPT_SUBTYPE, ··· 948 909 .put_super = fuse_put_super, 949 910 .umount_begin = fuse_umount_begin, 950 911 .statfs = fuse_statfs, 912 + .sync_fs = fuse_sync_fs, 951 913 .show_options = fuse_show_options, 952 914 }; 953 915
+1
fs/fuse/virtio_fs.c
··· 1447 1447 fc->release = fuse_free_conn; 1448 1448 fc->delete_stale = true; 1449 1449 fc->auto_submounts = true; 1450 + fc->sync_fs = true; 1450 1451 1451 1452 /* Tell FUSE to split requests that exceed the virtqueue's size */ 1452 1453 fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit,
+9 -1
include/uapi/linux/fuse.h
··· 181 181 * - add FUSE_OPEN_KILL_SUIDGID 182 182 * - extend fuse_setxattr_in, add FUSE_SETXATTR_EXT 183 183 * - add FUSE_SETXATTR_ACL_KILL_SGID 184 + * 185 + * 7.34 186 + * - add FUSE_SYNCFS 184 187 */ 185 188 186 189 #ifndef _LINUX_FUSE_H ··· 219 216 #define FUSE_KERNEL_VERSION 7 220 217 221 218 /** Minor version number of this interface */ 222 - #define FUSE_KERNEL_MINOR_VERSION 33 219 + #define FUSE_KERNEL_MINOR_VERSION 34 223 220 224 221 /** The node ID of the root inode */ 225 222 #define FUSE_ROOT_ID 1 ··· 512 509 FUSE_COPY_FILE_RANGE = 47, 513 510 FUSE_SETUPMAPPING = 48, 514 511 FUSE_REMOVEMAPPING = 49, 512 + FUSE_SYNCFS = 50, 515 513 516 514 /* CUSE specific operations */ 517 515 CUSE_INIT = 4096, ··· 974 970 975 971 #define FUSE_REMOVEMAPPING_MAX_ENTRY \ 976 972 (PAGE_SIZE / sizeof(struct fuse_removemapping_one)) 973 + 974 + struct fuse_syncfs_in { 975 + uint64_t padding; 976 + }; 977 977 978 978 #endif /* _LINUX_FUSE_H */