Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

dax: introduce holder for dax_device

Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.

The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.

It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.

Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.

Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.

The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()


The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.

One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.

Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.

With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.


This patch (of 14):

To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:

- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Shiyang Ruan and committed by
akpm
8012b866 96c06573

+110 -23
+66 -1
drivers/dax/super.c
··· 22 22 * @private: dax driver private data 23 23 * @flags: state and boolean properties 24 24 * @ops: operations for this device 25 + * @holder_data: holder of a dax_device: could be filesystem or mapped device 26 + * @holder_ops: operations for the inner holder 25 27 */ 26 28 struct dax_device { 27 29 struct inode inode; ··· 31 29 void *private; 32 30 unsigned long flags; 33 31 const struct dax_operations *ops; 32 + void *holder_data; 33 + const struct dax_holder_operations *holder_ops; 34 34 }; 35 35 36 36 static dev_t dax_devt; ··· 75 71 * fs_dax_get_by_bdev() - temporary lookup mechanism for filesystem-dax 76 72 * @bdev: block device to find a dax_device for 77 73 * @start_off: returns the byte offset into the dax_device that @bdev starts 74 + * @holder: filesystem or mapped device inside the dax_device 75 + * @ops: operations for the inner holder 78 76 */ 79 - struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 *start_off) 77 + struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 *start_off, 78 + void *holder, const struct dax_holder_operations *ops) 80 79 { 81 80 struct dax_device *dax_dev; 82 81 u64 part_size; ··· 99 92 dax_dev = xa_load(&dax_hosts, (unsigned long)bdev->bd_disk); 100 93 if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode)) 101 94 dax_dev = NULL; 95 + else if (holder) { 96 + if (!cmpxchg(&dax_dev->holder_data, NULL, holder)) 97 + dax_dev->holder_ops = ops; 98 + else 99 + dax_dev = NULL; 100 + } 102 101 dax_read_unlock(id); 103 102 104 103 return dax_dev; 105 104 } 106 105 EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); 106 + 107 + void fs_put_dax(struct dax_device *dax_dev, void *holder) 108 + { 109 + if (dax_dev && holder && 110 + cmpxchg(&dax_dev->holder_data, holder, NULL) == holder) 111 + dax_dev->holder_ops = NULL; 112 + put_dax(dax_dev); 113 + } 114 + EXPORT_SYMBOL_GPL(fs_put_dax); 107 115 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */ 108 116 109 117 enum dax_device_flags { ··· 226 204 } 227 205 EXPORT_SYMBOL_GPL(dax_recovery_write); 228 206 207 + int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off, 208 + u64 len, int mf_flags) 209 + { 210 + int rc, id; 211 + 212 + id = dax_read_lock(); 213 + if (!dax_alive(dax_dev)) { 214 + rc = -ENXIO; 215 + goto out; 216 + } 217 + 218 + if (!dax_dev->holder_ops) { 219 + rc = -EOPNOTSUPP; 220 + goto out; 221 + } 222 + 223 + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags); 224 + out: 225 + dax_read_unlock(id); 226 + return rc; 227 + } 228 + EXPORT_SYMBOL_GPL(dax_holder_notify_failure); 229 + 229 230 #ifdef CONFIG_ARCH_HAS_PMEM_API 230 231 void arch_wb_cache_pmem(void *addr, size_t size); 231 232 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) ··· 322 277 if (!dax_dev) 323 278 return; 324 279 280 + if (dax_dev->holder_data != NULL) 281 + dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); 282 + 325 283 clear_bit(DAXDEV_ALIVE, &dax_dev->flags); 326 284 synchronize_srcu(&dax_srcu); 285 + 286 + /* clear holder data */ 287 + dax_dev->holder_ops = NULL; 288 + dax_dev->holder_data = NULL; 327 289 } 328 290 EXPORT_SYMBOL_GPL(kill_dax); 329 291 ··· 471 419 iput(&dax_dev->inode); 472 420 } 473 421 EXPORT_SYMBOL_GPL(put_dax); 422 + 423 + /** 424 + * dax_holder() - obtain the holder of a dax device 425 + * @dax_dev: a dax_device instance 426 + 427 + * Return: the holder's data which represents the holder if registered, 428 + * otherwize NULL. 429 + */ 430 + void *dax_holder(struct dax_device *dax_dev) 431 + { 432 + return dax_dev->holder_data; 433 + } 434 + EXPORT_SYMBOL_GPL(dax_holder); 474 435 475 436 /** 476 437 * inode_dax: convert a public inode into its dax_dev
+1 -1
drivers/md/dm.c
··· 758 758 } 759 759 760 760 td->dm_dev.bdev = bdev; 761 - td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, &part_off); 761 + td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, &part_off, NULL, NULL); 762 762 return 0; 763 763 } 764 764
+6 -4
fs/erofs/super.c
··· 255 255 if (IS_ERR(bdev)) 256 256 return PTR_ERR(bdev); 257 257 dif->bdev = bdev; 258 - dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off); 258 + dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off, 259 + NULL, NULL); 259 260 } 260 261 261 262 dif->blocks = le32_to_cpu(dis->blocks); ··· 721 720 } 722 721 723 722 sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev, 724 - &sbi->dax_part_off); 723 + &sbi->dax_part_off, 724 + NULL, NULL); 725 725 } 726 726 727 727 err = erofs_read_superblock(sb); ··· 814 812 { 815 813 struct erofs_device_info *dif = ptr; 816 814 817 - fs_put_dax(dif->dax_dev); 815 + fs_put_dax(dif->dax_dev, NULL); 818 816 if (dif->bdev) 819 817 blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL); 820 818 erofs_fscache_unregister_cookie(&dif->fscache); ··· 888 886 return; 889 887 890 888 erofs_free_dev_context(sbi->devs); 891 - fs_put_dax(sbi->dax_dev); 889 + fs_put_dax(sbi->dax_dev, NULL); 892 890 erofs_fscache_unregister_cookie(&sbi->s_fscache); 893 891 erofs_fscache_unregister_fs(sb); 894 892 kfree(sbi->opt.fsid);
+4 -3
fs/ext2/super.c
··· 171 171 brelse (sbi->s_sbh); 172 172 sb->s_fs_info = NULL; 173 173 kfree(sbi->s_blockgroup_lock); 174 - fs_put_dax(sbi->s_daxdev); 174 + fs_put_dax(sbi->s_daxdev, NULL); 175 175 kfree(sbi); 176 176 } 177 177 ··· 835 835 } 836 836 sb->s_fs_info = sbi; 837 837 sbi->s_sb_block = sb_block; 838 - sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off); 838 + sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off, 839 + NULL, NULL); 839 840 840 841 spin_lock_init(&sbi->s_lock); 841 842 ret = -EINVAL; ··· 1205 1204 failed_mount: 1206 1205 brelse(bh); 1207 1206 failed_sbi: 1208 - fs_put_dax(sbi->s_daxdev); 1207 + fs_put_dax(sbi->s_daxdev, NULL); 1209 1208 sb->s_fs_info = NULL; 1210 1209 kfree(sbi->s_blockgroup_lock); 1211 1210 kfree(sbi);
+5 -4
fs/ext4/super.c
··· 1307 1307 if (sbi->s_chksum_driver) 1308 1308 crypto_free_shash(sbi->s_chksum_driver); 1309 1309 kfree(sbi->s_blockgroup_lock); 1310 - fs_put_dax(sbi->s_daxdev); 1310 + fs_put_dax(sbi->s_daxdev, NULL); 1311 1311 fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy); 1312 1312 #if IS_ENABLED(CONFIG_UNICODE) 1313 1313 utf8_unload(sb->s_encoding); ··· 4272 4272 return; 4273 4273 4274 4274 kfree(sbi->s_blockgroup_lock); 4275 - fs_put_dax(sbi->s_daxdev); 4275 + fs_put_dax(sbi->s_daxdev, NULL); 4276 4276 kfree(sbi); 4277 4277 } 4278 4278 ··· 4284 4284 if (!sbi) 4285 4285 return NULL; 4286 4286 4287 - sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off); 4287 + sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off, 4288 + NULL, NULL); 4288 4289 4289 4290 sbi->s_blockgroup_lock = 4290 4291 kzalloc(sizeof(struct blockgroup_lock), GFP_KERNEL); ··· 4297 4296 sbi->s_sb = sb; 4298 4297 return sbi; 4299 4298 err_out: 4300 - fs_put_dax(sbi->s_daxdev); 4299 + fs_put_dax(sbi->s_daxdev, NULL); 4301 4300 kfree(sbi); 4302 4301 return NULL; 4303 4302 }
+3 -2
fs/xfs/xfs_buf.c
··· 1911 1911 list_lru_destroy(&btp->bt_lru); 1912 1912 1913 1913 blkdev_issue_flush(btp->bt_bdev); 1914 - fs_put_dax(btp->bt_daxdev); 1914 + fs_put_dax(btp->bt_daxdev, NULL); 1915 1915 1916 1916 kmem_free(btp); 1917 1917 } ··· 1964 1964 btp->bt_mount = mp; 1965 1965 btp->bt_dev = bdev->bd_dev; 1966 1966 btp->bt_bdev = bdev; 1967 - btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off); 1967 + btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off, NULL, 1968 + NULL); 1968 1969 1969 1970 /* 1970 1971 * Buffer IO error rate limiting. Limit it to no more than 10 messages
+25 -8
include/linux/dax.h
··· 43 43 void *addr, size_t bytes, struct iov_iter *iter); 44 44 }; 45 45 46 + struct dax_holder_operations { 47 + /* 48 + * notify_failure - notify memory failure into inner holder device 49 + * @dax_dev: the dax device which contains the holder 50 + * @offset: offset on this dax device where memory failure occurs 51 + * @len: length of this memory failure event 52 + * @flags: action flags for memory failure handler 53 + */ 54 + int (*notify_failure)(struct dax_device *dax_dev, u64 offset, 55 + u64 len, int mf_flags); 56 + }; 57 + 46 58 #if IS_ENABLED(CONFIG_DAX) 47 59 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops); 60 + void *dax_holder(struct dax_device *dax_dev); 48 61 void put_dax(struct dax_device *dax_dev); 49 62 void kill_dax(struct dax_device *dax_dev); 50 63 void dax_write_cache(struct dax_device *dax_dev, bool wc); ··· 79 66 return dax_synchronous(dax_dev); 80 67 } 81 68 #else 69 + static inline void *dax_holder(struct dax_device *dax_dev) 70 + { 71 + return NULL; 72 + } 82 73 static inline struct dax_device *alloc_dax(void *private, 83 74 const struct dax_operations *ops) 84 75 { ··· 131 114 #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX) 132 115 int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk); 133 116 void dax_remove_host(struct gendisk *disk); 134 - struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, 135 - u64 *start_off); 136 - static inline void fs_put_dax(struct dax_device *dax_dev) 137 - { 138 - put_dax(dax_dev); 139 - } 117 + struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 *start_off, 118 + void *holder, const struct dax_holder_operations *ops); 119 + void fs_put_dax(struct dax_device *dax_dev, void *holder); 140 120 #else 141 121 static inline int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk) 142 122 { ··· 143 129 { 144 130 } 145 131 static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, 146 - u64 *start_off) 132 + u64 *start_off, void *holder, 133 + const struct dax_holder_operations *ops) 147 134 { 148 135 return NULL; 149 136 } 150 - static inline void fs_put_dax(struct dax_device *dax_dev) 137 + static inline void fs_put_dax(struct dax_device *dax_dev, void *holder) 151 138 { 152 139 } 153 140 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */ ··· 218 203 size_t bytes, struct iov_iter *i); 219 204 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, 220 205 size_t nr_pages); 206 + int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off, u64 len, 207 + int mf_flags); 221 208 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size); 222 209 223 210 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,