Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.

The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.

For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.

Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.

But wait...it gets worse!

The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.

This pile aims to do three things:

1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing

2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.

3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.

To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.

Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:

1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.

2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.

This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:

https://lwn.net/Articles/718734/
https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty

+572 -63
+41 -3
Documentation/filesystems/vfs.txt
··· 576 576 written at any point after PG_Dirty is clear. Once it is known to be 577 577 safe, PG_Writeback is cleared. 578 578 579 - Writeback makes use of a writeback_control structure... 579 + Writeback makes use of a writeback_control structure to direct the 580 + operations. This gives the the writepage and writepages operations some 581 + information about the nature of and reason for the writeback request, 582 + and the constraints under which it is being done. It is also used to 583 + return information back to the caller about the result of a writepage or 584 + writepages request. 585 + 586 + Handling errors during writeback 587 + -------------------------------- 588 + Most applications that do buffered I/O will periodically call a file 589 + synchronization call (fsync, fdatasync, msync or sync_file_range) to 590 + ensure that data written has made it to the backing store. When there 591 + is an error during writeback, they expect that error to be reported when 592 + a file sync request is made. After an error has been reported on one 593 + request, subsequent requests on the same file descriptor should return 594 + 0, unless further writeback errors have occurred since the previous file 595 + syncronization. 596 + 597 + Ideally, the kernel would report errors only on file descriptions on 598 + which writes were done that subsequently failed to be written back. The 599 + generic pagecache infrastructure does not track the file descriptions 600 + that have dirtied each individual page however, so determining which 601 + file descriptors should get back an error is not possible. 602 + 603 + Instead, the generic writeback error tracking infrastructure in the 604 + kernel settles for reporting errors to fsync on all file descriptions 605 + that were open at the time that the error occurred. In a situation with 606 + multiple writers, all of them will get back an error on a subsequent fsync, 607 + even if all of the writes done through that particular file descriptor 608 + succeeded (or even if there were no writes on that file descriptor at all). 609 + 610 + Filesystems that wish to use this infrastructure should call 611 + mapping_set_error to record the error in the address_space when it 612 + occurs. Then, after writing back data from the pagecache in their 613 + file->fsync operation, they should call file_check_and_advance_wb_err to 614 + ensure that the struct file's error cursor has advanced to the correct 615 + point in the stream of errors emitted by the backing device(s). 580 616 581 617 struct address_space_operations 582 618 ------------------------------- ··· 840 804 The File Object 841 805 =============== 842 806 843 - A file object represents a file opened by a process. 807 + A file object represents a file opened by a process. This is also known 808 + as an "open file description" in POSIX parlance. 844 809 845 810 846 811 struct file_operations ··· 924 887 925 888 release: called when the last reference to an open file is closed 926 889 927 - fsync: called by the fsync(2) system call 890 + fsync: called by the fsync(2) system call. Also see the section above 891 + entitled "Handling errors during writeback". 928 892 929 893 fasync: called by the fcntl(2) system call when asynchronous 930 894 (non-blocking) mode is enabled for a file
+6
MAINTAINERS
··· 5069 5069 F: drivers/video/fbdev/s1d13xxxfb.c 5070 5070 F: include/video/s1d13xxxfb.h 5071 5071 5072 + ERRSEQ ERROR TRACKING INFRASTRUCTURE 5073 + M: Jeff Layton <jlayton@poochiereds.net> 5074 + S: Maintained 5075 + F: lib/errseq.c 5076 + F: include/linux/errseq.h 5077 + 5072 5078 ET131X NETWORK DRIVER 5073 5079 M: Mark Einon <mark.einon@gmail.com> 5074 5080 S: Odd Fixes
+1
drivers/dax/device.c
··· 499 499 inode->i_mapping = __dax_inode->i_mapping; 500 500 inode->i_mapping->host = __dax_inode; 501 501 filp->f_mapping = inode->i_mapping; 502 + filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping); 502 503 filp->private_data = dev_dax; 503 504 inode->i_flags = S_DAX; 504 505
+2 -1
fs/block_dev.c
··· 632 632 struct block_device *bdev = I_BDEV(bd_inode); 633 633 int error; 634 634 635 - error = filemap_write_and_wait_range(filp->f_mapping, start, end); 635 + error = file_write_and_wait_range(filp, start, end); 636 636 if (error) 637 637 return error; 638 638 ··· 1751 1751 return -ENOMEM; 1752 1752 1753 1753 filp->f_mapping = bdev->bd_inode->i_mapping; 1754 + filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping); 1754 1755 1755 1756 return blkdev_get(bdev, filp->f_mode, filp); 1756 1757 }
+8 -5
fs/btrfs/file.c
··· 2032 2032 struct btrfs_root *root = BTRFS_I(inode)->root; 2033 2033 struct btrfs_trans_handle *trans; 2034 2034 struct btrfs_log_ctx ctx; 2035 - int ret = 0; 2035 + int ret = 0, err; 2036 2036 bool full_sync = 0; 2037 2037 u64 len; 2038 2038 ··· 2051 2051 */ 2052 2052 ret = start_ordered_ops(inode, start, end); 2053 2053 if (ret) 2054 - return ret; 2054 + goto out; 2055 2055 2056 2056 inode_lock(inode); 2057 2057 atomic_inc(&root->log_batch); ··· 2156 2156 * An ordered extent might have started before and completed 2157 2157 * already with io errors, in which case the inode was not 2158 2158 * updated and we end up here. So check the inode's mapping 2159 - * flags for any errors that might have happened while doing 2160 - * writeback of file data. 2159 + * for any errors that might have happened since we last 2160 + * checked called fsync. 2161 2161 */ 2162 - ret = filemap_check_errors(inode->i_mapping); 2162 + ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err); 2163 2163 inode_unlock(inode); 2164 2164 goto out; 2165 2165 } ··· 2248 2248 ret = btrfs_end_transaction(trans); 2249 2249 } 2250 2250 out: 2251 + err = file_check_and_advance_wb_err(file); 2252 + if (!ret) 2253 + ret = err; 2251 2254 return ret > 0 ? -EIO : ret; 2252 2255 } 2253 2256
+13 -7
fs/buffer.c
··· 178 178 set_buffer_uptodate(bh); 179 179 } else { 180 180 buffer_io_error(bh, ", lost sync page write"); 181 - set_buffer_write_io_error(bh); 181 + mark_buffer_write_io_error(bh); 182 182 clear_buffer_uptodate(bh); 183 183 } 184 184 unlock_buffer(bh); ··· 352 352 set_buffer_uptodate(bh); 353 353 } else { 354 354 buffer_io_error(bh, ", lost async page write"); 355 - mapping_set_error(page->mapping, -EIO); 356 - set_buffer_write_io_error(bh); 355 + mark_buffer_write_io_error(bh); 357 356 clear_buffer_uptodate(bh); 358 357 SetPageError(page); 359 358 } ··· 480 481 { 481 482 list_del_init(&bh->b_assoc_buffers); 482 483 WARN_ON(!bh->b_assoc_map); 483 - if (buffer_write_io_error(bh)) 484 - set_bit(AS_EIO, &bh->b_assoc_map->flags); 485 484 bh->b_assoc_map = NULL; 486 485 } 487 486 ··· 1177 1180 } 1178 1181 } 1179 1182 EXPORT_SYMBOL(mark_buffer_dirty); 1183 + 1184 + void mark_buffer_write_io_error(struct buffer_head *bh) 1185 + { 1186 + set_buffer_write_io_error(bh); 1187 + /* FIXME: do we need to set this in both places? */ 1188 + if (bh->b_page && bh->b_page->mapping) 1189 + mapping_set_error(bh->b_page->mapping, -EIO); 1190 + if (bh->b_assoc_map) 1191 + mapping_set_error(bh->b_assoc_map, -EIO); 1192 + } 1193 + EXPORT_SYMBOL(mark_buffer_write_io_error); 1180 1194 1181 1195 /* 1182 1196 * Decrement a buffer_head's reference count. If all buffers against a page ··· 3290 3282 3291 3283 bh = head; 3292 3284 do { 3293 - if (buffer_write_io_error(bh) && page->mapping) 3294 - mapping_set_error(page->mapping, -EIO); 3295 3285 if (buffer_busy(bh)) 3296 3286 goto failed; 3297 3287 bh = bh->b_this_page;
+3 -1
fs/dax.c
··· 855 855 856 856 ret = dax_writeback_one(bdev, dax_dev, mapping, 857 857 indices[i], pvec.pages[i]); 858 - if (ret < 0) 858 + if (ret < 0) { 859 + mapping_set_error(mapping, ret); 859 860 goto out; 861 + } 860 862 } 861 863 start_index = indices[pvec.nr - 1] + 1; 862 864 }
+1 -4
fs/ext2/file.c
··· 174 174 { 175 175 int ret; 176 176 struct super_block *sb = file->f_mapping->host->i_sb; 177 - struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; 178 177 179 178 ret = generic_file_fsync(file, start, end, datasync); 180 - if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { 179 + if (ret == -EIO) 181 180 /* We don't really know where the IO error happened... */ 182 181 ext2_error(sb, __func__, 183 182 "detected IO error when writing metadata buffers"); 184 - ret = -EIO; 185 - } 186 183 return ret; 187 184 } 188 185
+1 -1
fs/ext4/fsync.c
··· 124 124 goto out; 125 125 } 126 126 127 - ret = filemap_write_and_wait_range(inode->i_mapping, start, end); 127 + ret = file_write_and_wait_range(file, start, end); 128 128 if (ret) 129 129 return ret; 130 130 /*
+1
fs/file_table.c
··· 168 168 file->f_path = *path; 169 169 file->f_inode = path->dentry->d_inode; 170 170 file->f_mapping = path->dentry->d_inode->i_mapping; 171 + file->f_wb_err = filemap_sample_wb_err(file->f_mapping); 171 172 if ((mode & FMODE_READ) && 172 173 likely(fop->read || fop->read_iter)) 173 174 mode |= FMODE_CAN_READ;
+1 -1
fs/gfs2/lops.c
··· 180 180 bh = bh->b_this_page; 181 181 do { 182 182 if (error) 183 - set_buffer_write_io_error(bh); 183 + mark_buffer_write_io_error(bh); 184 184 unlock_buffer(bh); 185 185 next = bh->b_this_page; 186 186 size -= bh->b_size;
+4 -12
fs/jbd2/commit.c
··· 263 263 continue; 264 264 jinode->i_flags |= JI_COMMIT_RUNNING; 265 265 spin_unlock(&journal->j_list_lock); 266 - err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping); 267 - if (err) { 268 - /* 269 - * Because AS_EIO is cleared by 270 - * filemap_fdatawait_range(), set it again so 271 - * that user process can get -EIO from fsync(). 272 - */ 273 - mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO); 274 - 275 - if (!ret) 276 - ret = err; 277 - } 266 + err = filemap_fdatawait_keep_errors( 267 + jinode->i_vfs_inode->i_mapping); 268 + if (!ret) 269 + ret = err; 278 270 spin_lock(&journal->j_list_lock); 279 271 jinode->i_flags &= ~JI_COMMIT_RUNNING; 280 272 smp_mb();
+5 -1
fs/libfs.c
··· 974 974 int err; 975 975 int ret; 976 976 977 - err = filemap_write_and_wait_range(inode->i_mapping, start, end); 977 + err = file_write_and_wait_range(file, start, end); 978 978 if (err) 979 979 return err; 980 980 ··· 991 991 992 992 out: 993 993 inode_unlock(inode); 994 + /* check and advance again to catch errors after syncing out buffers */ 995 + err = file_check_and_advance_wb_err(file); 996 + if (ret == 0) 997 + ret = err; 994 998 return ret; 995 999 } 996 1000 EXPORT_SYMBOL(__generic_file_fsync);
+3
fs/open.c
··· 707 707 f->f_inode = inode; 708 708 f->f_mapping = inode->i_mapping; 709 709 710 + /* Ensure that we skip any errors that predate opening of the file */ 711 + f->f_wb_err = filemap_sample_wb_err(f->f_mapping); 712 + 710 713 if (unlikely(f->f_flags & O_PATH)) { 711 714 f->f_mode = FMODE_PATH; 712 715 f->f_op = &empty_fops;
+1 -1
fs/xfs/xfs_file.c
··· 140 140 141 141 trace_xfs_file_fsync(ip); 142 142 143 - error = filemap_write_and_wait_range(inode->i_mapping, start, end); 143 + error = file_write_and_wait_range(file, start, end); 144 144 if (error) 145 145 return error; 146 146
+1
include/linux/buffer_head.h
··· 149 149 */ 150 150 151 151 void mark_buffer_dirty(struct buffer_head *bh); 152 + void mark_buffer_write_io_error(struct buffer_head *bh); 152 153 void init_buffer(struct buffer_head *, bh_end_io_t *, void *); 153 154 void touch_buffer(struct buffer_head *bh); 154 155 void set_bh_page(struct buffer_head *bh,
+19
include/linux/errseq.h
··· 1 + #ifndef _LINUX_ERRSEQ_H 2 + #define _LINUX_ERRSEQ_H 3 + 4 + /* See lib/errseq.c for more info */ 5 + 6 + typedef u32 errseq_t; 7 + 8 + errseq_t __errseq_set(errseq_t *eseq, int err); 9 + static inline void errseq_set(errseq_t *eseq, int err) 10 + { 11 + /* Optimize for the common case of no error */ 12 + if (unlikely(err)) 13 + __errseq_set(eseq, err); 14 + } 15 + 16 + errseq_t errseq_sample(errseq_t *eseq); 17 + int errseq_check(errseq_t *eseq, errseq_t since); 18 + int errseq_check_and_advance(errseq_t *eseq, errseq_t *since); 19 + #endif
+60 -1
include/linux/fs.h
··· 32 32 #include <linux/workqueue.h> 33 33 #include <linux/delayed_call.h> 34 34 #include <linux/uuid.h> 35 + #include <linux/errseq.h> 35 36 36 37 #include <asm/byteorder.h> 37 38 #include <uapi/linux/fs.h> ··· 402 401 gfp_t gfp_mask; /* implicit gfp mask for allocations */ 403 402 struct list_head private_list; /* ditto */ 404 403 void *private_data; /* ditto */ 404 + errseq_t wb_err; 405 405 } __attribute__((aligned(sizeof(long)))); 406 406 /* 407 407 * On most architectures that alignment is already the case; but ··· 881 879 struct list_head f_tfile_llink; 882 880 #endif /* #ifdef CONFIG_EPOLL */ 883 881 struct address_space *f_mapping; 882 + errseq_t f_wb_err; 884 883 } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ 885 884 886 885 struct file_handle { ··· 2539 2536 extern int filemap_fdatawrite(struct address_space *); 2540 2537 extern int filemap_flush(struct address_space *); 2541 2538 extern int filemap_fdatawait(struct address_space *); 2542 - extern void filemap_fdatawait_keep_errors(struct address_space *); 2539 + extern int filemap_fdatawait_keep_errors(struct address_space *mapping); 2543 2540 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, 2544 2541 loff_t lend); 2545 2542 extern bool filemap_range_has_page(struct address_space *, loff_t lstart, ··· 2552 2549 extern int filemap_fdatawrite_range(struct address_space *mapping, 2553 2550 loff_t start, loff_t end); 2554 2551 extern int filemap_check_errors(struct address_space *mapping); 2552 + 2553 + extern void __filemap_set_wb_err(struct address_space *mapping, int err); 2554 + extern int __must_check file_check_and_advance_wb_err(struct file *file); 2555 + extern int __must_check file_write_and_wait_range(struct file *file, 2556 + loff_t start, loff_t end); 2557 + 2558 + /** 2559 + * filemap_set_wb_err - set a writeback error on an address_space 2560 + * @mapping: mapping in which to set writeback error 2561 + * @err: error to be set in mapping 2562 + * 2563 + * When writeback fails in some way, we must record that error so that 2564 + * userspace can be informed when fsync and the like are called. We endeavor 2565 + * to report errors on any file that was open at the time of the error. Some 2566 + * internal callers also need to know when writeback errors have occurred. 2567 + * 2568 + * When a writeback error occurs, most filesystems will want to call 2569 + * filemap_set_wb_err to record the error in the mapping so that it will be 2570 + * automatically reported whenever fsync is called on the file. 2571 + * 2572 + * FIXME: mention FS_* flag here? 2573 + */ 2574 + static inline void filemap_set_wb_err(struct address_space *mapping, int err) 2575 + { 2576 + /* Fastpath for common case of no error */ 2577 + if (unlikely(err)) 2578 + __filemap_set_wb_err(mapping, err); 2579 + } 2580 + 2581 + /** 2582 + * filemap_check_wb_error - has an error occurred since the mark was sampled? 2583 + * @mapping: mapping to check for writeback errors 2584 + * @since: previously-sampled errseq_t 2585 + * 2586 + * Grab the errseq_t value from the mapping, and see if it has changed "since" 2587 + * the given value was sampled. 2588 + * 2589 + * If it has then report the latest error set, otherwise return 0. 2590 + */ 2591 + static inline int filemap_check_wb_err(struct address_space *mapping, 2592 + errseq_t since) 2593 + { 2594 + return errseq_check(&mapping->wb_err, since); 2595 + } 2596 + 2597 + /** 2598 + * filemap_sample_wb_err - sample the current errseq_t to test for later errors 2599 + * @mapping: mapping to be sampled 2600 + * 2601 + * Writeback errors are always reported relative to a particular sample point 2602 + * in the past. This function provides those sample points. 2603 + */ 2604 + static inline errseq_t filemap_sample_wb_err(struct address_space *mapping) 2605 + { 2606 + return errseq_sample(&mapping->wb_err); 2607 + } 2555 2608 2556 2609 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end, 2557 2610 int datasync);
+25 -6
include/linux/pagemap.h
··· 28 28 AS_NO_WRITEBACK_TAGS = 5, 29 29 }; 30 30 31 + /** 32 + * mapping_set_error - record a writeback error in the address_space 33 + * @mapping - the mapping in which an error should be set 34 + * @error - the error to set in the mapping 35 + * 36 + * When writeback fails in some way, we must record that error so that 37 + * userspace can be informed when fsync and the like are called. We endeavor 38 + * to report errors on any file that was open at the time of the error. Some 39 + * internal callers also need to know when writeback errors have occurred. 40 + * 41 + * When a writeback error occurs, most filesystems will want to call 42 + * mapping_set_error to record the error in the mapping so that it can be 43 + * reported when the application calls fsync(2). 44 + */ 31 45 static inline void mapping_set_error(struct address_space *mapping, int error) 32 46 { 33 - if (unlikely(error)) { 34 - if (error == -ENOSPC) 35 - set_bit(AS_ENOSPC, &mapping->flags); 36 - else 37 - set_bit(AS_EIO, &mapping->flags); 38 - } 47 + if (likely(!error)) 48 + return; 49 + 50 + /* Record in wb_err for checkers using errseq_t based tracking */ 51 + filemap_set_wb_err(mapping, error); 52 + 53 + /* Record it in flags for now, for legacy callers */ 54 + if (error == -ENOSPC) 55 + set_bit(AS_ENOSPC, &mapping->flags); 56 + else 57 + set_bit(AS_EIO, &mapping->flags); 39 58 } 40 59 41 60 static inline void mapping_set_unevictable(struct address_space *mapping)
+57
include/trace/events/filemap.h
··· 10 10 #include <linux/memcontrol.h> 11 11 #include <linux/device.h> 12 12 #include <linux/kdev_t.h> 13 + #include <linux/errseq.h> 13 14 14 15 DECLARE_EVENT_CLASS(mm_filemap_op_page_cache, 15 16 ··· 53 52 TP_ARGS(page) 54 53 ); 55 54 55 + TRACE_EVENT(filemap_set_wb_err, 56 + TP_PROTO(struct address_space *mapping, errseq_t eseq), 57 + 58 + TP_ARGS(mapping, eseq), 59 + 60 + TP_STRUCT__entry( 61 + __field(unsigned long, i_ino) 62 + __field(dev_t, s_dev) 63 + __field(errseq_t, errseq) 64 + ), 65 + 66 + TP_fast_assign( 67 + __entry->i_ino = mapping->host->i_ino; 68 + __entry->errseq = eseq; 69 + if (mapping->host->i_sb) 70 + __entry->s_dev = mapping->host->i_sb->s_dev; 71 + else 72 + __entry->s_dev = mapping->host->i_rdev; 73 + ), 74 + 75 + TP_printk("dev=%d:%d ino=0x%lx errseq=0x%x", 76 + MAJOR(__entry->s_dev), MINOR(__entry->s_dev), 77 + __entry->i_ino, __entry->errseq) 78 + ); 79 + 80 + TRACE_EVENT(file_check_and_advance_wb_err, 81 + TP_PROTO(struct file *file, errseq_t old), 82 + 83 + TP_ARGS(file, old), 84 + 85 + TP_STRUCT__entry( 86 + __field(struct file *, file); 87 + __field(unsigned long, i_ino) 88 + __field(dev_t, s_dev) 89 + __field(errseq_t, old) 90 + __field(errseq_t, new) 91 + ), 92 + 93 + TP_fast_assign( 94 + __entry->file = file; 95 + __entry->i_ino = file->f_mapping->host->i_ino; 96 + if (file->f_mapping->host->i_sb) 97 + __entry->s_dev = 98 + file->f_mapping->host->i_sb->s_dev; 99 + else 100 + __entry->s_dev = 101 + file->f_mapping->host->i_rdev; 102 + __entry->old = old; 103 + __entry->new = file->f_wb_err; 104 + ), 105 + 106 + TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x", 107 + __entry->file, MAJOR(__entry->s_dev), 108 + MINOR(__entry->s_dev), __entry->i_ino, __entry->old, 109 + __entry->new) 110 + ); 56 111 #endif /* _TRACE_FILEMAP_H */ 57 112 58 113 /* This part must be outside protection */
+1 -1
lib/Makefile
··· 38 38 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ 39 39 bsearch.o find_bit.o llist.o memweight.o kfifo.o \ 40 40 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ 41 - once.o refcount.o usercopy.o 41 + once.o refcount.o usercopy.o errseq.o 42 42 obj-y += string_helpers.o 43 43 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o 44 44 obj-y += hexdump.o
+208
lib/errseq.c
··· 1 + #include <linux/err.h> 2 + #include <linux/bug.h> 3 + #include <linux/atomic.h> 4 + #include <linux/errseq.h> 5 + 6 + /* 7 + * An errseq_t is a way of recording errors in one place, and allowing any 8 + * number of "subscribers" to tell whether it has changed since a previous 9 + * point where it was sampled. 10 + * 11 + * It's implemented as an unsigned 32-bit value. The low order bits are 12 + * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits 13 + * are used as a counter. This is done with atomics instead of locking so that 14 + * these functions can be called from any context. 15 + * 16 + * The general idea is for consumers to sample an errseq_t value. That value 17 + * can later be used to tell whether any new errors have occurred since that 18 + * sampling was done. 19 + * 20 + * Note that there is a risk of collisions if new errors are being recorded 21 + * frequently, since we have so few bits to use as a counter. 22 + * 23 + * To mitigate this, one bit is used as a flag to tell whether the value has 24 + * been sampled since a new value was recorded. That allows us to avoid bumping 25 + * the counter if no one has sampled it since the last time an error was 26 + * recorded. 27 + * 28 + * A new errseq_t should always be zeroed out. A errseq_t value of all zeroes 29 + * is the special (but common) case where there has never been an error. An all 30 + * zero value thus serves as the "epoch" if one wishes to know whether there 31 + * has ever been an error set since it was first initialized. 32 + */ 33 + 34 + /* The low bits are designated for error code (max of MAX_ERRNO) */ 35 + #define ERRSEQ_SHIFT ilog2(MAX_ERRNO + 1) 36 + 37 + /* This bit is used as a flag to indicate whether the value has been seen */ 38 + #define ERRSEQ_SEEN (1 << ERRSEQ_SHIFT) 39 + 40 + /* The lowest bit of the counter */ 41 + #define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1)) 42 + 43 + /** 44 + * __errseq_set - set a errseq_t for later reporting 45 + * @eseq: errseq_t field that should be set 46 + * @err: error to set 47 + * 48 + * This function sets the error in *eseq, and increments the sequence counter 49 + * if the last sequence was sampled at some point in the past. 50 + * 51 + * Any error set will always overwrite an existing error. 52 + * 53 + * Most callers will want to use the errseq_set inline wrapper to efficiently 54 + * handle the common case where err is 0. 55 + * 56 + * We do return an errseq_t here, primarily for debugging purposes. The return 57 + * value should not be used as a previously sampled value in later calls as it 58 + * will not have the SEEN flag set. 59 + */ 60 + errseq_t __errseq_set(errseq_t *eseq, int err) 61 + { 62 + errseq_t cur, old; 63 + 64 + /* MAX_ERRNO must be able to serve as a mask */ 65 + BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1); 66 + 67 + /* 68 + * Ensure the error code actually fits where we want it to go. If it 69 + * doesn't then just throw a warning and don't record anything. We 70 + * also don't accept zero here as that would effectively clear a 71 + * previous error. 72 + */ 73 + old = READ_ONCE(*eseq); 74 + 75 + if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO), 76 + "err = %d\n", err)) 77 + return old; 78 + 79 + for (;;) { 80 + errseq_t new; 81 + 82 + /* Clear out error bits and set new error */ 83 + new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err; 84 + 85 + /* Only increment if someone has looked at it */ 86 + if (old & ERRSEQ_SEEN) 87 + new += ERRSEQ_CTR_INC; 88 + 89 + /* If there would be no change, then call it done */ 90 + if (new == old) { 91 + cur = new; 92 + break; 93 + } 94 + 95 + /* Try to swap the new value into place */ 96 + cur = cmpxchg(eseq, old, new); 97 + 98 + /* 99 + * Call it success if we did the swap or someone else beat us 100 + * to it for the same value. 101 + */ 102 + if (likely(cur == old || cur == new)) 103 + break; 104 + 105 + /* Raced with an update, try again */ 106 + old = cur; 107 + } 108 + return cur; 109 + } 110 + EXPORT_SYMBOL(__errseq_set); 111 + 112 + /** 113 + * errseq_sample - grab current errseq_t value 114 + * @eseq: pointer to errseq_t to be sampled 115 + * 116 + * This function allows callers to sample an errseq_t value, marking it as 117 + * "seen" if required. 118 + */ 119 + errseq_t errseq_sample(errseq_t *eseq) 120 + { 121 + errseq_t old = READ_ONCE(*eseq); 122 + errseq_t new = old; 123 + 124 + /* 125 + * For the common case of no errors ever having been set, we can skip 126 + * marking the SEEN bit. Once an error has been set, the value will 127 + * never go back to zero. 128 + */ 129 + if (old != 0) { 130 + new |= ERRSEQ_SEEN; 131 + if (old != new) 132 + cmpxchg(eseq, old, new); 133 + } 134 + return new; 135 + } 136 + EXPORT_SYMBOL(errseq_sample); 137 + 138 + /** 139 + * errseq_check - has an error occurred since a particular sample point? 140 + * @eseq: pointer to errseq_t value to be checked 141 + * @since: previously-sampled errseq_t from which to check 142 + * 143 + * Grab the value that eseq points to, and see if it has changed "since" 144 + * the given value was sampled. The "since" value is not advanced, so there 145 + * is no need to mark the value as seen. 146 + * 147 + * Returns the latest error set in the errseq_t or 0 if it hasn't changed. 148 + */ 149 + int errseq_check(errseq_t *eseq, errseq_t since) 150 + { 151 + errseq_t cur = READ_ONCE(*eseq); 152 + 153 + if (likely(cur == since)) 154 + return 0; 155 + return -(cur & MAX_ERRNO); 156 + } 157 + EXPORT_SYMBOL(errseq_check); 158 + 159 + /** 160 + * errseq_check_and_advance - check an errseq_t and advance to current value 161 + * @eseq: pointer to value being checked and reported 162 + * @since: pointer to previously-sampled errseq_t to check against and advance 163 + * 164 + * Grab the eseq value, and see whether it matches the value that "since" 165 + * points to. If it does, then just return 0. 166 + * 167 + * If it doesn't, then the value has changed. Set the "seen" flag, and try to 168 + * swap it into place as the new eseq value. Then, set that value as the new 169 + * "since" value, and return whatever the error portion is set to. 170 + * 171 + * Note that no locking is provided here for concurrent updates to the "since" 172 + * value. The caller must provide that if necessary. Because of this, callers 173 + * may want to do a lockless errseq_check before taking the lock and calling 174 + * this. 175 + */ 176 + int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) 177 + { 178 + int err = 0; 179 + errseq_t old, new; 180 + 181 + /* 182 + * Most callers will want to use the inline wrapper to check this, 183 + * so that the common case of no error is handled without needing 184 + * to take the lock that protects the "since" value. 185 + */ 186 + old = READ_ONCE(*eseq); 187 + if (old != *since) { 188 + /* 189 + * Set the flag and try to swap it into place if it has 190 + * changed. 191 + * 192 + * We don't care about the outcome of the swap here. If the 193 + * swap doesn't occur, then it has either been updated by a 194 + * writer who is altering the value in some way (updating 195 + * counter or resetting the error), or another reader who is 196 + * just setting the "seen" flag. Either outcome is OK, and we 197 + * can advance "since" and return an error based on what we 198 + * have. 199 + */ 200 + new = old | ERRSEQ_SEEN; 201 + if (new != old) 202 + cmpxchg(eseq, old, new); 203 + *since = new; 204 + err = -(new & MAX_ERRNO); 205 + } 206 + return err; 207 + } 208 + EXPORT_SYMBOL(errseq_check_and_advance);
+109 -17
mm/filemap.c
··· 309 309 } 310 310 EXPORT_SYMBOL(filemap_check_errors); 311 311 312 + static int filemap_check_and_keep_errors(struct address_space *mapping) 313 + { 314 + /* Check for outstanding write errors */ 315 + if (test_bit(AS_EIO, &mapping->flags)) 316 + return -EIO; 317 + if (test_bit(AS_ENOSPC, &mapping->flags)) 318 + return -ENOSPC; 319 + return 0; 320 + } 321 + 312 322 /** 313 323 * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range 314 324 * @mapping: address space structure to write ··· 418 408 } 419 409 EXPORT_SYMBOL(filemap_range_has_page); 420 410 421 - static int __filemap_fdatawait_range(struct address_space *mapping, 411 + static void __filemap_fdatawait_range(struct address_space *mapping, 422 412 loff_t start_byte, loff_t end_byte) 423 413 { 424 414 pgoff_t index = start_byte >> PAGE_SHIFT; 425 415 pgoff_t end = end_byte >> PAGE_SHIFT; 426 416 struct pagevec pvec; 427 417 int nr_pages; 428 - int ret = 0; 429 418 430 419 if (end_byte < start_byte) 431 - goto out; 420 + return; 432 421 433 422 pagevec_init(&pvec, 0); 434 423 while ((index <= end) && ··· 444 435 continue; 445 436 446 437 wait_on_page_writeback(page); 447 - if (TestClearPageError(page)) 448 - ret = -EIO; 438 + ClearPageError(page); 449 439 } 450 440 pagevec_release(&pvec); 451 441 cond_resched(); 452 442 } 453 - out: 454 - return ret; 455 443 } 456 444 457 445 /** ··· 468 462 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, 469 463 loff_t end_byte) 470 464 { 471 - int ret, ret2; 472 - 473 - ret = __filemap_fdatawait_range(mapping, start_byte, end_byte); 474 - ret2 = filemap_check_errors(mapping); 475 - if (!ret) 476 - ret = ret2; 477 - 478 - return ret; 465 + __filemap_fdatawait_range(mapping, start_byte, end_byte); 466 + return filemap_check_errors(mapping); 479 467 } 480 468 EXPORT_SYMBOL(filemap_fdatawait_range); 481 469 ··· 485 485 * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), 486 486 * fsfreeze(8) 487 487 */ 488 - void filemap_fdatawait_keep_errors(struct address_space *mapping) 488 + int filemap_fdatawait_keep_errors(struct address_space *mapping) 489 489 { 490 490 loff_t i_size = i_size_read(mapping->host); 491 491 492 492 if (i_size == 0) 493 - return; 493 + return 0; 494 494 495 495 __filemap_fdatawait_range(mapping, 0, i_size - 1); 496 + return filemap_check_and_keep_errors(mapping); 496 497 } 498 + EXPORT_SYMBOL(filemap_fdatawait_keep_errors); 497 499 498 500 /** 499 501 * filemap_fdatawait - wait for all under-writeback pages to complete ··· 537 535 int err2 = filemap_fdatawait(mapping); 538 536 if (!err) 539 537 err = err2; 538 + } else { 539 + /* Clear any previously stored errors */ 540 + filemap_check_errors(mapping); 540 541 } 541 542 } else { 542 543 err = filemap_check_errors(mapping); ··· 574 569 lstart, lend); 575 570 if (!err) 576 571 err = err2; 572 + } else { 573 + /* Clear any previously stored errors */ 574 + filemap_check_errors(mapping); 577 575 } 578 576 } else { 579 577 err = filemap_check_errors(mapping); ··· 584 576 return err; 585 577 } 586 578 EXPORT_SYMBOL(filemap_write_and_wait_range); 579 + 580 + void __filemap_set_wb_err(struct address_space *mapping, int err) 581 + { 582 + errseq_t eseq = __errseq_set(&mapping->wb_err, err); 583 + 584 + trace_filemap_set_wb_err(mapping, eseq); 585 + } 586 + EXPORT_SYMBOL(__filemap_set_wb_err); 587 + 588 + /** 589 + * file_check_and_advance_wb_err - report wb error (if any) that was previously 590 + * and advance wb_err to current one 591 + * @file: struct file on which the error is being reported 592 + * 593 + * When userland calls fsync (or something like nfsd does the equivalent), we 594 + * want to report any writeback errors that occurred since the last fsync (or 595 + * since the file was opened if there haven't been any). 596 + * 597 + * Grab the wb_err from the mapping. If it matches what we have in the file, 598 + * then just quickly return 0. The file is all caught up. 599 + * 600 + * If it doesn't match, then take the mapping value, set the "seen" flag in 601 + * it and try to swap it into place. If it works, or another task beat us 602 + * to it with the new value, then update the f_wb_err and return the error 603 + * portion. The error at this point must be reported via proper channels 604 + * (a'la fsync, or NFS COMMIT operation, etc.). 605 + * 606 + * While we handle mapping->wb_err with atomic operations, the f_wb_err 607 + * value is protected by the f_lock since we must ensure that it reflects 608 + * the latest value swapped in for this file descriptor. 609 + */ 610 + int file_check_and_advance_wb_err(struct file *file) 611 + { 612 + int err = 0; 613 + errseq_t old = READ_ONCE(file->f_wb_err); 614 + struct address_space *mapping = file->f_mapping; 615 + 616 + /* Locklessly handle the common case where nothing has changed */ 617 + if (errseq_check(&mapping->wb_err, old)) { 618 + /* Something changed, must use slow path */ 619 + spin_lock(&file->f_lock); 620 + old = file->f_wb_err; 621 + err = errseq_check_and_advance(&mapping->wb_err, 622 + &file->f_wb_err); 623 + trace_file_check_and_advance_wb_err(file, old); 624 + spin_unlock(&file->f_lock); 625 + } 626 + return err; 627 + } 628 + EXPORT_SYMBOL(file_check_and_advance_wb_err); 629 + 630 + /** 631 + * file_write_and_wait_range - write out & wait on a file range 632 + * @file: file pointing to address_space with pages 633 + * @lstart: offset in bytes where the range starts 634 + * @lend: offset in bytes where the range ends (inclusive) 635 + * 636 + * Write out and wait upon file offsets lstart->lend, inclusive. 637 + * 638 + * Note that @lend is inclusive (describes the last byte to be written) so 639 + * that this function can be used to write to the very end-of-file (end = -1). 640 + * 641 + * After writing out and waiting on the data, we check and advance the 642 + * f_wb_err cursor to the latest value, and return any errors detected there. 643 + */ 644 + int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend) 645 + { 646 + int err = 0, err2; 647 + struct address_space *mapping = file->f_mapping; 648 + 649 + if ((!dax_mapping(mapping) && mapping->nrpages) || 650 + (dax_mapping(mapping) && mapping->nrexceptional)) { 651 + err = __filemap_fdatawrite_range(mapping, lstart, lend, 652 + WB_SYNC_ALL); 653 + /* See comment of filemap_write_and_wait() */ 654 + if (err != -EIO) 655 + __filemap_fdatawait_range(mapping, lstart, lend); 656 + } 657 + err2 = file_check_and_advance_wb_err(file); 658 + if (!err) 659 + err = err2; 660 + return err; 661 + } 662 + EXPORT_SYMBOL(file_write_and_wait_range); 587 663 588 664 /** 589 665 * replace_page_cache_page - replace a pagecache page with a new one
+1 -1
mm/memory-failure.c
··· 684 684 * the first EIO, but we're not worse than other parts 685 685 * of the kernel. 686 686 */ 687 - mapping_set_error(mapping, EIO); 687 + mapping_set_error(mapping, -EIO); 688 688 } 689 689 690 690 return me_pagecache_clean(p, pfn);