Merge tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull xfs fixes from Darrick Wong:
"This fixes multiple problems in the reserve pool sizing functions: an
incorrect free space calculation, a pointless infinite loop, and even
more braindamage that could result in the pool being overfilled. The
pile of patches from Dave fix myriad races and UAF bugs in the log
recovery code that much to our mutual surprise nobody's tripped over.
Dave also fixed a performance optimization that had turned into a
regression.

Dave Chinner is taking over as XFS maintainer starting Sunday and
lasting until 5.19-rc1 is tagged so that I can focus on starting a
massive design review for the (feature complete after five years)
online repair feature. From then on, he and I will be moving XFS to a
co-maintainership model by trading duties every other release.

NOTE: I hope very strongly that the other pieces of the (X)FS
ecosystem (fstests and xfsprogs) will make similar changes to spread
their maintenance load.

Summary:

- Fix an incorrect free space calculation in xfs_reserve_blocks that
could lead to a request for free blocks that will never succeed.

- Fix a hang in xfs_reserve_blocks caused by an infinite loop and the
incorrect free space calculation.

- Fix yet a third problem in xfs_reserve_blocks where multiple racing
threads can overfill the reserve pool.

- Fix an accounting error that lead to us reporting reserved space as
"available".

- Fix a race condition during abnormal fs shutdown that could cause
UAF problems when memory reclaim and log shutdown try to clean up
inodes.

- Fix a bug where log shutdown can race with unmount to tear down the
log, thereby causing UAF errors.

- Disentangle log and filesystem shutdown to reduce confusion.

- Fix some confusion in xfs_trans_commit such that a race between
transaction commit and filesystem shutdown can cause unlogged dirty
inode metadata to be committed, thereby corrupting the filesystem.

- Remove a performance optimization in the log as it was discovered
that certain storage hardware handle async log flushes so poorly as
to cause serious performance regressions. Recent restructuring of
other parts of the logging code mean that no performance benefit is
seen on hardware that handle it well"

* tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: drop async cache flushes from CIL commits.
xfs: shutdown during log recovery needs to mark the log shutdown
xfs: xfs_trans_commit() path must check for log shutdown
xfs: xfs_do_force_shutdown needs to block racing shutdowns
xfs: log shutdown triggers should only shut down the log
xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
xfs: shutdown in intent recovery has non-intent items in the AIL
xfs: aborting inodes on shutdown may need buffer lock
xfs: don't report reserved bnobt space as available
xfs: fix overfilling of reserve pool
xfs: always succeed at setting the reserve pool size
xfs: remove infinite loop when reserving free block pool
xfs: don't include bnobt blocks when reserving free block pool
xfs: document the XFS_ALLOC_AGFL_RESERVE constant

Linus Torvalds 4 years ago b32e3819 1fdff407

+349 -248

18 changed files

expand all

xfs

libxfs

xfs_alloc.c

xfs_alloc.h

xfs_bio_io.c

xfs_fsops.c

xfs_icache.c

xfs_inode.c

xfs_inode_item.c

xfs_inode_item.h

xfs_linux.h

xfs_log.c

xfs_log_cil.c

xfs_log_priv.h

xfs_log_recover.c

xfs_mount.c

xfs_mount.h

xfs_super.c

xfs_trans.c

xfs_trans_ail.c

+23 -5

fs/xfs/libxfs/xfs_alloc.c

··· 82 82 } 83 83 84 84 /* 85 + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to 86 + * guarantee that we can refill the AGFL prior to allocating space in a nearly 87 + * full AG. Although the the space described by the free space btrees, the 88 + * blocks used by the freesp btrees themselves, and the blocks owned by the 89 + * AGFL are counted in the ondisk fdblocks, it's a mistake to let the ondisk 90 + * free space in the AG drop so low that the free space btrees cannot refill an 91 + * empty AGFL up to the minimum level. Rather than grind through empty AGs 92 + * until the fs goes down, we subtract this many AG blocks from the incore 93 + * fdblocks to ensure user allocation does not overcommit the space the 94 + * filesystem needs for the AGFLs. The rmap btree uses a per-AG reservation to 95 + * withhold space from xfs_mod_fdblocks, so we do not account for that here. 96 + */ 97 + #define XFS_ALLOCBT_AGFL_RESERVE 4 98 + 99 + /* 100 + * Compute the number of blocks that we set aside to guarantee the ability to 101 + * refill the AGFL and handle a full bmap btree split. 102 + * 85 103 * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of 86 104 * AGF buffer (PV 947395), we place constraints on the relationship among 87 105 * actual allocations for data blocks, freelist blocks, and potential file data ··· 111 93 * extents need to be actually allocated. To get around this, we explicitly set 112 94 * aside a few blocks which will not be reserved in delayed allocation. 113 95 * 114 - * We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to handle a 115 - * potential split of the file's bmap btree. 96 + * For each AG, we need to reserve enough blocks to replenish a totally empty 97 + * AGFL and 4 more to handle a potential split of the file's bmap btree. 116 98 */ 117 99 unsigned int 118 100 xfs_alloc_set_aside( 119 101 struct xfs_mount *mp) 120 102 { 121 - return mp->m_sb.sb_agcount * (XFS_ALLOC_AGFL_RESERVE + 4); 103 + return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4); 122 104 } 123 105 124 106 /* ··· 142 124 unsigned int blocks; 143 125 144 126 blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */ 145 - blocks += XFS_ALLOC_AGFL_RESERVE; 127 + blocks += XFS_ALLOCBT_AGFL_RESERVE; 146 128 blocks += 3; /* AGF, AGI btree root blocks */ 147 129 if (xfs_has_finobt(mp)) 148 130 blocks++; /* finobt root block */ 149 131 if (xfs_has_rmapbt(mp)) 150 - blocks++; /* rmap root block */ 132 + blocks++; /* rmap root block */ 151 133 if (xfs_has_reflink(mp)) 152 134 blocks++; /* refcount root block */ 153 135

-1

fs/xfs/libxfs/xfs_alloc.h

··· 88 88 #define XFS_ALLOC_NOBUSY (1 << 2)/* Busy extents not allowed */ 89 89 90 90 /* freespace limit calculations */ 91 - #define XFS_ALLOC_AGFL_RESERVE 4 92 91 unsigned int xfs_alloc_set_aside(struct xfs_mount *mp); 93 92 unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp); 94 93

-33

fs/xfs/xfs_bio_io.c

··· 9 9 return bio_max_segs(howmany(count, PAGE_SIZE)); 10 10 } 11 11 12 - static void 13 - xfs_flush_bdev_async_endio( 14 - struct bio *bio) 15 - { 16 - complete(bio->bi_private); 17 - } 18 - 19 - /* 20 - * Submit a request for an async cache flush to run. If the request queue does 21 - * not require flush operations, just skip it altogether. If the caller needs 22 - * to wait for the flush completion at a later point in time, they must supply a 23 - * valid completion. This will be signalled when the flush completes. The 24 - * caller never sees the bio that is issued here. 25 - */ 26 - void 27 - xfs_flush_bdev_async( 28 - struct bio *bio, 29 - struct block_device *bdev, 30 - struct completion *done) 31 - { 32 - struct request_queue *q = bdev->bd_disk->queue; 33 - 34 - if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { 35 - complete(done); 36 - return; 37 - } 38 - 39 - bio_init(bio, bdev, NULL, 0, REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC); 40 - bio->bi_private = done; 41 - bio->bi_end_io = xfs_flush_bdev_async_endio; 42 - 43 - submit_bio(bio); 44 - } 45 12 int 46 13 xfs_rw_bdev( 47 14 struct block_device *bdev,

+27 -33

fs/xfs/xfs_fsops.c

··· 17 17 #include "xfs_fsops.h" 18 18 #include "xfs_trans_space.h" 19 19 #include "xfs_log.h" 20 + #include "xfs_log_priv.h" 20 21 #include "xfs_ag.h" 21 22 #include "xfs_ag_resv.h" 22 23 #include "xfs_trace.h" ··· 348 347 cnt->allocino = percpu_counter_read_positive(&mp->m_icount); 349 348 cnt->freeino = percpu_counter_read_positive(&mp->m_ifree); 350 349 cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) - 351 - mp->m_alloc_set_aside; 350 + xfs_fdblocks_unavailable(mp); 352 351 353 352 spin_lock(&mp->m_sb_lock); 354 353 cnt->freertx = mp->m_sb.sb_frextents; ··· 431 430 * If the request is larger than the current reservation, reserve the 432 431 * blocks before we update the reserve counters. Sample m_fdblocks and 433 432 * perform a partial reservation if the request exceeds free space. 433 + * 434 + * The code below estimates how many blocks it can request from 435 + * fdblocks to stash in the reserve pool. This is a classic TOCTOU 436 + * race since fdblocks updates are not always coordinated via 437 + * m_sb_lock. Set the reserve size even if there's not enough free 438 + * space to fill it because mod_fdblocks will refill an undersized 439 + * reserve when it can. 434 440 */ 435 - error = -ENOSPC; 436 - do { 437 - free = percpu_counter_sum(&mp->m_fdblocks) - 438 - mp->m_alloc_set_aside; 439 - if (free <= 0) 440 - break; 441 - 442 - delta = request - mp->m_resblks; 443 - lcounter = free - delta; 444 - if (lcounter < 0) 445 - /* We can't satisfy the request, just get what we can */ 446 - fdblks_delta = free; 447 - else 448 - fdblks_delta = delta; 449 - 441 + free = percpu_counter_sum(&mp->m_fdblocks) - 442 + xfs_fdblocks_unavailable(mp); 443 + delta = request - mp->m_resblks; 444 + mp->m_resblks = request; 445 + if (delta > 0 && free > 0) { 450 446 /* 451 447 * We'll either succeed in getting space from the free block 452 - * count or we'll get an ENOSPC. If we get a ENOSPC, it means 453 - * things changed while we were calculating fdblks_delta and so 454 - * we should try again to see if there is anything left to 455 - * reserve. 448 + * count or we'll get an ENOSPC. Don't set the reserved flag 449 + * here - we don't want to reserve the extra reserve blocks 450 + * from the reserve. 456 451 * 457 - * Don't set the reserved flag here - we don't want to reserve 458 - * the extra reserve blocks from the reserve..... 452 + * The desired reserve size can change after we drop the lock. 453 + * Use mod_fdblocks to put the space into the reserve or into 454 + * fdblocks as appropriate. 459 455 */ 456 + fdblks_delta = min(free, delta); 460 457 spin_unlock(&mp->m_sb_lock); 461 458 error = xfs_mod_fdblocks(mp, -fdblks_delta, 0); 459 + if (!error) 460 + xfs_mod_fdblocks(mp, fdblks_delta, 0); 462 461 spin_lock(&mp->m_sb_lock); 463 - } while (error == -ENOSPC); 464 - 465 - /* 466 - * Update the reserve counters if blocks have been successfully 467 - * allocated. 468 - */ 469 - if (!error && fdblks_delta) { 470 - mp->m_resblks += fdblks_delta; 471 - mp->m_resblks_avail += fdblks_delta; 472 462 } 473 - 474 463 out: 475 464 if (outval) { 476 465 outval->resblks = mp->m_resblks; ··· 519 528 int tag; 520 529 const char *why; 521 530 522 - if (test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &mp->m_opstate)) 531 + 532 + if (test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &mp->m_opstate)) { 533 + xlog_shutdown_wait(mp->m_log); 523 534 return; 535 + } 524 536 if (mp->m_sb_bp) 525 537 mp->m_sb_bp->b_flags |= XBF_DONE; 526 538

+1 -1

fs/xfs/xfs_icache.c

··· 883 883 */ 884 884 if (xlog_is_shutdown(ip->i_mount->m_log)) { 885 885 xfs_iunpin_wait(ip); 886 - xfs_iflush_abort(ip); 886 + xfs_iflush_shutdown_abort(ip); 887 887 goto reclaim; 888 888 } 889 889 if (xfs_ipincount(ip))

+1 -1

fs/xfs/xfs_inode.c

··· 3631 3631 3632 3632 /* 3633 3633 * We must use the safe variant here as on shutdown xfs_iflush_abort() 3634 - * can remove itself from the list. 3634 + * will remove itself from the list. 3635 3635 */ 3636 3636 list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { 3637 3637 iip = (struct xfs_inode_log_item *)lip;

+134 -30

fs/xfs/xfs_inode_item.c

··· 544 544 uint rval = XFS_ITEM_SUCCESS; 545 545 int error; 546 546 547 - ASSERT(iip->ili_item.li_buf); 547 + if (!bp || (ip->i_flags & XFS_ISTALE)) { 548 + /* 549 + * Inode item/buffer is being being aborted due to cluster 550 + * buffer deletion. Trigger a log force to have that operation 551 + * completed and items removed from the AIL before the next push 552 + * attempt. 553 + */ 554 + return XFS_ITEM_PINNED; 555 + } 548 556 549 - if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) || 550 - (ip->i_flags & XFS_ISTALE)) 557 + if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp)) 551 558 return XFS_ITEM_PINNED; 552 559 553 560 if (xfs_iflags_test(ip, XFS_IFLUSHING)) ··· 841 834 } 842 835 843 836 /* 844 - * This is the inode flushing abort routine. It is called when 845 - * the filesystem is shutting down to clean up the inode state. It is 846 - * responsible for removing the inode item from the AIL if it has not been 847 - * re-logged and clearing the inode's flush state. 837 + * Clear the inode logging fields so no more flushes are attempted. If we are 838 + * on a buffer list, it is now safe to remove it because the buffer is 839 + * guaranteed to be locked. The caller will drop the reference to the buffer 840 + * the log item held. 841 + */ 842 + static void 843 + xfs_iflush_abort_clean( 844 + struct xfs_inode_log_item *iip) 845 + { 846 + iip->ili_last_fields = 0; 847 + iip->ili_fields = 0; 848 + iip->ili_fsync_fields = 0; 849 + iip->ili_flush_lsn = 0; 850 + iip->ili_item.li_buf = NULL; 851 + list_del_init(&iip->ili_item.li_bio_list); 852 + } 853 + 854 + /* 855 + * Abort flushing the inode from a context holding the cluster buffer locked. 856 + * 857 + * This is the normal runtime method of aborting writeback of an inode that is 858 + * attached to a cluster buffer. It occurs when the inode and the backing 859 + * cluster buffer have been freed (i.e. inode is XFS_ISTALE), or when cluster 860 + * flushing or buffer IO completion encounters a log shutdown situation. 861 + * 862 + * If we need to abort inode writeback and we don't already hold the buffer 863 + * locked, call xfs_iflush_shutdown_abort() instead as this should only ever be 864 + * necessary in a shutdown situation. 848 865 */ 849 866 void 850 867 xfs_iflush_abort( 851 868 struct xfs_inode *ip) 852 869 { 853 870 struct xfs_inode_log_item *iip = ip->i_itemp; 854 - struct xfs_buf *bp = NULL; 871 + struct xfs_buf *bp; 855 872 856 - if (iip) { 857 - /* 858 - * Clear the failed bit before removing the item from the AIL so 859 - * xfs_trans_ail_delete() doesn't try to clear and release the 860 - * buffer attached to the log item before we are done with it. 861 - */ 862 - clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); 863 - xfs_trans_ail_delete(&iip->ili_item, 0); 864 - 865 - /* 866 - * Clear the inode logging fields so no more flushes are 867 - * attempted. 868 - */ 869 - spin_lock(&iip->ili_lock); 870 - iip->ili_last_fields = 0; 871 - iip->ili_fields = 0; 872 - iip->ili_fsync_fields = 0; 873 - iip->ili_flush_lsn = 0; 874 - bp = iip->ili_item.li_buf; 875 - iip->ili_item.li_buf = NULL; 876 - list_del_init(&iip->ili_item.li_bio_list); 877 - spin_unlock(&iip->ili_lock); 873 + if (!iip) { 874 + /* clean inode, nothing to do */ 875 + xfs_iflags_clear(ip, XFS_IFLUSHING); 876 + return; 878 877 } 878 + 879 + /* 880 + * Remove the inode item from the AIL before we clear its internal 881 + * state. Whilst the inode is in the AIL, it should have a valid buffer 882 + * pointer for push operations to access - it is only safe to remove the 883 + * inode from the buffer once it has been removed from the AIL. 884 + * 885 + * We also clear the failed bit before removing the item from the AIL 886 + * as xfs_trans_ail_delete()->xfs_clear_li_failed() will release buffer 887 + * references the inode item owns and needs to hold until we've fully 888 + * aborted the inode log item and detached it from the buffer. 889 + */ 890 + clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); 891 + xfs_trans_ail_delete(&iip->ili_item, 0); 892 + 893 + /* 894 + * Grab the inode buffer so can we release the reference the inode log 895 + * item holds on it. 896 + */ 897 + spin_lock(&iip->ili_lock); 898 + bp = iip->ili_item.li_buf; 899 + xfs_iflush_abort_clean(iip); 900 + spin_unlock(&iip->ili_lock); 901 + 879 902 xfs_iflags_clear(ip, XFS_IFLUSHING); 880 903 if (bp) 881 904 xfs_buf_rele(bp); 882 905 } 906 + 907 + /* 908 + * Abort an inode flush in the case of a shutdown filesystem. This can be called 909 + * from anywhere with just an inode reference and does not require holding the 910 + * inode cluster buffer locked. If the inode is attached to a cluster buffer, 911 + * it will grab and lock it safely, then abort the inode flush. 912 + */ 913 + void 914 + xfs_iflush_shutdown_abort( 915 + struct xfs_inode *ip) 916 + { 917 + struct xfs_inode_log_item *iip = ip->i_itemp; 918 + struct xfs_buf *bp; 919 + 920 + if (!iip) { 921 + /* clean inode, nothing to do */ 922 + xfs_iflags_clear(ip, XFS_IFLUSHING); 923 + return; 924 + } 925 + 926 + spin_lock(&iip->ili_lock); 927 + bp = iip->ili_item.li_buf; 928 + if (!bp) { 929 + spin_unlock(&iip->ili_lock); 930 + xfs_iflush_abort(ip); 931 + return; 932 + } 933 + 934 + /* 935 + * We have to take a reference to the buffer so that it doesn't get 936 + * freed when we drop the ili_lock and then wait to lock the buffer. 937 + * We'll clean up the extra reference after we pick up the ili_lock 938 + * again. 939 + */ 940 + xfs_buf_hold(bp); 941 + spin_unlock(&iip->ili_lock); 942 + xfs_buf_lock(bp); 943 + 944 + spin_lock(&iip->ili_lock); 945 + if (!iip->ili_item.li_buf) { 946 + /* 947 + * Raced with another removal, hold the only reference 948 + * to bp now. Inode should not be in the AIL now, so just clean 949 + * up and return; 950 + */ 951 + ASSERT(list_empty(&iip->ili_item.li_bio_list)); 952 + ASSERT(!test_bit(XFS_LI_IN_AIL, &iip->ili_item.li_flags)); 953 + xfs_iflush_abort_clean(iip); 954 + spin_unlock(&iip->ili_lock); 955 + xfs_iflags_clear(ip, XFS_IFLUSHING); 956 + xfs_buf_relse(bp); 957 + return; 958 + } 959 + 960 + /* 961 + * Got two references to bp. The first will get dropped by 962 + * xfs_iflush_abort() when the item is removed from the buffer list, but 963 + * we can't drop our reference until _abort() returns because we have to 964 + * unlock the buffer as well. Hence we abort and then unlock and release 965 + * our reference to the buffer. 966 + */ 967 + ASSERT(iip->ili_item.li_buf == bp); 968 + spin_unlock(&iip->ili_lock); 969 + xfs_iflush_abort(ip); 970 + xfs_buf_relse(bp); 971 + } 972 + 883 973 884 974 /* 885 975 * convert an xfs_inode_log_format struct from the old 32 bit version

fs/xfs/xfs_inode_item.h

··· 44 44 extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *); 45 45 extern void xfs_inode_item_destroy(struct xfs_inode *); 46 46 extern void xfs_iflush_abort(struct xfs_inode *); 47 + extern void xfs_iflush_shutdown_abort(struct xfs_inode *); 47 48 extern int xfs_inode_item_format_convert(xfs_log_iovec_t *, 48 49 struct xfs_inode_log_format *); 49 50

-2

fs/xfs/xfs_linux.h

··· 197 197 198 198 int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count, 199 199 char *data, unsigned int op); 200 - void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev, 201 - struct completion *done); 202 200 203 201 #define ASSERT_ALWAYS(expr) \ 204 202 (likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))

+59 -52

fs/xfs/xfs_log.c

··· 487 487 * Run all the pending iclog callbacks and wake log force waiters and iclog 488 488 * space waiters so they can process the newly set shutdown state. We really 489 489 * don't care what order we process callbacks here because the log is shut down 490 - * and so state cannot change on disk anymore. 490 + * and so state cannot change on disk anymore. However, we cannot wake waiters 491 + * until the callbacks have been processed because we may be in unmount and 492 + * we must ensure that all AIL operations the callbacks perform have completed 493 + * before we tear down the AIL. 491 494 * 492 495 * We avoid processing actively referenced iclogs so that we don't run callbacks 493 496 * while the iclog owner might still be preparing the iclog for IO submssion. ··· 504 501 struct xlog_in_core *iclog; 505 502 LIST_HEAD(cb_list); 506 503 507 - spin_lock(&log->l_icloglock); 508 504 iclog = log->l_iclog; 509 505 do { 510 506 if (atomic_read(&iclog->ic_refcnt)) { ··· 511 509 continue; 512 510 } 513 511 list_splice_init(&iclog->ic_callbacks, &cb_list); 512 + spin_unlock(&log->l_icloglock); 513 + 514 + xlog_cil_process_committed(&cb_list); 515 + 516 + spin_lock(&log->l_icloglock); 514 517 wake_up_all(&iclog->ic_write_wait); 515 518 wake_up_all(&iclog->ic_force_wait); 516 519 } while ((iclog = iclog->ic_next) != log->l_iclog); 517 520 518 521 wake_up_all(&log->l_flush_wait); 519 - spin_unlock(&log->l_icloglock); 520 - 521 - xlog_cil_process_committed(&cb_list); 522 522 } 523 523 524 524 /* 525 525 * Flush iclog to disk if this is the last reference to the given iclog and the 526 526 * it is in the WANT_SYNC state. 527 - * 528 - * If the caller passes in a non-zero @old_tail_lsn and the current log tail 529 - * does not match, there may be metadata on disk that must be persisted before 530 - * this iclog is written. To satisfy that requirement, set the 531 - * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new 532 - * log tail value. 533 527 * 534 528 * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the 535 529 * log tail is updated correctly. NEED_FUA indicates that the iclog will be ··· 543 545 * always capture the tail lsn on the iclog on the first NEED_FUA release 544 546 * regardless of the number of active reference counts on this iclog. 545 547 */ 546 - 547 548 int 548 549 xlog_state_release_iclog( 549 550 struct xlog *log, 550 - struct xlog_in_core *iclog, 551 - xfs_lsn_t old_tail_lsn) 551 + struct xlog_in_core *iclog) 552 552 { 553 553 xfs_lsn_t tail_lsn; 554 554 bool last_ref; ··· 557 561 /* 558 562 * Grabbing the current log tail needs to be atomic w.r.t. the writing 559 563 * of the tail LSN into the iclog so we guarantee that the log tail does 560 - * not move between deciding if a cache flush is required and writing 561 - * the LSN into the iclog below. 564 + * not move between the first time we know that the iclog needs to be 565 + * made stable and when we eventually submit it. 562 566 */ 563 - if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) { 567 + if ((iclog->ic_state == XLOG_STATE_WANT_SYNC || 568 + (iclog->ic_flags & XLOG_ICL_NEED_FUA)) && 569 + !iclog->ic_header.h_tail_lsn) { 564 570 tail_lsn = xlog_assign_tail_lsn(log->l_mp); 565 - 566 - if (old_tail_lsn && tail_lsn != old_tail_lsn) 567 - iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; 568 - 569 - if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) && 570 - !iclog->ic_header.h_tail_lsn) 571 - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); 571 + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); 572 572 } 573 573 574 574 last_ref = atomic_dec_and_test(&iclog->ic_refcnt); ··· 575 583 * pending iclog callbacks that were waiting on the release of 576 584 * this iclog. 577 585 */ 578 - if (last_ref) { 579 - spin_unlock(&log->l_icloglock); 586 + if (last_ref) 580 587 xlog_state_shutdown_callbacks(log); 581 - spin_lock(&log->l_icloglock); 582 - } 583 588 return -EIO; 584 589 } 585 590 ··· 589 600 } 590 601 591 602 iclog->ic_state = XLOG_STATE_SYNCING; 592 - if (!iclog->ic_header.h_tail_lsn) 593 - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); 594 603 xlog_verify_tail_lsn(log, iclog); 595 604 trace_xlog_iclog_syncing(iclog, _RET_IP_); 596 605 ··· 860 873 iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA; 861 874 if (iclog->ic_state == XLOG_STATE_ACTIVE) 862 875 xlog_state_switch_iclogs(iclog->ic_log, iclog, 0); 863 - return xlog_state_release_iclog(iclog->ic_log, iclog, 0); 876 + return xlog_state_release_iclog(iclog->ic_log, iclog); 864 877 } 865 878 866 879 /* ··· 1360 1373 */ 1361 1374 if (XFS_TEST_ERROR(error, log->l_mp, XFS_ERRTAG_IODONE_IOERR)) { 1362 1375 xfs_alert(log->l_mp, "log I/O error %d", error); 1363 - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 1376 + xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); 1364 1377 } 1365 1378 1366 1379 xlog_state_done_syncing(iclog); ··· 1899 1912 iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); 1900 1913 1901 1914 if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) { 1902 - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 1915 + xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); 1903 1916 return; 1904 1917 } 1905 1918 if (is_vmalloc_addr(iclog->ic_data)) ··· 2398 2411 ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || 2399 2412 xlog_is_shutdown(log)); 2400 2413 release_iclog: 2401 - error = xlog_state_release_iclog(log, iclog, 0); 2414 + error = xlog_state_release_iclog(log, iclog); 2402 2415 spin_unlock(&log->l_icloglock); 2403 2416 return error; 2404 2417 } ··· 2474 2487 xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, 2475 2488 "ctx ticket reservation ran out. Need to up reservation"); 2476 2489 xlog_print_tic_res(log->l_mp, ticket); 2477 - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 2490 + xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); 2478 2491 } 2479 2492 2480 2493 len = xlog_write_calc_vec_length(ticket, log_vector, optype); ··· 2615 2628 2616 2629 spin_lock(&log->l_icloglock); 2617 2630 xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); 2618 - error = xlog_state_release_iclog(log, iclog, 0); 2631 + error = xlog_state_release_iclog(log, iclog); 2619 2632 spin_unlock(&log->l_icloglock); 2620 2633 2621 2634 return error; ··· 3039 3052 * reference to the iclog. 3040 3053 */ 3041 3054 if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1)) 3042 - error = xlog_state_release_iclog(log, iclog, 0); 3055 + error = xlog_state_release_iclog(log, iclog); 3043 3056 spin_unlock(&log->l_icloglock); 3044 3057 if (error) 3045 3058 return error; ··· 3808 3821 #endif 3809 3822 3810 3823 /* 3811 - * Perform a forced shutdown on the log. This should be called once and once 3812 - * only by the high level filesystem shutdown code to shut the log subsystem 3813 - * down cleanly. 3824 + * Perform a forced shutdown on the log. 3825 + * 3826 + * This can be called from low level log code to trigger a shutdown, or from the 3827 + * high level mount shutdown code when the mount shuts down. 3814 3828 * 3815 3829 * Our main objectives here are to make sure that: 3816 3830 * a. if the shutdown was not due to a log IO error, flush the logs to ··· 3820 3832 * parties to find out. Nothing new gets queued after this is done. 3821 3833 * c. Tasks sleeping on log reservations, pinned objects and 3822 3834 * other resources get woken up. 3835 + * d. The mount is also marked as shut down so that log triggered shutdowns 3836 + * still behave the same as if they called xfs_forced_shutdown(). 3823 3837 * 3824 3838 * Return true if the shutdown cause was a log IO error and we actually shut the 3825 3839 * log down. ··· 3833 3843 { 3834 3844 bool log_error = (shutdown_flags & SHUTDOWN_LOG_IO_ERROR); 3835 3845 3836 - /* 3837 - * If this happens during log recovery then we aren't using the runtime 3838 - * log mechanisms yet so there's nothing to shut down. 3839 - */ 3840 - if (!log || xlog_in_recovery(log)) 3846 + if (!log) 3841 3847 return false; 3842 - 3843 - ASSERT(!xlog_is_shutdown(log)); 3844 3848 3845 3849 /* 3846 3850 * Flush all the completed transactions to disk before marking the log ··· 3842 3858 * before the force will prevent the log force from flushing the iclogs 3843 3859 * to disk. 3844 3860 * 3845 - * Re-entry due to a log IO error shutdown during the log force is 3846 - * prevented by the atomicity of higher level shutdown code. 3861 + * When we are in recovery, there are no transactions to flush, and 3862 + * we don't want to touch the log because we don't want to perturb the 3863 + * current head/tail for future recovery attempts. Hence we need to 3864 + * avoid a log force in this case. 3865 + * 3866 + * If we are shutting down due to a log IO error, then we must avoid 3867 + * trying to write the log as that may just result in more IO errors and 3868 + * an endless shutdown/force loop. 3847 3869 */ 3848 - if (!log_error) 3870 + if (!log_error && !xlog_in_recovery(log)) 3849 3871 xfs_log_force(log->l_mp, XFS_LOG_SYNC); 3850 3872 3851 3873 /* ··· 3868 3878 spin_lock(&log->l_icloglock); 3869 3879 if (test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) { 3870 3880 spin_unlock(&log->l_icloglock); 3871 - ASSERT(0); 3872 3881 return false; 3873 3882 } 3874 3883 spin_unlock(&log->l_icloglock); 3884 + 3885 + /* 3886 + * If this log shutdown also sets the mount shutdown state, issue a 3887 + * shutdown warning message. 3888 + */ 3889 + if (!test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &log->l_mp->m_opstate)) { 3890 + xfs_alert_tag(log->l_mp, XFS_PTAG_SHUTDOWN_LOGERROR, 3891 + "Filesystem has been shut down due to log error (0x%x).", 3892 + shutdown_flags); 3893 + xfs_alert(log->l_mp, 3894 + "Please unmount the filesystem and rectify the problem(s)."); 3895 + if (xfs_error_level >= XFS_ERRLEVEL_HIGH) 3896 + xfs_stack_trace(); 3897 + } 3875 3898 3876 3899 /* 3877 3900 * We don't want anybody waiting for log reservations after this. That ··· 3906 3903 wake_up_all(&log->l_cilp->xc_start_wait); 3907 3904 wake_up_all(&log->l_cilp->xc_commit_wait); 3908 3905 spin_unlock(&log->l_cilp->xc_push_lock); 3909 - xlog_state_shutdown_callbacks(log); 3910 3906 3907 + spin_lock(&log->l_icloglock); 3908 + xlog_state_shutdown_callbacks(log); 3909 + spin_unlock(&log->l_icloglock); 3910 + 3911 + wake_up_var(&log->l_opstate); 3911 3912 return log_error; 3912 3913 } 3913 3914

+15 -31

fs/xfs/xfs_log_cil.c

··· 540 540 spin_unlock(&cil->xc_cil_lock); 541 541 542 542 if (tp->t_ticket->t_curr_res < 0) 543 - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 543 + xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); 544 544 } 545 545 546 546 static void ··· 705 705 * The LSN we need to pass to the log items on transaction 706 706 * commit is the LSN reported by the first log vector write, not 707 707 * the commit lsn. If we use the commit record lsn then we can 708 - * move the tail beyond the grant write head. 708 + * move the grant write head beyond the tail LSN and overwrite 709 + * it. 709 710 */ 710 711 ctx->start_lsn = lsn; 711 712 wake_up_all(&cil->xc_start_wait); 712 713 spin_unlock(&cil->xc_push_lock); 714 + 715 + /* 716 + * Make sure the metadata we are about to overwrite in the log 717 + * has been flushed to stable storage before this iclog is 718 + * issued. 719 + */ 720 + spin_lock(&cil->xc_log->l_icloglock); 721 + iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; 722 + spin_unlock(&cil->xc_log->l_icloglock); 713 723 return; 714 724 } 715 725 ··· 864 854 865 855 error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS); 866 856 if (error) 867 - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 857 + xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); 868 858 return error; 869 859 } 870 860 ··· 898 888 struct xfs_trans_header thdr; 899 889 struct xfs_log_iovec lhdr; 900 890 struct xfs_log_vec lvhdr = { NULL }; 901 - xfs_lsn_t preflush_tail_lsn; 902 891 xfs_csn_t push_seq; 903 - struct bio bio; 904 - DECLARE_COMPLETION_ONSTACK(bdev_flush); 905 892 bool push_commit_stable; 906 893 907 894 new_ctx = xlog_cil_ctx_alloc(); ··· 967 960 */ 968 961 list_add(&ctx->committing, &cil->xc_committing); 969 962 spin_unlock(&cil->xc_push_lock); 970 - 971 - /* 972 - * The CIL is stable at this point - nothing new will be added to it 973 - * because we hold the flush lock exclusively. Hence we can now issue 974 - * a cache flush to ensure all the completed metadata in the journal we 975 - * are about to overwrite is on stable storage. 976 - * 977 - * Because we are issuing this cache flush before we've written the 978 - * tail lsn to the iclog, we can have metadata IO completions move the 979 - * tail forwards between the completion of this flush and the iclog 980 - * being written. In this case, we need to re-issue the cache flush 981 - * before the iclog write. To detect whether the log tail moves, sample 982 - * the tail LSN *before* we issue the flush. 983 - */ 984 - preflush_tail_lsn = atomic64_read(&log->l_tail_lsn); 985 - xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev, 986 - &bdev_flush); 987 963 988 964 /* 989 965 * Pull all the log vectors off the items in the CIL, and remove the ··· 1044 1054 lvhdr.lv_iovecp = &lhdr; 1045 1055 lvhdr.lv_next = ctx->lv_chain; 1046 1056 1047 - /* 1048 - * Before we format and submit the first iclog, we have to ensure that 1049 - * the metadata writeback ordering cache flush is complete. 1050 - */ 1051 - wait_for_completion(&bdev_flush); 1052 - 1053 1057 error = xlog_cil_write_chain(ctx, &lvhdr); 1054 1058 if (error) 1055 1059 goto out_abort_free_ticket; ··· 1102 1118 if (push_commit_stable && 1103 1119 ctx->commit_iclog->ic_state == XLOG_STATE_ACTIVE) 1104 1120 xlog_state_switch_iclogs(log, ctx->commit_iclog, 0); 1105 - xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn); 1121 + xlog_state_release_iclog(log, ctx->commit_iclog); 1106 1122 1107 1123 /* Not safe to reference ctx now! */ 1108 1124 ··· 1123 1139 return; 1124 1140 } 1125 1141 spin_lock(&log->l_icloglock); 1126 - xlog_state_release_iclog(log, ctx->commit_iclog, 0); 1142 + xlog_state_release_iclog(log, ctx->commit_iclog); 1127 1143 /* Not safe to reference ctx now! */ 1128 1144 spin_unlock(&log->l_icloglock); 1129 1145 }

+12 -2

fs/xfs/xfs_log_priv.h

··· 484 484 return test_bit(XLOG_IO_ERROR, &log->l_opstate); 485 485 } 486 486 487 + /* 488 + * Wait until the xlog_force_shutdown() has marked the log as shut down 489 + * so xlog_is_shutdown() will always return true. 490 + */ 491 + static inline void 492 + xlog_shutdown_wait( 493 + struct xlog *log) 494 + { 495 + wait_var_event(&log->l_opstate, xlog_is_shutdown(log)); 496 + } 497 + 487 498 /* common routines */ 488 499 extern int 489 500 xlog_recover( ··· 535 524 536 525 void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, 537 526 int eventual_size); 538 - int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog, 539 - xfs_lsn_t log_tail_lsn); 527 + int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog); 540 528 541 529 /* 542 530 * When we crack an atomic LSN, we sample it first so that the value will not

+20 -36

fs/xfs/xfs_log_recover.c

··· 2485 2485 error = xfs_trans_alloc(mp, &resv, dfc->dfc_blkres, 2486 2486 dfc->dfc_rtxres, XFS_TRANS_RESERVE, &tp); 2487 2487 if (error) { 2488 - xfs_force_shutdown(mp, SHUTDOWN_LOG_IO_ERROR); 2488 + xlog_force_shutdown(mp->m_log, SHUTDOWN_LOG_IO_ERROR); 2489 2489 return error; 2490 2490 } 2491 2491 ··· 2519 2519 xfs_defer_ops_capture_free(mp, dfc); 2520 2520 } 2521 2521 } 2522 + 2522 2523 /* 2523 2524 * When this is called, all of the log intent items which did not have 2524 - * corresponding log done items should be in the AIL. What we do now 2525 - * is update the data structures associated with each one. 2525 + * corresponding log done items should be in the AIL. What we do now is update 2526 + * the data structures associated with each one. 2526 2527 * 2527 - * Since we process the log intent items in normal transactions, they 2528 - * will be removed at some point after the commit. This prevents us 2529 - * from just walking down the list processing each one. We'll use a 2530 - * flag in the intent item to skip those that we've already processed 2531 - * and use the AIL iteration mechanism's generation count to try to 2532 - * speed this up at least a bit. 2528 + * Since we process the log intent items in normal transactions, they will be 2529 + * removed at some point after the commit. This prevents us from just walking 2530 + * down the list processing each one. We'll use a flag in the intent item to 2531 + * skip those that we've already processed and use the AIL iteration mechanism's 2532 + * generation count to try to speed this up at least a bit. 2533 2533 * 2534 - * When we start, we know that the intents are the only things in the 2535 - * AIL. As we process them, however, other items are added to the 2536 - * AIL. 2534 + * When we start, we know that the intents are the only things in the AIL. As we 2535 + * process them, however, other items are added to the AIL. Hence we know we 2536 + * have started recovery on all the pending intents when we find an non-intent 2537 + * item in the AIL. 2537 2538 */ 2538 2539 STATIC int 2539 2540 xlog_recover_process_intents( ··· 2557 2556 for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0); 2558 2557 lip != NULL; 2559 2558 lip = xfs_trans_ail_cursor_next(ailp, &cur)) { 2560 - /* 2561 - * We're done when we see something other than an intent. 2562 - * There should be no intents left in the AIL now. 2563 - */ 2564 - if (!xlog_item_is_intent(lip)) { 2565 - #ifdef DEBUG 2566 - for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur)) 2567 - ASSERT(!xlog_item_is_intent(lip)); 2568 - #endif 2559 + if (!xlog_item_is_intent(lip)) 2569 2560 break; 2570 - } 2571 2561 2572 2562 /* 2573 2563 * We should never see a redo item with a LSN higher than ··· 2599 2607 } 2600 2608 2601 2609 /* 2602 - * A cancel occurs when the mount has failed and we're bailing out. 2603 - * Release all pending log intent items so they don't pin the AIL. 2610 + * A cancel occurs when the mount has failed and we're bailing out. Release all 2611 + * pending log intent items that we haven't started recovery on so they don't 2612 + * pin the AIL. 2604 2613 */ 2605 2614 STATIC void 2606 2615 xlog_recover_cancel_intents( ··· 2615 2622 spin_lock(&ailp->ail_lock); 2616 2623 lip = xfs_trans_ail_cursor_first(ailp, &cur, 0); 2617 2624 while (lip != NULL) { 2618 - /* 2619 - * We're done when we see something other than an intent. 2620 - * There should be no intents left in the AIL now. 2621 - */ 2622 - if (!xlog_item_is_intent(lip)) { 2623 - #ifdef DEBUG 2624 - for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur)) 2625 - ASSERT(!xlog_item_is_intent(lip)); 2626 - #endif 2625 + if (!xlog_item_is_intent(lip)) 2627 2626 break; 2628 - } 2629 2627 2630 2628 spin_unlock(&ailp->ail_lock); 2631 2629 lip->li_ops->iop_release(lip); ··· 3454 3470 */ 3455 3471 xlog_recover_cancel_intents(log); 3456 3472 xfs_alert(log->l_mp, "Failed to recover intents"); 3457 - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 3473 + xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); 3458 3474 return error; 3459 3475 } 3460 3476 ··· 3501 3517 * end of intents processing can be pushed through the CIL 3502 3518 * and AIL. 3503 3519 */ 3504 - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 3520 + xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); 3505 3521 } 3506 3522 3507 3523 return 0;

+2 -1

fs/xfs/xfs_mount.c

··· 21 21 #include "xfs_trans.h" 22 22 #include "xfs_trans_priv.h" 23 23 #include "xfs_log.h" 24 + #include "xfs_log_priv.h" 24 25 #include "xfs_error.h" 25 26 #include "xfs_quota.h" 26 27 #include "xfs_fsops.h" ··· 1147 1146 * problems (i.e. transaction abort, pagecache discards, etc.) than 1148 1147 * slightly premature -ENOSPC. 1149 1148 */ 1150 - set_aside = mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks); 1149 + set_aside = xfs_fdblocks_unavailable(mp); 1151 1150 percpu_counter_add_batch(&mp->m_fdblocks, delta, batch); 1152 1151 if (__percpu_counter_compare(&mp->m_fdblocks, set_aside, 1153 1152 XFS_FDBLOCKS_BATCH) >= 0) {

+15

fs/xfs/xfs_mount.h

··· 479 479 */ 480 480 #define XFS_FDBLOCKS_BATCH 1024 481 481 482 + /* 483 + * Estimate the amount of free space that is not available to userspace and is 484 + * not explicitly reserved from the incore fdblocks. This includes: 485 + * 486 + * - The minimum number of blocks needed to support splitting a bmap btree 487 + * - The blocks currently in use by the freespace btrees because they record 488 + * the actual blocks that will fill per-AG metadata space reservations 489 + */ 490 + static inline uint64_t 491 + xfs_fdblocks_unavailable( 492 + struct xfs_mount *mp) 493 + { 494 + return mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks); 495 + } 496 + 482 497 extern int xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta, 483 498 bool reserved); 484 499 extern int xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);

+2 -1

fs/xfs/xfs_super.c

··· 815 815 spin_unlock(&mp->m_sb_lock); 816 816 817 817 /* make sure statp->f_bfree does not underflow */ 818 - statp->f_bfree = max_t(int64_t, fdblocks - mp->m_alloc_set_aside, 0); 818 + statp->f_bfree = max_t(int64_t, 0, 819 + fdblocks - xfs_fdblocks_unavailable(mp)); 819 820 statp->f_bavail = statp->f_bfree; 820 821 821 822 fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);

+33 -15

fs/xfs/xfs_trans.c

··· 836 836 bool regrant) 837 837 { 838 838 struct xfs_mount *mp = tp->t_mountp; 839 + struct xlog *log = mp->m_log; 839 840 xfs_csn_t commit_seq = 0; 840 841 int error = 0; 841 842 int sync = tp->t_flags & XFS_TRANS_SYNC; ··· 865 864 if (!(tp->t_flags & XFS_TRANS_DIRTY)) 866 865 goto out_unreserve; 867 866 868 - if (xfs_is_shutdown(mp)) { 867 + /* 868 + * We must check against log shutdown here because we cannot abort log 869 + * items and leave them dirty, inconsistent and unpinned in memory while 870 + * the log is active. This leaves them open to being written back to 871 + * disk, and that will lead to on-disk corruption. 872 + */ 873 + if (xlog_is_shutdown(log)) { 869 874 error = -EIO; 870 875 goto out_unreserve; 871 876 } ··· 885 878 xfs_trans_apply_sb_deltas(tp); 886 879 xfs_trans_apply_dquot_deltas(tp); 887 880 888 - xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant); 881 + xlog_cil_commit(log, tp, &commit_seq, regrant); 889 882 890 883 xfs_trans_free(tp); 891 884 ··· 912 905 */ 913 906 xfs_trans_unreserve_and_mod_dquots(tp); 914 907 if (tp->t_ticket) { 915 - if (regrant && !xlog_is_shutdown(mp->m_log)) 916 - xfs_log_ticket_regrant(mp->m_log, tp->t_ticket); 908 + if (regrant && !xlog_is_shutdown(log)) 909 + xfs_log_ticket_regrant(log, tp->t_ticket); 917 910 else 918 - xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket); 911 + xfs_log_ticket_ungrant(log, tp->t_ticket); 919 912 tp->t_ticket = NULL; 920 913 } 921 914 xfs_trans_free_items(tp, !!error); ··· 933 926 } 934 927 935 928 /* 936 - * Unlock all of the transaction's items and free the transaction. 937 - * The transaction must not have modified any of its items, because 938 - * there is no way to restore them to their previous state. 929 + * Unlock all of the transaction's items and free the transaction. If the 930 + * transaction is dirty, we must shut down the filesystem because there is no 931 + * way to restore them to their previous state. 939 932 * 940 - * If the transaction has made a log reservation, make sure to release 941 - * it as well. 933 + * If the transaction has made a log reservation, make sure to release it as 934 + * well. 935 + * 936 + * This is a high level function (equivalent to xfs_trans_commit()) and so can 937 + * be called after the transaction has effectively been aborted due to the mount 938 + * being shut down. However, if the mount has not been shut down and the 939 + * transaction is dirty we will shut the mount down and, in doing so, that 940 + * guarantees that the log is shut down, too. Hence we don't need to be as 941 + * careful with shutdown state and dirty items here as we need to be in 942 + * xfs_trans_commit(). 942 943 */ 943 944 void 944 945 xfs_trans_cancel( 945 946 struct xfs_trans *tp) 946 947 { 947 948 struct xfs_mount *mp = tp->t_mountp; 949 + struct xlog *log = mp->m_log; 948 950 bool dirty = (tp->t_flags & XFS_TRANS_DIRTY); 949 951 950 952 trace_xfs_trans_cancel(tp, _RET_IP_); ··· 971 955 } 972 956 973 957 /* 974 - * See if the caller is relying on us to shut down the 975 - * filesystem. This happens in paths where we detect 976 - * corruption and decide to give up. 958 + * See if the caller is relying on us to shut down the filesystem. We 959 + * only want an error report if there isn't already a shutdown in 960 + * progress, so we only need to check against the mount shutdown state 961 + * here. 977 962 */ 978 963 if (dirty && !xfs_is_shutdown(mp)) { 979 964 XFS_ERROR_REPORT("xfs_trans_cancel", XFS_ERRLEVEL_LOW, mp); 980 965 xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); 981 966 } 982 967 #ifdef DEBUG 983 - if (!dirty && !xfs_is_shutdown(mp)) { 968 + /* Log items need to be consistent until the log is shut down. */ 969 + if (!dirty && !xlog_is_shutdown(log)) { 984 970 struct xfs_log_item *lip; 985 971 986 972 list_for_each_entry(lip, &tp->t_items, li_trans) ··· 993 975 xfs_trans_unreserve_and_mod_dquots(tp); 994 976 995 977 if (tp->t_ticket) { 996 - xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket); 978 + xfs_log_ticket_ungrant(log, tp->t_ticket); 997 979 tp->t_ticket = NULL; 998 980 } 999 981

+4 -4

fs/xfs/xfs_trans_ail.c

··· 873 873 int shutdown_type) 874 874 { 875 875 struct xfs_ail *ailp = lip->li_ailp; 876 - struct xfs_mount *mp = ailp->ail_log->l_mp; 876 + struct xlog *log = ailp->ail_log; 877 877 xfs_lsn_t tail_lsn; 878 878 879 879 spin_lock(&ailp->ail_lock); 880 880 if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { 881 881 spin_unlock(&ailp->ail_lock); 882 - if (shutdown_type && !xlog_is_shutdown(ailp->ail_log)) { 883 - xfs_alert_tag(mp, XFS_PTAG_AILDELETE, 882 + if (shutdown_type && !xlog_is_shutdown(log)) { 883 + xfs_alert_tag(log->l_mp, XFS_PTAG_AILDELETE, 884 884 "%s: attempting to delete a log item that is not in the AIL", 885 885 __func__); 886 - xfs_force_shutdown(mp, shutdown_type); 886 + xlog_force_shutdown(log, shutdown_type); 887 887 } 888 888 return; 889 889 }