Merge tag 'repair-auto-reap-space-reservations-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

xfs: reserve disk space for online repairs

Online repair fixes metadata structures by writing a new copy out to
disk and atomically committing the new structure into the filesystem.
For this to work, we need to reserve all the space we're going to need
ahead of time so that the atomic commit transaction is as small as
possible. We also require the reserved space to be freed if the system
goes down, or if we decide not to commit the repair, or if we reserve
too much space.

To keep the atomic commit transaction as small as possible, we would
like to allocate some space and simultaneously schedule automatic
reaping of the reserved space, even on log recovery. EFIs are the
mechanism to get us there, but we need to use them in a novel manner.
Once we allocate the space, we want to hold on to the EFI (relogging as
necessary) until we can commit or cancel the repair. EFIs for written
committed blocks need to go away, but unwritten or uncommitted blocks
can be freed like normal.

Earlier versions of this patchset directly manipulated the log items,
but Dave thought that to be a layering violation. For v27, I've
modified the defer ops handling code to be capable of pausing a deferred
work item. Log intent items are created as they always have been, but
paused items are pushed onto a side list when finishing deferred work
items, and pushed back onto the transaction after that. Log intent done
item are not created for paused work.

The second part adds a "stale" flag to the EFI so that the repair
reservation code can dispose of an EFI the normal way, but without the
space actually being freed.

This has been lightly tested with fstests. Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'repair-auto-reap-space-reservations-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: force small EFIs for reaping btree extents
xfs: log EFIs for all btree blocks being used to stage a btree
xfs: implement block reservation accounting for btrees we're staging
xfs: remove unused fields from struct xbtree_ifakeroot
xfs: automatic freeing of freshly allocated unwritten space
xfs: remove __xfs_free_extent_later
xfs: allow pausing of pending deferred work items
xfs: don't append work items to logged xfs_defer_pending objects

Chandan Babu R 2 years ago 49391d13 dec0224b

+1007 -76

20 changed files

expand all

xfs

Makefile

libxfs

xfs_ag.c

xfs_alloc.c

xfs_alloc.h

xfs_bmap.c

xfs_bmap_btree.c

xfs_btree_staging.h

xfs_defer.c

xfs_defer.h

xfs_ialloc.c

xfs_ialloc_btree.c

xfs_refcount.c

xfs_refcount_btree.c

scrub

newbt.c

newbt.h

reap.c

trace.h

xfs_extfree_item.c

xfs_reflink.c

xfs_trace.h

fs/xfs/Makefile

··· 181 181 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) 182 182 xfs-y += $(addprefix scrub/, \ 183 183 agheader_repair.o \ 184 + newbt.o \ 184 185 reap.o \ 185 186 repair.o \ 186 187 )

+1 -1

fs/xfs/libxfs/xfs_ag.c

··· 984 984 if (err2 != -ENOSPC) 985 985 goto resv_err; 986 986 987 - err2 = __xfs_free_extent_later(*tpp, args.fsbno, delta, NULL, 987 + err2 = xfs_free_extent_later(*tpp, args.fsbno, delta, NULL, 988 988 XFS_AG_RESV_NONE, true); 989 989 if (err2) 990 990 goto resv_err;

+100 -4

fs/xfs/libxfs/xfs_alloc.c

··· 2522 2522 * Add the extent to the list of extents to be free at transaction end. 2523 2523 * The list is maintained sorted (by block number). 2524 2524 */ 2525 - int 2526 - __xfs_free_extent_later( 2525 + static int 2526 + xfs_defer_extent_free( 2527 2527 struct xfs_trans *tp, 2528 2528 xfs_fsblock_t bno, 2529 2529 xfs_filblks_t len, 2530 2530 const struct xfs_owner_info *oinfo, 2531 2531 enum xfs_ag_resv_type type, 2532 - bool skip_discard) 2532 + bool skip_discard, 2533 + struct xfs_defer_pending **dfpp) 2533 2534 { 2534 2535 struct xfs_extent_free_item *xefi; 2535 2536 struct xfs_mount *mp = tp->t_mountp; ··· 2578 2577 XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len); 2579 2578 2580 2579 xfs_extent_free_get_group(mp, xefi); 2581 - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list); 2580 + *dfpp = xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list); 2582 2581 return 0; 2582 + } 2583 + 2584 + int 2585 + xfs_free_extent_later( 2586 + struct xfs_trans *tp, 2587 + xfs_fsblock_t bno, 2588 + xfs_filblks_t len, 2589 + const struct xfs_owner_info *oinfo, 2590 + enum xfs_ag_resv_type type, 2591 + bool skip_discard) 2592 + { 2593 + struct xfs_defer_pending *dontcare = NULL; 2594 + 2595 + return xfs_defer_extent_free(tp, bno, len, oinfo, type, skip_discard, 2596 + &dontcare); 2597 + } 2598 + 2599 + /* 2600 + * Set up automatic freeing of unwritten space in the filesystem. 2601 + * 2602 + * This function attached a paused deferred extent free item to the 2603 + * transaction. Pausing means that the EFI will be logged in the next 2604 + * transaction commit, but the pending EFI will not be finished until the 2605 + * pending item is unpaused. 2606 + * 2607 + * If the system goes down after the EFI has been persisted to the log but 2608 + * before the pending item is unpaused, log recovery will find the EFI, fail to 2609 + * find the EFD, and free the space. 2610 + * 2611 + * If the pending item is unpaused, the next transaction commit will log an EFD 2612 + * without freeing the space. 2613 + * 2614 + * Caller must ensure that the tp, fsbno, len, oinfo, and resv flags of the 2615 + * @args structure are set to the relevant values. 2616 + */ 2617 + int 2618 + xfs_alloc_schedule_autoreap( 2619 + const struct xfs_alloc_arg *args, 2620 + bool skip_discard, 2621 + struct xfs_alloc_autoreap *aarp) 2622 + { 2623 + int error; 2624 + 2625 + error = xfs_defer_extent_free(args->tp, args->fsbno, args->len, 2626 + &args->oinfo, args->resv, skip_discard, &aarp->dfp); 2627 + if (error) 2628 + return error; 2629 + 2630 + xfs_defer_item_pause(args->tp, aarp->dfp); 2631 + return 0; 2632 + } 2633 + 2634 + /* 2635 + * Cancel automatic freeing of unwritten space in the filesystem. 2636 + * 2637 + * Earlier, we created a paused deferred extent free item and attached it to 2638 + * this transaction so that we could automatically roll back a new space 2639 + * allocation if the system went down. Now we want to cancel the paused work 2640 + * item by marking the EFI stale so we don't actually free the space, unpausing 2641 + * the pending item and logging an EFD. 2642 + * 2643 + * The caller generally should have already mapped the space into the ondisk 2644 + * filesystem. If the reserved space was partially used, the caller must call 2645 + * xfs_free_extent_later to create a new EFI to free the unused space. 2646 + */ 2647 + void 2648 + xfs_alloc_cancel_autoreap( 2649 + struct xfs_trans *tp, 2650 + struct xfs_alloc_autoreap *aarp) 2651 + { 2652 + struct xfs_defer_pending *dfp = aarp->dfp; 2653 + struct xfs_extent_free_item *xefi; 2654 + 2655 + if (!dfp) 2656 + return; 2657 + 2658 + list_for_each_entry(xefi, &dfp->dfp_work, xefi_list) 2659 + xefi->xefi_flags |= XFS_EFI_CANCELLED; 2660 + 2661 + xfs_defer_item_unpause(tp, dfp); 2662 + } 2663 + 2664 + /* 2665 + * Commit automatic freeing of unwritten space in the filesystem. 2666 + * 2667 + * This unpauses an earlier _schedule_autoreap and commits to freeing the 2668 + * allocated space. Call this if none of the reserved space was used. 2669 + */ 2670 + void 2671 + xfs_alloc_commit_autoreap( 2672 + struct xfs_trans *tp, 2673 + struct xfs_alloc_autoreap *aarp) 2674 + { 2675 + if (aarp->dfp) 2676 + xfs_defer_item_unpause(tp, aarp->dfp); 2583 2677 } 2584 2678 2585 2679 #ifdef DEBUG

+11 -11

fs/xfs/libxfs/xfs_alloc.h

··· 231 231 return bp->b_addr; 232 232 } 233 233 234 - int __xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno, 234 + int xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno, 235 235 xfs_filblks_t len, const struct xfs_owner_info *oinfo, 236 236 enum xfs_ag_resv_type type, bool skip_discard); 237 237 ··· 255 255 #define XFS_EFI_SKIP_DISCARD (1U << 0) /* don't issue discard */ 256 256 #define XFS_EFI_ATTR_FORK (1U << 1) /* freeing attr fork block */ 257 257 #define XFS_EFI_BMBT_BLOCK (1U << 2) /* freeing bmap btree block */ 258 + #define XFS_EFI_CANCELLED (1U << 3) /* dont actually free the space */ 258 259 259 - static inline int 260 - xfs_free_extent_later( 261 - struct xfs_trans *tp, 262 - xfs_fsblock_t bno, 263 - xfs_filblks_t len, 264 - const struct xfs_owner_info *oinfo, 265 - enum xfs_ag_resv_type type) 266 - { 267 - return __xfs_free_extent_later(tp, bno, len, oinfo, type, false); 268 - } 260 + struct xfs_alloc_autoreap { 261 + struct xfs_defer_pending *dfp; 262 + }; 269 263 264 + int xfs_alloc_schedule_autoreap(const struct xfs_alloc_arg *args, 265 + bool skip_discard, struct xfs_alloc_autoreap *aarp); 266 + void xfs_alloc_cancel_autoreap(struct xfs_trans *tp, 267 + struct xfs_alloc_autoreap *aarp); 268 + void xfs_alloc_commit_autoreap(struct xfs_trans *tp, 269 + struct xfs_alloc_autoreap *aarp); 270 270 271 271 extern struct kmem_cache *xfs_extfree_item_cache; 272 272

+2 -2

fs/xfs/libxfs/xfs_bmap.c

··· 575 575 576 576 xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork); 577 577 error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo, 578 - XFS_AG_RESV_NONE); 578 + XFS_AG_RESV_NONE, false); 579 579 if (error) 580 580 return error; 581 581 ··· 5218 5218 if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) { 5219 5219 xfs_refcount_decrease_extent(tp, del); 5220 5220 } else { 5221 - error = __xfs_free_extent_later(tp, del->br_startblock, 5221 + error = xfs_free_extent_later(tp, del->br_startblock, 5222 5222 del->br_blockcount, NULL, 5223 5223 XFS_AG_RESV_NONE, 5224 5224 ((bflags & XFS_BMAPI_NODISCARD) ||

+1 -1

fs/xfs/libxfs/xfs_bmap_btree.c

··· 272 272 273 273 xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_ino.whichfork); 274 274 error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo, 275 - XFS_AG_RESV_NONE); 275 + XFS_AG_RESV_NONE, false); 276 276 if (error) 277 277 return error; 278 278

-6

fs/xfs/libxfs/xfs_btree_staging.h

··· 37 37 38 38 /* Number of bytes available for this fork in the inode. */ 39 39 unsigned int if_fork_size; 40 - 41 - /* Fork format. */ 42 - unsigned int if_format; 43 - 44 - /* Number of records. */ 45 - unsigned int if_extents; 46 40 }; 47 41 48 42 /* Cursor interactions with fake roots for inode-rooted btrees. */

+227 -34

fs/xfs/libxfs/xfs_defer.c

··· 182 182 * Note that the continuation requested between t2 and t3 is likely to 183 183 * reoccur. 184 184 */ 185 + STATIC struct xfs_log_item * 186 + xfs_defer_barrier_create_intent( 187 + struct xfs_trans *tp, 188 + struct list_head *items, 189 + unsigned int count, 190 + bool sort) 191 + { 192 + return NULL; 193 + } 194 + 195 + STATIC void 196 + xfs_defer_barrier_abort_intent( 197 + struct xfs_log_item *intent) 198 + { 199 + /* empty */ 200 + } 201 + 202 + STATIC struct xfs_log_item * 203 + xfs_defer_barrier_create_done( 204 + struct xfs_trans *tp, 205 + struct xfs_log_item *intent, 206 + unsigned int count) 207 + { 208 + return NULL; 209 + } 210 + 211 + STATIC int 212 + xfs_defer_barrier_finish_item( 213 + struct xfs_trans *tp, 214 + struct xfs_log_item *done, 215 + struct list_head *item, 216 + struct xfs_btree_cur **state) 217 + { 218 + ASSERT(0); 219 + return -EFSCORRUPTED; 220 + } 221 + 222 + STATIC void 223 + xfs_defer_barrier_cancel_item( 224 + struct list_head *item) 225 + { 226 + ASSERT(0); 227 + } 228 + 229 + static const struct xfs_defer_op_type xfs_barrier_defer_type = { 230 + .max_items = 1, 231 + .create_intent = xfs_defer_barrier_create_intent, 232 + .abort_intent = xfs_defer_barrier_abort_intent, 233 + .create_done = xfs_defer_barrier_create_done, 234 + .finish_item = xfs_defer_barrier_finish_item, 235 + .cancel_item = xfs_defer_barrier_cancel_item, 236 + }; 185 237 186 238 static const struct xfs_defer_op_type *defer_op_types[] = { 187 239 [XFS_DEFER_OPS_TYPE_BMAP] = &xfs_bmap_update_defer_type, ··· 242 190 [XFS_DEFER_OPS_TYPE_FREE] = &xfs_extent_free_defer_type, 243 191 [XFS_DEFER_OPS_TYPE_AGFL_FREE] = &xfs_agfl_free_defer_type, 244 192 [XFS_DEFER_OPS_TYPE_ATTR] = &xfs_attr_defer_type, 193 + [XFS_DEFER_OPS_TYPE_BARRIER] = &xfs_barrier_defer_type, 245 194 }; 246 195 247 196 /* Create a log intent done item for a log intent item. */ ··· 540 487 * done item to release the intent item; and then log a new intent item. 541 488 * The caller should provide a fresh transaction and roll it after we're done. 542 489 */ 543 - static int 490 + static void 544 491 xfs_defer_relog( 545 492 struct xfs_trans **tpp, 546 493 struct list_head *dfops) ··· 582 529 583 530 xfs_defer_relog_intent(*tpp, dfp); 584 531 } 585 - 586 - if ((*tpp)->t_flags & XFS_TRANS_DIRTY) 587 - return xfs_defer_trans_roll(tpp); 588 - return 0; 589 532 } 590 533 591 534 /* ··· 637 588 return error; 638 589 } 639 590 591 + /* Move all paused deferred work from @tp to @paused_list. */ 592 + static void 593 + xfs_defer_isolate_paused( 594 + struct xfs_trans *tp, 595 + struct list_head *paused_list) 596 + { 597 + struct xfs_defer_pending *dfp; 598 + struct xfs_defer_pending *pli; 599 + 600 + list_for_each_entry_safe(dfp, pli, &tp->t_dfops, dfp_list) { 601 + if (!(dfp->dfp_flags & XFS_DEFER_PAUSED)) 602 + continue; 603 + 604 + list_move_tail(&dfp->dfp_list, paused_list); 605 + trace_xfs_defer_isolate_paused(tp->t_mountp, dfp); 606 + } 607 + } 608 + 640 609 /* 641 610 * Finish all the pending work. This involves logging intent items for 642 611 * any work items that wandered in since the last transaction roll (if ··· 670 603 struct xfs_defer_pending *dfp = NULL; 671 604 int error = 0; 672 605 LIST_HEAD(dop_pending); 606 + LIST_HEAD(dop_paused); 673 607 674 608 ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES); 675 609 ··· 689 621 */ 690 622 int has_intents = xfs_defer_create_intents(*tp); 691 623 624 + xfs_defer_isolate_paused(*tp, &dop_paused); 625 + 692 626 list_splice_init(&(*tp)->t_dfops, &dop_pending); 693 627 694 628 if (has_intents < 0) { ··· 703 633 goto out_shutdown; 704 634 705 635 /* Relog intent items to keep the log moving. */ 706 - error = xfs_defer_relog(tp, &dop_pending); 707 - if (error) 708 - goto out_shutdown; 636 + xfs_defer_relog(tp, &dop_pending); 637 + xfs_defer_relog(tp, &dop_paused); 638 + 639 + if ((*tp)->t_flags & XFS_TRANS_DIRTY) { 640 + error = xfs_defer_trans_roll(tp); 641 + if (error) 642 + goto out_shutdown; 643 + } 709 644 } 710 645 711 - dfp = list_first_entry(&dop_pending, struct xfs_defer_pending, 712 - dfp_list); 646 + dfp = list_first_entry_or_null(&dop_pending, 647 + struct xfs_defer_pending, dfp_list); 648 + if (!dfp) 649 + break; 713 650 error = xfs_defer_finish_one(*tp, dfp); 714 651 if (error && error != -EAGAIN) 715 652 goto out_shutdown; 716 653 } 717 654 655 + /* Requeue the paused items in the outgoing transaction. */ 656 + list_splice_tail_init(&dop_paused, &(*tp)->t_dfops); 657 + 718 658 trace_xfs_defer_finish_done(*tp, _RET_IP_); 719 659 return 0; 720 660 721 661 out_shutdown: 662 + list_splice_tail_init(&dop_paused, &dop_pending); 722 663 xfs_defer_trans_abort(*tp, &dop_pending); 723 664 xfs_force_shutdown((*tp)->t_mountp, SHUTDOWN_CORRUPT_INCORE); 724 665 trace_xfs_defer_finish_error(*tp, error); ··· 742 661 xfs_defer_finish( 743 662 struct xfs_trans **tp) 744 663 { 664 + #ifdef DEBUG 665 + struct xfs_defer_pending *dfp; 666 + #endif 745 667 int error; 746 668 747 669 /* ··· 764 680 } 765 681 766 682 /* Reset LOWMODE now that we've finished all the dfops. */ 767 - ASSERT(list_empty(&(*tp)->t_dfops)); 683 + #ifdef DEBUG 684 + list_for_each_entry(dfp, &(*tp)->t_dfops, dfp_list) 685 + ASSERT(dfp->dfp_flags & XFS_DEFER_PAUSED); 686 + #endif 768 687 (*tp)->t_flags &= ~XFS_TRANS_LOWMODE; 769 688 return 0; 770 689 } ··· 779 692 struct xfs_mount *mp = tp->t_mountp; 780 693 781 694 trace_xfs_defer_cancel(tp, _RET_IP_); 695 + xfs_defer_trans_abort(tp, &tp->t_dfops); 782 696 xfs_defer_cancel_list(mp, &tp->t_dfops); 783 697 } 784 698 699 + /* 700 + * Return the last pending work item attached to this transaction if it matches 701 + * the deferred op type. 702 + */ 703 + static inline struct xfs_defer_pending * 704 + xfs_defer_find_last( 705 + struct xfs_trans *tp, 706 + enum xfs_defer_ops_type type, 707 + const struct xfs_defer_op_type *ops) 708 + { 709 + struct xfs_defer_pending *dfp = NULL; 710 + 711 + /* No dfops at all? */ 712 + if (list_empty(&tp->t_dfops)) 713 + return NULL; 714 + 715 + dfp = list_last_entry(&tp->t_dfops, struct xfs_defer_pending, 716 + dfp_list); 717 + 718 + /* Wrong type? */ 719 + if (dfp->dfp_type != type) 720 + return NULL; 721 + return dfp; 722 + } 723 + 724 + /* 725 + * Decide if we can add a deferred work item to the last dfops item attached 726 + * to the transaction. 727 + */ 728 + static inline bool 729 + xfs_defer_can_append( 730 + struct xfs_defer_pending *dfp, 731 + const struct xfs_defer_op_type *ops) 732 + { 733 + /* Already logged? */ 734 + if (dfp->dfp_intent) 735 + return false; 736 + 737 + /* Paused items cannot absorb more work */ 738 + if (dfp->dfp_flags & XFS_DEFER_PAUSED) 739 + return NULL; 740 + 741 + /* Already full? */ 742 + if (ops->max_items && dfp->dfp_count >= ops->max_items) 743 + return false; 744 + 745 + return true; 746 + } 747 + 748 + /* Create a new pending item at the end of the transaction list. */ 749 + static inline struct xfs_defer_pending * 750 + xfs_defer_alloc( 751 + struct xfs_trans *tp, 752 + enum xfs_defer_ops_type type) 753 + { 754 + struct xfs_defer_pending *dfp; 755 + 756 + dfp = kmem_cache_zalloc(xfs_defer_pending_cache, 757 + GFP_NOFS | __GFP_NOFAIL); 758 + dfp->dfp_type = type; 759 + INIT_LIST_HEAD(&dfp->dfp_work); 760 + list_add_tail(&dfp->dfp_list, &tp->t_dfops); 761 + 762 + return dfp; 763 + } 764 + 785 765 /* Add an item for later deferred processing. */ 786 - void 766 + struct xfs_defer_pending * 787 767 xfs_defer_add( 788 768 struct xfs_trans *tp, 789 769 enum xfs_defer_ops_type type, ··· 862 708 ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); 863 709 BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX); 864 710 865 - /* 866 - * Add the item to a pending item at the end of the intake list. 867 - * If the last pending item has the same type, reuse it. Else, 868 - * create a new pending item at the end of the intake list. 869 - */ 870 - if (!list_empty(&tp->t_dfops)) { 871 - dfp = list_last_entry(&tp->t_dfops, 872 - struct xfs_defer_pending, dfp_list); 873 - if (dfp->dfp_type != type || 874 - (ops->max_items && dfp->dfp_count >= ops->max_items)) 875 - dfp = NULL; 876 - } 877 - if (!dfp) { 878 - dfp = kmem_cache_zalloc(xfs_defer_pending_cache, 879 - GFP_NOFS | __GFP_NOFAIL); 880 - dfp->dfp_type = type; 881 - dfp->dfp_intent = NULL; 882 - dfp->dfp_done = NULL; 883 - dfp->dfp_count = 0; 884 - INIT_LIST_HEAD(&dfp->dfp_work); 885 - list_add_tail(&dfp->dfp_list, &tp->t_dfops); 886 - } 711 + dfp = xfs_defer_find_last(tp, type, ops); 712 + if (!dfp || !xfs_defer_can_append(dfp, ops)) 713 + dfp = xfs_defer_alloc(tp, type); 887 714 888 715 xfs_defer_add_item(dfp, li); 889 716 trace_xfs_defer_add_item(tp->t_mountp, dfp, li); 717 + return dfp; 718 + } 719 + 720 + /* 721 + * Add a defer ops barrier to force two otherwise adjacent deferred work items 722 + * to be tracked separately and have separate log items. 723 + */ 724 + void 725 + xfs_defer_add_barrier( 726 + struct xfs_trans *tp) 727 + { 728 + struct xfs_defer_pending *dfp; 729 + const enum xfs_defer_ops_type type = XFS_DEFER_OPS_TYPE_BARRIER; 730 + const struct xfs_defer_op_type *ops = defer_op_types[type]; 731 + 732 + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); 733 + 734 + /* If the last defer op added was a barrier, we're done. */ 735 + dfp = xfs_defer_find_last(tp, type, ops); 736 + if (dfp) 737 + return; 738 + 739 + xfs_defer_alloc(tp, type); 740 + 741 + trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL); 890 742 } 891 743 892 744 /* ··· 1217 1057 xfs_refcount_intent_destroy_cache(); 1218 1058 xfs_rmap_intent_destroy_cache(); 1219 1059 xfs_defer_destroy_cache(); 1060 + } 1061 + 1062 + /* 1063 + * Mark a deferred work item so that it will be requeued indefinitely without 1064 + * being finished. Caller must ensure there are no data dependencies on this 1065 + * work item in the meantime. 1066 + */ 1067 + void 1068 + xfs_defer_item_pause( 1069 + struct xfs_trans *tp, 1070 + struct xfs_defer_pending *dfp) 1071 + { 1072 + ASSERT(!(dfp->dfp_flags & XFS_DEFER_PAUSED)); 1073 + 1074 + dfp->dfp_flags |= XFS_DEFER_PAUSED; 1075 + 1076 + trace_xfs_defer_item_pause(tp->t_mountp, dfp); 1077 + } 1078 + 1079 + /* 1080 + * Release a paused deferred work item so that it will be finished during the 1081 + * next transaction roll. 1082 + */ 1083 + void 1084 + xfs_defer_item_unpause( 1085 + struct xfs_trans *tp, 1086 + struct xfs_defer_pending *dfp) 1087 + { 1088 + ASSERT(dfp->dfp_flags & XFS_DEFER_PAUSED); 1089 + 1090 + dfp->dfp_flags &= ~XFS_DEFER_PAUSED; 1091 + 1092 + trace_xfs_defer_item_unpause(tp->t_mountp, dfp); 1220 1093 }

+18 -2

fs/xfs/libxfs/xfs_defer.h

··· 20 20 XFS_DEFER_OPS_TYPE_FREE, 21 21 XFS_DEFER_OPS_TYPE_AGFL_FREE, 22 22 XFS_DEFER_OPS_TYPE_ATTR, 23 + XFS_DEFER_OPS_TYPE_BARRIER, 23 24 XFS_DEFER_OPS_TYPE_MAX, 24 25 }; 25 26 ··· 35 34 struct xfs_log_item *dfp_intent; /* log intent item */ 36 35 struct xfs_log_item *dfp_done; /* log done item */ 37 36 unsigned int dfp_count; /* # extent items */ 37 + unsigned int dfp_flags; 38 38 enum xfs_defer_ops_type dfp_type; 39 39 }; 40 40 41 - void xfs_defer_add(struct xfs_trans *tp, enum xfs_defer_ops_type type, 42 - struct list_head *h); 41 + /* 42 + * Create a log intent item for this deferred item, but don't actually finish 43 + * the work. Caller must clear this before the final transaction commit. 44 + */ 45 + #define XFS_DEFER_PAUSED (1U << 0) 46 + 47 + #define XFS_DEFER_PENDING_STRINGS \ 48 + { XFS_DEFER_PAUSED, "paused" } 49 + 50 + void xfs_defer_item_pause(struct xfs_trans *tp, struct xfs_defer_pending *dfp); 51 + void xfs_defer_item_unpause(struct xfs_trans *tp, struct xfs_defer_pending *dfp); 52 + 53 + struct xfs_defer_pending *xfs_defer_add(struct xfs_trans *tp, 54 + enum xfs_defer_ops_type type, struct list_head *h); 43 55 int xfs_defer_finish_noroll(struct xfs_trans **tp); 44 56 int xfs_defer_finish(struct xfs_trans **tp); 45 57 int xfs_defer_finish_one(struct xfs_trans *tp, struct xfs_defer_pending *dfp); ··· 163 149 164 150 int __init xfs_defer_init_item_caches(void); 165 151 void xfs_defer_destroy_item_caches(void); 152 + 153 + void xfs_defer_add_barrier(struct xfs_trans *tp); 166 154 167 155 #endif /* __XFS_DEFER_H__ */

+3 -2

fs/xfs/libxfs/xfs_ialloc.c

··· 1854 1854 return xfs_free_extent_later(tp, 1855 1855 XFS_AGB_TO_FSB(mp, agno, sagbno), 1856 1856 M_IGEO(mp)->ialloc_blks, &XFS_RMAP_OINFO_INODES, 1857 - XFS_AG_RESV_NONE); 1857 + XFS_AG_RESV_NONE, false); 1858 1858 } 1859 1859 1860 1860 /* holemask is only 16-bits (fits in an unsigned long) */ ··· 1900 1900 ASSERT(contigblk % mp->m_sb.sb_spino_align == 0); 1901 1901 error = xfs_free_extent_later(tp, 1902 1902 XFS_AGB_TO_FSB(mp, agno, agbno), contigblk, 1903 - &XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE); 1903 + &XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE, 1904 + false); 1904 1905 if (error) 1905 1906 return error; 1906 1907

+1 -1

fs/xfs/libxfs/xfs_ialloc_btree.c

··· 161 161 xfs_inobt_mod_blockcount(cur, -1); 162 162 fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)); 163 163 return xfs_free_extent_later(cur->bc_tp, fsbno, 1, 164 - &XFS_RMAP_OINFO_INOBT, resv); 164 + &XFS_RMAP_OINFO_INOBT, resv, false); 165 165 } 166 166 167 167 STATIC int

+3 -3

fs/xfs/libxfs/xfs_refcount.c

··· 1153 1153 tmp.rc_startblock); 1154 1154 error = xfs_free_extent_later(cur->bc_tp, fsbno, 1155 1155 tmp.rc_blockcount, NULL, 1156 - XFS_AG_RESV_NONE); 1156 + XFS_AG_RESV_NONE, false); 1157 1157 if (error) 1158 1158 goto out_error; 1159 1159 } ··· 1215 1215 ext.rc_startblock); 1216 1216 error = xfs_free_extent_later(cur->bc_tp, fsbno, 1217 1217 ext.rc_blockcount, NULL, 1218 - XFS_AG_RESV_NONE); 1218 + XFS_AG_RESV_NONE, false); 1219 1219 if (error) 1220 1220 goto out_error; 1221 1221 } ··· 1985 1985 /* Free the block. */ 1986 1986 error = xfs_free_extent_later(tp, fsb, 1987 1987 rr->rr_rrec.rc_blockcount, NULL, 1988 - XFS_AG_RESV_NONE); 1988 + XFS_AG_RESV_NONE, false); 1989 1989 if (error) 1990 1990 goto out_trans; 1991 1991

+1 -1

fs/xfs/libxfs/xfs_refcount_btree.c

··· 112 112 be32_add_cpu(&agf->agf_refcount_blocks, -1); 113 113 xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS); 114 114 return xfs_free_extent_later(cur->bc_tp, fsbno, 1, 115 - &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); 115 + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA, false); 116 116 } 117 117 118 118 STATIC int

+513

fs/xfs/scrub/newbt.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + /* 3 + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. 4 + * Author: Darrick J. Wong <djwong@kernel.org> 5 + */ 6 + #include "xfs.h" 7 + #include "xfs_fs.h" 8 + #include "xfs_shared.h" 9 + #include "xfs_format.h" 10 + #include "xfs_trans_resv.h" 11 + #include "xfs_mount.h" 12 + #include "xfs_btree.h" 13 + #include "xfs_btree_staging.h" 14 + #include "xfs_log_format.h" 15 + #include "xfs_trans.h" 16 + #include "xfs_sb.h" 17 + #include "xfs_inode.h" 18 + #include "xfs_alloc.h" 19 + #include "xfs_rmap.h" 20 + #include "xfs_ag.h" 21 + #include "xfs_defer.h" 22 + #include "scrub/scrub.h" 23 + #include "scrub/common.h" 24 + #include "scrub/trace.h" 25 + #include "scrub/repair.h" 26 + #include "scrub/newbt.h" 27 + 28 + /* 29 + * Estimate proper slack values for a btree that's being reloaded. 30 + * 31 + * Under most circumstances, we'll take whatever default loading value the 32 + * btree bulk loading code calculates for us. However, there are some 33 + * exceptions to this rule: 34 + * 35 + * (1) If this is a per-AG btree and the AG has less than 10% space free. 36 + * (2) If this is an inode btree and the FS has less than 10% space free. 37 + 38 + * In either case, format the new btree blocks almost completely full to 39 + * minimize space usage. 40 + */ 41 + static void 42 + xrep_newbt_estimate_slack( 43 + struct xrep_newbt *xnr) 44 + { 45 + struct xfs_scrub *sc = xnr->sc; 46 + struct xfs_btree_bload *bload = &xnr->bload; 47 + uint64_t free; 48 + uint64_t sz; 49 + 50 + /* Let the btree code compute the default slack values. */ 51 + bload->leaf_slack = -1; 52 + bload->node_slack = -1; 53 + 54 + if (sc->ops->type == ST_PERAG) { 55 + free = sc->sa.pag->pagf_freeblks; 56 + sz = xfs_ag_block_count(sc->mp, sc->sa.pag->pag_agno); 57 + } else { 58 + free = percpu_counter_sum(&sc->mp->m_fdblocks); 59 + sz = sc->mp->m_sb.sb_dblocks; 60 + } 61 + 62 + /* No further changes if there's more than 10% free space left. */ 63 + if (free >= div_u64(sz, 10)) 64 + return; 65 + 66 + /* 67 + * We're low on space; load the btrees as tightly as possible. Leave 68 + * a couple of open slots in each btree block so that we don't end up 69 + * splitting the btrees like crazy after a mount. 70 + */ 71 + if (bload->leaf_slack < 0) 72 + bload->leaf_slack = 2; 73 + if (bload->node_slack < 0) 74 + bload->node_slack = 2; 75 + } 76 + 77 + /* Initialize accounting resources for staging a new AG btree. */ 78 + void 79 + xrep_newbt_init_ag( 80 + struct xrep_newbt *xnr, 81 + struct xfs_scrub *sc, 82 + const struct xfs_owner_info *oinfo, 83 + xfs_fsblock_t alloc_hint, 84 + enum xfs_ag_resv_type resv) 85 + { 86 + memset(xnr, 0, sizeof(struct xrep_newbt)); 87 + xnr->sc = sc; 88 + xnr->oinfo = *oinfo; /* structure copy */ 89 + xnr->alloc_hint = alloc_hint; 90 + xnr->resv = resv; 91 + INIT_LIST_HEAD(&xnr->resv_list); 92 + xrep_newbt_estimate_slack(xnr); 93 + } 94 + 95 + /* Initialize accounting resources for staging a new inode fork btree. */ 96 + int 97 + xrep_newbt_init_inode( 98 + struct xrep_newbt *xnr, 99 + struct xfs_scrub *sc, 100 + int whichfork, 101 + const struct xfs_owner_info *oinfo) 102 + { 103 + struct xfs_ifork *ifp; 104 + 105 + ifp = kmem_cache_zalloc(xfs_ifork_cache, XCHK_GFP_FLAGS); 106 + if (!ifp) 107 + return -ENOMEM; 108 + 109 + xrep_newbt_init_ag(xnr, sc, oinfo, 110 + XFS_INO_TO_FSB(sc->mp, sc->ip->i_ino), 111 + XFS_AG_RESV_NONE); 112 + xnr->ifake.if_fork = ifp; 113 + xnr->ifake.if_fork_size = xfs_inode_fork_size(sc->ip, whichfork); 114 + return 0; 115 + } 116 + 117 + /* 118 + * Initialize accounting resources for staging a new btree. Callers are 119 + * expected to add their own reservations (and clean them up) manually. 120 + */ 121 + void 122 + xrep_newbt_init_bare( 123 + struct xrep_newbt *xnr, 124 + struct xfs_scrub *sc) 125 + { 126 + xrep_newbt_init_ag(xnr, sc, &XFS_RMAP_OINFO_ANY_OWNER, NULLFSBLOCK, 127 + XFS_AG_RESV_NONE); 128 + } 129 + 130 + /* 131 + * Designate specific blocks to be used to build our new btree. @pag must be 132 + * a passive reference. 133 + */ 134 + STATIC int 135 + xrep_newbt_add_blocks( 136 + struct xrep_newbt *xnr, 137 + struct xfs_perag *pag, 138 + const struct xfs_alloc_arg *args) 139 + { 140 + struct xfs_mount *mp = xnr->sc->mp; 141 + struct xrep_newbt_resv *resv; 142 + int error; 143 + 144 + resv = kmalloc(sizeof(struct xrep_newbt_resv), XCHK_GFP_FLAGS); 145 + if (!resv) 146 + return -ENOMEM; 147 + 148 + INIT_LIST_HEAD(&resv->list); 149 + resv->agbno = XFS_FSB_TO_AGBNO(mp, args->fsbno); 150 + resv->len = args->len; 151 + resv->used = 0; 152 + resv->pag = xfs_perag_hold(pag); 153 + 154 + ASSERT(xnr->oinfo.oi_offset == 0); 155 + 156 + error = xfs_alloc_schedule_autoreap(args, true, &resv->autoreap); 157 + if (error) 158 + goto out_pag; 159 + 160 + list_add_tail(&resv->list, &xnr->resv_list); 161 + return 0; 162 + out_pag: 163 + xfs_perag_put(resv->pag); 164 + kfree(resv); 165 + return error; 166 + } 167 + 168 + /* Don't let our allocation hint take us beyond this AG */ 169 + static inline void 170 + xrep_newbt_validate_ag_alloc_hint( 171 + struct xrep_newbt *xnr) 172 + { 173 + struct xfs_scrub *sc = xnr->sc; 174 + xfs_agnumber_t agno = XFS_FSB_TO_AGNO(sc->mp, xnr->alloc_hint); 175 + 176 + if (agno == sc->sa.pag->pag_agno && 177 + xfs_verify_fsbno(sc->mp, xnr->alloc_hint)) 178 + return; 179 + 180 + xnr->alloc_hint = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno, 181 + XFS_AGFL_BLOCK(sc->mp) + 1); 182 + } 183 + 184 + /* Allocate disk space for a new per-AG btree. */ 185 + STATIC int 186 + xrep_newbt_alloc_ag_blocks( 187 + struct xrep_newbt *xnr, 188 + uint64_t nr_blocks) 189 + { 190 + struct xfs_scrub *sc = xnr->sc; 191 + struct xfs_mount *mp = sc->mp; 192 + int error = 0; 193 + 194 + ASSERT(sc->sa.pag != NULL); 195 + 196 + while (nr_blocks > 0) { 197 + struct xfs_alloc_arg args = { 198 + .tp = sc->tp, 199 + .mp = mp, 200 + .oinfo = xnr->oinfo, 201 + .minlen = 1, 202 + .maxlen = nr_blocks, 203 + .prod = 1, 204 + .resv = xnr->resv, 205 + }; 206 + xfs_agnumber_t agno; 207 + 208 + xrep_newbt_validate_ag_alloc_hint(xnr); 209 + 210 + error = xfs_alloc_vextent_near_bno(&args, xnr->alloc_hint); 211 + if (error) 212 + return error; 213 + if (args.fsbno == NULLFSBLOCK) 214 + return -ENOSPC; 215 + 216 + agno = XFS_FSB_TO_AGNO(mp, args.fsbno); 217 + 218 + trace_xrep_newbt_alloc_ag_blocks(mp, agno, 219 + XFS_FSB_TO_AGBNO(mp, args.fsbno), args.len, 220 + xnr->oinfo.oi_owner); 221 + 222 + if (agno != sc->sa.pag->pag_agno) { 223 + ASSERT(agno == sc->sa.pag->pag_agno); 224 + return -EFSCORRUPTED; 225 + } 226 + 227 + error = xrep_newbt_add_blocks(xnr, sc->sa.pag, &args); 228 + if (error) 229 + return error; 230 + 231 + nr_blocks -= args.len; 232 + xnr->alloc_hint = args.fsbno + args.len; 233 + 234 + error = xrep_defer_finish(sc); 235 + if (error) 236 + return error; 237 + } 238 + 239 + return 0; 240 + } 241 + 242 + /* Don't let our allocation hint take us beyond EOFS */ 243 + static inline void 244 + xrep_newbt_validate_file_alloc_hint( 245 + struct xrep_newbt *xnr) 246 + { 247 + struct xfs_scrub *sc = xnr->sc; 248 + 249 + if (xfs_verify_fsbno(sc->mp, xnr->alloc_hint)) 250 + return; 251 + 252 + xnr->alloc_hint = XFS_AGB_TO_FSB(sc->mp, 0, XFS_AGFL_BLOCK(sc->mp) + 1); 253 + } 254 + 255 + /* Allocate disk space for our new file-based btree. */ 256 + STATIC int 257 + xrep_newbt_alloc_file_blocks( 258 + struct xrep_newbt *xnr, 259 + uint64_t nr_blocks) 260 + { 261 + struct xfs_scrub *sc = xnr->sc; 262 + struct xfs_mount *mp = sc->mp; 263 + int error = 0; 264 + 265 + while (nr_blocks > 0) { 266 + struct xfs_alloc_arg args = { 267 + .tp = sc->tp, 268 + .mp = mp, 269 + .oinfo = xnr->oinfo, 270 + .minlen = 1, 271 + .maxlen = nr_blocks, 272 + .prod = 1, 273 + .resv = xnr->resv, 274 + }; 275 + struct xfs_perag *pag; 276 + xfs_agnumber_t agno; 277 + 278 + xrep_newbt_validate_file_alloc_hint(xnr); 279 + 280 + error = xfs_alloc_vextent_start_ag(&args, xnr->alloc_hint); 281 + if (error) 282 + return error; 283 + if (args.fsbno == NULLFSBLOCK) 284 + return -ENOSPC; 285 + 286 + agno = XFS_FSB_TO_AGNO(mp, args.fsbno); 287 + 288 + trace_xrep_newbt_alloc_file_blocks(mp, agno, 289 + XFS_FSB_TO_AGBNO(mp, args.fsbno), args.len, 290 + xnr->oinfo.oi_owner); 291 + 292 + pag = xfs_perag_get(mp, agno); 293 + if (!pag) { 294 + ASSERT(0); 295 + return -EFSCORRUPTED; 296 + } 297 + 298 + error = xrep_newbt_add_blocks(xnr, pag, &args); 299 + xfs_perag_put(pag); 300 + if (error) 301 + return error; 302 + 303 + nr_blocks -= args.len; 304 + xnr->alloc_hint = args.fsbno + args.len; 305 + 306 + error = xrep_defer_finish(sc); 307 + if (error) 308 + return error; 309 + } 310 + 311 + return 0; 312 + } 313 + 314 + /* Allocate disk space for our new btree. */ 315 + int 316 + xrep_newbt_alloc_blocks( 317 + struct xrep_newbt *xnr, 318 + uint64_t nr_blocks) 319 + { 320 + if (xnr->sc->ip) 321 + return xrep_newbt_alloc_file_blocks(xnr, nr_blocks); 322 + return xrep_newbt_alloc_ag_blocks(xnr, nr_blocks); 323 + } 324 + 325 + /* 326 + * Free the unused part of a space extent that was reserved for a new ondisk 327 + * structure. Returns the number of EFIs logged or a negative errno. 328 + */ 329 + STATIC int 330 + xrep_newbt_free_extent( 331 + struct xrep_newbt *xnr, 332 + struct xrep_newbt_resv *resv, 333 + bool btree_committed) 334 + { 335 + struct xfs_scrub *sc = xnr->sc; 336 + xfs_agblock_t free_agbno = resv->agbno; 337 + xfs_extlen_t free_aglen = resv->len; 338 + xfs_fsblock_t fsbno; 339 + int error; 340 + 341 + if (!btree_committed || resv->used == 0) { 342 + /* 343 + * If we're not committing a new btree or we didn't use the 344 + * space reservation, let the existing EFI free the entire 345 + * space extent. 346 + */ 347 + trace_xrep_newbt_free_blocks(sc->mp, resv->pag->pag_agno, 348 + free_agbno, free_aglen, xnr->oinfo.oi_owner); 349 + xfs_alloc_commit_autoreap(sc->tp, &resv->autoreap); 350 + return 1; 351 + } 352 + 353 + /* 354 + * We used space and committed the btree. Cancel the autoreap, remove 355 + * the written blocks from the reservation, and possibly log a new EFI 356 + * to free any unused reservation space. 357 + */ 358 + xfs_alloc_cancel_autoreap(sc->tp, &resv->autoreap); 359 + free_agbno += resv->used; 360 + free_aglen -= resv->used; 361 + 362 + if (free_aglen == 0) 363 + return 0; 364 + 365 + trace_xrep_newbt_free_blocks(sc->mp, resv->pag->pag_agno, free_agbno, 366 + free_aglen, xnr->oinfo.oi_owner); 367 + 368 + ASSERT(xnr->resv != XFS_AG_RESV_AGFL); 369 + 370 + /* 371 + * Use EFIs to free the reservations. This reduces the chance 372 + * that we leak blocks if the system goes down. 373 + */ 374 + fsbno = XFS_AGB_TO_FSB(sc->mp, resv->pag->pag_agno, free_agbno); 375 + error = xfs_free_extent_later(sc->tp, fsbno, free_aglen, &xnr->oinfo, 376 + xnr->resv, true); 377 + if (error) 378 + return error; 379 + 380 + return 1; 381 + } 382 + 383 + /* Free all the accounting info and disk space we reserved for a new btree. */ 384 + STATIC int 385 + xrep_newbt_free( 386 + struct xrep_newbt *xnr, 387 + bool btree_committed) 388 + { 389 + struct xfs_scrub *sc = xnr->sc; 390 + struct xrep_newbt_resv *resv, *n; 391 + unsigned int freed = 0; 392 + int error = 0; 393 + 394 + /* 395 + * If the filesystem already went down, we can't free the blocks. Skip 396 + * ahead to freeing the incore metadata because we can't fix anything. 397 + */ 398 + if (xfs_is_shutdown(sc->mp)) 399 + goto junkit; 400 + 401 + list_for_each_entry_safe(resv, n, &xnr->resv_list, list) { 402 + int ret; 403 + 404 + ret = xrep_newbt_free_extent(xnr, resv, btree_committed); 405 + list_del(&resv->list); 406 + xfs_perag_put(resv->pag); 407 + kfree(resv); 408 + if (ret < 0) { 409 + error = ret; 410 + goto junkit; 411 + } 412 + 413 + freed += ret; 414 + if (freed >= XREP_MAX_ITRUNCATE_EFIS) { 415 + error = xrep_defer_finish(sc); 416 + if (error) 417 + goto junkit; 418 + freed = 0; 419 + } 420 + } 421 + 422 + if (freed) 423 + error = xrep_defer_finish(sc); 424 + 425 + junkit: 426 + /* 427 + * If we still have reservations attached to @newbt, cleanup must have 428 + * failed and the filesystem is about to go down. Clean up the incore 429 + * reservations and try to commit to freeing the space we used. 430 + */ 431 + list_for_each_entry_safe(resv, n, &xnr->resv_list, list) { 432 + xfs_alloc_commit_autoreap(sc->tp, &resv->autoreap); 433 + list_del(&resv->list); 434 + xfs_perag_put(resv->pag); 435 + kfree(resv); 436 + } 437 + 438 + if (sc->ip) { 439 + kmem_cache_free(xfs_ifork_cache, xnr->ifake.if_fork); 440 + xnr->ifake.if_fork = NULL; 441 + } 442 + 443 + return error; 444 + } 445 + 446 + /* 447 + * Free all the accounting info and unused disk space allocations after 448 + * committing a new btree. 449 + */ 450 + int 451 + xrep_newbt_commit( 452 + struct xrep_newbt *xnr) 453 + { 454 + return xrep_newbt_free(xnr, true); 455 + } 456 + 457 + /* 458 + * Free all the accounting info and all of the disk space we reserved for a new 459 + * btree that we're not going to commit. We want to try to roll things back 460 + * cleanly for things like ENOSPC midway through allocation. 461 + */ 462 + void 463 + xrep_newbt_cancel( 464 + struct xrep_newbt *xnr) 465 + { 466 + xrep_newbt_free(xnr, false); 467 + } 468 + 469 + /* Feed one of the reserved btree blocks to the bulk loader. */ 470 + int 471 + xrep_newbt_claim_block( 472 + struct xfs_btree_cur *cur, 473 + struct xrep_newbt *xnr, 474 + union xfs_btree_ptr *ptr) 475 + { 476 + struct xrep_newbt_resv *resv; 477 + struct xfs_mount *mp = cur->bc_mp; 478 + xfs_agblock_t agbno; 479 + 480 + /* 481 + * The first item in the list should always have a free block unless 482 + * we're completely out. 483 + */ 484 + resv = list_first_entry(&xnr->resv_list, struct xrep_newbt_resv, list); 485 + if (resv->used == resv->len) 486 + return -ENOSPC; 487 + 488 + /* 489 + * Peel off a block from the start of the reservation. We allocate 490 + * blocks in order to place blocks on disk in increasing record or key 491 + * order. The block reservations tend to end up on the list in 492 + * decreasing order, which hopefully results in leaf blocks ending up 493 + * together. 494 + */ 495 + agbno = resv->agbno + resv->used; 496 + resv->used++; 497 + 498 + /* If we used all the blocks in this reservation, move it to the end. */ 499 + if (resv->used == resv->len) 500 + list_move_tail(&resv->list, &xnr->resv_list); 501 + 502 + trace_xrep_newbt_claim_block(mp, resv->pag->pag_agno, agbno, 1, 503 + xnr->oinfo.oi_owner); 504 + 505 + if (cur->bc_flags & XFS_BTREE_LONG_PTRS) 506 + ptr->l = cpu_to_be64(XFS_AGB_TO_FSB(mp, resv->pag->pag_agno, 507 + agbno)); 508 + else 509 + ptr->s = cpu_to_be32(agbno); 510 + 511 + /* Relog all the EFIs. */ 512 + return xrep_defer_finish(xnr->sc); 513 + }

+65

fs/xfs/scrub/newbt.h

··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + /* 3 + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. 4 + * Author: Darrick J. Wong <djwong@kernel.org> 5 + */ 6 + #ifndef __XFS_SCRUB_NEWBT_H__ 7 + #define __XFS_SCRUB_NEWBT_H__ 8 + 9 + struct xrep_newbt_resv { 10 + /* Link to list of extents that we've reserved. */ 11 + struct list_head list; 12 + 13 + struct xfs_perag *pag; 14 + 15 + /* Auto-freeing this reservation if we don't commit. */ 16 + struct xfs_alloc_autoreap autoreap; 17 + 18 + /* AG block of the extent we reserved. */ 19 + xfs_agblock_t agbno; 20 + 21 + /* Length of the reservation. */ 22 + xfs_extlen_t len; 23 + 24 + /* How much of this reservation has been used. */ 25 + xfs_extlen_t used; 26 + }; 27 + 28 + struct xrep_newbt { 29 + struct xfs_scrub *sc; 30 + 31 + /* List of extents that we've reserved. */ 32 + struct list_head resv_list; 33 + 34 + /* Fake root for new btree. */ 35 + union { 36 + struct xbtree_afakeroot afake; 37 + struct xbtree_ifakeroot ifake; 38 + }; 39 + 40 + /* rmap owner of these blocks */ 41 + struct xfs_owner_info oinfo; 42 + 43 + /* btree geometry for the bulk loader */ 44 + struct xfs_btree_bload bload; 45 + 46 + /* Allocation hint */ 47 + xfs_fsblock_t alloc_hint; 48 + 49 + /* per-ag reservation type */ 50 + enum xfs_ag_resv_type resv; 51 + }; 52 + 53 + void xrep_newbt_init_bare(struct xrep_newbt *xnr, struct xfs_scrub *sc); 54 + void xrep_newbt_init_ag(struct xrep_newbt *xnr, struct xfs_scrub *sc, 55 + const struct xfs_owner_info *oinfo, xfs_fsblock_t alloc_hint, 56 + enum xfs_ag_resv_type resv); 57 + int xrep_newbt_init_inode(struct xrep_newbt *xnr, struct xfs_scrub *sc, 58 + int whichfork, const struct xfs_owner_info *oinfo); 59 + int xrep_newbt_alloc_blocks(struct xrep_newbt *xnr, uint64_t nr_blocks); 60 + void xrep_newbt_cancel(struct xrep_newbt *xnr); 61 + int xrep_newbt_commit(struct xrep_newbt *xnr); 62 + int xrep_newbt_claim_block(struct xfs_btree_cur *cur, struct xrep_newbt *xnr, 63 + union xfs_btree_ptr *ptr); 64 + 65 + #endif /* __XFS_SCRUB_NEWBT_H__ */

+6 -1

fs/xfs/scrub/reap.c

··· 31 31 #include "xfs_da_btree.h" 32 32 #include "xfs_attr.h" 33 33 #include "xfs_attr_remote.h" 34 + #include "xfs_defer.h" 34 35 #include "scrub/scrub.h" 35 36 #include "scrub/common.h" 36 37 #include "scrub/trace.h" ··· 410 409 /* 411 410 * Use deferred frees to get rid of the old btree blocks to try to 412 411 * minimize the window in which we could crash and lose the old blocks. 412 + * Add a defer ops barrier every other extent to avoid stressing the 413 + * system with large EFIs. 413 414 */ 414 - error = __xfs_free_extent_later(sc->tp, fsbno, *aglenp, rs->oinfo, 415 + error = xfs_free_extent_later(sc->tp, fsbno, *aglenp, rs->oinfo, 415 416 rs->resv, true); 416 417 if (error) 417 418 return error; 418 419 419 420 rs->deferred++; 421 + if (rs->deferred % 2 == 0) 422 + xfs_defer_add_barrier(sc->tp); 420 423 return 0; 421 424 } 422 425

+37

fs/xfs/scrub/trace.h

··· 1332 1332 __entry->freemask) 1333 1333 ) 1334 1334 1335 + DECLARE_EVENT_CLASS(xrep_newbt_extent_class, 1336 + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, 1337 + xfs_agblock_t agbno, xfs_extlen_t len, 1338 + int64_t owner), 1339 + TP_ARGS(mp, agno, agbno, len, owner), 1340 + TP_STRUCT__entry( 1341 + __field(dev_t, dev) 1342 + __field(xfs_agnumber_t, agno) 1343 + __field(xfs_agblock_t, agbno) 1344 + __field(xfs_extlen_t, len) 1345 + __field(int64_t, owner) 1346 + ), 1347 + TP_fast_assign( 1348 + __entry->dev = mp->m_super->s_dev; 1349 + __entry->agno = agno; 1350 + __entry->agbno = agbno; 1351 + __entry->len = len; 1352 + __entry->owner = owner; 1353 + ), 1354 + TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x owner 0x%llx", 1355 + MAJOR(__entry->dev), MINOR(__entry->dev), 1356 + __entry->agno, 1357 + __entry->agbno, 1358 + __entry->len, 1359 + __entry->owner) 1360 + ); 1361 + #define DEFINE_NEWBT_EXTENT_EVENT(name) \ 1362 + DEFINE_EVENT(xrep_newbt_extent_class, name, \ 1363 + TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \ 1364 + xfs_agblock_t agbno, xfs_extlen_t len, \ 1365 + int64_t owner), \ 1366 + TP_ARGS(mp, agno, agbno, len, owner)) 1367 + DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_alloc_ag_blocks); 1368 + DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_alloc_file_blocks); 1369 + DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_free_blocks); 1370 + DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_claim_block); 1371 + 1335 1372 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ 1336 1373 1337 1374 #endif /* _TRACE_XFS_SCRUB_TRACE_H */

+5 -4

fs/xfs/xfs_extfree_item.c

··· 453 453 struct xfs_extent *extp; 454 454 uint next_extent; 455 455 xfs_agblock_t agbno; 456 - int error; 456 + int error = 0; 457 457 458 458 xefi = container_of(item, struct xfs_extent_free_item, xefi_list); 459 459 agbno = XFS_FSB_TO_AGBNO(mp, xefi->xefi_startblock); ··· 473 473 * the existing EFI, and so we need to copy all the unprocessed extents 474 474 * in this EFI to the EFD so this works correctly. 475 475 */ 476 - error = __xfs_free_extent(tp, xefi->xefi_pag, agbno, 477 - xefi->xefi_blockcount, &oinfo, xefi->xefi_agresv, 478 - xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); 476 + if (!(xefi->xefi_flags & XFS_EFI_CANCELLED)) 477 + error = __xfs_free_extent(tp, xefi->xefi_pag, agbno, 478 + xefi->xefi_blockcount, &oinfo, xefi->xefi_agresv, 479 + xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); 479 480 if (error == -EAGAIN) { 480 481 xfs_efd_from_efi(efdp); 481 482 return error;

+1 -1

fs/xfs/xfs_reflink.c

··· 618 618 619 619 error = xfs_free_extent_later(*tpp, del.br_startblock, 620 620 del.br_blockcount, NULL, 621 - XFS_AG_RESV_NONE); 621 + XFS_AG_RESV_NONE, false); 622 622 if (error) 623 623 break; 624 624

+11 -2

fs/xfs/xfs_trace.h

··· 2551 2551 __field(dev_t, dev) 2552 2552 __field(int, type) 2553 2553 __field(void *, intent) 2554 + __field(unsigned int, flags) 2554 2555 __field(char, committed) 2555 2556 __field(int, nr) 2556 2557 ), ··· 2559 2558 __entry->dev = mp ? mp->m_super->s_dev : 0; 2560 2559 __entry->type = dfp->dfp_type; 2561 2560 __entry->intent = dfp->dfp_intent; 2561 + __entry->flags = dfp->dfp_flags; 2562 2562 __entry->committed = dfp->dfp_done != NULL; 2563 2563 __entry->nr = dfp->dfp_count; 2564 2564 ), 2565 - TP_printk("dev %d:%d optype %d intent %p committed %d nr %d", 2565 + TP_printk("dev %d:%d optype %d intent %p flags %s committed %d nr %d", 2566 2566 MAJOR(__entry->dev), MINOR(__entry->dev), 2567 2567 __entry->type, 2568 2568 __entry->intent, 2569 + __print_flags(__entry->flags, "|", XFS_DEFER_PENDING_STRINGS), 2569 2570 __entry->committed, 2570 2571 __entry->nr) 2571 2572 ) ··· 2678 2675 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish); 2679 2676 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort); 2680 2677 DEFINE_DEFER_PENDING_EVENT(xfs_defer_relog_intent); 2678 + DEFINE_DEFER_PENDING_EVENT(xfs_defer_isolate_paused); 2679 + DEFINE_DEFER_PENDING_EVENT(xfs_defer_item_pause); 2680 + DEFINE_DEFER_PENDING_EVENT(xfs_defer_item_unpause); 2681 2681 2682 2682 #define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT 2683 2683 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer); ··· 2698 2692 __field(void *, intent) 2699 2693 __field(void *, item) 2700 2694 __field(char, committed) 2695 + __field(unsigned int, flags) 2701 2696 __field(int, nr) 2702 2697 ), 2703 2698 TP_fast_assign( ··· 2707 2700 __entry->intent = dfp->dfp_intent; 2708 2701 __entry->item = item; 2709 2702 __entry->committed = dfp->dfp_done != NULL; 2703 + __entry->flags = dfp->dfp_flags; 2710 2704 __entry->nr = dfp->dfp_count; 2711 2705 ), 2712 - TP_printk("dev %d:%d optype %d intent %p item %p committed %d nr %d", 2706 + TP_printk("dev %d:%d optype %d intent %p item %p flags %s committed %d nr %d", 2713 2707 MAJOR(__entry->dev), MINOR(__entry->dev), 2714 2708 __entry->type, 2715 2709 __entry->intent, 2716 2710 __entry->item, 2711 + __print_flags(__entry->flags, "|", XFS_DEFER_PENDING_STRINGS), 2717 2712 __entry->committed, 2718 2713 __entry->nr) 2719 2714 )