Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

xfs: fix broken icreate log item cancellation

Inode cluster buffers are invalidated and cancelled when inode chunks
are freed to notify log recovery that previous logged updates to the
metadata buffer should be skipped. This ensures that log recovery does
not overwrite buffers that might have already been reused.

On v4 filesystems, inode chunk allocation and inode updates are logged
via the cluster buffers and thus cancellation is easily detected via
buffer cancellation items. v5 filesystems use the new icreate
transaction, which uses logical logging and ordered buffers to log a
full inode chunk allocation at once. The resulting icreate item often
spans multiple inode cluster buffers.

Log recovery checks for cancelled buffers when processing icreate log
items, but it has a couple problems. First, it uses the full length of
the inode chunk rather than the cluster size. Second, it uses the length
in FSB units rather than BB units. Either of these problems prevent
icreate recovery from identifying cancelled buffers and thus inode
initialization proceeds unconditionally.

Update xlog_recover_do_icreate_pass2() to iterate the icreate range in
cluster sized increments and check each increment for cancellation.
Since icreate is currently only used for the minimum atomic inode chunk
allocation, we expect that either all or none of the buffers will be
cancelled. Cancel the icreate if at least one buffer is cancelled to
avoid making a bad situation worse by initializing a partial inode
chunk, but detect such anomalies and warn the user.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

authored by

Brian Foster and committed by
Dave Chinner
fc0d1656 78d57e45

+38 -13
+38 -13
fs/xfs/xfs_log_recover.c
··· 3032 3032 unsigned int count; 3033 3033 unsigned int isize; 3034 3034 xfs_agblock_t length; 3035 + int blks_per_cluster; 3036 + int bb_per_cluster; 3037 + int cancel_count; 3038 + int nbufs; 3039 + int i; 3035 3040 3036 3041 icl = (struct xfs_icreate_log *)item->ri_buf[0].i_addr; 3037 3042 if (icl->icl_type != XFS_LI_ICREATE) { ··· 3095 3090 } 3096 3091 3097 3092 /* 3098 - * Inode buffers can be freed. Do not replay the inode initialisation as 3099 - * we could be overwriting something written after this inode buffer was 3100 - * cancelled. 3101 - * 3102 - * XXX: we need to iterate all buffers and only init those that are not 3103 - * cancelled. I think that a more fine grained factoring of 3104 - * xfs_ialloc_inode_init may be appropriate here to enable this to be 3105 - * done easily. 3093 + * The icreate transaction can cover multiple cluster buffers and these 3094 + * buffers could have been freed and reused. Check the individual 3095 + * buffers for cancellation so we don't overwrite anything written after 3096 + * a cancellation. 3106 3097 */ 3107 - if (xlog_check_buffer_cancelled(log, 3108 - XFS_AGB_TO_DADDR(mp, agno, agbno), length, 0)) { 3098 + blks_per_cluster = xfs_icluster_size_fsb(mp); 3099 + bb_per_cluster = XFS_FSB_TO_BB(mp, blks_per_cluster); 3100 + nbufs = length / blks_per_cluster; 3101 + for (i = 0, cancel_count = 0; i < nbufs; i++) { 3102 + xfs_daddr_t daddr; 3103 + 3104 + daddr = XFS_AGB_TO_DADDR(mp, agno, 3105 + agbno + i * blks_per_cluster); 3106 + if (xlog_check_buffer_cancelled(log, daddr, bb_per_cluster, 0)) 3107 + cancel_count++; 3108 + } 3109 + 3110 + /* 3111 + * We currently only use icreate for a single allocation at a time. This 3112 + * means we should expect either all or none of the buffers to be 3113 + * cancelled. Be conservative and skip replay if at least one buffer is 3114 + * cancelled, but warn the user that something is awry if the buffers 3115 + * are not consistent. 3116 + * 3117 + * XXX: This must be refined to only skip cancelled clusters once we use 3118 + * icreate for multiple chunk allocations. 3119 + */ 3120 + ASSERT(!cancel_count || cancel_count == nbufs); 3121 + if (cancel_count) { 3122 + if (cancel_count != nbufs) 3123 + xfs_warn(mp, 3124 + "WARNING: partial inode chunk cancellation, skipped icreate."); 3109 3125 trace_xfs_log_recover_icreate_cancel(log, icl); 3110 3126 return 0; 3111 3127 } 3112 3128 3113 3129 trace_xfs_log_recover_icreate_recover(log, icl); 3114 - xfs_ialloc_inode_init(mp, NULL, buffer_list, count, agno, agbno, length, 3115 - be32_to_cpu(icl->icl_gen)); 3116 - return 0; 3130 + return xfs_ialloc_inode_init(mp, NULL, buffer_list, count, agno, agbno, 3131 + length, be32_to_cpu(icl->icl_gen)); 3117 3132 } 3118 3133 3119 3134 STATIC void