Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

xfs: log vector rounding leaks log space

The addition of direct formatting of log items into the CIL
linear buffer added alignment restrictions that the start of each
vector needed to be 64 bit aligned. Hence padding was added in
xlog_finish_iovec() to round up the vector length to ensure the next
vector started with the correct alignment.

This adds a small number of bytes to the size of
the linear buffer that is otherwise unused. The issue is that we
then use the linear buffer size to determine the log space used by
the log item, and this includes the unused space. Hence when we
account for space used by the log item, it's more than is actually
written into the iclogs, and hence we slowly leak this space.

This results on log hangs when reserving space, with threads getting
stuck with these stack traces:

Call Trace:
[<ffffffff81d15989>] schedule+0x29/0x70
[<ffffffff8150d3a2>] xlog_grant_head_wait+0xa2/0x1a0
[<ffffffff8150d55d>] xlog_grant_head_check+0xbd/0x140
[<ffffffff8150ee33>] xfs_log_reserve+0x103/0x220
[<ffffffff814b7f05>] xfs_trans_reserve+0x2f5/0x310
.....

The 4 bytes is significant. Brain Foster did all the hard work in
tracking down a reproducable leak to inode chunk allocation (it went
away with the ikeep mount option). His rough numbers were that
creating 50,000 inodes leaked 11 log blocks. This turns out to be
roughly 800 inode chunks or 1600 inode cluster buffers. That
works out at roughly 4 bytes per cluster buffer logged, and at that
I started looking for a 4 byte leak in the buffer logging code.

What I found was that a struct xfs_buf_log_format structure for an
inode cluster buffer is 28 bytes in length. This gets rounded up to
32 bytes, but the vector length remains 28 bytes. Hence the CIL
ticket reservation is decremented by 32 bytes (via lv->lv_buf_len)
for that vector rather than 28 bytes which are written into the log.

The fix for this problem is to separately track the bytes used by
the log vectors in the item and use that instead of the buffer
length when accounting for the log space that will be used by the
formatted log item.

Again, thanks to Brian Foster for doing all the hard work and long
hours to isolate this leak and make finding the bug relatively
simple.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


authored by

Dave Chinner and committed by
Dave Chinner
110dc24a ce576f1c

+17 -9
+13 -6
fs/xfs/xfs_log.h
··· 24 24 struct xfs_log_iovec *lv_iovecp; /* iovec array */ 25 25 struct xfs_log_item *lv_item; /* owner */ 26 26 char *lv_buf; /* formatted buffer */ 27 - int lv_buf_len; /* size of formatted buffer */ 27 + int lv_bytes; /* accounted space in buffer */ 28 + int lv_buf_len; /* aligned size of buffer */ 28 29 int lv_size; /* size of allocated lv */ 29 30 }; 30 31 ··· 53 52 return vec->i_addr; 54 53 } 55 54 55 + /* 56 + * We need to make sure the next buffer is naturally aligned for the biggest 57 + * basic data type we put into it. We already accounted for this padding when 58 + * sizing the buffer. 59 + * 60 + * However, this padding does not get written into the log, and hence we have to 61 + * track the space used by the log vectors separately to prevent log space hangs 62 + * due to inaccurate accounting (i.e. a leak) of the used log space through the 63 + * CIL context ticket. 64 + */ 56 65 static inline void 57 66 xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len) 58 67 { 59 - /* 60 - * We need to make sure the next buffer is naturally aligned for the 61 - * biggest basic data type we put into it. We already accounted for 62 - * this when sizing the buffer. 63 - */ 64 68 lv->lv_buf_len += round_up(len, sizeof(uint64_t)); 69 + lv->lv_bytes += len; 65 70 vec->i_len = len; 66 71 } 67 72
+4 -3
fs/xfs/xfs_log_cil.c
··· 97 97 { 98 98 /* Account for the new LV being passed in */ 99 99 if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) { 100 - *diff_len += lv->lv_buf_len; 100 + *diff_len += lv->lv_bytes; 101 101 *diff_iovecs += lv->lv_niovecs; 102 102 } 103 103 ··· 111 111 else if (old_lv != lv) { 112 112 ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED); 113 113 114 - *diff_len -= old_lv->lv_buf_len; 114 + *diff_len -= old_lv->lv_bytes; 115 115 *diff_iovecs -= old_lv->lv_niovecs; 116 116 kmem_free(old_lv); 117 117 } ··· 239 239 * that the space reservation accounting is correct. 240 240 */ 241 241 *diff_iovecs -= lv->lv_niovecs; 242 - *diff_len -= lv->lv_buf_len; 242 + *diff_len -= lv->lv_bytes; 243 243 } else { 244 244 /* allocate new data chunk */ 245 245 lv = kmem_zalloc(buf_size, KM_SLEEP|KM_NOFS); ··· 259 259 260 260 /* The allocated data region lies beyond the iovec region */ 261 261 lv->lv_buf_len = 0; 262 + lv->lv_bytes = 0; 262 263 lv->lv_buf = (char *)lv + buf_size - nbytes; 263 264 ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t))); 264 265