Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

dio: optimize cache misses in the submission path

Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size. This is because it has to walk the chain from
block_dev to gendisk to queue.

The block size is needed early on to check alignment and sizes. It's only
done if the check for the inode block size fails. But the costly block
device state is unconditionally fetched.

- Reorganize the code to only fetch block dev state when actually
needed.

Then do a prefetch on the block dev early on in the direct IO path. This
is worth it, because there is substantial code run before we actually
touch the block dev now.

- I also added some unlikelies to make it clear the compiler that block
device fetch code is not normally executed.

This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)

[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Andi Kleen and committed by
Linus Torvalds
65dd2aa9 87192a2a

+37 -9
+37 -9
fs/direct-io.c
··· 36 36 #include <linux/rwsem.h> 37 37 #include <linux/uio.h> 38 38 #include <linux/atomic.h> 39 + #include <linux/prefetch.h> 39 40 40 41 /* 41 42 * How many user pages to map in one call to get_user_pages(). This determines ··· 1088 1087 * individual fields and will generate much worse code. This is important 1089 1088 * for the whole file. 1090 1089 */ 1091 - ssize_t 1092 - __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, 1090 + static inline ssize_t 1091 + do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, 1093 1092 struct block_device *bdev, const struct iovec *iov, loff_t offset, 1094 1093 unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, 1095 1094 dio_submit_t submit_io, int flags) ··· 1098 1097 size_t size; 1099 1098 unsigned long addr; 1100 1099 unsigned blkbits = inode->i_blkbits; 1101 - unsigned bdev_blkbits = 0; 1102 1100 unsigned blocksize_mask = (1 << blkbits) - 1; 1103 1101 ssize_t retval = -EINVAL; 1104 1102 loff_t end = offset; ··· 1110 1110 if (rw & WRITE) 1111 1111 rw = WRITE_ODIRECT; 1112 1112 1113 - if (bdev) 1114 - bdev_blkbits = blksize_bits(bdev_logical_block_size(bdev)); 1113 + /* 1114 + * Avoid references to bdev if not absolutely needed to give 1115 + * the early prefetch in the caller enough time. 1116 + */ 1115 1117 1116 1118 if (offset & blocksize_mask) { 1117 1119 if (bdev) 1118 - blkbits = bdev_blkbits; 1120 + blkbits = blksize_bits(bdev_logical_block_size(bdev)); 1119 1121 blocksize_mask = (1 << blkbits) - 1; 1120 1122 if (offset & blocksize_mask) 1121 1123 goto out; ··· 1128 1126 addr = (unsigned long)iov[seg].iov_base; 1129 1127 size = iov[seg].iov_len; 1130 1128 end += size; 1131 - if ((addr & blocksize_mask) || (size & blocksize_mask)) { 1129 + if (unlikely((addr & blocksize_mask) || 1130 + (size & blocksize_mask))) { 1132 1131 if (bdev) 1133 - blkbits = bdev_blkbits; 1132 + blkbits = blksize_bits( 1133 + bdev_logical_block_size(bdev)); 1134 1134 blocksize_mask = (1 << blkbits) - 1; 1135 - if ((addr & blocksize_mask) || (size & blocksize_mask)) 1135 + if ((addr & blocksize_mask) || (size & blocksize_mask)) 1136 1136 goto out; 1137 1137 } 1138 1138 } ··· 1317 1313 out: 1318 1314 return retval; 1319 1315 } 1316 + 1317 + ssize_t 1318 + __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, 1319 + struct block_device *bdev, const struct iovec *iov, loff_t offset, 1320 + unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, 1321 + dio_submit_t submit_io, int flags) 1322 + { 1323 + /* 1324 + * The block device state is needed in the end to finally 1325 + * submit everything. Since it's likely to be cache cold 1326 + * prefetch it here as first thing to hide some of the 1327 + * latency. 1328 + * 1329 + * Attempt to prefetch the pieces we likely need later. 1330 + */ 1331 + prefetch(&bdev->bd_disk->part_tbl); 1332 + prefetch(bdev->bd_queue); 1333 + prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); 1334 + 1335 + return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, 1336 + nr_segs, get_block, end_io, 1337 + submit_io, flags); 1338 + } 1339 + 1320 1340 EXPORT_SYMBOL(__blockdev_direct_IO); 1321 1341 1322 1342 static __init int dio_init(void)