Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

block: replace REQ_NOIDLE with REQ_IDLE

Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.

This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

authored by

Christoph Hellwig and committed by
Jens Axboe
a2b80967 b685d3d6

+33 -28
+16 -16
Documentation/block/cfq-iosched.txt
··· 240 240 On this tree we idle on each queue individually. 241 241 242 242 All synchronous non-sequential queues go on sync-noidle tree. Also any 243 - request which are marked with REQ_NOIDLE go on this service tree. On this 244 - tree we do not idle on individual queues instead idle on the whole group 245 - of queues or the tree. So if there are 4 queues waiting for IO to dispatch 246 - we will idle only once last queue has dispatched the IO and there is 247 - no more IO on this service tree. 243 + synchronous write request which is not marked with REQ_IDLE goes on this 244 + service tree. On this tree we do not idle on individual queues instead idle 245 + on the whole group of queues or the tree. So if there are 4 queues waiting 246 + for IO to dispatch we will idle only once last queue has dispatched the IO 247 + and there is no more IO on this service tree. 248 248 249 249 All async writes go on async service tree. There is no idling on async 250 250 queues. ··· 257 257 258 258 FAQ 259 259 === 260 - Q1. Why to idle at all on queues marked with REQ_NOIDLE. 260 + Q1. Why to idle at all on queues not marked with REQ_IDLE. 261 261 262 - A1. We only do tree idle (all queues on sync-noidle tree) on queues marked 263 - with REQ_NOIDLE. This helps in providing isolation with all the sync-idle 262 + A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked 263 + with REQ_IDLE. This helps in providing isolation with all the sync-idle 264 264 queues. Otherwise in presence of many sequential readers, other 265 265 synchronous IO might not get fair share of disk. 266 266 267 267 For example, if there are 10 sequential readers doing IO and they get 268 - 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled 269 - roughly after 1 second. If after completion of REQ_NOIDLE request we 270 - do not idle, and after a couple of milli seconds a another REQ_NOIDLE 268 + 100ms each. If a !REQ_IDLE request comes in, it will be scheduled 269 + roughly after 1 second. If after completion of !REQ_IDLE request we 270 + do not idle, and after a couple of milli seconds a another !REQ_IDLE 271 271 request comes in, again it will be scheduled after 1second. Repeat it 272 272 and notice how a workload can lose its disk share and suffer due to 273 273 multiple sequential readers. ··· 276 276 context of fsync, and later some journaling data is written. Journaling 277 277 data comes in only after fsync has finished its IO (atleast for ext4 278 278 that seemed to be the case). Now if one decides not to idle on fsync 279 - thread due to REQ_NOIDLE, then next journaling write will not get 279 + thread due to !REQ_IDLE, then next journaling write will not get 280 280 scheduled for another second. A process doing small fsync, will suffer 281 281 badly in presence of multiple sequential readers. 282 282 283 - Hence doing tree idling on threads using REQ_NOIDLE flag on requests 283 + Hence doing tree idling on threads using !REQ_IDLE flag on requests 284 284 provides isolation from multiple sequential readers and at the same 285 285 time we do not idle on individual threads. 286 286 287 - Q2. When to specify REQ_NOIDLE 288 - A2. I would think whenever one is doing synchronous write and not expecting 287 + Q2. When to specify REQ_IDLE 288 + A2. I would think whenever one is doing synchronous write and expecting 289 289 more writes to be dispatched from same context soon, should be able 290 - to specify REQ_NOIDLE on writes and that probably should work well for 290 + to specify REQ_IDLE on writes and that probably should work well for 291 291 most of the cases.
+8 -3
block/cfq-iosched.c
··· 3914 3914 cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); 3915 3915 } 3916 3916 3917 + static inline bool req_noidle(struct request *req) 3918 + { 3919 + return req_op(req) == REQ_OP_WRITE && 3920 + (req->cmd_flags & (REQ_SYNC | REQ_IDLE)) == REQ_SYNC; 3921 + } 3922 + 3917 3923 /* 3918 3924 * Disable idle window if the process thinks too long or seeks so much that 3919 3925 * it doesn't matter ··· 3941 3935 if (cfqq->queued[0] + cfqq->queued[1] >= 4) 3942 3936 cfq_mark_cfqq_deep(cfqq); 3943 3937 3944 - if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE)) 3938 + if (cfqq->next_rq && req_noidle(cfqq->next_rq)) 3945 3939 enable_idle = 0; 3946 3940 else if (!atomic_read(&cic->icq.ioc->active_ref) || 3947 3941 !cfqd->cfq_slice_idle || ··· 4226 4220 const int sync = rq_is_sync(rq); 4227 4221 u64 now = ktime_get_ns(); 4228 4222 4229 - cfq_log_cfqq(cfqd, cfqq, "complete rqnoidle %d", 4230 - !!(rq->cmd_flags & REQ_NOIDLE)); 4223 + cfq_log_cfqq(cfqd, cfqq, "complete rqnoidle %d", req_noidle(rq)); 4231 4224 4232 4225 cfq_update_hw_tag(cfqd); 4233 4226
+1 -1
drivers/block/drbd/drbd_actlog.c
··· 148 148 149 149 if ((op == REQ_OP_WRITE) && !test_bit(MD_NO_FUA, &device->flags)) 150 150 op_flags |= REQ_FUA | REQ_PREFLUSH; 151 - op_flags |= REQ_SYNC | REQ_NOIDLE; 151 + op_flags |= REQ_SYNC; 152 152 153 153 bio = bio_alloc_drbd(GFP_NOIO); 154 154 bio->bi_bdev = bdev->md_bdev;
+2 -2
include/linux/blk_types.h
··· 175 175 __REQ_META, /* metadata io request */ 176 176 __REQ_PRIO, /* boost priority in cfq */ 177 177 __REQ_NOMERGE, /* don't touch this for merging */ 178 - __REQ_NOIDLE, /* don't anticipate more IO after this one */ 178 + __REQ_IDLE, /* anticipate more IO after this one */ 179 179 __REQ_INTEGRITY, /* I/O includes block integrity payload */ 180 180 __REQ_FUA, /* forced unit access */ 181 181 __REQ_PREFLUSH, /* request for cache flush */ ··· 190 190 #define REQ_META (1ULL << __REQ_META) 191 191 #define REQ_PRIO (1ULL << __REQ_PRIO) 192 192 #define REQ_NOMERGE (1ULL << __REQ_NOMERGE) 193 - #define REQ_NOIDLE (1ULL << __REQ_NOIDLE) 193 + #define REQ_IDLE (1ULL << __REQ_IDLE) 194 194 #define REQ_INTEGRITY (1ULL << __REQ_INTEGRITY) 195 195 #define REQ_FUA (1ULL << __REQ_FUA) 196 196 #define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH)
+5 -5
include/linux/fs.h
··· 197 197 #define WRITE REQ_OP_WRITE 198 198 199 199 #define READ_SYNC 0 200 - #define WRITE_SYNC (REQ_SYNC | REQ_NOIDLE) 201 - #define WRITE_ODIRECT REQ_SYNC 202 - #define WRITE_FLUSH (REQ_NOIDLE | REQ_PREFLUSH) 203 - #define WRITE_FUA (REQ_NOIDLE | REQ_FUA) 204 - #define WRITE_FLUSH_FUA (REQ_NOIDLE | REQ_PREFLUSH | REQ_FUA) 200 + #define WRITE_SYNC REQ_SYNC 201 + #define WRITE_ODIRECT (REQ_SYNC | REQ_IDLE) 202 + #define WRITE_FLUSH REQ_PREFLUSH 203 + #define WRITE_FUA REQ_FUA 204 + #define WRITE_FLUSH_FUA (REQ_PREFLUSH | REQ_FUA) 205 205 206 206 /* 207 207 * Attribute flags. These should be or-ed together to figure out what
+1 -1
include/trace/events/f2fs.h
··· 32 32 TRACE_DEFINE_ENUM(SSR); 33 33 TRACE_DEFINE_ENUM(__REQ_RAHEAD); 34 34 TRACE_DEFINE_ENUM(__REQ_SYNC); 35 - TRACE_DEFINE_ENUM(__REQ_NOIDLE); 35 + TRACE_DEFINE_ENUM(__REQ_IDLE); 36 36 TRACE_DEFINE_ENUM(__REQ_PREFLUSH); 37 37 TRACE_DEFINE_ENUM(__REQ_FUA); 38 38 TRACE_DEFINE_ENUM(__REQ_PRIO);