Documentation/block/as-iosched.txt at v2.6.13

tjh.dev / kernel
fork
Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork
kernel / Documentation / block / as-iosched.txt
at v2.6.13 165 lines 8.5 kB view raw
wrap content
  1Anticipatory IO scheduler
  2-------------------------
  3Nick Piggin <piggin@cyberone.com.au>    13 Sep 2003
  4
  5Attention! Database servers, especially those using "TCQ" disks should
  6investigate performance with the 'deadline' IO scheduler. Any system with high
  7disk performance requirements should do so, in fact.
  8
  9If you see unusual performance characteristics of your disk systems, or you
 10see big performance regressions versus the deadline scheduler, please email
 11me. Database users don't bother unless you're willing to test a lot of patches
 12from me ;) its a known issue.
 13
 14Also, users with hardware RAID controllers, doing striping, may find
 15highly variable performance results with using the as-iosched. The
 16as-iosched anticipatory implementation is based on the notion that a disk
 17device has only one physical seeking head.  A striped RAID controller
 18actually has a head for each physical device in the logical RAID device.
 19
 20However, setting the antic_expire (see tunable parameters below) produces
 21very similar behavior to the deadline IO scheduler.
 22
 23
 24Selecting IO schedulers
 25-----------------------
 26To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
 27'noop' and 'as' (the default) are also available. IO schedulers are assigned
 28globally at boot time only presently.
 29
 30
 31Anticipatory IO scheduler Policies
 32----------------------------------
 33The as-iosched implementation implements several layers of policies
 34to determine when an IO request is dispatched to the disk controller.
 35Here are the policies outlined, in order of application.
 36
 371. one-way Elevator algorithm.
 38
 39The elevator algorithm is similar to that used in deadline scheduler, with
 40the addition that it allows limited backward movement of the elevator
 41(i.e. seeks backwards).  A seek backwards can occur when choosing between
 42two IO requests where one is behind the elevator's current position, and
 43the other is in front of the elevator's position. If the seek distance to
 44the request in back of the elevator is less than half the seek distance to
 45the request in front of the elevator, then the request in back can be chosen.
 46Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
 47This favors forward movement of the elevator, while allowing opportunistic
 48"short" backward seeks.
 49
 502. FIFO expiration times for reads and for writes.
 51
 52This is again very similar to the deadline IO scheduler.  The expiration
 53times for requests on these lists is tunable using the parameters read_expire
 54and write_expire discussed below.  When a read or a write expires in this way,
 55the IO scheduler will interrupt its current elevator sweep or read anticipation
 56to service the expired request.
 57
 583. Read and write request batching
 59
 60A batch is a collection of read requests or a collection of write
 61requests.  The as scheduler alternates dispatching read and write batches
 62to the driver.  In the case a read batch, the scheduler submits read
 63requests to the driver as long as there are read requests to submit, and
 64the read batch time limit has not been exceeded (read_batch_expire).
 65The read batch time limit begins counting down only when there are
 66competing write requests pending.
 67
 68In the case of a write batch, the scheduler submits write requests to
 69the driver as long as there are write requests available, and the
 70write batch time limit has not been exceeded (write_batch_expire).
 71However, the length of write batches will be gradually shortened
 72when read batches frequently exceed their time limit.
 73
 74When changing between batch types, the scheduler waits for all requests
 75from the previous batch to complete before scheduling requests for the
 76next batch.
 77
 78The read and write fifo expiration times described in policy 2 above
 79are checked only when in scheduling IO of a batch for the corresponding
 80(read/write) type.  So for example, the read FIFO timeout values are
 81tested only during read batches.  Likewise, the write FIFO timeout
 82values are tested only during write batches.  For this reason,
 83it is generally not recommended for the read batch time
 84to be longer than the write expiration time, nor for the write batch
 85time to exceed the read expiration time (see tunable parameters below).
 86
 87When the IO scheduler changes from a read to a write batch,
 88it begins the elevator from the request that is on the head of the
 89write expiration FIFO.  Likewise, when changing from a write batch to
 90a read batch, scheduler begins the elevator from the first entry
 91on the read expiration FIFO.
 92
 934. Read anticipation.
 94
 95Read anticipation occurs only when scheduling a read batch.
 96This implementation of read anticipation allows only one read request
 97to be dispatched to the disk controller at a time.  In
 98contrast, many write requests may be dispatched to the disk controller
 99at a time during a write batch.  It is this characteristic that can make
100the anticipatory scheduler perform anomalously with controllers supporting
101TCQ, or with hardware striped RAID devices. Setting the antic_expire
102queue paramter (see below) to zero disables this behavior, and the anticipatory
103scheduler behaves essentially like the deadline scheduler.
104
105When read anticipation is enabled (antic_expire is not zero), reads
106are dispatched to the disk controller one at a time.
107At the end of each read request, the IO scheduler examines its next
108candidate read request from its sorted read list.  If that next request
109is from the same process as the request that just completed,
110or if the next request in the queue is "very close" to the
111just completed request, it is dispatched immediately.  Otherwise,
112statistics (average think time, average seek distance) on the process
113that submitted the just completed request are examined.  If it seems
114likely that that process will submit another request soon, and that
115request is likely to be near the just completed request, then the IO
116scheduler will stop dispatching more read requests for up time (antic_expire)
117milliseconds, hoping that process will submit a new request near the one
118that just completed.  If such a request is made, then it is dispatched
119immediately.  If the antic_expire wait time expires, then the IO scheduler
120will dispatch the next read request from the sorted read queue.
121
122To decide whether an anticipatory wait is worthwhile, the scheduler
123maintains statistics for each process that can be used to compute
124mean "think time" (the time between read requests), and mean seek
125distance for that process.  One observation is that these statistics
126are associated with each process, but those statistics are not associated
127with a specific IO device.  So for example, if a process is doing IO
128on several file systems on separate devices, the statistics will be
129a combination of IO behavior from all those devices.
130
131
132Tuning the anticipatory IO scheduler
133------------------------------------
134When using 'as', the anticipatory IO scheduler there are 5 parameters under
135/sys/block/*/queue/iosched/. All are units of milliseconds.
136
137The parameters are:
138* read_expire
139    Controls how long until a read request becomes "expired". It also controls the
140    interval between which expired requests are served, so set to 50, a request
141    might take anywhere < 100ms to be serviced _if_ it is the next on the
142    expired list. Obviously request expiration strategies won't make the disk
143    go faster. The result basically equates to the timeslice a single reader
144    gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
145    very roughly the % streaming read efficiency your disk should get with
146    multiple readers.
147
148* read_batch_expire
149    Controls how much time a batch of reads is given before pending writes are
150    served. A higher value is more efficient. This might be set below read_expire
151    if writes are to be given higher priority than reads, but reads are to be
152    as efficient as possible when there are no writes. Generally though, it
153    should be some multiple of read_expire.
154
155* write_expire, and
156* write_batch_expire are equivalent to the above, for writes.
157
158* antic_expire
159    Controls the maximum amount of time we can anticipate a good read (one
160    with a short seek distance from the most recently completed request) before
161    giving up. Many other factors may cause anticipation to be stopped early,
162    or some processes will not be "anticipated" at all. Should be a bit higher
163    for big seek time devices though not a linear correspondence - most
164    processes have only a few ms thinktime.
165
Configure Feed

Configure Feed