The following was written in response to an email from a
customer about monitoring IO in response to my scsi.d
postings. Tim covers where disk IO requests can be queued in his
posting titled “Where
can I/O queue up in sd/ssd” which
I would recommend as a starting point.
The disk IO sub-systems are built to provide maximum through put
which is most often the right thing. However the weakness of tuning
for throughput is that occasionally you can get some bizarre
behaviour when it comes to latency. The way that optimum IO bandwidth
is achieved is by sorting each io by logical
block address (LBA) and then issuing those in order to minimize
head seek. This is documented in the disksort(9F) manual page.
So if you have a sequence of writes to blocks N, N+1, N+2, N-200,
N+3, N+4, N+5,N+6, N+7 in that order and your LUN
as a queue depth and therefore throttle of 2.
The IO’s will actually be delivered to the LUN in this order N, N+1,
N+2, N+3, N+4, N+5,N+6, N+7, N-200. Hence there will be a significant
latency applied to the IO going to LBA N-200 and in practice it is
possible to have IO requests delayed on the waitq for many seconds (I
have a pathological test case that can hold them there for the time
it takes to perform an IO on nearly every block on the LUN,
literally hours). You better hope that that IO was not your important
one!
This issue only comes into play in the disk driver has reached the
throttle for the device as up until that point each IO can be passed
straight to the LUN for processing.
Once the driver has reached the throttle for the LUN it begins
queuing IO requests internally and by default will sort them to get
maximum throughput. Clearly the lower the throttle the the sooner you
get into this potential scenario.
Now for the good news. For most disk arrays sorting by LBA does
not make much sense since the LUN will be made up of a number of
drives and there will be a read cache and a write cache. So for these
devices it makes sense to disable disksort and deliver the IO
requests to the LUN in the order in which they are delivered to the
disk driver. If you look in the source for sd.c you will see that we
do this by default for most common arrays. To achieve this there is a
flag, “disable
disksort”, that can be set in sd.conf or
ssd.conf depending on which driver is in use. See Micheal’s
blog entry about editing sd.conf. While you are reading that
entry note you can use it to set the throttle for individual LUNS so
you do not have to set [s]sd_max_throttle, which will penalize all
devices rather than just the one you were aiming for. If you have
just one that only has a small queue depth and you will see below why
a small queue depth can be a really bad thing.
So how could you spot these latency bubbles?
It will come as no surprise that the answer is dtrace. Using my
pathological test case, but with it set to run for only 10 minute to
a single spindle, the following D produces a clear indication that
all is not well:
fbt:ssd:ssdstrategy:entry,
fbt:sd:sdstrategy:entry
{
start[(struct buf *)arg0] = timestamp;
}
fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf *)((struct scsi_pkt *)arg0)->pkt_private)] != 0 /
{
this->un = ((struct sd_xbuf *) this->buf->b_private)->xb_un;
@[this->un] = lquantize((timestamp – start[this->buf])/1000000,
60000, 600000, 60000);
@q[this->un] = quantize((timestamp – start[this->buf])/1000000);
start[this->buf] = 0;
}
This produces the following output,
the times are milliseconds:
dtrace: script ’ssdrwtime.d’ matched 4 probes
^C
6597960853440
value ————- Distribution ————- count
< 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 377204
60000 | 0
120000 | 0
180000 | 0
240000 | 0
300000 | 0
360000 | 0
420000 | 0
480000 | 2
540000 | 300
>= 600000 | 0
6597960853440
value ————- Distribution ————- count
-1 | 0
0 | 40
1 | 9
2 | 6
4 | 17
8 | 23
16 | 6
32 | 36
64 |@@ 15407
128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 361660
256 | 0
512 | 0
1024 | 0
2048 | 0
4096 | 0
8192 | 0
16384 | 0
32768 | 0
65536 | 0
131072 | 0
262144 | 0
524288 | 302
1048576 | 0
Now recall that my test case is particularly unpleasant but it
demonstrates the point. 300 IO requests took over 9 minutes and they
only actually got to complete as the test case was shutting down.
While the vast majority of the IO requests complete in less than
256ms.
Now lets run the same pathological test
with disksort disabled:
dtrace: script ’ssdrwtime.d’ matched 4 probes
^C
6597960853440
value ————- Distribution ————- count
< 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 543956
60000 | 0
6597960853440
value ————- Distribution ————- count
-1 | 0
0 | 30
1 | 21
2 | 30
4 | 0
8 | 0
16 | 50
32 | 3
64 | 384
128 | 505
256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 531169
512 |@ 11764
1024 | 0
Notice that now the majority of the IO requests took longer now,
falling in the 256ms bucket rather than the 128ms bucket but none of
the IO requests took many minutes.
Now my test case is pathological but if
you have drives with small queue depths and disk sort is still
enabled you are open to some quite spectacular latency bubbles. To
mitigate this my advice is:
Don’t ever set the global
[s]sd_max_throttle in /etc/system. Use the [s]sd.conf file to set
the appropriate throttle for each device.
Consider what is more important to
you. Throughput or latency. If is is latency or if your LUN is on a
storage array then turn of disksort using the [s]sd.conf file.
If you have pathological
applications then understand that the IO subsystem can give you
throughput or bounded latency, not both. So separate out the IO
devices that need throughput from those for which latency is more
important.
Be aware that even “dumb”
disk drives often implement disk sort internally so in some cases
they can give a similar issues when they have a queue depth of
greater than 2.
In those cases you may find it better to throttle them down to a
queue depth of 2 and disable disksort in [s]sd to get the most
predictable latency all be it at the expense of throughput. If this
is your issue then you can spot this either by using scsi.d directly
or by modifying it to produce aggregations like those above. I’ll
leave that as an exercise for the reader.