Skip to content

Latency bubbles in your disk IO

January 11, 2008

The following was written in response to an email from a customer about monitoring IO in response to my scsi.d postings. Tim covers where disk IO requests can be queued in his posting titled “Where can I/O queue up in sd/ssdwhich I would recommend as a starting point.

The disk IO sub-systems are built to provide maximum through put which is most often the right thing. However the weakness of tuning for throughput is that occasionally you can get some bizarre behaviour when it comes to latency. The way that optimum IO bandwidth is achieved is by sorting each io by logical block address (LBA) and then issuing those in order to minimize head seek. This is documented in the disksort(9F) manual page.

So if you have a sequence of writes to blocks N, N+1, N+2, N-200, N+3, N+4, N+5,N+6, N+7 in that order and your LUN as a queue depth and therefore throttle of 2.1 The IO’s will actually be delivered to the LUN in this order N, N+1, N+2, N+3, N+4, N+5,N+6, N+7, N-200. Hence there will be a significant latency applied to the IO going to LBA N-200 and in practice it is possible to have IO requests delayed on the waitq for many seconds (I have a pathological test case that can hold them there for the time it takes to perform an IO on nearly every block on the LUN, literally hours). You better hope that that IO was not your important one!

This issue only comes into play in the disk driver has reached the throttle for the device as up until that point each IO can be passed straight to the LUN for processing.2 Once the driver has reached the throttle for the LUN it begins queuing IO requests internally and by default will sort them to get maximum throughput. Clearly the lower the throttle the the sooner you get into this potential scenario.

Now for the good news. For most disk arrays sorting by LBA does not make much sense since the LUN will be made up of a number of drives and there will be a read cache and a write cache. So for these devices it makes sense to disable disksort and deliver the IO requests to the LUN in the order in which they are delivered to the disk driver. If you look in the source for sd.c you will see that we do this by default for most common arrays. To achieve this there is a flag, “disable disksort”, that can be set in sd.conf or ssd.conf depending on which driver is in use. See Micheal’s blog entry about editing sd.conf. While you are reading that entry note you can use it to set the throttle for individual LUNS so you do not have to set [s]sd_max_throttle, which will penalize all devices rather than just the one you were aiming for. If you have just one that only has a small queue depth and you will see below why a small queue depth can be a really bad thing.

So how could you spot these latency bubbles?

It will come as no surprise that the answer is dtrace. Using my pathological test case, but with it set to run for only 10 minute to a single spindle, the following D produces a clear indication that all is not well:

fbt:ssd:ssdstrategy:entry, fbt:sd:sdstrategy:entry
      start[(struct buf *)arg0] = timestamp;
fbt:ssd:ssdintr:entry, fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf *)((struct scsi_pkt *)arg0)->pkt_private)] != 0 /
       this->un = ((struct sd_xbuf *) this->buf->b_private)->xb_un;
       @[this->un] = lquantize((timestamp - start[this->buf])/1000000,
            60000, 600000, 60000);
       @q[this->un] = quantize((timestamp - start[this->buf])/1000000);
       start[this->buf] = 0;

This produces the following output3, the times are milliseconds:

dtrace: script 'ssdrwtime.d' matched 4 probes ^C
        value  ------------- Distribution ------------- count
      < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 377204
        60000 |                                         0
       120000 |                                         0
       180000 |                                         0
       240000 |                                         0
       300000 |                                         0
       360000 |                                         0
       420000 |                                         0
       480000 |                                         2
       540000 |                                         300
    >= 600000 |                                         0
        value  ------------- Distribution ------------- count
           -1 |                                         0
            0 |                                         40
            1 |                                         9
            2 |                                         6
            4 |                                         17
            8 |                                         23
           16 |                                         6
           32 |                                         36
           64 |@@                                       15407
          128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   361660
          256 |                                         0
          512 |                                         0
         1024 |                                         0
         2048 |                                         0
         4096 |                                         0
         8192 |                                         0
        16384 |                                         0
        32768 |                                         0
        65536 |                                         0
       131072 |                                         0
       262144 |                                         0
       524288 |                                         302
      1048576 |                                         0

Now recall that my test case is particularly unpleasant but it demonstrates the point. 300 IO requests took over 9 minutes and they only actually got to complete as the test case was shutting down. While the vast majority of the IO requests complete in less than 256ms.

Now lets run the same pathological test with disksort disabled:

dtrace: script 'ssdrwtime.d' matched 4 probes
value  ------------- Distribution ------------- count
< 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 543956
  60000 |                                         0
value  ------------- Distribution ------------- count
   -1 |                                         0
    0 |                                         30
    1 |                                         21
    2 |                                         30
    4 |                                         0
    8 |                                         0
   16 |                                         50
   32 |                                         3
   64 |                                         384
  128 |                                         505
  256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  531169
  512 |@                                        11764
 1024 |                                         0

Notice that now the majority of the IO requests took longer now, falling in the 256ms bucket rather than the 128ms bucket but none of the IO requests took many minutes.

Now my test case is pathological but if you have drives with small queue depths and disk sort is still enabled you are open to some quite spectacular latency bubbles. To mitigate this my advice is:

  1. Don’t ever set the global [s]sd_max_throttle in /etc/system. Use the [s]sd.conf file to set the appropriate throttle for each device.

  2. Consider what is more important to you. Throughput or latency. If is is latency or if your LUN is on a storage array then turn of disksort using the [s]sd.conf file.

  3. If you have pathological applications then understand that the IO subsystem can give you throughput or bounded latency, not both. So separate out the IO devices that need throughput from those for which latency is more important.

  4. Be aware that even “dumb” disk drives often implement disk sort internally so in some cases they can give a similar issues when they have a queue depth of greater than 24. In those cases you may find it better to throttle them down to a queue depth of 2 and disable disksort in [s]sd to get the most predictable latency all be it at the expense of throughput. If this is your issue then you can spot this either by using scsi.d directly or by modifying it to produce aggregations like those above. I’ll leave that as an exercise for the reader.

1The queue depth of a LUN is the number of commands that it can handle at the same time. The throttle is usually set to the same number and it used by the disk driver to prevent it sending more commands than the device can cope with.

2Now the LUN itself may then re order the IO if it has more then two IO’s in it’s internal queue.

3Edited to remove output for other drives.

4With a queue depth of 2 the drive can not sort the IO requests. It has to have one as active and the next one waiting. When the active one completes the waiting one will be actioned before a new command can come from the initiator.


From → bsc, Solaris

  1. Alex Maidak permalink

    Great blog… Many 3rd party array vendors require you to tune down the sd_max_throttle parameter. While researching this further I encountered this document in SPECTRUM: 6378621. According to it disksort should be disabled by default. How can I tell whether or not disksort is enabled for a particular device on my system?

  2. [Trackback] Following on from the latency
    bubbles in your IO posting. I have been asked two questions about
    this post privately:
    How can you map those long numbers in the output into
    readable entries, eg sd0.
    How can I confirm that disksort has …

Trackbacks & Pingbacks

  1. Latency Bubbles follow up « Chris Gerhard's Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: