The disk drivers in Solaris support SCSI tagged queuing and have done for a long time. This enables them to send more than one command to a LUN (logical Unit) at a time. The number of commands that can be sent in parallel is limited, throttled, by the disk drivers so that they never send more commands than the LUN can cope with. While it is possible for the LUN to respond with a “queue full” SCSI status to tell the driver that it can not cope with any more commands there are significant problems with relying on this approach:
Devices connected via fibre channel have to negotiate onto the loop to return the queue full status. This can mean that by the time the device manages to return queue full the host can have sent many more commands. This risks that the LUN can end up with a situation it can not cope with and typically results in the LUN resetting.
If the LUN is being accessed from more than one host it is possible for it to return Queue full on the very first command. This makes it hard for the host to know when it will be safe to send a command since there are none outstanding from that host.
If the LUN is part of many LUNs on a single target it may share the total pool of commands that can be accepted by all the LUNS and so again could respond with “queue full” on the first command to a LUN.
In the two cases above the total number of commands a single host can send a single LUN will vary depending on conditions that the host simply can not know making adaptive algorithms unreliable.
All the above issues result in people taking the safest option and setting the throttle for a device as low as required so that the LUN never needs to send queue full. In some cases as low as 1. This is bad when limited to an individual LUN, it is terrible when done globally on the entire system.
As soon as you get to the point where you hit the throttle two things happen:
You are no longer transferring data over the interconnect (fibre channel, parallel scsi or iscsi) for writes. This data has to wait until another command can complete before it can be transferred. This then reduces the throughput of the device. You writes can end up being throttled by reads and hence tend towards the speed of the spinning disk if the read has to go to the disk even though you may have a write cache.
The command is queued on the waitq which will increase the latency still further if the queue becomes deep. See here for information about disksort’s effect on latency.
Given that the system will regularly dump large numbers of commands on devices for short periods of time you want those commands to be handled as quickly as possible to minimized applications hanging while their IO is completed. If you want to observe the maximum number of commands sent to a device then there is a D script here to do that.
So the advice for configuring storage would be:
Don’t break large arrays into so many LUNS that you have to throttle the IO to a very low, single digit, value.
If you do have to throttle io then don’t do it globally. Do it on a per LUN basis. See http://blogs.sun.com/mtibbets/entry/scuzzy_scsi_configuration for details on how to set per LUN settings in the sd.conf and ssd.conf
Make sure that if you have a storage device with a write cache you have disabled disksort. Again see http://blogs.sun.com/mtibbets/entry/scuzzy_scsi_configuration for how.