I am finally getting around to setting up a SSD array, starting with a few, then working up to perhaps 20 units over 2 controllers and 4 x4 SAS ports. During initial testing I observed very high disk latency, in the range of 100ms+ for reads and upto 400ms+ for writes in certain operations. This occurs during periods of exceptionally high disk queue depth. So the questions are 1) does high queue-depth and latency improve performance 2) cause problems with other operations, including responsiveness, and 3) can this be avoided without giving up performance?
Lets start with a single disk drive performance principles, then proceeding to disk arrays, caching RAID controllers, and finally to the SAN system, with consideration for SSD characteristics. Certain storage performance advice commonly cited without reference to their origin and more importantly scope can be traced. Then the SQL Server IO queue depth policy is reviewed. With this, we can understand why the simple model used by SQL Server in trying for the best performance can overload the storage system causing periods of poor responsiveness. A more comprehensive model of hard disk, SSD, and storage system performance characteristics, along with input on whether OLTP or DW behavior is desired, can significantly improve SQL Server responsiveness while staying very close to the best achievable throughput performance.
Hard Disk IOPS Theory
The standard theory of disk drive random IO rate is that average access time is the sum of the rotational latency, the average seek time, the transfer time, command overhead, and propagation delays. For small block IO, only the first two items make significant contribution. The average rotational latency for a 15K disk drive is 2ms, and the typical 15K 3.5in cited average seek time is 3.4ms, plus others for the average total access time of around 5.5ms. Then with the two key qualifiers:
- 1) for random accesses to data distributed over the entire disk
- 2) at queue depth one,one IO issued, the next issued after the completion of the predecessor
A 15K disk should support 180 IOPS. Too many people neglect to mention the two key qualifiers.
Short Stroke Effect
The accesses are to data within a range of disk cylinders, then the average seek time is less, and IOPS should be higher.
Hard Disk Random Read IOPS versus Queue Depth
Now consider the implication of the second qualifier, random read IOPS versus queue depth. At higher queue depth, meaning multiple IO issued to the disk, and even if there are multiple outstanding IO, the controller on the disk drive itself can reorder the IO, with net effect of reducing time between each IO to increase the IOPS at the expense of higher latency for each IO.
There is a small gain at queue depth 2, perhaps to 200 IOPS, and larger gains to queue depth 4 at 240 IOPS, and about 40-50 IOPS with each doubling of queue depth to 32, with small gain to queue depth 64. (the disk drives of 2005 has a 64-deep task queue, it is now 128-deep.) With each doubling of the queue depth, the latency almost doubles.
The figures below show both the short-effect and queue depth impact on IOPS and access latency.
IOPS versus queue depth for various disk space utilizations
Latency versus queue depth for various disk space utilizations
Individually, both effects improve hard disk performance, but the two combined produce more dramatic benefits. In online transaction processing, response time and hence disk IO latency is as important as throughput performance. Hence the common rule adopted was to keep (time-)averaged queue depth below 2 per disk (disregarding transient spikes) for online transaction processing. (This rule was popular in the days of 5400 and 7200RPM drives).
In batch processing, where there is not a live person waiting for each transaction to complete, the strategy is to drive queue depth higher for the improved throughput performance. In DW/DSS, all such rules go out the window. If there was unused throughput not being utilized, then that was throughput squandered. Note the importance of the qualifers, and yet people feel that it is acceptable to state the rule of queue depth per disk below 2 without qualification.
RAID Controllers and Disk Arrays
In the earlier days, we had just a bunch of disks (JBOD). The major RDBMS handled this situation by supporting multiple files, and file groups as well for each database. And then there was light (sorry, wrong book) RAID and RAID controllers. And people saw that RAID was good, having fewer "disks" to manage at the operating system and database level. Now a disk array appears to the operating system as a single disk and performance counters were usually read from the operating system, not the storage system.
The rule of queue depth 2 per disk (and accompanying qualifiers) does not translate directly to the queue depth determined from the operating system performance counter. So it became popular to cite a latency rule the data should not be higher than 10-20ms roughly, corresponding to the rule of queue depth 2 per disk, having long since forgotten that there were additional qualifiers.
In general, data access latency below 10 ms was usually an indication that transaction response time should be good. Latency in the range 10-20ms corresponds to acceptable response time. Latency over 20ms corresponds to a heavily loaded disk system. More importantly, any temporary surge would push the disk IO into the higher queue depth range with sharp spikes in repsonse times. So even if average transaction response time is deem acceptable, there could be a noticeable distribution tail experiencing very poor response.
RAID Group Random Read IOPS versus Queue Depth
So what should be the IO characteristics of a set of disks in a RAID group. If a disk read IO were issued one at a time, i.e., queue depth 1, then one disk in the group would get the IO. The other disks would be idle, with 180 IOPS from the RAID group. At queue depth 2, it is likely the two IO will go to different disks, so two disks will operate at queue depth 1, with the remaining disks idle. Only when the average queue depth is one per disk in the group will IOPS reach 180 per 15K disk, with some distribution probability adjustments.
For queue depth at some whole integer multiple of one per disk, the expected IOPS per disk is similar to the single disk IOPS versus queue depth. The important point is that at queue depth 1, a given thread will experience single disk IOPS, not disk group IOPS.
Log Write Latency
In the old direct-attach storage days, the only advice was to have a dedicated RAID 1 disk pair for each high transaction volume database log. It was rarely mentioned, but the pure sequential small block log write could achieve latency on the order of 0.3ms, and around 3000-5000 IOPS.
SAN vendors frequently suggest not bothering with dedicated physical disks for each high transaction volume log, every thing will be fine, trust them. Even if do provided dedicated disks, and perhaps even a dedicated service processor, the SAN still could not achieve a very low log write latency. As SAN systems were pervasive, Micrsoft changed SQL Server to allow more log writes to be in flight (32 for 64-bit SQL Server and xxKB?).
RAID Small Block Random Write
RAID level write overhead is a subject covered elsewhere and is not discussed here. People like to cite RAID 5 and 10 rules without qualification, but the rule commonly cited only applies to small block random writes. On a non-caching controller, we would expect write IO to have similar IOPS characteristics as read IO, adjusted for RAID level overhead, both theoretical and controller specific.
Caching RAID Controllers and Read IO
In other discussions, I have explained why read cache is counter-productive. In essence, the database engine itself is a data cache that is much closer and less expensive to access than cache on the storage controller. Next, a properly configured system, the database engine should have much larger buffer cache than on the storage system. It is very unlikely that anything in the storage controller cache will be accessed again. Finally, the overhead of read-caching is significant in a storage configured for high IOPS performance. Read-caching on the storage controllers incurs overhead for blocks that will almost never be accessed.
Do you write weekly TPS reports at work that never get read?
Put this in your report just to see who actually comments on it, with the correct TPS cover sheet of course!
Read caching is usually disabled in TPC benchmark systems for the reasons just cited. One reputable source stated that a small 2MB (not GB!) read cache per LUN to enable read-ahead is the preferred strategy. I recall someone asserting that a specific server system with 48GB memory showed IO performance improvement when SAN cache was increased from 80GB to 120GB. What this fact proves could be argued along more than one angle.
Caching RAID Controllers and Write IO
And now to the random write IO performance characteristics on a caching RAID controller. We have mostly skirted around write IO until now. There is a reason for this. Below is the small block random write IOPS pattern with a caching RAID controller.
When SQL Server or the operating systems sends one or more write IO to the RAID controller, the IO is written to the controller cache, and a completion signal is sent back to the source. The next IO is then sent. There is nearly no variation in IOPS versus queue depth. Latency is very low until the write volume reaches the IOPS limit. Beyond this, the write cache fills up, and latency becomes high, until the source throttle back on write IO.
Large Storage Systems with Large RAID Groups
As systems became more powerful, compute performance growing 40% per year, and hard disk performance averaging less than 10% per year (7.2K to 10K to 15K, then nothing until SSD), it was necessary to build storage systems with very large number of disks. During this period, SAN systems became pervasive, especially for large storage systems.
It was soon noticed that the SAN could not deliver anywhere near the expected IOPS based on the number of disks. One cause was traced to the FC HBA default queue depth setting of 32 (per adapter, now per target?). The reasoning behind the default was the SAN vendor doctrine of shared storage. To prevent one host from generating too much load, the IO was throttled with the HBA queue depth setting so all hosts could get a share of IO volume.
If one were to measure IOPS versus the HBA Queue Depth setting on LUNs comprised of many disks, one should find that IOPS performance increases with the higher Queue Depth all the way up to the maximum. The behavior in fact was already described in the RAID Group IOPS versus queue depth section above.
FC HBA Queue Depth Setting
Note that in the early days, the HBA queue depth setting applied to either the HBA or each HBA FC port. On the more recent Emulex FC HBA, the default is now queue depth 32 per LUN, with the option of per LUN or for the entire target. (QLogic uses the term Execution Throttle?) I suppose that in one of the few TPC-C benchmark reports with SAN storage system, a reference was made to changing the HBA queue depth from 32 to 254 without addition explanation.
The TPC-C systems all have very large disk arrays. Of course it is proper to dial HBA queue depth to maximum. Eventually, this was noted and the recommendation to change HBA queue depth from 32 to 255 made its way into various Microsoft documents. The ones I have seen gave no explanation to the underlying cause and effect, and supporting measurements.
So now what about a SAN with a small disk array? Especially if the queue depth setting is per LUN, and each LUN is comprised of 4 disks? Should the queue depth setting be increased to 254? I suggest following my guidelines for IOPS and latency versus queue depth with adjustments for the number of disks per LUN, all weighed against whether the objective is OLTP responsiveness or batch/DSS pure throughput.
Sequential IO
All of the IOPS versus queue depth discussed so far does not pertain to sequential disk IO. For large block sequential IO, a queue depth of 1 per LUN seems to be sufficient to generate maximum IO bandwidth. I am inclined to think that the theory is the IO size x Queue depth should be larger than the number of disks in the array x the RAID stripe size. The reasoning is that each disk will have IO to process, but I have not verified this hypothesis.
Increasing queue depth beyond the minimum necessary to achieve near maximum bandwidth will only server to increase latency. In a mixed small and large block IO load, perhaps a higher queue depth on the large block might improve the large block portion of throughput, but this has not been studied. In a SAN, there are some suggestions that a higher queue depth may be necessary to reach maximum sequential bandwidth, along with multiple LUNs per RAID group. A satisfactory plausible explanation has not been provided.
SQL Server IO Characteristics
There are several Microsoft documents that describe SQL Server IO in detail. A selection include the CSS SQL Server Engineers blog How It Works: Bob Dorr's SQL Server I/O Presentation,
and Microsoft-SQL-Server-IO-Internals slidedeck,
KB (917047) Microsoft SQL Server I/O subsystem requirements for the tempdb database.
The SQL Server Technical Article by Emily Wilson, Mike Ruthruff, Thomas Kejser Analyzing I/O Characteristics and Sizing Storage Systems for SQL Server Database Applications.
SQL Server 2008R2 Books Online has the following under Buffer Management, and Craig Freedman's Random Prefetching discussion with asynchronous IO.
In brief, in a table scan operation, SQL Server will issue IO to try to stay 1024 pages ahead of the scan with Enterprise Edition and 128 pages in Standard Edition. In the random-like 8KB accesses for key lookup and loop join inner source rows, SQL Server switches from synchronous to asynchronous IO at estimate 25 rows (a source other than me is needed for this).
High Queue Depth SQL on SSD
In a table scan query without lock hints, a Read Queue depth of over 1300 was observed. The IO size was 8KB, read latency went above 200ms even on SSD storage. With table lock, the IO size was around 500K (probably mostly 512K plus a few small block IO), disk latency was less than 50ms and queue depth was around 40.
For key lookup 8KB IO, queue depth was around 160 with 7ms latency. With HDD storage and 20 or so disk, queue depth 160 works out to 8 per disk, a reasonable number for good IO but not excessive latency.
On SSD, any queue depth more than 1-2 per LUN should achieve maximum IOPS and latency is below 0.1ms. Marc Bevand on the Zorinaq's blog pointed out that IOPS at queue depth 1 is essentially measure of latency. Suppose a SSD is rated at 100us latency, and 30K IOPS for 8KB IO (30K x 8KB = 240MB). Then the queue depth 1 IOPS should be 10K (1,000,000 us/s / 100us). So the theory is that queue depth 3 should reach 30K IOPS. Keeping queue depth at the bare minimum necessary for maximum IOPS does not degrade the query generating the huge IOPS, while providing good responsiveness for other concurrent queries.
High Write Latency in Creating Clustered Indexes
The Create Clustered Index command was observed to generate the highest write latency. Queue depth was 500, latency was 600ms+ and IO size averaging 100KB. Now creating large clustered indexes does not occur during the working day. Still, there is no point issuing so many outstanding IO. With either a caching RAID controller or SSD, write IO bandwidth can be saturated even at low queue depth. Driving IO so high only makes the system highly unresponsive for any function requiring IO to the affected drives.
IO Queue Depth Summary
We have explored in brief the key components that are impacted by IO queue depth. The following are the main points to consider.
1) Random read to hard disks have the most to gain from having sufficiently high queue depth.
2) Sequential IO does (should?) not need high queue depth operation beyond what is necessary to keep all disks busy. Staying 1024 pages ahead of the scan seems reasonable for large block IO, but I would not flood the queue with 8KB IO. This strategy should be adjusted based on IO size, or perhaps we should why is it necessary it issue 8K IO if the table is not fragmented.
3) Random write to RAID controller with write cache do not need deep queue depth for best performance.
4) SSD do not need deep queue for maximum performance
A) SQL Server appears to follow a set piece strategy on IO queue depth, dependent only on Edition, Standard or Enterprise. The number of disks behind each LUN is not considered, the useage model (OLTP versus DW/DSS) is not considered.
The proposed strategy is as follows. 1) Sequential IO should not try for 1024 pages ahead if IO size is 8KB. 2) Writes IO to controllers with write cache should use lower queue depth. 3) It is important to adjust random read IO queue depth based on the type of storage, HDD or SSD. 4) It is helpful to adjust HDD random read IO by usage model, OLTP or DW/DSS. 4) It is helpful to adjust HDD random read IO based on disks per LUN.
Some of the above could be detected automatically. Others might require a parameter setting, sp_configure seems appropriate. As much as we would like a universal answer (42) independent of user action, having adjustments could greatly improve the usability of SQL Server. Today a number of operations can render the SQL Server system completely unresponsive for the duration due to disk queue flooding, even with SSD storage. Only very large perfectly configured storage systems would have immunity.
PS - The performance data that I cited could be broadened in scope prior to making the change in SQL Server IO queue depth strategy. I am only providing enough to support my arguments, and Microsoft probably has more resources than I to put into this.
More references
Joe Sack's SQL Server Blog SQL Server and HBA Queue Depth Mashup also references Christian Boltons blogcast.
(Update 2010-24) Tempdb
SQL Server is also fairly aggressive on tempdb IO queue depth, probably having been tuned on a high-disk count storage system (as critical database server storage systems should be), and so issues IO at high queue depth. On all but the most poweful storage systems, this results in high disk latencies, and even then still might have somewhat high disk latency.
Many people want to interpret the best practice guideline of less than 10-20ms as a law of xxx engraved in stone, rather than as just one point in the queue depth - latency relation. Hence the (incorrect) conclusion drawn is that the tempdb is overloaded, instead of (the more correct interpretation that) SQL Server just wanting the load up the queue for best throughput.
Synchronous and Asynchronous IO
I mentioned above about the cut over from synchronous IO to asynchronous at 25 rows or there abouts. Consider a transaction processing server. If a transactional query involving 20 row also generates 20 read IO for the key lookup or loop join. The IO are issued serially, the next IO is not issued until the previous completes, 5ms with a properly configured low-queue depth storage system. The complete query will take 100ms, 20 row x 5ms/row (the query does not wait from data page writes). Even if there are many disks able to handle simultaneous IO, only 1 IO is issued by the query at a time.
Now consier a report that generates 100 or 1000's of IO. The IO is issued asynchronously. A batch of 100 IO could complete in 10-20ms, but disk latency is also driven up to 10-20ms. The effect is that reporting queries are prioritized with high queue disk IO, while transaction queries are deprioritized by both serial IO, and the latency for each IO is now longer.
This is another reason why it is important to have fine grain controller over IO queue depth strategy.
See: http://www.qdpma.com/Storage/IoQueueDepthStrategy.html for updates.