This post is my contribution to Adam Machanic's T-SQL Tuesday #004, hosted this time by Mike Walsh.
Being applicative DBA, I usually don't take part in discussions which storage to buy or how to configure it. My interaction with IO is usually via PerfMon. When somebody calls me asking why everything is suddenly so slow on database server, "disk queue length" or "average seconds per transfer" counters provide an overwhelming answer in 60-70% of such cases. Sometimes it can be fixed by adding indexes to heavy tables. Not always though. Many companies, especially young and fast growing ones, just can't properly measure IO requirements. They are usually aware of how much disk space they need. Question about IOpS is met with puzzled looks.
Recently I had to build a new system capable of 40 thousand random IOpS at the peak load. In such IO is by far the most expensive part. Although space requirements were moderate, just several Terabytes, 4-5 SATA disks of 1TB won't be enough. Far from it. Modern disks with 15000 RPM rotation speed can perform ~170 random IO requests per second. So, 40K/170 = 235 disks in several RAID0 arrays. Enough?
Not so fast. What about high availability requirements? Can we afford downtime if one of the disks is gone? No, we can't. OK, so RAID0 is out of scope. It leaves us with RAID10(striping plus mirroring) and different parity options: RAID3 through RAID6. I'll consider RAID5 as it is the most popular one. For RAID10 calculation is simple: 235 disk in RAID0 we've seen earlier x2 for mirror disks = 470 disks. That's quite a lot! So let's consider less expensive (at least sounds so) option - RAID5. Standard RAID5 array contains 5 disks. Unlike RAID3 and RAID4 it has no dedicated disk for parity (instead parity is written in a round robbin on all disks), so we can assume that nothing is wasted. We're still running with the same number - 235 disks, just a little less overall space due to parity. But space isn't our concern here. So, can we finally issue order to the storage vendor?
Unfortunately we're still not there. Good time to ask about usage pattern - what do those 40K IOpS contain? ~40% are SELECT operations (e.g. 16K operations per second), 20% are INSERT (8K per second) and 40% are UPDATE (16K per second). We also know, that rows that undergo UPDATE are those that are SELECTed before. And we know that all operations use clustered index seek. Let's perform simple calculation. We can safely assume that upper levels of clustered index are in cache, so we have 1 physical read per SELECT. UPDATE performs read + write but we know that row has been read earlier, so we can assume that read is also from cache. So, price of UPDATE is 1 IO. Great! 1 IO per operation regardless of its type - we're still with 235 disks. At this point some of you would say: "Stop wasting our time, let's buy that damn storage!"
OK, after week or two we install storage, configure 235 disks in arrays of 5 plus a little more for hot spare, Windows, SQL Server transaction log etc. Now we open PerfMon and start load test expecting to see 40K IOpS at maximum load. And we would be greatly disappointed. Actually, having CV updated is a good thing at this point - some organizations that I know won't employ DBA that wastes them around 100K bucks. So, where is the catch?
The catch is an UPDATE. Contrary to the wide belief, PerfMon doesn't show physical IO. Yes, we know that drives D and E reside on the same "physical disk". But is it really physical? No, it is just another layer that can conceal beneath anything from directly attached disk to RAID50 array. PerfMon shows IO "on host". So we probably see 1 IO for UPDATE. But storage actually does something else - and it is really physical layer now. First, it reads both data and parity bytes. Data byte is in cache, so we have 1 read instead of 2; but remember, we haven't updated anything yet (you probably ask about parity byte - why it isn't in cache? because for SELECT operations only data bytes are read). Next, it writes both data and parity. So, we have 3 IO operations per single UPDATE. Going back to overall numbers, 16K updates are actually 48K IO operations when working on RAID5. This trap is known as "RAID5 write penalty". That's not the only RAID5 issue. We also have ~2/3 performance degradation on single disks failure - because in order to reconstruct corrupted bit we have to read all 4 bits on other disks including parity -> 4 times more IO. Considering all this, we should carefully evaluate whether RAID5 is really more cost effective than RAID10. The answer is positive for read-intensive environments where we do not have write penalty. For write-intensive environments TCO of RAID10 can be better, especially taking into account possible corruptions.
Last but not least. I know that some statements here are oversimplified. I didn't take into account page splits or influence of storage system cache; I didn't consider SSD-based or mixed storage. Although I used real system I've built as example, here it is more of a mathematical model. Real world is always more complicated. But I don't want my fellow DBAs to fall asleep reading this post :-).