A contribution for T-SQL Tuesday #004, hosted by the illustrious Mike Walsh!
In the past few weeks I had some correspondence with Kendal Van Dyke leading up to his SQL Saturday presentation on SSDs, and he got me fired up to share a little of my team’s experience with a real implementation. Over the past four months or so, our IT group at work has deployed a new disk array incorporating enterprise-class fibre channel SSDs for database functions, and I am happy to report it’s been a success. I do have a few nuggets that I learned along the way, which might be useful if you are contemplating a move like this. Some of these tips I’ve talked about before, but they are doubly true in the rarefied (read: expensive) air surrounding SAN SSD’s. You really don’t want to sink this kind of money and not get the maximum possible return!
In the Fall our organization faced a large data migration to a second datacenter, to enhance our disaster recovery and business continuity position. This entailed the purchase of some new infrastructure, and for various reasons the group decided a storage array including SSD’s made strategic sense for the next several years. I had had one eye on SSD’s when they started becoming cost-effective enough to make it into the smaller scale TPC benchmarks – indicating the cost/performance ratio was starting to make sense. Our SAN admin had also been watching them too, for all kinds of reasons, and I think I might have seen her drooling a little when this notion came up.
Anyway, we were able to obtain an EMC Symmetrix-class array with three tiers of storage, from SSD’s at the high end to 15k Fibre Channel disks to SATA disks. The array was incorporated into our existing SAN, and attached to a combination of new and repurposed servers, to create production environments for both Oracle and SQL Server. We tested the complete environment as pre-production for several weeks, mainly to validate the IO performance of this new architecture relative to what we had been running, and then have progressively cut production systems over to the equipment.
I can say that it’s performed really well – the Oracle and SQL Server systems that have SSD storage on the new array definitely saw a net boost in performance. Most of the time, our Oracle DBA’s report that they see about a 2/3 increase in performance, with some exceptions. I am running a data warehouse function on the SSD’s, and I can say we shaved perhaps 25% off of our ETL/load window by moving to the new hardware. At the same time, I would like to caution that, though it might seem like these things are supernatural in terms of performance, they are not a make-all-your-problems-disappear silver bullet. They are dramatically faster for some particular tasks, but not for everything.
With help from EMC, EMC professional services consultants and our internal staff, we went through this overall process to design and deploy:
- We came up with a general size and performance profile of all the existing systems we planned to move and/or consolidate onto the new array. Importantly, we got the data from real performance stats collected off our existing servers and arrays, so we could describe to EMC what the real IO requirements looked like, as well as the capacity needed. IO usually drives the cost of storage more than capacity in the database world, and we had to take pains to make sure our execs and other parties didn’t get sidetracked into looking only at capacity (so many TB) instead of performance.
- EMC came back with a general design showing how many of what type of disks would be in the array. That design set most of our budget parameters :-). The new array ended up being sort of mid-range, with 46 200 GB STEC Fibre Channel SSD’s, 124 450 GB 15k RPM FC Disks, and 36 1TB 7200 RPM SATA disks.
- While the array was in the manufacturing and testing queue at EMC, we did a deep dive into the specifics of how to arrange the data on all the available devices. We took about a month’s worth of performance counters from all our existing storage and mashed that into aggregate IO targets for the new servers on the new array. EMC professional services then took that data and proposed a design showing exactly how the LUNs and files would physically lay out. We did a couple of back-and-forth iterations before settling on the final design.
- The basic guiding principle around the SSDs was to use them for the files that would benefit most: the read-intensive ones. We literally found the read-intensive files or disks on our existing systems from the IO counters and placed those, only, on the SSD tier. The rest of the files are on conventional 15K FC disks. All the SSD’s are set up as RAID5 (reads being the priority) and all the 15K disks are mirrored pairs, for performance.
- Then The Beast, as I like to call it, was delivered, put together and health-checked by EMC.
- We moved our pre-production data to the new system, and ran automated performance tests for about a month, just so we knew how to expect this thing to act.
- Finally, we moved production systems over one at a time.
I learned a lot going through this process – and it was really fun, to boot. Here are some basic ideas to think about if you are looking at SSDs in SAN:
- If you don’t have a system for gathering IO performance counters on a rolling basis I strongly recommend, as I have before, getting or building one. This capability is vital to right-size your infrastructure and spend money wisely. I think it’s most cost-effective just to buy a monitoring system (and put your valuable working hours into something business-specific that can’t be purchased), but if you can’t afford a monitor, it’s not very hard to build a basic counter-collection system. Worst case, it is possible to log this data with perfmon, but that takes extra effort and can push your schedule out, as you have to gather the data when needed rather than just having it already in a repository.
- Resist the idea that SSD’s are just so miraculously fast that it’ll be like RAM, or make all your performance worries disappear. In fact they solve mainly one problem – an important problem, but one only. They avoid the penalty of random access to disks, which requires mechanical moving parts to get to some region of a disk, then pass the data on the disk past the disk head. If you suffer from that problem, they will help – a lot. If not, then you should dig deeper and see if SSD’s will really be an advantage for your workload. Writes might not be that much faster, nor sequential reads, depending on the details. You might be best off with a hybrid solution that uses two kinds of storage, and it’s worth considering a design like that. Be realistic about their true performance characteristics and set expectations appropriately.
- You still have to RAID the devices together. SSD’s can fail just as disks can fail, and if you are running a production system then you will want the redundancy. That obviously has a cost impact. On the other hand, the arithmetic might go like this: if you need, say, 48 disks in RAID 1+0 to get good performance for an existing server (and those disks are probably mostly empty) but SSD’s could get you the same performance (imaginary numbers here) with only 6 or 8 devices in RAID5, then maybe the budget starts to look attractive.
Still, it was really fun last week when I spied our little data warehouse doing over 7,900 IOPS! Sweet!