Storage has changed dramatically over the last three years driven by SSD developments.
Most of the key components necessary for a powerful storage system are available
and the cost is highly favorable for direct placement of data files.
Some additional infrastructure elements could greatly enhance the flexibility of storage systems with SSDs. There is still some discussion on whether SSD should interface directly to PCI-E
or continue using the SAS/SATA interfaces originally designed for hard disks.
New products coming this year include Express Bay, an ecosystem of connectors allowing
both PCI-E and SAS/SATA to co-exist until a clear direction is established.
Also expected in the coming year are PCI-E SSDs based on the NVM Express interface.
The Intel Xeon E5 processors, codename Sandy Bridge-EP, have 40 PCI-E gen 3 lanes on each processor socket. Even though PCI-E gen 3 is 8GT/s, a change in the encoding means that the usable bandwidth is double that of PCI-E gen2 at 5GT/s.
The net realizable bandwidth of a PCI-E gen 3 x8 slot is 6.4GB/s versus 3.2GB/s for gen 2.
The unfortunate aspect is that the major server vendors all implement a mix of x16 and x4 slots,
while the HBA vendors seem to be concentrating on products for PCI-E x8.
Only Supermicro has a system with 10
8 PCI-E gen 3 x8 slots.
Could a vendor put 2 HBA/RAID Controllers designed for x8 onto a single card for a x16 slot?
Perhaps the new Express Bay form factor will have some means to use x16 slots?
Another disappointment is that the 4-socket Xeon E5-46xx systems only connect half of the PCI-E lanes. This might be because the base system configuration is 2-socket populated.
If a full set of slots are provided, there would no connection to half of the slots unless all four sockets are populated. But this is also an issue on the 2-socket systems if only 1 socket is populated.
For the most part, I will discuss direct-attach storage configuration,
as we can pick and choose among the latest components available.
Technically, direct-attach with SAS can support a 2-node cluster,
but few system vendors promote this configuration.
Dell sells the MD3200 as direct-attach storage supporting 4 hosts in a cluster (or not),
but technically it is a SAN that just happens to use SAS interfaces on both the front-end and back-end.
The objective in the baseline storage configuration below is to achieve
very high IO bandwidth even in the low capacity configuration.
Of course it will also have very high IOPS capability because the main elements are SSD.
My recommended storage system has both SSD and HDD in each IO channel.
This intent is to place the main databases on SSD and use the HDD for backups and for restore verification. For an Inmon style data warehouse, the HDD might also be used for older data.
The reason for having both SSD and HDD on each IO channel is to take advantage of simultaneous bi-directional IO. On a database backup, the IO systems reads from SSD, and simultaneously writes to HDD.
4 RAID Controllers, 4GB/s per controller, 16GB/s total IO bandwidth
4 Disk enclosures (yes, I am showing 8 enclosures in the diagram above)
4 x 16 = 64 SSD
4 x 8 = 32 (10K) HDD
The standard 2U enclosure has 24 x 15mm bays. My preference is for a 16 SSD and 8 HDD mix.
With the small capacity 100GB SSD, there will be 1.6TB per enclosure and 6.4TB over 4 enclosures before RAID. In 7+1 RAID 5 groups, there will be 5.6TB net capacity, and 4.8TB in 3+1 RG across 4 units. The goal is 4GB/s per controller because the SAS infrastructure is still 6Gbps,
supporting 2.2GB/s on each x4 port. With 16 SSDs per controller, each SSD needs to support 250MB/s. Most of the recent enterprise class SSDs are rated for well over 300MB/s per unit,
allowing for a large degree of excess capability.
Another option is to configure 12 SSDs per controller, expecting each SSD to support 333MB/s.
The cost structure for the above is as follows:
RAID controller $1K
2U Enclosure $3K
Intel SSD DC 3700 100GB SSD $235 x 16 = $3760, RAID 5 7+1: 1.6TB
Seagate Savvio 600GB 10K HDD $400
x 8 = $3200.
This works out to $11K per unit or $44K for the set of 4.
The set of 16 x 100GB contributes $3760. For the 800GB SSD, the R5 7+1 capacity is 44.8TB at cost $148K.
At maximum expansion of 4 enclosures per RAID controller, capacity is 170TB at cost is $590K. Of course at this level, I would elect a system with more PCI-E slots for greater IO bandwidth. Another option is a RAID controller with 4 x4 SAS ports.
Unfortunately none of these have 4 external ports.
While the Intel SSD DC 3700 drew reviews for pioneering consistency of IO performance
instead over peak performance, it is only available in SATA interface.
Crucial has announced the P410m with similar specifications but with SAS interface. This is listed on the Micron website as in production, but probably only to OEM customers. There are other enterprise grade high endureance MLC SSDs with SAS interface as well.
Note: I do not recommend anything less than 10K HDD even to support database backups.
The 10K HDDs are not particularly expensive as direct-attach components ($400 for the 600GB model). Only SAN vendors sell $400 HDDs for $2K or more.
SAS 12Gbps Enclosures
Disk enclosures supporting SAS at 12Gbps might become available as early as this year.
Each of the 12Gbps SAS x4 uplink and down link ports would then support 4GB/s.
The RAID controller (HBA) can support 6GB/s+ in a PCI-E gen 3 x8.
The system with 4 RAID controllers could then deliver 24GB/s instead of 16GB/s.
At 16 SSDs per controller, this would require 400MB/s per SSD. While SSDs are rated as high as 550MB/s, achieving the full aggregate bandwidth in an array is not necessarily practical. So 400MB/s per SSD in an array is a more reasonable expectation. Also, enterprise SAS SSDs
many only be rated to 400MB/s.
We should not need 12Gbps SAS SSDs or HDDs in the near future (but 8 NAND channels is a good match for a 1.1GB/s interface). The internal wires in the enclosure connect through a SAS expander.
The IO from each device bay can signal at 6Gbps, then uplink to the HBA at 12Gbps,
assuming that packets are buffered on the expander.
The standard 2U disk enclosure today supports 24 or 25 2.5in (SFF) bays, with 15mm thickness.
This is the dimension of an enterprise class 10K or 15K HDD with up to 3 platters.
The older full size notebook used a 9mm HDD supporting 2 platters.
Thin notebooks used a 7mm HDD restricted to a single platter.
There is no particular reason for an SSD to be more than 7mm.
It would be better if the new 12Gbps SAS enclosures support more than 24 bays.
My preference is for 16 x 15mm and 16 x 7mm bays. The key is to match the practically realizable aggregate bandwidth of SSDs to the uplink with some degree of excess.
Personally, I would like to discard the SSD case to further reduce thickness.
Another option is to employ the NGFF, perhaps a 1U stick, at 5mm or less.
There could be 2 rows of 24 for SSD, and the 16 x 15mm bays.
I believe that the all-SSD idea is misguided. SSDs are wonderful, but HDD still have an important role. One example is having the HDDs available for backup and restores.
I want local HDD for backups because so very few people know how to configure for multiple parallel 10GbE network transmission, not to mention IO bandwidth on the backup system.
A database backup that has not been actually verified to restore (with recovery) is a potentially useless backup.
Having HDDs for backup and restore verification preserves the write endurance on the SSD.
This allows the use of high-endurance MLC instead of SLC.
In some cases, it might even be possible to use consumer grade MLC if and only if the database organization maintenance strategy is architected to minimize wear on the SSD.
Some of the discussion on PCI-E versus SATA/SAS interface for the NAND/Flash controller
incorrectly focuses on the bandwidth of a single 6Gbps lane versus 4 or 8 lanes on PCI-E.
It is correct that PCI-E was designed to distribute traffic over multiple lanes
and that hard drives were never expected to exceed to bandwidth of a single lane
at the contemporary SATA/SAS signaling rate.
The transmission delay across an extra silicon trip,
from NAND controller with SATA interface to a SATA to PCI-E bridge chip,
on the order of 50ns,
is inconsequential compare this with the 25-50µsec access time of NAND.
The more relevant matter is matching NAND bandwidth to the upstream bandwidth.
All (or almost all?) the SATA interface flash controllers have 8 NAND channels.
Back when SATA was 3Gbps and NAND was 40-50MB/s, 8 channel s to the 280MB/s net bandwidth of SATA 3G was a good match.
About the time SATA moved to 6Gbps, NAND at 100-133MB/s became available so 8 channels was still a good choice.
NAND is now at 200 and 333MB/s, while SATA is still 6Gpbs.
The nature of silicon product cost structure is such that there is only minor cost reduction in building a 4 channel flash controller.
The 8 channel controller only requires 256-pin package.
The PCI-E flash controllers have been designed with 32 NAND channels.
The IDT 32-channel controller has 1517 pins,
which is not excessively difficult or expensive for a high-end server product.
Note the Intel Xeon processors are 2011-pins.
As noted earlier a PCI-E gen 3 x8 port supports 6.4GB/s.
Over 32 channels, each channel needs to provide 200MB/s.
The new 333MB/s NAND is probably a better fit to sustain the full PCI-E gen 3 x8 bandwidth after RAID (now RAIN because disks are replaced by NAND).
Based on 64Gbit die, and 8 die per package, a package has 64GB raw capacity. The 32-channel PCI-E with 1 package per channel would have 2TB raw capacity (net capacity with 0.78 for over-provisioning and 0.875 for RAIN would be 1400GB) versus 512GB on an 8-channel SATA/SAS SSD.
The IDT document states capacity is 4TB raw for their 32-channel controllers, so perhaps it allows 2 packages per channel? The Micron datasheet mentions 32-channel and 64 placements.
As is today, a PCI-E SSD can achieve maximum bandwidth at lower NAND capacity and in a more compact form factor than with SAS SSD.
On the other hand, SAS infrastructure provides flexible expansion.
Capacity can be increased without replacing existing devices.
Some systems support hot swap PCI-E slots.
However the orientation of the connector in the system chassis makes this a complicated matter.
The implications are that PCI-E slot SSDs are highly suitable for high density requirements with limited expansion needs. One database server example is tempdb on SSD.
The new generation of PCI-E SSDs may employ the NVMe interface standard.
There is a standard driver for Windows and other operating systems,
which will later be incorporated into to the OS distribution media
allowing boot from an NVMe device, as with SATA devices today.
This is mostly a client side feature.
For the server side, the NVMe driver is designed for massive bandwidth and IOPS.
There can be up to 64K queues, 64K commands (outstanding IO requests?).
The driver is designed for IO to be both super-efficient in cpu-cycles
and scalable on NUMA systems with very many processor cores.
To promote the growth of SSD without betting on which interface,
the Express Bay standard defines a connector that can support both PCI-E and SATA or SAS.
Some Dell servers today support PCI-E to SSDs in the 2.5in HDD form factor (SFF),
but I am not sure if this is Express Bay.
This form factor will allow PCI-E devices to hot-swapped with the same ease as SAS devices today.
As mentioned earlier, the PCI-E slot arrangement in server systems does not facilitate hot-add,
even if it is supported.
Existing PCI-E SSDs also do not provide a mechanism for capacity expansion,
aside from adding a card to an empty slot or replacing an existing card.
Of course, there are PCI-E switches, just like the SAS expanders.
A 64 lane PCI-E switch could connect 5 x8 PCI-E devices over a x8 upstream link.
Other possibilities is a x16 link supporting 12.8GB/s to host with 4 ports for SSD,
or 8 x4 ports to SSD for finer grain expansion.
It may also be possible to support multiple hosts, as in a cluster storage arrangement?
Below is a representation of a typical configuration sold to customers by the SAN vendor.
I am not joking in that it is common to find 2 ports FC or FCOE on each host.
The most astounding case was a SAN with 240 x 15K disks and 2 single port FC HBAs in the server.
Even though the storage system service processors had 4 FC port each (and the FC switches had 48-ports), only 1 on each SP was connected.
Obviously the storage engineer understood single component and path failure fault-tolerant design. It was just too bad he built a fault-tolerant garden hose system when a fire hose was needed.
As I understand it, what happened was the SAN engineer asked how much space is needed for the databases accounting for growth, and then created one volume for it.
The Windows OS does support multi-path IO.
Originally the storage vendor provided the MPIO driver, but now it is managed by Microsoft.
Apparently it was not considered that even with MPIO, all IO for a single volume has a primary path. The secondary path is only used when the primary is not available.
High-bandwidth SAN Configuration
A proper SAN configuration for both OLTP and DW database servers is shown below.
Traditionally, a transaction processing database generates small block random IO (2KB in the old days, 8KB since SQL Server 7). As it was difficult to get 10K IOPS (x8KB = 80MB/s), it was thought that IO bandwidth was not a requirement.
This was 15 years ago. Apparently the SAN vendors read documents from this period, but not more recently, hence their tunnel vision on IOPS, ignoring bandwidth.
For the last 10 or more years, people have been running large queries on the OLTP system. I have noticed report queries that saturate the storage IO channels could essentially shutdown transaction processing. This is because the report query generates asynchronous IO at high queue depth, while the transaction queries issue synchronous IO at queue depth 1. And the report may escalate to a table lock (or it may use nolock). Furthermore, it is desirable to be able to backup and restore the transaction database quickly. This means bandwidth.
Note that the system below shows 8Gbps FC, not 10Gbps FCOE.
A single 10Gbps FCOE may have more bandwidth than a single 8Gbps FC port.
But no serious storage system will have less than 4 or even 8 ports.
Apparently FCOE currently does not scale well over multiple ports, due to the overhead of handling Ethernet packets?
An Intel IDF 2012 topic mentions that this will be solved in the next generation.
The above diagram shows 8 x 8Gbps FC ports between host and storage system.
Each 8Gbps FC port can support 700MB/s for a system IO bandwidth target of 5.6GB/s.
An OLTP system that handles very high transaction volume may benefit from a dedicated HBA and
FC ports for log traffic. This would allow the log HBA to be configured for low latency,
and the data HBA to be configured with interrupt moderation and high throughput.
An alternate SAN configuration for SQL Server 2012 is shown below with local SSD for tempdb.
The write cache on a SAN must be mirrored for fault tolerance.
There is very little detail on the bandwidth capability of the link between controllers
(or SP) on SAN systems, beyond what can be deduced from the fact that the sustained write bandwidth is much lower than the read bandwidth.
So keeping tempdb off the SAN should preserve IO write bandwidth for traffic that
actually needs protection.
The number of volumes for data and temp should be some multiple of 8.
It would be nice to have 1 volume for each FC path.
However we do need to consider how SQL Server place extents over multiple files.
This favors RAID groups of 4 disks.
In a HDD storage system,
the objective for bandwidth is to simultaneously issue large block IO
to all data disks across all IO channels.
A 256K block could be sufficiently large to generate 100MB/s per disk (400 IOPS, not random).
If this were issued at low queue depth (2?),
then the storage system would not only generate high IO bandwidth
and still be perceptually responsive to other requests for small block IO.
For small block random IO, it is only necessary to distribute IO over all hard disks with reasonable uniformity.
The file layout strategy has two objectives.
One is to not overwhelm any single IO channel.
In direct-attach, this is not a proper as the smallest pipe is x4 SAS for 2GB/s.
In a SAN, even using 8Gbps FC, this is a concern as 8Gb FC can support only 700-760MB/s.
Although 10Gb FCoE seems to have higher bandwidth, this may not scale with the number of channels
as well as straight FC.
The new Intel Xeon E5 (Sandy-Bridge EP) processors may be able to scale 10Gb FCoE
with Data Direct IO (DDIO) - but this needs to be verified.
The second is to ensure IO goes to every disk in the RAID Group (or volume).
By default, SQL Server allocates a single 64K extent from each file
before round-robin allocating from the next file.
This might be the reason that many SAN systems generate only 10MB/s per disk (150 IOPS at 64K),
along with no read-ahead.
The -E startup flag instructs SQL to allocate up to 4 consecutive extents
before proceeding to the next file.
See James Rowland-Jones
Focus on Fast Track : Understanding the –E Startup Parameter
for more on this.
In a 4-disk RAID group with stripe-size 64K, a 256K IO to the file
would generate a 64K IO to each disk.
It would be necessary to rebuild indexes before this scheme takes effect.
Somewhere it was mentioned that it is important to build indexes with max degree of parallelism
limited to either 4 or 8. It might be in the
Microsoft Fast Track Data Warehouse Reference Architecture.
Start with version 4 for SQL Server 2012, and work backwards?
Trace Flag 1117 (-T1117) causes all files in a filegroup to grow together.
SQL Server 2008 Trace Flag -T 1117.
With SSD, the second may not be important as the SQL Server read-ahead strategy
(1024 pages?) should generate IO to all units.
On the hard disk, generate close-to-sequential IO was important.
On SSD, it is sufficient beneficial just to generate large block IO, with 64K being large.
The old concept of distributing IO over both devices and channels still apply.
The recent pricing of SSD is sufficiently low to warrant serious consideration ($2-3K/TB eMLC).
While there is more flexibility in SSD configuration,
it is still necessary to validate performance characteristics
with real queries to an actual database.
SQLIO or other synthetic tests are not sufficient.
If the SAN vendor advised in the configuration,
then chances are IO bandwidth will be not be good.
If anyone thinks I am being unfair to or overly critical of SAN vendors, do the following test.
Find the biggest table in your database, excluding LOB fields.
Then compute 8 (KB/page) * (physical reads + read-ahead reads)/(elapsed time in ms)
Is this closer to 700 MB/s or 4GB/s? What did your SAN vendor tell you?
I am also not fan of SSD caching or auto-tiering on critical databases, meaning the database that runs your business, that is managed by one or more full-time DBAs. In other applications, there may not be a way to segregate the placement of hot data differently from inactive data. In SQL Server, there are filegroups and partitioning. We have all the necessary means of isolating and placing hot data whereever we want it. SSD caching or auto-tiering will probably require SLC NAND. With active management using database controls, we should be able to use HET MLC or even MLC.
I stress the importance of analyzing the complete system and how it will be used instead of over-focusing on the components. There are criteria that might be of interest when there is only a single device or even single HBA. Today it is possible to over-configure the storage performance without unwarranted expense, and this is best accomplished by watching the big picture.
Adaptec reports that their Series 7 SAS RAID Controller (72405 - PCI-E gen 3 x8 on the upstream side and 6 x4 SAS 6Gpbs) using the PMC PM8015 controller can do 500K IOPS and 6.6GB/s.
I will keep this topic up to date on www.qdpma.com
related posts on storage:
If you are using SQL Server 2012 clustering on a SAN,
I do suggest placing tempdb on local SSD, making use of the new 2012 feature
that does not require tempdb to be on shared storage.
Keep in mind on the SAN, writes must be mirrored between two storage processors for fault recovery, and this is not a cheap thing to do.
We should plan redo whatever was using tempdb at the time.