THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Enterprise Storage Systems - EMC VMAX

I generally do not get involved in high-end SAN systems. It is almost impossible to find meaningful information on the hardware architecture from the vendor. And it is just as impossible to get configuration information from the SAN admin. The high-end SAN is usually a corporate resource managed in a different department from the database.

The SAN admin is generally hard set on implementing SAN vendor doctrine of "Storage as a Service" and does not care to hear input on special considerations from the database team. In addition to unpredictable or poor performance and sometimes both, it is often necessary for fight for every GB of storage space via email requests, filling out forms or endless meetings. This is especially ridiculous because storage capacity at the component level is a really cheap. It only becomes a precious resource in a SAN.

Still, I am expected to answer questions on what is wrong with SQL Server when there are performance problems against a magically all-powerful enterprise SAN, so this is my best understanding. The example I am using is the EMC Symmetrix line, but the concepts here could be applied to other systems if details were available.

The EMC Symmetrix VMAX was introduced in 2009 using Intel Core2 architecture processors (45nm Penryn) with RapidIO fabric. A second generation came out in 2012, with the VMAX 10K, 20K and 40K models using Intel Xeon 5600 (32nm Westmere) processors. The predecessor to the VMAX was the Symmetrix DMX-4, which used PPC processors and a cross-bar architecture connecting front-end, memory and back-end units.

The basic information here is from the EMC documents. Because the details on the internal architecture of the VMAX are not found in a single authoritative source, much of it has to be pieced together. Some of the assessments here are speculation, so anyone with hard knowledge is invited to provide corrections.

VMAX (2009)

The original VMAX architecture is comprised of up to 8 engines. Each engine is comprised of a pair of directors. Each director is a 2-way quad-core Intel Xeon 5400 system with up to 64GB memory (compared with 16GB for the CLARiiON CX4-960).

VMaxEngine

Each director has 8 back-end 4Gb/s FC ports (comprised of quad-port HBAs?) and various options for the front-end including 8 x 4Gb/s FC ports.

The engine with 2 directors has 16 back-end FC ports (2 ports making 1 loop) and can have 16 ports on the front-end in the FC configuration. Assuming 375MB/s net realizable throughput with 4Gbps FC, each director could support an aggregate of 3.0GB/s on both the front and back-end ports.

In the full VMAX system of 8 engines (16 directors) with FC front-end configuration there are 128 x 4Gb/s FC ports on the front and back ends. Then in theory, the combined front-end and back-end bandwidth of the full system is 16 x 3.0GB/s (or 128 x 375MB/s) = 48 GB/s.

Of course, there is no documentation on the actual sequential (or large block) IO capability of the V-Max system. There is an EMC VMAX Oracle document mentioning 10GB/s on 2 engines (not sure whether this is the 2009 VMAX or the second generation VMAX).

To support the above director, I would guess that the system architecture should have 6 x8 PCI-E slots. Based on a quad-port FC HBA, the 8 back-end ports requires 2 x8 slots, and there are also 2 x8 slots for the front-end for any supported interface.

Without discussing the nature of the interconnect between directors in an engine, and the Virtual Matrix Interface, I am supposing that each requires one x8 slot. The above diagram does show a connection between the two directors in one engine.

So there should be 2 back-end, 2 front-end, 1 VM and 1 director-director x8 PCI-E slots in all. It could also be presumed that the slots are not connected through an expander, as this would result in an arrangement with unbalanced bandwidth.

At this point I would like to digress to review the Intel Core2 system architecture. The original memory controller hub (MCH or chipset) for the 2-socket Core2 system was the 5000P in 2006, 1432-pins. The 5000P has 24 PCI-E lanes and the ESI, which is equivalent to 4 lanes. So this is clearly inadequate to support the VMAX director.

5000P MCH

In late 2007-early 2008 or so, late in the product life of the Core2 architecture processors, Intel produced the 5400 MCH chipset, codename Seaburg, with 1520-pins supporting 36 PCI-E lanes plus the ESI, equivalent to 4 PCI-E lanes.

5400MCH

This MCH chipset was not used by any server system vendor, so why did Intel make it if there were no apparent customers? It is possible the 5400 MCH was built specifically to the requirements of the high-end storage system vendors. I mentioned this briefly in System Architecture 2011 Q3.

The 5400 MCH can support 5 x8 PCI-E slots. I think this is done by using the ESI plus 1 x4 on the upstream side of the Enterprise South Bridge to support x8 on the downstream side. So there is something wrong with my estimate to the PCI-E slot count required for the VMAX engine.

 

When the original EMC VMAX came out in 2009, I could find no documentation on the Virtual Matrix interface. I had assumed it was Infini-band, as FC would not have been suitable on bandwidth or protocol support. Later I came across a slide deck illustrating VMI implemented with an ASIC connecting x8 PCI-E to RapidIO. The second generation VMAX specification sheets explicitly lists RapidIO as the interconnect fabric.

RapidIO is an open-standard switched fabric. In short, RapidIO has protocols for additional functionality that was not necessary in PCI-E, a point-to-point protocol. (Some of these may have been added to PCI-E in later versions?) RapidIO can "seamlessly encapsulate PCI-E". The other aspect of RapidIO is that the packet overhead is far lower than Ethernet layer 2, and even more so than Ethernet layer 2 plus layer 3 (IP) plus layer 4 (TCP) as there is no requirement to handle world-wide networks. The RapidIO protocol overhead is also slightly lower than PCI-E.

The first version of serial RapidIO supported 1.25, 2.5 and 3.125 Gbaud, and x1 and x4 links. Version 2 added 5 and 6.25 Gbaud and x2, x8 and x16 links.

The diagram below is for the original VMAX using two Xeon L5410 processors. I neglected to note the source, so some help on this would be appreciated.

VMax Director

In the diagram above, the VMI ASIC is connected to x8 PCI-E to the director system, and 2 x4 RapidIO for the interconnect. The RapidIO encoded data rate is 3.125GHz. The data rate before 8b/10b encoding is 2.5Gb/s per lane or 1.25GB/s bandwidth for the x4 connection in each direction. The bandwidth per connection cited at 2.5GB/s full duplex is the combined bandwidth in each direction on the RapidIO side.

The bandwidth on the PCI-E side is 2.5Gbit/s raw, or 2Gbps unencoded data (8b/10b) for 2.0GB/s on the x8 slot. This is the nominal bandwidth of the full PCI-E packet including header and payload. The PCI-E packet overhead is 22 bytes.

The net bandwidth that I have seen for disk IO on x8 PCI-E gen 1 is 1.6GB/s. I am not sure what the average payload size was for this. It could have been 512 bytes, the disk sector size commonly used. In any case, the packet overhead is much less than 20%, so there is a difference between the net achievable bandwidth and the net bandwidth after PCI-E packet overhead.

The VMAX diagram above shows one x8 PCI-E for VMI and 4 x8 PCI-E for Disks (Back-end) and front-end channels (HBAs). The 4 BE and FE slots are labeled at 1.5GB/s each and 6.0GB/s for the set of four. Presumably this is the 4 x 375MB/s FC bandwidth, and not the PCI-E x8 bandwidth of 2.0 GB/s including packet overhead.

A dedicated interconnect between the two directors in one engine is not shown. So this would represent a valid configuration for 5400 MCH, except that 4 x8 PCI-E should be to the MCH, and only 1 x8 on the ICH (ICH was the desktop I/O controller hub, ESB was the server version).

The main observation here is that EMC decided it is a waste of time and money to continue to building custom architecture in silicon when there are alternatives. It is better to use Intel Xeon (or AMD Opteron) components along with an open-standard fabric. There are ASIC and FPGA vendors that provide a base PCI-E to RapidIO interface design that can be customized. I am presuming the EMC VMI ASIC is built on this.

Below is EMC's representation of the VMAX system, showing 8 engines (16 directors) interconnected via the Virtual Matrix.

VMax Matrix

The diagram is pretty, but conveys very little understanding of what it is. Knowing that the Virtual Matrix interface is RapidIO is all that we need to know. The Virtual Matrix is a RapidIO switch, or rather a set of RapidIO switches.

Each of 16 directors is connected to the VM with 2 x4 RapidIO ports. A single switch with 128 (16x2x4) RapidIO lanes could connect the full VMAX system. A second possibility is two switches with 64 (16x4) RapidIO lanes. Each switch connects one x4 port on each director. Other possibilities with fewer than 64 lanes include 8 switches of 16 lanes, or some arrangement involving more than 1 switch between directors.

IDT makes RapidIO switches and PCI-E to RapidIO bridges (not to mention PCI-E switches). There are other vendors that make RapidIO switches and I do not know the source for the EMC VMAX. The RapidIO switches are available with up to 48 lanes as shown below.

I am not sure if there are any with 64 lanes? There is an IDT PCIe switch with 64 lanes in a 1156-pin BGA. IDT describes their 48-port RapidIO switch, capable of operating at 6.25Gbaud, as having 240Gb/s Throughput. So they felt it was more correct to cite the unencoded bandwidth, single direction, not the full duplex, and not the encoded data rate.

The diagram below shows the full VMax system comprising 11 racks with the maximum disk configuration!

VMax Full Config

The center rack is for the VMax engines, the other 10 are storage bays. Each storage bay can hold up to 240 drives. There are 160 disk array enclosures, 64 directly connected, and 96 daisy chained. There are 8 VMax engines, with the disk enclosures in matching color.

The 2009 VMAX only supported 3.5in drives initially? (I misplaced or did not keep the original VMAX documentation, oops) The back-end interface on both the original and second generation (!@#$%^&) VMAX is 4Gbps FC. The 3.5in disk drives are also FC. The 2.5in disk drives for the second generation VMAX is listed as SAS, so presumably the disk enclosure converts the external FC interface to SAS internally. There are flash drive options for both 3.5in and 2.5in, the 3.5in being FC and the 2.5in SAS?

The mid-range VNX moved off FC disks in 2011. Perhaps the size of the VMAX with all 11 racks is beyond the cable limits of SAS? But why 4Gb/s FC and not 8Gbps? Is this to maintain compatibility with the previous generation DMX? I am inclined to think it is not a good idea to saddle a new generation with the baggage from the older generation. Perhaps in the next generation FC on the back-end would be replaced by SAS?

VMAX Second Generation (2012)

The second generation EMC VMAX employs the Intel Xeon 5600 series (Westmere-EP) processors with up to six cores. There are three series, the VMAX 10K, 20K and 40K. The complete system is comprised of one or more engines. There can be up to 8 engines in the 20K and 40K and up to 4 engines in the 10K.

Each engine is comprised of 2 directors. A director is a computer system. The 10K director originally had a single quad-core processor; later versions have a single six-core processor. The 20K director has two quad-core processors. The 40K director has two six-core processors. Both the 10K and 20K (directors) have dual Virtual Matrix Interface (VMI or just VM?). The 40K (director) has quad-VM.

It is very hard to find useful detailed SAN system architecture information. I came across the following from an EMC VMAX 40K Oracle presentation, which appears to be just an update of the original VMAX engine diagram to the second generation VMAX 40K.

VMaxEngineII

But notice that the interconnect between the two directors (2-socket servers) is labeled as CMI-II. CMI is of course the acronym for CLARiiON Messaging Interface, which in turn was once Configuration Manager Interface (prior to marketing intervention?). This makes sense. There is no reason to develop different technologies to perform the same function in the two product lines. So the separation between VNX and VMAX is that the latter has VMI to cluster multiple engines together.

Along the same lines, does there need to be a difference in the chips to perform the CMI and VMI functions? It does not matter if the software stacks are different.

To support the VMAX 40K director, there should be 2 x8 PCI-E slots each for both the front-end and back-end ports as before in the original VMAX. I am also assuming a single x8 PCI-E slot for the CMI-II. The difference is that the 40K director needs 2 x8 PCI-E slots to support 4 VM connections, each x4 RapidIO. This makes a total of 7 x8 PCI-E slots.

The 2-socket Xeon 5600 system architecture is shown below with two 5520 IOH devices each supporting 36 PCI-E gen2 lanes for 72 lanes total, not counting the ESI (equivalent to PCI-E gen 1 x4).

5600

The full Xeon 5600 system can support 8 PCI-E gen2 x8 slots, plus 2 gen2 x4 (because the extra x4 on each IOH cannot be combined into a single x8?). So this time there are more PCI-E slots than necessary. Note also that all of these are PCI-E gen2 slots. The back-end FC on the 2nd generation VMAX is still 4Gb/s FC. The front-end FC can be 8Gbps FC. It could be that all FC HBAs in the second generation can support 8Gbps, just that the back-end ports operate at 4Gbps?

Virtual Matrix and RapidIO

The original VMAX used RapidIO at 3.135 Gbaud. After 8b/10b encoding, the un-encoded data rate is 2.5Gbps. In a x4 link, the combined data rate is 10 Gbit/s or 1.25 GByte/s. As with modern serial protocols, data transmission is simultaneous bi-directional. So the bandwidth in both directions combined is 2.5GB/s full duplex.

In a server system, citing full duplex bandwidth for storage is not meaningful because IO is almost always heavily in one direction (except for backups directly to disk). However, it should be pointed out that the bi-directional capability is immensely valuable because the primary stream is not disrupted by minor traffic in the opposite direction (including acknowledgement packets). Just do not confuse this with the full duplex bandwidth being a useful value.

In a storage system, it could be legitimate to cite the full duplex bandwidth for the engine, because each engine could be simultaneously processing data in-bound from and out-bound to other engines. So the engine must be able to handle the full duplex bandwidth.

Now considering the complete storage system, any traffic that leaves one engine must arrive at another engine. The total traffic is the sum of the bandwidth a single direction. So it is misleading to cite the sum total full duplex bandwidth. But marketing people can be relied upon to mislead, and we can trust marketing material to be misleading.

The VMI ASIC bridges 8 PCI-E lanes to 8 RapidIO lanes. In the original VMAX, this PCI-E gen 1 to RapidIO at 3.125 Gbaud. In the second generation VMAX with Westmere-EP processors, the PCI-E is gen2 and RapidIO is now presumed to be 6.25 Gbaud. PCI-E gen1 is 2.5Gbps and gen2 is 5Gbps.

I suppose that there is a good reason RapidIO was defined to 3.125 Gbaud at the time PCI-E was 2.5Gbps. Consider sending data from one system to another. In the first system, data is first transmitted over PCI-E (hop 1), A device converts the data to be transmitted over RapidIO (hop 2). At the other end, a device converts back for transmission over PCI-E (hop 3) to the final destination.

It would seem reasonable that if all interfaces had equal data rates, there would be some loss of efficiency due to the multiple hops. So for lack of hard analysis I am just speculating that there was a deliberate reason in the RapidIO specification.

Another speculation is that it was known that RapidIO would be interconnecting systems with PCI-E, and the extra bandwidth would allow encapsulated PCI-E packets on RapidIO with the upstream and downstream PCI-E ports to be running at full bandwidth?

The point of the above discussion is that the bandwidth on the RapidIO of the VMI ASIC is less material to the storage professional. The bandwidth on the PCI-E side is closer to net storage IO bandwidth.

 

In the table below, I am trying to make sense of the Virtual Matrix bandwidth of the original VMAX, and the second generation VMAX 10K, 20K and 40K. The original VMAX 2009 had 3.125 Gbaud RapidIO, so each x4 link had 1.25GB/s unencoded bandwidth per direction. Each director has dual Virtual Matrix, so the combined full duplex bandwidth of 4 VM for the engine is 10GB/s unencoded. The combined full duplex bandwidth on the PCI-E side is 8GB/s per engine.

 Original10K20K40K
Processor Core2 Westmere Westmere Westmere
Sockets 2 1 2 2
cores 4 4-6 4 6
VMI/dir 2 2 2 4
VMI/eng 4 2 4 8
RapidIO 3.125 Gbaud 6.25 Gbaud ? 6.25 Gbaud
Unencoded 8b/10b 2.5 Gbaud 5 Gbaud ? 5 Gbaud
x4 link 1.25GB/s 2.5GB/s ? 2.5GB/s
x4 link bi-dir 2.5GB/s 5GB/s ? 5GB/s
Engine VM BW 10GB/s 50GB/s? 24GB/s 50GB/s
System VM BW 80GB/s? 200GB/s? 192GB/s 400GB/s

The second generation VMAX should be on RapidIO at 6.25 Gbaud and PCI-E gen 2 at 5Gbps. The VMAX 40K specification sheet cites Virtual Matrix bandwidth of 50GB/s for the engine and the full system with 8 engines VM at 400GB/s. The VMAX 20K specification sheet cites VM bandwidth of 24GB/s for the engine and the full system with 8 engines VM at 192GB/s. The VMAX 10K specification sheet cites the full system (4 engines) VM bandwidth at 200GB/s, implying a single engine at VM bandwidth of 50GB/s.

Given that the VMAX 40K has twice as many Virtual Matrix interfaces and double the signaling rate, the cited VM value of 50GB/s can only mean the bi-directional encoded rate of 6.25 Gbaud over 8 x4 lanes on the RapidIO side. The VMAX 20K value of 24GB/s is perplexing. Why is it not the full duplex rate of 25GB/s for 6.25 GBaud over 8 x4 lanes?

The VMAX 10K system value of 200GB/s is also perplexing. There are only 4 engines maximum, meaning each engine would be 50GB/s. The other documents or slide decks indicate the VMAX 10K director is dual VMI? So the VM bandwidth should be 25GB/s full duplex encoded?

On the assumption that the VMAX 40K engine has 50GB/s Virtual Matrix encoded full duplex bandwidth, then the unencoded bi-directional bandwidth is 40GB/s on the RapidIO side, and the unencoded bi-directional bandwidth is 32GB/s on the PCI-E side, corresponding to 4 x 8 PCI-E gen 2 lanes. So the useful bandwidth for the engine VM is 16GB/s single direction.

 

 

Bandwidth Calculation and Speculation

For lack of hard data on what the VMAX IO bandwidth capability actually is, I will speculate. The original VMAX director could have 8 x 4Gbps FC ports on both front-end and back-end. As discussed above, based on 375MB/s for each 4Gbps FC, the director FE and BE bandwidth is then 3.0 GB/s.

I will further assume that the sole purpose of the CMI-II between the two directors in each engine is to maintain a duplicate of the write cache for fault tolerance. This means all other traffic between directors must go through the VMI.

In the circumstance that every I/O request coming to a particular port on one of the directors access data only RAID groups directly attached to that director, then we would have 100% locality and the would be nearly zero traffic over the VM. Not only is this highly improbable and extremely difficult to contrive, it also goes against one of the key principles of the central SAN argument. The idea is to pool a very large number of disks into one system such that every volume from all hosts could access the aggregate IOPS capability of the complete storage system.

A RAID group must be built only from the disks directly attached to the director. So the aggregate concept is achieved by pooling all RAID groups together. Volumes are created by taking a (small) slice of each RAID group across all directors. Each volume now has access to the IOPS capability of the entire set of disks. This is why the SAN shared storage concept is valid for transaction processing systems but not for DW systems that would benefit from sequential large block IO.

In this scenario, the presumption is that IO requests arriving at any director are evenly distributed to all directors. In the full system of 8 engines, 16 directors, 6.25% (1/16) of IO are on local disks accessed via the back-end ports and 93.75% (15/16) must come through the VM from the other directors.

Then the SAN system bandwidth is constrained by the more limiting of the Front-end channels, the backend channels, and the adjusted Virtual Matrix single direction, not full duplex, bandwidth. The adjustment accounts for the percentage of traffic that must come through the VM. If the VM must handle 15/16 of the total traffic, then the upper limit is 16/15 times the VM bandwidth. On the VM, it so happens the PCI-E is more limiting than the RapidIO side, so quoting bi-directional bandwidth is misleading and so is quoting the RapidIO side bandwidth instead of the PCI-E bandwidth.

The PCI-E bandwidth to VM in the original VMAX is 2.0 GB/s (x8 gen 1) including PCI-E protocol overhead. The actual net bandwidth is less than 2GB/s but possibly more than 1.6GB/s cited earlier as the maximum that I have seen in direct attach IO. This is more limiting than the 3GB/s on the 8 x 4Gbps FC front-end or backend ports.

The second generation VMAX allows 8 x 8Gbps FC ports on the front-end for an aggregate bandwidth of 6GB/s based on 750MB/s per 8Gbps FC port. However the back-end ports are still 4Gbps FC for an aggregate of the same 3GB/s in the original VMAX. The 40K VMAX engine is described as 50GB/s VM bandwidth, not mentioning this is the full-duplex value encoded on the RapidIO side. The single direction encoded data rate on a single director is 12.5GB/s. The unencoded rate 10GB/s on the RapidIO side. The single direction unencoded rate on the PCI-E side is 8GB/s (16 PCI-E gen 2 lanes). Still this is much more than either the FE or BE ports.

Note that with fewer engines and corresponding directors, more of the traffic is local. With 4 engines and 8 directors, the local traffic is 12.5% and 87.5% remote. With 2 engines and 4 directors, the local traffic is 25% and 75% remote.

All of the above is for read traffic, and does not consider if there are other more limiting elements. Another consideration is memory bandwidth. A read from "disk" could be first written to memory, then read from memory. (the latency due to the CPU-cycles involved is not considered). An 8-byte wide DDR DRAM channel at 1333MHz has 10.7GB/s bandwidth, but this is only for memory read.

The memory write bandwidth to SDR/DDR is one-half the nominal bandwidth. In the really old days, a disk access involving a memory write followed by a memory read would be constrained by one-third of the nominal memory bandwidth. Intel server systems from 2006 or so on used memory with a buffer chip that is described as supporting simultaneous read at the nominal bandwidth and write at one-half the nominal bandwidth.

In writes to storage, the write IO is first sent to memory on the local director, then copied across the CMI-II(?) to the other director in the same engine? So the net bandwidth across the CMI is also limiting.

Now that SQL Server 2012 allows clustering (AlwaysOn?) with tempdb on local (preferably SSD) storage, I recommend this to avoid burdening the SAN with writes. Or a SAN vendor can bother to understand the nature of tempdb and allow write cache mirroring to be selectively disabled?

Even with all this, there is not a definitive statement from EMC on the actual bandwidth capability of the VMAX, original or extra-crispy second generation. Some slides mention a 3X increase in bandwidth. Was that a particular element, or the realizable bandwidth? Is it possible that the original VMAX could do only 1/3 the back-end aggregate of 48GB/s, and that the second generation can do the full back-end limit?

Summary

Regardless of the SAN, focus on the IOPS and bandwidth that can be realized by actual SQL Server queries. SAN vendor big meaningless numbers are not helpful. The standard SAN vendor configuration should be able to produce reasonable IOPS, but will probably be sadly deficient on bandwidth that can be realized by SQL Server. I do like SSD. I do not like auto-tiering, flash-cache or 7200RPM drives for the main Line of Business database. It should be the database administrators responsibility to isolate hot data with filegroups and partitioning.

Considering that a 10K 900GB drive lists for $600, why bother with the 7200RPM (3TB) drive in an enterprise system, unless it is because the markup is so ridiculously high? Or perhaps data that needs to be on 7200RPM drives for cost should not be on an host-cost enterprise storage system?) If there are SSDs, these should be made available as pure SSD.

(Edit 2013-08-02)
Also extremely important is how SQL Server IO works. First, FC 8Gbps is 8 Giga-bits/sec, not Giga-Bytes/s. After encoding overhead, the net single direction BW is 700-750MB/s. On modern server hardware, SQL Server can easily consume data at 5-10GB/s, ie, single direction, so 8-16 x 8Gbps FC ports are recommended for multi-ten TB DW. Furthermore, use serious configuration documents, not the stupid idiots who say 1 LUN each for data, temp, log. The should be 1 volume per FC path to be shared between data and temp. For OLTP, log should be on a separate volume with dedicated physical HDD (ie, not a volume carved from a shared pool) and even possibly its own FC path. For DW, log can share a path with the data volumes. Each critical FG has one file on each of the data volumes. And don't forget the -E startup parameter.

 

 

Symmetrix DMX (pre-2009)

The old DMX-4 architecture below. The front-end and back-end units used PPC processors(?) connected with a matrix to memory controllers?

VMax

A history of the EMC Symmetrix product line can be found on Storage Nerve.

There can be up to 8 front-end units. Each FE can have 8 FC ports for a total of 64 FE ports? Assuming that this was designed for 4Gbps FC, with a realizable bandwidth 375MB/s on each 4Gbps FC port, then each FE unit of 8 ports would in theory have a maximum BW of 3.0GB/sec. The combined 8 FE units with 64 ports total would a have theoretical upper bound of 24GB/s. Now it is possible that the DMX was originally design to 2Gbps FC, for an upper bound design of 12GB/s.

Various EMC documents mentions the interconnect bandwidth as a sum total of the individual component bandwidths. But nowhere in EMC document is there a mention of the actual DMX bandwidth capability. I have heard that due to some internal architecture aspect, the actual bandwidth capability of the DMX is in fact 3GB/s.

Lonny Niederstadt provided the following link Agilysys Audit.

Published Friday, May 10, 2013 12:52 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Chris Adkin said:

I have first hand experience of SANS running out of IOPS bandwidth before they run out of storage capacity, both with high and low end offerings. To get the best performance out of them you need SANS that have the intelligence to optimize for using the outer 'Cylinder' on disks, before you mention SSDs, most SAN vendors charge through the nice for their own brand SSDs. SANs also tend to be biased towards performing large read aheads and I've known some to quiesce all I/O if the write part of the cache on the storage processors fills up. iScsi has become popular of late, however I think that with things like TCP-IP back off this is not an ideal way of talking to a SAN. A new storage paradigm is in the wings, its not going to be pure flash arrays, but along the lines of converged all in one compute, network and storage. Someone somewhere will develop a way of presenting flash on PCIe cards as consolidated pools of storage, it might be something along the line of PernixData.

May 21, 2013 5:14 PM
 

jchang said:

The outer cylinder trick is to achieve maximum sequential IO with HDs. the Seagate Cheetah 15K.7 is rated ranging from 122MB/s inner to 204MB/s on the outer track. This useful for DW bandwidth, not much on random IO. All this assumes the IO system can deliver the BW available from the disks. SAN systems are designed for random IO. They configure many disks per channel. So it would not benefit from being able to extract max BW per disk.

My opinion is that the database engine is designed to be tightly coupled to raw storage devices. Putting "general intelligence" into the storage system may or may not help and may even be counter productive. This is why I prefer direct attach storage for DW, and just basic SAN for OLTP.

May 24, 2013 7:42 AM
 

Josh Krischer said:

DMX bandwidth.

The DMX-3 uses what EMC refers to as “Direct Matrix Architecture” which EMC claims provides for up to 128 GB/s bandwidth

The DMX cache is built from 2 to 8 cache modules with capacities between 16 and 512 GBytes. Each Cache module has 8, 1Gbit/s connections to each of the Channel Directors (CD - host front-end interface) and similar connectivity to the 8 Device Directors (DD - back-end interface) which means that only a fully configured DMX has 128x1 GBytes/s bi/directional  connections.

However, this doesn’t mean that the maximum cache bandwidth is 128 GBytes/s because the DMX cache supports a maximum of only 32 concurrent operations* (4 concurrent memory transfers per cache module) which only result in a total theoretical 32 GBytes/s for data and control traffic.

Each of the 1GB/s serial connections is composed of a pair of full-duplex unidirectional serial links—two 250MB/s serial transfer links (TX), and two 250MB/s serial receive links (RX) which means 0.5 GByte/s in each direction, which, depending on the type of workload may further reduce the practical available bandwidth.

There are only 32 bi-directional paths to cache, and only half the bandwidth can be used in any direction at one time.

Considering all the above, the maximum achievable bandwidth of the DMX is below 16 GByte/s, much lower than the 128 GByte/s stated in DMX documentation.

A modest DMX configuration with two cache modules, two CDs and two DDs has half of this bandwidth.  Because the cache directory is stored in the cache and each access to cache requires additional access to fetch the metadata, the effective bandwidth is even lower.

June 30, 2013 12:06 PM
 

jchang said:

thanks josh. I have it from a reliable source that the actual realizable bandwidth of the DMX-3 is 3GB/s. For a high-end (complex) systems that came to market in 2005, this is not entirely bad, but I think the target should have been 10MB/s per disk. At 1000 disks (Wiki says max is 1440/2400) then 10GB/s would have been nice.

Of course, the high band-width is really for DW, and the high-end storage systems (all vendors) are just a waste of money in DW. It is really designed for consolidation, sharing IOPS for random IO.

July 1, 2013 10:44 AM
 

Garret Black said:

Nice article.  I'm a SAN admin and I try to work closely with our DBA's as our company has alot of SQL.  Not typical OLTP but large DW with bulk loads (writes) etc which typically isn't covered in white paper solutions.  You are correct in that the drive for SAN's is for efficiency.  SSD's in every server that needs more disks for IOPS than it can hold really adds up in cost and it will likely not get enough capacity.  We have SQL servers from 2TB to 50TB and IOP ranges peaking from 2,000 to 70,000 per server.  One could add multiple DAS units to each server but then it takes up valuable datacenter space and more non-redundant components per server.  When you have over 1000 servers adding a DAS to each server really adds up.  There are many people that are affraid of auto-teiring but it's nothing new at this point and it's where everyone is going.  The fact is every auto-tiering solution is different.  We were told point blank by HDS that auto-teiring wouldn't work with our workload on a VSP due to the batch nature of our workload, but EMC said it would work on the VMAX and we have been running great with it to the point that we are now at an 8 engine VMAX.  Could it be faster?  Yes, but it would be at a cost.  We are provisioning different storage groups for each data type (SQLDB, TempDB, Logs, Backup) so that we can pin data to a tier of disk if needed but we haven't had that need.  One thing that I'm quite scared of is Microsoft's recent push for essentially creating your own SAN with clustered shared volumes and SMB 3.0 and using their auto-teiring.  Microsoft's auto-teiring is very retro active as well compared to a SAN.  It has many thinking "storage for cheap" but everything has a cost.  The other argument is whether or not to virtualize SQL.  DBA's typically have the argument that SQL needs to seperate disk I/O on different disks such as tempdb/logs local and DB on SAN, but look at the Microsoft PDW v2.  It's based off VM's on hyper-v all sharing the same spindles.  It seems that many of the old rules of luns to cores and seperated disk groups doesn't matter these days.

July 29, 2013 4:00 PM
 

jchang said:

A shared SAM is fine for consolidation DB. I would prefer that the most critical OLTP db not be on shared SAN, but if that's the way you want to run your business. A shared SAN is not right for very large DW. The IO patterns are completely wrong. IOPS is the wrong metric for DW. I would really want to target 10-20GB/s BW to support a 50TB DW.

In any case, for a large DW, there should be several volumes for data/temp. That is, data and temp should share volumes. But there should be 1 volume for each path. To support 10GB/s BW, you should probably have 16 8Gbps FC paths. The backups should go to a different storage system.

Auto-tiering is a dumb idea for DW. On every large table scan, data will move on and off the tier-0 SSD generating excessive and unnecessary wear. This actually degrades performance. The whole concept of DW with sequential access is that HDD can support this. You can even use MLC SSD for DW because the core data is write once, followed by heavy read. tempdb writes would be distributed over a lot of excess space.

July 30, 2013 10:59 AM
 

Garret Black said:

It seems we don't quite fit the I/O characteristics you are describing for a typical DW.  I see alot of random I/O and unfortunately I don't know enough SQL to dig into why that is.  We also don't have SQL servers that push over 1.5GB/s including 4 socket servers with multiple luns.  We provision 8 Gb FC paths to servers over 2 8Gb HBA's and they don't come close to saturation.  Seems like we are thinking of different workloads here.

I was expecting to have to pin luns to tiers when we first implemented the SAN.  The fact is we have auto-tiering and we have cut disk cost in half yet still get equal or better performance.

July 30, 2013 2:02 PM
 

Garret Black said:

Also, what are your thought on Microsoft's push to "virtualize everything" which of course includes SQL?  In a large environment the only way to do that is share disks on some type of SAN or NAS solution.  

July 30, 2013 2:11 PM
 

jchang said:

Broadly, there Inmon and Kimball DW's, which are completely different entities that were both given the same name. I generally think in terms of BI/DSS DW's where your queries involve large table scans. In principle, a table scan might generate sequential IO, but after the layers from SQL Server file page allocation, the OS file, and the SAN allocation, you may not be generating sequential IO. So just pay attention to IO size. 8K IO from SQL Server generally points to pseudo-random access. 64K IO from SQL Server may indicate a table scan.

Each 8Gbit FC port can support about 700-750MByte/s net after protocol overhead at all levels. You have 2 x 8Gb ports, but if you only have 1 data LUN, 1 temp LUN, then chances are most of the time, you will only see 700MB/s, with unusual circumstances reaching 1.5GB/s.

So you are saturated on the FC IO paths. The modern 4-socket servers should be able to generate tables in the 10-20GB/s range, but your configuration will only support 0.7GB/s.

I am not sure you have proper baselines for comparing with and without auto-tiering. In principal, auto-tiering is intended to help random IO performance, typically small-block. Auto-tiering just gets in the way of large block or sequential IO.

Virtualizing everything is the pinnacle of stupidity. It should be virtualize everything except the very very few really (super) critical systems for which your organization is staffed with dedicated extra personnel to ensure absolute best performance.

July 30, 2013 4:46 PM
 

Garret Black said:

Please take this as a constructive debate.  It's interesting to hear the reasons from the DBA side of things.

Your calculations seem to be very low on 8Gbit FC ports.  One 8Gbit QLE2560 FC port has a spec of 16000 MBps.  Even if their documentation is incorrect and actually referring to Mbps that's still 2000 MB/s per FC port.  Protocol overhead for FC isn't near what it is for TCP/IP protocols like iSCSI.  We have a server that has 2 QLE2560 cards and pushes over 2500 MB/s.  While I agree 4-socket servers are large and could in theory push more I/O it's not the bottleneck for our DW servers.

We compared run times on the old SAN vs on the new SAN which is the only metric that matters for the DBA's.  EVA's had 240 15K FC drives drives and our VMAX started with about 500 drives and now it has 1440 composed of 1% EFD, 30% SAS, and 70% SATA.  I'll admit that this is comparison was unfair due to the ammount of cache and spindles in the new SAN since we were consolidating about 8 EVA's.  The main thing we wanted to test was how SATA performance was since having all data on SATA would be the worst case scenenario in an auto-teiring SAN.  We also tested concurancy as much as we could during our POC and we haven't ran into any issues regarding that.  On the SAN side I compared the usual response time, transfer, and IOPS.  Most SAN's have algorithms to optimize for sequential I/O.  Cache also is factor.

I totally agree the best performance will be dedicated resources, but that doesn't scale efficiently.  When your buisiness provides a DW service you suddenly have many critical systems.  Not only that but when dealing with large SQL servers with local storage you now have many large servers with wasted resources at some level unless you have a SQL server that runs jobs 24/7 and were perfect when sizing local storage needs.  

Why do you believe virtualization doesn't work?  Have you looked at Microsoft's PDW v2?  It's built off VM's and is purpose built for DW.  Pretty interesting stuff and in our POC's the DBA's are telling us it's faster then Netezza's which aren't built off virtualization.  This I think has opened some eyes with our DBA's that virtualization isn't so bad.

July 31, 2013 10:12 AM
 

jchang said:

I think you are confused about FC protocol overhead.

see http://www.fibrechannel.org/fibre-channel-roadmaps.html and

http://www.infostor.com/san/fibre-channel/2010/16gfc-standard-doubles-fibre-channel-speed-.html

8Gb (giga-bit) FC line is actually 8.5 Gbaud, but after 8b/10b overhead, the net packet data rate is 800MB/s.

I believe the reason the fibrechannel.org lists 1600 MByte/s (not 16000) throughput is because FC, as with all high-speed transport today, is bi-directional. Notice each FC port has 2 FC cables. One for each direction. Then 8Gbit/s FC can carry 800MB/s in each direction simultaneously.

However, if you consider real workloads, the traffic is almost all in one particular direction for any given task at a point in time. So citing the bi-directional bandwidth is dishonest or ignorant, both characteristics of a marketing puke pretending to be technical.

Now, the actual data rate realizable by an application after the upper layer overhead (the FC packet has a header, and payload, the payload itself has an upper level header) is in the 700-750MB/s range.

I believe EMC even has documentation stating this. Of course, they don't tell you, you have to ask for it. They would rather you focus on the 8Gb number. Sometimes I wonder if SAN vendors hope their customers are too stupid to understand the difference between bit and byte, considering that are already trying to confuse you on single and bi-directional bandwidth. I am actually surprised they didn't sell you their 10Gb FCOE crap, which has a different issue.

Something is very wrong if you have 2 QLE2560 single port 8Gbps FC cards pushing 2500MB/s. Is this a really application? or so kind of benchmark? The only real database test that will generate 2500 on 2 x 8Gbit/s FC is a database backup. Read 750MB/s from each of 2 ports, while simultaneously writing 750MB/s to each port, in theory, nearly capable of a combined bi-directional bandwidth of 3000MB/s. Of course, if you are backing up to SATA disks, then 600MB/s per port per direction is perfectly reasonable.

How do you know that your SQL Server does not need high bandwidth? you have not configured the storage system to support high-bandwidth. So what is probably happening in IO intensive operations, the CPU load is low, as the IO channel(s) is saturated (you have 2 channels, but data read will only go over a single channel).

If you are on a second generation VMAX with 8 engines, that could probably deliver 10GB/s over 16 x 8 Gb FC ports, and a single SQL Server on 4 sockets could consume data at that rate. But 30% of 1440 drives being SAS, I would target 10MB/s per disk for about 4GB/s. I think you have too many SATA drives for a serious 50TB DW system (or just too few 10K SAS drives).

July 31, 2013 8:55 PM
 

jchang said:

I actually got a chance to work on PDW v1 in the lab. v1 architecture was entry SAN (example HP P2000G3?) connected via FC.

I heard about v2 architecture, but did not remember, probably because I am getting old and senile, so I had to look it up,

I found this, I think there might also be a video from TechEd?

http://www.youtube.com/watch?v=8mr0qkkrZo4

v2 uses virtualization to consolidate management into 2 nodes? v1 had too many management nodes which was ridiculous. So that is good, virtualization the little things. But as far as I can tell, the core compute nodes are not virtualized. v2 storage also ditches the performance impeding SAN for direct-attach SAS storage. Two nodes connect to each SAS storage for vault-tolerance.

If anyone can point me to a good PDW v2 architecture slide deck + architecture whitepaper, I would appreciate it.

I found this, but have not watched it yet

http://www.youtube.com/embed/8_EXwvNtch8?wmode=transparent#t=71m43s

July 31, 2013 9:18 PM
 

Garret Black said:

I'd have to dig into the metrics of the server but it's a DW server and also has SAS cubes.  

I see what you mean with 2 channels per port with each one being directional.  I never thought of it that way when it comes down to reads and writes.  The average I/O of a server is even less than 500 MB/s with some peaks to 1500 MB/s and once in awhile up to 2500 MB/s.  What i'll have to dig into is whether at those peaks it's a combined read/write or only read or only write. I assume if a server is pushing 1600 MB/s of only read then it's being bottlnecked and same with write.  

We are on Gen2 with our VMAX being a 20K.  My percentages are a bit misleading.. as those are capacities.  What I should have noted were the disk counts which are 64 EFD, 920 10K SAS in R1, and 424 SATA in R6.  

I'll see what architecture I can come up with for PDW v2.  I'll have to confirm, but I believe the Hyper-V hosts are in a cluster so the VM's would actually access the disk via SMB 3.0 from a clustered shared volume and not exactly locally since the data could actually reside on another hyper-v node.    

August 1, 2013 5:59 PM
 

jchang said:

In looking over the slide deck: HP AppSystem for Parallel Data Warehouse, from TechEd2013 eu, it does appear that the compute nodes are VM. I am not thrilled about VM for a core compute intensive element, parted on the memory overhead for VM. If I understand correctly, each compute node supports VMs for quorum and app fabric, but most of the cycles are for compute. So I suppose it is ok to use VM for administrative purposes, but a consolidation VM, one physical system supporting very many VMs) is still a stupid idea for DW.

The normal nodes connect directly to storage over SAS. On failure of one node in a compute node pair, one of the spare nodes takes over as compute, connecting to data over Infini-band via SMB 3 to the other compute node, connected to the storage array.

I am more disappointed in the use of SATA 7200rpm 3.5in disks (1, 2, 3TB options) instead of 10K SAS disks (600, 900GB options). Physical volume would have been comparable between 3.5in 3TB SATA and 2.5in 600GB SAS with the right storage enclosure. I consider the cost difference between SATA and 10K SAS relative to capacity to be a red herring because in the big picture, the HW element is small, and anyone stupid enough to fill 3TB SATA disks with important data deserves what happens next.

I do not consider 7200rpm disks to be an option for normal SQL Server because the query optimizer is hard locked/stuck on an old formula for random to sequential IO (320 IOPS having equivalent elapsed time to 10.5MB/s scan operation). This was apparent in the TPC-H benchmarks. Sybase could produce a good result with far fewer HDDs than SQL Server, because of the old random-sequential ratio, there were still lookup or loop join operations in SQL Server, while Sybase heavily favored scan to leverage the capability of (then) modern 15K disks - 100MB/s+ sequential vs 200 IOPS random.

Of course it is possible that in PDW, they have access to the QO, and may have changed it to reflect the underlying HW.

Before you get too gaga over PDW, note that there has not been a TPC-H benchmark result published for PDW or Columnstore for that matter. There are some queries in TPC-H that PDW cannot do. So it all depends on what you are doing. If your DW is just feeding the MS BI stack (is it still called Analysis Server?), then that's fine because its what PDW concentrates on.

Anyways, a good set of questions and comments from GB with a fully stacked VMAX 20K. I think 920 10K is a good configuration. It will not be hard to prove that your SQL Server single query read (from a data file) is 750MB/s because it is going over only 1 of the FC ports. It would require a second query to hit 1500MB/s read, and combined read/write to exceed 1.5GB/s. And it can be demonstrated that your 4-socket SQL Server can easily consume 4-5GB/s if this IO bandwidth were available.

Given that you have already purchased the expensive elements, 1) VMAX 8 engines and 1400 disk, 2) SQL server licenses, 3) server hardware. you only need a few more dual port FC HBA to achieve high-BW at $2K each? assuming there are space FC switch ports. It will take some DBA work to rebuild the database objects, and some details on the VMAX pool configuration.

August 2, 2013 9:42 AM
 

Sirinath said:

Great details and write. Excellent work. Thank you.

November 11, 2013 11:11 PM
 

Mark Kulacz said:

Hi Joe - To the best of my knowledge, the VMAX 20k is essentially the same as the original VMAX in April 2009. The VMAXe, released in early/mid 2011 (and re-labeled the VMAX 10k in mid-2012) was updated in Jan 2013 to use the new Westmere CPU, updated PCIe, etc.. But the VMAX 20k was not updated.

It is quite confusing. This would help explain the challenges you encountered into when calculating bus speeds of the different models.

I'm also mystified how the current (gen2) VMAX 10k can have 200GB/s of aggregate interconnect bandwidth, but the VMAX 40k is 400GB/s, even though the VMAX 40k has 4 matrices and twice the number of engines. Either it is an EMC typo, or two of the four matrices are in a full standby-only mode. Any thoughts?

November 19, 2013 5:07 AM
 

Lee H. said:

Is it possible to get information from the SAN side as to each storage group the server that is connected to it? I need to correlate the storage groups and volumes to servers. VMax storage

January 9, 2014 11:25 AM
 

jchang said:

Your SAN admin would have this information. Good luck getting it. In many cases, the SAN admin will absolutely refuse to provide this information. I was on one project where an Oracle team tried to get this for months without success. Finally, they said screw this, we're getting Exadata (probably the full Oracle Database Machine).

January 9, 2014 3:59 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement