I generally do not get involved in high-end SAN systems.
It is almost impossible to find meaningful information on the hardware architecture from the vendor.
And it is just as impossible to get configuration information from the SAN admin.
The high-end SAN is usually a corporate resource managed in a different
department from the database.
The SAN admin is generally hard set on implementing SAN vendor doctrine of "Storage as a Service"
and does not care to hear input on special considerations from the database team.
In addition to unpredictable or poor performance and sometimes both,
it is often necessary for fight for every GB of storage space via email requests,
filling out forms or endless meetings.
This is especially ridiculous because storage capacity at the component level is a really cheap.
It only becomes a precious resource in a SAN.
Still, I am expected to answer questions on what is wrong with SQL Server when there are
performance problems against a magically all-powerful enterprise SAN,
so this is my best understanding.
The example I am using is the EMC Symmetrix line, but the concepts here could be applied to other systems if details were available.
The EMC Symmetrix VMAX was introduced in 2009 using Intel Core2 architecture processors (45nm Penryn)
with RapidIO fabric.
A second generation came out in 2012,
with the VMAX 10K, 20K and 40K models using Intel Xeon 5600 (32nm Westmere) processors.
The predecessor to the VMAX was the Symmetrix DMX-4,
which used PPC processors and a cross-bar architecture connecting front-end,
memory and back-end units.
The basic information here is from the EMC documents.
Because the details on the internal architecture of the VMAX
are not found in a single authoritative source,
much of it has to be pieced together.
Some of the assessments here are speculation, so anyone with hard knowledge is invited to provide corrections.
The original VMAX architecture is comprised of up to 8 engines.
Each engine is comprised of a pair of directors.
Each director is a 2-way quad-core Intel Xeon 5400 system with up to 64GB memory
(compared with 16GB for the CLARiiON CX4-960).
Each director has 8 back-end 4Gb/s FC ports (comprised of quad-port HBAs?)
and various options for the front-end including 8 x 4Gb/s FC ports.
The engine with 2 directors has 16 back-end FC ports (2 ports making 1 loop)
and can have 16 ports on the front-end in the FC configuration.
Assuming 375MB/s net realizable throughput with 4Gbps FC,
each director could support an aggregate of 3.0GB/s on both the front and back-end ports.
In the full VMAX system of 8 engines (16 directors) with FC front-end configuration
there are 128 x 4Gb/s FC ports on the front and back ends.
Then in theory, the combined front-end and back-end bandwidth
of the full system is 16 x 3.0GB/s (or 128 x 375MB/s) = 48 GB/s.
Of course, there is no documentation on the actual sequential (or large block) IO capability of the V-Max system.
There is an EMC VMAX Oracle document mentioning 10GB/s on 2 engines (not sure whether this is the 2009 VMAX or the second generation VMAX).
To support the above director,
I would guess that the system architecture should have 6 x8 PCI-E slots.
Based on a quad-port FC HBA, the 8 back-end ports requires 2 x8 slots,
and there are also 2 x8 slots for the front-end for any supported interface.
Without discussing the nature of the interconnect between directors in an engine,
and the Virtual Matrix Interface, I am supposing that each requires one x8 slot.
The above diagram does show a connection between the two directors in one engine.
So there should be 2 back-end, 2 front-end, 1 VM and 1 director-director x8 PCI-E slots in all.
It could also be presumed that the slots are not connected through an expander,
as this would result in an arrangement with unbalanced bandwidth.
At this point I would like to digress to review the Intel Core2 system architecture.
The original memory controller hub (MCH or chipset) for the 2-socket Core2 system
was the 5000P in 2006, 1432-pins.
The 5000P has 24 PCI-E lanes and the ESI, which is equivalent to 4 lanes.
So this is clearly inadequate to support the VMAX director.
In late 2007-early 2008 or so, late in the product life of the Core2 architecture processors,
Intel produced the 5400 MCH chipset, codename Seaburg, with 1520-pins
supporting 36 PCI-E lanes plus the ESI, equivalent to 4 PCI-E lanes.
This MCH chipset was not used by any server system vendor,
so why did Intel make it if there were no apparent customers?
It is possible the 5400 MCH was built specifically to the requirements of the
high-end storage system vendors.
I mentioned this briefly in
System Architecture 2011 Q3.
The 5400 MCH can support 5 x8 PCI-E slots.
I think this is done by using the ESI plus 1 x4 on the upstream side of the Enterprise South Bridge
to support x8 on the downstream side.
So there is something wrong with my estimate to the PCI-E slot count required for the VMAX engine.
When the original EMC VMAX came out in 2009, I could find no documentation on
the Virtual Matrix interface.
I had assumed it was Infini-band, as FC would not have been suitable on bandwidth or protocol support.
Later I came across a slide deck illustrating VMI implemented with an ASIC connecting
x8 PCI-E to RapidIO.
The second generation VMAX specification sheets explicitly lists RapidIO as the interconnect fabric.
RapidIO is an open-standard switched fabric.
In short, RapidIO has protocols for additional functionality that was not necessary in PCI-E,
a point-to-point protocol. (Some of these may have been added to PCI-E in later versions?)
RapidIO can "seamlessly encapsulate PCI-E".
The other aspect of RapidIO is that the packet overhead is far lower than Ethernet layer 2,
and even more so than Ethernet layer 2 plus layer 3 (IP) plus layer 4 (TCP)
as there is no requirement to handle world-wide networks.
The RapidIO protocol overhead is also slightly lower than PCI-E.
The first version of serial RapidIO supported 1.25, 2.5 and 3.125 Gbaud, and x1 and x4 links.
Version 2 added 5 and 6.25 Gbaud and x2, x8 and x16 links.
The diagram below is for the original VMAX using two Xeon L5410 processors.
I neglected to note the source, so some help on this would be appreciated.
In the diagram above, the VMI ASIC is connected to x8 PCI-E to the director system,
and 2 x4 RapidIO for the interconnect.
The RapidIO encoded data rate is 3.125GHz. The data rate before 8b/10b encoding is 2.5Gb/s per lane
or 1.25GB/s bandwidth for the x4 connection in each direction.
The bandwidth per connection cited at 2.5GB/s full duplex is the combined bandwidth in each direction
on the RapidIO side.
The bandwidth on the PCI-E side is 2.5Gbit/s raw, or 2Gbps unencoded data (8b/10b)
for 2.0GB/s on the x8 slot.
This is the nominal bandwidth of the full PCI-E packet including header and payload.
The PCI-E packet overhead is 22 bytes.
The net bandwidth that I have seen for disk IO on x8 PCI-E gen 1 is 1.6GB/s.
I am not sure what the average payload size was for this.
It could have been 512 bytes, the disk sector size commonly used.
In any case, the packet overhead is much less than 20%,
so there is a difference between the net achievable bandwidth and the net bandwidth after PCI-E packet overhead.
The VMAX diagram above shows one x8 PCI-E for VMI and 4 x8 PCI-E for Disks (Back-end) and front-end channels (HBAs).
The 4 BE and FE slots are labeled at 1.5GB/s each and 6.0GB/s for the set of four.
Presumably this is the 4 x 375MB/s FC bandwidth,
and not the PCI-E x8 bandwidth of 2.0 GB/s including packet overhead.
A dedicated interconnect between the two directors in one engine is not shown.
So this would represent a valid configuration for 5400 MCH,
except that 4 x8 PCI-E should be to the MCH, and only 1 x8 on the ICH
(ICH was the desktop I/O controller hub, ESB was the server version).
The main observation here is that EMC decided it is a waste of time and money
to continue to building custom architecture in silicon when there are alternatives.
It is better to use Intel Xeon (or AMD Opteron) components along with an open-standard fabric.
There are ASIC and FPGA vendors that provide a base PCI-E to RapidIO interface design
that can be customized.
I am presuming the EMC VMI ASIC is built on this.
Below is EMC's representation of the VMAX system, showing 8 engines (16 directors)
interconnected via the Virtual Matrix.
The diagram is pretty, but conveys very little understanding of what it is.
Knowing that the Virtual Matrix interface is RapidIO is all that we need to know.
The Virtual Matrix is a RapidIO switch, or rather a set of RapidIO switches.
Each of 16 directors is connected to the VM with 2 x4 RapidIO ports.
A single switch with 128 (16x2x4) RapidIO lanes could connect the full VMAX system.
A second possibility is two switches with 64 (16x4) RapidIO lanes. Each switch connects
one x4 port on each director.
Other possibilities with fewer than 64 lanes include 8 switches of 16 lanes,
or some arrangement involving more than 1 switch between directors.
IDT makes RapidIO switches and PCI-E to RapidIO bridges (not to mention PCI-E switches).
There are other vendors that make RapidIO switches and I do not know the source for the EMC VMAX.
The RapidIO switches are available with up to 48 lanes as shown below.
I am not sure if there are any with 64 lanes?
There is an IDT PCIe switch with 64 lanes in a 1156-pin BGA.
IDT describes their 48-port RapidIO switch, capable of operating at 6.25Gbaud,
as having 240Gb/s Throughput.
So they felt it was more correct to cite the unencoded bandwidth, single direction,
not the full duplex, and not the encoded data rate.
The diagram below shows the full VMax system comprising 11 racks with the maximum disk configuration!
The center rack is for the VMax engines, the other 10 are storage bays.
Each storage bay can hold up to 240 drives.
There are 160 disk array enclosures, 64 directly connected, and 96 daisy chained.
There are 8 VMax engines, with the disk enclosures in matching color.
The 2009 VMAX only supported 3.5in drives initially?
(I misplaced or did not keep the original VMAX documentation, oops)
The back-end interface on both the original and second generation (!@#$%^&) VMAX is 4Gbps FC.
The 3.5in disk drives are also FC.
The 2.5in disk drives for the second generation VMAX is listed as SAS,
so presumably the disk enclosure converts the external FC interface to SAS internally.
There are flash drive options for both 3.5in and 2.5in, the 3.5in being FC and the 2.5in SAS?
The mid-range VNX moved off FC disks in 2011.
Perhaps the size of the VMAX with all 11 racks is beyond the cable limits of SAS?
But why 4Gb/s FC and not 8Gbps?
Is this to maintain compatibility with the previous generation DMX?
I am inclined to think it is not a good idea to saddle a new generation with the baggage
from the older generation.
Perhaps in the next generation FC on the back-end would be replaced by SAS?
VMAX Second Generation (2012)
The second generation EMC VMAX employs the Intel Xeon 5600 series (Westmere-EP) processors
with up to six cores.
There are three series, the VMAX 10K, 20K and 40K.
The complete system is comprised of one or more engines.
There can be up to 8 engines in the 20K and 40K and up to 4 engines in the 10K.
Each engine is comprised of 2 directors.
A director is a computer system.
The 10K director originally had a single quad-core processor; later versions have a single six-core processor.
The 20K director has two quad-core processors.
The 40K director has two six-core processors.
Both the 10K and 20K (directors) have dual Virtual Matrix Interface (VMI or just VM?).
The 40K (director) has quad-VM.
It is very hard to find useful detailed SAN system architecture information.
I came across the following from an EMC VMAX 40K Oracle presentation,
which appears to be just an update of the original
VMAX engine diagram to the second generation VMAX 40K.
But notice that the interconnect between the two directors (2-socket servers) is labeled as CMI-II.
CMI is of course the acronym for CLARiiON Messaging Interface,
which in turn was once Configuration Manager Interface (prior to marketing intervention?).
This makes sense.
There is no reason to develop different technologies to perform the same function
in the two product lines.
So the separation between VNX and VMAX is that the latter has VMI to cluster
multiple engines together.
Along the same lines, does there need to be a difference in the chips to perform the CMI
and VMI functions? It does not matter if the software stacks are different.
To support the VMAX 40K director, there should be 2 x8 PCI-E slots each for both
the front-end and back-end ports as before in the original VMAX.
I am also assuming a single x8 PCI-E slot for the CMI-II.
The difference is that the 40K director needs 2 x8 PCI-E slots to support 4 VM connections, each x4 RapidIO.
This makes a total of 7 x8 PCI-E slots.
The 2-socket Xeon 5600 system architecture is shown below with two 5520 IOH devices
each supporting 36 PCI-E gen2 lanes for 72 lanes total, not counting the ESI (equivalent to PCI-E gen 1 x4).
The full Xeon 5600 system can support 8 PCI-E gen2 x8 slots, plus 2 gen2 x4
(because the extra x4 on each IOH cannot be combined into a single x8?).
So this time there are more PCI-E slots than necessary.
Note also that all of these are PCI-E gen2 slots.
The back-end FC on the 2nd generation VMAX is still 4Gb/s FC.
The front-end FC can be 8Gbps FC.
It could be that all FC HBAs in the second generation can support 8Gbps,
just that the back-end ports operate at 4Gbps?
Virtual Matrix and RapidIO
The original VMAX used RapidIO at 3.135 Gbaud.
After 8b/10b encoding, the un-encoded data rate is 2.5Gbps.
In a x4 link, the combined data rate is 10 Gbit/s or 1.25 GByte/s.
As with modern serial protocols, data transmission is simultaneous bi-directional.
So the bandwidth in both directions combined is 2.5GB/s full duplex.
In a server system, citing full duplex bandwidth for storage is not meaningful
because IO is almost always heavily in one direction (except for backups directly to disk).
However, it should be pointed out that the bi-directional capability is immensely valuable
because the primary stream is not disrupted by minor traffic in the opposite direction
(including acknowledgement packets).
Just do not confuse this with the full duplex bandwidth being a useful value.
In a storage system, it could be legitimate to cite the full duplex bandwidth
for the engine, because each engine could be simultaneously processing data in-bound from
and out-bound to other engines.
So the engine must be able to handle the full duplex bandwidth.
Now considering the complete storage system,
any traffic that leaves one engine must arrive at another engine.
The total traffic is the sum of the bandwidth a single direction.
So it is misleading to cite the sum total full duplex bandwidth.
But marketing people can be relied upon to mislead,
and we can trust marketing material to be misleading.
The VMI ASIC bridges 8 PCI-E lanes to 8 RapidIO lanes.
In the original VMAX, this PCI-E gen 1 to RapidIO at 3.125 Gbaud.
In the second generation VMAX with Westmere-EP processors,
the PCI-E is gen2 and RapidIO is now presumed to be 6.25 Gbaud.
PCI-E gen1 is 2.5Gbps and gen2 is 5Gbps.
I suppose that there is a good reason RapidIO was defined to 3.125 Gbaud
at the time PCI-E was 2.5Gbps.
Consider sending data from one system to another.
In the first system, data is first transmitted over PCI-E (hop 1),
A device converts the data to be transmitted over RapidIO (hop 2).
At the other end, a device converts back for transmission over PCI-E (hop 3)
to the final destination.
It would seem reasonable that if all interfaces had equal data rates,
there would be some loss of efficiency due to the multiple hops.
So for lack of hard analysis I am just speculating that there
was a deliberate reason in the RapidIO specification.
Another speculation is that it was known that RapidIO would be interconnecting systems with PCI-E,
and the extra bandwidth would allow encapsulated PCI-E packets
on RapidIO with the upstream and downstream PCI-E ports to be running at full bandwidth?
The point of the above discussion is that the bandwidth on the RapidIO
of the VMI ASIC is less material to the storage professional.
The bandwidth on the PCI-E side is closer to net storage IO bandwidth.
In the table below, I am trying to make sense of the Virtual Matrix bandwidth
of the original VMAX, and the second generation VMAX 10K, 20K and 40K.
The original VMAX 2009 had 3.125 Gbaud RapidIO, so each x4 link had 1.25GB/s unencoded bandwidth
Each director has dual Virtual Matrix, so the combined full duplex
bandwidth of 4 VM for the engine is 10GB/s unencoded.
The combined full duplex bandwidth on the PCI-E side is 8GB/s per engine.
|x4 link bi-dir
|Engine VM BW
|System VM BW
The second generation VMAX should be on RapidIO at 6.25 Gbaud and PCI-E gen 2 at 5Gbps.
The VMAX 40K
specification sheet cites Virtual Matrix bandwidth of 50GB/s for the engine
and the full system with 8 engines VM at 400GB/s.
The VMAX 20K
specification sheet cites VM bandwidth of 24GB/s for the engine and the full system with 8 engines VM at 192GB/s.
The VMAX 10K
specification sheet cites the full system (4 engines) VM bandwidth at 200GB/s,
implying a single engine at VM bandwidth of 50GB/s.
Given that the VMAX 40K has twice as many Virtual Matrix interfaces and double the signaling rate,
the cited VM value of 50GB/s can only mean the bi-directional encoded rate of 6.25 Gbaud
over 8 x4 lanes on the RapidIO side.
The VMAX 20K value of 24GB/s is perplexing.
Why is it not the full duplex rate of 25GB/s for 6.25 GBaud over 8 x4 lanes?
The VMAX 10K system value of 200GB/s is also perplexing.
There are only 4 engines maximum, meaning each engine would be 50GB/s.
The other documents or slide decks indicate the VMAX 10K director is dual VMI?
So the VM bandwidth should be 25GB/s full duplex encoded?
On the assumption that the VMAX 40K engine has 50GB/s Virtual Matrix encoded full duplex bandwidth,
then the unencoded bi-directional bandwidth is 40GB/s on the RapidIO side,
and the unencoded bi-directional bandwidth is 32GB/s on the PCI-E side,
corresponding to 4 x 8 PCI-E gen 2 lanes.
So the useful bandwidth for the engine VM is 16GB/s single direction.
Bandwidth Calculation and Speculation
For lack of hard data on what the VMAX IO bandwidth capability actually is,
I will speculate.
The original VMAX director could have 8 x 4Gbps FC ports on both front-end and back-end.
As discussed above, based on 375MB/s for each 4Gbps FC,
the director FE and BE bandwidth is then 3.0 GB/s.
I will further assume that the sole purpose of the CMI-II between the two directors in each engine is to
maintain a duplicate of the write cache for fault tolerance.
This means all other traffic between directors must go through the VMI.
In the circumstance that every I/O request coming to a particular port on one of the directors
access data only RAID groups directly attached to that director,
then we would have 100% locality and the would be nearly zero traffic over the VM.
Not only is this highly improbable and extremely difficult to contrive,
it also goes against one of the key principles of the central SAN argument.
The idea is to pool a very large number of disks into one system such that every volume
from all hosts could access the aggregate IOPS capability of the complete storage system.
A RAID group must be built only from the disks directly attached to the director.
So the aggregate concept is achieved by pooling all RAID groups together.
Volumes are created by taking a (small) slice of each RAID group across all directors.
Each volume now has access to the IOPS capability of the entire set of disks.
This is why the SAN shared storage concept is valid for transaction processing systems
but not for DW systems that would benefit from sequential large block IO.
In this scenario, the presumption is that IO requests arriving at any director
are evenly distributed to all directors.
In the full system of 8 engines, 16 directors, 6.25% (1/16) of IO are on local disks
accessed via the back-end ports
and 93.75% (15/16) must come through the VM from the other directors.
Then the SAN system bandwidth is constrained by the more limiting of the Front-end channels,
the backend channels, and the adjusted Virtual Matrix single direction, not full duplex, bandwidth.
The adjustment accounts for the percentage of traffic that must come through the VM.
If the VM must handle 15/16 of the total traffic, then the upper limit is 16/15 times the VM bandwidth.
On the VM, it so happens the PCI-E is more limiting than the RapidIO side,
so quoting bi-directional bandwidth is misleading and so is quoting the RapidIO side bandwidth instead of the PCI-E bandwidth.
The PCI-E bandwidth to VM in the original VMAX is 2.0 GB/s (x8 gen 1) including PCI-E protocol overhead.
The actual net bandwidth is less than 2GB/s
but possibly more than 1.6GB/s cited earlier as the maximum that I have seen in direct attach IO.
This is more limiting than the 3GB/s on the 8 x 4Gbps FC front-end or backend ports.
The second generation VMAX allows 8 x 8Gbps FC ports on the front-end for an aggregate bandwidth of 6GB/s
based on 750MB/s per 8Gbps FC port.
However the back-end ports are still 4Gbps FC for an aggregate of the same 3GB/s in the original VMAX.
The 40K VMAX engine is described as 50GB/s VM bandwidth, not mentioning this is the full-duplex value
encoded on the RapidIO side.
The single direction encoded data rate on a single director is 12.5GB/s.
The unencoded rate 10GB/s on the RapidIO side.
The single direction unencoded rate on the PCI-E side is 8GB/s (16 PCI-E gen 2 lanes).
Still this is much more than either the FE or BE ports.
Note that with fewer engines and corresponding directors, more of the traffic is local.
With 4 engines and 8 directors, the local traffic is 12.5% and 87.5% remote.
With 2 engines and 4 directors, the local traffic is 25% and 75% remote.
All of the above is for read traffic, and does not consider if there are other more limiting elements.
Another consideration is memory bandwidth.
A read from "disk" could be first written to memory, then read from memory.
(the latency due to the CPU-cycles involved is not considered).
An 8-byte wide DDR DRAM channel at 1333MHz has 10.7GB/s bandwidth, but this is only for memory read.
The memory write bandwidth to SDR/DDR is one-half the nominal bandwidth.
In the really old days, a disk access involving a memory write followed by a memory read would be constrained
by one-third of the nominal memory bandwidth.
Intel server systems from 2006 or so on used memory with a buffer chip that is described as supporting
simultaneous read at the nominal bandwidth and write at one-half the nominal bandwidth.
In writes to storage, the write IO is first sent to memory on the local director, then copied across the
CMI-II(?) to the other director in the same engine?
So the net bandwidth across the CMI is also limiting.
Now that SQL Server 2012 allows clustering (AlwaysOn?) with tempdb on local (preferably SSD) storage,
I recommend this to avoid burdening the SAN with writes.
Or a SAN vendor can bother to understand the nature of tempdb and allow write cache mirroring to be selectively disabled?
Even with all this, there is not a definitive statement from EMC on the actual bandwidth capability
of the VMAX, original or
extra-crispy second generation.
Some slides mention a 3X increase in bandwidth.
Was that a particular element, or the realizable bandwidth?
Is it possible that the original VMAX could do only 1/3 the back-end aggregate of 48GB/s,
and that the second generation can do the full back-end limit?
Regardless of the SAN, focus on the IOPS and bandwidth that can be realized by actual SQL Server queries.
SAN vendor big meaningless numbers are not helpful.
The standard SAN vendor configuration should be able to produce reasonable IOPS,
but will probably be sadly deficient on bandwidth that can be realized by SQL Server.
I do like SSD.
I do not like auto-tiering, flash-cache or 7200RPM drives for the main Line of Business database.
It should be the database administrators responsibility to isolate hot data with filegroups and partitioning.
Considering that a 10K 900GB drive lists for $600,
why bother with the 7200RPM (3TB) drive in an enterprise system,
unless it is because the markup is so ridiculously high?
Or perhaps data that needs to be on 7200RPM drives for cost
should not be on an host-cost enterprise storage system?)
If there are SSDs, these should be made available as pure SSD.
Also extremely important is how SQL Server IO works.
First, FC 8Gbps is 8 Giga-bits/sec, not Giga-Bytes/s. After encoding overhead, the net single direction BW is 700-750MB/s.
On modern server hardware, SQL Server can easily consume data at 5-10GB/s, ie, single direction, so 8-16 x 8Gbps FC ports are recommended for multi-ten TB DW. Furthermore, use serious configuration documents, not the stupid idiots who say 1 LUN each for data, temp, log.
The should be 1 volume per FC path to be shared between data and temp. For OLTP, log should be on a separate volume with dedicated physical HDD (ie, not a volume carved from a shared pool) and even possibly its own FC path. For DW, log can share a path with the data volumes. Each critical FG has one file on each of the data volumes. And don't forget the -E startup parameter.
Symmetrix DMX (pre-2009)
The old DMX-4 architecture below. The front-end and back-end units used PPC processors(?)
connected with a matrix to memory controllers?
A history of the EMC Symmetrix product line can be found on
There can be up to 8 front-end units. Each FE can have 8 FC ports for a total of 64 FE ports?
Assuming that this was designed for 4Gbps FC, with a realizable bandwidth 375MB/s on each 4Gbps FC port,
then each FE unit of 8 ports would in theory have a maximum BW of 3.0GB/sec.
The combined 8 FE units with 64 ports total would a have theoretical upper bound of 24GB/s.
Now it is possible that the DMX was originally design to 2Gbps FC, for an upper bound design of 12GB/s.
Various EMC documents mentions the interconnect bandwidth as a sum total of the individual component bandwidths.
But nowhere in EMC document is there a mention of the actual DMX bandwidth capability.
I have heard that due to some internal architecture aspect, the actual bandwidth capability of the DMX is in fact 3GB/s.
Lonny Niederstadt provided the following link