THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

  • Enterprise Storage Systems - EMC VMAX

    I generally do not get involved in high-end SAN systems. It is almost impossible to find meaningful information on the hardware architecture from the vendor. And it is just as impossible to get configuration information from the SAN admin. The high-end SAN is usually a corporate resource managed in a different department from the database.

    The SAN admin is generally hard set on implementing SAN vendor doctrine of "Storage as a Service" and does not care to hear input on special considerations from the database team. In addition to unpredictable or poor performance and sometimes both, it is often necessary for fight for every GB of storage space via email requests, filling out forms or endless meetings. This is especially ridiculous because storage capacity at the component level is a really cheap. It only becomes a precious resource in a SAN.

    Still, I am expected to answer questions on what is wrong with SQL Server when there are performance problems against a magically all-powerful enterprise SAN, so this is my best understanding. The example I am using is the EMC Symmetrix line, but the concepts here could be applied to other systems if details were available.

    The EMC Symmetrix VMAX was introduced in 2009 using Intel Core2 architecture processors (45nm Penryn) with RapidIO fabric. A second generation came out in 2012, with the VMAX 10K, 20K and 40K models using Intel Xeon 5600 (32nm Westmere) processors. The predecessor to the VMAX was the Symmetrix DMX-4, which used PPC processors and a cross-bar architecture connecting front-end, memory and back-end units.

    The basic information here is from the EMC documents. Because the details on the internal architecture of the VMAX are not found in a single authoritative source, much of it has to be pieced together. Some of the assessments here are speculation, so anyone with hard knowledge is invited to provide corrections.

    VMAX (2009)

    The original VMAX architecture is comprised of up to 8 engines. Each engine is comprised of a pair of directors. Each director is a 2-way quad-core Intel Xeon 5400 system with up to 64GB memory (compared with 16GB for the CLARiiON CX4-960).

    VMaxEngine

    Each director has 8 back-end 4Gb/s FC ports (comprised of quad-port HBAs?) and various options for the front-end including 8 x 4Gb/s FC ports.

    The engine with 2 directors has 16 back-end FC ports (2 ports making 1 loop) and can have 16 ports on the front-end in the FC configuration. Assuming 375MB/s net realizable throughput with 4Gbps FC, each director could support an aggregate of 3.0GB/s on both the front and back-end ports.

    In the full VMAX system of 8 engines (16 directors) with FC front-end configuration there are 128 x 4Gb/s FC ports on the front and back ends. Then in theory, the combined front-end and back-end bandwidth of the full system is 16 x 3.0GB/s (or 128 x 375MB/s) = 48 GB/s.

    Of course, there is no documentation on the actual sequential (or large block) IO capability of the V-Max system. There is an EMC VMAX Oracle document mentioning 10GB/s on 2 engines (not sure whether this is the 2009 VMAX or the second generation VMAX).

    To support the above director, I would guess that the system architecture should have 6 x8 PCI-E slots. Based on a quad-port FC HBA, the 8 back-end ports requires 2 x8 slots, and there are also 2 x8 slots for the front-end for any supported interface.

    Without discussing the nature of the interconnect between directors in an engine, and the Virtual Matrix Interface, I am supposing that each requires one x8 slot. The above diagram does show a connection between the two directors in one engine.

    So there should be 2 back-end, 2 front-end, 1 VM and 1 director-director x8 PCI-E slots in all. It could also be presumed that the slots are not connected through an expander, as this would result in an arrangement with unbalanced bandwidth.

    At this point I would like to digress to review the Intel Core2 system architecture. The original memory controller hub (MCH or chipset) for the 2-socket Core2 system was the 5000P in 2006, 1432-pins. The 5000P has 24 PCI-E lanes and the ESI, which is equivalent to 4 lanes. So this is clearly inadequate to support the VMAX director.

    5000P MCH

    In late 2007-early 2008 or so, late in the product life of the Core2 architecture processors, Intel produced the 5400 MCH chipset, codename Seaburg, with 1520-pins supporting 36 PCI-E lanes plus the ESI, equivalent to 4 PCI-E lanes.

    5400MCH

    This MCH chipset was not used by any server system vendor, so why did Intel make it if there were no apparent customers? It is possible the 5400 MCH was built specifically to the requirements of the high-end storage system vendors. I mentioned this briefly in System Architecture 2011 Q3.

    The 5400 MCH can support 5 x8 PCI-E slots. I think this is done by using the ESI plus 1 x4 on the upstream side of the Enterprise South Bridge to support x8 on the downstream side. So there is something wrong with my estimate to the PCI-E slot count required for the VMAX engine.

     

    When the original EMC VMAX came out in 2009, I could find no documentation on the Virtual Matrix interface. I had assumed it was Infini-band, as FC would not have been suitable on bandwidth or protocol support. Later I came across a slide deck illustrating VMI implemented with an ASIC connecting x8 PCI-E to RapidIO. The second generation VMAX specification sheets explicitly lists RapidIO as the interconnect fabric.

    RapidIO is an open-standard switched fabric. In short, RapidIO has protocols for additional functionality that was not necessary in PCI-E, a point-to-point protocol. (Some of these may have been added to PCI-E in later versions?) RapidIO can "seamlessly encapsulate PCI-E". The other aspect of RapidIO is that the packet overhead is far lower than Ethernet layer 2, and even more so than Ethernet layer 2 plus layer 3 (IP) plus layer 4 (TCP) as there is no requirement to handle world-wide networks. The RapidIO protocol overhead is also slightly lower than PCI-E.

    The first version of serial RapidIO supported 1.25, 2.5 and 3.125 Gbaud, and x1 and x4 links. Version 2 added 5 and 6.25 Gbaud and x2, x8 and x16 links.

    The diagram below is for the original VMAX using two Xeon L5410 processors. I neglected to note the source, so some help on this would be appreciated.

    VMax Director

    In the diagram above, the VMI ASIC is connected to x8 PCI-E to the director system, and 2 x4 RapidIO for the interconnect. The RapidIO encoded data rate is 3.125GHz. The data rate before 8b/10b encoding is 2.5Gb/s per lane or 1.25GB/s bandwidth for the x4 connection in each direction. The bandwidth per connection cited at 2.5GB/s full duplex is the combined bandwidth in each direction on the RapidIO side.

    The bandwidth on the PCI-E side is 2.5Gbit/s raw, or 2Gbps unencoded data (8b/10b) for 2.0GB/s on the x8 slot. This is the nominal bandwidth of the full PCI-E packet including header and payload. The PCI-E packet overhead is 22 bytes.

    The net bandwidth that I have seen for disk IO on x8 PCI-E gen 1 is 1.6GB/s. I am not sure what the average payload size was for this. It could have been 512 bytes, the disk sector size commonly used. In any case, the packet overhead is much less than 20%, so there is a difference between the net achievable bandwidth and the net bandwidth after PCI-E packet overhead.

    The VMAX diagram above shows one x8 PCI-E for VMI and 4 x8 PCI-E for Disks (Back-end) and front-end channels (HBAs). The 4 BE and FE slots are labeled at 1.5GB/s each and 6.0GB/s for the set of four. Presumably this is the 4 x 375MB/s FC bandwidth, and not the PCI-E x8 bandwidth of 2.0 GB/s including packet overhead.

    A dedicated interconnect between the two directors in one engine is not shown. So this would represent a valid configuration for 5400 MCH, except that 4 x8 PCI-E should be to the MCH, and only 1 x8 on the ICH (ICH was the desktop I/O controller hub, ESB was the server version).

    The main observation here is that EMC decided it is a waste of time and money to continue to building custom architecture in silicon when there are alternatives. It is better to use Intel Xeon (or AMD Opteron) components along with an open-standard fabric. There are ASIC and FPGA vendors that provide a base PCI-E to RapidIO interface design that can be customized. I am presuming the EMC VMI ASIC is built on this.

    Below is EMC's representation of the VMAX system, showing 8 engines (16 directors) interconnected via the Virtual Matrix.

    VMax Matrix

    The diagram is pretty, but conveys very little understanding of what it is. Knowing that the Virtual Matrix interface is RapidIO is all that we need to know. The Virtual Matrix is a RapidIO switch, or rather a set of RapidIO switches.

    Each of 16 directors is connected to the VM with 2 x4 RapidIO ports. A single switch with 128 (16x2x4) RapidIO lanes could connect the full VMAX system. A second possibility is two switches with 64 (16x4) RapidIO lanes. Each switch connects one x4 port on each director. Other possibilities with fewer than 64 lanes include 8 switches of 16 lanes, or some arrangement involving more than 1 switch between directors.

    IDT makes RapidIO switches and PCI-E to RapidIO bridges (not to mention PCI-E switches). There are other vendors that make RapidIO switches and I do not know the source for the EMC VMAX. The RapidIO switches are available with up to 48 lanes as shown below.

    I am not sure if there are any with 64 lanes? There is an IDT PCIe switch with 64 lanes in a 1156-pin BGA. IDT describes their 48-port RapidIO switch, capable of operating at 6.25Gbaud, as having 240Gb/s Throughput. So they felt it was more correct to cite the unencoded bandwidth, single direction, not the full duplex, and not the encoded data rate.

    The diagram below shows the full VMax system comprising 11 racks with the maximum disk configuration!

    VMax Full Config

    The center rack is for the VMax engines, the other 10 are storage bays. Each storage bay can hold up to 240 drives. There are 160 disk array enclosures, 64 directly connected, and 96 daisy chained. There are 8 VMax engines, with the disk enclosures in matching color.

    The 2009 VMAX only supported 3.5in drives initially? (I misplaced or did not keep the original VMAX documentation, oops) The back-end interface on both the original and second generation (!@#$%^&) VMAX is 4Gbps FC. The 3.5in disk drives are also FC. The 2.5in disk drives for the second generation VMAX is listed as SAS, so presumably the disk enclosure converts the external FC interface to SAS internally. There are flash drive options for both 3.5in and 2.5in, the 3.5in being FC and the 2.5in SAS?

    The mid-range VNX moved off FC disks in 2011. Perhaps the size of the VMAX with all 11 racks is beyond the cable limits of SAS? But why 4Gb/s FC and not 8Gbps? Is this to maintain compatibility with the previous generation DMX? I am inclined to think it is not a good idea to saddle a new generation with the baggage from the older generation. Perhaps in the next generation FC on the back-end would be replaced by SAS?

    VMAX Second Generation (2012)

    The second generation EMC VMAX employs the Intel Xeon 5600 series (Westmere-EP) processors with up to six cores. There are three series, the VMAX 10K, 20K and 40K. The complete system is comprised of one or more engines. There can be up to 8 engines in the 20K and 40K and up to 4 engines in the 10K.

    Each engine is comprised of 2 directors. A director is a computer system. The 10K director originally had a single quad-core processor; later versions have a single six-core processor. The 20K director has two quad-core processors. The 40K director has two six-core processors. Both the 10K and 20K (directors) have dual Virtual Matrix Interface (VMI or just VM?). The 40K (director) has quad-VM.

    It is very hard to find useful detailed SAN system architecture information. I came across the following from an EMC VMAX 40K Oracle presentation, which appears to be just an update of the original VMAX engine diagram to the second generation VMAX 40K.

    VMaxEngineII

    But notice that the interconnect between the two directors (2-socket servers) is labeled as CMI-II. CMI is of course the acronym for CLARiiON Messaging Interface, which in turn was once Configuration Manager Interface (prior to marketing intervention?). This makes sense. There is no reason to develop different technologies to perform the same function in the two product lines. So the separation between VNX and VMAX is that the latter has VMI to cluster multiple engines together.

    Along the same lines, does there need to be a difference in the chips to perform the CMI and VMI functions? It does not matter if the software stacks are different.

    To support the VMAX 40K director, there should be 2 x8 PCI-E slots each for both the front-end and back-end ports as before in the original VMAX. I am also assuming a single x8 PCI-E slot for the CMI-II. The difference is that the 40K director needs 2 x8 PCI-E slots to support 4 VM connections, each x4 RapidIO. This makes a total of 7 x8 PCI-E slots.

    The 2-socket Xeon 5600 system architecture is shown below with two 5520 IOH devices each supporting 36 PCI-E gen2 lanes for 72 lanes total, not counting the ESI (equivalent to PCI-E gen 1 x4).

    5600

    The full Xeon 5600 system can support 8 PCI-E gen2 x8 slots, plus 2 gen2 x4 (because the extra x4 on each IOH cannot be combined into a single x8?). So this time there are more PCI-E slots than necessary. Note also that all of these are PCI-E gen2 slots. The back-end FC on the 2nd generation VMAX is still 4Gb/s FC. The front-end FC can be 8Gbps FC. It could be that all FC HBAs in the second generation can support 8Gbps, just that the back-end ports operate at 4Gbps?

    Virtual Matrix and RapidIO

    The original VMAX used RapidIO at 3.135 Gbaud. After 8b/10b encoding, the un-encoded data rate is 2.5Gbps. In a x4 link, the combined data rate is 10 Gbit/s or 1.25 GByte/s. As with modern serial protocols, data transmission is simultaneous bi-directional. So the bandwidth in both directions combined is 2.5GB/s full duplex.

    In a server system, citing full duplex bandwidth for storage is not meaningful because IO is almost always heavily in one direction (except for backups directly to disk). However, it should be pointed out that the bi-directional capability is immensely valuable because the primary stream is not disrupted by minor traffic in the opposite direction (including acknowledgement packets). Just do not confuse this with the full duplex bandwidth being a useful value.

    In a storage system, it could be legitimate to cite the full duplex bandwidth for the engine, because each engine could be simultaneously processing data in-bound from and out-bound to other engines. So the engine must be able to handle the full duplex bandwidth.

    Now considering the complete storage system, any traffic that leaves one engine must arrive at another engine. The total traffic is the sum of the bandwidth a single direction. So it is misleading to cite the sum total full duplex bandwidth. But marketing people can be relied upon to mislead, and we can trust marketing material to be misleading.

    The VMI ASIC bridges 8 PCI-E lanes to 8 RapidIO lanes. In the original VMAX, this PCI-E gen 1 to RapidIO at 3.125 Gbaud. In the second generation VMAX with Westmere-EP processors, the PCI-E is gen2 and RapidIO is now presumed to be 6.25 Gbaud. PCI-E gen1 is 2.5Gbps and gen2 is 5Gbps.

    I suppose that there is a good reason RapidIO was defined to 3.125 Gbaud at the time PCI-E was 2.5Gbps. Consider sending data from one system to another. In the first system, data is first transmitted over PCI-E (hop 1), A device converts the data to be transmitted over RapidIO (hop 2). At the other end, a device converts back for transmission over PCI-E (hop 3) to the final destination.

    It would seem reasonable that if all interfaces had equal data rates, there would be some loss of efficiency due to the multiple hops. So for lack of hard analysis I am just speculating that there was a deliberate reason in the RapidIO specification.

    Another speculation is that it was known that RapidIO would be interconnecting systems with PCI-E, and the extra bandwidth would allow encapsulated PCI-E packets on RapidIO with the upstream and downstream PCI-E ports to be running at full bandwidth?

    The point of the above discussion is that the bandwidth on the RapidIO of the VMI ASIC is less material to the storage professional. The bandwidth on the PCI-E side is closer to net storage IO bandwidth.

     

    In the table below, I am trying to make sense of the Virtual Matrix bandwidth of the original VMAX, and the second generation VMAX 10K, 20K and 40K. The original VMAX 2009 had 3.125 Gbaud RapidIO, so each x4 link had 1.25GB/s unencoded bandwidth per direction. Each director has dual Virtual Matrix, so the combined full duplex bandwidth of 4 VM for the engine is 10GB/s unencoded. The combined full duplex bandwidth on the PCI-E side is 8GB/s per engine.

     Original10K20K40K
    Processor Core2 Westmere Westmere Westmere
    Sockets 2 1 2 2
    cores 4 4-6 4 6
    VMI/dir 2 2 2 4
    VMI/eng 4 2 4 8
    RapidIO 3.125 Gbaud 6.25 Gbaud ? 6.25 Gbaud
    Unencoded 8b/10b 2.5 Gbaud 5 Gbaud ? 5 Gbaud
    x4 link 1.25GB/s 2.5GB/s ? 2.5GB/s
    x4 link bi-dir 2.5GB/s 5GB/s ? 5GB/s
    Engine VM BW 10GB/s 50GB/s? 24GB/s 50GB/s
    System VM BW 80GB/s? 200GB/s? 192GB/s 400GB/s

    The second generation VMAX should be on RapidIO at 6.25 Gbaud and PCI-E gen 2 at 5Gbps. The VMAX 40K specification sheet cites Virtual Matrix bandwidth of 50GB/s for the engine and the full system with 8 engines VM at 400GB/s. The VMAX 20K specification sheet cites VM bandwidth of 24GB/s for the engine and the full system with 8 engines VM at 192GB/s. The VMAX 10K specification sheet cites the full system (4 engines) VM bandwidth at 200GB/s, implying a single engine at VM bandwidth of 50GB/s.

    Given that the VMAX 40K has twice as many Virtual Matrix interfaces and double the signaling rate, the cited VM value of 50GB/s can only mean the bi-directional encoded rate of 6.25 Gbaud over 8 x4 lanes on the RapidIO side. The VMAX 20K value of 24GB/s is perplexing. Why is it not the full duplex rate of 25GB/s for 6.25 GBaud over 8 x4 lanes?

    The VMAX 10K system value of 200GB/s is also perplexing. There are only 4 engines maximum, meaning each engine would be 50GB/s. The other documents or slide decks indicate the VMAX 10K director is dual VMI? So the VM bandwidth should be 25GB/s full duplex encoded?

    On the assumption that the VMAX 40K engine has 50GB/s Virtual Matrix encoded full duplex bandwidth, then the unencoded bi-directional bandwidth is 40GB/s on the RapidIO side, and the unencoded bi-directional bandwidth is 32GB/s on the PCI-E side, corresponding to 4 x 8 PCI-E gen 2 lanes. So the useful bandwidth for the engine VM is 16GB/s single direction.

     

     

    Bandwidth Calculation and Speculation

    For lack of hard data on what the VMAX IO bandwidth capability actually is, I will speculate. The original VMAX director could have 8 x 4Gbps FC ports on both front-end and back-end. As discussed above, based on 375MB/s for each 4Gbps FC, the director FE and BE bandwidth is then 3.0 GB/s.

    I will further assume that the sole purpose of the CMI-II between the two directors in each engine is to maintain a duplicate of the write cache for fault tolerance. This means all other traffic between directors must go through the VMI.

    In the circumstance that every I/O request coming to a particular port on one of the directors access data only RAID groups directly attached to that director, then we would have 100% locality and the would be nearly zero traffic over the VM. Not only is this highly improbable and extremely difficult to contrive, it also goes against one of the key principles of the central SAN argument. The idea is to pool a very large number of disks into one system such that every volume from all hosts could access the aggregate IOPS capability of the complete storage system.

    A RAID group must be built only from the disks directly attached to the director. So the aggregate concept is achieved by pooling all RAID groups together. Volumes are created by taking a (small) slice of each RAID group across all directors. Each volume now has access to the IOPS capability of the entire set of disks. This is why the SAN shared storage concept is valid for transaction processing systems but not for DW systems that would benefit from sequential large block IO.

    In this scenario, the presumption is that IO requests arriving at any director are evenly distributed to all directors. In the full system of 8 engines, 16 directors, 6.25% (1/16) of IO are on local disks accessed via the back-end ports and 93.75% (15/16) must come through the VM from the other directors.

    Then the SAN system bandwidth is constrained by the more limiting of the Front-end channels, the backend channels, and the adjusted Virtual Matrix single direction, not full duplex, bandwidth. The adjustment accounts for the percentage of traffic that must come through the VM. If the VM must handle 15/16 of the total traffic, then the upper limit is 16/15 times the VM bandwidth. On the VM, it so happens the PCI-E is more limiting than the RapidIO side, so quoting bi-directional bandwidth is misleading and so is quoting the RapidIO side bandwidth instead of the PCI-E bandwidth.

    The PCI-E bandwidth to VM in the original VMAX is 2.0 GB/s (x8 gen 1) including PCI-E protocol overhead. The actual net bandwidth is less than 2GB/s but possibly more than 1.6GB/s cited earlier as the maximum that I have seen in direct attach IO. This is more limiting than the 3GB/s on the 8 x 4Gbps FC front-end or backend ports.

    The second generation VMAX allows 8 x 8Gbps FC ports on the front-end for an aggregate bandwidth of 6GB/s based on 750MB/s per 8Gbps FC port. However the back-end ports are still 4Gbps FC for an aggregate of the same 3GB/s in the original VMAX. The 40K VMAX engine is described as 50GB/s VM bandwidth, not mentioning this is the full-duplex value encoded on the RapidIO side. The single direction encoded data rate on a single director is 12.5GB/s. The unencoded rate 10GB/s on the RapidIO side. The single direction unencoded rate on the PCI-E side is 8GB/s (16 PCI-E gen 2 lanes). Still this is much more than either the FE or BE ports.

    Note that with fewer engines and corresponding directors, more of the traffic is local. With 4 engines and 8 directors, the local traffic is 12.5% and 87.5% remote. With 2 engines and 4 directors, the local traffic is 25% and 75% remote.

    All of the above is for read traffic, and does not consider if there are other more limiting elements. Another consideration is memory bandwidth. A read from "disk" could be first written to memory, then read from memory. (the latency due to the CPU-cycles involved is not considered). An 8-byte wide DDR DRAM channel at 1333MHz has 10.7GB/s bandwidth, but this is only for memory read.

    The memory write bandwidth to SDR/DDR is one-half the nominal bandwidth. In the really old days, a disk access involving a memory write followed by a memory read would be constrained by one-third of the nominal memory bandwidth. Intel server systems from 2006 or so on used memory with a buffer chip that is described as supporting simultaneous read at the nominal bandwidth and write at one-half the nominal bandwidth.

    In writes to storage, the write IO is first sent to memory on the local director, then copied across the CMI-II(?) to the other director in the same engine? So the net bandwidth across the CMI is also limiting.

    Now that SQL Server 2012 allows clustering (AlwaysOn?) with tempdb on local (preferably SSD) storage, I recommend this to avoid burdening the SAN with writes. Or a SAN vendor can bother to understand the nature of tempdb and allow write cache mirroring to be selectively disabled?

    Even with all this, there is not a definitive statement from EMC on the actual bandwidth capability of the VMAX, original or extra-crispy second generation. Some slides mention a 3X increase in bandwidth. Was that a particular element, or the realizable bandwidth? Is it possible that the original VMAX could do only 1/3 the back-end aggregate of 48GB/s, and that the second generation can do the full back-end limit?

    Summary

    Regardless of the SAN, focus on the IOPS and bandwidth that can be realized by actual SQL Server queries. SAN vendor big meaningless numbers are not helpful. The standard SAN vendor configuration should be able to produce reasonable IOPS, but will probably be sadly deficient on bandwidth that can be realized by SQL Server. I do like SSD. I do not like auto-tiering, flash-cache or 7200RPM drives for the main Line of Business database. It should be the database administrators responsibility to isolate hot data with filegroups and partitioning.

    Considering that a 10K 900GB drive lists for $600, why bother with the 7200RPM (3TB) drive in an enterprise system, unless it is because the markup is so ridiculously high? Or perhaps data that needs to be on 7200RPM drives for cost should not be on an host-cost enterprise storage system?) If there are SSDs, these should be made available as pure SSD.

    (Edit 2013-08-02)
    Also extremely important is how SQL Server IO works. First, FC 8Gbps is 8 Giga-bits/sec, not Giga-Bytes/s. After encoding overhead, the net single direction BW is 700-750MB/s. On modern server hardware, SQL Server can easily consume data at 5-10GB/s, ie, single direction, so 8-16 x 8Gbps FC ports are recommended for multi-ten TB DW. Furthermore, use serious configuration documents, not the stupid idiots who say 1 LUN each for data, temp, log. The should be 1 volume per FC path to be shared between data and temp. For OLTP, log should be on a separate volume with dedicated physical HDD (ie, not a volume carved from a shared pool) and even possibly its own FC path. For DW, log can share a path with the data volumes. Each critical FG has one file on each of the data volumes. And don't forget the -E startup parameter.

     

     

    Symmetrix DMX (pre-2009)

    The old DMX-4 architecture below. The front-end and back-end units used PPC processors(?) connected with a matrix to memory controllers?

    VMax

    A history of the EMC Symmetrix product line can be found on Storage Nerve.

    There can be up to 8 front-end units. Each FE can have 8 FC ports for a total of 64 FE ports? Assuming that this was designed for 4Gbps FC, with a realizable bandwidth 375MB/s on each 4Gbps FC port, then each FE unit of 8 ports would in theory have a maximum BW of 3.0GB/sec. The combined 8 FE units with 64 ports total would a have theoretical upper bound of 24GB/s. Now it is possible that the DMX was originally design to 2Gbps FC, for an upper bound design of 12GB/s.

    Various EMC documents mentions the interconnect bandwidth as a sum total of the individual component bandwidths. But nowhere in EMC document is there a mention of the actual DMX bandwidth capability. I have heard that due to some internal architecture aspect, the actual bandwidth capability of the DMX is in fact 3GB/s.

    Lonny Niederstadt provided the following link Agilysys Audit.

  • Relativity e-discovery on SQL Server

    Back in late 2011 to early 2012, I was asked to look into issues for a SQL Server system supporting kCura Relativity. Relativity is an e-discovery platform, that is, a document search management system frequently used for document discovery is legal cases. So it also has auditing to prove that a search was done.

    Normally I would post items both here and on my own website. Now it is my nature to say things that others find to be not entirely polite (see my previous post) and I am too old to change. So to avoiding hurting the feelings of too many people, I am not posting the details of my findings of Relativity on SQL Server here.

    The link is on my site, www.qdpma.com, which does not get much traffic, in the Optimizer section.

    Here is the direct link
    Relativity

     

    Still, I would like hear from SQL people who support Relativity. Of course other people have different conclusions. So it is always important to fully examine the information to make an assessment on technical merit. Avoid judgment based solely on the reputation of the author.

  • Excel Error or Idiot User?

    Many of us have probably seen recent headlines titled Excel Error - major catastrophe, end of the world blah blah blah. No where in any of these is it cited that there was actually an error made by Excel. But if there was a mistake, people would rather imply that it was in the technology, and not the idiot using the technology without checking his or her work. The New York Times blog also mentions that an Excel error was part of the London Whale incident. This is why I never trust information given to me. I insist on seeing the supporting data for myself. Of course, there are people who double check their arithmetic to make sure they got it right to the penny on a five dollar transaction when it is their money. But not for a billion/trillion of someone else's money.

    here is a link from Reuters How a student took on eminent economists on debt issue - and won.

    If the link does not work, either search "How a student took on eminent economists on debt issue - and won" or try http://www.reuters.com/article/2013/04/18/us-global-economy-debt-herndon-idUSBRE93H0CV20130418

    Edit
    let me clarify. my complaint is that the media headlines "Excel Error" instead of emphasizing it was an error by the user in using Excel.

  • Hyper-Threading Performance

    The Hyper Threading technology returned to the Intel Xeon (and Core-ix) with processor codename Nehalem in 2009. It was first introduced with Willamette in late 2000?, and the first Xeon in 2001. But the Core 2 architecture, 2006, in the Xeon 5300 and 5400 series did not have this feature. As with many (potentially) performance enhancing technologies and features, there is a glaring absence of meaningful detailed performance data that might be of help to developers and information technology professionals. And it is not as if Intel and Microsoft do not have the budget for this. (Note: I am deliberately using both Hyper-Threading and Hyper Threading because it impacts search.)

    The complex interaction between software and advanced hardware inherently implies that there is no magic that always improve performance without risk of negative consequences. I am of the impression that marketing people squash anything that could be interpreted as negative because they want victims to believe that everything will be wonderful.

    On our side we should expect that it takes talent and years of experience to build up the skills necessary to command professional salaries. So we need to see both good and bad in order to determine the best strategy to realized the benefits of the positive while containing the impact of any deficiencies. If it were not, then why staff our positions with someone more expensive than necessary?

    Hyper-Threading from Nehalem to Sandy Bridge

    Nehalem and later Xeon Core-ix based processor implement temporal multi-threading. The older Pentium 4 based processors had simultaneous multi-threading. The details of both type are covered in the references at the end and will not be covered here. I have a brief discussion on HT in Processor Architectures and History.

    When the Nehalem-EX (Xeon 7500 series) was near introduction, the HP benchmark team suggested that the impact of HT was in the range of 30% improvement for the OLTP oriented benchmarks and 10% for the DW/DSS benchmarks. I presumed that this was on the 8-socket ProLiant DL980 as benchmarks for the ProLiant 300 and 500 systems were handled by a different group. My expectations are that HT should have greater impact on the 1 to 4 socket systems because it is much more difficult to achieve performance gains on the more complex systems.

    A few years ago, I did an assessment of a proprietary database engine that only did parallel b-tree searches without locking. Scaling was nearly linear with the number threads over both physical cores and logical processors. This was not unexpected because of the pointer chasing code means the core executes serialized memory accesses, so there should be plenty of no-op cycles for two logical processor to share. As a matter of factor, 4 threads (logical processors) per core might be even better.

    Apparently that SQL Server does not realize such benefit has to do with the locking. Is there is a single resource that locks must go through? Dave Campbell's post Breakthrough performance with in-memory technologies on Hekaton cites Per-Ake Larson et al article, and mentions lock free data structures. (I found Sasha Goldshtein material Practical Concurrency Patterns: Lock-Free Operations and Practical Concurrency Patterns: Spinlock. If any knows good sources, please advise.) This might indicate that the next version of SQL Server could show more benefit from HT and perhaps Intel should consider more threads per core in the next generation?

     

    HT in Small and Large Queries

    Ideally we should investigate HT for both high-volume small queries (non-parallel execution plans) and low-volume large queries with parallel execution plans. Unfortunately my multi-threaded test harness is old and I do not have time to bring it up to date.

    Recently I did work on a clustered system with two nodes. For whatever reason, one had HT enabled and another had HT disabled. Periodically, the active node would be switched to allow for patching. This provide invaluable data on HT performance.

    Test System

    The average SQL statement on this system is 1-2 CPU-ms, comparable to TPC-E. HT was off until mid-September, on until October, off until early November, on until mid-January and off until beginning of March. Certain SQL statements are not impacted by HT, and some SQL could be more than 50% faster with HT on, even queries without a parallel execution plan. The overall steady CPU load on the system is typically 25% during the day.

    For large queries, I used the TPC-H data generator at scale factor 100 and the set of 22 TPC-H queries. The LineItem table is 80GB because of the use of the 4-byte date data type instead of the 8-byte datetime. The total database is 140GB.

     

    Test System

    The test server is a Dell PowerEdge T110 II with Intel Xeon E3-1240 processor, quad-core 3.3GHz, 32GB memory, and LSI MegaRAID SAS-9260 controller connected to 8 SATA 6Gpbs SSDs. There are 2 Crucial m4, 2 OCZ Vertex 3, 3 OCZ Vertex 3 Max IOPS and 1 OCZ Vector. The T110 only has PCI-E gen 2 slots, so I will need to get a new system for gen 3 testing. Below is a representation of the test system.

    Test System

    The software stack is Windows Server 2012 and SQL Server 2012 Enterprise Edition. Most testing was done with sp1 + cu3, with a few tests at RTM, and sp1 base. The storage configurations evaluated were 4 x 2 disk RAID 0, 8 disk RAID 4 and 1 x 8 disk RAID 0. All results here are 1 x 8 disk RAID 0, but write activity was minimal so RAID 5 results should be only moderately less.

    Hyper-Threading can be enabled or disabled in the UEFI (formerly BIOS, now Universal Extensible Firmware?). Dell also allows setting the number of cores to enable. I think is most useful in light of SQL Server per core licensing. We can buy a single model for many functions, then enable the correct number of cores to balance performance and licensing cost.

    The SQL Server DATE_CORRELATION_OPTIMIZATION setting was turned on. It so happens that correlated dates within the TPC-H tables are within 120-days of each other. Ship date is within 120 days of order date and so on. The TPC-H queries are written citing a greater than condition on one date column and a less than condition on a correlated column, so this setting helps. If there is a not tight correlation, then the setting may not help. In practice, it might be better to just explicitly specify both upper and lower bounds on the cluster key column whenever possible.

    Several settings were tried on the LSI SAS-9260. No read-ahead is better than read-ahead and 128K stripe size was better than 64K, but results below are with no read-ahead and 64K stripe size.

    While the TPC-H scale factor 100 data set was used, the indexes were built different than TPC-H requirements as I am not presenting official TPC-H results. The significant difference is that the clustered index key on LineItem is Shipdate+OrderKey+LineNumber, specified as unique. This is per my practice of building unique clustered indexes, and not to eliminate key lookups in the execution plan.

    Hyper-Threading On versus Off

    In all tests, only the execution time and CPU are reported. Statistics generation and compile times were excluded. Each test of the 22 TPC-H queries are from clean buffers and clean procedure cache. From a freshly populated database, the statistics generation time is about 40 sec. The compile time with statistics previously generated is about 4-5 sec for all 22 queries.

    CPU and elapsed time were collected with SET STATISTICS TIME ON. The DMV dm_exec_query_stats reports exactly the same elapsed time, but the worker time is higher, perhaps including the compile time?

    Below are elapsed time in milli-sec to execute each of the 22 TPC-C queries with Hyper-Threading off and on. There was no restriction on Max degree of parallelism so HT off is DOP 4 and HT on is DOP 8 (4 physical cores).

    HT off on dur

    Below are CPU (worker) time in ms for the same with HT off and on.

    HT off on CPU

    The sum total execution elapsed time for 22 queries is 692 sec with HT off and 559 sec with HT on. Overall, the 22 queries run 24% faster with HT. On the individual queries, all but one range from 0 to 65% faster. The exception, query 15, runs 4 times faster with HT on instead of off. Below is the elapsed time for HT off relative to HT on.

    HT elapsed

    The sum total execution CPU time is 2253 sec with HT off and 3374 sec with HT on. Overall, CPU is 50% higher with HT on. Of course, there are twice as much available "CPU-sec" with HT on than off, so this is not a negative. With a perfect parallel execution plan and infinitely fast IO, all (logical) cores should be at 100% during the query execution.

    On the individual queries, CPU on all but Q 2 and Q15 range from 22 to 80% higher with HT on. On the Q2, CPU is 137% higher. On Q15, CPU was 45% lower with HT on than off, reflecting the 4X speedup with double the logical processors. Below is the CPU time for HT on relative to HT off.

    HT elapsed

    There is just the slight difference in the execution plans for Q15, shown below. The first is with HT off and the second with HT on.

    Q15 HT off

    Q15 HT on

    I do expect HT should improve the performance of this query but it is probably reasonable to believe that most of the 4X gain in elapsed time is attributed to the difference in execution plans. I have no idea why the Query Optimizer would not use the same plan in both cases. Perhap someone from the Microsoft QO team can help?

    For anyone interested, I have provided the actual execution plans for the above HT off and on sqlplan. (Too bad the web browser does not render sqlplan files!)

     

    Hyper-Threading Off and On at DOP 1

    Below is elapsed time with HT off and on at degree of parallelism 1. Amazingly, performance is better with HT on at DOP 1, 20% faster!

    HT off on dur

    How can this be? When this is the only activity, only one of two logical cores on one of the four cores is in use. The performance should be very similar to single-threaded HT off? The range is 0 to 63% faster except Query 17 which 14% slower with HT on.

    Below is CPU time with HT off and on at DOP 1.

    HT off on CPU

    Below is the relative elapsed time with HT off to HT on.

    HT elapsed

    Query 1 is a simple table access aggregating 591M rows over nearly 80GB, so there can be no difference in execution plan, and yet it is 48% faster with HT on? If someone knows how this can be, please advise?

    I could speculate that with HT on, there are twice as many logical processors, so it is less likely an OS task would disrupt the SQL thread handling the running query. Perhaps is so happened that thread handling the SQL query and the thread handling IO were on the same physical processor, so data was already in L2? These are just wild guesses.

    In the comments below, Chris A mentions that SQL Server 2012 tries to fit batches into L2 cache. What if it estimate the amount of available L2 too high? Then with HT on, it would presume only half that to be available? But with nothing running on the other LP, now single-thread HT on runs better than HT off?

    I like this hypothesis, but the HT-on benefit occurs for S2K8R2 too. Its never the theories that we like!

     

    Hyper-Threading Off and On at DOP 2

    Below is elapsed time for HT off and on at DOP 2. Most of the individual queries are with 12% of each other, faster or slower, except for Query 1, 28% faster with HT on. Overall HT on is 2% faster, basically noise.

    HT off on dur

    Below is CPU time for HT off and on at DOP 2

    HT off on CPU

     

    Hyper-Threading Off and On at DOP 4

    Below is elapsed time for HT off and on at DOP 4. The range with HT on is 2.5X slower (Query 6) to 90% faster (Query 15). Overall, 13% slower with HT on. It is possible that both logical processors are used on certain cores instead of a single logical processor of each physical core. The same may have happened at DOP 2.

    HT off on dur

    Below is CPU time for HT off and on at DOP 4

    HT off on CPU

     

    Hyper-Threading Off, DOP 1, 2 and 4

    Below is elapsed time with HT off at DOP 1, 2 and 4. Notice that scaling is sometimes better than 2X from DOP 1 to 2. I discussed this super-scaling effect in Tpch Studies at SF10

    HT off on dur

    Below is CPU time for HT off at DOP 1, 2 and 4

    HT off on CPU

    Below is the ratio of elapsed time at DOP 2 over 1, and DOP 4 over 2. Correction: this is actually speedup from DOP 1 to 2, and 2 to 4, it is the ratio of elapsed time at DOP 1 over DOP 2, and 2 over 4. The conventional expectation is the scaling should be close to linear from DOP 1 to 2, for a value of just under 2 on the graph. From 2 to 4, scaling should be still be good but less than 1 to 2. If the test system had more cores, there should be reduced gains in scaling to the higher DOP. Query 10 is an example.

    HT elapsed

    As it turns out, the behavior is all over the map of possibilities. One effect is super-scaling where the parallel execution plan is actually more efficient than the non-parallel plan. Four queries (2,11,13 and 18) actually exhibit super-scaling at both DOP 1-2 and DOP 2-4.

    Other effects are that the parallel plan is much less efficient than the non-parallel plan, but otherwise scales with parallelism. In this case, we would expect scaling from 1 to 2 to be poor, but good at 2 to 4. Examples are queries 6 and 15.

    In addition, because of the SQL Server cost formulas, we could be expect completely different execution plans and characteristics at different degrees of parallelism.

    Below is the ratio of worker time at DOP 2 over 1, and DOP 4 over 2. The conventional expectation is progressively higher CPU with increasing DOP to account for the overhead of coordinating the threads.

    HT elapsed

    But as with elapsed time, there are radical variations in behavior. Per super-scaling effect, CPU at DOP 2 is actually lower than DOP 1 in many cases.

     

    Hyper-Threading On, DOP 1, 2, 4 and 8

    Below is elapsed time with HT on at DOP 1, 2, 4 and 8. Notice the more predictable behavior of queries running faster at progressively higher DOP, even when the number of threads exceeds the number of physical cores. The exception is query 15.

    HT off on dur

    Below is CPU time with HT on at DOP 1, 2, 4 and 8. Again the expected behavior of progressively higher CPU with increasing DOP except in the DOP 1 to 2 where the parallel execution plan is more efficient than the non-parallel plan.

    HT off on CPU

    Below is the ratio of elapsed time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4. Correction: DOP 1 over 2 DOP 2 over 4 and DOP 4 over 8

    HT elapsed

    Below is the ratio of CPU time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4.

    HT elapsed

     

    HT Summary

    Overall Hyper-threading is a significant positive for performance. There are unexpected characteristics so it is important to collect sufficient performance data for HT off and on, so that anomalies can be investigated. It is also important to have data on query performance at range of DOP values. In a production system, one should also consider the strategy for max degree of parallelism (and cost threshold for parallelism) factoring in the number of concurrently running queries (with and without parallel execution plans.

    The premise is that HT performance gains on SQL Server is heavily impacted by the locking mechanism. Then the next version of SQL Server, which is expected to make extensive use of lock-free data structures should see more substantial performance gains with HT. This should also prompt Intel to support more logical processor on each physical core, similar to the 4 threads on IBM POWER.

    See also Queries barely over the Cost Threshold for Parallelism and Parallelism Strategy and Comments. In the first, I mentioned that a parallel execution plan with an intermediate DOP may allocate both logical processors on one core, instead of distributing threads one per physical core. Even if using both logical processors on one core produces a 30% throughput gain, that implies each processor running at 65% of one LP used and the other idle. Then the threads on the doubled up core would take 50% longer than the cores with only one active LP.

    Another question is how should the threads be distributed? Consider one thread per core, all on a single socket first, before going to the next socket, so threads share a common L3 cache. Potentially operations such as the Parallelism Repartition Streams might run better. The other option is one thread on each socket first, before coming back to the first socket. If SQL Server could then distribute the work so that threads are assigned to access pages stored on local memory as much as possible, this would improve memory performance. Of course, this would be impossible to accomplish perfectly, but perhaps good enough is possible?

    Wikipedia References

    Wikipedia has pages on both Multi-Threading and processors with MT in one form or another. MT topics include the broader topic Multi-Threading, and the specific types of MT: Simultaneos Multi-Threading, Temporal Multi-Threading, along with the Intel brand name Intel Hyper-Threading on this topic.

    Processors with MT include the IBM POWER5 (2004), POWER6 (2007) and POWER7 (2010), all 4 threads per core. POWER8 in 2013?

    The Sun (now Oracle) UltraSPARC T1 (2005) is 4 threads. UltraSPARC T2 (2007) and later are 8 threads. Then SPARC T3 (2010) T4 (2011). The newest Oracle SPARC T5 (2013).

    Intel Itanium acquired MT with Montecito (2006), and continued with Tukwila 2010 and Poulson (2012).

    HT-SQL references

    I should have checked earlier, Linchi post on this topic performance-impact-not-all ... (2012-01)- 2 posts actually. And Kevin pointed out Slava Oks' Weblog Be aware: To Hyper or not to Hyper

  • Storage Performance

    Storage has changed dramatically over the last three years driven by SSD developments. Most of the key components necessary for a powerful storage system are available and the cost is highly favorable for direct placement of data files. Some additional infrastructure elements could greatly enhance the flexibility of storage systems with SSDs. There is still some discussion on whether SSD should interface directly to PCI-E or continue using the SAS/SATA interfaces originally designed for hard disks. New products coming this year include Express Bay, an ecosystem of connectors allowing both PCI-E and SAS/SATA to co-exist until a clear direction is established. Also expected in the coming year are PCI-E SSDs based on the NVM Express interface.

    System Level

    The Intel Xeon E5 processors, codename Sandy Bridge-EP, have 40 PCI-E gen 3 lanes on each processor socket. Even though PCI-E gen 3 is 8GT/s, a change in the encoding means that the usable bandwidth is double that of PCI-E gen2 at 5GT/s. The net realizable bandwidth of a PCI-E gen 3 x8 slot is 6.4GB/s versus 3.2GB/s for gen 2.

    The unfortunate aspect is that the major server vendors all implement a mix of x16 and x4 slots, while the HBA vendors seem to be concentrating on products for PCI-E x8. Only Supermicro has a system with 10 8 PCI-E gen 3 x8 slots. Could a vendor put 2 HBA/RAID Controllers designed for x8 onto a single card for a x16 slot? Perhaps the new Express Bay form factor will have some means to use x16 slots?

    Sandy Bridge EP 2-socket

    Another disappointment is that the 4-socket Xeon E5-46xx systems only connect half of the PCI-E lanes. This might be because the base system configuration is 2-socket populated. If a full set of slots are provided, there would no connection to half of the slots unless all four sockets are populated. But this is also an issue on the 2-socket systems if only 1 socket is populated.

    Direct-Attach

    For the most part, I will discuss direct-attach storage configuration, as we can pick and choose among the latest components available. Technically, direct-attach with SAS can support a 2-node cluster, but few system vendors promote this configuration. Dell sells the MD3200 as direct-attach storage supporting 4 hosts in a cluster (or not), but technically it is a SAN that just happens to use SAS interfaces on both the front-end and back-end.

    The objective in the baseline storage configuration below is to achieve very high IO bandwidth even in the low capacity configuration. Of course it will also have very high IOPS capability because the main elements are SSD. My recommended storage system has both SSD and HDD in each IO channel.

    PCIe

    This intent is to place the main databases on SSD and use the HDD for backups and for restore verification. For an Inmon style data warehouse, the HDD might also be used for older data. The reason for having both SSD and HDD on each IO channel is to take advantage of simultaneous bi-directional IO. On a database backup, the IO systems reads from SSD, and simultaneously writes to HDD.

    4 RAID Controllers, 4GB/s per controller, 16GB/s total IO bandwidth
    4 Disk enclosures (yes, I am showing 8 enclosures in the diagram above)
    4 x 16 = 64 SSD
    4 x 8 = 32 (10K) HDD

    The standard 2U enclosure has 24 x 15mm bays. My preference is for a 16 SSD and 8 HDD mix. With the small capacity 100GB SSD, there will be 1.6TB per enclosure and 6.4TB over 4 enclosures before RAID. In 7+1 RAID 5 groups, there will be 5.6TB net capacity, and 4.8TB in 3+1 RG across 4 units. The goal is 4GB/s per controller because the SAS infrastructure is still 6Gbps, supporting 2.2GB/s on each x4 port. With 16 SSDs per controller, each SSD needs to support 250MB/s. Most of the recent enterprise class SSDs are rated for well over 300MB/s per unit, allowing for a large degree of excess capability. Another option is to configure 12 SSDs per controller, expecting each SSD to support 333MB/s.

    The cost structure for the above is as follows:
      RAID controller $1K
      2U Enclosure $3K
      Intel SSD DC 3700 100GB SSD $235 x 16 = $3760, RAID 5 7+1: 1.6TB
      Seagate Savvio 600GB 10K HDD $400 x 8 = $3200.

    This works out to $11K per unit or $44K for the set of 4. The set of 16 x 100GB contributes $3760. For the 800GB SSD, the R5 7+1 capacity is 44.8TB at cost $148K.

    At maximum expansion of 4 enclosures per RAID controller, capacity is 170TB at cost is $590K. Of course at this level, I would elect a system with more PCI-E slots for greater IO bandwidth. Another option is a RAID controller with 4 x4 SAS ports. Unfortunately none of these have 4 external ports.

    While the Intel SSD DC 3700 drew reviews for pioneering consistency of IO performance instead over peak performance, it is only available in SATA interface. Micron Crucial has announced the P410m with similar specifications but with SAS interface. This is listed on the Micron website as in production, but probably only to OEM customers. There are other enterprise grade high endureance MLC SSDs with SAS interface as well.

    Note: I do not recommend anything less than 10K HDD even to support database backups. The 10K HDDs are not particularly expensive as direct-attach components ($400 for the 600GB model). Only SAN vendors sell $400 HDDs for $2K or more.

    SAS 12Gbps Enclosures

    Disk enclosures supporting SAS at 12Gbps might become available as early as this year. Each of the 12Gbps SAS x4 uplink and down link ports would then support 4GB/s. The RAID controller (HBA) can support 6GB/s+ in a PCI-E gen 3 x8. The system with 4 RAID controllers could then deliver 24GB/s instead of 16GB/s. At 16 SSDs per controller, this would require 400MB/s per SSD. While SSDs are rated as high as 550MB/s, achieving the full aggregate bandwidth in an array is not necessarily practical. So 400MB/s per SSD in an array is a more reasonable expectation. Also, enterprise SAS SSDs many only be rated to 400MB/s.

    We should not need 12Gbps SAS SSDs or HDDs in the near future (but 8 NAND channels is a good match for a 1.1GB/s interface). The internal wires in the enclosure connect through a SAS expander. The IO from each device bay can signal at 6Gbps, then uplink to the HBA at 12Gbps, assuming that packets are buffered on the expander.

    PCIe

    The standard 2U disk enclosure today supports 24 or 25 2.5in (SFF) bays, with 15mm thickness. This is the dimension of an enterprise class 10K or 15K HDD with up to 3 platters. The older full size notebook used a 9mm HDD supporting 2 platters. Thin notebooks used a 7mm HDD restricted to a single platter. There is no particular reason for an SSD to be more than 7mm.

    It would be better if the new 12Gbps SAS enclosures support more than 24 bays. My preference is for 16 x 15mm and 16 x 7mm bays. The key is to match the practically realizable aggregate bandwidth of SSDs to the uplink with some degree of excess. Personally, I would like to discard the SSD case to further reduce thickness.

    PCIe

    Another option is to employ the NGFF, perhaps a 1U stick, at 5mm or less. There could be 2 rows of 24 for SSD, and the 16 x 15mm bays.

    I believe that the all-SSD idea is misguided. SSDs are wonderful, but HDD still have an important role. One example is having the HDDs available for backup and restores. I want local HDD for backups because so very few people know how to configure for multiple parallel 10GbE network transmission, not to mention IO bandwidth on the backup system.

    A database backup that has not been actually verified to restore (with recovery) is a potentially useless backup. Having HDDs for backup and restore verification preserves the write endurance on the SSD. This allows the use of high-endurance MLC instead of SLC. In some cases, it might even be possible to use consumer grade MLC if and only if the database organization maintenance strategy is architected to minimize wear on the SSD.

    PCI-E SSD

    Some of the discussion on PCI-E versus SATA/SAS interface for the NAND/Flash controller incorrectly focuses on the bandwidth of a single 6Gbps lane versus 4 or 8 lanes on PCI-E. It is correct that PCI-E was designed to distribute traffic over multiple lanes and that hard drives were never expected to exceed to bandwidth of a single lane at the contemporary SATA/SAS signaling rate. The transmission delay across an extra silicon trip, from NAND controller with SATA interface to a SATA to PCI-E bridge chip, on the order of 50ns, is inconsequential compare this with the 25-50µsec access time of NAND.

    The more relevant matter is matching NAND bandwidth to the upstream bandwidth. All (or almost all?) the SATA interface flash controllers have 8 NAND channels. Back when SATA was 3Gbps and NAND was 40-50MB/s, 8 channel s to the 280MB/s net bandwidth of SATA 3G was a good match. About the time SATA moved to 6Gbps, NAND at 100-133MB/s became available so 8 channels was still a good choice.

    PCIe
    NAND is now at 200 and 333MB/s, while SATA is still 6Gpbs. The nature of silicon product cost structure is such that there is only minor cost reduction in building a 4 channel flash controller. The 8 channel controller only requires 256-pin package.

    The PCI-E flash controllers have been designed with 32 NAND channels. The IDT 32-channel controller has 1517 pins, which is not excessively difficult or expensive for a high-end server product. Note the Intel Xeon processors are 2011-pins. As noted earlier a PCI-E gen 3 x8 port supports 6.4GB/s. Over 32 channels, each channel needs to provide 200MB/s. The new 333MB/s NAND is probably a better fit to sustain the full PCI-E gen 3 x8 bandwidth after RAID (now RAIN because disks are replaced by NAND).

    PCIe

    Based on 64Gbit die, and 8 die per package, a package has 64GB raw capacity. The 32-channel PCI-E with 1 package per channel would have 2TB raw capacity (net capacity with 0.78 for over-provisioning and 0.875 for RAIN would be 1400GB) versus 512GB on an 8-channel SATA/SAS SSD. The IDT document states capacity is 4TB raw for their 32-channel controllers, so perhaps it allows 2 packages per channel? The Micron datasheet mentions 32-channel and 64 placements.

    As is today, a PCI-E SSD can achieve maximum bandwidth at lower NAND capacity and in a more compact form factor than with SAS SSD. On the other hand, SAS infrastructure provides flexible expansion. Capacity can be increased without replacing existing devices. Some systems support hot swap PCI-E slots. However the orientation of the connector in the system chassis makes this a complicated matter. The implications are that PCI-E slot SSDs are highly suitable for high density requirements with limited expansion needs. One database server example is tempdb on SSD.

    NVM Express

    The new generation of PCI-E SSDs may employ the NVMe interface standard. There is a standard driver for Windows and other operating systems, which will later be incorporated into to the OS distribution media allowing boot from an NVMe device, as with SATA devices today. This is mostly a client side feature.

    For the server side, the NVMe driver is designed for massive bandwidth and IOPS. There can be up to 64K queues, 64K commands (outstanding IO requests?). The driver is designed for IO to be both super-efficient in cpu-cycles and scalable on NUMA systems with very many processor cores.

    EMC VNX

    Express Bay

    To promote the growth of SSD without betting on which interface, the Express Bay standard defines a connector that can support both PCI-E and SATA or SAS. Some Dell servers today support PCI-E to SSDs in the 2.5in HDD form factor (SFF), but I am not sure if this is Express Bay. This form factor will allow PCI-E devices to hot-swapped with the same ease as SAS devices today.

    PCIe

    PCI-E Switches

    As mentioned earlier, the PCI-E slot arrangement in server systems does not facilitate hot-add, even if it is supported. Existing PCI-E SSDs also do not provide a mechanism for capacity expansion, aside from adding a card to an empty slot or replacing an existing card.

    Of course, there are PCI-E switches, just like the SAS expanders. A 64 lane PCI-E switch could connect 5 x8 PCI-E devices over a x8 upstream link. Other possibilities is a x16 link supporting 12.8GB/s to host with 4 ports for SSD, or 8 x4 ports to SSD for finer grain expansion. It may also be possible to support multiple hosts, as in a cluster storage arrangement?

    PCIe

    SAN Configuration

    Below is a representation of a typical configuration sold to customers by the SAN vendor. I am not joking in that it is common to find 2 ports FC or FCOE on each host. The most astounding case was a SAN with 240 x 15K disks and 2 single port FC HBAs in the server. Even though the storage system service processors had 4 FC port each (and the FC switches had 48-ports), only 1 on each SP was connected. Obviously the storage engineer understood single component and path failure fault-tolerant design. It was just too bad he built a fault-tolerant garden hose system when a fire hose was needed.

    SAN_Configuration1c

    As I understand it, what happened was the SAN engineer asked how much space is needed for the databases accounting for growth, and then created one volume for it. The Windows OS does support multi-path IO. Originally the storage vendor provided the MPIO driver, but now it is managed by Microsoft. Apparently it was not considered that even with MPIO, all IO for a single volume has a primary path. The secondary path is only used when the primary is not available.

    High-bandwidth SAN Configuration

    A proper SAN configuration for both OLTP and DW database servers is shown below. Traditionally, a transaction processing database generates small block random IO (2KB in the old days, 8KB since SQL Server 7). As it was difficult to get 10K IOPS (x8KB = 80MB/s), it was thought that IO bandwidth was not a requirement. This was 15 years ago. Apparently the SAN vendors read documents from this period, but not more recently, hence their tunnel vision on IOPS, ignoring bandwidth.

    For the last 10 or more years, people have been running large queries on the OLTP system. I have noticed report queries that saturate the storage IO channels could essentially shutdown transaction processing. This is because the report query generates asynchronous IO at high queue depth, while the transaction queries issue synchronous IO at queue depth 1. And the report may escalate to a table lock (or it may use nolock). Furthermore, it is desirable to be able to backup and restore the transaction database quickly. This means bandwidth.

    Note that the system below shows 8Gbps FC, not 10Gbps FCOE. A single 10Gbps FCOE may have more bandwidth than a single 8Gbps FC port. But no serious storage system will have less than 4 or even 8 ports. Apparently FCOE currently does not scale well over multiple ports, due to the overhead of handling Ethernet packets? An Intel IDF 2012 topic mentions that this will be solved in the next generation.

    SAN_Configuration2e

    The above diagram shows 8 x 8Gbps FC ports between host and storage system. Each 8Gbps FC port can support 700MB/s for a system IO bandwidth target of 5.6GB/s. An OLTP system that handles very high transaction volume may benefit from a dedicated HBA and FC ports for log traffic. This would allow the log HBA to be configured for low latency, and the data HBA to be configured with interrupt moderation and high throughput.

    An alternate SAN configuration for SQL Server 2012 is shown below with local SSD for tempdb.

    SAN_Configuration3b

    The write cache on a SAN must be mirrored for fault tolerance. There is very little detail on the bandwidth capability of the link between controllers (or SP) on SAN systems, beyond what can be deduced from the fact that the sustained write bandwidth is much lower than the read bandwidth. So keeping tempdb off the SAN should preserve IO write bandwidth for traffic that actually needs protection.

    The number of volumes for data and temp should be some multiple of 8. It would be nice to have 1 volume for each FC path. However we do need to consider how SQL Server place extents over multiple files. This favors RAID groups of 4 disks.

    File Layout

    In a HDD storage system, the objective for bandwidth is to simultaneously issue large block IO to all data disks across all IO channels. A 256K block could be sufficiently large to generate 100MB/s per disk (400 IOPS, not random). If this were issued at low queue depth (2?), then the storage system would not only generate high IO bandwidth and still be perceptually responsive to other requests for small block IO.

    For small block random IO, it is only necessary to distribute IO over all hard disks with reasonable uniformity.

    The file layout strategy has two objectives. One is to not overwhelm any single IO channel. In direct-attach, this is not a proper as the smallest pipe is x4 SAS for 2GB/s. In a SAN, even using 8Gbps FC, this is a concern as 8Gb FC can support only 700-760MB/s. Although 10Gb FCoE seems to have higher bandwidth, this may not scale with the number of channels as well as straight FC. The new Intel Xeon E5 (Sandy-Bridge EP) processors may be able to scale 10Gb FCoE with Data Direct IO (DDIO) - but this needs to be verified.

    The second is to ensure IO goes to every disk in the RAID Group (or volume). By default, SQL Server allocates a single 64K extent from each file before round-robin allocating from the next file. This might be the reason that many SAN systems generate only 10MB/s per disk (150 IOPS at 64K), along with no read-ahead.

    FileLayout_1

    The -E startup flag instructs SQL to allocate up to 4 consecutive extents before proceeding to the next file. See James Rowland-Jones Focus on Fast Track : Understanding the –E Startup Parameter for more on this. In a 4-disk RAID group with stripe-size 64K, a 256K IO to the file would generate a 64K IO to each disk.

    FileLayout_2

    It would be necessary to rebuild indexes before this scheme takes effect. Somewhere it was mentioned that it is important to build indexes with max degree of parallelism limited to either 4 or 8. It might be in the Microsoft Fast Track Data Warehouse Reference Architecture. Start with version 4 for SQL Server 2012, and work backwards?

    Trace Flag 1117 (-T1117) causes all files in a filegroup to grow together. See SQL Server 2008 Trace Flag -T 1117.

    With SSD, the second may not be important as the SQL Server read-ahead strategy (1024 pages?) should generate IO to all units. On the hard disk, generate close-to-sequential IO was important. On SSD, it is sufficient beneficial just to generate large block IO, with 64K being large.

    Summary

    The old concept of distributing IO over both devices and channels still apply. The recent pricing of SSD is sufficiently low to warrant serious consideration ($2-3K/TB eMLC). While there is more flexibility in SSD configuration, it is still necessary to validate performance characteristics with real queries to an actual database. SQLIO or other synthetic tests are not sufficient. If the SAN vendor advised in the configuration, then chances are IO bandwidth will be not be good.

    Addendum
    If anyone thinks I am being unfair to or overly critical of SAN vendors, do the following test.
    Find the biggest table in your database, excluding LOB fields.
    Run this:
    DBCC DROPCLEANBUFFERS
    GO
    SET STATISTICS IO ON
    SET STATISTICS TIME ON
    GO
    SELECT COUNT(*) FROM TableA WITH (INDEX(0))
    GO

    Then compute 8 (KB/page) * (physical reads + read-ahead reads)/(elapsed time in ms)
    Is this closer to 700 MB/s or 4GB/s? What did your SAN vendor tell you?

    I am also not fan of SSD caching or auto-tiering on critical databases, meaning the database that runs your business, that is managed by one or more full-time DBAs. In other applications, there may not be a way to segregate the placement of hot data differently from inactive data. In SQL Server, there are filegroups and partitioning. We have all the necessary means of isolating and placing hot data whereever we want it. SSD caching or auto-tiering will probably require SLC NAND. With active management using database controls, we should be able to use HET MLC or even MLC. 

    I stress the importance of analyzing the complete system and how it will be used instead of over-focusing on the components. There are criteria that might be of interest when there is only a single device or even single HBA. Today it is possible to over-configure the storage performance without unwarranted expense, and this is best accomplished by watching the big picture.

    Adaptec reports that their Series 7 SAS RAID Controller (72405 - PCI-E gen 3 x8 on the upstream side and 6 x4 SAS 6Gpbs) using the PMC PM8015 controller can do 500K IOPS and 6.6GB/s.

    I will keep this topic up to date on www.qdpma.com Storage 2013

    related posts on storage:
    io-queue-depth-strategy (2010-08)
    data-log-and-temp-file-placement (2010-03)
    io-cost-structure-preparing-for-ssd-arrays (2008-09)
    storage-performance-for-sql-server (2008-03)

    ps,
    If you are using SQL Server 2012 clustering on a SAN, I do suggest placing tempdb on local SSD, making use of the new 2012 feature that does not require tempdb to be on shared storage. Keep in mind on the SAN, writes must be mirrored between two storage processors for fault recovery, and this is not a cheap thing to do. We should plan redo whatever was using tempdb at the time.

  • Job History row limiter

    There are situations where guaranteed accuracy correctness is essential, which is why we have transactional databases meeting the ACID properties. And then there are situations where we just need an approximate number. An example is the size of the SQL Agent job history log.

    I bring up this topic because on more than one occasion, after fixing performance problems at a client site, the remaining top waits are two SQL statements belonging to stored procedures in msdb: one in sp_sqlagent_log_jobhistory,

    sql agent ins

    and one in sp_jobhistory_row_limiter.

    jobhistory row limiter

    The average wait time is on for the day of Feb 28 was 207ms and 201ms respectively. The average elapsed times over several weeks are 171ms and 170ms respectively. Neither of these show meaningful CPU consumption, at 0.716ms and 0.124ms respectively.

    A third statement, in sp_jobhistory_row_limiter, actually has much higher CPU consumption. This statement contributes 33.5ms CPU and 39.4ms elapsed time per call over the long period.

    jobhistory row limiter

    In terms of overall load on the system, these msdb procedures are negligible. A modern 2-socket system has 12-16 physical cores, so there are 288-384 CPU-hours available per day. The CPU intensive query consumes 40 min per day (33.5ms/call x 71,509 calls per day), a very small portion of the available CPU.

    The wait time of the 2 longer running queries may be significant at 4 hours per day each. The calls are from SQL Agent, not a live person waiting for the response, so the wait time is not a problem by itself. The high wait statement in sp_sqlagent_log_jobhistory is:

    INSERT INTO msdb.dbo.sysjobhistory ( job_id, step_id, step_name, sql_message_id, sql_severity, message, run_status, run_date, run_time, run_duration, operator_id_emailed, operator_id_netsent, operator_id_paged, retries_attempted, server)
    VALUES ( @job_id, @step_id, @step_name, @sql_message_id, @sql_severity, @message, @run_status, @run_date, @run_time, @run_duration, @operator_id_emailed, @operator_id_netsent, @operator_id_paged, @retries_attempted, @server)
    GO

    The high wait statement in sp_jobhistory_row_limiter is:

    SELECT @current_rows_per_job = COUNT(*)
    FROM msdb.dbo.sysjobhistory with (TABLOCKX)
    WHERE (job_id = @job_id)

    The maximum elapsed times of sp_sqlagent_log_jobhistory and sp_jobhistory_row_limiter over the longer period are 11 and 41 seconds respectively. Potentially this might cause problems in situation that require precise timing. Of course, it is probably not a good idea to rely on SQL Agent if precise timing is required on a heavily load database server processing a high volume of jobs.

    One might think that at just less than 1 call per second (71,509 calls per day, 86,400 sec per day) the chance of a collision between the two statements at 0.2 sec each is small. However these are not random calls, but fire on schedule. So if 100 jobs fire off every 2 minutes on the minute, perhaps the chance of collision is much higher?

    The sp_sqlagent_log_jobhistory procedure inserts into jobhistory, and then at the end calls sp_jobhistory_row_limiter. There will not be contention between the two procedures for a given job and session. So the issue is whether there will be collisions between different jobs. (I am not sure under what circumstances there would be multiple sessions of 1 job. Does this occur if the job does not complete within its periodicity or can one job be controlled by multiple schedulers?)

    The typical sequence of statements in sp_jobhistory_row_limiter excluding conditionals is as follows:

    EXECUTE master.dbo.xp_instance_regread -- 'JobHistoryMaxRows'

    EXECUTE master.dbo.xp_instance_regread -- 'JobHistoryMaxRowsPerJob'

    BEGIN TRANSACTION

    /* this is the high wait statement */
    SELECT @current_rows_per_job = COUNT(*)
    FROM msdb.dbo.sysjobhistory with (TABLOCKX)
    WHERE (job_id = @job_id)

    WITH RowsToDelete AS (
      SELECT TOP (@rows_to_delete) *
      FROM msdb.dbo.sysjobhistory
      WHERE (job_id = @job_id)
      ORDER BY instance_id
    )
    DELETE FROM RowsToDelete;

    /* this is the high CPU statement */
    SELECT @current_rows = COUNT(*)
    FROM msdb.dbo.sysjobhistory

    /* this is rarely called? */
    WITH RowsToDelete AS (
      SELECT TOP (@rows_to_delete) *

      FROM msdb.dbo.sysjobhistory
      ORDER BY instance_id
    )
    DELETE FROM RowsToDelete;

    IF (@@trancount > 0)
    COMMIT TRANSACTION

    The combination of the SELECT with (TABLOCKX) and the BEGIN/COMMIT TRAN will block inserts to jobhistory for the duration of the transaction. So it is quite possible there are collisions with the sp_sqlagent_log_jobhistory insert statement.

    Below is the detail for the xp_instance_regread call. For some reason the execution statistics are not available. If this is mainly called from the sp_jobhistory_row_limiter procedure, with 2 calls, then there would be 143,000 calls per day with an average wait of 12.6ms per call.

    jobhistory row limiter

    I am thinking that the reason for the BEGIN/COMMIT TRAN block in the sp_jobhistory_row_limiter procedure is to prevent the action of one session from deleting rows in the process of being inserted by another session?

    Having limits on both the total number of rows for job history and individual jobs is a good idea. But this is an example of a situation where we are not concern about guaranteed accuracy. In this case there are 150 jobs and 260,000 rows in jobhistory, requiring 460MB of storage. If we set total jobhistory at 500,000 rows and individual job history at 1000 rows, perhaps our intention

    We definitely do not need to check and enforce (both) jobhistory limits 71,500 times per day. One way to reduce the frequency of calls is to delete. I suggest deleting at minimum 10 rows per job everytime the threshold is exceeded.

    I think it is silly to do a full count of the number of rows in sysjobhistory for job execution. The SQL Server SELECT COUNT(*) statement does not cheat. It reports the actual number of rows in the table by touching the row header of each and every row. So perhaps we should use an alternative good enough row count, from the sys.dm_db_partition_stats row_count column for example?

    I would also like to do away with the BEGIN/COMMIT TRAN block along with the TABLOCKX. Perhaps we can only delete jobs more than 1 day old ?

    /* be sure to make this index change before altering the procedure */
    CREATE INDEX nc1 ON sysjobhistory (job_id) INCLUDE (run_date)
    WITH (DROP_EXISTING = ON)
    GO

    So my proposal for the sequence of statements in jobhistory_row_limter is as follows:

    SELECT @today = CONVERT( int, CONVERT(char (8), GETDATE(),112))

    --BEGIN TRAN
    SELECT @current_rows_per_job = COUNT(*)
    FROM msdb.dbo.sysjobhistory -- with (TABLOCKX)
    WHERE (job_id = @job_id) AND run_date < @today

    SELECT @rows_to_delete = @current_rows_per_job - @max_rows_per_job

    IF ( @rows_to_delete > 0 ) BEGIN
    /* if we only delete previous days, then this will only active once per day */
    WITH RowsToDelete AS (
      SELECT TOP (@rows_to_delete) *
      FROM msdb.dbo.sysjobhistory
      WHERE (job_id = @job_id) AND run_date < @today
      ORDER BY instance_id
    )
    DELETE FROM RowsToDelete;

    END

    /* do this first */
    SELECT @current_rows = row_count
    FROM sys.dm_db_partition_stats d WITH(NOLOCK)
    WHERE object_id = OBJECT_ID('sysjobhistory') AND index_id < 2

    /* only if the above is above threshold */
    IF (@current_rows > @max_total_rows) BEGIN

    SELECT @current_rows = COUNT(*)
    FROM msdb.dbo.sysjobhistory with (NOLOCK)
    WHERE run_date < @today

    SELECT @rows_to_delete = @current_rows - @max_total_rows

    IF ( @rows_to_delete > 0 ) BEGIN
    WITH RowsToDelete AS (
      SELECT TOP (@rows_to_delete) *
      FROM msdb.dbo.sysjobhistory
      WHERE run_date < @today
      ORDER BY instance_id
    )
    DELETE FROM RowsToDelete;
    -- ALTER INDEX ALL ON msdb.dbo.sysjobhistory REBUILD?
    END

    END

    --COMMIT TRAN

    Even though the overall impact of the jobhistory limits procedures are insignificant, it is somewhat annoying to have them show up at the top of the list of a third party performance monitoring tool. With a few simple precautions, these can be eliminated.

    ps
    If any one is willing to alter the jobhistory row limiter procedure,
    please us know the changes made along with the before and after execution statistics.

    I should also add that the sysjobhistory table will get heavily fragmented over time, especially since job_id is a uniqueidentifier, so perhaps periodic index rebuilds are in order. Perhaps our strategy should be: when the max total rows is exceeded, delete a larger percentage, say 20% for example, then rebuild the index (outside the transaction if present),

    pps - I may have inadvertently deleted a comment? - oops! - sorry Publish

  • EMC VNX2 and VNX Future

    Update 2013-10: StorageReview on EMC Next Generation VNX
    Update 2013-08: News reports that VNX2 will come out in Sep 2013

    While going through the Flash Management Summit 2012 slide decks, I came across the session Flash Implications in Enterprise Storage Designs by Denis Vilfort of EMC, that provided information on performance of the CLARiiON, VNX, a VNX2 and VNX Future.

    A common problem with SAN vendors is that it is almost impossible to find meaningful performance information on their storage systems. The typical practice is to cited some meaningless numbers like IOPS to cache or the combined IO bandwidth of the FC ports, conveying the impression of massive IO bandwidth, while actually guaranteeing nothing.

    VNX (Original)

    The original VNX was introduced in early 2011? The use of the new Intel Xeon 5600 (Westmere-EP) processors was progressive. The decision to employ only a single socket was not.

    EMC VNX

    EMC did provide the table below on their VNX mid-range systems in the document "VNX: Storage Technology High Bandwidth Application" (h8929) showing the maximum number of front-end FC and back-end SAS channels along with the IO bandwidths for several categories.

    EMC VNX

    It is actually unusual for a SAN storage vendor to provide such information, so good for EMC. Unfortunately, there is no detailed explanation of the IO patterns for each category.

    Now obviously the maximum IO bandwidth can be reached in the maximum configuration, that is with all IO channels and all drive bays populated. There is also no question that maximum IO bandwidth requires all back-end IO ports populated and a sufficient number of front-end ports populated. (The VNX systems may support more front-end ports than necessary for configuration flexibility?)

    However, it should not be necessary to employ the full set of hard disks to reach maximum IO bandwidth. This is because SAN systems are designed for capacity and IOPS. There are Microsoft Fast Track Data Warehouse version 3.0 and 4.0 documents for the EMC VNX 5300 or 5500 system. Unfortunately Microsoft has backed away from bare table scan tests of disk rates in favor of a composite metric. But it does seem to indicate that 30-50MB/s per disk is possible in the VNX.

    What is needed is a document specifying the configuration strategy for high bandwidth specific to SQL Server. This includes the number and type of front-end ports, the number of back-end SAS buses, the number of disk array enclosures (DAE) on each SAS bus, the number of disks in each RAID group and other details for each significant VNX model. It is also necessary to configure the SQL Server database file layout to match the storage system structure, but that should be our responsibility as DBA.

    It is of interest to note that the VNX FTDW reference architectures do not employ Fast Cache (flash caching) and (auto) tiered-storage. Both of these are an outright waste of money on DW systems and actually impedes performance. It does make good sense to employ a mix of 10K/15K HDD and SSD in the DW storage system, but we should use the SQL Server storage engine features (filegroups and partitioning) to place data accordingly.

    A properly configured OLTP system should also employ separate HDD and SSD volumes, again using of filegroups and partitioning to place data correctly. The reason is that the database engine itself is a giant data cache, with perhaps as much as 1000GB of memory. What do we really expect to be in the 16-48GB SAN cache that is not in the 1TB database buffer cache? The IO from the database server is likely to be very misleading in terms of what data is important and whether it should be on SSD or HDD.

    CLARiiON, VNX, VNX2, VNX Future Performance

    Below are performance characteristics of EMC mid-range for CLARiiON, VNX, VNX2 and VNX Future. This is why I found the following diagrams highly interesting and noteworthy. Here, the CLARiiON bandwidth is cited as 3GB/s and the current VNX as 12GB/s (versus 10GB/s in the table above).

    EMC VNX

    I am puzzled that the VNX is only rated at 200K IOPS. That would correspond to 200 IOPS per disk and 1000 15K HDDs at low queue depth. I would expect there to be some capability to support short-stroke and high-queue depth to achieve greater than 200 IOPS per 15K disk. The CLARiiON CX4-960 supported 960 HDD. Yet the IOPS cited corresponds to the queue depth 1 performance of 200 IOPS x 200 HDD = 40K. Was there some internal issue in the CLARiiON. I do recall a CX3-40 generating 30K IOPS over 180 x 15K HDD.

    A modern SAS controller can support 80K IOPS, so the VNX 7500 with 8 back-end SAS buses should handle more than 200K IOPS (HDD or SSD), perhaps as high as 640K? So is there some limitation in the VNX storage processor (SP), perhaps the inter-SP communication? or a limitation of write-cache which requires write to memory in both SP?

    VNX2?

    Below (I suppose) is the architecture of the new VNX2. (Perhaps VNX2 will come out in May with EMC World?) In addition to transitioning from Intel Xeon 5600 (Westmere) to E5-2600 series (Sandy Bridge EP), the diagram indicates that the new VNX2 will be dual-processor (socket) instead of single socket on the entire line of the original VNX. Considering that the 5500 and up are not entry systems, this was disappointing.

    EMC VNX

    VNX2 provides 5X increase in IOPS to 1M and 2.3X in IO bandwidth to 28GB/s. LSI mentions a FastPath option that dramatically increases IOPS capability of their RAID controllers from 80K to 140-150K IOPS. My understanding is that this is done by completely disabling the cache on the RAID controller. The resources to implement caching for large array of HDDs can actually impede IOPS performance, hence caching is even more degrading on an array of SSDs.

    The bandwidth objective is also interesting. The 12GB/s IO bandwidth of the original VNX would require 15-16 FC ports at 8Gbps (700-800MBps per port) on the front-end. The VNX 7500 has a maximum of 32 FC ports, implying 8 quad-port FC HBAs, 4 per SP.

    The 8 back-end SAS busses implies 4 dual-port SAS HBAs per SP? as each SAS bus requires 1 SAS port to each SP? This implies 8 HBAs per SP? Intel Xeon 5600 processor connects over QPI to a 5220 IOH with 32 PCI-E gen 2 lanes, supporting 4 x8 and 1x4 slots, plus a 1x4 Gen1 for other functions.

    In addition, a link is needed for inter-SP communication. If one x8 PCI-E gen2 slot is used for this, then write bandwidth would be limited to 3.2GB/s (per SP?). A single socket should only be able to drive 1 IOH even though it is possible to connect 2. Perhaps the VNX 7500 is dual-socket?

    An increase to 28GB/s could require 40 x8Gbps FC ports (if 700MB/s is the practical limit of 1 port). A 2-socket Xeon E5-2600 should be able to handle this easily, with 4 memory channels and 5 x8 PCI-E gen3 slots per socket.

    VNX Future?

    The future VNX is cited as 5M IOPS and 112GB/s. I assume this might involve the new NVM-express driver architecture supporting distributed queues and high parallelism. Perhaps both VNX2 and VNX Future are described is that the basic platform is ready but not all the components to support the full bandwidth?

    EMC VNX

    The 5M IOPS should be no problem with an array of SSDs, and the new NVM express architecture of course. But the 112GB/s bandwidth is curious. The number of FC ports, even at a future 16Gbit/s is too large to be practical. When the expensive storage systems will finally be able to do serious IO bandwidth, it will also be time to ditch FC and FCOE. Perhaps the VNX Future will support infini-band? The puprose of having extreme IO bandwidth capability is to be able to deliver all of it to a single database server on demand, not a little dribblet here and there. If not, then the database server should have its own storage system.

    The bandwidth is also too high for even a dual-socket E5-2600. Each Xeon E5-2600 has 40 PCI-E gen3 lanes, enough for 5 x8 slots. The nominal bandwidth per PCIe G3 lane is 1GB/s, but the realizable bandwidth might be only 800MB/s per lane, or 6.4GB/s. A socket system in theory could drive 64GB/s. The storage system is comprised of 2 SP, each SP being a 2-socket E5-2600 system.

    To support 112GB/s each SP must be able to simultaneously move 56GB/s on storage and 56GB/s on the host-side ports for a total of 112GB/s per SP. In addition, suppose the 112GB/s bandwidth for read, and that the write bandwidth is 56GB/s. Then it is also necessary to support 56GB/s over the inter-SP link to guarantee write-cache coherency (unless it has been decided that write caching flash on the SP is stupid).

    Is it possible the VNX Future has more than 2 SP's? Perhaps each SP is a 2-socket E5-4600 system, but the 2 SPs are linked via QPI? Basically this would be a 4-socket system, but running as 2 separate nodes, each node having its own OS image. Or that it is a 4-socket system? Later this year, Intel should be releasing an Ivy Bridge-EX, which might have more bandwidth? Personally I am inclined to prefer a multi-SP system over a 4-socket SP.

    Never mind, I think Haswell-EP will have 64 PCIe gen4 lanes at 16GT/s. The is 2GB/s per lane raw, and 1.6GB/s per lane net, 12.8GB/s per x8 slot and 100GB/s per socket. I still think it would be a good trick if one SP could communicate with the other over QPI, instead of PCIe. Write caching SSD at the SP level is probably stupid if the flash controller is already doing this? Perhaps the SP memory should be used for SSD metadata? In any case, there should be coordination between what each component does.

    Summary

    It is good to know that EMC is finally getting serious about IO bandwidth. I was of the opinion that the reason Oracle got into the storage business was that they were tired of hearing complaints from customers resulting from bad IO performance on the multi-million dollar SAN.

    My concern is that the SAN vendor field engineers have been so thoroughly indoctrinated in the SaaS concept that only capacity matters while having zero knowledge of bandwidth, that they are not be able to properly implement the IO bandwidth capability of the existing VNX, not to mention the even higher bandwidth in VNX2 and Future.

    Updates will be kept on QDPMA Storage.

  • The Path to In-Memory Database Technology

    The term in-memory database can be subject to misinterpretation. An in-memory database was originally used to describe a storage engine designed for the memory access characteristics of modern microprocessors, not simply a database stored in memory. Today it is common for a database to reside almost entirely in the buffer cache, i.e., memory of a traditional relational DBMS, but this is very different from an in-memory database just defined. As Microsoft recently announced that the next version of SQL Server will incorporate in-memory database technology under the Hekaton codename, it is worthwhile now to revisit in more detail the difference between the original disk storage oriented and in-memory databases, along with the differences in computer system architecture between then and now that drove the change in the database storage engine.

    The First Relational Databases

    Relational databases originated from the papers of Edgar Codd published from 1970 on. Oracle may have had the first commercial product. A group at UC Berkeley (Stonebraker and Wong) built the INGRES, from which Sybase and later SQL Server descended. Ingres was developed on a DEC PDP-11, which was a popular mini-computer system at the time (16-bit integer/register). The Design and Implementation of INGRES paper by Stonebraker, Wong and Held, ACM 1976 mentions support for UNIX on the PDP-11/40 45 and 70 models. The 11/40 could support a process address space of 64K and 128K on the 11/45 and 11/70 models.

    The main element was for a database engine to make best use of limited memory to complete a query with minimal disk IO.

    Computer System Architecture Evolution

    DEC PDP-11

    The DEC PDP-11 came out in 1970 at a relatively low price-point such that it was a very popular system in university environments. The Spring Joint Computer Conference 1970 paper A new architecture for mini-computers - The DEC PDP-11 cites a $5000-10K price target. This is may have been why one happened to be available for the original Ingres development project. The PDP 11 Handbook lists the PDP-11/10 as having 1,024 words of 16-bit read-only memory and 128 word read-write memory. The PDP-11/20 model has 4,096 words of 16-bit read-write (Magnetic) core memory. The max data transfer rate on the Unibus was one word every 750ns. Core memory had a 1.2 µs cycle time and 500 ns access time.

    Wikipedia lists the history of the PDP-11 Unibus models as:

    • PDP-11/20 and 11/15: original with core memory, non-microprogrammed
    • PDP-11/35 and 11/40: with microprogramming (1972?)
    • PDP-11/45, 50 and 55: upto 256KB semiconductor memory (1971?)
    • PDP-11/70: upto 4MB memory and 2KB cache (1975)

    Microsoft Research now seems to be the repository of DEC material under the Gordon Bell section, including DEC Museum and 16-bit timeline. The timeline information between Wikipedia and Microsoft Research do not appear to be entirely conistent. Either it is difficult to interpret surviving documents or people's recollections of this era are fading.

    DEC VAX 11/780

    There is more information on the next generation DEC VAX 11/780, the first 32-bit mini-computer. This system came out in 1977. See VAX-11/780 Hardware Users's Guide and VAX Product Sales Guide for details. Also search for the VAX-11/780 Architecture Handbook from Carnegie Mellon ECE. The CPU was built with TTL, had a 200ns clock and 8KB cache. No transistor count is cited? The VAX-11/780 pricing was between $210K and 350K?

    The system was described as 1MIPS, but that was because the performance was roughly comparable to an IBM system (370/158-3?) that was accepted as 1MIPS. It turned out the VAX 11/780 executed 0.5M native instructions per sec to deliver equivalent peformance to the IBM 1MIPS. John McMallum jcmit cites the IBM 370/158-3 as 0.73MIPS and the VAX-11/780 as 1MIPS.

    VAX System

    The CPUs of this time were very limited in the number transistors, and should have only basic instructions. It would have not been feasible for compiled binaries to be built on basic instructions. The native VAX (or PDP-11) instruction set were comprised of complex instructions, which are translated by a set of microprogrammed instructions (microcode) into the basic instructions? The presumption based on 0.5 VAX MIPS and the 5MHz clock cycle is then that the average VAX instruction decomposes into 10 basic instructions or rather clock cycles, accounting for memory access time?

    The memory system contains one or two memory controllers. Each controller can handle 1 to 16 arrays. The memory array has a cycle time of 600ns. A memory controller buffers one command while processing another. The memory controllers can be interleaved.

    Cache access time is 200ns, basically 1-cycle access. Memory cycle time is 600ns. Read access time at the processor is 1800ns. Effective average operand access time is 290ns.

    The first systems used 4Kbit DRAM supporting a maximum system memory of 2MB, in increments of 128K.

    VAX System

    Later systems used 16Kbit DRAM, supporting up to 8MB memory, in 256K increments. Minimum memory was cited as 128K and 256K in the 1977 and 1979 handbooks, but later documents cited minimum memory of 1M?

    VAX System

    If we do that math, we can work out that excluding overhead for ECC, 4096 chips are required for 8MB at 16Kbit per DRAM. The VAX 11/780 has a 72-bit memory path comprised of a 64-bit word with 8-bits for ECC.

    By comparison, a modern server system supports 1TB memory over 64 DIMM sockets with 16GB DIMMs. There are 36 chips on each 16GB DIMM (32 for data, 4 for ECC) at 4Gbit per chip. The DRAM package could be single or double die package (DDP). So the system could have upto 2048 chips plus 256 for ECC.

    Microprocessors

    Over the course of time, computer systems transitioned to single chip microprocessors. The low-end systems transitioned first to realize the cost benefits of lower part count. Eventually high-end systems transitioned to microprocessors as well, due to the chip to chip signal delays not scaling with improving transistor performance within a chip.

    Pipelining

    The next step in microprocessor architecture was pipelined execution. A complete single instruction is comprised of a sequence of many operations. By dividing the sequence into many steps, the clock rate for completing a single step can be higher than for the whole instruction. By allowing the a sequence of instructions to overlap, one instruction could be completed each clock cycle with pipelining. Wikibooks Microprocessor Design/Pipelined Processors has excellent illustrations of pipelining. The time to execution a single full instruction is several clock cycles.

    Pipeline
    Pipeline

    The Intel 80486 (1989) has a 5-stage pipeline: fetch, decode1, decode2 (effective address), execute, and write-back.

    The Pentium (1993) pipeline stages are: Prefetch, Fetch (MMX only?), D1 Instruction Decode, D2 Address Generate, Execute, Writeback. So that makes 5 stage for the original Pentium and 6 for the MMX?

    Intel is curiously vague on the exact number of pipeline stages for the Pentium Pro to Pentium III, collectively known as the P6 line. The later Pentium M could be an improved P6, but is also called a new design. It might be because the actual number of stages varies with the instruction? The Pentium III has been cited as 10 stages, and the Pentium Pro (P6) could be the same. The later Pentium III processors may have added a (prefetch) stage purely to account for the time to access L1 cache as core frequency increased with process generation and maturity.

    The first Pentium 4 processors (Willametter and Northwood) are 20 stage, the second generation Prescot is 31 stages.

    The diagram below is from "The Microarchitecture of the Pentium 4 Processor", Intel Technology Journal Q1 2001 showing 10 stages for a basic Pentium III, and 20 stages for the 180nm and 130nm Pentium 4s, Willamette and Northwood.

    Pipeline

    In other documents, I have P6 as:
    IFU1, IFU2, IFU3, Dec1, Dec2, RAT, ROB, Dis, EX, Ret1, Ret2.

    The Core 2 (Conroe 65nm, Penryn 45nm, there were other codenames for server and mobile) 14 stages. The Core 2 brand name was later changed to Core, even though pre-Core 2 processors had already been sold with Core (Duo and Solo) as brand name. The difficult decisions that marketing pukes must make.

    Superscalar

    The next significant innovation was super-scalar execution, where a microprocessor could complete several instructions in parallel each clock cycle. The Intel Pentium has limited 2-wide super-scalar. The Pentium Pro had a more broadly usable 3-wide. The super-scalar execution units typically have special uses, so it is not always possible to complete an instruction on all units in each cycle.

    microarchitecture

    The Intel Pentium 4 is shown with 4 ports, 2 for execution, 1 Load and 1 Store port. I recall the Pentium 4 as 3-wide, which might be the maximum throughput of the ports.

    The Intel Core microarchitecture (Conroe/Penryn) is described as 4 instructions per clock cycle versus 3 in previous architectures. The diagram shows 5 units, 3 for different aspects of ALU, FP and vector, 1 Load and 1 Store. Also mentions 14-stage pipeline.

    The Intel Nehalem is shown in IDF 2008 with 6 execution units, 3 for integer, FP and vector, 1 Load, 1 Store Address and 1 Store Data. The Intel Sandy-Bridge is shown with 6 execution units, 3 for integer, FP and vector, 2 for Load/Store Address and 1 Store Data.

    The Intel IDF Fall 2012 presentation on Haswell shows 8 units: 4 integer of which 3 can do vector, and 2 can do FP, , 1 Load/Store Addres, 1 Store Data, 1 Store Address.

    Million Instructions Per Second MIPS

    Technically, a instruction on one system architecture has no inherent correlation to an instruction on a different system architecture. So there should be no correlation between MIPS on one system to another. But people need or want to compare systems, and MIPS had already become popular, so the MIPS was standardized, first as Whetstone (contains floating-point), and then later Dhrystone (no floating-point). One DMIPS is the performance of the VAX-11/780, rather than 1 million specific actual IPS.

    There are minor inconsistencies between MIPS from various sources. The table below is mostly from Wikipedia Instruction per second . Last two items are multi-core processoes and the MIPS rating is for all cores, but the D IPS/clock is per core. Another broad compilation is jcmit.

    ProcessorMHzMIPSD IPS/clock
    /core
    Year
    Intel 4004 0.740 0.092 0.1 1971
    IBM 370 158-3   1 MIPS at 8.69MHz 0.1 1972
    Intel 8080 2 0.33 (not Dhrystone) 0.165 1974
    VAX-11/780 5 1 Dhrystone 0.2 1977
    Intel 80286 12.5 2.66 0.2 1982
    Intel 80386 33 9.9 0.3 1985
    Intel 80486 DX2 66 54 0.8 1992
    Intel Pentium 100 188 1.88 1994
    Intel Pentium Pro 200 541 2.7 1996
    Intel Pentium III 600 2054 3.4 1999
    Intel Pentium 4EE 3200 9726 3.0 2003
    Intel Core 2 (2c) 2930 27079 4.6 2006
    Intel Core i7 920 (4c) 2660 82300 7.7 2008

    Notice the sharp rise in IPS per clock between the Intel 80386 (non-pipelined) and the 80486DX2 (pipelined) to nearly 1 per clock. Presumably the main contributor is the 8K (unified) on die for the 80486 and 8K data + 8 K instruction cache for the Pentium. The high-end 486 and Pentium systems of this period also had off-die L2 cache as well. I do not recall if off-die cache was common for 386 systems.

    Thereafter, IPS/clock is greater than 1 with the advent of super-scalar execution. Both the Pentium Pro and Pentium III are 3-wide, so the increase in IPC might be due to the SIMD capability of the Pentium III. The Pentium 4 gave up a degree of IPC on the very deep pipeline to achieve extraordinarily high clock rates. The Core 2 was 5-wide? The Core i7 is 5-wide but also has hyperthreading. The latest Sandy-Bridge is 6 wide?

    Intel provides MIPS rating of their earlier processors up to Pentium in List of Intel microprocessors

    ProcessorMHzMIPStransistorsprocessYear
    Intel 4004 0.740 0.07 2,300 10µm 1971
    Intel 8080 2 0.29 6,000 6µm 1974
    Intel 8086 5
    8
    0.33
    0.66
    29,000 3µm 1978
    Intel 80286 6
    12.5
    0.9
    2.66
    134,000 1.5µm 1982
    Intel 80386DX 16
    33
    5
    9.9
    275,000 1µm 1985
    1989
    Intel 80486DX 25
    50
    20
    41
    1.2M 1µm
    0.8µm
    1989
    1991
    Intel 80486DX4 75
    100
    53
    70.7
    1.6M 0.6µm 1994
    Intel Pentium (P5) 60
    66
    100
    112
    3.1M 0.8µm 1993
    Intel Pentium (P54) 90
    100
    149.8
    166.3
    3.2M 0.6µm 1994
    Intel Pentium (P54CS) 133
    166
    218.9
    247
    3.3M 0.35µm 1995
    1996

    Memory

    A complete DRAM history is more difficult to trace, along with the primary manufacturers chaning over time. Wikipedia is generally a good starting point. Dynamic random access memory, cites DRAM Design Overview from Stanford University by Junji Ogawa.

    YearDRAMDensitytRAC
    1971 Intel 1103 1K 300ns
    1973 TI TMS403 4K ?
    1977 Mostek MK4116 16K 250ns
    1980 ? 64K 200ns
    1983 ? 256K 150ns
    1986 ? 1M ?
    1989 ? 4M 80ns
    1992 ? 16M ?
    1995 ? 64M ?

    DRAM timing is particularly difficult to understand, more so with the radical change from asynchronous (FPM and EDO) DRAM to synchronous SDRAM, and DDR timings. Wikipedia SDRAM latency provides the diagram below.

    microarchitecture

    Other references are Anantech everything-you-always-wanted-to-know-about-sdram-memory-but-were-afraid-to-ask. and Ulrich Drepper's What every programmer should know about memory.

    The aspect of focus here is memory access latency. This element was generally quoted for asynchronous DRAM products. After the change to synchronous DRAM, the industry emphasis changed to bandwidth timings.

    The last of the FPM and EDO DRAM products were available with 50ns access times, but 60ns products were more common. Perhaps the 50ns access time required cherry picking from a normal production run?

    Today, the best DDR3 may have an access time of 25-30ns at the DRAM chip. Local memory access time at the processors (with integrated memory controller) is on the order of 50ns? The difference due to signal transmission from processor to memory and back. On server systems using registered memory, there may be buffer chip between processor and DRAM? On multi-processor (socket) systems, access to remote node memory may be over 95ns? to an adjacent node and 150ns+ for 2-hop distances?

    DDR transfers data on both edges of the clock, i.e., at double the clock rate. Internally, DRAM is now organizied into multiple banks in order to sustain data transfers at a very high-rate.

    The entire discussion above pertains to mainstream DRAM, which emphasis cost relative capacity first, followed by bandwidth, with the expectation that computer system will be comprised of many DRAM chips. For example, a recent generation personal computer will have 2 memory channels, each 64-bits wide. The DRAM components are organized as x8, providing an 8-bit data path, so there are 8 chips to form a 64-bit channel, and the minimum system memory has 16-chips.

    There specialty DRAM products designed around different requirements. Graphics DRAM is designed for high bandwidth on the assumption that the memory system will be comprised of few chips. Consider a graphics subsystem that needs only 1GB comprised of 1Gbit chips. The desired bandwidth might require a 256-bit path. So GDDR DRAM are often organized wider, x32 being popular.

    Another specialty is reduced latency DRAM for network systems. These systems do not require monstrous system memory capacity, but do need super low latency to support fast turn-around time for high-speed networking, in the 10-40Gbit/s range. A Micron RLDRAM document mentions tRC of 8ns versus 46-52ns for DDR3?

    It has been my opinion that server system memory has long since become out of balance with the original concept of system main memory. The latency has become to long for memory. Today most memory is used for caching of one form or another, including the database buffer cache. The time is right to split computer system memory. There should be a smaller memory subsystem emphasizing very low latency, not just with specialtly DRAM, but also with physical proximity, perhaps in the same package as the processor. A separate larger subsystem can continue to implement bulk DRAM, tolerating longer latency.

    It has long been known that memory access latency cannot keep up with the microprocessor. Of course, the Intel server microprocessor clocks rates have settled into the 3GHz range, with further progress emphasizing the number superscalar execution ports, and the number of cores on a single die (or socket). For a 3GHz processor and 50ns local node access, memory latency is now 150 CPU clock cycles away, and 300+ for remote node memory access.

    Micron and other memory companies have formed the Hybrid Memory Cube consortium, proposing a radical re-architecture of the memory system. See Hot Chips HC23 Hybrid Memory Cube (HMC). by J. Thomas Pawlowski, Micron and High-Performance Memories for Packet Processing.

    On the VAX-11/780, the CPU clock was 5MHz or 200ns cycle time, but a complete instruction averaged 10 cycles. DRAM access time was 250ns, 600ns to the memory controller and 1800ns to the processor. This was before the advent of SIMM and DIMM technology. The processor, memory controller and memory were all on separate boards, with long signal delays. So essentially, memory access time was comparable to the time complete one VAX (complex?) instruction.

    The a single Intel Sandy-Bridge core can complete 6 instructions per clock cycle if there are no memory stalls. The key to modern microprocessor performance is an effective cache strategy to hide memory latency. This can be successful is there is locality or if memory can be prefeched, ideally 150+ cycles before it is needed. An alternative strategy is sequential memory access to make use of the high memory bandwidth of modern systems.

    YearInstructions
    per clock
    CPU clockEffective ns/InstMemory Access
    1978 0.1 200ns 2000 1800ns
    2012 upto 6 0.33ns 0.055 50ns (local)
    100ns (1 hop)

    Summarizing, the CPU clock was faster than memory access even back in 1978. However, the CPU was also a very simple device that required 10 full cycles to complete a (microprogammed) instruction. So the net result was instruction time was comparable to memory access. Today, a single core is capable of completing 6 instructions per clock. This is on top of the 150-1 ratio between local memory access latency to CPU clock. The decisions made thirty years ago for good reasons nolonger hold today. The current nature of computer system architecture points to a completely different strategy given the long latency for memory access.

    The modern microprocessor is designed to operate with pipelining and superscalar execution. There should be multiple independent instructions that can be executed on each clock. Furthermore, instructions executed in one clock should not have intractable dependencies on instructions in the immediately preceding clock cycles.

    The most difficult code for modern microprocessors is pointer chasing. This is where a memory access retrieves the next memory location to be accessed. If the memory address is not in cache, then the access time to DRAM is over 150 cpu-cycles, during which the processor core has nothing to do. Once the memory is accessed, this provides the address of the next memory fetch. Unfortunately, this code sequence just happens to describe a b-tree search.

    Modern Computer Architecture and Memory

    Page and Row Storage Organization

    Having covered the computer system architecture transistions from 1970 to 2012, including the processor core and the memory system, it is appropriate to return to the orignal relational database implementation. The following diagram is from The Design and Implementation of INGRES by Stonebraker, Wong and Held, ACM 1976.

    SQL Page

    The page and row storage organization from Ingres in the 1970s is still in use today. The diagrams below are from Microsoft MSDN Understanding Pages and Extents

    SQL Page

    and SQLSkills Inside the Storage Engine: Anatomy of a page

    SQL Page

    Now examine the sequence of operations to access rows and columns with page-row storage, with consideration for whether memory access operations are in cache, or can be prefetched.

    Assume that we have already found the sequence of rows required by a query from an index. The information we have for each row is the file_id, page_id, and row

    • 1) Check if the page is in the SQL Server buffer cache.
      Also the OS must check if the page is in memory (unless lock pages in memory is in effect)
    • 2) Acquire a shared lock or latch on the page (table and row locks)
    • 3) Read the page header
    • 4) Read the 2-byte row offset at the end of the page
    • 5) Read the row/record header
    • 6a) Fixed column loads: Address = row offset + column offset
    • 6b) Nullable columns: Load NULL bitmap, calculate offset, load?
    • 6c) Variable length: follow the chain of column offset to the desired column?

    Comments:
    1) the cost of the page in cache check could be as high as 1000 cpu-cycles? This is based on a series of table scan tests I did for varying number of rows per page. with the lock pages in memory permission on. The OS check could be equally expensive. One of Thomas Kejser's slides from SQLBits mentions that lock pages in memory performance impact could be significant.
    Note to storage vendors: this is why the claim that caching solves IO performance problems is totally stupid.
    2) It is necessary to place some kind of lock or latch on the page even if nolock or tablock is applied on the query. This is so the SQL Server storage engine knows that the page cannot be evicted from the buffer cache while being read.
    4) The reason that the row offset and the actual row data is filled in from opposite directions of the page is the improve storage efficiency. With nullable or variable length data, it is not known how many rows will fit in any given page. This requires non-sequential memory access patterns.
    5) One might think in a SELECT COUNT(*) query that we could just read the m_slotCnt value in the page header, or read the number of 2-byte row offset values at the end of page, but apparently SQL Server actually reads the row header for each row.
    6) Fixed length non-nullable columns are the least effort because the column offset is known ahead of time and the same for each row. One of the recent SQL Server versions improved the handling of nullable columns by having a bitmask for all columns in each row, which simplifies the process of determining the offset?

    Variable length columns are then difficult? I think we have to go to the first variable length column, read the length to get the next length value and so on until we find the desired column. It would be nice to see the source code for this. Perhaps someone would be helpful in examining the code of one of the open source databases?

    There are also cycles expended to handle conversion from raw bytes to the correct data type and special SQL Server rules.

    A couple of years ago, I proposed extension to to the Intel vector instructions in SIMD Extensions for the Database Storage Engine in order to facilitate database page-row-column address calculation. This would require working out the details on the new instructions, and getting Intel to implement this in the next processor still early in the design stage. I suspect that it would also be necessary to change the way metadata is stored to facilitate loading into the vector registers. It would take 2-3 years for the new processor to enter production. There would be another 2-3 years before the new technology is broadly deployed. Of course all of this should be been started ten years ago when CPU frequency went over 1GHz.

    I ran a test on a series of tables with a range of rows per page from 1 to 476. The queries consisted of a table scans, first getting just a count of rows and then aggregating successive columns. The first three systems are 2-socket, running SQL Server 2008R2 on Windows Server 2008R2. The last system is single socket running SQL Server 2012 on Windows Server 2012.

    The table below shows in CPU-nanoseconds for the cost per page (including 1 row) of the count query, the cost for each additional rows, and then the additionak cost for the first and each additional column aggregated.

    SystemTablePage
    w/1 row
    additional
    row
    first
    column
    additional
    column
    X5670 2.93GHz Heap 720 58 85 38
    X5670 2.93GHz Clust 810 58 84 39
    X5660 2.80GHz Heap 630 60 109 43
    X5660 2.80GHz Clust 760 64 97 45
    E5-2690 2.9GHz Heap 640 52 72 34
    E5-2690 2.9GHz Clust 700 48 74 34
    E3-1270 3.4GHz Heap 560 47 80 35
    E3-1270 3.4GHz Clust 656 45 68 30

    Given that all processors cores were around 3GHz, the CPU-cycles for each page is the range of 2000 cycles, each row and column contributing another 150 or so cycles. When the first row is accessed, ie, reading the last 2 bytes of an 8KB page, the entire 64-byte cache line comprising 32 row offset values would be read into cache. The approx 150ns cost for per row corresponds to a single memory access for the row header, with the row offset most likely already in cache.

    The tests compared column accesses in sequence. The single column aggregate is on the second column. The two column aggregate is on the second and third columns, which should be stored in adjacent bytes. There is some indication the pairs of columns are marginally more than a single column but the cost off 100+ cycles per successive column seems to be high. Is the type conversion? or due the interpreted code?

    My standard SQL Server configuration is with the lock pages in memory right assigned, as this is required for Trace flag 834: use large-page allocations for the buffer pool. I was not aware of Thomas Kejser's report that the lock pages in memory by itself would have significant performance impact. If possible, I will re-run the above tests with and without lock pages in memory.

    Scaling and Locks

    Another major top in database performance is scaling to a very high number of many cores. This is both scaling over the cores in a single processor socket and scaling over all the cores of a multi-socket system. Apparently the locking mechanism is a serious obstacle to scaling.

    A few years ago, I did a study of a non-transactional b-tree search engine, ie, without locking. Not only did it scale perfectly over the physical cores, it also scaled perfectly over the Hyper-Threading logical cores. This was possible because the b-tree search is a series of pointer chasing memory accesses, resulting in many no-ops cycles within a single thread. With no lock contention, the scaling was perfect.

    I also looked at compression and parallelism. At DOP 1, queries to compressed tables consumed about 40% more CPU than to an uncompressed tables, depending on the operation. The uncompressed tables would scale with increasing the degree of parallelism up to a point, before scaling falls off and the performance is saturated. The compressed tables scaled perfectly until the performance was equal to the uncompressed tables. The interpretation was that contention for locks was limiting scaling with parallelism. The compression added enough CPU on each thread to relieve the contention. At high degree of parallelism, 16-32 in some examples, the compression essentially become free.

    Transactional memory is currently a topic of discussion. See the Intel Developer Forum 2012 session ARCS0004, Intel Transaction Synchronization Extensions. The objective is a lower overhead alternative to locking that can used on most cases? The Microsoft paper also discusses lockless or lock free memory strategies?

    In-Memory Database

    As soon as it was evident that CPU-cycles and memory access latency were on diverging paths, perhaps around 1990, it was realized that the page row storage system with pointer chasing code to retrieve scattered metadata would not realize the full capability of modern processors. Hence the term in-memory database for describing a storage engine optimized for processor - memory access characteristics.

    Another option is columnar data storage. The sequential data access could then take advantage of the memory bandwidth of systems, which was improving at an adequate pace. Furthermore, the data type within each would be known, except for the (hopefully) rarely used variant.

    By the time of the later Intel 486 or early Pentium processors, the cpu cycle time to memory access latency ratio had exceed ten. So there was talk of 10X or greater performance with in-memory and columnar database technology.

    At that time, system memory had not become ridiculously huge as it is today, so in-memory databases were not really practical to achieve broad adoption.

    Today server memory capacity is both huge and cheap, with 16GB DIMM pricing below $400. Of course the mainstream database systems have progressed far beyond their original base with a deep infrastructure of tools and features that migrating to a different DBMS would involve huge effort and risk.

    The natural solution is incorporate in-memory database technology into an existing DBMS. Microsoft SQL Server has already incorporated columnar storage in version 2012.

    Breakthrough performance with in-memory technologies (Nov 8, 2012) on the Microsoft Technet SQL Server Blog by Dave Campbell, and The coming in-memory database tipping point (Apr 9, 2012) describes the rational behind in Hekaton. The first Nov 8 post cites Per-Ake Larson et al High-Performance Concurrency Control Mechanisms for Main-Memory Databases which describes method to reduce locking and other concurrency overhead.

    Additional references:
    Oracle TimesTen In-Memory Database Architectural Overview,
    IBM solidDB redbook,
    Wikipedia In-memory database, and Column-oriented DBMS.

    The diagram below is from Oracle TimesTen In-Memory Database Architectural Overview

    TimesTen

    ps
    I am still working on this, to fill in missing data, correct mistakes, etc
    I will try to make updates here, but if not, the permanent copy is here
    http://www.qdpma.com/CBO/InMemory.html

  • Queries barely over the Cost Threshold for Parallelism

    I had discussed SQL Server parallelism in Oct 2010, with my thoughts on the best settings for: Cost Threshold for Parallelism (CTP) and Max Degrees of Parallelism (MAXDOP) in Parallelism Strategy and Comments. At the time, I had intended to follow up with detailed measurements. So now a mere 2 years later, here it is. The general thought was that CTP should be raised from the default value of 5, and MAXDOP should be changed from unrestricted, on modern systems with very many cores, and most especially on systems with Hyper-Threading. However reasonable each persons ideas/suggestions are, nothing is better than hard data.

    The interest is in the smaller queries that can have a parallel execution plan. With the default Cost Threshold of Parallelism setting, we are trying to find queries with the lowest plan cost and lowest actual cost (which point to different queries) that have a parallel execution plan. (I will details on index seek later).

    The test system is now a 2-socket Intel Xeon 5670 six-core 2.93GHz (32nm Westmere-EP) with Hyper-Threading enabled (a total of 24 logical processors). Some references are made to test results on an earlier system with 2 x Xeon E5430 quad-core 2.66GHz (45nm Core 2) without Hyper-Threading.

    The test data is built using the TPC-H data generator, initially employing derivatives of the Part table. At scale factor 10, the Part table is 2M rows, about 285MB, 55 rows per page or 149 bytes per row. The derived tables have 2 additional 4-byte integer columns, size 301MB, 52 rows per page or 157 bytes per row.

    Parallel Hash Join

    The smallest hash join on the modified Part table that results in a parallel execution plan occurs at around 100,000 rows. Lets first examine the non-parallel plan.

    Hash Join DOP 1

    The three major components of this plan are the outer and inner source index seeks, and the hash match. The Stream Aggregate is small to the major components.

    Hash Join DOP 1 Hash Join DOP 1

    The IO cost of a 1.42 corresponds roughly to a range scan/seek of 15MB (IO cost 1 is 10.5MB). The actual logical IO is 1931, very close to the values of (1.42446 - 0.003125) * 1350, and includes a few upper level index LIOs.

    Below are the Hash Match and Stream Aggregate details. Note that the Hash Match has the largest cost of all operations in this execution plan, and the entire cost is in the CPU element. The 2 index seeks operations have their cost mostly in the IO element, with only 7.1% in the CPU.

    Hash Join DOP 1 Hash Join DOP 1

    Below is the parallel Hash Join execution plan at DOP 2.

    Hash Join DOP 1

    The Outer and Inner Source index seeks have identical costs so only the first is shown. As covered in Cost Based Optimizer Parallelism I the CPU cost is reduced by the degree of parallelism, but the IO cost does not change, which I call a saturated IO model.

    Hash Join DOP 1 Hash Join DOP 1

    The Hash Match details above and the Stream Aggregate details below at DOP 2. Both operations have cost reduced by the degree of parallelism as both elements have their costs entirely in the CPU portion.

    Hash Join DOP 1 Hash Join DOP 1

    The Parallelism operation, as its name implies, only occurs in parallel execution plans adding cost 0.0285. The cost structure of this operation is proportional to the number of rows, which implies that high row count queries are less likely to have a parallel execution plan as the cost of reconstituting threads may outweigh the formula benefit of parallel operations.

    Parallel Loop Join

    The second test query is a loop join. A parallel execution plan occurs at around 8000 rows for high degrees of parallelism. At DOP 2, this occurs around 10000 rows, which will be used for test purposes. The non-parallel loop join plan is shown below.

    Loop Join DOP 1

    The outer and inner source operation details are shown below. Note the inner source subtree cost of 29.0264, based on number of executions 10000. The CPU component should be 10000 * 0.0001581 = 1.581. Then the IO component is 29.0264 - 1.581 = 27.4454, which is approximately equal to 8782 * 0.003125. This is an estimate of the number of physical IO for 10000 executes. The assumption is that some of the pages would have been previously loaded into memory during the execution of this query, but not yet evicted.

    Loop Join DOP 1 Loop Join DOP 1

    There are 38465 leaf level pages in both tables. The alternate plan for this query is a hash join with a scan on the inner source table. In this plan, the table scan would have IO component 28.5, CPU 2.2 and the hash match would contribute another 9. So it would take another 35% more rows before the loop join plan would naturally shift to a hash join at DOP 1. At higher DOP, the hash join cost is reduced proportionate to DOP.

    The Nested Loops and Stream Aggregate details are shown below. The Nested Loops is only 0.14% of the overall plan cost, and the Stream Aggregate even less.

    Loop Join DOP 1 Loop Join DOP 1

    Below is the parallel loop join query plan at DOP 2.

    Loop Join DOP 1

    The parallel outer and inner details are below. Note the very minor reduction in outer source CPU from 0.155756 to 0.150178. There is no change in the inner source operation, not even the element attributed to CPU.

    Loop Join DOP 1 Loop Join DOP 1

    The Nested Loops and Parallelism details are below. In the parallel plan, there is a CPU reduction of 0.005 from the outer source, 0.0209 from the Nest Loops, and 0.003 from the Stream Aggregate, to just ever so slightly overcome the Parallelism cost of 0.0285.

    Loop Join DOP 1 Loop Join DOP 1

    The total plan cost for the non-parallel loop join is 29.23 and the parallel plan cost is 29.229. If this model was even remotely accurate, we should ask why bother to switch to a parallel plan for such a minute gain. It turns out that the SQL Server parallel execution plan cost estimate is nothing close to the true cost structure, which raises a completely different set of questions, starting with: how can SQL Server be expected to generate good parallel (or not) execution plans if the cost model is completely different than the true cost structure?

    Parallel Query Performance and Actual Cost - Hash Joins

    Below is the hash join performance in rows per sec (left vertical scale), and (worker or CPU) cost in micro-sec per row (right vertical scale), both versus degree of parallelism.

    Loop Join DOP 1
    Hash Join 100K rows versus Degree of Parallelism

    At DOP 1, the performance is just over 1M rows/sec, scaling well to DOP 4, levels off for DOP 6-12 at just over 4M rows/s, then jumps again at DOP 14. The peak gain from parallelism is 6.646 speed up at DOP 20 over DOP 1. The cost is just under 1 us per row at DOP 1, rising as high as 2.6 us per row at high DOP, more so if HT is involved? At DOP 1, the query execution time for 100K rows is 95ms.

    It appears that at DOP 2 and 4, the individual threads are running on logical processors in different physical cores, hence the excellent scaling. At DOP 6-12, some of the logical cores are on the same phyical cores. Hyper-threading does improve performance to a moderate degree in SQL Server depending on the operation.

    (I worked on a custom b-tree search engine with no locking protection. The scaling was essentially linear regardless of physical or logical cores from 1 to 24 threads. This may seem fantastic, but it does make sense because a b-tree search is just serialized memory accesses. Fetch a memory location, which determine the next memory location to fetch. The first memory access must be complete before the next can be issued. The core clock rate for 3.3GHz is 0.3ns. Memory access latency is 50ns for local node, 100ns for 1 hop remote node, corresponding to 150 and 300 CPU-cycles respectively.)

    If this assessment is correct, then it would suggest the proper strategy for the SQL Server engine in parallel execution plans is to allocate 1 worker thread from a logical processor on each physical core, perhaps allocating from the cores on the same socket before proceeding to the next socket. Only if the desired degree of parallelism is greater than number of cores should two logical processor be allocated from any single physical core, and this should only occur in unusual circumstances.

    The next level of sophistication would be to match thread assign with memory alignment, so that the majority of memory accesses are to the local node. This might require partitioned tables (with hash key partitioning?). This effort would be hugely complicated, and only applicable to special circumstances, but a true fanatic would not be deterred.

    Below is the performance and cost structure for a hash join on the full table of 300MB and 2M rows. The cost per row is higher 1.495 us per row at DOP 1, rising to 3us at high DOP. However scaling is better with a peak speedup over DOP 1 of 11.35 and the peak rows per sec is also higher than the previous query.

    Loop Join DOP 1
    Hash Join - 2M rows versus Degree of Parallelism

    I am not sure why the cost per row in this query is higher for the full table over the limited range of 100K rows. There was no tempdb activity in either case. One speculation is that the hash join intermediate results remain L3 cache (15M), and results in lower access time than off-die memory accesses.

    A test with small and large merge joins, also exhibits the same behavior. Is it possible this is due to local and remote node memory locations? Life is not simple on NUMA systems with HT.

    Just to be sure, the same full table scan query was tested on a 3GB 20M row table, as shown below. The DOP 1 cost and performance was about the same, slightly lower actually at 1.376 us per row. The scaling was better with peak speedup at 14.95 and peak performance at nearly 11M rows/sec.

    Loop Join DOP 1
    Hash Join - 20M rows versus Degree of Parallelism

    Both full table hash joins (300MB and 3GB) demonstrate that parallel execution plan scaling is better for larger queries.

    Parallel Query Performance and Actual Cost - Loop Joins

    Below is the Loop Join performance in rows per sec and cost in usec per row for a query of 10000 rows on the 300MB tables. Notice that performance peaks at DOP 6 and degrades at higher DOP. The peak speedup over DOP 1 was 4.4.

    Loop Join DOP 1
    Loop Join - 10K rows versus Degree of Parallelism

    The cost at DOP 1 is 1.827 usec per row. For 10000 rows, the execution time is a mere 1.8 ms (correction) 18ms. That this query benefit from parallel execution at all is amazing. Of course it also raises the serious question as why SQL Server should go to the effort of a parallel execution plan for a query that runs in 1.8ms 18ms.

    This is due to the antiquated IO based cost model and the default CTP setting of 5. Based on the 100K row hash join hash with plan cost just over 5 and actual query time of 95ms, we might consider Cost Threshold for Parallelism around 25, making the theshold at around 0.5 sec. However, based on the 10K rows loop join with plan cost 29 and actual query time of 1.8ms, CTP of 25 would point to a 50K rows loop join with actual execution tim 10ms. (OK, it was late) However the 10K row loop join at plan cost 29 runs in 1.8ms!

    The disparity between Loop and Hash join model and actual cost does allow a good strategy for the Cost Threshold for Parallelism setting without a complete overhaul of the SQL Server cost model. And this is not something the SQL Server team seems to be will to tackle. Another possiblilty is for the Cost Threshold for Parallelism setting to only consider the CPU element of the plan cost. Too bad we do not have access to source code to investigate this.

    The outer and inner source tables were populated and indexed in a manner such that for this loop join, each row from the outer source joins to a row in a different page of the inner source table. When the rows join to consecutive rows in the inner source table, the cost per row is even lower at 1.33 usec per row.

    Below is the Loop Join performance and cost for 100K rows on the 3GB tables. Performance continues to increase for higher DOP. The peak speedup over DOP 1 was 8.26.

    Loop Join DOP 1
    Loop Join - 100K rows versus Degree of Parallelism

    In examining the CPU usage during the tests, it was evident that in some cases, only one logical processor of a physical core was used. In other cases, both logical processors on some cores were used. This leads to the uneven scaling versus DOP.

    Parallel Execution Plan Throughput Tests

    Another point interest is throughput with parallel execution plans is supporting concurrent sessions. The figure below shows performance in rows per sec at various DOP settings versus number of concurrent sessions. The results based on the number of queries completed in a 10-sec windows. Since the single session DOP 1 query run-time is 100ms, this should be reasonably accurate.

    Loop Join DOP 1
    Hash Join Performance (rows/s) by DOP versus # of Concurrent Sessions

    SQL Server parallel execution plans are great for a single session, even for queries that run in so short a time that parallelism should have never been considered. However there are definitely issues with parallel execution in throughput tests.

    Loop Join DOP 1
    Loop Join Performance (rows/s) by DOP versus # of Concurrent Sessions

    It is stressed that the queries used in the tests above run in 95ms and 1.8ms 18ms respectively at DOP 1. So it is not surprising that there are significant throughput issues in attempting concurrent parallel execution with such minuscule queries. I present these as examples because they are among the smallest queries for which SQL Server engages parallel execution with the default settings.

    This is because the default setting are seriously obsolete. The original setting that targeted a 5 sec threshold for parallel execution on a Pentium Pro 166MHz (I think this a 0.8um or 800nm design) does not work well on modern microprocessor at 3GHz and 32nm design.

    That said, SQL Server does have issues with concurrent parallel execution throughput on properly sized queries. So some effort here would be helpful. But the priority is below index on hash key, and parallel write operations.

    The initial release neglected the simple index seek test

    Parallel Query Performance and Actual Cost - Index Seek

    Below is the performance and cost characteristics for a simple index seek aggregating 1 float type column for 350K rows. The cost per row is 0.1734 usec at DOP 1. Considering that the hash join test above does 2 index seeks and summed 2 float columns, we could speculated that the cost of the bare hash join is 0.952 - 2x0.1734 = 0.605 usec per row.

    Loop Join DOP 1

    Below is the characteristics for a simple index with only COUNT(*). The cost is now only 0.0783 usec per row. This implies that the float type column sum costs 0.095 usec per row.

    Loop Join DOP 1

    The bare index seek cost of 0.0783usec per row also amortizes the cost of the page access. At 52 rows per page, the full page cost including rows is 4.07usec. Previously, I had assessed that the bare page access cost was 1us on the Core 2 platform. The Westmere processor is more powerful and mostly importantly, has significantly lower memory round-trip time. So I am expecting the bare page cost on Westmere to be less than 1usec.

    It might seem that we a quibbling over fractions of 1usec. But we need to consider that 1us on a modern microprocessor in 2-3,000 CPU-cycles, and this on a processor core that can execute multiple intructions per cycle. Those of you with very good friends on the Microsoft SQL Server engine team might persuade them to show you the source code and compiled binary/assembly. Even without looking source code, it is clear that there are not nearly that many instructions to touch a page, row and column.

    The CPU-cycles are spent in the lock mechanism and the memory-round sequences. This is why there significant differences between a traditional database engine with all data in-memory and the "in-memory" database engine that have completely different data organization. It might better to refer to traditional database engines as page-row structures and the new stuff column-oriented structures.

    See my blog SIMD Extensions for the Database Storage Engine for my thoughts on an alternative solution.

    Parallel Execution Setting Summary

    It is definitely clear that the default settings for Cost Threshold for Parallelism (5) and Max Degree of Parallelism (0 - unrestricted) are seriously obsolete and should be changed as standard practice on SQL Server installation.

    It is not clear there is a single universal best value for Cost Threshold for Parallelism. I think the more important criteria is that there should not be a high volume of concurrently running queries with a parallel execution plans. Whatever the plan cost of the high volume queries are, the CTP should be set above it. Of course, the other important objective is to reduce the plan cost of these high volume queries, assuming this correlates to improved actual query execution time.

    The strategy for Max Degree of Parallelism is also unclear. Before the advent of multi-core processors, a good strategy was to disable parallelism on transaction processing systems. It was also critical in the old days to ruthlessly enforce the restriction of only pure transactions too. Today, we have 32-40 powerful cores on transaction systems, and almost everyone runs complex queries on it. So parallelism is good.

    Ideally I would like to restrict MaxDOP to the number of physical cores on a Data Warehouse system, and less than that on a transaction system. I would also like to get great scaling with parallelism. But because SQL Server does not balance the threads to one logical processor per phyical core, this strategy does not work. So this is something the SQL Server team needs to address.

    I will make continued updates on this on my website www.qdpma.com in the topic Parallelism.

  • SSDs at PASS 2012

    There were 7 companies exhibiting SSD products as PASS this year, and one with a product to support SSD storage. This is not counting Dell, EMC and HP who have SSD products, but were at PASS for other reasons. This shows that many have aspirations with the hope that market leadership is not yet firmly set.

    The products fall into the following:
    PCI-E SSDs (Fusion-io, LSI, Virident)
    SSD SAN (Violin, TMS, Whiptail)
    SSD caching (LSI, GridIron)

    I am inclined to be of the opinion that SSD market is not yet hard set. One reason is the upcoming form factor changes. The Dell PowerEdge T620 has the option of front accessible PCI-E lanes to or a 2.5in HDD form factor device. Intel is planning on a transition to hybrid PCI-E, SAS, SATA connectors (SFF-8639 and SATA Express) for flexibility in supporting devices. Whoever has the best solution for the new form factors has an opportunity to shake up the existing order.

    The form factor shift could be an opportunity to address the issue of expansion. In SAS storage system, it is easy to design for both high IO bandwidth in the initial configuration while maintaing the ability to expand capacity over time. We can fill the PCI-E slots with RAID controller or HBAs, and sufficient disks on each to match the PCI-E slot bandwidth. And it is still possible to add disks to each SAS channel while maintaining bandwidth.

    For PCI-E SSDs, if we fill the PCI-E slots with lower capacity devices initially (for maximum bandwidth), then capacity can only be expanded by replacing the existing cards. The new form factor is an opportunity to allow SAS-style expansion.

    Second is that there will be a role for all of SLC, HET MLC (also known as eMLC?) and even ordinary MLC in the server market the same way there was a role for 15K, 10K and 7200RPM hard disks. Some vendors are focused on the enterprise side with SLC and HET MLC configured with a high degree of over-provisioning (30% plus) resulting in a cost structure of $10K per TB for eMLC and $20K for SLC. Others have low cost consumer oriented MLC with a low degree of over-provisioning (7-13%) netting a cost structure of $1K per TB, There is role for all three and any vendor without a full lineup is going to leave an opening for other vendors to enter an account.

  • SSD-HDD price parity

    It is hard to believe that we are essentially at SSD-HDD price parity? Of course I am comparing enterprise class 10K/15K HDDs to consumer grade SSDs. Below are prices I am seeing

    600GB 15K 3.5in HDD $370
    3TB 7.2K 3.5in HDD $400

    300GB 15K 2.5in HDD $370
    900GB 10K 2.5in HDD $600
    1TB 7.2K 2.5in HDD $230 (less for consumer HDDs)

    512GB SATA SSD $400-600
    Intel SSD DC S3700 400GB $940

    The 512GB SATA SSDs are consumer grade, MLC NAND, with only 7% over provisioning.
    That is 512GB (1GB = 2^30) of NAND, with 512GB (1GB =10^9) of user capacity.
    Intel just announced the SSD DC S3700, which appears to be a reasonable enterprise product, in having 32% over-provisioning. I am inclined to think that DW permanent data does not need more over-provisioning than in the consumer grade SSDs. Otherwise your db is probably not a DW.

    Potentially tempdb might need more write endurance than in the consumer SSDs. So the question is whether a large array of consumer SSDs supporting mostly static data and a smaller write intensive tempdb is a good match.
    Most SAN vendors are peddling grossly over-priced enterprise grade SLC SSDs. That is because they want SSD to be used for caching or tiered storage, which results in heavy write activity. Apparently SAN vendors have no concept of DW.

    Now if we could only get system vendors to provide storage bays for 9.3mm SSDs instead of 15mm 10K/15K HDDs to achieve higher density. Better yet, arrange with SSD vendors to ditch the case, providing the SSD on just the PCB. The SSD storage bays should also be more appropriately balanced, say 2 x4 SAS ports to 8-bays.

  • Supermicro motherboards and systems

    I used to buy SuperMicro exclusively for my own lab. SuperMicro always had a deep lineup of motherboards with almost every conceivable variation. In particular, they had the maximum memory and IO configuration that is desired for database servers. But from around 2006, I became too lazy to source the additional components necessary to complete the system, and switched to Dell PowerEdge Tower servers.

    Now, I may reconsider as neither Dell or HP are offering the right combination of PCI-E slots. Nor do the chassis support the capability I am looking for. The two Supermicro motherboards of interest are the X9DRX+-F for 2-way Xeon E5-2600 series processors, and the X9QR7-TF-JBOD for 4-way Xeon E5-4600 series processors.

    Below is a comparison of the Dell, HP and Supermicro 2-way Xeon E5-2600 systems (or motherboards). Both the Dell and HP have PCI-E x16 slots. Unfortunately this is not particularly useful as the only PCI-E SSD capable of using the full x16 bandwidth is the Fusion-IO ioDrive Octal at over $90K.

     Dell
    T620
    HP
    ML350p G8
    SuperMicro
    X9DRX+-F
    DIMM sockets242416
    PCI-E 3 x16430
    PCI-E 3.0 x82110
    PCI-E 3.0 x4140
    PCI-E 2.0 x4011

    Below are the Dell and HP 4-way systems for Xeon E5-4600, the HP 4-way Xeon E7 (Westmere-EX) and the Supermicro 4-way E5-4600 motherboard. It is apparent that neither the Dell and HP E5-4600 systems are meant to fully replace the previous generation 4-way E7 (Westmere-EX) systems, as both implement only half of the full set of PCI-E lanes.

     Dell R820HP DL560 Gen8HP DL580 Gen7SuperMicro
    DIMM sockets48486424
    PCI-E 3 x16 2207
    PCI-E 3.0 x85*36 (g2)1
    PCI-E 3.0 x40000
    PCI-E 2.0 x40150

    The Xeon E5-2600 and 4600 series processor has 40 PCI-E gen 3 lanes, and the DMI which is equivalent to x4 PCI-E gen 2. One processor needs to have the south-bridge using DMI, but the others could implement a x4 g2 port. Of course the full set of 160 gen 3 lanes are only available if all processors are populated, but the same concept applies for the memory sockets. These systems are probably more suitable for VM consolidation servers. Hopefully there will be a true database server in the works?

    Today, I am interested in maximum IO bandwidth with a uniform set of PCI-E slots. Maximum memory bandwidth is required to support this, but it is not absolutely essential to have maximum memory capacity.

    The IO bandwidth plan is built around SSDs, because a 15K HDD starts around $200 providing 200MB/s on the outer tracks while a 128GB SSD can deliver 500MB/s for around $100. It would actually be easier to build a high bandwidth system with PCI-E SSDs.

    The Intel 910 400GB model is rated at 1GB/s for just over $2000, and the 800GB model does 2GB/s at the same IO bandwidth per dollar. The Fusion-io ioDrive2 Duo 2.4TB can do 3GB/s but costs $25K (or is it $38K?). The Micron P320h can also do 3GB/s but is probably expensive being based on SLC.

    The other option is 8 x 128GB SATA SSDs on a PCI-E SAS RAID Controller. The LSI SAS 9260-8i can support 2.4GB/s with 8 SSDs. In theory 6 SSDs could support this but I have not validated this. So one 9260-8i ($500) and 8x128GB SATA SSDs ($100 each) means I can get 2.4GB/s for $1300, possibly less. I understand that the LSI SAS 9265-8i ($665) can support 2.8GB/s (or better?), but the LSI rep did not send me one when said he would. LSI now has PCI-E 3.0 SAS controllers, the 9271 and 9286, but I do not have any yet.

    Bandwidth versus cost options
    Fusion-IO ioDrive2 365GB            910MB/s        $6K
    Intel 910 400GB                           1000MB/s       $2K
    LSI+8 SATA SSD                         2400MB/s?     $1.3K?

    To implement this strategy, the chassis should support many SATA/SAS devices organized 8 bays per 8 SAS lanes. The both Dell T620 and HP ML350p support 32 2.5in SAS devices, but organized as 16 per dual SAS ports (x4 each?). So for my purposes, these systems are adequate to house SSDs for 2 adapters. It could also be pointed out that the 2.5in SAS bays are designed for the enterprise class 10K/15K HDDs which are 15mm thick. SATA SSDs on the other hand are 9.3mm thick, designed to fit laptop HDD dimensions. It could be even thinner without the case.

    I should point out that the Intel 910 400GB PCI-E SSD has 768GB actual NAND, about 50% of capacity is reserved, so this should have very good write endurance for MLC. This typical for enterprise oriented SSDs. The typical consumer SSD, has about 7% reserve. For example, a device with 128GB (binary 1GB=1024^3) has 128GB decimal (1GB=10^9) user capacity. So for production use, stay with the enterprise oriented products.

    RichB
    First, the is a lab, not a production server, and I am paying for this myself, so $1-2K matters to me.
    Lets start with the a 2-way Xeon E5, and the Dell T620 for example.
    A reasonable IO target for this system is 5GB/s based on 300MB/s per core. I can get this with 2 PCI-E cards that can do 2.5GB/s, but the cheap card can only do 1GB/s so I need 5 cards. Plus I might like 2 RAID controllers to HDD so I can do really fast local backups. Next I might like to have 2x10GbE or even an Infiniband HBA. So by this time I am out of slots. I might like to run extreme IO tests, so perhaps targeting 8GB/s.
    So the x16 slots are wasting PCI-E lanes that I need for extra x8 slots. And I cannot afford the Fusion Octal, and Fusion will not lend one to me long-term.

    Next, the Intel 910 400GB is $2100, while the Fusion ioDrive2 365GB is $5-6K (sorry, how much is 1£ these days?)
    both are about the same in bandwidth. The Fusion is about 50% better in random read, and 10X+ in random write. Both cite 65us latency. If I had only a single card, I am sure I could see the difference between the Intel and Fusion. But if I were to fill the empty slots with PCI-E SSDs, I am inclined to think that I have exceeded the ability of SQL Server to drive random IO.

    I tested this once with OCZ RevoDrives, but OCZ cannot provide server system support and OCZ uses the Marvell PCIE to SATA controller, so I stopped using OCZ PCI-E cards. I still used OCZ SATA SSDs, just connected to LSI SAS HBAs. Intel uses the LSI, which has much better interrupt handling. While Fusion may be better at the individual card level, I am not willing to spend the extra money on a lab system. And I am using SATA SSDs because they even cheaper than the Intel 910.

    Realistically, I need to replace my lab equipment every 1-2 years to be current, so I treat this as disposables, not investments.

  • Server Systems for SQL Server 2012 per core licensing

    Until recently, the SQL Server Enterprise Edition per processor (socket) licensing model resulted in only 2 or 3 server system configurations being the preferred choice. Determine the number of sockets: 2, 4 or 8. Then select the processor with the most compute capability at that socket count level. Finally, fill the DIMM sockets with the largest capacity ECC memory module at reasonable cost per GB. Currently this is the 16GB DIMM with a price of $365 on the Dell website, and $240 from Crucial. The 32GB from Dell is currently (2012-Oct) at $1399 each, down significantly from $2499 in early 2012? Perhaps next year the 32GB DIMM might be under $800?

    SQL Server 2012 Enterprise Edition options

    Now with SQL Server 2012 per core licensing, there are a broader range of possibilities based on the number of cores. The table below shows Dell PowerEdge system examples for the Intel Xeon E5 processors from 8 to 32 cores. I would cite HP ProLiant configurations as well, but their website has become so painful to use that I have given up.

    SystemSocketsProcessorGHzCores/
    socket
    Total
    cores
    Max
    DIMMs
    System
    Price
    SQL 2012
    EE
    license?
    T620 2 E5-2643 3.3GHz 4 8 24 $10K $48K
    T620 2 E5-2667 2.9GHz 6 12 24 $12K $72K
    T620 2 E5-2690 2.9GHz 8 16 24 $13K $96K
    R820 4 E5-4603 2.0GHz 4 16 48 $14K $96K
    R820 4 E5-4617 2.9GHz 6 24 48 $20K $144K
    R820 4 E5-4650 2.7GHz 8 32 48 $29K $192K

    Pricing for the Dell PowerEdge T620 system above with 16x16GB memory and 1 boot drive. The prices for the Dell PowerEdge R820 are also with 16x16GB memory. Each additional 16x16GB DIMMs cost $5840.

    The SQL Server 2012 Enterprise Edition licensing shown above is based on a discounted price of $6K per core. The list price is $6,736 per core. The Fujitsu RX300 S7 TPC-E full disclosure report of 2012 Jul 5 shows a full environment: system + storage + software discount of 20%. If evenly applied, this would mean SQL Server license at $5,400 per core. I would like hear what discounts people are getting with respect to volume. My understanding prior to 2012 was that the Microsoft sales rep does not love you unless you buy 16 EE processor licenses, which would translate to 32 core licenses in 2012. Is this still the threshold?

    It is unfortunate that Intel does not offer a high frequency 4-core in the E5-4600 line as they do in the E5-2600 series. I am of the opinion that a 4-way system with Xeon E5 3GHz+ quad-core processor and 48 DIMM sockets would be a very interesting platform. The Intel list price for the E5-4650 8-core is $3616, the 4617 6-core for $1611, and the 4603 4-core at $551.

    Considering that the SQL Server Enterprise Edition licensing component dwarfs the system and processor costs, it would be a good idea for Intel to offer an all-purpose E5-4600 at the high-end that can configured to 4, 6 or 8 cores in the microcode. It would be simpler for large organizations to purchase 4-way systems with the all-purpose processor as a standard configuration. Then each individual system could have the number cores dialed down to the desired level.

    The 4-way E5-4603 2.0GHz is probably not as useful compared to the 2-way E5-2690 2.9GHz, both at 16-cores total. The 4-way has twice the memory bandwidth and capacity but probably also much more than necessary to support the 16 x 2GHz cores. The 2-way has nearly 50% more compute capability with balanced memory bandwidth because the complete processor was designed to be in balance for the high-end configuration. There are only a small number of situations that would favor the larger memory capacity of the 4-way E5-4600.

    The recent generation Intel processor cores are so powerful that 4 or 6 cores is probably good enough for most medium size businesses. I would prefer a 2-socket system for the extra memory bandwidth and capacity, but the minimum SQL Server 2012 license is for 4-cores per socket, negating the feasibility of a 2-way dual-core system.

    Standard Edition

    The limit for SQL Server 2012 Standard Edition is the lessor of 16-cores or 4 sockets and 64GB memory. In addition, many important features are not available like compression, partitioning, and advanced security. I recall that there was a limit to parallel query execution, and that it was less than 16? Standard Edition does not have parallel index operations - i.e., index creation? Perhaps all this means that 16-core is far more than can be used in a Standard Edition environment. The 64GB memory also provides guidance on when to use standard edition.

    Personally, I do not have much side-by-side comparisons of Standard versus Enterprise edition. I would like to hear from people what are the key technical considerations in determining when SE is suitable.

    It would seem that a single socket system with 4-6 cores and 64GB+ memory is most suitable for Standard Edition. The 64GB memory limit applies to SQL Server, it might be a good idea to configure the server with more than 64GB, perhaps as much as 96GB, so as to leave more than sufficient memory for the operating system and other processes.

    Below are some Dell system examples that might be suitable for SQL Server Standard Edition. It appears that Dell is discontinuing the T320 and T420, in favor of the Rx20 systems. While the R-modes are great from higher density environments as web servers, the T-models are best for small business database servers.

    SystemProcessorFrequencyCoresMemoryPCI-ESystem Price
    T110 1 E3-1270v2 3.5GHz 4 4x8GB 1x16, 1x8, 1x4 $2,080
    T320* 1 E5-1410 2.8GHz 4 6x8GB 1x16, 2x4g3, 1x4g2 $1,944
    T320* 1 E5-2407 2.2GHz 4 6x8GB 1x16, 2x4g3, 1x4g2 $2,019
    T320 1 E5-2440 2.4GHz 6 6x8GB 1x16, 2x4g3, 1x4g2 $2,799
    R320 1 E5-2470 2.3GHz 8 6x8GB 1x16, 1x4g2 $3,529
    R520 2 E5-2407 2.2GHz 2x4 12x8GB 1x16, 3x8 $3,503
    T620 2 E5-2643 3.3GHz 2x4 8x8GB 4x16, 2x8, 1x4 $4,292

    Notes:
    * The T320 and T420 are no longer available? Only the R320 and R420?
    The T110 II with 4x8GB added $1,251 from Dell, price from Crucial is $440.
    The T320 memory price from Dell is $160 for 8GB, and $365 for 16GB. Crucial is $85 for 8GB and $240 for 16GB.

    Technically, the systems for the E5 processors are better than the E3, with more memory bandwidth (3 channels versus 2) and larger memory capacity. On the downside is a large drop in processor frequency. The 2-socket quad-core is probably a better option than the single socket 8-core processors.

    The SQL Server 2012 per core licensing may be a shock over 2008 R2 licensing at the 8-core per socket level, effectively doubling SQL Server licensing costs. However, based on direct observations of many environments, I am of the opinion that most businesses would have more than adequate performance with a properly tuned 2-way quad-core system with 8 cores total. This system has more than 4X the compute capability of 4-way systems from the period before multi-core processors. So in fact, SQL Server licensing costs have gone down, we just need to be judicious in the choice of configuration.

  • Decoding STATS_STREAM

    Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines including SQL Server. From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field accessible with the STATS_STREAM clause. Back in SQL Server 2000, it was possible to modify system tables directly, including the sysindexes stat_blob field. At the time, I described a decode of the stat_blob field with the purpose of influence the execution plan, presumably on a development system and not a production system.

    Starting with SQL Server 2005, it was nolonger possible to directly modify system tables. An API was provided to access data distribution statistics to allow cloning the statistics from one database to another. The presumed usage is to clone statistics from a large production database to a small development database. In other database engines, I had heard of the idea of updating statistics on a backup system to be applied to the production system. While it was still possible to decode most of the 2005 stats_stream binary, it appears that a checksum was added so it was not possible to apply an externally generated statistics binary unless the "checksum" value could be correctly calculated.

    Around this time, I was working on other SQL Server tools, most prominently SQL System for performance monitoring, Exec Stats for execution plan analysis and TraceAnalysis for trace processing. Work on the SQL Server data distribution cloning tool was discontinued, and I could not continue further research into the decoding of SQL Server data distribution statistics.

    Since several people have asked about the data distribution statisics decode, I am making what I know about stat_stream available. It would be helpful is other people would contribute to the missing pieces.

    Note that organization of stats_stream changed from SQL Server version 2000 (then sysindexes stat_blob) to 2005 and again to 2008? It is quite possible there are also changes in version 2012? Most of what I discuss here applies to version 2008 R2.

    Decoding Stats Stream for SQL Server 2008R2

    Here I am using 1 based reference. Byte index 1 is the first byte.
    In C# and most other programming languages use zero based index.
    PositionLengthValue/TypePurpose
    1 4? 1 unkown
    5 4?   number of vectors
    9 4 0 zero
    13 4 0 zero
    17 4   checksum
    21 4 0 zero
    25 4   stats stream length
    29 4 0 zero
    33 4   stats stream length - minus vector variable length
    The difference [25]-[33] is 64 for 1 vector (defined as off1).
    Each additional vector adds 24 bytes starting at byte position 41
    37 4 0 zero
    Start of vector information, 24-bytes per vector
    41 1   system type id
    42 1   unkown
    43 2   unkown
    45 4   user type id
    49 2   length
    51 1   Prec
    52 1   Scale
    53 4   unknown
    57 4   unknown
    61 2   unknown
    63 2   unknown
          Some of the unknown fields should be for nullable, collation, etc
    Addition vectors if present
    off1+1* 9   Updated?, 9 byte datetime2?
    off1+10 3   unknown
    off1+13 8   Rows
    off1+21 8   Rows sampled
    off1+29 4 4 byte real Density - Header
    off1+33 4x33=132 4 byte real Density - vector, upto 33 values
    off1+165 4 4 byte int Steps (first copy)
    off1+169 4 4 byte int Steps (second copy)
    off1+173 4 4 byte int number of vectors
    off1+177 4 4 byte int Step size (in bytes)
    off1+181 4 4 byte real Average Key length - header
    off1+185 4 4 byte real Unfiltered rows
    off1+189 4 4 byte int unknown
    off1+193 4x33=132 4 byte real Average key length - vector
          Some fields may represent string index (bool), or filter expression
    off1+325 8 8 byte int unkown, values 0x11, 0x13 and 0x19 observed,
    may determine the # of post histogram 8 bytes values starting at off1+341?
    off1+333 8 0 8 byte 0?
    off1+341 8 0 offset for value after histogram?
    off1+349 8 0 another offset
    off1+357 8 0 another offset if value of [off1+25] is 19 or more?
    more offsets if value of [off1+25] is 25 or more?
    Eventually, this sequence appears: 0x10001100 followed by three 4-byte real,
    a value in native type of the stat, and then ending with 0x040000
    off2 2 0x10 - 16 length of core columns determines the organization of histogram structures?-->
    off2+2 2 17 or higher size of step, excluding 3 byte trailer
    off2+4 4 4 byte real Eq Rows
    off2+8 4 4 byte real Range Rows
    off2+12 4 4 byte real Avg Range Rows
    off2+16 native length native type Range Hi Key
    off2+16+x 3 byte 0x040000 step terminator?, x is the size of the type
    off3 ? ? additional info
    *off1 = value of 4(8) byte int at position [25] - value of [33]
    **off2 = off1 + 341 + 16 if value of [off1+325] is 0x11, or 24 if 0x13 or 0x19
    ***off3 = off1 + 341 + value of 4(8) byte int at [off1+341]

    So far, for SQL Sever 2008 R2, I have only looked at fixed length not nullable statistics. Variable length statistics has different organization, particularly in the histogram part. String statistics may have extended information after the histogram, per new feature of SQL Server 2008?

    Umachandar provides a SQL function for converting 4-byte binary to real or 8-byte binary float, and vice versa.

    Supporting SQL functions and procedures:

    The updated tools now has a stored procedure that accepts table and index (or column stat) as input parameters, in addition to the original procedure that has the stats stream binary.

    Updates on QDPMA Stats Stream - Updated

    decoding stats stream - Updated

    decoding stats stream

    ps
    An interesting fact is that it is not necessary for statistics to be highly accurate to be effective. Normally we are interested in distribution differences to shift the execution plan from one to another. The boundaries for this can be very wide. False statistics in certain circumstances might guard against catastrophically bad execution plans, example in out of bounds situations. Another is in skewed distributions, but this should be handled by other means, to ensure high and low distributions get different execution plans.

  • Intel Xeon E5 (Sandy Bridge-EP) and SQL Server 2012 Benchmarks

    Intel officially announced the Xeon E5 2600 series processor based on Sandy Bridge-EP variant with upto 8 cores and 20MB LLC per socket. Only one TPC benchmark accompanied product launch, summary below.

    ProcessorsCores perFrequencyMemorySQLVendorTPC-E
    2 x Xeon E5-269082.9GHz512GB (16x32GB)2012IBM1,863.23
    2 x Xeon E7-2870102.4GHz512GB (32x16GB)2008R2IBM1,560.70
    2 x Xeon X569063.46GHz192GB (12x16GB)2008R2HP1,284.14

    Note: the HP report lists SQL Server 2008 R2 Enterprise Edition licenses at $23,370 per socket.
    The first IBM report lists SQL Server 2012 Enterprise Edition licenses at $13,473 per pair of cores(?) or $53,892 per socket. All results used SSD storage. The IBM E7 result used eMLC SSDs, the IBM E5 results showed more expensive SSDs, but did not explicitly say SLC?.

    The Xeon E5 superceeds 2-socket systems based on both the Xeon 5600 (Westmere-EP) and Xeon E7 (Westmere-EX). It is evident that Sandy Bridge improves performance over Westmere at both the socket and core levels and also on a GHz basis.

    ArchitectureTotal CoresFrequencyCore-GHzTPC-Etps-E per core-GHz
    Sandy Bridge-EP2 x 8 = 162.9GHz46.41,863.2340.16
    Westmere-EX2 x 10 = 202.4GHz48.01,560.7032.51
    Westmere-EP2 x 6 = 123.46GHz41.521,284.1430.93

    One advantage of the Xeon E7 (Westmere-EX) system is that the memory expanders support for 4 DIMMs per channel or 16 DIMMs per socket (4 memory channels). However, a two-socket Sandy Bridge-EP system supports 256GB with 16 (8 per socket) of the lower price (per GB) 16GB DIMMs. And really, 256GB is more than enough for most situations, so it is quite reasonable to not burden outlier configuration requirements on the large majority.

    A later version of the Xeon E5 will support 4-socket systems. There is no explanation as to whether glue-less 8-socket systems will be supported in the future. It was previously discussed that there would a EN variant of Sandy Bridge with 3 memory channels and fewer PCI-E lanes.

    Hardware Strategy for SQL Server 2012 per core licensing
    Top frequency on the 6 core E5-2667 is 2.9GHz, the same as the 8 core (excluding the 8 core 2687W model at 3.1GHz). Top frequency for the 4 core E5-2643 and 2 core E5-2637 are 3.3 and 3.0GHz respectively. The desktop i7-2830 is 3.6GHz with 4 cores, so Intel is deliberately constraining the top frequency on 2 & 4 core version for the server parts, apparently to favor interest in the 8 core part.

    Given the SQL Server 2012 per core licensing, there should be interest in a system with fewer cores per socket running at higher frequency, while taking advantage of the high memory and IO bandwith of the E5 system. Consider also that SQL Server write operations (Insert, Update, Delete, the final stage of index builds) and even certain SELECT operations are not parallel (the Sequence Project operator that support the ROW_NUMBER function).

    I think it would also make sense for Intel to allow cores to be disabled in BIOS (now UEFI) on the top of line E5-2690 like the desktop extreme edition unlocked processors. Large corporate customers can buy a batch of identical systems, disabling cores that are not needed on individul systems. 

    It would also be of value to engage a (nolonger quite so, relative to core licenses) exhorbitantly priced consultant to tune SQL Server to run on fewer cores.(Not to be construed as a solicitation for services)

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement