This came out on the Microsoft website. I have not had time to look over the hardware configuration carefully. Of course the recommended storage is a SAN, vendors love to sell very high margin products when there are perfectly good merely high margin alternatives. Well atleast this one looked at sequential disk IO bandwidth, unlike past configs that were totally silly. HP config for the DL385G5p does 1.5GB/s, the DL585 3GB/s and the DL785 6GB/s. The Dell configs for PE2950 does 1.6GB/s and the R900 4.8GB/s. Gut feeling is the 2-socket systems are underpowered storage wise, but that what happens when you config expensive storage solutions.
Ok, I withdraw my initial complaint on the MSA 2000, depending on which model. I love direct-attach storage in non-cluster environments for brute force capability. So I like on the HP side, the MSA60 with LFF drives or the MSA70 with SFF drives (because of the high drive count, 25, on a single x4 SAS, this is really for 8K random IO, not large block or sequential). The MSA 60 base unit is $2,999. The MSA 2012sa single controller base unit is $4,499. The MSA 60 requires a RAID controller in the server. The MSA 2012sa has a built-in RAID controller, requiring an SAS HBA on the server. So the MSA 2000 SAS variant is only slightly more expensive than the pure DA solution. The MSA2000 FC variants are far more expensive and (while suitable for cluster environments which this is not) do not contribute to performance.
Data Consumption Calculations
The reference configurations seem to be built around the calculated table scan data consumption rate of 200MB/sec per core. It is commendable that configurations be built around calculations. Also, most people cannot handle complexity, so calculations must be kept as simple as possible. However, it is grossly irresponsible to make the statement that table scan consumption rate is linear with the number of cores. It is highly dubious that parallel execution plan consumption rate was tested at all given the cited SELECT * OPTION (MAXDOP 1) method. An all rows returned SELECT * query is very unlikely to generate a parallel execution plan. This has to do with the formulas used by the internal SQL Server cost based optimizer (or execution plan cost formulas), as IO costs are not reduced, only CPU. The extra cost of the “Parallelism (Gather Streams)” operation inhibits the parallel plan.
Also, SELECT * FROM Table is not a proper test of data consumption, it is as much as a test of the ability of the client to receive data.
Anyways, I had previously reported precise measurements of the cost structure of SQL Server table scans. A rough model is that the base table scan (SELECT COUNT(*) or equivalent) depends on the number of pages and rows involved, and whether disk access is required. The cost per page (referenced a 2-way system with the Intel Xeon 5430, Core 2 architecture at 2.66GHz) is around 1 CPU-microsecond for in-memory. The test tables columns were all fixed length not null.
The cost per row per page is roughly 0.05 CPU-microseconds. So the in-memory table scan for a table with 20 rows per page works out to approximately 2 CPU-microseconds. This applies for a unhinted SQL query, for which the default for a table this size should be table lock. The cost of a table scan using row lock is much higher, possibly 1000 cpu-cycles per row (I reported on this long ago, in SQL Server 2000 day on the Pentium III, 4 and Itanium architectures).
Let me emphasize that these are measure values, and are known to be consistent over a wide range of conditions. It does not matter what one imagines the code to execute the above looks like.
For large block disk access, the cost additional per page is approximately 4.3 CPU-micro-seconds. So the base table scan at 20 rows per page will run in the range of 4GB/sec in memory and 1.2GB/sec to disk on a single core. Note that I said for large block disk access. So I think this amortizes the disk IO cost over many pages (32-64?) and of course includes the cost of evicting a page from the buffer cache and entering the new page.
Notice I said base table scan, meaning more or less SELECT COUNT(*) or equivalent so that the execution plan is a table scan or clustered index scan, not a nonclustered index scan, but the formulas are approximately valid. It turns out that the cost of logic, i.e. AVG, SUM, and other calculations can be quite expensive. A single aggregate on an integer column appears to cost about 0.20 CPU-microseconds per row, meaning the 20 rows per page in-memory table scan cost is now 6 CPU-microseconds versus 2 for the base. The TPC-H query 1 has 7 aggregates, 3 multiplies and 3 add (or subtracts), yielding a net data consumption rate around 140MB/sec on a single core.
So should each application be tested to the actual data consumption rate for the most important aggregate query? I think this is too complicated and narrow. Even if the main query is complicated, it still helps to have brute force capability in the storage system. This is why my recommendation is to simply fill the available PCI-E slots with controllers, which are relatively inexpensive, and distribute many disks across the controllers, with consideration for the cost of the disk enclosures. This means avoiding the big capacity drives, i.e. skip the 300GB 15K drives. Last year, I recommended 73GB 15 drives. Today, there is little price difference between the 73G and 146G 15K drives, so go for the 146G 15K drives.
On memory, forget the stupid 4GB per core rule. I see no justification for it. Memory is cheap. Fill the DIMM sockets with the largest capacity memory module where the cost per GB is essentially linear. Two years ago, 2GB ECC DIMMs were around $200, and 4GB ECC were around $1000, so 2GB DIMMs were the recommendation then. I just checked on Crucial, today 4GB ECC is less than $200 each, while 8GB is around $850, so 4GB DIMMs are the recommendation for now.
Finally, a note on recommended processors: Large data warehouse queries do benefit from high clock rate (only comparable within a micro-architecture, do not equate Opteron to Core 2 frequency). DW queries do not benefit from cache size. So if the same frequency is available with different cache sizes, at different prices, just be aware that processor cache size does not impact DW. Large processor cache does significant benefit high call volume transactional queries, so the higher price is more than justified there.
Reference Configurations with Direct Attach Storage
Below are my reference configurations on the assumption the data warehouse is not clustered. I have included prices obtained from the Dell and HP websites on 26 February 2009. If you can get Dell or HP to quote the price of SAN based storage as in the Microsoft/Dell/HP reference configurations
Dell PowerEdge 2950 III, 2x3.33GHz Xeon X5470, 32GB memory
2 PERC6/E controllers, $850 each
2-4 MD 1000 w/ 15 x 146GB 15K SAS drives, $7K each
Total cost: $22K with 30 disks, $36K with 60 disks
PowerEdge R900 4 x Six Core Intel Xeon X7460 2.67GHz, 16M, 128GB, $21,366
6 PERC6/E controllers, $5,100,
6 MD 1000 w/15x146G 15K, $43K
Total cost: $70K.
While I do agree in principle with the concept of balanced configuration, I ask the question: which is better? Each configuration is approximately $36K.
1 PE 2950 w 2x3.33GHz X5470, 32GB memory, 2 PERC6, 60 x 146G 15K disks
or 1 PE R900, 4x2.67GHz X7460, 128GB memory, 2 PERC6, 30 x 146G 15K disks
HP ProLiant configurations with Opteron (Shanghai, 6M L3) processors.
DL385G5p, 2 x Opteron 2384 2.7GHz, 32GB, 2 x 72GB 15K HDD, $7,282
DL385G5p, 2 x Opteron 2384 2.7GHz, 32GB, 16x72GB 15K HDD $12,500
DL 585G5, 4 x Opteron 8384 2.7GHz, 128GB, 2 x 72GB 15K, $21,500
DL 785G5, 8 x Opteron 8384 2.7GHz, 256GB, 2 x 72GB 15K, $60,600
MSA60, 12 x 146K 15K $7K ($5765 bundle price?)
Smart Array P800 $949
Smart Array P411 $649
The lowest cost DL385G5p DW configuration is to populate all 16 internal SFF drive bays, then attach 2 MSA 60, each with 12 15K disks. The preferred DL585G5 configuration is to populate 6 RAID controller, with 2 MSA 60 (directly connected, not daisy chained) on each controller in the 3 x8 PCI-E slots, and 1 MSA 60 for each controller in the x4 PCI-E slots.
On the DL785G5, place RAID controllers in each of the 3 x16 and 3 x8 PCI-E slots, each connecting 2 MSA 60, and connect a single MSA 60 for RAID controllers in the x4 PCI-E slots. Defer to Gunter Zink’s team for any more specific slot to disk configuration.