THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

  • Fixing the Query Optimizer Cost Model

    The model for the cost of operations used by the SQL Server query optimizer to produce an execution plan is rather simple. It has been largely unchanged since SQL Server version 7, RTM in 1998. Back then, SQL Server had not yet achieved tier 1 status with commensurate development resources. Some adjustments were made in version 2005. One might question how something so old can possibly be relevant today, considering how much hardware has changed over the intervening years. The cost model is still adequate for pure transaction processing scenarios in which queries have highly selective search arguments. Its limitations become pronounced in parallel execution plans.

    The salient aspect of modern hardware is that compute capability is distributed over very many processor cores, possibly with irregularities due to non-uniform memory access (NUMA). As such, the decision in parallelism cannot be just on/off with a cost threshold and one-size fits all with max degree of parallelism. It is necessary to have a degree of sophistication in the employment of resources for parallel processing. For some queries, it might be best to utilize a low to moderate level of parallelism. In others, perhaps all cores in one socket. And in exceptional circumstances, the whole set of cores.

    For this, the existing SQL Server cost model does not have sufficient depth to be the foundation from which an effective strategy for deploying resources can be formulated. It should not be too difficult to build a suitable model, considering the wealth of knowledge that has (or should have been) accumulated over the years. There will be some resistance to changing the cost model given that people have become accustomed to the plans generated from the current cost model. But that could be accommodated as a legacy mode, so as to not stand in the way of progress.

    The SQL Server Plan Cost Model Overview

    The cost model of the SQL Server Query Optimizer is explored in more detail elsewhere. Other authors have written on the query optimizer itself. Only a brief description of the SQL Server cost model is given here.

    Buried in Books Online in the section on the query governor cost limit option, is the statement: "Query cost refers to the estimated elapsed time, in seconds, required to complete a query on a specific hardware configuration." The reference system existed when SQL Server version 7 was in development, so this could have a been a Pentium Pro system, but some people have said that it was a Pentium. My investigation in this topic began around year 2000, though I never went back to look at whether SQL Server version 6 and 6.5 used the same model.

    Each operator in an execution plan can have two components, CPU and IO. The IO component represents the time to complete the IO operation. There is in fact also CPU involved to perform an IO, and it is unclear whether this is part of the IO or represented in the CPU component. The assumption is that leaf level pages are not in memory at the start of query execution, but maybe retained if the execution touches a page more than once. The complete cost of operator factors in the CPU, IO, number of rows and number of executions. The CPU and IO may represent the cost for only a single execution, while the operator cost incorporates the entire operation as a function of the number of executions.

    The IO cost is based on a model of the disparate nature of hard disk performance in random and sequential access at fixed queue depth. SQL Server version 7 and 2000 had two baselines, but from version 2005 on, the IO cost model is that random IO performance is 320 IOPS, and sequential IO is 1350 pages per sec (10.8MB/s). As cost is in seconds, the random IO from key lookup and loop join operations or the first page of a scan is 1/320 = 0.003125. Each successive page in a scan or range seek is 1/1350 = 0.000740740 (the 740 sequence repeats).

    The plan cost does not to model the cost of logic beyond the basic operation. For example, an aggregation of one column has the same cost as two or more. The interpretation of this is that the purpose of the plan cost model is to find the best index and table access sequence and method. Ancillary logic that must be executed in all cases does not affect the choice of plan.

    Hard Disks 1995-2005

    It might seem that given the pace of hardware change, such old model cannot possibly valid, resulting horrible execution plans. Around 1995 or so, the high-performance HDD was 7200RPM with a sequential bandwidth of 4-5MB/s. The mean rotational latency for 7200RPM is 4ms. An average seek time of 8.5ms seems reasonable, though I have not kept any documentation from this period. This would correspond to 80 IOPS per disk at queue depth 1 per HDD. So, it seems curious that the SQL Server cost model is based on the random IOPS of 4 disks, but the sequential IO of 2 HDDs.

    Performance HDDs progressed first to 10K RPM around 1996, and then to 15K around 2000, with corresponding rotational latencies of 3 and 2ms respectively. Average seek time was reduced over time to 3ms with developments in powerful rare-earth magnets. The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD. But no further progress was made on HDD rotational speed. In same time period, hard disk sequential IO phenomenally quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.

    In other words, the SQL Server cost model is based on a 1350/320 = 4.2 ratio. But 15K HDDs in the mid-2000's were 100MB/s × 128 pages/MB = 12,800 pages/sec sequential to 200 IOPS random for a ratio of 64:1. It turns out that achieving nearly the sequential IO capability of HDDs from SQL Server required a special layout strategy, as outlined in the Fast Track Data Warehouse Reference Architecture papers, which few people followed. This was due to the fixed, inflexible IO pattern of SQL Server, which required the disk configuration to match that of SQL Server instead of being able to adjust the IO pattern to match that of the storage configuration.

    In a SAN system, where the vendor was exclusively focused on random IO, the typical configuration supported sequential performance of about 150 IOPS at 64KB, for 9.6MB/s or 1200 pages per sec. Quite by accident, this was approximately in line with the SQL Server query optimizer. In any case, regardless of the actual random and sequential values, the important criterion is the sequential to random IO ratio, on the assumption that leaf level pages are not in-memory. And this was usually not true by the 2005 timeframe, when hot data was likely to be in memory for a well-tuned transaction processing system on a maximum memory configuration 4-way system.

    Queue Depth

    All the numbers cited above for HDD random IOPS performance were for queue depth 1 per disk. Of course, SCSI HDDs support elevator seeking, in which IO is issued at high queue depth, and the disk reorders the sequence of operations for higher IOPS. The SQL Server engine facilitated this by issuing IO at high queue depth when the estimated of number executions of a key lookup or loop join was 25 or more. Below this, IO was issued at queue depth 1. But the plan cost did not have a model for HDD random IO at different queue depths.

    It should be pointed out that individual query performance optimization is not always the prime objective. For a transaction processing system that also supports reporting queries, the first priority is usually transaction latency. In this case the preferred strategy is to sacrifice reporting query performance by not issuing IO at excessively high-queue depth to keep latency low.

    SSDs

    SSDs started to pick up momentum around 2010. Early use was for extreme requirements. Now it is the more practical choice for almost all main line-of-business systems over HDDs. This comes at a time when the system may also happen to have enough memory for almost the entire database. With data on SSD, the access times for sequential and random might reflect a lower ratio than the optimizer model as opposed to a much higher ratio for HDDs in the FTDW RA. Still, the CPU cost for loading 64KB or more into memory with a single IO is lower than evicting pages and loading multiple individual pages. The scan should have a lower cost per page.

    There is a less appreciated aspect of SSD storage in being able to achieve massive IOPS capability far in excess of what is really necessary. In an HDD system, it was possible to support sufficiently high random IOPS at less than excessive latency for transactions. It was also possible to achieve high bandwidth for large block IO in support of DW. What was not entirely practical was to simultaneously support low latency IOPS for transactions and high bandwidth for DW. This is now possible with an SSD storage system and this should be the configuration objective. It would be helpful for SQL Server to implement the correct IO queue depth strategy for this capability.

    Key Lookup - Scan Cross-over

    For pure transactions, queries have a highly selective search argument. In such cases, the plan cost model is not hugely important, regardless of wild differences between the plan cost model and various implementations of real hardware. The cost model is more important for queries involving an intermediate number of rows, in which the true cross-over point from an index seek with key lookup to scan operation is of greater interest. But even then, we may prefer one of the two for other reasons.

    Parallel Cost Model Inconsistency

    There is a curious inconsistency in the cost model for parallel execution plans. The CPU component of operator cost is scaled down in proportion to the degree of parallelism, except for the last doubling of the number of processor (cores). At DOP 2, the CPU cost is half of that at DOP 1. On a system with 20 logical processors and unrestricted parallelism, the CPU cost is reduced by a factor of 10, regardless of whether it is 10 physical cores with Hyper-Threading or 20 cores without HT. The inconsistency occurs in IO cost. There is one model for source access, that is seeks and scans to table or indexes. And there is a different model for intermediate operations, examples being Hash Match and Sort operators.

    Parallel Cost - Table and Index IO

    The IO component in table and index access operations, including index seek, scan, and key lookup, does not change with parallelism. The interpretation of this is that IO system is saturated with a single thread. Hence, a parallel plan does not reduce IO time in accesses to tables and indexes. This is a very reasonable assumption for the original reference system having 2 or 4 HDDs.

    It is a not an accurate representation for scan operations on a system configured to the FTDW RA, capable of multi-GB/s IO bandwidth. The saturated IO model might be somewhat applicable for random IO because a single processor core from about 2006 on could drive 50K IOPS depending on the situation and only a very large array of perhaps 200 15K HDDs could support that volume at queue depth 1 per HDD.

    On a modern system with properly configured SSD storage, the expectation should be that both sequential and random IO scales with parallelism, fully up to the number of physical cores. An example of a properly configured system is test system from Parallel Execution in SQL Server 2016, one Xeon E5 2640 v4 10-core processor with 4 Intel P750 PCI-E SSDs.

    Parallel Cost - Hash and Sort Operations

    Hash and Sort operations for a small data set only have a CPU component. When the number of rows times the row-size exceeds a value, there will be an IO component as well. The set point is some fraction of memory, either the max server memory or system memory if the prior is unlimited. This appears to be just over 5 MB per GB of memory. This value is also per thread. At higher DOP, each thread can accommodate up to the set point. Beyond the set point, the IO is somewhat greater than nonlinear, at perhaps IO 25 for 10MB, and 300 for 100MB. However, the IO cost in hash and sort operations does scale with parallelism, unlike IO cost is index seek and scan operations.

    This might seem to be the min query memory configuration setting and the resource governor max memory setting in actual runtime behavior. But it might be that the plan cost model only factors in the max server memory setting.

    Plan Cost Scaling - Parallelism

    Below is the ratio of plan cost at DOP 1 to higher DOP between 2 and 10 for the TPC-H SF 100 database on the Xeon E5-2630 v4 system with 64GB, but SQL Server limited to 32GB. In essence, this is the predicted speedup from parallelism by the plan cost model.

    Test System

    In several queries, there is very little reduction in plan cost at higher DOP. This occurs when the bulk of the costs are from the table and index access operations, in which parallelism does not reduce IO cost, only the CPU cost is reduced.

    In a few queries, examples being 8, 16 and 18, there is significant scaling in plan cost. This occurs primarily when much of the cost is in the hash match or sort operations, in which case, both CPU and IO are reduced with parallelism.

    In any case, actual scaling is always better than plan scaling, as shown below. See Parallel Execution in SQL Server 2016 for a more complete discussion.

    Test System

    The SQL Server plan cost model has very little predictive value on the matter of scaling. The plan cost model only predicts scaling when a significant portion of the cost is in the IO component of hash and sort operations. When the cost is largely in the IO of table and index scans and range seeks, the plan predicts weak scaling. In fact, many operations seem to scale well when the IO subsystem is not saturated from a single thread.

    True Cost versus Plan Cost

    As mentioned earlier, the plan cost model was based on a reference system that was probably a Pentium processor from the mid-1990's. In that era, each successive generation of processor increased rapidly in performance. It would not make sense to recalibrate the cost model to newer generation processors. And furthermore, it does not really matter what the processor is. Plan cost tends to be dominated by the IO component of table and index accesses based on 320 IOPS random and 10.5MB/s sequential, or from the IO cost of hash and sort operations. Still, one might be curious as how plan cost correlates to actual cost.

    The sum of the plan cost for the 22 TPC-H SF100 queries at DOP 1 is 200,839.65. Note that many of the execution plans have large hash operations for which the IO cost is dependent on system memory. The sum of the execution times for the same 22 queries also at DOP 1 is 3,515.15sec, although it would be less if some queries were manually tuned, resulting in higher plan cost, but lower runtime. The ratio of the sum of the plan costs to actual runtime is 57.135.

    Curiously, this is the approximate order of magnitude difference between a Pentium and modern processor core. There was no particular expectation of this because the plan cost model represents mostly IO wait while the test system has sufficiently powerful IO that the ratio of worker time to elapsed time was 98%.

    The figure below shows actual query runtime divided by plan cost scaled by 57.1, for the 22 TPC-H queries.

    Test System

    The range in variation between actual execution time relative to plan cost scaled by 57 is between 0.32 and 2.53. This is actually not a bad range, because the query optimizer plan cost does not attempt to model the true cost of a query, only the methods of table and index access.

    Two of the outliers on either end are Q2 at 0.32 and Q13 at 2.53. The critical part of the plan for Q2 is shown below.

    Q02

    The large portion of the plan cost is for the loop join to PARTSUPP. As pointed out earlier and elsewhere, the plan cost for random IO is fairly high. Most of the operations in the TPC-H queries are scans or range seeks. The bulk of weight in the measured 57X scaling relative to plan cost is driven by the plan cost of sequential IO.

    Also, disregarding the plan cost model, the SQL Server engine actually issues IO at high queue depth when the estimated rows is more than 25. The expectation is that a storage system with 4 NVMe PCI-E SSDs blows through the loop join faster than the 57X average scaling.

    The other extreme is Q13, for which the core plan component is below. There is nothing unusual in the plan.

    Q02

    However, in the query search argument is the following:

      AND CUSTOMER not like '%%special%%requests%%'

    This expression is expected to be expensive to evaluate, and is not modeled in the plan cost, being treated as just any other predicate.

    Compression

    With compression, the plan cost for the 22 TPC-H SF100 queries at DOP 1 decreases by 32% from 200,839 to 136,792, reflecting lower IO cost from compressed tables. The actual execution time increases by 21% from 3,515 to 4,275sec, reflecting the overhead of compression significantly outweighing the reduced IO.

    Cost Threshold for Parallelism

    The original scale of plan cost was that 1 represented a simple estimate of execution time in seconds. This has long lost its original meaning. For all practical purposes, from a few years after its arrival, we treated cost as some arbitrary unit of measure. Except for the fact that plan cost was used in the cost threshold for parallelism. In the original model, the intent was that the query optimizer does not consider parallelism until the estimated execution time of the non-parallel plan exceeds 5sec, which was once a very reasonable value.

    Today, plan cost 5 might be a query that executes in less than 0.1 sec without parallelism. This is most definitely well below the point at which parallelism should be engaged. The overhead of coordinating multiple threads is heavy in relation to the actual work itself. But we should also not wait until plan cost 285, corresponding to 57 × 5, for the original 5 sec duration. Part of this is because people today are far less patient than in the previous era. The other part is that today we have so many processor cores at our disposal that we can now be more liberal in their use than before.

    Parallelism Operators

    Having discussed both the cost model and the to some degree actual scaling, it is now the time to lay the ground work for a new model of plan cost and parallelism. A parallel plan involves splitting the work to be done among multiple threads. This involves additional work not in a non-parallel plan, comprising one or more of: distribute, repartition or gather streams.

    The effectiveness of a parallel plan depends on both the ability to divide the work with some degree of uniformity and such that the extra work necessary to coordinate multiple threads not being too large. There is also some effort necessary to start a thread or to acquire threads from a pool.

      The Bitmap Operator

    There is also another factor in parallelism. As discussed elsewhere, there are a number of queries in which the parallel execution plan runs with lower CPU time than a non-parallel plan, which should not happen. It turns out that the parallel plan has a bitmap operator that is not in the non-parallel plan.

    The following is from Microsoft TechNet on Bitmap Showplan Operator: " SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering. "

    As it turns out, the bitmap operator might have the effect of improving plan efficiency by up to 30% depending on the circumstances. Microsoft could allow the bitmap operator to occur in plans above a certain cost threshold, but not parallel for whatever reason. That is, a separate the cost threshold for the bitmap operator from the cost threshold for parallelism. I would not hold my breath on this.

    Processor and System Architecture

    The foundation of the parallelism strategy must encompass the nature of the modern processor and system. The largest Intel Broadwell EP/EX of 2016 has 24 cores, the full set of which is available in the Xeon E7-v4, but only 22 in the E5-v4.

    While we may have been conditioned from past history into viewing a server system as having multiple processor sockets, that fact is that a system having a single processor socket today can be incredibly powerful. There are significant benefits for a single socket system. One of which is that all memory is local. Another substantial benefit is that the effort to coordinate between multiple threads is low. This leads to excellent scaling characteristics even when there appears to be significant inter-thread redistribution activity.

    For those who cannot let go of the multi-socket mind-lock, below are representations of a 2-socket E5

    and a 4-socket Xeon E7.

    The inescapable fact is that coordinating between threads running on cores in different sockets is very substantial. And so, there is a very substantial degradation in scaling when threads are on different sockets. For this reason, the parallelism strategy must respect the nature of the system architecture. Low to moderate parallelism must use cores on a common socket. There should probably be one strategy in employing parallelism up to the number of cores in one socket and a separate strategy for high-parallelism spanning cores on multiple sockets.

    A Modern Parallelism Strategy

    With this knowledge, the parallelism strategy needs to blend the method for partitioning of work with the nature of the modern server processor and system. There are very many cores, i.e., compute resources that can be allocated to query plans, in either a single or multi-socket system. For this, we need something more than the very basic, legacy configuration controls: 1) Cost Threshold for Parallelism and 2) Max Degree of Parallelism. To Connect request on this matter, Microsoft replied that the newer additional controls in Resource Governor is sufficient. In essence, the SQL Server team is abrogating their responsibility in having an intelligent mechanism to properly utilize the capabilities of a modern system.

    The first step might be to rescale the plan cost so that the cost threshold for parallelism is in more understandable terms. This is not absolutely essential, and we could just state that plan cost 60 might correspond roughly to 1 second. It did not make sense to rescale plan cost in the days when processor core performance changed substantially with every generation. But performance at the core level over the last several years have been more stable, improving at a slow or moderate pace, and may continue to do so barring a radical event. We might take this opportunity to reset the plan cost.

    Second, the cost model needs to be more comprehensive. The cost of touching each row in a scan or other sequential access is larger than the current model represents and should be corrected. For this, the impact of lock-level might be considered. The cost of touching each column is both non-trivial and not accounted for at all in the current model. The cost of ancillary logic also needs to be modeled. And of course, the cost of access to compressed pages should be accounted for. The above items may not impact the structure of the execution plan, which is why it was not considered back in version 7 or 2008 for compression. But it is relevant in the cost threshold for parallelism assessment, and also impacts scaling projections.

    Third, the IO cost model needs to be fixed. This might involve modeling the actual IO subsystem performance characteristics. Or it might just assume that IO performance scales with parallelism. Also important is that parallelism scaling does not stop at the half the number of cores, instead going up to the full number of physical cores. It would help if the optimizer knew the difference between physical and logical cores. The IO cost in hash and sort operations already scale with parallelism, so fix the IO cost for source access would make the complete model consistent.

    As a rough guess, (current) plan cost 30, corresponding to roughly one-half of one second execution time might be a good default cost threshold for parallelism, settable of course. For a query with plan cost 30 expected to execute in 0.5 CPU-sec (worker time), a DOP of 2 and not higher might be a good idea. One suggestion for the DOP strategy might be nonlinear. Employ DOP 2 at the base value of the cost threshold. Then double the DOP for every 4× over of the base cost. If base is 30, then plan cost 120 (4×30) is the threshold for DOP 4 and 480 (16×30) is threshold for DOP 8. This could be the strategy for DOP up to the number of cores in one socket with one threshold for the base value and another for escalating the DOP.

    The expectation is that scaling is not so simple or excellent for parallelism on cores in different sockets. It would be good if the plan cost could model this. In any case, there might be an even higher threshold for employing cores over multiple sockets. We should not normally set DOP to be somewhat over the numbers cores in one socket. It should be up to the number of cores in one socket for most circumstances, or employ most of the cores of multiple sockets for exceptional circumstances. Furthermore, if the desired number of cores in one socket are not available, then reduce the DOP rather than employ cores on different sockets.

    Summary

    Both cost model and the simple controls for parallelism have become seriously obsolete for modern high core count processors. An improved cost model, encompassing cost structure that was not necessary for optimal single thread table access strategy, is now needed for the parallelism strategy. The cost model needs to correctly model parallelism on modern hardware, which means fixing the IO cost model that currently assumes a single thread saturates the IO. A proper parallelism strategy also needs more complexity than the current on/off based on the cost threshold for parallelism, and the all or nothing with only a MAX DOP setting. A moderately more complex model has been suggested but alternatives are possible.

  • Parallel Execution on SQL Server 2016

    There are a number of interesting questions in parallel execution performance that can be investigated using the TPC-H kit, which has a data generator. There are also supplemental files that accompany published results. Important questions in parallel execution are: scaling versus degree of parallelism (DOP) on physical cores, Hyper-Threading, Compression and Columnstore. (Memory optimized tables support parallel plans in 2016? perhaps in a later post.) To be clear, the TPC-H data generator (dbgen) and queries are used here, but no attempt is made to adhere to the TPC-H benchmark requirements (there are some outright departures). The material reported here are only for purpose of studying parallel execution characteristics.

    In brief, the findings are as follows. The sum average performance gain from Hyper-Threading over the 22 TPC-H queries is 23%, in line with characteristics observed in SQL Server version 2012. (Official TPC-H results use a geometric mean.) There used to be an anomalous behavior on SQL Server 2012 in which performance was better with HT enabled and the second logical processor idle over HT disabled, even for single-threaded execution as reported in Hyper Threading. Compression has an average overhead of 20% across all DOP levels. For SQL Server 2008, I had reported in SF 10 that compression overhead was 40% between DOP 1-8, then decreases to less than 10% at DOP 32. The test circumstances were different, and may have been related to the NUMA effects of an 8-way system. Columnstore performance is excellent and should be considered whenever practical.

    • Hyper-Threading performance boost 23% in parallel DW queries
    • HT anomalous performance gain in version 2012 at DOP 1 no more
    • Page compression overhead 20%, down from 40% in 2008
    • Columnstore should run one per physical core, HT can be enabled

    Test Environment

    The test system is a single socket Intel Xeon E5-2630 v4 10-core 2.2GHz processor with 64GB memory. Storage is 4 Intel P750 PCI-E NVMe SSDs for data and temp, and one SATA SSD for logs. The software stack is Windows Server 2016 and SQL Server 2016 SP1. SQL Server was limited to 32GB for results reported here.

    The data is generated using dbgen from the TPC-H kit at scale factor 100. This used to mean the Lineitem table is 100GB, but with a 4-byte date for three columns that were formerly 8-byte datetime, is now 80GB in row-store without compression (still 600M rows). Overall database size is 140GB. With page compression, the Lineitem table is 46GB data and overall database size is 101GB. In columnstore, total database size is 44GB.

    Elsewhere, I have argued that performance characteristics are very different between scaling with multiple cores on a single socket and scaling with cores over multiple sockets. Multi-socket scaling cannot be investigated here. Conclusions should not be extrapolated beyond what is tested.

    Baseline, no compression, row store

    The figure below is speedup (elapsed time ratio) at various degrees of parallelism relative to DOP 1 for the baseline row-store, no compression database. Note that there are 10 physical cores. The SQL Server engine did seem to run parallel plans on separate cores except when DOP exceeds the number of cores.

    Test System

    Queries 5 and 15 are worse at DOP 2 than DOP 1, at 0.675 and 0.802 respectively, and Q3 is slightly slower at 0.948 relative to DOP 1. In query 3 and 5, the negative scaling from DOP 1 to 2 is caused by a change in the execution plan structure, excluding parallelism operators.

    Below is focused look at scaling from DOP 1 to 2. As noted, three queries are worse at DOP 2 than 1. Several queries are almost exactly 2X better, as expected in the idea case for a parallel execution plan.

    Test System

    And then there are also queries that are more than 2X better at DOP 2 than 1. Q17 is 4.3X faster at DOP 2 than 1. Four others at more than 3.5X. Usually, these have the bitmap operator in the parallel plan, which is not used in a non-parallel plan. The bitmap operator has some startup cost, but can process large data sets more efficiently.

    The reasonable assumption is that large queries have a parallel plan, so there is no need to employ the bitmap in non-parallel plans. There are other reasons for super-scaling. One is the set points for hash and sort operations spilling to tempdb. The setting is fixed as a fraction of system memory for each thread. At higher DOP, more memory is allocated across all threads of the parallel plan.

    Below is the scaling from DOP 2 to 10, which removes the wide fluctuation that occurs between DOP 1 to 2. In the ideal parallelism case, scaling should be close to 5.

    Overall, the DOP 2 to 10 scaling is more uniform, except for Q15, in which the poor scaling from 1 to 2 is recovered at DOP 4 and higher. Q15 scaling from 1 to 4 is 3.346, which is not bad, good even.

    Performance largely comparable between HT enabled and disabled for DOP levels up to the number of physical cores, 1-10 in this case. It was reported in Hyper Threading, that DOP 1 ran 24% faster at DOP 1 with HT enabled over disabled. This might seem paradoxical. A speculative explanation is that the SQL Server engine tries to work in batches that fit inside L2 cache. There is an API for the L2 cache size, but how much of this should be assumed to be available for use by a particular operation? If the assumed value is too high, but the HT enabled case reduces this by half, then it is possible that DOP 1 with HT enabled could be faster?

    Below is the effect of enabling Hyper-Threading (HT) with DOP 20, over disabling HT in UEFI (formerly BIOS) for DOP 10.

    Test System

    Overall, then benefit of HT is strong at 1.23X. Q2 shows as 0.9 reduction, going from 0.61 to 0.67sec runtime, but this is probably just the effect of running a small query at very high DOP. It would be interesting to know which logical processors are employed when DOP exceeds the number of physical cores on one socket, in a multi-socket system with HT enabled. Scaling over cores in different sockets is not as good compared to all cores in one socket, so perhaps the 23% gain from HT might be acceptable?

    Special cases

    In queries 3 and 5, there is a good plan at DOP 1, but less efficient plans for DOP 2 and higher. The result is negative scaling to DOP 2, and poor scaling at higher DOP relative to DOP 1. The query optimizer model believes that two or more plans have similar costs. There are differences in how the cost for these plans vary with the degree of parallelism, and hence the winning plan changes with DOP.

    Both Q3 and 5 can be rewritten to force a particular plan. Note that the special feature Date Correlation Optimization was designed specifically for TPC-H, and probably no production database uses it. When a query plan is forced, the date correlation optimization is also turned off. Since we are interested in scaling for a particular plan, the full set of implied date range SARGs are explicitly specified.

    The figure below shows Q3, 5, and 7 scaling relative to DOP 1 for natural and forced execution plans.

    Force

    In Q3 and Q5, the natural plan from DOP 1 is forced in the parallel plans. This results in good scaling at all DOP levels. Below is the execution plan for Query 3 at DOP 1 without other hints having cost 1349, execution time 28.38sec.

    Q3 DOP1

    Below is the plan at DOP 2 and higher, without hints, having cost 1057 at DOP2. Execution time is 29.94 sec, slightly longer than DOP 1.

    Q3 n

    On forcing the parallel plan to the same join order of the non-parallel plan, the cost is 1248.

    Q3 n

    In actuality, above plan having the join order of the non-parallel plan executes faster at 8.564sec versus 29.94sec for the unforced parallel plan.

    Below is the natural plan for query 5 at DOP 1, having cost 4445, and execution time 72.70sec.

    Q5 a

    The natural parallel plan for Q5 has the structure below. In this case, the query was written to force the join order of the parallel plan, and then set to DOP 1. This is done so that the plan structure is easier to discern without excessive complexity of the parallelism operators. The plan cost at DOP 2 is 3243.31 and execution time 107.62, significantly longer than for DOP 1.

    Q5 a

    If the plan at DOP2 had been forced to the same join order as at DOP 1, the plan cost is 3244.02, or 0.022% higher. In fact, the forced plan is much quicker at 23.469 than both the DOP 1 plan and the natural DOP 2 plan.

    Q15 is the other query that has negative scaling from DOP 1 to 2, miraculously recovers at DOP 4 and higher relative to DOP 1. This is a strange case. The execution plans are similar at various DOP levels, but there is a slight difference at DOP2. but not the other levels.

    It is not always the case that the non-parallel plan structure is better than the parallel plan structure. Q7 has a super-scaling gain of 2.969X from DOP 1 to 2, an outstanding gain of 1.908 from DOP 2 to 4, then another super-scaling gain of 3.072 from DOP 4 to 8, before trailing off, all with HT enabled.

    However, Q7 performance is less at DOP 2-10 with HT disabled than with HT enabled, when the results should be similar. This results in Q7 show an incredible 1.8X gain from HT disabled DOP 10 to HT enabled DOP 20. With HT enabled in UEFI, the gain from DOP 10 to 20 is 1.25X, which is very good but not irrationally high.

    Below is the natural plan at DOP 1, cost 4780 and execution time 114.96sec.

    Q7 a

    It changes to another plan at DOP 2, and then a third plan at DOP 4, for which the structure is shown below, but forced to DOP 1, cost 5840, execution time 102.074sec. The forced DOP 2 plan cost is 4614, execution time 31.895, whereas the natural DOP 2 plan cost 3710.8, execution time 38.72.

    Q7 a

    In this case, it is the parallel plan at DOP 4 and higher that runs faster at all DOP levels instead of the DOP 1 or 2 plans. In forcing the plan to the DOP 4+ plan, normal good scaling results are observed.

    Query Optimizer Cost Model

    (this should be a separate article)
    The root cause of most of these anomalies is that the cost formulas for execution operations are based on very simple, fix model. The model is adequate for transaction processing in which queries have highly selective search arguments. But the model is poor in assessing the index seek plus key lookup versus scan cross-over point.

    The SQL Server query optimizer employs a fixed model for IO cost. The model is 320 IOPS for operations assumed to be random (key lookups and loop join inner source) 1350 pages/sec × 8KB/page = 10,800KB/s for sequential IO operations (scans and range seeks). The absolute values of random and sequential IOPS are not important, as it is the ratio that effects the execution plan.

    Furthermore, the cost model assumes that IO is saturated at DOP 1 for accesses to source tables and indexes. A parallel plan does not reduce the IO cost in source accesses. IO cost is reduced in accordance with DOP in intermediate hash and sort operations. For storage on a small number of hard disks, IO is saturated at DOP 1. A very large HDD array is capable of scaling sequential IO with parallelism. A proper SSD array is capable of that and also scaling random IO.

    Why does the cost model assume saturated IO for source accesses but scalable IO for intermediate non-source operations? It is as if one developer designed the model for source, another did the model for intermediate, but they did not talk to each other to present a consist picture. The two developers may sit in adjacent cubicles.

    A second issue is that the cost model underestimates row access cost, and essentially does not model column and logic costs. This is evident in the plan cost of for a query doing a row count versus aggregating one or more columns. There is no difference between a count or any number of column aggregations. There is a small cost assessed for the first logic operation (example multiply two columns), but not for additional columns or logic. In actuality, column and logic might be the large part of an aggregation query.

    The combined effect of the query optimizer cost model limitations is that it has no meaningful capability to predict the degree of scaling to be expected with parallelism. There are only two controls: the cost threshold for parallelism, and the max degree of parallelism. The query optimizer employs parallelism when the parallel plan has a model cost of just slightly lower than the non-parallel plan and if the non-parallel plan cost is over the threshold.

    This was not a huge issue back when big systems had a total of four processor cores. A modern system up to 24 cores per socket in the current generation as well as typically having multiple sockets. The all or nothing strategy is seriously inadequate. It is now important to have the ability to assess what degree of parallelism to employ. This should be determined based on the query cost and expected scaling characteristics.

    A query that is executes in a few seconds might employ a low DOP, perhaps 2 or 4, but only if the gain from parallelism is sufficient. For example, 40% could be considered sufficient, but 10% is not. An intermediate-large query might employ up to all the cores in one socket, again, so long as there is scaling with increasing DOP. Very large queries might employ cores in all sockets, but only if it can be predicted that the execution plan scales well across multiple sockets.

    Page Compression, row store

    Below is the compression overhead, at various DOP levels relative to no compression. In SQL Server 2008 sp1, on an 8-way quad-core Opteron at SF10, all data in memory

    Test System

    Q5 has the good execution plan at DOP 1 and 2, before shifting to the poor plan at DOP 4 and higher. However, at DOP 2 the comparison is between a good plan on the database with compressed tables and a poor plan on the database having tables without compression, leading to the anomaly. Overall, the compression overhead is a fairly uniform 1.2X at all DOP levels, but can range from being nearly free to a high of 1.4X. The individual values over 1.4X are probably anomalies rather than representative.

    This is different than in 2008R2, in which compression overhead was 1.4X at DOP 1, but then gradually decreased to being essentially free at higher DOP. The test system back then was an 8-way Opteron quad-core, in which the effects of NUMA might have been significant. The hypothesized explanation was that without compression, there was contention between threads as DOP increased, decreasing scaling. The extra overhead of compression just happened to alleviate the contention.

    Presumably the explanation now is that the SQL Server engine team did a good job of removing contention where possible, so that there is better scaling without compression. Then the overhead of compression is not masked. Or it could be because of the difference in the system, NUMA and not.

    Columnstore

    Below is columnstore performance at various DOP, relative to row store at the same DOP. In some queries, it can be almost 30X better. Overall, columnstore is more than 4X better than row store.

    Test System

    Note that columnstore at DOP 20 with HT enabled is very poor. This is the expected behavior, because columnstore marches through memory sequentially, so HT is expected to interfere with columnstore code.

    The figure below shows the large negative impact of DOP 20 relative to DOP 10, although four queries showed modest gain.

    Test System

    There was little difference between DOP 10 with HT disabled or enabled. Enabling HT in UEFI is not harmful for columnstore, so long as max DOP is limited to the number of physical cores in columnstore queries.

    The more proper comparison is between columnstore at DOP 10 and row store at DOP 20 with Hyper-Threading, as shown below.

    Test System

    The overall gain for columnstore is now 3.45. Only Q13 shows no gain for columnstore, but no query is worse than row store. Q13 was one of the few columnstore queries that did show gain with HT, so columnstore is better than row store at matching DOP and HT.

    The figure below shows columnstore scaling versus DOP.

    Test System

    Columnstore scaling is uniformly good from DOP 1 to 4, but becomes erratic at DOP 8 and 10.

    Summary

    Parallelism is a complicated topic. The main guidance points were highlighted at the beginning. In addition, make few or no assumptions and always investigate. Evaluate performance characteristics at different levels of parallelism, DOP 1, 2, 4 and higher. Watch for scaling anomalies, and changes in execution plan structure. Be especially careful in comparing scaling past the number of cores in a single socket.

    Hyper-Threading does have benefit in parallel plans, but mostly for row store. It is probably advisable to enable HT in the UEFI/BIOS, but to limit columnstore to only the number of physical cores. It would be better if Microsoft not employ HT for columnstore unless they know exactly when HT might benefit. Compression does have overhead, but might still be worthwhile.

    The larger issue is that SQL Server seriously needs to overhaul the query optimizer cost model. Much of the SQL Server engine is very sophisticated, and yet the cost model has not changed substantially since version 7 in 1998. As is, the execution plan cost has weak bearing to reality and that is has no ability to predict scaling with parallelism, issues in scaling beyond a single socket, and when scaling to HT works. Without a proper cost model, the benefit of a sophisticated query optimizer is lost. For the time being, developers and DBAs can (re-)learn the method for writing SQL to a particular join order.

  • Rethinking System Architecture

    Server system memory capacities have grown to ridiculously large levels far beyond what is necessary now that solid-state storage is practical. Why is this a problem? Because the requirement that memory capacity trumps other criteria has driven system architecture to be focused exclusively on low cost DRAM. DDR DRAM, currently in its fourth version, has low cost, acceptable bandwidth, but poor latency. Round-trip memory access latency is now far more important for database transaction processing. The inverse of latency is practically a direct measure of performance.

    In the hard disk era, the importance of memory in keeping IO volume manageable did trump all other criteria. In theory, any level of IO performance could be achieved by aggregating enough HDDs. But large arrays have proportionately more frequent component failures. Although RAID provides fault tolerance, the failed drive rebuild process causes operational problems. The practical upper bound for HDD storage is on the order of 1000 disks, or 200K IOPS. If it was necessary to purchase 1TB of memory to accomplish this, then it was money well spent because the storage was even more expensive.

    Since then, storage with performance requirements have or should be transitioning to all-flash, and not some tiered mixed of flash and HDD. Modern solid-state arrays can handle up to 1M IOPS, far more and easily when on a full NVMe stack. Now is the time to re-evaluate just how much memory is really needed when storage is solid-state.

    There are existing memory technologies, RLDRAM and SRAM for example, with different degrees of lower latency, higher cost and lower density at both the chip and system level. There is potential to reduce memory latency by a factor of two or more. The performance impact for database transaction processing is expected to be equal. A single-socket system with RLDRAM or SRAM memory could replace a 4-socket system with DDR4. Practically any IO volume, even several million IOPS, can be handled by the storage system. The key is that the CPU expended by the database engine on IO be kept to a reasonable level.

    An argument is made for a radical overhaul of current system architecture, replacing DDR4 DRAM as main memory with a different technology that has much lower latency, possibly sacrificing an order of magnitude in capacity. As the memory controller is integrated into the processor, this also calls for a processor architecture change. A proper assessment for such a proposal must examine not only its own merits, but also other technologies with the potential for order of magnitude impact.

    The discussion here mostly pertains to Intel processors and Microsoft SQL Server. However, the concepts are valid on any processor and database engines built around page organization, row-store with b-tree indexes. The argument for low latency memory is valid even in the presence of Hekaton memory-optimized tables, or other MVCC implementations because it is a drop-in hardware solution instead of a database re-architecture project. Of course, the heavy lifting is on Intel or other to re-architect the system via processor and memory.

    Background

    For more than forty years, DRAM has been the standard for main memory in almost every computer system architecture. A very long time ago, having sufficient memory to avoid paging was a major accomplishment. And so, the driving force in memory was the semiconductor technology with low cost. This lead to DRAM. And DRAM was optimized to 1 transistor plus 1 capacitor (1T1C). DRAM addressing was multiplexed because even the package and signal count impact on cost mattered.

    The page file has since become a relic, that for some reason has yet to be removed. Over the years, the mainstream form of DRAM, from SDRAM to DDR4, did evolve to keep pace with progress on bandwidth. The criteria that has changed little is latency, and it is round-trip latency that has become critical in applications characterized by pointer chasing code resulting serialized memory accesses.

    Even as memory capacity reached enormous proportions, the database community routinely continued to configure systems with the memory slots filled, often with the maximum capacity DIMMs, despite the premium over the next lower capacity. There was a valid reason for this. Extravagant memory could mask (serious) deficiencies in storage system performance.

    Take the above in conjunction with SAN vendors providing helpful advice such as log volumes do not need dedicated physical disks, and the SAN has 32GB cache that will solve all IO performance problems. Then there is the doctrine to implement their vision of storage as a service. But now all-flash is the better technology for IO intensive storage. Solid-state storage on an NVMe stack is even better. Massive memory for the sole purpose of driving IO down to noise levels is no longer necessary.

    Now free of the need to cover-up for weakness in storage, it is time to rethink system architecture, particularly the memory strategy. Not all application desire low latency memory at lower capacity and higher cost. Some work just fine with the characteristics of DDR4. Others prefer a different direction entirely. The HPC community is going in the direction of extreme memory bandwidth, implemented with MCDRAM in the latest Xeon Phi.

    In the past, when Moore's Law was in full effect, Intel operated on: "the Process is the Business Model". It was more important to push the manufacturing process, which meant having a new architecture every two years with twice the complexity of the previous generation and then be ready to shrink the new architecture to a new process the next year. Specialty products divert resources from the main priority and were difficult to justify. But now process technology has slowed to a three-year cycle and perhaps four years in the next cycle. It is time to address the major profitable product outside of general purpose computing.

    The principle topic here is the impact of memory latency on database transaction performance, and the directions forward for significant advancement. As the matter has complex entanglements, a number of different aspects come into play. One is scaling on non-uniform memory access (NUMA) system architecture, in which the impact of memory access can be examined. Memory technologies with lower latency are mentioned, but these topics are left to experts in their respective fields. Hyper-Threading is a work-around to the issue of memory latency, so it is mentioned.

    The general strategy for server system performance in recent years is multi-core. But it is also important to consider which is the right core as the foundation. Another question to answer is whether a hardware and/or software solution should be employed. On software, Multi-version Concurrency Control (MVCC) is the technology employed by memory-optimized databases (the term in-memory, has misleading connotations) that can promise large performance gains, and even more extreme gain when coupled with natively compiled procedures. Architecting the database for scaling on NUMA architecture is also a good idea.

    A mixed item is SIMD, the SSE/AVX registers and instructions introduced over the last 18 years. From the database transaction processing point of view, this is not something that would have been pursued. But it has come to occupy a significant chunk of real estate on the processor core. So, find a way to use it! Or give it back.

    The list is as follows:

    • Scale up on NUMA versus single socket
    • RLDRAM/SRAM versus DDR4+3D XPoint
    • Hyper-Threading to 4-way
    • Core versus Atom and all comers?
    • Hekaton, memory-optimized, MVCC
    • Database architected for NUMA
    • SSE/AVX Vector Instructions

    The above topics will be address, perhaps only partially, often out of order, and sometimes mixed as appropriate. Factors to consider are potential impact, difficulty of implementation and overall business justification.

    L3 and Memory Latency

    The 7-cpu benchmark website lists L3 and memory latency for some recent processors, shown below.

    ProcessorBase FreqL3Local MemoryRemote Memory
    Westmere EP (Xeon X5650)2.67GHz40-42 cyclesL3+67nsL3+105ns
    Ivy Bridge (i7-3770)3.4GHz30 cyclesL3+53ns 
    Haswell (i7-4770)3.4GHz36 cyclesL3+57ns 
    Haswell (E5-2603 v3)1.6GHz42-43 cycles  
    Skylake (i7-6700)4.0GHz42 cyclesL3+51ns 

    There are a number of questions. Is L3 latency determined by absolute time or cycles? In the Intel Xeon processors from Ivy Bridge EP/EX on, this is particularly complicated because there are 3 die layout models; LCC, MCC, and HCC, each with different structure and/or composition. The Intel material says that L3 latency is impacted by cache coherency snoop modes, ring structure and composition. See Intel Dev Conference 2015 Processor Architecture Update, and similar Intel HPC Software Workshop 2016 Barcelona For simplicity, L3 is assumed to be 15ns here without consideration for other details.

    DRAM, SQL Server 8KB Page

    The Crucial web site and Micron datasheet cites DDR4-2133 latency at 15-15-15, working out to 14ns each for CL, RCD and RP, so random row latency is 42ns the DRAM interface. The Intel web page for their Memory Latency Checker utility shows an example having local node latency as 67.5 or 68.5 ns and remote node as 125.2 or 126.5 ns. L3 and memory latency on modern processors is a complicated matter. The Intel values above includes the L3 latency. So, by implication, the 10ns difference is the transmission time from memory controller to DRAM and back?

    One of the modes of DRAM operation is to open a row, then access different columns, all on the same row, with only the CL time between columns. I do not recall this being in the memory access API? Is it set in the memory controller? Systems designed for database servers might force close a row immediately?

    In accessing a database page, first, the header is read. Next, the row offsets at the end of the 8KB page. Then the contents of a row, all or specific columns. This should constitute a sequence of successive reads all to the same DRAM row?

     
     

    Memory Architecture - Xeon E5 and E7

    The most recent Core i3/5/7 processors run in the mid-4GHz range. There are Xeon EP and EX processors with base frequency in the 3GHz+ range. But for the high core count models, 2.2GHz is the common base frequency. Below is a representation of the Xeon E5 memory subsystem.


             
     

    In the Xeon E7, the memory controller connects to a scalable memory buffer (SMB) or Memory Extension Buffer (MXB), depending on which document, and the interface between MC and SMB is SMI. The SMB doubles the number of DIMMs that can connect to each memory channel.

     
     

    There is no mention of the extra latency for the SMB. It cannot be free, "All magic comes with a price". The value of 15ns is used here for illustrative purposes.

    Anyone with access to both 2-way Xeon E5 and 4-way E7 systems of the same generation and core count model (LCC, MCC or HCC) is requested to run the 7-cpu benchmark or Intel Memory Latency Checker utility, and make the results known. In principle, if Intel were feeling helpful, they would do this.

    System Architecture

    Below is a single socket Xeon E5 v4. The high core count (HCC) model has 24 cores. But the Xeon E5 v4 series does not offer a 24-core model. Two cores are shown as disabled as indicated, though it could any two? There is only one memory node, and all memory accesses are local.

     

     

    Below is a 2-socket system, representing the Xeon E5 either HCC or MCC models in having two double rings, and the interconnect between rings. The details shown are to illustrate the nature of NUMA architecture.

     

     

    NUMA is complicated topic, with heavy discussion on cache coherency, the details of which impacts L3 and memory latency. This discussion here will only consider a very simple model only looking a physical distance impact on memory latency. See NUMA Deep Dive Series by Frank Denneman.

    Each socket is its own memory node (not sure what happens in Cluster-on-Die mode). To a core in socket 0, memory in node 0 is local, memory in node 1 is remote, and vice versa to a core in the other socket (more diagrams at System Architecture Review 2016).

    Below is a representation of a 4-socket system based the Xeon E7 v4 memory. To a core in socket 0, memory in node 0 is local, memory in nodes 1, 2, and 3 are remote. And repeat for other nodes.

     

     

    It is assumed that memory accesses are 15ns longer than on the E5 due the extra hop through the SMB outbound and inbound. This applies to both local and remote node memory accesses.

    On system and database startup, threads should allocate memory on the local node. After the buffer-cache has been warmed up, there is no guarantee that threads running in socket 0 will primarily access pages located in memory node 0, or threads on socket 1 to memory node 1. That is, unless the database has been purposefully architected with a plan in mind to achieve higher than random memory locality.

    The application also needs to be built to the same memory locality plan. A connection to SQL Server will use a specific TCP/IP port number based on the key value range. SQL Server will have TCP/IP ports mapped to specified NUMA nodes. Assuming the application does the initial buffer-cache warm up, and not some other source that is not aware of the NUMA tuning, there should be alignment of threads to pages on the local node. (It might be better if an explicit statement specifies the preferred NUMA node by key value range?)

    An example of how this is done in the TPC-C and E benchmark full disclosures and supplemental files. Better yet is to get the actual kit from the database vendor. For SQL Server, the POC is JR Curiously, among the TPC-E full disclosure and supplemental files' thousands of pages of inane details, there is nothing that says that the NUMA tuning is of pivotal importance. It is assumed in the examples here that threads access random pages, and memory accesses are evenly distributed over the memory nodes.

    Simple Model for Memory Latency

    In Memory Latency, NUMA and HT, a highly simplified model for the role of memory latency in database transaction processing is used to demonstrate the differences between a single-socket system having uniform memory and multi-socket systems having NUMA architecture.

    Using the example of a hypothetical transaction that could execute in 10M cycles with "single-cycle" memory, the "actual" time is modeled for a 2.2GHz core in which 5% of instructions involve a non-open page memory access. The model assumes no IO, but IO could be factored in if desired. The Xeon E5 memory model is used for 1 and 2 sockets, and the E7 for the 4-socket system.

    GHzL3+memremotesktavg. memmem cyclesfractiontot cyclestps/coretot tx/sec
    2.267nsn/a167ns1470.0583.2M26.44-
    2.267ns125ns296ns2110.05115.1M19.1138.23
    2.282ns140ns4125.5ns2760.05147.5M14.9159.64

    If there were such a thing as single-cycle memory, the performance would be 220 transactions per second based 2,200M CPU-cycles per second and 10M cycles per transaction.

    Based on 67ns round-trip memory access, accounting for a full CL+tRCD+tRP, the transmission time between processor and DRAM, and L3 latency, incurred in 0.5M of the 10M "instructions", the transaction now completes in 83.2M cycles.

    The balance of 73.2M cycles are spent waiting for memory accesses to complete. This circumstance arises primarily in point-chasing code, where the contents of one memory access determines the next action. Until the access completes, there is nothing else for the thread to do. The general advice is to avoid this type of coding, except that this is what happens in searching a b-tree index.

    If the impact of NUMA on database transaction processing performance were understood and clearly communicated, databases could have been architected from the beginning to work with the SQL Server NUMA and TCP/IP port mapping features. Then threads running on a given node primarily access pages local to that node.

    If this forethought had been neglected, then one option is to re-architect both the database and application, which will probably involve changing the primary key of the core tables. Otherwise, accept that scaling on multi-socket systems is not going to be what might have been expected.

    Furthermore, the Xeon E7 processor, commonly used in 4-socket systems, has the SMB feature for doubling memory capacity. As mentioned earlier, this must incur some penalty in memory latency. In the model above, scaling is:

      1P -> 2P = 1.45X,   2P -> 4P = 1.56X and
      1P -> 4P = 2.26X

    The estimate here is that the SMB has an 11% performance penalty. If the doubling of memory capacity (or other functionality) was not needed, then it might have been better to leave off the SMB. There is 4-waay Xeon E5 4600-series, but one processor is 2-hops away, which introduces its own issues.

    There is a paucity of comparable benchmark results to support meaningful quantitative analysis. In fact, it would seem that the few benchmarks available employ configuration variations with the intent to prevent insight. Below are TPC-E results from Lenovo on Xeon E5 and 7 v4, at 2 and 4-sockets respectively.

    ProcessorSocketscoresthreadsmemorydata storagetpsE 
    E5-2699 v424488512GB (16x32)3x17 R54,938.14-
    E7-8890 v44961924TB (64x64)5x16 R59,068.00-

    It would seem that scaling from 2P to 4P is outstanding at 1.915X. But there is a 9% increase in cores per socket from 22 to 24. Factoring this in, the scaling is 1.756X, although scaling versus core count should be moderately less than linear. Then there is the difference in memory, from 256GB per socket to 1TB per socket. Impressive, but how much did it actually contribute? Or did it just make up for the SMB extra latency? Note that TPC-E does have an intermediate level of memory locality, to a lesser extent than TPC-C.

    Hyper-Threading

    The details of the current Intel Hyper-Threading implementation are discussed elsewhere. The purpose of Hyper-Threading is to make use of the dead cycles that occur during memory access or other long latency operations. Hyper-threading is an alternative solution to the memory latency problem. Work arounds are good, but there are times when it is necessary to attack the problem directly. Single thread performance is important, perhaps second after overall system throughput.

    Given that the CPU clock is more than 150 times that of memory round-trip access time, it surprising that Intel only implements 2-way HT. The generic term is simultaneous multi-threading (SMT). IBM POWER8 is at 8-way, up from 4-way in POWER7. SPARC has been 8-way for a few generations?

    There is a paper on one of the two RISC processors stating that there were technical challenges in SMT at 8-way. So, a 4-way HT is reasonable. This can nearly double transaction performance. The effort to increase HT from 2-way to 4-way should not be particularly difficult. Given the already impossibly complex complexion of the processor, "difficult should be a walk in the park".

    It might help if there were API directives in the operating system to processor on code that runs well with HT and code that does not.

    Frequency

    One other implication of the memory latency effect is that scaling versus frequency is poor. The originator of the memory latency investigation was an incident (Amdahl Revisited) in which a system UEFI/BIOS update reset processors to power-save mode, changing base frequency from 2.7GHz to 135MHz. There was a 3X increase in worker (CPU) time on key SQL statements. A 20X change in frequency for 3X performance.

    In other words, do not worry about the lower frequency of the high core count processors. They work just fine. But check the processor SKUs carefully, certain models do not have Hyper-Threading which is important. It also appears that turbo-boost might be more of a problem than benefit. It might be better to lock processors to the base frequency.

    Hekaton - MVCC

    Contrary to popular sentiment, putting the entire database in to memory on a traditional engine having page organization and row-store does not make much of a direct contribution in performance over a far more modest buffer cache size. This is why database vendors have separate engines for memory-optimized operation, Hekaton in the case of Microsoft SQL Server. To achieve order of magnitude performance gain, it was necessary to completely rethink the architecture of the database engine.

    A database entirely in memory can experience substantial performance improvement when the storage system is seriously inadequate to meet the IOPS needed. When people talk about in-memory being "ten times" faster, what meant was that if the database engine have been designed around all data residing in memory, it would be built in a very different way than the page and row structured implemented by INGRES in the 1970's.

    Now that the memory-optimized tables feature, aka Hekaton for Microsoft SQL Server, is available, are there still performance requirements that have not been met? Memory-optimized tables, and its accompanying natively compiled procedures are capable of unbelievable performance levels. All we have to do is re-architect the database to use memory-optimized tables. And then rewrite the stored procedures for native compilation. This can be done! And it should be done, when practical.

    In many organizations, the original architects have long retired, departed, or gone the way of Dilbert's Wally (1999 Y2K episode). The current staff developers know that if they touch something, it breaks, they own it. So, if there were a way to achieve significant performance gain, with no code changes, just by throwing money at the problem, then there would be interest, and money.

    There is not a TPC-E result for Hekaton. This will not happen without a rule change. The TPC-E requirement is that database size scales with the performance reported. The 4-way Xeon E7 v4 result of 9,068 tpsE corresponds to a database size of 37,362GB. The minimum expectation from Hekaton is 3X, pointing to a 100TB database. A rule change for this should be proposed. Allow a "memory-optimized" option to run with a database smaller than memory capacity of the system.

    About as difficult as MVCC, less upside? Potentially the benefits of NUMA scaling and Hekaton could be combined if Microsoft exposed a mechanism for how key values map to a NUMA node. It would be necessary for a collection of tables with compound primary key to have the same lead column and that the hash use the lead column in determining NUMA node?

    DDR4

    The major DRAM companies are producing DDR4 at the 4Gb die level. Samsung has an 8Gb bit die. Micron has a catalog entry for 2×4Gb die in one package as an 8Gb product. There can be up to 36 packages on a double-sided DIMM, 18 on each side. The multiple chips/packages form a 64-bit word plus 8-bits for ECC, capable of 1-bit error correction and 2-bit error detection. A memory controller might aggregate multiple-channels into a larger word and combine the ECC bits to allow for more sophisticated error correction and detection scheme.

    A 16GB non-ECC DDR4 DIMM sells for $100 or $6.25 per GB. The DIMM is comprised of 32×4Gb die, be it 32 single-die packages or 16 two-die packages. The 16GB ECC U or RDIMM consisting of 36, ×4Gb die is $128, for $8/GB net (data+ECC) or $7.11/GB raw. There is a slight premium for ECC parts, but much less than it was in the past, especially with fully buffered DIMMs that had an XMB chip on the module. The 4Gb die + package sells for less than $3.

    Jim Handy, the memory guy, estimates that 4Gbit DRAM needs to be 70mm2 to support a price in this range. The mainstream DRAM is a ruthlessly competitive environment. The 8Gbit DDR4 package with 2×4Gb die allows 32GB ECC DDR4 to sell for $250, no premium over the 16GB part.

    The 64GB ECC RDIMM (72GB raw) is priced around $1000. This might indicate that it is difficult to put 4×4Gb die in one package, or that the 8Gb die sells for $12 compared to $3 for the 4Gb die. Regardless, it possible to charge a substantial premium in the big capacity realm.

    One consequence of the price competitiveness in the mainstream DRAM market is that cost cutting is an imperative. Multiplexed row and column address lines originated in 1973, allowing for lower cost package and module product. Twenty years ago, there was discussion on going back to a full width address, but no one was willing the pull the trigger on this.

    The only concession for performance in mainstream DRAM was increasing bandwidth by employing multiple sequential word accesses, starting with DDR to the present DDR4.

    RLDRAM III

    Reduced latency DRAM appeared in 1999 for applications that needed lower latency than mainstream DRAM, but at lower cost than SRAM. One application is in high-speed network switches. The address lines on RLDRAM are not multiplexed. The entire address is sent in one group. RLDRAM allows low latency access to a particular bank. That bank cannot be accessed again for the normal DRAM period? But the other banks can, so the strategy is to access banks in a round-robin fashion if possible.

    A 2012 paper, lead author Nilandrish Chatterjee (micro12 and micro 45) has a discussion on RLDRAM. The Chatterjee paper mentions: RLDRAM employs many small arrays that sacrifices density for latency. Bank-turnaround time (tRC) is 10-15ns compared to 50ns for DDR3. The first version of RLDRAM had 8 banks, while the contemporary DDR (just DDR then) had 2 banks. Both RLDRAM3 and DDR4 are currently 16 banks, but the banks are organized differently?

    Micron currently has a 1.125Gb RLDRAM 3 product in x18 and x36. Presumably the extra bits are for ECC, 4 x18 or 2 x36 forming a 72-bit path to support 64-bit data plus 8-bit ECC. The mainstream DDR4 8Gbit 2-die package from Micron comes in a 78-ball package for x4 and x8 organization, and 96-ball for x16. The RLDRAM comes in a 168-ball package for both x18 and x36. By comparison, GDDR5 8Gb at 32-wide comes in a 170-ball BGA, yet has multiplexed address? The package pin count factors into cost, and also in the die size because each signal needs to be boosted before it can go off chip?

    Digi-Key lists a Micron 576M RLRAM3 part at $34.62, or $554/GB w/ECC, compared with DDR4 at $8 or 14/GB also with ECC, depending the module capacity. At this level, RLDRAM is 40-70 times more expensive than DDR4 by capacity. A large part for is probably because the RLDRAM is quoted as a specialty low volume product at high margins, while DDR4 is a quoted on razor thin margins. The top RLDRAM at 1.125Gb capacity might reflect the size needed for high-speed network switches or it might have comparable die area to a 4Gb DDR?

    SRAM

    There are different types of SRAM. High-performance SRAM has 6 transistors, 6T. Intel may use 8T Intel Labs at ISSCC 2012 or even 10T for low power? (see real world tech NTV). It would seem that SRAM should be six or eight times less dense than DRAM, depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.

    There is a Micron slide in Micro 48 Keynote III that says SRAM does not scale on manufacturing process as well as DRAM. Instead of 6:1, or 0.67Gbit SRAM at the same die size as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area? Another source says 100:1 might be appropriate.

    Eye-balling the Intel Broadwell 10-core (LCC) die, the L3 cache is 50mm2, listed as 25MB. It includes tags and ECC on both data and tags? There could be 240Mb or more in the 25MB L3? Then 1G could fit in a 250mm2 die, plus area for the signals going off-die.

    Digi-Key lists Cypress QDR IV 144M (8M×18, 361 pins) in the $235-276 range. This $15K per GB w/ECC. It is reasonable to assume that prices for both RLDRAM and QDR SRAM are much lower when purchased in volume?

    The lowest price for an Intel processor on the Broadwell LCC die of 246mm2 is $213 in a 2011-pin package. This would suggest SRAM at south of $1800 per GB. While the ultra-high margins in high-end processors is desirable, it is just as important to fill the fab to capacity. So, SRAM at 50% margin is justified. We could also estimate SRAM at 40X that of DRAM, per the Micron assertion of relative density, pointing to $160-320 per GB.

    Graphics and High-Bandwidth Memory

    Many years ago, graphics processors diverged from mainstream DRAM. Their requirement was for very high bandwidth at a smaller capacity than main memory, plus other features to support the memory access patterns in graphics. GDR is currently on version 5, at density up to 8Gbit, with a x32 wide path (170-ball package) versus x4, x8 and x16 for mainstream DDR4. More recently, High Bandwidth Memory (HBM) is promoted by AMD, Hybrid Cube Memory by Micron.

    High bandwidth memory is not pertinent to databases, but it does provide scope on when there is need to go a separate road from mainstream memory. Databases on the page-row type engine does not come close to testing the limits of DDR4 bandwidth. This is true for both transaction processing and DW large table scans. For that matter, neither does column-store, probably because of the CPU-cycles for decompression.

    Edit
    I may have to take this back. DDR4-2133 bandwidth is 17GB/s per channel, and 68GB/s over 4 channels. (GB is always decimal by default for rates, but normally binary for size.) A table scan with simple aggregation from memory is what now? It was 200MB/s per core in Core 2 days, 350MB/s in Westmere. Is it 500 or 800MB/s per core now? It is probably more likely to be 500, but let's assume 800MB/s here.

    Then 24 cores (Xeon E7 v4 only, not E5) consume 19.2GB/s (HT does not contribute in table scans). This is still well inside the Xeon E5/7 memory bandwidth. But what if this were read from storage? A table scan from disk is a write to memory, followed by a read. DDR writes to memory at the clock rate, i.e., one-half the MT/s rate. So the realized table scan rate effectively consumes 3X of the MT/s value, which is 57.6.

    Memory Summary

    To pursue the path of low latency memory, it is necessary to justify the cost and capacity structure of alternative technologies It is also necessary that the opportunity be worthwhile to justify building one more specialty processor with a different memory controller. And it may be necessary to work with operating system and database engine vendors to all be aligned in doing what is necessary.

    The Chatterjee et al micro12 paper for Micro45 (microarch.org) shows LRDRAM3 improving throughput by 30% averaged across 27 of the 29 components of the SPEC CPU 2006 suite, integer and fp. MCF shows greatest gain at 2.2X. Navigating the B-tree index should show very high gain as well.

    The cost can be justified as follows. An all-in 4-way server has the following processor and memory cost.

    ComponentDetailUnit CostTotal
    Processor4×E7-8890 v4$7,174 ea.$28,700
    Memory4TB, 64×64GB$1,000 ea.$64,000

    If the above seems excessive, recall that there was a time when some organizations where not afraid to spend $1M on the processor and memory complex, or sometimes just the for 60+ processors (sales of 1000 systems per year?). That was in the hope of having amazing performance. Except that vendors neglected to stress the importance NUMA implications. SAN vendors continued to sell multi-million-dollar storage without stressing the importance of dedicated disks for logs.

    If a more expensive low latency memory were to be implemented, the transmission time between the memory controller and DIMM, estimated to be 10ns earlier, should be revisited. A RLDRAM system might still have the DIMM slot arrangement currently in use, but other option should be considered.

    An SRAM main memory should probably be in an MCM module, or some other In-Package Interconnect (TSMC Hot Chips 28). This is if enough SRAM can be stuffed into a module or package. It would also require that the processor and memory be sold as a single unit and instead of memory being configured later. In the case of SRAM, the nature of the processor L3 probably needs to be re-examined.

    HDD Storage

    Before SSDs, the high-end 15K HDDs were popular with storage performance experts who understood that IOPS was more important than capacity. In a "short-path to bare metal" disk array, the 15K HDD could support 200 IOPS at queue depth 1 per HDD with low latency (5ms). It should be possible to assemble a very large array of 1,000 disks, capable of 200,000 IOPS.

    It is necessary to consider the mean-time-between-failure (MTBF), typically cited as over 1M-hours. There are 8,760 hours in a 365-day year. At 1M-hr MTBF, the individual disk failure rate is 0.876%. An array of 1000 disks is expected to see 9 failures per year. Hard disks in RAID groups will continue to operate with a single or sometimes multiple disk failures. However, rebuilding a RAID group from the failed drive could take several hours, and performance is degraded in this period. It is not operationally practical to run on a very large disk array. The recommendation was to fill the memory slots with big DIMMs, and damn the cost.

    SSD (tech)

    The common convention used to be a NAND controller for SATA on the upstream side would have 8-channels on the NAND side. The PCI-E controller would 16 or 32 NAND channels for x4 and x8 respectively. On the downstream side, a NAND channel could have 1 or 2 packages. There could be up to 8 chips in a package. A NAND chip may be divided in to 2 planes, and each plane is functionally an independent entity. An SSD with 8 packages could have 64 NAND chips comprised of 128 planes. The random IOPS performance at the plane level is better than a 15K HDD, so even a modest collection of 24 SSDs could have a very large array (3,072) of base units.

    At the component level, having sufficient units for 1M IOPS is not difficult. Achieving 1M IOPS at the system level is more involved. NVMe builds a new stack, software and hardware, for driving extraordinarily high IOPS possible with a large array of SSDs, while making more efficient use of CPU than the SAS. PCI-E NVMe SSDs have been around since 2014, so it is possible to build a direct-attach SSD array with the full NVMe stack. NVMe over fabric was recently finalized, so SAN products might be not too far in the near future.

    From the host operating system, it is possible to drive 1M IOPS on the NVMe stack without consuming too much CPU. At the SQL Server level, there are additional steps, such as determining which page to evict from the buffer cache. Microsoft has reworked IO code to support the bandwidth made practical with SSD for DW usage. But given the enormous memory configuration of typical transaction processing systems, there may not have been much call for the ability to do random IOPS with a full buffer cache. But if the need arose, it could probably be done.

    All Flash Array

    When SSDs were still very expensive as components, storage system vendors promoted the idea of a SSD cache and/or a tiering structure of SSD, 10K and 7.2K HDDs. In the last few years, new upstarts are promoting all flash. HDD storage should not go away, but its role is backup and anything not random IO intensive.

    Rethinking System Architecture

    The justification for rethinking system architecture to low latency memory at far higher cost is shown below. The scaling achieved in 4-socket system is less than exceptional except for the very few NUMA architected databases, which is probably just the TPC-C and TPC-E database. It might be 2.2X better than single socket.

     

     

    At lower latency, 40ns L3+memory, the single socket system could match the performance of a 2-socket system with DDR4 DRAM. If 25ns were possible, then it could even match up with the 4-socket system. The mission of massive memory made possible in the 4-way to reduce IO is no longer a mandatory requirement. The fact that a single-socket system with RLDRAM or SRAM could match a 4-socket with massive memory allows very wide latitude in cost.

     

     

    RLDRAM may reside inside or outside of the processor. If outside, thought should be given on how to reduce the transmission delay. SRAM should most probably be placed inside the processor package, so the challenge is how much could be done. Should there still be an L3? Any latency from processor core to memory must be minimized as much as possible as warranted by the cost of SRAM.

    Below are the memory latency simple model calculations for a single socket with L3+memory latency of 43 and 25ns. There are the values necessary for the single-socket system to match 2 and 4-socket systems respectively.

    GHzL3+memremotesktavg. memmem cyclesfractiontot cyclestps/coretot tx/sec
    2.243nsn/a143ns950.0556.8M38.73-
    2.225nsn/a125ns550.0537.0M59.46-

    In the examples above, Hyper-Threading should still have good scaling to 4-way at 43ns, and some scaling to 4-way at 25ns memory latency.

    The new memory architecture does not mean that DDR4 DRAM becomes obsolete. It is an established and moderately inexpensive technology. There could still be DRAM memory channels. Whether this is a two-class memory system or perhaps DDR memory is accessed like a memory-mapped file can be debated elsewhere. Xeon Phi has off-package DDR4 as memory node 0 and on-package MCDRAM as memory node 1, all to the same processor.

    It is acknowledged that the proposed system architecture is not a new idea. The Cray-1 used SRAM as memory, and DRAM as storage? For those on a budget, the Cray-1M has MOS memory. Circumstances of the intervening years favored processor with SRAM cache and DRAM is main memory. But the time has come to revisit this thinking.

    While working on this, I came across the slide below in J Pawlowski, Micron, Memory as We Approach a New Horizon. The outline of the Pawlowski paper includes high bandwidth, and persistent memory. Deeper in, RL3 Row Cycle Time (tRC) is 6.67 to 8ns, versus 45-50ns for DDR4.


     

    I am guessing that the large number of double-ended arrows between processor and near memory means high bandwidth. And even bandwidth to DIMMs is substantial.

    On the devices to the right seems to be storage. Does ASIC mean logic? Instead of just accessing blocks, it would be useful to say: read the pointer at address A, then fetch the memory that A points to.

    Intel Vision

    Below is roughly Intel's vision of next-generation system architecture, featuring 3D XPoint. The new Intel and Micron joint non-volatile technology is promoted as having performance characteristics almost as good as DRAM, higher density than DRAM, and cost somewhere in between DRAM and NAND.

     

     

    The full potential of 3D XPoint cannot be realized as PCI-E attached storage. The idea is then to have 3D XPoint DIMMs devices on the memory interface along with DRAM. The argument is that memory configurations in recent years have become ridiculously enormous. That much of it is used to cache tepid or even cool data. In this case, DRAM is overkill. The use of 3D XPoint is almost as good, it costs less, consumes less power, is persistent, and will allow even larger capacity.

    In essence, the Intel vision acknowledges the fact that much of main memory is being used for less than hot data. The function of storing not so hot data can be accomplished with 3D XPoint at lower cost. But this also implies that the most critical functions of memory require far less capacity than that of recent generation systems.

    In the system architecture with a small SRAM or RLDRAM main memory, there will be more IO. To a degree, IO at 100µs to NAND is not bad, but the potential for 10µs or less IO to 3D XPoint further validates the concept and is too good to pass up.

    Below is a my less fancy representation of the Micron System Concept.

     

     

    The Right Core

    The latest Intel mainline Core-i processor has incredibly powerful cores, with 8-wide superscalar execution. A desktop version of Kaby Lake, 7th generation, has 4.2GHz base frequency and 4.5GHz turbo. This means the individual core can run at 4.5GHz if not significantly higher, but must throttle down to 4.2GHz so that four cores plus graphics and the system agent stays under 91W. A reasonable guess might be that the power consumption is 20w per core at 4.2GHz? Sky Lake top frequency was 4.0GHz base and 4.2GHz turbo. Broadwell is probably 3.5GHz base and 3.8GHz turbo, but Intel did not deploy this product as widely as normal.

    In transaction processing, this blazing frequency is squandered on memory latency. The server strategy is in high core count. At 24 cores, frequency is throttled down to 2.2GHz to stay under 165W. The Broadwell HCC products do allow turbo mode in which a few cores can run at up to 3.5 or 3.6GHz.

    Every student of processor architecture knows the foundations of Moore's Law. One of the elements is that on doubling the silicon real estate at a fixed process, our expectation is to achieve 1.4X increase in performance. (The other element is a new process with 0.71X linear shrink yields 50% frequency, also providing 1.4X performance.) The Intel mainline core is two times more powerful than the performance that can be utilized in a high core count processor.

    In theory, it should be possibly to design a processor core with performance equivalent to the mainline core at 2.2GHz (see SQL on Xeon Phi). In theory, this new light core should be one-quarter the size of the mainline core. (Double the light core complexity for 1.4X performance. Double again for another 1.4X, for a cumulative 2X over baseline.)

    The new light core would be running at maximum design frequency to match the 2.2GHz mainline, whatever frequency that might be. We can pretend it is 2.2GHz if that helps. This new core would have no turbo capability. What is the power consumption of this core? Perhaps one-quarter of the mainline, because it is one-quarter the size? Or more because it is running at a slightly higher voltage? (this is covered elsewhere).

    It might be possible to fit four times as many of the light cores on the same die size, assuming cache sizes are reduced. But maybe only 3 times as many cores can be supported to stay within power limits? This a much more powerful multi-core processor for multi-threaded server workloads. How valuable is the turbo capability?

    The turbo boost has value because not everything can be made heavily multi-threaded. Single or low-threaded code might not be pointer chasing code, and would then be fully capable of benefitting from the full power of the Intel mainline core. A major reason that Intel is in such a strong position is that they have had the most powerful core for several years running, and much of the time prior to the interlude period. (AMD does have a new core coming out and some people think highly of it.)

    Edit/Note:
    Intel has two versions of each manufacturing process, one for high performance, and another for low power. Could the mainline core be built on the lower power process? In principle this should reduce power to a greater degree than scaling voltage down. Would it also make the core more compact?

    Knights Landing

    We could speculate on theory, applying the general principles of Moore's Law. But there is a real product along these lines, just targeted towards a different function. The Xeon Phi 200, aka Knights Landing has 72 Atom cores, albeit at 245W (260 with fabric). The current Phi is based on the Airmont Atom core. (the latest Atom is actually Goldmont).

    The recent Atom cores have a 14-stage pipeline versus 14-19 for Core? Airmont is 3-wide superscalar, with out-of-order, but does not have a µOP cache? The true top frequency for Airmont is unclear, some products based on are Airmont are listed as 2.6GHz in turbo.

    On Xeon Phi, the frequency is 1.5GHz base, 1.7GHz turbo. It might be that the low core count processors will always be able to operate at a higher voltage for maximum frequency while high core count products set a lower voltage resulting lower frequency, regardless of intent.

    Below is a diagram of Knights Landing, or Xeon Phi 200 from Intel's Hot Chips 27 (2015). The processor and 8 MCDRAM devices are on a single multi-chip-module (package) SVLC LGA 3647.

     

     

    The MCDRAM is a version of Hybrid Memory Cube? (Intel must have their own private acronyms.) Each device is a stack of die, 2GB, for a total of 16GB with over 400GB/s bandwidth. There are also 3 memory controllers driving a total of 6 DDR4 memory channels for another 90GB/s bandwidth (at 2133 MT/s). Only 1 DIMM per channel is supported, presumably the applications are fine with 384GB but wants extreme bandwidth.

    The Xeon Phi is designed for HPC. As is, it might be able deliver impressive transaction processing performance. But perhaps not without tuning at many levels. The question is, how does Knights Landing perform on transaction processing had the memory been designed for latency instead of bandwidth?

    I suppose this could be tested simply by comparing an Airmont Atom against a Broadwell or Skylake? The theory is that the memory round-trip latency dominates, so the 8-wide superscalar of Haswell and later has little benefit. Even if there is some code that can used wide superscalar, the benefit is drowned out by code that wait for memory accesses.

    Even in transaction processing databases, not everything is transactions, i.e., amenable to wide parallelism. Some code, with important functions, do benefit from the full capability of mainline Intel core. Perhaps the long-term solution is asymmetric multi-core, two or four high-end cores, and very many mini-cores.

    SSE/AVX Vector Unit (SIMD)

    The vector (SSE/AVX) unit is a large portion of the core area. This are not used in transaction processing but are used in the more recent column-store engine. Microsoft once evaluated the use of the Intel SSE registers, but did not find a compelling case. It might have been on the assumption of the existing SSE instructions?

    Perhaps what is needed is to redesign the page structure so that the vector registers can be used effectively. The SQL Server 8KB page, 8,192 bytes, has a row header of 96 bytes, leaving 8096 bytes. Row offsets (slot array of 2 byte values) are filled in from the end of the page. See Paul Randall, SQL Skills Anatomy of a Record.

    Within a page, each row has a header (16-bytes?) with several values. The goal of redesigning the page architecture is so that the slot and header arrays can be loaded into the vector registers in an efficient manner. This might mean moving the slot array and other headers up front. SQL Server would continue to recognize the old page structure. On index rebuild, the new page structure employed.

    The necessary instructions to do row-column byte offset calculations directly from the vector register would have to be devised. This needs to be worked out between Intel and various database vendors. Perhaps the load into the vector registers bypasses L1 and/or L2? It would be in L3 for cache coherency?

    The current Xeon E5/7 processors, with the latest on the Broadwell core, have 16 vector registers of 256-bits (32 bytes) totaling 512 bytes. The Skylake has 32 registers of 512-bits, 2KB of registers. This is too much to waste. If they cannot be used, then the processor with special memory controller should discard the vector unit.

    Recap

    The main purpose of this article was to argue for a new system architecture, having low latency memory, implying a processor architecture change, with a major focus on the feasibility for transaction processing databases, and mostly as pertinent to Intel processors. However, all avenues for significant transaction performance improvement are considered.

    • Hyper-Threading to 4-way
    • Memory-optimized tables with natively compiled procedures
    • Database architected for NUMA
    • Multi-core - the right size core?
    • RLDRAM/SRAM
    • 3D XPoint
    • SSE/AVX Vector Instructions

    Increasing HT from 2-way to 4-way has the potential to nearly doubly transaction processing performance. Other options have greater upside, but this is a drop-in option.

    Memory-optimized tables and natively compiled procedures combined has the greatest upside potential.

    People do not want to hear that the database should be re-architected for NUMA scaling. If it runs fine on single-socket or Hekaton, then fine. But Intel mentions that scaling to the Phi core count levels requires NUMA architecture even on one socket.

    Higher core count using a smaller core will have best transaction throughput, but an asymmetric model might be more practical.

    The capabilities of the modern processor core are being squandered in long latency for capacity that is not needed. Figure out what is the right low latency memory.

    There is definitely potential for 3D Point. But it goes beyond displacing some DDR DRAM and NAND. The true potential is to enable a smaller lower latency true main memory, then have DDR DRAM and 3D XPoint as something in-between memory and IO.

    SSE/AVX: use it or lose it.

     

    Summary

    The growing gap between processor clock cycle time to memory latency is not a new topic. There have been many other papers on the advantage of various memory technologies with lower latency. Most of these originate either from universities or semiconductor companies. Everyone acknowledged that cost relative to mainstream DRAM was a serious obstacle. A number of strategies were conceived to narrow the gap.

    Here, the topic is approached from the point of view of database transaction processing. TP is one of several database applications. Database is one of many computer applications. However, transaction processing is a significant portion of the market for high-end processors, and systems with maximum memory configuration.

    The database community regularly spends $100K on a processor-memory complex for performance levels that could be achieved with a single socket, if it were matched with the right memory. There is valid justification from the database side to pursue the memory strategy. There is justification to the processor-memory vendor that this one market has the dollar volume to make this effort worthwhile. And the extremely large memory capacity requirement is shown to now be a red herring.

    In all, there are several worthwhile actions for the next generation of server system architecture. Probably none are mutually exclusive. There are trade-offs between impact, cost and who does the heavy lifting. No single factor wins in cases, so a multi-prong attack is the more correct approach.

     

    Addendum

    I will do a discussion to the figure below later

    The diagram below might be helpful in discussion. The elapsed time for a transaction is the weight sum of operations that incur a wait at each level. Single thread performance is the inverse of elapsed time. For throughput, memory access latency can be partially hidden with hyper-threading. IO latency is hidden by asynchronous IO, but there is an overhead to that too.

     

     

    Suppose D is the latency for DDR memory, and S it the latency of some low latency memory, both inclusive of transmission time and possibly L3. There is no read IO in the DDR system. Suppose x is the fraction of memory accesses that are outside of the fast memory, and the I is the latency. The term I might represent accesses to DDR or 3D XPoint on the memory interface via a memory access protocol so it is really not IO. Or it could be to 3D XPoint or NAND attached to PCI-E via an IO protocol.

    The criteria for the small faster memory being an advantage in elapse time delta, exclusive of operations that occur inside L3, is as follows.

    (1-x)×S + x×I < D

    Then

    x < (D-S)/(I-S)

    The objective is to achieve a large gain in performance via elapsed time. Only so much can be gained on the numerator D-S, so much depends on the latency of I. If the secondary device were DDR or 3X XPoint on the memory interface, then a very high value of accesses (x) could be allowed while still achieving good performance gain. If it were on PCI-E, then 3D XPoint might have a strong advantage over NAND.

    Atom

    In the discussion on Knights Landing, I suggested that the Atom core might not be bad for transaction processing. The cheapest Xeon Phi is the 7210 at $2438. About $4700 in a system. What is the difference between Atom C2750 and 2758? Both are Silvermont 8-cores, no HT. Use ECC SODIMM.

    Atom has changed since its original inception, not using out-of-order execution for simplicity and power-efficiency. Silvermont added OOO. Not sure about Goldmont. Is Atom to be a slimmed down Core? with 3-wide superscalar and manufactured on the SoC version of the process?

    Appendix

    I will try to sort out the material and redistribute over several articles as appropriate.

  • Intel Processor Architecture 2020

    There is an article on WCCFtech, that a new Intel processor architecture to succeed the lake processors (Sky, Cannon, Ice and Tiger) will be "faster and leaner" and more interestingly might not be entirely compatible with older software. The original source is bitsandchips.it. I suppose it is curious that the Lake processors form a double tick-tock or now process-architecture-optimization (PAO), but skipped Kaby, and Cannon. Both the bridge (Sandy and Ivy) and well processors (Has and Broad) each had only one tick-tock pair.

    Naturally, I cannot resist commenting on this. About time!

    For perspective, in the really old days, processor architecture and instruction set architecture (ISA) was somewhat the same thing. The processor implemented the instruction set, so that was the architecture. I am excluding the virtual-architecture concept where lower cost version would not implement the complete instruction set in hardware.

    The Intel Pentium Pro was a significant step away from this, with micro-architecture and instruction set architecture now largely different topics. Pentium Pro has its own internal instructions, called micro-operations. The processor dynamically decodes X86 instructions to the "native" micro-operations. This was one of the main concepts that allow Intel to borrow many of the important technologies from RISC.

    The Pentium 4 processor, codename Willamette, had a Trace cache, that was a cache for decoded instructions. This may not have been in the Core 2 architecture that followed Pentium 4.

    My recollection is that Pentium Pro had 36 physical registers of which only 8 are visible to the X86 ISA. The processor would rename the ISA registers as necessary to support out-of-order execution. Pentium 4 increased this to 128 registers.

    Also see MIT 6.838 and NJIT rlopes

    The Nehalem micro-architecture diagrams do not mention a µop cache, (somehow the acronym is DSB) but Sandy Bridge and subsequent processors do. This is curious because both Willamette and Nehalem are Oregon designs, while Core 2 and Sandy Bridge are Haifa designs.

    The other stream that comes into this topic involves the Intel Itanium adventure. The original plan for Itanium was to have a hardware (silicon) X86 unit. Naturally, this would not be comparable to the then contemporary X86 processors, which would have been Pentium III, codename Coppermine at 900MHz, for Merced. So by implication, X86 execution would probably be comparable to something several years old, a Pentium II 266MHz with luck, and Itanium was not lucky.

    By the time of Itanium 2, the sophistication of software CPU emulation was sufficiently advanced that the hardware X86 unit was discarded. In its place was IA-32 Execution Layer. Also see the IEEE Micro paper on this topic. My recollection was the Execution Layer emulation was not great but not bad either.

    The two relevant technologies are: one, the processor having native µops instead of the visible X86 instructions, and two, the Execution Layer for non-native code. With this, why is the compiler generating X86 (ok, Intel wants to call these IA-32 and Intel 64 instructions?) binaries.

    Why not make the native processor µops visible to the compiler. When the processor detects a binary with native micro-instructions, it can bypass the decoder? Also make the full set of physical registers visible to the compiler? If Hyper-threading is enabled, then the compiler should know to only use the correct fraction of registers.

    I am inclined to also say that the more the compiler knows about the underlying hardware, the better it can generate binaries to fully utilize available resources, with less reliance on the processor doing dynamic scheduling for parallelism. But of course, that was what Itanium was, and we would need to understand why Itanium did not succeed. My opinion was that EPIC was really better suited to scientific computing and not logic heavy server applications.

    Have one or two generations of overlap, for Microsoft and the Linux players make a native micro-op operating system. Then ditch the hardware decoders for X86. Any old code would then run on the Execution Layer, which may not be 100% compatible. But we need a clean break from old baggage or it will sink us.

    Off topic, but who thinks legacy baggage is sinking the Windows operating system?

    Addendum

    Of course, I still think that one major issues is that Intel is stretching their main line processor core over too broad a spectrum. The Core is used in both high-performance and high-efficiency mode. For high performance, it is capable of well over 4GHz, probably more limited by power than transistor switching speed. For power efficiency, the core is throttled to 2 or even 1 GHz.

    If Intel wants to do this in a mobile processor, it is probably not that big a deal. However, in the big server chips, with 24 core in Xeon v4 and possibly 32 cores in the next generation (v5), it becomes a significant matter.

    The theory is that if a given core is designed to operate at a certain level, then doubling the logic should achieve a 40% increase in performance. So if Intel is deliberately de-rating the core in the Xeon HCC die, then they could built a different core specifically to one half the original performance is perhaps one quarter the complexity.

    So it should be possible to have 100 cores with half the performance of the Broadwell 4GHz capable core, i.e., equivalent to Broadwell at 2GHz? If this supposed core were very power efficient, then perhaps we could even support the thermal envelope of 100 mini-cores?

    Of course, not every application is suitable for wide parallelism. I would like to see Intel do a processor with mixed cores. Perhaps 2 or 4 high performance cores and 80 or so mini-cores?

    A really neat trick would be if the GPU were programmable, but graphics vendors have things along this line?

  • Memory Latency and NUMA

    It should be intuitively obvious that round-trip memory access latency is one of the most important factors in modern server system architecture for transaction processing databases. Yet this is a topic that no one talks about. Vendors do not want to discuss this because no near or long-term actions are planned. Outsiders cannot write a meaningful article because too much important information is missing. In some respects, there is nothing practical we can do about latency in terms of memory components that we elect to employ. However, we can influence latency because Intel processors since Nehalem in 2009/10 have integrated memory controllers. Hence all multi-socket systems since then have non-uniform memory access (NUMA) and this is one mechanism that determines latency.

    We will start by looking at modern Intel processor architecture and the time scales involved. Then continue by examining system architecture with respect to memory latency. From there, we can do basic calculations on how we expect memory latency to impact transaction processing performance. This basic model still needs to be backed up by real experimental measurements, but there is enough to provide the basis for further investigations. Pending such, one of the important conclusions is that it is time to re-examine fundamental assumptions on server sizing.

    Broadwell, Xeon E3 v4

    The diagram below is a representation of a modern Intel mainline (desktop and mobile) processor. The proportions are from Broadwell, the 5th generation Intel Core i3, 5 and 7 and Xeon E3 v4, as Sky Lake has a different layout arrangement. The four cores, system agent, and graphics communicate over the interconnect ring. The memory controller and IO are in the system agent, but the signals that go off-chip are in the interface shown as a separate unit.

     

     

    Never mind the above, I don't think thats how Broadwell 4c is laid out. See below left. Sky Lake is below right.

     
     
     

    Intel and system vendors like to talk about specifications such as processor frequency and memory transfer rate. Recent generation Intel processor cores are capable of operating in the high 3 to low 4GHz range. It is quite possible that they could run even higher, but are power constrained.

    DDR4 Interlude

    The memory interface is DDR4 at 1866 and 2133MT/s. The term used is mega-transfers per second instead of MHz. This is because the clock is cited in MHz, and the data transfers rate is eight times the clock rate. Address and writes should be one-half the data rate. Using MT/s for data transfer rate is so much more clear.
    (edit 2016-12-26)
    In DDR4, there is a memory clock, 266.67MHz for example, an I/O bus clock that is 4 times higher, 1066MHz, and the data transfer rate at double the I/O clock, 2133MT/s.
    DDR memory timings are specified at the memory interface and consist of 4 values: CL, RCD, RP and RAS, the last of which is frequently not cited, and sometimes only the CAS Latency is cited. The value is in terms of the I/O clock cycles.

    In database transaction processing, the memory access pattern is largely unpredictable, amounting to a random memory row access, so the latency is RP + RCD + CL, for Row Pre-charge, Row to Column Delay, and CAS Latency. For Registered DDR4 2166, all three values are 15? (does this includes the extra cycle for registered memory?). The 2166 MT/s corresponds to 1066MHz, so each cycle is 0.938ns, and 15 cycles is 14 ns. The total memory latency at the memory interface is then 42ns?

    Cache and Memory Latency

    There are applications in which processor frequency and memory bandwidth matter an great deal. But neither are particularly important for transaction processing databases. The diagram below calls out some details between the processor core, various levels of cache, and memory that are more relevant to databases.

     

     

    At 3GHz, the processor core cycle time is 0.33ns, the inverse of the frequency. The L1 cache is cited as having 4 cycle latency. The L1 is part of the processor core execution pipeline, so to some degree, L1 latency is hidden. L2 cache latency is cited as 12 cycles. It is not certain as to whether this is fixed in cycles, or is actually something like 4ns.

    L3 cache probably depends on the number of cores and other circumstances. It is shown here as 40+ cycles. If L3 latency is actually time based, 15ns for example, then number of cycles would depend on the core clock rate. I am not sure if the L3 latency incorporates L2 latency. Memory latency is probably somewhat over 50ns, plus the L3 latency. How much of this is at the DRAM chip versus the transmission delay between the processor to DRAM and back?

    Igor Pavlov provides both the 7z benchmark and results for many processors at 7-cpu, source code included. 7-cpu lists Haswell L3 at 36 cycles, and memory at L3 + 57ns. Sky Lake L3 is 42 cycles and memory at L3 + 51ns.
    (edit 2016-12-26)
    (This seems to imply that the transmission delay from processor to memory and back is about 10ns?)
    Intel has their own Memory Latency Checker utility. One of the Intel pages shows an example with local node memory latency as 67ns, and remote node at 125ns, which will be used in the examples below.

    Broadwell EP and EX

    The above diagrams were for the desktop and Xeon E3 processors. We are really more interested in the Intel EP/EX processors used in the Xeon E5 and E7 product lines. The latest Xeon E5 and E7 processors are v4, based on the Broadwell core. There are 3 layouts, HCC, MCC and LCC with 24, 15 and 10 cores respectively. Intel provides functional layouts for all three, but the actual die layout is provided only for the LCC model.

    The Broadwell HCC representation is shown below. There are 6 rows and 4 columns of cores, 24 cores total. The 24-core model is only available in the E7. The top E5 has only 22 cores enabled. Two columns of cores communicate over the interconnect dual ring (counter-rotating). The two double rings are connected by a switch. The Intel functional diagram actually shows both QPI and PCI-E on the left side.

     

     

    Below are my representations of the MCC model on the left with 3 columns of 5 rows for 15 cores and the LCC model on the right with 2 columns of 5 rows for 10 cores. In the LCC model, there is no ring switch. The PCI-E and memory actually do overhang the core, meaning that the space to the right of the cores is blank? As even the LCC is a high margin product, an exact and efficient fit is not necessary?

     
     

    The MCC arrangement of both QPI and PCI-E on the left side of the switch connected to the left ring and a memory controller on each side of the ring switches matches the Intel functional layout, but I do not know if there is overhang. Regardless of the actual placement of the controllers, the interface signals for QPI and PCI-E probably does run along the length of the upper edge, and the interface for the memory signals probably runs along most of the lower edge.

    I am inclined to believe that L3 latency is higher in the E5/E7 processors as the path is longer and more complicated. On the LCC die, there are 10 cores, the memory, QPI and PCI-E controllers on one ring. However, if the desktop and E3 processors have only one ring (single direction) then it is possible that the bidirectional ring in the E5/E7 processors help keep L3 latency low? Presumably latencies on the MCC and HCC die are longer than on the LCC because both rings must be checked?

    Edit 2017-01-4
    Search: Haswell Cluster on Die (COD) Mode, but filter our call of duty. An Intel slidedeck on this suggests that memory latency is higher when crossing the coherency bus switches.

    Xeon E5 v4

    Below is a representation of a 2-socket Xeon E5 system with the HCC die. Two of the 24 cores are marked out, as the E5 has a maximum of 22 cores. The E5 has 2 full QPI links, and both are used to connect to the other proccessor. For a core in the left socket, the memory attached to that socket is local memory, and memory attached to the other socket is remote memory.

     

     

    It should not be difficult to appreciate that there is a large difference in memory access time between local and remote memory nodes. The Intel Memory Latency Checker example has 67ns for local and 125ns for remote memory, which will be the values we use in a following example. I am not certain if these values are for unbuffered or registered memory. Unbuffered memory should have lower latency, but registered memory is available in larger capacities, 64GB versus 16GB.

    Xeon E7 v4

    Below is a representation of the 4-socket Xeon E7 system. The E7 has 3 full QPI links, one connecting directly to each of the three other processors. All remote processors are then said to be 1-hop away. The difference of significance here between the E5 and E7 systems is that the E5 memory channel connects directly to memory. The E7 connects to a scalable memory buffer (SMB, other names have been used too), that splits into two memory channels on the downstream side. Because there are so few acronyms in effect, the interface from processor to SMB is SMI. The SMB doubles the number of DIMM sites, and in effect, doubles the memory capacity per socket.

     

     

    This difference in the memory arrangement between processors designed for 2 and 4-socket systems has been a recurring pattern in Intel system architecture for a long time, though it was not present in all systems, and there were variations. In the current generation, there is a four-socket version of the E5, which does not have the SMB in the path to memory, but each processor only has two QPI links, so one of the remote sockets is two-hops away.

    Long ago, maximum memory capacity was very valuable for database servers in reducing IO to less than impossible levels. The extra latency incurred from the SMB chip was worth the price. Since then, memory configuration has increased to stupendous levels, and data read IO has been reduced to negligible. NAND flash has also become very economical, allowing storage now to be capable of formerly impossibly high IOPS. Of course, this occurred after it was no longer absolutely essential.

    In more recent years, with maximum memory configuration sometimes more than double what is already extravagant, we might not want to incur the extra latency of the SMB?

    Highly Simplified Example

    Before proceeding, I will take the opportunity to say that the modern microprocessor is an enormously complex entity in so many ways that it defies characterization in anything less than an exceedingly complicated model. The processor has multiple cores. Each core has pipelined superscalar and out-of-order execution. Each core has dedicated L1 I+D cache and dedicated L2 unified cache. And there is decoded micro-op cache as well. The processor has L3 cache shared among the cores.

    That said, I will now provide an example based on an extremely simplified model of the core-processor memory complex.

    Suppose we have a fictitious 3GHz processor, cycle time 0.33ns, having a very fast cache, and executes one instruction per cycle when all memory is in cache. Suppose further that memory latency is 67ns, and the long latency L3 cache effects are not considered.

    Suppose we have a similarly fictitious transaction of 10M instructions. If everything is cache, the transaction completes in 3.33 million nano-seconds, or 3.33 milli-sec, and our performance is 300 transactions per second per core.

    Now suppose that 5% (1/20) of instructions require a round-trip memory access before proceeding to the next step. The 0.95 fraction of instructions that have a cache hit consume 9.5M cycles. The 0.05 fraction of 10M is 0.5M instructions that miss cache. Each of these require a round-trip memory access of 67ns or 201 cycles for 100.5M cycles. The total time to complete the transaction is now 110M cycles. Performance is now 27.27 transactions per sec per core instead of 300 tps.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
    3.067nsn/a167ns2010.05110.0M27.27-

    Now suppose that we have a 2-socket system, meaning two memory nodes, and that we have not specially architected our database in a manner to achieve higher memory locality than expected from random access patterns. Any memory access is equally likely to be in either node. The local node memory continues to be 67ns and remote node memory access is 125ns. Average memory access is now (67+125)/2 = 96ns, or 288 cycles

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
    3.067ns125ns296ns2880.05153.5M19.5439.09

    Without a database architected to achieve memory locality, we have lost 28% performance per core (1 - 19.54/27.27). Of course, we did double the number of cores, so throughput has increased by 43% (2*19.54/27.27). Alternatively, the performance per core in the single socket system is 39.5% better than in the 2-socket system (27.27/19.54). This magnitude is important. Did your vendors and infrastructure experts forget to mention this?

    Now suppose we are on a 4-socket Xeon E7-type system with the same database, so memory access is equally probable to any of the four cores. Local memory access is 25%, and 75% is remote to one of the three other sockets. All sockets are directly connected, so all remote nodes are one-hop away. Now, recall that Xeon E7 has a memory buffer in the path between the processor (memory controller) and memory.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
    3.082ns140ns4125.5ns3770.05197.75M15.1760.68

    Let's suppose that the SMB adds 15ns additional latency. (I do not know what the number really is. It is not free. The magic of doubling memory capacity comes at a price.) Local node memory access is now 82ns and remote node is 140ns. Average memory access is (82+3*140)/4 = 125.5ns, or 377 cycles.

    We have now just lost another 22% performance per core going from a 2-socket E5 to the 4-socket E7 type systems (1 - 15.17/19.54). Total throughput is 55% better than the two-socket (2*15.17/19.54). Performance per core is 28.8% better on the 2-socket than on the 4-socket.

    The performance per core between the 1-socket is 80% better than on 4-socket (27.27/15.17). The 4-socket has 2.22X better throughput than the 1-socket (4*15.17/27.27).

    Scaling - NUMA

    Below is all three of the above cases in a single table.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
    3.067nsn/a167ns2010.05110.0M27.27-
    3.067ns125ns296ns2880.05153.5M19.5439.09
    3.082ns140ns4125.5ns3770.05197.75M15.1760.68
    Scaling - Frequency

    We can also do the same calculations based on a similarly fictitious 2GHz processor.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
    2.067nsn/a167ns1340.0576.5M26.14-
    2.067ns125ns296ns1920.05105.5M18.9637.91
    2.082ns140ns4125.5ns2510.05135.0M14.8159.26

    Notice that we did not lose much performance in stepping down from 3 to 2GHz. We could even further step down to 1GHz and still be at 23.26, 34.78 and 55.36 tot tps for 1, 2, and 4 sockets respectively. It is important to stress that this is based on the assumption of a transaction processing workload having the characteristic of serialized memory accesses.

    Disclaimer

    All of the above is based on a highly simplified model. Real and rigorous testing should be done before drawing final conclusions. Regardless, there is no way anyone can claim that the difference is between local and remote node memory access latency is not important, unless the database has been architected to achieve a high degree of memory locality.

    Front-Side Bus Systems

    In the days before Intel integrated the memory controller, four processors connected to a memory controller in a system with uniform memory access. The Pentium Pro arrangement of 4P on one bus is represented on the left, although the diagram is actually closer to the 450NX or later.

     
       
     

    The system on the right represents the 4-way Xeon 7300, each quad-core processor on its own bus connected to the 7300 MCH. Intel had already committed to the Pentium 4 quad-pumped bus in 2000/01. Vendors were expecting a long stable infrastructure, so Intel delayed the switch-over from bus to point-to-point until 2009.

    Pre-Nehalem NUMA Systems

    A system with 16 or more processors could be built with a custom memory + node controller connecting four processors, memory and also a crossbar. The crossbar in turn connects multiple node controllers to form a system having non-uniform memory access (NUMA).

     

     

    In the old NUMA systems, some SQL Server operations ran fine, and other operations had horrible characteristics, far worse than would be suggested by the remote to local node memory latency ratio. So, it is the possible that there are other NUMA affects with far greater negative impact. Some may have since been resolved, while others may still be present but not as pronounced in modern NUMA systems.

    Hyper-Threading (HT)

    An important fact to notice is that a very high fraction of CPU cycles are no-ops, where the processor core does nothing while waiting for a round-trip memory access. This is why Hyper-Threading is highly effective. While one logical processor is waiting for a memory access to complete, the other thread can run. Scaling on the logical processors can be nearly linear for a transaction processing workload.

    Note, the first-generation of Intel Hyper-Threading was on the Intel Pentium 4 (Net Burst) processors. In that generation, the implementation was two threads running simultaneously, as in each clock cycle, trying to fill the super-scalar execution units. The first generation of HT was problematic. It could have been because it was too aggressive to try to execute two threads simultaneously, or it could have been simply that the Windows operating system and SQL Server engine at the time did not know how to properly use HT. The next generation of Intel processor architecture, Core 2, did not have HT.

    Then in Nehalem, HT returned, except that this time, it was a time slice implementation. Only one thread executes on any given cycle. When the executing thread encounters a memory or other long latency operation, the processor core switches to the other thread. If anyone has doubts on HT based on experience or hearsay from the Pentium 4 generation, forget it. The Nehalem and later HT is highly effective for transaction processing workloads. There used to be several SKUs with HT disabled in the Xeon E5/7 v1-3 generations. Pay close attention and pass on the no-HT SKUs.

    The question to ask is why Intel does not increase the degree of HT? The generic term is Simultaneous Multi-Threading (SMT). Both IBM POWER and Oracle SPARC processors are or have been at 8-way SMT. Granted, one of the two mentioned that scaling to 8-way SMT was tricky. It is high time for Intel to increase HT to 4-way.

    Database Architecture

    In the above examples, the simple model suggests that scaling to multiple sockets is poor on the assumption of a transaction processing database without a means of achieving memory locality. (There is supposed to be an HPE whitepaper demonstrating the importance of the SQL NUMA tuning techniques in a properly designed database.) Just what does a database architected for NUMA mean? Naturally, this will have to be expounded in a separate article.

    But for now, take a close look at both the TPC-C and TPC-E databases. The TPC-C database has all table leading with a common key, Warehouse Id, that provides a natural organizational structure. The TPC-E database has 5 transaction tables with a common key value but does not use the identity property. Instead it uses a function that must read, then update a table to determine the next value.

    The Case for Single Socket

    Naturally, the database and application should be architected together with the SQL Server NUMA tuning options to support good scaling on multi-socket NUMA systems. If we neglected this in the original design, I am sure many DBA-developers know how well such a suggestion would be received by management.

    Is there another option? Well yes. Get rid of the NUMA latency issue with a non-NUMA system. Such a system has a single processor socket, hence one memory node. Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.

    Still, a single quad-core processor today is 40 times more powerful than a 4-way system from twenty years ago (100,000 tpm-C per core is probably possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996). The Xeon E3 could probably support many medium sized organizations. Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each). My recollection is that many IO problems went away at the 32-64GB level. And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4's.

    But I am really talking about a single-socket Xeon E5. In the v4 generation, we could have up to 22 cores, though we should start by looking at the 10-core E5-2630 v4 at $667, stepping up to the 16-core 2683 v4 at $1846 before going to the 20-22 core models at $3226 and $4938.

     

     

    The Xeon E5 has 40 PCI-E gen 3 lanes. It might be convenient if there were a motherboard with 1 PCI-E x8 and 8 x4, because NVMe PCI-E SSDs are more common and economical with the x4 interface. Supermicro does have a UP Xeon E5 motherboard (X10SRL-F) with 4 x8, 2x4 gen3 plus 1 x4 gen2. It only has 8 DIMM sites out of 12 possible with the E5, but that is probably good enough.

    Summary

    A strong explanation was provided showing why round-trip memory latency is very important in transaction processing. One implication of this is that scaling to multiple sockets is poor due to the NUMA effect. A remedy is to architect the database and application together in working with the SQL Server NUMA tuning options to achieve locality. Alternatively, give serious consideration to a single-socket, yet still very powerful system. A second implication is that processor frequency is less important for transaction processing, though it might be important for other aspects. The memory latency affect also supports the argument that Hyper-Threading is highly effective and Intel really needs to increase the degree of HT.

    Addendum

    OK, I didn't show why database transaction processing incurs the round-trip memory latency. It has to do with the b-tree index, in which we read through a page to find the right pointer to the next level. We access the memory for that pointer, then read through to find the next pointer. I will try to do a diagram of this later. But if someone can dig through an open source database engine, please send a code example.

    Several of the images were updated.

    It would help if Intel would be so helpful as to provide L3 latencies for Xeon E5 v4 LCC, MCC and HCC models. What memory latencies for local and remote node in E5 v4? How much latency does the SMB in the Xeon E7 add?

    Note that Xeon E3 and client side processors use unbuffered memory. While Xeon E5 can used unbuffered memory, these are currently limited to 16GB DIMMs while registered memory is available in capacities to 64GB.

    The Xeon D is for specialized embedded applications and not suited for the single-socket database server. It has only 4 DIMM sites?

    Supermicro has an Ultra product line targeting specialized applications. One of the features they call Hyper-Speed. They claim that with very high quality design and components, it is possible to reduce (memory?) latency via low jitter. I would like to know more about this. But the only option is Xeon E5 dual-socket, and I am more interested in single-socket. The emphasis seems to be on RHEL, and high frequency trading? There are examples for determining which processor socket the NIC is attached to, and whether a threading is running on a core in that socket. These hardware organization detection tools really should be incorporated into Windows as well. I have tried to use the WMI API from C#, but some things require coding in C or possibly assembly?

    It was stressed that round-trip memory latency impacts transaction processing databases. Column-store DW avoids this problem by emphasis on marching through memory sequentially. The main intent of Hekaton memory-optimized tables was to eliminate the need for locks. But the other part was the use of a hash index, which happens to reduce the number of memory round-trip operations.

    Additional references
    Inte Core i7 Xeon 5500 series
    Core i7 Xeon 5500 Series
    Data Source Latency (approximate)
    L1 CACHE hit, ~4 cycles
    L2 CACHE hit, ~10 cycles
    L3 CACHE hit, line unshared ~40 cycles
    L3 CACHE hit, shared line in another core ~65 cycles
    L3 CACHE hit, modified in another core ~75 cycles
    remote L3 CACHE ~100-300 cycles
    Local Dram ~60 ns
    Remote Dram ~100 ns
    Intel D. Levinthal paper

    freebsd Ulrich Drepper paper

  • 40 Gigabit Ethernet

    A couple of months ago, I got a pair of Intel XL710QDA2 Ethernet Converged Network Adapters, basically the fancy way of saying 40 Gigabit Ethernet that does other stuff too. I also bought the Intel QSFP+ Twin-axial cable. At that point in time, I might like to have opted for the newer Intel Omni-Path 100Gbps (56Gbps on PCI-E x4) which came out in Q4 2015 over the XL710 which came out in Q3 2104. But Intel Omni-Path only supports Linux at this time with a heavy emphasis in HPC applications, so it is unclear as to whether they will support Windows in the future. Another option was Infini-band, which other people have written about in conjunction with SMB 3.0 in Windows Server 2012.

    The XL710QDA2 (dual-port) adapters were $500 each, and the cable $130. The single port XL710QDA1 was $420, so I thought having the second port was worth an extra $80 even though I am not using it yet. A 40GbE switch seems to cost about $500 per port, which I think is reasonable. There is the Cisco Nexus 9332PQ for $17K. I thought there was another vendor, Edge-Core? with a 20-port, but it is difficult searching on 40 GbE because of all the junk results. In any case, I do not believe that there is a small-office (8-12-port) 40 gigabit Ethernet switch. So, my current usage is connecting two systems directly with the QSFP+ twin-axial cable, and using the embedded 1GbE for internet connectivity.

    The more readily apparent reason for high-speed networks in a SQL Server environment is to move backups from one machine to another. For my perfunctory dig at storage system vendors, the SAN vendor will say: our million-dollar product has software so you don’t need to move large backups over the network.

    A less obvious reason is to achieve low latency in support of serialize high-volume calls. Transaction processing might not need high network data rates, and can probably generate high call volume on 1 GbE from multiple concurrent connections. However, there could be situations that require the highest possible call volume on a single thread and connection. Of course, in this case, the first course of action is to employ a back-to-back connection, bypassing any switches whatsoever. And definitely not have more than one switch in the path from application to database if a direct connection is not possible.

    Potentially, we might think that high volume ETL could benefit from high data rate networks, but I believe that SQL Server single thread write operations are more limiting than 1 gigabit Ethernet. But perhaps more recent versions of SSIS might have better tools for parallelism.

    Trying out the Intel 40GbE adapater
    Back in August, when I got the XL710QDA, my test systems were running Windows Server 2016 TP5, for which Intel had beta driver. The adapter was recognized by the operating system and the beta driver installed with no issues. The Intel documentation on firmware updates was very complicated, so I did not apply any. In this setup, even though the driver installed, I could not ping across the 40GbE network. Intel support was of the opinion that the back-to-back connection should work. No diagnostic tools were provided for examining the software stack for the XL710.

    Then in early October, there was a substantial hotfix for Windows Server 2016 TP5, after which SQL Server connections were going over the 40GbE. A ping test worked one way but not the other. In any case, Windows Server 2016 RTM was out about this time, so I installed both machines to RTM, applied OS hotfixes, and updated to the release driver. Now everything seems to work.

    In bulk copies, network transfer rate was high and serialize network calls were also faster than on the 1GbE network (which was via a switch, not back to back). When I get a chance, I will look at how much round-trip latency reduction I would have gotten on 1GbE with a direct back to back cable. I did this about 15 years ago when Gigabit adapters came down to $300 but switches were still very expensive. If anyone is interested, I investigated on whether there was a cross-over cable for gigabit, Gigabit Ethernet Direct Connect , i.e. no special cross-over cable needed.

    10 gigabit Ethernet
    For people not ready to jump directly to 40 Gigabit Ethernet, then consider 10 Gigabit. Adapters are now in the $300-400 range, and switched are in the $100 per port territory. The main consideration is which interface to standardize on. Two options for short distances are 10GBase-T and SFP+ Direct-Attach. It appears CX4 is not getting traction? One might be tempted to default to twisted pair, CAT6a or CAT7 for 10GBase-T. Cables for CAT6a and 7 are really cheap, $3-10 for short length pre-made CAT6a and $7-17 for CAT7, while SFP+ cables are $50-100.

    However, various sources mention that 10GBase-T latency is higher than for the other options. And in certain situations, serialize network calls for example, latency is the primary criterion. In large file transfers, latency is not as important.

    Netgear actually offers small office 10 Gigabit Ethernet switches. The XS7xx models are 8, 12, 16, and 24 10GBase-T ports, with some models having additional SFP+ uplink ports. The M4300 models include one with 8 10GBase-T + 8 SFP+ and 12+12. I did not see any that were all or mostly SFP+, but the X+X should be acceptable. It is possible to use the Intel XL710 with a QSFP to 4-port SFP breakout cable.

    Addendum
    Is the Intel 40GbE a dead-end? Intel is going their own way at 100Gbps with Omni-Fabric, which currently does not support Windows.

    Mellanox has a new 100GbE that also does NVMe over fabric?

    2016-12-29
    Mellanox is offering a rebase for one-purchase of their 40/100GbE switches, the SN2100 16-port? Price might be $6600-7200? with 50% rebate. This puts it at $200+ per port at the rebate rate, and perhaps $450 per port normally?

  • Samsung SSD 960 PRO - 3500MB/s

    Last month, Samsung announced that their 960 Pro, (PCI-E x4 gen3) M.2 NVMe SSD would available in October at 512GB, 1TB and 2TB capacities. This is a high-end PC and workstation product. There is a similar but separate product line (SM961?) for OEM customers. The 960 Pro had already been mentioned earlier this year. This week, all the hardware sights published product reviews, including: Anandtech, Tomshardware, TheSsdReview and others as well.

    All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read. The 960 Pro is rated for 3,500MB/s read. This is pretty much the maximum possible bandwidth for PCI-E x4 gen3. Each PCI-E gen3 lane is 8Gbit/s, but the realizable bandwidth is less. In earlier generation products, an upper bound of 800MB/s realizable per 8Gbit/s nominal signaling rate was typical.

    Presumably there was a reason why every PCI-E x4 was in the 2000-2500MB/s bandwidth. It could be that these were 8-channel controllers and the NAND interface was 333-400MB/s. Even though 8 x 400MB/s = 3,200MB/s, it is expected that excess bandwidth is necessary on the downstream side. The could be other reasons as well, perhaps the DRAM for caching NAND meta-data. Intel had an 18-channel controller, which produced 2,400MB/s in the P750 line, and 2,800MB/s in the P3x00 line.

    It is very annoying that the more vertical SSD OEMs decline to disclose meaning details of the controller in their SSD. The 960 PRO uses the new Polaris controller, for which few details beyond the Read/Write IOPS and bandwidth specifications are published. On TheSsdReview, the 960 PRO is shown to have the Polaris controller and DRAM in a single package. There are 4 NAND packages, each with 16 stacked 512Gbit die for the 2TB product.

    I am guessing that the Polaris controller has 8-channels, at 8-bit per channel with each package connecting to 2 channels? But there are other possibilities, such as 4 x 16-bit channels.

    The reason is that the previous generation NAND were 333 or 400MB/s. The most recent NAND from Micron is listed at 533MB/s. In order to support 3,500MB/s on the upstream interface, there needs to be more than that amount on the downstream side combined. While Samsung is the leader in V-NAND, it seems more likely that the 960 has 8 channels 8-bit (or 4 channels, 16-bit) at 533MT/s and not 4 channels, 8-bit 1066MT/s. Between the two likely scenarios, 8-channel 8-bit should support higher IOPS than 4 channel 16bit?

    Both the 2TB and 1TB models have read specs of 440,000 IOPS while the 512GB model is 330,000 IOPS. It could be that the 1TB is 4 packages, 16 chips stacked with a 256Gb chips, and the 512GB model has 8 chips stacked. It is also possible that the 1TB has packages stacked 8 high with the same 512Gb, just that there is no reduction in IOPS, while the 512GB has 4 chip stacks?

    I would think that 2 x 960 PROs in a single socket Xeon E3 quad-core would be a great match. My current test system is a single socket Xeon E5 10-core with four Intel 750 SSDs. I will probably not outfit another test system until the Xeon E5 v5 comes out next year, in which I may opt for the 960 PRO.

    Oh yeah, the 960 PRO is not a enterprise product, so should probably not be used in critical production environments, but should be ok for test.
    It is possible to buy a M.2 to PCI-E adapter, but I would like an adapter that can accommodate 2 M.2 devices in one PCI-E x8 slot, without the use of a PCI-E switch if possible.
    HP has an adapter for 4 M.2 in a x16 that only works in HP systems

    ps,
    last year I bought a Dell XPS9550 with the then new Skylake processor. It came with a 512GB Samsung PM951, which was an older model rated for 1GB/s sequential, while the slightly more expensive SM951 was rated for 2GB/s. This drive died after 3 months. Dell sent a replacement, but I elected to go out and buy the 950 PRO myself, rated the same 2GB/s as the SM951, but perhaps same or not? The failure of one device is not sufficient to make an assessment, but I have dozens of OCZ Vertex and Vectors, and several Intel SSDs of various flavors, all running for several years now without failure.

  • Insert Performance Limitations with Sequentially Increasing Index

    Earlier this year, HPE and Microsoft sponsored an article, The Importance of Benchmarking, in SQL Server Pro. While true, it is also apparent that there has been little interest in benchmarks within the DBA/IT community over the last several years. There are many reasons for this, one of which is the emphasis on the benchmark result as the important takeaway.

    Today most people expect that systems are very powerful, and probably know that performance at the processor core level has improved only incrementally in the recent years. The main venue is the increasing number of cores with each generation, currently at 22 in Xeon E5 v4 and 24 in E7 v4.

    The TPC benchmarks are generally well designed, with rules meant the prevent the publication of a distorted result. But every application has its own characteristics, different table organizations, and transactions with SQL logic bearing no resemblance to a benchmark or another application. As such, a benchmark result does not translate directly to a different set of circumstances.

    What is of great importance are the problems that were encountered in the benchmark, and the measures that were taken to address each problem. The topic of this article is Inserts to a table with a sequentially increasing index, commonly in the form of an identity column, but other implementations are possible as well.

    At a New York City SQL Server Users Group meeting, Thomas Grohser, author of Expert SQL Server Performance Engineering, mentioned that multiple client (connections) doing single row inserts to a table with clustered indexed on an identity column had extremely poor performance running on a 4-scoket system. In that case, this was about 6,000 rows per second. Inserts to a table with a clustered index on a uniqueidentifier, or having a compound key not leading with the identity column, performance was 95,000 (calls and) rows per second. Memory-optimized tables could achieve 1.7 rows per second. But it might be a good idea to consult with Thomas on this.

    The assessment was that this was a contention issue due to the high latency between processor sockets. All modern systems with more than one processor socket have Non-Uniform Memory Access (NUMA) architecture. For Intel processors, this goes back to Xeon 5500 (2009) and 7500 (2010) and 2003/4 for AMD Opteron.

    An Insert to a table with an index on an identity column, regardless of whether the index is clustered or nonclustered, involves acquiring an exclusive lock on the last row of the last page, as well as accessing the memory for that page and row.

    When there are multiple concurrent threads with affinity on cores in different sockets of a multi-socket system, the memory involved bounces between the L3 caches of each socket. While it is inefficient for memory to bounce across sockets, the magnitude of the impact on Inserts is stunning.

    Testing Insert Performance

    The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled. Turbo-boost is disabled for consistency. The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164). Some tests were conducted on a single socket Xeon E5 v4 with 10 cores, but most are on the E3 system. In the past, I used to maintain two-socket systems for investigating issues, but only up to the Core2 processor, which were not NUMA.

    The test table has 8 fixed length not null columns, 4 bigint, 2 guids, 1 int, and a 3-byte date. This adds up to 70 bytes. With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.

    Both heap and clustered index organized tables were tested. The indexes tested were 1) single column key sequentially increasing and 2) two column key leading with a grouping value followed by a sequentially increasing value. The grouping value was chosen so that inserts go to many different pages.

    The test was for a client to insert a single row per call. Note that the recommended practice is to consolidate multiple SQL statements into a single RPC, aka network roundtrip, and if appropriate, bracket multiple Insert, Update and Delete statements with a BEGIN and COMMIT TRAN. This test was contrived to determine the worst case insert scenario.

    On the single socket Xeon E3 quad-core system, the heap table with no indexes and both the heap with one nonclustered index and clustered index with the two column key could support an insert call rate of 70-72,000 per sec, both rows inserted and number of network roundtrips.

    The heap table with one nonclustered index and table with only a clustered index, both with the index key being a single sequentially increasing column such that inserts from all threads/connections go to the last page could support an insert call rate of about 40,000 rows per sec.

    Comments

    The issue here is multiple concurrent connections making calls to Insert into a single table with a sequentially increasing index, clustered or nonclustered, resulting in contention between threads for the last page and row.

    There is a substantial 42% performance degradation on a single socket system, with memory in a common L3 cache, but the relevant memory locations are moved between the L2 cache of different cores after exclusive access is acquired. While the impact is large, even this may not be a crippling effect depending on the circumstance.

    In a multi-socket system, memory must now also be moved between L3 cache of different processor sockets, which has higher latency. The impact is expected to be more severe with as the number of sockets increases.

    The presumption here is that the limitation is in the ability to acquire exclusive locks between threads on different cores, potentially in different sockets. If each call inserted more than one row, the call might decrease only slighty resulting in a higher row insert rate. Hence the explicit emphasis on the call rate as well as the row insert rate.

    Note also that we should problem be able to insert in to multiple tables each at more or less the same rate as inserts to a single table until the limitation becomes the log write thread or overall CPU.

    This test is not the same as the test by Thomas Grohser in terms of table structure, client application, and hardware etc. Still, the expectation is that results would not be dramatically different. My tests on this matter is incomplete, and more work is necessary. As soon as I can get access to two and four socket systems, I will try to run this same test (hint to Intel or someone).

    In preliminary tests on the single socket 10-core system, the Insert call rate with rows going to different pages was over 100,000 per sec for both heap and clustered, and the around 30,000 per sec with a sequential index, nonclustered or clustered. We might infer that this a contention issue in which performance degrades with increasing number of cores, for both single and multi-socket system. In the case multiple sockets, there might be as more severe degradation?

    This issue might be more complex than a simple matter of the higher latency between sockets, but other people might have better tools to conduct such an investigation.

    Work arounds

    There are work-arounds for this issue. One is to implement a multi-column clustered index key such that inserts are distributed over many pages. It also necessary to not have even a nonclustered index on the identity column, which may have implications.

    Another work-around is simply to deploy on a single socket system. This is actually good advice for perhaps a majority of situations. The Xeon E3 with 4 cores is perhaps twenty times more powerful than a 4-socket Pentium II Xeon from 1998. If quad-core E3 were not enough, probably in memory or PCI-E, then a single socket Xeon E5 with up to 22 cores should definitely considered before reflexively defaulting to a two-socket system without quantitative substantiation of any performance assumptions.

    One the problems today is that infrastructure people have bought into the vendor driven arguments for virtual machines and/or cloud, but apply it pervasively even into the mission-critical systems, while completely disregarding any special requirements. The typical platform is a 2-socket.

    There are other options too. We could try affinitizing calls that insert into the critical table, see Map TCP IP Ports to NUMA Nodes.

    Benchmarks and Inserts

    Before closing this topic, given that it is so important, we might ask whether the benchmarks shed light on this matter. The TPC-C benchmark has the highest insert volume into a single table. However, TPC-C does not have this problem because all tables are organized (clustered index) by warehouse followed by district. This reflects a brick and mortar business, where customers place orders mostly in their own warehouse and district.

    If we were curious anyhow, TPC-C benchmark results are in transactions per minute (tpm-C). The highest SQL Server result is 1.8M tpm-C, corresponding to 30K new orders per second. Technically, the order line table has more rows inserted. Each transaction inserts one row to orders and on average 10 rows to the order line table with a single call, so the Insert statement call volume is the same for both tables.

    No TPC-C benchmark results were published for SQL Server version 2008 or later, even though a handful of results continued to be published for version 2005 after 2008 RTM. The reason for this is that SQL Server 2008 compressed log file writes to better support database mirroring, a very important feature across the broad base of users.

    Log writes were single-threaded until version 2016. This meant that SQL Server write performance could be limited by the ability of a single core running the log writer thread to compress log entries. Presumably there was some survey to suggest that this would be acceptable as there is no option to disable log write compression?

    The TPC-C benchmark has a curious statement when updating the stock table in the output clause that touches but does not change several fat columns. This part serves no meaningful business purpose, but has the effect of grossly inflating the log write size far beyond what should be necessary.

    Presumably other RDBMSs do not have mandatory log compression. And so the highest TPC-C result on a system with Intel processors is 5M tpm-C (8 sockets, 80 cores total, 160 threads) corresponding to 84000 inserts to orders per second.

    The TPC-E benchmark does have 3 tables that are unique on the (big)integer Trade Id column, and two more tables that have a two column key leading with Trade Id. Is there a magical tuning trick that can evade the insert to a table with a sequentially increasing index problem? Apparently not.

    TPC-E works around this by not using an identity column in the tables where one might think the primary key column should be an identity. In fact, it uses a scalar function and a table to assign trade Id values.

    The TradeOrderFrame4 procedure first calls the function Get_Next_T_ID with an output variable to retrieve a Trade Id. The function uses the table TID_RANGES to store blocks of ID values. Each block, represented by a row in TID_RANGES, is assigned to a specific session id (SPID, accessed with @@SPID). Every time the function is called, a column for the current Id value for the callers spid is returned and the row is updated to the next Id value.

    This is probably an acceptable implementation in the TPC-E benchmark, where all connections are running transactions. In a real-world environment, there might be very many open connections, not all of which process transactions, but in a common connection pool. I might suggest using spid modulus the total number or logical processors. Or perhaps spid divided by 2 first, then modulus the number of cores when HT in active.

    All of this is in the details as required by TPC full disclosure rules. But it is buried in the supplement files with thousands of extraneous files. There is no explanation given as to why the three tables that have a primary key clustered on the trade Id column is not an identity, instead using a function that requires an update to maintain.

    It is interesting this method was in TPC-E from the beginning. The first TPC-E result was published in 2007 for a 16-socket Unisys ES7000 with Xeon (7140M?) dual-core processors. In that generation, there were probably 4 processors in each NUMA node. The performance was 660 tpsE, meaning 660 inserts per second to each of 3 tables unique on trade Id plus inserts to two more tables. The current 4-socket Xeon E7 v4 with 96 cores total is 9,068 tpsE.

    Was the insert - index on sequentially increasing column problem know back then? If it occurred at such a low insert rate, then how many people suffered this problem without knowing why? The conclusion to draw is that, yes benchmarks are important.

    However, what is important is not the result that vendors like to brag about, but rather the many problems that were solved in producing a competitive result. While all the methods are disclosed in the supporting files, there is no accompanying explanation.

    Some methods are simple to apply, like lock pages in memory. Others must be worked into the database architecture, which has deep ramification if done after it is already is in production, or must be part of the client side application, such as the TCP per affinity.

    Some methods might have critically significant impact. Others might be of just minor impact that is desired for benchmarks, but have no important consequence in a production system. I do not suppose anyone would be willing to apply the -x startup parameter in a production environment, but it might be good to know the impact? Good luck in deciphering which details are critically important to a specific situation and which might just be used to get an extra 1% edge over a competitor.

    Addendum 2016-10-14
    The tests systems, client-load generator and server each had Intel I210 1GbE connected to a common switch. A single thread connection could drive about 4K single row select RPC/sec, and 2K single insert row RPC/sec. A high level of saturation was achieved at 80 concurrent connections. About 12 years ago, I had tested Xeon (Pentium 4) systems with 1GbE connected back to back (no switch), and this could drive 5K single row select RPC/s. I did not test the current 1GbE direct connect network round-trip rate.

    Both of these systems also had a Intel XL710 40GbE with direct QSFP+ connection. At first this did not seem to work in Windows Server 2016 TP5 with the Intel beta driver. Intel support was of the opinion that this should work but did not say definitively. Then on Tue, with the latest Windows TP5 hotfixes, all of the sudden the load generator was connecting via the 40 GbE instead of the 1GbE. There were some quirks in the behavior.

    In any case, the 40GbE direct connect could drive 11K select RPC/s, and 6K insert RPC/sec at the single. Only 20 concurrent threads were necessary to achieve a high level of load, at which the 10-core Xeon E5 could support perhaps 40K single row inserts with a sequential clustered index. I am currently in the process of changing over to Windows Server 2016 RTM + hotfixes, which also seems to upgrade the XL710 driver from 1.3.115 to 1.5.59, so more to come.

    Addendum 2016-10-19
    Henrik provided the link, SQL Server Latch Contention, which is why he mentioned that this is latch contention. I would have thought it would be lock contention, but apparently it is latch contention according to MS?.

    I only took a quick look, MS identifies the problem with the degree of concurrency, more severe at higher concurrency. Between Thomas Grosher's report, and my data, it should be emphasis that there is most probably a strong hardware NUMA or not-NUMA impact. But it would be preferable if one person ran the same test on both 1-socket, 2 and 4-socket systems.

  • SQL Server on Xeon Phi?

    Can the Intel Xeon Phi x200, aka Knights Landing, run SQL Server? It does run Windows Server 2016, so is there anything in SQL Server 2016 that would stop it from installing? Xeon Phi is designed for HPC, so it would not have been tested with SQL Server, but that does not confirm whether it will or will not work. If so, then this could be used to prove out some important theories.

    The main line Intel processors used in the Core i7/5/3 and Xeon product lines, with recent generation codenames Haswell, Broadwell, Sky Lake and soon Kaby Lake, are heavily overbuilt at the core level. The Broadwell processor core is designed to operate at 4GHz if not higher.

    The Haswell 22nm processor was rated for up to 3.5GHz base, 3.9GHz turbo at launch in Q2’13. In Q2’14, a new model was rated 4/4.4GHz base/turbo at 88W TDP. Both are C-0 step, so the higher frequency was achieved by maturity of the manufacturing process or cherry picking? The Broadwell 14nm processor had a top frequency of 3.3/3.8GHz base/turbo 65W for desktop, but perhaps this is because it was more focused at mobile than desktop? (Curiously there is also a Xeon E3 v4 at 3.5/3.8GHz 95W and Iris Pro graphics). The top Sky Lake 14nm processor was 4.0/4.2 GHz base/turbo at 91W.

    With a single core under load, processor is probably running at the turbo boost frequency. When all four cores are under load, it should be able to maintain the rated base frequency while staying within design thermal specifications, and it might be able to run at a boosted frequency depending on which execution units are active.

    The latest Intel Xeons (regular, not phi) are the E5 and E7 v4, based on the Broadwell core. There are 3 die versions, LCC, MCC, and HCC with 8 10?, 15, and 24 cores respectively. All of these should be able to operate at the same frequency as the desktop Broadwell or better, considering that the Xeon E5/7 v4 Broadwells came out one year after the desktop processors. But Xeons need to be more conservative in its ratings so a lower frequency is understandable.

    The top Xeon 4-core model, E5-1630 v4, using the LCC die is 3.7/4GHz at 140W TDP. The top 8-core is 3.4/4.0GHz, E5-1680 v4, also at 140W TDP.

    The top 14-core (MCC die) is 2.2/2.8GHz 105W. The top 24-core (HCC die) is 2.2/3.4GHz 140W. So the Xeon E5 and E7 v4 processors are built using cores designed to operate electrically at over 4GHz, but are constrained by heat dissipation when all cores are active to a much lower value, as low as one-half the design frequency in the high core count parts.

    The transistor density half of Moore’s law is that doubling the number of transistors on the same manufacturing process should enable a 40% increase in general purpose performance. The implication here is that if a particular Intel processor (full-size) core is designed with transistor budget to operator at 4GHz, then in theory, a processor with one-quarter of that transistor budget should be comparable to the full-size core operated at one-half the design frequency, whatever the actual operating frequency of the quarter size core is. (Doubling the quarter size core to half-size yields 1.4X gain. Double again to full-size yields another 1.4X for approximately 2X performance going from quarter to full size).

    So the theory is that it might be possible to have 100 cores of one-quarter the complexity of the Broadwell core on a die of comparable size as Broadwell EX (456mm2), with adjustments for L2/L3 variations, and differences in the memory and PCI elements.

    This just what Xeon Phi, aka Knights Landing, appears to be. There are 72 cores in 36 tiles, operating at 1.3-1.5 GHz base, 1.7GHz turbo. The Xeon Phi x200 is based on the Silvermont Atom, but at 14nm. A tile is composed 2 Atom cores, with 4-way simultaneous multi-threading (SMT) and 1M L2 cache shared between 2 cores. (There is no shared L3? how is cache coherency handled?) The Xeon Phi has 16 MCDRAM and 6 memory channels capable of 115GB/s and 384GB max capacity (6x64GB). The MCDRAM can be used in one of three modes: Cache, Flat, or Hybrid.

    There is no mention of the MCDRAM latency, only the phenomenal combined bandwidth of 400-500GB/s. My expectation is that it should be possible for the processor to off-die memory roundtrip latency to be lower when the memory is in the same package as the processor compared to the common arrangement when memory is outside the processor package. This is because it should be possible to use really narrow wires to connect the processor to memory in a common package, so there should be less buffering circuits to amplify the signal current? (Can some circuit designer speak to this please?)

    This higher core count, higher threads on SMT is more or less comparable to IBM POWER, SPARC and even AMD Zen. Transactional queries are essentially pointer chasing code: fetch a memory location, use its value to determine the next location to fetch. This should run fine on a simpler core than 6/8-port superscalar Broadwell. And have many dead cycles during the round-trip memory access latency, implying SMT will work well (beyond the two threads per core in the main line Intel cores).

    However, this may not be the best general purpose computing solution, in that there are important single threaded tasks, or tasks that are not massively parallelizable for which the existing powerful Intel core is the best. My thinking is that a mix of few powerful cores and many smaller cores is right solution. And that there should be a few smaller cores dedicated to special OS functions (interrupt handling and polling), in a blended asymmetric-symmetric arrangement.

  • Server Sizing Comments: the case for 1-socket in OLTP

    Long ago, almost 20 years, there was legitimate concern on whether servers had sufficient capability to drive transaction processing requirements for large businesses. Then for a long period, perhaps from the late 1990’s to 2012, the better choice for most mid to large businesses was to simply buy a 4-way (socket) system. Larger systems were based on a non-uniform memory access (NUMA) architecture that had unusual characteristics and vendors provided almost no information on how to work around issues that could cripple such systems.

    Towards the end of the second era of server sizing, processors (at the socket level) became ridiculously powerful such that properly tuned systems ran at very low overall CPU utilization. However, 2-socket systems in this period did not have adequate memory or IO capability, unless it was a Xeon 7500 or later E7 series system with 2 sockets populated. And even then, the structure of SQL Server per processor licensing meant that many people opted for the 4-way system regardless of requirements.

    In 2012, Intel finally put powerful processors (Sandy Bridge, up to 8-cores) with larger memory and IO capability into the Xeon E5 line (there were both 2 and 4 socket versions). This coincided with SQL Server 2012 per core licensing. Now there was a stronger technical justification and financial incentive to not automatically opt for a 4-socket system with the most number of cores.

    It would seem reasonable to make some effort to first determine the total number of processor cores that will meet requirements with some headroom. Then determine the number processor sockets as more processor sockets means more memory sockets (also considering the difference in DIMM sockets per processor between Intel Xeon E5 and E7 processors). It is unclear whether anyone actually engaged in the technical analysis versus just making the decision strictly on budget goals.

    Today, it is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems. First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements.

    We should also factor in that the second generation hyper-threading (two logical processors per physical core) from Nehalem on has almost linear scaling in transactional workloads (heavy on index seeks involving few rows). This is very different from the first generation HT in Willamette and Northwood which was problematic, and the improved first generation in Prescott which was somewhat better in the positive aspects, and had fewer negatives.

    A single socket Intel Xeon E5 processor supports 12 DIMM sockets for 384GB with 32GB DIMMs, or 768GB with 64GB DIMMs which is now at a moderate (50%) premium per GB over the 32GB DIMM. In the old days, DBA’s were taught that memory was precious in reducing disk IO to a tolerable level. That was back when a total system memory of 1GB was considered very large.

    I noticed that after common server memory configuration reached 32GB memory, disk IO could be brought to a very manageable level, well under the IOPS possible with 100-200 15K HDDs.

    Today, all-flash storage is practical. Definitely in direct-attach storage, and maybe in SAN based storage depending on the degree of price gouging from your SAN vendor. So while it might be possible to show that TB-plus memory reduces IOPS from 10,000 to nil, the reality is a modern storage system correctly configured for database applications can easily support 100,000 IOPS if not 1M (which would require NVMe to be practical). In this case, we should find that memory requirements are far lower than the memory capacity of recent generation 2 and 4-socket systems.

    Why not continue to spectacularly over-configured the server system when even with SQL Server per core licensing is only a small portion of the costs in deploying a critical line-of-business application? The big bosses may not question the costs of a 2-socket system or perhaps 4-socket system even factoring in that there may be several environments between production, DR, QA and development. But if performance is inadequate, it may cost you your job.

    The answer is performance. Today, transaction processing performance is heavily dependent on serialized round-trip memory access latency. The term is pointer chasing in that a CPU instruction accesses memory to determine the next memory location to access. No amount of L2/L3 cache can hide this, nor is it possible to pre-fetch memory. (What Intel needs to do is increase Hyper-threading beyond two logical processors per physical core. IBM Power is 4 and SPARC is 8?).

    For a processor with integrated memory controllers (AMD from Opteron on, Intel from Nehalem on), local node memory access round-trip latency is probably 50ns. For memory on an adjacent processor, a 1-hop remote node, it is perhaps 100ns. This information is absolutely crucial in understanding system level performance for transactional databases, and yet vendors try to not mention it, instead stressing metrics that are mostly irrelevant.

    In a single socket system, 100% of memory accesses are local node, because there is only 1 node. In a two socket system, the expectation is that memory accesses are split 50-50 between local and remote node. In principle, it is possible to devise some strategy such that a higher percentage is local, but SQL Server does not make it practical to implement such a scheme. In the Intel Xeon E7 4-socket systems, 1 out of 4 memory nodes is local, and there are 3 memory nodes 1-hop away. In an 8-socket system, 4 nodes are 2-hop away.

    An argument in favor of multi-socket systems is that the number of memory channels scales with the number of sockets. However, in transaction processing, it is memory round-trip latency the dominates, not memory bandwidth. (it is a factor that more memory channels means there can be more memory transactions, but it is unclear as to whether this offsets the increase in memory latency.)

    If someone has reasonably comparable systems with 1, 2 and 4 processor sockets, all of the same processor generation, the impact of the number of sockets can be tested.

    I am also of the opinion that SQL Server cannot make meaningful use of all the processor cores in a top-of-the-line 2-socket (36-cores for Xeon E5 v3), let alone a 4-socket (72-cores) due to most write operations data and log being single threaded. But I would like here opinions on this. Is this addressed in SQL Server 2016 to a degree?

    Addendum
    The above applies mostly to transaction processing applications.
    It is reasonable to suppose that decision support system application could use both the greater number of processors cores and the greater number of memory channels in a multi-socket systems. However it is unclear as to whether SQL Server is effective in implementing parallel execution plans when the degree of parallelism is greater than the number of cores in one socket.

    I would think that the strategy of partitioning and alignment of compute to local (system) node memory from (formerly) PDW applied to a single SQL Server instance, treating each processor socket as though it were a node, which it is, could dramatically improved very DOP parallel execution plans. But it does not seem that Microsoft is interested in this approach.

    I am presuming that column store is better capable of utilizing multi-socket, as it is definitely not encumbered by serialized round-trip memory latency.

    Addendum 2
    The Dell PowerEdge R730 can be purchased with 1 or 2 of 2 sockets populated. The R930 must have 2 or 4 of 4 sockets populated. It appears that the premium from 32GB to 64GB memory modules might be because the 32 is RDIMM and the 64GB is LRDIMM. The 2400MT/s 32GB RDIMM is $558, and the 64GB LRIMM is $1483. For this differential, I would stay single socket, opting for the more expensive 64GB (if necessary) over 2 sockets using the less expensive 32GB, even though I am giving up the 4 extra memory channels.

    If the plan is to implement memory-optimized tables in 2016 (which supports more than 256GB? and has fewer other restrictions) and more than 768GB system memory is needed, then go ahead with 2-sockets. But I would try to get my transaction processing DB under the single socket limit. In the old days, we moved the old out to the DW system.

  • Join and Query hints – serious DBAs need to learn this

    For a long time, I have refrained from endorsing the use of Join and Query Hints. The reason is that there is no such thing as query hints in SQL Server except for a few simple items. The key options listed under Microsoft SQL Server Hints (Transact-SQL) that affect join order, join type, and index usage are not really hints.

    They are in fact directives such that the specified option is employed if possible. But join order optimization is essentially disabled, and the execution plan joins tables in the order as written, exception being EXISTS and NOT EXIST clauses. It is if the Query Optimizer is hyper-sensitive and its feeling are hurt when someone impudently suggests a hint. So if hint is specified, good luck to you. You are on your own.

    Microsoft documentation is obviously very thorough on this matter with the following:
    “Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend only using hints as a last resort for experienced developers and database administrators.”

    But the fact is that there are complicated queries such that it is difficult to estimate the number of rows, especially past the source table into the intermediate steps. The query optimizer does (edit: not) produce a good plan, because the row estimate is grossly erroneous. This is aggravated by the fact plan costs employed by the query optimizer reflect a fixed IOPS model based around four 1995 era 7.2K hard disks, with no consideration for whether the actual capability of the storage system or if data is entirely in memory.

    So there are times when it is necessary to override the query optimizer. I will not go into the details of query optimization here. Back around 2001, I reversed engineered the execution plan cost formulas for many the fundamental operations without bothering to explain the operations or other aspects of query optimization. Since then, Paul White, Benjamin Nevarez, and others have written very detailed papers on query optimizations. Roughly the strategy is a balance of 1) tables access efficiency, 2) join efficiency and 3) reduce rows as quickly as possible.

    Because join order optimization is disabled when a join or query hint is specified, the key is to write the query to produce a specific join order. There is a method of doing so. See Inside Microsoft SQL Server 2008 T-SQL Querying: T-SQL Querying By Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass on the topic of bushy plans. Once the correct join sequence is found, most of the join types might already be good. Perhaps a few join hints might be necessary. Join and Query hints do no cause the query optimizer to disable index optimization, so index hints should not be needed as frequently.

    This is a much deeper topic than I can discuss at this point in time. So why I am bring it up? Today there was a news report of the cause of the Air Asia flight 8501 crash last December. For details, see CNN Air Asia crash report or Wikipedia

    In brief, there was an equipment malfunction that put the airplane outside of the normal flight envelope. It should have been possible for the flight crew to recover, but they did not do so.

    It further turns out: “because the manual provided by the plane's manufacturer said the aircraft, an Airbus 320, was designed to prevent it from becoming upset and therefore upset recovery training was unnecessary.”

    In our world, we are told the query optimizer “typically” is best, hence we should not have to do manual query optimization, hence we do not need to learn this. So when we do need to do manual query tuning, we do not know how because we did not learn this.

    This is BULLSHIT! I don’t give a damn how good the query optimizer is “typically.” A serious principal DBA needs to learn how to do query optimization so that he/she can do so when the situation calls for it. The starting point is the bushy join style in the Itzik Ben-Gan book.

    Addendum
    here is the link to my previous post on bushy joins, notice at the time I put caveats on it. Bushy Joins
    I am not saying go for broke on join/query hints, just that we need to learn to shape the join sequence. Rob Farley had a great note in the comments. I had seen that query style before, in TPC benchmarks? without understanding at the time what the writer was smoking.

    per Claus comment linking to Dynamic Search Conditions in T‑SQL, I especially like the strategy below, with hard code when good parameters are specified, then some thing else to handle the unusual, be it OPTION(RECOMPILE) or dynamically built parameterized SQL

    IF @orderid IS NOT NULL
    BEGIN
    SELECT ...
    WHERE O.OrderID = @orderid
    AND -- Conditions on Order Details here.
    -- No OPTION (RECOMPILE) here!
    END
    ELSE
    BEGIN
    SELECT ...
    WHERE -- same conditions as before
    OPTION (RECOMPILE)
    END

  • SSD Form Factor and interface

    There is a curious quiet from the enterprise storage community on form factor and interface direction for solid state storage, be it NAND Flash, Intel 3D XPoint or other. On the client-side, personal computing, both desktop and mobile, show clear direction in favoring both the M.2 form factor and PCI-E as the preferred interface for SSD storage. There is a backward compatibility option in M.2 to work with either SATA or PCI-E interface, but I do not think this will be widely used. SATA or hard disks will not go away, only the primary SSD is M.2 form factor, PCI-E interface and NVMe host protocol.

    On the enterprise side, there is great deal of deployed infrastructure built around the SAS interface (a super-set of SATA), and the small form factor (SFF) for 2.5in HDD at 15mm height or thickness. The bean counter types would like to wish that SSD (NAND flash for those who do not like the acronym SSD) would use existing infrastructure and not just as an interim measure. They are probably still unhappy that Fiber Channel on the back-end had to be abandoned several years ago, being not competitive and a cost burden relative to SAS.

    Preserving the value of investment in existing infrastructure is important because people are unhappy when equipment purchased at painfully high cost becomes obsolete. Of course, enterprise storage is only extremely expensive because storage vendors invented justifications for selling inexpensive components at very high markup. There is also a failure to consider that hardware has an effective depreciation of 30-40% per year due to the pace of progress, which renders the term investment in hardware foolish, or if I were less polite, then completely stupid. So ultimately this is circular logic based on an untenable premise.

    That said, it would be possible to build a viable enterprise storage system around either the PCI-E or SAS interface, because both support multiplexing lanes, and there are switch chips for both PCI-E and SAS interfaces. The reason PCs are moving from SATA to PCI-E is that the NAND interface bandwidth is advancing at a faster pace than any single lane connection can support and SATA does not support multiplexing. (In websites catering to desktop hardware, some say that PCI-E is superior to SATA. This is rubbish by writers with poor technical knowledge. The only important fact is whether the interface supports multiplexing.)

    The reason existing enterprise infrastructure should be abandoned is not because of any deficiency in SAS, but rather that it is built around four lane (x4) uplink and downlink ports. SAS at 12 Gbits/s would only support 4.4GB/s net bandwidth. This might seem to be high because enterprise storage vendors sell crappy systems with pathetic bandwidth capability. The other reason is that most existing infrastructure are either the 24-25 bay SFF in 2U or 15-bay LFF in 3U 19-inch wide rack mount enclosures designed for hard disks. Both the number of bays and physical volume are completely wrong for current generation SSDs going forward.

    My opinion is that the correct uplink and downlink for solid state (be it NAND flash, Intel 3D XPoint or other) storage enclosures (not individual devices) should be 16 lanes wide or x16. Both PCI-E and SAS have adequate bandwidth and protocols. For PCI-E gen 3 at 8Gbit/s per lane, this would support a net bandwidth of 12.8GB/s. The existing x4 SAS is just too low for an SSD (formerly disk) array enclosure.

    The core of this argument is based around the PC standard of a Flash controller with 8-channels on the NAND side, and PCI-E on the uplink side. Today the NAND interface is 333MB/s, so an 8-channel controller could support 2.6GB/s. There may have been some thought that the upstream side should be PCI-E gen 3 with 2 lanes (x2, capable of 1.6GB/s), as it is common to have excess bandwidth capability on the down stream side. But in PC world, single device benchmark performance is important, so the trend seems to be PCI-E x4 on the controller, with the option to connect only x2 (or even x1?).

    In the time of hard disks, client side PCs used 7200 RPM HDDs or less, for lower cost and higher capacity. Enterprise storage was primarily 10K or 15K RPM for greater IOPS performance, although 7200 RPM was adopted for tier 2 storage. (Storage capacity should have been too cheap to meter even for 10K HDDs, but because vendors sold a ridiculously high prices, this created a demand for 7.2K in enterprise storage systems.)

    In the first phase of SSD adoption, enterprise systems preferred single level cell (SLC) NAND with greater write endurance while client side was mostly 2-bit MLC and later some low cost devices being 3-bit TLC. Today NAND flash technology is sufficiently mature that MLC has sufficient write endurance for many enterprise needs. Fundamentally, the performance oriented PC and enterprise could use the same SSD, just with different over-provisioning and other firmware settings. It would be foolish for enterprise systems not to leverage components developed for client side systems, given the huge volume and low cost structure.

    While the standard desktop SSD element is M.2 form factor with an 8-channel controller and capable of x4 on the upstream side, the enterprise strategy should be to connect x2 on the upstream side. In enterprise, it is the performance of the complete array of storage elements that is important, not the single component. The standard storage array enclosure should probably have 16 bays, each connected x2 to the PCI-E switch, and x16 for each of the uplink port and downlink expansion port. The PCI-E switch would have 64 ports, 16 for uplink, 16 for downlink expansion, and 16 x2 for the M.2 SSDs. The enclosure should work with either 1 or 2 controllers. Existing DAEs have a single (x1) SAS connection to each bay.

    The physical volume for 16 M.2 devices would occupy only one-quarter of 1U rack. Existing enterprise storage infrastructure is x4 uplink/downlink ports, 2U full rack with 24-25 bays connected x1. This wrong for SSDs at on multiple points. Uplink and down link ports should be x16. The volume of the enclosure should be shrunk by a factor of 8. Connections to each bay should be x2, but 16 bays connected at x1 is weakly viable. Given that existing infrastructure is unsuitable for SSDs going forward, there is no excuse to not adopt the client-side components with M.2 form factor and PCI-E in a new properly designed infrastructure.

    Addendum
    for some reason I cannot respond to comments

    Good question. Sorry about the long winded answer, but life is complicated. I do agree with shared storage concept in HDD days, having a common pool of HDD so that each host can access the aggregate IOPS capability when needed. This and managing the complex storage system alone would have justified a good profit margin. But storage felt the need to justify extraordinary margins, hence started to invent reasons, which led to doctrine based on the invented justifications. Any time happens, it is a fuck-up of the first magnitude. And storage vendors do not seem to understand what bandwidth is, or about log write latency.

    Next, blade systems are non-starter in databases because it gives up DIMM slots and PCI-E slots. So we should stick with rack systems with the full boat of DIMM and PCI-E slots. Today a 4TB PCI-E is do-able. Whats missing is some way to match PCI-E SSDs to the available PCI-E lanes. System vendors have a mix of PCI-E slots, including several x16. Only workstations and HPC have x16 cards, servers do not. So we want to connect four PCI-E x4 SSDs to x16 slots. HP workstations have some card for this, but we need a server version. I can see a 4-socket server with 64 - 128 PCI-E lanes dedicated to storage, that is 16-32 x4 PCI-E SSDs, so 64-128TB in one system. All this can be internal, SSDs do take much space, aren't too heavy and don't consume much power.

    Storage vendors still want to sell horribly expensive AFA storage with features that we don't need, and cannot deliver anywhere close to the bandwidth that is possible. So it really is a fight between the DBA for cheap SSD at high bandwidth, and the SAN admin who wants to spend a shit load of money, have control over all storage, make you fill out forms to justify why you need each precious GB of space, all so he/she can deny you request as not sufficient in his/her judgment.

    Edit 2016-Jan-08
    Given that the NVMe controllers seem to be PCI-E x4, perhaps the strategy should be x16 uplink and downlink, with 8 x4 bays. There is still a 2:1 mismatch between downstream and upstream. The point being bandwidth max is reached with 4 devices, but there is space for 4 more. Above, I suggested 16 x2 bays.

  • Computers without ECC memory are crap - no exceptions

    In my previous post, Hardware rant 2015, some readers reacted to my suggestion that vendors start offering the Intel Xeon E3 v5 in laptop and desktop PCs as if this were an outlandish or impractical idea.

    First, doing so requires almost no additional work. Simply substitute 1) the Xeon E3 v5 for the Core i7 gen 6, 2) the server PCH (C236) in place of the desktop PCH (Z170) which are really the same thing as are the two processors, and 3) ECC memory for non-ECC, which has 8 extra bits over the normal 64-bits. The cost of this might one hundred dollars mostly driven by the premium Intel charges, only slightly from the 12% higher cost of memory. (It is the Xeon E5 line that would be seriously impractical in a laptop that an old person could easily carry. A young fit person might claim to not feel the difference between 4 and 15lbs, or 2 and 6kg).

    Second, I should explain why ECC memory is so important, far out weighing the extra cost. This is true for user systems, not just servers with 24/7 requirements. As the title states, a PC without ECC protected memory is total crap, no exceptions unless what you do on the PC is totally worthless, which could be the case for a significant market segment.

    Basically without any check on memory integrity, we may have no idea when and where a soft error has occurred. Perhaps the only hint being the OS or application crashes for no traceable reason or serious data corruption has already occurred. Let it be clear that soft errors do occur unless you are deep under ground.

    Up until the early 1990’s, many if not most PC’s sold as desktops and laptops had parity protected memory. Then in the time frame of Windows 3.x, (almost?) all PC vendors switched to memory with no data integrity protection for their entire lineup of desktop and mobile PCs (with perhaps the exception of dual-processor systems based on Pentium Pro and later, that were subsequently classified as workstations). This was done to reduce cost, eliminating 1/9th of the memory for parity.

    All server systems retained parity, and later switched to ECC memory even though entry level servers use the same processor as desktops (either with the same product name, or different). The implementation of memory protection is done in the memory controller, which was on in the north-bridge in the past, and more recently, integrated into the processor itself (starting with Opteron on the AMD side, and Nehalem in the Intel side).

    I recall that the pathetic (but valid?) excuses given to justify abandoning parity memory protection was that DOS and Windows were so unreliable so as to be responsible for more system crashes than an unprotected memory system. However, since 2003 or so, new PCs were sold with operating system shifted to the Windows NT code base, imaginatively called Windows XP.

    (In theory) Windows NT is supposed to be a hugely more reliable operating system than Windows 3.1/95, depending on the actual third-party kernel mode drivers used. (Lets not sidetrack on this item, and pretend what I just said is really true). By this time, the cost of sufficient DRAM, unprotected or ECC, was no longer as serious a matter, even though base memory configuration had grown from 4MB for Windows 3.1 to 512MB if not 1GB for Windows XP or later. And yet, there was not a peep from PC system vendors on restoring memory protection with ECC now being standard. (I did hear IBM engineers* propose this, but nothing from PC vendors without real engineers. We don’t need to discuss what the gutless wonders in product management thought).

    Presumably soft-errors are now the most common source of faults in systems from Windows NT/XP on. Apple Mac OS (from version?) and Linux are also protected mode operating systems. So this is pretty much the vast majority of systems in use today. It is possible that bugs in drivers from third-party that have not been tested under the vast range of possible system configurations (more so for performance oriented graphics drivers?). Still, the fact that vendors to do not regard correcting the most serious source of errors in PCs today is an indication that they consider the work we do on PCs to be worthless crap, which is the same regard we should have for their products.

    Let me stress again that putting out PCs with ECC memory does not require any technical innovation. ECC capability has been in entry server systems built from identical or comparable components all along. By this time, Intel memory controllers had ECC capability which could be (factory) enabled or disabled depending on the targeted market segment. (Intel does have dumbed-down chipsets for the low-end PCs, but it is unclear if ECC was actually removed from the silicon.)

    Additional notes:
    A. The Wikipedia article ECC memory cites references that mentions actual soft-error rates. There are a wide range of values cited, so I suggest not getting hung up on the exact rate, and treat this as order-of-magnitude(s). There is a separate entry soft-errors for anyone interested in the underlying physics. Of course there are other Wiki entries on the implementation of ECC.

    Briefly, the prevalent source of soft-errors today originating with cosmic rays striking the upper atmosphere, creating a shower of secondary particles, of which neutron can reach down to habitable areas of Earth. Unless the environment is a cave deep underground, there will be soft errors caused by background radiation. The probability of errors also depends on the surface area of memory silicon, so a system with a single DIMM will experience fewer soft errors than system with many DIMMs.

    B. Early memory modules were organized as 8 bit data plus 1 bit for parity in a 30-pin x9 SIMM. Sometime in the early 1990’s, around the 80486 to Pentium time, 72-pin x36 SIMMs (32 bit data, 4 bit parity) was popular. The implementation was 1 parity bit protects 8 bits of data for both the x9 and x36 modules. Parity protected memory had ability to detect, but not correct single bit errors in an 8 bit “line”.

    A few high-end servers in this era had ECC memory which may have been implemented with 2 x36 memory modules forming a 64 bit line with 8 bits for parity, or perhaps a custom memory module? Later on, memory modules progressed to DIMMs, having 64 bits of data with allowance for 8 additional bits for ECC. The base implementation of ECC is to have a 72-bit line with 64-bits for data and 8 bits for ECC. This allows the ability to detect and correct single-bit errors and detect but not correct 2-bit errors (SECDED). More than 2-bits in error could potentially constitute an undetected error (dependent on the actual ECC implementation). There also other ECC strategies such as grouping 4 x72 DIMMs into a line allowing the ability to detect and correct the failure of an entire x4 (or x8?) DRAM chip, when each DIMM is comprised of 18 x4 chips, each chip providing 4 bits of data.

    C. At the hardware level, if an error is detected and corrected, the operating system and applications continue to function. The event can be logged at the system level. A detected but uncorrected error, the hardware should cause a blue screen OS crash.

    An undetected error is just that. It is undetected. The system continues running with incorrect memory content. Depending on the nature of the memory corruption, anything can happen. It could be executable code, in which case the instruction changes. It could be critical a operating system data, causing subsequent memory access to read or write to the wrong location, which could have serious corruption consequences. It could also be end data, or number or char or control, which may or may not be critical.

    Edit
    * It is probably more correct to say that soft-errors is the province of scientists/physicists, not engineers. Sun had perfectly good engineers, but in the late 1990's, they had an Ultra-Sparc II processor with 4M L2 cache in their high-end enterprise system. I believe the L2 data had ECC - SECDED, but the tags were only parity protected - SED. Some of systems started to experience mysterious failures (the one located in high-altitude locations?). This was ultimately traced to soft-errors. It was not a simple thing to change the L2 cache tags from parity to ECC (logic in the processor itself?) so the temporary solution was to mirror the memory used for tags? (if some knows the details, please step forward)

    Edit 2015-11-10
    The Wikipedia topic ECC Memory states "ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing."
    It is more correct to say ECC is used to when it is necessary to correct the more common (typically single bit) errors, and detect certain errors involving more than 1 bit, which cannot be corrected. However it is possible that some multi-bit errors cannot even be detected.

    Edit 2015-12-10
    Donaldvc pointed to this new article on IEEE Spectrum drams-damning-defects-and-how-they-cripple-computers
    Much of my knowledge is very old, from back in the days when memory chips were 1-4 bit wide. Back then, the soft-error might only affect many memory cells but it would only be one bit in a word. Then as memory became more dense, a soft error could affect multiple bits in a word? So processors did ECC on a bank of 4 DIMMs = 256 bits of data, 288 bits of memory, which allowed more sophisticated algorithms. I am not sure what Xeon E3 or E5 has. Xeon E7 is supposed to be very sophisticated. If someone free time, please look into this.

  • Hardware rant 2015

    It has been a while so I suppose it is time for another rant on hardware. There are two systems I would like:
    One is a laptop.
    The second is a server capable of demonstrating extreme IO performance, with the secondary objective of being small enough to bring along to customer sites.

    On the laptop I am looking for
    1) quad-core with HT, i.e. 8 logical processors for better parallel execution plans.
    2) PCIe SSD, I would prefer 3GB/s+, so PCIe gen3 x4, or 2 x M.2 PCIe x2 is also an option.
    3) drive 1, but preferably 2 external 4K monitors (so I can look at complex execution plans)

    On this matter, it is time to bitch at the MS SQL Server team that there should be an option to contract the white space in execution plans. The existing zoom capability is worthless.
    Yes I know SQL Sentry Plan Explorer can do this, but really MS, is it so hard? or have you outsourced the entire SSMS some team that does not know that there is such a thing as complex queries?
    The reason I want to drive 2 external 4K displays is that at the 4K resolution, I need more than a 28 in monitor to use the resolution.

    A couple of days ago, Dell announce the new XPS 15 with Core i7-6700 processors (Sky Lake) which I immediately ordered , but unfortunately it shows a shipping date of Nov 16

    it does have 4K display, and 1 external port which may or may not support 4K. I thought I ordered the docking station, but I do not know if this would support dual external 4K monitors.
    I currently have the Dell 28in 4K monitors, which is great for looking at execution plans, but at the normal text size setting, is difficult read.
    I am thinking that the much more expensive Dell 32in 4K monitor will be better, but maybe not enough. Should I get a 55in 4k TV instead? these all have just the HDMI connector, so I need to make sure there are proper 4K adapters.

    The new XPS 15 data sheet says it has HDD bay (SATA interface) and one M.2 bay (uncertain if PCIe x2 or x4). I would have been nice if 2 M.2 x2 bays were available instead of the HDD bay. I ordered the XPS 15 with the PCIe SSD. I do not know if it is good one (Samsung SM951 cite 2150MB/s) if not, it will throw the Dell SSD out, and get a good one.

    One more thing, ECC memory
    Intel desktop and mobile processors all do not have ECC (or parity) memory capability. ECC memory has been built into Intel processors for some time now, it is just disabled in the Core product lines, enabled on in the server Xeon line.
    So the Sky Lake equivalent is the Xeon E3 v5. Intel released the v5 under the mobile group, with a 45W rating.
    Unfortunately I cannot find a laptop for sale that uses the Xeon E3 v5.

    Perhaps Dell or someone could offer a Xeon E3 mobile system?

    Extreme IO Performance demonstrator
    First, why do I need such a thing?
    when my clients have multi-million dollar SAN storage systems?
    Because SAN people are complete idiots on the matter of IO performance, being locked into irrelevant matters (to enterprise DB) like thin provisioning etc.
    Invariably, the SAN people (vendor sales engineer, the SAN admin etc) confuse that Fiber Channel is specified in Gigabits/sec (Gb/s) while all other IO bandwidth is specified in GigaBytes/sec (GB/s).
    So we have a multi-million dollar storage system (full of add-on software that have no purpose in an enterprise DB) connected to a powerful server (60+ cores and paying for SQL Server 2012 EE per core licensing) over 2 x 8Gbit/s FC links.
    Is this stupid or is this exceptionally stupid?

    Yes I know it is extremely rude of me to call other people stupid, and that being stupid is not crime, but when you are the vendor for multi-million dollar equipment, there is a reasonable expectation that you are not stupid.

    So onto the system.
    For this, I am sure I need more than 4 cores, so it needs to the Xeon E5. Perhaps 8 cores (single socket) is sufficient.
    The new Intel SSD DC P3608 has great specs, but I am not sure when it is actually available?
    I would put 2 of these in the system to demonstrate 10GB/s. Ideally this would all go into box that fits carry on luggage, which is unfortunately not one of the standard PC or server form factors.

    Another option is a 2 x 12 core system to demonstrate 20GB/s on 4 x P3608.

    Addendum
    I would prefer to get a laptop without high performance graphics, the NVidia GTX 960M in this case.
    The current Intel graphics is sufficient for high resolution rendering, but I do not need high frame rate. All the Intel Core i7 6th gen processors have graphics, I wonder if I can remove the GTX (for power savings)?
    Apparently Dell will have a new docking station, the Thunderbolt Dock TB15 next year, that will support 2 x 4K monitors?

    2015-10-20
    I did already rant on PC laptops only being available with 16x9 displays?
    How stupid is this? It is one thing for consumer laptops to have a 16x9 display, on the assumption that the home users just watch movies.
    but on what justification does this apply to business and workstation laptops?

    Concurrent with the Intel Skylake Xeon E3 v5 regular announcement, Supermicro announced motherboards for the E3 v5.
    There is a micro-ATX (X11SAE-M) but with just 1 x16 and 1 x4 PCIe g3 slots.
    where as the ATX (X11SAT) has 3 slot with 16/8/8 as an option. This would let me put 2 P3608? for 10GB/s?

  • GHz and MHz-BIOS updates and Amdahl Revisited

    Last week, a routine performance test ran about twice as long as expected. A check of dm_exec_query_stats showed that CPU bound statements (worker time roughly equal to elapsed time) were approximately 3 times higher than previous tests for matching SQL statements. Almost all of the SQL involved single or few row index seeks, usually Select, some Insert and Update. The server system is a 2-socket Xeon E5-2680 (8 cores, 2.7GHz nominal, Sandy Bridge) in a managed data center. The data center had sent out notice that there would be system restarts the previous weekend, which could mean either OS patches or BIOS/UEFI updates. So naturally the next thing to do is check the Processor Information object for the Processor Frequency and % of Maximum Frequency counters (or equivalent). This showed 135, as in 135MHz, and 5% of maximum. Another system of the same model also rebooted showed 1188 and 44%.

    This issue has occurred previously in this environment and in other HP systems that I am aware of. The BIOS (or UEFI) update puts the system into one of the energy efficient configurations. It could also be an operating system setting, but most that I have seen are BIOS settings? One can imagine a firmware engineer being so committed to green activism that this was made the default on BIOS updates without discussion with other parties. Perhaps there is a facility (in Houston?) with inadequate air conditioning for the number systems, that this setting was put in to prevent the lab from overheating. Then no one remembered to exclude the step in the production code? Not that I have ever done such a thing (and no further questions on this should be asked).

    Another question might be why the data center monitoring team did not check for this, as it has happened before. The whole argument for going to managed data center instead of a simple hosted data center was that the managed data center could provide the broad range of expertise that is not economical for a mid-size IT department. Obviously this managed data center did not monitor for the performance/power configuration.

    This matter is of serious concern to production DBAs and IT staff in handling operations. As the Processor Information performance object with extended information was only introduced in Windows Server 2008 R2, many software monitoring tools may not alert on changes of Processor Frequency, especially after reboot. Imagine the IT staff or DBA encountering this for the first time on the production servers, with users complaining, your boss watching over your shoulder, and his/her boss hovering over your boss, offering their helpful insights in the non-judgemental manner as bosses do.

    Performance Insight

    However, I am more interested in a different aspect of this incident. When there are two sets of data, one for the processors cores at 2.7GHz and another at presumably 135MHz, we can extrapolate parameters of interest. Does it seem stunning that the drop from 2.7GHz to 135MHz, a factor of 20, only decreases CPU efficiency (increase CPU-sec, or worker time) by 3X? Perhaps, but this actually should have been expected.

    The salient aspect of modern computer system architecture is the difference between CPU clock cycle and memory access time. A young person might not know, but old timers would know. Up to about 20 years ago, the primary memory performance specification was access time, with 80, 70 and 60 ns being common in fast page mode and extended data out. Then with the switch to synchronous dram (SDRAM), the key specification changed to data rate. In the Xeon E5 (v1) generation, DDR at 1333MHz was common. This means a memory channel can deliver one line every 0.75ns, or 1.333 billion times per sec, with a line being 64-bits (excluding ECC bits). The Xeon E5 26xx series has four DDR3 channels. The Intel processor internally is shown as having 2 memory controllers, each controller driving 2 DDR channels, so channel can have different meanings depending on the context).

    What is less commonly cited is the round trip latency, from a processor issuing a request to memory, the internal memory access within the DRAM chip and finally the transmit time back to the processor. (The L1, L2 and L3 cache sequence is also involve in memory access timing.) For local memory (attached directly to the processor) this is around 50ns. For memory on an adjacent processor, the round trip time might be 95ns or so.

    On a 2.7GHz processor, the CPU cycle time is 0.37 ns, so 50ns for local memory round trip access is 135 CPU cycles. This particular system has 2 sockets, so one might expect that half of memory accesses are local at 50ns round-trip latency, and half at 95ns latency.

    This is a well understood issue. Two methods of addressing the disparity between CPU cycle time and memory access are 1) large cache on the processor, and 2) pre-fetching memory. Current Intel processors have dedicated 32KB I+D L1 and 256K L2, both per core, and an additional shared L3 cache sized at 2.5MB per core. From Pentium 4 one, the processor pre-fetches 64-bytes (the cache line size) with an option to prefetch the adjacent cache line. Prefetching is exposed in the instruction set architecture (can someone provide a reference please) and there should also be a BIOS/UEFI for hardware prefetch.

    Simple Model

    Now lets visualize the (simplified) code sequence in a relational database engine with traditional page-row data structures. There is a memory access for the index root level page. Read to the page to find the pointer for the second level page. Memory access, and repeat. It is a sequence of serialized memory accesses with poor locality (so cache can only help so much) and the next location is not known until the current memory request is completed, so pre-fetching is not possible.

    Modern processor performance characteristics are very complicated, but we will attempt to build a very simple model focusing on the impact of round-trip memory access latency. Start with an imaginary processor with a cycle time equal to the full round-trip memory access time. In this scenario, one instruction completes every cycle, be it an arithmetic or logic or memory access instruction.

    Such a system may have never existed so now consider a system where the round trip memory access latency is some multiple of the CPU cycle time. The average time to complete an instruction where time is in units of the memory access latency (50ns or 20MHz for local node), “a” is the fraction of instructions that involve (non-local, non-prefetch-able) memory access and “n” is the processor frequency.

      (1-a)/n + a

    The term (1-a) is the fraction of instructions that are either not memory access, or memory access to cache (from previous access or pre-fetched). “1/n” is the processor cycle time (in units where memory access time is 1).

    Performance (instructions per unit time), the inverse of average time per instruction is:

      P = 1 / ( (1-a)/n + a )

         = n / (1 +(n-1)*a )

    We can see the the above equation has characteristics that as processor frequency increases, the upper bound on performance is:

      n -> infinity, P = 1/a

    Also, if the fraction on instructions that require memory access, “a,” is zero, then P = n.

    Does the above look familiar? It is just Amdahl’s Law, which formulated in the old days to demonstrate the limits of vectorization in supercomputers. I have just recast it to express the limits of increasing processor frequency relative to round-trip memory access time.

    If someone would like to check my math, please do so. It has been a long time. Trying tricking your teenage son/daughter into doing this as a practical math exercise?

    OK, anybody still reading is obviously not deterred by math, or knows the trick of skipping such things. What am I really after? In the above equation, what is known is processor frequency relative to memory access latency. While we know the performance or worker time of certain queries, we do not know it terms of instructions per CPU-cycle. And the second item we do not know is the fraction of instructions that introduce a round-trip memory access latency that cannot be hidden with cache or pre-fetching.

    But, we have data points at 2 frequencies, 2.7GHz and reportedly 135MHz. Express the relative performance between the two points as a ratio.

      P2/P1 = R

    Then from the two equations

      P1 = 1 / ( (1-a)/n1 + a )

      P2 = 1 / ( (1-a)/n2 + a )

    we can solve for a in terms of the know values n1, n2 and R.

      a = (n2 – n1*R) / ( n1*n2*(R-1) + n2-n1*R )

    Assuming memory access latency of 50ns, the base frequency is 20MHz corresponds to memory access in 1 cycle. Plugging in the values n1 = 135MHz / 20MHz = 6.75, n2 = 2700/20 = 135 and R = 3. We get a = 0.059, or 5.9% of instructions incurring a non-cached, non-prefetch round-trip memory access latency would result in a 3:1 performance ratio between 135MHz and 2700MHz. (Perhaps it would be more correct in estimating round-trip memory access latency as the average between the local and 1-hop remote node at 75ns?)

    So while it might seem astonishing that the difference between 135MHz and 2700MHz translates to only 3X performance, the database transaction processing workload is an extreme (but important) case. There are many workloads which exhibit better locality or have memory access patterns that are amenable to prefetch and have performance scaling better with frequency.

    Hyper-Threading

    Earlier, two methods of hiding round-trip memory access latency were mentioned. There is another, Hyper-threading. The processor core to appears as two (or more) logical processors to the operating system. Presumably, there is an extra set of program counters, and resources to determine which physical registers (different from the registers specified in the instruction set architecture) are assigned to each logical processor.

    In the earlier example, say that the round-trip memory access time is 135 CPU-cycles and the fraction of instructions that incurs the full round-trip latency is 6%. Then for 100 instructions, 94 are executed in 1-cycle each (excluding consideration for superscalar) as either not involving memory access or data is already in cache, and the 6 the incurs the round-trip memory latency of 135 cycles. Then the total time in terms of CPU-cycles is 94*1 + 6*135 = 904. In other words, only 100 cycles out of 904 are used, the rest are no-ops.

    The Intel Xeon processors from Nehalem on implement Hyper-Threading with 2 logical processors on each physical core. (This can be disabled in BIOS/UEFI. Some models have HT disabled. The earlier Intel Pentium 4 based processors implemented a more complex form of HT.)

    In considering the nature of the database transaction processing workload, being a memory access to determine the next memory access in nature, it is perhaps time for Intel to increase the degree of HT, especially considering that the server oriented Xeon E5 and E7 models are already 1 full year or more behind the smaller desktop/mobile processor variants. I seem to recall IBM POWER as having 4 logical processors per physical core, one of the SPARC processor lines as having 8. It would also be necessary to have a good strategy for using HT based on workload. The option to enable or disable HT in the BIOS/UEFI is not I what mean. HT should be visible to the OS. But the application itself should detect the presence and degree of HT, and make its own decision on whether HT should be used and how it should be used.

    Xeon Phi, Many Integrated Core

    Another item worth mentioning here is the Intel many integrated core (MIC) architecture, codename Knights something, now Xeon Phi. The processor puts many smaller processor cores, 61 in the 22nm Knights Corner, versus 12-18 in the 22nm mainline Xeon processors. The theory behind many smaller cores stems from one of the two main elements of Moore's Law. Doubling the number logic transistors/complexity in a single core should translate to about 40% performance gain.

    (This was the case up to several generations ago. Since then, Intel no longer tries to double the logic from one process to the next. There might be 10-20% performance gain in general instructions. Some effort is given to expanding the functionality of the special/vector instructions. And most effort has been in increasing the number of cores.)

    One manifestation of this (more logic transistors) could increased frequency (which Intel stopped pursing years ago). Another might be more execution ports (8 in Haswell) or other areas to improves instructions per cycle (IPC). Following the rule of 2X transistor per 1.4X (square root of 2) backwards, the expectation is that a processor if 1/4th the size would have 1/2 the performance. But potentially there could be 4X as many cores, depending on interconnect and power limitations. So in workloads that are amenable to vectorization, or otherwise can be parallelized, the more smaller cores could be a better strategy.

    The Xeon Phi is targeted to HPC workloads, as reflected in the 512-bit SIMD instructions. If we were thinking about a transaction processing database engine on the MIC architecture, we would probably consider a very basic ALU without SIMD, (not sure on FP). I am thinking that an asymmetric processor architecture might be the objective. Perhaps two powerful cores from the current main line, and many simpler cores (without SIMD) perhaps even simpler than Atom? (The Intel Xeon Phi line implements Hyper-Threading with 4 logical processors per physical core.)

    Column-Store

    As said earlier, the nature of database page storage along rows make serialized memory access (also called pointer chasing code?) its hallmark. This is why there is interest in column storage architecture. Now all of the sudden, for certain database workloads, the next memory access is 4 bytes over, already in cache. The work a little further down touches memory in the next 64 byte line or two away. Both the software and hardware knows this, and either is capable of issuing a pre-fetch. It does not matter that columnstore must touch more data. Processor can stream huge volumes of data, much more effectively than pointer chasing only the necessary rows.

    Summary

    I should probably say something here.

    Notes

    As I said earlier, modern microprocessors are very complex. Pipelined execution was introduced (in Intel processors) with the 80486 (1989) and superscalar execution with Pentium (1993). Pipelined means that while the processor can complete an instruction in each cycle, the actual start to finish time of a specific instruction occurs over several cycles. Intel does not talk about pipeline stages any more, but there are occasional references to Core2 and later processors having a 14+ stage pipeline. (Wikipedia say Core2 is 12-14 stage pipeline. Nehalem and later 20-24?, Sandy Bridge as 14-19.)

    Superscalar means that there are more than one execution unit, with the goal of completing more than one instruction per cycle. Haswell has 8 execution ports. Several processors generation prior were 6-port on superscalar. We could apply the principle of Amdahl’s on scaling performance to any and all of pipe-lining, superscalar, and round-trip memory latency, and probably other things too.

    Rethinking Computer System Architecture

    I have said this else where. It is long past due to do a clean sheet system architecture with matching change to OS architecture. Current system architecture stems from the 1970's of processor with physical memory (8MB was big) and a page file on disk. Why do we still have a page file on disk? In the old days, there was not enough physical memory such that it was tolerable to have a page file on disk to support a larger virtual address space.

    Today, more the 1TB of physical is possible and affordable (compare to the cost of SQL Server per core licensing). But the key factor is in how memory is used. Back then, it was mostly for executable code and internal data structures. The assumption was that very few database data pages would actually be in memory at any given point in time. Today, a very large percentage of memory is used for caching data pages. Of the memory used for executable code and internal data structures, most is junk.

    The CPU-cycle time to memory access time discrepancy dictates that the more urgent strategy is to get memory closer to the processor even if it means drastically reducing the size of true memory, to perhaps a few GB per socket. Given that DRAM is so cheap, we would still have system with multi-TB DRAM capacity, except that this would now be the page file. Of course the operating system (and applications) would have to be designed around this new architecture. Given how well the Intel Itanium software coordination went, I guess this might be too much to expect.

More Posts Next page »

This Blog

Syndication

Privacy Statement