The model for the cost of operations used by the SQL Server query optimizer
to produce an execution plan
is rather simple.
It has been largely unchanged since SQL Server version 7, RTM in 1998.
Back then, SQL Server had not yet achieved tier 1 status with commensurate
Some adjustments were made in version 2005.
One might question how something so old can possibly be relevant today,
considering how much hardware has changed over the intervening years.
The cost model is still adequate for pure transaction processing
scenarios in which queries have highly selective search arguments.
Its limitations become pronounced in parallel execution plans.
The salient aspect of modern hardware is that compute capability is
distributed over very many processor cores,
possibly with irregularities due to non-uniform memory access (NUMA).
the decision in parallelism cannot be just on/off with a
cost threshold and one-size fits all with max degree of parallelism.
It is necessary to have a degree of sophistication
in the employment of resources for parallel processing.
For some queries, it might be best to utilize a low to moderate level of
In others, perhaps all cores in one socket.
And in exceptional circumstances, the whole set of cores.
For this, the existing SQL Server cost model does not have sufficient
depth to be the foundation from which an effective strategy for
deploying resources can be formulated.
It should not be too difficult to build a suitable model,
considering the wealth of knowledge that has (or should have been)
accumulated over the years.
There will be some resistance to changing the cost model
given that people have become accustomed to the plans generated
from the current cost model.
But that could be accommodated as a legacy mode,
so as to not stand in the way of progress.
The SQL Server Plan Cost Model Overview
The cost model of the SQL Server
is explored in more detail elsewhere.
Other authors have written on the query optimizer itself.
Only a brief description of the SQL Server cost model is given here.
Buried in Books Online in the section on the
query governor cost limit option, is the statement:
"Query cost refers to the estimated elapsed time, in seconds,
required to complete a query on a specific hardware configuration."
The reference system existed when SQL Server version 7 was in development,
so this could have a been a Pentium Pro system, but some people have said
that it was a Pentium.
My investigation in this topic began around year 2000,
though I never went back to look at whether SQL Server version 6
and 6.5 used the same model.
Each operator in an execution plan can have two components, CPU and IO.
The IO component represents the time to complete the IO operation.
There is in fact also CPU involved to perform an IO, and it is unclear
whether this is part of the IO or represented in the CPU component.
The assumption is that leaf level pages are not in memory at the start of query
execution, but maybe retained if the execution touches a page more than once.
The complete cost of operator factors in the CPU, IO, number of rows
and number of executions.
The CPU and IO may represent the cost for only a single execution,
while the operator cost incorporates the entire operation
as a function of the number of executions.
The IO cost is based on a model of the disparate nature of hard disk performance
in random and sequential access at fixed queue depth.
SQL Server version 7 and 2000 had two baselines, but from version 2005 on,
the IO cost model is that random IO performance is 320 IOPS, and sequential
IO is 1350 pages per sec (10.8MB/s).
As cost is in seconds, the random IO from key lookup and loop join
operations or the first page of a scan is 1/320 = 0.003125.
Each successive page in a scan or range seek is 1/1350 =
0.000740740 (the 740 sequence repeats).
The plan cost does not to model the cost of logic beyond the basic operation.
For example, an aggregation of one column has the same cost as two or more.
The interpretation of this is that the purpose of the plan cost model is to find
the best index and table access sequence and method.
Ancillary logic that must be executed in all cases does not affect the choice
Hard Disks 1995-2005
It might seem that given the pace of hardware change,
such old model cannot possibly valid, resulting horrible execution plans.
Around 1995 or so, the high-performance HDD was 7200RPM with a sequential
bandwidth of 4-5MB/s.
The mean rotational latency for 7200RPM is 4ms.
An average seek time of 8.5ms seems reasonable, though I have not kept
any documentation from this period.
This would correspond to 80 IOPS per disk at queue depth 1 per HDD.
So, it seems curious that the SQL Server cost model is based on the random
IOPS of 4 disks, but the sequential IO of 2 HDDs.
Performance HDDs progressed first to 10K RPM around 1996,
and then to 15K around 2000, with corresponding rotational
latencies of 3 and 2ms respectively.
Average seek time was reduced over time
to 3ms with developments
in powerful rare-earth magnets.
The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD.
But no further progress was made on HDD rotational speed.
In same time period, hard disk sequential IO phenomenally
quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.
In other words, the SQL Server cost model is based on a 1350/320 = 4.2
But 15K HDDs in the mid-2000's were 100MB/s × 128 pages/MB = 12,800 pages/sec
sequential to 200 IOPS random for a ratio of 64:1.
It turns out that achieving nearly the sequential IO capability of HDDs
from SQL Server required a special layout strategy,
as outlined in the Fast Track Data Warehouse Reference Architecture
papers, which few people followed.
This was due to the fixed, inflexible IO pattern of SQL Server, which
required the disk configuration to match that of SQL Server
instead of being able to adjust the IO pattern to match that of the
In a SAN system, where the vendor was exclusively focused on random
IO, the typical configuration supported sequential performance of about 150 IOPS at 64KB,
for 9.6MB/s or 1200 pages per sec.
Quite by accident, this was approximately in line with the SQL Server query optimizer.
In any case, regardless of the actual random and sequential values,
the important criterion is the sequential to random IO ratio,
on the assumption that leaf level pages are not in-memory.
And this was usually not true by the 2005 timeframe, when hot data
was likely to be in memory for a well-tuned transaction processing system
on a maximum memory configuration 4-way system.
All the numbers cited above for HDD random IOPS performance were for queue depth 1
Of course, SCSI HDDs support elevator seeking, in which IO is issued at high queue depth,
and the disk reorders the sequence of operations for higher IOPS.
The SQL Server engine facilitated this by issuing IO at high queue depth when the
estimated of number executions of a key lookup or loop join was 25 or more.
Below this, IO was issued at queue depth 1.
But the plan cost did not have a model for HDD random IO at different queue depths.
It should be pointed out that individual query performance optimization is
not always the prime objective.
For a transaction processing system that also supports reporting queries,
the first priority is usually transaction latency.
In this case the preferred strategy is to sacrifice reporting query performance
by not issuing IO at excessively high-queue depth to keep latency low.
SSDs started to pick up momentum around 2010.
Early use was for extreme requirements.
Now it is the more practical choice for almost all main line-of-business
systems over HDDs.
This comes at a time when the system may also happen
to have enough memory for almost the entire database.
With data on SSD, the access times for sequential and random might
reflect a lower ratio than the optimizer model
as opposed to a much higher ratio for HDDs in the FTDW RA.
Still, the CPU cost for loading 64KB or more into memory with a single IO is lower
than evicting pages and loading multiple individual pages.
The scan should have a lower cost per page.
There is a less appreciated aspect of SSD storage in being able to achieve
massive IOPS capability far in excess of what is really necessary.
In an HDD system, it was possible to support sufficiently high random IOPS
at less than excessive latency for transactions.
It was also possible to achieve high bandwidth for large block IO
in support of DW.
What was not entirely practical was to simultaneously support low latency IOPS
for transactions and high bandwidth for DW.
This is now possible with an SSD storage system
and this should be the configuration objective.
It would be helpful for SQL Server to implement the correct IO
queue depth strategy for this capability.
Key Lookup - Scan Cross-over
For pure transactions, queries have a highly selective search argument.
In such cases, the plan cost model is not hugely important,
regardless of wild differences between the plan cost model and various
implementations of real hardware.
The cost model is more important for queries involving an intermediate
number of rows,
in which the true cross-over point
from an index seek with key lookup to scan operation is of greater interest.
But even then, we may prefer one of the two for other reasons.
Parallel Cost Model Inconsistency
There is a curious inconsistency in the cost model for parallel execution plans.
The CPU component of operator cost is scaled down in proportion to the degree
except for the last doubling of the number of processor (cores).
At DOP 2, the CPU cost is half of that at DOP 1.
On a system with 20 logical processors and unrestricted parallelism,
the CPU cost is reduced by a factor of 10,
regardless of whether it is 10 physical cores with Hyper-Threading or
20 cores without HT.
The inconsistency occurs in IO cost.
There is one model for source access, that is
seeks and scans to table or indexes.
And there is a different model for intermediate operations,
examples being Hash Match and Sort operators.
Parallel Cost - Table and Index IO
The IO component in table and index access operations,
including index seek, scan, and key lookup,
does not change with parallelism.
The interpretation of this is that IO system is saturated with a single
Hence, a parallel plan does not reduce IO time in accesses to tables
This is a very reasonable assumption for the original reference system having
2 or 4 HDDs.
It is a not an accurate representation for scan operations on a system
configured to the FTDW RA, capable of multi-GB/s IO bandwidth.
The saturated IO model might be somewhat applicable for random IO
because a single processor core from about 2006 on could
drive 50K IOPS depending on the situation
and only a very large array of perhaps 200 15K HDDs could support
that volume at queue depth 1 per HDD.
On a modern system with properly configured SSD storage,
the expectation should be that both sequential and random
IO scales with parallelism, fully up to the number of physical cores.
An example of a properly configured system is test system from
Parallel Execution in SQL Server 2016,
one Xeon E5 2640 v4 10-core processor with 4 Intel P750 PCI-E SSDs.
Parallel Cost - Hash and Sort Operations
Hash and Sort operations for a small data set only have a CPU component.
When the number of rows times the row-size exceeds a value,
there will be an IO component as well.
The set point is some fraction of memory, either the
max server memory or system memory if the prior is unlimited.
This appears to be just over 5 MB per GB of memory.
This value is also per thread.
At higher DOP, each thread can accommodate up to the set point.
Beyond the set point, the IO is somewhat greater
than nonlinear, at perhaps IO 25 for 10MB, and 300 for 100MB.
the IO cost in hash and sort operations does scale with parallelism,
unlike IO cost is index seek and scan operations.
This might seem to be the min query memory configuration setting
and the resource governor max memory setting in actual runtime
But it might be that the plan cost model only factors
in the max server memory setting.
Plan Cost Scaling - Parallelism
Below is the ratio of plan cost at DOP 1 to higher DOP between 2 and 10
for the TPC-H SF 100 database on the Xeon E5-2630 v4 system with 64GB,
but SQL Server limited to 32GB.
In essence, this is the predicted speedup from parallelism
by the plan cost model.
In several queries, there is very little reduction in plan cost at higher DOP.
This occurs when the bulk of the costs are from the table and index access
operations, in which parallelism does not reduce IO cost, only the CPU cost is reduced.
In a few queries, examples being 8, 16 and 18, there is significant scaling in plan cost.
This occurs primarily when much of the cost is in the hash match or sort operations,
in which case, both CPU and IO are reduced with parallelism.
In any case, actual scaling is always better than plan scaling,
as shown below. See
Parallel Execution in SQL Server 2016
for a more complete discussion.
The SQL Server plan cost model has very little predictive value
on the matter of scaling.
The plan cost model only predicts scaling when a significant portion
of the cost is in the IO component of hash and sort operations.
When the cost is largely in the IO of table and index scans and range seeks,
the plan predicts weak scaling.
In fact, many operations seem to scale well
when the IO subsystem is not saturated from a single thread.
True Cost versus Plan Cost
As mentioned earlier, the plan cost model was based on a reference
system that was probably a Pentium processor from the mid-1990's.
In that era, each successive generation of processor increased rapidly
It would not make sense to recalibrate the cost model
to newer generation processors.
And furthermore, it does not really matter what the processor is.
Plan cost tends to be dominated by the IO component of table and index
accesses based on 320 IOPS random and 10.5MB/s sequential,
or from the IO cost of hash and sort operations.
Still, one might be curious as how plan cost correlates to actual cost.
The sum of the plan cost for the 22 TPC-H SF100 queries at DOP 1 is 200,839.65.
Note that many of the execution plans have large hash operations for which the IO cost
is dependent on system memory.
The sum of the execution times for the same 22 queries also at DOP 1 is 3,515.15sec,
although it would be less if some queries were manually tuned,
resulting in higher plan cost, but lower runtime.
The ratio of the sum of the plan costs to actual runtime is 57.135.
Curiously, this is the approximate order of magnitude difference
between a Pentium and modern processor core.
There was no particular expectation of this because the plan cost model represents
mostly IO wait while the test system has sufficiently powerful IO
that the ratio of worker time to elapsed time was 98%.
The figure below shows actual query runtime divided by plan cost
scaled by 57.1, for the 22 TPC-H queries.
The range in variation between actual execution time
relative to plan cost scaled by 57 is between 0.32 and 2.53.
This is actually not a bad range, because the query optimizer plan cost
does not attempt to model the true cost of a query,
only the methods of table and index access.
Two of the outliers on either end are Q2 at 0.32 and Q13 at 2.53.
The critical part of the plan for Q2 is shown below.
The large portion of the plan cost is for the loop join to PARTSUPP.
As pointed out earlier and elsewhere, the plan cost for random IO
is fairly high.
Most of the operations in the TPC-H queries are scans or range seeks.
The bulk of weight in the measured 57X scaling relative to plan cost
is driven by the plan cost of sequential IO.
Also, disregarding the plan cost model, the SQL Server engine actually
issues IO at high queue depth when the estimated rows is more than 25.
The expectation is that a storage system with 4 NVMe PCI-E SSDs
blows through the loop join faster than the 57X average scaling.
The other extreme is Q13, for which the core plan component is below.
There is nothing unusual in the plan.
However, in the query search argument is the following:
AND CUSTOMER not like
This expression is expected to be expensive to evaluate, and is not
modeled in the plan cost, being treated as just any other predicate.
With compression, the plan cost for the 22 TPC-H SF100 queries at DOP 1
decreases by 32% from 200,839 to 136,792,
reflecting lower IO cost from compressed tables.
The actual execution time increases by 21% from 3,515 to 4,275sec,
reflecting the overhead of compression significantly outweighing
the reduced IO.
Cost Threshold for Parallelism
The original scale of plan cost was that 1 represented a simple estimate
of execution time in seconds.
This has long lost its original meaning.
For all practical purposes,
from a few years after its arrival,
we treated cost as some arbitrary unit of measure.
Except for the fact that plan cost was used in the cost threshold for parallelism.
In the original model,
the intent was that the query optimizer does not consider parallelism
until the estimated execution time of the non-parallel plan exceeds 5sec,
which was once a very reasonable value.
Today, plan cost 5 might be a query that executes in less than 0.1 sec
This is most definitely well below the point at which parallelism should be
The overhead of coordinating multiple threads
is heavy in relation to the actual work itself.
But we should also not wait until plan cost 285, corresponding to 57 × 5,
for the original 5 sec duration.
Part of this is because people today are far less patient than
in the previous era.
The other part is that today we have so many processor cores at our
disposal that we can now be more liberal in their use than before.
Having discussed both the cost model and the to some degree actual scaling,
it is now the time to lay the ground work for a new model of plan
cost and parallelism.
A parallel plan involves splitting the work to be done among multiple threads.
This involves additional work not in a non-parallel plan,
comprising one or more of: distribute,
repartition or gather streams.
The effectiveness of a parallel plan depends on both
the ability to divide the work with some degree of
uniformity and such that the extra work necessary to
coordinate multiple threads not being too large.
There is also some effort necessary to start a thread
or to acquire threads from a pool.
The Bitmap Operator
There is also another factor in parallelism.
As discussed elsewhere, there are a number of queries in which the parallel
execution plan runs with lower CPU time than a non-parallel plan, which should
It turns out that the parallel plan has a bitmap operator that is not in
the non-parallel plan.
The following is from Microsoft TechNet on
Bitmap Showplan Operator:
SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see
Optimizing Data Warehouse Query Performance Through Bitmap Filtering.
As it turns out, the bitmap operator might have the effect of improving plan efficiency
by up to 30% depending on the circumstances.
Microsoft could allow the bitmap operator to occur in plans above a certain cost
threshold, but not parallel for whatever reason.
That is, a separate the cost threshold for the bitmap operator
cost threshold for parallelism.
I would not hold my breath on this.
Processor and System Architecture
The foundation of the parallelism strategy
must encompass the nature of the modern processor
The largest Intel Broadwell EP/EX of 2016 has 24 cores,
the full set of which is available in the Xeon E7-v4,
but only 22 in the E5-v4.
While we may have been conditioned from past history into viewing
a server system as having multiple processor sockets,
that fact is that a system having a single processor socket
today can be incredibly powerful.
There are significant benefits for a single socket system.
One of which is that all memory is local.
Another substantial benefit is that the effort to
coordinate between multiple threads is low.
This leads to excellent scaling characteristics even
when there appears to be significant inter-thread
For those who cannot let go of the multi-socket mind-lock,
below are representations of a 2-socket E5
and a 4-socket Xeon E7.
The inescapable fact is that coordinating between threads running
on cores in different sockets is very substantial.
And so, there is a very substantial degradation in scaling
when threads are on different sockets.
For this reason, the parallelism strategy must respect
the nature of the system architecture.
Low to moderate parallelism
must use cores on a common socket.
There should probably be one strategy in employing parallelism
up to the number of cores in one socket
and a separate strategy for high-parallelism spanning
cores on multiple sockets.
A Modern Parallelism Strategy
With this knowledge,
the parallelism strategy needs to blend
the method for partitioning of work
with the nature of the modern server processor and system.
There are very many cores, i.e., compute resources that can be allocated
to query plans, in either a single or multi-socket system.
For this, we need something more than the very basic, legacy configuration controls:
1) Cost Threshold for Parallelism and 2) Max Degree of Parallelism.
To Connect request on this matter,
Microsoft replied that the newer additional controls in Resource Governor
In essence, the SQL Server team is abrogating their responsibility
in having an intelligent mechanism to properly utilize the capabilities
of a modern system.
The first step might be to rescale the plan cost so that the cost
threshold for parallelism is in more understandable terms.
This is not absolutely essential, and
we could just state that plan cost 60 might correspond roughly to 1 second.
It did not make sense to rescale plan cost in the days when
processor core performance changed substantially with every generation.
But performance at the core level over the last several years
have been more stable, improving at a slow or moderate pace,
and may continue to do so barring a radical event.
We might take this opportunity to reset the plan cost.
Second, the cost model needs to be more comprehensive.
The cost of touching each row in a scan or other sequential access
is larger than the current model represents
and should be corrected.
For this, the impact of lock-level might be considered.
The cost of touching each column is both non-trivial
and not accounted for at all in the current model.
The cost of ancillary logic also needs to be modeled.
And of course, the cost of access to compressed pages should be accounted for.
The above items may not impact the structure of the execution plan,
which is why it was not considered back in version 7 or 2008 for compression.
But it is relevant in the cost threshold for parallelism assessment,
and also impacts scaling projections.
Third, the IO cost model needs to be fixed.
This might involve modeling the actual IO subsystem performance characteristics.
Or it might just assume that IO performance scales with parallelism.
Also important is that parallelism scaling does not stop at the
half the number of cores, instead going up to the full number of
It would help if the optimizer knew the difference between
physical and logical cores.
The IO cost in hash and sort operations already scale with parallelism,
so fix the IO cost for source access would make the complete
As a rough guess, (current) plan cost 30, corresponding to roughly one-half
of one second execution time might be a good default cost threshold for parallelism,
settable of course.
For a query with plan cost 30 expected to execute in 0.5 CPU-sec (worker time),
a DOP of 2 and not higher might be a good idea.
One suggestion for the DOP strategy might be nonlinear.
Employ DOP 2 at the base value of the cost threshold.
Then double the DOP for every 4× over of the base cost.
If base is 30, then plan cost 120 (4×30) is the threshold for DOP 4
and 480 (16×30) is threshold for DOP 8.
This could be the strategy for DOP up to the number of cores in one socket
with one threshold for the base value and another for escalating
The expectation is that scaling is not so simple or excellent
for parallelism on cores in different sockets.
It would be good if the plan cost could model this.
In any case, there might be an even higher threshold for employing
cores over multiple sockets.
We should not normally set DOP to be somewhat over the numbers cores in one socket.
It should be up to the number of cores in one socket
for most circumstances,
or employ most of the cores of multiple sockets for exceptional circumstances.
Furthermore, if the desired number of cores in one socket are not available,
then reduce the DOP rather than employ cores on different sockets.
Both cost model and the simple controls for parallelism
have become seriously obsolete for modern high core count
An improved cost model, encompassing cost structure
that was not necessary for optimal single thread table access
is now needed for the parallelism strategy.
The cost model needs to correctly model parallelism
on modern hardware, which means fixing the IO cost model
that currently assumes a single thread saturates the IO.
A proper parallelism strategy also needs more
complexity than the current on/off based on the cost threshold
for parallelism, and the all or nothing
with only a MAX DOP setting.
A moderately more complex model has been suggested
but alternatives are possible.
There are a number of interesting questions in parallel execution performance
that can be investigated using the TPC-H kit, which has a data generator.
There are also supplemental files that accompany published results.
Important questions in parallel execution are:
scaling versus degree of parallelism (DOP) on physical cores,
Hyper-Threading, Compression and Columnstore.
(Memory optimized tables support parallel plans in 2016? perhaps in a later post.)
To be clear, the TPC-H data generator (dbgen) and queries are used here,
but no attempt is made to adhere to the TPC-H benchmark requirements
(there are some outright departures).
The material reported here are only for purpose of studying
parallel execution characteristics.
In brief, the findings are as follows.
The sum average performance gain from Hyper-Threading
over the 22 TPC-H queries is 23%, in line with characteristics observed in
SQL Server version 2012.
(Official TPC-H results use a geometric mean.)
There used to be an anomalous behavior on SQL Server 2012 in which
performance was better with HT enabled and the second logical processor idle
over HT disabled, even for single-threaded execution as reported in
Compression has an average overhead of 20% across all DOP levels.
For SQL Server 2008, I had reported in SF 10
that compression overhead was 40% between DOP 1-8,
then decreases to less than 10% at DOP 32.
The test circumstances were different, and may have been related
to the NUMA effects of an 8-way system.
Columnstore performance is excellent and should be considered whenever
- Hyper-Threading performance boost 23% in parallel DW queries
- HT anomalous performance gain in version 2012 at DOP 1 no more
- Page compression overhead 20%, down from 40% in 2008
- Columnstore should run one per physical core, HT can be enabled
The test system is a single socket Intel Xeon E5-2630 v4 10-core 2.2GHz processor
with 64GB memory. Storage is 4 Intel P750 PCI-E NVMe SSDs for data and temp,
and one SATA SSD for logs.
The software stack is Windows Server 2016 and SQL Server 2016 SP1.
SQL Server was limited to 32GB for results reported here.
The data is generated using dbgen from the TPC-H kit at scale factor 100.
This used to mean the Lineitem table is 100GB, but with a 4-byte date for three
columns that were formerly 8-byte datetime, is now 80GB in row-store without compression
(still 600M rows).
Overall database size is 140GB.
With page compression, the Lineitem table is 46GB data and overall database size is 101GB.
In columnstore, total database size is 44GB.
Elsewhere, I have argued that performance characteristics are very different
between scaling with multiple cores on a single socket and scaling with cores
over multiple sockets.
Multi-socket scaling cannot be investigated here.
Conclusions should not be extrapolated beyond what is tested.
Baseline, no compression, row store
The figure below is speedup (elapsed time ratio) at various degrees of parallelism
relative to DOP 1
for the baseline row-store, no compression database.
Note that there are 10 physical cores.
The SQL Server engine did seem to run parallel plans on separate cores
except when DOP exceeds the number of cores.
Queries 5 and 15 are worse at DOP 2 than DOP 1, at 0.675 and 0.802 respectively,
and Q3 is slightly slower at 0.948 relative to DOP 1.
In query 3 and 5, the negative scaling from DOP 1 to 2
is caused by a change in the execution plan structure,
excluding parallelism operators.
Below is focused look at scaling from DOP 1 to 2.
As noted, three queries are worse at DOP 2 than 1.
Several queries are almost exactly 2X better, as expected in the
idea case for a parallel execution plan.
And then there are also queries that are more than 2X better at DOP 2 than 1.
Q17 is 4.3X faster at DOP 2 than 1.
Four others at more than 3.5X.
Usually, these have the bitmap operator in the parallel plan,
which is not used in a non-parallel plan.
The bitmap operator has some startup cost,
but can process large data sets more efficiently.
The reasonable assumption is that large queries have a parallel plan,
so there is no need to employ the bitmap in non-parallel plans.
There are other reasons for super-scaling.
One is the set points for hash and sort operations spilling to tempdb.
The setting is fixed as a fraction of system memory for each thread.
At higher DOP, more memory is allocated across all threads of the parallel plan.
Below is the scaling from DOP 2 to 10, which removes the
wide fluctuation that occurs between DOP 1 to 2.
In the ideal parallelism case, scaling should be close to 5.
Overall, the DOP 2 to 10 scaling is more uniform,
except for Q15, in which the poor scaling from 1 to 2 is recovered
at DOP 4 and higher.
Q15 scaling from 1 to 4 is 3.346, which is not bad, good even.
Performance largely comparable between HT enabled and disabled
for DOP levels up to the number of physical cores, 1-10 in this case.
It was reported in Hyper Threading,
that DOP 1 ran 24% faster at DOP 1 with HT enabled over disabled.
This might seem paradoxical.
A speculative explanation is that the SQL Server engine tries to work
in batches that fit inside L2 cache.
There is an API for the L2 cache size, but how much of this should be assumed
to be available for use by a particular operation?
If the assumed value is too high, but the HT enabled case reduces
this by half, then it is possible that DOP 1 with HT enabled could be faster?
Below is the effect of enabling Hyper-Threading (HT) with DOP 20, over
disabling HT in UEFI (formerly BIOS) for DOP 10.
Overall, then benefit of HT is strong at 1.23X.
Q2 shows as 0.9 reduction, going from 0.61 to 0.67sec runtime,
but this is probably just the effect of running a small query
at very high DOP.
It would be interesting to know which logical processors are employed
when DOP exceeds the number of physical cores on one socket,
in a multi-socket system with HT enabled.
Scaling over cores in different sockets is not as good
compared to all cores in one socket, so perhaps the 23% gain from HT
might be acceptable?
In queries 3 and 5, there is a good plan at DOP 1,
but less efficient plans for DOP 2 and higher.
The result is negative scaling to DOP 2,
and poor scaling at higher DOP relative to DOP 1.
The query optimizer model believes that two or more plans
have similar costs.
There are differences in how the cost for these plans vary with
the degree of parallelism, and hence the winning plan
changes with DOP.
Both Q3 and 5 can be rewritten to force a particular plan.
Note that the special feature Date Correlation Optimization
was designed specifically for TPC-H,
and probably no production database uses it.
When a query plan is forced, the date correlation optimization
is also turned off.
Since we are interested in scaling for a particular plan,
the full set of implied date range SARGs are explicitly specified.
The figure below shows Q3, 5, and 7 scaling relative to DOP 1
for natural and forced execution plans.
In Q3 and Q5, the natural plan from DOP 1 is forced in the parallel plans.
This results in good scaling at all DOP levels.
Below is the execution plan for Query 3 at DOP 1 without other hints
having cost 1349, execution time 28.38sec.
Below is the plan at DOP 2 and higher, without hints, having cost 1057 at DOP2.
Execution time is 29.94 sec, slightly longer than DOP 1.
On forcing the parallel plan to the same join order of the non-parallel
plan, the cost is 1248.
In actuality, above plan having the join order of the non-parallel plan executes
faster at 8.564sec versus 29.94sec for the unforced parallel plan.
Below is the natural plan for query 5 at DOP 1, having cost 4445, and execution time 72.70sec.
The natural parallel plan for Q5 has the structure below.
In this case, the query was written to force the join order of the parallel
plan, and then set to DOP 1.
This is done so that the plan structure is easier to discern without excessive
complexity of the parallelism operators.
The plan cost at DOP 2 is 3243.31 and execution time 107.62,
significantly longer than for DOP 1.
If the plan at DOP2 had been forced to the same join order as at DOP 1,
the plan cost is 3244.02, or 0.022% higher.
In fact, the forced plan is much quicker at 23.469 than both the DOP 1 plan
and the natural DOP 2 plan.
Q15 is the other query that has negative scaling from DOP 1 to 2,
miraculously recovers at DOP 4 and higher relative to DOP 1.
This is a strange case.
The execution plans are similar at various DOP levels,
but there is a slight difference at DOP2. but not the other levels.
It is not always the case that the non-parallel plan structure
is better than the parallel plan structure.
Q7 has a super-scaling gain of 2.969X from DOP 1 to 2,
an outstanding gain of 1.908 from DOP 2 to 4,
then another super-scaling gain of 3.072 from DOP 4 to 8,
before trailing off, all with HT enabled.
However, Q7 performance is less at DOP 2-10 with HT disabled
than with HT enabled, when the results should be similar.
This results in Q7 show an incredible 1.8X gain from HT disabled DOP 10
to HT enabled DOP 20.
With HT enabled in UEFI, the gain from DOP 10 to 20 is 1.25X,
which is very good but not irrationally high.
Below is the natural plan at DOP 1, cost 4780 and execution time 114.96sec.
It changes to another plan at DOP 2, and then a third plan at DOP 4,
for which the structure is shown below, but forced to DOP 1, cost 5840,
execution time 102.074sec.
The forced DOP 2 plan cost is 4614, execution time 31.895,
whereas the natural DOP 2 plan cost 3710.8, execution time 38.72.
In this case, it is the parallel plan at DOP 4 and higher
that runs faster at all DOP levels instead of the DOP 1 or 2 plans.
In forcing the plan to the DOP 4+ plan, normal good scaling results
Query Optimizer Cost Model
(this should be a separate article)
The root cause of most of these anomalies is that the cost formulas
for execution operations are based on very simple, fix model.
The model is adequate for transaction processing in which
queries have highly selective search arguments.
But the model is poor in assessing the index seek plus key lookup
versus scan cross-over point.
The SQL Server query optimizer employs a fixed model for IO cost.
The model is 320 IOPS for operations
assumed to be random (key lookups and loop join inner source)
1350 pages/sec × 8KB/page = 10,800KB/s for
sequential IO operations (scans and range seeks).
The absolute values of random and sequential IOPS are not important,
as it is the ratio that effects the execution plan.
Furthermore, the cost model assumes that IO is saturated at DOP 1
for accesses to source tables and indexes.
A parallel plan does not reduce the IO cost in source accesses.
IO cost is reduced in accordance with DOP in intermediate hash and sort operations.
For storage on a small number of hard disks, IO is saturated at DOP 1.
A very large HDD array is capable of scaling sequential IO with parallelism.
A proper SSD array is capable of that and also scaling random IO.
Why does the cost model assume saturated IO for source accesses
but scalable IO for intermediate non-source operations?
It is as if one developer designed the model for source, another
did the model for intermediate, but they did not talk to each
other to present a consist picture.
The two developers may sit in adjacent cubicles.
A second issue is that the cost model underestimates row access cost,
and essentially does not model column and logic costs.
This is evident in the plan cost of for a query doing a row count
versus aggregating one or more columns.
There is no difference between a count or any number of column aggregations.
There is a small cost assessed for the first logic operation
(example multiply two columns),
but not for additional columns or logic.
In actuality, column and logic might be the large part of an aggregation
The combined effect of the query optimizer cost model limitations
is that it has no meaningful capability to predict the
degree of scaling to be expected with parallelism.
There are only two controls: the cost threshold for parallelism,
and the max degree of parallelism.
The query optimizer employs parallelism
when the parallel plan has a model cost of just slightly lower
than the non-parallel plan and if the non-parallel plan cost is
over the threshold.
This was not a huge issue back when big systems had a total of
four processor cores.
A modern system up to 24 cores per socket in the current generation
as well as typically having multiple sockets.
The all or nothing strategy is seriously inadequate.
It is now important to have the ability to assess what degree
of parallelism to employ.
This should be determined based on the query cost and expected
A query that is executes in a few seconds might employ
a low DOP, perhaps 2 or 4, but only if the gain from parallelism
For example, 40% could be considered sufficient, but 10% is not.
An intermediate-large query might employ up to all the cores in one socket,
again, so long as there is scaling with increasing DOP.
Very large queries might employ cores in all sockets,
but only if it can be predicted that the execution plan
scales well across multiple sockets.
Page Compression, row store
Below is the compression overhead, at various DOP levels
relative to no compression.
In SQL Server 2008 sp1, on an 8-way quad-core Opteron at SF10,
all data in memory
Q5 has the good execution plan at DOP 1 and 2, before shifting to the
poor plan at DOP 4 and higher.
However, at DOP 2 the comparison is between a good plan on the database
with compressed tables and a poor plan on the database having tables without
compression, leading to the anomaly.
Overall, the compression overhead is a fairly uniform 1.2X at all DOP levels,
but can range from being nearly free to a high of 1.4X.
The individual values over 1.4X are probably anomalies rather than representative.
This is different than in 2008R2, in which compression overhead was 1.4X at DOP 1,
but then gradually decreased to being essentially free at higher DOP.
The test system back then was an 8-way Opteron quad-core, in which the effects
of NUMA might have been significant.
The hypothesized explanation was that without compression, there was contention
between threads as DOP increased, decreasing scaling.
The extra overhead of compression just happened to alleviate the contention.
Presumably the explanation now is that the SQL Server engine team did a good
job of removing contention where possible, so that there is better scaling
Then the overhead of compression is not masked.
Or it could be because of the difference in the system, NUMA and not.
Below is columnstore performance at various DOP, relative to row store
at the same DOP.
In some queries, it can be almost 30X better.
Overall, columnstore is more than 4X better than row store.
Note that columnstore at DOP 20 with HT enabled is very poor.
This is the expected behavior, because columnstore marches through memory
sequentially, so HT is expected to interfere with columnstore code.
The figure below shows the large negative impact of DOP 20 relative to DOP 10,
although four queries showed modest gain.
There was little difference between DOP 10 with HT disabled or enabled.
Enabling HT in UEFI is not harmful for columnstore,
so long as max DOP is limited to the number of physical cores
in columnstore queries.
The more proper comparison is between columnstore at DOP 10
and row store at DOP 20 with Hyper-Threading,
as shown below.
The overall gain for columnstore is now 3.45.
Only Q13 shows no gain for columnstore, but no query is worse than row store.
Q13 was one of the few columnstore queries that did show gain with HT,
so columnstore is better than row store at matching DOP and HT.
The figure below shows columnstore scaling versus DOP.
Columnstore scaling is uniformly good from DOP 1 to 4, but becomes erratic
at DOP 8 and 10.
Parallelism is a complicated topic.
The main guidance points were highlighted at the beginning.
In addition, make few or no assumptions and always investigate.
Evaluate performance characteristics at different levels
of parallelism, DOP 1, 2, 4 and higher.
Watch for scaling anomalies, and changes in execution plan structure.
Be especially careful in comparing scaling past the number
of cores in a single socket.
Hyper-Threading does have benefit in parallel plans,
but mostly for row store.
It is probably advisable to enable HT in the UEFI/BIOS,
but to limit columnstore to only the number of physical cores.
It would be better if Microsoft not employ HT for
columnstore unless they know exactly when HT might benefit.
Compression does have overhead, but might still be worthwhile.
The larger issue is that SQL Server seriously needs to overhaul
the query optimizer cost model.
Much of the SQL Server engine is very sophisticated, and yet
the cost model has not changed substantially since version 7 in 1998.
As is, the execution plan cost has weak bearing to reality
and that is has no ability to predict scaling with parallelism,
issues in scaling beyond a single socket, and when scaling to HT works.
Without a proper cost model, the benefit of a sophisticated query
optimizer is lost.
For the time being, developers and DBAs can (re-)learn the
method for writing SQL to a particular join order.
Server system memory capacities have grown to ridiculously large levels
far beyond what is necessary now that solid-state storage is practical.
Why is this a problem?
Because the requirement that memory capacity trumps other criteria has
driven system architecture to be focused exclusively on low cost DRAM.
DDR DRAM, currently in its fourth version,
has low cost, acceptable bandwidth,
but poor latency.
Round-trip memory access latency is now far more important
for database transaction processing.
The inverse of latency is practically a direct measure
In the hard disk era, the importance of memory in keeping IO
volume manageable did trump all other criteria.
In theory, any level of IO performance could be achieved
by aggregating enough HDDs.
But large arrays have proportionately more frequent
Although RAID provides fault tolerance,
the failed drive rebuild process causes
The practical upper bound for HDD storage
is on the order of 1000 disks, or 200K IOPS.
If it was necessary to purchase 1TB of memory to accomplish this,
then it was money well spent because the storage was
even more expensive.
Since then, storage with performance requirements
have or should be transitioning to all-flash,
and not some tiered mixed of flash and HDD.
Modern solid-state arrays can handle up to 1M IOPS,
far more and easily when on a full NVMe stack.
Now is the time to re-evaluate just how much memory
is really needed when storage is solid-state.
There are existing memory technologies, RLDRAM and SRAM for example,
with different degrees of lower latency, higher cost
and lower density at both the chip and system level.
There is potential to reduce memory latency by a factor of two or more.
The performance impact for database transaction processing
is expected to be equal.
A single-socket system with RLDRAM or SRAM memory could replace
a 4-socket system with DDR4.
Practically any IO volume, even several million IOPS,
can be handled by the storage system.
The key is that the CPU expended by the database engine
on IO be kept to a reasonable level.
An argument is made for a radical overhaul of current system
architecture, replacing DDR4 DRAM as main memory
with a different technology that has much lower latency,
possibly sacrificing an order of magnitude in capacity.
As the memory controller is integrated into the processor,
this also calls for a processor architecture change.
A proper assessment for such a proposal
must examine not only its own merits,
but also other technologies with the potential for order
of magnitude impact.
The discussion here mostly pertains to Intel processors
and Microsoft SQL Server.
However, the concepts are valid on any processor and database engines
built around page organization, row-store with b-tree indexes.
The argument for low latency memory is valid even in the
presence of Hekaton memory-optimized tables, or other MVCC implementations
because it is a drop-in hardware solution instead of a database
Of course, the heavy lifting is on Intel or other
to re-architect the system via processor and memory.
For more than forty years, DRAM has been the standard
for main memory in almost every computer system architecture.
A very long time ago, having sufficient memory to avoid paging
was a major accomplishment.
And so, the driving force in memory was the semiconductor technology
with low cost.
This lead to DRAM. And DRAM was optimized to 1 transistor plus 1 capacitor (1T1C).
DRAM addressing was multiplexed because even the package
and signal count impact on cost mattered.
The page file has since become a relic, that for some reason
has yet to be removed.
Over the years, the mainstream form of DRAM, from SDRAM to DDR4,
did evolve to keep pace with progress on bandwidth.
The criteria that has changed little is latency,
and it is round-trip latency that has become critical in applications
characterized by pointer chasing code resulting
serialized memory accesses.
Even as memory capacity reached enormous proportions,
the database community routinely continued to configure systems
with the memory slots filled, often with the maximum
capacity DIMMs, despite the premium over the next lower capacity.
There was a valid reason for this.
Extravagant memory could mask (serious) deficiencies
in storage system performance.
Take the above in conjunction with SAN vendors providing helpful advice
such as log volumes do not need dedicated physical disks,
and the SAN has 32GB cache that will solve all IO performance problems.
Then there is the doctrine to implement their vision of storage as a service.
But now all-flash is the better technology for IO intensive storage.
Solid-state storage on an NVMe stack is even better.
Massive memory for the sole purpose of driving IO down to noise levels
is no longer necessary.
Now free of the need to cover-up for weakness in storage,
it is time to rethink system architecture,
particularly the memory strategy.
Not all application desire low latency memory at lower capacity
and higher cost.
Some work just fine with the characteristics of DDR4.
Others prefer a different direction entirely.
The HPC community is going in the direction of extreme
memory bandwidth, implemented with MCDRAM in the latest Xeon Phi.
In the past, when Moore's Law was in full effect,
Intel operated on: "the Process is the Business Model".
It was more important to push the manufacturing process,
which meant having a new architecture every two years
with twice the complexity of the previous generation
and then be ready to shrink the new architecture to
a new process the next year.
Specialty products divert resources from the main priority
and were difficult to justify.
But now process technology has slowed to a three-year cycle
and perhaps four years in the next cycle.
It is time to address the major
product outside of general purpose computing.
The principle topic here is the impact of memory latency
on database transaction performance, and the
directions forward for significant advancement.
As the matter has complex entanglements, a number of different
aspects come into play.
One is scaling on non-uniform memory access (NUMA) system
architecture, in which the impact of memory access can be examined.
Memory technologies with lower latency are mentioned,
but these topics are left to experts in their respective fields.
Hyper-Threading is a work-around to the issue of memory latency,
so it is mentioned.
The general strategy for server system performance in recent years
But it is also important to consider which is
the right core as the foundation.
Another question to answer is whether a hardware and/or software solution
should be employed.
On software, Multi-version Concurrency Control (MVCC) is the technology
employed by memory-optimized databases
(the term in-memory,
has misleading connotations)
that can promise large performance gains,
and even more extreme gain when coupled with natively compiled
Architecting the database for scaling on NUMA architecture
is also a good idea.
A mixed item is SIMD, the SSE/AVX registers and instructions
introduced over the last 18 years.
From the database transaction processing point of view,
this is not something that would have been pursued.
But it has come to occupy a significant chunk of real estate
on the processor core.
So, find a way to use it! Or give it back.
The list is as follows:
- Scale up on NUMA versus single socket
- RLDRAM/SRAM versus DDR4+3D XPoint
- Hyper-Threading to 4-way
- Core versus Atom and all comers?
- Hekaton, memory-optimized, MVCC
- Database architected for NUMA
- SSE/AVX Vector Instructions
The above topics will be address, perhaps only partially,
often out of order, and sometimes mixed as appropriate.
Factors to consider are potential impact,
difficulty of implementation
and overall business justification.
L3 and Memory Latency
benchmark website lists L3 and memory latency for some recent processors, shown below.
|Processor||Base Freq||L3||Local Memory||Remote Memory|
|Westmere EP (Xeon X5650)||2.67GHz||40-42 cycles||L3+67ns||L3+105ns|
|Ivy Bridge (i7-3770)||3.4GHz||30 cycles||L3+53ns|| |
|Haswell (i7-4770)||3.4GHz||36 cycles||L3+57ns|| |
|Haswell (E5-2603 v3)||1.6GHz||42-43 cycles|| || |
|Skylake (i7-6700)||4.0GHz||42 cycles||L3+51ns|| |
There are a number of questions.
Is L3 latency determined by absolute time or cycles?
In the Intel Xeon processors from Ivy Bridge EP/EX on,
this is particularly complicated because there are 3 die layout
LCC, MCC, and HCC,
each with different structure and/or composition.
The Intel material says that L3 latency is impacted
by cache coherency snoop modes, ring structure
See Intel Dev Conference 2015
Processor Architecture Update,
and similar Intel
HPC Software Workshop 2016 Barcelona
For simplicity, L3 is assumed to be 15ns here without consideration for
DRAM, SQL Server 8KB Page
The Crucial web site and Micron datasheet cites DDR4-2133 latency at 15-15-15,
working out to 14ns each for CL, RCD and RP, so random row latency is 42ns
the DRAM interface.
The Intel web page for their
Memory Latency Checker
utility shows an example having local node latency as 67.5 or 68.5 ns
and remote node as 125.2 or 126.5 ns.
L3 and memory latency on modern processors is a complicated matter.
The Intel values above includes the L3 latency.
So, by implication, the 10ns difference is the transmission time from memory controller
to DRAM and back?
One of the modes of DRAM operation is to open a row,
then access different columns, all on the same row,
with only the CL time between columns.
I do not recall this being in the memory access API?
Is it set in the memory controller?
Systems designed for database servers might force close a row immediately?
In accessing a database page, first, the header is read.
Next, the row offsets at the end of the 8KB page.
Then the contents of a row, all or specific columns.
This should constitute a sequence of successive reads
all to the same DRAM row?
Memory Architecture - Xeon E5 and E7
The most recent Core i3/5/7 processors run in the mid-4GHz range.
There are Xeon EP and EX processors with base frequency in the 3GHz+ range.
But for the high core count models, 2.2GHz is the common base frequency.
Below is a representation of the Xeon E5 memory subsystem.
In the Xeon E7, the memory controller connects to a scalable memory buffer (SMB)
or Memory Extension Buffer (MXB), depending on which document,
and the interface between MC and SMB is SMI.
The SMB doubles the number of DIMMs that can connect to each memory channel.
There is no mention of the extra latency for the SMB.
It cannot be free,
"All magic comes with a price".
The value of 15ns is used here for illustrative purposes.
Anyone with access to both 2-way Xeon E5 and 4-way E7 systems
of the same generation and core count model (LCC, MCC or HCC)
is requested to run the 7-cpu
or Intel Memory Latency Checker
utility, and make the results known.
In principle, if Intel were feeling helpful, they would do this.
Below is a single socket Xeon E5 v4.
The high core count (HCC) model has 24 cores.
But the Xeon E5 v4 series does not offer a 24-core model.
Two cores are shown as disabled as indicated, though it could any two?
There is only one memory node, and all memory accesses are local.
Below is a 2-socket system, representing the Xeon E5
either HCC or MCC models in having two double rings,
and the interconnect between rings.
The details shown are to illustrate the nature of NUMA
NUMA is complicated topic, with heavy discussion on cache coherency,
the details of which impacts L3 and memory latency.
This discussion here will only consider a very simple model
only looking a physical distance impact on memory latency.
NUMA Deep Dive Series
by Frank Denneman.
Each socket is its own memory node
(not sure what happens in Cluster-on-Die mode).
To a core in socket 0, memory in node 0 is local, memory in node 1 is remote,
and vice versa to a core in the other socket
(more diagrams at System Architecture Review 2016).
Below is a representation of a 4-socket system based the Xeon E7 v4 memory.
To a core in socket 0, memory in node 0 is local,
memory in nodes 1, 2, and 3 are remote.
And repeat for other nodes.
It is assumed that memory accesses are 15ns longer than
on the E5 due the extra hop through the SMB outbound and inbound.
This applies to both local and remote node memory accesses.
On system and database startup, threads should allocate memory on the local node.
After the buffer-cache has been warmed up,
there is no guarantee that threads running in socket 0
will primarily access pages located in memory node 0,
or threads on socket 1 to memory node 1.
That is, unless the database has been purposefully architected with a plan
in mind to achieve higher than random memory locality.
The application also needs to be built to the same memory locality plan.
A connection to SQL Server will use a specific TCP/IP port number
based on the key value range.
SQL Server will have TCP/IP ports mapped to specified NUMA nodes.
Assuming the application does the initial buffer-cache warm up,
and not some other source that is not aware of the NUMA tuning,
there should be alignment of threads to pages on the local node.
(It might be better if an explicit statement specifies the preferred NUMA node
by key value range?)
An example of how this is done in the TPC-C and E benchmark full disclosures
and supplemental files.
Better yet is to get the actual kit from the database vendor.
For SQL Server, the POC is JR
Curiously, among the TPC-E full disclosure and supplemental files'
thousands of pages of inane details, there is nothing
that says that the NUMA tuning is of pivotal importance.
It is assumed in the examples here that threads access random pages,
and memory accesses are evenly distributed over the memory nodes.
Simple Model for Memory Latency
Memory Latency, NUMA and HT,
a highly simplified model for the role of memory latency in
database transaction processing is used to demonstrate the
differences between a single-socket system having uniform memory
and multi-socket systems having NUMA architecture.
Using the example of a hypothetical transaction that could execute in 10M cycles
with "single-cycle" memory,
the "actual" time is modeled for a 2.2GHz core in which 5% of instructions
involve a non-open page memory access.
The model assumes no IO, but IO could be factored in if desired.
The Xeon E5 memory model is used for 1 and 2 sockets,
and the E7 for the 4-socket system.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
If there were such a thing as single-cycle memory,
the performance would be 220 transactions per second
based 2,200M CPU-cycles per second and 10M cycles per transaction.
Based on 67ns round-trip memory access,
accounting for a full CL+tRCD+tRP,
the transmission time between processor and DRAM,
and L3 latency,
incurred in 0.5M of the 10M "instructions",
the transaction now completes in 83.2M cycles.
The balance of 73.2M cycles are spent waiting for memory accesses to complete.
This circumstance arises primarily in point-chasing code,
where the contents of one memory access determines the next action.
Until the access completes, there is nothing else for the thread to do.
The general advice is to avoid this type of coding,
except that this is what happens in searching a b-tree index.
If the impact of NUMA on database transaction processing performance
were understood and clearly communicated,
databases could have been architected from the beginning to work with
the SQL Server NUMA and TCP/IP port mapping features.
Then threads running on a given node primarily access pages local to that node.
If this forethought had been neglected, then one option is to re-architect
both the database and application,
which will probably involve changing the primary key
of the core tables.
Otherwise, accept that scaling on multi-socket systems is
not going to be what might have been expected.
Furthermore, the Xeon E7 processor, commonly used in 4-socket systems,
has the SMB feature for doubling memory capacity.
As mentioned earlier, this must incur some penalty in memory latency.
In the model above, scaling is:
1P -> 2P = 1.45X, 2P -> 4P = 1.56X and
1P -> 4P = 2.26X
The estimate here is that the SMB has an 11% performance penalty.
If the doubling of memory capacity (or other functionality)
was not needed, then it might have been better to leave off the SMB.
There is 4-waay Xeon E5 4600-series, but one processor is
2-hops away, which introduces its own issues.
There is a paucity of comparable benchmark results to support meaningful quantitative
In fact, it would seem that the few benchmarks available employ configuration variations
with the intent to prevent insight.
Below are TPC-E results from Lenovo on Xeon E5 and 7 v4, at 2 and 4-sockets respectively.
|Processor||Sockets||cores||threads||memory||data storage||tpsE|| |
|E5-2699 v4||2||44||88||512GB (16x32)||3x17 R5||4,938.14||-|
|E7-8890 v4||4||96||192||4TB (64x64)||5x16 R5||9,068.00||-|
It would seem that scaling from 2P to 4P is outstanding at 1.915X.
But there is a 9% increase in cores per socket from 22 to 24.
Factoring this in, the scaling is 1.756X, although scaling versus core count
should be moderately less than linear.
Then there is the difference in memory, from 256GB per socket to 1TB per socket.
Impressive, but how much did it actually contribute?
Or did it just make up for the SMB extra latency?
Note that TPC-E does have an intermediate level of memory locality,
to a lesser extent than TPC-C.
The details of the current Intel Hyper-Threading implementation are
The purpose of Hyper-Threading is to make use of the dead cycles
during memory access or other long latency operations.
Hyper-threading is an alternative solution to
the memory latency problem.
Work arounds are good, but there are times when it is necessary
to attack the problem directly.
Single thread performance is important, perhaps second after overall system throughput.
Given that the CPU clock is more than 150 times that of memory round-trip access
time, it surprising that Intel only implements 2-way HT.
The generic term is simultaneous multi-threading (SMT).
IBM POWER8 is at 8-way, up from 4-way in POWER7.
SPARC has been 8-way for a few generations?
There is a paper on one of the two RISC processors
stating that there were technical challenges in SMT at 8-way.
So, a 4-way HT is reasonable.
This can nearly double transaction performance.
The effort to increase HT from 2-way to 4-way should not be
Given the already impossibly complex
complexion of the processor,
"difficult should be a walk in the park".
It might help if there were API directives in the operating
system to processor on code that runs well with HT and code that does not.
One other implication of the memory latency effect is that
scaling versus frequency is poor.
The originator of the memory latency investigation
was an incident
in which a system UEFI/BIOS update reset processors
to power-save mode, changing base frequency from 2.7GHz to 135MHz.
There was a 3X increase in worker (CPU) time on key SQL statements.
A 20X change in frequency for 3X performance.
In other words, do not worry about the lower frequency of the
high core count processors.
They work just fine.
But check the processor SKUs carefully, certain models do not
have Hyper-Threading which is important.
It also appears that turbo-boost might be more of a problem
It might be better to lock processors to the base frequency.
Hekaton - MVCC
Contrary to popular sentiment,
putting the entire database in to memory on a traditional engine
having page organization and row-store does not make much of
a direct contribution in performance
over a far more modest buffer cache size.
This is why database vendors have separate engines
for memory-optimized operation,
Hekaton in the case of Microsoft SQL Server.
To achieve order of magnitude performance gain, it was necessary to
completely rethink the architecture of the database engine.
A database entirely in memory can experience substantial performance
improvement when the storage system is seriously inadequate to meet
the IOPS needed.
When people talk about in-memory being "ten times" faster,
what meant was that if the database engine have been designed around
all data residing in memory, it would be built in a very different way
than the page and row structured implemented by INGRES in the 1970's.
Now that the memory-optimized tables feature, aka Hekaton for Microsoft
SQL Server, is available,
are there still performance requirements that have not been met?
Memory-optimized tables, and its accompanying natively compiled
procedures are capable of unbelievable performance levels.
All we have to do is re-architect the database to use memory-optimized
And then rewrite the stored procedures for native compilation.
This can be done!
And it should be done, when practical.
In many organizations, the original architects have long retired,
departed, or gone the way of Dilbert's Wally (1999 Y2K episode).
The current staff developers know that if they touch something,
it breaks, they own it.
So, if there were a way to achieve significant performance gain,
with no code changes, just by throwing money at the problem,
then there would be interest, and money.
There is not a TPC-E result for Hekaton.
This will not happen without a rule change.
The TPC-E requirement is that database size scales with the
The 4-way Xeon E7 v4 result of 9,068 tpsE corresponds to a database
size of 37,362GB.
The minimum expectation from Hekaton is 3X, pointing to a 100TB database.
A rule change for this should be proposed.
Allow a "memory-optimized" option to run with a database
smaller than memory capacity of the system.
About as difficult as MVCC, less upside?
Potentially the benefits of NUMA scaling and Hekaton
could be combined if Microsoft exposed a mechanism
for how key values map to a NUMA node.
It would be necessary for a collection of tables
with compound primary key to have the same lead column
and that the hash use the lead column in determining NUMA node?
The major DRAM companies are producing DDR4 at the 4Gb die level.
Samsung has an 8Gb bit die.
Micron has a catalog entry for 2×4Gb die in one package as an 8Gb product.
There can be up to 36 packages on a double-sided DIMM, 18 on each side.
The multiple chips/packages form a 64-bit word plus 8-bits for ECC,
capable of 1-bit error correction and 2-bit error detection.
A memory controller might aggregate multiple-channels
into a larger word and combine the ECC bits to allow for more
sophisticated error correction and detection scheme.
A 16GB non-ECC DDR4 DIMM sells for $100 or $6.25 per GB.
The DIMM is comprised of 32×4Gb die, be it 32 single-die packages
or 16 two-die packages.
The 16GB ECC U or RDIMM consisting of 36, ×4Gb die is $128,
for $8/GB net (data+ECC) or $7.11/GB raw.
There is a slight premium for ECC parts, but much less than it was in the past,
especially with fully buffered DIMMs that had an XMB chip on the module.
The 4Gb die + package sells for less than $3.
the memory guy,
estimates that 4Gbit DRAM needs to be 70mm2 to support
a price in this range.
The mainstream DRAM is a ruthlessly competitive environment.
The 8Gbit DDR4 package with 2×4Gb die allows 32GB ECC DDR4 to sell for $250,
no premium over the 16GB part.
The 64GB ECC RDIMM (72GB raw) is priced around $1000.
This might indicate that it is difficult to put 4×4Gb die in one package,
or that the 8Gb die sells for $12 compared to $3 for the 4Gb die.
Regardless, it possible to charge a substantial premium in
the big capacity realm.
One consequence of the price competitiveness in the mainstream DRAM
market is that cost cutting is an imperative.
Multiplexed row and column address lines originated in 1973,
allowing for lower cost package and module product.
Twenty years ago, there was discussion on going back to a full width address,
but no one was willing the pull the trigger on this.
The only concession for performance in mainstream DRAM was increasing
bandwidth by employing multiple sequential word accesses,
starting with DDR to the present DDR4.
Reduced latency DRAM appeared in 1999
for applications that needed lower latency than mainstream DRAM,
but at lower cost than SRAM.
One application is in high-speed network switches.
The address lines on RLDRAM are not multiplexed.
The entire address is sent in one group.
RLDRAM allows low latency access to a particular bank.
That bank cannot be accessed again for the normal DRAM period?
But the other banks can, so the strategy is to access banks in a round-robin fashion if possible.
A 2012 paper, lead author Nilandrish Chatterjee
has a discussion on RLDRAM.
The Chatterjee paper mentions:
RLDRAM employs many small arrays that sacrifices density for latency.
Bank-turnaround time (tRC) is 10-15ns compared to 50ns for DDR3.
The first version of RLDRAM had 8 banks,
while the contemporary DDR (just DDR then) had 2 banks.
Both RLDRAM3 and DDR4 are currently 16 banks,
but the banks are organized differently?
Micron currently has a 1.125Gb RLDRAM 3 product in x18 and x36.
Presumably the extra bits are for ECC, 4 x18 or 2 x36 forming a 72-bit path
to support 64-bit data plus 8-bit ECC.
The mainstream DDR4 8Gbit 2-die package from Micron comes in
a 78-ball package for x4 and x8 organization, and 96-ball for x16.
The RLDRAM comes in a 168-ball package for both x18 and x36.
By comparison, GDDR5 8Gb at 32-wide comes in a 170-ball BGA,
yet has multiplexed address?
The package pin count factors into cost,
and also in the die size because each signal needs to be boosted before
it can go off chip?
Digi-Key lists a Micron 576M RLRAM3 part at $34.62, or $554/GB w/ECC,
compared with DDR4 at $8 or 14/GB also with ECC, depending the module
At this level, RLDRAM is 40-70 times more expensive than DDR4
A large part for is probably because the RLDRAM is quoted
as a specialty low volume product at high margins,
while DDR4 is a quoted on razor thin margins.
The top RLDRAM at 1.125Gb capacity might reflect the size needed
for high-speed network switches
or it might have comparable die area to a 4Gb DDR?
There are different types of SRAM.
High-performance SRAM has 6 transistors, 6T.
Intel may use 8T
Intel Labs at ISSCC 2012
or even 10T for low power?
(see real world tech NTV).
It would seem that SRAM should be six or eight times less dense than DRAM,
depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.
There is a Micron slide in
Micro 48 Keynote III
that says SRAM does not scale
on manufacturing process as well as DRAM.
Instead of 6:1, or 0.67Gbit SRAM at the same die size
as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area?
Another source says 100:1 might be appropriate.
Eye-balling the Intel Broadwell 10-core (LCC) die,
the L3 cache is 50mm2,
listed as 25MB.
It includes tags and ECC on both data and tags?
There could be 240Mb or more in the 25MB L3?
Then 1G could fit in a 250mm2 die,
plus area for the signals going off-die.
Digi-Key lists Cypress QDR IV 144M (8M×18, 361 pins) in the $235-276 range.
This $15K per GB w/ECC.
It is reasonable to assume that prices for both RLDRAM and QDR SRAM
are much lower when purchased in volume?
The lowest price for an Intel processor on the Broadwell LCC die of 246mm2 is $213
in a 2011-pin package.
This would suggest SRAM at south of $1800 per GB.
While the ultra-high margins in high-end processors is desirable,
it is just as important to fill the fab to capacity.
So, SRAM at 50% margin is justified.
We could also estimate SRAM at 40X that of DRAM, per the Micron
assertion of relative density, pointing to $160-320 per GB.
Graphics and High-Bandwidth Memory
Many years ago, graphics processors diverged from mainstream DRAM.
Their requirement was for very high bandwidth at a smaller capacity
than main memory, plus other features to support the memory
access patterns in graphics.
GDR is currently on version 5, at density up to 8Gbit, with a x32
wide path (170-ball package) versus x4, x8 and x16 for mainstream DDR4.
More recently, High Bandwidth Memory (HBM) is promoted by AMD,
Hybrid Cube Memory by Micron.
High bandwidth memory is not pertinent to databases,
but it does provide scope on when there is need to
go a separate road from mainstream memory.
Databases on the page-row type engine does not come close to testing the limits
of DDR4 bandwidth.
This is true for both transaction processing and DW large table scans.
For that matter, neither does column-store, probably because of the CPU-cycles
I may have to take this back.
DDR4-2133 bandwidth is 17GB/s per channel, and 68GB/s over 4 channels.
(GB is always decimal by default for rates, but normally binary for size.)
A table scan with simple aggregation from memory is what now?
It was 200MB/s per core in Core 2 days, 350MB/s in Westmere.
Is it 500 or 800MB/s per core now?
It is probably more likely to be 500,
but let's assume 800MB/s here.
Then 24 cores (Xeon E7 v4 only, not E5) consume 19.2GB/s
(HT does not contribute in table scans).
This is still well inside the Xeon E5/7 memory bandwidth.
But what if this were read from storage?
A table scan from disk is a write to memory, followed by a read.
DDR writes to memory at the clock rate, i.e., one-half the MT/s rate.
So the realized table scan rate effectively consumes 3X of the MT/s value,
which is 57.6.
To pursue the path of low latency memory,
it is necessary to justify the cost and capacity
structure of alternative technologies
It is also necessary that the opportunity be worthwhile
to justify building one more specialty processor
with a different memory controller.
And it may be necessary to work with operating system
and database engine vendors to all be aligned
in doing what is necessary.
The Chatterjee et al
paper for Micro45 (microarch.org) shows LRDRAM3 improving throughput
by 30% averaged across 27 of the 29 components of the SPEC CPU 2006 suite, integer and fp.
MCF shows greatest gain at 2.2X.
Navigating the B-tree index should show very high gain as well.
The cost can be justified as follows.
An all-in 4-way server has the following processor and memory cost.
|Processor||4×E7-8890 v4||$7,174 ea.||$28,700|
|Memory||4TB, 64×64GB||$1,000 ea.||$64,000|
If the above seems excessive, recall that there was a time
when some organizations where not afraid to spend $1M on the
processor and memory complex,
or sometimes just the for 60+ processors
(sales of 1000 systems per year?).
That was in the hope of having amazing performance.
Except that vendors neglected to stress the importance
SAN vendors continued to sell multi-million-dollar storage
without stressing the importance of dedicated disks for logs.
If a more expensive low latency memory were to be implemented,
the transmission time between the memory controller and DIMM,
estimated to be 10ns earlier,
should be revisited.
A RLDRAM system might still have the DIMM slot arrangement currently
in use, but other option should be considered.
An SRAM main memory should probably be in an MCM module,
or some other
(TSMC Hot Chips 28).
This is if enough SRAM can be stuffed into a module or package.
It would also require that the processor and memory be sold as a single
unit and instead of memory being configured later.
In the case of SRAM, the nature of the processor L3 probably needs to be
Before SSDs, the high-end 15K HDDs were popular with storage
performance experts who understood that IOPS was more important
In a "short-path to bare metal" disk array,
the 15K HDD could support 200 IOPS at queue depth 1 per HDD
with low latency (5ms).
It should be possible to assemble a very large array of 1,000
disks, capable of 200,000 IOPS.
It is necessary to consider the mean-time-between-failure (MTBF),
typically cited as over 1M-hours.
There are 8,760 hours in a 365-day year.
At 1M-hr MTBF, the individual disk failure rate is 0.876%.
An array of 1000 disks is expected to see 9 failures per year.
Hard disks in RAID groups will continue to operate
with a single or sometimes multiple disk failures.
However, rebuilding a RAID group from the failed drive
could take several hours, and performance is degraded
in this period.
It is not operationally practical to run on a very
large disk array.
The recommendation was to fill the memory slots with big DIMMs,
and damn the cost.
The common convention used to be a NAND controller for
SATA on the upstream side would have 8-channels on the NAND side.
The PCI-E controller would 16 or 32 NAND channels for x4 and x8 respectively.
On the downstream side, a NAND channel could have
1 or 2 packages.
There could be up to 8 chips in a package.
A NAND chip may be divided in to 2 planes,
and each plane is functionally an independent entity.
An SSD with 8 packages could have 64 NAND chips comprised of 128 planes.
The random IOPS performance at the plane level is better
than a 15K HDD, so even a modest collection of 24 SSDs
could have a very large array (3,072) of base units.
At the component level, having sufficient units for 1M IOPS is not difficult.
Achieving 1M IOPS at the system level is more involved.
NVMe builds a new stack, software and hardware, for driving
extraordinarily high IOPS possible with a large array of SSDs,
while making more efficient use of CPU than the SAS.
PCI-E NVMe SSDs have been around since 2014,
so it is possible to build a direct-attach SSD array
with the full NVMe stack.
NVMe over fabric was recently finalized, so SAN products
might be not too far in the near future.
From the host operating system, it is possible to drive
1M IOPS on the NVMe stack without consuming too much CPU.
At the SQL Server level, there are additional steps,
such as determining which page to evict from the buffer cache.
Microsoft has reworked IO code to support the bandwidth
made practical with SSD for DW usage.
But given the enormous memory configuration of typical
transaction processing systems,
there may not have been much call for the ability
to do random IOPS with a full buffer cache.
But if the need arose, it could probably be done.
All Flash Array
When SSDs were still very expensive as components,
storage system vendors promoted the idea of a SSD
cache and/or a tiering structure of SSD, 10K and 7.2K HDDs.
In the last few years, new upstarts are promoting all flash.
HDD storage should not go away, but its role is backup
and anything not random IO intensive.
Rethinking System Architecture
The justification for rethinking system architecture
to low latency memory at far higher cost
is shown below.
The scaling achieved in 4-socket system is less than exceptional
except for the very few NUMA architected databases,
which is probably just the TPC-C and TPC-E database.
It might be 2.2X better than single socket.
At lower latency, 40ns L3+memory, the single socket system could
match the performance of a 2-socket system with DDR4 DRAM.
If 25ns were possible, then it could even match up with the 4-socket system.
The mission of massive memory made possible in the 4-way
to reduce IO is no longer a mandatory requirement.
The fact that a single-socket system with RLDRAM or SRAM
could match a 4-socket with massive memory
allows very wide latitude in cost.
RLDRAM may reside inside or outside of the processor.
If outside, thought should be given on how to reduce
the transmission delay.
SRAM should most probably be placed inside the processor package,
so the challenge is how much could be done.
Should there still be an L3?
Any latency from processor core to memory must be minimized
as much as possible as warranted by the cost of SRAM.
Below are the memory latency simple model calculations
for a single socket with L3+memory latency of 43 and 25ns.
There are the values necessary for the single-socket
system to match 2 and 4-socket systems respectively.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
In the examples above, Hyper-Threading should still have good scaling to 4-way
at 43ns, and some scaling to 4-way at 25ns memory latency.
The new memory architecture does not mean that DDR4 DRAM
It is an established and moderately inexpensive technology.
There could still be DRAM memory channels.
Whether this is a two-class memory system or
perhaps DDR memory is accessed like a memory-mapped file
can be debated elsewhere.
Xeon Phi has off-package DDR4 as memory node 0 and on-package MCDRAM
as memory node 1, all to the same processor.
It is acknowledged that the proposed system architecture is not
a new idea.
The Cray-1 used SRAM as memory, and DRAM as storage?
For those on a budget, the Cray-1M has MOS memory.
Circumstances of the intervening years favored processor with SRAM cache
and DRAM is main memory.
But the time has come to revisit this thinking.
While working on this, I came across the slide below in
J Pawlowski, Micron,
Memory as We Approach a New Horizon.
The outline of the Pawlowski paper includes high bandwidth, and persistent memory.
Deeper in, RL3 Row Cycle Time (tRC) is 6.67 to 8ns, versus 45-50ns for DDR4.
I am guessing that the large number of double-ended arrows between processor and
near memory means high bandwidth.
And even bandwidth to DIMMs is substantial.
On the devices to the right seems to be storage.
Does ASIC mean logic?
Instead of just accessing blocks, it would be useful to say:
read the pointer at address A, then fetch the memory that A points to.
Below is roughly Intel's vision of next-generation system architecture,
featuring 3D XPoint.
The new Intel and Micron joint non-volatile technology is promoted
as having performance characteristics
almost as good as DRAM, higher density than DRAM,
and cost somewhere in between DRAM and NAND.
The full potential of 3D XPoint cannot be realized as PCI-E
The idea is then to have 3D XPoint DIMMs devices on the memory interface
along with DRAM.
The argument is that memory configurations in recent years
have become ridiculously enormous.
That much of it is used to cache tepid or even cool data.
In this case, DRAM is overkill.
The use of 3D XPoint is almost as good,
it costs less, consumes less power, is persistent,
and will allow even larger capacity.
In essence, the Intel vision acknowledges the fact
that much of main memory is being used for less than hot data.
The function of storing not so hot data can be accomplished
with 3D XPoint at lower cost.
But this also implies that the most critical functions
of memory require far less capacity than that of recent generation
In the system architecture with a small SRAM or RLDRAM
main memory, there will be more IO.
To a degree, IO at 100µs to NAND is not bad,
but the potential for 10µs or less IO to 3D XPoint
further validates the concept and
is too good to pass up.
Below is a my less fancy representation of the Micron System Concept.
The Right Core
The latest Intel mainline Core-i processor has incredibly powerful
cores, with 8-wide superscalar execution.
A desktop version of Kaby Lake, 7th generation,
has 4.2GHz base frequency and 4.5GHz turbo.
This means the individual core can run at 4.5GHz if not significantly
higher, but must throttle down to 4.2GHz
so that four cores plus graphics and the system agent stays under 91W.
A reasonable guess might be that the power consumption is 20w per core at 4.2GHz?
Sky Lake top frequency was 4.0GHz base and 4.2GHz turbo.
Broadwell is probably 3.5GHz base and 3.8GHz turbo, but Intel
did not deploy this product as widely as normal.
In transaction processing, this blazing frequency is squandered
on memory latency.
The server strategy is in high core count.
At 24 cores, frequency is throttled down to 2.2GHz to stay under 165W.
The Broadwell HCC products do allow turbo mode in which a few cores
can run at up to 3.5 or 3.6GHz.
Every student of processor architecture knows the foundations of Moore's Law.
One of the elements is that on doubling the silicon real estate
at a fixed process, our expectation is to achieve 1.4X increase in performance.
(The other element is a new process with 0.71X linear shrink yields 50% frequency,
also providing 1.4X performance.)
The Intel mainline core is two times more powerful than the
performance that can be utilized in a high core count processor.
In theory, it should be possibly to design a processor core
with performance equivalent to the mainline core at 2.2GHz
(see SQL on Xeon Phi).
In theory, this new light core should be one-quarter the size of the
(Double the light core complexity for 1.4X performance. Double again for another 1.4X,
for a cumulative 2X over baseline.)
The new light core would be running at maximum design frequency to match
the 2.2GHz mainline, whatever frequency that might be.
We can pretend it is 2.2GHz if that helps.
This new core would have no turbo capability.
What is the power consumption of this core?
Perhaps one-quarter of the mainline, because it is one-quarter the size?
Or more because it is running at a slightly higher voltage? (this is covered elsewhere).
It might be possible to fit four times as many of the light cores on the same die
size, assuming cache sizes are reduced.
But maybe only 3 times as many cores can be supported to stay within power limits?
This a much more powerful multi-core processor
for multi-threaded server workloads.
How valuable is the turbo capability?
The turbo boost has value because not everything can be made heavily multi-threaded.
Single or low-threaded code might not be pointer chasing code,
and would then be fully capable of benefitting from the full power of the Intel mainline core.
A major reason that Intel is in such a strong position is that they have had
the most powerful core for several years running,
and much of the time prior to the interlude period.
(AMD does have a new core coming out and some people think highly of it.)
Intel has two versions of each manufacturing process, one for high performance,
and another for low power.
Could the mainline core be built on the lower power process?
In principle this should reduce power to a greater degree than scaling voltage down.
Would it also make the core more compact?
We could speculate on theory, applying the general principles of Moore's Law.
But there is a real product along these lines, just targeted towards a different
The Xeon Phi 200, aka
has 72 Atom cores, albeit at 245W (260 with fabric).
The current Phi is based on the Airmont Atom core.
(the latest Atom is actually Goldmont).
The recent Atom cores have a 14-stage pipeline versus 14-19 for Core?
Airmont is 3-wide superscalar, with out-of-order, but does not have a µOP cache?
The true top frequency for Airmont is unclear, some products based on
are Airmont are listed as 2.6GHz in turbo.
On Xeon Phi, the frequency is 1.5GHz base, 1.7GHz turbo.
It might be that the low core count processors will always be able to operate at a higher voltage for maximum frequency
while high core count products set a lower voltage resulting lower frequency,
regardless of intent.
Below is a diagram of Knights Landing, or Xeon Phi 200
from Intel's Hot Chips 27 (2015).
The processor and 8 MCDRAM devices are on a single multi-chip-module
(package) SVLC LGA 3647.
The MCDRAM is a version of Hybrid Memory Cube?
(Intel must have their own private acronyms.)
Each device is a stack of die, 2GB, for a total of 16GB
with over 400GB/s bandwidth.
There are also 3 memory controllers driving a total
of 6 DDR4 memory channels for another 90GB/s bandwidth (at 2133 MT/s).
Only 1 DIMM per channel is supported,
presumably the applications are fine with 384GB but wants extreme bandwidth.
The Xeon Phi is designed for HPC.
As is, it might be able deliver impressive
transaction processing performance.
But perhaps not without tuning at many levels.
The question is, how does Knights Landing perform on transaction processing
had the memory been designed for latency
instead of bandwidth?
I suppose this could be tested simply by comparing an Airmont Atom against
a Broadwell or Skylake?
The theory is that the memory round-trip latency dominates,
so the 8-wide superscalar of Haswell and later has little benefit.
Even if there is some code that can used wide superscalar,
the benefit is drowned out by code that wait for memory accesses.
Even in transaction processing databases,
not everything is transactions, i.e., amenable to wide parallelism.
Some code, with important functions, do benefit from the full capability
of mainline Intel core.
Perhaps the long-term solution is asymmetric multi-core,
two or four high-end cores, and very many mini-cores.
SSE/AVX Vector Unit (SIMD)
The vector (SSE/AVX) unit is a large portion of the core area.
This are not used in transaction processing
but are used in the more recent column-store engine.
Microsoft once evaluated the use of the Intel SSE registers,
but did not find a compelling case.
It might have been on the assumption of the existing
Perhaps what is needed is to redesign the page structure
so that the vector registers can be used effectively.
The SQL Server 8KB page, 8,192 bytes, has a row header of 96 bytes,
leaving 8096 bytes.
Row offsets (slot array of 2 byte values) are filled in from the end of the page.
See Paul Randall, SQL Skills
Anatomy of a Record.
Within a page, each row has a header (16-bytes?) with several values.
The goal of redesigning the page architecture is so that the slot
and header arrays can be loaded into the vector registers
in an efficient manner.
This might mean moving the slot array and other headers up front.
SQL Server would continue to recognize the old page structure.
On index rebuild, the new page structure employed.
The necessary instructions to do row-column byte offset calculations
directly from the vector register would have to be devised.
This needs to be worked out between Intel and various database vendors.
Perhaps the load into the vector registers bypasses L1 and/or L2?
It would be in L3 for cache coherency?
The current Xeon E5/7 processors, with the latest on the Broadwell
core, have 16 vector registers of 256-bits (32 bytes) totaling 512 bytes.
The Skylake has 32 registers of 512-bits, 2KB of registers.
This is too much to waste.
If they cannot be used, then the processor
with special memory controller
should discard the vector unit.
The main purpose of this article was to argue
for a new system architecture, having low latency memory,
implying a processor architecture change,
with a major focus on the feasibility for transaction processing
databases, and mostly as pertinent to Intel processors.
However, all avenues for significant transaction performance improvement
- Hyper-Threading to 4-way
- Memory-optimized tables with natively compiled procedures
- Database architected for NUMA
- Multi-core - the right size core?
- 3D XPoint
- SSE/AVX Vector Instructions
Increasing HT from 2-way to 4-way has the potential to nearly
doubly transaction processing performance.
Other options have greater upside, but this is a drop-in option.
Memory-optimized tables and natively compiled procedures combined
has the greatest upside potential.
People do not want to hear that the database should be
re-architected for NUMA scaling.
If it runs fine on single-socket or Hekaton, then fine.
But Intel mentions that scaling to the Phi core count levels
requires NUMA architecture even on one socket.
Higher core count using a smaller core will have best
transaction throughput, but an asymmetric model might be
The capabilities of the modern processor core are being squandered
in long latency for capacity that is not needed.
Figure out what is the right low latency memory.
There is definitely potential for 3D Point.
But it goes beyond displacing some DDR DRAM and NAND.
The true potential is to enable a smaller lower latency
true main memory, then have DDR DRAM and 3D XPoint as
something in-between memory and IO.
SSE/AVX: use it or lose it.
The growing gap between processor clock cycle time to memory latency
is not a new topic.
There have been many other papers on the advantage of various
memory technologies with lower latency.
Most of these originate either from universities or semiconductor
Everyone acknowledged that cost relative to mainstream DRAM
was a serious obstacle.
A number of strategies were conceived to narrow the gap.
Here, the topic is approached from the point of view of
database transaction processing.
TP is one of several database applications.
Database is one of many computer applications.
However, transaction processing is a significant portion
of the market for high-end processors, and
systems with maximum memory configuration.
The database community regularly spends $100K on
a processor-memory complex for performance levels
that could be achieved with a single socket,
if it were matched with the right memory.
There is valid justification from the database side
to pursue the memory strategy.
There is justification to the processor-memory vendor
that this one market has the dollar volume to make this
And the extremely large memory capacity requirement is shown to now be a red herring.
In all, there are several worthwhile actions for the next generation
of server system architecture.
Probably none are mutually exclusive.
There are trade-offs between impact, cost and who does the heavy lifting.
No single factor wins in cases, so a multi-prong attack is the more
I will do a discussion to the figure below later
The diagram below might be helpful in discussion.
The elapsed time for a transaction is the weight sum of operations
that incur a wait at each level.
Single thread performance is the inverse of elapsed time.
For throughput, memory access latency can be partially hidden with hyper-threading.
IO latency is hidden by asynchronous IO, but there is an overhead to that too.
Suppose D is the latency for DDR memory,
and S it the latency of some low latency memory,
both inclusive of transmission time and possibly L3.
There is no read IO in the DDR system.
Suppose x is the fraction of memory accesses that
are outside of the fast memory, and the I is the latency.
The term I might represent accesses to DDR or 3D XPoint
on the memory interface via a memory access protocol
so it is really not IO.
Or it could be to 3D XPoint or NAND attached to PCI-E
via an IO protocol.
The criteria for the small faster memory being an advantage in elapse time delta,
exclusive of operations that occur inside L3, is as follows.
(1-x)×S + x×I < D
x < (D-S)/(I-S)
The objective is to achieve a large gain in performance via elapsed time.
Only so much can be gained on the numerator D-S, so much depends on the
latency of I.
If the secondary device were DDR or 3X XPoint on the memory interface,
then a very high value of accesses (x) could be allowed while
still achieving good performance gain.
If it were on PCI-E, then 3D XPoint might have a strong advantage over NAND.
In the discussion on Knights Landing, I suggested that the Atom core might not be bad
for transaction processing.
The cheapest Xeon Phi is the 7210 at $2438. About $4700 in a system.
What is the difference between Atom C2750 and 2758? Both are Silvermont 8-cores, no HT.
Use ECC SODIMM.
Atom has changed since its original inception,
not using out-of-order execution for simplicity and power-efficiency.
Silvermont added OOO. Not sure about Goldmont.
Is Atom to be a slimmed down Core? with 3-wide superscalar
and manufactured on the SoC version of the process?
I will try to sort out the material and redistribute
over several articles as appropriate.
There is an article on
that a new Intel processor architecture to succeed the lake processors
(Sky, Cannon, Ice and Tiger)
will be "faster and leaner" and more interestingly might not be entirely
compatible with older software.
The original source is
I suppose it is curious that the Lake processors form a double tick-tock or now process-architecture-optimization (PAO), but skipped Kaby, and Cannon.
Both the bridge (Sandy and Ivy) and well processors (Has and Broad) each had only
one tick-tock pair.
Naturally, I cannot resist commenting on this.
in the really old days, processor architecture and instruction set architecture (ISA) was somewhat
the same thing.
The processor implemented the instruction set, so that was the architecture.
I am excluding the virtual-architecture concept where lower cost version
would not implement the complete instruction set in hardware.
was a significant step away from this,
with micro-architecture and instruction set architecture now
largely different topics.
Pentium Pro has its own internal instructions, called
The processor dynamically decodes X86 instructions to the "native"
This was one of the main concepts that allow Intel to borrow
many of the important technologies from RISC.
The Pentium 4 processor, codename
that was a cache for decoded instructions.
This may not have been in the Core 2 architecture that
followed Pentium 4.
My recollection is that Pentium Pro had 36 physical registers
of which only 8 are visible to the X86 ISA.
The processor would rename the ISA registers as necessary to support
Pentium 4 increased this to 128 registers.
Nehalem micro-architecture diagrams
do not mention a
(somehow the acronym is DSB) but
and subsequent processors do.
This is curious because both Willamette and Nehalem
are Oregon designs, while Core 2 and Sandy Bridge are Haifa designs.
The other stream that comes into this topic involves
the Intel Itanium adventure.
The original plan for Itanium was to have a hardware (silicon)
Naturally, this would not be comparable to the then contemporary
X86 processors, which would have been Pentium III, codename Coppermine
at 900MHz, for Merced.
So by implication, X86 execution would probably be comparable to
something several years old, a Pentium II 266MHz with luck,
and Itanium was not lucky.
By the time of Itanium 2, the sophistication of software CPU emulation
was sufficiently advanced that the hardware X86 unit was discarded.
In its place was
IA-32 Execution Layer.
Also see the
IEEE Micro paper on this topic.
My recollection was the Execution Layer emulation was not great but not bad either.
The two relevant technologies are: one, the processor having native µops
instead of the visible X86 instructions,
and two, the Execution Layer for non-native code.
With this, why is the compiler generating X86
(ok, Intel wants to call these IA-32 and Intel 64 instructions?) binaries.
Why not make the native processor µops visible to the compiler.
When the processor detects a binary with native micro-instructions,
it can bypass the decoder?
Also make the full set of physical registers visible to the compiler?
If Hyper-threading is enabled, then the compiler should know to only
use the correct fraction of registers.
I am inclined to also say that the more the compiler knows about the underlying
hardware, the better it can generate binaries to fully utilize available resources,
with less reliance on the processor doing dynamic scheduling for parallelism.
But of course, that was what Itanium was,
and we would need to understand why Itanium did not succeed.
My opinion was that EPIC was really better suited to scientific computing
and not logic heavy server applications.
Have one or two generations of overlap,
for Microsoft and the Linux players make a native micro-op operating system.
Then ditch the hardware decoders for X86.
Any old code would then run on the Execution Layer,
which may not be 100% compatible.
But we need a clean break from old baggage
or it will sink us.
Off topic, but who thinks legacy baggage is sinking the Windows operating system?
Of course, I still think that one major issues is that Intel is stretching
their main line processor core over too broad a spectrum.
The Core is used in both high-performance and high-efficiency mode.
For high performance, it is capable of well over 4GHz,
probably more limited by power than transistor switching speed.
For power efficiency, the core is throttled to 2 or even 1 GHz.
If Intel wants to do this in a mobile processor, it is probably not
that big a deal.
However, in the big server chips, with 24 core in Xeon v4
and possibly 32 cores in the next generation (v5),
it becomes a significant matter.
The theory is that if a given core is designed to operate at
a certain level, then doubling the logic should achieve a 40%
increase in performance.
So if Intel is deliberately de-rating the core in the Xeon HCC die,
then they could built a different core specifically to one half
the original performance is perhaps one quarter the complexity.
So it should be possible to have 100 cores with half the performance
of the Broadwell 4GHz capable core,
i.e., equivalent to Broadwell at 2GHz?
If this supposed core were very power efficient, then perhaps we could
even support the thermal envelope of 100 mini-cores?
Of course, not every application is suitable for wide parallelism.
I would like to see Intel do a processor with mixed cores.
Perhaps 2 or 4 high performance cores and 80 or so mini-cores?
A really neat trick would be if the GPU were programmable,
but graphics vendors have things along this line?
It should be intuitively obvious that round-trip memory access latency
is one of the most important factors in modern server system architecture
for transaction processing databases.
Yet this is a topic that no one talks about.
Vendors do not want to discuss this because no near or long-term actions are planned.
Outsiders cannot write a meaningful article because too much important information is missing.
In some respects, there is nothing practical we can do about latency
in terms of memory components that we elect to employ.
However, we can influence latency because Intel processors since Nehalem in 2009/10 have integrated memory controllers.
Hence all multi-socket systems since then have non-uniform
memory access (NUMA) and this is one mechanism that determines latency.
We will start by looking at modern Intel processor architecture
and the time scales involved.
Then continue by examining system architecture with respect to
From there, we can do basic calculations on how we expect
memory latency to impact transaction processing performance.
This basic model still needs to be backed up by real experimental measurements,
but there is enough to provide the basis for further investigations.
Pending such, one of the important conclusions is that it is time to re-examine
fundamental assumptions on server sizing.
Broadwell, Xeon E3 v4
The diagram below is a representation of a modern Intel mainline
(desktop and mobile) processor.
The proportions are from Broadwell, the 5th generation Intel Core i3, 5 and 7 and Xeon E3 v4,
as Sky Lake has a different layout arrangement.
The four cores, system agent, and graphics communicate over the
The memory controller and IO are in the system agent,
but the signals that go off-chip are in the interface shown as a separate unit.
Never mind the above, I don't think thats how Broadwell 4c is laid out.
See below left. Sky Lake is below right.
Intel and system vendors like to talk about specifications
such as processor frequency and memory transfer rate.
Recent generation Intel processor cores are capable of operating in
the high 3 to low 4GHz range.
It is quite possible that they could run even higher,
but are power constrained.
The memory interface is DDR4 at 1866 and 2133MT/s.
The term used is mega-transfers per second instead of MHz.
This is because the clock is cited in MHz, and the data transfers rate
is eight times the clock rate.
Address and writes should be one-half the data rate.
Using MT/s for data transfer rate is so much more clear.
there is a memory clock, 266.67MHz for example,
an I/O bus clock that is 4 times higher, 1066MHz,
and the data transfer rate at double the I/O clock, 2133MT/s.
DDR memory timings
are specified at the memory interface and
consist of 4 values: CL, RCD, RP and RAS,
the last of which is frequently not cited, and sometimes only the
CAS Latency is cited.
The value is in terms of the I/O clock cycles.
In database transaction processing, the memory access pattern
is largely unpredictable, amounting to a random memory row access,
so the latency is RP + RCD + CL, for Row Pre-charge, Row to Column Delay,
and CAS Latency.
For Registered DDR4 2166, all three values are 15?
(does this includes the extra cycle for registered memory?).
The 2166 MT/s corresponds to 1066MHz, so each cycle is 0.938ns,
and 15 cycles is 14 ns.
The total memory latency at the memory interface is then 42ns?
Cache and Memory Latency
There are applications in which processor frequency and memory bandwidth
matter an great deal.
But neither are particularly important for transaction processing databases.
The diagram below calls out some details between the processor core,
various levels of cache, and memory that are more relevant to databases.
At 3GHz, the processor core cycle time is 0.33ns, the inverse of the frequency.
The L1 cache is cited as having 4 cycle latency.
The L1 is part of the processor core execution pipeline,
so to some degree, L1 latency is hidden.
L2 cache latency is cited as 12 cycles.
It is not certain as to whether this is fixed in cycles,
or is actually something like 4ns.
L3 cache probably depends on the number of cores and other circumstances.
It is shown here as 40+ cycles.
If L3 latency is actually time based, 15ns for example,
then number of cycles would depend on the core clock rate.
I am not sure if the L3 latency incorporates L2 latency.
Memory latency is probably somewhat over 50ns, plus the L3 latency.
How much of this is at the DRAM chip versus the transmission delay
between the processor to DRAM and back?
Igor Pavlov provides both the 7z benchmark and results for many processors
at 7-cpu, source code included.
7-cpu lists Haswell L3 at 36 cycles, and memory at L3 + 57ns.
Sky Lake L3 is 42 cycles and memory at L3 + 51ns.
(This seems to imply that the transmission delay from processor
to memory and back is about 10ns?)
Intel has their own
Memory Latency Checker
One of the Intel pages shows an example with local node memory latency as 67ns,
and remote node at 125ns,
which will be used in the examples below.
Broadwell EP and EX
The above diagrams were for the desktop and Xeon E3 processors.
We are really more interested in the Intel EP/EX processors used
in the Xeon E5 and E7 product lines.
The latest Xeon E5 and E7 processors are v4, based on the Broadwell core.
There are 3 layouts, HCC, MCC and LCC with 24, 15 and 10 cores respectively.
Intel provides functional layouts for all three,
but the actual die layout is provided only for the LCC model.
The Broadwell HCC representation is shown below.
There are 6 rows and 4 columns of cores, 24 cores total.
The 24-core model is only available in the E7.
The top E5 has only 22 cores enabled.
Two columns of cores communicate over the interconnect dual ring (counter-rotating).
The two double rings are connected by a switch.
The Intel functional diagram actually shows both QPI and PCI-E on the left side.
Below are my representations of the MCC model on the left with 3 columns of 5 rows
for 15 cores
and the LCC model on the right with 2 columns of 5 rows for 10 cores.
In the LCC model, there is no ring switch.
The PCI-E and memory actually do overhang the core,
meaning that the space to the right of the cores
As even the LCC is a high margin product, an exact and efficient fit is not necessary?
The MCC arrangement of both QPI and PCI-E on the left side
of the switch connected to the left ring
and a memory controller on each side
of the ring switches matches the Intel functional layout,
but I do not know if there is overhang.
Regardless of the actual placement of the controllers,
the interface signals for QPI and PCI-E probably does run along the length
of the upper edge,
and the interface for the memory signals probably runs along most of the lower edge.
I am inclined to believe that L3 latency is higher in the E5/E7 processors
as the path is longer and more complicated.
On the LCC die, there are 10 cores, the memory, QPI and PCI-E controllers
on one ring.
However, if the desktop and E3 processors have only one ring (single direction)
then it is possible that the bidirectional ring in the E5/E7 processors
help keep L3 latency low?
Presumably latencies on the MCC and HCC die are longer than on the LCC
because both rings must be checked?
Search: Haswell Cluster on Die (COD) Mode, but filter our call of duty.
An Intel slidedeck on this suggests that memory latency is higher
when crossing the coherency bus switches.
Xeon E5 v4
Below is a representation of a 2-socket Xeon E5 system with the HCC die.
Two of the 24 cores are marked out, as the E5 has a maximum
of 22 cores.
The E5 has 2 full QPI links, and both are used to connect to the other proccessor.
For a core in the left socket, the memory attached to that socket is local
memory, and memory attached to the other socket is remote memory.
It should not be difficult to appreciate that there is a large difference
in memory access time between local and remote memory nodes.
The Intel Memory Latency Checker example
has 67ns for local and 125ns for remote memory,
which will be the values we use in a following example.
I am not certain if these values are for unbuffered or registered memory.
Unbuffered memory should have lower latency,
but registered memory is available in larger capacities, 64GB versus 16GB.
Xeon E7 v4
Below is a representation of the 4-socket Xeon E7 system.
The E7 has 3 full QPI links, one connecting directly to each of
the three other processors.
All remote processors are then said to be 1-hop away.
The difference of significance here between the E5 and E7 systems is that
the E5 memory channel connects directly to memory.
The E7 connects to a scalable memory buffer (SMB, other names have been used too),
that splits into two memory channels on the downstream side.
Because there are so few acronyms in effect, the interface from processor
to SMB is SMI.
The SMB doubles the number of DIMM sites, and in effect,
doubles the memory capacity per socket.
This difference in the memory arrangement between processors
designed for 2 and 4-socket systems
has been a recurring pattern in Intel system architecture for a long time,
though it was not present in all systems, and there were variations.
In the current generation, there is a four-socket version of the E5,
which does not have the SMB in the path to memory,
but each processor only has two QPI links,
so one of the remote sockets is two-hops away.
Long ago, maximum memory capacity was very valuable for database servers
in reducing IO to less than impossible levels.
The extra latency incurred from the SMB chip was worth the price.
Since then, memory configuration has increased to stupendous levels,
and data read IO has been reduced to negligible.
NAND flash has also become very economical,
allowing storage now to be capable of formerly impossibly high IOPS.
Of course, this occurred after it was no longer absolutely essential.
In more recent years, with maximum memory configuration sometimes more than
double what is already extravagant,
we might not want to incur the extra latency of the SMB?
Highly Simplified Example
Before proceeding, I will take the opportunity to say that the modern
microprocessor is an enormously complex entity in so many ways
that it defies characterization in anything less than an exceedingly complicated model.
The processor has multiple cores.
Each core has pipelined superscalar and out-of-order execution.
Each core has dedicated L1 I+D cache and dedicated L2 unified cache.
And there is decoded micro-op cache as well.
The processor has L3 cache shared among the cores.
That said, I will now provide an example based on an extremely simplified
model of the core-processor memory complex.
Suppose we have a fictitious 3GHz processor, cycle time 0.33ns,
having a very fast cache, and executes one instruction
per cycle when all memory is in cache.
Suppose further that memory latency is 67ns,
and the long latency L3 cache effects are not considered.
Suppose we have a similarly fictitious transaction of 10M instructions.
If everything is cache, the transaction completes in 3.33 million nano-seconds,
or 3.33 milli-sec, and our performance is 300 transactions per second per core.
Now suppose that 5% (1/20) of instructions require a round-trip memory access
before proceeding to the next step.
The 0.95 fraction of instructions that have a cache hit consume 9.5M cycles.
The 0.05 fraction of 10M is 0.5M instructions that miss cache.
Each of these require a round-trip memory access of 67ns or 201 cycles
for 100.5M cycles. The total time to complete the transaction
is now 110M cycles.
Performance is now 27.27 transactions per sec per core
instead of 300 tps.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Now suppose that we have a 2-socket system,
meaning two memory nodes,
and that we have not specially architected our database in a manner
to achieve higher memory locality than expected from random access patterns.
Any memory access is equally likely to be in either node.
The local node memory continues to be 67ns
and remote node memory access is 125ns.
Average memory access is now (67+125)/2 = 96ns, or 288 cycles
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Without a database architected to achieve memory locality,
we have lost 28% performance per core (1 - 19.54/27.27).
Of course, we did double the number of cores, so throughput has increased by 43% (2*19.54/27.27).
Alternatively, the performance per core in the single socket system is 39.5% better
than in the 2-socket system (27.27/19.54).
This magnitude is important.
Did your vendors and infrastructure experts forget to mention this?
Now suppose we are on a 4-socket Xeon E7-type system with the same database,
so memory access is equally probable to any of the four cores.
Local memory access is 25%, and 75% is remote to one of the three other sockets.
All sockets are directly connected, so all remote nodes are one-hop away.
Now, recall that Xeon E7 has a memory buffer in the path between
the processor (memory controller) and memory.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Let's suppose that the SMB adds 15ns additional latency.
(I do not know what the number really is. It is not free.
The magic of doubling memory capacity comes at a price.)
Local node memory access is now 82ns and remote node is 140ns.
Average memory access is (82+3*140)/4 = 125.5ns, or 377 cycles.
We have now just lost another 22% performance per core going from a 2-socket
E5 to the 4-socket E7 type systems (1 - 15.17/19.54).
Total throughput is 55% better than the two-socket (2*15.17/19.54).
Performance per core is 28.8% better on the 2-socket than on the 4-socket.
The performance per core between the 1-socket is 80% better than on 4-socket
The 4-socket has 2.22X better throughput than the 1-socket (4*15.17/27.27).
Scaling - NUMA
Below is all three of the above cases in a single table.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Scaling - Frequency
We can also do the same calculations based on a similarly fictitious 2GHz processor.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Notice that we did not lose much performance in stepping down from 3 to 2GHz.
We could even further step down to 1GHz and still be at 23.26, 34.78 and 55.36 tot tps
for 1, 2, and 4 sockets respectively.
It is important to stress that this is based on the assumption of a transaction
processing workload having the characteristic of serialized memory accesses.
All of the above is based on a highly simplified model.
Real and rigorous testing should be done before drawing
Regardless, there is no way anyone can claim that the difference
is between local and remote node memory access latency is not important,
unless the database has been architected to achieve a high degree of memory locality.
Front-Side Bus Systems
In the days before Intel integrated the memory controller,
four processors connected to a memory controller in a system with uniform memory access.
The Pentium Pro arrangement of 4P on one bus is represented on the left,
although the diagram is actually closer to the 450NX or later.
The system on the right represents the 4-way Xeon 7300,
each quad-core processor on its own bus connected to the 7300 MCH.
Intel had already committed to the Pentium 4 quad-pumped bus in 2000/01.
Vendors were expecting a long stable infrastructure,
so Intel delayed the switch-over from bus to point-to-point until 2009.
Pre-Nehalem NUMA Systems
A system with 16 or more processors could be built
with a custom memory + node controller
connecting four processors, memory and also a crossbar.
The crossbar in turn connects multiple node controllers
to form a system having non-uniform memory access (NUMA).
In the old NUMA systems, some SQL Server operations ran fine,
and other operations had horrible characteristics,
far worse than would be suggested by the
remote to local node memory latency ratio.
So, it is the possible that there are other NUMA affects with far greater
Some may have since been resolved, while others may still be present
but not as pronounced in modern NUMA systems.
An important fact to notice is that a very high fraction of CPU cycles are no-ops,
where the processor core does nothing while waiting for a round-trip memory access.
This is why Hyper-Threading is highly effective.
While one logical processor is waiting for a memory access to complete,
the other thread can run.
Scaling on the logical processors can be nearly linear for a transaction processing
Note, the first-generation of Intel Hyper-Threading was on the Intel Pentium 4
(Net Burst) processors.
In that generation, the implementation was two threads running simultaneously,
as in each clock cycle, trying to fill the super-scalar execution units.
The first generation of HT was problematic.
It could have been because it was too aggressive to try to execute two
threads simultaneously, or it could have been simply that the Windows
operating system and SQL Server engine at the time did not know how to properly use HT.
The next generation of Intel processor architecture, Core 2, did not have HT.
Then in Nehalem, HT returned, except that this time,
it was a time slice implementation.
Only one thread executes on any given cycle.
When the executing thread encounters a memory or other long latency operation,
the processor core switches to the other thread.
If anyone has doubts on HT based on experience or hearsay
from the Pentium 4 generation, forget it.
The Nehalem and later HT is highly effective for transaction processing workloads.
There used to be several SKUs with HT disabled in the Xeon E5/7 v1-3 generations.
Pay close attention and pass on the no-HT SKUs.
The question to ask is why Intel does not increase the degree of HT?
The generic term is Simultaneous Multi-Threading (SMT).
Both IBM POWER and Oracle SPARC processors are or have been at 8-way SMT.
Granted, one of the two mentioned that scaling to 8-way SMT was tricky.
It is high time for Intel to increase HT to 4-way.
In the above examples, the simple model suggests that scaling to
multiple sockets is poor on the assumption of a transaction processing database
without a means of achieving memory locality.
(There is supposed to be an HPE whitepaper demonstrating the importance
of the SQL NUMA tuning techniques in a properly designed database.)
Just what does a database architected for NUMA mean?
Naturally, this will have to be expounded in a separate article.
But for now, take a close look at both the TPC-C and TPC-E databases.
The TPC-C database has all table leading with a common key, Warehouse Id,
that provides a natural organizational structure.
The TPC-E database has 5 transaction tables with a common key value
but does not use the identity property.
Instead it uses a function that must read, then update a table
to determine the next value.
The Case for Single Socket
Naturally, the database and application
should be architected together with the SQL Server NUMA tuning options
to support good scaling
on multi-socket NUMA systems.
If we neglected this in the original design,
I am sure many DBA-developers know how well such a suggestion would be received
Is there another option? Well yes.
Get rid of the NUMA latency issue with a non-NUMA system.
Such a system has a single processor socket, hence one memory node.
Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.
Still, a single quad-core processor today is 40 times more powerful
than a 4-way system from twenty years ago (100,000 tpm-C per core is probably
possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996).
The Xeon E3 could probably support many medium sized organizations.
Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each).
My recollection is that many IO problems went away at the 32-64GB level.
And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4's.
But I am really talking about a single-socket Xeon E5.
In the v4 generation, we could have up to 22 cores,
though we should start by looking at the 10-core E5-2630 v4 at $667,
stepping up to the 16-core 2683 v4 at $1846
before going to the 20-22 core models at $3226 and $4938.
The Xeon E5 has 40 PCI-E gen 3 lanes.
It might be convenient if there were a motherboard with 1 PCI-E x8 and 8 x4,
because NVMe PCI-E SSDs are more common and economical with the x4 interface.
Supermicro does have a UP Xeon E5 motherboard (X10SRL-F) with 4 x8, 2x4 gen3 plus 1 x4 gen2.
It only has 8 DIMM sites out of 12 possible with the E5, but that is probably good enough.
A strong explanation was provided showing why round-trip memory latency
is very important in transaction processing.
One implication of this is that scaling to multiple sockets is poor
due to the NUMA effect.
A remedy is to architect the database and application together in working
with the SQL Server NUMA tuning options to achieve locality.
Alternatively, give serious consideration to a single-socket, yet still very powerful system.
A second implication is that processor frequency is less important
for transaction processing, though it might be important for other aspects.
The memory latency affect also supports the argument that Hyper-Threading
is highly effective and Intel really needs to increase the degree of HT.
OK, I didn't show why database transaction processing incurs the round-trip
It has to do with the b-tree index, in which we read through a page
to find the right pointer to the next level.
We access the memory for that pointer, then read through to find the next pointer.
I will try to do a diagram of this later.
But if someone can dig through an open source database engine,
please send a code example.
Several of the images were updated.
It would help if Intel would be so helpful as
to provide L3 latencies for Xeon E5 v4 LCC, MCC and HCC models.
What memory latencies for local and remote node in E5 v4?
How much latency does the SMB in the Xeon E7 add?
Note that Xeon E3 and client side processors use unbuffered memory.
While Xeon E5 can used unbuffered memory, these are currently limited
to 16GB DIMMs while registered memory is available in capacities to 64GB.
The Xeon D is for specialized embedded applications and not suited for the
single-socket database server. It has only 4 DIMM sites?
Supermicro has an Ultra product line targeting specialized applications.
One of the features they call
They claim that with very high quality design and components,
it is possible to reduce (memory?) latency via low jitter.
I would like to know more about this.
But the only option is Xeon E5 dual-socket, and I am more interested in single-socket.
The emphasis seems to be on RHEL, and high frequency trading?
There are examples for determining which processor socket the NIC is attached to,
and whether a threading is running on a core in that socket.
These hardware organization detection tools really should be incorporated into Windows
I have tried to use the WMI API from C#, but some things require coding in C
or possibly assembly?
It was stressed that round-trip memory latency impacts transaction processing databases.
Column-store DW avoids this problem by emphasis on marching through memory sequentially.
The main intent of Hekaton memory-optimized tables was to eliminate the need for locks.
But the other part was the use of a hash index, which happens to reduce
the number of memory round-trip operations.
Inte Core i7 Xeon 5500 series
Core i7 Xeon 5500 Series
Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns
Intel D. Levinthal paper
freebsd Ulrich Drepper paper
A couple of months ago, I got a pair of Intel XL710QDA2 Ethernet Converged Network Adapters, basically the fancy way of saying 40 Gigabit Ethernet that does other stuff too. I also bought the Intel QSFP+ Twin-axial cable. At that point in time, I might like to have opted for the newer Intel Omni-Path 100Gbps (56Gbps on PCI-E x4) which came out in Q4 2015 over the XL710 which came out in Q3 2104. But Intel Omni-Path only supports Linux at this time with a heavy emphasis in HPC applications, so it is unclear as to whether they will support Windows in the future. Another option was Infini-band, which other people have written about in conjunction with SMB 3.0 in Windows Server 2012.
The XL710QDA2 (dual-port) adapters were $500 each, and the cable $130. The single port XL710QDA1 was $420, so I thought having the second port was worth an extra $80 even though I am not using it yet. A 40GbE switch seems to cost about $500 per port, which I think is reasonable. There is the Cisco Nexus 9332PQ for $17K. I thought there was another vendor, Edge-Core? with a 20-port, but it is difficult searching on 40 GbE because of all the junk results. In any case, I do not believe that there is a small-office (8-12-port) 40 gigabit Ethernet switch. So, my current usage is connecting two systems directly with the QSFP+ twin-axial cable, and using the embedded 1GbE for internet connectivity.
The more readily apparent reason for high-speed networks in a SQL Server environment is to move backups from one machine to another. For my perfunctory dig at storage system vendors, the SAN vendor will say: our million-dollar product has software so you don’t need to move large backups over the network.
A less obvious reason is to achieve low latency in support of serialize high-volume calls. Transaction processing might not need high network data rates, and can probably generate high call volume on 1 GbE from multiple concurrent connections. However, there could be situations that require the highest possible call volume on a single thread and connection. Of course, in this case, the first course of action is to employ a back-to-back connection, bypassing any switches whatsoever. And definitely not have more than one switch in the path from application to database if a direct connection is not possible.
Potentially, we might think that high volume ETL could benefit from high data rate networks, but I believe that SQL Server single thread write operations are more limiting than 1 gigabit Ethernet. But perhaps more recent versions of SSIS might have better tools for parallelism.
Trying out the Intel 40GbE adapater
Back in August, when I got the XL710QDA, my test systems were running Windows Server 2016 TP5, for which Intel had beta driver. The adapter was recognized by the operating system and the beta driver installed with no issues. The Intel documentation on firmware updates was very complicated, so I did not apply any. In this setup, even though the driver installed, I could not ping across the 40GbE network. Intel support was of the opinion that the back-to-back connection should work. No diagnostic tools were provided for examining the software stack for the XL710.
Then in early October, there was a substantial hotfix for Windows Server 2016 TP5, after which SQL Server connections were going over the 40GbE. A ping test worked one way but not the other. In any case, Windows Server 2016 RTM was out about this time, so I installed both machines to RTM, applied OS hotfixes, and updated to the release driver. Now everything seems to work.
In bulk copies, network transfer rate was high and serialize network calls were also faster than on the 1GbE network (which was via a switch, not back to back). When I get a chance, I will look at how much round-trip latency reduction I would have gotten on 1GbE with a direct back to back cable. I did this about 15 years ago when Gigabit adapters came down to $300 but switches were still very expensive. If anyone is interested, I investigated on whether there was a cross-over cable for gigabit,
Gigabit Ethernet Direct Connect , i.e. no special cross-over cable needed.
10 gigabit Ethernet
For people not ready to jump directly to 40 Gigabit Ethernet, then consider 10 Gigabit. Adapters are now in the $300-400 range, and switched are in the $100 per port territory. The main consideration is which interface to standardize on. Two options for short distances are 10GBase-T and SFP+ Direct-Attach. It appears CX4 is not getting traction? One might be tempted to default to twisted pair, CAT6a or CAT7 for 10GBase-T. Cables for CAT6a and 7 are really cheap, $3-10 for short length pre-made CAT6a and $7-17 for CAT7, while SFP+ cables are $50-100.
However, various sources mention that 10GBase-T latency is higher than for the other options. And in certain situations, serialize network calls for example, latency is the primary criterion. In large file transfers, latency is not as important.
Netgear actually offers small office 10 Gigabit Ethernet switches. The XS7xx models are 8, 12, 16, and 24 10GBase-T ports, with some models having additional SFP+ uplink ports. The M4300 models include one with 8 10GBase-T + 8 SFP+ and 12+12. I did not see any that were all or mostly SFP+, but the X+X should be acceptable. It is possible to use the Intel XL710 with a QSFP to 4-port SFP breakout cable.
Is the Intel 40GbE a dead-end? Intel is going their own way at 100Gbps with Omni-Fabric, which currently does not support Windows.
Mellanox has a new 100GbE that also does NVMe over fabric?
Mellanox is offering a rebase for one-purchase of their 40/100GbE switches, the SN2100 16-port?
Price might be $6600-7200? with 50% rebate.
This puts it at $200+ per port at the rebate rate, and perhaps $450 per port normally?
Last month, Samsung announced
that their 960 Pro,
(PCI-E x4 gen3) M.2 NVMe SSD would available in October
at 512GB, 1TB and 2TB capacities.
This is a high-end PC and workstation product.
There is a similar but separate product line (SM961?) for OEM customers.
The 960 Pro had already been mentioned earlier this year.
This week, all the hardware sights published product reviews,
and others as well.
All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read.
The 960 Pro is rated for 3,500MB/s read.
This is pretty much the maximum possible bandwidth for PCI-E x4 gen3.
Each PCI-E gen3 lane is 8Gbit/s, but the realizable bandwidth is less.
In earlier generation products,
an upper bound of 800MB/s realizable per 8Gbit/s nominal signaling rate
Presumably there was a reason why every PCI-E x4 was
in the 2000-2500MB/s bandwidth.
It could be that these were 8-channel controllers and the NAND interface was 333-400MB/s.
Even though 8 x 400MB/s = 3,200MB/s, it is expected that excess bandwidth is necessary
on the downstream side.
The could be other reasons as well, perhaps the DRAM for caching NAND meta-data.
Intel had an 18-channel controller, which produced 2,400MB/s in the P750 line, and 2,800MB/s in the P3x00 line.
It is very annoying that the more vertical SSD OEMs decline to disclose meaning details
of the controller in their SSD.
The 960 PRO uses the new Polaris controller, for which few details beyond the Read/Write IOPS and bandwidth
specifications are published.
the 960 PRO is shown to have the Polaris controller and DRAM in a single package.
There are 4 NAND packages, each with 16 stacked 512Gbit die for the 2TB product.
I am guessing that the Polaris controller has 8-channels,
at 8-bit per channel with each package connecting to 2 channels?
But there are other possibilities, such as 4 x 16-bit channels.
The reason is that the previous generation NAND were 333 or 400MB/s.
The most recent NAND from Micron is listed at 533MB/s.
In order to support 3,500MB/s on the upstream interface,
there needs to be more than that amount on the downstream side combined.
While Samsung is the leader in V-NAND, it seems more likely that the 960 has 8 channels 8-bit
(or 4 channels, 16-bit) at 533MT/s
and not 4 channels, 8-bit 1066MT/s.
Between the two likely scenarios, 8-channel 8-bit should support higher IOPS than 4 channel 16bit?
Both the 2TB and 1TB models have read specs of 440,000 IOPS while the 512GB model is 330,000 IOPS.
It could be that the 1TB is 4 packages, 16 chips stacked with a 256Gb chips, and the 512GB model has 8 chips stacked. It is also possible that the 1TB has packages stacked 8 high with the same 512Gb, just that there is no reduction in IOPS, while the 512GB has 4 chip stacks?
I would think that 2 x 960 PROs in a single socket Xeon E3 quad-core would be a great match.
My current test system is a single socket Xeon E5 10-core with four Intel 750 SSDs.
I will probably not outfit another test system until the Xeon E5 v5 comes out next year,
in which I may opt for the 960 PRO.
Oh yeah, the 960 PRO is not a enterprise product, so should probably not be used in critical production environments, but should be ok for test.
It is possible to buy a M.2 to PCI-E adapter, but I would like an adapter that can accommodate 2 M.2 devices in one PCI-E x8 slot, without the use of a PCI-E switch if possible.
HP has an adapter for 4 M.2 in a x16 that only works in HP systems
last year I bought a Dell XPS9550 with the then new Skylake processor.
It came with a 512GB Samsung PM951, which was an older model rated for 1GB/s sequential, while the slightly more expensive SM951 was rated for 2GB/s.
This drive died after 3 months. Dell sent a replacement, but I elected to go out and buy the 950 PRO myself, rated the same 2GB/s as the SM951, but perhaps same or not?
The failure of one device is not sufficient to make an assessment, but I have dozens of OCZ Vertex and Vectors, and several Intel SSDs of various flavors, all running for several years now without failure.
Earlier this year, HPE and Microsoft sponsored an article,
The Importance of Benchmarking,
in SQL Server Pro.
While true, it is also apparent that there has been little interest in benchmarks
within the DBA/IT community over the last several years.
There are many reasons for this,
one of which is the emphasis on the benchmark result as the important takeaway.
Today most people expect that systems are very powerful,
and probably know that performance at the processor core level
has improved only incrementally in the recent years.
The main venue is the increasing number of cores with each generation,
currently at 22 in Xeon E5 v4 and 24 in E7 v4.
The TPC benchmarks are generally well designed,
with rules meant the prevent the publication of a distorted result.
But every application has its own characteristics,
different table organizations,
and transactions with SQL logic
bearing no resemblance to a benchmark or another application.
As such, a benchmark result does not translate directly to a different set of circumstances.
What is of great importance are the problems that were encountered in the benchmark,
and the measures that were taken to address each problem.
The topic of this article is Inserts to a table with a sequentially increasing index,
commonly in the form of an identity column,
but other implementations are possible as well.
At a New York City SQL Server Users Group meeting, Thomas Grohser,
Expert SQL Server Performance Engineering,
mentioned that multiple client (connections)
doing single row inserts to a table
with clustered indexed on an identity column
had extremely poor performance running on a 4-scoket system.
In that case, this was about 6,000 rows per second.
Inserts to a table with a clustered index on a uniqueidentifier,
or having a compound key not leading with the identity column,
performance was 95,000 (calls and) rows per second.
Memory-optimized tables could achieve 1.7 rows per second.
But it might be a good idea to consult with Thomas on this.
The assessment was that this was a contention issue due to the high latency
between processor sockets.
All modern systems with more than one processor socket have
Non-Uniform Memory Access (NUMA) architecture.
For Intel processors, this goes back to Xeon 5500 (2009) and 7500 (2010) and 2003/4 for AMD Opteron.
An Insert to a table with an index on an identity column,
regardless of whether the index is clustered or nonclustered,
involves acquiring an exclusive lock on the last row of the last page,
as well as accessing the memory for that page and row.
When there are multiple concurrent threads with affinity on cores
in different sockets of a multi-socket system,
the memory involved bounces between the L3 caches of each socket.
While it is inefficient for memory to bounce across sockets,
the magnitude of the impact on Inserts is stunning.
Testing Insert Performance
The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled.
Turbo-boost is disabled for consistency.
The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164).
Some tests were conducted on a single socket Xeon E5 v4 with 10 cores,
but most are on the E3 system.
In the past, I used to maintain two-socket systems for investigating issues,
but only up to the Core2 processor, which were not NUMA.
The test table has 8 fixed length not null columns,
4 bigint, 2 guids, 1 int, and a 3-byte date.
This adds up to 70 bytes.
With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.
Both heap and clustered index organized tables were tested.
The indexes tested were 1) single column key sequentially increasing
and 2) two column key leading with a grouping value followed by a sequentially increasing value.
The grouping value was chosen so that inserts go to many different pages.
The test was for a client to insert a single row per call.
Note that the recommended practice is to consolidate
multiple SQL statements into a single RPC, aka network roundtrip,
and if appropriate, bracket multiple Insert, Update and Delete statements
with a BEGIN and COMMIT TRAN.
This test was contrived to determine the worst case insert scenario.
On the single socket Xeon E3 quad-core system, the heap table with no indexes
and both the heap with one nonclustered index and clustered index
with the two column key
could support an insert call rate of 70-72,000 per sec,
both rows inserted and number of network roundtrips.
The heap table with one nonclustered index and table with only a clustered index,
both with the index key being a single sequentially increasing column
such that inserts from all threads/connections go to the last page
could support an insert call rate of about 40,000 rows per sec.
The issue here is multiple concurrent connections
making calls to Insert into a single table
with a sequentially increasing index, clustered or nonclustered,
resulting in contention between threads for the last page and row.
There is a substantial 42% performance degradation on a single socket system,
with memory in a common L3 cache, but the relevant memory locations
are moved between the L2 cache of different cores after exclusive access is acquired.
While the impact is large,
even this may not be a crippling effect depending on the circumstance.
In a multi-socket system, memory must now also be moved between L3 cache of
different processor sockets, which has higher latency.
The impact is expected to be more severe with as the number of sockets increases.
The presumption here is that the limitation is in the ability to acquire exclusive
locks between threads on different cores, potentially in different sockets.
If each call inserted more than one row, the call might decrease only slighty
resulting in a higher row insert rate.
Hence the explicit emphasis on the call rate as well as the row insert rate.
Note also that we should problem be able to insert in to multiple tables
each at more or less the same rate as inserts to a single table
until the limitation becomes the log write thread or overall CPU.
This test is not the same as the test by Thomas Grohser
in terms of table structure, client application, and hardware etc.
Still, the expectation is that results would not be dramatically different.
My tests on this matter is incomplete, and more work is necessary.
As soon as I can get access to two and four socket systems, I will try to run this same test
(hint to Intel or someone).
In preliminary tests on the single socket 10-core system, the Insert call rate
with rows going to different pages was over 100,000 per sec for both heap and clustered,
and the around 30,000 per sec with a sequential index, nonclustered or clustered.
We might infer that this a contention issue in which performance degrades
with increasing number of cores, for both single and multi-socket system.
In the case multiple sockets, there might be as more severe degradation?
This issue might be more complex than a simple matter of the higher latency between sockets,
but other people might have better tools to conduct such an investigation.
There are work-arounds for this issue.
One is to implement a multi-column clustered index key such that inserts are distributed over many pages.
It also necessary to not have even a nonclustered index on the identity column,
which may have implications.
Another work-around is simply to deploy on a single socket system.
This is actually good advice for perhaps a majority of situations.
The Xeon E3 with 4 cores is perhaps twenty times more powerful than
a 4-socket Pentium II Xeon from 1998.
If quad-core E3 were not enough, probably in memory or PCI-E,
then a single socket Xeon E5 with up to 22 cores should definitely considered
before reflexively defaulting to a two-socket system
without quantitative substantiation of any performance assumptions.
One the problems today is that infrastructure people have bought into
the vendor driven arguments for virtual machines and/or cloud,
but apply it pervasively even into the mission-critical systems,
while completely disregarding any special requirements.
The typical platform is a 2-socket.
There are other options too.
We could try affinitizing calls that insert into the critical table,
Map TCP IP Ports to NUMA Nodes.
Benchmarks and Inserts
Before closing this topic, given that it is so important,
we might ask whether the benchmarks shed light on this matter.
The TPC-C benchmark has the highest insert volume into a single table.
However, TPC-C does not have this problem because all tables are organized
(clustered index) by warehouse followed by district.
This reflects a brick and mortar business,
where customers place orders mostly in their own warehouse and district.
If we were curious anyhow, TPC-C benchmark results are in transactions per minute (tpm-C).
The highest SQL Server result is 1.8M tpm-C, corresponding to 30K new orders per second.
Technically, the order line table has more rows inserted.
Each transaction inserts one row to orders
and on average 10 rows to the order line table with a single call,
so the Insert statement call volume is the same for both tables.
No TPC-C benchmark results were published for SQL Server version 2008 or later,
even though a handful of results continued to be published for version 2005
after 2008 RTM.
The reason for this is that SQL Server 2008 compressed log file writes
to better support database mirroring,
a very important feature across the broad base of users.
Log writes were single-threaded until version 2016.
This meant that SQL Server write performance could be limited by the ability of a
single core running the log writer thread to compress log entries.
Presumably there was some survey to suggest that this would be acceptable
as there is no option to disable log write compression?
The TPC-C benchmark has a curious statement when updating the stock table
in the output clause that touches but does not change several fat columns.
This part serves no meaningful business purpose,
but has the effect of grossly inflating the log write size far beyond
what should be necessary.
Presumably other RDBMSs do not have mandatory log compression.
And so the highest TPC-C result on a system with Intel processors
is 5M tpm-C (8 sockets, 80 cores total, 160 threads)
corresponding to 84000 inserts to orders per second.
The TPC-E benchmark does have 3 tables that are unique on the (big)integer Trade Id column,
and two more tables that have a two column key leading with Trade Id.
Is there a magical tuning trick that can evade the insert to a table
with a sequentially increasing index problem?
TPC-E works around this by not using an identity column in the tables
where one might think the primary key column should be an identity.
In fact, it uses a scalar function and a table to assign trade Id values.
The TradeOrderFrame4 procedure first calls the function Get_Next_T_ID
with an output variable to retrieve a Trade Id.
The function uses the table TID_RANGES to store blocks of ID values.
Each block, represented by a row in TID_RANGES, is assigned
to a specific session id (SPID, accessed with @@SPID).
Every time the function is called,
a column for the current Id value for the callers spid is returned
and the row is updated to the next Id value.
This is probably an acceptable implementation in the TPC-E benchmark,
where all connections are running transactions.
In a real-world environment, there might be very many open connections,
not all of which process transactions, but in a common connection pool.
I might suggest using spid modulus the total number or logical processors.
Or perhaps spid divided by 2 first, then modulus the number of cores when HT in active.
All of this is in the details as required by TPC full disclosure rules.
But it is buried in the supplement files with thousands of extraneous files.
There is no explanation given as to why the three tables that have a primary key clustered
on the trade Id column is not an identity,
instead using a function that requires an update to maintain.
It is interesting this method was in TPC-E from the beginning.
The first TPC-E result was published in 2007
for a 16-socket
with Xeon (7140M?) dual-core processors.
In that generation, there were probably 4 processors in each NUMA node.
The performance was 660 tpsE,
meaning 660 inserts per second to each of 3 tables unique on trade Id
plus inserts to two more tables.
The current 4-socket Xeon E7 v4 with 96 cores total is
Was the insert - index on sequentially increasing column problem know back then?
If it occurred at such a low insert rate,
then how many people suffered this problem without knowing why?
The conclusion to draw is that, yes benchmarks are important.
However, what is important is not the result that vendors like to brag about,
but rather the many problems that were solved in producing a competitive result.
While all the methods are disclosed in the supporting files,
there is no accompanying explanation.
Some methods are simple to apply, like lock pages in memory.
Others must be worked into the database architecture,
which has deep ramification if done after it is already is in production,
or must be part of the client side application, such as the TCP per affinity.
Some methods might have critically significant impact.
Others might be of just minor impact that is desired for benchmarks,
but have no important consequence in a production system.
I do not suppose anyone would be willing to apply the -x startup parameter
in a production environment, but it might be good to know the impact?
Good luck in deciphering which details are critically important to a specific situation
and which might just be used to get an extra 1% edge over a competitor.
The tests systems, client-load generator and server each had Intel I210 1GbE connected to a common switch.
A single thread connection could drive about 4K single row select RPC/sec, and 2K single insert row RPC/sec. A high level of saturation was achieved at 80 concurrent connections.
About 12 years ago, I had tested Xeon (Pentium 4) systems with 1GbE connected back to back (no switch), and this could drive 5K single row select RPC/s. I did not test the current 1GbE direct connect network round-trip rate.
Both of these systems also had a Intel XL710 40GbE with direct QSFP+ connection. At first this did not seem to work in Windows Server 2016 TP5 with the Intel beta driver.
Intel support was of the opinion that this should work but did not say definitively.
Then on Tue, with the latest Windows TP5 hotfixes, all of the sudden the load generator was connecting via the 40 GbE instead of the 1GbE. There were some quirks in the behavior.
In any case, the 40GbE direct connect could drive 11K select RPC/s, and 6K insert RPC/sec at the single. Only 20 concurrent threads were necessary to achieve a high level of load, at which the 10-core Xeon E5 could support perhaps 40K single row inserts with a sequential clustered index.
I am currently in the process of changing over to Windows Server 2016 RTM + hotfixes,
which also seems to upgrade the XL710 driver from 1.3.115 to 1.5.59, so more to come.
Henrik provided the link,
SQL Server Latch Contention, which is why he mentioned that this is latch contention.
I would have thought it would be lock contention, but apparently it is latch contention according to MS?.
I only took a quick look, MS identifies the problem with the degree of concurrency, more severe at higher concurrency. Between Thomas Grosher's report, and my data, it should be emphasis that there is most probably a strong hardware NUMA or not-NUMA impact. But it would be preferable if one person ran the same test on both 1-socket, 2 and 4-socket systems.
Can the Intel Xeon Phi x200, aka Knights Landing, run SQL Server? It does run Windows Server 2016, so is there anything in SQL Server 2016 that would stop it from installing? Xeon Phi is designed for HPC, so it would not have been tested with SQL Server, but that does not confirm whether it will or will not work. If so, then this could be used to prove out some important theories.
The main line Intel processors used in the Core i7/5/3 and Xeon product lines, with recent generation codenames Haswell, Broadwell, Sky Lake and soon Kaby Lake, are heavily overbuilt at the core level. The Broadwell processor core is designed to operate at 4GHz if not higher.
The Haswell 22nm processor was rated for up to 3.5GHz base, 3.9GHz turbo at launch in Q2’13. In Q2’14, a new model was rated 4/4.4GHz base/turbo at 88W TDP. Both are C-0 step, so the higher frequency was achieved by maturity of the manufacturing process or cherry picking?
The Broadwell 14nm processor had a top frequency of 3.3/3.8GHz base/turbo 65W for desktop,
but perhaps this is because it was more focused at mobile than desktop?
(Curiously there is also a Xeon E3 v4 at 3.5/3.8GHz 95W and Iris Pro graphics).
The top Sky Lake 14nm processor was 4.0/4.2 GHz base/turbo at 91W.
With a single core under load, processor is probably running at the turbo boost frequency.
When all four cores are under load, it should be able to maintain the rated base frequency while staying within design thermal specifications, and it might be able to run at a boosted frequency depending on which execution units are active.
The latest Intel Xeons (regular, not phi) are the E5 and E7 v4, based on the Broadwell core.
There are 3 die versions, LCC, MCC, and HCC with
8 10?, 15, and 24 cores respectively.
All of these should be able to operate at the same frequency as the desktop Broadwell or better, considering that the Xeon E5/7 v4 Broadwells came out one year after the desktop processors.
But Xeons need to be more conservative in its ratings so a lower frequency is understandable.
The top Xeon 4-core model, E5-1630 v4, using the LCC die is 3.7/4GHz at 140W TDP.
The top 8-core is 3.4/4.0GHz, E5-1680 v4, also at 140W TDP.
The top 14-core (MCC die) is 2.2/2.8GHz 105W. The top 24-core (HCC die) is 2.2/3.4GHz 140W. So the Xeon E5 and E7 v4 processors are built using cores designed to operate electrically at over 4GHz, but are constrained by heat dissipation when all cores are active to a much lower value, as low as one-half the design frequency in the high core count parts.
The transistor density half of Moore’s law is that doubling the number of transistors on the same manufacturing process should enable a 40% increase in general purpose performance.
The implication here is that if a particular Intel processor (full-size) core is designed with transistor budget to operator at 4GHz, then in theory, a processor with one-quarter of that transistor budget should be comparable to the full-size core operated at one-half the design frequency, whatever the actual operating frequency of the quarter size core is.
(Doubling the quarter size core to half-size yields 1.4X gain. Double again to full-size yields another 1.4X for approximately 2X performance going from quarter to full size).
So the theory is that it might be possible to have 100 cores of one-quarter the complexity of the Broadwell core on a die of comparable size as Broadwell EX (456mm2), with adjustments for L2/L3 variations, and differences in the memory and PCI elements.
This just what Xeon Phi, aka Knights Landing, appears to be. There are 72 cores in 36 tiles, operating at 1.3-1.5 GHz base, 1.7GHz turbo.
The Xeon Phi x200 is based on the Silvermont Atom, but at 14nm.
A tile is composed 2 Atom cores, with 4-way simultaneous multi-threading (SMT)
and 1M L2 cache shared between 2 cores.
(There is no shared L3? how is cache coherency handled?)
The Xeon Phi has 16 MCDRAM and 6 memory channels capable of 115GB/s and 384GB max capacity (6x64GB). The MCDRAM can be used in one of three modes: Cache, Flat, or Hybrid.
There is no mention of the MCDRAM latency, only the phenomenal combined bandwidth of 400-500GB/s.
My expectation is that it should be possible for the processor to off-die memory roundtrip latency to be lower when the memory is in the same package as the processor compared to the common arrangement when memory is outside the processor package.
This is because it should be possible to use really narrow wires to connect the processor to memory in a common package, so there should be less buffering circuits to amplify the signal current?
(Can some circuit designer speak to this please?)
This higher core count, higher threads on SMT is more or less comparable to IBM POWER, SPARC and even AMD Zen.
Transactional queries are essentially pointer chasing code: fetch a memory location, use its value to determine the next location to fetch.
This should run fine on a simpler core than 6/8-port superscalar Broadwell.
And have many dead cycles during the round-trip memory access latency,
implying SMT will work well (beyond the two threads per core in the main line Intel cores).
However, this may not be the best general purpose computing solution,
in that there are important single threaded tasks,
or tasks that are not massively parallelizable for which the existing powerful Intel core is the best.
My thinking is that a mix of few powerful cores and many smaller cores is right solution.
And that there should be a few smaller cores dedicated to special OS functions (interrupt handling and polling), in a blended asymmetric-symmetric arrangement.
Long ago, almost 20 years, there was legitimate concern on whether servers had sufficient capability to drive transaction processing requirements for large businesses.
Then for a long period, perhaps from the late 1990’s to 2012, the better choice for most mid to large businesses was to simply buy a 4-way (socket) system.
Larger systems were based on a non-uniform memory access (NUMA) architecture that had unusual characteristics and vendors provided almost no information on how to work around issues that could cripple such systems.
Towards the end of the second era of server sizing, processors (at the socket level) became ridiculously powerful such that properly tuned systems ran at very low overall CPU utilization.
However, 2-socket systems in this period did not have adequate memory or IO capability, unless it was a Xeon 7500 or later E7 series system with 2 sockets populated.
And even then, the structure of SQL Server per processor licensing meant that many people opted for the 4-way system regardless of requirements.
In 2012, Intel finally put powerful processors (Sandy Bridge, up to 8-cores) with larger memory and IO capability into the Xeon E5 line (there were both 2 and 4 socket versions). This coincided with SQL Server 2012 per core licensing. Now there was a stronger technical justification and financial incentive to not automatically opt for a 4-socket system with the most number of cores.
It would seem reasonable to make some effort to first determine the total number of processor cores that will meet requirements with some headroom.
Then determine the number processor sockets as more processor sockets means more memory sockets (also considering the difference in DIMM sockets per processor between Intel Xeon E5 and E7 processors).
It is unclear whether anyone actually engaged in the technical analysis versus just making the decision strictly on budget goals.
Today, it is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems.
First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements.
We should also factor in that the second generation hyper-threading (two logical processors per physical core) from Nehalem on has almost linear scaling in transactional workloads (heavy on index seeks involving few rows). This is very different from the first generation HT in Willamette and Northwood which was problematic, and the improved first generation in Prescott which was somewhat better in the positive aspects, and had fewer negatives.
A single socket Intel Xeon E5 processor supports 12 DIMM sockets for 384GB with 32GB DIMMs, or 768GB with 64GB DIMMs which is now at a moderate (50%) premium per GB over the 32GB DIMM. In the old days, DBA’s were taught that memory was precious in reducing disk IO to a tolerable level. That was back when a total system memory of 1GB was considered very large.
I noticed that after common server memory configuration reached 32GB memory, disk IO could be brought to a very manageable level, well under the IOPS possible with 100-200 15K HDDs.
Today, all-flash storage is practical.
Definitely in direct-attach storage, and maybe in SAN based storage
depending on the degree of price gouging from your SAN vendor.
So while it might be possible to show that TB-plus memory reduces IOPS from 10,000 to nil, the reality is a modern storage system correctly configured for database applications can easily support 100,000 IOPS if not 1M (which would require NVMe to be practical).
In this case, we should find that memory requirements are far lower than the memory capacity of recent generation 2 and 4-socket systems.
Why not continue to spectacularly over-configured the server system when even with SQL Server per core licensing is only a small portion of the costs in deploying a critical line-of-business application? The big bosses may not question the costs of a 2-socket system or perhaps 4-socket system even factoring in that there may be several environments between production, DR, QA and development. But if performance is inadequate, it may cost you your job.
The answer is performance. Today, transaction processing performance is heavily dependent on serialized round-trip memory access latency. The term is pointer chasing in that a CPU instruction accesses memory to determine the next memory location to access. No amount of L2/L3 cache can hide this, nor is it possible to pre-fetch memory. (What Intel needs to do is increase Hyper-threading beyond two logical processors per physical core. IBM Power is 4 and SPARC is 8?).
For a processor with integrated memory controllers (AMD from Opteron on, Intel from Nehalem on), local node memory access round-trip latency is probably 50ns. For memory on an adjacent processor, a 1-hop remote node, it is perhaps 100ns. This information is absolutely crucial in understanding system level performance for transactional databases, and yet vendors try to not mention it, instead stressing metrics that are mostly irrelevant.
In a single socket system, 100% of memory accesses are local node, because there is only 1 node. In a two socket system, the expectation is that memory accesses are split 50-50 between local and remote node. In principle, it is possible to devise some strategy such that a higher percentage is local, but SQL Server does not make it practical to implement such a scheme. In the Intel Xeon E7 4-socket systems, 1 out of 4 memory nodes is local, and there are 3 memory nodes 1-hop away. In an 8-socket system, 4 nodes are 2-hop away.
An argument in favor of multi-socket systems is that the number of memory channels scales with the number of sockets. However, in transaction processing, it is memory round-trip latency the dominates, not memory bandwidth. (it is a factor that more memory channels means there can be more memory transactions, but it is unclear as to whether this offsets the increase in memory latency.)
If someone has reasonably comparable systems with 1, 2 and 4 processor sockets, all of the same processor generation, the impact of the number of sockets can be tested.
I am also of the opinion that SQL Server cannot make meaningful use of all the processor cores in a top-of-the-line 2-socket (36-cores for Xeon E5 v3), let alone a 4-socket (72-cores) due to most write operations data and log being single threaded. But I would like here opinions on this. Is this addressed in SQL Server 2016 to a degree?
The above applies mostly to transaction processing applications.
It is reasonable to suppose that decision support system application could use both the greater number of processors cores and the greater number of memory channels in a multi-socket systems. However it is unclear as to whether SQL Server is effective in implementing parallel execution plans when the degree of parallelism is greater than the number of cores in one socket.
I would think that the strategy of partitioning and alignment of compute to local (system) node memory from (formerly) PDW applied to a single SQL Server instance, treating each processor socket as though it were a node, which it is, could dramatically improved very DOP parallel execution plans. But it does not seem that Microsoft is interested in this approach.
I am presuming that column store is better capable of utilizing multi-socket, as it is definitely not encumbered by serialized round-trip memory latency.
The Dell PowerEdge R730 can be purchased with 1 or 2 of 2 sockets populated. The R930 must have 2 or 4 of 4 sockets populated. It appears that the premium from 32GB to 64GB memory modules might be because the 32 is RDIMM and the 64GB is LRDIMM. The 2400MT/s 32GB RDIMM is $558, and the 64GB LRIMM is $1483.
For this differential, I would stay single socket, opting for the more expensive 64GB (if necessary) over 2 sockets using the less expensive 32GB, even though I am giving up the 4 extra memory channels.
If the plan is to implement memory-optimized tables in 2016 (which supports more than 256GB? and has fewer other restrictions) and more than 768GB system memory is needed, then go ahead with 2-sockets.
But I would try to get my transaction processing DB under the single socket limit. In the old days, we moved the old out to the DW system.
For a long time, I have refrained from endorsing the use of Join and Query Hints. The reason is that there is no such thing as query hints in SQL Server except for a few simple items. The key options listed under Microsoft SQL Server Hints (Transact-SQL) that affect join order, join type, and index usage are not really hints.
They are in fact directives such that the specified option is employed if possible. But join order optimization is essentially disabled, and the execution plan joins tables in the order as written, exception being EXISTS and NOT EXIST clauses. It is if the Query Optimizer is hyper-sensitive and its feeling are hurt when someone impudently suggests a hint. So if hint is specified, good luck to you. You are on your own.
Microsoft documentation is obviously very thorough on this matter with the following:
“Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend only using hints as a last resort for experienced developers and database administrators.”
But the fact is that there are complicated queries such that it is difficult to estimate the number of rows,
especially past the source table into the intermediate steps.
The query optimizer does (edit: not) produce a good plan, because the row estimate is grossly erroneous.
This is aggravated by the fact plan costs employed by the query optimizer reflect a fixed IOPS model based around four 1995 era 7.2K hard disks, with no consideration for whether the actual capability of the storage system or if data is entirely in memory.
So there are times when it is necessary to override the query optimizer. I will not go into the details of query optimization here. Back around 2001, I reversed engineered the execution plan cost formulas for many the fundamental operations without bothering to explain the operations or other aspects of query optimization. Since then, Paul White, Benjamin Nevarez, and others have written very detailed papers on query optimizations. Roughly the strategy is a balance of 1) tables access efficiency, 2) join efficiency and 3) reduce rows as quickly as possible.
Because join order optimization is disabled when a join or query hint is specified, the key is to write the query to produce a specific join order. There is a method of doing so. See Inside Microsoft SQL Server 2008 T-SQL Querying: T-SQL Querying By Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass on the topic of bushy plans.
Once the correct join sequence is found, most of the join types might already be good. Perhaps a few join hints might be necessary. Join and Query hints do no cause the query optimizer to disable index optimization, so index hints should not be needed as frequently.
This is a much deeper topic than I can discuss at this point in time. So why I am bring it up? Today there was a news report of the cause of the Air Asia flight 8501 crash last December. For details, see
CNN Air Asia crash report or
In brief, there was an equipment malfunction that put the airplane outside of the normal flight envelope. It should have been possible for the flight crew to recover, but they did not do so.
It further turns out: “because the manual provided by the plane's manufacturer said the aircraft, an Airbus 320, was designed to prevent it from becoming upset and therefore upset recovery training was unnecessary.”
In our world, we are told the query optimizer “typically” is best, hence we should not have to do manual query optimization, hence we do not need to learn this. So when we do need to do manual query tuning, we do not know how because we did not learn this.
This is BULLSHIT! I don’t give a damn how good the query optimizer is “typically.” A serious principal DBA needs to learn how to do query optimization so that he/she can do so when the situation calls for it. The starting point is the bushy join style in the Itzik Ben-Gan book.
here is the link to my previous post on bushy joins, notice at the time I put caveats on it.
I am not saying go for broke on join/query hints, just that we need to learn to shape the join sequence.
Rob Farley had a great note in the comments. I had seen that query style before, in TPC benchmarks? without understanding at the time what the writer was smoking.
per Claus comment linking to
Dynamic Search Conditions in T‑SQL,
I especially like the strategy below, with hard code when good parameters are specified,
then some thing else to handle the unusual, be it OPTION(RECOMPILE) or dynamically built parameterized SQL
IF @orderid IS NOT NULL
WHERE O.OrderID = @orderid
AND -- Conditions on Order Details here.
-- No OPTION (RECOMPILE) here!
WHERE -- same conditions as before
There is a curious quiet from the enterprise storage community on form factor and interface direction for solid state storage, be it NAND Flash, Intel 3D XPoint or other. On the client-side, personal computing, both desktop and mobile, show clear direction in favoring both the M.2 form factor and PCI-E as the preferred interface for SSD storage. There is a backward compatibility option in M.2 to work with either SATA or PCI-E interface, but I do not think this will be widely used. SATA or hard disks will not go away, only the primary SSD is M.2 form factor, PCI-E interface and NVMe host protocol.
On the enterprise side, there is great deal of deployed infrastructure built around the SAS interface (a super-set of SATA), and the small form factor (SFF) for 2.5in HDD at 15mm height or thickness. The bean counter types would like to wish that SSD (NAND flash for those who do not like the acronym SSD) would use existing infrastructure and not just as an interim measure. They are probably still unhappy that Fiber Channel on the back-end had to be abandoned several years ago, being not competitive and a cost burden relative to SAS.
Preserving the value of investment in existing infrastructure is important because people are unhappy when equipment purchased at painfully high cost becomes obsolete. Of course, enterprise storage is only extremely expensive because storage vendors invented justifications for selling inexpensive components at very high markup.
There is also a failure to consider that hardware has an effective depreciation of 30-40% per year due to the pace of progress, which renders the term investment in hardware foolish, or if I were less polite, then completely stupid. So ultimately this is circular logic based on an untenable premise.
That said, it would be possible to build a viable enterprise storage system around either the PCI-E or SAS interface, because both support multiplexing lanes, and there are switch chips for both PCI-E and SAS interfaces. The reason PCs are moving from SATA to PCI-E is that the NAND interface bandwidth is advancing at a faster pace than any single lane connection can support and SATA does not support multiplexing. (In websites catering to desktop hardware, some say that PCI-E is superior to SATA. This is rubbish by writers with poor technical knowledge. The only important fact is whether the interface supports multiplexing.)
The reason existing enterprise infrastructure should be abandoned is not because of any deficiency in SAS, but rather that it is built around four lane (x4) uplink and downlink ports. SAS at 12 Gbits/s would only support 4.4GB/s net bandwidth. This might seem to be high because enterprise storage vendors sell crappy systems with pathetic bandwidth capability. The other reason is that most existing infrastructure are either the 24-25 bay SFF in 2U or 15-bay LFF in 3U 19-inch wide rack mount enclosures designed for hard disks. Both the number of bays and physical volume are completely wrong for current generation SSDs going forward.
My opinion is that the correct uplink and downlink for solid state (be it NAND flash, Intel 3D XPoint or other) storage enclosures (not individual devices) should be 16 lanes wide or x16. Both PCI-E and SAS have adequate bandwidth and protocols. For PCI-E gen 3 at 8Gbit/s per lane, this would support a net bandwidth of 12.8GB/s. The existing x4 SAS is just too low for an SSD (formerly disk) array enclosure.
The core of this argument is based around the PC standard of a Flash controller with 8-channels on the NAND side, and PCI-E on the uplink side. Today the NAND interface is 333MB/s, so an 8-channel controller could support 2.6GB/s. There may have been some thought that the upstream side should be PCI-E gen 3 with 2 lanes (x2, capable of 1.6GB/s), as it is common to have excess bandwidth capability on the down stream side. But in PC world, single device benchmark performance is important, so the trend seems to be PCI-E x4 on the controller, with the option to connect only x2 (or even x1?).
In the time of hard disks, client side PCs used 7200 RPM HDDs or less, for lower cost and higher capacity. Enterprise storage was primarily 10K or 15K RPM for greater IOPS performance, although 7200 RPM was adopted for tier 2 storage. (Storage capacity should have been too cheap to meter even for 10K HDDs, but because vendors sold a ridiculously high prices, this created a demand for 7.2K in enterprise storage systems.)
In the first phase of SSD adoption, enterprise systems preferred single level cell (SLC) NAND with greater write endurance while client side was mostly 2-bit MLC and later some low cost devices being 3-bit TLC. Today NAND flash technology is sufficiently mature that MLC has sufficient write endurance for many enterprise needs. Fundamentally, the performance oriented PC and enterprise could use the same SSD, just with different over-provisioning and other firmware settings. It would be foolish for enterprise systems not to leverage components developed for client side systems, given the huge volume and low cost structure.
While the standard desktop SSD element is M.2 form factor with an 8-channel controller and capable of x4 on the upstream side, the enterprise strategy should be to connect x2 on the upstream side. In enterprise, it is the performance of the complete array of storage elements that is important, not the single component. The standard storage array enclosure should probably have 16 bays, each connected x2 to the PCI-E switch, and x16 for each of the uplink port and downlink expansion port. The PCI-E switch would have 64 ports, 16 for uplink, 16 for downlink expansion, and 16 x2 for the M.2 SSDs. The enclosure should work with either 1 or 2 controllers. Existing DAEs have a single (x1) SAS connection to each bay.
The physical volume for 16 M.2 devices would occupy only one-quarter of 1U rack. Existing enterprise storage infrastructure is x4 uplink/downlink ports, 2U full rack with 24-25 bays connected x1. This wrong for SSDs at on multiple points. Uplink and down link ports should be x16. The volume of the enclosure should be shrunk by a factor of 8. Connections to each bay should be x2, but 16 bays connected at x1 is weakly viable. Given that existing infrastructure is unsuitable for SSDs going forward, there is no excuse to not adopt the client-side components with M.2 form factor and PCI-E in a new properly designed infrastructure.
for some reason I cannot respond to comments
Good question. Sorry about the long winded answer, but life is complicated. I do agree with shared storage concept in HDD days, having a common pool of HDD so that each host can access the aggregate IOPS capability when needed. This and managing the complex storage system alone would have justified a good profit margin. But storage felt the need to justify extraordinary margins, hence started to invent reasons, which led to doctrine based on the invented justifications. Any time happens, it is a fuck-up of the first magnitude. And storage vendors do not seem to understand what bandwidth is, or about log write latency.
Next, blade systems are non-starter in databases because it gives up DIMM slots and PCI-E slots. So we should stick with rack systems with the full boat of DIMM and PCI-E slots. Today a 4TB PCI-E is do-able. Whats missing is some way to match PCI-E SSDs to the available PCI-E lanes. System vendors have a mix of PCI-E slots, including several x16. Only workstations and HPC have x16 cards, servers do not. So we want to connect four PCI-E x4 SSDs to x16 slots. HP workstations have some card for this, but we need a server version. I can see a 4-socket server with 64 - 128 PCI-E lanes dedicated to storage, that is 16-32 x4 PCI-E SSDs, so 64-128TB in one system. All this can be internal, SSDs do take much space, aren't too heavy and don't consume much power.
Storage vendors still want to sell horribly expensive AFA storage with features that we don't need, and cannot deliver anywhere close to the bandwidth that is possible. So it really is a fight between the DBA for cheap SSD at high bandwidth, and the SAN admin who wants to spend a shit load of money, have control over all storage, make you fill out forms to justify why you need each precious GB of space, all so he/she can deny you request as not sufficient in his/her judgment.
Given that the NVMe controllers seem to be PCI-E x4, perhaps the strategy should be x16 uplink and downlink, with 8 x4 bays. There is still a 2:1 mismatch between downstream and upstream. The point being bandwidth max is reached with 4 devices, but there is space for 4 more. Above, I suggested 16 x2 bays.
In my previous post, Hardware rant 2015, some readers reacted to my suggestion that vendors start offering the Intel Xeon E3 v5 in laptop and desktop PCs as if this were an outlandish or impractical idea.
First, doing so requires almost no additional work. Simply substitute 1) the Xeon E3 v5 for the Core i7 gen 6, 2) the server PCH (C236) in place of the desktop PCH (Z170) which are really the same thing as are the two processors, and 3) ECC memory for non-ECC, which has 8 extra bits over the normal 64-bits. The cost of this might one hundred dollars mostly driven by the premium Intel charges, only slightly from the 12% higher cost of memory. (It is the Xeon E5 line that would be seriously impractical in a laptop that an old person could easily carry. A young fit person might claim to not feel the difference between 4 and 15lbs, or 2 and 6kg).
Second, I should explain why ECC memory is so important, far out weighing the extra cost. This is true for user systems, not just servers with 24/7 requirements. As the title states, a PC without ECC protected memory is total crap, no exceptions unless what you do on the PC is totally worthless, which could be the case for a significant market segment.
Basically without any check on memory integrity, we may have no idea when and where a soft error has occurred. Perhaps the only hint being the OS or application crashes for no traceable reason or serious data corruption has already occurred. Let it be clear that soft errors do occur unless you are deep under ground.
Up until the early 1990’s, many if not most PC’s sold as desktops and laptops had parity protected memory. Then in the time frame of Windows 3.x, (almost?) all PC vendors switched to memory with no data integrity protection for their entire lineup of desktop and mobile PCs (with perhaps the exception of dual-processor systems based on Pentium Pro and later, that were subsequently classified as workstations). This was done to reduce cost, eliminating 1/9th of the memory for parity.
All server systems retained parity, and later switched to ECC memory even though entry level servers use the same processor as desktops (either with the same product name, or different). The implementation of memory protection is done in the memory controller, which was on in the north-bridge in the past, and more recently, integrated into the processor itself (starting with Opteron on the AMD side, and Nehalem in the Intel side).
I recall that the pathetic (but valid?) excuses given to justify abandoning parity memory protection was that DOS and Windows were so unreliable so as to be responsible for more system crashes than an unprotected memory system. However, since 2003 or so, new PCs were sold with operating system shifted to the Windows NT code base, imaginatively called Windows XP.
(In theory) Windows NT is supposed to be a hugely more reliable operating system than Windows 3.1/95, depending on the actual third-party kernel mode drivers used. (Lets not sidetrack on this item, and pretend what I just said is really true). By this time, the cost of sufficient DRAM, unprotected or ECC, was no longer as serious a matter, even though base memory configuration had grown from 4MB for Windows 3.1 to 512MB if not 1GB for Windows XP or later. And yet, there was not a peep from PC system vendors on restoring memory protection with ECC now being standard. (I did hear IBM engineers* propose this, but nothing from PC vendors without real engineers. We don’t need to discuss what the gutless wonders in product management thought).
Presumably soft-errors are now the most common source of faults in systems from Windows NT/XP on. Apple Mac OS (from version?) and Linux are also protected mode operating systems. So this is pretty much the vast majority of systems in use today. It is possible that bugs in drivers from third-party that have not been tested under the vast range of possible system configurations (more so for performance oriented graphics drivers?). Still, the fact that vendors to do not regard correcting the most serious source of errors in PCs today is an indication that they consider the work we do on PCs to be worthless crap, which is the same regard we should have for their products.
Let me stress again that putting out PCs with ECC memory does not require any technical innovation. ECC capability has been in entry server systems built from identical or comparable components all along. By this time, Intel memory controllers had ECC capability which could be (factory) enabled or disabled depending on the targeted market segment. (Intel does have dumbed-down chipsets for the low-end PCs, but it is unclear if ECC was actually removed from the silicon.)
A. The Wikipedia article
cites references that mentions actual soft-error rates.
There are a wide range of values cited, so I suggest not getting hung up on the exact rate, and treat this as order-of-magnitude(s).
There is a separate entry
for anyone interested in the underlying physics.
Of course there are other Wiki entries on the implementation of ECC.
Briefly, the prevalent source of soft-errors today originating with cosmic rays striking the upper atmosphere, creating a shower of secondary particles, of which neutron can reach down to habitable areas of Earth. Unless the environment is a cave deep underground, there will be soft errors caused by background radiation. The probability of errors also depends on the surface area of memory silicon, so a system with a single DIMM will experience fewer soft errors than system with many DIMMs.
B. Early memory modules were organized as 8 bit data plus 1 bit for parity in a 30-pin x9 SIMM. Sometime in the early 1990’s, around the 80486 to Pentium time, 72-pin x36 SIMMs (32 bit data, 4 bit parity) was popular. The implementation was 1 parity bit protects 8 bits of data for both the x9 and x36 modules. Parity protected memory had ability to detect, but not correct single bit errors in an 8 bit “line”.
A few high-end servers in this era had ECC memory which may have been implemented with 2 x36 memory modules forming a 64 bit line with 8 bits for parity, or perhaps a custom memory module? Later on, memory modules progressed to DIMMs, having 64 bits of data with allowance for 8 additional bits for ECC. The base implementation of ECC is to have a 72-bit line with 64-bits for data and 8 bits for ECC. This allows the ability to detect and correct single-bit errors and detect but not correct 2-bit errors (SECDED). More than 2-bits in error could potentially constitute an undetected error (dependent on the actual ECC implementation). There also other ECC strategies such as grouping 4 x72 DIMMs into a line allowing the ability to detect and correct the failure of an entire x4 (or x8?) DRAM chip, when each DIMM is comprised of 18 x4 chips, each chip providing 4 bits of data.
C. At the hardware level, if an error is detected and corrected, the operating system and applications continue to function. The event can be logged at the system level. A detected but uncorrected error, the hardware should cause a blue screen OS crash.
An undetected error is just that. It is undetected. The system continues running with incorrect memory content.
Depending on the nature of the memory corruption, anything can happen.
It could be executable code, in which case the instruction changes.
It could be critical a operating system data, causing subsequent memory access to read or write to the wrong location, which could have serious corruption consequences. It could also be end data, or number or char or control, which may or may not be critical.
* It is probably more correct to say that soft-errors is the province of scientists/physicists, not engineers. Sun had perfectly good engineers, but in the late 1990's, they had an Ultra-Sparc II processor with 4M L2 cache in their high-end enterprise system. I believe the L2 data had ECC - SECDED, but the tags were only parity protected - SED. Some of systems started to experience mysterious failures (the one located in high-altitude locations?). This was ultimately traced to soft-errors. It was not a simple thing to change the L2 cache tags from parity to ECC (logic in the processor itself?) so the temporary solution was to mirror the memory used for tags? (if some knows the details, please step forward)
The Wikipedia topic ECC Memory states
"ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing."
It is more correct to say ECC is used to when it is necessary to correct the more common (typically single bit) errors, and detect certain errors involving more than 1 bit, which cannot be corrected. However it is possible that some multi-bit errors cannot even be detected.
Donaldvc pointed to this new article on IEEE Spectrum
Much of my knowledge is very old, from back in the days when memory chips were 1-4 bit wide.
Back then, the soft-error might only affect many memory cells but it would only be one bit in a word.
Then as memory became more dense, a soft error could affect multiple bits in a word?
So processors did ECC on a bank of 4 DIMMs = 256 bits of data, 288 bits of memory, which allowed more sophisticated algorithms.
I am not sure what Xeon E3 or E5 has. Xeon E7 is supposed to be very sophisticated.
If someone free time, please look into this.
It has been a while so I suppose it is time for another rant on hardware.
There are two systems I would like:
One is a laptop.
The second is a server capable of demonstrating extreme IO performance,
with the secondary objective of being small enough to bring along to customer sites.
On the laptop I am looking for
1) quad-core with HT, i.e. 8 logical processors for better parallel execution plans.
2) PCIe SSD, I would prefer 3GB/s+, so PCIe gen3 x4, or 2 x M.2 PCIe x2 is also an option.
3) drive 1, but preferably 2 external 4K monitors (so I can look at complex execution plans)
On this matter, it is time to bitch at the MS SQL Server team that there should be an option to contract the white space in execution plans.
The existing zoom capability is worthless.
Yes I know SQL Sentry Plan Explorer can do this,
but really MS, is it so hard? or have you outsourced the entire SSMS some team that does not know that there is such a thing as complex queries?
The reason I want to drive 2 external 4K displays is that at the 4K resolution, I need more than a 28 in monitor to use the resolution.
A couple of days ago, Dell announce the new XPS 15 with Core i7-6700 processors (Sky Lake)
which I immediately ordered , but unfortunately it shows a shipping date of Nov 16
it does have 4K display, and 1 external port which may or may not support 4K.
I thought I ordered the docking station, but I do not know if this would support dual external 4K monitors.
I currently have the Dell 28in 4K monitors, which is great for looking at execution plans,
but at the normal text size setting, is difficult read.
I am thinking that the much more expensive Dell 32in 4K monitor will be better, but maybe not enough.
Should I get a 55in 4k TV instead? these all have just the HDMI connector,
so I need to make sure there are proper 4K adapters.
The new XPS 15 data sheet says it has HDD bay (SATA interface) and one M.2 bay (uncertain if PCIe x2 or x4). I would have been nice if 2 M.2 x2 bays were available instead of the HDD bay.
I ordered the XPS 15 with the PCIe SSD. I do not know if it is good one (Samsung SM951 cite 2150MB/s)
if not, it will throw the Dell SSD out, and get a good one.
One more thing, ECC memory
Intel desktop and mobile processors all do not have ECC (or parity) memory capability.
ECC memory has been built into Intel processors for some time now,
it is just disabled in the Core product lines, enabled on in the server Xeon line.
So the Sky Lake equivalent is the Xeon E3 v5.
Intel released the v5 under the mobile group, with a 45W rating.
Unfortunately I cannot find a laptop for sale that uses the Xeon E3 v5.
Perhaps Dell or someone could offer a Xeon E3 mobile system?
Extreme IO Performance demonstrator
First, why do I need such a thing?
when my clients have multi-million dollar SAN storage systems?
Because SAN people are complete idiots on the matter of IO performance,
being locked into irrelevant matters (to enterprise DB) like thin provisioning etc.
Invariably, the SAN people (vendor sales engineer, the SAN admin etc) confuse that Fiber Channel is specified in Gigabits/sec (Gb/s) while all other IO bandwidth is specified in GigaBytes/sec (GB/s).
So we have a multi-million dollar storage system (full of add-on software that have no purpose in an enterprise DB) connected to a powerful server (60+ cores and paying for SQL Server 2012 EE per core licensing)
over 2 x 8Gbit/s FC links.
Is this stupid or is this exceptionally stupid?
Yes I know it is extremely rude of me to call other people stupid, and that being stupid is not crime,
but when you are the vendor for multi-million dollar equipment,
there is a reasonable expectation that you are not stupid.
So onto the system.
For this, I am sure I need more than 4 cores, so it needs to the Xeon E5.
Perhaps 8 cores (single socket) is sufficient.
The new Intel SSD DC P3608 has great specs, but I am not sure when it is actually available?
I would put 2 of these in the system to demonstrate 10GB/s.
Ideally this would all go into box that fits carry on luggage, which is unfortunately not one of the standard PC or server form factors.
Another option is a 2 x 12 core system to demonstrate 20GB/s on 4 x P3608.
I would prefer to get a laptop without high performance graphics, the NVidia GTX 960M in this case.
The current Intel graphics is sufficient for high resolution rendering, but I do not need high frame rate.
All the Intel Core i7 6th gen processors have graphics, I wonder if I can remove the GTX (for power savings)?
Apparently Dell will have a new docking station, the Thunderbolt Dock TB15 next year, that will support 2 x 4K monitors?
I did already rant on PC laptops only being available with 16x9 displays?
How stupid is this? It is one thing for consumer laptops to have a 16x9 display,
on the assumption that the home users just watch movies.
but on what justification does this apply to business and workstation laptops?
Concurrent with the Intel Skylake Xeon E3 v5 regular announcement,
Supermicro announced motherboards for the E3 v5.
There is a micro-ATX (X11SAE-M) but with just 1 x16 and 1 x4 PCIe g3 slots.
where as the ATX (X11SAT) has 3 slot with 16/8/8 as an option.
This would let me put 2 P3608? for 10GB/s?
Last week, a routine performance test ran about twice as long as expected.
A check of dm_exec_query_stats showed that CPU bound statements (worker time roughly equal to elapsed time)
were approximately 3 times higher than previous tests for matching SQL statements.
Almost all of the SQL involved single or few row index seeks, usually Select, some Insert and Update.
The server system is a 2-socket Xeon E5-2680 (8 cores, 2.7GHz nominal, Sandy Bridge) in a managed data center.
The data center had sent out notice that there would be system restarts the previous weekend,
which could mean either OS patches or BIOS/UEFI updates.
So naturally the next thing to do is check the Processor Information object for the Processor Frequency
and % of Maximum Frequency counters (or equivalent).
This showed 135, as in 135MHz, and 5% of maximum.
Another system of the same model also rebooted showed 1188 and 44%.
This issue has occurred previously in this environment and in other HP systems that I am aware of.
The BIOS (or UEFI) update puts the system into one of the energy efficient configurations.
It could also be an operating system setting, but most that I have seen are BIOS settings?
One can imagine a firmware engineer being so committed to green activism
that this was made the default on BIOS updates without discussion with other parties.
Perhaps there is a facility (in Houston?) with inadequate air conditioning for the number systems,
that this setting was put in to prevent the lab from overheating.
Then no one remembered to exclude the step in the production code?
Not that I have ever done such a thing (and no further questions on this should be asked).
Another question might be why the data center monitoring team did not check for this, as it has happened before.
The whole argument for going to managed data center instead of a simple hosted data center was
that the managed data center could provide the broad range of expertise that is not economical for a mid-size IT department.
Obviously this managed data center did not monitor for the performance/power configuration.
This matter is of serious concern to production DBAs and IT staff in handling operations.
As the Processor Information performance object with extended information was only introduced in Windows Server 2008 R2,
many software monitoring tools may not alert on changes of Processor Frequency,
especially after reboot.
Imagine the IT staff or DBA encountering this for the first time on the production servers,
with users complaining, your boss watching over your shoulder, and his/her boss hovering over your boss,
offering their helpful insights in the non-judgemental manner as bosses do.
However, I am more interested in a different aspect of this incident.
When there are two sets of data, one for the processors cores at 2.7GHz and another at presumably 135MHz,
we can extrapolate parameters of interest.
Does it seem stunning that the drop from 2.7GHz to 135MHz, a factor of 20,
only decreases CPU efficiency (increase CPU-sec, or worker time) by 3X?
Perhaps, but this actually should have been expected.
The salient aspect of modern computer system architecture is the difference between CPU clock cycle and memory access time.
A young person might not know, but old timers would know.
Up to about 20 years ago, the primary memory performance specification was access time,
with 80, 70 and 60 ns being common in fast page mode and extended data out.
Then with the switch to synchronous dram (SDRAM),
the key specification changed to data rate. In the Xeon E5 (v1) generation, DDR at 1333MHz was common.
This means a memory channel can deliver one line every 0.75ns, or 1.333 billion times per sec,
with a line being 64-bits (excluding ECC bits).
The Xeon E5 26xx series has four DDR3 channels.
The Intel processor internally is shown as having 2 memory controllers,
each controller driving 2 DDR channels, so channel can have different meanings depending on the context).
What is less commonly cited is the round trip latency, from a processor issuing a request to memory,
the internal memory access within the DRAM chip and finally the transmit time back to the processor.
(The L1, L2 and L3 cache sequence is also involve in memory access timing.)
For local memory (attached directly to the processor) this is around 50ns.
For memory on an adjacent processor, the round trip time might be 95ns or so.
On a 2.7GHz processor, the CPU cycle time is 0.37 ns, so 50ns for local memory round trip access is 135 CPU cycles.
This particular system has 2 sockets, so one might expect that half of memory accesses are local at 50ns round-trip latency,
and half at 95ns latency.
This is a well understood issue.
Two methods of addressing the disparity between CPU cycle time and memory access are 1) large cache on the processor,
and 2) pre-fetching memory.
Current Intel processors have dedicated 32KB I+D L1 and 256K L2, both per core,
and an additional shared L3 cache sized at 2.5MB per core.
From Pentium 4 one, the processor pre-fetches 64-bytes (the cache line size) with an option to prefetch the adjacent cache line.
Prefetching is exposed in the instruction set architecture (can someone provide a reference please)
and there should also be a BIOS/UEFI for hardware prefetch.
Now lets visualize the (simplified) code sequence in a relational database engine with traditional page-row data structures.
There is a memory access for the index root level page.
Read to the page to find the pointer for the second level page.
Memory access, and repeat. It is a sequence of serialized memory accesses with poor locality (so cache can only help so much)
and the next location is not known until the current memory request is completed, so pre-fetching is not possible.
Modern processor performance characteristics are very complicated,
but we will attempt to build a very simple model focusing on the impact of round-trip memory access latency.
Start with an imaginary processor with a cycle time equal to the full round-trip memory access time.
In this scenario, one instruction completes every cycle, be it an arithmetic or logic or memory access instruction.
Such a system may have never existed so now consider a system where the round trip memory access latency
is some multiple of the CPU cycle time.
The average time to complete an instruction where time is in units of the memory access latency
(50ns or 20MHz for local node),
“a” is the fraction of instructions that involve (non-local, non-prefetch-able) memory access
and “n” is the processor frequency.
(1-a)/n + a
The term (1-a) is the fraction of instructions that are either not memory access, or memory access to cache
(from previous access or pre-fetched).
“1/n” is the processor cycle time (in units where memory access time is 1).
Performance (instructions per unit time), the inverse of average time per instruction is:
P = 1 / ( (1-a)/n + a )
= n / (1 +(n-1)*a )
We can see the the above equation has characteristics that as processor frequency increases, the upper bound on performance is:
n -> infinity, P = 1/a
Also, if the fraction on instructions that require memory access, “a,” is zero, then P = n.
Does the above look familiar? It is just Amdahl’s Law, which formulated in the old days to demonstrate the limits of
vectorization in supercomputers.
I have just recast it to express the limits of increasing processor frequency relative to round-trip memory access time.
If someone would like to check my math, please do so. It has been a long time.
Trying tricking your teenage son/daughter into doing this as a practical math exercise?
OK, anybody still reading is obviously not deterred by math, or knows the trick of skipping such things.
What am I really after?
In the above equation, what is known is processor frequency relative to memory access latency.
While we know the performance or worker time of certain queries,
we do not know it terms of instructions per CPU-cycle.
And the second item we do not know is the fraction of instructions that introduce a round-trip memory access latency
that cannot be hidden with cache or pre-fetching.
But, we have data points at 2 frequencies, 2.7GHz and reportedly 135MHz.
Express the relative performance between the two points as a ratio.
P2/P1 = R
Then from the two equations
P1 = 1 / ( (1-a)/n1 + a )
P2 = 1 / ( (1-a)/n2 + a )
we can solve for a in terms of the know values n1, n2 and R.
a = (n2 – n1*R) / ( n1*n2*(R-1) + n2-n1*R )
Assuming memory access latency of 50ns, the base frequency is 20MHz corresponds to memory access in 1 cycle.
Plugging in the values n1 = 135MHz / 20MHz = 6.75, n2 = 2700/20 = 135 and R = 3.
We get a = 0.059, or 5.9% of instructions incurring a non-cached, non-prefetch
round-trip memory access latency would result in a 3:1 performance ratio between 135MHz and 2700MHz.
(Perhaps it would be more correct in estimating round-trip memory access latency as the average
between the local and 1-hop remote node at 75ns?)
So while it might seem astonishing that the difference between 135MHz and 2700MHz
translates to only 3X performance,
the database transaction processing workload is an extreme (but important) case.
There are many workloads which exhibit better locality
or have memory access patterns that are amenable to prefetch
and have performance scaling better with frequency.
Earlier, two methods of hiding round-trip memory access latency were mentioned.
There is another, Hyper-threading. The processor core to appears as two (or more)
logical processors to the operating system.
Presumably, there is an extra set of program counters, and resources to determine which physical registers
(different from the registers specified in the instruction set architecture)
are assigned to each logical processor.
In the earlier example, say that the round-trip memory access time is 135 CPU-cycles
and the fraction of instructions that incurs the full round-trip latency is 6%.
Then for 100 instructions, 94 are executed in 1-cycle each (excluding consideration for superscalar)
as either not involving memory access or data is already in cache,
and the 6 the incurs the round-trip memory latency of 135 cycles.
Then the total time in terms of CPU-cycles is 94*1 + 6*135 = 904.
In other words, only 100 cycles out of 904 are used, the rest are no-ops.
The Intel Xeon processors from Nehalem on implement Hyper-Threading with 2 logical processors
on each physical core. (This can be disabled in BIOS/UEFI. Some models have HT disabled.
The earlier Intel Pentium 4 based processors implemented a more complex form of HT.)
In considering the nature of the database transaction processing workload,
being a memory access to determine the next memory access in nature,
it is perhaps time for Intel to increase the degree of HT,
especially considering that the server oriented Xeon E5 and E7 models are already 1 full year or more behind the
smaller desktop/mobile processor variants.
I seem to recall IBM POWER as having 4 logical processors per physical core,
one of the SPARC processor lines as having 8.
It would also be necessary to have a good strategy for using HT based on workload.
The option to enable or disable HT in the BIOS/UEFI is not I what mean.
HT should be visible to the OS. But the application itself should detect the presence and degree of HT,
and make its own decision on whether HT should be used and how it should be used.
Xeon Phi, Many Integrated Core
Another item worth mentioning here is the Intel many integrated core (MIC) architecture,
codename Knights something, now Xeon Phi.
The processor puts many smaller processor cores, 61 in the 22nm Knights Corner,
versus 12-18 in the 22nm mainline Xeon processors.
The theory behind many smaller cores stems from one of the two main elements of Moore's Law.
Doubling the number logic transistors/complexity in a single core should translate to about 40% performance gain.
(This was the case up to several generations ago.
Since then, Intel no longer tries to double the logic from one process to the next.
There might be 10-20% performance gain in general instructions.
Some effort is given to expanding the functionality of the special/vector instructions.
And most effort has been in increasing the number of cores.)
One manifestation of this (more logic transistors) could increased frequency (which Intel stopped pursing years ago).
Another might be more execution ports (8 in Haswell) or other areas to improves instructions per cycle (IPC).
Following the rule of 2X transistor per 1.4X (square root of 2) backwards,
the expectation is that a processor if 1/4th the size would have 1/2 the performance.
But potentially there could be 4X as many cores, depending on interconnect and power limitations.
So in workloads that are amenable to vectorization, or otherwise can be parallelized,
the more smaller cores could be a better strategy.
The Xeon Phi is targeted to HPC workloads, as reflected in the 512-bit SIMD instructions.
If we were thinking about a transaction processing database engine on the MIC architecture,
we would probably consider a very basic ALU without SIMD, (not sure on FP).
I am thinking that an asymmetric processor architecture might be the objective.
Perhaps two powerful cores from the current main line,
and many simpler cores (without SIMD) perhaps even simpler than Atom?
(The Intel Xeon Phi line implements Hyper-Threading with 4 logical processors per physical core.)
As said earlier, the nature of database page storage along rows make serialized memory access
(also called pointer chasing code?) its hallmark.
This is why there is interest in column storage architecture.
Now all of the sudden, for certain database workloads, the next memory access is 4 bytes over,
already in cache.
The work a little further down touches memory in the next 64 byte line or two away.
Both the software and hardware knows this, and either is capable of issuing a pre-fetch.
It does not matter that columnstore must touch more data.
Processor can stream huge volumes of data, much more effectively than pointer chasing
only the necessary rows.
I should probably say something here.
As I said earlier, modern microprocessors are very complex.
Pipelined execution was introduced (in Intel processors) with the 80486 (1989) and
superscalar execution with Pentium (1993).
Pipelined means that while the processor can complete an instruction in each cycle,
the actual start to finish time of a specific instruction occurs over several cycles.
Intel does not talk about pipeline stages any more, but there are occasional references to Core2
and later processors having a 14+ stage pipeline.
(Wikipedia say Core2 is 12-14 stage pipeline. Nehalem and later 20-24?, Sandy Bridge as 14-19.)
Superscalar means that there are more than one execution unit, with the goal of completing more than one instruction per cycle.
Haswell has 8 execution ports. Several processors generation prior were 6-port on superscalar.
We could apply the principle of Amdahl’s on scaling performance to any and all of pipe-lining, superscalar,
and round-trip memory latency, and probably other things too.
Rethinking Computer System Architecture
I have said this else where. It is long past due to do a clean sheet system architecture with matching change to OS architecture.
Current system architecture stems from the 1970's of processor with physical memory (8MB was big) and a page file on disk.
Why do we still have a page file on disk?
In the old days, there was not enough physical memory such that it was tolerable to have a page file
on disk to support a larger virtual address space.
Today, more the 1TB of physical is possible and affordable (compare to the cost of SQL Server per core licensing).
But the key factor is in how memory is used.
Back then, it was mostly for executable code and internal data structures.
The assumption was that very few database data pages would actually be in memory at any given point in time.
Today, a very large percentage of memory is used for caching data pages.
Of the memory used for executable code and internal data structures, most is junk.
The CPU-cycle time to memory access time discrepancy dictates that the more urgent strategy is to get memory closer
to the processor even if it means drastically reducing the size of true memory, to perhaps a few GB per socket.
Given that DRAM is so cheap, we would still have system with multi-TB DRAM capacity, except that this would now be the page file.
Of course the operating system (and applications) would have to be designed around this new architecture.
Given how well the Intel Itanium software coordination went, I guess this might be too much to expect.