For nearly ten years, I have had success in using a specifically modified version of the TPC-C benchmark for evaluating server hardware running SQL Server.
Depending on your purpose, you can evaluate computer hardware in numerous ways with numerous benchmarks. Arbitrarily picking a generic benchmark to evaluate a server is of no interest to me. Rather, I’m interested in a piece of hardware only in how SQL Server runs on that hardware or that specific configuration of hardware. Now in this day and age, you never really just evaluate hardware. No matter how you approach it, you always end up evaluating a complete system under test, and that includes hardware, the operating system, and server software of your interest, in my case, that would be SQL Server. You get to infer the hardware performance by trying to keep the other components in the system under test constant. But even that may not be feasible or desirable at times. Often, you are really just interested in the performance of the entire system under test.
To have any meaningful results to talk about and to have a meaningful frame of reference, at the bare minimum you may want to keep you test workload constant across the systems under test. In many of my server hardware evaluation efforts, the modified TPC-C benchmark mentioned earlier is what remains constant, and it is the constant frame of reference with which I compare various server hardware models and configurations. Good news is that the evaluation data I have collected over the years show this particular test benchmark to have both predictive and discriminating power.
It’s predictive in that the results I got by running this benchmark on new processors or server models correlate well with the processing power seen by actual SQL Server production workloads. This of course does not mean that every production workload will see what the benchmark sees. After all, if your workload is strictly disk I/O bound, this benchmark has zero predictive power. But as an engineering tool and when you must make a prediction before you have a chance to test all your production workloads against a specific new server model/configuration, which by the way is not feasible, the benchmark provides a useful set of data into the decision making process.
The chart below shows an example of the benchmark’s predictive power. It contains real data from running the benchmark on a set of real servers (many of which were commonly used in the datacenters for running SQL Server), but with the server identities scrambled.
Knowing how these servers generally behave—performance wise—with the real production workloads, I can also interpret the above chart as an indication of the benchmark’s discriminating power in that the benchmark is able to tell these servers apart. But more than that, in repeated blind tests where we run the same benchmark test against two servers that are known to be different but with that fact withheld from the test administrator, we consistently see the benchmark results correctly identify which server to be the more powerful.
Most importantly, however, the benchmark is practically useful beyond a mere engineering tool. I was often called into cases where the application users complained that their databases appeared to behave differently on the two nodes within a cluster with the nodes being identical in terms of their hardware configuration. With this benchmark, I was able to determine whether the nodes were indeed different and by how much (of course as measured by the benchmark), when other generic hardware benchmark tools and system diagnostics tools failed to tell the difference.
Now you may ask how I could be so sure about what the benchmark was telling us. Well, that is not an easy question to answer in many cases. The beauty with this benchmark is that it’s relatively quick to run and there are shortcuts to make it even more efficient to administer. And because of that, we could afford to go through a process by elimination in some cases to actually zero in on the root cause. For an example, see what I reported earlier on identifying a bad memory module.
So what exactly is this modified TPC-C benchmark? It’s the following (at the risk of repeating myself because I did mention it several times in my previous posts):
- It uses a TPC-C compliant database populated for 100 warehouses. Note that the number of warehouses is the scaling factor in TPC-C. You control the size of the database by specifying the number of warehouses for the benchmark. That then determines how many rows each table would have and so on. In practice, a TPC-C database scaled to 100 warehouses is about 10GB in size in terms of its data and indexes.
- Its client program (i.e. the workload driver) calls only two out of the five official TPC-C transactions, and these two transactions are read-only. They are the Order Status transaction and the Stock Level transaction.
- The two transactions are called in a 50/50 split. In other words, before each call to the database, the client uses a random number generator to generate an integer between 1 and 100 in an even distribution, and if the number is between 1 and 50, the Order Status transaction is called, and if the number is between 51 and 100, the Stock Level transaction is called.
- Unlike in an official TPC-C benchmark, it doesn’t allow any wait time between two consecutive database calls. This of course makes the calls much heavier than any/most real world apps. But it makes it efficient to load up the server.
- I only take measures when all the data pages are cached, and no disk I/O traffic is observed. This effectively takes the storage subsystem out of the picture.
And that is it!
Life is good, eh?
Well, bad news is that it doesn’t scale beyond a certain number of processors. This was never a problem until the 10-core processors came along. Lately, I’ve found this particular benchmark failing to fully utilize all the processor power on machines with a large number of processors, most notably those with 40 or more processors. Take a look at the following chart that shows the maximum total CPU usage that can be obtained with the benchmark on a server with four Intel Xeon E7-4870 processors:
Note that each of the E7-4870 processor has 10 cores and each core can have hyperthreading enabled. So the total number of processors on the machine visible to Windows is 80. To help demonstrate the problem, I looked into four different configurations in more detail:
- 20 cpus: Disabled half the cores and disabled hyperthreading from the BIOS
- 40 cpus: Disabled half the cores and enabled hyperthreading from the BIOS. That is, 20 cores with each core having two threads
- 40 cpus: Enabled all the cores and disabled hyperthreading from the BIOS. That is, 40 cores with each core having a single thread
- 80 cpus: Enabled all the cores and enabled hyperthreading from the BIOS. That is, 40 cores with each core having two threads
When the server is configured to have 20 logical processors (20 cores without hyperthreading), the benchmark has no problem driving all the processors to 100%, and this is expected because all the pages are already cached and SQL Server should not be waiting for any resource.
With 40 cpus, either 40 cores without hyperthreading or 20 cores with hyperthreading, the benchmark fails to drive all the processors to 100% no matter how many client users (i.e. threads) are fired up to submit the transactions. This is not expected. Obviously, SQL Server is not spending all the time doing useful work in this case. Checking the usual suspects (e.g. wait stats, latch stats, and spinlocks) reveals no clue as to what it may be waiting on. SQL Server doesn’t seem to expose any instrumentation beyond what it considers as resources, and it’s beyond me to profile how SQL Server may be spending its time at the lower level, which I’m sure will shed light on what’s going on.
The 80 cpus case is most interesting. Although the maximum total CPU usage is reported to be around 50% in the chart, often I can only drive the total usage to around 22%. When SQL Server does manage to use 50% of the total CPU time, all the processors are observed to be active, and the load is more or less evenly distributed across the processors. When the benchmark can’t drive the total CPU usage above 22%, half the processors are seen to have no activities from Task Manager. There is obviously a scheduling issue with SQL Server in this case. I have no idea exactly what triggers it.
Whatever the underlying problem may be, the benchmark as configured is not effective with machines that have a large number of processors (say 40 cores or more).
Fortunately, it’s not difficult to modify the benchmark for it to scale well with 80 processors. One modification is to have each transaction do more work than it currently does. Remember: I’m not doing any official TPC-C work, and don’t publish any official TPC-C benchmark results. All I care is to use it as a tool to help evaluate SQL Server running on some hardware configuration. So to that end, I can change it any way or shape I fancy as long as I get useful results for what I want to accomplish.
Take the Stock Level transaction as an example. Its code is very simple and is reproduced as follows:
CREATE PROC dbo.spStockLevel @w_id int,
SET NOCOUNT ON
DECLARE @o_id_low int,
SELECT @o_id_low = (d_next_o_id - 20),
@o_id_high = (d_next_o_id - 1)
WHERE d_w_id = @w_id
AND d_id = @d_id
FROM stock, order_line
WHERE ol_w_id = @w_id
AND ol_d_id = @d_id
AND ol_o_id BETWEEN @o_id_low AND @o_id_high
AND s_w_id = ol_w_id
AND s_i_id = ol_i_id
AND s_quantity < @threshold
One way to make it do more work is to put the main code segment--with the resultset suppressed--in a while loop. Unfortunately, this does not work out too well. I still see the total CPU usage up against a ceiling and it is still able to use all the processor power. Another approach is to include more index lookups such as the following:
declare @a float
select @a = avg(i_price) from item with(nolock)
where i_id < 15000
This adds a moderate amount of work, but turns out to be more effective. When I add this piece of SELECT statement in each of the two read-only transactions, the benchmark scales well with any number of processors (tested up to 80). Of course, only more tests can tell whether the benchmark modified as such has sufficient predictive and discriminating power to be practically useful.
No doubt there are many other ways to salvage this benchmark. But that discussion is not the focus of this post.
This purpose of this post is two fold. One is to leave a reference on one of the tools I use often in collecting empirical data, and that helps the future posts that depend on the tool. Secondly, this serves as a small example to highlight the significant impact of many-core machines on what we do, and there is work ahead for the Microsoft SQL Server team.