Recently, I have been looking into the performance impact of enabling hyperthreading on various platforms with various SQL Server workloads. All the results I have shared so far are from a DL580 G7 with four Westmere-EX (E7-4870) processors. Overall, the results of enabling hyperthreading have been positive for both the tested reporting queries and the tested OLTP queries, although I did run into one exception where a reporting query workload performed better without hyperthreading. In all cases, when the system was pushed really hard, hyperthreading was found to either increase the throughput, or at least have no negative impact.
On a different platform, will we see similar patterns? Or will we see hyperthreading behave differently?
In this post, I’d like to share some results I obtained on a DL360 G7 with two 3.46GHz six-core Westmere-EP (aka Gulftown) processors. With two X5690 sockets, the server had 12 cores, and with hyperthreading (HT) enabled, the OS and SQL Server see 24 processors. I repeated all the tests I reported about previously, namely:
- Standard TPC-C readonly queries. This is the OLTP query workload described in an earlier post. It is the workload that I found did not scale well on a four-socket DL580 G7. As discussed in Evaluating server hardware: a sign of the times, when running this test workload, SQL Server was not able to push all the processors sufficiently. It does not have the same problem with the two-socket DL360 G7.
- Augmented TPC-C readonly queries. This is the OLTP workload described in the same post, that is the standard TPC-C readonly workload plus an additional SELECT query, which causes the server to do more work and therefore scales better with high core count.
- Reporting queries - P1. This is the test workload described in Performance impact: hyperthreading for reporting queries. To distinguish this reporting query workload from the next one, I’ll refer to it in this post as Reporting queries – P1.
- Reporting queries- P2. This is test workload described in Performance impact: not all is better with hyperthreading. This test workload uses a query suggested by Serguei Tarassov in a comment to my post.
Note that since the two-socket DL360 G7 platform is less powerful that the four-socket DL580 G7 platform I tested previously, it would not make sense to calibrate the tests exactly the same as used before. I did not want to limit our test results to overly saturated load levels. Nor did I want to use a workload whose load level was too light since our primary interest was to understand how hyperthreading affects the processing capacity of these systems.
To find a good test range, I first ran the above-mentioned test workloads with a rather wide load level range, the results of which gave me an overall picture of how these systems would respond to the load levels. The tests were then calibrated as follows:
- Standard TPC-C readonly queries and Augmented TPC-C readonly queries. With the query DOP set to 8, I tested 1, 4, 8, and 16 concurrent query streams. With the query DOP set to 1, I tested 4, 8, 16, 32, and 64 concurrent query streams. 64 streams proved to be too heavy for the system under test, the test could not reliably run to completion, and the results had to be discarded.
- Reporting queries – P1 and Reporting queries – P2. I tested 50, 100, 200, and 400 simulated users.
The following two system configurations were used to help check the impact of enabling hyperthreading on the two-socket DL360 G7 platform:
- with HT: 12 cores with hyperthreading enabled. The OS and the SQL Server instance see 24 processors.
- without HT: 12 cores with hyperthreading disabled. The OS and the SQL Server instance see 12 processors.
To summarize the test environment, the server was a DL360 G7 with two X5690 (6 core) processors and 64GB of RAM. The OS was Windows 2008 R2 Enterprise x64 Edition with SP1 and DBMS was SQL Server 2008 R2 Enterprise x64 Edition (10.50.1600). I’d have preferred to run these tests on a 10.50.2500 instance. But you test with what you have, not what you wish to have. In all cases, the test data was fully cached in the buffer pool before test measures were taken.
For the reporting queries workloads, the tran_stats table DDL and the data load script can be found in the attachment of this previous post. And the table was populated with 50 million rows.
Standard TPC-C readonly queries
Overall, the impact of enabling hyperthreading was positive. the highest impact was at 100 users, where hyperthreading resulted in a ~19% boost of the transaction throughput.
Augmented TPC-C readonly queries
Hyperthreading did not have any significant impact with the augmented TPC-C readonly workload.
Reporting queries – P1 with DOP = 8
With 4, 8, and 16 query streams, hyperthreading had a significant positive impact, as high as ~25% increase in the query throughput. With a single query stream, however, the workload performed consistently better without hyperthreading.
I looked at the single stream case more closely. With the help of CoreInfo.exe and Task Manager’s NUMA Node view, SQL Server and Windows were seen to consistently favor the processors within the same NUMA node when scheduling threads to execute intra-query parallelism. In fact, I did not see a single case where the threads processing this reporting query with DOP set to 8 were spread across the two nodes. Note that since there were six cores in each node, some of the threads would have to share a core. This scheduling strategy generally makes sense because the threads would be able to share the L3 cache and it possibly can help reduce remote memory access. Apparently, for this particular workload, L3 cache and local memory offered no advantage. Without hyperthreading, the threads were scheduled on different cores and that appeared to offer better performance.
Reporting queries – P1 with DOP = 1
When the reporting queries were not processed with parallelism (i.e. DOP set to 1), the impact of enabling hyperthreading was generally positive, especially at a higher load level.
Reporting queries – P2 with DOP = 8
The results here were mixed. At lower load levels (i.e. when the number of query streams was 4 or 8), higher query throughput was achieved without hyperthreading. At the higher load levels, however, the workload saw higher throughput with hyperthreading enabled. I do not have any explanation for this pattern. But I should reiterate that I did repeat the tests several times, and the pattern was consistent.
Reporting queries – P2 with DOP = 1
When the reporting queries were not processed with parallelism, the impact of enabling hyperthreading was generally positive, especially at a higher load level. This exactly mirrors the observation with reporting queries – P1 when DOP was set to 1.
16 streams with DOP set to 16
What abut the case I reported in the pervious post where the reporting test workload had 16 streams running queries with DOP set to 16 and hyperthreading was seen to have a negative impact?
I did repeat that test on this two-socket DL360 G7 platform. And it turned out that the impact of enabling hyperthreading was different. As shown in the following chart, hyperhtreading resulted in a ~14% increase in throughput.
In majority of the cases I studied, the impact of hyperthreading was either positive or negligible. In particular, I did not see a single negative case (so far) when the workloads were OLTP queries, when DOP was set to 1 for reporting queries, or when the system under test was heavily loaded. There were a few cases where performance was better when hyperthreading was disabled for reporting queries. It is worth noting that the same workload can indeed behave differently on different platforms with respect to the hyperthreading impact.
I’d suggest the following:
- Test your SQL workload with and without hyperthreading. This sounds like a platitude, and it is. But then, you see the test results.
- If you can’t test your SQL workload (and in many cases you simply can’t, or can’t afford to, perform this type of controlled tests) or you have no idea what your SQL workload may look like, by default I recommend enabling hyperthreading (for Westmere-based processors anyway and to stretch it a bit for the recent Nehalem-based processors).