THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Performance impact: not all is better with hyperthreading

In the comments to my previous post on the performance impact of enabling hyperthreading on reporting queries, Serguei Tarassov indicated that it might be interesting to try different reporting queries, and suggested a specific parameterized test query. I happened to have some free time over this past long weekend, and I was curious to see how a different workload might behave, differently, similarly or the same. So I modified the workload driver client I wrote earlier to work with Serguei’s query, ran the tests, and found a few interesting things in the process that I thought I should share.

Although I ran the tests for various numbers of concurrent streams and various degrees of parallelism, for this post I’ll focus on the following two workloads:

  • Workload 1: 32 concurrent query streams with maxdop set to 32 for the test query (actually a stored procedure called p_repot_16)
  • Workload 2: 16 concurrent query streams with maxdop set to 16 for the test query (actually a stored procedure called p_report_32)

Obviously, Workload 2 was lighter than Workload 1. Two test environments were examined with the test workloads:

  • with HT:       40 cores with hyperthreading enabled. The OS and the SQL Server instance see 80 processors.
  • without HT: 40 cores with hyperthreading disabled. The OS and the SQL Server instance see 40 processors.

The tran_stats table and the test environment were exactly the same as described in the previous post except that the tran_stats table was populated with 50 million rows (as opposed to 200 million rows) and that tran_stats was the only table used in this test.

The code for p_report_16 and the code for p_report-32 are identical except their maxdop settings:

CREATE PROCEDURE dbo.p_report_16
      @tran_begin datetime
WITH RECOMPILE
AS
SET NOCOUNT ON;
select sum(cnt) as cnt_sum from
(SELECT COUNT(1) as cnt,
         t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
    FROM tran_stats t1 INNER JOIN
         (SELECT AVG(resultset_size) AS avg_resultset_size, server_config
            FROM tran_stats
          GROUP BY server_config
         ) t2
         ON t1.server_config = t2.server_config AND
            t1.resultset_size <= t2.avg_resultset_size
   WHERE t1.tran_begin BETWEEN @tran_begin AND DATEADD(second, 1, @tran_begin)
   GROUP BY t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
) as v
option (maxdop 16)
go

CREATE PROCEDURE dbo.p_report_32
      @tran_begin datetime
WITH RECOMPILE
AS
SET NOCOUNT ON;
select sum(cnt) as cnt_sum from
(SELECT COUNT(1) as cnt,
         t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
    FROM tran_stats t1 INNER JOIN
         (SELECT AVG(resultset_size) AS avg_resultset_size, server_config
            FROM tran_stats
          GROUP BY server_config
         ) t2
         ON t1.server_config = t2.server_config AND
            t1.resultset_size <= t2.avg_resultset_size
   WHERE t1.tran_begin BETWEEN @tran_begin AND DATEADD(second, 1, @tran_begin)
   GROUP BY t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus
) as v
option (maxdop 32)

Every time the proc (p_report_16 or p_report_32) was called in the workload driver, the input parameter value was picked randomly between min(tran_begin) and max(tran_begin).

Each stream of the stored proc calls was submitted from a separate thread in the workload driver without any wait between two consecutive calls. All streams started at the same time, and each stream ran for 20 minutes. The begin time and the end time of each call was recorded on the client side and loaded into a table at the conclusion of the test for analysis. The total number of queries processed in this 20-minute test run was tallied and the queries per hour metric was then calculated.

Workload 1: 32 concurrent streams running p_report_32

The following chart summarizes the result:

hyperthreading_32users_32dop

The chart shows that with this workload (i.e. 32 concurrent streams of the queries with maxdop set to 32), the 40-core server achieved higher query throughput with hyperthreading enabled than it did with hyperthreading disabled. Using the configuration without hyperthreading as the baseline, the throughput improvement was about 7%, that is (5848-5472)/5472.

I didn’t expect to see the same or even similar number as we saw in the previous tests. But at least, the result still shows that the performance impact of hyperthreading is positive.

Workload 2: 16 concurrent streams running p_report_16

The result of Workload 2 is a bit surprising and very different from what we have seen so far:

hyperthreading_16users_16dop

I had to re-run the workload several times to ensure that the result highlighted in the above chart was stable and repeatable. And it was! The 40-core server was able to support a much higher query throughput without hyperthreading than it was with hyperthreading. If we use the throughput with hyperthreading as the baseline (4460 queries per hour), disabling hyperthreading resulted in a ~23% gain to a throughput of ~5482 queries per hour!

I do not have a root cause explanation for this difference. I did look around, but did not come up with anything. However, I should reiterate that I did repeat the test several times, and the pattern was consistent.

Now, since I had the execution statistics of each call logged in a table, I looked at the response times more closely and found some interesting pattern in the response times. Take a look at the following summary table (the response times are in seconds):

hyperthreading_p_report

In the case of Workload 2 (16 streams with maxdop=16), when hyperthreading was enabled, the query response time fluctuated much more wildly than when hyperthreading was disabled. The response time standard deviation was 0.56 without hyperthreading. Compare that to 7.56 when hyperthreading was enabled.

With both workloads, all the processors (as seen in the OS) were pushed to 100% or nearly 100% with or without hyperthreading.

One more thing to note is that I did not test the above two workloads in isolation. Instead, as a standard practice, I’d run tests to cover a full spectrum of the load levels so that I can see the overall trends and patterns first before I zero in on certain specific scenarios. In this case, I did the same and then focused on the above two workloads. So what’s the story with the other workloads?

With 32 concurrent query streams, the 40-core server with hyperthreading consistently outperformed it without hyperthreading for maxdop = 16, 20, and 32. With 16 concurrent streams, however, the results were mixed.  With maxdop set to 16 and 20, it performed better without hyperthreading than with hyperthreading. With maxdop set to 32, the 16 streams workload performed better when hyperthreading was enabled than when hyperthreading was disabled.

Finally, I should point out that although the query plan was largely stable, it’s possible to get a different plan when the input parameter value was near the extreme upper end. But since the number of calls was sufficiently high (1400~1900 total number of calls for each test run), I do not expect the change in the query plan to have had a systematic impact on the test outcome.

I’ll upload the workload driver program with all the source code next time so that you can try it out in your own environment. I need to write up a little readme and can’t do it in this post.

The test environment was a DL580 G7 with four E7-4870 (10 core) processors and 264GB of RAM. The OS was Windows 2008 R2 Enterprise x64 Edition with SP1 and DBMS was SQL Server 2008 R2 Enterprise x64 Edition (10.50.2500). The tran_stats table DDL and the data load script can be found in the attachment of this previous post.

Published Tuesday, January 17, 2012 6:57 PM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Thomas Kejser said:

Hi Linchi

Great to see you continue this work.i have been in lab and use, but I have not forgotten the other investigation.

One theory to explain what you are seeing is that the HT helps when you are over scheduling and burning time on spin locks and memory fetch (one thread can work while the other waits for memory access or spinlock/memory barrier)... If that's the case here, you should see HT get better if you increase user concurrency in your DOP 16 test. Might be worth a shot

January 18, 2012 2:36 AM
 

Linchi Shea said:

Oops! I just noticed that the chart title on the second chart for 16 streams with 16 DOP is incorrect (another copy/paste error, yikes!). I'll correct it when I get home.

January 18, 2012 9:53 AM
 

Linchi Shea said:

Thomas;

I basically ran the tests multiple times for all the permutations of streams (1,4,8,16,32,64) and dop's (1,2,8,16, etc). Since this is all automated, all I need to do is to check that the query plans are stable and when a test run is done, I go into BIOS, change the processor configuration and restart the test. So I already have the data, and I'll check the data when I get home. But from my recollection, I think you are correct about the effect of increasing concurrency.

However, the thing that really bugs me is the wild fluctuation in the query response time with 16 streams/16 DOP and HT enabled. I'd guess that perhaps that's because the work is not as evenly distributed among the cores when HT is enabled as it when HT is disabled. But I have no evidence to support that.

January 18, 2012 10:09 AM
 

Linchi Shea said:

I've corrected the title of the second chart (for 16 streams with 16 DOPs).

January 18, 2012 8:45 PM
 

Linchi Shea said:

Recently, I have been looking into the performance impact of enabling hyperthreading on various platforms

January 22, 2012 1:41 AM
 

Joe Chang said:

The Hyper-Threading technology returned to the Intel Xeon (and Core-ix) with processor codename Nehalem

April 11, 2013 12:57 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement