THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Performance impact: hyperthreading for reporting queries

There are a lot of questions on hyperthreading, but not a lot of answers.  There is no shortage of opinions, but very few are based on significant first hand experience or solid test data.

We know that the hyperthreading technology in the older generations of the Intel processors was not well received by the SQL Server community. Hyperthreading in the newer generations of the Intel’s Nehalem-based Xeon processors, however, is decidedly better implemented, and it appears to be much better received by the SQL Server community.

I have done some tests with the Intel Westmere-EX processors, exploring how its hyperthreading feature may influence various types of SQL Server workloads. And my experience from these tests is generally positive. In fact, I have not seen any significantly negative performance impact that could be directly attributed to enabling hyperthreading. I plan to post some of my empirical observations. The intent is not to settle any issues. Rather in keeping with my own tradition, it is to contribute some data points to this very important subject.

In this post, I focus on the performance impact of enabling hyperthreading on reporting queries. Since there is a large number of variables that could heavily influence the results of any hyperthreading related test, quite a bit of attention is paid to ensure repeatability of the test results.

The tests are set up as follows:

  • The tests use a single common query (shown below). The focus is not on the single query performance, but how streams of this query perform.
  • There are two key test variables:
    • Degree of parallelism. It is controlled with the option (maxdop) clause appended to the query.
    • Number of concurrent query streams.  It is controlled through a C# program by spawning as many threads, each of which drives the query in a serial loop for a configurable amount of time.
  • There are three hyperthreading (HT) test configurations:
    • 40 core with HT. That is 80 logical cpus. HT is enabled in the BIOS.
    • 40 cores without HT. That is 40 logical cpus. HT is disabled in the BIOS.
    • 20 cores with HT. That is 40 logical cpus. HT is enbaled in the BIOS.
  • The query execution plans are logged by the test driver and verified to be constant across each test run and each query execution.
  • The performance statistics of each query execution, regardless of its maxdop setting or which stream it is in, is captured by the test driver and loaded into a table for analysis and overall metric calculation.
  • The test database consists of a 200-million-row main table tran_stats, exactly as described in this previous post. The SQL scripts for creating and populating this table are in the attached zip file. The test database also includes a dimension table which basically is the #tmp table in this same previous post. The scripts to create and populate this dimension table are also included in the attached zip file.
  • The test results are summarized with a query processing throughput metric, Queries per Hour, representing the number of queries processed in an hour for the given test configuration.
  • All tests are run while the data are fully cached. The storage subsystem is not a factor in these tests.

Essentially, with this test workload we want to see how it behaves when we increase the load level by increasing the number of concurrent query streams for a given maxdop setting, with and without hyperthreading on a DL580 G7 with four E7-4870 (10 core) processors and 264GB of RAM. The OS is Windows 2008 R2 Enterprise x64 Edition with SP2 and DBMS is SQL Server 2008 R2 Enterprise x64 Edition (10.50.1600).

The main test query (without the maxdop option clause) is as follows:

select COUNT(*) from (
   select id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus,
             COUNT(*) as cnt
  from tran_stats t1 join dimension t2
      on t1.server_config = t2.server_config and
           t1.thread_name = t2.thread_name and
           t1.tran_type   = t2.tran_type and 
           t1.tran_status = t2.tran_status and
           t1.users       = t2.users and
           t1.cpus        = t2.cpus
group by id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus
) as v

I pick this particular query because it is a rather plain reporting query. Nothing exotic and something you’d see often in the real world. I also happen to be playing with it recently for other purposes.

By the way, in the spirit of full disclosure, if there is interest in the actual test driver program, I can post it here.

So, how do the results look like?

40 cores with HT vs. 40 cores without HT

For maxdop = 2, the query throughput results are summarized in the following chart.

maxdop2_40cores_HT_vs_40cores

Most notably, when the load level is stressed with 64 concurrent streams, the system manages to push through ~478 queries per hour with HT. Compare that to ~374 queries per hour without HT at the same load level. That is about 28% increase in the processing power (for this specific configuration, always remember that!). That’s a substantial improvement.

Also note that for streams 1 through 16, there is little to no difference whether or not HT is enabled. That is expected. After all, if the queries are not competing for processors, whether you have 40 or 80 processors should not make a big difference. This also means that if you just run your test with a single query, you probably won’t see any difference, one way or another. And if you draw a conclusion based on that single query result, the conclusion is likely to be misleading.

The next three charts show the results for maxdop=4, 8, and 20, respectively. The results look similar to that for maxdop=2 in terms of the general trend and pattern, though the extent of performance improvement varies. The improvement is generally between 8% and 25%.

maxdop4_40cores_HT_vs_40cores

maxdop8_40cores_HT_vs_40cores

maxdop20_40cores_HT_vs_40cores

40 cores without HT vs. 20 cores with HT

In this case, they both appear as 40 logical processors at the OS level and to the SQL Server instance except that one has 40 cores and the other has 20 cores. So which configuration should have better performance capacity?

You probably would pick 40 cores without HT. And you’d be correct. The following charts are the supporting evidence.

maxdop2_40cores_vs_20cores_HT

maxdop4_40cores_vs_20cores_HT

maxdop8_40cores_vs_20cores_HT

maxdop20_40cores_vs_20cores_HT

When pushed, 40 cores without HT outperforms 20 cores with HT substantially by as much as ~48%.

So overall in these tests, the performance impact of hyperthreading on the query throughput is positive, and in some cases very substantial.

To reiterate, I’d be careful--if at all--extrapolating the observations here. Hopefully, though, as the evidence accumulates from the community, we can have a better comfort level in how we use this feature effectively.

Published Thursday, January 05, 2012 1:36 AM by Linchi Shea

Attachment(s): create_tables_load_data.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

GrumpyOldDBA said:

Very interesting, thank you for publishing this, I rarely get the chance to run this level of tests on new hardware so your work is much appreciated.

I think I might have expected the HT to perform better bearing in mind the claimed improvements. Our DW team have a couple of these boxes but with HT turned off, sadly I wasn't able to run any tests on their boxes and last i heard they had no plans to enable it, but that may well change.

January 5, 2012 8:02 AM
 

Linchi Shea said:

My previous post focuses on the performance impact of enabling hyperthreading (HT) on a machine with

January 5, 2012 11:44 PM
 

Bryce Deneen said:

Great post! Very nice to see how much HT can improve your performance given similar setups with\without HT enabled.

I would however make 1 suggestion. Between the first set of graphs and the second set of graphs you changed the color of HT vs Non-HT. It would make more sense to keep the colors consistent throughout based on processor as opposed to best performance.

January 7, 2012 11:18 AM
 

Serguei Tarassov said:

Thank you for publishing the results.

As I see, your query seems to return the same result all the time.

Could you change a little this query and introduce some filter in WHERE clause? I.e. filter by "tran_type" randomly in threads.

January 9, 2012 6:08 AM
 

Linchi Shea said:

Serguei;

What additional information do you think adding a WHERE clause would help bring forth in terms of evaluating the impact of hyperthreading?

January 9, 2012 10:54 AM
 

Linchi Shea said:

Bryce;

Thanks! I did that coloring code change on purpose. It might not have been a good idea.

January 9, 2012 10:55 AM
 

Mark said:

Linchi,thanks for taking the time to setup these tests and publishing the results. Very interesting!

January 9, 2012 5:50 PM
 

Serguei Tarassov said:

Lunchi,

HT vs NoHT is very interesting for me because of numerous demands of databases hosting teams. So we could work together for obtaining the result. Feel free to contact me directly st@arbinada.com.

About your test, there are some ambiguous points:

1. After reproducing the query always returns “0”

2. I cannot see any “parallelism” operator in execution plan

3. Your query is not appropriate for “real world OLAP” and all the more for “real OLTP”

I would like suggest you to change a little your test:

1. Use only one table tran_stats

2. Fill the table with 20M lines only to calculate some aggregates in reasonable time (real world OLAP use recalculated aggregates for 200M usually)

3. Create test query as parameterized stored procedure “test_q1” (see below)

CREATE PROCEDURE test_q1

 @tran_begin datetime

WITH RECOMPILE

AS

BEGIN

 SET NOCOUNT ON;

 SELECT

   COUNT(1), t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus

 FROM

   tran_stats t1

   INNER JOIN

   (SELECT AVG(resultset_size) AS avg_resultset_size, server_config/*, thread_name, tran_type, tran_status, users, cpus*/

   FROM tran_stats

   GROUP BY server_config/*, thread_name, tran_type, tran_status, users, cpus*/

   ) t2

   ON

     t1.server_config = t2.server_config AND

     t1.resultset_size <= t2.avg_resultset_size

 WHERE

   t1.tran_begin BETWEEN @tran_begin AND DATEADD(second, 1, @tran_begin)

 GROUP BY

   t1.server_config, t1.thread_name, t1.tran_type, t1.tran_status, t1.users, t1.cpus

END

GO

4. In every testing thread:

4.1. Get “max” and “min” values for “tran_begin”

4.2. Repeat executing procedure “test_q1” with random @tran_begin value between “max” and “min”. I.e.

EXEC test_q1 @tran_begin = '2012-01-10 12:30:50.903'

Could you please repeat the test on your server and compare the results?

January 10, 2012 11:52 AM
 

Linchi Shea said:

Serguei;

I'll see if I have time to run your query.

> 1. After reproducing the query always returns “0”

That is done on purpose so that we can focus on the processing on the server itself without having to worry about other components comfounding the results.

> 2. I cannot see any “parallelism” operator in execution plan

I did use a qury hint to ensure parallelism. See Paul White's blog post here on forcing parallelism a few weeks ago.

> 3. Your query is not appropriate for “real world OLAP” and all the more for “real OLTP”

I beg to differ here. Very generally speaking, whether or not the query represents a real world app is actually not that relevant. What's relevant is the general characteristics of how the query is processed in terms of scanning a large number of pages, parallelism, and so on. More specifically, though, the test query was in fact taken from a real-world app but with all the column names and subject domain changed. In the real app, the query (not exactly the same) was written by someone else to extract the dimension ID's from the raw 'fact' table in which the actual values are stored, when first populating the data warehouse table where we don't want to store actual dimension values but only the ID's.

But then your real world app may not be my real world app, a discussion I rather not get into.

January 11, 2012 11:09 AM
 

Serguei Tarassov said:

Linchi,

Thank for you answer.

I suppose that using hints in the query is not a good practice at all and especially for the test because one of principal objectives is the testing of engine optimizer in HT environment.

In "real world OLAP" data warehouses are not the objects for adhoc user queries but are the sources for cubes and datamarts attacked by final users. My query example is just the very simple version of one typical query in real relational OLAP system.

Please, note that in real OLAP the cluster index _should be used_ in query (time dimension usually) that is not the case in your test.

Hope this helps.

Bon courage :)

January 13, 2012 5:20 AM
 

Linchi Shea said:

In general, query hint may not be a good practice. But sometimes, it's a necessity. And in such a test, it's a must because stability of the plan is a requirement for the results to make sense. Note that I wasn't really testing the optimizer itself (e.g. I wasn't interestd in how turning on HT may cause the optimizer to choose a different plan), but to what extent procesing the query may be exercising HT.

January 13, 2012 12:04 PM
 

Linchi Shea said:

In the comments to my previous post on the performance impact of enabling hyperthreading on reporting

January 17, 2012 5:57 PM
 

Linchi Shea said:

Recently, I have been looking into the performance impact of enabling hyperthreading on various platforms

January 22, 2012 1:41 AM
 

Linchi Shea said:

These days if you get a new server-class machine to run SQL Server, you can almost be 100% sure that

January 30, 2012 6:17 PM
 

Kevin Boles said:

Hey Linchi!  I see you followed up to Serguei's input.  I have two additional theories of my own and it would be great if you could test them out sometime.

1) You used a single fact table, which increases the potential for CPU-level cache hits.  I was wondering if you could spin up at least 2, preferably more, tables to spread across the threads to test this theory.

2) I think potentially MUCH more important than that one is that you used the Enterprise Edition of SQL Server, which as the Advanced Scan (merry-go-round) functionality.  So theoretically almost all the concurrent threads that start up will be accessing the same data pages a (large?) fraction of the time, again leading to significant CPU-level cache hits.  This would seem to be the reason that the reporting test actually provided a BIGGER benefit for HT than the OLTP tests did, which I for one found VERY surprising.

BTW, I LOVE all your blog posts, but especially your benchmarking ones!!

April 27, 2012 8:56 AM
 

Prasanna said:

Interesting Article, and exactly on the lines of what I was looking for. Thank You!

Microsoft has changed SQL Server Licensing to Core-based licensing, and on Virtual Servers it now count's V-CPUs as Cores.

So if I have a 4-Core Physical Server, then I can use a 4-Core License and use it with Hyper-threading to get maximum performance. But if I Virtualize, then that same License only gives me 4 V-CPUs (or half the processor capacity).

So is HT really worth a separate Core license?

January 12, 2013 9:55 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement