THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Elisabeth Redei

Lies damned lies and statistics - Part II

Lies damned lies and statistics - Part II

In previous post I listed situations where your statistics might need some special attention (http://sqlblog.com/blogs/elisabeth_redei/archive/2009/03/01/lies-damned-lies-and-statistics-part-i.aspx).

As I mentioned in that post, the symptoms are typically related to more or less random variations in duration that cannot be explained by resource contention or blocking.

A quick recap of the various problems you can encounter:

1. Statistics is out-of-date

2. The default sample rate (used by the Autoupdate Statistics feature) may not be enough because it gets relatively lower as the tables grow larger (to avoid issues with Autoupdate statistics taking too long).

3. Data in the column is naturally "skewed" (e.g. LastName; Smith vs. Redei for instance)
4. The Autoupdate Statistics process interrupts your user activity

In the last post I described how you can diagnose and alleviate the first two problems, in today's post I will talk about the third – skewed data and/or skewed statistics.

Skewed Data

First I will show you a repro that demonstrates the problem.

I created and filled a table with 600 000 rows, in buckets of 1000 each and with a non-clustered index on the second column

 

CREATE TABLE test1

(

c1 tinyint

, c2 smallint

)

GO

CREATE INDEX idx1 ON test1 (c2)

However, I made sure that the value 500 does not exist at all:

clip_image001

clip_image002

clip_image003

Fig 1. SELECT c2, COUNT(*) AS cnt FROM test1 GROUP BY c2 – note the missing 500.

If we look at the histogram of the statistics, because of how 500 falls between the steps, it looks like there are 1000 rows with the value 500 in the c2 column:

DBCC SHOW_STATISTICS ('test1', 'idx1') WITH HISTOGRAM

clip_image004

Fig 2. 500 falls in the range where 501 is the highest key value (RANGE_HI_KEY). AVG_RANGE_ROWS is for this – and all other ranges,1000 rows.

This is the information that the Optimizer will use when estimating how many rows will be returned by any operator in the query plan. Let’s look at the problem in action.

-- Slow

DECLARE @mytime DATETIME

SET @mytime = GETDAT()

SELECT c1, c2 FROM test1 WHERE c2 = 500

SET @mytime = GETDATE() - @mytime

SELECT @mytime

clip_image005

Fig 3: The query returns 0 rows and executes in 253 ms.

clip_image006

Fig 4. Showplan output, there is a big discrepancy between the Estimated and Actual number of rows.

This query returns 0 rows but SQL Server chooses a Table Scan because the statistics implies that 1000 rows will be returned and if it didn’t do a Table Scan it would have to do a lookup for an Estimated 1000 rows to get c1 from the table data (the heap). If c1 was not included in the SELECT list, it could just scan the (leaf level of) the non-clustered index to satisfy the query:

SELECT c2 FROM test1 WHERE c2 = 500

clip_image007

Fig 5. Because all data can be found at the leaf level of the non-clustered index, the query results in an index seek.

If we force an index seek, the query executes much faster as well:

DECLARE @mytime datetime

SET @mytime = getdate()

SELECT c1, c2 FROM test1 WITH (INDEX = idx1) WHERE c2 = 500

SET @mytime = getdate() - @mytime

SELECT @mytime

The index hint is honored and the query runs a lot faster:

clip_image009

Fig 6.

clip_image010

Fig 7.

Let’s try another query:

DECLARE @mytime datetime

SET @mytime = getdate()

SELECT c1, c2 FROM test1 WHERE c2 = 502

SET @mytime = getdate() - @mytime

SELECT @mytime

clip_image011

Fig 8: A Table scan is chosen because it is in this case the best option since the 1000 Estimated rows equals the 1000 Actual rows returned.

clip_image012

Fig 9. The table scan takes 287 ms to execute.

Let’s force an index hint for this last query, just to prove the point:

DECLARE @mytime datetime

SET @mytime = getdate()

SELECT c1, c2 FROM test1 WITH (INDEX = idx1) WHERE c2 = 502

SET @mytime = getdate() - @mytime

SELECT @mytime

clip_image013

Fig 10. When forcing an index seek, they query takes much longer, proving that a Table Scan is the right choice if a 1000 – or more – rows are expected to be returned.

How do you diagnose problems related to statistics granularity and/or skewed data?

This is a little bit tricky. You need to find queries that contain operators with a big discrepancy between EstimatedRows (how many rows SQL Server expects the operator to return) and ActualRows (number of rows that were actually returned) as reported in the Actual execution plan.

If you have a single query, and the query is simple enough, it is relatively easy. In SQL Server management studio, choose “Display Actual Execution Plan” from the Query menu or simply press CTRL-M and then run your query. If you hover over the icons (representing operators) the pop-up will present both EstimatedRows and ActualRows as in Fig 4, Fig 5 and Fig 8 above.

But if you have some workload you need to investigate, this method will be somewhat cumbersome. You already know that the plan is accessible via DMVs but even if none of the issues with sys.dm_exec_* DMVs that I described in "How to get High Quality Data about Query Performance" (http://sqlblog.com/blogs/elisabeth_redei/archive/2009/03/01/how-to-get-high-quality-information-about-query-performance.aspx) apply, you still cannot use it because sys.dm_exec_query_plan contains the compile time representation of the query plan which means it will only contain estimates and not actual values for rows returned.

Again, Profiler (or the corresponding sp_trace* procedures) comes to rescue (Gail Shaw has written an excellent post on how to use the sp_trace procedures on simple-talk.com, http://www.simple-talk.com/sql/performance/finding-the-causes-of-poor-performance-in-sql-server,-part-1/). I will do it using the graphical interface.

Setting up The Profiler Trace to Capture Execution Plans

Start Profiler with a new trace and choose a blank template. On the Event Selection tab, click "Show all columns" and "Show All Events" and choose only the Performance: Showplan XML Statistics Profile event. Click "Column Filters" and add any relevant events, such as ApplicationName or maybe you are only interested in traffic from a specific login; LoginName.

Then move on to the Events Extraction Settings and click "Save XML Showplan events separately". Once you have specified the path, you can choose "All XML Showplan batches in a single file".

The events will end up in a file with the .sqlplan extension. You can double-click this file and it should open up in SQL Server Management studio. At this point you can hover over the operators just as would you do over a graphical execution plan generated in SQL Server Management studio.

But let's pretend we have a little bit more than the 4 queries I issued above. I want to find out which indexes have statistics that leads to large discrepancies between row estimates and actual rows return.

The attached script, GetEstimateVsActualRows.sql*, give the following output for the 4 queries I have issued in the article:

clip_image014

Fig 11. Estimates vs. Actual

The output reveals that the same index (or rather statistics) sometimes gives correct estimates and sometimes very misleading estimates for Idx1 (the Diff column).

Now there are two reasons why this could happen.

  1. The problem I am describing in this post – skewed data or problems with statistics granularity
  2. The plan for a particular stored procedure (or any reusable plan) has been cached with an atypical value – or the opposite, it has been cached with typical values and suddenly an atypical value is passed as a parameter to the procedure

I am not going to talk about the second scenario in this post (but I might do in later a post J ).

What can you do if you encounter this problem?

Several things actually:

  1. First of all, check whether increasing the SAMPLE RATE (if not already at FULLSCAN) alleviates the problem (although in my case it doesn’t – simply because I made sure it wouldn’t):

You can follow the steps I provided in http://sqlblog.com/blogs/elisabeth_redei/archive/2009/03/01/lies-damned-lies-and-statistics-part-i.aspx.

  1. Provide an index hint (as I did in Fig 6 and Fig 7 above)
  2. Create a covering non-clustered index on (c2, c1) to avoid the entire table to be scanned (all data will be present in the leaf level of the non-clustered index). However, because the test1 table consists of only those two columns, there would be no effect.
  3. "Hide" the value you are looking for from the SQL Server optimizer so it will not use the Histogram to make the estimate.

This will only work if the overall selectivity as presented by DBCC SHOW_STATISTICS ... WITH STAT_HEADER gives an estimate that works well for most instances of the query (i.e. no matter what value you pass in, you get satisfactory performance).

You can achieve this by wrapping your query in a stored procedure and use a local variable:

CREATE PROCEDURE usp_HideParam (@col1 INT)

AS

DECLARE @newcol1value INT

SET @newcol1value = @col1

SELECT c1, c2 FROM test1 WHERE c2 = @newcol1value

(In my – extreme - case it doesn’t work very well at all because the Estimate will still be 1000)

Obviously the reverse should be considered as well – if you want the optimizer to use the statistics in the histogram (which by the way is only created for the first column in the index), you need to make sure that the value you pass into the stored procedure is the value used in the statement.

For instance, if I execute the following query, I will get an index seek because Estimaterows = 1 for 601 which is out of the range of values (Fig 1.):

SELECT c1, c2 FROM test1 WHERE c2 = 601

Whereas

usp_HideParam 601

.. results in Estimaterows = 1000 and consequently a Table Scan

  1. Wrap the query in a stored procedure and use the OPTIMIZE FOR hint.

Again, to create a plan that performs satisfactory for most instances of the query.

  1. Use the USE PLAN hint

Needless to say, neither of the above suggestions except for the second (providing an index hint), would work very well in the above reproduction scenario and wouldn't be very meaningful either. If the problem manifests itself only for one or very few values in the column it is hardly worth the effort.

Nevertheless, for a more natural, but still skewed, data distribution where increasing the SAMPE RATE of the statistics does not alleviate the problem - it might be better to have a predictable – although not the best performance than a completely random performance pattern.

* This script was put together with bits and pieces from http://sqlblog.com/blogs/michael_coles/archive/2009/03/03/practical-sql-server-xml-part-2.aspx and http://sqlblog.com/blogs/jonathan_kehayias/archive/2009/07/27/digging-into-the-sql-plan-cache-finding-missing-indexes.aspx and a whole lot of sweat! It works on 2005 and 2008 showplan output, on parallel as well as non-parallel execution plans.

Published Monday, August 10, 2009 5:29 PM by Elisabeth Redei

Attachment(s): GetEstimatesVSActualRows.txt

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

jchang said:

in this specific example, when you are looking for a row that does not exist, it does not matter what the statistics sampling is, a good estimate is frequently not possible.

The presumption is that one is looking for a row that does exist.

Lets assume that value is not a Range_Hi_Key value but within the bounds of the low and high Range_Hi_Key,

then the only info available is from the Distinct_Range_Rows and Avg_Range_Rows.

Suppose we are looking for value = 550 and the lower Range_Hi_Key is 500 and the next Range_Hi_Key is 600, and the Distinct_Range_Rows is 30, and the Avg_Range_Rows is 100.

Of the 99 possible values between 500 and 600, only 30 exist, for which the average distribution is 100.

The query plan for value = 550 does not consider that there is a 2/3 chance of no values, the presumption is that we are searching for one of the 30 that do exist.

So I think the only time a zero return will show (as estimate 1 row) is if the value is outside the bounds, or if the Range_Rows is zero, which sometimes means there are fewer than 200 distinct values.

August 10, 2009 11:28 PM
 

Understanding SQL Server Statistics said:

July 14, 2014 7:43 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement