This post shows how the Query Optimizer uses statistics to estimate the selectivity of expressions during query optimization.

You can also use this as a second part of my last post, The Query Optimizer and Parameter Sniffing. Here I will show you the advantage of the use of statistics when the Query Optimizer can “sniff” the parameter values compared to just guessing the selectivity of expressions when local variables are used. So I will be using the same query as in that previous post.

Notice that there are many other details that can not be covered here so this post will focus on a very simple example to show how the Query Optimizer creates and uses statistics.

To start open your AdventureWorks database and run this to display the current statistics on the ProductID column of the Sales.SalesOrderDetail table

dbcc show_statistics('Sales.SalesOrderDetail', IX_SalesOrderDetail_ProductID)

This will display the header, density vector and histogram of the statistics object.

1) Understanding the Histogram

First I will explain the meaning of the values of a histogram’s steps. Let us take a look at step 86, shown here

RANGE_HI_KEY is the upper boundary of a histogram step. The value 826 is the upper boundary for step 85 and 831 is the upper boundary for step 86. This means that step 86 may contain only values from 827 to 831.

Run the following query to obtain the real number of records for ProductIDs 827 to 831 to compare them against the histogram

select ProductID, COUNT(*) as Total

from Sales.SalesOrderDetail

where ProductID between 827 and 831

group by ProductID

This produces the following result

Going back to the histogram, EQ_ROWS is the estimated number of rows whose column value equals RANGE_HI_KEY. In our example RANGE_HI_KEY is 831 and the number of records with ProductID 831 is 198. The same value is shown on the histogram.

RANGE_ROWS is the estimated number of rows whose column value falls inside the range of the step, excluding the upper boundary. In our example, this is the number of records with values from 827 to 830 (831, the upper boundary, is excluded). The histogram shows 110 records and we could obtain the same value by getting the sum of 31 records for 827, 46 records for 828, 0 records for 829 and 33 records for 830.

DISTINCT_RANGE_ROWS is the estimated number of rows with a distinct column value inside this range, excluding the upper bound. In our example we have records for three distinct values: 827, 828, and 830, so DISTINCT_RANGE_ROWS is 3. There are no records for ProductID 829 and 831 is again excluded.

Finally, AVG_RANGE_ROWS is the average number of rows per distinct value and it is calculated as RANGE_ROWS / DISTINCT_RANGE_ROWS. In our example, we have a total of 110 records for 3 DISTINCT_RANGE_ROWS, so that gives, 110 / 3 = 36.6667 also shown on the histogram for step 86.

Now let us see how the statistics are used to estimate the selectivity of the queries.

2) When the Query Optimizer knows the value

Let us see the first query

select * from Sales.SalesOrderDetail

where ProductID = 831

Since 831 is a RANGE_HI_KEY on step 86, the Query Optimizer will use the EQ_ROWS value and the estimated number of rows will be 198.

Now run the same query with the value 828

This time the value is inside the range of step 86 but it is not a RANGE_HI_KEY so the Query Optimizer uses the value calculated before as AVG_RANGE_ROWS. Actually, we get the same estimated number of rows for any of the other values in the range (except RANGE_HI_KEY). This also includes 829, even when there are no records for this ProductID.

Let us try now a nonequality operator and try to find the number of records less than 714. For these we need to calculate the sum of the values of both RANGE_ROWS and EQ_ROWS for steps 1 thru 7, which give us a total of 13,223 rows.

This is the query and the estimated number of rows is shown on the execution plan

select * from Sales.SalesOrderDetail

where ProductID < 714

3) When the Query Optimizer does not know the value

In the case when the Query Optimizer does not know the value used in the expression, like when local variables are used, the Query Optimizer can not use the histogram so it will use some other information including the statistics density information or it will try to guess the selectivity using some fixed percentages. First, let us try using the equality operator.

declare @pid int = 897

select * from Sales.SalesOrderDetail

where ProductID = @pid

The Query Optimizer is not able to see the value 897 in this query. As explained in my previous post, the Query Optimizer does not know the value of the @pid local variable at optimization time. So it will use the density value of the ProductID column, 0.003759399, as listed on the second section, density vector, of the DBCC SHOW_STATISTICS output. If we multiply this value by the total number of records, 121,317, we will get 456.079 which will be shown in the next execution plan.

Finally, run this query with a nonequality operator

declare @pid int = 897

select * from Sales.SalesOrderDetail

where ProductID < @pid

Same as before, the value 897 does not matter; any other value will give you the same estimated number of rows and execution plan. The estimated number of rows is always 30% of the total number of records for a nonequality operator. In this case the 30% of 121,317 is 36,395.1 as shown next.