Distribution statistics are one of the most important sources of information that the Query Optimizer uses to determine a good query plan. In this post, I’m not going to tell you everything about distribution statistics. I’m just going to show you a few tricks for getting access to the statistics.
If you want a deeper understanding of what the statistics keep track of, and you don’t have any of my SQL Server Internals books handy, check out this whitepaper: Statistics Used by the Query Optimizer in Microsoft SQL Server 2008
Microsoft does provide us a tool called DBCC SHOW_STATISTICS for examining the distribution statistics.
Microsoft has gradually been making more of the more of the old DBCC commands available as DMVs, even some undocumented ones. For example, one of my favorites, DBCC IND, has now been replaced in SQL Server 2012 by sys.dm_db_database_page_allocations.
I have been wishing for several versions that Microsoft would make the DBCC SHOW_STATISTICS information available as a DMV. But it hasn’t happened yet, and I’m tired of waiting, so I decided to do something about it.
My solution is not quite as easy to use as a DMV might be, but it allows you to get the information that DBCC SHOW_STATISTICS provides into a set of three tables that can then be saved into a more permanent location of your choice, and/or queried as desired.
DBCC SHOW_STATISTICS returns three sets of information, with different columns in the output, so three different tables are needed. The DBCC SHOW_STATISTICS command can be called with an argument that specifies that you just want one of the three sets returned. The options are
WITH STAT_HEADER – returns basic info such as last update date, and number of rows in the table/index. Also reports number of steps returned for HISTOGRAM section.
WITH DENSITY_VECTOR – returns density info for each left-based subset of columns in the index. For example, an index on (lastname, firstname, city) would have a density value for (lastname), for (lastname, firstname), and for (lastname, firstname, city). Each density value is a single number representing the average number of occurrences and depends on the number of distinct values. For example, if there are only 2 possible values in the column, the density would be 0.5. Multiplying density by the number of rows in the STAT_HEADER section would give the average expected rowcount if a query was executed looking for an equality on the specified column(s).
WITH HISTOGRAM – returns a set of ordered values from the first column of the index, creating a histogram. This histogram provides the optimizer with selectivity information for specific values or ranges of values in the first column of the index.
To collect this info, I will use one of my favorite tricks, which is to create a table in the master database with a name starting with sp_. (I’ve written about this trick several times, including in this earlier blog post.) Once I have the table(s) created, I can access them from any database. So here are the three tables:
IF (SELECT object_id('sp_stat_header')) IS NOT NULL
DROP TABLE sp_statsheader;
CREATE TABLE sp_stat_header
( Name sysname,
Density numeric (10,9),
IF (SELECT object_id('sp_density_vector')) IS NOT NULL
DROP TABLE sp_density_vector;
CREATE TABLE sp_density_vector
( all_density numeric(10,8),
columns nvarchar(2126) );
IF (SELECT object_id('sp_histogram')) IS NOT NULL
DROP TABLE sp_histogram;
CREATE TABLE sp_histogram
( RANGE_HI_KEY sql_variant,
The second trick is to use INSERT … EXEC to execute a DBCC statement and populate the tables. I will build the DBCC command dynamically, after capturing the schema, table and index names in variables. You of course could take this code and turn it into a stored procedure, for which the schema, table and index names are passed as parameters. I’ll use as an example a table in the AdventureWorks2008 sample database, just so you can try running the code, and I can verify that it actually works!
I will use the table Sales.SalesOrderDetail and the index IX_SalesOrderDetail_ProductID. So the object name (@oname) is SalesOrderDetail, the schema name (@sname) is Sales, and the index name (@iname) is IX_SalesOrderDetail_ProductID.
SET NOCOUNT ON;
DECLARE @oname sysname, @iname sysname, @sname sysname
SELECT @oname = 'SalesOrderDetail', @sname = 'Sales', @iname = 'IX_SalesOrderDetail_ProductID'; TRUNCATE TABLE sp_stat_header;
-- Update the object name to include the schema name, because that is the format the DBCC command expects
SELECT @oname = @sname +'.' + @oname;
INSERT INTO sp_stat_header
EXEC ('DBCC SHOW_STATISTICS(['+ @oname + '],' + @iname +') WITH STAT_HEADER');
TRUNCATE TABLE sp_density_vector;
INSERT INTO sp_density_vector
EXEC ('DBCC SHOW_STATISTICS(['+ @oname + '],' + @iname +') WITH DENSITY_VECTOR');
TRUNCATE TABLE sp_histogram;
INSERT INTO sp_histogram
EXEC ('DBCC SHOW_STATISTICS(['+ @oname + '],' + @iname +') WITH HISTOGRAM');
So now you can look at the values collected and filter or query in any way, or use SELECT INTO to save them into another table, so the sp_ tables can be used the next time you want to capture distribution statistics information.
SELECT * FROM sp_stat_header;
SELECT * FROM sp_density_vector;
SELECT * FROM sp_histogram;
Let me know if you find this useful, and especially if you embellish it to create a procedure or an automated process of your own!