Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines
including SQL Server.
From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field
accessible with the STATS_STREAM clause.
Back in SQL Server 2000, it was possible to modify system tables directly,
including the sysindexes stat_blob field.
At the time, I described a decode of the stat_blob field with the purpose of influence the execution plan,
presumably on a development system and not a production system.
Starting with SQL Server 2005, it was nolonger possible to directly modify system tables.
An API was provided to access data distribution statistics to allow cloning the statistics from one database
to another.
The presumed usage is to clone statistics from a large production database to a small development database.
In other database engines, I had heard of the idea of updating statistics on a backup system
to be applied to the production system.
While it was still possible to decode most of the 2005 stats_stream binary,
it appears that a checksum was added so it was not possible to
apply an externally generated statistics binary unless the "checksum" value could be correctly calculated.
Around this time, I was working on other SQL Server tools,
most prominently SQL System for performance monitoring,
Exec Stats for execution plan analysis and TraceAnalysis for trace processing.
Work on the SQL Server data distribution cloning tool was discontinued,
and I could not continue further research into the decoding of SQL Server data distribution statistics.
Since several people have asked about the data distribution statisics decode,
I am making what I know about stat_stream available.
It would be helpful is other people would contribute to the missing pieces.
Note that organization of stats_stream changed from SQL Server version 2000 (then sysindexes stat_blob) to 2005 and again to 2008?
It is quite possible there are also changes in version 2012?
Most of what I discuss here applies to version 2008 R2.
Decoding Stats Stream for SQL Server 2008R2
Here I am using 1 based reference. Byte index 1 is the first byte.
In C# and most other programming languages use zero based index.
| Position | Length | Value/Type | Purpose |
| 1 |
4? |
1 |
unkown |
| 5 |
4? |
|
number of vectors |
| 9 |
4 |
0 |
zero |
| 13 |
4 |
0 |
zero |
| 17 |
4 |
|
checksum |
| 21 |
4 |
0 |
zero |
| 25 |
4 |
|
stats stream length |
| 29 |
4 |
0 |
zero |
| 33 |
4 |
|
stats stream length - minus vector variable length
The difference [25]-[33] is 64 for 1 vector (defined as off1).
Each additional vector adds 24 bytes starting at byte position 41
|
| 37 |
4 |
0 |
zero |
| Start of vector information, 24-bytes per vector |
| 41 |
1 |
|
system type id |
| 42 |
1 |
|
unkown |
| 43 |
2 |
|
unkown |
| 45 |
4 |
|
user type id |
| 49 |
2 |
|
length |
| 51 |
1 |
|
Prec |
| 52 |
1 |
|
Scale |
| 53 |
4 |
|
unknown |
| 57 |
4 |
|
unknown |
| 61 |
2 |
|
unknown |
| 63 |
2 |
|
unknown |
| |
|
|
Some of the unknown fields should be for nullable, collation, etc |
| Addition vectors if present |
| off1+1* |
9 |
|
Updated?, 9 byte datetime2? |
| off1+10 |
3 |
|
unknown |
| off1+13 |
8 |
|
Rows |
| off1+21 |
8 |
|
Rows sampled |
| off1+29 |
4 |
4 byte real |
Density - Header |
| off1+33 |
4x33=132 |
4 byte real |
Density - vector, upto 33 values |
| off1+165 |
4 |
4 byte int |
Steps (first copy) |
| off1+169 |
4 |
4 byte int |
Steps (second copy) |
| off1+173 |
4 |
4 byte int |
number of vectors |
| off1+177 |
4 |
4 byte int |
Step size (in bytes) |
| off1+181 |
4 |
4 byte real |
Average Key length - header |
| off1+185 |
4 |
4 byte real |
Unfiltered rows |
| off1+189 |
4 |
4 byte int |
unknown |
| off1+193 |
4x33=132 |
4 byte real |
Average key length - vector |
| |
|
|
Some fields may represent string index (bool), or filter expression |
| off1+325 |
8 |
8 byte int |
unkown, values 0x11, 0x13 and 0x19 observed,
may determine the # of post histogram 8 bytes values starting at off1+341? |
| off1+333 |
8 |
0 |
8 byte 0? |
| off1+341 |
8 |
0 |
offset for value after histogram? |
| off1+349 |
8 |
0 |
another offset |
| off1+357 |
8 |
0 |
another offset if value of [off1+25] is 19 or more? |
| more offsets if value of [off1+25] is 25 or more? |
Eventually, this sequence appears: 0x10001100 followed by three 4-byte real,
a value in native type of the stat, and then
ending with 0x040000 |
| off2 |
2 |
0x10 - 16 |
length of core columns
determines the organization of histogram structures?--> |
| off2+2 |
2 |
17 or higher |
size of step, excluding 3 byte trailer |
| off2+4 |
4 |
4 byte real |
Eq Rows |
| off2+8 |
4 |
4 byte real |
Range Rows |
| off2+12 |
4 |
4 byte real |
Avg Range Rows |
| off2+16 |
native length |
native type |
Range Hi Key |
| off2+16+x |
3 byte |
0x040000 |
step terminator?, x is the size of the type |
| off3 |
? |
? |
additional info |
| *off1 = value of 4(8) byte int at position [25] - value of [33] |
| **off2 = off1 + 341 + 16 if value of [off1+325] is 0x11, or 24 if 0x13 or 0x19 |
| ***off3 = off1 + 341 + value of 4(8) byte int at [off1+341] |
So far, for SQL Sever 2008 R2, I have only looked at fixed length not nullable statistics.
Variable length statistics has different organization, particularly in the histogram part.
String statistics may have extended information after the histogram, per new feature of SQL Server 2008?
Umachandar provides a SQL function for converting 4-byte binary to real or 8-byte binary float,
and vice versa.
Supporting SQL functions and procedures:
The updated tools now has a stored procedure that accepts table and index (or column stat) as input parameters,
in addition to the original procedure that has the stats stream binary.
Updates on
QDPMA Stats Stream - Updated
decoding stats stream - Updated
decoding stats stream
ps
An interesting fact is that it is not necessary for statistics to be highly accurate to be effective. Normally we are interested in distribution differences to shift the execution plan from one to another. The boundaries for this can be very wide. False statistics in certain circumstances might guard against catastrophically bad execution plans, example in out of bounds situations. Another is in skewed distributions, but this should be handled by other means, to ensure high and low distributions get different execution plans.