To use compression you have to understand what your data looks like, if your data is mostly unique then compression might not really help a lot. If your data is repeated on a page then it could be beneficial. In my own test compression would improve performance with some tables and not so much with other tables. I won't explain how compression works, this is all explained in books on line. the goal of this post is for you to have some code that you can run and then see how page and row level compression differs compared to the original data.
Here is some code that you can run to see how compression works. the data I have used is similar to what we use at my job(but not the same, i used fake tickers so don't try to invest in those ;-0). I have used just the 10 industry groups from the Industry Classification Benchmark, I left out supersectors,sectors and subsectors otherwise we would have a lot more rows per table. I also did not create fake ISIN Code,SEDOL and CUSIP columns since that would have bloated the code in this post. The code will generate 2780650 rows per table, this is a nice size to test with.
Compression is really nice when you get charged $30 per GB for the SAN, so besides performance there is also a cost benefit in storage
Here is the code
This block of code will create the database, setup the lookup tables and create the tables for our tests
use
master
go
create
database CompressionTest2
go
use
CompressionTest2
go
--our Industry Classification Benchmark table
Create
table IndustryGroup(GroupCode char(4) not null primary key,
GroupDescription
varchar(40))
GO
--just industry groups
insert
IndustryGroup values('0001','Oil & Gas' )
insert
IndustryGroup values('1000','Basic Materials' )
insert
IndustryGroup values('2000','Industrials')
insert
IndustryGroup values('3000','Consumer Goods')
insert
IndustryGroup values('4000','Healthcare')
insert
IndustryGroup values('5000','Consumer Services')
insert
IndustryGroup values('6000','Telecommunications')
insert
IndustryGroup values('7000','Utilities')
insert
IndustryGroup values('8000','Financials')
insert
IndustryGroup values('9000','Technology')
GO
--currency table
Create
table Currency (CurrencyCode char(3) not null primary key, CurrencyDescription varchar(30))
GO
--just handful of currencies
insert
currency values('USD','U.S. Dollar')
insert
currency values('AUD','Australian Dollar')
insert
currency values('CAD','Canadian Dollar')
insert
currency values('JPY','Japanese Yen')
insert
currency values('MXN','Mexican Peso')
insert
currency values('GBP','U.K. Sterling')
insert
currency values('EUR','European Euro')
insert
currency values('ISK','Iceland Krona')
insert
currency values('BGN','Bulgarian Lev')
insert
currency values('RON','Romanian Leu')
insert
currency values('INR','Indian Rupee')
insert
currency values('RUB','Russia Rubles')
insert
currency values('BHD','Bahrain Dinar')
insert
currency values('EGP','Egypt Pounds')
insert
currency values('JOD','Jodan Dinars')
insert
currency values('KWD','Kuwait Dinars')
insert
currency values('MAD','Morocco Dirham')
insert
currency values('OMR','Omam Rial')
insert
currency values('QAR','Qatari Rial')
GO
--market cap table
create
table MarketCap (MarketCap varchar(20) not null primary key,MarketCapDescription varchar(100) )
GO
--left out a bunch of market caps
insert
MarketCap values('ALL','broad')
insert
MarketCap values('MID','mid cap')
insert
MarketCap values('MCR','micro cap')
insert
MarketCap values('SML','small cap')
insert
MarketCap values('LRG','large cap')
--calendar table
create
table Calendar (CalendarDate date not null primary key)
GO
insert
Calendar
select
dateadd(d,number,'19920101') from master..spt_values
where
type = 'p'
and
datepart(dw,dateadd(d,number,'20080101')) not in (1,7)
union
select
dateadd(d,number,'19970810') from master..spt_values
where
type = 'p'
and
datepart(dw,dateadd(d,number,'19970810')) not in (1,7)
--the table that we will test against
create
table IndexCloseValues (CalendarDate date not null,
IndexSymbol
varchar(30) not null,
GroupCode
char(4) not null,
CurrencyCode
char(3) not null,
MarketCap
varchar(20) not null,
CloseValue
decimal(30,10),
TotalReturnClose
decimal(30,10))
--2780650 rows
insert
IndexCloseValues
select
CalendarDate,
CurrencyCode
+ left(GroupCode,1) + MarketCap as Symbol, --fake tickers
GroupCode
,CurrencyCode,MarketCap,
ROW_NUMBER() OVER( ORDER BY CalendarDate,GroupCode,CurrencyCode,MarketCap ) *0.00100,
ROW_NUMBER() OVER( ORDER BY CalendarDate,GroupCode,CurrencyCode,MarketCap ) *0.247001
from IndustryGroup i cross join currency c
cross
join MarketCap
cross
join calendar
GO
--create a copy for page compression
select
* into IndexCloseValuesPage
from
IndexCloseValues
Go
--create a copy for row compression
select
* into IndexCloseValuesRow
from
IndexCloseValues
GO
--add unique constraint
ALTER
TABLE IndexCloseValues WITH NOCHECK
ADD
CONSTRAINT UQ_IndexCloseValues UNIQUE Clustered (CalendarDate,IndexSymbol)
GO
--add unique constraint
ALTER
TABLE IndexCloseValuesPage WITH NOCHECK
ADD
CONSTRAINT UQ_IndexCloseValuesPage UNIQUE Clustered (CalendarDate,IndexSymbol)
GO
--page compression
ALTER
TABLE IndexCloseValuesPage
REBUILD
WITH (DATA_COMPRESSION = PAGE);
GO
--add unique constraint
ALTER
TABLE IndexCloseValuesRow WITH NOCHECK
ADD
CONSTRAINT UQ_IndexCloseValuesRow UNIQUE Clustered (CalendarDate,IndexSymbol)
GO
--row compression
ALTER
TABLE IndexCloseValuesRow
REBUILD
WITH (DATA_COMPRESSION = ROW);
GO
Now that everything is setup we can look how big the tables are. Run the code below (old school I know)
dbcc
showcontig('IndexCloseValues')
dbcc
showcontig('IndexCloseValuesPage')
dbcc
showcontig('IndexCloseValuesRow')
Table: 'IndexCloseValues' (213575799); index ID: 1, database ID: 16
TABLE level scan performed.
- Pages Scanned................................: 23767
- Extents Scanned..............................: 2972
- Extent Switches..............................: 2971
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.97% [2971:2972]
- Logical Scan Fragmentation ..................: 0.01%
- Extent Scan Fragmentation ...................: 0.17%
- Avg. Bytes Free per Page.....................: 23.3
- Avg. Page Density (full).....................: 99.71%
Table: 'IndexCloseValuesPage' (245575913); index ID: 1, database ID: 16
TABLE level scan performed.
- Pages Scanned................................: 9307
- Extents Scanned..............................: 1165
- Extent Switches..............................: 1164
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.91% [1164:1165]
- Logical Scan Fragmentation ..................: 0.04%
- Extent Scan Fragmentation ...................: 0.77%
- Avg. Bytes Free per Page.....................: 10.7
- Avg. Page Density (full).....................: 99.87%
Table: 'IndexCloseValuesRow' (261575970); index ID: 1, database ID: 16
TABLE level scan performed.
- Pages Scanned................................: 13255
- Extents Scanned..............................: 1657
- Extent Switches..............................: 1656
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 100.00% [1657:1657]
- Logical Scan Fragmentation ..................: 0.01%
- Extent Scan Fragmentation ...................: 0.36%
- Avg. Bytes Free per Page.....................: 11.4
- Avg. Page Density (full).....................: 99.86%
As you can see compression really reduced the pages for the table, page level compression looks really good here.
First let's look at the execution plans, running the following block of code....
select * from IndexCloseValues
where
CalendarDate = '19920101'
select
* from IndexCloseValuesPage
where
CalendarDate = '19920101'
select
* from IndexCloseValuesRow
where
CalendarDate = '19920101'
...will result in this plan
Increasing the query to return a month of data instead of a day
select
* from IndexCloseValues
where
CalendarDate between '19920101' and '19920121'
select
* from IndexCloseValuesPage
where
CalendarDate between '19920101' and '19920121'
select
* from IndexCloseValuesRow
where
CalendarDate between '19920101' and '19920121'
will result in this plan.
You can draw your own conclusions from those images
Let's look at some reads, first turn statistics io on
SET STATISTICS IO ON
go
select * from IndexCloseValues
where
CalendarDate = '19920101'
select
* from IndexCloseValuesPage
where
CalendarDate = '19920101'
select
* from IndexCloseValuesRow
where
CalendarDate = '19920101'
950 row(s) affected)
Table 'IndexCloseValues'. Scan count 1, logical reads 12, physical reads 0......
(950 row(s) affected)
Table 'IndexCloseValuesPage'. Scan count 1, logical reads 7, physical reads 0......
(950 row(s) affected)
Table 'IndexCloseValuesRow'. Scan count 1, logical reads 8, physical reads 0......
Those are some nice numbers for the reads, now we will increase the date range to one month
select * from IndexCloseValues
where
CalendarDate between '19920101' and '19920121'
select
* from IndexCloseValuesPage
where
CalendarDate between '19920101' and '19920121'
select
* from IndexCloseValuesRow
where
CalendarDate between '19920101' and '19920121'
(14250 row(s) affected)
Table 'IndexCloseValues'. Scan count 1, logical reads 125, physical reads 0......
(14250 row(s) affected)
Table 'IndexCloseValuesPage'. Scan count 1, logical reads 52, physical reads 0......
(14250 row(s) affected)
Table 'IndexCloseValuesRow'. Scan count 1, logical reads 69, physical reads 0......
When selecting more data the numbers look even better.
Turn statistics io off again
SET
STATISTICS IO OFF
go
So as you can see compression reduces the reads by over half when using page compression.