Some time ago, I blogged about how to really comprehensively re-index a data warehouse:
(Wow, 2009. Time flies.)
Today I am reviving that old post, because I had a chance this week to put that technique into practice against a +/- 1 TB warehouse running on SSDs. Noteworthy: the technique reclaimed 20 percent of the disk space on the SSD disk set. This is significant on a number of levels - but mostly it was just a whole lot more than I ever expected. The database was already being maintained with a conventional re-indexing scheme, so I did not imagine this process would have such a dramatic effect.
The orange line in this graph from our SCOM dashboard represents percent free space on the data drive array, before (left) and after (right):
Where did all that space come from? Two places: first, page density. A fully defragmented index structure can be built with the maximum possible density of rows per page, especially in a warehouse scenario where much of the data is read-only. That means limited empty space on each data page. Second, the free space in each file - in a partitioned warehouse, there can be tens or hundreds of data files - that had been provided as headroom to re-index the data partitions within the file, is no longer needed, once the older data is no longer receiving inserts or updates.
With all the buzz about how SSD’s eliminate most of the performance penalty caused by disk fragmentation, it’s tempting to dial back the defrag strategy, but there are a couple reasons you might not want to do that:
Sparse Data Pages Still Waste RAM
In the buffer cache, the unit of memory is a whole data page. That means, if you are doing decision support, it’s helpful to pack as many rows onto each data page as possible, because the net effect is that more rows fit in memory on your server. (By Sparse here I mean “partly empty,” not the Sparse feature of SQL Server.)
SSDs are Still Expensive
In the past we had to buy spindles by the boatload just to get to the random iops numbers we needed, and the disks were often mostly empty. The main cost benefit in SSD’s, apart from raw datacenter space and power, is that we can replace many, many spindles with a few SSD devices, and fill them up. An interesting side-effect: while I didn’t care too much about ever filling disks on the old system, I now have to start worrying that the SSDs could actually fill up to capacity, and it would be costly to expand the total volume of available space by adding more SSD’s.
SSDs can Slow as they Fill Up
Another reason, maybe more esoteric, is that the write performance of many SSD technologies can deteriorate as the disk becomes full of data. That implies a bit of tension between wanting to pre-allocate enough file space to be comfortable vs., for example, auto-grow, but not over-allocate such that the performance of the disks suffers. In a warehousing scenario, with partitioned tables, this can be handled by automating the process of adding new files weekly, monthly (or whatever your partition size is) incrementally over time, rather than creating all the files and filling the disks up at the outset, as might be tempting on an array of spinning disks.
For all these reasons, it might be a good idea to continue to give all your indexes that same defragging TLC we have come to know and love.
It’s always nice to finish out a Friday on a high note, especially before a holiday weekend. So if you are reading this in the US, Friday, enjoy the long weekend! Otherwise, happy re-indexing!
Edit 6/3/2011 Paul Randal has an excellent post on wasted (or potentially wasted) memory from low-density data pages: