This is part II of two Scandalous posts. Watch, mouth agape, as I run with scissors, right up against prevailing wisdom! Unfollow me now, before it’s too late!
Here’s the thing. There are two really outstanding posts out there on the ‘tubez that explain in vivid detail the problems with sending compressed data into a de-duplicating appliance. And these guys are both absolutely right. Everything in their posts is correct, and I would ask that, if you haven’t, you please read them before mine:
First, Brent Ozar:
(And, may I say, well done on the Numero Uno Google result for that post. Very nice!)
Next Denny Cherry:
(A very respectable #3 on the Google-ometer.)
Now, I’m not kidding. These guys know their stuff, and they are right. Stop reading right now.
Still here? Ok, now come closer.
I studied this whole thing very carefully, and I do it anyway.
While it’s true that de-duplication works poorly with compressed data, and if you compare the de-dupe ratios for “usual” uncompressed files with the de-dupe ratios for compressed files, the compressed data looks very, very bad. But there’s even more to this story, so much more that we decided to, in a limited way, stuff the compressed files into our DDR anyway.
Both SQL Server backups and file compression are a deterministic process. If you back up the same database twice, and it has the same data pages in it, and those pages are largely unchanged, then the backup files will be substantially the same. This is true if you compress both files with the same algorithm and settings, too – the data in the compressed files will be largely identical. It will not be like any OTHER files on your network, but the two files will be similar to one another.
If you change a small percentage of the data pages in the data file, that will still be true: a compressed backup of the database on, say, Monday will be mostly the same as a compressed backup of the same database, with modest changes, on Tuesday.
What that means is that if I have a 1 TB database, which I do, that produces a 250 GB compressed backup file, and that database receives mainly incremental changes from day to day or week to week, then each successive backup will be similar to the previous one. And if I copy them into a de-duplicating store (at least the one I have to work with) then, while the first file will be basically 100% net new data, the second will de-dupe against the first. It’s not as effective as other types of files, but it does help. Let’s say, for argument, that I get 75% de-duplication of only the two files, instead of the normal 85%+ across many instances of other files, I am still getting 75% de-duplication, and that can be very useful.
Useful how? Well, we have SAN replication married to our de-duplicating store for offsite backup and disaster recovery. That means that each night I have to transmit a LOT of SQL backup data across a WAN to another site. What’s a lot? For me, that just means the pipe is small and the data is much bigger. And that process would go a lot faster if, somehow, by magic, a whole lot of the data were already at the other end of the pipe before I start.
See where I’m going with this? With de-duplicated files, as days and weeks pass, each time we replicate new files from one site to the other, a whole lot of the data is already there at the other site. We only have to transmit the net new data. Even if that’s only 50% (a very poor performance number for de-duplicated storage in most people’s minds) that’s still cutting the data in half. Which is pretty good. Plus it’s compressed, which helps every other aspect of the backup story.
So we have what I think is a good compromise, born out by internal testing:
- Keeping compressed SQL Backups in de-duplicated storage indefinitely, as a replacement for tapes, is impractical. It’s just too expensive. So we keep the SQL Backups in there only for the purpose of DR, and we have a pretty aggressive purge schedule to be rid of old files. The sweet spot seems to be to keep only a week or two.
- We use tapes too, for archival purposes, and they have longer retention.
- We back up to local (DAS or SAN) disks first at the SQL Server and then copy into the de-duplicating store, so that the backup process performs well and isn’t bottlenecked at the network or at the speed the appliance can receive the files. So backups go to disk, then get copied into the de-dupe store, cancel against whatever is in there, and then it replicates them off site.
This is not a cheap setup, but it works great. I love it. That 250 GB file I mentioned is available at my other site in a couple of hours, because it’s always mostly there already. Your mileage may vary depending on all the specifics of the technology you have, and, as I said, Brent and Denny are right.
* Professional driver on a closed course; don’t try this at home; no animals were de-duped in the production of this post.