THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Hole in your Backup Sequence?

I have been stung by this scenario a few times, using SQL Server's typical sliding window to remove old backup files:

  1. The backup disk is nearly full
  2. Full Backups are the largest kind, typically, and the Fulls start to fail first if there's a lack of disk space.
  3. Transaction log backups are smaller, so they continue working, fitting into the smaller available space.
  4. Each transaction log backup execution deletes files older than some specific sliding window - and before long, you are missing some between the last successful full backup and the oldest log backup file. Oops!

If you want to have a point-in-time set of backups in near-line storage, this will not do, as you have an old Full backup, then a gap, then some transaction log backups that you can't restore.

A CLR-based Solution

In the past, I've created various workarounds for this, but they usually involved some process external to SQL Server, and an extra script or executable, and perhaps even xp_cmdshell from SQL Agent. Kluges at best. Enter SQL Server 2005 and CLR. CLR integration, for all the controversy surrounding it, seems ideal to me for this type of thing - I need a utilitarian process that can interact with the OS, but that is safer than the shell and can be code-signed and so on. The .NET framework is a perfect fit.

Before I delve into the details, I want to note some additional requirements: I've always worked in shops with a third-party backup compression tool such as Litespeed or Red Gate SQL Backup, so it's important that the solution work no matter how the backups get onto the disk. Second, I need to be able to disable (perhaps "leave disabled" is more accurate) xp_cmdshell in the configuration of SQL Server for security purposes. Finally, the solution should work from "inside" familiar SQL Server tools, like SQL Agent, so other DBAs don't face some cryptic solution that isn't simple and easy to find and understand.

What I am after is a method that can more selectively delete the old backup files from disk, but take into account the dependencies between them, so that if there are full backups and differentials and log backups, I will always preserve a working sequence of files. I settled on a CLR stored procedure that can examine the backup files in a directory, and sort out this dependency before removing old files.

Method

A simple outline for this procedure looks something like this:

  • Get the name of a database, and a folder containing its backup files.
  • Get a number indicating how many complete backup sequences to retain on disk.
  • From the backup folder, get a list of the existing full backup files. Identify the oldest one that we want to keep.
  • Remove any full backups that are older than that file.
  • Remove any dependent files - differentials or log backups - that are older than that file.

In C#, with classes like FileInfo from the .NET Framework, this is quite simple to put together. I elected to use the file names to identify which files depend on which others, mainly because of the first requirement above that the procedure work on any type of native or third-part backup files. This does introduce requirements for naming conventions that might be avoidable if we could reliably read the data in the files themselves (i.e. if these were all SQL Server native files), but the conventions can be simple, and I customarily name files this way in any case. The names have to follow this pattern:

<database name> <anything you like>_<chronological datetime>.<extension>

For example:

mydatabase_FullBackup_20090101.bak

or

mydatabase_2009-01-21.bak

or

myDatabase_arbitraryDescription_20090101023015.foo

The only important features are that the file names begin with the database name, end with an underscore (_) followed by a string representing a datetime that sorts in dictionary order, and that the extensions are unique for backup, log backup, and differential backup. Each set of files, per database, should be in its own folder to prevent the possibility of file names colliding.

Code

The solution is a fairly short C# function. I began by firing up Visual Studio 2005 and creating a solution with the C# SQL Server Project type. Within the new project I added a new Stored Procedure item, and that provided a template for a new CLR stored procedure, with the appropriate references and basic structure.

Within the Stored Procedure template, I created code following the basic outline above. First I set up arguments for the function to receive when the user calls the stored procedure, including the database name, folder to search, how many backups to keep, what the file naming convention is, and so on. The trick to the method signature is that the arguments have to be of .NET types that match the SQL Server data types a Stored Proc will provide. B.O.L. has a very useful chart of what SQL Server types (varchar(), for example) map to what C# types (SqlString). Otherwise, the statement to create a C# function and the T-SQL statement to create a procedure are very similar:

[Microsoft.SqlServer.Server.SqlProcedure]
public static int DeleteAgedBackupFiles(SqlString databaseName,
    SqlString  directory,
    SqlInt32   fullBackupsToRetain,
    SqlString  fullBackupExt,
    SqlString  trnBackupExt,
    SqlString  diffBackupExt,
    SqlBoolean archivedFilesOnly) {
}

Next, DirectoryInfo and FileInfo provide ready-to-use functionality to interrogate files and folders, so I leveraged those to get lists of the files in the folder indicated, for the database indicated, and put the file lists into standard .NET Arrays:

DirectoryInfo folder = new DirectoryInfo(directory.Value);

FileInfo[] bakFiles;
FileInfo[] diffFiles;
FileInfo[] trnFiles;

String bakPattern = databaseName.Value + "*." + fullBackupExt.Value;
String trnPattern = databaseName.Value + "*." + trnBackupExt.Value;
String diffPattern = databaseName.Value + "*." + diffBackupExt.Value;

try
{
    bakFiles = folder.GetFiles(bakPattern);
    trnFiles = folder.GetFiles(trnPattern);
    diffFiles = folder.GetFiles(diffPattern);
}

Having lists of the files in arrays makes it pretty simple to run through them and perform the needed work, with for() or foreach() loops. But one tricky bit first: if the arrays were ordered, it would save a lot of extra effort comparing file names. Like most things .NET, basic tasks practically always have a provided object or method, and so, unsurprisingly, Array has a built-in Sort. No need to reinvent that wheel. The only issue that is perhaps less than intuitive is that the sort function has to be told how to order the objects in the array, using a "helper" in the form of a Comparer. A comparer is just a way for the sort function to compare two objects and determine which is first and which is second in the order you intend. Making a comparer takes just a few lines of code. In this case we want the files in standard dictionary order. That means that if the file names are sorted with a standard string sort, we have what we need. A simple version of that comparer is:

public class CompareFileNames : IComparer
{
    public int Compare(object x, object y)
    {
        return ((FileInfo)x).Name.CompareTo(((FileInfo)y).Name);
    }
}

In plain English, this function says, given two FileInfo objects, fetch out the file names and compare them as strings. The order of the strings is the order we want. Having this class included in the code allows us to sort an array of file objects with one line:

Array.Sort(bakFiles, new CompareFileNames());

Once the Arrays are ordered, I can do something like this to remove the oldest files:

// Delete every full backup file older than the file that satisfies the  
// FullBackupsToRetain argument, taking into account the Archive attribute: 
if (bakFiles.Length > 0)
{
    for (int i = 0; i <= bakFiles.GetUpperBound(0) - fullBackupsToRetain; i++)
    {
        if (archivedFilesOnly == false || bakFiles[i].Attributes != FileAttributes.Archive)
        {
             bakFiles[i].Delete();
        }
    }
}

Similar loops can process the arrays of differential and log backup files, selectively deleting the appropriate files. That is the basic idea driving the whole stored procedure. I won't go into every detail here, but instead offer the entire function in the code listing at the end of this post.

Once the procedure is written, the next step is to deploy it. This is where things perhaps get less familiar for the typical DBA, but I found this fairly simple once I did some reading. If .NET CLR code called from SQL Server is going to operate outside the SQL Server itself - for example, deleting files from the file system - it's important that the code be secure. For that reason, the SQL Server team created some special security categories for CLR procedures, and rules around those. In simplified terms, you have to do these things:

  1. Sign your code. This prevents someone tampering with the code after you've compiled and deployed it. You then have to "tell" SQL Server about the signature, by providing a key and a login that has the key and owns the code.
  2. Tell SQL Server, when you install the procedure, that the code should be in the correct security category to be able to access the file system. This is a category called "External Access."

The first time through I found this a little bumpy, but it's not too bad once you've done it once or twice, and each tool (Visual Studio and T-SQL) has methods that simplify the process. I won't repeat all the detailed instructions here, because others have, and it's in B.O.L. Here is a simple reference I found helpful, by Jonathan Kehayias:

http://www.sqlclr.net/Articles/tabid/54/articleType/ArticleView/articleId/9/Default.aspx

I agree with his advice that signing the code is better and more secure than the "Trustworthy" option within SQL Server. Plus, it's really not any more work.

That's all there is to it. Extending SQL Server to do essentially anything that .NET libraries can do is quite straightforward, and might be a safer alternative to those xp_cmdshell-based jobs lurking around on your servers.

[Shout out to my colleague Toby for code review and C# tips]

Published Tuesday, July 21, 2009 9:08 AM by merrillaldrich
Filed under: , ,

Attachment(s): BackupDeleteCodeSample.txt

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

ismail guzey said:

What is the login for?

It can not be the job owner, can not be used "with execute as"?

June 17, 2011 4:37 PM
 

merrillaldrich said:

The login enables code-signing the .NET assembly. It's basically so that SQL Server can validate the assembly with a key before executing it. If there were no key like the one managed by that login, some malicious code or person could potentially alter the assembly, sneak the new version in and the server might execute it.

June 17, 2011 7:57 PM
 

Merrill Aldrich said:

I haven’t posted for a while, but I hope that today I’ve got something really interesting to share. Even

August 1, 2012 1:23 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement