Today I am posting a wee 200-level trick, taken from some work I am doing with archiving. Here’s the scenario: I have seen a few applications that have performance problems where “active” records in the database are comingled with “inactive” records. For the purposes of this post, imagine a company has a database with customer data, where the customer records are all stored in one large table. Suppose that only a small percentage of the records relate to active customers, and the rest are inactive customers – but the company wants to retain those inactive records. In daily use, it seems reasonable that the active customer records would be accessed frequently, and would want to be in cache, while the inactive records could, in theory, sit out there on disk and only be accessed rarely.
This presents two problems: first, if the active and inactive rows are mixed up together in the same table, and therefore in the same clustered and non-clustered indexes, then in the buffer cache at any given time, you will have some mix of the two. That wastes some memory, and some resources used to access the data, because with each operation the server has to pick out the active customer records from the inactive ones. Obviously the engine is geared to do that, but there is probably an advantage to horizontally partitioning this data, physically separating the inactive customers out, and concentrating the smaller set of active customers in a nice, compact table, more of which could sit in memory.
Second, in many implementations I have seen (and this came up yesterday in my own work) there’s just an Active/Inactive “flag” in the large table, marking each customer record. While not incorrect, there are some performance implications of that design – mainly that a two-value column like this, when incorporated into a table of perhaps millions of rows, will not have cardinality that is useful to index. Generally if you issue a query “where active = 1,” unless there has been some very clever index design, you are likely to be scanning the table, picking out the active “needles” from the inactive “haystack.” There are exceptions, but this is a very common pattern.
So, please take my rather contrived example, and let’s look at the trick I’m really looking to demonstrate here: moving rows from one table to another, transactionally. Assuming horizontal partitioning, it’s certain that we need a procedure to “move” rows from one table to the other. But in T-SQL there isn’t really a “move rows” statement.
Here’s the setup: in a test database, make two identical tables:
CREATE TABLE dbo.CustomersActive (
CustomerID int NOT NULL
PRIMARY KEY CLUSTERED,
FirstName varchar(50) NOT NULL,
LastName varchar(50) NOT NULL
)
GO
CREATE TABLE dbo.CustomersInactive (
CustomerID int NOT NULL
PRIMARY KEY CLUSTERED,
FirstName varchar(50) NOT NULL,
LastName varchar(50) NOT NULL
)
GO
Let’s imagine those are, in fact, large, complex and realistic tables. (Work with me here :-)
Next we populate with some sample data:
INSERT dbo.CustomersActive
( CustomerID, FirstName, LastName )
VALUES ( 1000, 'Road', 'Runner' )
INSERT dbo.CustomersActive
( CustomerID, FirstName, LastName )
VALUES ( 1001, 'Quincy', 'MaGoo' )
INSERT dbo.CustomersActive
( CustomerID, FirstName, LastName )
VALUES ( 1002, 'Bugs', 'Bunny' )
INSERT dbo.CustomersActive
( CustomerID, FirstName, LastName )
VALUES ( 1003, 'Daffy', 'Duck' )
INSERT dbo.CustomersActive
( CustomerID, FirstName, LastName )
VALUES ( 1004, 'Wile', 'Coyote' )
INSERT dbo.CustomersActive
( CustomerID, FirstName, LastName )
VALUES ( 1005, 'Elmer', 'Fudd' )
INSERT dbo.CustomersActive
( CustomerID, FirstName, LastName )
VALUES ( 1006, 'Captain', 'America' )
GO
Problem: we need a clear, transactional way to move rows from the active to the inactive customers table. In older versions of SQL Server, this would have required an explicit transaction, and insert and a delete, plus of course error handling to ensure a customer could not be lost or be duplicated in the middle of the transaction. Today we have TRY/CATCH, which helps on the error handling front, so this two-step process could be written something like this:
BEGIN TRANSACTION
BEGIN TRY
INSERT dbo.CustomersInactive (
CustomerID,
FirstName,
LastName
) SELECT
CustomerID,
FirstName,
LastName
FROM dbo.CustomersActive
WHERE CustomerID IN ( 1002, 1004, 1006 )
DELETE dbo.CustomersActive
WHERE CustomerID IN ( 1002, 1004, 1006 )
COMMIT
END TRY
BEGIN CATCH
ROLLBACK
END CATCH
SELECT * FROM dbo.CustomersActive
SELECT * FROM dbo.CustomersInactive
This is essentially correct (except, of course, the flow control and error handling would be different and more robust in a real system). But there are some liabilities here:
- We must wrap this in an explicit transaction regardless of the context, just to make the two statements commit or roll back together.
- We have to state the criteria for the selection of rows to move twice, which is no big deal if done correctly, but invites trouble later. Imagine a developer editing the first “where” clause but forgetting to implement an identical change to the second. Oops.
- The server has to do the work to locate the affected rows twice – once for the select and again for the delete. This is no big deal for our simple case, but in many real archiving scenarios, it can be expensive to locate the target rows, and it’s a problem to ask the server to do that work twice.
Here’s a perfect use case for the OUTPUT clause! If you are unfamiliar with OUTPUT, what it does is essentially cause a SQL statement to emit a set of rows - which can be the data, or a subset of the data, that the statement affected. For example, if we run:
DELETE dbo.CustomersActive
WHERE CustomerID IN ( 1002, 1004, 1006 )
there isn’t any output from the delete, other than perhaps a count of affected rows. The statement will not capture the deleted rows, it will just act on the table and the rows will be discarded.
On the other hand, if we add the OUTPUT clause:
DELETE dbo.CustomersActive
OUTPUT
DELETED.CustomerID,
DELETED.FirstName,
DELETED.LastName
WHERE CustomerID IN ( 1001, 1003, 1005 )
then the content of the DELETED virtual table is actually emitted when the statement is run, like a select statement, returning the affected rows as a result set:
| CustomerID |
FirstName |
LastName |
| 1001 |
Quincy |
MaGoo |
| 1003 |
Daffy |
Duck |
| 1005 |
Elmer |
Fudd |
The real magic with OUTPUT is, not only does it emit rows this way, it also makes SQL statements even more composable than they had been. That is, one SQL statement can be used to generate a derived table, which in turn can act as the input data for another SQL statement, composing the two statements together into one action. That means it’s possible to nest a DELETE, for example, inside of an INSERT statement.
“Sounds complicated,” is a reaction I see a lot around this.
But don’t let the language describing this fool you – it’s really simple, and quite handy. Think again about the need to move rows from one table to another. What we really want to do is perform a delete, but grab (i.e. select) those rows that were deleted and then put them into another table. The first bit of that process is to feed the DELETE-ed rows to a SELECT statement. Once we “have” that set, we hand it to an INSERT statement, with INSERT … SELECT.
INSERT dbo.CustomersInactive (
CustomerID,
FirstName,
LastName
) SELECT
CustomerID,
FirstName,
Lastname
FROM (
DELETE dbo.CustomersActive
OUTPUT
DELETED.CustomerID,
DELETED.FirstName,
DELETED.LastName
WHERE CustomerID IN ( 1001, 1003, 1005 )
) AS RowsToMove
SELECT * FROM dbo.CustomersActive
SELECT * FROM dbo.CustomersInactive
The fact that we can take output from DELETE and feed it to INSERT actually models what we are trying to do perfectly. And, we get some advantages:
- This is now a single, atomic statement on its own.
- The logic about which rows to move is specified only once, which is neater.
- The logic about which rows to move is only processed one time by the SQL Server engine.
I hope this simple example opens the door for you to begin using OUTPUT and composable DML.
Edit: Be sure to see Ben Thul’s comment below for an even better/more concise way to write this >