THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Handy Trick: Move Rows in One Statement

Today I am posting a wee 200-level trick, taken from some work I am doing with archiving. Here’s the scenario: I have seen a few applications that have performance problems where “active” records in the database are comingled with “inactive” records. For the purposes of this post, imagine a company has a database with customer data, where the customer records are all stored in one large table. Suppose that only a small percentage of the records relate to active customers, and the rest are inactive customers – but the company wants to retain those inactive records. In daily use, it seems reasonable that the active customer records would be accessed frequently, and would want to be in cache, while the inactive records could, in theory, sit out there on disk and only be accessed rarely.

This presents two problems: first, if the active and inactive rows are mixed up together in the same table, and therefore in the same clustered and non-clustered indexes, then in the buffer cache at any given time, you will have some mix of the two. That wastes some memory, and some resources used to access the data, because with each operation the server has to pick out the active customer records from the inactive ones. Obviously the engine is geared to do that, but there is probably an advantage to horizontally partitioning this data, physically separating the inactive customers out, and concentrating the smaller set of active customers in a nice, compact table, more of which could sit in memory.

Second, in many implementations I have seen (and this came up yesterday in my own work) there’s just an Active/Inactive “flag” in the large table, marking each customer record. While not incorrect, there are some performance implications of that design – mainly that a two-value column like this, when incorporated into a table of perhaps millions of rows, will not have cardinality that is useful to index. Generally if you issue a query “where active = 1,” unless there has been some very clever index design, you are likely to be scanning the table, picking out the active “needles” from the inactive “haystack.” There are exceptions, but this is a very common pattern.

So, please take my rather contrived example, and let’s look at the trick I’m really looking to demonstrate here: moving rows from one table to another, transactionally. Assuming horizontal partitioning, it’s certain that we need a procedure to “move” rows from one table to the other. But in T-SQL there isn’t really a “move rows” statement.

Here’s the setup: in a test database, make two identical tables:

CREATE TABLE dbo.CustomersActive (
     CustomerID int NOT NULL
                    PRIMARY KEY CLUSTERED,
     FirstName varchar(50) NOT NULL,
     LastName varchar(50) NOT NULL
    )
GO
 
CREATE TABLE dbo.CustomersInactive (
     CustomerID int NOT NULL
                    PRIMARY KEY CLUSTERED,
     FirstName varchar(50) NOT NULL,
     LastName varchar(50) NOT NULL
    )
GO

Let’s imagine those are, in fact, large, complex and realistic tables. (Work with me here :-)

Next we populate with some sample data:

INSERT  dbo.CustomersActive
        ( CustomerID, FirstName, LastName )
VALUES  ( 1000, 'Road', 'Runner' )
         
INSERT  dbo.CustomersActive
        ( CustomerID, FirstName, LastName )
VALUES  ( 1001, 'Quincy', 'MaGoo' )
 
INSERT  dbo.CustomersActive
        ( CustomerID, FirstName, LastName )
VALUES  ( 1002, 'Bugs', 'Bunny' )
 
INSERT  dbo.CustomersActive
        ( CustomerID, FirstName, LastName )
VALUES  ( 1003, 'Daffy', 'Duck' )
 
INSERT  dbo.CustomersActive
        ( CustomerID, FirstName, LastName )
VALUES  ( 1004, 'Wile', 'Coyote' )
 
INSERT  dbo.CustomersActive
        ( CustomerID, FirstName, LastName )
VALUES  ( 1005, 'Elmer', 'Fudd' )
 
INSERT  dbo.CustomersActive
        ( CustomerID, FirstName, LastName )
VALUES  ( 1006, 'Captain', 'America' )
GO

Problem: we need a clear, transactional way to move rows from the active to the inactive customers table. In older versions of SQL Server, this would have required an explicit transaction, and insert and a delete, plus of course error handling to ensure a customer could not be lost or be duplicated in the middle of the transaction. Today we have TRY/CATCH, which helps on the error handling front, so this two-step process could be written something like this:

BEGIN TRANSACTION
 
BEGIN TRY
 
    INSERT  dbo.CustomersInactive (
                    CustomerID,
              FirstName,
              LastName
            ) SELECT 
                    CustomerID,
                    FirstName,
                    LastName
            FROM    dbo.CustomersActive
            WHERE   CustomerID IN ( 1002, 1004, 1006 )
       
    DELETE  dbo.CustomersActive
    WHERE   CustomerID IN ( 1002, 1004, 1006 )
       
    COMMIT
   
END TRY
       
BEGIN CATCH
       
    ROLLBACK
       
END CATCH 
SELECT * FROM dbo.CustomersActive
SELECT * FROM dbo.CustomersInactive

This is essentially correct (except, of course, the flow control and error handling would be different and more robust in a real system). But there are some liabilities here:

  1. We must wrap this in an explicit transaction regardless of the context, just to make the two statements commit or roll back together.
  2. We have to state the criteria for the selection of rows to move twice, which is no big deal if done correctly, but invites trouble later. Imagine a developer editing the first “where” clause but forgetting to implement an identical change to the second. Oops.
  3. The server has to do the work to locate the affected rows twice – once for the select and again for the delete. This is no big deal for our simple case, but in many real archiving scenarios, it can be expensive to locate the target rows, and it’s a problem to ask the server to do that work twice.

Here’s a perfect use case for the OUTPUT clause! If you are unfamiliar with OUTPUT, what it does is essentially cause a SQL statement to emit a set of rows - which can be the data, or a subset of the data, that the statement affected. For example, if we run:

    DELETE  dbo.CustomersActive
    WHERE   CustomerID IN ( 1002, 1004, 1006 )

there isn’t any output from the delete, other than perhaps a count of affected rows. The statement will not capture the deleted rows, it will just act on the table and the rows will be discarded.

On the other hand, if we add the OUTPUT clause:

DELETE dbo.CustomersActive
OUTPUT
    DELETED.CustomerID,
    DELETED.FirstName,
    DELETED.LastName
WHERE CustomerID  IN ( 1001, 1003, 1005 )

then the content of the DELETED virtual table is actually emitted when the statement is run, like a select statement, returning the affected rows as a result set:

CustomerID FirstName LastName
1001 Quincy MaGoo
1003 Daffy Duck
1005 Elmer Fudd

The real magic with OUTPUT is, not only does it emit rows this way, it also makes SQL statements even more composable than they had been. That is, one SQL statement can be used to generate a derived table, which in turn can act as the input data for another SQL statement, composing the two statements together into one action. That means it’s possible to nest a DELETE, for example, inside of an INSERT statement.

“Sounds complicated,” is a reaction I see a lot around this.

But don’t let the language describing this fool you – it’s really simple, and quite handy. Think again about the need to move rows from one table to another. What we really want to do is perform a delete, but grab (i.e. select) those rows that were deleted and then put them into another table. The first bit of that process is to feed the DELETE-ed rows to a SELECT statement. Once we “have” that set, we hand it to an INSERT statement, with INSERT … SELECT.

INSERT dbo.CustomersInactive (
      CustomerID,
      FirstName,
      LastName
) SELECT 
            CustomerID,
            FirstName,
            Lastname
      FROM    (
           DELETE dbo.CustomersActive
           OUTPUT
                   DELETED.CustomerID,
                   DELETED.FirstName,
                   DELETED.LastName
           WHERE CustomerID  IN ( 1001, 1003, 1005 )
      ) AS RowsToMove  
     
SELECT * FROM dbo.CustomersActive
SELECT * FROM dbo.CustomersInactive

The fact that we can take output from DELETE and feed it to INSERT actually models what we are trying to do perfectly. And, we get some advantages:

  1. This is now a single, atomic statement on its own.
  2. The logic about which rows to move is specified only once, which is neater.
  3. The logic about which rows to move is only processed one time by the SQL Server engine.

I hope this simple example opens the door for you to begin using OUTPUT and composable DML.

Edit: Be sure to see Ben Thul’s comment below for an even better/more concise way to write this >

Published Wednesday, August 17, 2011 5:44 PM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Timothy said:

This is a brilliant KISS stuff!!   Keep it Simple and Stupid!!  I wonder what's the OOP programmers will say about this!!!!

August 17, 2011 9:10 PM
 

Ben Thul said:

What about:

DELETE dbo.CustomersActive

          OUTPUT

                  DELETED.CustomerID,

                  DELETED.FirstName,

                  DELETED.LastName

          INTO dbo.CustomersInactive

          WHERE CustomerID  IN ( 1001, 1003, 1005 )

August 17, 2011 9:29 PM
 

merrillaldrich said:

Timothy - I know, right? The tables involved don't even make a network round trip to an app server. Rather unsophisticated :-).

August 17, 2011 10:09 PM
 

merrillaldrich said:

Ben - even better. I got too wrapped up in demo'ing composition.

August 17, 2011 10:21 PM
 

Greg Milner said:

Nice tip. Thanks, Merrill.

August 18, 2011 11:21 AM
 

Stephen Morris said:

Also a filtered index on the IsActive column might be a winner

August 19, 2011 1:24 AM
 

Tim McCorkle said:

Very cool, Thanks

August 19, 2011 2:08 AM
 

Craig said:

Come on, would a DBA ever really approve that? LOL

August 19, 2011 2:12 PM
 

Tony said:

@Craig - as a DBA...yeah, why not? Id much rather see something written like that than a multi step approach or a client app.

August 21, 2011 4:14 AM
 

Mark said:

Wouldn't you end up with a table lock?

August 22, 2011 10:38 AM
 

merrillaldrich said:

Mark - no, this won't cause a table lock per se. I believe the locking and lock escalation rules are no different than for other insert and delete statements.

August 22, 2011 12:54 PM
 

Brij Sharma said:

Nice, thumbs up. Learned a lot.

August 22, 2011 1:10 PM
 

Steve Jones said:

Merrill,

Excellent example, and I like the use of OUTPUT. Ben's example is more compact, but I'm not sure it would perform differently across small sets of data and the lack of the INSERT, IMHO, reduces readability quite a bit.

I'd wonder how this performs on, say 10k record sets where you might have 50k active customers and roll off a slice every month.

August 29, 2011 11:37 AM
 

AndyG (@DBA_ANDY) said:

Very Slick - I have never heard of using OUTPUT in this fashion.

August 29, 2011 1:16 PM
 

JohnG said:

Perfect timing for me...I have just finished an archiving task that uses the 2step traditional method to archive millions of rows in manageable chunks.  I will modify to use OUTPUT, and if there is interest, I will return with my performance comparisons.

August 30, 2011 12:47 PM
 

Balaji said:

Nice article..

September 5, 2012 11:42 AM
 

Erika (@DancesWChickens) said:

Just used this and it was much simpler than the approach I probably would have taken. Thanks!!

September 21, 2012 4:08 PM
 

Octavient said:

Is there a way to do this?

OUTPUT

   DELETED.*

Otherwise, won't this break, or lose data, if another column is subsequently added to the main table?

November 2, 2012 5:45 PM
 

Tammra said:

I know this is an older post, but have to chime in ... I just ran across this article and found it very interesting. However, JohnG says he's going to give it a try to archive millions of rows (which is the same chore I currently have that sent me to this article). I tested this on my dev server and 'moved' just 31 records - it took 1.5 minutes but my transaction log took a major hit! From 32 to 30,745 MB ... for 31 records.  I'd be curious to know if others have tried using this method for archiving large amounts of data.

June 26, 2013 10:37 AM
 

Tammra said:

Just an added side note, I tested the traditional two-step Insert Into and Delete From, and though the timing was about the same my transaction logs didn't change.  I'm thinking 'old school' is a better answer for me.

June 26, 2013 11:09 AM
 

AbiNed said:

I just got a job as a .NET programmer and I need to do this. How many rows could I do this for? 200? 500?

November 12, 2013 2:40 AM
 

Joel said:

I just did it for 6.5M rows (took 53 mins).

November 14, 2013 4:15 PM
 

Joel said:

A couple other comments, the source table had 27M rows, I outputted DELETED.* into the new table, and it had a LEFT OUTER JOIN (grabbing everything that failed to join).

November 14, 2013 4:20 PM
 

merrillaldrich said:

There's no technical limitation for the statement/syntax. The limit will be one of transaction log size for one transaction or time spent in the transaction or blocking. That is, it's the same as any manipulation of lots of rows in any single transaction. If you have to do a lot of rows, it can make sense to chunk it out in batches. Often the batches can be thousands of rows, though, without a big hit.

November 14, 2013 4:53 PM
 

WW said:

This was exactly what I was looking for.  Worked great for my purposes.  I used a variation of Ben's version.  Thank you!

February 20, 2014 9:27 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement