Siddhartha Gautama, the Buddha, taught us to understand that the key to
enlightenment is following the Middle Path. And today I learned a
valuable lesson in extremes. You can file this one in the "Doh! Wrong again!" category...
A fairly common question on SQL Server forums is, "how can I get the
running sum of the data in this column?" Being the fan of set-based
queries that I am, I always answer the exact same way. I show the
person asking the question how to do a self-join on the grouped column,
getting all of the "previous" values to create a running sum. The
following example shows how you might do this against the
AdventureWorks Production.TransactionHistory table:
SELECT
TH1.TransactionID,
TH1.ActualCost,
SUM(TH2.ActualCost) AS RunningTotal
FROM Production.TransactionHistory TH1
JOIN Production.TransactionHistory TH2 ON TH2.TransactionID <= TH1.TransactionID
GROUP BY TH1.TransactionID, TH1.ActualCost
ORDER BY TH1.TransactionID
Pretty simple query. For each row of the "TH1" alias, every row with a
lesser-or-equal TransactionID will be summed. Thereby creating a
running total for every row of the table. Note, I've used the IDENTITY
column instead of the date column. I'd generally suggest not doing so
because, e.g., you might need to insert some post-dated rows at some
point and relying on the IDENTITY for a time sequence will thereby not
work. But in this case it's a lazy solution because the
TransactionDate column isn't indexed, and it's also not unique. I want
to test a lot of rows (TransactionHistory has around 113,000), but I
don't want to skew the test by forcing a table scan on every iteration!
But I digress. The point is, I've given this answer more than a few
times and, well, I'd like to apologize. Just now I went ahead and ran
this query on my powerful test server--err, my laptop.
As you might guess, since I'm performance-minded I also happen to be
extremely impatient--so I went ahead and killed the query at the
five-minute mark. SSMS's result grid showed the first 26,568 rows, so
obviously there was a long way to go to hit that 113,000 mark. And
with an estimated cost of 38,086 for the query, I can't say I'm
surprised.
A few moments of head scratching and the following re-write was issued:
SELECT
TH1.TransactionID,
TH1.ActualCost,
(
SELECT SUM(TH2.ActualCost)
FROM Production.TransactionHistory TH2
WHERE TH2.TransactionID <= TH1.TransactionID
) AS RunningTotal
FROM Production.TransactionHistory TH1
ORDER BY TH1.TransactionID
With an estimated cost of only 6,630, I had high hopes for this one.
Alas, once again I was forced to cancel the query at the five-minute
mark. 27,683 rows. Not much better, I'm afraid. And, as an aside,
I'm starting to wonder about these estimated costs. But that's another
post for another day.
So where am I going with all of this? Well, there's a reason I haven't
given any indication up until this point in the post. You see, it's
utterly painful to write this, but...
In this case, a cursor is faster.
Yes, I said it. That evil construct which we as database developers despise, the cursor. Thanks to
Paul Nielsen,
who revealed this ugly fact to me in a conversation today, I was forced
to test this for myself (hoping to prove him wrong, of course). Which
is why I started playing around with the solution that I've given so
many times on forums. Unfortunately, he is correct.
My next test query, using the first cursor I've written in several years:
DECLARE RunningTotalCursor
CURSOR LOCAL FAST_FORWARD FOR
SELECT TransactionID, ActualCost
FROM Production.TransactionHistory
ORDER BY TransactionID
OPEN RunningTotalCursor
DECLARE @TransactionID INT
DECLARE @ActualCost MONEY
DECLARE @RunningTotal MONEY
SET @RunningTotal = 0
DECLARE @Results TABLE
(
TransactionID INT NOT NULL PRIMARY KEY,
ActualCost MONEY,
RunningTotal MONEY
)
FETCH NEXT FROM RunningTotalCursor
INTO @TransactionID, @ActualCost
WHILE @@FETCH_STATUS = 0
BEGIN
SET @RunningTotal = @RunningTotal + @ActualCost
INSERT @Results
VALUES (@TransactionID, @ActualCost, @RunningTotal)
FETCH NEXT FROM RunningTotalCursor
INTO @TransactionID, @ActualCost
END
CLOSE RunningTotalCursor
DEALLOCATE RunningTotalCursor
SELECT *
FROM @Results
ORDER BY TransactionID
What's really unfortunate about the cursor approach is that you need to
use a temporary table if you want to return a single result set to the
client. I figured the additional I/O due to the temp table would
balance any improvement gains from the cursor approach, thereby
rendering my forum responses correct, and Paul wrong. Well, 14 seconds
and 113,443 rows later, SSMS and my laptop declared Paul the undisputed
Champion of the Cursor.
This cursor makes a lot of sense in this case. The set-based query
works by looping over each row of the table, taking a sum of every
previous row. So for the 10th row, 10 previous rows also need to be
visited. For the 1000th row, 1000 previous rows need to be visited.
And so on. The larger the set gets, the worse performance will be--and
that's not going to be a merely linear decrease in performance. Think
about this: Using the set-based method to find the running sum over a
set of 100 rows, 5050 total rows need to be visited. For a set of 200
rows, the query processor needs to visit 20100 total rows -- a
four-fold increase in the amount of work that must be done to satisfy
the query (O((N^2)/2), for those who are a bit more algorithmically
minded.)
The cursor, on the other hand, needs to visit each row exactly once
(O(N)). By maintaining the running count in a variable, there is no
need to re-visit previous rows. And as my laptop was so happy to show
me, the I/O cost due to the temp table does not overshadow the
performance improvement of having to visit so many less rows.
So what have we learned today? In my set-based singlemindedness I
failed to realize that the cursor does, indeed, have utility.
Everything in moderation.
Next steps? I get the feeling that this can be made even faster by
employing a CLR routine. Pull the data into a DataReader and loop over
that instead, which will completely eliminate the need for a temporary
table. Watch for that experiment coming to this space soon.
And next time you hear someone mention how horrible cursors are, remind
that person that there is a time and place for everything (
and it's called college).