THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Piotr Rodak

if datepart(dw, getdate()) in (6, 7)
    use pubs;



OUTPUT clause and windowing functions

A few days ago I was asked a question how to remove duplicates from a table. As usually in such cases, the duplicate values were related only to several columns, not all of them. So classical approach, to copy distinct data into temp table, truncate the table with duplicates and then copy the data back wouldn’t work. In such cases I find one of the windowing functions, row_number() to be the easiest to use. You just define criteria of numbering of rows and then you remove all rows are not first in their partition. Let’s have a look how it works.

First, let’s create and populate sample table:

  1. if object_id('tempdb..#t') is not null
  2.         drop table #t
  3.  
  4. create table #t(Id char(3), randomData uniqueidentifier default(newid()))
  5.  
  6. insert #t(Id)
  7. values
  8. ('123'), ('124'),
  9. ('133'), ('133'), ('133'),
  10. ('141'), ('141'),
  11. ('121'), ('121'),
  12. ('145'), ('145'),
  13. ('152')
  14.  
  15. select * from #t

When you run the query, you should see result similar to the following illustration:

Deleting duplicates is pretty straightforward. You define partitioned row numbering and then you delete all rows that have row number bigger than 1:

  1. ;with dupes as
  2. (
  3.     select *, row_number() over(partition by Id order by Id) as rn
  4.     from #t
  5. )
  6. delete dupes
  7. where rn > 1
  8.  
  9. select * from #t

As you see, the duplicates were removed.

The execution plan for the above statement is pretty straightforward:

The most interesting operator in this plan is the Sequence Project operator. It adds the calculated row number to the output, based on the criteria that you pass to the row_number() OVER clause. Here we partitioned the input by column Id which causes the row number to be restarted from 1 whenever new Id is encountered in the input. The row number is returned as result of internal function row_number, which you can see in the following picture:

On the left to the Sequence Project operator you can see Filter operator. This operator filters the input based on the certain predicate. You can see in  the picture below, that Filter operator allows to pass though only rows that have Expr1004 > 1.

You can check that in this particular case only five rows were let through and subsequently deleted. Two with Id 133, one 141, one 121 and one 145.

What if you want to look at the rows that are deleted? Let’s repopulate the table and run the delete statement, this time with the OUTPUT clause:

  1. truncate table #t
  2.  
  3. insert #t(Id)
  4. values
  5. ('123'), ('124'),
  6. ('133'), ('133'), ('133'),
  7. ('141'), ('141'),
  8. ('121'), ('121'),
  9. ('145'), ('145'),
  10. ('152')
  11.  
  12.  
  13. ;with dupes as
  14. (
  15.     select *, row_number() over(partition by Id order by Id) as rn
  16.     from #t
  17. )
  18. delete dupes
  19. output deleted.*
  20. where rn > 1

The OUTPUT clause allows you to return rows from pseudo-tables inserted and deleted, the same that you have access to in triggers. You can return these rows to the client or to a temporary table or table variable for further processing, or audit. Let’s have a look at the execution plan.

What’s this?! The Sort, Segment and Sequence Project operators where added second time to the plan! When you look at their details you will see that they have exactly the same predicates and conditions as the set on the right-hand side of the Filter operator. The difference is that they are applied to the data stream returned by the output clause. You can verify this looking at the results returned by the statement:

Have you noticed something? the rn column contains values 1 and 2, while it should contain only values 2 and 3, because this is what the Filter operator allowed to flow through. This is the evidence that the new operators work on the output data stream. You can see that they work on 5 rows, the ones that were deleted:

It is obvious that if you use output clause on queries that otherwise perform very well, you may run into problems. When you have many rows, the Sort operator can become a bottleneck when it is executed twice. The example also shows that the row number identifier of the row is not reliable within boundaries of single query. This may become an issue for you, if you want to identify rows based on the output from an update or delete statement.

From purely performance point of view, what can you do to avoid these additional operators? The answer is simple – don’t return the row_number() column in the OUTPUT clause.

  1. truncate table #t
  2.  
  3. insert #t(Id)
  4. values
  5. ('123'), ('124'),
  6. ('133'), ('133'), ('133'),
  7. ('141'), ('141'),
  8. ('121'), ('121'),
  9. ('145'), ('145'),
  10. ('152')
  11.  
  12.  
  13. ;with dupes as
  14. (
  15.     select *, row_number() over(partition by Id order by Id) as rn
  16.     from #t
  17. )
  18. delete dupes
  19. output deleted.Id, deleted.randomData
  20. where rn > 1

As you see in the snippet above, I explicitly return columns from the deleted table and I don’t return the rn column. The optimizer realized that the column is not needed and removed the additional operators from the plan, so looks exactly the same as the first plan we saw in this post.

Well, if someone asked me what I think about this behavior, I would say it is a bug in the SQL Server engine. The windowing functions shouldn’t be applied to the results of the OUTPUT clause because they affect performance and potentially affect the logic of the application.

This is the link to the test script, so you can run your own investigations.

Published Sunday, July 31, 2011 10:55 PM by Piotr Rodak

Attachment(s): OUTPUT-testing_duplicates.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Paul White said:

Hi Piotr,

This is a very good post - well done.

I have seen this plan shape before, but didn't really investigate it.  I had assumed this was just another example of the optimizer not collapsing ranking functions.

Having thought about it a bit more deeply, I have come to the conclusion that this is a bug in the parser.  We are not usually allowed to reference columns (or expressions) that are not being modified via the INSERTED or DELETED clause of the OUTPUT statement.

Indeed, trying to write a similar query using an UPDATE, and attempting to reference the ROW_NUMBER expression via the INSERTED clause, we get parser error 404: "The column reference "inserted.rn" is not allowed because it refers to a base table that is not being modified in this statement".

This restriction *should* also apply to the DELETED clause - but for some reason the parser allows it, and sends the optimizer an operator tree that it is not equipped to handle.  To see this more starkly, change your ROW_NUMBER expression to use any scalar function or an expression e.g:

abs(row_number() over(partition by Id order by Id)) as rn

OR

row_number() over(partition by Id order by Id) + 0 as rn

You will be rewarded with an unhandled exception - an access violation in this case - as the optimizer code tries to determine the partitioning and ordering for the ranking function.

In the specific example in this post, if non-modified references were disallowed in the DELETED clause (the correct behaviour) there's no easy way I can see to write a query in TSQL that outputs the row numbers of rows that were deleted.  If your example table had a primary key, we could do it with a join, but as it stands it seems impossible.

By the way, this bug is not fixed in Denali CTP 3, and can be made to produce all sorts of counter-intuitive/just plain wrong results, e.g. with COUNT(*) OVER instead of ROW_NUMBER.  Have you entered a Connect bug for this?

Paul

July 31, 2011 10:27 PM
 

Piotr Rodak said:

Hi Paul,

Thanks for your reply. I actually hoped that you might shed more light on the problem. I never realized that the column rn should not be available in the output. It kind of makes sense, although it is also intuitive to have all columns defined on input available on output. Everything would be probably all right if the parser didn't inject the operators into the result of the OUTPUT clause but rather left the filtered output untouched.

I will raise the Connect bug for this issue.

Piotr

August 1, 2011 6:15 AM
 

Paul White said:

Hey Piotr,

I couldn't see your Connect item to vote on, so added one myself:

https://connect.microsoft.com/SQLServer/feedback/details/682645/wrong-results-unhandled-exception-using-deleted-in-the-output-clause

Cheers,

Paul

August 2, 2011 9:43 AM
 

Piotr Rodak said:

Thanks Paul,

I was to add it today. I voted for the item.

Piotr

August 2, 2011 10:35 AM
 

Paul White said:

The Connect item is now resolved as 'fixed'.  As usual, there are no details concerning the fix - what it is, when it will be available, which versions...sigh.

August 17, 2011 8:26 AM
 

Piotr Rodak said:

Hi Paul,

Thanks for the update. Somehow I expected email notification from Connect.

Well, it looks it's going to be my hobby to check every new version of SQL Server to find out what the fix is :)

Piotr

August 19, 2011 8:24 AM
 

Paul White said:

Yeah email notifications would be good - perhaps we should open a Connect item for that ;c)

UC has now updated the item with a good description of the problem and some details of the short term fix and longer-term goals.  Turns out you identified a real bug here - not just a problem, but one where they are going to have to change its behaviour at some point to comply with ANSI SQL standards...cool!

Paul

August 19, 2011 10:33 AM
 

Piotr Rodak said:

I see my name carved in stone in Hall of Fame, forever :)

August 21, 2011 6:50 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement