THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

  • SQL Intersection Conference, Las Vegas MGM Grand April 8-11

    I'll_Be_There

    You might have noticed the SQL Intersection logo appearing in my recent blog posts. This is a brand new event focussed on tuning and troubleshooting the SQL Server core engine (including Windows SQL Azure Database). If you read and enjoy at least some of the things I write about, this conference is one you should seriously consider attending.

    For my part, I’ll be presenting three sessions: two on Interpreting Execution Plans, and one on Parallel Query Execution. I have to promote my sessions, because just look at the other speakers you have to choose from:

    SQL Speakers

    As well as speaking, all these people will be around to chat to over the four conference days, and some are offering complete days of training at pre- and post-conference workshops as well.

    Personally, I’m particularly excited about talking to the following people and attending their sessions:

    • Bob Ward
    • Conor Cunningham
    • Paul Randal
    • Kimberly Tripp
    • Brent Ozar
    • Kendra Little
    • Grant Fritchey
    • Joe Sack
    • Jeremiah Peschka
    • Aaron Bertrand
    • Steve Jones

    Main Conference Overview

    Each of the main conference days has three tracks to choose from:

    Tuesday

    Tuesday Schedule

    Wednesday

    Wednesday Schedule

    Thursday

    Thursdsay Schedule

    The full schedule is available here as a PDF document, or you can visit the conference website for all the details.

    Pricing

    The conference pricing is very competitive (especially so, considering the location and speaker line-up):

    Pricing Grid

    Discount Code

    When registering, quote the Discount Code WHITE to receive an extra $50 off these prices.

    I really hope to see you in Las Vegas – please feel free to come and talk to me about anything concerning execution plans or the query optimizer :)

    Paul White

    SQL Server MVP 2011, 2012
    Twitter:
    @SQL_Kiwi

  • Execution Plan Analysis: The Mystery Work Table

    Ill_Be_There4

    I love SQL Server execution plans. It is often easy to spot the cause of a performance problem just by looking at one. The task is considerably easier if the plan includes run-time information (a so-called ‘actual’ execution plan), but even a compiled plan can be very useful. Nevertheless, there are still times where the execution plan does not tell the whole story, and we need to think more deeply about query execution to really understand a performance problem. This post looks at one such example, based on a recent question posted on the SQL Performance Q & A site.

    The Execution Plan

    Original Query Plan

    This plan is reasonably large (20MB cached plan size) but not massively complex once you break it down (click on the image above to view it full-size in a new window). The context of the question is that this query usually executes in less than a minute, but sometimes it runs for nearly twenty minutes – though the plan appears identical.

    High-Cost Operators

    There are many different things to look for in execution plans. What you choose to look at first is as much a matter of personal preference as anything, but many people are drawn to high-cost operators, so I will start there. In this plan, the cost of one operator dominates all others, shown as being responsible for 100% of the cost of the query. It is highlighted in red in Plan Explorer; I have expanded the relevant plan section (the top right) below:

    100% operator cost

    There is no doubt that this seek is busy little thing. It is executed 249,484 times, though it only produces a grand total of 167,813 rows over all iterations of the loop join – an average of just 0.7 rows per seek. There are all sorts of interesting details in the plan about this seek – I could write a whole blog post about it – but two details that stand out are the “Force Seek: True” and “Partitioned: True” attributes. These tell us that the base table is partitioned, and the query writer had to use a FORCESEEK table hint to get this plan.

    Without this hint, the optimizer would almost certainly choose a Hash Match or Merge Join rather than Nested Loops. This is understandable given the optimizer’s cost model and the simplifying assumptions it makes (such as assuming every query starts with a cold buffer cache). That’s fine, but we can see from the query plan that the inner-side table has 643 million rows. Left to its own devices, the optimizer would estimate that it would be faster to perform a sequential scan of 643 million rows (with large-block read-ahead) than it would be to run a quarter-million randomly-distributed seeks driven by a Nested Loops join.

    I doubt that the optimizer’s reasoning here is sound (at least on any reasonably modern hardware) but there we go. The query author probably knows that a good fraction of this table is likely to be in cache, so with all that in mind, I think we can reasonably assume at this stage that the FORCESEEK hint is genuinely needed here, and this part of the plan is at least reasonably optimal.

    Important note: The seek certainly does not account for 100% of the runtime cost of this query. Remember cost percentages are always estimates – even in ‘actual’ plans. It can be useful to check the reasons for high-estimated-cost operators, but they should never be used as a primary tuning metric.

    Execution Warnings

    Sort Warning

    This query was executed on SQL Server 2012, so there is a handy warning triangle on the Sort operator indicating that one or more sort runs had to be spilled to physical tempdb disk. The plan clearly shows this spilling is a result of an inaccurate cardinality estimate at the Filter operator (the estimates are not bad at all prior to this). The Sort expects 9,217 rows totalling approximately 5MB, but actually encountered 61,846 rows in 35MB. As you may know, memory for sorts and hashes is allocated before execution starts, and generally cannot expand dynamically at run time.

    The spilled sort is undesirable, of course, but it is unlikely to be a major cause of the occasional poor performance given the small size of the spilled data. Nevertheless, this might be a good place to split this query up. The idea would be to write the results of the query (up to and including the Filter) to a temporary heap table using SELECT INTO, and then create a clustered index with the same keys as the Sort operator. The temporary table would not be large, and may well perform better overall than the spilled sort, including the cost of creating the clustered index. Of course, creating this index will involve a sort, but it will be one based on the known cardinality of the temporary heap table. The part of the plan that could be replaced by a temporary table is shown below:

    Plan subtree replaced with a temp table

    I am a big fan of simplifications like this. Smaller execution plans tend to optimize better for all sorts of reasons, and the source code usually becomes easier to maintain as well. I should mention there is another warning triangle in the 2012 execution plan (shown on the root icon), which relates to some implicit conversions that I will mention later.

    I/O Information

    The execution plan was captured with Plan Explorer, so we can also easily see I/O statistics for the two executions. The first is for a fast (sub-60-second) run:

    I/O data - fast

    Overall, these I/O numbers show pretty much what we would expect: a decent number of logical reads associated with the seeks into the Trans table (but certainly not 100% of the total, ha ha), a very small number of physical reads, and a small amount of read-ahead activity on a couple of tables.

    The second set of I/O data is from a slow run (18 minutes or so):

    I/O data - slow

    The very obvious difference is the appearance of a work table, with 178 million logical reads and 130 million LOB logical reads. It seems very likely this work table, and its 300 million logical reads, is responsible for the dramatic decrease in query performance. But given that the execution plans are identical (right down to the XML) what is causing this?

    My answer to that question (on the Q & A site) was that it is related to the increased level of read-ahead activity, but to see why that is the case, we will need to reproduce the issue and dig a bit deeper.

    Execution Outline

    Before we really get going on this, it will be useful to take a look at what the execution plan is doing in outline. We saw the first part of the plan earlier when looking at the spilling sort. The data set at that point (which we would like to write to a temporary table, remember) essentially represents source data for a second query, which uses a series of Nested Loops Left Joins to lookup information from other tables:

    Nested Loop Lookups

    The inner side of each join involves some reasonably involved logic, which is thankfully not important to the present discussion. What is important is that the result of each lookup is a LOB data type. This begins to shed some light on the LOB logical reads reported against the work table, but it does not explain why the work table (and the 300 million associated reads) do not appear when the query runs quickly (with the same execution plan).

    Reproducing the problem

    Table Creation

    The first part of the repro involves creating six tables that represent the lookup tables in the original query plan. Each table will have 10,000 rows, consisting of a sequential reference number and a second column containing a 2048 single-byte-character string. The source table used to drive the lookups will be a regular Numbers table containing just a single integer column.

    CREATE TABLE dbo.T1 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T2 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T3 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T4 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T5 (id integer IDENTITY PRIMARY KEY, d char(2048));
    CREATE TABLE dbo.T6 (id integer IDENTITY PRIMARY KEY, d char(2048));
    GO
    INSERT dbo.T1 WITH (TABLOCKX)
    SELECT REPLICATE('A', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T2 WITH (TABLOCKX)
    SELECT REPLICATE('B', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T3 WITH (TABLOCKX)
    SELECT REPLICATE('C', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T4 WITH (TABLOCKX)
    SELECT REPLICATE('D', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T5 WITH (TABLOCKX)
    SELECT REPLICATE('E', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
    GO
    INSERT dbo.T6 WITH (TABLOCKX)
    SELECT REPLICATE('F', 2048)
    FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;

    The next step is to ensure that each lookup table is optimally organized for read-ahead:

    ALTER TABLE dbo.T1 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T2 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T3 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T4 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T5 REBUILD WITH (MAXDOP = 1);
    ALTER TABLE dbo.T6 REBUILD WITH (MAXDOP = 1);

    Test Query

    The original query translates into our simplified test rig as:

    DECLARE @d nvarchar(max) = NCHAR(10000);
     
    SELECT n.n,
        DATALENGTH
        (
            CONCAT
            (
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T1 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T2 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T3 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T4 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T5 AS t WHERE t.id = n.n),
                (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T6 AS t WHERE t.id = n.n)
            )
        )
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n
    OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

    The broad idea there is to concatenate our 2048-character column to itself five times and include a Unicode character that was used in the original query as a delimiter that could not appear in the source data. Each lookup performs the same basic operation against its target table, and the final result is the result of concatenating all the intermediate results. The query hints are necessary to get the right plan shape, just because my test rig tables are so much smaller than the real ones.

    Note that the Unicode delimiter means the 2048-character single-byte data is implicitly converted to Unicode, doubling in size. It is not a crucial feature of the test, but it did appear in the original query and explains the type conversion warnings in the execution plan I mentioned earlier. The execution plan for the test query is (click to enlarge if necessary):

    Test query execution plan

    I should also stress that the CONCAT operator (new in SQL Server 2012) is not crucial either. If you are using an earlier version of SQL Server, an equivalent query (for present purposes) is shown below. I’m going to stick with CONCAT for the remainder of the post, however.

    DECLARE @d nvarchar(max) = NCHAR(10000);
     
    SELECT n.n,
        DATALENGTH
        (
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T1 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T2 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T3 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T4 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T5 AS t WHERE t.id = n.n) +
            (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T6 AS t WHERE t.id = n.n)
        )
    FROM dbo.Numbers AS n
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n
    OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

    Warm cache results

    With all data in memory, the test query (in either form) completes in about 1.6 seconds on my laptop. The result shows that each output row contains 147,468 bytes of Unicode character data. A typical set of I/O statistics follows:

    image

    Nothing too exciting to see there, but this is just our baseline.

    Cold cache results

    CHECKPOINT;
    DBCC DROPCLEANBUFFERS;

    With no data in memory, the test query now runs for 18.6 seconds – almost 12x slower. The I/O statistics show the expected (but still mysterious!) work table and its associated reads:

    image

    The Extended Events wait statistics show SQL Server spent very little of that time waiting on my laptop’s slow hard drive – just 402 ms:

    image

    Explanation

    The are a number of factors in play here that we will look at in turn.

    Nested Loops Prefetching

    One of the reasons the optimizer prefers Hash Match and Merge Join for larger inputs is that the data access patterns tend to favour large sequential read-ahead. Both hash and merge tend to scan (range-scan in the case of a seek) their inputs, and the SQL Server Storage Engine automatically issues read-ahead when it detects this type of access. There is nothing in the execution plan to show that a base table will be read with read-ahead, it just happens.

    A very basic implementation of Nested Loops join would not benefit from read-ahead at all on its inner side. The outer (driving) side of the loops join might well be a scan or range-scan of an index, and so benefit from automatic read-ahead, of course. The inner side is executed once per outer row, resulting in a rapid succession of small index seeks for different values. These small seeks will typically not be large enough to trigger the automatic read-ahead mechanism. Indeed, in our test, each inner side seek is for precisely one value.

    SQL Server improves on this by implementing a second read-ahead mechanism especially for Nested Loops joins (not all N-L joins, it is a cost-based decision the optimizer makes). The basic idea is to buffer extra rows from the outer side of the join, and to use the row values in the buffer to drive read-ahead for the inner side. The effect is that the Nested Loops join becomes a partly blocking operator as outer-side rows are read into the buffer and read-ahead issued based on buffered index key values.

    This read-ahead may be either order-preserving or not, and is indicated in the execution plan by the Nested Loop attributes With Ordered Prefetch and With Unordered Prefetch, respectively. When unordered prefetch occurs, the inner side is processed in whatever order the asynchronous reads happen to complete. With ordered prefetching, the mechanism is careful to ensure that the order of rows entering the join is preserved on the output.

    In the test rig, the ORDER BY clause means there is a need to preserve row order, so Ordered Prefetch is used:

    Ordered Prefetch

    The issue described in this post is not specific to ordered prefetching – the same behaviour is just as likely with unordered prefetching. The point is that Nested Loops prefetching is one of the requirements.

    Documented trace flags 652 and 8744 may be used (with care, and after serious testing) to disable automatic read-ahead and Nested Loops prefetching respectively. This is sometimes beneficial where all data is expected to be in memory (in which case read-ahead processing consumes resources better used by query execution) or where the I/O subsystem is extremely fast. In case you were wondering, there is no background thread for prefetching – all the work of checking whether the data is in memory, and issuing I/O if not, is performed by the worker thread executing the query.

    I should stress that read-ahead and Nested Loops prefetching is generally A Very Good Thing with typical storage solutions (e.g. SANs) and both work best (or at all) when indexes have low logical fragmentation.

    Manufactured LOBs

    The issue described here also requires that a large object data type is manufactured before prefetching. The Compute Scalar operators in the test execution plan perform that function:

    Manufactured LOB

    By ‘manufactured’, I mean that the source columns are not LOB types, but the expression output is – notice the implicit conversion to nvarchar(max). To be clear about it, the issue we are analysing here does not occur when Nested Loops prefetching occurs with an expression that was a LOB type to begin with.

    The Outer Join

    The optimizer is quite good, generally speaking, at moving scalar expressions around. If the query had featured inner joins (whether by query design or through optimizer activities) the chances are quite good that the problematic expressions (the LOB manufacturers) would have moved beyond the prefetching, and so out of harm’s way. It is quite tricky to preserve NULL-extension and other outer-join semantics properly when moving expressions above an outer join, so the optimizer generally does not even try. In essence, the outer join represents an optimization barrier to the LOB-manufacturing expressions.

    Memory Allocation

    When Nested Loops prefetching occurs with a manufactured LOB, the question arises of where to store the created LOBs when buffering rows for prefetch. If the source data were already a LOB type, the execution engine would already have memory structures in place to handle them. When prefetching encounters a manufactured LOB, it needs to store it somewhere, since the engine is no longer processing a stream of one row at a time. It turns out that there is a small memory buffer set aside for this eventuality, which empirical tests show to be 24KB.

    However, this 24KB (directly allocated, not via workspace memory grant) is shared across all concurrently executing prefetching joins in the query. With six such joins in the test rig plan and large manufactured LOBs, the buffer stands no chance. As a result, query execution engages a bail-out option: a work table created in tempdb. Though the pages of the worktable may in fact remain memory-resident, overheads (including latching and using general-purpose code interfaces for access to the buffered rows) mean this is very much slower than using the direct-memory cache.

    As with most internal work tables, the logical reads reported on this work table indicate the number of rows processed (not 8KB pages, as for regular I/O statistics). This fact, together with the large number of items processed via the worktable in our test, accounts for the millions of reads reported.

    The creation and use of the work table depends on run time conditions and timing. If execution finds the data it needs is already in memory, the prefetch checks are still performed, but no asynchronous read requests end up being posted. The 24KB buffer is never filled, so the need to create a work table never arises. The more prefetch that actually occurs, the higher the chances that the buffer will fill. It is quite possible to experience a low level of prefetch with manufactured LOBs without the engine needing to bail out to a work table, especially if the LOBs are not very big and the I/O system is quite fast.

    Workaround

    We can rewrite the query to avoid feeding manufactured LOB data to the prefetch buffer. The idea is to use OUTER APPLY to return the data that contributes to the concatenation, rather than the result of the concatenation. We can then perform the CONCAT operation (which handles NULLs nicely without extra work) after the join, avoiding the prefetch buffer issue completely. In SQL Server versions prior to 2012, we would need to use direct string concatenation, and handle rows that are NULL-extended explicitly using ISNULL or COALESCE.

    DECLARE @d nvarchar(max) = NCHAR(10000);
     
    SELECT 
        n.n,
        DATALENGTH
        (
            CONCAT
            (
                CONCAT(oa1.i0, oa1.i1, oa1.i2, oa1.i3, oa1.i4, oa1.i5, oa1.i6),
                CONCAT(oa2.i0, oa2.i1, oa2.i2, oa2.i3, oa2.i4, oa2.i5, oa2.i6),
                CONCAT(oa3.i0, oa3.i1, oa3.i2, oa3.i3, oa3.i4, oa3.i5, oa3.i6),
                CONCAT(oa4.i0, oa4.i1, oa4.i2, oa4.i3, oa4.i4, oa4.i5, oa4.i6),
                CONCAT(oa5.i0, oa5.i1, oa5.i2, oa5.i3, oa5.i4, oa5.i5, oa5.i6),
                CONCAT(oa6.i0, oa6.i1, oa6.i2, oa6.i3, oa6.i4, oa6.i5, oa6.i6)
            )
        )
    FROM dbo.Numbers AS n
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T1 AS t WHERE t.id = n.n) AS oa1
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T2 AS t WHERE t.id = n.n) AS oa2
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T3 AS t WHERE t.id = n.n) AS oa3
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T4 AS t WHERE t.id = n.n) AS oa4
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T5 AS t WHERE t.id = n.n) AS oa5
    OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T6 AS t WHERE t.id = n.n) AS oa6
    WHERE n.n BETWEEN 1 AND 10000
    ORDER BY n.n
    OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

    The execution plan for the rewritten query looks visually similar to the problematic one:

    Rewritten query plan

    However, the Compute Scalars no longer manufacture a LOB data type, they just emit column and variable references:

    image

    All the concatenation work (and LOB manufacture) is performed by the final top-level Compute Scalar in a single monster expression [Expr1056]:

    image

    Warm cache results

    With all data in memory, the new query completes in 1.8 seconds (very slightly up on 1.6 seconds before):

    image

    Cold cache results

    When all data must be fetched from disk, the query issues optimal prefetching and completes in 7.3 seconds (down from 18.6 seconds) with no work table:

    image

    The Extended Events wait statistics now show 3.8 seconds spent waiting for my laptop’s slow spinning disk (which is a good thing!)

    image

    Final Thoughts

    Work tables can appear in STATISTICS IO output for a wide range of reasons, but if you encounter one with a very large number of reads – particularly LOB reads – you may be encountering this issue. The rewrite proposed above may not always be possible, but you should be able to refactor your query to avoid the issue now you know it exists.

    I am not a fan of doing large amounts of string manipulation in SQL Server. I am always particularly suspicious of the perceived need to split or concatenate large volumes of strings.

    I am, however, a fan of always using explicit data types (rather than relying on implicit conversions) and generating relatively small query plans that offer the query optimizer clear and obvious choices. By necessity, this often means writing small SQL queries in logical steps (and no, long chains of common table expressions do not count!)

    The real world does not always make these things possible, of course, but it is good to have goals :)

    © 2013 Paul White – All Rights Reserved
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

    Screenshots acquired using SnagIt by TechSmith
    Query plan details obtained using Plan Explorer PRO by SQLSentry

  • Halloween Protection – The Complete Series

    image

    I have just published a four-part series for SQLPerformance.com on the Halloween Problem. Some of you will never have heard of this issue, and those that have might associate it only with T-SQL UPDATE queries. In fact, the Halloween problem affects execution plans for INSERT, UPDATE, DELETE and MERGE statements.

    This is a topic I have been meaning to write about properly for years, ever since I read Craig Freedman’s 2008 blog post on the topic, which ended with the cryptic comment:

    “…although I've used update statements for all of the examples in this post, some insert and delete statements also require Halloween protection, but I'll save that topic for a future post.”

    That future post never materialized, sadly, so I thought I would have a go. The four parts of the series are summarized and linked below, I hope you find the material interesting.


    Part 1 – The Halloween Problem and UPDATE statements

    • The SQL standard and three-phase separation
    • Logical update processing
    • Pipelined execution
    • The Halloween problem
    • Avoiding the problem in UPDATE statements

    Part 2 – The Halloween Problem in INSERT and DELETE queries

    • INSERT examples
    • DELETE examples
    • Constraint checking and phase separation

    Part 3 – Halloween Problem optimizations for MERGE

    • MERGE contains several optimizations the other DML statements do not
    • Hole-filling with merge join
    • Hole-filling with nested loops
    • Avoiding an extra B-tree navigation
    • Avoiding the join

    Part 4 – The Halloween Problem and the Query Optimizer

    • Early optimization approaches
    • The SQL Server optimizer approach
    • The case of the redundant sort
    • HP levels and properties
    • Plan changes for Halloween Protection
    • Non-spool options
    • Row versioning
    • Heaps and forwarded records
    • T-SQL functions

    As always, I appreciate your comments and feedback.

    Paul White
    @SQL_Kiwi
    SQLkiwi@gmail.com

    Ill_Be_There4

  • Incorrect Results with Indexed Views


    Summary: If you use MERGE, indexed views and foreign keys, your queries can return incorrect results.

    Microsoft have released a fix for incorrect results returned when querying an indexed view. The problem applies to:

    • SQL Server 2012
    • SQL Server 2008 R2
    • SQL Server 2008

    The Knowledge Base article does not go into much detail, or provide a reproduction script, but this blog entry does :)

    The KB does say that reproducing the bug requires these features:

    • An indexed view on two tables that have a foreign key relationship
    • An update performed against the base tables
    • A query executed against the indexed view using a NOEXPAND hint

    There are two important details I would like to add right away:

    • The NOEXPAND hint is not required to reproduce the bug on Enterprise Edition
    • The update must be performed by a MERGE statement

    The fix is available in the following cumulative update packages:

    Cumulative Update 2 for SQL Server 2012 SP1 [build 11.0.3339]
    Cumulative Update 5 for SQL Server 2012 RTM [build 11.0.2395]
    Cumulative Update 4 for SQL Server 2008 R2 SP2 [build 10.50.4270]
    Cumulative Update 10 for SQL Server 2008 R2 SP1 [build 10.50.2868]
    Cumulative Update 8 for SQL Server 2008 SP3 [build 10.00.5828]

    No service pack contains this fix, you must apply one of the hotfix packages above.

    Steps to Reproduce

    The first thing we will need is two tables:

    CREATE TABLE dbo.Parent 
    (
        parent_id integer IDENTITY NOT NULL,
        value varchar(20) NOT NULL,
     
        CONSTRAINT PK_Parent_id 
            PRIMARY KEY CLUSTERED (parent_id)
    );
    GO
    CREATE TABLE dbo.Child
    (
        child_id integer IDENTITY NOT NULL,
        parent_id integer NOT NULL,
     
        CONSTRAINT PK_Child_id 
            PRIMARY KEY CLUSTERED (child_id)
    );

    And a few rows of data:

    INSERT dbo.Child (parent_id)
    SELECT New.parent_id 
    FROM 
        (
        INSERT Parent 
        OUTPUT inserted.parent_id 
        VALUES 
            ('Apple'), 
            ('Banana'), 
            ('Cherry')
        ) AS New;

    The tables now look like this (parent first):

    Original Data

    We can now add the required FOREIGN KEY relationship:

    ALTER TABLE dbo.Child
    ADD CONSTRAINT FK_Child_Parent
    FOREIGN KEY (parent_id)
    REFERENCES dbo.Parent (parent_id);

    Next, a very simple indexed view that joins the two tables (the view could contain more complex features like aggregates):

    CREATE VIEW dbo.ParentsAndChildren
    WITH SCHEMABINDING 
    AS
    SELECT 
        p.parent_id, 
        p.value, 
        c.child_id
    FROM dbo.Parent AS p
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;
    GO
    CREATE UNIQUE CLUSTERED INDEX cuq 
    ON dbo.ParentsAndChildren (child_id);

    The final step is to use a MERGE statement to make some changes to the Parent table:

    DECLARE @ParentMerge AS TABLE
    (
        parent_id integer PRIMARY KEY, 
        value varchar(20) NOT NULL
    );
     
    INSERT @ParentMerge
        (parent_id, value)
    VALUES
        (1, 'Kiwi Fruit'),
        (4, 'Dragon Fruit');
     
    MERGE dbo.Parent AS p
    USING @ParentMerge AS s ON 
        s.parent_id = p.parent_id
    WHEN MATCHED THEN 
        UPDATE SET value = s.value
    WHEN NOT MATCHED THEN 
        INSERT (value) VALUES (s.value)
    OUTPUT 
        $action, 
        inserted.parent_id, 
        deleted.value AS old_value, 
        inserted.value AS new_value;

    This MERGE performs two actions:

    1. Updates the value column of parent row 1 from Apple to Kiwi Fruit
    2. Adds a new parent row 4 for Dragon Fruit

    The statement includes an OUTPUT clause to show the changes it makes (this is not required for the repro):

    MERGE changes

    This confirms that the changes have been made as we requested: parent row 1 has changed; and row 4 has been added. The changes are reflected in the base tables:

    SELECT * FROM dbo.Parent AS p;
    SELECT * FROM dbo.Child AS c;

    Updated Base Table Data

    As highlighted, row 1 has changed from Apple to Kiwi Fruit and row 4 has been added.

    We do not expect to see row 4 in the indexed view because there are no child records for that row, and the indexed view uses an inner join. Checking the indexed view using the NOEXPAND table hint (required in non-Enterprise SKUs to use indexes on a view):

    SELECT *
    FROM dbo.ParentsAndChildren 
    WITH (NOEXPAND);

    Incorrect indexed view data

    The results are incorrect. They show the old value of the data for parent row 1.

    Now we try using the EXPAND VIEWS query hint to force SQL Server to access the base tables rather than reading view indexes:

    SELECT *
    FROM dbo.ParentsAndChildren
    OPTION (EXPAND VIEWS);

    Expand Views Hint

    This query produces correct results.

    On SQL Server Enterprise Edition, the optimizer chooses whether to access the indexed view or the base tables. For following query, without any hints, the optimizer chooses not to expand the view. It reads the view index and produces incorrect results:

    -- Enterprise Edition ONLY
    SELECT * 
    FROM dbo.ParentsAndChildren;

    Indexed View Matching

    Perhaps adding a child row to match the new parent row 4 will somehow fix things up?

    INSERT dbo.Child (parent_id) VALUES (4);
    GO
    SELECT * FROM dbo.ParentsAndChildren WITH (NOEXPAND);
    SELECT * FROM dbo.ParentsAndChildren OPTION (EXPAND VIEWS);

    After Insert

    No. The query plan that accesses the view index still returns an incorrect value for row 1. It seems MERGE has corrupted our indexed view.

    Analysis using DBCC CHECKTABLE

    Checking the view with DBCC CHECKTABLE returns no errors:

    DBCC CHECKTABLE (ParentsAndChildren);

    DBCC output 1

    Unless we use the EXTENDED_LOGICAL_CHECKS option:

    DBCC CHECKTABLE (ParentsAndChildren) WITH EXTENDED_LOGICAL_CHECKS;

    DBCC output 2

    The damage is repairable:

    ALTER DATABASE Sandpit 
    SET SINGLE_USER 
    WITH ROLLBACK IMMEDIATE;
    GO
    DBCC CHECKTABLE (ParentsAndChildren, REPAIR_REBUILD) WITH EXTENDED_LOGICAL_CHECKS;
    GO
    DBCC CHECKTABLE (ParentsAndChildren) WITH EXTENDED_LOGICAL_CHECKS;
    GO
    ALTER DATABASE Sandpit SET MULTI_USER;

    DBCC repair

    You probably do not want to set your database to SINGLE_USER mode and run a DBCC repair after every MERGE statement, however. We could also rebuild the indexed view’s clustered index manually, of course.

    Cause

    For the MERGE statement above, the query optimizer builds a plan that does not update the indexed view (click to enlarge):

    Incorrect Update Plan

    In a version of SQL Server with the fix applied, the same MERGE statement produces a plan that does maintain the indexed view:

    Correct Update Plan

    The plan operators used to keep the view index in step with the base tables are highlighted. Without these operators, the changes to the base table are not correctly written to any indexes defined on the view. The root cause is related to the same simplification that allows the optimizer to remove the reference to the Parent table in this query:

    SELECT 
        COUNT_BIG(*)
    FROM dbo.Parent AS p 
    JOIN dbo.Child AS c ON 
        c.parent_id = p.parent_id;

    Simplification Example Plan

    The FOREIGN KEY relationship and NOT NULL constraints on the referencing column together mean that the join to Parent cannot affect the result of the query, so the join is simplified away. In SQL Server 2012, we can see when this simplification is performed because the following message appears when undocumented trace flags 8619 and 3604 are enabled during compilation:

    Trace Flag 8619 output

    The same message is emitted when a MERGE statement contains a WHEN MATCHED THEN UPDATE clause and either a WHEN NOT MATCHED THEN INSERT or WHEN MATCHED … THEN DELETE clause. These conditions combine such that the optimizer incorrectly concludes that a table reference can be removed, when in fact it is needed later on when the update side of the plan is built.

    Other details of the query and database can affect whether the simplification can be misapplied. For example, if the FOREIGN KEY constraint contains an ON DELETE CASCADE clause, and the MERGE contains a DELETE clause, the simplification is not performed, the TF 8619 message does not appear, and the bug does not manifest.

    The key to determining whether a particular query is vulnerable to this bug (TF 8619 aside) is to check whether the query plan includes operators to maintain the indexed view. At a minimum, you should see a update operator for the view:

    View Update Operator

    SQL Sentry Plan Explorer identifies the operator as applying to a view explicitly, in SSMS you need to click on the graphical operator and inspect the Properties window.

    Summary

    The updated conditions for incorrect results are:

    • An indexed view that joins tables
    • Two of the tables have a single-column FOREIGN KEY constraint
    • A MERGE statement contains an UPDATE action that affects one of the tables
    • The MERGE statement also contains an INSERT or DELETE action (or both)
    • The optimizer applies a simplification that removes a table reference based on the FK relationship and other metadata
    • As a result, the MERGE execution plan does not contain the operators necessary to correctly maintain the indexed view
    • A subsequent query plan accesses an index on the view, either explicitly or via indexed-view matching (Enterprise Edition)

    Note:

    • The simplification is not applied in tempdb
    • The simplification is not applied to multi-column foreign key constraints

    Under these conditions, the indexes on the view do not reflect the state of the base tables and incorrect results are returned. Once the hot fix is applied, the optimizer does not misapply the simplification so the correct indexed view maintenance features are built into execution plans.

    Update: I am adding this based on Ian Yates’ great question in the comments: my expectation is that applying this hotfix will not remove any existing indexed view corruption. You would need to test the hotfix as usual, apply it, and then either rebuild all affected indexed views manually, or run DBCC CHECKDB with the EXTENDED_LOGICAL_CHECKS option (which could take a long time).

    © 2013 Paul White – All Rights Reserved

    Ill_Be_There4

    Twitter: @SQL_Kiwi
    Email: SQLKiwi@gmail.com

  • A creative use of IGNORE_DUP_KEY

    I'll_Be_There

    Let’s say you have a big table with a clustered primary key, and an application that inserts batches of rows into it. The nature of the business is that the batch will inevitably sometimes contain rows that already exist in the table.

    The default SQL Server INSERT behaviour for such a batch is to throw error 2627 (primary key violation), terminate the statement, roll back all the inserts (not just the rows that conflicted) and keep any active transaction open:

    CREATE TABLE #Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
    GO
    -- Start with values 1 & 5 in the table
    INSERT #Big (pk) VALUES (1), (5);
     
    -- Our batch transaction
    BEGIN TRANSACTION;
        -- Insert a batch containing some pre-existing rows
        INSERT #Big (pk) VALUES (1), (2), (3), (4), (5);
     
        -- Show the contents of the table after the insert statement
        SELECT pk FROM #Big;
     
        -- Show the transaction count
        SELECT tran_count = @@TRANCOUNT;
     
    -- Rollback
    ROLLBACK TRANSACTION;
     
    -- Final table contents
    SELECT pk FROM #Big;
    GO
    -- Tidy up
    DROP TABLE #Big;

    The output is:

    INSERT with duplicate values

    Ignoring Duplicates

    Let us further imagine that the desired behaviour is that new rows in a batch should be inserted, and any duplicates silently rejected. Most importantly, no error messages should be returned to the client. Ideally, this would be achieved without immediate application changes, and without impacting concurrent users of the table (the instance is Enterprise Edition, so online operations are available).

    This seems like an ideal use for the IGNORE_DUP_KEY option:

    CREATE TABLE dbo.Big
    (
        pk int NOT NULL,
        
        CONSTRAINT PK_Big 
        PRIMARY KEY (pk)
        WITH (IGNORE_DUP_KEY = ON)
    );
    GO
    -- Unique values
    INSERT dbo.Big (pk)
    VALUES (1), (3), (5);
    GO
    -- key 3 already exists
    INSERT dbo.Big (pk)
    VALUES (2), (3), (4);

    That script executes successfully with just a warning (not an error!) about the duplicate key:

    IGNORE_DUP_KEY output

    The problem

    We would like to add the IGNORE_DUP_KEY option to the existing primary key constraint, but the ALTER TABLE command does not have an ALTER CONSTRAINT clause. We do not want to drop the existing primary key and recreate it with the new option, because the table is large and we want to avoid disrupting concurrent users. Dropping and recreating would result in rebuilding the entire table first as a heap and then as a clustered table again.

    Although there is no ALTER CONSTRAINT syntax, we do know that certain constraint modifications can be performed using ALTER INDEX REBUILD on the index used to enforce the constraint. We can, for example, change the type of compression used, the index fill factor, and whether row locking is enabled:

    CREATE TABLE dbo.Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
    GO
    INSERT dbo.Big (pk)
    VALUES (1), (2), (3), (4), (5);
    GO
    ALTER INDEX PK_Big ON dbo.Big
    REBUILD WITH (DATA_COMPRESSION = PAGE, ONLINE = ON);
    GO
    ALTER INDEX PK_Big ON dbo.Big
    REBUILD WITH (FILLFACTOR = 90, ALLOW_ROW_LOCKS = OFF, ONLINE = ON);
    GO
    DROP TABLE dbo.Big;

    Unfortunately, we cannot use the same trick to add the IGNORE_DUP_KEY option to the underlying index for the primary key:

    ALTER INDEX with IGNORE_DUP_KEY

    The same error message results even if the ONLINE = ON option is not specified. There will be a range of views about how accurate the error message text is here, but we cannot avoid the fact that the operation we want to perform is not supported by SQL Server.

    A creative workaround

    One idea is to add a new UNIQUE constraint (or index) on the same columns as the primary key, but with the IGNORE_DUP_KEY option added. This operation can be performed ONLINE:

    -- Existing table
    CREATE TABLE dbo.Big (pk int NOT NULL CONSTRAINT PK_Big PRIMARY KEY);
    GO
    -- Existing data
    INSERT dbo.Big (pk) VALUES (1), (2), (3);
    GO
    -- New constraint (or index) with IGNORE_DUP_KEY, added ONLINE
    ALTER TABLE dbo.Big
    ADD CONSTRAINT UQ_idk
    UNIQUE NONCLUSTERED (pk)
    WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);
     
    --CREATE UNIQUE NONCLUSTERED INDEX UQ_idk 
    --ON dbo.Big (pk)
    --WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);
    GO
    -- Key 3 is a duplicate, just a warning now
    INSERT dbo.Big (pk) VALUES (3), (4), (5);
    GO
    SELECT pk FROM dbo.Big;

    The new arrangement results in the correct final state of the database, without throwing an error. The effect is the same as if we had been able to modify the existing primary key constraint to add IGNORE_DUP_KEY:

    Output with duplicate unique index

    Execution plan analysis

    There are some drawbacks to this idea, which we will explore in some detail. The drawbacks are significant, so adding the extra index with IGNORE_DUP_KEY is only really suitable as a temporary solution, as we will see. The first INSERT in the previous script (without the extra constraint or index with IGNORE_DUP_KEY) is pretty trivial; it just shows the constants from the VALUES clause being inserted to the clustered index:

    Trivial INSERT

    The second INSERT (with the IGNORE_DUP_KEY index) is rather more complex (click to enlarge):

    INSERT with IGNORE_DUP_KEY

    An extra index to maintain

    One fairly obvious consequence of adding the new index is that the Clustered Index Insert operator now has to maintain the new nonclustered index too. I used Plan Explorer above because it shows the per-row nonclustered index insert more explicitly than SSMS, where you have to dig into the Object node in the Properties window with the relevant graphical operator selected:

    SSMS Object Node

    Another way to see the nonclustered index maintenance explicitly is to run the query again with undocumented trace flag 8670 to produce a wide update plan:

    Wide update plan

    Besides maintaining the extra index, the IGNORE_DUP_KEY plans seem to be doing a lot of extra work: there are lots of new operators compared with the simple insert. As you would expect, all the new operators are associated with ignoring duplicate keys, and there are two distinct cases to consider.

    Rows that already exist

    The first case relates to checking for INSERT rows that already exist in the base table. This checking is implemented in the execution plan by the Left Semi Join, Index Seek, and Assert operators:

    Plan Right Fragment

    The Index Seek looks for the key value of the current row we are looking to insert. The Semi Join cannot be an Inner Join because that would reject rows where a match was not found (and we need to insert a row in that case). Nevertheless, the query processor needs to know if an existing row was found with the same key value. The Semi Join sets a value to indicate this, which is stored in the Probe Column:

    Nested Loops Operator Properties

    The Assert operator (known internally as a Stream Check) tests a condition and raises an error if the test fails. Assert operators are often seen in plans that affect columns with CHECK constraints, for example. In this case, however, the Assert does not raise an error, it emits the ‘Duplicate key was ignored.’ message instead, and only passes a row on to its parent (the Sort operator) if the condition passes. The check performed by the Assert is based on the Probe value stored in the expression labelled [Expr1013] above:

    Assert Operator Tooltip

    The Asset passes rows where the predicate evaluates to anything other than NULL. If a match was found by the Index Seek (so the Probe Column, Expr1013 is not NULL) the Assert does not pass on the row and raises a warning instead. The following TF 8607 output fragment for the INSERT statement shows the option (there is nothing equivalent in regular show plan output unfortunately):

    WarnIgnoreDuplicate

    Duplicates within the Insert Set

    The first check only looks for rows from the insert set that already exist in the base table. The query processor also needs to check for duplicates within the set of values we are inserting. The Segment and Top operators in the plan combine to meet this requirement:

    Plan Middle Fragment

    The Segment requires a stream ordered by the index keys, and adds a flag to indicate the start of a new group of key values. The Top operator returns only one row from each group. The overall effect is to remove rows with duplicate index key values.

    The Top and Segment execution operators together implement a single physical query processor operator called a “Group-By Top”. I mention this detail because we need to be familiar with the internal name to understand the TF 8607 output, which indicates that this operator also emits a `Duplicate key was ignored.’ warning when it encounters a group with more than one key value:

    WARN-DUP

    Locking

    Adding the new index means that locks will now be taken on the nonclustered index when checking for existing rows:

    Index Seek Locks

    It might surprise you to learn that the index seek acquires Range S-U locks on the nonclustered index (extract from the Books Online link below):

    Range Locks

    Key-range locks like this are only taken under the SERIALIZABLE isolation level, but our INSERT statement uses them whatever isolation level the user runs the query under (for example, range locks are still taken if the session is running under the default READ COMMITTED isolation level).

    SQL Server raises the effective isolation level to SERIALIZABLE for the Index Seek (and just that operator) because it needs to be sure that if it does not find a match in the index, that situation remains the same until it inserts the new row. The Index Seek is looking for a key value that does not exist, so a range lock is necessary to prevent a concurrent transaction inserting a row with that key value before we do. There is no existing key in the index to lock (because the row does not exist) so a range lock is the only option. If SQL Server did not do this, our INSERT query might fail with a duplicate key error despite the IGNORE_DUP_KEY setting (we check for a row, don’t find it, someone else inserts it, then we try to).

    The optimizer adds a SERIALIZABLE hint for the Index Seek operator (internally, a physical Range operator) as we can see using another extract from the TF 8607 output:

    Query Processor Hints

    The execution plan is also forced to use the nonclustered index with the IGNORE_DUP_KEY setting for this seek (FORCEDINDEX) and an update lock (UPDLOCK) hint is also given to help prevent conversion deadlocks. Nevertheless, you may find increased deadlocking if you choose to add an extra IGNORE_DUP_KEY index like this.

    Permanent solutions

    Adding a nonclustered unique index with the IGNORE_DUP_KEY option and the same key as the clustered primary key allowed us to solve an immediate problem without code changes, while keeping the table online, but it does come at a price. The execution plan is much more complex (and expensive) than the original INSERT, and there is a chance of deadlocks. The biggest performance impact of adding the extra index is of course the cost of maintaining it, meaning we need to look at a more permanent solution via a code change.

    IGNORE_DUP_KEY

    The (trivial) test I am going to run inserts 5000 unique rows from my table of numbers into the table we have been using so far. To establish a baseline, we will first look at the execution plan for an INSERT to the table with a nonclustered IGNORE_DUP_KEY index:

    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000;

    IGNORE_DUP_KEY test

    This plan is very similar to the one we analysed earlier, though a Merge is used to perform the Left Semi Join, and a wide update plan has been chosen, making the nonclustered index insert easier to see. The other main feature is an Eager Table Spool, required for Halloween Protection. The estimated cost of this execution plan is 0.363269 cost units on my installation.

    Modified INSERT

    The first code alternative to the extra index is to modify the INSERT statement to check that rows do not already exist in the destination. There are a number of SQL syntaxes for this, each with different characteristics and performance in different circumstances. To keep things simple, and because I only want to make a couple of specific points here, I have chosen my simple example to produce the same execution plans for all the common syntax variants. The first thing to do is to drop our extra constraint:

    ALTER TABLE dbo.Big
    DROP CONSTRAINT UQ_idk;

    Now we can evaluate the modified INSERT statement:

    -- I prefer this syntax
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND NOT EXISTS (SELECT * FROM dbo.Big WHERE pk = n);
     
    -- With EXCEPT
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    EXCEPT
    SELECT pk FROM dbo.Big;
     
    -- Not recommended
    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND n NOT IN (SELECT pk FROM dbo.Big);

    All three produce this execution plan:

    Modified INSERT test

    This has an estimated cost of 0.0981188 units – much cheaper than the 0.363269 cost seen previously. The Halloween Protection spool still features in the plan, but there is a weakness in our queries that you might have already spotted. We are not doing anything to protect against duplicate key violation errors in case someone else inserts a row after we have checked to see if it exists, but before we insert it. The query optimizer added a SERIALIZABLE hint when it added an existence check, so if avoiding errors is important to us, we need to do the same:

    INSERT dbo.Big (pk)
    SELECT n FROM dbo.Numbers
    WHERE n BETWEEN 1 AND 5000
    AND NOT EXISTS (SELECT * FROM dbo.Big WITH (SERIALIZABLE) WHERE pk = n);

    We do not need the UPDLOCK hint for two reasons. First, the engine automatically takes update locks for us when reading the source table. Second, we are reading from the same index we are inserting to (not reading from a nonclustered index and inserting to the clustered index) so the previous deadlock scenario is not applicable.

    Using MERGE

    Another option is to use the WHEN NOT MATCHED feature of the Merge statement. This time we will add the necessary SERIALIZABLE hint up front:

    MERGE dbo.Big WITH (SERIALIZABLE) AS b
    USING (SELECT n FROM dbo.Numbers WHERE n BETWEEN 1 AND 5000) AS s
    ON s.n = b.pk
    WHEN NOT MATCHED THEN 
    INSERT (pk) VALUES (s.n);

    MERGE test

    This plan has an estimated cost of 0.0950127 units – slightly less than the 0.0981188 units for the modified INSERT plans. Some of this improvement is due to the lack of a Halloween Protection spool, for interesting reasons I cover in depth in a short series of articles to be published shortly (update: part one is out).

    These are not meant to be performance tests by any stretch of the imagination. There are any number of subtle factors that will affect the execution plans and run times for different numbers of rows, different distributions, and so on. I should stress that I normally find MERGE plans perform less well than separate INSERT/UPDATE/DELETE more often than not. Anyway, in case you are interested, typical performance results on my machine for this specific test are (INSERT first, then MERGE), timings in milliseconds:

    Performance results

    IGNORE_DUP_KEY and Clustered Indexes

    When IGNORE_DUP_KEY is specified for a unique clustered index (primary key or otherwise), duplicates are handled by the storage engine rather than the query processor. For example:

    CREATE TABLE dbo.T (pk int CONSTRAINT PK_idk PRIMARY KEY WITH (IGNORE_DUP_KEY = ON));
    INSERT T VALUES (1), (1), (2), (3);

    The execution plan shows none of the complexity of the nonclustered index case:

    image

    The TF 8607 output contains none of the query processor hints to ignore duplicates and raise warnings:

    image

    The IGNORE_DUP_KEY side of things is handled entirely by the storage engine – if it finds a row in the unique index where it was going to insert one, it flags a warning but does not try to insert (which would cause an error). The estimated cost of this query plan is identical whether the unique clustered index has IGNORE_DUP_KEY or not – there is no extra work for the query processor.

    If you are wondering why the same mechanism is not used for a nonclustered unique index with the IGNORE_DUP_KEY option, my understanding is it is because the Clustered Index is always updated before the nonclustered ones (even in a narrow plan). By the time the storage engine detected a duplicate in the nonclustered index, it would have already added the row to the base table. So, the query processor handles IGNORE_DUP_KEY for nonclustered indexes.

    This is my understanding of the mechanics of IGNORE_DUP_KEY, I don’t know for sure and will happily correct any details I may have wrong. Any storage engine experts reading this please contact me if so.

    Acknowledgement

    This post was inspired by a post by Daniel Adeniji and our subsequent discussion of it. My thanks to him for permission to analyse the issue further here, and for allowing me to reference his original. The issue described here may not accurately reflect Daniel’s original problem – I used a certain amount of artistic licence. Even if the specific issue is not particularly interesting to you, I hope you enjoyed some aspects of the analysis and maybe picked up some new information along the way.

    © 2013 Paul White – All rights reserved
    twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

  • Optimizing T-SQL queries that change data

    I'll_Be_There[4]

    Most tuning efforts for data-changing operations concentrate on the SELECT side of the query plan. Sometimes people will also look at important storage engine considerations like locking and transaction log throughput that can have dramatic effects. As a consequence, a number of common practices have emerged, such as avoiding large numbers of row locks and lock escalation, splitting large changes into smaller batches of a few thousand rows and combining a number of small changes into a single transaction in order to optimize log flushes.

    This is all good of course, but what about the data-changing side of the query plan – the INSERT, UPDATE, DELETE or MERGE operation itself – are there any query processor considerations we should take into account? The short answer is yes. The query optimizer considers different plan options for the write-side of an I/U/D/M query, though there isn’t a huge amount of T-SQL language support that allows us to affect these choices directly. Nevertheless, there are things to be aware of, and things we can look to change.

    Side note: In this post I am going to use the term update (lower case) to apply to any operation that changes the state of the database (INSERT, UPDATE, DELETE and MERGE in T-SQL). This is a common practice in the literature, and is used inside SQL Server too as we will see.

    The Three Basic Update Forms

    First a bit of background. All query plans execute using a demand-driven iterator model, and updates are no exception. Parent operators drive child operators (to the right of the parent) to do work by asking them for a row at a time. Take the following simple AdventureWorks INSERT for example:

    DECLARE @T AS TABLE (ProductID int);
    INSERT @T (ProductID) 
    SELECT p.ProductID 
    FROM Production.Product AS p;

    Query Plan Execution Order Example

    Plan execution starts at the far left, where you can think of the green T-SQL INSERT icon as representing rows returned to the client. This root node asks its immediate child operator (the Table Insert) for a row. The Table Insert requests a row from the Index Scan, which provides one. This row is inserted into the heap, and an empty row is returned to the root node (if the INSERT query contained an OUTPUT clause, the returned row would contain data). This row-by-row process continues until all source rows have been processed. Notice that the XML show plan output shows the Heap Table Insert is performed by an Update operator internally. Ok, on to the matter at hand:

    1. Wide, per-index updates

    Wide per-index update plans have a separate update operator for each clustered and nonclustered index. It is always possible for the optimizer to produce this type of plan; the options described later have not been (or cannot be) implemented when certain engine features are used. For example, if the update target has an indexed view defined on it, a wide update plan has to be used to maintain the indexed view. An example per-index update plan is shown below:

    Wide update plan

    This plan updates the base table using a Clustered Index Delete operator. This operator may also read and output extra column values necessary to find and delete rows from nonclustered indexes. The iterative delete activity on the base table is driven by the Eager Table Spool on the top branch. The spool is eager because it stores all the rows from its input in a worktable before returning the first row to its parent operator (the Index Delete on the top branch). The effect is that all qualifying base table rows are deleted before any nonclustered index changes occur.

    The spool in this plan is a common subexpression spool. It is populated once and then acts as a row source for multiple consumers. In this case, the contents of the spool are consumed first by the top-branch Index Delete, which removes index entries from one of the nonclustered indexes. When the Sequence operator switches to asking for rows from the lower branch, the spool is rewound and replays its rows for the second nonclustered index. Note that the spool contains the UNION of all columns required by nonclustered index changes.

    2. Narrow, per-row updates

    Narrow per-row updates have a single update operator which maintains the base table (heap or clustered) and one or more nonclustered indexes. Each row arriving at the update operator updates all indexes associated with the operator before processing the next row. An example:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int UNIQUE, col2 int UNIQUE);
    DECLARE @S AS TABLE (pk int PRIMARY KEY, col1 int, col2 int);
     
    INSERT @T (pk, col1)
    SELECT
        s.pk,
        s.col1 
    FROM @S AS s;

    The execution plan viewed using SQL Sentry Plan Explorer shows the nonclustered index maintenance explicitly (hover over the Clustered Index Insert for a tooltip showing the names of the nonclustered indexes involved):

    Plan Explorer Narrow Update Plan

    In SQL Server Management Studio, there is no information about the nonclustered index maintenance in the graphical plan or tooltips, we need to click on the update operator (Clustered Index Insert) and then check the Properties window:

    SSMS narrow index updates

    This shows the clustered index and two nonclustered indexes being maintained. Because all indexes are updated for each row streaming through the update operator, this plan shape is also known as a per-row update plan.

    The query optimizer uses cost-based reasoning to decide whether to update each nonclustered index separately (per-index) or as part of the base table changes (per-row). An example that updates one nonclustered index per-row, and another per-index is shown below:

    CREATE TABLE #T (pk int IDENTITY NOT NULL, col1 int, col2 varchar(8000) DEFAULT '');
    ALTER TABLE #T ADD CONSTRAINT PK_T PRIMARY KEY (pk);
    CREATE INDEX nc1 ON #T (col1);
    CREATE INDEX nc2 ON #T (col1) INCLUDE (col2);

    The combination strategy can be seen with Plan Explorer (click to expand in a new tab):

    Combination index update

    The details are also displayed in the SSMS Properties window when the Clustered Index Insert operator is selected as well.

    3. Single-operator updates

    The third form of update is an optimization of the per-row update plan for very simple operations. The cost-based optimizer still emits a per-row update plan, but a rewrite is subsequently applied to collapse the reading and writing operations into a single operator:

    -- Simple operations
    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
    INSERT @T (pk, col1) VALUES (1, 1000);
    UPDATE @T SET col1 = 1234 WHERE pk = 1;
    DELETE @T WHERE pk = 1;

    The complete execution plans (and extracts from the XML plans) are shown below:

    Single operator update plans

    The UPDATE and DELETE plans have been collapsed to a `Simple Update` operation, while the INSERT is simplified to a `Scalar Insert`. We can see the original optimizer output tree with trace flags 3604 and 8607:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
     
    DBCC TRACEON (3604, 8607);
     
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DELETE @T WHERE pk = 1 OPTION (RECOMPILE);
     
    DBCC TRACEOFF (3604, 8607);

    The trace output for the INSERT statement shown above is:

    Optimizer output tree

    Notice the two physical operators: a constant scan (in-memory table) containing the literal values specified in the query; and a stream update operator that performs the database changes per row. All three statements produce a similar optimizer tree output (and all use a stream update). We can prevent the single-operator optimization being applied with undocumented trace flag 8758:

    DECLARE @T AS TABLE (pk int PRIMARY KEY, col1 int);
     
    DBCC TRACEON (8758);
     
    INSERT @T (pk, col1) VALUES (1, 1000) OPTION (RECOMPILE);
    UPDATE @T SET col1 = 1234 WHERE pk = 1 OPTION (RECOMPILE);
    DELETE @T WHERE pk = 1 OPTION (RECOMPILE);
     
    DBCC TRACEOFF (8758);

    This exposes the expanded per-row update plans produced by the optimizer before the single-operator rewrite:

    Expanded single operator plan

    Single operator plans can be significantly more efficient than the original multiple-operator form in the right circumstances, for example if the plan is executed very frequently.

    Update plan optimizations

    From this point forward, I’m going to use simple AdventureWorks DELETE examples, but the general points apply equally well to INSERT, UPDATE and MERGE as well. The examples will not have a complicated SELECT component, because I want to concentrate on the update side of the plan rather than the reading side.

    Per-row updates

    It’s worth emphasising again that per-row update plans are an optimization that is not available for every update query. The optimizer favours a per-row plan if the expected number of rows at the update operator is low. In the example below, the optimizer expects to delete 25 rows:

    Per-row updates

    This plan updates the base table (a clustered index in this case) and all nonclustered indexes in sequence per row. The row source here is an ordered scan of a nonclustered index, which means rows will not generally be presented in clustered index key order to the update operator.

    An unordered stream like this tends to result in random I/O (assuming physical I/O is required at all, of course). If the plan is expected to process only a small number of rows, the optimizer decides it is not worth adding extra operations to encourage sequential I/O.

    Unordered Prefetch

    If the expected number of rows is a bit larger, the optimizer may decide to build a plan that applies prefetching to one or more update operators. The basic idea is the same as ordinary prefetching (a.k.a read-ahead) for scans and range seeks. The engine looks ahead at the keys of the incoming stream and issues asynchronous IOs for pages that will be needed by the update operator soon. Prefetching helps reduce the impact of the random I/O mentioned a moment ago. An example of prefetching on the update operator is shown below for the same query and an expected cardinality of 50 rows:

    image

    The prefetching is unordered from the perspective of the clustered index. Neither SSMS nor Plan Explorer show the prefetch information prominently. In Plan Explorer, we need to have the With Unordered Prefetch optional column selected. To do this, switch to the Plan Tree tab, open the Column Chooser, and drag the column to the grid. The unordered prefetch property is ticked for the Clustered Index Delete operator in the screenshot below:

    Unordered Prefetch in Plan Explorer

    In SSMS, click on the Clustered Index Delete icon in the graphical plan, and then look in the Properties window:

    SSMS Unordered Prefetch

    If the query plan were reading from the clustered index, the read operation would issue regular read-ahead, so there would be no point prefetching from the Clustered Index Delete as well. The example below shows the expected cardinality increased to 100, where the optimizer switches from scanning the nonclustered index with unordered prefetching to scanning the clustered index. No prefetching occurs on the update operator in this plan:

    No clustered index prefetch

    Where the plan property With Unordered Prefetch is not set, it is simply omitted – it does not appear set to False as you might expect.

    Ordered Prefetch

    Where rows are expected to arrive at an update operator in non-key order, the optimizer might consider adding an explicit sort. The idea is that presenting rows in key order will encourage sequential I/O for the index pages. The optimizer weighs the cost of the sort against the expected benefits of avoiding random I/O. The execution plan below shows an example of a sort being added for this reason:

    Sort to reduce random I/O

    The sort is on Transaction ID, the clustering key for this table. With the incoming stream now sorted, the update operator can use ordered prefetching. The plan is still scanning the nonclustered index on the read side, so read-ahead issued by that operator does not bring in the clustered index pages needed by the update operator. Ordered prefetching tends to result in sequential I/O, compared with the mostly random I/O of unordered prefetching. The With Ordered Prefetch column is also not shown by default by Plan Explorer, so it needs to be added as we did before:

    Plan Explorer Ordered Prefetch

    Not every update operator with ordered prefetching requires a sort. If the optimizer finds a plan that naturally presents keys in nonclustered index order, an update operator may still use ordered prefetching.

    DML Request Sort

    When the optimizer decides to explore a plan alternative that requires ordered input to an update operator, it will set the DML Request Sort property for that operator. Plan Explorer shows this setting in the update operator tooltip:

    Plan Explorer DML Request Sort

    SSMS shows With Ordered Prefetch and DML Request Sort in the Properties window:

    SSMS DMLRequestSort and WithOrderedPrefetch

    Nonclustered index updates

    Sorting and ordered prefetching may also be considered for the nonclustered indexes in a per-index update plan.

    Sort for nonclustered index

    This plan shows all three update operators with DML Request Sort set. The Clustered Index Delete does not require an explicit sort because rows are being read (with internal ordering guarantees) from the Clustered Index Scan. The two non-clustered indexes do require explicit sorts, however. Both nonclustered index update operators also use ordered prefetching; the clustered index update does not because the scan automatically invokes read-ahead for that index.

    The next example shows that each update operator is considered separately for the various optimizations:

    Indexes optimized separately

    That plan features unordered prefetching on the Clustered Index Delete (because the row source is a scan of a nonclustered index), an explicit sort with ordered prefetching on the top branch Index Delete, and unordered prefetching on the bottom branch Index Delete.

    The per-index and per-row update plan

    Is the following a per-index or per-row update plan?

    Per-index and per-row plan

    It appears to be a per-index plan because there are two separate update operators, but there is no blocking operator (eager spool or sort) between the two. This plan is a pipeline: each row returned from the seek will be processed first by the Clustered Index Delete and then by the Index Delete. In that sense, the both updates (clustered and nonclustered) are certainly per-row.

    The important thing is not the terminology, it is being able to interpret the plan. Now we know what the various optimizations are for, we can see why the optimizer chose the plan above over the seemingly equivalent but more efficient-looking narrow plan:

    Narrow plan

    The narrow plan has no prefetching. The Seek provides read-ahead for the clustered index pages, but the single non-clustered index has nothing to prefetch any pages from disk it might need. By contrast, the previous wide update plan has two update operators. The Clustered Index Delete has no prefetching for the same reason as in the narrow plan, but the Index Delete has unordered prefetching enabled.

    So, although the smaller narrow plan looks like it ought to be more efficient, it might perform less well if nonclustered index pages are required from disk and unordered prefetching proves effective. On the other hand, if all required nonclustered index pages are in memory at the time the query is executed, the wide plan might perform less well since it features an extra operator and has to perform all the work associated with prefetching (even though ultimately no physical reads are issued).

    Performance Impacts

    You may have noticed a lack of performance numbers (I/O statistics, elapsed times and so on) in this post. There is a good reason for this. The optimizations presented here are quite dependent on SQL Server configuration and hardware. I don’t want you drawing general conclusions based on the performance of the rather ordinary single SATA drive in my laptop, the 256MB of RAM I have allowed for my SQL Server 2012 instance, and some rather trivial AdventureWorks examples.

    Which Optimizations are Good?

    It depends, naturally.

    If the data structures you are changing are very likely to be in memory, and/or if you have a very fast random I/O system, the optimizer’s goal of minimizing random I/O and enhancing sequential I/O may not make much sense for you. The strategies the optimizer employs may well end up costing you much more than they save.

    On the other hand, if your system has a much smaller buffer pool than database working set size, and your I/O system works very much faster with large sequential I/O than with smaller random I/O requests, you should generally find the optimizer makes reasonable choices for you, subject to the usual caveats about useful up-to-date statistics and accurate cardinality estimation, of course.

    If your system fits the optimizer’s model, at least to a first approximation, you will usually find that narrow update plans are best for low cardinality update operations, unordered prefetching helps a bit when more rows are involved, ordered prefetching helps more, and explicit sorts before nonclustered index updates helps most of all for the largest sets. That is the theory, anyway.

    What are the Problems?

    The problem I see most often is with the optimization that is supposed to help most: the explicit sort before a nonclustered index update. The idea is that sorting the input to the index update in key order promotes sequential I/O, and often it does. The problem occurs if the workspace memory allocated to the sort at execution time proves to be too small to perform the sort entirely in memory. The memory grant is fixed based on cardinality estimation and row size estimates, and may be reduced if the server is under memory pressure at the time.

    In any case, however it occurs, a sort that runs out of memory spills whole sort runs to physical tempdb disk, often repeatedly. Unsurprisingly, this is not at all good for performance, and can result in queries that complete very much slower than if the sort had not been introduced at all. SQL Server reuses the general Sort operator for sorting index keys – it does not have a sort specially optimized for index updates which could make a best effort to sort within the memory allocated, but never spill to disk. Perhaps I should suggest this on Connect :)

    The narrow plan optimization can cause problems where it is selected due to a low cardinality estimate, where several nonclustered indexes need to be maintained, the necessary pages are not in memory, and the I/O system is slow at random reads.

    The ordered and unordered prefetch options cause problems much more rarely. For a system with all data in memory, there is a small but possibly measurable impact due to the prefetch overhead (finding pages to consider read-ahead for, checking to see if they are already in the buffer pool, and so on). Even if no asynchronous I/O is ever issued by the worker, it still spends time on the task that it could spend doing real work.

    What options are there?

    The usual answer to deal with poor execution plan choices due to a system’s configuration or performance characteristics not matching the optimizer’s model is to try different T-SQL syntax, and/or to try query hints. There are times when it is possible to rewrite the query to get a plan that performs better, and other times where rethinking the problem a bit can pay dividends. Various creative uses of partitioning, minimal logging, and bulk loading are possible in some cases, for example.

    There are very few hints that can help with the update side of a plan that does not respond to the usual tricks, however. The two I use most often affect the optimizer’s cardinality estimation - OPTION (FAST n) and a trick involving TOP and the OPTIMIZE FOR query hint. OPTION (FAST n) affects cardinality estimates in a plan by setting a final row goal, but its effects can be difficult to predict, and may not have much of an effect if the plan contains blocking operators. Anyway, the general idea is to vary the value of ‘n’ in the hint until the optimizer chooses the desired plan shape or optimization options as described in this post. Best of luck.

    The idea with TOP is similar, but tends to work better in my experience. The trick is to declare a bigint variable with the number of rows the query should return (or a very large value such as the maximum value of a bigint if all rows are required), use that as the parameter to a TOP, and then use the OPTIMIZE FOR hint to set a value for ‘n’ that the optimizer should use when considering alternative plans. This option is particularly useful for encouraging a narrow plan.

    Most of the examples in this post used the TOP trick in the following general form:

    DECLARE @n bigint = 9223372036854775807;
    DELETE TOP (@n) 
    FROM Production.TransactionHistory
    OPTION (OPTIMIZE FOR (@n = 50));

    No magic trace flags?

    Sure there are trace flags that affect the optimizations in this post. They are undocumented and unsupported of course, and may only work on some versions, but they can be handy to validate a performance analysis, or to generate a plan guide for a particularly crucial query. They can also be fun to play with on a personal system to explore their effects. The main ones that affect the optimizations described here are:


    2332 : Force DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    8633:  Enable prefetch (CUpdUtil::FPrefetchAllowedForDML and CPhyOp_StreamUpdate::FDoNotPrefetch)

    8744 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)

    8758 : Disable rewrite to a single operator plan (CPhyOp_StreamUpdate::PqteConvert)

    8790 : Force a wide update plan (CUpdUtil::PexprBuildWideUpdatePlan)

    8795 : Disable DML Request Sort (CUpdUtil::FDemandRowsSortedForPerformance)

    9115 : Disable prefetch (CUpdUtil::FPrefetchAllowedForDML)


    These trace flags can all be manipulated using the usual DBCC commands or enabled temporarily for a particular query using the OPTION (QUERYTRACEON xxxx) hint.

    Final Thoughts

    The optimizer has a wide range of choices when building the writing side of an update plan. It may choose to update one or more indexes separately, or as part of the base table update. It may choose to explicitly sort rows as part of a per-index update strategy, and may elect to perform unordered or ordered prefetching as well. As usual, these decisions are made based on costs computed using the optimizer’s model. This model may not always produce plans that are optimal for your hardware (and as usual the optimizer’s choices are only as good as the information you give it).

    If a particular update query is performance critical for you, make sure cardinality estimates are reasonably accurate. Test with alternate syntax and/or trace flags to see if an alternative plan would perform significantly better in practice. It is usually possible to use documented techniques like TOP clauses and OPTIMIZE FOR hints to produce an execution plan that performs well. Where that is not possible, more advanced tricks and techniques like plan guides may be called for.

    Thanks for reading today, I hope this post was interesting and provided some new insights into update query optimization.

    © 2013 Paul White – All rights reserved
    email: SQLkiwi@gmail.com
    twitter: @SQL_Kiwi

  • MERGE Bug with Filtered Indexes

    A MERGE statement can fail, and incorrectly report a unique key violation when:

    • The target table uses a unique filtered index; and
    • No key column of the filtered index is updated; and
    • A column from the filtering condition is updated; and
    • Transient key violations are possible

    Example Tables

    Say we have two tables, one that is the target of a MERGE statement, and another that contains updates to be applied to the target.  The target table contains three columns, an integer primary key, a single character alternate key, and a status code column.  A filtered unique index exists on the alternate key, but is only enforced where the status code is ‘a’:

    CREATE TABLE #Target 
    (
        pk          integer NOT NULL, 
        ak          character(1) NOT NULL,
        status_code character(1) NOT NULL,
     
        PRIMARY KEY (pk)
    );
     
    CREATE UNIQUE INDEX uq1
    ON #Target (ak)
    INCLUDE (status_code)
    WHERE status_code = 'a';

    The changes table contains just an integer primary key (to identify the target row to change) and the new status code:

    CREATE TABLE #Changes
    (
        pk          integer NOT NULL,
        status_code character(1) NOT NULL,
     
        PRIMARY KEY (pk)
    );

    Sample Data

    The sample data for the example is:

    INSERT #Target 
        (pk, ak, status_code)
    VALUES 
        (1, 'A', 'a'),
        (2, 'B', 'a'),
        (3, 'C', 'a'),
        (4, 'A', 'd');
     
    INSERT #Changes
        (pk, status_code)
    VALUES
        (1, 'd'),
        (4, 'a');

             Target                     Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ a           ║    ║  1 ║ d           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ d           ║
    ╚════╩════╩═════════════╝

    The target table’s alternate key (ak) column is unique, for rows where status_code = ‘a’.  Applying the changes to the target will change row 1 from status ‘a’ to status ‘d’, and row 4 from status ‘d’ to status ‘a’.  The result of applying all the changes will still satisfy the filtered unique index, because the ‘A’ in row 1 will be deleted from the index and the ‘A’ in row 4 will be added.

    Merge Test One

    Let’s now execute a MERGE statement to apply the changes:

    MERGE #Target AS t
    USING #Changes AS c ON 
        c.pk = t.pk
    WHEN MATCHED 
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    The MERGE changes the two target rows as expected.  The updated target table now contains:

    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ d           ║ <—changed from ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║ <—changed from ‘d’
    ╚════╩════╩═════════════╝

    Merge Test Two

    Now let’s repopulate the changes table to reverse the updates we just performed:

    TRUNCATE TABLE #Changes;
     
    INSERT #Changes
        (pk, status_code)
    VALUES
        (1, 'a'),
        (4, 'd');

    This will change row 1 back to status ‘a’ and row 4 back to status ‘d’.  As a reminder, the current state of the tables is:

             Target                        Changes
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ d           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝
    ║  4 ║ A  ║ a           ║
    ╚════╩════╩═════════════╝

    We execute the same MERGE statement:

    MERGE #Target AS t
    USING #Changes AS c ON 
        c.pk = t.pk
    WHEN MATCHED 
        AND c.status_code <> t.status_code 
        THEN UPDATE SET status_code = c.status_code;

    However this time we receive the following message:

    Msg 2601, Level 14, State 1, Line 1
    Cannot insert duplicate key row in object 'dbo.#Target' with unique index 'uq1'. The duplicate key value is (A).
    The statement has been terminated.

    Applying the changes using UPDATE

    Let’s now rewrite the MERGE to use UPDATE instead:

    UPDATE t
    SET status_code = c.status_code
    FROM #Target AS t
    JOIN #Changes AS c ON
        t.pk = c.pk
    WHERE
        c.status_code <> t.status_code;

    This query succeeds where the MERGE failed.  The two rows are updated as expected:

    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ a           ║ <—changed back to ‘a’
    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ d           ║ <—changed back to ‘d’
    ╚════╩════╩═════════════╝

    What went wrong with the MERGE?

    In this test, the MERGE query execution happens to apply the changes in the order of the ‘pk’ column.

    In test one, this was not a problem: row 1 is removed from the unique filtered index by changing status_code from ‘a’ to ‘d’ before row 4 is added.  At no point does the table contain two rows where ak = ‘A’ and status_code = ‘a’.

    In test two, however, the first change was to change row 1 from status ‘d’ to status ‘a’.  This change means there would be two rows in the filtered unique index where ak = ‘A’ (both row 1 and row 4 meet the index filtering criteria ‘status_code = a’).

    The storage engine does not allow the query processor to violate a unique key (unless IGNORE_DUP_KEY is ON, but that is a different story, and doesn’t apply to MERGE in any case).  This strict rule applies regardless of the fact that if all changes were applied, there would be no unique key violation (row 4 would eventually be changed from ‘a’ to ‘d’, removing it from the filtered unique index, and resolving the key violation).

    Why it went wrong

    The query optimizer usually detects when this sort of temporary uniqueness violation could occur, and builds a plan that avoids the issue.  I wrote about this a couple of years ago in my post Beware Sneaky Reads with Unique Indexes (you can read more about the details on pages 495-497 of Microsoft SQL Server 2008 Internals or in Craig Freedman’s blog post on maintaining unique indexes).  To summarize though, the optimizer introduces Split, Filter, Sort, and Collapse operators into the query plan to:

    1. Split each row update into delete followed by an inserts
    2. Filter out rows that would not change the index (due to the filter on the index, or a non-updating update)
    3. Sort the resulting stream by index key, with deletes before inserts
    4. Collapse delete/insert pairs on the same index key back into an update

    The effect of all this is that only net changes are applied to an index (as one or more insert, update, and/or delete operations).  In this case, the net effect is a single update of the filtered unique index: changing the row for ak = ‘A’ from pk = 4 to pk = 1.  In case that is less than 100% clear, let’s look at the operation in test two again:

             Target                     Changes                   Result
    ╔════╦════╦═════════════╗    ╔════╦═════════════╗    ╔════╦════╦═════════════╗
    ║ pk ║ ak ║ status_code ║    ║ pk ║ status_code ║    ║ pk ║ ak ║ status_code ║
    ╠════╬════╬═════════════╣    ╠════╬═════════════╣    ╠════╬════╬═════════════╣
    ║  1 ║ A  ║ d           ║    ║  1 ║ d           ║    ║  1 ║ A  ║ a           ║
    ║  2 ║ B  ║ a           ║    ║  4 ║ a           ║    ║  2 ║ B  ║ a           ║
    ║  3 ║ C  ║ a           ║    ╚════╩═════════════╝    ║  3 ║ C  ║ a           ║
    ║  4 ║ A  ║ a           ║                            ║  4 ║ A  ║ d           ║
    ╚════╩════╩═════════════╝                            ╚════╩════╩═════════════╝

    From the filtered index’s point of view (filtered for status_code = ‘a’ and shown in nonclustered index key order) the overall effect of the query is:

      Before           After
    ╔════╦════╗    ╔════╦════╗
    ║ pk ║ ak ║    ║ pk ║ ak ║
    ╠════╬════╣    ╠════╬════╣
    ║  4 ║ A  ║    ║  1 ║ A  ║
    ║  2 ║ B  ║    ║  2 ║ B  ║
    ║  3 ║ C  ║    ║  3 ║ C  ║
    ╚════╩════╝    ╚════╩════╝

    The single net change there is a change of pk from 4 to 1 for the nonclustered index entry ak = ‘A’.  This is the magic performed by the split, sort, and collapse.  Notice in particular how the original changes to the index key (on the ‘ak’ column) have been transformed into an update of a non-key column (pk is included in the nonclustered index).  By not updating any nonclustered index keys, we are guaranteed to avoid transient key violations.

    The Execution Plans

    The estimated MERGE execution plan that produces the incorrect key-violation error looks like this (click to enlarge in a new window):

    MERGE plan

    The successful UPDATE execution plan is (click to enlarge in a new window):

    UPDATE execution plan

    The MERGE execution plan is a narrow (per-row) update.  The single Clustered Index Merge operator maintains both the clustered index and the filtered nonclustered index.  The UPDATE plan is a wide (per-index) update.  The clustered index is maintained first, then the Split, Filter, Sort, Collapse sequence is applied before the nonclustered index is separately maintained.

    There is always a wide update plan for any query that modifies the database. The narrow form is a performance optimization where the number of rows is expected to be relatively small, and is not available for all operations.  One of the operations that should disallow a narrow plan is maintaining a unique index where intermediate key violations could occur.

    Workarounds

    The MERGE can be made to work (producing a wide update plan with split, sort, and collapse) by:

    • Adding all columns referenced in the filtered index’s WHERE clause to the index key (INCLUDE is not sufficient); or
    • Executing the query with trace flag 8790 set e.g. OPTION (QUERYTRACEON 8790).

    Undocumented trace flag 8790 forces a wide update plan for any data-changing query (remember that a wide update plan is always possible).  Either change will produce a successfully-executing wide update plan for the MERGE that failed previously.

    Conclusion

    The optimizer fails to spot the possibility of transient unique key violations with MERGE under the conditions listed at the start of this post.  It incorrectly chooses a narrow plan for the MERGE, which cannot provide the protection of a split/sort/collapse sequence for the nonclustered index maintenance.

    The MERGE plan may fail at execution time depending on the order in which rows are processed, and the distribution of data in the database.  Worse, a previously solid MERGE query may suddenly start to fail unpredictably if a filtered unique index is added to the merge target table at any point.

    Connect bug filed here

    Bug reproduced on the following SQL Server versions (all x64 Developer Edition):

    2012 SP1 CUI (build 11.0.3321 – November 2012)
    2008 R2 SP2 CU3 (build 10.50.4266 – October 2012)
    2008 SP3 CU8 (build 10.0.5828 – November 2012)

    © 2012 Paul White – All Rights Reserved

    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Cardinality Estimation Bug with Lookups

    Estimated row counts on key or RID lookups where a filtering predicate is applied can be wrong in SSMS execution plans.  This error does not affect the optimizer’s ultimate plan selection, but it does look odd.  There are other cases where estimated row counts are inconsistent (for defensible reasons) but the behaviour shown in this post in certainly a bug.

    SQL Server 2005

    The following AdventureWorks sample query displays TransactionHistory information for ProductID #1 and the last three months of 2003 (if you are using a more recent version of AdventureWorks, you will need to change the year from 2003 to 2007):

    SELECT
        th.ProductID,
        th.TransactionID,
        th.TransactionDate
    FROM Production.TransactionHistory AS th 
    WHERE 
        th.ProductID = 1 
        AND th.TransactionDate BETWEEN '20030901' AND '20031231';

    The query plan is:

    Plain Plan

    The execution flow is pretty straightforward.  The plan seeks a non-clustered index on the ProductID column to find rows where ProductID = 1.  This non-clustered index is keyed on ProductID alone, but the index also includes the TransactionID clustering key (to point to the parent row in the base table).  The index does not cover the query (the TransactionDate column is not present in the index) so a Key Lookup is required for the TransactionDate column.  The nested loops join drives a look up process such that each row from the Index Seek on ProductID = 1 results in a lookup in the clustered index to find the TransactionDate value for that row.  A final Filter operator passes only rows that meet the date range condition.

    The next diagram shows the same plan with cardinality estimates and extra details for the Key Lookup:

    SQL Server 2005 plan

    The cardinality estimate at the Index Seek is exactly correct.  The table is not very large, there are up-to-date statistics associated with the index, so this is as expected.  If you run a query to find all rows in the TransactionHistory table for ProductID #1, 45 rows will indeed be returned.

    The estimate for the Key Lookup is exactly correct.  Each lookup into the Clustered Index to find the Transaction Date is guaranteed to return exactly one row (each non-clustered index entry is associated with exactly one base table row).  The Key Lookup is expected to be executed 45 times (shown circled).

    The estimate for the Inner Join is exactly correct – 45 rows from the seek joining to one row each time from the lookup, gives a total of 45 rows.

    The Filter estimate is also good: SQL Server estimates that of the 45 rows entering the filter, 16.9951 rows will match the specified range of transaction dates.  In reality, only 11 rows are produced by this query, but that small difference in estimates is quite normal and certainly nothing to worry about here.  You might like to keep that estimate of 16.9951 rows in mind, however.

    SQL Server 2008 onward

    The same query executed against an identical copy of AdventureWorks on SQL Server 2008 (or R2, or 2012) produces a quite different execution plan:

    SQL Server 2008 plan

    Instead of an explicit Filter to find rows with the requested date range, the optimizer has pushed the date predicate down to the Key Lookup (the Key Lookup now has a predicate on Transaction Date).  This is a good optimization in general terms; it makes sense to filter rows as early as possible.  Unfortunately, it has made a bit of a mess of the cardinality estimates in the process.  The post-Filter estimate of 16.9951 rows seen in the 2005 plan has been moved to the Key Lookup.  Instead of estimating one row per lookup, the plan now suggests that 16.9951 rows will be produced by each clustered index lookup – clearly not right!

    To my way of thinking, the execution plan cardinality estimates should look something like this:

    image

    As shown, I personally prefer to see the inner side of a nested loops join estimate the number of rows per execution, but I understand I may be in a minority on this point.  If the estimate were aggregated over the expected number of executions, the inner-side estimate would be also be 16.9951 (45 executions * 0.37767 per execution).

    Cause

    The query tree produced by the 2008+ optimizer looks much the same as the 2005 version – the explicit Filter is still present.  However, a post-optimization rewrite occurs in the ‘copy out’ phase that removes the Filter and incorporates it in the Key Lookup seek, resulting in a residual predicate on that operator.  This rewrite applies to regular scans and seeks too (so all residual predicates on scans and seeks are a result of this late rewrite).

    I would like to thank Dima Piliugin (twitter | blog) for introducing me to undocumented trace flag 9130, which disables the rewrite from Filter + (Scan or Seek) to (Scan or Seek) + Residual Predicate.  Enabling this flag retains the Filter in the final execution plan, resulting in a SQL Server 2008+ plan that mirrors the 2005 version, with correct estimates.

    The bug is that this rewrite that does not correctly update cardinality estimates when the filter is pushed down to a lookup.  I should stress that the rewrite occurs after all cost-based decisions have been made, so the incorrect estimate just looks odd and makes plan analysis harder than it ought to be.  Cardinality estimates with regular scans and seeks appear to work correctly, as far as I can tell.  The bug applies to both RID lookups and Key Lookups where a residual predicate is applied.

    Workarounds

    Using the trace flag is not a workaround because (a) it is (very) undocumented and unsupported; and (b) it results in a less efficient plan where rows are filtered much later than is optimal.  One genuine workaround is to provide a covering non-clustered index (avoiding the lookup avoids the problem):

    CREATE INDEX nc1 
    ON Production.TransactionHistory (ProductID) 
    INCLUDE (TransactionDate);

    With the Transaction Date filter applied as a residual predicate in the same operator as the seek, the final estimate is again as expected:

    image

    We could also force the use of the ultimate covering index (the clustered one) with a suitable table index hint:

    SELECT
        th.ProductID,
        th.TransactionID,
        th.TransactionDate
    FROM Production.TransactionHistory AS th WITH (INDEX(1))
    WHERE 
        th.ProductID = 1 
        AND th.TransactionDate BETWEEN '20030901' AND '20031231';

    image

    Again, the estimate is 16.9951 as expected.

    Fixed in SQL Sentry Plan Explorer build 7.2.42.0

    After this post was published on October 15 2012 I was contacted by the SQL Sentry people to see if a good workaround could be incorporated in their free Plan Explorer product.  I was happy to provide a little testing and general feedback during a process that ultimately resulted in a new build being released on 31 October 2012.  If only SSMS limitations could be addressed so quickly!  Once you upgrade to the new version, the plan displayed for our test query is:

    image

    This is exactly the representation we would expect if the SQL Server bug did not exist (note that Plan Explorer aggregates estimated inner-side executions for you, and rounds to integer).  Well done to the SQL Sentry team, especially Brooke Philpott (@MacroMullet).

    Summary

    Providing a covering non-clustered index to avoid lookups for all possible queries is not always practical, and scanning the clustered index will rarely be the optimal choice either.  Nevertheless, these are the best workarounds we have today in SSMS.

    Watch out for poor cardinality estimates when a predicate is applied as part of a lookup.

    This particular cardinality estimation issue does not affect the final plan choice (the internal estimates are correct) but it does look odd and will confuse people when analysing query plans in SSMS.  If you think this situation should be improved, please vote for this Connect item.  It will be interesting to see how long it takes to catch up with Plan Explorer.

    © 2012 Paul White – All Rights Reserved

    twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

  • Why Doesn’t Partition Elimination Work?

    Given a partitioned table and a simple SELECT query that compares the partitioning column to a single literal value, why does SQL Server read all the partitions when it seems obvious that only one partition needs to be examined?

    Sample Data

    The following script creates a table, partitioned on the char(3) column ‘Div’, and populates it with 100,000 rows of data:

    USE Sandpit;
    GO
    CREATE PARTITION FUNCTION PF (char(3))
    AS RANGE RIGHT FOR VALUES
    (
        '1', '2', '3', '4', '5', '6', '7', '8', '9',
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
        'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
    );
    GO
    CREATE PARTITION SCHEME PS
    AS PARTITION PF
    ALL TO ([PRIMARY]);
    GO
    CREATE TABLE dbo.Test
    (
        Div     char(3) NOT NULL,
        Data    char(50) NOT NULL,
    )
    ON PS(Div);
    GO
    INSERT dbo.Test WITH (TABLOCKX)
        (Div, Data)
    SELECT TOP (100000)
        CONVERT(char(3), CONVERT(char(36), NEWID())),
        CONVERT(char(50), REPLICATE('X', 50))
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3;

    Sample data:

    Partition Elimination Sample Data

    The ‘Div’ column is pseudo-random so your results will be slightly different, but this is the distribution across partitions I saw:

    SELECT
        PartitionID = F.PartitionID,
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t WITH (TABLOCK)
    CROSS APPLY (VALUES($PARTITION.PF(t.Div))) AS F (PartitionID)
    GROUP BY
        F.PartitionID
    ORDER BY
        F.PartitionID;

    Sample Data Distribution

    (The partition function defines a total of 36 partitions but only 16 are used by the sample data.)

    The Test Query

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF';

    Since ‘Div’ is the partitioning column, the expectation is that only one partition will be scanned to count the rows that match the WHERE clause predicate.  However, the execution plan shows that query execution scans all 36 partitions (a full table scan):

    Execution Plan 1

    Why did SQL Server not just look at the one partition it knows the value ‘ABF’ must lie in?

    Here’s Why:

    The answer lies in the predicate:

    Predicate

    The value ‘ABF’ is nowhere to be seen; it has been replaced by [@1].  This means the query has been considered for simple parameterization by SQL Server to promote query plan reuse.  The presence of the [@1] does not mean that simple parameterization was successful (considered safe for all possible values by the optimizer).  We have to check the plan cache to see if an associated prepared plan exists and is reused for different ‘parameter’ values.  It turns out that this time the simple parameterization attempt is considered safe, so a prepared plan is created and is reused for different values.  The full parameterized text is:

    (@1 varchar(8000))SELECT COUNT_BIG(*) [RowCnt] FROM [dbo].[Test] [t] WHERE [t].[Div]=@1

    Static Partition Elimination

    The reason SQL Server cannot apply static partition elimination (determining the partition number at compile time) is that the plan now contains a parameter.  Subsequent executions that reuse the plan might specify a different value for the parameter, so static elimination would not be safe.  It is not possible to disable simple parameterization, but we can prevent it applying to this query in a number of ways.  One way is to incorporate an inequality predicate of the form expression <> constant:

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
        AND 1 <> 2;

    The redundant 1 <> 2 predicate is completely removed by the optimizer, and the query still qualifies for TRIVIAL optimization as it did previously:

    Execution Plan 2

    There are a number of interesting things to notice here:

    • The query is no longer parameterized (the literal value ‘ABF’ appears)
    • Only one partition is touched (partition 11)
    • This unindexed heap table has a seek predicate
    • The ‘ordered’ property has changed from False to True

    The problem with this plan is that it cannot be reused for different predicate values – so you could end up with a separate cached plan for each possible literal value.

    Adding OPTION (RECOMPILE) also defeats simple parameterization and therefore achieves static elimination.  In an ad-hoc SQL batch, OPTION (RECOMPILE) also means the query plan will not be cached (stored procedures are not so lucky).  The downside is a fresh trip through the optimizer on each call.  This query does qualify for trivial plan, so optimization time is minimal, but more complex queries with the same desire for partition elimination might well require full optimization.

    Dynamic Partition Elimination

    Ideally, we would like a query plan that is reusable but still only touches one partition for any given string literal value.  What we need is dynamic partition elimination.  In case the concept is new to you, the optimizer can produce plans that determine the correct partition at execution time (rather than at compile-time, as for static elimination).  To achieve this, the optimizer builds a plan that uses an internal function called RangePartitionNew().  One might think the original simple-parameterized plan ought to have included this feature; the reason it didn’t is quite interesting.

    When SQL Server parameterizes a query using simple or forced parameterization, it has to decide on a data type for the parameter.  For string literals, it always chooses varchar(8000) for non-Unicode strings and nvarchar(4000) for Unicode strings.  If it chose a specific type like varchar(3), we could end up with the same plan cache bloating and non-reuse problem as for the non-parameterized case – and there are potentially 8000 different lengths for varchar.  Also, prior to SQL Server 2005 SP2, the parameter-declaration part of the query text was not included in the hash value used to decide in which cache bucket to store the plan.  This could result in hundreds of same-text plans (with different parameters) occupying the same hash bucket (this does not work well).

    Anyway, faced with the parameterized predicate Div = [@1], where Div is char(3) and [@1] is varchar(8000), dynamic partition elimination does not occur because the value of [@1] might be truncated.  One way to address this is to explicitly convert the to-be-parameterized string literal to the same type as the target column (char(3)):

    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(3), 'CDC');

    Now we have a parameterized query plan that uses dynamic partition elimination:

    Dynamic Elimination 1

    This plan has all the interesting features of the static elimination plan, but the heap seek no longer selects a hard-coded partition ID.  The single partition searched is the one returned at runtime by the call to RangePartitionNew().  The explicit convert prevents data truncation when the parameterized value is compared with the partition boundary values (defined as char(3) in the partition function).  The extra CONVERT_IMPLICIT to char(3) is required for internal reasons, and only appears with non-Unicode partition function types.

    If we explicitly convert to char(4) instead of char(3), the possibility of truncation arises again and we are back to scanning all 36 partitions:

    No Partition Elimination

    Using varchar(3) instead of char(3) also results in dynamic elimination, which surprises some people:

    Dynamic Elimination 2

    Implicit Conversions and Precedence

    An alternative explanation I have seen for the original behaviour is that the string literal ‘ABF’ is implicitly typed as a varchar(3) by SQL Server and data type precedence means the char(3) ‘Div’ column has to be implicitly converted to varchar(3) to perform the comparison.  This is incorrect, but it is interesting to look at why that is so:

    Literal Types

    We can see the implicit type of the string literal ‘ABF’ using a query:

    SELECT
        BaseType = SQL_VARIANT_PROPERTY(V.v, 'BaseType'),
        MaxLength = SQL_VARIANT_PROPERTY(V.v, 'MaxLength')
    FROM (VALUES(CONVERT(sql_variant, 'ABF'))) AS V (v);

    Literal Data Type

    The implied type of ‘ABF’ does seem to be varchar(3), so that checks out.

    Type Precedence

    The data type precedence table in Books Online shows that varchar does indeed have a higher precedence than char:

    Data Type Precedence

    Also, from the Books Online topic Data Types (Transact-SQL):

    Precedence

    The example below uses a UNION ALL over char(5) and varchar(3) columns:

    DECLARE @T1 AS TABLE (col1 char(5) NOT NULL);
    DECLARE @T2 AS TABLE (col1 varchar(3) NOT NULL);
     
    INSERT @T1 VALUES ('12345');
    INSERT @T2 VALUES ('123');
     
    SELECT
        BaseType = SQL_VARIANT_PROPERTY(CONVERT(sql_variant, t.col1), 'BaseType'),
        MaxLength = SQL_VARIANT_PROPERTY(CONVERT(sql_variant, t.col1), 'MaxLength'),
        t.col1
    FROM
    (
        SELECT col1 FROM  @T1 
        UNION ALL
        SELECT col1 FROM @T2
    ) AS t;

    Result Data Type

    The result column ‘col1’ has to have a well-defined type.  As expected, the result is varchar(5) since varchar has a higher precedence than char and 5 is the maximum length of the result.  All good so far.

    Strings Are Special

    We get used to seeing (and preferably avoiding) implicit conversions in query plans where two mismatched types are compared:

    DECLARE @T1 AS TABLE (col1 integer NULL);
    DECLARE @v tinyint;
    SELECT col1 FROM @T1 AS t WHERE col1 = @v;

    Mismatched Types Comparison

    The plan shows the tinyint variable being implicitly converted to the higher-precedence integer data type, as expected.  Now let’s do the same thing with varchar and char (perhaps expecting char to be converted to the higher-precedence varchar):

    DECLARE @T1 AS TABLE (col1 varchar(5) NULL);
    DECLARE @v char(3);
    SELECT col1 FROM @T1 AS t WHERE col1 = @v;

    No Implicit Conversion

    Note the lack of an implicit conversion in the predicate.

    Important

    SQL Server can compare char with varchar without performing a conversion.  They are fundamentally the same type, it’s just that one is fixed-length and the other is variable-length.  The SQL standard requires padding for the character strings used in comparisons so that their lengths match before comparing, thereby removing the only distinction between char and varchar.  The same optimization can apply to a comparison between nchar and nvarchar (though not to Unicode versus non-Unicode comparisons, since they are genuinely different types of course).

    The present case involves a predicate “WHERE Div = ‘ABF’”.  This is a comparison that can use this optimization, so there is no implicit conversion.  This is the reason the explicit conversion to varchar(3) works in the main example – the precedence rules do not need to be invoked, and the data column is not converted from char(3) to varchar(3).

    Why Use a Partitioned Heap?

    In case you were wondering, the only reason I chose a partitioned heap for this post is just because I could (it’s more interesting in many ways).  All the demonstrations work just as well for a partitioned clustered table.  A complete script using a clustered partitioned table is below:

    USE Sandpit;
    GO
    CREATE PARTITION FUNCTION PF (char(3))
    AS RANGE RIGHT FOR VALUES
    (
        '1', '2', '3', '4', '5', '6', '7', '8', '9',
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
        'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
    );
    GO
    CREATE PARTITION SCHEME PS
    AS PARTITION PF
    ALL TO ([PRIMARY]);
    GO
    CREATE TABLE dbo.Test
    (
        ID      integer IDENTITY NOT NULL,
        Div     char(3) NOT NULL,
        Data    char(50) NOT NULL,
     
        CONSTRAINT PK__dbo_Test_Div_ID
        PRIMARY KEY CLUSTERED (Div, ID)
        ON PS(Div)
    );
    GO
    INSERT dbo.Test WITH (TABLOCKX)
        (Div, Data)
    SELECT TOP (100000)
        CONVERT(char(3), CONVERT(char(36), NEWID())),
        CONVERT(char(50), REPLICATE('X', 50))
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3;
    GO
    -- Show the distribution of data
    SELECT
        PartitionID = F.PartitionID,
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t WITH (TABLOCK)
    CROSS APPLY (VALUES($PARTITION.PF(t.Div))) AS F (PartitionID)
    GROUP BY
        F.PartitionID
    ORDER BY
        F.PartitionID;
    GO
    -- Seek on all 36 partitions
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF';
    GO
    -- Static elimination = 1 hard-coded partition ID
    -- Cached plan unlikely to be reused
    SELECT
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
        AND 1 <> 2;
    GO
    -- Static elimination
    -- Compilation overhead
    SELECT
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = 'ABF'
    OPTION (RECOMPILE);
    GO
    -- Dynamic elimination with convert to char(3)
    -- Reusable plan
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(3), 'ABF');
    GO
    -- Dynamic elimination with convert to varchar(3)
    -- Data type precedence not applied:
    -- varchar(3) convert does not cause the char(3)
    -- column Div to be converted (would prevent a seek)
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(varchar(3), 'ABF');
    GO
    -- char(4) means truncation could occur when
    -- comparing with the char(3) partition function
    -- No partition elimination, seek on 36 partitions
    SELECT 
        RowCnt = COUNT_BIG(*)
    FROM dbo.Test AS t 
    WHERE 
        t.Div = CONVERT(char(4), 'ABF');
    GO
    -- Clean up
    DROP TABLE dbo.Test;
    DROP PARTITION SCHEME PS;
    DROP PARTITION FUNCTION PF;

    This post applies to SQL Server 2008, 2008 R2 and 2012 only (partitioned table query execution was very different in 2005).  As far as I can tell, SQL Server 2005 did not attempt simple parameterization on a partitioned table.  I’m in two minds whether the SQL Server 2008+ behaviour is a bug, an oversight, or an undesirable consequence of fixing something else…so I opened a Connect item for it.  Update: fixing this is under consideration for the next release.

    Sean Broadley raises an important practical point in the comments below – the issue highlighted in this post only occurs when a literal value is used.  By correctly parameterizing your queries (including application code and dynamic SQL) you will avoid simple parameterization, achieve dynamic partition elimination, and cache a more reusable query plan.

    Acknowledgement

    Thanks to Jonathan Kehayias (blog | twitter) who first altered me to the MSDN forum question that prompted this post.

    Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    © 2012 Paul White – All rights reserved

    image454

  • Compute Scalars, Expressions and Execution Plan Performance

    Compute Scalar Operator

    The humble Compute Scalar is one of the least well-understood of the execution plan operators, and usually the last place people look for query performance problems.  It often appears in execution plans with a very low (or even zero) cost, which goes some way to explaining why people ignore it.

    Some readers will already know that a Compute Scalar can contain a call to a user-defined function, and that any T-SQL function with a BEGIN…END block in its definition can have truly disastrous consequences for performance (see this post by Rob Farley for details).  This post is not about those sorts of concerns.

    Compute Scalars

    The Books Online description of this operator is not very detailed:

    Books Online Description

    This leads people to think that Compute Scalar behaves like the majority of other operators: as rows flow through it,  the results of whatever computations the Compute Scalar contains are added to the stream.  This is not generally true.  Despite the name, Compute Scalar does not always compute anything, and does not always contain a single scalar value (it can be a vector, an alias, or even a Boolean predicate, for example).  More often than not, a Compute Scalar simply defines an expression; the actual computation is deferred until something later in the execution plan needs the result.


    Compute Scalars are not the only operators that can define expressions. You can find expression definitions with labels like [Expr1008] in many query plan operators, including Constant Scans, Stream Aggregates and even Inserts.

    Deferred expression evaluation was introduced in SQL Server 2005 as a performance optimization.  The idea is that the number of rows tends to decrease in later parts of the plan due to the effects of filtering and joining (reducing the number of expression evaluations at runtime).  To be fair to the Books Online entry, it does mention this behaviour in a note:

    Books Online Note

    The take away from this first section is that a Compute Scalar is most often just a placeholder in the execution plan.  It defines an expression and assigns a label to it (e.g. [Expr1000]) but no calculations are performed at that point.  The defined expression may be referenced by multiple operators later in the execution plan.  This makes counting the number of ‘executions’ more complex than for a regular plan operator., and goes some way to explaining why Compute Scalar operators only ever show estimated row counts (even in an ‘actual’ execution plan).

    Execution Plans and the Expression Service

    The execution plans you see in Management Studio (or Plan Explorer) are a highly simplified representation that is nevertheless useful for many kinds of plan analysis.  The XML form of show plan has a more detail than the graphical representation, but even that is a shallow depiction of the code that actually runs inside SQL Server when a query is executed.  It is easy to be seduced into thinking that what we see as an ‘execution plan’ is more complete than it actually is.  As a tool for high-level analysis, it is excellent, but do not make the mistake of taking it too literally.  More detail on this later.

    One more thing to appreciate before we get into a concrete example: the SQL Server engine contains a component known as the Expression Service.  This is used by components such as the query optimizer, query execution, and the storage engine to perform computations, conversions and comparisons.  It is the call to the expression service in the execution plan that can be deferred and performed by a plan operator other than the source Compute Scalar (or other expression-defining operator).

    An Example

    I came across a forum thread recently which demonstrated the concepts I want to explore in this post really well.  The topic of discussion was delimited string-splitting (an old favourite of a topic, for sure) but I want to emphasise that the particular code is not at all important – it is the query plan execution I want to focus on.  The particular string-splitting method employed was one I first saw written about by Brad Shultz; it replaces the delimiters with XML tags, and then uses XML shredding to return the results.  I should say that I am not a fan of this technique because it is (a) a strange use of XML; and (b) it does not work for all possible inputs.  Anyway, I said the example was not that important in itself, so let’s get on to the first example:

    Test One – XML Split

    DECLARE
        @ItemID bigint,
        @Item varchar(8000),
        @Input varchar(8000);
     
    SET @Input = REPLICATE('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z', 32);
     
    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(@Input, ',', '</r><r>') + 
                '</r>', 0
                )
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The code above creates a single comma-delimited string containing the letters of the alphabet, repeated 32 times (a total string length of 1632 characters).  The string-splitting SELECT statement replaces the commas with XML tags, fixes up the start and end of the string, and then uses the nodes() XML method to perform the actual splitting.  There is also a ROW_NUMBER call to number the elements returned in whatever order the nodes() method spits them out.  Finally, two variables are used to ‘eat’ the output so returning results to the client does not affect execution time.  The execution plan is as follows (click to enlarge):

    XML String Splitter Test 1

    The Constant Scan operator defines an expression labelled as [Expr1000] which performs the string replacement and conversion to the XML type.  There are four subsequent references to this expression in the query plan:

    • Two XML Reader table-valued functions (performing the nodes() shredding and value() functions)
    • The Filter operator (a start-up predicate checks that the expression is not NULL)
    • The Stream Aggregate (as part of a CASE expression that determines what the value() XML method should finally return)

    The query takes 3753ms to execute on my laptop.  This seems a rather long time to shred a single string containing just 801 elements.

    Test Two – The Empty String

    The following code makes one small modification to the test one script:

    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(@Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(@@DBTS, 0)  -- NEW
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The only difference in the execution plan is in the definition of [Expr1000] (click to enlarge):

    XML String Splitter Test 2

    This plan returns the same correct results as the first test, but it completes in 8ms (quite an improvement over 3753ms).

    Beyond the Execution Plan – Test One

    There is nothing in the query plan to explain why the second form of the query runs 500 times faster than the first.  To see what is going on under the covers of the execution engine when executing the slow query, we can attach a debugger and look closer at the code SQL Server is really running.  Using SQL Server 2012 and setting a breakpoint on the CXMLTypeProvider::ConvertStringToXMLForES method (as the name suggests, it converts a string to XML for the Expression Service) the first debugger break occurs with this stack trace:

    ConvertStringToXMLForES

    I know that not everyone reading this post will routinely analyse stack traces, so for this first example I will describe the interesting calls from the bottom up (the currently executing code is on top, each line below is the method that called that routine):

    1. There is nothing very interesting below CQueryScan::GetRow (it is just start-up code)
    2. CQueryScan:GetRow drives the plan as we are used to seeing it.  You might think of it as approximately the code behind the green SELECT query plan icon in a graphical plan.  It reads a row at a time from the first plan operator to its right
    3. The first visible plan operator to execute is the leftmost Nested Loops join (CQScanNLJoinTrivialNew::GetRow).  (The Compute Scalar to its left in the graphical plan simply defines an alias for an expression computed by the Stream Aggregate – it does not even exist in the code executed by the execution engine).
    4. The nested loops join operator asks the Sequence Project operator for a row (CQScanSeqProjectNew::GetRow)
    5. The sequence project operator asks the Segment operator for a row (CQScanSegmentNew::GetRow)
    6. The segment operator asks the second Nested Loops Join for a row (CQScanNLJoinTrivialNew::GetRow)
    7. The nested loops join has obtained a row from the Constant Scan and is now preparing to fetch a row from the inner side of the join
    8. The start-up Filter operator is executing its Open() method (CQScanStartupFilterNew::Open)
    9. The start-up Filter calls into the general entry point for the Expression Service to evaluate expression [Expr1000] defined by the Constant Scan
    10. CEsExec::GeneralEval4 calls an ES method to convert a string to XML (ConvertFromStringTypesAndXmlToXml)
    11. The XML type provider executes the code that actually converts the string to XML (ConvertStringToXMLForES)

    The stack trace illustrates two things nicely:

    • The iterative nature of plan operators (one row at a time, Open, GetRow, and Close methods)
    • An expression defined by one operator (the Constant Scan) being deferred for evaluation until a later operator (the start-up filter) requires the result

    Continuing execution with the debugger, the next break occurs when the Table-valued Function to the right of the filter is initialised:

    Table-valued Function Stack Trace

    Another break when the Table-valued Function on the lowest branch of the plan initialises:

    Table-valued Function Stack Trace 2

    And finally, a break from the Stream Aggregate, when it needs to evaluate [Expr1016] (which contains a reference to [Expr1000]):

    Stream Aggregate Stack Trace

    The last two breaks (lowest table-valued function and stream aggregate) execute a total of 801 times.  The reason for the slow performance of this query is now clear: [Expr1000] is being computed more than 1604 times.  Yes, that means the REPLACE and CONVERT to XML really does run 1604 times on the same value – no wonder performance is so poor!

    If you are inclined to test it, the expression service method that performs the REPLACE is sqlTsEs!BhReplaceBhStrStr.  It turns out that the string REPLACE is very much faster than conversion to XML (remember XML is a LOB type and also has complex validation rules) so the dominant performance effect is the conversion to XML.

    Beyond the Execution Plan – Test Two

    Running test two with the debugger attached reveals that only one call is made to CXMLTypeProvider::ConvertStringToXMLForES:

    FillExprCache

    Reading down the stack trace we see that execution of the plan (as we know it) has not started yet.  The call to convert the string to XML is called from InitForExecute() and SetupQueryScanAndExpression.  The conversion to XML is evaluated once and cached in the execution context (CQueryExecContext::FillExprCache).  Once the iterative part of plan execution begins, the cached result can be retrieved directly without calling ConvertStringToXMLForES() 1604 times.

    Test Three – Column Reference

    Before you go off adding LEFT(@@DBTS, 0) to all your expressions (!) look what happens when we use a table to hold the input data rather than a variable:

    DECLARE
        @ItemID bigint,
        @Item varchar(8000);
     
    DECLARE @Table AS TABLE (Input varchar(8000) NOT NULL);
     
    INSERT @Table (Input)
    VALUES (REPLICATE('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z', 32));
     
    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM @Table AS t
    CROSS APPLY
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(t.Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(@@DBTS, 0)
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    Instead of the Constant Scan we saw when using a variable, we now have a table scan and a Compute Scalar that defines the expression:

    [Expr1003] = Scalar Operator(
    CONVERT(xml,'<r>'+replace(@Table.[Input] as [t].[Input],',','</r><r>')+'</r>'+
    substring(CONVERT_IMPLICIT(varchar(8),@@dbts,0),(1),(0)),0))

    XML String Splitter Test 3

    The other plan details are the same as before; the expression label [Expr1003] is still referenced by the Filter, two Table-valued functions and the Stream Aggregate.  The test three query takes 3758ms to complete.  Debugger analysis shows that CXMLTypeProvider::ConvertStringToXMLForES is again being called directly by the Filter, Table-valued Functions, and Stream Aggregate.

    Test Four – CRYPT_GEN_RANDOM

    This final test replaces @@DBTS with the CRYPT_GEN_RANDOM function:

    SELECT
        @ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 
        @Item = x.i.value('./text()[1]', 'varchar(8000)')    
    FROM @Table AS t
    CROSS APPLY
    (
        VALUES
            (
            CONVERT
                (xml, 
                '<r>' + 
                REPLACE(t.Input, ',', '</r><r>') + 
                '</r>' +
                LEFT(CRYPT_GEN_RANDOM(1), 0)    -- CHANGED
                , 0)
            )
    ) AS ToXML (result) 
    CROSS APPLY ToXML.result.nodes('./r') AS x(i);

    The execution plan shows an extra Compute Scalar next to the Table Scan:

    [Expr1018] = Scalar Operator('<r>'+replace(@Table.[Input] as [t].[Input],',','</r><r>')+'</r>')

    The Compute Scalar to the left of that is the same one seen previously, though it now references the new expression in its definition:

    [Expr1003] = Scalar Operator(
    CONVERT(xml,[Expr1018]+
    substring(CONVERT_IMPLICIT(varchar(8000),Crypt_Gen_Random((1)),0),(1),(0)),0))

    XML String Splitter Test 4

    The remainder of the query plan is unaffected; the same operators reference [Expr1003] as before.  As you might have been expecting, this query executes in 8ms again.

    Explanation

    In test one, there is nothing to stop the query processor deferring execution of the expression, which results in the same expensive computation being performed 1604 times (once each for the filter and first table-valued function, and 801 times each for the second function and the stream aggregate).  As an interesting aside, notice that even with OPTION (RECOMPILE) specified, constant-folding only applies to the REPLACE and string concatenation; constant-folding cannot produce a LOB type (including XML).

    In test two, we introduced the non-deterministic function @@DBTS.  This is one of the functions (like RAND and GETDATE) that are implemented to always return the same value per query (despite being non-deterministic, see the links in the Further Reading section below for more details).  These runtime constants are evaluated before the query starts processing rows and the result is cached.  There are some hints (ConstExpr labels) in the execution plan that may indicate a runtime constant but I have not found this to be reliable.

    In test three, we changed the game again by replacing the reference to a variable with a column reference from a table.  The value of this column reference could change for each table row so the engine can no longer treat the expression as a runtime constant.  As a consequence, the result is no longer cached and we see the same behaviour (and slow performance) as in test one.

    In test four, we replaced the @@DBTS function with the side-effecting function CRYPT_GEN_RANDOM (the other such function is NEWID).  One of the effects of this is to disable the deferred evaluation of the expression defined by the Compute Scalar.  The result is a XML LOB and a second optimization occurs where handles to the same LOB (evaluated by the defining operator) can be shared by other operators that reference the LOB.  This combination of effects means the expensive computation is only performed once in test four, even though the runtime constant caching mechanism is not used.  By the way, there was a bug with CRYPT_GEN_RANDOM which meant it was incorrectly assessed by the optimizer in early versions of SQL Server 2008.  If you encounter this issue when running the scripts above, apply a later service pack (or use the NEWID function for SQL Server 2005).

    Final Thoughts

    • There is a lot more going on in query execution than we can see in execution plans
    • In general, there are no guarantees about the order of execution of expressions, or how many times they will be evaluated
    • Please do not start using @@DBTS or CRYPT_GEN_RANDOM for their undocumented effects
    • It might be nice if show plan output did indicate if a computation was deferred or cached, but it is not there today
    • In many cases we can work around the potential for poor performance caused by repeated evaluation by explicitly materializing the result of the expression using a variable, table, non-inline function…though not always
    • Please do not suggest improvements to the XML string splitter unless it handles all input characters and can accept a semicolon as a delimiter ;c)
    • I always use SQLCLR for string splitting

    Acknowledgements

    I would like to particularly thank Usman Butt and Chris Morris from SQL Server Central for their contributions and thoughts which helped me write this entry.

    Further Reading

    When a Function is indeed a Constant – Andrew Kelly
    Runtime Constant Functions – Conor Cunningham
    RAND() and other runtime constant functions – Conor again
    Scalar Operator Costing – Guess who

    © 2012 Paul White
    twitter: @SQL_Kiwi
    email: SQLkiwi@gmail.com

    image454

  • Deletes that Split Pages and Forwarded Ghosts

    Can DELETE operations cause pages to split?  Yes.  It sounds counter-intuitive on the face of it; deleting rows frees up space on a page, and page splitting occurs when a page needs additional space.  Nevertheless, there are circumstances when deleting rows causes them to expand before they can be deleted.  The mechanism at work here is row versioning (extract from Books Online below):

    Row versioning space usage

    Isolation Levels

    The relationship between row-versioning isolation levels (the first bullet point) and page splits is reasonably clear.  Any data that existed before either of the isolation levels was enabled will need to have the 14 bytes added by future data modifications, perhaps causing pages to split.  In this scenario, tables will likely contain a mix of records to start with, but over time (particularly as index maintenance is performed) the database will end up with row versions on most records, reducing the chances of a page split for that particular reason.

    There is nevertheless a window of opportunity where adding the 14 bytes to an existing record could cause a page split.  No doubt there’s a recommendation out there somewhere to rebuild all tables and indexes when enabling or disabling a row-versioning isolation level on a database.  This is not all that interesting though, so let’s look at the second bullet point instead:

    Triggers

    The documentation says that versioning information is added if the table has a trigger.  What it doesn’t say is:

    • The extra bytes for row versioning can be added even where both READ_COMMITTED_SNAPSHOT and SNAPSHOT isolation are OFF.
    • This only applies to AFTER triggers, not INSTEAD OF triggers
    • The AFTER trigger also needs to be enabled to generate row versions, the mere existence of a trigger is not enough.
    • There is a very important exception to all the above…

    SQL Server can still avoid adding the row-versioning information even where an enabled AFTER TRIGGER exists (the remainder of this post assumes that both row-versioning isolation levels are OFF, by the way).

    Avoiding Row Versioning with Triggers

    To explore this behaviour in a bit of detail, we’ll need a test rig:

    USE tempdb;
    GO
    IF  OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL
        DROP TABLE dbo.Test;
    GO
    -- Test table
    CREATE TABLE dbo.Test
    (
        ID          integer IDENTITY PRIMARY KEY,
        Column01    nvarchar(20) NULL,
        Column02    nvarchar(4000) NULL,
    );
    GO
    -- Add some rows
    INSERT dbo.Test WITH (TABLOCKX)
        (Column01)
    SELECT TOP (100000)
        CONVERT(nvarchar(20), N'X')
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3
    OPTION (MAXDOP 1);
    GO
    -- A trigger that does nothing
    CREATE TRIGGER trg
    ON dbo.Test
    AFTER DELETE
    AS RETURN;
    GO
    -- Write any dirty pages to disk
    CHECKPOINT;
    GO
    -- Physical storage before any changes
    SELECT
        ddips.index_type_desc,
        ddips.alloc_unit_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.index_level = 0;
    GO
    -- Buffer pool pages before any changes
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    ORDER BY
        dobd.page_id;
    GO
    SET STATISTICS IO ON;
     
    -- Delete 1 in 10 rows
    DELETE dbo.Test
    WHERE ID % 10 = 0;
     
    SET STATISTICS IO OFF;
    GO
    SELECT
        [Page Splits] = COUNT_BIG(*)
    FROM sys.fn_dblog(NULL,NULL) AS fd 
    WHERE 
        fd.[Transaction Name] = N'SplitPage';
    GO
    -- Ensure ghosted records are processed so
    --  we see accurate per-page row counts
    DBCC FORCEGHOSTCLEANUP;
    GO
    -- Physical storage after the delete operation
    SELECT
        ddips.index_type_desc,
        ddips.alloc_unit_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.index_level = 0;
    GO
    -- Buffer pool pages after the delete operation
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
    ORDER BY
        dobd.page_id;
    GO
     
    --DBCC TRACEON (3604);
    --DBCC PAGE (tempdb, 1, 4720, 3);

    The script performs the following actions:

    1. Creates a clustered table with an ID and two data columns
    2. Adds 100,000 rows each with a single ‘X’ character in the first column, and NULL in the second
    3. Creates an AFTER DELETE trigger that does nothing at all except exist and be enabled
    4. Displays physical storage and buffer pool information using DMVs
    5. Deletes every tenth row in the table
    6. Shows the number of leaf-level page splits that occurred
    7. Displays the physical storage and buffer pool information again

    Test One – Clustered Table

    Typical output:

    Clustered Table Delete Test 1

    There are 235 data pages with a maximum physical record size of 17 bytes before and after the delete.  Before the delete, each data page contains 426 rows with 2 bytes of free space.  After the DELETE:

    • A total of 10,000 records have been deleted
    • The data page count remains at 235
    • The maximum record size is still 17 bytes
    • Each data page has lost 42 or 43 rows
    • The free space on each page has risen to 800 or 819 bytes
    • All data pages are marked as being modified in memory
    • A total of 237 logical reads are reported

    No surprises there.

    Test Two – Clustered Table

    Now, run the script again with the only change being that Column01 is defined as nvarchar(22) instead of nvarchar(20).  The before picture is the same as before, but the situation after the DELETE is very different:

    Clustered Table Delete Test 2

    There have been 234 page splits, increasing the data page count from 235 pages to 469 pages, and halving the number of rows on each data page.  The number of reads reported has also blown out from 237 previously to 2342 logical reads in this run (a factor of ten worse).

    Explanation

    The cause of the page splitting is that the deleted records must be versioned.  SQL Server 2005 and later uses the version store to build the inserted and deleted pseudo-tables used by AFTER triggers.  Where the data has no pre-existing versioning data, adding the 14 bytes will result in a clustered index page split if the page contains insufficient free space to accommodate this expansion.

    Temporarily turning off ghost record clean-up using global trace flag 661 and examining an affected data page using DBCC PAGE shows the following (remember to turn the trace flag off afterward if you try this):

    Row versioning ghost data record

    Slots 8 and 10 on this page hold records that were unaffected by the DELETE; the physical row length is 17 bytes as displayed previously.  The record that was in slot 9 has been deleted.  It is a ghost record with versioning information added.  The record size is 17 + 14 = 31 bytes, and this expansion with only 2 bytes of free space on the page caused it to split.

    This explains why the nvarchar(22) DELETE test caused page splitting, but why didn’t the original nvarchar(20) script behave the same?

    There is a performance optimization that can avoid adding row versioning information, but only if the table cannot generate ROW_OVERFLOW or LOB allocation units.  This means that the definition of the table must not allow for LOBs or for the possibility of variable length columns moving off row.  The actual size of the data stored is immaterial – it is the potential size that matters.

    In our test, the nvarchar(22) column definition caused the maximum possible INROW size to just exceed the 8060 byte limit.  (The exact INROW limit also depends on the table definition; marking one of the data columns SPARSE would reduce the limit to 8015-8019 bytes – whatever the right number is).

    Heaps of Forwarded Ghosts

    “Page splitting only occurs in index structures, are heap structured tables affected by this issue too?”

    It is true that heap pages do not split, but when a heap row needs to expand, the engine will move the row to another page if insufficient free space exists on the current page.  When the storage engine does this, it leaves a forward pointer behind to avoid updating all non-clustered indexes to reflect the new physical row locator.

    For a heap table with an active AFTER trigger, and a LOB column (or the possibility of row-overflow) the row has to be versioned and ghosted.  If the page contains insufficient free space to accommodate the versioning,  the row moves to another page leaving a forwarding stub behind.  This results in a forwarded ghost record.  Ghost clean-up will normally remove this record pretty quickly, so we will need to disable that process temporarily.  The following script creates the very special circumstances necessary to produce a forwarded ghost record this way (note this script is for test systems only should not be run in tempdb):

    USE Sandpit;
    GO
    IF  OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL
        DROP TABLE dbo.Test;
    GO
    -- Heap test
    CREATE TABLE dbo.Test
    (
        ID          integer IDENTITY PRIMARY KEY NONCLUSTERED,
        Column01    nvarchar(16) NULL,
        Column02    nvarchar(4000) NULL
    );
    GO
    -- Add some all-NULL rows
    INSERT dbo.Test WITH (TABLOCKX)
        (Column01)
    SELECT TOP (100000)
        NULL
    FROM sys.columns AS c
    CROSS JOIN sys.columns AS c2
    CROSS JOIN sys.columns AS c3
    OPTION (MAXDOP 1);
    GO
    -- Ensure rows are tightly packed
    ALTER TABLE dbo.Test REBUILD;
    GO
    -- A trigger that does nothing
    CREATE TRIGGER trg
    ON dbo.Test
    AFTER DELETE
    AS RETURN;
    GO
    -- Write any dirty pages to disk
    CHECKPOINT;
    GO
    -- Physical storage before any changes
    SELECT
        ddips.index_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes,
        ddips.ghost_record_count,
        ddips.forwarded_record_count
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.alloc_unit_type_desc = N'IN_ROW_DATA'
        AND ddips.index_level = 0;
    GO
    -- Buffer pool pages before any changes
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
        AND dobd.page_type = N'DATA_PAGE'
    ORDER BY
        dobd.page_id;
    GO
    -- Disable ghost clean-up
    DBCC TRACEON (661, -1);
    GO
    SET STATISTICS IO ON;
     
    -- Delete three records on the same page
    DELETE dbo.Test
    WHERE ID BETWEEN 1 AND 3;
     
    SET STATISTICS IO OFF;
    GO
    -- Physical storage after the delete operation
    SELECT
        ddips.index_type_desc,
        ddips.page_count,
        ddips.record_count,
        ddips.max_record_size_in_bytes,
        ddips.ghost_record_count,
        ddips.forwarded_record_count
    FROM sys.dm_db_index_physical_stats(
        DB_ID(), OBJECT_ID(N'dbo.Test', N'U'), NULL, NULL, 'DETAILED') AS ddips
    WHERE
        ddips.alloc_unit_type_desc = N'IN_ROW_DATA'
        AND ddips.index_level = 0;
    GO
    -- Buffer pool pages after the delete operation
    SELECT 
        dobd.[file_id],
        dobd.page_id,
        dobd.page_type,
        dobd.row_count,
        dobd.free_space_in_bytes,
        dobd.is_modified
    FROM sys.partitions AS p
    JOIN sys.allocation_units AS au 
        ON au.container_id = p.hobt_id
    JOIN sys.dm_os_buffer_descriptors AS dobd ON 
        dobd.allocation_unit_id = au.allocation_unit_id
    WHERE
        p.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
        AND dobd.page_type = N'DATA_PAGE'
        AND dobd.is_modified = 1
    ORDER BY
        dobd.page_id;
    GO
    -- View the appropriate page
    --DBCC TRACEON (3604);
    --DBCC PAGE (0, 1, 339408, 3);
     
    -- Enable ghost clean-up
    DBCC TRACEOFF (661, -1);

    Example output, showing page 453648 receiving a versioned forwarded ghost record (click to enlarge in a new tab):

    Versioned Forwarded Ghost Record

    Partial DBCC PAGE output for the highlighted page:

    DBCC PAGE forward ghost record

    Summary

    If you ever wonder why your deletes are so slow, it is worth checking to see if you are suffering from page splitting due to an enabled trigger and a table definition that allows for LOB allocations or ROW_OVERFLOW.  Any table with a LOB column (including the max data types) qualifies, as does one with even a surprisingly small number of variable-length columns, as shown in the examples in this post.  This is a great reason to avoid using ‘max’ or old-style LOB data types unnecessarily, and to be careful about the maximum length of ‘normal’ variable-length data types too.  Remember, it is the potential maximum row size that is important, not the actual row size.

    On a related note, remember that deletes on a heap can only deallocate empty pages if a table lock is acquired?  A table definition that allows for LOB or ROW_OVERFLOW prevents that optimization too.  So, if your heaps are growing despite DELETE WITH (TABLOCKX), check the maximum possible row length!  You could also convert them to clustered tables as well, of course, but that’s a quite different debate.

    I would like to acknowledge and thank SQL Server MVP Dmitri V. Korotkevitch who first brought the basic issue to my attention with UPDATE queries.  I would strongly encourage you to also read his blog entry showing how this behaviour also affects UPDATE queries, resulting in slow performance and excessive fragmentation.

    Thanks for reading.

    image45

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

  • Temporary Table Caching Explained

    SQL Server 2005 onward caches temporary tables and table variables referenced in stored procedures for reuse, reducing contention on tempdb allocation structures and catalogue tables.  A number of things can prevent this caching (none of which are allowed when working with table variables):

    • Named constraints (bad idea anyway, since concurrent executions can cause a name collision)
    • DDL after creation (though what is considered DDL is interesting)
    • Creation using dynamic SQL
    • Table created in a different scope
    • Procedure executed WITH RECOMPILE

    Temporary objects are often created and destroyed at a high rate in production systems, so caching temporary objects can be an important optimization.  The temporary object cache is just another SQL Server cache (using the general framework) though it’s entries are a bit more visible than most.  The sys.dm_os_performance_counters DMV exposes a number of counters under the ‘Temporary Tables & Table Variables’ instance of the Plan Cache object.  The cache is also visible through the usual cache DMVs, for example as CACHESTORE_TEMPTABLES in sys.dm_os_memory_cache_counters.

    Cached Object Names

    The cached objects themselves are visible in tempdb.sys.tables, named with a single # character followed by the 8-character hexadecimal representation of the object id.  This is different from the names of ordinary temporary tables, which have the user-supplied name followed by a bunch of underscores and an id.

    The following procedure shows a total of nine cache objects created using CREATE TABLE #xyz syntax, table variables, and SELECT…INTO:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
        CREATE TABLE #T2 (dummy int NULL);
        CREATE TABLE #T3 (dummy int NULL);
     
        DECLARE @T1 AS TABLE (dummy int NULL);
        DECLARE @T2 AS TABLE (dummy int NULL);
        DECLARE @T3 AS TABLE (dummy int NULL);
        
        SELECT * INTO #T4 FROM #T1;
        SELECT * INTO #T5 FROM @T2;
        SELECT * INTO #T6 FROM #T3;
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.* 
    FROM tempdb.sys.tables AS t 
    WHERE
        t.name LIKE N'#[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]';

    The results show nine separate cached objects:

    Cached temporary objects

    Notice the relationship between object id and the name e.g. –1383692523 = hex AD868715.

    Caching is per object not per procedure

    If any of the temporary objects in a procedure are not cacheable for any reason, the others may still be cached. So, for example, if we modify the test above to create an index on table #T5, that particular table will not be cached, but the other temporary objects will be:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
        CREATE TABLE #T2 (dummy int NULL);
        CREATE TABLE #T3 (dummy int NULL);
     
        DECLARE @T1 AS TABLE (dummy int NULL);
        DECLARE @T2 AS TABLE (dummy int NULL);
        DECLARE @T3 AS TABLE (dummy int NULL);
        
        SELECT * INTO #T4 FROM #T1;
        SELECT * INTO #T5 FROM @T2;
        SELECT * INTO #T6 FROM #T3;
     
        -- Prevents caching of #T5
        CREATE INDEX nc1 ON #T5 (dummy);
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;

    There are now only eight cached objects:

    Only some objects cached

    Apparently, DROP TABLE is not DDL

    Dropping a temporary table in a procedure does not count as DDL, and neither does TRUNCATE TABLE, nor UPDATE STATISTICS.  None of these things prevent temporary table caching (so it does not matter whether you explicitly drop a temporary table at the end of a procedure or not).

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
        CREATE TABLE #T2 (dummy int NULL);
        CREATE TABLE #T3 (dummy int NULL);
     
        DECLARE @T1 AS TABLE (dummy int NULL);
        DECLARE @T2 AS TABLE (dummy int NULL);
        DECLARE @T3 AS TABLE (dummy int NULL);
        
        SELECT * INTO #T4 FROM #T1;
        SELECT * INTO #T5 FROM @T2;
        SELECT * INTO #T6 FROM #T3;
     
        -- Does not prevent caching
        DROP TABLE #T1, #T4, #T6;
        TRUNCATE TABLE #T2;
        TRUNCATE TABLE #T5;
        UPDATE STATISTICS #T3;
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;

    There are nine cached objects again:

    DROP TABLE is not DDL

    Concurrent executions

    If a stored procedure is executed concurrently, multiple separate cached objects may be created in tempdb.  There is one cached plan for the procedure but one cached temporary object per execution context derived from that cached plan.  Recall that execution contexts are relatively lightweight instances of a cached plan, populated with execution-specific data such as temporary object ids and parameter values (image reproduced from the plan caching white paper):

    Cached Plans and Execution Contexts

    The runtime contents of a temporary object (table or variable) are obviously specific to a particular execution, so it makes sense for the cached object to be associated with execution contexts rather than the parent plan.  There may also be more than one cached plan for a procedure in cache (for example due to compilations with different SET options) and each parent plan will have its own collection of execution contexts, so there can be one cached tempdb object per execution context per plan.

    There does not appear to be a fixed limit on the number of these cached objects; I was able to quickly create 2,000 of them using the test procedures above and Adam Machanic’s SQL Query Stress tool running 200 threads.  This is the reason for the 1 second delay in the procedure – to make sure the procedure runs for a little while so new execution contexts are generated for each execution rather than reused.  The contents of tempdb after that test were as follows:

    2000 cached temporary tables

    Statistics on cached temporary objects

    Any auto-stats created are linked to the cached temporary table:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #T1 (dummy int NULL);
     
        DECLARE @dummy int;
     
        -- Trigger auto-stats
        SELECT @dummy = dummy 
        FROM #T1
        WHERE dummy > 0;
     
        WAITFOR DELAY '00:00:01'
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.name,
        t.[object_id],
        s.name,
        s.auto_created
    FROM tempdb.sys.tables AS t
    JOIN tempdb.sys.stats AS s ON
        s.[object_id] = t.[object_id];

    Cached temporary object statistics

    In the case of multiple execution contexts, you might be wondering if each of the tempdb objects can have auto-created statistics associated with them.  The answer is no: auto-stats are used to compile the parent plan, and execution contexts are derived from that same plan as needed.  The metadata looks a little odd though; the statistics are explicitly linked to the object id of the cached temporary object that caused the auto-stats to be created.  Other cached tables for the same plan have different ids, and so do not link to the sys.stats entry.

    Statistics created using an explicit CREATE STATISTICS statement are not linked to a cached temporary object, for the simple reason that CREATE STATISTICS is considered DDL, and prevents caching from occurring in the first place.

    Drop and Create in Detail

    The first time a procedure containing a cacheable temporary object is executed, the temporary object is created as normal, then renamed to the hexadecimal internal form described previously when the object is dropped (explicitly or implicitly at the end of the procedure).  On subsequent executions, the cached object is renamed to the normal user-visible name when ‘created’ in the procedure, and renamed back to the internal form when it is ‘dropped’.  The following script demonstrates the creation and renaming of cached temporary objects:

    USE tempdb;
    GO
    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #Demo (i int);
     
        SELECT 
            t.name,
            t.object_id,
            t.type_desc,
            t.create_date
        FROM sys.tables AS t
        WHERE
            t.name LIKE N'#Demo%';
     
        DROP TABLE #Demo;
     
        SELECT
            t.name,
            t.object_id,
            t.type_desc,
            t.create_date
        FROM sys.tables AS t 
        WHERE
            t.name LIKE N'#________';
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    CHECKPOINT;
    EXECUTE dbo.Demo;
    GO
    SELECT
        fd.[Current LSN],
        fd.Operation,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID],
        CONVERT(sysname, SUBSTRING(fd.[RowLog Contents 0], 3, 256)),
        CONVERT(sysname, SUBSTRING(fd.[RowLog Contents 1], 3, 256))
    FROM sys.fn_dblog(NULL, NULL) AS fd;

    The first time it is run the first part of the output is:

    Name remapping

    Notice that the object ids are the same, and the object has familiar external name while in scope, but is renamed after the DROP TABLE statement. The transaction log entries displayed are (click to enlarge):

    Transaction Log for First Execution

    The highlighted section shows the table being renamed in the internal catalogue tables from the user-visible name to the internal name. On the second run, there is an extra renaming log entry in the CREATE TABLE system transaction, as the object is renamed from the internal form to the user-visible name:

    Transaction Log 2

    The demonstration shows that CREATE TABLE and DROP TABLE for a cached temporary object are replaced by renaming operations.

    Cached Object Scope

    The cached object is scoped to the query plan that references it.  If the plan is evicted from cache for any reason (perhaps by ALTER or DROP PROCEDURE or an explicit DBCC FREEPROCCACHE command) a background thread removes the tempdb object. This is not synchronous to the command that causes the eviction; the delay seems to be 5 seconds or less on current SQL Server versions, and is performed by a system process id.  The following code shows the link between the cached temporary table and the cached plans for the stored procedure:

    USE tempdb;
    GO
    IF  OBJECT_ID(N'dbo.Demo', N'P') IS NOT NULL
        DROP PROCEDURE dbo.Demo;
    GO
    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        CREATE TABLE #Demo (i int);
    END;
    GO
    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
    GO
    SELECT
        decp.cacheobjtype,
        decp.objtype,
        decp.plan_handle,
        dest.[text]
    FROM sys.dm_exec_cached_plans AS decp
    CROSS APPLY sys.dm_exec_plan_attributes(decp.plan_handle) AS depa
    CROSS APPLY sys.dm_exec_sql_text(decp.plan_handle) AS dest
    WHERE
        decp.cacheobjtype = N'Compiled Plan'
        AND decp.objtype = N'Proc'
        AND depa.attribute = N'objectid'
        AND CONVERT(integer, depa.value) = OBJECT_ID(N'dbo.Demo', N'P');

    Example output:

    Link to the cached plan

    Aside from checking for suitably-named objects in tempdb, there are a number of ways to see how many cached temporary objects (tables and variables) exist, and how many are in use by executing code.  One way is to query the sys.dm_os_memory_cache_counters DMV:

    SELECT
        domcc.name,
        domcc.[type],
        domcc.entries_count,
        domcc.entries_in_use_count
    FROM sys.dm_os_memory_cache_counters AS domcc
    WHERE domcc.[type] = N'CACHESTORE_TEMPTABLES';

    Cache Counters DMV

    Another way is to check the performance counters (also accessible via DMV):

    SELECT
        dopc.[object_name], 
        dopc.counter_name, 
        dopc.cntr_value
    FROM sys.dm_os_performance_counters AS dopc
    WHERE
        dopc.[object_name] LIKE N'MSSQL%Plan Cache%'
        AND dopc.instance_name = N'Temporary Tables & Table Variables'
        AND dopc.counter_name IN (N'Cache Object Counts', N'Cache Objects in use');

    Perf Counters 1

    This second example shows 200 cached objects on a different run:

    Perf Counters 2

    On SQL Server 2008 and later, we can evict a particular plan handle from cache to show that this removes the cached temporary table:

    DBCC FREEPROCCACHE(0x05000200CC7BDE0940E16D8C000000000000000000000000);

    As mentioned, there can be a delay of up to 5 seconds before the cached object is removed after the DBCC statement completes (though the DMVs reflect the cache changes immediately).  The more comprehensive test below shows all these things combined:

    -- Cache a temporary object
    EXECUTE dbo.Demo;
    GO
    -- Show the cached table
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
     
    DECLARE @plan_handle varbinary(64);
     
    -- Find the plan handle
    SELECT
        @plan_handle = decp.plan_handle
    FROM sys.dm_exec_cached_plans AS decp
    CROSS APPLY sys.dm_exec_plan_attributes(decp.plan_handle) AS depa
    CROSS APPLY sys.dm_exec_sql_text(decp.plan_handle) AS dest
    WHERE
        decp.cacheobjtype = N'Compiled Plan'
        AND decp.objtype = N'Proc'
        AND depa.attribute = N'objectid'
        AND CONVERT(integer, depa.value) = OBJECT_ID(N'dbo.Demo', N'P');
     
    -- Truncate the log
    CHECKPOINT;
     
    -- Evict the plan
    DBCC FREEPROCCACHE(@plan_handle);
     
    -- Show log entries
    SELECT
        fd.[Current LSN],
        fd.SPID,
        fd.Operation,
        fd.Context,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID],
        fd.[Begin Time],
        fd.[End Time]
    FROM sys.fn_dblog(NULL,NULL) AS fd;
     
    -- Cached object not dropped yet
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
     
    WAITFOR DELAY '00:00:05';
     
    -- Show cleanup
    SELECT
        fd.[Current LSN],
        fd.SPID,
        fd.Operation,
        fd.Context,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID],
        fd.[Begin Time],
        fd.[End Time]
    FROM sys.fn_dblog(NULL,NULL) AS fd;
     
    -- Gone!
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';

    The first result set shows the cached temporary object:

    Cached Temp Object

    The transaction log entries immediately after evicting the plan from cache show no activity aside from the CHECKPOINT we issued to truncate the log:

    Empty Transaction Log

    Then we see that the cached object still exists at this point (though the DMVs now show zero cached temporary objects):

    Cached Object Still Exists

    After an up-to-five-second delay, the transaction log contains:

    Asynchronous Log Activity

    Notice the system transaction named ‘droptemp’ is performed by system SPID 14, and instead of the renaming we saw earlier, all references to the cached object are deleted from the system tables.

    More about Table Variables

    You might recognise the internal hexadecimal name for cached temporary objects; the same is used for table variables outside a stored procedure:

    DECLARE @T AS TABLE (dummy int NULL);
     
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
    GO 5

    The batch runs five times and produces output like this:

    Table Variable Output

    Notice that the object id is different each time (and so, therefore, is the name).  As already mentioned, table variables in a stored procedure can be cached just like temporary tables:

    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        DECLARE @T AS TABLE (dummy int NULL);
     
        SELECT
            t.name,
            t.[object_id],
            t.type_desc,
            t.create_date
        FROM tempdb.sys.tables AS t
        WHERE
            t.name LIKE N'#________';
    END
    GO
    EXECUTE dbo.Demo;
    GO 5

    Now, though, we see the same object id and create date each time:

    Cached Table Variables

    Once a table variable is cached, the transaction log records for a simple procedure are quite interesting:

    DBCC FREEPROCCACHE;
    WAITFOR DELAY '00:00:05';
    GO
    CREATE PROCEDURE dbo.Demo
    AS
    BEGIN
        DECLARE @T AS TABLE (dummy int NULL);
     
        INSERT @T VALUES (1);
    END
    GO
    -- Cache the table variable
    EXECUTE dbo.Demo;
    GO
    SELECT
        t.name,
        t.[object_id],
        t.type_desc,
        t.create_date
    FROM tempdb.sys.tables AS t
    WHERE
        t.name LIKE N'#________';
    GO
    CHECKPOINT;
    GO
    EXECUTE dbo.Demo;
    GO
    SELECT
        fd.[Current LSN],
        fd.Operation,
        fd.AllocUnitName,
        fd.[Transaction Name],
        fd.[Transaction ID]
    FROM sys.fn_dblog(NULL, NULL) AS fd;

    Table Variable Transaction Log

    Notice the reference to the internal name #628FA481 and the clean up activity.  The same procedure with a temporary table instead of a table variable generates a bit more work for the server:

    Temporary Table Transaction Log

    Many of the entries are similar to the table variable case, with extra steps to rename the cached object when CREATE TABLE and the implicit DROP TABLE at the end of the procedure are executed.  Clearly, some efforts have been made to make table variables more efficient than temporary tables, while sharing many features in common at quite a low level.

    Another interesting thing, as I mentioned right at the start of this post, is that table variables disallow just about all of the actions that prevent caching of a temporary table.  Table variables do not allow named constraints or DDL that affects caching (e.g. CREATE INDEX, CREATE STATISTICS).  Table variables are also scoped more tightly than temporary tables.  While we can create a temporary table in one procedure, and refer to it in  another, the same thing cannot be done with table variables.  For the same scoping reasons, table variables cannot be defined using dynamic SQL and referenced outside that context.  One oddity is TRUNCATE TABLE; disallowed by table variables, but which does not affect caching.

    Anyway, the restrictions mean that table variables can always be cached, and don’t allow some of the crazy things that are possible with temporary tables (particularly as regards scoping, but also the cached statistics issue I described in my last post).  They also have the potential to perform better (no hidden renaming) at least in some high-volume circumstances.  If only we were able to create statistics (with intuitive behaviour!) and indexes after creation, table variables might well make the old Sybase ‘non-shareable temporary tables’ finally redundant.  Until then, we are left having to choose one or the other as best we can.

    © 2012 Paul White
    Twitter: @SQL_Kiwi
    Email: SQLkiwi@gmail.com

    image454

  • Temporary Tables in Stored Procedures

    Ask anyone what the primary advantage of temporary tables over table variables is, and the chances are they will say that temporary tables support statistics and table variables do not.  This is true, of course; even the indexes that enforce PRIMARY KEY and UNIQUE constraints on table variables do not have populated statistics associated with them, and it is not possible to manually create statistics or non-constraint indexes on table variables.  Intuitively, then, any query that has alternative execution plans to choose from ought to benefit from using a temporary table rather than a table variable.  This is also true, up to a point.

    The most common use of temporary tables is in stored procedures, where they can be very useful as a way of simplifying a large query into smaller parts, giving the optimizer a better chance of finding good execution plans, providing statistical information about an intermediate result set, and probably making future maintenance of the procedure easier as well.  In case it is not obvious, breaking a complex query into smaller steps using temporary tables makes life easier for the optimizer in several ways.  Smaller queries tend to have a smaller number of possible execution plans, reducing the chances that the optimizer will miss a good one.  Complex queries are also less likely to have good cardinality (row count) estimates and statistical information, since small errors tend to grow quickly as more and more operators appear in the plan.

    This is a very important point that is not widely appreciated.  The SQL Server query optimizer is only as good as the information it has to work with.  If cardinality or statistical information is badly wrong at any point in the plan, the result will most likely be a poor execution plan selection from that point forward.  It is not just a matter of creating and maintaining appropriate statistics on the base tables, either.  The optimizer does use these as a starting point, but it also derives new statistics at every plan operator, and things can quickly conspire to make these (invisible) derived statistics hopelessly wrong.  The only real sign that something is wrong (aside from poor performance, naturally) is that actual row counts vary widely from the optimizer’s estimate.  Sadly, SQL Server does not make it easy today to routinely collect and analyse differences between cardinality estimates and runtime row counts, though some small (but welcome) steps forward have been made in SQL Server 2012 with new row count information in the sys.dm_exec_query_stats view.

    The benefits of using simplifying temporary tables where necessary are potentially better execution plans, now and in the future as data distribution changes and new execution plans are compiled.  On the cost side of the ledger we have the extra effort needed to populate the temporary table, and maintain the statistics.  In addition, we expect a higher number of recompilations for optimality reasons due to changes in statistics.  In short, we have a trade-off between potential execution plan quality and maintenance/recompilation cost.

    The problem, though, is that temporary tables do not work quite how (almost) everyone expects them to…

    A World Without Temporary Objects

    Imagine if temporary tables and table variables did not exist, and we had a complex query with cardinality-estimation difficulties that would be best addressed by breaking the query into parts.  The example presented below is necessarily simplified, and uses the AdventureWorks sample database to make it accessible.  It is not a query with insurmountable optimization problems by any means, but bear with me.

    IF  OBJECT_ID(N'dbo.Demo', N'P') IS NOT NULL
        DROP PROCEDURE dbo.Demo;
    GO
    CREATE PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        SELECT
            p.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM Production.Product AS p
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = p.ProductID
        WHERE
            p.Name LIKE @StartsWith + N'%'
        GROUP BY
            p.Name;
    END;

    This procedure returns a count of orders containing a part whose name starts with the specification passed in as a parameter.  For example:

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';

    image

    A Short Digression about Parameter Sniffing

    The only real optimization issue in this simplified example is parameter-sniffing.  The number of rows that match the LIKE predicate varies widely depending on the value of the parameter, which can impact execution plan selection.  When the SELECT query is compiled or recompiled, SQL Server uses the actual value of the parameter at the time to estimate the number of rows qualified from the Product table.  If the SELECT statement happens to compile or recompile with a parameter that matches very few rows, we get an execution plan like this (click to enlarge the graphic):

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'[0-9]';

    Parameter Sniffing [0-9]

    This plan is cached and reused next time the procedure is called, perhaps with a parameter value that matches more rows than before, but not enough to cause a recompilation:

    EXECUTE dbo.Demo @StartsWith = N'M';

    Parameter Sniffing M

    This results in a large number of Key Lookups and a Sort that spills to tempdb physical disk (the warning triangle is new in SQL Server 2012).  The memory grant for the sort was sized for the expected 370 rows, not the 22,527 that actually turned up.  (If you are curious about the Constant Scan and Compute Scalar in these plans, see my previous post on dynamic seeks for details).

    One way (certainly not the only way) to address this parameter-sniffing problem is to ask SQL Server to compile a fresh query plan for the SELECT statement on every execution, using the OPTION (RECOMPILE) query hint:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        SELECT
            p.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM Production.Product AS p
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = p.ProductID
        WHERE
            p.Name LIKE @StartsWith + N'%'
        GROUP BY
            p.Name
        OPTION (RECOMPILE);
    END;

    Now we get plans optimized for the particular parameter value at runtime, and (in SQL Server 2008 SP1 CU5 or later) the added bonus that the runtime value of the parameter is embedded into the query so the dynamic seek is not required:

    EXECUTE dbo.Demo @StartsWith = N'[0-9]';

    Parameter Sniffing with RECOMPILE

    EXECUTE dbo.Demo @StartsWith = N'M';

    Parameter Sniffing with RECOMPILE 2

    Back to the Main Example

    However, we are imagining that the example query is much more complex than shown, and we decide to break it into two parts.  The first part will fetch qualified items from the Products table, and the second will join the results (with statistics) to the History table.  Remember that temporary objects do not exist for this part of the discussion, so we have to use a permanent table:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        -- Real table to hold intermediate result
        CREATE TABLE dbo.Temp
        (
            ProductID   integer NOT NULL, 
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        -- First part of the 'complex' query
        INSERT INTO dbo.Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        -- Second part referencing the 'temp' table
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM dbo.Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name;
     
        -- Show the statistics for the Name column
        DBCC SHOW_STATISTICS (Temp, Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE dbo.Temp;
    END;

    This procedure is not very practical as it stands, because the CREATE TABLE and DROP TABLE statements would make it impossible for more than one user to execute it at the same time.  Nevertheless, it does show how most people expect things to work regarding statistics, temporary table lifetimes, and compilation:

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'A';

    There are three matching Product records, shown below together with the SHOW_STATISTICS output:

    Permanent Table Output A

    Nothing too surprising in the execution plans: the table INSERT is a trivial plan (no choices for the optimizer to make), and the SELECT query has cardinality estimates that are very close to the actual numbers encountered.  The warning triangle on the SELECT (again, new for SQL Server 2012) is just suggesting an index for the History table, which we won’t be adding on this occasion.

    Permanent Execution Plan A

    While the procedure was executing, a Profiler trace captured the following compilation and statistics activity:

    image

    The table is created, and since this is the first time the procedure has been run, the INSERT and SELECT statements both incur a deferred compilation.  As part of the SELECT compilation, new statistics are created on the Name and ProductID columns of the new table, the DBCC SHOW_STATISTICS command executes, and finally the table is dropped.  Now we execute the procedure again but for products that start with ‘E’ rather than ‘A’ as above:

    EXECUTE dbo.Demo @StartsWith = N'E';

    Permanent Output E

    This time there are nine matching records, though the statistics-gathering algorithms compress the information into the five histogram steps shown.  The execution plan is below (trivial INSERT plan not repeated):

    image

    Again, the cardinality estimates are extremely good, and the optimizer selected a nested loops plus key-lookup plan based on costs computed according with the smaller number of expected rows.  The trace captured the following:

    image

    The only change is that this time recompilation is triggered because the cached INSERT and SELECT plans reference a table that no longer exists, leading to a ‘schema changed’ recompilation.  The crucial observations here are that the table is recreated and new statistics are gathered on every execution.  This is the way most people expect tables, statistics, and compilations to behave inside stored procedures, so don’t worry if there’s nothing too surprising so far.

    Using a Temporary Table

    The previous example is not terribly practical.  It is possible to conjure up schemes where ordinary tables can serve the purpose effectively, but as it stands, the procedure would likely cause runtime errors if more than one user or process attempted to execute it concurrently.  In addition, ordinary tables do not benefit from engine optimizations that only apply to genuine temporary tables, so they tend to be less efficient, even if created in tempdb.  We now return to the real world (where temporary tables do exist) and modify the procedure to use a temporary table instead of a permanent table:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM #Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name;
     
        DBCC SHOW_STATISTICS (N'tempdb..#Temp', Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE #Temp;
    END;

    Executing this procedure with an ‘E’ parameter produces exactly the same output as when the permanent table was used.  Same execution plan, same statistics, same Profiler output, everything:

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';

    Temporary Table Execution Plan E

    Now we will execute the procedure with a parameter value of ‘T’, which has 65 matching records in the Product table.  This is the output (only the first nine of the 65 matching names are shown in the first result set):

    image

    The results are as expected (products that begin with ‘T’) but the SHOW_STATISTICS output displays ‘E’ statistics from the previous execution!

    Temporary Table Execution Plan T

    The execution plan has serious cardinality estimation errors.  It has the same nested loops and key-lookup shape and estimated row counts as the ‘E’ execution, instead of being optimized for the ‘T’ data.  This causes slow performance and a sort that spills data to physical tempdb disk.

    image

    The Profiler output explains why: no recompilations occurred, and no statistics were created.  In fact, with the AdventureWorks database, there is no way to persist a different execution plan or update the statistics for this procedure without changing it (more on that later).  Even executing with [A-Z] as the parameter reuses the plan optimized for the ‘E’ value:

    EXECUTE dbo.Demo @StartsWith = N'[A-Z]';

    Temporary Table Execution Plan A-Z

    The key lookup is now executed 113,443 times, and the sort has to perform a multi-pass spill to tempdb (previously only a single pass spill was required).  Executing the procedure with the WITH RECOMPILE option will produce better plans, at the cost of recompiling the whole procedure, and the resulting plans will not be cached for reuse.  In general, recompiling the whole procedure every time could be bad since many procedures have many more statements to recompile than just a single INSERT and SELECT.  Although we have not yet identified a cause for the behaviour we are seeing, perhaps the answer lies in adding a RECOMPILE query hint to the SELECT statement as we did earlier for the parameter-sniffing example?

    OPTION (RECOMPILE) to the Rescue?

    Thinking that the cause might be a new form of the parameter-sniffing problem, we now modify the procedure to add the RECOMPILE query hint to just the SELECT statement.  This avoids recompiling the whole procedure on each call as EXECUTE … WITH RECOMPILE would do, and will hopefully solve our optimization problems:

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)
        SELECT
            p.ProductID,
            p.Name
        FROM Production.Product AS p
        WHERE
            p.Name LIKE @StartsWith + N'%';
     
        SELECT
            t.Name,
            OrderCount = COUNT_BIG(DISTINCT th.ReferenceOrderID)
        FROM #Temp AS t
        JOIN Production.TransactionHistory AS th ON
            th.ProductID = t.ProductID
        GROUP BY
            t.Name
        OPTION (RECOMPILE);
     
        DBCC SHOW_STATISTICS (N'tempdb..#Temp', Name) WITH STAT_HEADER, HISTOGRAM;
     
        DROP TABLE #Temp;
    END;

    Unsurprisingly, clearing the plan cache and executing with a parameter value of ‘E’ produces the exact same nested loops and key lookup plan as previously shown (and the same statistics and Profiler events too).

    DBCC FREEPROCCACHE;
    EXECUTE dbo.Demo @StartsWith = N'E';

    This is fine because it is the optimal plan for this value.  The test for OPTION (RECOMPILE) comes when we try a second execution with the 65-row ‘T’ value:

    EXECUTE dbo.Demo @StartsWith = N'T';

    Temporary Table Execution Plan RECOMPILE

    The execution plan has a different shape now, with the optimizer correctly preferring a hash join over the nested loops plus key lookup.  The estimated cardinality is correct for the temporary table (65 rows versus 9 previously seen for ‘E’), but it is badly wrong for the hash join (811 rows expected, 12,273 actual) and the distinct sort, which has to perform a single-pass spill to tempdb disk.  The query output shows the reason for the inaccuracy:

    Temporary Table RECOMPILE T

    The results are correct of course (65 products starting with ‘T’) but the statistics are again still reflecting the parameter value ‘E’.  The RECOMPILE query hint allows the optimizer to see the true cardinality of the temporary table (65 rows) but the 100% wrong statistics on the ProductID column used to estimate the cardinality of the join to the History table result in a mess of a plan from that point forward.  (Note that statistics on the Name column are also considered interesting by the optimizer for this query due to the GROUP BY clause.  The SHOW_STATISTICS output above shows the Name statistics because it more immediately shows that ‘E’ values are stored; the ProductID statistics are integers, and much harder for the reader to correlate with the parameter).

    image

    The Profiler output above confirms the recompilation due to the query hint, and that no statistics were created or updated.  An interesting thing occurs if we execute the procedure four more times with the same ‘T’ parameter (a total of five times with ‘T’):

    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';
    EXECUTE dbo.Demo @StartsWith = N'T';

    The first three of those executions are exactly as before, but the fourth one suddenly produces this different execution plan:

    Temporary Table RECOMPILE T5

    The estimates are now well within the boundaries of expected accuracy, and the sort now receives enough workspace memory to avoid spilling rows to tempdb.

    image

    The statistics have been updated, and now show 65 rows that start with ‘T’ sampled (compressed into 49 histogram steps).  This is confirmed by the trace output:

    image

    Notice the StatMan and AutoStats entries showing the ProductID and Name statistics being updated.  Unfortunately, executing the procedure again with a new parameter value takes us back to inaccurate cardinality estimates again.  The execution plans for this procedure aren’t always a disaster, but this is pure luck – poor estimates are poor estimates, whether the query plan happens to perform acceptably or not.  We really need to get to the root cause here.  Perhaps OPTION (RECOMPILE) was not the way to go, and we should try a manual statistics update instead of relying on automatic statistics updates…?

    Manual Statistics Update to the Rescue?

    We know we want to add an UPDATE STATISTICS command to our stored procedure instead of the OPTION (RECOMPILE) … but where to put the new statement exactly?  The statistics we want to update are created during compilation or recompilation of the SELECT statement, suggesting the UPDATE STATISTICS should come after the SELECT; but then how could it possibly affect the preceding SELECT plan?  It seems the only sensible place to put it is before the SELECT, however counter-intuitive it might seem.

    ALTER PROCEDURE dbo.Demo
        @StartsWith nvarchar(50)
    AS
    BEGIN
        SET NOCOUNT ON;
     
        CREATE TABLE #Temp
        (
            ProductID   integer NOT NULL,
            Name        nvarchar(50) COLLATE DATABASE_DEFAULT NOT NULL
        );
     
        INSERT INTO #Temp
            (ProductID, Name)