THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand. See also my articles on SQLperformance.com

Execution Plan Analysis: The Mystery Work Table

Ill_Be_There4

I love SQL Server execution plans. It is often easy to spot the cause of a performance problem just by looking at one. The task is considerably easier if the plan includes run-time information (a so-called ‘actual’ execution plan), but even a compiled plan can be very useful. Nevertheless, there are still times where the execution plan does not tell the whole story, and we need to think more deeply about query execution to really understand a performance problem. This post looks at one such example, based on a recent question posted on the SQL Performance Q & A site.

The Execution Plan

Original Query Plan

This plan is reasonably large (20MB cached plan size) but not massively complex once you break it down (click on the image above to view it full-size in a new window). The context of the question is that this query usually executes in less than a minute, but sometimes it runs for nearly twenty minutes – though the plan appears identical.

High-Cost Operators

There are many different things to look for in execution plans. What you choose to look at first is as much a matter of personal preference as anything, but many people are drawn to high-cost operators, so I will start there. In this plan, the cost of one operator dominates all others, shown as being responsible for 100% of the cost of the query. It is highlighted in red in Plan Explorer; I have expanded the relevant plan section (the top right) below:

100% operator cost

There is no doubt that this seek is busy little thing. It is executed 249,484 times, though it only produces a grand total of 167,813 rows over all iterations of the loop join – an average of just 0.7 rows per seek. There are all sorts of interesting details in the plan about this seek – I could write a whole blog post about it – but two details that stand out are the “Force Seek: True” and “Partitioned: True” attributes. These tell us that the base table is partitioned, and the query writer had to use a FORCESEEK table hint to get this plan.

Without this hint, the optimizer would almost certainly choose a Hash Match or Merge Join rather than Nested Loops. This is understandable given the optimizer’s cost model and the simplifying assumptions it makes (such as assuming every query starts with a cold buffer cache). That’s fine, but we can see from the query plan that the inner-side table has 643 million rows. Left to its own devices, the optimizer would estimate that it would be faster to perform a sequential scan of 643 million rows (with large-block read-ahead) than it would be to run a quarter-million randomly-distributed seeks driven by a Nested Loops join.

I doubt that the optimizer’s reasoning here is sound (at least on any reasonably modern hardware) but there we go. The query author probably knows that a good fraction of this table is likely to be in cache, so with all that in mind, I think we can reasonably assume at this stage that the FORCESEEK hint is genuinely needed here, and this part of the plan is at least reasonably optimal.

Important note: The seek certainly does not account for 100% of the runtime cost of this query. Remember cost percentages are always estimates – even in ‘actual’ plans. It can be useful to check the reasons for high-estimated-cost operators, but they should never be used as a primary tuning metric.

Execution Warnings

Sort Warning

This query was executed on SQL Server 2012, so there is a handy warning triangle on the Sort operator indicating that one or more sort runs had to be spilled to physical tempdb disk. The plan clearly shows this spilling is a result of an inaccurate cardinality estimate at the Filter operator (the estimates are not bad at all prior to this). The Sort expects 9,217 rows totalling approximately 5MB, but actually encountered 61,846 rows in 35MB. As you may know, memory for sorts and hashes is allocated before execution starts, and generally cannot expand dynamically at run time.

The spilled sort is undesirable, of course, but it is unlikely to be a major cause of the occasional poor performance given the small size of the spilled data. Nevertheless, this might be a good place to split this query up. The idea would be to write the results of the query (up to and including the Filter) to a temporary heap table using SELECT INTO, and then create a clustered index with the same keys as the Sort operator. The temporary table would not be large, and may well perform better overall than the spilled sort, including the cost of creating the clustered index. Of course, creating this index will involve a sort, but it will be one based on the known cardinality of the temporary heap table. The part of the plan that could be replaced by a temporary table is shown below:

Plan subtree replaced with a temp table

I am a big fan of simplifications like this. Smaller execution plans tend to optimize better for all sorts of reasons, and the source code usually becomes easier to maintain as well. I should mention there is another warning triangle in the 2012 execution plan (shown on the root icon), which relates to some implicit conversions that I will mention later.

I/O Information

The execution plan was captured with Plan Explorer, so we can also easily see I/O statistics for the two executions. The first is for a fast (sub-60-second) run:

I/O data - fast

Overall, these I/O numbers show pretty much what we would expect: a decent number of logical reads associated with the seeks into the Trans table (but certainly not 100% of the total, ha ha), a very small number of physical reads, and a small amount of read-ahead activity on a couple of tables.

The second set of I/O data is from a slow run (18 minutes or so):

I/O data - slow

The very obvious difference is the appearance of a work table, with 178 million logical reads and 130 million LOB logical reads. It seems very likely this work table, and its 300 million logical reads, is responsible for the dramatic decrease in query performance. But given that the execution plans are identical (right down to the XML) what is causing this?

My answer to that question (on the Q & A site) was that it is related to the increased level of read-ahead activity, but to see why that is the case, we will need to reproduce the issue and dig a bit deeper.

Execution Outline

Before we really get going on this, it will be useful to take a look at what the execution plan is doing in outline. We saw the first part of the plan earlier when looking at the spilling sort. The data set at that point (which we would like to write to a temporary table, remember) essentially represents source data for a second query, which uses a series of Nested Loops Left Joins to lookup information from other tables:

Nested Loop Lookups

The inner side of each join involves some reasonably involved logic, which is thankfully not important to the present discussion. What is important is that the result of each lookup is a LOB data type. This begins to shed some light on the LOB logical reads reported against the work table, but it does not explain why the work table (and the 300 million associated reads) do not appear when the query runs quickly (with the same execution plan).

Reproducing the problem

Table Creation

The first part of the repro involves creating six tables that represent the lookup tables in the original query plan. Each table will have 10,000 rows, consisting of a sequential reference number and a second column containing a 2048 single-byte-character string. The source table used to drive the lookups will be a regular Numbers table containing just a single integer column.

CREATE TABLE dbo.T1 (id integer IDENTITY PRIMARY KEY, d char(2048));
CREATE TABLE dbo.T2 (id integer IDENTITY PRIMARY KEY, d char(2048));
CREATE TABLE dbo.T3 (id integer IDENTITY PRIMARY KEY, d char(2048));
CREATE TABLE dbo.T4 (id integer IDENTITY PRIMARY KEY, d char(2048));
CREATE TABLE dbo.T5 (id integer IDENTITY PRIMARY KEY, d char(2048));
CREATE TABLE dbo.T6 (id integer IDENTITY PRIMARY KEY, d char(2048));
GO
INSERT dbo.T1 WITH (TABLOCKX)
SELECT REPLICATE('A', 2048)
FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
GO
INSERT dbo.T2 WITH (TABLOCKX)
SELECT REPLICATE('B', 2048)
FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
GO
INSERT dbo.T3 WITH (TABLOCKX)
SELECT REPLICATE('C', 2048)
FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
GO
INSERT dbo.T4 WITH (TABLOCKX)
SELECT REPLICATE('D', 2048)
FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
GO
INSERT dbo.T5 WITH (TABLOCKX)
SELECT REPLICATE('E', 2048)
FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;
GO
INSERT dbo.T6 WITH (TABLOCKX)
SELECT REPLICATE('F', 2048)
FROM dbo.Numbers AS n WHERE n BETWEEN 1 AND 10000;

The next step is to ensure that each lookup table is optimally organized for read-ahead:

ALTER TABLE dbo.T1 REBUILD WITH (MAXDOP = 1);
ALTER TABLE dbo.T2 REBUILD WITH (MAXDOP = 1);
ALTER TABLE dbo.T3 REBUILD WITH (MAXDOP = 1);
ALTER TABLE dbo.T4 REBUILD WITH (MAXDOP = 1);
ALTER TABLE dbo.T5 REBUILD WITH (MAXDOP = 1);
ALTER TABLE dbo.T6 REBUILD WITH (MAXDOP = 1);

Test Query

The original query translates into our simplified test rig as:

DECLARE @d nvarchar(max) = NCHAR(10000);
 
SELECT n.n,
    DATALENGTH
    (
        CONCAT
        (
            (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T1 AS t WHERE t.id = n.n),
            (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T2 AS t WHERE t.id = n.n),
            (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T3 AS t WHERE t.id = n.n),
            (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T4 AS t WHERE t.id = n.n),
            (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T5 AS t WHERE t.id = n.n),
            (SELECT CONCAT(t.d, t.d, t.d, t.d, t.d, t.d, @d) FROM dbo.T6 AS t WHERE t.id = n.n)
        )
    )
FROM dbo.Numbers AS n
WHERE n.n BETWEEN 1 AND 10000
ORDER BY n.n
OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

The broad idea there is to concatenate our 2048-character column to itself five times and include a Unicode character that was used in the original query as a delimiter that could not appear in the source data. Each lookup performs the same basic operation against its target table, and the final result is the result of concatenating all the intermediate results. The query hints are necessary to get the right plan shape, just because my test rig tables are so much smaller than the real ones.

Note that the Unicode delimiter means the 2048-character single-byte data is implicitly converted to Unicode, doubling in size. It is not a crucial feature of the test, but it did appear in the original query and explains the type conversion warnings in the execution plan I mentioned earlier. The execution plan for the test query is (click to enlarge if necessary):

Test query execution plan

I should also stress that the CONCAT operator (new in SQL Server 2012) is not crucial either. If you are using an earlier version of SQL Server, an equivalent query (for present purposes) is shown below. I’m going to stick with CONCAT for the remainder of the post, however.

DECLARE @d nvarchar(max) = NCHAR(10000);
 
SELECT n.n,
    DATALENGTH
    (
        (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T1 AS t WHERE t.id = n.n) +
        (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T2 AS t WHERE t.id = n.n) +
        (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T3 AS t WHERE t.id = n.n) +
        (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T4 AS t WHERE t.id = n.n) +
        (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T5 AS t WHERE t.id = n.n) +
        (SELECT @d+t.d+t.d+t.d+t.d+t.d+t.d FROM dbo.T6 AS t WHERE t.id = n.n)
    )
FROM dbo.Numbers AS n
WHERE n.n BETWEEN 1 AND 10000
ORDER BY n.n
OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

Warm cache results

With all data in memory, the test query (in either form) completes in about 1.6 seconds on my laptop. The result shows that each output row contains 147,468 bytes of Unicode character data. A typical set of I/O statistics follows:

image

Nothing too exciting to see there, but this is just our baseline.

Cold cache results

CHECKPOINT;
DBCC DROPCLEANBUFFERS;

With no data in memory, the test query now runs for 18.6 seconds – almost 12x slower. The I/O statistics show the expected (but still mysterious!) work table and its associated reads:

image

The Extended Events wait statistics show SQL Server spent very little of that time waiting on my laptop’s slow hard drive – just 402 ms:

image

Explanation

The are a number of factors in play here that we will look at in turn.

Nested Loops Prefetching

One of the reasons the optimizer prefers Hash Match and Merge Join for larger inputs is that the data access patterns tend to favour large sequential read-ahead. Both hash and merge tend to scan (range-scan in the case of a seek) their inputs, and the SQL Server Storage Engine automatically issues read-ahead when it detects this type of access. There is nothing in the execution plan to show that a base table will be read with read-ahead, it just happens.

A very basic implementation of Nested Loops join would not benefit from read-ahead at all on its inner side. The outer (driving) side of the loops join might well be a scan or range-scan of an index, and so benefit from automatic read-ahead, of course. The inner side is executed once per outer row, resulting in a rapid succession of small index seeks for different values. These small seeks will typically not be large enough to trigger the automatic read-ahead mechanism. Indeed, in our test, each inner side seek is for precisely one value.

SQL Server improves on this by implementing a second read-ahead mechanism especially for Nested Loops joins (not all N-L joins, it is a cost-based decision the optimizer makes). The basic idea is to buffer extra rows from the outer side of the join, and to use the row values in the buffer to drive read-ahead for the inner side. The effect is that the Nested Loops join becomes a partly blocking operator as outer-side rows are read into the buffer and read-ahead issued based on buffered index key values.

This read-ahead may be either order-preserving or not, and is indicated in the execution plan by the Nested Loop attributes With Ordered Prefetch and With Unordered Prefetch, respectively. When unordered prefetch occurs, the inner side is processed in whatever order the asynchronous reads happen to complete. With ordered prefetching, the mechanism is careful to ensure that the order of rows entering the join is preserved on the output.

In the test rig, the ORDER BY clause means there is a need to preserve row order, so Ordered Prefetch is used:

Ordered Prefetch

The issue described in this post is not specific to ordered prefetching – the same behaviour is just as likely with unordered prefetching. The point is that Nested Loops prefetching is one of the requirements.

Documented trace flags 652 and 8744 may be used (with care, and after serious testing) to disable automatic read-ahead and Nested Loops prefetching respectively. This is sometimes beneficial where all data is expected to be in memory (in which case read-ahead processing consumes resources better used by query execution) or where the I/O subsystem is extremely fast. In case you were wondering, there is no background thread for prefetching – all the work of checking whether the data is in memory, and issuing I/O if not, is performed by the worker thread executing the query.

I should stress that read-ahead and Nested Loops prefetching is generally A Very Good Thing with typical storage solutions (e.g. SANs) and both work best (or at all) when indexes have low logical fragmentation.

Manufactured LOBs

The issue described here also requires that a large object data type is manufactured before prefetching. The Compute Scalar operators in the test execution plan perform that function:

Manufactured LOB

By ‘manufactured’, I mean that the source columns are not LOB types, but the expression output is – notice the implicit conversion to nvarchar(max). To be clear about it, the issue we are analysing here does not occur when Nested Loops prefetching occurs with an expression that was a LOB type to begin with.

The Outer Join

The optimizer is quite good, generally speaking, at moving scalar expressions around. If the query had featured inner joins (whether by query design or through optimizer activities) the chances are quite good that the problematic expressions (the LOB manufacturers) would have moved beyond the prefetching, and so out of harm’s way. It is quite tricky to preserve NULL-extension and other outer-join semantics properly when moving expressions above an outer join, so the optimizer generally does not even try. In essence, the outer join represents an optimization barrier to the LOB-manufacturing expressions.

Memory Allocation

When Nested Loops prefetching occurs with a manufactured LOB, the question arises of where to store the created LOBs when buffering rows for prefetch. If the source data were already a LOB type, the execution engine would already have memory structures in place to handle them. When prefetching encounters a manufactured LOB, it needs to store it somewhere, since the engine is no longer processing a stream of one row at a time. It turns out that there is a small memory buffer set aside for this eventuality, which empirical tests show to be 24KB.

However, this 24KB (directly allocated, not via workspace memory grant) is shared across all concurrently executing prefetching joins in the query. With six such joins in the test rig plan and large manufactured LOBs, the buffer stands no chance. As a result, query execution engages a bail-out option: a work table created in tempdb. Though the pages of the worktable may in fact remain memory-resident, overheads (including latching and using general-purpose code interfaces for access to the buffered rows) mean this is very much slower than using the direct-memory cache.

As with most internal work tables, the logical reads reported on this work table indicate the number of rows processed (not 8KB pages, as for regular I/O statistics). This fact, together with the large number of items processed via the worktable in our test, accounts for the millions of reads reported.

The creation and use of the work table depends on run time conditions and timing. If execution finds the data it needs is already in memory, the prefetch checks are still performed, but no asynchronous read requests end up being posted. The 24KB buffer is never filled, so the need to create a work table never arises. The more prefetch that actually occurs, the higher the chances that the buffer will fill. It is quite possible to experience a low level of prefetch with manufactured LOBs without the engine needing to bail out to a work table, especially if the LOBs are not very big and the I/O system is quite fast.

Workaround

We can rewrite the query to avoid feeding manufactured LOB data to the prefetch buffer. The idea is to use OUTER APPLY to return the data that contributes to the concatenation, rather than the result of the concatenation. We can then perform the CONCAT operation (which handles NULLs nicely without extra work) after the join, avoiding the prefetch buffer issue completely. In SQL Server versions prior to 2012, we would need to use direct string concatenation, and handle rows that are NULL-extended explicitly using ISNULL or COALESCE.

DECLARE @d nvarchar(max) = NCHAR(10000);
 
SELECT 
    n.n,
    DATALENGTH
    (
        CONCAT
        (
            CONCAT(oa1.i0, oa1.i1, oa1.i2, oa1.i3, oa1.i4, oa1.i5, oa1.i6),
            CONCAT(oa2.i0, oa2.i1, oa2.i2, oa2.i3, oa2.i4, oa2.i5, oa2.i6),
            CONCAT(oa3.i0, oa3.i1, oa3.i2, oa3.i3, oa3.i4, oa3.i5, oa3.i6),
            CONCAT(oa4.i0, oa4.i1, oa4.i2, oa4.i3, oa4.i4, oa4.i5, oa4.i6),
            CONCAT(oa5.i0, oa5.i1, oa5.i2, oa5.i3, oa5.i4, oa5.i5, oa5.i6),
            CONCAT(oa6.i0, oa6.i1, oa6.i2, oa6.i3, oa6.i4, oa6.i5, oa6.i6)
        )
    )
FROM dbo.Numbers AS n
OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T1 AS t WHERE t.id = n.n) AS oa1
OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T2 AS t WHERE t.id = n.n) AS oa2
OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T3 AS t WHERE t.id = n.n) AS oa3
OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T4 AS t WHERE t.id = n.n) AS oa4
OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T5 AS t WHERE t.id = n.n) AS oa5
OUTER APPLY (SELECT i0 = @d, i1 = t.d, i2 = t.d, i3 = t.d, i4 = t.d, i5 = t.d, i6 = t.d FROM dbo.T6 AS t WHERE t.id = n.n) AS oa6
WHERE n.n BETWEEN 1 AND 10000
ORDER BY n.n
OPTION (LOOP JOIN, FORCE ORDER, MAXDOP 1);

The execution plan for the rewritten query looks visually similar to the problematic one:

Rewritten query plan

However, the Compute Scalars no longer manufacture a LOB data type, they just emit column and variable references:

image

All the concatenation work (and LOB manufacture) is performed by the final top-level Compute Scalar in a single monster expression [Expr1056]:

image

Warm cache results

With all data in memory, the new query completes in 1.8 seconds (very slightly up on 1.6 seconds before):

image

Cold cache results

When all data must be fetched from disk, the query issues optimal prefetching and completes in 7.3 seconds (down from 18.6 seconds) with no work table:

image

The Extended Events wait statistics now show 3.8 seconds spent waiting for my laptop’s slow spinning disk (which is a good thing!)

image

Final Thoughts

Work tables can appear in STATISTICS IO output for a wide range of reasons, but if you encounter one with a very large number of reads – particularly LOB reads – you may be encountering this issue. The rewrite proposed above may not always be possible, but you should be able to refactor your query to avoid the issue now you know it exists.

I am not a fan of doing large amounts of string manipulation in SQL Server. I am always particularly suspicious of the perceived need to split or concatenate large volumes of strings.

I am, however, a fan of always using explicit data types (rather than relying on implicit conversions) and generating relatively small query plans that offer the query optimizer clear and obvious choices. By necessity, this often means writing small SQL queries in logical steps (and no, long chains of common table expressions do not count!)

The real world does not always make these things possible, of course, but it is good to have goals :)

© 2013 Paul White – All Rights Reserved
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi

Screenshots acquired using SnagIt by TechSmith
Query plan details obtained using Plan Explorer PRO by SQLSentry

Published Friday, March 08, 2013 8:42 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

mjswart said:

Well done Paul, I always like these in depth posts of yours. The time spent running through the examples always pays off.

There's a lot of things going on here to make this issue occur. There's the nested loops prefetching, the manufactured lobs etc...

I like to put myself in the position of a DBA who's asked to look at this query. To be honest I don't think I'd be up to the task of understanding the exact cause without a humungous amount of mental effort (so good on you).

But I like to think that maybe some knee-jerk rules of thumb I exercise might have done the trick: Concatenating strings in the app for example.

But thanks again for the write up! Your internals posts are always fascinating.

March 7, 2013 5:16 PM
 

Alejandro Mesa said:

A delightful reading!

Adam, could you add the "Like" button?

--

AMB

March 7, 2013 5:22 PM
 

Paul White said:

Hi Michael,

I agree, though in this particular case, it turns out the string concatenation was to feed input to an in-query CLR function - a valid use case with only partial workarounds and alternatives AFAIK. Nevertheless, this was a fun post to write.

Paul

P.S. Cheers, Alejandro!

March 7, 2013 5:50 PM
 

SomewhereSomehow said:

Very interesting detective story from the one hand, and useful informative reading from the other, thank you!

Small question: "the logical reads reported on this work table indicate the number of rows" - curious note! Is it documented somewhere, or just an observation?

March 11, 2013 1:37 AM
 

Paul White said:

Hi Dima,

More details at http://stackoverflow.com/questions/4230838/why-are-logical-reads-for-windowed-aggregate-functions-so-high and http://sqlblog.com/blogs/paul_white/archive/2010/07/28/partitioning-and-the-common-subexpression-spool.aspx

As far as being documented or not - depends: it is mentioned in a book somewhere - I think the author might have been Craig Freedman, I'm afraid I don't recall more than that.

March 11, 2013 2:21 AM
 

SomewhereSomehow said:

Anyway, that's interesting. Thanks for the links!

March 11, 2013 4:40 AM
 

John Waclawski said:

Great, great information!!   Thank you!!

Is there anyway to repost this information not using SQL Sentry because some of us can't afford to buy the software & would rather use what is available to us.  Unless the free version gives us the same information.

Thank you again!!

April 4, 2013 9:09 AM
 

Paul White said:

Hi John,

The free version is all that is required to reproduce the screenshots in this post. I do prefer to use P/E for blog work because if nothing else the graphical plans fit better in the space available. As you can see, there are other benefits in the way it presents information too.

I do offer a full refund to anyone not happy with the way I write my blog posts of course :-p

April 17, 2013 7:37 PM
 

Tony said:

Nice sharing.

October 3, 2014 9:54 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement