Back in 2008, Marc Friedman of the SQL Server Query Processor Team wrote a blog entry entitled “Distinct Aggregation Considered Harmful”, in which he shows a way to work around the poor performance that often results simply from adding the keyword DISTINCT to an otherwise perfectly reasonable aggregate function in a query. This post is an update to that, presenting a query optimizer enhancement in SQL Server 2012 RC0 that reduces the need to perform the suggested rewrite manually. First, though, it makes sense for me to cover some broader aggregation topics in general.
Query Plan Options for DISTINCT
There are two main strategies to extract distinct values from a stream of rows. The first option is to keep track of unique values in a hash table. The second is to sort the incoming rows into groups and return one value from each group. SQL Server uses the Hash Match operator to implement the hash table option, and Stream Aggregate or Sort Distinct for the sorting option.
Hash Match Aggregate
The optimizer tends to prefer the Hash Match Aggregate on larger rowsets, with fewer groups, where there is no reason to produce a sorted output, and where the incoming rows are not sorted on the DISTINCT expression(s). Larger inputs favour hash matching because the algorithm generally scales well (although it does require a memory grant) and can make good use of parallelism. Fewer groups are better for hashing because it means fewer entries in the hash table, and the memory needed to store unique values is proportional to the number of groups (and the size of the group). Hash matching does not require or preserve the order of the incoming row stream. The query and plan below show a Hash Match Aggregate building a hash table on values in the Quantity column:
SELECT DISTINCT
th.Quantity
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

The standard Hash Match Aggregate is blocking: it produces no output until it has processed its entire input stream. If we restrict the number of rows using TOP or SET ROWCOUNT the aggregate can operate in a streaming fashion, producing new unique values as soon as they are encountered. This streaming mode is known as Flow Distinct, and is activated depending on the estimated number of unique values in the stream. In the example above, the estimated number of output rows from the Hash Match Aggregate is 455 (based on the statistics on my test machine). Limiting the query to 455 or fewer rows using TOP or ROWCOUNT produces a Hash Match Aggregate running in Flow Distinct mode:
SET ROWCOUNT 10
SELECT DISTINCT
th.Quantity
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)
SET ROWCOUNT 0

This plan is interesting because it limits the output to 10 rows without including a specific TOP operator for that purpose. TOP is generally preferred to ROWCOUNT for the reasons set out in the Books Online topic “Limiting Result Sets by Using TOP and PERCENT”. Ignore the part that says “because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query” – we have just seen how it can affect a query plan.
For completeness, this is the TOP query and execution plan:
SELECT DISTINCT TOP (10)
th.Quantity
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

Stream Aggregate
Performing a DISTINCT is the same as applying a GROUP BY on every expression in the SELECT list. A stream of rows sorted on these GROUP BY expressions has the useful property that all rows with the same GROUP BY values appear together, so all we need to do is output a row of GROUP BY keys each time a change occurs. If the required order can be obtained without an explicit sort, the query plan can use a Stream Aggregate directly:
SELECT DISTINCT
th.ProductID
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

This query plan shows an ordered scan of an index on the ProductID column, followed by a Stream Aggregate with a GROUP BY on the same column. The Stream Aggregate emits a new row each time a new group is encountered, similar to the Hash Match Aggregate running in Flow Distinct mode. This operator does not require a memory grant because it only needs to keep track of one set of GROUP BY expression values at a time.
Sort Distinct
The third alternative is to perform an explicit Sort followed by a Stream Aggregate. The optimizer can often collapse this combination into a single Sort operating in Sort Distinct mode:
SELECT DISTINCT
p.Color
FROM Production.Product AS p
OPTION (RECOMPILE)

To see the original Sort and Stream Aggregate combination, we can temporarily disable the GbAggToSort optimizer rule that performs this transformation:
DBCC RULEOFF('GbAggToSort')
SELECT DISTINCT
p.Color
FROM Production.Product AS p
OPTION (RECOMPILE)
DBCC RULEON('GbAggToSort')
The query plan now shows a regular (non-distinct) Sort followed by the Stream Aggregate:

Distinct Aggregates
The DISTINCT keyword is most commonly used with the COUNT and COUNT_BIG aggregate functions, though it can be specified with a variety of built-in and CLR aggregates. The interesting thing is that SQL Server always processes distinct aggregates by first performing a DISTINCT (using any of the three methods shown previously) and then applying the aggregate (for example COUNT) as a second step:
SELECT
COUNT_BIG(DISTINCT p.Color)
FROM Production.Product AS p
OPTION (RECOMPILE)

One thing worth highlighting is that COUNT DISTINCT does not count NULLs. The previous query that listed the distinct colours from the Product table produced 10 rows (9 colours and one NULL), but the the COUNT DISTINCT query here returns the value 9. The query plan uses a Distinct Sort to perform the DISTINCT (which includes the NULL group) and then counts the groups using a Stream Aggregate. The aggregate expression uses COUNT(expression) rather than COUNT(*), which eliminates the NULL group produced by the Distinct Sort. This second example shows a Hash Match Aggregate performing the DISTINCT, followed by a Stream Aggregate to count the non-NULL groups:
SELECT
COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)

So far, the counting step has always been performed by a Stream Aggregate. This is because it has been a scalar aggregate – an aggregate without a GROUP BY clause – and SQL Server always implements scalar aggregates using Stream Aggregate (there would be no point using hashing since there will only be one group by definition). If we add a GROUP BY clause, the final COUNT is no longer a single (scalar) result, so we can get a plan with two Hash Match Aggregates:
SELECT
th.ProductID,
COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
GROUP BY
th.ProductID
OPTION (RECOMPILE)

In this plan, the rightmost aggregate is performing the DISTINCT, and the leftmost one implements the COUNT (with GROUP BY).
Combining Aggregates
The decision to always implement distinct aggregates as separate DISTINCT and aggregate steps makes life easier for the optimizer. Separating the two operations makes it easier to plan and cost parallelism, partial aggregation, combining similar aggregates, moving aggregates around (for example pushing an aggregate below a join) and so on. On the downside, it creates problems for queries that include multiple distinct aggregates, or combine a single distinct aggregate with an ‘ordinary’ aggregate.
Multiple Aggregates
As you probably know, there is no problem combining multiple ordinary aggregations into a single operator:
SELECT
COUNT_BIG(*),
MAX(th.ActualCost),
STDEV(th.Quantity)
FROM Production.TransactionHistory AS th
GROUP BY
th.TransactionType

The single Hash Match Aggregate operator computes the three separate aggregates concurrently on the incoming stream.
Multiple Distinct Aggregates
This next query does a similar thing, but with three DISTINCT aggregations on the stream:
SELECT
COUNT_BIG(DISTINCT th.ProductID),
COUNT_BIG(DISTINCT th.TransactionDate),
COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
OPTION (RECOMPILE)
It produces a more complex query plan:

The issue here is that once a DISTINCT has been performed on a stream, the stream no longer contains the columns necessary to perform the other DISTINCT operations. To work around this, the optimizer builds a plan that reads the source stream once per DISTINCT, computes the aggregates separately, and then combines the results using a cross join (which is safe because these are all scalar aggregates, guaranteed to produce one row each). The same basic pattern is employed if the query contains an outer GROUP BY clause, but instead of cross joins there will be inner joins on the GROUP BY columns.
More often than not, the source of rows will not be an unrestricted table scan. Where the source is complex (and therefore expensive to re-run for every distinct aggregate) or where a filter significantly reduces the number of rows, the query optimizer may choose to Eager Spool the source rows and replay them once per distinct aggregate:
SELECT
COUNT_BIG(DISTINCT th.ProductID),
COUNT_BIG(DISTINCT th.TransactionDate),
COUNT_BIG(DISTINCT th.Quantity)
FROM Production.TransactionHistory AS th
WHERE
th.ActualCost < $5
GROUP BY
th.TransactionType
OPTION (RECOMPILE)

This is a plan shape that you are likely to encounter in the real world, since most queries will likely have a filtering condition or have a row source that is more complex than a simple scan of an index or table. For queries over larger tables than Adventure Works provides, this plan is likely to perform poorly, as you might expect. Aside from the obvious concerns, inserting rows into a spool has to be performed on a single thread (like all data modification operations in today’s SQL Server). Another limitation is that the spool does not support parallel scan for reading, so the optimizer is very unlikely to restart parallelism after the spool (or any of its replay streams). In queries that operate on large data sets, the parallelism implications of the spool plan can be the most important cause of poor performance.
Performing Multiple Distinct Aggregates Concurrently
If SQL Server were able to perform the whole COUNT DISTINCT aggregate in a single operator (instead of splitting the DISTINCT and COUNT into two steps) there would be no reason to split plans with spools as seen above. This could not be done with a Stream Aggregate since that operator requires the stream to be sorted on the DISTINCT expression, and it is not possible to sort a single stream in more than one way at the same time.
On the other hand, the Hash Match Aggregate does not require sorted input (it keeps the distinct values in a hash table, remember), so it should be possible to design a Hash Match Aggregate that computes COUNT DISTINCT in a single operation. We can test this idea with a User-Defined Aggregate (UDA) – SQL Server 2008 required:
using System;
using System.Collections.Generic;
using System.Data.SqlTypes;
using System.IO;
using Microsoft.SqlServer.Server;
[SqlUserDefinedAggregate
(
Format.UserDefined,
IsInvariantToDuplicates = true,
IsInvariantToNulls = true,
IsInvariantToOrder = true,
IsNullIfEmpty = false,
MaxByteSize = -1
)
]
public struct CountDistinctInt : IBinarySerialize
{
// The hash table
private Dictionary<int, object> dict;
// Recreate the hash table for each new group
public void Init()
{
dict = new Dictionary<int, object>();
}
// Ignore NULLs, store key values in the hash table
public void Accumulate(SqlInt32 Data)
{
if (!Data.IsNull)
{
dict[Data.Value] = null;
}
}
// Merge partial aggregates
public void Merge(CountDistinctInt Group)
{
foreach (var item in Group.dict.Keys)
{
dict[item] = null;
}
}
// Return the DISTINCT COUNT result
public int Terminate()
{
return dict.Count;
}
// Required by SQL Server to serialize this object
void IBinarySerialize.Write(BinaryWriter w)
{
w.Write(dict.Count);
foreach (var item in dict.Keys)
{
w.Write(item);
}
}
// Required by SQL Server to deserialize this object
void IBinarySerialize.Read(BinaryReader r)
{
var recordCount = r.ReadInt32();
dict = new Dictionary<int, object>(recordCount);
for (var i = 0; i < recordCount; i++)
{
dict[r.ReadInt32()] = null;
}
}
}
This UDA does nothing more than create a hash table, add (non-NULL) values to it, and return the count of values when aggregation is complete (the hash table takes care of duplicates). There is a little extra infrastructure code in there to allow SQL Server to serialize and reconstruct the hash table when needed (and merge partial aggregates) but the core of the function is just four lines of code. The example above only aggregates integer values, but it is easy to extend the idea to include other types. Anyway, armed with the integer and date time versions of the UDA, I now return to the multiple-distinct-count query that caused all the spooling, with COUNT DISTINCT replaced by UDA references:
SELECT
dbo.CountDistinctInt(th.ProductID),
dbo.CountDistinctDateTime(th.TransactionDate),
dbo.CountDistinctInt(th.Quantity)
FROM Production.TransactionHistory AS th
WHERE
th.ActualCost < $5
GROUP BY
th.TransactionType
OPTION (RECOMPILE)
Instead of all the Eager Spools, we now get this query plan:

You may be surprised to see that the three distinct count aggregates are being performed by a Stream Aggregate; after all I just finished explaining why a Stream Aggregate could not possibly do what we want here. The thing is that all CLR UDAs are interfaced to query plans using the Stream Aggregate model. The fact that this UDA uses a hash table internally does not change that.
The Sort in this plan is there to ensure that groups of rows arrive at the ‘Stream Aggregate’ interface in the required GROUP BY order. This is so SQL Server knows when to call the Init() and Terminate() methods on our UDA. The COUNT DISTINCT aggregation that is happening inside the UDA for each group could not care less about ordering, of course. (In case you were wondering, yes, the UDA produces the same results as the original T-SQL code).
The point here is to demonstrate that multiple DISCOUNT COUNT operations can be performed within a single operator, not that UDAs are necessarily always a great way to do that in general. As far as performance is concerned, the original spool query runs in around 220ms, and the UDA settles down around 160ms which isn’t bad, all things considered.
We can also improve the performance of the T-SQL query by rewriting it to avoid the spools by scanning the source table three times (this executes in around 75ms). Part of the problem here (to go with the spool plan issues mentioned earlier) is that the optimizer assumes that all queries start executing with an empty data cache, and it does not account for the fact that the three scans complete sequentially, so the pages are extremely likely to be available from memory for the second and third scans. The rewrite and query plan are below.
WITH
Stream AS
(
SELECT
th.TransactionType,
th.ProductID,
th.TransactionDate,
th.Quantity
FROM Production.TransactionHistory AS th
WHERE
th.ActualCost < $5
),
CountDistinctProduct AS
(
SELECT
TransactionType,
COUNT_BIG(DISTINCT ProductID) AS c
FROM Stream
GROUP BY
TransactionType
),
CountDistinctTransactionDate AS
(
SELECT
TransactionType,
COUNT_BIG(DISTINCT TransactionDate) AS c
FROM Stream
GROUP BY
TransactionType
),
CountDistinctQuantity AS
(
SELECT
TransactionType,
COUNT_BIG(DISTINCT Quantity) AS c
FROM Stream
GROUP BY
TransactionType
)
SELECT
p.c,
d.c,
q.c
FROM CountDistinctProduct AS p
JOIN CountDistinctTransactionDate AS d ON
d.TransactionType = p.TransactionType
JOIN CountDistinctQuantity AS q ON
q.TransactionType = d.TransactionType

Combining a Single Distinct Aggregate with other Aggregates
Marc Friedman’s blog post presented a way to rewrite T-SQL queries that contain a single distinct aggregate and one or more non-distinct aggregates so as to avoid spools or reading the source of the rows more than once. The essence of the method is to aggregate first by the GROUP BY expressions in the query and the DISTINCT expressions in the aggregate, and then to apply some relational math to aggregate those partial aggregates to produce the final result. I encourage you to read the full post to see all the detail, but I will quickly work through an example here too:
SELECT
dp.EnglishProductName,
SUM(frs.SalesAmount),
COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
FROM dbo.FactResellerSales AS frs
JOIN.dbo.DimProduct AS dp ON
frs.ProductKey = dp.ProductKey
GROUP BY
dp.EnglishProductName
OPTION (MAXDOP 1, RECOMPILE)
This query contains a regular SUM aggregate and a COUNT DISTINCT, so as expected the query optimizer produces a plan with an Eager Spool (click to enlarge):

To the left of the spool, the top branch performs the DISTINCT followed by the COUNT per group, and the spool replay on the lower branch of the plan computes the SUM per group. Finally, the two branches join on the common GROUP BY expression (EnglishProductName).
The rewrite starts by grouping on EnglishProductName (the GROUP BY expression) and SalesTerritoryKey (the DISTINCT expression) to produce a partial aggregate:
SELECT
dp.EnglishProductName,
frs.SalesTerritoryKey,
SUM(frs.SalesAmount) AS ssa
FROM dbo.DimProduct AS dp
JOIN dbo.FactResellerSales AS frs ON
frs.ProductKey = dp.ProductKey
GROUP BY
frs.SalesTerritoryKey,
dp.EnglishProductName
This query contains no distinct aggregates, so we get a plan with an ordinary join and and ordinary SUM aggregate:

To produce the results specified by the original query, we now need to SUM the partial SalesAmount sums, and COUNT (ignoring NULLs) the SalesTerritoryKey values. The final rewrite looks like this:
WITH PartialAggregate AS
(
SELECT
dp.EnglishProductName,
frs.SalesTerritoryKey,
SUM(frs.SalesAmount) AS ssa
FROM dbo.DimProduct AS dp
JOIN dbo.FactResellerSales AS frs ON
frs.ProductKey = dp.ProductKey
GROUP BY
frs.SalesTerritoryKey,
dp.EnglishProductName
)
SELECT
pa.EnglishProductName,
SUM(pa.ssa),
COUNT_BIG(pa.SalesTerritoryKey)
FROM PartialAggregate AS pa
GROUP BY pa.EnglishProductName
OPTION (RECOMPILE)
The query plan adds another layer of aggregation on top of the partial aggregate plan above:

This plan avoids the Eager Spools seen earlier and improves execution time from 320ms to 95ms in this example. On larger sets, where parallelism becomes important and the spools might need to use physical tempdb storage, the gains are likely to be much larger.
New for SQL Server 2012 RC0
The good news is that SQL Server 2012 RC0 adds a new optimizer rule (ReduceForDistinctAggs) to perform this rewrite automatically on the original form of the query. This is particularly good because the rewrite, while ingenious, can be somewhat inconvenient to do in practice, and all too easy to get wrong (particularly ensuring that NULL partially-aggregated groups are handled correctly). The new optimizer rule was not available in CTP3, so you will need RC0 to see SQL Server turn this T-SQL:
SELECT
dp.EnglishProductName,
SUM(frs.SalesAmount),
COUNT_BIG(DISTINCT frs.SalesTerritoryKey)
FROM dbo.FactResellerSales AS frs
JOIN.dbo.DimProduct AS dp ON
frs.ProductKey = dp.ProductKey
GROUP BY
dp.EnglishProductName
OPTION (RECOMPILE)
…directly into this query plan:

This transformation is only valid for queries with a single distinct aggregate (and at least one non-distinct aggregate of course). If your query contains multiple distinct aggregates, it may not help you directly, though you may be able to refactor the T-SQL to take advantage.
If you want to see SQL Server 2012 RC0 produce the Eager Spool plan instead (created by the long-standing ExpandDistinctGbAgg rule), disable the new rule with:
DBCC RULEOFF ('ReduceForDistinctAggs')
…and then recompile. Don’t forget to enable it again afterward using RULEON or by reconnecting to the server.
Even with the new rule enabled, you may still see the spool or multiple-scan plan from time to time. As always, the optimizer may explore many alternative plan forms and choose the one that looks cheapest. In some cases, the optimizer may still choose the spool plan, though it probably won’t be right to do so…
Thanks for reading; please consider voting for the following Connect item (to allow CLR UDA hash aggregates). Thank you.
http://connect.microsoft.com/SQLServer/feedback/details/629920/allow-option-hash-group-with-sqlclr-udas
© 2011 Paul White
Twitter: @SQL_Kiwi
Email: SQLkiwi@gmail.com