THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Swinging From Tree to Tree Using CTEs, Part 2: Adjacency to Nested Intervals

In our previous installment, we saw how to convert Adjacency Lists into Nested Sets using a CTE.

In this episode, we will convert the Adjacency List into a Nested Intervals encoding.  Specifically, this encoding will make use of the Nested Intervals with Continued Fractions technique that Tropashko presented in a later paper.

The key to this technique lies in using a slightly different form of materialized path than was used in the last post.  Rather than materializing the EmployeeIds into a path, the path will be created as an enumerated representation, based on sibling ordering.  For example, the first two levels of the AdventureWorks HumanResources.Employee table's tree look like:

EmployeeId 109 (CEO)
EmployeeId 6 (Marketing Manager)
EmployeeId 12 (VP Engineering)
EmployeeId 42 (IS Manager)
EmployeeId 140 (CFO)
EmployeeId 148 (VP Production)
EmployeeId 273 (VP Sales)

Using the materialized path representation from the previous post, the paths to the second-level employees would be:

6: 109.6
12: 109.12
42: 109.42
140: 109.140
148: 109.148
273: 109.273

However, for this post, paths will instead be materialized based on sibling ordering.  We don't know anything about how siblings should be ordered in the AdventureWorks employee hierarchy (that's probably a business rule question, if it matters at all).  So ordering will be done by EmployeeId:

6: 1.1
12: 1.2
42: 1.3
140: 1.4
148: 1.5
273: 1.6
The significance of this encoding is that for each path a rational number can be generated, using a Euclidian algorithm (described very well on this web site).  The algorithm works by iterating over each element on the path, building a rational number as it goes.  The beauty of this, as shown by Tropashko in his paper, is that by using these paths we can determine an interval, beween which all children of a given path will fall.

In order to accomplish getting the path, the ROW_NUMBER function will be used, but slightly differently than last time.  Instead of creating a second CTE to reference the first, thereby getting the row number for each element of the path, the ROW_NUMBER function will be embedded within the recursive CTE itself, as in the following example:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        ManagerId,
        ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
    FROM HumanResources.Employee
    WHERE ManagerId IS NULL

    UNION ALL

    SELECT
        e.EmployeeId,
        e.ManagerId,
        ROW_NUMBER() OVER (ORDER BY e.EmployeeId) AS theRow
    FROM EmployeeRows x
    JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
)
SELECT *
FROM EmployeeRows
ORDER BY
    ManagerId,
    EmployeeId

Interestingly, this example as-is will return exactly the same results as the following, non-recursive CTE example:
SELECT
    EmployeeId,
    ManagerId,
    ROW_NUMBER() OVER (PARTITION BY ManagerId ORDER BY EmployeeId) AS theRow
FROM HumanResources.Employee
ORDER BY
    ManagerId,
    EmployeeId
So, why do we care?  Take a close look at the two examples.  In the CTE example, the ROW_NUMBER function does not use PARTITION BY.  Yet, results are implicitly partitioned.  This gives an interesting view into the inner-workings of CTEs.  As it turns out, the recursive part of the CTE is called once per row returned by the anchor or previous recursion.  This is not how I originally expected CTEs to behave (I thought the recursive part would be called once per rowset returned by the anchor or previous recursion), but it does help us with this particular task!

The current row number, at any given point in the recursion, represents the enumeration for that node.  But because we're using a recursive CTE, we also have access to the parent's enumaration.  For instance, to build an enumerated path, the following T-SQL would be used:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        ManagerId,
        CONVERT(VARCHAR(MAX), ROW_NUMBER() OVER (ORDER BY EmployeeId)) AS thePath
    FROM HumanResources.Employee
    WHERE ManagerId IS NULL

    UNION ALL

    SELECT
        e.EmployeeId,
        e.ManagerId,
        x.thePath + '.' + CONVERT(VARCHAR(MAX), ROW_NUMBER() OVER (ORDER BY e.EmployeeId)) AS thePath
    FROM EmployeeRows x
    JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
)
SELECT *
FROM EmployeeRows
ORDER BY
    thePath
Note that this sample can be made a bit more readable (and more functional for later) by embeding the ROW_NUMBER in a derived table:
WITH EmployeeRows AS
(
    SELECT
        y.EmployeeId,
        y.ManagerId,
        CONVERT(VARCHAR(MAX), y.theRow) AS thePath
    FROM
    (
        SELECT
            EmployeeId,
            ManagerId,
            ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
        FROM HumanResources.Employee
        WHERE ManagerId IS NULL
    ) y

    UNION ALL

    SELECT
        y.EmployeeId,
        y.ManagerId,
        y.thePath + '.' + CONVERT(VARCHAR(MAX), y.theRow) AS thePath
    FROM
    (
        SELECT
            e.EmployeeId,
            e.ManagerId,
            x.thePath,
            ROW_NUMBER() OVER (ORDER BY e.EmployeeId) AS theRow
        FROM EmployeeRows x
        JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
    ) y
)
SELECT *
FROM EmployeeRows
ORDER BY
    thePath
From here, it's a simple step to implement the Euclidian algorithm.  The algorithm is quite simple:
  1. Set parentNumerator <- 1
  2. Set parentDenominator <- 0
  3. Set theElement <- first enumeration in the path
  4. Set currentNumerator <- theElement
  5. Set currentDenominator <- 1
  6. Set theElement <- next enumeration in the path
  7. Set previousParentNumerator <- parentNumerator
  8. Set previousParentDenominator <- parentDenominator
  9. Set parentNumerator <- currentNumerator
  10. Set parentDenominator <- currentDenominator
  11. Set currentNumerator <- (parentNumerator * theElement) + previousParentNumerator
  12. Set currentDenominator <- (parentDenominator * theElement) + previousParentDenominator
  13. If the current element is not the final node in the path, goto 6.
This seems a bit hairy, but I think that looking at the algorithm and spending a few minutes with our old friends pencil and paper will make it quite clear.  Also, look at the web site linked above.  Here is how I've implemented the algorithm using a CTE:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        CONVERT(VARCHAR(MAX), theRow) AS thePath,
        CONVERT(BIGINT, 1) AS prevNumer,
        CONVERT(BIGINT, 0) AS prevDenom,
        CONVERT(BIGINT, theRow) AS currNumer,
        CONVERT(BIGINT, 1) AS currDenom
    FROM
    (
        SELECT
            EmployeeId,
            ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
        FROM HumanResources.Employee
        WHERE ManagerId IS NULL
    ) y
   
    UNION ALL
   
    SELECT
        y.EmployeeId,
        y.thePath + '.' + CONVERT(VARCHAR(MAX), y.theRow) AS thePath,
        prevNumer = y.currNumer,
        prevDenom = y.currDenom,
        (y.currNumer * y.theRow) + y.prevNumer  AS currNumer,
        (y.currDenom * y.theRow) + y.prevDenom  AS currDenom
    FROM
    (
        SELECT
            e.EmployeeID,
            x.thePath,
            x.currNumer,
            x.currDenom,
            x.prevNumer,
            x.prevDenom,
            ROW_NUMBER() OVER (ORDER BY e.EmployeeID) AS therow
        FROM EmployeeRows x
        JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
    ) y
)
Note that in this case I didn't require the use of the temporary variables; the previous anchor/recursive parts act as temporary storage enough. 

Readers will also hopefully notice that I haven't yet included a SELECT to get the data from the CTE!  This is because I'd like to explain briefly what it will do.  In his paper, Tropashko explains that for each node, the intervals for the children of that node will fall into an interval between the encoding of that node (currNumer / currDenom) and the numerator of the previous node plus the numerator for the current node, divided by the denominator of the previous node plus the denominator for the current node ((currNumer + prevNumer) / (currDenom + prevDenom)).  Quite wordy here.  Refer to the paper for a better explanation and proof.
Anyway, the completed query follows:
WITH EmployeeRows AS
(
    SELECT
        EmployeeId,
        CONVERT(VARCHAR(MAX), theRow) AS thePath,
        CONVERT(BIGINT, 1) AS prevNumer,
        CONVERT(BIGINT, 0) AS prevDenom,
        CONVERT(BIGINT, theRow) AS currNumer,
        CONVERT(BIGINT, 1) AS currDenom
    FROM
    (
        SELECT
            EmployeeId,
            ROW_NUMBER() OVER (ORDER BY EmployeeId) AS theRow
        FROM HumanResources.Employee
        WHERE ManagerId IS NULL
    ) y
   
    UNION ALL
   
    SELECT
        y.EmployeeId,
        y.thePath + '.' + CONVERT(VARCHAR(MAX), y.theRow) AS thePath,
        prevNumer = y.currNumer,
        prevDenom = y.currDenom,
        (y.currNumer * y.theRow) + y.prevNumer  AS currNumer,
        (y.currDenom * y.theRow) + y.prevDenom  AS currDenom
    FROM
    (
        SELECT
            e.EmployeeID,
            x.thePath,
            x.currNumer,
            x.currDenom,
            x.prevNumer,
            x.prevDenom,
            ROW_NUMBER() OVER (ORDER BY e.EmployeeID) AS therow
        FROM EmployeeRows x
        JOIN HumanResources.Employee e ON e.ManagerId = x.EmployeeId
    ) y
)
SELECT
    EmployeeId,
    thePath,
    currNumer AS startNumer,
    currDenom AS startDenom,
    currNumer + prevNumer AS endNumer,
    currDenom + prevDenom AS endDenom
FROM EmployeeRows
For each node (EmployeeId), you now have an interval (start and end) within which all children intervals will fall.  Note that computation must be done by the interval, not by the current node's encoding.  The reason becomes apparent when looking at the encodings for 1.1.1 and 1.2.  They are the same; however, their intervals do not overlap.  As a matter of fact, encodings will be the same for every next sibling/first child pair.  But the intervals remain nested, and if proper queries are written there will be no confusion.

So that's a first step towards using the Nested Intervals Model in SQL Server 2005.  Stay tuned for more... And as always, feel free to post questions or comments.  I know some of this material can be confusing (at least, it was to me before I wrote this post!)

Published Wednesday, July 12, 2006 10:49 PM by Adam Machanic
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Andy Austin said:

What would be the query for retrieving the children of a node?

September 19, 2008 10:31 AM
 

Kerr said:

I was excited to look at nested intervals as an alternative to other methods, but I believe there may be an issue with this implementation.

Doing this exercise, the first level subordinates Numer and Denom columns do not appear to be assigned as designed.  I'll use the first employee who has both a supervisor and subordinate, EmployeeID 273.  Their record reads as:

thePath 1.6

startNumer 7

startDenom 6

endNumer 8

endDenom 7

Doing the math, the left edge (startNumer / startDenom) for this employee is 1.1666 and the right edge (endNumer / endDenom) is 1.1428.  All direct subordinates of the CEO, EmployeeID 109, experience this phenomenon where the right edge is numerically less than the left edge.

By contrast, here is the record for EmployeeID 268, whose supervisor is EmployeeID 273:

thePath 1.6.1

startNumer 8

startDenom 7

endNumer 15

endDenom 13

left edge = 1.1428 (same as supervisor's right edge!)

right edge = 1.1538

The relationship with left and right edges is appropriate here, and traversing deeper within the hierarchy, the respective left and right edges are bound within each parent as designed.

Unrelated to this oddity, I've also been running into issues with precision at very deep levels of my own hierarchy.  I'm at a point where the numerators and denominators are in the hundreds of millions, and no amount of nested casting or conversion to higher precision data types produces a unique left edge.  Even while creating a column of decimal(38,1), I cannot obtain a scale of over 19 digits.

June 16, 2011 12:08 AM
 

Kerr said:

Just a quick update on the precision and scale problem I had.  Given the nature of my hierarchy, the highest precision required by my numerators with 0 scale would've been safely 11 digits.  I ended up doing something like the following and was able to get atomic left and right edges.

cast(endNumer as decimal(12,0)) / cast(endDenom as decimal(38,0))

This gave me some breathing room and pushed things out to a decimal(27,1), which still fits into the 13 byte storage breakpoint. ;)

This still doesn't address the apparent flipping of the relationships between left and right edges at the 1st subordinate level.  I have a workaround in place where I transpose them, though it's definitely not ideal as I'm likely missing a simple piece of things that would make the solution cleaner.

June 16, 2011 12:58 AM
 

Adam Machanic said:

Hi Kerr,

I wrote this a long, long time ago, and at this point I no longer believe that nested intervals can be a viable solution in SQL Server. I think that a materialized path model (either using hierarchyId or something similar) is the best bet.

--Adam

June 16, 2011 10:10 AM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement