THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Solving the Net Changes Problem with Temporal Tables (T-SQL Tuesday #087)

SQL Server 2008 was perhaps not the meatiest of SQL Server releases, but it did have one especially promising feature: Data change detection. Or maybe two features, depending on how you look at things. I’m talking, of course, about the almost confusingly similarly named Change Data Capture and Change Tracking.

Change Data Capture was the big gun that promised to solve everyone’s change tracking woes (well at least everyone running Enterprise Edition). In practice it was a heavy, overwrought hack of a feature. Built on the foundations of replication—a feature designed to move data, not record and store changes to it—the entire system was quirky and problematic. Users had to become familiar with odd APIs, low-level transaction identifiers, and the occasional call to sp_repldone for that wonderful moment when the whole thing crashed and burned. But enough about this feature. If you want to know more about it and its future in SQL Server, look it up in the Azure SQL Database documentation.

Change Tracking, on the other hand, was designed to solve only one problem. It did so using a totally integrated approach, with changes recorded as part of the write transaction, first-class T-SQL syntax, and a very simple and well thought out API. The Change Tracking problem? Net changes. Simply put: First give me all of the rows. Then let me periodically ask what’s changed since last time I asked. When I ask, tell me how to synchronize my copy, cache, app, or whatever is on the other end, to get it into the same state as the host data set.

Fast-forward to SQL Server 2016, and now we have Temporal Tables. This feature is, in many ways, the best of both worlds when it comes to change detection. Temporal Tables is as well integrated into the query engine—and the transaction—as Change Tracking, but captures all changes, much like Change Data Capture. It has a clean and simple interface, and while it’s definitely a v1 technology it seems like a much more solid foundation than either of the SQL Server 2008 features ever did.

This month’s T-SQL Tuesday, hosted by Matt Gordon, asks for solutions to old problems using “shiny new toys.” I wondered—could I solve the Change Tracking net changes problem using Temporal Tables? Unlike its predecessors, net changes is a major use case that Temporal Tables simply doesn’t do right out of the box. Turning on both Temporal Tables and Change Tracking side-by-side is an option, but that seems like overkill. One change detection technology should be enough…or is it?

Thinking through the problem, it quickly became apparent to me that once you have all changes—as we have with Temporal Tables—getting to a net scenario is not especially difficult. Given a primary key and a time interval over which we’re concerned about changes,  there are of only four possible outcomes:

  • A new row for the key was inserted
  • The row was updated
  • The row was deleted
  • Nothing happened

For now we can ignore the final outcome and focus on the first three, but of course any proper solution must ensure that the negative case is taken into account so as to not send false changes.

For now, consider the following set of propositions, given that we’re asking at time Y for all changes since a prior time X.

  • INSERT: The key did not exist at time X but does exist at time Y.
  • DELETE: The key existed at time X but does not exist at time Y.
  • UPDATE: The key existed at both time X and at time Y, and at least one change occurred between time X and time Y.

Given these assumptions, we can begin work on a temporal queries that return the necessary rows. Solving for these conditions will require all rows that were active as of time X, rows that were (or are) active at time Y, and for the final case, all rows that were active in-between times X and Y. Since this is a range-based scenario, our best Temporal predication option will be either FROM or BETWEEN. The difference between these two is subtle: FROM uses an open interval (non-inclusive at both endpoints), whereas BETWEEN uses a half-open interval, inclusive on the end date. Given the choice in the scenario, BETWEEN makes more sense, as we can take advantage of the inclusive endpoint to avoid dropping a badly-timed row. But more on that in a moment.

To begin solving for net changes, we must first get all changes over the time interval in question:

SELECT
    t.pk,
    t.[other columns],
    t.valid_start_time,
    t.valid_end_time
FROM [table] AS t
FOR SYSTEM_TIME BETWEEN @start_time AND @end_time

This query will return all rows with a valid_end_time greater than the passed-in @start_time and a valid_start_time less than or equal to the passed-in @end_time. With that in mind, we can start putting together some expressions for each of the operations in question.

First step in tackling the insert: Did the key only come into existence after @start_time? The following expression finds out by testing the minimum valid_start_time per key; if it’s after the passed-in @start_time, we know it’s a new key.

CASE
    WHEN 
        @start_time < 
            MIN(t.valid_start_time) OVER (PARTITION BY t.pk)
                THEN 1
    ELSE 0
END AS is_new_key

A key could have been both inserted and deleted in the interval between @start_time and @end_time—or simply deleted altogether—so a second expression is necessary to determine whether the row still exists at @end_time. The expression below accomplishes this by checking the @end_time against the maximum valid_end_time per key. If the @end_time is greater than or equal to the maximum valid_end_time then the key must have been deleted. (This is where we’re taking advantage of the inclusive endpoint used by the BETWEEN predicate.)

CASE
    WHEN
        @end_time >=
            MAX(t.valid_end_time) OVER (PARTITION BY t.pk)
                THEN 1
    ELSE 0
END AS is_deleted_key

The final case is that of update operations. A ranged Temporal predicate will return at least one row per key that was active during the range. If the query returns more than one row for a given key, some change must have occurred. A simple COUNT will suffice in this case.

CASE
    WHEN
        1 <
            COUNT(*) OVER (PARTITION BY t.pk) THEN 1
    ELSE 0
END AS is_changed_key

Putting this all together, we can come up with a general template query to answer net changes style questions. To make this work I put the above expressions into a derived table, to which I added a ROW_NUMBER. The numbering is used so that we get only a single version for each row – the most recent one. (Consider cases where a row was updated multiple times during the interval.) The query below uses the derived table [x], and applies a WHERE clause that filters the set based on the expressions, in order to output only rows we’re interested in: new, deleted, or changed rows, except in cases where a row was both inserted and deleted in the input range. The CASE expression in the outer SELECT list is used to determine what type of operation will have to be done on the synchronizing side in order to bring the two sets into parity.

SELECT
    x.*,
    CASE
        WHEN x.is_new_key = 1 THEN 'INSERT'
        WHEN x.is_deleted_key = 1 THEN 'DELETE'
        ELSE 'UPDATE'
    END AS change_operation
FROM
(
    SELECT
        t.pk,
        t.[other columns],
        t.valid_start_time,
        t.valid_end_time,
        CASE
            WHEN 
                @start_time < 
                    MIN(t.valid_start_time) OVER (PARTITION BY t.pk)
                        THEN 1
            ELSE 0
        END AS is_new_key,
        CASE
            WHEN
                @end_time >=
                    MAX(t.valid_end_time) OVER (PARTITION BY t.pk)
                        THEN 1
            ELSE 0
        END AS is_deleted_key,
        CASE
            WHEN
                1 <
                    COUNT(*) OVER (PARTITION BY t.pk) THEN 1
            ELSE 0
        END AS is_changed_key,
        ROW_NUMBER() OVER (PARTITION BY t.pk ORDER BY t.valid_end_time DESC) AS rn
    FROM [table] AS t
    FOR SYSTEM_TIME BETWEEN @start_time AND @end_time
) AS x
WHERE
    x.rn = 1
    AND
    (
        x.is_new_key = 1
        OR x.is_changed_key = 1
        OR x.is_deleted_key = 1
    )
    AND NOT
    (
        x.is_new_key = 1
        AND x.is_deleted_key = 1
    )

And in theory, that’s that. This query does in fact solve for net changes. But the devil’s in the details, and there is a subtle bug in the above query: What happens if the passed in @end_time happens to fall at exactly the same time that an update occurred on one of my rows? The answer is that the MAX(t.valid_end_time) for that key will be equal to the passed in @end_time, and so the logic above will consider the update to be a delete. Clearly not desirable behavior!

The solution? In an update operation, Temporal Tables logs the new row with the same valid start time as the old row’s valid end time. Therefore, to figure out whether an update occurred, we must look forward exactly one row from our actual intended end the. To accomplish this, the following modified version of the query creates a new search end time by bumping the input end date up by 100ns (the finest resolution supported by DATETIME2).

DECLARE @search_end_time DATETIME2 =
    DATEADD(NANOSECOND, 100, @end_time)

SELECT
    x.*,
    CASE
        WHEN x.is_new_key = 1 THEN 'INSERT'
        WHEN x.is_deleted_key = 1 THEN 'DELETE'
        ELSE 'UPDATE'
    END AS change_operation
FROM
(
    SELECT
        y.*,
        CASE
            WHEN
                1 <
                    COUNT(*) OVER (PARTITION BY y.pk) THEN 1
            ELSE 0
        END AS is_changed_key,
        ROW_NUMBER() OVER (PARTITION BY y.pk ORDER BY y.valid_end_time DESC) AS rn
    FROM
    (
        SELECT
            t.pk,
            t.[other columns]
            t.valid_start_time,
            t.valid_end_time,
            CASE
                WHEN 
                    @start_time < 
                        MIN(t.valid_start_time) OVER (PARTITION BY t.pk)
                            THEN 1
                ELSE 0
            END AS is_new_key,
            CASE
                WHEN
                    @end_time >=
                        MAX(t.valid_end_time) OVER (PARTITION BY t.pk)
                            THEN 1
                ELSE 0
            END AS is_deleted_key
        FROM [table] AS t
        FOR SYSTEM_TIME BETWEEN @start_time AND @search_end_time
    ) AS y
    WHERE
        y.valid_start_time <= @end_time
) AS x
WHERE
    x.rn = 1
    AND
    (
        x.is_new_key = 1
        OR x.is_changed_key = 1
        OR x.is_deleted_key = 1
    )
    AND NOT
    (
        x.is_new_key = 1
        AND x.is_deleted_key = 1
    )

This query uses @search_end_time in the Temporal predicate, but then does all of the comparison work using the original @end_time. For the is_new_key check, this doesn’t matter, as we’re only comparing start times. For the is_deleted_key check it makes all the difference in the world, as an update done at exactly @end_time will populate a new row which will increase the value of MAX(t.valid_end_time). Because the extended search predicate can return an extra row, the is_changed_key and rn expressions had to be moved to an outer table expression filtered by the original @end_time. If an update occurs exactly at @end_time, we don’t want it to trigger a delete operation now, but we also don’t want to actually see it until next time we synchronize.

That bug solved, we now must consider a second, slightly more important unsolved problem: What do you use for the @start_date and @end_date?

Naïvely speaking we should be able to ask for a base set of rows by using some long-past date as the @start_date—say, 1900-01-01—and the current SYSUTCDATETIME() as the @end_date. Then we should be able to pass back that same @end_date next time as the start date, and so on and so forth. But that approach will open you to a major issue; Temporal Tables simply wasn’t designed for this.

Backing up a bit: Change Tracking, which was in fact designed for answering net changes questions, works by placing a surrogate internal transaction identifier on each row touched during a transaction, and then exposing an external transaction identifier that corresponds to the commit time of the transaction. It does this by using an external “transactional commit table” in conjunction with the main changes table. But Temporal Tables uses an entirely different system, wherein there is no external table, and each row is populated with its one-and-only, both internal and external identifier, as it is being written. And every row touched by a given transaction gets the same exact time: The start time of the transaction.

So what’s does this mean when we’re talking about net changes? Pretend for a moment that you start a transaction, jump on a plane, go hang out on the beach in Jamaica for a week (highly recommended), and then return to your office (lightly sunburned), update a table of your choosing, and commit the transaction. The time stamped on the row will correspond to the start of the transaction—a week ago, before you ever boarded the plane. Meanwhile, your synchronization process has been running regularly, let’s say once an hour, and it thinks it’s lagged by only an hour. It’s never going to ask for rows that changed a week ago. This is referred to in change detection parlance as a lost update.

Does this mean we’re completely out of luck? Of course not—but it does mean that a complete solution will have to consider the transactional state of the system. And you’ll have to monitor to make sure that no one starts a transaction and leaves it hanging around for a week. (You should probably be doing that anyway.)

The key to my full solution is the sys.dm_tran_active_transactions DMV. This view contains one row per current transaction—either explicit or implicit—and a handy column called transaction_type which categorizes the transactions as read, read/write, system, or distributed. It also includes a transaction start time, which can be used to figure out which times we might not want to touch just yet, if some transaction is still outstanding from then.

The following expression uses the DMV and is designed to output a safe @end_date, that will avoid lost updates. It starts by asking for the minimum start time for any active read/write transactions (transaction_type 1). If there are none, it uses the current SYSUTCDATETIME instead. Just to be on the extra safe side—in case there was a delay in processing due to sitting on a long runnable queue, for example—the expression pulls five seconds off of the end time. Note that this code also needs to be able to covert the local time as reported by the DMV into UTC as required by our Temporal predicate. There is currently no reasonable way to get the local time zone in T-SQL (there is at least one hack involving a registry read, which I do not recommend), so you’ll have to replace the hardcoded Eastern Standard Time with whatever time zone is appropriate for your server.

CONVERT
(
    DATETIME2(0),
    DATEADD
    (
        SECOND,
        -5,
        ISNULL
        (
            (
                SELECT
                    CONVERT(DATETIMEOFFSET, MIN(transaction_begin_time))
                        AT TIME ZONE 'Eastern Standard Time' AT TIME ZONE 'UTC'
                FROM sys.dm_tran_active_transactions
                WHERE
                    transaction_type = 1
            ),
            SYSUTCDATETIME()
        )
    )
)

The final operation I’ve added to the end date expression is to truncate the fractional (sub-second) precision off of the DATETIME2. The reason for this is that it’s important that subsequent change requests use the exact same start time as the prior end time. If you’re storing the value from your last synchronization in a system that has slightly different date rules than SQL Server, you may lose some precision and wind up with a rounding error. So better, in my opinion, to delay changes by up to one additional second, by ensuring that any possible date/time data type will suffice.

Once all of this is in place on your server, pointing at your table, usage is pretty simple:

  • Set a start date of 1900-01-01.
  • Set an end date using the expression above.
  • Ask for “changes.” That’s your input set. (Similar to Charge Tracking’s VERSION.)
  • Record the end date, and use it as the start date next time. The same date functions as both end date one time and start time the next because of the BETWEEN predicate’s half-open interval; if a delete operation occurred at exactly the end date, it will be included in that synchronization due to the end date being inclusive, but it will not be included in the next synchronization because the start time comparison is non-inclusive.
  • Re-populate the end date each time using the expression. It may be worthwhile to log the start and end dates so that you can detect conditions where the new end date is not much greater than the start date. This would indicate either that you might want to increase your polling interval or figure out why user transactions are running for a long time.
  • Repeat as needed. Remember that every cache is effectively stale the moment you’re done synchronizing it, so you must strike a balance between synchronized enough and not putting excessive stress on the host system trying to keep things up to date. And given the safety measures I’ve taken in the above expression, if you use this system and ask for changes more often than once every six seconds, you’ll tend to not get very far.

That’s it—net changes from Temporal Tables. Not the simplest exercise in the world, and maybe not quite as perfect as we’d like, but completely workable none the less. And of course, it’s always fun to put shiny new toys completely through their paces.

Enjoy, and happy synchronizations!

Published Tuesday, February 14, 2017 10:10 AM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Fanis said:

Please update "@end_time <= MAX(t.valid_end_time) OVER (PARTITION BY t.pk)                          THEN 1 ELSE 0 END AS is_deleted_key" to "@end_time >= ..."

March 7, 2017 1:30 AM
 

Adam Machanic said:

@Fanis

Good catch, thanks.

March 7, 2017 9:27 AM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Privacy Statement