Rounding out the PASS Summit today I attended Davide Mauri’s post-con on Building a BI Solution A to Z. It was excellent – so good that I went down to registration and bought the pre/post conference BI track DVD’s just so I can lock our BI team in a room to watch it when I am back at my office :-).
He did remind me of a topic I meant to blog about but had forgotten, namely how to avoid round trips to the SQL Server from an SSIS package that has to look up against a type II slowly changing dimension.
The basic issue is this: suppose you have a canonical type II dimension, with each row having a start and end date to establish when the row is in effect historically. When you go to load fact rows that use that dimension, the fact rows have to be resolved not only to the right dimension row by business key, but also to the right row according to the date ranges in the dimension. This means the lookup operation has to perform some sort of “between” comparison to ensure the date in a given fact row is inside the effective date range for it’s partner row from the dimension.
The issue is that the typical lookup operator in SSIS can only perform equality comparisons on columns, but cannot handle the “between” logic. If you want to use caching, there’s not an obvious way to manage the “between” logic. For that reason, often people resort to a lookup with no caching, and a custom parameterized query inside the lookup transform, defined to handle this. This causes a round trip to the SQL Server for each row, and one-at-a-time processing of the “between,” which is expensive.
There is a workaround for this which, depending on the number of rows and the details of your load, can enable the use of cache and save time. The idea is simple – suppose that a row existed in the slowly changing dimension table for every member, for every single day. If that were the case, then we could select all the rows from the dimension for the range of dates in our to-be-loaded facts. From those, hypothetically we could match up dimension rows and fact rows using only equality. That is, just pick the individual row that matches the fact date and business key.
Materializing all those rows would waste a lot of space, but we can use a view to simply generate the needed daily rows, only for the required date range, at run time. Those can be put into cache and then referenced by a lookup using equality. The cache will almost certainly be bigger than the size of the source rows, but round trips are really expensive, so we can come out ahead in the end.
Here’s the method
First write a select that joins your date dimension table to your slowly changing dimension, using between logic in the ON clause, to form a distinct row for every dimension member as of each day:
myDates.[Date] AS effectiveDate
INNER JOIN myDates
ON myDates.[Date] BETWEEN myDimension.startDate AND myDimension.endDate
Adjust according to the details of your tables, but just make sure you get one and only one effective row, for every member, for every day. This might be a whole lot of rows, but don’t fear – we won’t ever select all of them.
Next, encapsulate that Select in a view, so that the lookup component in SSIS can pull from the view instead of the underlying dimension table.
Now, in SSIS, it will be important not to attempt to pull all the rows for all dates from this view. Instead, add logic to the fact-loading package that can, as a preliminary step, fetch the minimum and maximum dates from your incoming/staged fact data and put those dates into variables. We will use those to limit the range selected from the view.
Define the required cache transform and cache connection to pull rows from this new view into a lookup cache. In that setup, be sure to use a parameterized query that can fetch and cache only the rows from the expanded dimension view that correspond to the date range from the oldest to the newest fact row, which we now should have available in variables. In a nightly loading scenario, this might ideally be just one day, which means you’ll get one “copy” of all effective members from the dimension. If the facts are for two days or a week, then you might get two or seven times the number of dimension members, one set for each day. (For long date ranges, this technique is a bad idea.)
Last, go into the fact loading logic proper, and you can change the lookup component from fetching from the base table to fetching from the cached rows. Voila!
So, like everything, there are tradeoffs here. For loading a single day’s facts, this is highly likely to be faster than the alternative, because we save (number of facts) round trips to the server at run time, and trade that for one visit to fill the cache. Also, we can, probably, get a nice, fast set-based operation to produce all those rows in one pass. However, as the range of dates in the loaded facts grows, then the advantage of this method probably goes down, because we have to “fabricate” so many extra dimension rows using the view, and those all take memory, network bandwidth, and so on.