THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Inequality joins, Asynchronous transformations and Lookups : SSIS

It is pretty much accepted by SQL Server Integration Services (SSIS) developers that synchronous transformations are generally quicker than asynchronous transformations (for a description of synchronous and asynchronous transformations go read Asynchronous and synchronous data flow components). Notice I said “generally” and not “always”; there are circumstances where using asynchronous transformations can be beneficial and in this blog post I’ll demonstrate such a scenario, one that is pretty common when building data warehouses.

Imagine I have a [Customer] dimension table that manages information about all of my customers as a slowly-changing dimension. If that is a type 2 slowly changing dimension then you will likely have multiple rows per customer in that table. Furthermore you might also have datetime fields that indicate the effective time period of each member record.

Here is such a table that contains data for four dimension members {Terry, Max, Henry, Horace}:

customer dimension table

Notice that we have multiple records per customer and that the [SCDStartDate] of a record is equivalent to the [SCDEndDate] of the record that preceded it (if there was one). (Note that I am on record as saying I am not a fan of this technique of storing an [SCDEndDate] but for the purposes of clarity I have included it here.)

Anyway, the idea here is that we will have some incoming data containing [CustomerName] & [EffectiveDate] and we need to use those values to lookup [Customer].[CustomerId]. The logic will be:

Lookup a [CustomerId] WHERE [CustomerName]=[CustomerName] AND [SCDStartDate] <= [EffectiveDate] AND [EffectiveDate] <= [SCDEndDate]

The conventional approach to this would be to use a full cached lookup but that isn’t an option here because we are using inequality conditions. The obvious next step then is to use a non-cached lookup which enables us to change the SQL statement to use inequality operators:

no cache lookup

Let’s take a look at the dataflow:

dataflow lookup

Notice these are all synchronous components. This approach works just fine however it does have the limitation that it has to issue a SQL statement against your lookup set for every row thus we can expect the execution time of our dataflow to increase linearly in line with the number of rows in our dataflow; that’s not good.

OK, that’s the obvious method. Let’s now look at a different way of achieving this using an asynchronous Merge Join transform coupled with a Conditional Split. I’ve shown it post-execution so that I can include the row counts which help to illustrate what is going on here:

dataflow merge join

Notice that there are more rows output from our Merge Join component than on the input. That is because we are joining on [CustomerName] and, as we know, we have multiple records per [CustomerName] in our lookup set. Notice also that there are two asynchronous components in here (the Sort and the Merge Join).

I have embedded a video below that compares the execution times for each of these two methods. The video is just over 8minutes long.

View on Vimeo 

For those that can’t be bothered watching the video I’ll tell you the results here. The dataflow that used the Lookup transform took 36 seconds whereas the dataflow that used the Merge Join took less than two seconds. An illustration in case it is needed:

image

Pretty conclusive proof that in some scenarios it may be quicker to use an asynchronous component than a synchronous one. Your mileage may of course vary.

The scenario outlined here is analogous to performance tuning procedural SQL that uses cursors. It is common to eliminate cursors by converting them to set-based operations and that is effectively what we have done here. Our non-cached lookup is performing a discrete operation for every single row of data, exactly like a cursor does. By eliminating this cursor-in-disguise we have dramatically sped up our dataflow.

I hope all of that proves useful. You can download the package that I demonstrated in the video from my SkyDrive at http://cid-550f681dad532637.skydrive.live.com/self.aspx/Public/BlogShare/20100514/20100514%20Lookups%20and%20Merge%20Joins.zip

Comments are welcome as always.

@Jamiet

Published Friday, May 14, 2010 11:13 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Matt Masson said:

You do need to be careful about relative dataset sizes when considering using this approach. If you have a very large reference table (your OLE_SRC), and not many change rows (SCR_SRC), you'll end up spending a lot of time waiting for the OLE_SRC to finish reading, even though there are no more matches coming in from SCR_SRC.

May 14, 2010 6:51 PM
 

jamiet said:

Matt,

Yes indeed. Others' mileage may vary; as always you must test and measure to see what works best for you.

-Jamie

May 15, 2010 3:16 AM
 

Marco Russo (SQLBI) said:

Matt, I agree with you but Jamie is right - each data warehouse (and each dimension) has specific characteristics that doesn't change so much over time - thus, I think that there are many scenarios where Jamie's approach does worth the effort (and the risk, which is calculated if you know what if you're doing!)

Great idea, Jamie!

Marco

May 15, 2010 12:22 PM
 

AlbertoFerrari said:

Jamie,

The major drawback of this approach is, IMHO, the fact that the inputs of the merge need to be sorted. If one of the input is the fact table and you have, for example, three SCD lookups, then you will end un paying three in-memory blocking sorts of the fact table, something that is seldom acceptable.

When I face these situations, I normally end up with a script component that caches the full SCD, in a dictionary where the key is the customer code and each element contains the sorted list of <date, id>. It runs at a speed very close to the normal equi-join non blocking lookup, even if it needs some coding, of course.

I wish MS will sometime create an SCD lookup... and I think I'm not the only one (Matt... got the point? :))

May 15, 2010 1:00 PM
 

Nitin said:

I've incorporated this suggestion in our test env and it works like a charm.

Thanks for the Tip Jamie.

@nitinsalgar

May 17, 2010 12:47 AM
 

JoshuaDD said:

Hi Jamie,

On an unrelated topic, I left you a question in this forum thread.

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/53e3f59d-9442-43f9-99c5-37dd964408fc

It's on an example you posted a couple years ago so I'm sorry if this isn't fresh in your mind.  Thanks in advance for any help you can give.

Josh

May 19, 2010 9:40 AM
 

François JEHL said:

Really great post Jamie. Even if I agree with Matt on the potential risks, it's the best solution I know to handle theta-joins. Even if a specific SCD lookup would be great, I think a Lookup with the ability to write the comparison expression (an IComparer thing I dunno) without losing the cache functionnality would be even better. Anyway thanks for this one.

July 29, 2010 8:49 AM
 

SSIS Junkie said:

I saw a thread on the SSIS forum today that went something like this: I have the following dataset: AccountNo

August 31, 2010 5:22 PM
 

MartinIsti said:

What I haven't been able to figure out how to produce the same in SSIS like the following simple T-SQL JOIN statement:

SELECT ...

FROM same_table t1 JOIN same_table t2

ON  t1.column1 =  t2.column1

AND t1.column2 >= t2.column2

The tables are the same (it's a self join) and e.g. consists of 3 rows. The end result can be more than 3 rows.

My problem is the following:

- I've tried the Lookup task's advanced tab's custom query feature to create inequal JOINs and that's working fine apart from the thing that a Lookup returns with a maximum of 1 row (per match)

- I've tried using the MERGE JOIN but it does NOT allow inequal JOINing conditions.

I think it would be a quite basic requirement against SSIS that it can provide a feature that covers this type of T-SQL JOINing. I hope there is and it's just me who hasn't found it (out) yet. If not... well it should provide a solution for this I think.

If anyone has an idea or just knows that it's already been discussed and solved please let me know, I've been desperately seeking the answer in the last 7 days.

Thanks!

November 30, 2010 11:06 PM
 

jamiet said:

MartinIsti,

Use the same technique that I used in this blog post. Put the "greater than or equal to" condition into a Conditional Split immediately after the Merge Join.

Regards

Jamie

December 1, 2010 3:34 AM
 

MartinIsti said:

Thanks Jamie, in the end I managed to solve the issue by modifying sg in the source query (but I purely used SSIS tricks :))

I'd like to have another question (not totally related to here but it's also about JOINs and LOOKUPs and I haven't found useful info on the net about it - or might have used the wrong keywords in the search engine):

I have a fact table with 4 date columns. For each of those I would like to get the surrogate key and putting only those into the final fact table. Is there a way of doing it without 4 lookups (since the same table is used it would negatively affect performance, I presume?) /apart from doing it with T-SQL JOINs in the source task/?

It just does not feel elegant and efficient to use that many lookups. What's your opinion?

Thanks for every idea!

December 5, 2010 8:50 PM
 

jamiet said:

Martin,

Unfortunately its pretty much unavoidable - if you want to use the Lookup component then you will need four of them. However, if the lookup set is the same for each one then you could alleviate some of the impact on your lookup source by using a Cache Connection (http://consultingblogs.emc.com/jamiethomson/archive/2007/11/16/Katmai_3A00_-SSIS_3A00_-Lookup-component-gets-a-makeover.aspx)

-Jamie

December 6, 2010 8:59 AM
 

MartinIsti said:

Thanks for the tip Jamie. Though this time I handled it in the source SQL query. I think it's not really elegant (not to mention the effectiveness) to use that many lookups, but it seems we have to live with it.

December 6, 2010 4:16 PM
 

Debo said:

Hi Jamie,

I appreciate the way you done the video.I am new to ssis and it is quite useful to see the experts.

Please post few more videos on different topics.That will be helpful .

December 13, 2010 11:14 PM
 

SSIS Junkie said:

Introduction The Dataflow task is one of the core components (if not the core component) of SQL Server

January 13, 2011 5:04 AM
 

ragini gupta said:

Its really grt post and made me to learn , ‘ how to use non-equi join using lookup and merge join transformation. Thanks a lot.

July 31, 2013 9:00 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement