Lookup. For many SQL Server Integration Services Developers, it's the transformation we most love to dislike - especially if you are using it against data living on remote server not on the same local network as the host running your package. This combination of circumstances sometimes drives us to some very inventive things. In this post I want to look at a situation I ran into a few months ago, how I addressed that situation with SSIS90 and how SSIS100 improves on that.
For the sake of keeping things simple, let us start with a scenario. You have a list of 5,000 email addresses in an XML file. You need do a look-up of those email address and get the person's name and mailing address. You will write that to a CSV file. This is trivial task using SSIS. Where SSIS90 runs into a problem is when the server you are performing the look-up against is on the far-end of a network connection. Following the best practices of SSIS development, supposed your look-up task was based on a SQL statement that specified only the columns of interest. However, what do you do about the rows of interest? In SSIS90, there is not much you can actually do easily.
Remember that unless you use a memory restriction, SSIS90 fully populates the look-up cache with all of the records in the look-up query before processing. That is a double-edge sword: it definitely increases the "wait time" before a data flow task starts processing records. This is especially true if you are pulling data from a remote source with a slow network connection. Yet, the once the data flow starts, the records are processed at blistering speeds.
A straightforward solution to this situation is to use SSIS's ability to serialize buffers to a file - also known as RAW files. In this case, an initial streaming of the lookup data in made into a RAW file. That RAW file is then used as a data source in a second package. The RAW file is "join merged" with the XML source. This effectively provides the same functionality as the look-up did in the previous package. Take care not assume that the RAW file will have all of the needed information - after all, people can move and new email addresses could be added at any time. Therefore, you should when you are designing the package, you should handle unmatched records using the lookup transformation. In addition, you should update the RAW file with the updated data.
In SSIS100, the idea is fundamentally the same, but the tasks are slightly different. The problem being addressed is that there is no obvious way to serialize a look-up cache to a file (aside from the aforementioned technique that may not be at all obvious.) Neither is using a "merge join" as a way of doing a "look-up." The SSIS team has added a new transformation - the Cache Transform -- and modified another - the look-up - to make it more obvious how to serialize the cache.
The Cache Transform component itself is simple enough: within a given Data Flow, you connect this component to a path and it writes the buffers from that path to a new version of the raw-file format know as a Cache-Raw file (or CAW, its file extension.) Like a normal RAW file, the output contains the binary version of the data. It also contains an index covering one or more of the columns. This indexing helps the look-up component efficiently use the cached data. When using this to solve our slow data problem we would run a one-time process to initialize the cache from a data source.
The Look-up component in SSIS100 is expanded to accommodate using the Cache-Raw file. As before, this component can be configured to use an OLE-DB data source. However, it can also be configured to use a Cache-Raw file instead. This can dramatically improve the performance of data flow since it eliminates the start-up delay in acquiring the data. However, there is still the possibility that they cached data may be stale or missing desired matches. Another new feature in the SSIS100 look-up is the ability to direct rows that were not found in the look-up to a new data path. You could "kind of" do this in SSIS90 if you assume that the only row-level error was a failure to find a match. In the new design, rows not matching from the Cache-Raw file could be redirected to another look-up configured to read recently updated data from the remote database.
One thing to keep in mind about the new look-up transform is that it does not seem to maintain the Cache-Raw file automatically. What you should do is re-direct the no-match rows to a traditional look-up. As your complete processing in the data flow, add a second Cache Transform that writes the union of the matched rows and the not-matched but by then looked-up rows. You will need to name that CAW file something other than what you used as source, of course. The File Task can be used to delete the old version of the Cache and rename the new file to the name you used when configuring the first look-up.
Examples of how to build the packages discussed in the post can be downloaded from http://gosqlserver.net/downloads/ttfnfssis100part2.zip .