Microsoft Data & Analytics consultant and Microsoft Data Platform MVP from the Netherlands
In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion.
SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do.
The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this:
SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() - 60
There are three types of replication:
This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber.
The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data!
So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same.
This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency.
The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s).
Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further.
Replication Services in a BI environment
Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system.
Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!
I got a questing regarding the supported Replication Services features in the different versions of SQL Server (Standard,Enterprise,etc). There is a nice table on MSDN that shows this!
When watching the SQL PASS session “What’s Coming Next in SSIS?” of Steve Swartz, the Group Program Manager for the SSIS team, an interesting question came up:
Why is SSIS thought of to be BI, when we use it so frequently for other sorts of data problems?
The answer of Steve was that he breaks the world of data work into three parts:
- Process of inputs
- Enterprise Information Management
All the work you have to do when you have a lot of data to make it useful and clean and get it to the right place. This covers master data management, data quality work, data integration and lineage analysis to keep track of where the data came from. All of these are part of Enterprise Information Management.
Next, Steve told Microsoft is developing SSIS as part of a large push in all of these areas in the next release of SQL. So SSIS will be, next to a BI tool, part of Enterprise Information Management in the next release of SQL Server.
I'm interested in the different ways people use SSIS, I've basically used it for ETL, data migrations and processing inputs. In which ways did you use SSIS?
At the PASS Summit that is happening in Seattle at the moment Microsoft announced the “BI Semantic Model” (BISM).
It looks like BISM is something like the UDM that we now know from SSAS. While the UDM was the bridge between relational data to multidimensional data, BISM is the bridge between relational data to the column-based Vertipaq engine. Some compare BISM to Business Objects universes.
The next version of SSAS will be able to either run in the old “UDM” mode or in “BISM” mode, a combination is not possible. Of course this will have some radical consequences, because there are a few major differences between the two modes:
- The switch from multidimensional cubes to the in-memory Vertipaq engine
- The switch from MDX to DAX
So multidimensional cubes and MDX will be deprecated? No, not really, SSAS as we know it now will be a product in the future and will remain supported. But it looks like Microsoft will concentrate on BISM, mainly because multidimensional cubes and MDX are very difficult to learn. Microsoft wants to make BI more approachable and less difficult, just like with Self Service BI.
I would say that it’s really time to start learning PowerPivot and DAX right now, if you have not already started learning it. If Microsoft will focus on the new BISM/Vertipaq technology that will be the future if you ask me.
Chris Webb wrote an interesting article about BISM and it looks like he is not very enthusiastic about the strategy Microsoft takes here because this could be the end of SSAS cubes within a few years: “while it’s not true to say that Analysis Services cubes as we know them today and MDX are dead, they have a terminal illness. I’d give them two, maybe three more releases before they’re properly dead, based on the roadmap that was announced yesterday.”
What’s also very interesting is the comprehensive comment on this article from Amir Netz. He explains BISM and UDM will live together in Analysis Services in the future and MOLAP is here to stay: “Make no mistake about it – MOLAP is still the bread and butter basis of SSAS, now and for a very long time. MDX is mature, functional and will stay with us forever.”
Read the article from Chris Webb here and make sure you don’t miss the comment from Amir!
SQL Server Denali (SQL Server 2011) CTP1 has been released!
Download it here
SQL 2011 is expected to be ready in the third quarter in 2011! I’ve already blogged about a few new SSIS features here
I will keep you posted!
With SQL Azure Reporting Services you can use SSRS as a service on the Azure platform with all the benefits of Azure and the most features and capabilities of premise. It’s also possible to embed your reports in your Windows or Azure applications.
Benefits of the Azure platform for Azure Reporting Services are:
- Highly available, the cloud services platform has built-in high availability and fault tolerance
- Scalable, the cloud services platform automatically scales up and down
- Secure, your reports and SQL Azure databases are on a safe place in the cloud
- Cost effective, you don’t have to set up servers and you don’t have to invest in managing servers
- Use the same tools you use today to develop your solutions. Just develop your reports in BIDS or Report Builder and deploy to Azure
- SQL Azure databases are the only supported data sources in the first version, more data sources are expected to come
- No developer extensibility in the first version, so no custom data sources, assemblies, report items or authentication
- No subscriptions or scheduled delivery
- No Windows Authentication, only SQL Azure username/password is supported in the first version, similar to SQL Azure database. When SQL Azure database gets Windows Authentication, Azure Reporting will follow
Despite the disadvantages of the first version I think SQL Azure Reporting Services offers great capabilities and can be extremely useful for a lot of organizations.
I’m really curious about the CTP, which will be available before the end of this year. You can sign up for the SQL Azure Reporting CTP here
Read more about SQL Azure Reporting here
Recently I passed the 70-455 exam. This exam upgrades your SQL 2005 MCTS and MCITP certifications to SQL 2008.
The exam contains 2 sections(basically separate exams), each with 25 questions:
- A part which covers exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
- A part which covers exam 70-452: PRO: Designing a Business Intelligence Infrastructure Using Microsoft SQL Server 2008
You need to pass on both of the sections with a score that’s at least 700. If you fail one section, you fail on the entire exam.
How did I study
I searched the internet and the conclusion was that there is no preparation material available for the 70-452 exam but fortunately there was a self-paced training kit for the 70-448 exam, which also covers this exam. So i bought the book, scanned it for subjects that needed attention and fortunately that was enough to pass the exam for me.
For the entire list of preparation materials for the 70-448 and 70-452 exams follow the links below:
70-448 preparation materials
70-452 preparation materials
My Current Transcript
The latest releases of SQL Server contained (almost) no new SSIS features. With the release of SSIS 2008 the ability to use C# scripts, the improved data flow and the cached lookup were most thrilling new features. The release of SQL 2008 R2 only gave us the ability to use a bulk insert mode for the ADO.NET destination, which was a bit disappointing.
Fortunately Matt Mason from the SSIS team announced that the next version of SQL Server (SQL 11) contain quite some exiting new functionality for SSIS!
- Undo/Redo support. Finally, this should have been added a long time ago ;-)
- Improved copy/paste mechanism. Let’s hope we keep the formatting of components after copy/pasting them!
- Data flow sequence container
- New icons and rounded corners for tasks and transformations
- Improved backpressure for data flow transformations with multiple inputs (for example a Merge Join). When one of the inputs get to much data compared to the other, the component that receives the data can tell the data flow that it needs more data on the other input
- The Toolbox window will automatically locate and show newly installed custom tasks
I’m Curious about the first CTP!
Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example FinanceDataExport_20100528.zip
Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.
Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:
- SourceFolder (type String): The folder that contains the (zip) files
- DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used
Note: In a previous blog post I wrote about unzipping zip files within SSIS, you might also find this useful: SSIS – Unpack a ZIP file with the Script Task
Public Sub Main()
'Use this piece of code to loop through a set of files in a directory
'and delete all files except for the most recent one based on a date in the filename.
'File name example:
Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
Dim mostRecentFile As String = ""
Dim currentFileDate As Integer
Dim mostRecentFileDate As Integer
Dim currentFileCreationDate As Date
Dim mostRecentFileCreationDate As Date
Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.
If dateInFilename Then
'Check which file is the most recent
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with: YYYYMMDD.zip)
If currentFileDate > mostRecentFileDate Then
mostRecentFileDate = currentFileDate
mostRecentFile = fi.Name
Else 'Date is not in filename, use creation date
'Check which file is the most recent
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
currentFileCreationDate = fi.CreationTime 'Get creation date of current file
If currentFileCreationDate > mostRecentFileCreationDate Then
mostRecentFileCreationDate = currentFileCreationDate
mostRecentFile = fi.Name
'Delete all files except the most recent one
For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
If fi.Name <> mostRecentFile Then
File.Delete(rootDirectory.ToString + "\" + fi.Name)
Dts.TaskResult = ScriptResults.Success
Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.
The feedback I received was mainly about two things:
1. Can you visualize the process or make it clearer without the whole technical story so it's easier to understand.
2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.
In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.
SSIS package design pattern for loading a data warehouse
Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.
When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.
If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
I personally always use four Sequence Containers in my SSIS packages:
- SEQ Extract (extract the necessary source tables to a staging database)
- SEQ Transform (transform these source tables to a dimension or fact table)
- SEQ Load (load this table into the data warehouse)
- SEQ Process (process the data warehouse table to the cube)
The technical trick that I performed - you can read about the inner working in my previous post - makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
The execution of a single dimension or fact table can now be performed from a master SSIS package like this:
1 - [Execute Package Task] DimCustomer.Extract
2 - [Execute Package Task] DimCustomer.Transform
3 - [Execute Package Task] DimCustomer.Load
4 - [Execute Package Task] DimCustomer.Process
The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.
If we look at this in a UML sequence diagram we see the following:
I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.
How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?
One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.
On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design - Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
- Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
- Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).
I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
- one with the dimension and fact table packages
- one with only the extract packages
I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.
Merging this approach in the design pattern will give the following result:
- The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
- The Extract Sequence Container of the master package will execute all the necessary extract packages at once.
This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
And like this with a run of a single Dimension package:
Overall, the design pattern will now always look like this when executed from a master package:
I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!
I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.
These developers even created multiple packages per single dimension/fact table:
- One extract package where the extract(E) logic of all dim/fact tables is stored
- One dim/fact package with the transform(T) logic of a single dim/fact table
- One dim/fact package with the load(L) logic of a single dim/fact table
I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
I asked them why they chose for this solution and there were multiple reasons:
To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
Keeping the loading window on the source system as short as possible is something that’s critical in some projects.
Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:
All the logic is in one place
Perform unit testing
If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
You can see your packages as separate ETL “puzzle pieces” that are reusable
It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
The overview in BIDS, having an enormous amount of packages does not make it clearer ;-)
Simplifies deployment after changes have been made
Changes are easier to track in source control systems
Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.
So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.
The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow:
Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.
To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:
Each of these configurations is set on the Disable property of one of the Sequence Containers:
Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by dis- or enabling the appropriate sequence containers with parent package variables.
Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.
Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:
First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages
Add Execute Package Tasks for all your packages in every Sequence Container
If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.
Variables for the EXTRACT Sequence Container (vDisableExtract False):
Variables for the TRANSFORM Sequence Container (vDisableTransform False):
The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.
Run all phases of a standalone package: Just execute the package:
Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:
Run a single phase of a single package from the master package:
Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:
Run all the child packages in the right order from the master package:
When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent:
When pressing Continue the package completes:
This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.
With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach. Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.
Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!
Download for SSIS 2008
Download for SSIS 2005
If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!
ATTENTION: See Part-2 on this subject for more background information!
How to: Use the Values of Parent Variables in a Child Package: http://technet.microsoft.com/en-us/library/ms345179.aspx
Recently I had to create a fact table with a lower grain than the source database. My source database contained order lines with a start- and end date and monthly revenue amounts.
To create reports that showed overall monthly revenue per year, lowering the grain was necessary. Because the lines contained revenue per month I decided to blow out the grain of my fact table to monthly records for all the order lines of the source database. For example, an order line with a start date of 1 January 2009 and an end date of 31 December 2009 should result in 12 order lines in the fact table, one line for each month.
To achieve this result I exploded the source records against my DimDate. I used a standard DimDate:
The query below did the job; use it in a SSIS source component and it will explode the order lines to a monthly grain:
- SELECT OL.LineId
- FROM OrderLine OL
- INNER JOIN DimDate DD
- ON DD.Month
- WHERE DD.DayOfMonth = 1
Some explanation about this query below:
· I always want to connect a record to the first day of the month in DimDate, that’s why this WHERE clause is used:
· Because I want to do a join on the month (format: YYYMM) of DimDate I need to format the start and end date on the same way (YYYYMM):
The source, order lines with a start and end date:
The Result, monthly order lines:
The excel connection manager scans every first 8 rows to determine the data type for a column in your SSIS source component. So if an Excel sheet column has integers on the first 8 rows and a string value on the 9th row, your data flow task will crash when executed because SSIS expects integers.
Fortunately you can change the number of rows that Excel will scan with the TypeGuessRows registry property.
1. Start Registry Editor by typing "regedit" in the run bar of the Start menu.
2. Search the register (CTRL-F) on "TypeGuessRows".
3. Double click "TypeGuessRows" and edit the value.
Todd McDermid (MVP) commented the following useful addition:
"Unfortunately, that reg key only allows values from 1 to 16 - yes, you can only increase the number of rows Excel will "sample" to 16."
Robbert Visscher commented:
"The reg key also allows the value 0. When this value is set, the excel connection manager scans every row to determine the data type for a column in your SSIS source component."
Thanks Robbert, I think setting it to 0 can be very powerful in some scenario's!
So the conclusion of the comments of Todd and Robbert is that a value from 0 to 16 is possible:
- TypeGuessRows 0: All rows will be scanned. This might hurt performance, so only use it when necessary.
- TypeGuessRows 1-16: A value between 1 and 16 is the default range for this reg key, use this in normal scenario's.
A while ago I needed to unpack a couple of zip files from SSIS. There is no Microsoft SSIS task that contains this functionality so I searched the Internet. It seems that there are quite some third party tools that offer this functionally. It's also possible to download custom SSIS tasks. I personally always try to avoid third party tools and custom tasks so I searched on.
It seemed there is a way to unzip files from SSIS with the Script Task. With some Visual Basic code using the Visual J# Library you can do the job. In this blog post I will use a Foreach Loop Container to loop through a folder that contains multiple zip files and unzip them one-by-one.
Make sure you have the Microsoft Visual J# Redistributable Package installed because a reference to vjslib.dll (Visual J# Library) is needed in the Script Task. Download it here for free.
Drag and drop a Foreach Loop Container on the Control Flow and create three variables with scope on the Foreach Loop Container:
Now configure the Foreach Loop Container:
- Enumerator: Foreach File Enumerator
- Files: *.zip
- Retrieve file name: Name and extension
Next click on the + next to Expressions add the following expression to connect the SourceFolder variable to the Directory property of the Foreach Loop Container:
Now go to the Variable Mappings and select the FileName variable on Index 0. Doing this we will be able to access the current file name when the Foreach Loop Container enumerates the zip files.
Now drag and drop a Script Task on the Control Flow, inside the Foreach Loop Container:
Open the Script Task Editor and do the following:
- Set the ScripLanguage on: Microsoft Visual Basic 2008
- Select our three ReadOnlyVariables using the new SSIS2008 Select Variables window:
Now click Edit Script and copy/paste the following script:
Public Sub Main()
Dim strSourceFile As String
Dim strDestinationDirectory As String
'MsgBox("Current File: " & Dts.Variables("FileName").Value.ToString)
strDestinationDirectory = Dts.Variables("DestinationFolder").Value.ToString
strSourceFile = Dts.Variables("SourceFolder").Value.ToString & Dts.Variables("FileName").Value.ToString
Dim oFileInputStream As New java.io.FileInputStream(strSourceFile)
Dim oZipInputStream As New java.util.zip.ZipInputStream(oFileInputStream)
Dim bTrue As Boolean = True
Dim sbBuf(1024) As SByte
While 1 = 1
Dim oZipEntry As ZipEntry = oZipInputStream.getNextEntry()
If oZipEntry Is Nothing Then Exit While
If oZipEntry.isDirectory Then
If Not My.Computer.FileSystem.DirectoryExists(strDestinationDirectory & oZipEntry.getName) Then
My.Computer.FileSystem.CreateDirectory(strDestinationDirectory & oZipEntry.getName)
Dim oFileOutputStream As New java.io.FileOutputStream(strDestinationDirectory.Replace("\", "/") & oZipEntry.getName())
While 1 = 1
Dim iLen As Integer = oZipInputStream.read(sbBuf)
If iLen < 0 Then Exit While
oFileOutputStream.write(sbBuf, 0, iLen)
Catch ex As Exception
Throw New Exception(ex.Message)
Now only one thing needs to be done, add a reference to vjslib.dll (Visual J# Library):
Your unzip solution is ready now! For testing purposes you can uncomment the following line in the script to see the file name of each processed zip file in a message box at runtime:
'MsgBox("Current File: " & Dts.Variables("FileName").Value.ToString)
You can use this solution in many ways, for example, I used it in the solution below where I download multiple zip files from an FTP. These zip files contain CSV's that are used as source for the loading of a data warehouse.
A while ago I figured out that the lookup transformation is case sensitive.
I used a lookup to find dimension table members in for my fact table records. This was done on a String business key like ‘AA12BB’. I attached a table for the error output and after running the package I found one record in this table.This record had a business key like ‘Aa12BB’. I searched the dimension table for this missing record and it surprised me, it DID exist but with the following business key: ‘AA12BB’. It seemed the lookup transformation is case sensitive. Next thing I tried was a T-SQL query in the management studio of SQL Server 2005. In the WHERE clause I referred to the business key: ‘Aa12BB’. The query returned the record with business key ‘AA12BB’. Conclusion: SQL Server is not case sensitive but the SSIS lookup component IS case sensitive… Interesting.
After some research I found a few solutions for this interesting feature of the lookup transformation. Before I explain these solutions you must know something about the inner working of the lookup component.
A lookup transformation uses full caching by default. This means that the first thing it does on execution, is loading all the lookup data in its cache. When this is done it works as expected, but with case sensitivity.
The solution is to set the CacheType property of the lookup transformation to Partial or None, the lookup comparisons will now be done by SQL Server and not by the SSIS lookup component.
Another solution is to format the data before you do the lookup. You can do this using the T-SQL LOWER() or UPPER() functions. These functions can be used in a query or for example in a derived column SSIS component.