In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool:
“Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!”
Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS.
What I did was the following:
- Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward.
- Create an SSIS package that executes the snapshot replication on demand and waits for its completion.
This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues.
Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it.
Configure snapshot replication
The first step is to create a publication on the database you want to replicate.
Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…
The New Publication Wizard appears, click Next
Choose your “source” database and click Next
Choose Snapshot publication and click Next
You can now select tables and other objects that you want to publish
Expand Tables and select the tables that are needed in your ETL process
In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example.
Its possible to filter out rows and/or columns. In this example I did not apply any filters.
Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on.
Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.
In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!
On the next screen choose to create the publication at the end of the wizard
Give the publication a name (SnapshotTest) and complete the wizard
The publication is created and the articles (tables in this case) are added
Now the publication is created successfully its time to create a new subscription for this publication.
Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions
The New Subscription Wizard appears
Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)
You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor.
Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database
Give the database the desired name, set the desired options and click OK
You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.
You now need to set the security settings for the Distribution Agent. Click on the …. button
Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here
Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on.
Initialize the subscription at first synchronization
Select the first box to create the subscription when finishing this wizard
Complete the wizard by clicking Finish
The subscription will be created
In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet.
Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:
Rename this job to “CreateSnapshot”
Now search for the job that distributes the snapshot:
Rename this job to “DistributeSnapshot”
Create an SSIS package that executes the snapshot replication
We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes.
The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous.
The SSIS package I’ve created does the following:
- It runs the CreateSnapshot job
- It checks every 5 seconds if the job is completed with a for loop
- When the CreateSnapshot job is completed it starts the DistributeSnapshot job
- And again it waits until the snapshot is delivered before the package will finish successfully
Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:
Download the SSIS package here (SSIS 2008)
In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good.
Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher.
So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.