THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jorg Klein

Microsoft Business Intelligence consultant from the Netherlands

  • The next version of SSIS is coming!

    The latest releases of SQL Server contained (almost) no new SSIS features. With the release of SSIS 2008 the ability to use C# scripts, the improved data flow and the cached lookup were most thrilling new features. The release of SQL 2008 R2 only gave us the ability to use a bulk insert mode for the ADO.NET destination, which was a bit disappointing.

    Fortunately Matt Mason from the SSIS team announced that the next version of SQL Server (SQL 11) contain quite some exiting new functionality for SSIS!

    - Undo/Redo support. Finally, this should have been added a long time ago ;-)

    - Improved copy/paste mechanism. Let’s hope we keep the formatting of components after copy/pasting them!

    - Data flow sequence container

    - New icons and rounded corners for tasks and transformations

    - Improved backpressure for data flow transformations with multiple inputs (for example a Merge Join). When one of the inputs get to much data compared to the other, the component that receives the data can tell the data flow that it needs more data on the other input

    - The Toolbox window will automatically locate and show newly installed custom tasks

    I’m Curious about the first CTP!

  • SSIS – Delete all files except for the most recent one

    Quite often one or more sources for a data warehouse consist of flat files. Most of the times these files are delivered as a zip file with a date in the file name, for example FinanceDataExport_20100528.zip

    Currently I work at a project that does a full load into the data warehouse every night. A zip file with some flat files in it is dropped in a directory on a daily basis. Sometimes there are multiple zip files in the directory, this can happen because the ETL failed or somebody puts a new zip file in the directory manually. Because the ETL isn’t incremental only the most recent file needs to be loaded. To implement this I used the simple code below; it checks which file is the most recent and deletes all other files.

    Usage is quite simple, just copy/paste the code in your script task and create two SSIS variables:

    • SourceFolder (type String): The folder that contains the (zip) files
    • DateInFilename (type Boolean): A flag, set it to True if your filename ends with the date YYYYMMDD, set it to false if creation date of the files should be used

    Note: In a previous blog post I wrote about unzipping zip files within SSIS, you might also find this useful: SSIS – Unpack a ZIP file with the Script Task

    Public Sub Main()
    
        'Use this piece of code to loop through a set of files in a directory
        'and delete all files except for the most recent one based on a date in the filename.
    
        'File name example:
        'DataExport_20100413.zip
    
        Dim rootDirectory As New DirectoryInfo(Dts.Variables("SourceFolder").Value.ToString) 'Set the directory in SSIS variable SourceFolder. For example: D:\Export\
        Dim mostRecentFile As String = ""
        Dim currentFileDate As Integer
        Dim mostRecentFileDate As Integer
        Dim currentFileCreationDate As Date
        Dim mostRecentFileCreationDate As Date
    
        Dim dateInFilename As Boolean = Dts.Variables("DateInFilename").Value 'If your filename ends with the date YYYYMMDD set SSIS variable DateInFilename to True. If not set to False.
    
    
        If dateInFilename Then
    
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
                currentFileDate = CInt(Left(Right(fi.Name, 12), 8)) 'Get date from current filename (based on a file that ends with: YYYYMMDD.zip)
    
                If currentFileDate > mostRecentFileDate Then
    
                    mostRecentFileDate = currentFileDate
                    mostRecentFile = fi.Name
    
                End If
    
            Next
    
        Else 'Date is not in filename, use creation date
    
            'Check which file is the most recent
            For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
                currentFileCreationDate = fi.CreationTime 'Get creation date of current file
    
                If currentFileCreationDate > mostRecentFileCreationDate Then
    
                    mostRecentFileCreationDate = currentFileCreationDate
                    mostRecentFile = fi.Name
    
                End If
    
            Next
    
        End If
    
    
        'Delete all files except the most recent one
        For Each fi As FileInfo In rootDirectory.GetFiles("*.zip")
    
            If fi.Name <> mostRecentFile Then
    
                File.Delete(rootDirectory.ToString + "\" + fi.Name)
    
            End If
    
        Next
    
        Dts.TaskResult = ScriptResults.Success
    End Sub
  • SSIS - Package design pattern for loading a data warehouse - Part 2

    Since my last blog post about a SSIS package design pattern I’ve received quite some positive reactions and feedback. Microsoft also added a link to the post on the SSIS portal which made it clear to me that there is quite some attention for this subject.

    The feedback I received was mainly about two things:
    1. Can you visualize the process or make it clearer without the whole technical story so it's easier to understand.
    2. How should the Extract phase of the ETL process be implemented when source tables are used by multiple dimensions and/or fact tables.

    In this post I will try to answer these questions. By doing so I hope to offer a complete design pattern that is usable for most data warehouse ETL solutions developed using SSIS.


    SSIS package design pattern for loading a data warehouse

    Using one SSIS package per dimension / fact table gives developers and administrators of ETL systems quite some benefits and is advised by Kimball since SSIS has been released. I have mentioned these benefits in my previous post and will not repeat them here.

    When using a single modular package approach, developers sometimes face problems concerning flexibility or a difficult debugging experience. Therefore, they sometimes choose to spread the logic of a single dimension or fact table in multiple packages. I have thought about a design pattern with the benefits of a single modular package approach and still having all the flexibility and debugging functionalities developers need.

    If you have a little bit of programming knowledge you must have heard about classes and functions. Now think about your SSIS package as a class or object that exists within code. These classes contain functions that you can call separately from other classes (packages). That would be some nice functionality to have, but unfortunately this is not possible within SSIS by default.
    To realize this functionality in SSIS I thought about SSIS Sequence Containers as functions and SSIS packages as classes.
    I personally always use four Sequence Containers in my SSIS packages:
    - SEQ Extract (extract the necessary source tables to a staging database)
    - SEQ Transform (transform these source tables to a dimension or fact table)
    - SEQ Load (load this table into the data warehouse)
    - SEQ Process (process the data warehouse table to the cube)

    The technical trick that I performed - you can read about the inner working in my previous post - makes it possible to execute only a single Sequence Container within a package, just like with functions in classes when programming code.
    The execution of a single dimension or fact table can now be performed from a master SSIS package like this:

    1 - [Execute Package Task] DimCustomer.Extract
    2 - [Execute Package Task] DimCustomer.Transform
    3 - [Execute Package Task] DimCustomer.Load
    4 - [Execute Package Task] DimCustomer.Process

    The package is executed 4 times with an Execute Package Task, but each time only the desired function (Sequence Container) will run.

    If we look at this in a UML sequence diagram we see the following:
    ETL_Example1

    I think this sequence diagram gives you a good overview of how this design pattern is organized. For the technical solution and the download of a template package you should check my previous post.


    How should the Extract phase of the ETL process be implemented when a single source table is used by multiple dimensions and/or fact tables?

    One of the questions that came up with using this design pattern is how to handle the extraction of source tables that are used in multiple dimensions and/or fact tables. The problem here is that a single table would be extracted multiple times which is, of course, undesirable.

    On coincidence I was reading the book “SQL Server 2008 Integration Services: Problem – Design - Solution” (which is a great book!) and one of the data extraction best practices (Chapter 5) is to use one package for the extraction of each source table. Each of these packages would have a very simple dataflow from the source table to the destination table within the staging area.
    Of course this approach will be more time consuming than using one big extract package with all table extracts in it but fortunately it also gives you some benefits:
    - Debugging, sometimes a source has changed, i.e. a column’s name could have been changed or completely deleted. The error that SSIS will log when this occurs will point the administrators straight to the right package and source table. Another benefit here is that only one package will fail and needs to be edited, while the others can still execute and remain unharmed.
    - Flexibility, you can execute a single table extract from anywhere (master package or dim/fact package).

    I recently created some solutions using this extract approach and really liked it. I used 2 SSIS projects:
    - one with the dimension and fact table packages
    - one with only the extract packages
    I have used the following naming conventions on the extract packages: Source_Table.dtsx and deployed them to a separate SSIS folder. This way the packages won’t bother the overview during development.
    A tip here is to use BIDS Helper; it has a great functionality to deploy one or more packages from BIDS.

    Merging this approach in the design pattern will give the following result:
    - The dimension and fact table extract Sequence Containers will no longer have data flow tasks in it but execute package tasks which point to the extract packages.
    - The Extract Sequence Container of the master package will execute all the necessary extract packages at once.

    This way a single source table will always get extracted only one time when executing your ETL from the master package and you still have the possibility to unit test your entire dimension or fact table packages.
    Drawing this approach again in a sequence diagram gives us the following example with a run from the master package (only the green Sequence Containers are executed):
    ETL_Example2

    And like this with a run of a single Dimension package:
    ETL_Example2_Customer

    Overall, the design pattern will now always look like this when executed from a master package:
    ETL_Overview


    Conclusion
    I think this design pattern is now good enough to be used as a standard approach for the most data warehouse ETL projects using SSIS. Thanks for all the feedback! New feedback is of course more than welcome!

  • SSIS – Package design pattern for loading a data warehouse

    I recently had a chat with some BI developers about the design patterns they’re using in SSIS when building an ETL system. We all agreed in creating multiple packages for the dimensions and fact tables and one master package for the execution of all these packages.

    These developers even created multiple packages per single dimension/fact table:

    • One extract package where the extract(E) logic of all dim/fact tables is stored
    • One dim/fact package with the transform(T) logic of a single dim/fact table
    • One dim/fact package with the load(L) logic of a single dim/fact table

    I like the idea of building the Extract, Transform and Load logic separately, but I do not like the way the logic was spread over multiple packages.
    I asked them why they chose for this solution and there were multiple reasons:

    • Enable running the E/T/L parts separately, for example: run only the entire T phase of all dim/fact tables.
    • Run the extracts of all dimensions and fact tables simultaneously to keep the loading window on the source system as short as possible.

    To me these are good reasons, running the E/T/L phases separately is a thing a developer often wants during the development and testing of an ETL system.
    Keeping the loading window on the source system as short as possible is something that’s critical in some projects.

    Despite the good arguments to design their ETL system like this, I still prefer the idea of having one package per dimension / fact table, with complete E/T/L logic, for the following reasons:

    • All the logic is in one place
    • Increase understandability
    • Perform unit testing
    • If there is an issue with a dimension or fact table, you only have to make changes in one place, which is safer and ore efficient
    • You can see your packages as separate ETL “puzzle pieces” that are reusable
    • It’s good from a project manager point of view; let your customer accept dimensions and fact tables one by one and freeze the appropriate package afterwards
    • The overview in BIDS, having an enormous amount of packages does not make it clearer ;-)
    • Simplifies deployment after changes have been made
    • Changes are easier to track in source control systems
    • Team development will be easier; multiple developers can work on different dim/fact tables without bothering each other.

    So basically my goal was clear: to build a solution that has all the possibilities the aforesaid developers asked for, but in one package per dimension / fact table; the best of both worlds.

    Solution:

    The solution I’ve created is based on a parent-child package structure. One parent (master) package will execute multiple child (dim/fact) packages. This solution is based on a single (child) package for each dimension and fact table. Each of these packages contains the following Sequence Containers in the Control Flow: 
     ChildControlFlow 

    Normally it would not be possible to execute only the Extract, Transform, Load or (cube) Process Sequence Containers of the child (dim/fact) packages simultaneously.

    To make this possible I have created four Parent package variable configurations, one for each ETL phase Sequence Container in the child package:

    clip_image003

    clip_image005

    Each of these configurations is set on the Disable property of one of the Sequence Containers:
    clip_image007

    Using this technique makes it possible to run separate Sequence Containers of the child package from the master package, simply by dis- or enabling the appropriate sequence containers with parent package variables.
    Because the default value of the Disable property of the Sequence Containers is False, you can still run an entire standalone child package, without the need to change anything.

    Ok, so far, so good. But, how do I execute only one phase of all the dimension and fact packages simultaneously? Well quite simple:


    First add 4 Sequence Containers to the Master package. One for each phase of the ETL, just like in the child packages


    Add Execute Package Tasks for all your packages in every Sequence Container


    clip_image008


    If you would execute this master package now, every child package would run 4 times as there are 4 Execute Package Tasks that run the same package in every sequence container.
    To get the required functionality I have created 4 variables inside each Sequence Container (Scope). These will be used as parent variable to set the Disable properties in the child packages. So basically I’ve created 4 variables x 4 Sequence Containers = 16 variables for the entire master package.

    Variables for the EXTRACT Sequence Container (vDisableExtract False):
    clip_image009

    Variables for the TRANSFORM Sequence Container (vDisableTransform False):
    clip_image010

    The LOAD and PROCESS Sequence Containers contain variables are based on the same technique.

    Results:

    Run all phases of a standalone package: Just execute the package:
    clip_image011

    Run a single phase of the ETL system (Extract/Transform/Load/Process): Execute the desired sequence container in the main package:

    RunAllTransforms 

    Run a single phase of a single package from the master package:
    RunSinglePhaseOfOnePackage


    Run multiple phases of the ETL system, for example only the T and L: Disable the Sequence Containers of the phases that need to be excluded in the master package:

    RunMultiplePhasesAndExcludeOthers

    Run all the child packages in the right order from the master package:
    When you add a breakpoint on, for example, the LOAD Sequence Container you see that all the child packages are at the same ETL phase as their parent: 
    RunCompleteMasterPackageBreakPoint


    When pressing Continue the package completes: 
    RunCompleteMasterPackageBreakPointCompleted


    Conclusion:

    This parent/child package design pattern for loading a Data Warehouse gives you all the flexibility and functionality you need. It’s ready and easy to use during development and production without the need to change anything.

    With only a single SSIS package for each dimension and fact table you now have the functionality that separate packages would offer. You will be able to, for example, run all the Extracts for all dimensions and fact tables simultaneously like the developers asked for and still have the benefits that come with the one package per dimension/fact table approach.

    Of course having a single package per dimension or fact table will not be the right choice in all cases but I think it is a good standard approach.
    Same applies to the ETL phases (Sequence Containers). I use E/T/L/P, but if you have different phases, which will be fine, you can still use the same technique.

    Download the solution with template packages from the URL’s below. Only thing you need to do is change the connection managers to the child packages (to your location on disk) and run the master package!


    Download for SSIS 2008

    Download for SSIS 2005


    If you have any suggestions, please leave them as a comment. I would like to know what your design pattern is as well!


    ATTENTION: See Part-2 on this subject for more background information!


    Backgrounds:

    How to: Use the Values of Parent Variables in a Child Package: http://technet.microsoft.com/en-us/library/ms345179.aspx

  • SSIS - Blowing-out the grain of your fact table

    Recently I had to create a fact table with a lower grain than the source database. My source database contained order lines with a start- and end date and monthly revenue amounts.

    To create reports that showed overall monthly revenue per year, lowering the grain was necessary. Because the lines contained revenue per month I decided to blow out the grain of my fact table to monthly records for all the order lines of the source database. For example, an order line with a start date of 1 January 2009 and an end date of 31 December 2009 should result in 12 order lines in the fact table, one line for each month.

    To achieve this result I exploded the source records against my DimDate. I used a standard DimDate:
    clip_image001[4]

    The query below did the job; use it in a SSIS source component and it will explode the order lines to a monthly grain:

    Code Snippet
    1. SELECT OL.LineId
    2.       ,DD.ActualDate
    3.       ,OL.StartDate
    4.       ,OL.EndDate
    5.       
    6.   FROM OrderLine OL
    7.   INNER JOIN DimDate DD
    8.       ON DD.Month
    9.       BETWEEN
    10.       (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))
    11.       AND
    12.       (YEAR(OL.EndDate)*100+MONTH(OL.EndDate))
    13.       
    14.   WHERE DD.DayOfMonth = 1

     


    Some explanation about this query below:

    · I always want to connect a record to the first day of the month in DimDate, that’s why this WHERE clause is used:

    Code Snippet
    1. WHERE DD.DayOfMonth = 1


    · Because I want to do a join on the month (format: YYYMM) of DimDate I need to format the start and end date on the same way (YYYYMM):

    Code Snippet
    1. (YEAR(OL.StartDate)*100+MONTH(OL.StartDate))

    The source, order lines with a start and end date:
    clip_image002[4]


    The Result, monthly order lines:
    clip_image003[4]

  • SSIS - Let the Excel connection manager pick the right column data types from an Excel source

    The excel connection manager scans every first 8 rows to determine the data type for a column in your SSIS source component. So if an Excel sheet column has integers on the first 8 rows and a string value on the 9th row, your data flow task will crash when executed because SSIS expects integers.

    Fortunately you can change the number of rows that Excel will scan with the TypeGuessRows registry property.

    Change TypeGuessRows:

    1. Start Registry Editor by typing "regedit" in the run bar of the Start menu.

    2. Search the register (CTRL-F) on "TypeGuessRows".
    Search

    3. Double click "TypeGuessRows" and edit the value.
    Edit

    Todd McDermid (MVP) commented the following useful addition:
    "Unfortunately, that reg key only allows values from 1 to 16 - yes, you can only increase the number of rows Excel will "sample" to 16."

    Robbert Visscher commented:
    "The reg key also allows the value 0. When this value is set, the excel connection manager scans every row to determine the data type for a column in your SSIS source component."

    Thanks Robbert, I think setting it to 0 can be very powerful in some scenario's!
     

    So the conclusion of the comments of Todd and Robbert is that a value from 0 to 16 is possible:
    • TypeGuessRows 0: All rows will be scanned. This might hurt performance, so only use it when necessary.
    • TypeGuessRows 1-16: A value between 1 and 16 is the default range for this reg key, use this in normal scenario's.
  • SSIS – Unpack a ZIP file with the Script Task

    A while ago I needed to unpack a couple of zip files from SSIS. There is no Microsoft SSIS task that contains this functionality so I searched the Internet. It seems that there are quite some third party tools that offer this functionally. It's also possible to download custom SSIS tasks. I personally always try to avoid third party tools and custom tasks so I searched on.
    It seemed there is a way to unzip files from SSIS with the Script Task. With some Visual Basic code using the Visual J# Library you can do the job. In this blog post I will use a Foreach Loop Container to loop through a folder that contains multiple zip files and unzip them one-by-one.

    Make sure you have the Microsoft Visual J# Redistributable Package installed because a reference to vjslib.dll (Visual J# Library) is needed in the Script Task. Download it here for free.

    Drag and drop a Foreach Loop Container on the Control Flow and create three variables with scope on the Foreach Loop Container:
    Variables

    Now configure the Foreach Loop Container:
    - Enumerator: Foreach File Enumerator
    - Files: *.zip
    - Retrieve file name: Name and extension
    Editor

    Next click on the + next to Expressions add the following expression to connect the SourceFolder variable to the Directory property of the Foreach Loop Container:
    Properties

    Now go to the Variable Mappings and select the FileName variable on Index 0. Doing this we will be able to access the current file name when the Foreach Loop Container enumerates the zip files.
    Editor2

    Now drag and drop a Script Task on the Control Flow, inside the Foreach Loop Container:
    Control Flow

    Open the Script Task Editor and do the following:
    - Set the ScripLanguage on: Microsoft Visual Basic 2008
    - Select our three ReadOnlyVariables using the new SSIS2008 Select Variables window:
    SelectVariables

    Now click Edit Script and copy/paste the following script:

    Imports System
    Imports System.Data
    Imports System.Math
    Imports Microsoft.SqlServer.Dts.Runtime
    Imports java.util.zip

        Public Sub Main()

            Try

                Dim strSourceFile As String
                Dim strDestinationDirectory As String

                'MsgBox("Current File: " & Dts.Variables("FileName").Value.ToString)

                strDestinationDirectory = Dts.Variables("DestinationFolder").Value.ToString 
                strSourceFile = Dts.Variables("SourceFolder").Value.ToString & Dts.Variables("FileName").Value.ToString

                Dim oFileInputStream As New java.io.FileInputStream(strSourceFile)
                Dim oZipInputStream As New java.util.zip.ZipInputStream(oFileInputStream)
                Dim bTrue As Boolean = True
                Dim sbBuf(1024) As SByte

                While 1 = 1

                    Dim oZipEntry As ZipEntry = oZipInputStream.getNextEntry()

                    If oZipEntry Is Nothing Then Exit While

                    If oZipEntry.isDirectory Then

                       If Not My.Computer.FileSystem.DirectoryExists(strDestinationDirectory & oZipEntry.getName) Then

                            My.Computer.FileSystem.CreateDirectory(strDestinationDirectory & oZipEntry.getName)

                        End If

                    Else

                        Dim oFileOutputStream As New java.io.FileOutputStream(strDestinationDirectory.Replace("\", "/") & oZipEntry.getName())

                        While 1 = 1

                            Dim iLen As Integer = oZipInputStream.read(sbBuf)

                            If iLen < 0 Then Exit While

                            oFileOutputStream.write(sbBuf, 0, iLen)

                       End While

                        oFileOutputStream.close()

                   End If

                End While

                oZipInputStream.close()
                oFileInputStream.close()

            Catch ex As Exception

                Throw New Exception(ex.Message)

            End Try

        End Sub

    End Class


    Now only one thing needs to be done, add a reference to vjslib.dll (Visual J# Library):
    Add Reference

    &
    Libary

    Your unzip solution is ready now! For testing purposes you can uncomment the following line in the script to see the file name of each processed zip file in a message box at runtime:

    'MsgBox("Current File: " & Dts.Variables("FileName").Value.ToString)

    MsgBox



    You can use this solution in many ways, for example, I used it in the solution below where I download multiple zip files from an FTP. These zip files contain CSV's that are used as source for the loading of a data warehouse.
    Solution

     

  • SSIS - Lookup is case sensitive

    A while ago I figured out that the lookup transformation is case sensitive.
    I used a lookup to find dimension table members in for my fact table records. This was done on a String business key like ‘AA12BB’. I attached a table for the error output and after running the package I found one record in this table.This record had a business key like ‘Aa12BB’. I searched the dimension table for this missing record and it surprised me, it DID exist but with the following business key: ‘AA12BB’. It seemed the lookup transformation is case sensitive. Next thing I tried was a T-SQL query in the management studio of SQL Server 2005. In the WHERE clause I referred to the business key: ‘Aa12BB’. The query returned the record with business key ‘AA12BB’. Conclusion: SQL Server is not case sensitive but the SSIS lookup component IS case sensitive… Interesting.

    Solution:
    After some research I found a few solutions for this interesting feature of the lookup transformation. Before I explain these solutions you must know something about the inner working of the lookup component.

    A lookup transformation uses full caching by default. This means that the first thing it does on execution, is loading all the lookup data in its cache. When this is done it works as expected, but with case sensitivity.

    The solution is to set the CacheType property of the lookup transformation to Partial or None, the lookup comparisons will now be done by SQL Server and not by the SSIS lookup component.
    Another solution is to format the data before you do the lookup. You can do this using the T-SQL LOWER() or UPPER() functions. These functions can be used in a query or for example in a derived column SSIS component.

  • SSAS - Clear SSAS cache with an SSIS package

    Often I see developers on different forums asking how they can clear the SSAS cache. You can achieve this by restarting SSAS, which is done quite often.
    Restarting is not necessary though, it's possible to clear the cache of an SSAS database with an XMLA script. Use the script below and replace "YourSsasDatabaseId" with the ID of your SSAS database and "YourSsasCubeId" with the ID of your SSAS cube.

    <ClearCache xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
      <
    Object
    >
        <
    DatabaseID>YourSsasDatabaseId</DatabaseID
    >
        <
    CubeID>YourSsasCubeId</CubeID
    >
      </
    Object
    >
    </
    ClearCache>

    Please note that you must use the ID's of the database/cube and not the name! You can find the ID's of these objects at the properties in BIDS or in SSMS. As you can see on the screenshot below, the cube/database name is not always the same as the ID. This particular cube has ‘Finance' as its name and ‘DW' as its ID!

    Cube Properties

    You can run this script manually from SSMS or automatically using SSIS. To run it from SSMS, right click your SSAS database and choose New Query > MDX. Although this is an XMLA script and not MDX it can be executed as MDX script.

    Running this script from SSMS is useful but I think most developers would want to clear the cache automatically. Fortunately SSIS has a ‘Analysis Services Execute DDL Task' that can execute this script!

    task1

     

     

     

    Just configure the task for your SSAS instance and copy/paste the script in the SourceDirect box as shown below.

    task2

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    With the possibilities SSIS offers you can now clear the cache anytime you want. For example you could run it right after you have processed the cube. A lot of developers restart the server after processing but with this script that is no longer necessary!

    Thanks to Jamie Thomson for sharing this script on his blog!

  • SSAS - Speed up dimensions using a NULL default cube measure

    Recently I faced some problems with the performance of SSAS dimensions. The cube users were using a large dimension with more than 100.000 members that didn't perform well.
    They tried to add a leaf dimension member on an Excel 2007 pivot table. When dragging this dimension member onto the rows they had to wait very long before the members returned from SSAS and showed on the screen.

    After some mailing with Chris Webb he thought this could have something to do with the default cube measure. It seems that when you query dimension members without picking a measure, SSAS takes the first measure from the first measure group as its default measure. So even when you only query a dimension, SSAS is still using a measure!

    You can find out which measure SSAS will take with the following query:

    SELECT [Measures].DefaultMember ON 0 FROM [YourCube]

    In this case the default measure that SSAS picked was from a measure group that was not connected to the dimension that was giving the performance problems. This, plus the fact that returning 100.000 times NULL is faster then returning some big float number, explained the performance issue.

    Chris advised me to use a NULL calculation as default measure, as he explains on his blog. The only problem here is that you can't select a calculation as default measure in the cube properties in BIDS (only normal measures are allowed):

    Default measure cube properties

     

     

     

     


    Fortunately pasting this MDX statement in the calculations script (use the script view on the calculations tab) did the trick. Just paste it right under the CALCULATE command (or somewhere else):

    CREATE MEMBER CURRENTCUBE.MEASURES.UseAsDefaultMeasure
     AS NULL,
    VISIBLE = 1;

    ALTER CUBE CURRENTCUBE UPDATE DIMENSION Measures,
    DEFAULT_MEMBER = [Measures].UseAsDefaultMeasure;

    When you return to the form view it should look like this:

    Calculations Script with default measure

     

     

     

     

    The next time you deploy the cube, SSAS will use the [UseAsDefaultMeasure] calculation as its default measure.


    Test results:

    I now had a nice opportunity to test the cube performance with and without the NULL calculation as default cube measure. I cleared the cache before each query to get a good comparison.

    In SSMS I used the following query:

    SELECT [Measures].DefaultMember ON 0,
    [Dimension].[AttributeHierarchy].Members ON 1
    FROM [Cube]

    The results are shown in the bar-chart below:

    • The query with the NULL default measure took 13 seconds.
    • The query without the NULL default measure took 4 minutes and 35 seconds.
    Chart

     

     

     

     

     

     

     

     

     

     

     

    Conclusion:

    Using a NULL default measure can be much faster when querying dimension members without selecting a measure. The result was shown more than 20 times faster in this particular case.

    Thanks to Chris Webb for helping me out on this issue!

  • Start / Stop SQL Server, SSIS, SSAS, SSRS and SQL Server Agent at once with a batch file

    A lot of developers have SQL Server and its different services running on their PC or notebook to develop or test BI solutions. Unfortunately this slows down your system quite a lot. To speed things up when not using SQL Server, I used to stop and start each service manual quite often.

    Recently I found out that it's possible to start and stop all services at once with a simple batch file. It now only takes a couple of seconds instead of a few minutes and some annoying steps.

    Copy/paste the following in a .txt file and rename it to .bat to make it a batch file, execute it by double clicking the file.

    START SCRIPT:

    NET START "SQL Server Agent (MsSqlServer)"
    NET START "MsSqlServer"
    NET START "MsSqlServerOlapService"
    NET START "ReportServer"
    NET START "SQL Server Integration Services"

    STOP SCRIPT:

    NET STOP "SQL Server Agent (MsSqlServer)"
    NET STOP "MsSqlServer"
    NET STOP "MsSqlServerOlapService"
    NET STOP "ReportServer"
    NET STOP "SQL Server Integration Services"

  • SSIS - Decrease your fact table loading time up to 40%

    Replace the multiple "Lookup Error Output" Derived Columns and Union All's with a single Derived Column and get a performance boost...

    Almost every BI developer needs to perform lookups while loading the Data Warehouse. When loading the fact tables for example, lookups are used to receive the (surrogate) primary keys of the dimension tables that are connected to the fact table.

    I, and I think most developers, perform the DWH fact-dimension lookups in the following way:
    1. Each fact record contains business keys to the different dimensions.
    2. Each dimension record contains a business key and a surrogate key (integer).
    3. Each dimension contains one unknown member, with a surrogate key of value 0.
    4. Match the dimension business key in the fact record to the business key in the dimension to receive the dimension surrogate key, using a Lookup Component. If the lookup gained no match, point to the unknown member in the dimension (surrogate key 0).
    5. Store the fact record in the DWH fact table with the gained surrogate keys of the dimensions.

    Loading a fact table in the way described above would typically look like this:


    I have tested this solution with 313.341 records. This took 1 minute and 23 seconds.
    This way of loading the fact table contains a lot of semi-blocking components: the Union All's. Read more about semi-blocking components in
    this blog, posted by me a little while ago. These components cause an unnecessary negative impact on the performance, as you can read in my other blog.

    There is a much more efficient way to load a fact table in a datawarehouse:
    1. Set the all the Lookup Error Outputs to "Ignore Failure".
    2. Delete all the Union All and Derived Column components.
    3. Add a Derived Column component to the end of the flow and add the unknown surrogate keys like this:

     der

    The data flow now looks like this:

    df2

    The loading of exactly the same amount of records (313.341) now took just 45 seconds! This is a performance boost of almost 40%. Loading your fact table like this does not only decrease your loading time, it also takes less development time. So a win-win situation!

    A minor downside might be that you can't easily see how much records have failed a particular lookup component, but it's not so hard to make a simple (SSRS) report which gives you this overview. It might be a good opportunity to include the information of all your fact tables in one single report, it will give you a much better overview of lookup-failures during the load of your fact tables then all the separate SSIS dataflow tasks.

  • MCTS - I passed the 70-556 “Microsoft Office PerformancePoint Server 2007” exam!

    My last blog is from a few months ago because I have been quite busy lately. Fortunately I have quite some topics that I want to blog about, starting with this short blog about the latest Microsoft certification I've gained.

    In March of this year I passed the 70-556 “Microsoft Office PerformancePoint Server 2007” exam. This makes me one of the 486(September, 2007) Microsoft Certified Technology Specialists worldwide! (Nr. of MCP's worldwide)

     


    It wasn't a tough exam compared to the previous exams I took. Time was no problem, and there were only 44 questions that were all of the multiple choice type where only 1 answer is correct. The exam contained almost no focus on PerformancePoint Planning, so make sure you focus on the Monitoring and Analyzing parts of PPS while studying for this exam.

    I used the Rational Guides to PPS books to study for the exam. Next to the books I followed a course and together this was enough to complete the exam successfully.

    Find more information here:
    Microsoft 70-556 Exam - MCTS Performance Point

    On the MCTS Exam for PerformancePoint Server 2007

    Preparation Guide for Exam 70-556

     

    My Microsoft Transcript is now:

    Microsoft Certification Status

    Credential

    Certification / Version

    Date Achieved

    Microsoft Certified IT Professional

    Business Intelligence Developer

    Mar 28, 2008

    Microsoft Certified Technology Specialist

    Office PerformancePoint Server 2007, Applications

    Jun 19, 2008

    Microsoft Certified Technology Specialist

    Microsoft® SQL ServerTM 2005 Business Intelligence Development

    Oct 19, 2007

     

    Microsoft Certification Exams Completed Successfully

    Exam ID

    Description

    Date Completed

    556

    TS: Microsoft Office PerformancePoint Server 2007, Application Development

    Jun 19, 2008

    446

    PRO: Designing a Business Intelligence Infrastructure by Using Microsoft® SQL ServerTM 2005

    Mar 28, 2008

    445

    TS: Microsoft SQL Server 2005 Business Intelligence-Implementation and Maintenance

    Oct 19, 2007

  • SSRS – Matrix that adds a new column each time 5 rows are filled with data

    What if you want a dynamic list of values in a matrix but with a maximum of 5 rows. How do you create a matrix like this? I thought this should be an easy job but I found out it was not really simple…

    I tried to create a matrix like this for a dynamic list of countries. In this blog I will explain how you can achieve this with a few simple steps.
    1. You need to create an MDX(I used a SSAS datasource) query that returns the list of countries with a numbering:


    2. Next thing you need to do is create a matrix:

    3. Next and last thing you need to do is the following:

    • Use the following expression for the row group: =(Fields!Country_Number.Value - 1) Mod 5
    • Use the following expression for the column group: =Floor((Fields!Country_Number.Value - 1) / 5)

    Result:

  • SSRS – Static column headers in a Matrix

    How do you create a static column header centered above your dynamic columns? One way to try achieving this is to place a textbox above your dynamic columns. One thing is for sure, the textbox will never be on the perfect centered location and what if the number of dynamic columns grow or shrink?


    Thing you need to do is to create a static column group. You do this by adding a new column group to the matrix and give it a static expression, for example: =”static”
    Now make it the top group by clicking Up for the static column group on the Groups tab of the matrix’s properties. You can also achieve this by just dragging the column group up in the layout view.

    The result, a centered and perfect aligned column header with the text “YTD” above some dynamic columns containing years:

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement