THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

  • Parent/Child Hierarchies in Tabular with Denali

    Some time ago, I wrote a post about how to handle Parent/Child hierarchies in PowerPivot 1.0. It was pretty intricate, because P/C are not handled natively in PowerPivot. Now, with the advent of Denali, Microsoft gave us all the functions needed to handle P/C in a convenient way. Kasper has already wrote a nice post about P/C here. His post is very informative and I suggest you reading it to look at all the P/C functions. In this post, I will need to repeat something he has already said but, hopefully, I will add some more insights into the DAX language solving a small issue that still exists with Parent/Child hierarchies, at least in CTP3.

    Let us start with the data model. We have this data model (wow, take a look at how wonderful is the the new graphical view of tables in PowerPivot… great job guys!):

    image

    where the Hierarchy table contains this unbalanced P/C hierarchy, using the classical ParentID data model:

    image

    In order to show this P/C data structure in a PivotTable, we need to understand how DAX implements P/C handling. It is based on some very powerful functions, the most important of which are:

    • PATH: computes the full path of a node following the ParentID column, starting from the root
    • PATHITEM: extracts an item from a path, as returned by PATH
    • LOOKUPVALUE: searches for a value inside a table, following a relationship that is valid only inside the formula

    Thus, by means of defining some simple columns:

    • HierarchyPath = PATH([NodeId], [ParentNodeID])
    • Level1= LOOKUPVALUE ([Node], [NodeId], PATHITEM ([HierarchyPath], 1))

    You get this result, which is self-explaining:

    image

    In order to have the Parent/Child hierarchy work, we still need to create a user define hierarchy putting the various Levels, from 1 to 3, inside a single hierarchy:

    image

    Et voilà, the work is done. We can put this hierarchy in a PivotTable, expand all the levels and we get this result:

    image

    There are both good and bad news here.

    • The good news is that all the levels, once put into a hierarchy, behave exactly like a Multidimensional P/C.
    • The bad news is that if the hierarchy is a ragged one, as it is often the case with P/C, Denali CTP3 still makes a load of empty nodes visible.

    SSAS handles this scenario with the usage of the HideMemberIf property of hierarchies, which hides useless nodes from the final result shown by the PivotTable. In CTP3, there seem not to be a way to set the HideMemberIf property for hierarchies, leading to this unwanted behavior. Hopefully, in the final release of Tabular the HideMemberIf will be available (if you are interested in having it, please vote this connect item, to make Microsoft guys aware of how useful this feature would be). If the item will be solved, then the next part of this post will be useless and life much easier. Thus, you might want to follow that item to check for its solution.

    Nevertheless, if we want to mimic the HideMemberIf feature in CTP3, we still need to make some work with DAX. So, let us roll up our sleeves and start thinking at a viable solution.

    The key to hide the unwanted rows is to note that:

    • We can define, for each node, its depth. Annabel and Bill have a depth of one, being root nodes. Catherine, being a child of Annabel, has a depth of two. Vincent, being a nephew of Bill, has a depth of three. The depth can be defined at the row level of the Hierarchy table.
    • We can define, for each cell, a Browse depth which is the depth up to which the P/C hierarchy has been unfolded, to make that cell visible.

    With these definitions in mind, our PivotTable can be seen in this way:

    image

    If we hide, from this PivotTable, all the rows where NodeDepth is greater than BrowseDepth, we will reach the desired result. In PowerPivot 1.0 this was a big pain in the neck, as can be seen in my previous post. In Denali CTP3, the solution is much easier and elegant, because there is a very interesting function available in DAX: ISFILTERED.

    ISFILTERED gets a column reference and returns a Boolean value indicating whether the column is filtered in the current filter context or not. Now, it is easy to note that a cell with BrowseDepth of 1 has a filter on the column Level1 and no filter for Level2 and Level3. A cell with BrowseDepth of 2, will have a filter on Level1 and a filter on Level2, with no filter for Level3.

    Thus, we can define the BrowseDepth measure with this formula:

    BrowseDepth :=
        
    IF (ISFILTERED ('Hierarchy'[Level3]), 3,
        
    IF (ISFILTERED ('Hierarchy'[Level2]), 2,
        
    IF (ISFILTERED ('Hierarchy'[Level1]), 1
    )))

    This measure will compute, for each cell, the browse depth. We are still missing the node depth of each row. Another useful DAX function will help us: PATHLENGTH, which computes the length of a path column. We can add a calculated column NodeDepth=PATHLENGTH([HierarchyPath]) and get this:

    image

    Clearly, the NodeDepth cannot be aggregated using SUM, we need to use MIN. Thus, we define a new measure MinNodeDepth which simply aggregates the NodeDepth column using MIN. The final result, with all these measures in a PivotTable, can be seen here:

    image

    The final touch is to make the Amount measure BLANK whenever BrowseDepth is greater than MinNodeDepth, with this definition:

    SumOfAmount := 
        IF (
            [BrowseDepth] > [MinNodeDepth], 
            BLANK(),
            SUM (Invoices[Amount])
        )

    All the measures in a PivotTable:

    image

    And, after removing all the technical measures, we get the final result, which is nice as we wanted it to be:

    image

    Pretty easy, elegant and super fast, all of the requirements of DAX are met Sorriso

    I am confident that the SSAS team will take the time to let us use Parent/Child hierarchies with the HideMemberIf option. In the meantime this is a viable solution and a good exercise to start getting acquainted with the new DAX functions.

    Last, but not least, in Tabular P/C hierarchies are standard user hierarchies. Thus, you can have as many P/C for a single table as you want. I know of many customers who will be happy to have more than one P/C hierarchy on the same dimension, without having the need to define separate dimensions as it was the case with Multidimensional in SSAS.

  • PowerPivot, basket analysis and the hidden many to many

    Surfing on the web, here, I came into this intriguing question:

    How do we ask something like "Show me how many customers have an iPad but don't have a book?"

    We are speaking about a sales analysis where the canonical entities are Customers, Products and Sales. Moreover, because we have been used to speak about cycles since many years ago, due to the nature of AdventureWorks, the question can be posed in AdventureWorks’s language as:

    Of all the customers who have bought a mountain bike, how many have never bought at least one mountain tire tube?

    The marketing need to answer this question is evident, if my customers have bought a mountain bike, they need to buy tire tubes. If they don’t buy them from me, it means that they are taking them somewhere else, and this is not good.

    There are some interesting points in this simple question that turn the scenario into a challenging one:

    1. We want to filter customers based on sales and then perform further analysis on the behavior of these customers
    2. The simple presence of “not” in the question makes it much more complicated because we will need to check the customer that did not buy tires and this is not very intuitive to compute because we need to search for lacking information
    3. The question has been posed for a single product but it is clear that the customer might want to perform the analysis on categories of products too. Moreover he probably want to be able to slice at the single product level.
    4. Finally, we need to find a generic solution which is easy to query with a PivotTable, because users like Excel

    The scenario is an interesting one, definitely worth a post. What about the solution, is it interesting too? It is for one simple reason: in order to solve it we will need to search for a many-to-many relationship where we are not used to find one. Moreover, I love all the solutions that force me to think in DAX and, as we are going to see, the scenario is quite simple in DAX while it would be much more complicated using any other tool. Well… beware that “quite simple” does not mean “simple”, the DAX code is a bit intricate but, hopefully, I am going to help you reading it.

    A simple side note: this data model has been developed with PowerPivot but, as the new data modeling paradigm has been disclosed just a few days ago with the name “tabular” in the BISM architecture, I will refer to this as a “tabular data model” to distinguish it from the “multidimensional data model”.

    In fact, this scenario is very similar to the survey data model described in the “Many-to-Many revolution” by Marco Russo even if the data structure is not so evident at first glance.

    Let us start speaking about the data model. It is a canonical star schema as taken from the AdventureWorks data warehouse:

    image

    Importing these tables (and their relationships) inside PowerPivot is a good first step but soon we will discover that something is missing. We need to have two instances of the DimProduct table. One is needed to detect the customers who have bought one of the selected product, in our example the mountain bike. The other one will be needed to select the product which we want to check, in our example the tire tubes. Thus, we will need to load the DimProduct twice inside the tabular data model. Because the new instance of DimProduct will be used for filtering, we name it “ProductFilter”.

    Thus, the complete data model we are going to work with is the following:

    image

    This definitely looks like a strange diagram: the new ProductFilter table has no relationships with the fact table nor with any dimension. In fact, if you think at it carefully, you’ll discover that we have no way to create a relationship between the fact table and the ProductFilter table, because the only available ProductKey in the fact table has already been used for the relationship with DimProduct. Tabular, among its limitations, can use a column for one relationship only. Nevertheless, we don’t care about it: the point is that that relationship is not needed, we will write the DAX code without leveraging its existence.

    Before to dive into the DAX solution, let us spend some time to describe the desired result. We want to be able to produce a report like this:

    image

    In the report, we have selected four model types in the vertical slicer (mountain bikes from 100 to 500) which are the product we require the customer to have bought. Then we have selected two product models in the horizontal slicer (tire tubes and patch kits) which represent the products we want to check whether the customer have bought or not.

    On the rows we have put FilterModelName and the ProductName columns. On the columns, as usual, there is the time. The two measures shown are:

    • HavingProduct: it counts the number of customer who bought a mountain bike and some tire tube or patch kit.
    • NotHavingProduct: it counts the number of customer who bought a mountain bike but no tire tube nor patch kit.

    Due to the nature of the measures, we have decided to make these measures incremental over time. This means that, in 2003, we count all the customers that have bought a mountain bike before the end of 2003 and have bought tire tubes before the same time. In 2004, we perform the same computation, thus including the customers of 2003. It can be changed, but it makes sense to compute the value this way.

    Thus, the PivotTable can be read as (first line): 396 (43+353) people have bought a Mountain-100 bicycle before 2003 and, of those, 43 have bought either a mountain tire tube or a patch kit. 353 of them did not buy tires or tubes (at least from us). The same for 2004 and for all other cells.

    Ok we spent enough time describing the scenario, it is now  time to write some DAX code. Because the final formulas turn out to be quite complicated, we will describe them step by step, trying to figure out the algorithm before giving the final formula.

    Let us start with the HavingProduct measure, which is somehow simpler since it does not contain the “not” part.

    First of all, we need to compute the set of the customers who have bought a mountain bike before 2003. Let us focus on the worksheet: there is a slicer filtering model names on the ProductFilter table and we want to use this filter to detect the customer who have bought these products. Moreover, the presence of the year on the columns creates a filter on the time table, showing only the sales for 2003. The steps we need to carry on are two:

    • We need to push the filter from the ProductFilter to the customers, using the fact table as a bridge
    • We need to extend the filter context on time to show all the time before the end of 2003, since we want “before 2003”

    We need to take into account the fact that the scenario is a bit more complicated than what we have spoken so far. In reality, we forgot to speak about the horizontal slicer. Its presence means that the fact table is filtered from the DimProduct too. That filter exists because there is a relationship between DimProducts and the fact table. Moreover, remember that our task is to push the filter on ProductFilter to the DimCustomer, passing through the fact table and there are no relationships between ProductFilter and the fact table. It comes out that the fact table, for our first task, is automatically filtered by the wrong product dimension.

    Now, with all these considerations in mind, the algorithm looks like:

    • For each customer
      • For each product
      • Check if there are sales
        • Of at least one of the products selected in ProductFilter
        • Anytime before the end of the currently selected year

    The key of the solution is to think at the fact table as being a bridge table that implements a many-to-many relationship between customers and products. While sales are not usually intended to implement this relationship, it is clear that this relationship holds. Thus, for each customer, we can filter the fact table showing only the sales of that customer, of the products selected in ProductFilter and of the dates before the end of the current year. If there exists at least one row, this means that that customer has bought one of the ProductFilter products. But, remember that the fact (now bridge) table is already filtered by the wrong product dimension. Thus, we need to remove that filter and replace it with the one we want to use.

    Although non very easy, the formula that detects these customers should now be understandable:

    FILTER (
        DimCustomer,
        SUMX (
            ProductFilter,
            CALCULATE (
                COUNTROWS (FactInternetSales),
                ALL (FactInternetSales),
                FactInternetSales[CustomerKey] = EARLIER (DimCustomer[CustomerKey]),
                FactInternetSales[ProductKey] = EARLIER (ProductFilter[ProductKey]),
                FILTER (
                    ALL (DimTime),
                    DimTime[TimeKey] <= MAX (DimTime[TimeKey])
                )
            )
       ) > 0
    )

    If we use this FILTER inside a CALCULATE, we will be sure that only the customers who have bought one of the ProductFilter products will be visible. Please note that the inner CALCULATE need to call ALL on the fact table to extend the filter context outside of the current year and remove the existing filter based on DimProduct (which is still active at that point). Having done that, we need to re-apply the filters on the customer, on the product (this time based on ProductFilter) and, finally, on the time (the fact table gets filtered due to the relationship with DimTime).

    Before continuing to read, please focus on the SUMX (third row) and answer this simple question: “why is it there?”. I ‘m not going to give the answer, use it to check if you got a good insight of the formula behavior. If not, read it again. Sorriso

    For what concerns the HavingProduct, the hard part ends here. The final formula of HavingProduct is simply a COUNTROWS(DISTINCT…)) using this complex filter context:

    CALCULATE (
        COUNTROWS (DISTINCT (FactInternetSales[CustomerKey])),
        FILTER (
            ALL (DimTime),
            DimTime[TimeKey] <= MAX (DimTime[TimeKey])
        ),
        FILTER (
            DimCustomer,
            SUMX (
                ProductFilter,
                CALCULATE (
                    COUNTROWS (FactInternetSales),
                    ALL (FactInternetSales),
                    FactInternetSales[CustomerKey] = EARLIER (DimCustomer[CustomerKey]),
                    FactInternetSales[ProductKey] = EARLIER (ProductFilter[ProductKey]),
                    FILTER (
                        ALL (DimTime),
                        DimTime[TimeKey] <= MAX (DimTime[TimeKey])
                    )
                )
           ) > 0
        )
     )

    Nevertheless, before continuing, try to clearly understand exactly the different filter contexts that are active in this formula, because we are using completely different filter contexts in different parts of the same formula, computing different values. In fact, the interesting part is understanding in which part of the formula is the DimProduct filter (horizontal slicer) still active.

    Now, after you have digested this first formula, it is time to think at the other one: the one implying the “not” part. If you got the insights of the first one, this formula should be pretty easy. It is enough to note that the customers who did not buy any of the products selected in DimProducts, will simply have no sales for those products in the required period. Thus, using again the fact table as a bridge between customers and products, we can apply one more time the same pattern reversing the condition.

    The formula for NotHavingProduct is very similar to the previous one, the only difference being in the first argument of the first CALCULATE:

    =CALCULATE (
        COUNTROWS (
            FILTER (
                DimCustomer,
                CALCULATE (COUNTROWS (FactInternetSales)) = 0
            )
        ),
        FILTER (
            ALL (DimTime),
            DimTime[TimeKey] <= MAX (DimTime[TimeKey])
        ),
        FILTER (
            DimCustomer,
            SUMX (
                ProductFilter,
                CALCULATE (
                    COUNTROWS (FactInternetSales),
                    ALL (FactInternetSales),
                    FactInternetSales[CustomerKey] = EARLIER (DimCustomer[CustomerKey]),
                    FactInternetSales[ProductKey] = EARLIER (ProductFilter[ProductKey]),
                    FILTER (
                        ALL (DimTime),
                        DimTime[TimeKey] <= MAX (DimTime[TimeKey])
                    )
                )
           ) > 0
        )
    )

    You should recognize the many-to-many pattern twice in this formula: once for ProductFilter (same as above) and another one in the formula computed by the first CALCULATE. Please note that this time (first part of the formula) we had no need to use SUMX and, again, understanding why is one of the key points of understanding the formula. Here we leverage automatic relationships, before now we had to manually implement a sort of “IN” operator and the SUMX came from that need.

    Well, the work is done, the two formulas can be used as measures in a PivotTable and the user is free to slice using any attribute of the two product dimensions, going at the customer level, if he wants to, with the usual blazing speed of DAX. And… if you think that these are long and complex formulas, please think at the length of your average SQL query, is it really shorter than these two formulas? Mine are not.

    The post should end here, or it might end with an invitation to our PowerPivot Workshop, as I often do. But, this time, I want to end it in a different way, sharing with you the considerations I made while I was writing the post.

    Long before starting to think at this scenario, I already had a good knowledge of how to handle many-to-many relationships with DAX. Nevertheless, I never thought about using fact sales as a bridge table and the “not” part of the problem was completely new to me. I started to work at this data model and reached the final formulas in less than one hour. Looking at the final formula I am sure that no Excel power user will be able to write such a formula by his own, so I agree that this is not self-service BI.

    But now we all know that DAX is the foundation of the tabular side of BISM. This means that BI consultants, with good data modeling skills and a good DAX knowledge, will be able to solve such a scenario for their customers in less than one hour, ending up with a working model the customer can play with. Not a prototype… a working model!

    I don’t know what are your feelings about DAX, BISM, tabular and multidimensional, but there is a fact that simply cannot be denied: DAX and the tabular data model are cool. Once you get used to think in DAX, formulas can be written at an amazing speed and they end up working at a more amazing speed. I think there is no doubt that customers will be happy to see what DAX can do for their needs, even if they will not be able to write code by themselves. How many of your customers would be happy to pose you such a question and get answered: “ok, give me an hour and I’ll bring you a working Excel workbook”?

    To come to an end, I think that the future of BI is brighter than ever, thanks to the effort of all the guys at Microsoft who spend their time figuring out what we will eventually need and who have decided to implement such a wonderful environment for BI, years before we even started thinking at it. I oftentimes blame MS guys when something does not work as expected but I am very happy to say that whenever I am able to solve complex problems so quickly I can think only one thing of them: “You guys rock, DAX rocks. Thank you and keep on with your great work”.

  • PowerPivot: Counting active days

    Browsing on the web I have seen an interesting question that is worth a post, since I think it is a very common pattern and a good example of “thinking in DAX”.

    Starting from a table which contains a people and two dates (ActiveFrom, ActiveTo), we want to compute how many days an individual has been active over some period of time, let us say at the month level.

    As always, a picture is worth a thousand words, here is the input and the desired result:

    image

    The solution is very easy by means of using a calendar table. To create one, it is enough to create an Excel table like the following one and use “Create Linked Table” from the PowerPivot tab of the Ribbon.

    image

    Once linked in PowerPivot, you end up with the two tables, that I called Activities and Calendar. Beware that there is no need at all to create relationships between the two tables, we will solve the scenario using DAX only. Moreover, there is no means to create these relationships, since the dates represent a time period, not singular dates.

    To reach the desired result, it is enough to note that, for each record in the Activity table, the number of rows in the calendar table that are between ActiveFrom and ActiveTo, represent the total number of activity days. Thus, the formula is straightforward:

    =SUMX (
        Activities,
        CALCULATE (
            COUNTROWS (Calendar),
            Calendar[Date] >= EARLIER (Activities[ActiveFrom]),
            Calendar[Date] <= EARLIER (Activities[ActiveTo])
        )
    )

    This formula seems to always compute, for each row in the activities table, the total number of active days for the individual, since it does not take into account years and months. Surprisingly, if you write this code in a measure and put it on a PivotTable, you will get the result shown at the beginning.

    The reason for which the formula works is that the inner CALCULATE creates a filter on the Calendar[Date] but does not change the current filter context on year and month. Thus, for each cell, the filter on year and month is still active and produces the correct result. Moreover, the same formula work for any calendar period without any change.

    I guess for PowerPivot newbies this behavior seems like magic, at least it was so for me a few months ago… in reality, understanding PowerPivot is just a matter of understanding row and filter contexts, once you get them formulas are really simple to write.

    An interesting exercise, left to the reader, is to count the number of working days of activity. Easier than you might think and surely worth spending some time if you want to learn some DAX data modeling. Sorriso

  • PowerPivot: improve the performances by sorting tables

    While it is well know that in UDM data should be loaded sorted, to get better file compression and map structures. The world of PowerPivot (and BISM) is yet not explored. Working with a 50 millions rows table, I tested various scenarios to check if sorting affects PowerPivot data structures too.

    The fact table structure is very simple:

    CREATE TABLE Fact_Transaction (
          ID_Account INT NOT NULL ,
          ID_Type INT NOT NULL ,
          ID_Date INT NOT NULL ,
          Amount MONEY NOT NULL
    ) 

    There are 240,000 accounts, 1,000 dates and 4 types. Thus, the ID_Account is the most selective, followed by the date and the type. I tried loading data inside PowerPivot using different sorting and the final workbook size is in the next table:

    Order File Size (K)
    HEAP (unosrted) 498,905
    Type, Date, Account 372,505
    Date, Type, Account 372,209
    Account, Date, Type 374,001

    Thus, the compression algorithm in PowerPivot seems to prefer sorted data (which is somehow expected, even if not obvious). The big difference is between sorted and unsorted data while changing the sort order does not affect the final size in an evident way. This is definitely expected since the ratio between transactions (50 millions) and accounts (240.000) is very high, resulting in a good sorting even with the most selective column. Different distributions in real world scenarios might lead to different results and they are definitely worth trying.

    What is interesting is that the query speed of the final workbook is much better with the smaller ones, when compared with the biggest, i.e. unsorted. This might indicate that it is not just a matter of workbook size but even of internal data structures that result in a better optimized format when data is fed sorted to PowerPivot. I don’t have clear number here, but the difference is evident at first glance by pivotting over the data model.

    The final hint is straightforward: if you use a heap or a table with an INT IDENTITY column for your fact table, then it might be useful to check different sorting and (as always) verify the final result. It might be the case that you are wasting memory and CPU due to a poorly designed data structure or, in other words, that there is space for improvement.

  • PowerPivot: remove useless columns!

    I teach to my students always to remove all useless columns from PowerPivot data models, because they simply waste precious memory. While I was working on a test model with 50 millions of rows, I did the naive mistake of loading the complete table in PowerPivot. Then I saved the file to check the compression of Vertipaq and widened my eyes when I saw a SQL table of 1.6Gb resulting in a workbook of 1.3Gb. It seemed to me that data compression was simply not working.

    Checking better, I discovered that the fact table contained a PK INT IDENTITY, which I stupidly loaded inside the PowerPivot data model. Simply removing that column (which has 50 millions distinct values) resulted in a 490Mb workbook, i.e. 1/3 of the original size. Much faster to open, save and work with.

    The lesson to remember is: SELECT * is the evil in the world of SQL and is still the evil in the world of PowerPivot. Always check for the presence of useless columns since, when present in a big fact table, they can really change the user experience with the workbook. Needless to say, publishing such a workbook on SharePoint and wasting 1Gb of RAM of the server is a crime that should be properly punished! Sorriso

  • A new SQL Server MVP from Italy

    Yep… it’s me. :)

  • Great post from Jeffrey Wang about DAX

    In case you missed it, this post from Jeffrey about DAX and the way filter contexts work is simply awesome!

    The way he explains the filter contexts internals is great, deep technical and yet easy to understand. Very very strongly suggested as a good reading for anybody serious about DAX, which simply means anybody working with the MS BI stack.

  • Visio Forward Engineer is on Codeplex

    As you might already know I have written an add-in for Microsoft Visio 2010 which makes it able to forward engineer database diagrams to SQL Server. A huge number of people downloaded it and started asking for some new features, as it is always the case with useful projects.

    That said, I don’t have neither skills nor time to follow a software project but, luckily, a very smart guy, Mario Majcica, dedicated his time to publish the project on Codeplex here: http://forwardengineer.codeplex.com/. He is working on a better installation process, bug fixes and many exciting new features and he is searching for skilled programmers to help with the development.

    If you want to download the latest release of the add-in or have some spare time to dedicate to the project helping it grow to a full featured add-in that solves the well known issue of Visio not being able to create databases, go to codeplex and have fun. I am going to become an happy user of the add-in, leaving the development in better hands. After all, I am a BI guy.Sorriso

  • Thinking in DAX: Counting Products in the Current Status with PowerPivot

    One of my readers came to me with an interesting formula to compute in PowerPivot. Even if I don’t normally post about very specific scenarios, I think this time it is interesting to write a blog post since the formula can be easily created, if you think at it in DAX, while it is very hard if you are still approaching it with an MDX or SQL mindset. Thinking in DAX is something that comes after a lot of formula authoring, something that all BI professionals should strive for, as Vertipaq in the new SSAS engine is approaching pretty fast!

    The scenario is pretty simple. A processing capability handles items (A, B, C) and, during the working process, the status of a product changes. Each time a product changes its status a new row is added to a logging table, along with a value, indicating the cost of the transformation. The raw data can be simulated with this table:

    image

    You can see, from the highlighted cells, that product A changed its status twice on the first of January, first it became an Apple, and then an Orange (don’t ask me why my reader decided to use fruits to model status, this overcomes my comprehension Sorriso but I still find it amusing, so I have not changed it).

    Now, with this set of data, the requirement is to compute, at the end of each day, how many products are in a specific status. Thus, product A, at the end of the first of January, was an Orange, even if it has been an Apple for some hourse. Moreover, it should be noted that A keeps the status of Orange during the period from the first to the fifth of January, even if no more rows were added to the table. No new rows, in this case, means “keep the last status”.

    When I received the workbook for some testing, it contained two tables: one pretty standard calendar table and the History table, as I have shown it in the previous picture. Moreover, a relationship was holding between the two tables, based on the date.

    In such a scenario the relationship is not only incorrect, it is misleading. From the functional point of view, there is no relationship between the fact table and the calendar table. For example, on the second of January, there are no data in the fact table but all of the measures should be computed taking into account the dates before the current one. Thus, the first step I have done is to delete the relationship so that I will leverage DAX only to solve the scenario. PowerPivot will continue to ask you the permission to auto-detect relationships… disable that feature, we know that, in this scenario, we don’t want to leverage automatic relationships.

    Now, to solve the scenario we should use a variation of the SCD handling formulas I provided in a previous post. In fact, the table looks very similar to a slowly changing dimension but, this time, we want to compute the last status of a product in a given point in time, where the point in time is not fixed but it changes following the filters imposed by the PivotTable. Said in other words, we are going to compute the SCD status of a product at a variable point in time.

    Please stop one second and think twice at what I have said, since this is where “DAX thinking” come to play. In this data model we don’t have a Product dimension, nor we have any canonical SCD handling. Nevertheless, by leveraging the tremendous speed of the Vertipaq engine, we are going to create a sort of “running SCD” on the fly, during the formula computation. This kind of computation is something that you would never try with MDX or SQL, it is a feature that can be leveraged only with DAX, where heavy leaf-level computation is carried on at a blazing speed.

    Now, it is time to show the final result and the formula, which I will describe in a few moments:

    image

    The result is straightforward: the grand total always shows three products (A, B and C) while the count at the status columns show the number of products that, at the end of that day, are in the given status. You can see that data is shown even for days where no fact table data exists, since the formula will leverage previous information to compute the SCD status at that day.

    Now the formula:

    =CALCULATE (
        COUNTROWS (
            FILTER (
                HistoryT,
                HistoryT[Id] = 
                    CALCULATE (
                        MAX (HistoryT[Id]), 
                        ALL (HistoryT),
                        HistoryT[ItemId] = EARLIER (HistoryT[ItemId]),
                        HistoryT[ShortDate] <= VALUES (Dates[Date])
                  )
            )
        ),
        LASTDATE (Dates[Date])
    )

    Let us spend some words to describe its behavior.

    First of all, we use CALCULATE to introduce a filter context that shows, from the calendar table, only the last available date, using LASDTATE. This is useful if the PivotTable does not filter a single date but a range of them (as I am doing with the slicer, that filters only the first fifteen days of the month). In that case, we use the last available date of the selected dates set to perform computation.

    Inside that filter context, we use the FILTER function to recreate the SCD status at that date. Let us see how we manage this:

    We FILTER the History table removing all the lines that have an ID earlier than the last ID for that specific product, using an inner CALCULATE. Let us check the filters inside the inner CALCULATE:

    • ALL(HistoryT) is necessary since in the current filter context, only the products with a specific status are shown. For example, during computation of <1/1/2020, Apple>, only products with “Apple” status are visible, while we want to check if the product has changed status. Thus, the ALL filter removes the condition on the current status, showing all the products.
    • HistoryT[ItemId] = EARLIER (HistoryT[ItemId]) filters all the instances of the current product, where “current” means the current iteration of the row context introduced by FILTER.
    • HistoryT[ShortDate] <= VALUES (Dates[Date]) filters all the rows that comes before the current calendar table.

    Now, it is important to remember that these conditions are ANDed together and the iteration of FILTER produces a table which is, at its end, the current SCD status of the History table at the last date in the calendar selection. It is enough to count the number of rows of this filter to get the number of products that are in the selected status. Always remember that the “selected status” is injected in the inner CALCULATE due to the presence of a filter context on the status which has not been removed by the inner CALCULATE and, thus, it is yet working, filtering only the rows of the SCD that have the correct status.

    Working directly with the fact table this formula might not be a top performer for a really huge table, if we need to improve performance, then some more data modeling should be applied, normalizing the product table or performing similar tasks. Nevertheless, what is important to note here is that DAX let you perform very complex computation without any ETL nor modeling, just plain DAX.

    If you think at how to solve the same scenario with MDX and… well, I would not even try it, UDM is great but it has a rigid structure and this scenario is not easy to solve without a deep data modeling change. On the other hand, DAX lets you shape the data on the fly and provides a very elegant and neat solution to a complex scenario like this one. The only problem is that, to solve the scenario, you need to think in DAX.

    This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you there!

  • PowerPivot, Parent/Child and Unary Operators

    Following my last post about parent/child hierarchies in PowerPivot, I worked a bit more to implement a very useful feature of Parent/Child hierarchies in SSAS which is obviously missing in PowerPivot, i.e. unary operators. A unary operator is simply the aggregation function that needs to be used to aggregate values of children over their parent.

    Unary operators are very useful in accountings where you might have incomes and expenses in the same hierarchy and, at the total level, you want to subtract expenses from incomes, in order to get a meaningful total. Clearly, you might obtain a good result adding a sign in front of the values in the fact table but, doing so, you miss the opportunity to create more than one hierarchy over the same data where values might get different signs in different hierarchies. For example, I might have a hierarchy where expenses are subtracted from the total and another one, with a different structure, where the expenses need to be summed with other values. Thus, I suppose that signing the value in the fact table is not a viable solution and I want to put the signs into the hierarchy.

    Implementing unary operators with the four mathematical operations is not easy at all. It can be done, but with a quite big effort and some severe limitations that I am not going to describe here. Moreover, in my personal experience I have never used neither multiplication nor division in any hierarchy, I have always used sum and subtraction only. Thus, I am going to focus on these two basic operations only, leaving the four operations to a next post, if somebody asks me to. Sorriso

    Let us start with the table structure for the demo. We have an Accounts table, which holds the hierarchy, and a Transactions table with the transactions. All amounts in the transactions table are positive numbers.

    CREATE TABLE Accounts (
        AccountId         INT, 
        Account           VARCHAR (100),
        ParentAccountId   INT,
        Aggregator        VARCHAR(1)
    )
    
    CREATE TABLE Transactions (
        AccountId    INT,
        Description  VARCHAR (100),
        Amount       MONEY
    )
    

    Having the two tables, we can now fill them with some data:

    INSERT INTO Accounts (AccountId, Account, ParentAccountId, Aggregator)
    VALUES  
            ( 1, 'Final Total',       NULL, ' '), -- Grand Total
            ( 2, 'Revenues',             1, '+'), --   ADD Total Revenues to Grand Total
            ( 3, 'Courses Provided',     2, '+'), --     SUM Courses to Revenues
            ( 4, 'Instructors paid',     3, '-'), --       SUBTRACT paid instructors from Courses
            ( 5, 'Cathering',            3, '-'), --     SUBTRACT cathering from Courses
            ( 6, 'Attendees rates',      3, '+'), --       SUM attendee rates to Courses
            ( 7, 'Consultancy',          2, '+'), --     SUM Consultancy to Revenues
            ( 8, 'Expenses',             1, '-'), --   SUBTRACT Expenses to Total
            ( 9, 'Travel',               8, '+'), --     SUM travel to Expenses
            (10, 'Travel refund',        9, '-'), --       SUBTRACT travels refund to Travels
            (11, 'Travels paid',         9, '+'), --       SUM travels paid to Expenses
            (12, 'Courses Taken',        8, '+'); --     SUM courses taken to Expenses
    
    INSERT INTO Transactions VALUES 
        ( 4, 'Marco Russo',         200),
        ( 4, 'Rob Collie',          500),
        ( 5, 'Sandwiches',          300),
        ( 6, 'Course in Seattle',   800),
        ( 6, 'Course in Boston',   1200),
        ( 7, 'Work in Microsoft',   400),
        ( 7, 'Work in Apple',       500),
        ( 7, 'Work in SQL Blog',    300),
        (10, 'Travel to Seattle',    80),
        (11, 'Travel to Boston',    150),
        (11, 'Travel to London',    190),
        (12, 'SSAS Maestros',      1000);
    
    

    Take a look at the Accounts table where I provided some descriptions for the operations. You can see, for example, that account number 4 (Instructors Paid), even if it belongs to Revenues, needs to be subtracted from there because it is, in reality, an expense. The same applies for account number 10 (Travel Refund) which belongs to an expense (Travel) but must be subtracted from there. Moreover, since Expenses needs to be subtracted from the grand total, it happens that Travel Refund needs to summed to the grand total and subtracted from expenses. As you can see, an account can change its sign during the hierarchy navigation, which is a perfectly normal and desired behavior. Nevertheless, we need a way to handle this.

    You should already know, from my previous post, that the hierarchy need to be flattened in order to work with PowerPivot. I am not going to bore you again with the flattening technique, here I am focusing on the unary operators only.

    The idea is that we need to compute the sign that each account should have at the different levels of the hierarchy. Let us take, as an example, the account number 10: Travel Refund. It is a leaf node in the hierarchy at level 4. Let us look at how we should aggregate its value at the various levels:

    • Level 4 (itself) a minus to show that its value will be subtracted from Travel
    • Level 3 (Travel) a minus since its value will be subtracted from Travel
    • At level (Expenses) it its sign becomes a plus, since is is aggregated into expenses which value is then subtracted from the grand total. Having traversed two minus, its sign becomes now a plus.
    • At level 1 (Final Total) it remains a plus since the grand total does not change the sign of the last aggregation (which was Expenses)

    You see that at each different level of the hierarchy an account might be summed or subtracted, depending on the number of minus signs that have been encountered during the “path” from the leaf node to the final total. With this basic idea in mind, it is now clear that we need to compute the “sign path” of each node and, from there, compute the sign of the account at the different levels of the hierarchy.

    This is easily accomplished with the next query:

    WITH AggPaths as (
        SELECT 
            AccountID   = A.accountID, 
            Account     = CAST (A.Account AS VARCHAR (100)),
            NodeDepth   = 1,
            AggPath     = CAST (a.aggregator AS VARCHAR (100)) 
        FROM 
            Accounts a where ParentAccountId IS NULL
        UNION ALL
        SELECT 
            AccountID   = Children.accountID, 
            Account     = CAST (REPLICATE (' ', Parent.NodeDepth * 4) + Children.Aggregator + ' ' + Children.Account AS VARCHAR (100)),
            NodeDepth   = Parent.NodeDepth + 1,
            AggPath     = CAST (Parent.AggPath + Children.aggregator AS VARCHAR (100)) 
        FROM Accounts Children
            INNER JOIN AggPaths Parent on Children.ParentAccountId = Parent.AccountID)
    SELECT 
        AggPaths.*, 
        SignAtLevel1 = CASE WHEN NodeDepth < 1 THEN NULL WHEN dbo.COUNTMINUS (SUBSTRING (AggPath, 1, 4)) % 2 = 0 THEN 1 ELSE -1 END,
        SignAtLevel2 = CASE WHEN NodeDepth < 2 THEN NULL WHEN dbo.COUNTMINUS (SUBSTRING (AggPath, 2, 4)) % 2 = 0 THEN 1 ELSE -1 END,
        SignAtLevel3 = CASE WHEN NodeDepth < 3 THEN NULL WHEN dbo.COUNTMINUS (SUBSTRING (AggPath, 3, 4)) % 2 = 0 THEN 1 ELSE -1 END,
        SignAtLevel4 = CASE WHEN NodeDepth < 4 THEN NULL WHEN dbo.COUNTMINUS (SUBSTRING (AggPath, 4, 4)) % 2 = 0 THEN 1 ELSE -1 END
    FROM AggPaths
    ORDER BY AccountID
    

    I have used a function, CountMinus, which counts the number of minus in a string and whose definition is simply:

    CREATE FUNCTION CountMinus (@Param VARCHAR (100)) 
    RETURNS INT AS
    BEGIN
        RETURN LEN (@Param) - LEN (REPLACE (@Param, '-', ''));
    END
    

    The result of the query is this:

    image

    You can see that the AggPath column contains the “sign path” of each account and the various SignAtLevel contain the sign that this account should have when aggregated at the various levels. At the root level I added a star to avoid having a blank, which would return incorrect values for SUBSTRING.

    Using the query and the original table, JOINED together, you can easily load the data inside PowerPivot and start some analysis. Now, to show the values in a PivotTable, you can decide two different methods:

    • Use, at each level, the sign that the value should have when aggregated with its parent
    • Always use the sign that the node will have when shown at level 1

    The difference between the two visualizations is evident in the next figure:

    image

    In “Value” I used the first technique. You can see that “Travel refund” has a minus sign, indicating that it will be subtracted from expenses. The same account, on the other measure, has a plus sign, indicating that the value will be summed to the final total. Both the information are correct, it all depends on how the user is used to look at numbers. I personally prefer the visualization of “FinalValue”, since my brain recognizes it as correct, while I refuse to understand the other one. Nevertheless, it is a matter of taste, I have met plenty of accounting managers who strongly prefer the other visualization.

    Now, how do we compute the two formulas? It is pretty easy, indeed. It is enough to note that, in the filter context of each cell, we are browsing the hierarchy at a defined and well known level. We simply need to sum all the amounts with a plus (in that level) and subtract all the amounts with a minus (again, in that level). Thus, the formula for “FinalValue” is:

    =IF ([ShowRow] && [CurrentLevel] > 0,
            CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel1] = +1) 
          - CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel1] = -1),
        BLANK ()
    )

    Where I make use of a couple of measures [ShowRow] and [CurrentLevel] which return true/false depending on whether the row should be visible or not and [CurrentLevel] which computes the current level of visualization. See my previous post to learn how they can be easily defined. Apart from those two measures, the formula is simply a subtraction of two CALCULATE which impose the correct filtering on the accounts, separating them into two distinct groups: the ones that need to be summed and the ones that need to be subtracted. It might be worth noting that the CALCULATE operates in a pre-existent filter context where all and only the children of the current node are visible and the new filter added to the context does not interfere with that selection, since it operates on a different column.

    The formula for “Value” is very similar, just a bit tedious to write:

    =IF ([ShowRow],
        IF ([CurrentLevel] = 4, 
            CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel4] = +1) 
          - CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel4] = -1),
        IF ([CurrentLevel] = 3, 
            CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel3] = +1) 
          - CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel3] = -1),
        IF ([CurrentLevel] = 2, 
            CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel2] = +1) 
          - CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel2] = -1),
        IF ([CurrentLevel] = 1, 
            CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel1] = +1) 
          - CALCULATE (SUM (Transactions[Amount]), Accounts[SignAtLevel1] = -1),
        BLANK ()
    )))))

    Here we perform the same computation but, instead of always using SignAtLevel1, we use the correct sign for each level, resulting in different signs at different levels. As I said before, it is a matter of personal taste.

    Please note that I preferred to use CALCULATE with a filter context for +/- where I could have used a more intuitive SUMX over the accounts, multiplying the resulting value for the correct sign, for performance reasons. In a big hierarchy with a big fact table SUM with CALCULATE should perform much better than SUMX, requiring less iterations over the fact table to gather the final result.

    This very same technique can be used to compute a hierarchy with multiplications and divisions only. It cannot, however, handle the four operations together because, in that case, we will need to follow precedence order during computations, something that we can ignore with any set of operators that obey to the commutative law, as we are doing here.

    The biggest difference between unary operator as defined in this post and the same in SSAS is that in SSAS the aggregation over the hierarchy with unary operators is always applied to the final result. Thus, once you define this hierarchy in SSAS and ask for the total amount, the result is computed accordingly to the hierarchy, while in PowerPivot this is not the case: the result is computed with the signs only when the hierarchy is displayed, otherwise SUM is used. I personally prefer the PowerPivot method since, if you have more than one hierarchy with unary operators, SSAS becomes terribly slow, due to the need to perform very complex computations. In PowerPivot you are free to define as many hierarchies as you want and always pay the CPU price to aggregate over one of them.

    Clearly, if you want to handle more than one hierarchy, it is strongly advisable to use a separate table for each hierarchy, otherwise the formulas would become a real pain to write if they need to take care of more than one hierarchy in the same table.

    This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you there!

  • Microsoft BI Conference 2011 in Lisbon

    Anyone interested in BI from Portugal or Spain should not miss the Microsoft BI Conference 2011 in Lisbon: one full day (March, 25, 2011) with three tracks on Business Intelligence:

    • Decision Makers
    • BI pros
    • Intro to BI.

    I am going to present two sessions on PowerPivot: one is a nice deep dive into DAX for BI pros, the other is about self service BI for decision makers. Titles and the complete agenda will be published in the next days, but I suggest to save the date.

    The full event is free and it seems to me a great opportunity to meet by face readers of my blog and have some good time speaking about our common passion!

  • PowerPivot and Parent/Child hierarchies

    Does PowerPivot handle Parent/Child hierarchies? The common answer is “no”, since it does not handle them natively. During last PowerPivot course in London, I have been asked the same question once more and had an interesting discussion about this severe limitation of the PowerPivot data modeling and visualization capabilities. On my way back in Italy, I started thinking at a possible solution and, after some work, I managed to make PowerPivot handle Parent/Child hierarchies in a very nice way, which is indistinguishable from SSAS. This post explains the technique.

    Before starting the real topic of the post, I need to say a couple things:

    1. This is definitely NOT self service BI. I am going to adopt advanced data modeling techniques and, to fully understand how the formulas work, you need to have a good understanding of the DAX behavior in terms of filter contexts. From the SQL point of view, on the other hand, only a basic understanding of CTE is required.
    2. After one day of hard work on this topic, my desktop was loaded with sheets of papers full of doodles and the test workbook contained an insane number of non-working formulas. It has been a phone call with Marco Russo which really turned on the light into the solution, we definitely work fine as a team! Thus, I share the merit of the post with Marco, and I feel the need to make it clear.

    And now, let’s get back to business, and start the topic.

    Let us suppose that we have a hierarchy of sales persons, like this:

    image

    This hierarchy has two root nodes and then some children, with a maximum depth of three levels. Each individual in this chart produces sales, stored in an invoice table that contains a reference to the individual. Data is stored in the classical way inside two tables: Tree contains the hierarchy and Invoices contains the invoices. The table structure is the following:

    CREATE TABLE Tree (
        NodeID INT,
        Node VARCHAR (100),
        ParentID INT)
    
    CREATE TABLE Invoices (
        NodeID INT, 
        InvoiceID INT, 
        Amount INT, 
        City VARCHAR (100)
    )

    Once you have created the tables, you can populate them with sample data using this script, that creates the hierarchy and puts some values in the invoices:

    INSERT INTO Tree
    VALUES  ( 1, 'Bill',       NULL),
            ( 2, 'Brad',       1),
            ( 3, 'Julie',      1),
            ( 4, 'Chris',      2),
            ( 5, 'Vincent',    2),
            ( 6, 'Annabel',    NULL),
            ( 7, 'Catherine',  6),
            ( 8, 'Harry',      6),
            ( 9, 'Michael',    6);
    
    INSERT INTO Invoices
    VALUES (2,  3, 200, 'Chicago'),
           (2,  4, 200, 'Seattle'),
           (3,  5, 300, 'Chicago'),
           (4,  6, 400, 'Seattle'),
           (5,  7, 500, 'Chicago'),
           (6,  8, 600, 'Seattle'),
           (7,  9, 600, 'Seattle'),
           (7, 10, 600, 'Chicago'),
           (8, 11, 400, 'Chicago'),
           (8, 12, 400, 'Seattle'),
           (9, 13, 300, 'Chicago'),
           (9, 14, 300, 'Seattle');

    Before explaining the technique, let us take a look at the final result. We want to obtain this PivotTable:

    image

    The notable aspects to look at are only two:

    • There are three levels in the hierarchy but, since the first tree, starting with Annabel, has a maximum depth of two, we don’t want to see the third level, which does not contain useful information.
    • The total of each node is the sum of all of its children plus the value produced by the node itself. If you look, for example, at the value of Brad in Chicago, it is 700.00, even if Vincent has produced 500,00. This means that 200,00 has been produced by Brad.

    We are going to reach the final result step by step, starting from a very basic visualization and then solving the various issues, as they will become evident.

    If we load the two tables inside PowerPivot, setup the relationships and create a PivotTable, we will end up with this:

    image

    The totals are correct, but don’t take into account the hierarchy, they represent only the total sales of each individual. Said in other words, PowerPivot has no knowledge at all of the existence of the self-relationship and behaves in a classical way, aggregating values by node. In order to create a hierarchy, we will need to add columns to the tree table that let us slice data following the Parent/Child hierarchy. There is a well-known technique to use, which is called "Naturalized Parent/Child” and that requires to:

    • Define the maximum depth of the hierarchy. Since in our case we have three levels at maximum depth, we can start with a hierarchy with some free space, let us say that five levels are good.
    • Create a column for each level which contains the parent of the node at that level.

    Since this is a classical technique, I am not going to explain it in full detail. The result can be obtained with a SQL query that implements recursive CTE like this:

    DECLARE @KeyLength INT = 3;
    
    WITH 
        FindNodePaths AS (
            SELECT 
                NodeId    = NodeID, 
                NodePath  = CAST (' ' 
                                  + RIGHT (REPLICATE ('0', @KeyLength) 
                                  + CAST (NodeID AS VARCHAR (100)), @KeyLength) AS VARCHAR (100))
            FROM Tree WHERE ParentID IS NULL
            UNION ALL
            SELECT 
                NodeId    = Tree.NodeID, 
                NodePath  = CAST (NodePath + ' ' 
                                  + RIGHT (REPLICATE ('0', @KeyLength) 
                                  + CAST (Tree.NodeID AS VARCHAR), @KeyLength) AS VARCHAR (100))
            FROM Tree
                INNER JOIN FindNodePaths AS Parent ON Parent.NodeID = Tree.ParentID
        ),
        ComputeParentNodeIdAtLevels AS (
            SELECT 
                NodeID                = NodeID,
                NodePath              = NodePath,
                ParentNodeIdAtLevel1  = SUBSTRING (NodePath, 2, @KeyLength),
                ParentNodeIdAtLevel2  = SUBSTRING (NodePath, (@KeyLength + 1) * 1 + 2, @KeyLength),
                ParentNodeIdAtLevel3  = SUBSTRING (NodePath, (@KeyLength + 1) * 2 + 2, @KeyLength),
                ParentNodeIdAtLevel4  = SUBSTRING (NodePath, (@KeyLength + 1) * 3 + 2, @KeyLength),
                ParentNodeIdAtLevel5  = SUBSTRING (NodePath, (@KeyLength + 1) * 4 + 2, @KeyLength)
            FROM FindNodePaths
        )
    SELECT 
        NodeId   = C.NodeId,
        Node     = T.Node,
        Level1   = T1.Node,
        Level2   = T2.Node,
        Level3   = T3.Node,
        Level4   = T4.Node,
        Level5   = T5.Node
    FROM 
        ComputeParentNodeIdAtLevels C
            LEFT OUTER JOIN Tree T  ON T.NodeID = C.NodeID
            LEFT OUTER JOIN Tree T1 ON T1.NodeID = C.ParentNodeIdAtLevel1
            LEFT OUTER JOIN Tree T2 ON T2.NodeID = C.ParentNodeIdAtLevel2
            LEFT OUTER JOIN Tree T3 ON T3.NodeID = C.ParentNodeIdAtLevel3
            LEFT OUTER JOIN Tree T4 ON T4.NodeID = C.ParentNodeIdAtLevel4
            LEFT OUTER JOIN Tree T5 ON T5.NodeID = C.ParentNodeIdAtLevel5

    This code, even if it seems complex, is indeed pretty simple. Moreover, I did not even try to make it optimal, the intent is educational, so I preferred to use suboptimal code to better show the technique:

    1. The first CTE function (FindNodePaths) will compute the node paths for each node, recursively traversing the tree thanks to the great feature of CTE in SQL Server. It is a standard tree traversal code.
    2. The second CTE (ComputeParentNodeIdAtLevels) computes, for each node, the keys of the nodes that are to be used for the Level from 1 to 5. In case you need more than five levels, it is pretty easy to adapt the code just by adding some columns.
    3. The last SELECT simply translates codes into names, since they are much better to show in reports than codes.

    This SQL code can be used straight as the source for a PowerPivot table and will return this table:

    image

    Now, we can use the various Level columns to browse the PivotTable and… well the result is not very sexy. Please note that I removed the city of the invoice from the following reports, to make it simpler to look at the PivotTables.

    image

    Even if the values are correct, and this time they take into account the hierarchy, there are a few issues:

    • Annabel has an empty child (the first one), which is the value of Annabel herself. This happens for all the nodes, up to the level selected. In the figure, I used three levels. Moreover, there is no evidence of the fact that the empty nodes are, indeed, the same value as their parent.
    • All three levels are shown, even when they are not useful.

    Let us start with solving the first issue. Who are those empty nodes? In the levels, we used NULL for Level2 of Annabel since the node of Annabel has a father at level 1, which is Annabel herself, but no father at levels greater than one. We can assume, with no loss in information, that Annabel is the father of herself for all levels. The same applies for any node: once a node reaches its last level, all the following ones should repeat the node name as its father. Doing so, we will remove all the NULL from the levels, removing empty nodes.

    To obtain this result, it is enough to replace the last SELECT of our query with this:

    SELECT 
        NodeId   = C.NodeId,
        Node     = T.Node,
        Level1   = COALESCE (T1.Node, T.Node),
        Level2   = COALESCE (T2.Node, T.Node),
        Level3   = COALESCE (T3.Node, T.Node),
        Level4   = COALESCE (T4.Node, T.Node),
        Level5   = COALESCE (T5.Node, T.Node)
    FROM 
        ComputeParentNodeIdAtLevels C
            LEFT OUTER JOIN Tree T  ON T.NodeID = C.NodeID
            LEFT OUTER JOIN Tree T1 ON T1.NodeID = C.ParentNodeIdAtLevel1
            LEFT OUTER JOIN Tree T2 ON T2.NodeID = C.ParentNodeIdAtLevel2
            LEFT OUTER JOIN Tree T3 ON T3.NodeID = C.ParentNodeIdAtLevel3
            LEFT OUTER JOIN Tree T4 ON T4.NodeID = C.ParentNodeIdAtLevel4
            LEFT OUTER JOIN Tree T5 ON T5.NodeID = C.ParentNodeIdAtLevel5

    By simply adding the COALESCE, the final table becomes this:

    image

    Using this new format for the table, the result is much nicer:

    image

    Now all the empty nodes disappeared from the PivotTable, when a node is a leaf it is repeated in all subsequent levels with its name. Still not what we really wanted to show, but definitely better than before.

    All what I said up to now, is pretty well known in the BI world. Thus, the interesting part of the post starts here. Sorriso

    Our second goal is to hide all the useless levels. For example, the two “Annabel” under the first one, should be hidden, since Annabel is a level 1 node and should disappear from the PivotTable when Level 2 is browsed. Said in other words, all nodes of level of N-1 should not be made visible at the level N of the hierarchy.

    Now, let us try to understand first why the two undesired rows appear in the PivotTable. If we look at the filter contexts of the first three cells of the PivotTable, we get this:

    image

    The filter context changes for the three cells. The first cell computes 4 rows, the second and the third ones compute only one row, which is the Annabel row. It is clear that we cannot leverage the fact that there is only one row in the filter context to hide the row because, doing so, we will compute incorrect results for all the leaves. The number of rows in the filter context is note the way to go.

    Let us focus on the only row with of Annabel, with NodeID equal to six. It should be computed in the first level and hidden in the next ones. Thus, the DAX formulas will work on the very same row, but with different filter contexts. The key to solve this scenario seems to be to find a way to detect the current filter context and then, if the row should be hidden under the current filter context, find a way to hide it. Unfortunately, in PowerPivot there is no way to compute the current filter context. We can leverage the VALUES function, which will return the distinct values of a column in the current filter context but, in this specific scenario, VALUES (Tree[Filter1]) will return “Annabel” for all the three Filter columns in all of our three contexts. Said in other words, inside DAX there is no way to discriminate computation based on the filter context, we can only discriminate based on the column values and, in our scenario, the column values are always the same.

    If we want to be able to discriminate the behavior of DAX in these three filter contexts, we need some column that has a different value in the different filter context. But, in order to do that, we need to have more than one row with the same set of attributes filtered by the contexts and with a new column that has different values in the various filter contexts.

    Think in this way: one row contains Annabel and a column which says: “show me in level 1”, another row contains Annabel too but the column will say: “show me in level 2” and so on. This change in the data model will let us discriminate among the various instances of Annabel and let us decide whether a row should be made visible at each level. With this basic idea in mind, we can move a step further and note that we don’t need all of the filter contexts to be different, since we only need to discriminate between levels in which Annabel should be visible and levels in which Annabel should be hidden. Thus, two rows for Annabel are enough, we don’t need a row for each level.

    We are moving toward the final solution but, before showing it, it is worth explaining the technique in more detail. If we duplicate the row of Annabel, making some mix between the last two queries (the one with COALESCE and the one without it), we will get this:

    image

    Now, these two rows are identical but the first filter context shows both of them, while the second and the third, filtering Level2=”Annabel”, will hide the first row. Now, inside DAX, by counting the rows, we will be able to discriminate between the two situations and we will be able to hide the second instance of Annabel if the first row is not visible. Well, we still need a way to hide the row, but we will come to that later.

    There is still a small problem we have to face before duplicating rows: NodeID is a key in the table and it cannot contain duplicates. Moreover, in the first filter context we want to count Annabel only once, even if there are two rows visible. This is easily accomplished by setting the NodeID of the duplicated row set to a negative value, which is not conflicting with any other ID and does not relate to any row in the fact table.

    Since it is clear that the duplication need to affect all the rows (each single row can be visible at a level and hidden at another one), the last SELECT of our query should become:

    SELECT 
        NodeId = C.NodeId,
        Node   = T.Node,
        Level1 = COALESCE (T1.Node, T.Node),
        Level2 = COALESCE (T2.Node, T.Node),
        Level3 = COALESCE (T3.Node, T.Node),
        Level4 = COALESCE (T4.Node, T.Node),
        Level5 = COALESCE (T5.Node, T.Node)
    FROM 
        ComputeParentNodeIdAtLevels C
            LEFT OUTER JOIN Tree T  ON T.NodeID = C.NodeID
            LEFT OUTER JOIN Tree T1 ON T1.NodeID = C.ParentNodeIdAtLevel1
            LEFT OUTER JOIN Tree T2 ON T2.NodeID = C.ParentNodeIdAtLevel2
            LEFT OUTER JOIN Tree T3 ON T3.NodeID = C.ParentNodeIdAtLevel3
            LEFT OUTER JOIN Tree T4 ON T4.NodeID = C.ParentNodeIdAtLevel4
            LEFT OUTER JOIN Tree T5 ON T5.NodeID = C.ParentNodeIdAtLevel5
    UNION ALL
    SELECT 
        NodeID = -C.NodeId,
        Node   = T.Node,
        Level1 = T1.Node,
        Level2 = T2.Node,
        Level3 = T3.Node,
        Level4 = T4.Node,
        Level5 = T5.Node
    FROM 
        ComputeParentNodeIdAtLevels C
            LEFT OUTER JOIN Tree T  ON T.NodeID = C.NodeID
            LEFT OUTER JOIN Tree T1 ON T1.NodeID = C.ParentNodeIdAtLevel1
            LEFT OUTER JOIN Tree T2 ON T2.NodeID = C.ParentNodeIdAtLevel2
            LEFT OUTER JOIN Tree T3 ON T3.NodeID = C.ParentNodeIdAtLevel3
            LEFT OUTER JOIN Tree T4 ON T4.NodeID = C.ParentNodeIdAtLevel4
            LEFT OUTER JOIN Tree T5 ON T5.NodeID = C.ParentNodeIdAtLevel5 

    You can see that I added a UNION ALL and the new rows will contain NULL at the level at which the row should be hidden and the negated ID avoid duplicates and false relationships. The final result looks like this:

    image

    Now the data model is ready to be queried, time to move to the PivotTable and take a look at the last points to solve:

    • Detect when a row should be hidden
    • Find a way to hide it

    Let’s go back to our PivotTable, add a measure called CountOfNodeID with the simple formula COUNTROWS (VALUES (Tree[NodeId])), we get this result:

    image

    I have highlighted the cells that need to be made visible, all other ones should be hidden. Moreover, this figure makes evident the different filter contexts (now Annabel appears under five different filter contexts, one for the root and two for each level each, where the level is Annabel of NULL). The other thing to note is that all the rows that need to be hidden have a COUNTROWS of 1. This is due to the fact that there are always at least two rows in the filter context, until we reach the point where the NULL value in the level is a discriminant. Then, the two rows are separated into two different contexts, each one of which shows only one of the two rows.

    Well, the technique is now clear: we just need to hide all the rows that have a CountOfNodeID that equals to 1. Oh… yes, we still need a way to hide the rows. Ok, last point. Sorriso

    If we remove the CountOfNodeID from the PivotTable, we will get this:

    image

    See what happened? All the rows that do not contain a value for Amount disappeared. This is due to the automatic NONEMPTY: all rows with empty values are automatically removed from the final result. Thus, if we put BLANK in a measure, whenever we want to hide a row, that row will automatically disappear from the PivotTable. Let us create a measure, called SumOfAmount, with this formula:

    =IF([CountOfNodeID] > 1, SUM (Invoices[Amount]), BLANK ())

    Adding it to the PivotTable results in this:

    image

    You see that only highlighted rows contain a value for the SumOfAmount measure. Thus, leaving the SumOfAmount measure alone in the PivotTable, we will reach the final result:

    image

    Et voilà, the visualization of Parent/Child looks exactly as it looks in SSAS, with no useless nodes. Since the value is computed with a measure, it can be easily sliced using the city of the invoice, as I did in the first figures of this post.

    Before going to the end of the post, there are still some small considerations that I want to share.

    First. If you want to make it evident the contribution of each node to its total (i.e. Brad has a value of 400 which is not shown in the report, it need to be deduced by the difference between Chris+Vincent and the total of Brad) you can easily add one visible level to the hierarchy, so that each node starts to be hidden one level later. This can be easily accomplished adding a COALESCE to the second part of the query. Another interesting technique is to differentiate between nodes that behave as aggregators and nodes that contain values. In our case, we will end up with two Brad: one who has invoices and another one who is the aggregator. This latter technique, IMHO, leads to a better data model and user experience.

    Second. The speed of Parent/Child with this technique is great. In SSAS Parent/Child suffered from the lack of aggregations but, since Vertipaq is superfast without any aggregation, it is still superfast with this implementation of Parent/Child, that behave exactly as any other hierarchy. Try it with your data and share the results, I have not performed many tests on performance and I am eager to see some value published by my readers.

    Third. If you need to get the aggregated value in a measure without having the hierarchy displayed on a PivotTable, all of the Level column are of no help. To accomplish that, you need to push the NodePath column computed by the first CTE to the final hierarchy, like this:

    image

    And then, to compute the aggregated value of a node, simply sum up all the rows that contain the NodeID somewhere in the path, using the DAX FIND function which works fine for this. Having used a space as a separator, we will not incur in wrong substring detection. A measure computed in this way can then be used to perform further computation, even if the hierarchy is not visible in the PivotTable.

    Fourth: the count of nodes, that determines the visibility of a node, is carried on on the dimension table only. Thus, it runs very fast because there is no need to access the fact table to determine whether a node should be made visible or not. This means that the addition of negative ID in the hierarchy does not affect performances, since those nodes will never be computed against the fact table. Moreover, we need to duplicate the dimension table but this is not very worrying since, normally, dimensions tend to be pretty small (i.e. less than one million rows).

    This is a long post so I think a small recap might be useful:

    • Since a hierarchy cannot be handled by PowerPivot we need to flatten it using a fixed number of levels. I think ten levels would suffice most scenarios.
    • The flattening of the hierarchy need to be done with SQL, since it handles recursion in a snap, while PowerPivot is unable to do this.
    • To be able to detect nodes to be hidden, we had to duplicate all nodes in the hierarchy using the Level as the discriminator.
    • To hide a row, it is enough to blank ALL the measures shown in the PivotTable and the row will disappear.

    And, the most important lesson here is: when formulas tend to look too much complicated, a simple change in the data model will often result in a much cleaner solution. You have seen that the DAX formulas used in this post are really plain, the key of the solution is in the data model, not in the formulas.

    To go to a conclusion: Parent/Child hierarchies are not present in PowerPivot, nevertheless they can be implemented pretty well, even if with some SQL and DAX acrobatics. The final result is very nice, elegant as the SSAS implementation is. It works fine and it will benefit from the tremendous speed of the Vertipaq engine providing a great user experience. At the end, it might be the case that PowerPivot handles Parent/Child hierarchies better than SSAS does.

    This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you  there!

  • PowerPivot course: Copenhagen, Dublin and Zurich new dates available for booking

    Wow, this time I have been able to make an announcement before Marco Russo! The PowerPivot course we are bringing in tour all over Europe has three new dates available for booking:

    • Copenhagen: 21-22, March 2011 at the Radisson Blu Royal Hotel.
    • Dublin: 28-29, March 2011 at the Microsoft Ireland Building 3.
    • Zurich: 4-5, April 2011 at Digicom Academy AG.
    The registrations are now open on the www.powerpivotworkshop.com web site., where you can find all the relevant information about the course.

    As always, registering during the early bird period will let you save some money over the full price. Thus, hurry up! I really look forward to see you during the workshop, since both the Amsterdam and London events have been a GREAT experience and I’d like to repeat them in other cities.

  • PowerPivot and the Slowly Changing Dimensions

    Slowly changing dimensions are very common in the data warehouses and, basically, they store many versions of the same entity whenever a change happens in the columns for which history needs to be maintained. For example, the AdventureWorks data warehouse has a type 2 SCD in the DimProduct table. It can be easily checked for the product code “FR-M94S-38” which shows three different versions of itself, with changing product cost and list price.

    image

    This is exactly what we can expect to find in any data warehouse: each row contains the historical value of the attribute, so that we can perform analysis on how the attribute has changed over time. Moreover, a column (status, in the example) holds information about which is the current version of the product. Thus, looking at the table, it is very easy to detect that the current price of the product is 1,346.50. Nevertheless reporting on the current list price is not very easy because there is no way to use the current value to slice data in a PivotTable: no columns hold it; it has to be computed in some way.

    The issue might not be evident with the list price but, if you think at a slowly changing dimension with the customer address as an historical attribute, it might be interesting to know where a customer lived but it is much more interesting to know where he lives now. Moreover, the best would be to be able to get both information, which might be interesting for different purposes. Unfortunately, in the AdventureWorks company they seem to be interested only in price changes, so I had to use this as an example. Sorriso

    Now, if you live in a corporate BI environment and want to perform current vs historical price comparisons, then somebody will create a view for you which, through some complex JOIN, will be able to expose both the historical and the current version of some columns. If you dare to ask him to consolidate the values in the dimension table, he will complain about ETL time, UPDATES needed on a dimension, log file usage and, at the end, he will probably refuse to do that.

    Luckily, in the Self-Service BI world, things are much easier. If we want to add a calculated column to the table that computes, for each version of the product, which is the current list price (which happens to be the same for all the instances of the same product), it will be enough to use this formula:

    =CALCULATE (
        VALUES (DimProduct[ListPrice]),
        ALL (DimProduct),
        DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey]),
        DimProduct[Status] = "Current"
    )

    Its behavior is very easy to understand: compute the value of the listPrice searching in all products which have the same ProductAlternateKey as the current product and the Status equal to" “Current”. The only part worth noting is the usage of ALL (DimProduct) inside the CALCULATE filters. This is needed because the initial CALCULATE will consolidate the row context of the calculated column into a filter one, which we will need to remove in order to replace it with our specific filter. The result is straightforward:

    image

    Now CurrentListPrice is a calculated column and can be used to slice data in a PivotTable, as any other column:

    image

    Some words need to be spent if you ever face a data warehouse which does not hold a “Current” column like AdventureWorks. There is really no standard technique to identify the current version of a product: sometimes we have an “historical/current” column, as in this example, sometimes we need to infer it from the EndDate containing NULL, sometimes we need to find the last version of the record and use that. I am writing here the various flavors of the formula, just to keep them at hand in case they are needed.

    If we need to rely on the emptiness of the EndDate column, the formula is easily adapted:

    =CALCULATE (
        VALUES (DimProduct[ListPrice]),
        ALL (DimProduct),
        DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey]),
        ISBLANK (DimProduct[EndDate])
    )

    While, if we need to search for the last occurrence of the record with the same code, we need some more DAX acrobatics to express the formula:

    =CALCULATE (
    VALUES (DimProduct[ListPrice]),
    FILTER (
    ALL (DimProduct),
    DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey]) &&
    DimProduct[StartDate] = CALCULATE (
    MAX (DimProduct[StartDate]),
    ALL (DimProduct),
    DimProduct[ProductAlternateKey] = EARLIER (DimProduct[ProductAlternateKey])
    )
    )
    )

    This last formula is the most interesting one, since it relies on a minimum of information (just the start date) and performs the whole logic of current value detection, searching, for each row, the one containing the same code but the last date of validity. Needless to say the three formulas return the very same value.

    As a side note, I think that in BISM this will be a very easy and convenient method to handle slowly changing dimensions even for the corporate BI environment since the current and historical values of any attribute can be easily computed with a minimal CPU effort and without the need to perform any update on the historical dimension on the database. I wish I had a similar functionality long time ago, when I needed to write ETL code to handle this scenario and spent some time in optimizations and locking issues… Sorriso

    This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you  there!

  • PowerPivot: Putting two stocks on the same PivotChart

    In a previous post, I have used a stock exchange scenario to speak about how to compute moving averages in a complex scenario. Playing with the same scenario, I felt the need to compare two stocks on the same chart, choosing the stock names with a slicer. As always, a picture is worth a thousand words, the final result I want to achieve is something like this, where I am comparing Microsoft over Apple during the last 10 years.

    image

    It is clear that I am not going to comment in any way why traders seem to prefer buying Apple instead of Microsoft… this would be the task of a trader and I am only a BI guy. Sorriso

    What makes this scenario interesting is that there are two values in the chart (Close of Stock 1 and 2) and their values will be determined by a slicer. I want to use the same chart to compare, for example, Oracle vs SQLBI.COM. Thus, the value of the measure need to be determined dynamically, depending on the value of the slicer. This is a pretty common example of a slicer used to “inject” parameters inside DAX formulas, i.e. the DAX formula will compute different values depending on parameters that the user can select with filters, slicers or any other UI tool.

    First of all, let us recall briefly the content of our tables: we have a Fixing table, containing the values of the stocks as they change during time, with a granularity at the day level, and a standard calendar table. The Fixing table is very easy:

    image

    Now, it should be evident that “Close of Stock 1” cannot be computed as a calculated column, since its value depends on the evaluation context defined by the slicer. Thus, the starting point is a measure. Similarly, the slicers for Stock 1" and 2 cannot use the Stock column in the Fixing table since their usage will be that of changing the measure behavior, not to filter the fixing table (or, in other words, the filter will be controlled by the DAX formula).

    Thus, the first step is to create two Excel tables that will be used as the sources for the slicers. This is pretty easy: just create two Excel tables like the ones shown below, link them in PowerPivot and call them Stock1 and Stock2. The first part is gone easily.

    image

    Then, in order to define the value of “Close of Stock 1”, we need to:

    • Check if the slicer for Stock1 filters one stock (the user might remove the filter and, in this case, the value of the measure is undefined)
    • Compute the value of the measure based on the stock name selected in the Stock1 slicer

    This is pretty easy to accomplish, using this formula:

    =IF (
    COUNTROWS (VALUES (Stock1[Stock 1])) <> 1,
    BLANK (),
    CALCULATE (
    AVERAGE (Fixing[Close]),
    Fixing[Stock] = VALUES (Stock1[Stock 1])
    )
    )
    The key of this formula is in the definition of the filter context under which CALCULATE computes the AVERAGE of Fixing[Close]. We ask for a filter context where the Fixing[Stock] column shows only the values for the stock in Stock1[Stock 1], which is the name of the stock selected in the slicer. Moreover, we know that the VALUES function will return exactly one row, since we test if in the preceding IF. As always, it is easier to read than to create, take your time to understand it well. Sorriso

    If you define this formula and create a PivotChart putting years and dates on the Axis, you get this first nice result:

    image

    Clearly, you can define a new measure tied to the Stock2 table, add it to the chart and you will get this result which does not look as pretty as before (in the example, I have filtered only the last two years to make the issue more evident):

    image

    You can see that there are points in time (near November 2011 and July 2010) where both stocks present a spike down to zero. The interesting part is the fact that these spikes do not appear when only one stock is added to the chart, they appear only when we put two stocks on the same chart.

    The reason for this behavior is that in my set of data, there are some dates where the value of one stock has not been recorded, for whatever reason. When only one stock is in the chart, those dates are removed from the graph. Nevertheless, when both stocks are present on the same chart, if it happens that the value of one of the two stocks is present, then that date will be present in the chart too. Now, for that date, the value of the other stock yields zero, resulting in the annoying issue we are facing on this chart: i.e. spikes down to zero when only one of the two stocks has been recorded.

    We could (and should) remove the problem updating the original table, adding interpolated values where they are not present. Nevertheless, since this is a post about DAX, we’d better resolve the issue using DAX. The idea is that, if Stock2 is not present for a period of time, we should not return any value for Stock1 too. The contrary holds true: if Stock1 is not present, then Stock2 should not return any value. Thus, at the end both Stock1 and Stock2 will return values only for dates where both are present.

    The formula, for Close of Stock 1 is the following:

    =IF (
    COUNTROWS (VALUES (Stock1[Stock 1])) <> 1 || COUNTROWS (VALUES (Stock2[Stock 2])) <> 1,
    BLANK (),
    IF (
    ISBLANK (
    CALCULATE (
    AVERAGE (Fixing[Close]),
    Fixing[Stock] = VALUES (Stock2[Stock 2])
    )
    ),
    BLANK (),
    CALCULATE (
    AVERAGE (Fixing[Close]),
    Fixing[Stock] = VALUES (Stock1[Stock 1])
    )
    )
    )

    If both slicers select one value each then, if Stock2 is present, then calculate Stock1 otherwise return BLANK. The formula for Stock2 is very similar, with some 1 and 2 interchanged. Updating the formulas is enough to get the nice result of the first chart, which was our final goal.

    Clearly, the very same technique can be used to add more interesting measures, like Bollinger Bands, moving averages or, with a bit more effort, turn PowerPivot in a complete trading system that suggests when to buy or to sell a stock. This is left as an interesting exercise to the reader. Sorriso

    This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you  there!

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement