This post is part of a Methodology discussion - other posts will follows. I will be happy to get your feedback!
One interesting question is “How do you measure the complexity of a BI solution?” If we need to decide whether a specific technique is suitable for a solution or not, it is very important to be able to classify that solution.
We want to define a classification of solutions based on their size and complexity. It is clear that size leads to complexity: even a simple solution made up of a few dimensions will become complex if it will handle several billions rows in the fact table.
Because the processing of data warehouses normally happens during the night, we propose a classification of BI solutions based on the time required for both the ETL execution and the cubes process to finish.
We can rebuild a small BI solution each time because the entire ETL phase will consume a few hours of computation and does not need to keep complex history of changes in the dimensions.
If you are developing a small BI solution, you normally will recreate the whole relational database each time the ETL phase starts, producing a fresh database each night. This simple method of creating the data warehouse is very convenient, when applicable, because it highly reduces the complexity of handling the database. We can apply any change without worrying about the old data, just because there is no old data (where old data means data already processed and imported in the BI solution).
Even if this approach might seem strange, we find that many small or medium size companies have data warehouses whose elaboration will last no more than six, seven hours of computation. Using this pattern leads to simple solutions that are very easy to handle and with a very high return of investment.
We like to call a Small BI solution as a “one shot solution”, because it can be built with “one shot” of ETL computation.
If we need to trace history of changes in the dimensions (i.e. we have some SCDs in the solution), the one shot solution seems to be no more a viable way. This is partially true.
If the complete solution can be rebuilt nighttime, we will always try to keep it as a one shot solution, computing it each night. If, for some reason, we need to maintain the history of changes of some specific dimensions, we think that it is easier to store the different versions of members of those dimensions in a persistent database, reloading all the facts and dimensions each night.
Doing this, we maintain the advantages of the one shot solution adding a slight complexity for the storage of different versions of the members of changing dimensions. Nevertheless, we still have full control on any database change, because we can rebuild the database during one pass of computation.
When the size of the fact tables becomes so big that the “one shot solution” is not a viable one, then the only choice is that of loading it incrementally each night, applying changes to the dimensions and adding new facts to it.
When a solution is very large, the overall complexity raises of some levels. We cannot rebuild the database easily, nor add attributes smoothly to the solution. Any change will require a high level of attention, because all operations will consume CPU and disk resources to a high degree.
Our experience is that the majority of solutions in the market are made of small or medium sized solutions. Whenever we start a new BI solution, we always need to understand in which scenario we are, because the architectural choices will highly depend on the size of the BI solution.