THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Davide Mauri

A place for my thoughts and experiences on SQL Server, Business Intelligence and .NET

  • PASS Summit 2014 Pre-Con Preview: Davide Mauri

    If you’re into Data Warehousing, you may be interested in attending to the workshop I’ll deliver at PASS Summit 2014 in Seattle on November 4th.

    http://www.sqlpass.org/summit/2014/Sessions/PreConferenceSessions.aspx

    The workshop is entirely dedicated to explaining why and how a *successful* Data Warehouse can be thought, designed, architected, built, loaded and tested, using the  Agile approach that, so far, has mainly be applied to the application development field and in the last year has gained traction also (and finally I would say) in the BI field. Both Gartner and Forrester also underline that the Agile is a key factor for success in modern BI world, since has been verified that 50% of the requirement change in the first year in a BI project.

    If you want to read more about the workshop, we you read the Q&A just published here:

    http://www.sqlpass.org/Community/PASSBlog/tabid/1476/entryid/676/PASS-Summit-2014-Pre-Con-Preview-Davide-Mauri.aspx

    In addition to that I’d also like to share the agenda of the workshop, that will give you even more information on what we’ll discuss on that day:

    • Why a Data Warehouse?
    • The Agile Approach
    • Modeling the Data Warehouse
      • Kimball, Inmon & Data Vault
      • Dimensional Modeling
      • Dimension, Fact, Measures
      • Star & Snowflake Schema
      • Transactional, Snapshot and Temporal Fact Tables
      • Slowly Changing Dimensions
    • Engineering the Solution
      • Building the Data Warehouse
        • Solution Architecture
        • Naming conventions, mandatory columns and other rules
        • Views and Stored Procedure usage
      • Loading the Data Warehouse
        • ETL Patterns
        • Best Practices
      • Automating Extraction and Loading
        • Making the solution automatable
        • BIML
    • Unit Testing Data
    • The Complete Picture
      • Where Big Data comes into play?
    • After the Data Warehouse
      • Optimized Hardware & Software
    • Conclusions

    As you can see it will be a fully packed day...so brings two cups of coffee and you'll be good :)

    See you in Seattle!

  • Sketch notes from 24 Hours of PASS

    24 Hours of PASS has passed and, beside the slides, demo and video (that will come soon on http://www.sqlpass.org/24hours/2014/summitpreview/Schedule.aspx), this time, thanks to Matt Penny (@salisbury_matt) you can also have very nice and well done sketch notes that summarizes the concept of the sessions Matt attended to, in a very nice, quick, effective and friendly way. Here’s what Matt did for my session:

    BxKPva4IEAAVZ_f

    I love it! I must say I’m a fan on sketch notes. It’s quite an art on its own IMHO, ‘cause good sketch notes mix written and visual language such in a way that make much more easier to the read to get the message and memorize it. I’ll be using the notes that Matt took for my session quite a lot in future, for sure.

    Beside notes of my session, you can find here

    http://mattypenny.net/2014/09/10/sketchnotes-from-24-hours-of-pass/

    sketch notes for the following sessions:

    • Brent Ozar on ‘Developers: Who Needs a DBA?’
    • Brian Knight on ‘Performance Tuning SQL Server Integration Services (SSIS)’
    • Allan Hirt on ‘Availability Groups vs. Failover Cluster Instances: What’s the Difference?’
    • Erin Stellato, Jonathan Kehayias on ‘Everything You Never Wanted to Know about Extended Events’
    • Gail Shaw on ‘Guessing Games: Statistics, Heuristics, and Row Estimations’
    • Tim Chapman, Denzil Ribeiro on ‘Troubleshoot Customer Performance Problems Like a Microsoft Engineer’
    • Argenis Fernandez on ‘Secure Your SQL Server Instance without Changing Any Code’
    • Joe Webb on ‘Hiring the Right People: Interviewing and Selecting the Right Team’
    • Robert Cain, Bradley Ball, Jason Strate on ‘Zero to Hero with PowerShell and SQL Server'
    • Chris Shaw, John Morehouse on ‘Real World SQL 2014 Migration Path Decisions’
    • Julie Koesmarno on ‘”I Want It NOW!” Data Visualization with Power View’
    • Jen Stirrup on ‘Business Intelligence Toolkit Overview: Microsoft Power BI and R’
    • Ryan Adams on ‘SQL Server AlwaysOn Quickstart’

    Thanks Matt!

  • SSIS Dashboard v 0.6.1

    Yesterday I’ve released the latest SSIS Dashboard update. There quite a lot of new features included that I found to very useful when you have a server full of packages and logs. Here the complete list: Highlighted the feature I think worth the most:

    • Updated Morris.js to v 0.5.1
    • Updated MetisMenu to v 1.1.1
    • Added information on "Child" Packages
    • Added more detail to the "Package Execution History" page. Also added an estimated end time / elapsed time for running packages, using a moving average of 7 steps.
    • Added navigation sidebar in the main page that shows available folders and projects
    • Added support for folders and project filtering
    • Changed configuration file in order to comply with Python/Flask standards
    • Cleaned Up code in order to follow Python best practices (still a lot to do :))

    Have you had a chance to give it a try? What features you’d like to see added?

    My plans for the next releases is to

    • Add a configuration page so that you can choose the maximum number of rows return (now set to 15) and the time interval you want to analyze (not set to 72 hours in the config file)
    • Use a EWMA instead of the simple Moving Average
    • Do a video to show how to install and use the Dashboard
    • Package everything in only one executable file / directory / VM (I want to be able to offer a xcopy deplyment “all-included”…not only the .py files)
    • Include additional information taken from [event_message_context], [executable_statistics], [execution_parameters_values]
    • Fix the layout for small / medium screens (smartphones / tablet)
    • Add historical / average elapsed time also for Child Packages and Executables
    • Include DataFlow informations

    Once all those things will be do, version 1.0 will be ready

    If you want to help, fork the code from Github:

    https://github.com/yorek/ssis-dashboard

    if you want to try it go here

    http://ssis-dashboard.azurewebsites.net/

  • On Agile Data Warehousing

    In the last month, I’ve been working on the slide deck of my Agile Data Warehousing workshop. In order to give to it additional value that goes beyond the pure technical aspects, and since now the “Agile” approach is becoming more and more mainstream also (and finally!) on BI, I did a small research to check what one can find on the web regarding this topic. Many things happened from the very first time I presented at PASS 2010, where I first mentioned the need to be Agile (or “Adaptive” as I prefer to say when talking about BI & Agility). In 2011 Gartner, at their BI Summit, stated through the voice of Andy Bitterer that

    50% of requirements change in the first year of a BI project

    and, as a result, the only possible way to succeed in a BI project is to be able to adapt quickly to the new requirements and requests. The doors to Agile BI were opened.

    Agile BI as grown from that point on, until the point that Forrester even started to evaluate Agile Business Intelligence Platform, even nominating Microsoft as one of the Leaders:

    Microsoft named a Leader in Agile Business Intelligence by Forrester

    I must say I’m not 100% with the definition of Agile BI the Forrester gives, since it puts together to many things (Data Visualization, Automation, Self-Service BI just to name a few), but I understand that they see the things from the end user perspective, that simply wants to “do everything, immediately, easily and nicely” with its data. There is also a definition on Wikipedia (page created on January 2013) that is better, more complete and less marketing-oriented:

    http://en.wikipedia.org/wiki/Agile_Business_Intelligence

    Beside those definitions, terms like Agile BI and Lean BI became quite common. Of course, with them, came also the idea of Agile Project Management and Agile Modeling. Especially this latter subject seems to be very hot and of course is something that is also close to my interests. Now, I won’t want to go into a deep discussion of the topic, telling you what it good and what is bad. There is already a lot on the web for or against any possible modeling solutions. Data Vault, BEAM, Model-Storming…a simple search on the web and you’ll find thousands of articles. Who’s the best? Should we go for Data Vault? Or for an Inmon-style DWH? Or Kimball? Or something else?

    Well…I don’t really care. Or, to be honest, a care just a little bit.

    Now, since “Ideas without execution are hallucinations”, and models are ideas after all, it’s my strong opinion that you don’t model the be agile: you “engineer” to be agile. Why? It’s simple: all models are agile…since they are models, and nothing more. Is not a problem to change a model, since it’s “just” a formal definition of a system…(of course, I’m bringing the idea to the extreme here)  and, since we’re assuming that business requirement will be changing, you known in advance that no model that will satisfy them all (immediately) exists (yeah, you can try to model the “Universal Data Model” but it’s going to be *very* complex…). So, the main point is to be able to bring changes quickly, with a measurable quality, in a controlled and economic way.

    We all know that the one and only one model that should be presented to the end user is the Dimensional Model. This is how your Data Mart should look like. But how do you model your Data Warehouse is completely up to you. And it will change over time, for sure. So how you implement the process in order to extract, transform and load the data, is the key point. That implementation must be agile. What lies behind the scenes, following the information hiding principle, should be considered a simple “technical enabler” that could change at any time. So, if one prefer to use Data Vault, or Inmon, or just store anything in some hundreds Hadoop server…I don’t see any problem with that. As soon as you have defined an engineered approach with naming conventions, design pattern automation, quality checks, metadata and all the stuff in order to make sure that when you have to change something, you can do the smallest change possible, measure its impact, and test the result.

    I’ve been trying to apply Agile principles to BI since 2003…I’ve been through any possible changes that you can imagine (even a complete change of an ERP that was the main source of data) and the most important thing I’ve learned is that the only model that works is the one that is liquid and is able to adapt quickly to changing requirements. I usually start modeling in the easiest way possible, and thus I apply the Dimension Model, and then I make all the changes to it in order to be able to keep

    • All the data at the highest granularity
    • Optimal performances
    • Historical Informations (that may not be visible to end user, but may be needed to correctly transform data)

    Which, for complex DWH, means that at the beginning the DWH and the Data Mart overlaps, and that they diverge as the project goes on. In one project we even decided to go for a normalized model of data since the DWH became the source not only for reporting and analysis but also for other, more operative, duties.

    Now, in order to be really agile, it’s mandatory to have an engineered approach that make sure that from agility the project doesn’t fall into anarchy. Because this is the biggest risk. The line that separates the two realities it’s very thin and crossing it is very easy. When you have a team of people, or they work as one, or Agile BI is not for you. Otherwise chaos will reign. And to make sure this does not happen, you have to have a streamlined building process, tools and methods (design patterns, frameworks and so on) so that everyone can technically do a good job and technical quality of the outcome is not only proportional to the experience of the person doing it.

    It’s really important that everyone who wants to approach Agile BI understand the “engineering” part. I found it always underestimated and in all post I’ve found on the web, I never read someone stressing the importance of that part. That’s why I felt the urge to write this post, and that’s why I’ll go very deep in this topic during my PASS Workshop.

    Now, before finishing the post, there is still one thing missing, but vital, for the success of an Agile VI solution: testing. Agility cannot exist if you don’t have an automated (or semi-automated) testing framework that assures you and your users that no errors will be introduced in the data as a result of a change done to satisfy some new or changed requirements. This is mandatory and I’m quite disappointed to see that almost no-one underline this point enough. Forrester doesn’t even took into consideration this point when evaluating the existing “Agile BI Platforms”. That’s a very big mistake in my opinion…since everyone give for granted data quality, but it’s actually the most difficult thing to obtain and maintain.

    Testing frameworks are quite common in development, even Visual Studio has a testing engine integrated, and they should become common in BI to. Something is starting to appear (https://nbi.codeplex.com/), but I wish that also big players (Microsoft above all) start to take this subject more seriously. How cool and useful will be a strong integration of testing in SSIS? After DWH/DM/Cube loading one could launch all the tests (maybe done right from Excel, from a power-user, or even created automatically if certain conditions are met…say the generation of year balance) and make sure that the freshly produced data are of good quality.

    Just like water. Because data IS water. I won’t drink it if not tested.

  • SSIS Dashboard 0.5.2 and Live Demo Website

    In the last days I’ve worked again on the SQL Server Integration Service Dashboard and I did some updates:

    Beta Added support for "*" wildcard in project names. Now you can filter a specific project name using an url like:

    http://<yourserver>/project/MyPro*
    

    Added initial support for Package Execution History. Just click on a package name and you'll see its latest 15 executions

    and I’ve also created a live demo website for all those who want to give it a try before downloading and using it:

    http://ssis-dashboard.azurewebsites.net/

  • SSIS Dashboard v0.4

    Following the post on SSISDB script on Gist, I’ve been working on a HTML5 SSIS Dashboard, in order to have a nice looking, user friendly and, most of all, useful, SSIS Dashboard.

    Since this is a “spare-time” project, I’ve decided to develop it using Python since it’s THE data language (R aside), it’s a beautiful & powerful, well established and well documented and with a rich ecosystem around.

    Plus it has full support in Visual Studio, through the amazing Python Tools For Visual Studio plugin,

    I decided also to use Flask, a very good micro-framework to create websites, and use the SB Admin 2.0 Bootstrap admin template, since I’m all but a Web Designer.

    The result is here:

    https://github.com/yorek/ssis-dashboard

    and I can say I’m pretty satisfied with the work done so far (I’ve worked on it for probably less than 24 hours). Though there’s some features I’d like to add in t future (historical execution time, some charts, connection with AzureML to do prediction on expected execution times) it’s already usable. Of course I’ve tested it only on my development machine, so check twice before putting it in production but, give the fact that, virtually, there is not installation needed (you only need to install Python), and that all queries are based on standard SSISDB objects, I expect no big problems.

    ssis-dashboard-v0.4

    If you want to test, contribute and/or give feedback please fell free to do it…I would really love to see this little project become a community project!

    Enjoy!

  • SSISDB Analysis Script on Gist

    I've created two simple, yet very useful, script to extract some useful data to quickly monitor SSIS packages execution in SQL Server 2012 and after.

    I've started to use gist since it comes very handy, for this "quick'n'dirty" scripts and snippets, and you can find the above scripts and others (hopefully the number will increase over time...I plan to use gist to store all the code snippet I used to store in a dedicated folder on my machine) there.

    Now, back to the aforementioned scripts. The first one ("get-ssis-execution-status") returns a list of all executed and executing packages along with

    • latest successful and running executions (so that on can have an idea of the expected run time)
    • error messages
    • warning messages related to duplicate rows found in lookups

    the second one ("get-ssis-data-pumped-rows") returns information on DataFlows status. Here there's something interesting, IMHO. Nothing exceptional, let it be clear, but nonetheless useful: the script extract information on destinations and row sent to destinations right from the messages produced by the DataFlow component. This helps to quickly understand how many rows as been sent and where...without having to increase the logging level.

    Enjoy! 

    PS
    I haven't tested it with SQL Server 2014, but AFAIK they should work without problems. Of course any feedback on this is welcome. 

  • Submitted Abstract Feedback – My 2 cents

    "WOW. That's exactly the feedback I would have loved to have for ALL my proposals in the past year. And not only for PASS Summit". No more and no less this is what I thought after having received the mail with all the feedback from the review committee of my PASS Summit proposals. Yeah, sure, I thought it in Italian, but for your convenience, I translated it in English. Being a written translation you cannot also see the typical Italian gestures related to the "WOW" phase, but you can get nice sense of that here: https://www.youtube.com/watch?v=M0n4Vw6twKo (at minute 0:41 J and 1:16 J)

    Now, beside this J, I encourage everyone who submitted session abstract for PASS Summit to ask for detailed feedback, as explained here:

    http://www.sqlpass.org/Community/PASSBlog/tabid/1476/entryid/642/PASS-Summit-Session-Selection.aspx

    After reading the feedbacks, I already have several ideas on how to improve submitted abstract of sessions that have been rejected, and I'm sure the changes I'm going to do will help to increase the chances to have the session accepted in future. After all, it's all about being able to communicate to others why you think your session is useful and how it can help people in their work. And since we all have different way of seeing things, the more feedback I can have on how people get and understand my submission, the better for me, since I can fine-tune the abstract for the audience.

    Good work, PASS.

  • PASS Summit 2014 – Agile BI & Agile Data Warehousing

    I'm really happy to be able to announce that I'll be delivering one full-day workshop and a regular session at PASS Summit 2014. Here's the titles and the short summary of my workshop and session:

    Agile Data Warehousing: from Start to Finish
    Full-Day Workshop
    In this workshop you'll learn how to create a Data Warehouse from the beginning to end: from the architectural definition to its building using an engineered and proven process, you'll see how effective and efficient an Agile approach can be.

    Agile BI: Unit Testing and Continuos Integration
    Regular Session
    The Agile principles can be applied to BI projects too, even the most extreme ones like Continuos Integration. In this session, you'll see how you can do that using several tools and processes all together to achieve a common goal: be Agile!

    These topics are among the one I love at most and I'm really happy (at, proud J) that they've been selected. I believe that the Agile approach IS the correct approach to make BI project a success, simply because it's the only one that can quickly return value to the business that, in turn, will start to appreciate the BI solution giving feedback to improve it and thus helping the IT department to improve it more and more every day. It's all about creating a virtuous circle that will make sure people will love your BI solution.

    Both Gartner and Forrester agrees that the Agile approach is the way to go. The first one, already back in 2011 noted that "50% of requirements change in the first year. […] Adopting Agile delivery allows you to fail fast and fail early, reducing the cost of failure and turning it into something to embrace and learn from", and the second one, in the latest research (Q1 2014) say that "business agility often equals the ability to adopt, react, and succeed in the midst of an unending fountain of customer driven requirements" and thus an Agile BI is needed to support this approach.

    Now, this is all nice, but what about costs? Agility can be too costly. In order to be effective all the low-add-value work MUST be automated, since you're doing thing in cycles, and thus each minute spent doing non-valuable work will be wasted over and over again. In the end, to be really agile, there must be a smooth and highly automated "construction yard" where quality is not directly dependent from (the quality and experience of your) people, and where results are testable and measureable.

    The workshop will be all about these topics. Agility, engineering, best practices, automation, change management, design patterns, testability.

    On the other side, the regular session will focus 100% (and even more) on the testability topics. I'll show that testing is one of the pillars that allow the Agile approach to exists and I'll also show how we can apply it to BI. This is really a frontier topic, and as such there is dedicated tool for that. We'll have to borrow them from Application Development and forge them again to be useful in the BI field. We'll push this concept to the limit, reaching the point where we'll also see how also Continuos Integration can be used in BI development to keep under strict control each change on improvement we do to our BI solution.

    Because BI is all about trust, delivered at right time.

    See you in Seattle!

  • GUID, fragmentation and performance: the eternal flame.

    In the last months a lot of discussion has been done around the topic of GUIDs and their usage as Clustered Keys. We had a very “hot” discussion on that several years in Italy, and looking on the web you can see that this kind of question surface gain periodically. Recently I discussed it also with my friend Thomas. Discussion that gave me some food for thought, as usual. So with this post I’d like to share my findings, hoping to write a blog post that can help to better understand this phenomenon that is always an hot topic, in any DBA and DEV discussion.

    Actually, the question is only a part of the more broader topic of “fragmentation and performance”, that is – IMHO – much  more interesting. It is well known that if you try to insert rows in a table as fast as possible, and you’re using an IDENTITY as your clustered index, this, instead of helping, will pose a barrier to the performance you can reach. A nice whitepaper explain the problem in detail:

    Diagnosing and Resolving Latch Contention on SQL Server

    Not only that, ORM like Hibernate or the Entity Framework loves to work with GUID since in that case you can created values without having to do a round-trip to SQL Server, allowing developers to be more free and independent from SQL Server, build a cleaner and easier to maintain code. (At least this is the common idea, of course I don’t agree with it at all, since it cover only one side of the problem: the developer point of view. But this is another story.)

    Also Microsoft uses it *a lot*. I you ever worked with Dynamics CRM you know that GUID simply pours out of the server. They are everywhere.

    So, are they a good thing or a bad one?

    As usual, forget to have simple, catch-all, answer. Computer science is so wide that it’s almost impossible to have a black or white approach. But luckily we work in a field where numbers and measurement still have a lot of importance, so I did some test to really understand how much fragmentation can impact the performance of a query.

    Here’s the test I did and the results they show.

    Test Setup

    First of all I created two tables

    image

    and loaded them with ostress:

    ostress -Slocalhost -E -dIndexTest -Q"insert into dbo.IndexTest3F default values" -oC:\FT\xxx\TMP\ -q -n70 -r50000

    After that I reindex the table IndexTest3N so that I can do some test with a badly fragmented (IndexTest3F) and a total unfragmented (IndexTest3N) table using the following command:

    alter index pk__3N on dbo.IndexTest3N rebuild with (fillfactor = 50)

    I used fillfactor 50 to try to keep the page count for both table at the same level. As you can notice, and as quite obvious from the row size, this is not possible, since exactly two rows can fit in any page. For some reason, without reindexing the table, the fill factor is lower than expected.


    image

    Unfragmented Table Tests

    To make sure that Buffer Cache was completely empty, I ran the following commands:

    CHECKPOINT
    GO
    DBCC DROPCLEANBUFFERS
    GO

    And than I searched for a value I know it wasn’t in the table, using a non-indexed column, so that a table scan would be the only possible plan. Here’s the results, with I/O and TIME statistics set to on:

    image

    This is a Fast Track machine (32 Cores, no HT, 256 GB Ram, 4 P2000 SANs) where I expect 6GB/Sec of transfer rate. Doing the math, this is the exact throughput I got

    image

    Monitoring the Avg.Disk Bytes/Reads counter, I noticed that SQL Server was able to do I/O up to 512 Kb, as I would expect in this situation: SQL Server is doing an Index-Order Scan, and pages are all in the correct physical order.

    Forcing an Allocation-Order Scan, using a TABLOCK hint, shows the very same results.

    If table is all in memory, so we’re not doing any physical I/O, performance are phenomenal, as one would expect. Scanning of all 13GBs of data, happens to take little more than 100msec, with a CPU a little over 3000 msec.

    image

    So far so good. Now let’s see what happens if we’re using an heavily fragmented table

    Fragmented Table Tests

    As soon as we run the same reference query again, we have some very different results:

    image

    The query took more than 10 seconds instead of the original 2! This means being 5 time slower, with a throughput of less than 2GB/Sec. Due to such high fragmentation, and since it’s using the Index-Order scan strategy, SQL Server was never able to do I/O bigger than 64Kb, bringing the performance bar down a lot.

    Helping SQL Server to do an Allocation-Order scan with a TABLOCK hint helps a bit:

    image

    but still performance a much lower than the one obtained with a non-fragmented table.

    Now, let’s see if something changes also with warm cache. We all know that memory is so fast, when compared to spindles, that fragmentation should not give any problem. Well, the tests shows something very different:

    image

    Of course performance is great here if compared with cold-cache results, after all we’re still getting an answer only after 500msec. But look at CPU time. It’s 16 SECONDS! 16 seconds instead of 3, again, 5 time (and more) more expensive than the non-fragmented table, even when in memory.

    I must admit that I wasn’t expecting such difference. Differences that disappear when going for the Allocation-Order scan, where fragmentation is not key factor anymore, since SQL Server doesn’t follow the double-linked list found at the leaf leave of index pages:

    image

    Other Tests

    I ran the very same test also on a much cheaper machine (the Fast Track machine used here has a cost, in hardware only, of more the 100.000$), not optimized for DWH workload and using SSD as a storage, and without any SAN, but using DAS.

    The test machine was a VM on my notebook, with 2 vCore, 8GB of RAM and Samsung 840 SSD for the storage.

    Results showed the very same behavior.

    Conclusions

    Well, it’s pretty clear that fragmentation has a non-trivial impact on table scan performances. So is a non-sequential GUID such a bad thing? Here’s my opinion, give the test I ran:

    Do It When
    if you need a table where you only need to persist data and you need to have a very high insert rate (persist = insert row and then access only by PK, and only for singleton lookups), then GUID or any other better way to avoid contention in the last page is good. In my test on the Fast Track I could only reach 15.000 Batch/Sec with the usual IDENTITY on the ID column, while I went up to 50.000 Batch/sec with the GUID. In both cases I had 140 concurrent connections (using ostress.exe) doing a very simple INSERT…DEFAULT VALUES and CPU usage were around 20%. With GUID bottleneck was then the PFS page. I only have one filegroup (in addition to the PRIMARY), since this is a DWH machine and was configured to optimize that specific workload, so I’m sure that on a proper OLTP configured machine, the saturation point is much after that value.

    Don’t Do It When
    if you are doing anything else where you cannot foretold and be 100% sure that you’re never going to do a table scan, I wouldn’t recommend the usage of a value (GUID or not) that brings your table to a 99% fragmentation guaranteed. For example if you’re developing and ERP or accounting system, or a website or anything else where the optimization of insert-rate is not your primary goal…well, think twice before adopting GUID blindly.

    I’m sure this won’t be the last post on that subject, but at least there are now some numbers that we can use to discuss on it. And I would really love to see the SQL family act as scientists would do (aren’t we Data Scientists? Winking smile): do that same tests on your machines, post the results and let’s discuss them!

  • Data Warehouse Advanced Modeling Techniques

    I’m yet again pleased to announce that on the next month (19th February 2014) I’ll deliver the Advanced version of our Data Warehouse Workshop with my friend Thomas Kejser. The venue will be Stockholm again (really a lovely city) but this time we’ll be at the

    World Trade Center
    Klarabergsviadukten 70
    107 24 Stockholm, Sweden

    This is the agenda:

    • Logical data models: What are they good for?
      • Comparing Inmon and Kimball
    • Demystifying meta data: Meta Models and auto generated ETL
      • Metadata Driven ETL
    • Understanding modern hardware: Co-Located data in SMP and MPP systems
      • How to create a balanced system
    • Why normalization is bad for data warehouses, and what you can do about it
    • Best Practices, Naming Conventions, Lesson Learned
    • Choosing between column and row stores
    • Keys, master data and mapping layers
    • Tracking history in the warehouse
    • Handling changing user requirements
    • High speed data loading – and reloading
    • Unit Testing

    http://www.sqlserverutbildning.se/ViewEvent.aspx?eventId=11&name=Data+Warehouse+Advanced+Modeling+Techniques

    I can tell you that this is really something unique, just like the one we already did one year ago and the one we delivered at SQL Rally, the blends and summarize years and years of experience, that, if you’re into BI, you should not miss at all.

    See you there!

  • PowerQuery, ExpandTableColumn and the parent data

    I’ve been playing with PowerQuery in the last days and I’ve come across the need of accessing “parent” object data when using the ExpandTableColumn function. The solution is more tricky then one could imagine and I have to say a BIG “THANK YOU” to Miguel Llopis for showing it me.

    Now, let’s say you have a bunch of files in a directory. All files has the same structure (csv, json…anything you want) and you need to import all their content in one PowerPivot table.

    image

    Of course the PowerQuery “From Folder” option is exactly what you need

    image

    And there’s the result

    image

    If you then click on the “Content” header  you will have the content of the files merged in just one resulting table.

    image

    Cool, isn’t it? Now, let’s say they you also want to have the information about the source of your data. Like, for example, the name of the source file, so that you can know from which file the data comes from. Now things become much more complex since there is no object that allows you to access the “parent object” information. In order to solve this, you have to start to use PowerQuery functions (http://office.microsoft.com/en-us/excel-help/learn-about-power-query-formulas-HA104003958.aspx page 11).

    First of all you need to create a function that will do the same operation you just did for all files in a folder, but limiting it only to a specific file, and turn that file in the parameter of such function. Here’s an example to do that:

    (filepath) =>
    let
        Source = Table.FromColumns({Lines.FromBinary(File.Contents(filepath))}),
        SplitColumnDelimiter = Table.SplitColumn(Source,"Column1",Splitter.SplitTextByDelimiter(","),2)
    in
        SplitColumnDelimiter

    the function will take a file, whose path will be passed in the filepath parameter, turn the file into a table and split the values using comma as a delimiter. Since it is a function, you don’t need to have it visible in an Excel sheet, so you can deselect the “Load To Worksheet” option. Let’s give a name to our function and call it DoStuff.

    Now that our function is in place we just have to

    1. Load the folder content again
    2. Create a new column with the file path that will be passed to the freshly created function
    3. Remove all unneeded columns
    4. Create an additional column where we’ll call the created function for each file in the folder
    5. Expand the loaded content into columns

    Translated into PowerQuery language it becomes:

    let
        Source = Folder.Files("D:\Temp\PowerQuery"),
        InsertedCustom = Table.AddColumn(Source, "File Path", each [Folder Path] & [Name]),
        RemovedOtherColumns = Table.SelectColumns(InsertedCustom,{"File Path"}),
        InsertedCustom1 = Table.AddColumn(RemovedOtherColumns, "Custom", each DoStuff([File Path])),
        #"Expand Custom" = Table.ExpandTableColumn(InsertedCustom1, "Custom", {"Column1.1", "Column1.2"}, {"Custom.Column1.1", "Custom.Column1.2"})
    in
        #"Expand Custom"

    and here’s the result:

    image

    Enjoy!

  • SQL Rally Nordic & Amsterdam slides & demos

    Last week I had the pleasure to speak at two GREAT conferences (as you can see from the wordcloud I’ve posted, here for Stockholm and here for Amsterdam. I used two different filtering techniques to produce the wordcloud, that’s why they look different. I’m playing a lot with R in these days…so I like to experiment different stuff).

    The workshop with my friend Thomas Kejser on “Data Warehouse Modeling – Making the Right Choices” and my sessions on “Automating DWH Patterns through Metadata” has been really appreciated by attendees, give the amount of good feedback I had on twitter and on some blog posts (Here and here).

    Of course many asked for slides & demos to download, so here you are!

    Automating DWH Patterns through Metadata – Stockholm
    http://sdrv.ms/1bcRAaW

    Automating DWH Patterns through Metadata – Amsterdam
    http://sdrv.ms/1cNDAex

    I’m still trying to understand if and how I can publicly post slides & demos of the workshop, so for that you have to wait a couple of days.

    I will post about it as soon as possible. Anyway, if you were in the workshop and would like to get the slide & demos ASAP, just send me an email, I’ll happily sent the protected link to my skydrive folder to you.

    Enjoy!

  • SQL Rally Pre-Con: Data Warehouse Modeling – Making the Right Choices

    As you may have already learned from my old post or Adam’s or Kalen’s posts, there will be two SQL Rally in North Europe.

    In the Stockholm SQL Rally, with my friend Thomas Kejser, I’ll be delivering a pre-con on Data Warehouse Modeling:

    Data warehouses play a central role in any BI solution. It's the back end upon which everything in years to come will be created. For this reason, it must be rock solid and yet flexible at the same time. To develop such a data warehouse, you must have a clear idea of its architecture, a thorough understanding of the concepts of Measures and Dimensions, and a proven engineered way to build it so that quality and stability can go hand-in-hand with cost reduction and scalability. In this workshop, Thomas Kejser and Davide Mauri will share all the information they learned since they started working with data warehouses, giving you the guidance and tips you need to start your BI project in the best way possible―avoiding errors, making implementation effective and efficient, paving the way for a winning Agile approach, and helping you define how your team should work so that your BI solution will stand the test of time.

    You'll learn:

    • Data warehouse architecture and justification
    • Agile methodology
    • Dimensional modeling, including Kimball vs. Inmon, SCD1/SCD2/SCD3, Junk and Degenerate Dimensions, and Huge Dimensions
    • Best practices, naming conventions, and lessons learned
    • Loading the data warehouse, including loading Dimensions, loading Facts (Full Load, Incremental Load, Partitioned Load)
    • Data warehouses and Big Data (Hadoop)
    • Unit testing
    • Tracking historical changes and managing large sizes

    With all the Self-Service BI hype, Data Warehouse is become more and more central every day, since if everyone will be able to analyze data using self-service tools, it’s better for him/her to rely on correct, uniform and coherent data. Already 50 people registered from the workshop and seats are limited so don’t miss this unique opportunity to attend to this workshop that is really a unique combination of years and years of experience!

    http://www.sqlpass.org/sqlrally/2013/nordic/Agenda/PreconferenceSeminars.aspx

    See you there!

  • When To Break Down Complex Queries

     

    Some days ago my SolidQ colleague Luca  pointed me to the followin SQLCAT article

    When To Break Down Complex Queries
    http://blogs.msdn.com/b/sqlcat/archive/2013/09/09/when-to-break-down-complex-queries.aspx

    that, as part of the reorganization of SQL CAt website, has emerged again from the depth of Internet.

    The article is very interesting and also allows me to underline again the importance of having a way to tell the optimizer when it should *not* expand subqueries into the main query. In this way the usage of #temp tables could be avoided in many cases, allowing for the usage of views or inline UDF instead of stored procedures.

    I opened the connect item back in 2010, so it’s time to bring more attention to it in order to have it in the product ASAP, and this article give me the option to do that.

    Please vote for it:

    https://connect.microsoft.com/SQLServer/feedback/details/533766/make-noexpand-hint-usable-also-with-cte

    Now that data volumes are increasing every day, more and more I find myself fighting with bad query plans due to bad cardinality estimation. When you are joining several big tables, you can be sure you’ll have to break your query in several pieces, to be sure to have good performance. Last time I had to do this was….the day before yesterday!

    I think it’s really time for an extended NOEXPAND hint. If you’re with me, vote vote and vote!

More Posts Next page »

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement