Welcome to Part 6 in the “So You Want To Learn BI” series! The first five posts in the series were on the old blog at blogspot – you can find them here. This is my first post on SQLBlog. All future posts will be made here, so please subscribe to this blog or to the main SQLBlog feed if you want to follow along!
Data Mining. Probably the most overused and abused term in modern computing. Not really. But it’s overused an awful lot, and very few people understand what data mining really is. Data mining is not comparable to anything we’ve looked at so far.
With OLTP queries, we’re answering a very specific question – for instance, how many units of a given product were sold, by territory, for the last year. OLAP (Analysis Services) makes this sort of querying significantly more accessible by presenting the dataset to a business user in a format that they can easily consume. Said business user can then pivot and slice the data any way they wish. This is great, but it still relies on said user actually knowing what they’re looking for. What happens when we’re looking for an answer to a more open-ended question? For example, let’s say that you’re “mining” a dataset containing retail purchase information. Using OLAP or OLTP, you can ask “how many copies of Guitar Hero were sold, by country, over the last 12 months?”, and you’d find out. But if you wanted to answer the question….who buys the most Guitar Hero…not just by region….but by every demographic that we collect – well, that’s where data mining kicks in. You see, data mining employs algorithms to find patterns in your data. Using one particular data mining algorithm (called a clustering model), you can indeed find out that 42 year old men from Wisconsin play way, way too much Guitar Hero.
Data mining involves a number of different types of mining models (to mine for different sorts of patterns), and the Self-Paced Training Kit (see the rest of the posts in this series if you don’t know what I’m talking about) does a great job of introducing them. You go through the process of creating and training models, and then you use the data mining viewers built into BIDS to explore the patterns that your mining models generated. The authors show you how to measure the accuracy of your mining models, and how to select the best model for a given application.
More than any other chapter in this book, I found the data mining one – to be brief and a little melodramatic – riveting. Truly extracting value from the data we work so hard to collect and maintain is like Christmas for database geeks. More and more businesses are realizing that employing Business Intelligence is required to survive and thrive in today’s world – as a database professional, it’s in your best interests to be able to hold your own in a discussion on the subject.
As with any technology, you don’t implement Data Mining “just because you can”. Unless you work in a retail or online sales environment, you may find it hard to apply some of the mining techniques you learn in this lesson to solve problems in the industry that you work in. If this is the spot that you find yourself in (as I have), I recommend you pick up a copy of Smart Business Intelligence Solutions with Microsoft SQL Server 2008.
I’ll post a review for that book at a later date, but in a nutshell it takes you through the process of designing and building real-life BI solutions using the SQL Server BI toolset. Even with this book, you may not find an application for data mining in your environment – and that’s OK. Being familiar with data mining and the types of conclusions that can be drawn using mining algorithms means that you’ll have an ace up your sleeve when the right situation comes along. Such situations can crop up in the oddest of places – for instance, check out SQL MVP Brent Ozar’s excellent post on Mining Perfmon Output using Excel and Analysis Tools for the Cloud.
That’s all for this time. The next (and final) post in this series will be on SSRS. See you soon!