If you keep an eye on the tech industry you cannot have failed to notice that an Open Source technology called Hadoop is getting lots of coverage these days. You’ll be able to find lots of descriptions around the web about what Hadoop actually is so I won’t try and come up with one myself other than to say its a technology for storing and querying over large sets of data.
Microsoft recently released a preview of their Hadoop offering which they are calling HDInsight; it can be installed on your own hardware (HDInsight Server) or you can sign up temporarily for their Azure-based offering (HSInsight Service). Learn more at http://www.microsoft.com/bigdata.
Cindy Gross has provided an excellent step-by-step tutorial on how to store data on HDInsight (either via the service or server) and query it in her blog post Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View. A few days ago I downloaded HDInsight server and followed Cindy’s instructions for installing it, inserting data, querying it and then building a Power View report atop it and the whole thing took less than two hours. Hence, if you’re looking to get familiar with Hadoop and you’re primarily a Windows guy like me then this is a really quick and easy (and free) way of going about it.
Cindy’s tutorial demonstrates how to combine data from Hadoop with data from SQL Server using PowerPivot and the output of that combined data is a Power View report:
My attempt is available on SkyDrive at http://sdrv.ms/Zhv4tB although be aware that Power View doesn’t work on SkyDrive so you’ll have to download the containing Excel workbook and open it in Excel 2013 if you want to see anything useful.
As I said above the steps to installing HDInsight, sticking data in it, and querying it were very easy thanks to Cindy’s thorough blog post and hence I was able to make some observations about this murky world of Hadoop that I’ve been hearing so much about.
Firstly, I gather that the “normal” way of querying Hadoop is to write programs in Java however a technology called Hive provides an abstraction over Hadoop that can make the data therein appear as a table with rows and columns just as those of us immersed in the RDBMS world are used to. For example, the data provided by Cindy was provided as a text file:
which literally gets stored in Hadoop as-is. Hive allows us to surface that file as a table using the following syntax:
CREATE EXTERNAL TABLE census (
, County_FIPS int
, Population bigint
, Pop_Age_Over_69 bigint
, Total_Households bigint
, Median_Household_Income bigint
, KeyID string
COMMENT 'US Census Data'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/user/demo/census';
Its fairly intuitive to look at this CREATE EXTERNAL TABLE definition and understand exactly what Hive is doing. Its taking a text file whose “columns” are delimited by tabs and defining data types for those columns.
There also exists an ODBC driver for Hive that enables one to query the data from anything that can talk ODBC (e.g. Excel).
I have often heard it said that Hadoop is for querying over unstructured data however, to me, that text file has got lots of structure to it; its got column delimiters and row delimiters, data in each row is stored consistently. That observation led me to question the use of the word "”unstructured” and I asked Cindy to clarify what that word meant in this context. She replied that its more a question of when structure is applied (in this case its being applied by Hive after the data is stored) and she correctly pointed out that a different structure could have been defined in the CREATE EXTERNAL TABLE command without modifying the underlying data.
If nothing else this tutorial helped to crystalize some of the marketing fluff that gets bandied about regarding Hadoop. Yes, you can probably store anything you like in there (hence “unstructured”) however in order to get any value from it you’re going to have to apply structure at some point even if that’s as simple as breaking a document into all the individual words by specifying a whitespace delimiter.
That’s a short rundown of what was an equally short investigation. If Hadoop is something that interests you then downloading HDInsight Server and following Cindy’s tutorial might be a good place to start.