I’ve been very quiet on the blogging front of late. There are a few reasons for that but one of the main ones is that I’ve spent the past three months on a new gig immersing myself in Hadoop, primarily in Microsoft’s Hadoop offering called HDInsight. I’ve got a tonne of learnings that I want to share at some point but in this blog post I’ll start with a handy little script that I put together yesterday.
I’m using a tool in the Hadoop ecosystem called HBase which was made available on HDInsight in preview form about a month ago. HBase is a NoSQL solution intended to provide very very fast access to data and my colleagues and I think it might be well suited for a problem we’re currently architecting a solution for. In order to evaluate HBase we wanted to shove lots of meaningless data into it and in the world of HDInsight the means of communicating with your HDInsight cluster is Powershell. Hence I’ve written a Powershell script that will use HBase’s REST API to create a table and insert random data into it. Likely if you’ve googled this post then you’re already familiar with Hadoop, HDInsight, REST, Powershell, HBase, column families, cells, rowkeys and other associated jargon so I won’t cover any of those, what is important is the format of the XML payload that has to get POSTed/PUTted up to the REST API. That payload looks like this:
<?xml version="1.0" encoding="UTF-8"?>
The payload can contain as many cells as you like. When the payload gets POSTed/PUTted the values therein need to be base64 encoded but don’t worry, the script I’m sharing herein takes care of all that for you. The script will also create the table for you. The data that gets inserted is totally meaningless and is also identical for each row, modifying the script to insert something meaningful is an exercise for the reader.
Another nicety of this script is that it uses Invoke-RestMethod which is built into Powershell 4. You don’t need to install other Powershell modules, nothing Azure specific. If you have Powershell 4 you’re good to go!
Embedding code on this blog site is ugly so I’ve made it available on my OneDrive: CreateHBaseTableAndPopulateWithData.ps1 Screenshot below gives you an idea of what’s going on here.
Hope this helps!
UPDATE. I’ve posted a newer script CreateHBaseTableAndPopulateWithDataQuickly.ps1 which loads data in much quicker. This one sends multiple rows in each POST and hence I was able to insert 13.97m rows in 3 hours and 37 minutes which, given latency to the datacentre and that this was over a RESTful API, isn’t too bad. The previous version of the script did singleton inserts and hence would have taken weeks to insert that much data.
The number of POSTs and the number of rows in each POST are configurable.