Today's the last day of the Summit, and Rick Heiges introduced Rob Farley and Buck Woody, who sang Rob's "Query Sucks" song. As is everything done by these two, it was way too much fun. Rick also did a retrospective of Wayne Snyder, PASS Immediate Past President. Wayne recognized Rick, who's rolling off the board this fall as well. Wayne ended with the wish that "as you slide down the banister of life, may the splinters of success stick in your career."
SQL Rally Nordic is now sold out! SQL Rally Dallas will take place May 10 and 11. Many SQL Saturdays are already on the schedule, and more are coming. PASS Summit 2012 will take place November 6-9, 2012, with two precon days on November 5 & 6. All PASS attendees are getting an ebook copy with four chapters from the SQL Server MVP Deep Dives (Volume I) as a thank you for attending. Rick then introduced Dr. David Dewitt, to talk about Big Data.
Dr. Dewitt first introduced Rimma Nehme, who's part of his team at the Jim Gray Systems Lab in Madison, Wisconsin.
Big Data breaks down to massive volumes of records, whether recorded in relational databases or not. By 2020, we'll be managing data in the range of zettabytes, averaging around 35 ZB. Data is being generated by automated sources, generating the incredible quantities we're anticipating. The dramatic drop in the cost of hardware is the reason behind the increase in the amount of data we keep.
eBay uses a Parallel data system, averaging about 10 PB on 256 nodes, while Facebook and Bing use NoSQL systems, managing 20PB and 150PB, respectively. He told us that NoSQL doesn't mean that SQL is dead, it means Not Only SQL - other systems will manage data as well as relational systems. It incorporates more data model flexibility, relaxed consistency models such as eventual consistency, low upfront costs, etc.
The NoSQL model implements a model where data is just stored on arrival, without data cleansing or staging, and the data is evaluated as it arrives. It can use a Key/Value Store method, as in MongoDB, CouchBase, etc, where the data model is very flexible and supports single record retrievals through a key, and other systems like Hadoop, which Microsoft is now supporting. In relational systems there is a structure, where NoSQL uses an "unstructured" model. Relational systems provide maturity and reliability, and NoSQL systems provide flexibility.
This is NOT a paradigm shift. SQL is NOT going away. (Codasyl to relational in the 70s was a paradigm shift.) Businesses will end up with data in both systems.
Big Data started at Google, because they had massive amounts of click stream data that had to be stored and analyzed. It had to scale to petabytes and thousands of nodes. It had to be fault tolerant and simple to program against. They built a distributed file system called GFS and a programming system called MapReduce. Hadoop = HDFS + MapReduce. Hadoop & MapReduce makes it easy to scale to high amounts of data, with fault tolerance and low software and software costs.
HDFS is the underpinnings of the entire ecosystem. It's scalable to 1000s of nodes, and assumes that failures are common. It's a write once, read multiple times, uses a traditional file system and is highly portable. The large file is broken up into 64MB blocks and stored separately on the native file system. The blocks are replicated so that hardware failures are handled, so that block 1 after being written on its original node, will also be stored on two additional nodes (2 and 4). This allows a high level of fault tolerance.
Inside Hadoop there's a name node, which has one instance per cluster. There's also there's a backup node in case the name node has a failure, and there are data nodes. In HDFS the name node is always checking the state of the data nodes and ensuring that the data nodes are alive and balanced. The application has to send a message to the name node to find out where to put the data it needs to write. The name node will report where to place the data, but then gets out of the way and lets the application manage the data writes. Data retrieval is similar in that it asks the name node where the data lives, then gets it from the nodes where it's written.
Failures are handled as an intrinsic part of HDFS. The multiple writes always ensure that the data is stored on nodes on multiple devices so that even rack or switch failures allow access to the data on another device that's still available. When additional hardware is added, the data nodes are rebalanced to make use of it. HDFS is highly scalable, doesn't make use of mirroring or RAID but you have no clue where your data really is.
MapReduce is a programming framework to analyze the data sets stored in HDFS. Map pulls in the data from the smaller chunks, then Reduce analyzes the data against each of the smaller chunks until the work is done. There's a JobTracker function which manages the workload, then TaskTracker functions which manage the data analysis against all the blocks. The JobTracker task lives on top of the Name Node, and the TaskTracker tasks live on the systems with the Data Nodes.
The actual number of map tasks is larger than the number of nodes existing. This allows map tasks to handle work for tasks that fail. Failures are detected by master pings. MapReduce is highly fault tolerant, relatively easy to write and removes the burden of dealing with failures from the programmer. The downside is that the schema is embedded in the application code. There is no shared schema, and there's no declarative query language. Both Facebook's HIVE language and Yahoo's PIG language use Hadoop's MapReduce methodology in their implementations.
Hive introduces a richer environment that pure MapReduce and approaches standard SQL in functionality. Facebook runs 150K jobs daily, and maybe 500 of those are pure MapReduce applications, the rest are HiveQL. In a side-by-side comparison, a couple of standard queries ran about 4 times longer than the same queries using Microsoft's PDW (next release, not yet available.)
Sqoop provides a bridge between the world where unstructured data exists and the structured data warehouse world. Sqoop is a command line utility to move data between those two worlds. Some analyses are hard to do in a query language and are more appropriate for a procedural language, so moving data between them makes sense. The problem with sqoop is that it's fairly slow.
The answer is logically to build a data management system that understands both worlds. Dr. Dewitt terms this kind of system an Enterprise Data Manager. Relational systems and Hadoop are designed to solve different problems. The technology complements each other and is best used where appropriate.
It's so wonderful that PASS brings Dr. Dewitt to help us get back to the fundamental basics behind what we do every day. I love the technical keynotes and really wish Microsoft would learn that marketing presentations aren't why we're here.
It's time to get on to the next session, but this has been a great PASS Summit.