There is always a threat from somewhere. Old-style DBAs are standing on the shaky ground.
5 years ago I remember myself defending the SQL Server against the domineering Oracle DBAs. The arguing was always about “who is better”. The Management Studio was a great weapon against sqlplus. And you could always use a “the DDLs are not a part of the transaction “ thing. Without any Oracle DBAs around I must admit that Oracle is pretty amazing product although it requires a huge deal of tuning.
Then came an era of NoSQL. It’s not easy to compete against such fast loads, easy scale-up and fast development. The NoSQL solutions support the usecases that do not really fit into the relational model, like graphs or unstructured data. On the other hand, the eventual consistency thing does not fit for everyone, the joins between the entities are hard and there are stability problems with many NoSQL products. Twitter apparently has been trying to move from MySQL to Cassandra for over a year. In August 2011 Foursquare reported an 11-hour downtime because of the MongoDB.
A few years ago a new trend/treat arrived to our lands. Analytical databases.
There are quite a lot of quite expensive products positioning themselves as a revolutionary technology that can easily handle an overwhelming amount of Big Data querying and analytical processing, self-tuning, making no need for query tuning, prepared aggregations or calculations. There is no locking or logging overhead therefore data loading into the analytical DBMS is declared to be extremely fast. On the other hand, those DBMS admit that they are not built for the OLTP loads, massive updates and deletes are their weak point.
The Data Warehouse team at my company has got to the decision that they need such analytical database solution. The vision is to have a scalable self-tuning database that will ingest terabyte or two on a weekly basis and will allow to any number of the concurrent users capability to execute an ad-hoc analytical queries on any period of time.
I strongly believe that with the proper database design SQL Server 2012 can provide good performance under the heavy analytical workload. However, I was really curious to understand how the analytical databases achieve their self-tuning performance magic. There is nothing better than the competitors to show us where we have weaknesses that need some strengthening.
First of all, here is a great map of all the technologies from the Big Data puzzle http://www.bigdatamap.net/technologies/ .
In order to get a better understanding of the strong and weak points of the analytical databases world, I will drill down a little bit into the technologies that are standing behind the buzzwords.
There are 4 categories of products inside the analytic databases part of the Data Repositories world:
- In-Memory solutions (like SAP HANA, XtremeData, Kognitio etc.):
- When querying the DBMSs from this group, you are working primarily against the tables in the memory. Those DBMSs are the fastest solution in the analytical databases world. In-memory DBMS primary relies on the memory optimization algorithms which are much simpler than the disk access. There are no locking/ concurrency issues and queries are amazingly fast. With those systems you can expect sub-second results in the truly ad-hoc DWH workload with unpredictable queries coming through, sometimes in a size of 6 pages of the code. I have asked whether there are partitions on the tables and the Kognitio representative was quite surprised. “What for?” – he asked me.”Well, how do you delete the old data?”. “You just write a DELETE statement and data gets deleted” – he answered. That’s it. Just use DELETE statement.
- There are several flavors of the in-memory products, each requires a different memory to data ratio. For instance, SAP HANA loads 100% of the database into the memory which makes this solution extremely expensive for the databases with hundreds of terabytes. Kognitio, on the other hand, allows you to choose which data goes into the memory, usually 20% - 40%, and the rest of the data stays in the persistency layer. For instance, if your queries usually access only last month data, you can have sliding window of this data in the memory and, in case any unpredicted query comes in, the system will load the missing data off the disc. However, on each cluster node 60% of memory will be used for data and the other 40% for the users activity and temporary calculations. When summed up you get to the quite pricey solution anyway.
- All products in this category save the data to the disk as well. This persistency layer can be their proprietary files, Hadoop or any other DBMS.
- One might think it will take time after the server downtime to get all data back into the memory. Kognitio states that it will take only 5 min for each 10 TB of data.
- Watch out for some limitations coming along with in-memory architecture. Some products are not available during the maintenance tasks, like backups, or new nodes additions ( ExtremeData) and should be taken down. This does not suit the systems that must be up 24/7.
- Disk- oriented solutions (Vertica, Greenplum, Netezza, SybaseIQ, VectorWise, Microsoft PWD, etc.)
- Disk oriented products are very well optimized for working with the data on the disk. The main trend in this category is about storing the data in the columnar way which allows high compression ratio. However, some products are row-based like Netezza. Microsoft PDW ( V2 which is based on SQL 2012) allows both flavors of storage, rowbased and columnar. Their release is actually in February 2013 so we'll see how they will compete with the others.
- Disk oriented solutions usually do not rely that there is a lot of RAM.
- Since the world around us is row-based, all the columnar products are loading their data from the row-based files. They inject the incoming data into some write-optimized row store. The background process afterwards is converting the inserted data into the read-optimized sorted columnar storage.
- Due to the fact that each column in columnar table is being stored in the separate file, the smaller the amount of requested columns the faster the queries perform. The wider queries usually require some sort of manual optimizations. Vertica, for instance is keeping several copies of the table, called projections. The projections, like covering indexes, contain columns grouped together or columns from different tables (for faster joins), nullifying the benefits of the compression. This is a hybrid columnar/row-based storage solution.
- Hybrid DBMS products ( ParAccel, Exasol, etc. )
- Those products are built on any disk-oriented technology, usually columnstores, but provide the best performance given as much RAM as they can get. For instance ParAccel submitted tcp-h benchmark on a system with 288% data/memory ratio.
- Daemons running on the Hadoop nodes. (Claudera Impala)
- Kind of HIVE alternative, those solutions are running on the same physical cluster as Hadoop creating a seamless environment. Kognitio is considering such architecture as a future enhancement of their product.
- Claudera Impala is a public beta, it goes out sometime in 2013. According to some web forum, the fastest Impala queries take > 1 second. Impala was tested and survived a load of 100 concurrent queries. Another test showed Impala running 10 cloned copies of a query with 25%~ish performance degradation.
If you are evaluating an analytical database solution for your Big Data, the features below can help you to understand better the different flavors of the products:
- Product scalability. Scalability is about whether the product is MPP (massively parallel processing) system or single node system. MPP systems usually are built as a cluster of commodity servers where the tables data distributed across the cluster according to the administrator’s choice, like some sort of partitioning. Cluster nodes combined into a single very powerful analytic machine have vast amounts of processing power. New nodes can be easily added to the cluster when needed. Vertica, ParAccel, Exasol, Greenplum are the MPP solutions. Sybase IQ and VectorWise are single node systems though VectorWise promises that there is a MPP version coming out soon.
- Single point of failure. While exploring an MPP product, it’s good to know if it’s share-nothing system or there is a main node (ParAccel, Netezza or Greenplum) that is in charge of producing query plans and processing the serial parts of the execution plans. For instance, the count distinct operation requires to run distinct sort again on the merged results. The main node might become a bottleneck in the future when the system load increases.
- Replication factor/data redundancy. Most MPP solutions are replicating every subset of the data to an additional node for the redundancy. In case any node goes down, there is always another copy of the data which means that the cluster continues to work. The question is which kind of algorithm they use to store the second copy. Two phaze commit is too pricey because it slows down the inserts. And most of the products use optimistic approach. Each one implements that approach differently but the data is usually being replicated by the background process. What happens if the node failed after the data was inserted and before it has got replicated ?
- Watch out for the limitations. VectorWise can support databases up to 30TB on single node. Another common limitation for the columnar solutions is character data type length. SAP HANA has the 2 billion records limit in a single partition.
- Data Encryption. Most of the analytical databases that I have looked into did not have a data encryption feature.
- Row to columnar conversion. The freshly loaded data querying is definitely worth a check. Especially on the column storage solutions. Your queries will probably be merging data from the write optimized row-based storage and column based files.
- Data modifications. Vertical and Exasol are locking the entire table during the row modification operations ( updates/deletes) locking other modifying processes. I must note here that this does not impact read flows against that table.
- Management tool. All this can sound great till you realize that your main tool will be the command line like in ParAccel. In such case you will probably need to purchase some third party management tool, like Toad. Make sure there is an ODBC/JDBC provider.
- Languages support. Analytical solution should usually support ANSI SQL including analytical extensions. Many products support other languages, like R, JAVA, MDX etc.
- Development/QA environment – It’s worth a check whether there it is possible to restore the production multi-node database to single-node server for development or testing purposes.
- Hadoop connectivity. Many products have implemented this, it’s quite a useful feature, called External tables. This is very tight products integration, making Hadoop data visible for the user queries without pre-loading it. Although, integration might be not so tight for some products: under the hood they simply connect to Hadoop using the Hive and performance is slow in such cases. Kognitio, on the other hand, have an agent installed on the Hadoop cluster nodes and thus can “push down the predicates” – ask Hadoop to filter the data prior to loading.
- Number of customers. If there are only 10 customers behind the revolutionary technology, would you consider paying a huge amount of money for it?
Machine generated data grows at the amazing speed therefore the requirements from the DBMS products are changing. SQL Server is evolving as well having a new columnstore index feature inside SQL 2012. In the next version more features from the analytical databases world will be implemented, like in-memory technology <Hekaton> , Hadoop External tables feature <Polybase> and true updatable columnnar data storage implemented by using new column-based clustered index.
Any software solution has it’s strength and there is no fit-for-everyone solution. There are many shades of grey besides the black and white colors.
We are starting the POC for the several analytical DBMS. I will surely test the same load on SQL 2012. I accept the architectural limitation that it’s not scalable. I also understand that before Hekaton comes to town, SQL Server cannot compete with the in-memory products. However, I cannot wait to compare the performance of the columnstore index against the other columnstore products.
Are there any performance benchmarks between the columnar non-mpp product and SQL 2012 that you were a part of ?