Update on Stinger: the view from a Microsoft Committer
This guest post from Eric Hanson, Principal Software Development Engineer on Microsoft HDInsight, and Apache Hive committer.
Hive has a substantial community of developers behind it, including a few from the Microsoft HDInsight team. We’ve been contributing to the Stinger initiative since it was started early in 2013, and have been contributing to Hadoop since October of 2011. It’s a good time to step back and see the progress that’s been made on Apache Hive since fall of 2012, and ponder what’s ahead.
Hive has a lot going for it with respect to both functionality and scalability. The external table model of Hive, input adaptors for many file formats, and the on-by-default UDF support and large base of Java code that can be applied in UDFs make it very attractive for data transformation applications. For non-traditional analysis, the ability to embed custom Java mappers and reducers inside Hive SQL queries is also quite useful. Hive’s SQL language coverage has expanded to include much of SQL-92 and some SQL-99 OLAP extensions. And it scales to thousands of nodes because of its integration with Hadoop and HDFS. But it’s been criticized for being slow – more specifically for having a slow inner loop that used to process rows on the order of 100X slower than a state-of-the-art query executer. Hive has been a favorite whipping boy when it comes to performance. Look around and it’s not hard to find statements like “Our <database-or-big-data-system-name> can run SQL queries <number> times faster than Hive.”
This is changing. Over the last 15 months or so, the following big things have happened with Hive to improve performance:
- ORC: ORC is a high-quality columnstore that can compress data around 10 times or more. It has the columnstore virtues people have come to expect, such as good compression and the need to only read columns your query touches.
- Vectorized query execution: Hive now supports vectorized query execution. This technology can reduce CPU costs in the inner loop of query execution by 10X or more. It was popularized by the MonetDB/X100 project and has made its way into the top-performing data warehouse (DW) DBMSs, including Microsoft SQL Server and PDW. Vectorized query execution is a once-in-a-generation technological breakthrough. Any DW DBMS (proprietary or open source, integrated with Big Data or not) that does not have it is not in the game. You wouldn’t sell an airliner with piston engines today would you?
- Tez: Hive on Tez reduces data communication costs between nodes and query phases tremendously by allowing more general data flows and reducing spooling to disk.
- Container re-use: This allows processes to be re-used in the same query or user session, reducing start-up overhead. It’s a little like multi-threading. I’ve seen results that allow queries to finish in under 10 seconds, down from over 20 seconds, with this turned on, on a 20 node cluster.
What this means is that you need to verify statements of the form “<systemname> is <X> times faster than Hive” carefully because the code in the Hive trunk today is an order of magnitude faster (sometimes more) than it was 15 months ago. Here’s an example from Hortonworks. The left bar is Hive 10, the middle bar is Hive 11 with ORC, and the right is the latest Hive trunk. These results are at scale factor 20 (approximately 200GB of data).
As you can see, for this query, Hive has moved from the “I’ll go for coffee while I run this query” stage to the “I don’t mind waiting for my answer” stage.
Even with this progress, Hive still has room for improvement. The biggest things it’s missing from a query execution performance perspective are:
- the ability to run tasks with low startup overhead using threads rather than heavy-weight processes
- an in-memory data cache or buffer pool to reduce or eliminate I/O
Hive is already attractive because of its functionality, ability to scale, established community and user base, and open source distribution. When the enhancements of the last 15 months get into production, its performance on a per-node basis won’t be too bad. Add in light weight scheduling and in-memory caching, and it can be downright good. Then Hive will be poised to grab the whip away and hit back.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.