This guest post from Eric Hanson, Principal Software Development Engineer on Microsoft HDInsight, and Apache Hive committer.
Hive has a substantial community of developers behind it, including a few from the Microsoft HDInsight team. We’ve been contributing to the Stinger initiative since it was started early in 2013, and have been contributing to Hadoop since October of 2011. It’s a good time to step back and see the progress that’s been made on Apache Hive since fall of 2012, and ponder what’s ahead.
Hive has a lot going for it with respect to both functionality and scalability. The external table model of Hive, input adaptors for many file formats, and the on-by-default UDF support and large base of Java code that can be applied in UDFs make it very attractive for data transformation applications. For non-traditional analysis, the ability to embed custom Java mappers and reducers inside Hive SQL queries is also quite useful. Hive’s SQL language coverage has expanded to include much of SQL-92 and some SQL-99 OLAP extensions. And it scales to thousands of nodes because of its integration with Hadoop and HDFS. But it’s been criticized for being slow – more specifically for having a slow inner loop that used to process rows on the order of 100X slower than a state-of-the-art query executer. Hive has been a favorite whipping boy when it comes to performance. Look around and it’s not hard to find statements like “Our <database-or-big-data-system-name> can run SQL queries <number> times faster than Hive.”
This is changing. Over the last 15 months or so, the following big things have happened with Hive to improve performance:
What this means is that you need to verify statements of the form “<systemname> is <X> times faster than Hive” carefully because the code in the Hive trunk today is an order of magnitude faster (sometimes more) than it was 15 months ago. Here’s an example from Hortonworks. The left bar is Hive 10, the middle bar is Hive 11 with ORC, and the right is the latest Hive trunk. These results are at scale factor 20 (approximately 200GB of data).
As you can see, for this query, Hive has moved from the “I’ll go for coffee while I run this query” stage to the “I don’t mind waiting for my answer” stage.
Even with this progress, Hive still has room for improvement. The biggest things it’s missing from a query execution performance perspective are:
Hive is already attractive because of its functionality, ability to scale, established community and user base, and open source distribution. When the enhancements of the last 15 months get into production, its performance on a per-node basis won’t be too bad. Add in light weight scheduling and in-memory caching, and it can be downright good. Then Hive will be poised to grab the whip away and hit back.