Written with Vinod Kumar Vavilapalli and Gopal Vijayaraghavan
A few weeks back we blogged about the Stinger Initiative and set a promise to work within the open community to make Apache Hive 100 times faster for SQL interaction with Hadoop. We have a broad set of scenarios queued up for testing but are so excited about the early results of this work that we thought we’d take the time to share some of this with you.
In order to get a fair assessment we styled our tests after the TPC Benchmark™ DS (TPC-DS). For this initial report, we provide detail around two of the most common use cases and as we execute more queries and make more improvements we will provide more detail.
In this report we provide results for two of our performance queries. In the first, we perform a star schema join where we load all the small tables in memory and do a scan through the fact table independently on all nodes. In the second query, we perform a join between two large fact tables, both of which are too large to fit in memory.
Our test environment comprised of a 10 node ec2 cluster with a total of 100 containers over 40 disks, to obtain query execution times with Hive on raw data and with Hive with all the optimizations enabled on partitioned data stored in RCFile format. We also use a scale factor of 200, which implies a data set of around 200GB.
For the first query, we’ve calculated as much as a 35X improvement over native Hive and have reduced query times from around 1400 seconds to 39! And for second query we calculate a whopping 45X improvement… all in open source Apache Hive.
This is a preliminary look at test results but it indicates significant improvements already. We are pretty excited about the results, as the work has only just begun. The test results above do not even include the Tez improvements, nor do they include the new ORCFile format! And there are still a handful of other Hive specific improvements coming. Some of the improvements are included in these tests include HIVE-3784, HIVE-3952 and HIVE-2340. A before and after of the execution looks like this:
In subsequent posts we will deliver more explicit results and provide a more in-depth specifications for hardware/software used for the benchmarks. We’ll also describe few other efforts, which will complete our story of attaining 100x gains we talked about earlier, so stay tuned.
We truly believe that the fastest path to innovation is the open community and this is a great example of how quick the community can prove this true. These advances are not attributed to Hortonworks alone; they were completed in partnership with the community with resources from Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing.
NOTE: We are talking about all of this and much more at Hadoop Summit Amsterdam. Attend our talk “Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performance” to learn more about this effort. “Optimizing Hive Queries” by Owen O’Malley and “What’s New and What’s Next in Apache Hive” by Gunther Hagleitner are two other talks that you should attend to learn more about other threads in our stinger initiative.