Stinger Early Results: 45X Performance Increase for Hive

Written with Vinod Kumar Vavilapalli and Gopal Vijayaraghavan

A few weeks back we blogged about the Stinger Initiative and set a promise to work within the open community to make Apache Hive 100 times faster for SQL interaction with Hadoop. We have a broad set of scenarios queued up for testing but are so excited about the early results of this work that we thought we’d take the time to share some of this with you.

In order to get a fair assessment we styled our tests after the TPC Benchmark™ DS (TPC-DS). For this initial report, we provide detail around two of the most common use cases and as we execute more queries and make more improvements we will provide more detail.

Performance Tests & Environment

In this report we provide results for two of our performance queries.  In the first, we perform a star schema join where we load all the small tables in memory and do a scan through the fact table independently on all nodes. In the second query, we perform a join between two large fact tables, both of which are too large to fit in memory.

Our test environment comprised of a 10 node ec2 cluster with a total of 100 containers over 40 disks, to obtain query execution times with Hive on raw data and with Hive with all the optimizations enabled on partitioned data stored in RCFile format. We also use a scale factor of 200, which implies a data set of around 200GB.

Results

For the first query, we’ve calculated as much as a 35X improvement over native Hive and have reduced query times from around 1400 seconds to 39! And for second query we calculate a whopping 45X improvement… all in open source Apache Hive.

StingerQuery1  StingerQuery2

This is a preliminary look at test results but it indicates significant improvements already. We are pretty excited about the results, as the work has only just begun.  The test results above do not even include the Tez improvements, nor do they include the new ORCFile format!  And there are still a handful of other Hive specific improvements coming.  Some of the improvements are included in these tests include HIVE-3784, HIVE-3952 and HIVE-2340.  A before and after of the execution looks like this:

BeforeStinger

AfterStinger

In subsequent posts we will deliver more explicit results and provide a more in-depth specifications for hardware/software used for the benchmarks. We’ll also describe few other efforts, which will complete our story of attaining 100x gains we talked about earlier, so stay tuned.

We truly believe that the fastest path to innovation is the open community and this is a great example of how quick the community can prove this true.  These advances are not attributed to Hortonworks alone; they were completed in partnership with the community with resources from Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing.

NOTE: We are talking about all of this and much more at Hadoop Summit Amsterdam. Attend our talk “Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performance” to learn more about this effort. “Optimizing Hive Queries” by Owen O’Malley and “What’s New and What’s Next in Apache Hive” by Gunther Hagleitner are two other talks that you should attend to learn more about other threads in our stinger initiative.

Categorized by :
Hive Performance

Comments

Matthew McBride
|
July 8, 2013 at 2:46 pm
|

For the benchmark testing environment, was Hive-0.11.0 running on YARN? I am having a hard time determining this as the diagram specifies YARN as phase 2 implementation. Also, what version of HDP was the testing performed?

Thanks!! :)

Carl Steinbach
|
March 22, 2013 at 4:03 pm
|

Would you be willing to publish your benchmark code (queries, schemas, ETL scripts, etc) on the HortonWorks Github account?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Stinger Initiative

The Stinger Initiative is a broad, community-based effort to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. More »

Recently in the Blog

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.

Thank you for subscribing!