Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.

SQL for Big Data

We also believe this should be done in the open community and that Apache Hive is the right choice to evolve SQL-IN-Hadoop. First, Hive is the only SQL engine proven at multi petabyte scale so it is a good baseline.  Second, it has the most comprehensive SQL semantics among any Hadoop solution for SQL so it is “closest to the pin”.  Third, Hive is built on MapReduce so it can it effortlessly handles hundreds or thousands of simultaneous queries, even at petabyte scale.  Finally, there is a large vibrant community of practitioners and developer using and working on Hive.  Why not rally the community to innovate in the open?

So, where are we?

The Stinger initiative outlines three phases.  In the first phase of delivery we saw:

  • Performance improvements of 35x-45x for common analytical queries and
  • Introduction of SQL windowing functions such as Rank, Lead, Lag, etc.

Our recent HDP 2.0 beta marks our second major release of Stinger based improvements for Hive, introducing:

  • A preview of the vectorized query engine that speeds all types of queries, adding another 5x-10x improvement,
  • Simplified SQL interoperability through the new VARCHAR and DATE datatypes and
  • A new query optimizer that speeds complex queries by several factors.

While this represents great progress, we still have one remaining piece of Stinger Phase 2: Hive on Apache Tez.  This will be released separately in beta form soon and will deliver order-of-magnitude improvements in query latency and push several types of queries past the 100x barrier.

We’re encouraged by the extremely positive reaction we’ve seen from both Hive users and the Hive ecosystem. Let’s take a deeper dive into some of the new HDP 2 features.

Performance

Increasing Hive performance 100x remains the primary goal of the Stinger Initiative. The HDP 2.0 Beta introduces several major new performance features that benefit both small reporting queries and deep analytical queries.  Some of which are describe in this table:

We’ll deep dive on these major new improvements in subsequent performance oriented blog posts. These improvements represent important steps on the path to improving Hive’s performance 100x. Here’s a few examples of what they add up to.

We looked at TPC-DS Query 27, a fairly simple reporting query, back in February and showed that some improvements to the Hive query planner led to massive performance benefits. HDP 2 brings incremental progress by introducing vectorized query, which makes the map stages far more efficient. The next big boost in Query 27 will come when we introduce the upcoming Tez Beta which unlocks true low latency on Hadoop. We believe Apache Tez will push this query and others like it past the 100x improvement mark.

Hive is not just for simple queries but can also handle quite complex queries. TPC-DS Query 95 is an extremely complex query including a 3-way fact table join. This query benefitted from our Hive 11 improvements but not to the extent that simple star schema join queries like Query 27 did. HDP 2 and the upcoming Hive 12 introduce a query optimizer that benefits complex queries by generating more efficient map/reduce plans. Even in the fastest time, Query 95 runs in 6 distinct MapReduce jobs, so the introduction of Tez will prove to be a massive boost here as well.

We’ll discuss these performance benefits in more depth in later blogs.

SQL Compliance

Our goal with SQL support is simple: Make Apache Hive a comprehensive and compliant SQL engine that meets Enterprise class needs. This round of Hive development introduces 2 critical new data types, VARCHAR, a very commonly used SQL type, and DATE, which is also very common and a natural choice for partitioning.

These new datatypes are extremely important for interoperability with existing databases. Historically, loading data from an external RDBMS into Hive (or systems claiming to be Hive compatible) required mapping VARCHAR to strings. This mapping process is no longer needed. These interoperability improvements further solidify Hive’s position as the most comprehensive SQL system for Hadoop, and greatly simplify the process of migrating SQL workloads to Hive.

The Hive Ecosystem: Stronger Than Ever

Our philosophy is to innovate SQL in Hadoop with 100% vendor-neutral Apache Foundation software that welcomes contributions from anyone. Although Hive 12 is not officially released yet, a lot of work has been flowing in from more than 60 developers, both within Hortonworks and from many other contributors in the community.

UPDATE: Thanks to Ed Capriolo, who pointed out that at least one large patch must have been missing due to an unexpectedly small number in the Cloudera section. Analysis has been re-run, resulting in updated numbers and the following corrections:
  • Cloudera contribution count was incorrectly reported as 4244, corrected to 13754
  • Hortonworks contribution count was incorrectly reported as 83742, corrected to 90681
  • NexR contribution count was incorrectly reported as 12103, corrected to 15236
  • “Other” contribution count was incorrectly reported as 8920, corrected to 10228

loc

Hortonworks customers value our 100% open source, 0% proprietary, 0% lock-in approach. Our partners have noticed this preference among our mutual customers and we’re excited to see them respond by deepening their investments in Hive. Here are a few noteworthy examples:

 microstrategy_logo Microstrategy focuses on deep analytics and MicroStrategy Server 9.4 has expanded deep analytics on Hadoop by embracing the SQL windowing capabilities introduced in Hive 11. Find out more about the 9.4 release on the MicroStrategy website.
karmasphere Karmasphere were early adopters of Hive 11, introducing support in July when Hortonworks HDP was the only distribution that included Hive 11. Karmasphere pushes the boundaries of deep analytics in other ways, partnering with Zmentis to bring machine learning and predictive analytics to Hive and Hadoop. Learn more on the Karmasphere website.
 mongo-db-logo MongoDB is a very popular NoSQL database thanks in large part to its simplicity in deployment and use. Mongo doesn’t focus on deep analytical capabilities, so Hive is a natural complement. MongoDB has released a new connector that allows you to create tables within Hive and run SQL queries directly on data stored in MongoDB. Learn more on the MongoDB website.
 20347_new-esri-logo In March, ESRI released “GIS Tools for Hadoop”, a powerful spatial analytics toolkit for Hive and Hadoop that makes spatial analytics simple. Learn more on their project page. You can also read about a sample application we built that has working code for using the ESRI toolkit.
 tableau Tableau recently released version 8.0.4 which adds support for HiveServer2 providing additional security and the ability to support more concurrent users. Find more on working with Tableau and Hadoop here.

Try Hive Today

HDP 2.0 Beta is available for download today and includes the fastest and most comprehensive Hive to date. Whether you’re looking for faster and more scalable SQL processing, or if you’re an existing Hive user looking to test drive the new performance enhancements, we encourage you to download and give us your feedback.

Categorized by :
Apache Hadoop Business Analytics Hadoop in the Enterprise HDP 1.x HDP 2 Hive Tez

Comments

Carter Shanklin
|
September 17, 2013 at 9:57 pm
|

Thanks to Ed Capriolo, who pointed out that at least one large patch must have been missing due to an unexpectedly small number in the Cloudera section. Analysis has been re-run, resulting in updated numbers and the following corrections.
Cloudera contribution count was incorrectly reported as 4244, corrected to 13754
Hortonworks contribution count was incorrectly reported as 83742, corrected to 90681
NexR contribution count was incorrectly reported as 12103, corrected to 15236
“Other” contribution count was incorrectly reported as 8920, corrected to 10228

The accompanying image has been replaced with the updated figures.

Carter Shanklin
|
September 10, 2013 at 8:38 pm
|

@Jim
I started off with a clean Hive git repo.
Next off I identified all JIRAs marked as fixVersion = Hive 0.12.0 and status Resolved. There are several hundred of these.
Third step is to trace this back to a git commit ID.
Hive developers are pretty good about including the JIRA number in the commit log so for the most part you can just parse “git log” and match JIRAs to commits. Just to be extra sure I also extracted comments made by Hudson within the JIRAs themselves to cross-validate. I found the Hive commit log complete enough that you don’t really need to do this to get essentially the same results I got.
This maps JIRAs to a set of commits, sometimes more than 1 as patches sometimes get made to multiple branches.
With this mapping I use git to identify the specific lines that have changed and count them up.
For JIRAs that have more than one commit associated I take the max value, trying to represent the effort of making the patch.
At this point I have Assignee, JIRA and number of lines. The last step is to map assignee back to company. This step is purely manual and is based on my best info. A few of the people who wound up in “other” might have belonged elsewhere.
I attributed a few large patches by Namit Jain to Facebook even though I want to say he had actually moved over to Nutanix by the time they went in. Namit can correct me if need be.
There were a few other Facebook contributions, not to the extent of Namit. I think it’s premature to say they’ve stopped, we certainly hope to see more from them.
The single largest patch was the Hive Correlation Optimizer, work that had been developed by Yin Huai for many years and committed while he was interning at Hortonworks this summer. It was about 20% of the whole count.

@Dan
Facebook published some impressive numbers on their Hive usage, more than 60,000 Hive queries per day with 100+PB data under management in their largest cluster. Source = http://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920

    edwardcapriolo
    |
    September 10, 2013 at 9:37 pm
    |

    The problem with counting lines of code is that a one line patch could cause cascading in hundreds of unit test and create an artificial lines of code count. This kinda goes with the thinking that 20% of the code was the optimizer.

    Without crunching the numbers i have a hard time accepting that. Like they say, whatever you count you get more of :)

Dan Graham
|
September 10, 2013 at 11:50 am
|

Regarding your statement “Hive is built on MapReduce so it can it effortlessly handles hundreds or thousands of simultaneous queries”. How many sites are running this many simultaneous Hive queries against a single cluster? Can you name any of them?

Also, what are the business partners listed above doing with Hadoop 2.0?

Jim Thompson
|
September 9, 2013 at 12:54 pm
|

Could you publish exactly how your calculated those LOC contributions? From what I can tell Facebook has stopped contributing to Hive altogether.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Join the Webinar!

Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache Knox
Thursday, October 23, 2014
1:00 PM Eastern / 12:00 PM Central / 11:00 AM Mountain / 10:00 AM Pacific

More Webinars »

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.