Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop
The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.
We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.
SQL for Big Data
We also believe this should be done in the open community and that Apache Hive is the right choice to evolve SQL-IN-Hadoop. First, Hive is the only SQL engine proven at multi petabyte scale so it is a good baseline. Second, it has the most comprehensive SQL semantics among any Hadoop solution for SQL so it is “closest to the pin”. Third, Hive is built on MapReduce so it can it effortlessly handles hundreds or thousands of simultaneous queries, even at petabyte scale. Finally, there is a large vibrant community of practitioners and developer using and working on Hive. Why not rally the community to innovate in the open?
So, where are we?
The Stinger initiative outlines three phases. In the first phase of delivery we saw:
- Performance improvements of 35x-45x for common analytical queries and
- Introduction of SQL windowing functions such as Rank, Lead, Lag, etc.
Our recent HDP 2.0 beta marks our second major release of Stinger based improvements for Hive, introducing:
- A preview of the vectorized query engine that speeds all types of queries, adding another 5x-10x improvement,
- Simplified SQL interoperability through the new VARCHAR and DATE datatypes and
- A new query optimizer that speeds complex queries by several factors.
While this represents great progress, we still have one remaining piece of Stinger Phase 2: Hive on Apache Tez. This will be released separately in beta form soon and will deliver order-of-magnitude improvements in query latency and push several types of queries past the 100x barrier.
We’re encouraged by the extremely positive reaction we’ve seen from both Hive users and the Hive ecosystem. Let’s take a deeper dive into some of the new HDP 2 features.
Increasing Hive performance 100x remains the primary goal of the Stinger Initiative. The HDP 2.0 Beta introduces several major new performance features that benefit both small reporting queries and deep analytical queries. Some of which are describe in this table:
We’ll deep dive on these major new improvements in subsequent performance oriented blog posts. These improvements represent important steps on the path to improving Hive’s performance 100x. Here’s a few examples of what they add up to.
We looked at TPC-DS Query 27, a fairly simple reporting query, back in February and showed that some improvements to the Hive query planner led to massive performance benefits. HDP 2 brings incremental progress by introducing vectorized query, which makes the map stages far more efficient. The next big boost in Query 27 will come when we introduce the upcoming Tez Beta which unlocks true low latency on Hadoop. We believe Apache Tez will push this query and others like it past the 100x improvement mark.
Hive is not just for simple queries but can also handle quite complex queries. TPC-DS Query 95 is an extremely complex query including a 3-way fact table join. This query benefitted from our Hive 11 improvements but not to the extent that simple star schema join queries like Query 27 did. HDP 2 and the upcoming Hive 12 introduce a query optimizer that benefits complex queries by generating more efficient map/reduce plans. Even in the fastest time, Query 95 runs in 6 distinct MapReduce jobs, so the introduction of Tez will prove to be a massive boost here as well.
We’ll discuss these performance benefits in more depth in later blogs.
Our goal with SQL support is simple: Make Apache Hive a comprehensive and compliant SQL engine that meets Enterprise class needs. This round of Hive development introduces 2 critical new data types, VARCHAR, a very commonly used SQL type, and DATE, which is also very common and a natural choice for partitioning.
These new datatypes are extremely important for interoperability with existing databases. Historically, loading data from an external RDBMS into Hive (or systems claiming to be Hive compatible) required mapping VARCHAR to strings. This mapping process is no longer needed. These interoperability improvements further solidify Hive’s position as the most comprehensive SQL system for Hadoop, and greatly simplify the process of migrating SQL workloads to Hive.
The Hive Ecosystem: Stronger Than Ever
Our philosophy is to innovate SQL in Hadoop with 100% vendor-neutral Apache Foundation software that welcomes contributions from anyone. Although Hive 12 is not officially released yet, a lot of work has been flowing in from more than 60 developers, both within Hortonworks and from many other contributors in the community.
- Cloudera contribution count was incorrectly reported as 4244, corrected to 13754
- Hortonworks contribution count was incorrectly reported as 83742, corrected to 90681
- NexR contribution count was incorrectly reported as 12103, corrected to 15236
- “Other” contribution count was incorrectly reported as 8920, corrected to 10228
Hortonworks customers value our 100% open source, 0% proprietary, 0% lock-in approach. Our partners have noticed this preference among our mutual customers and we’re excited to see them respond by deepening their investments in Hive. Here are a few noteworthy examples:
|Microstrategy focuses on deep analytics and MicroStrategy Server 9.4 has expanded deep analytics on Hadoop by embracing the SQL windowing capabilities introduced in Hive 11. Find out more about the 9.4 release on the MicroStrategy website.|
|Karmasphere were early adopters of Hive 11, introducing support in July when Hortonworks HDP was the only distribution that included Hive 11. Karmasphere pushes the boundaries of deep analytics in other ways, partnering with Zmentis to bring machine learning and predictive analytics to Hive and Hadoop. Learn more on the Karmasphere website.|
|MongoDB is a very popular NoSQL database thanks in large part to its simplicity in deployment and use. Mongo doesn’t focus on deep analytical capabilities, so Hive is a natural complement. MongoDB has released a new connector that allows you to create tables within Hive and run SQL queries directly on data stored in MongoDB. Learn more on the MongoDB website.|
|In March, ESRI released “GIS Tools for Hadoop”, a powerful spatial analytics toolkit for Hive and Hadoop that makes spatial analytics simple. Learn more on their project page. You can also read about a sample application we built that has working code for using the ESRI toolkit.|
|Tableau recently released version 8.0.4 which adds support for HiveServer2 providing additional security and the ability to support more concurrent users. Find more on working with Tableau and Hadoop here.|
Try Hive Today
HDP 2.0 Beta is available for download today and includes the fastest and most comprehensive Hive to date. Whether you’re looking for faster and more scalable SQL processing, or if you’re an existing Hive user looking to test drive the new performance enhancements, we encourage you to download and give us your feedback.
Get Started using Hadoop to Analyze Data. This guide includes tutorials, videos and advice on integrating Hadoop with popular analytics packages.
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.