Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
February 20, 2013
prev slideNext slide

The Stinger Initiative: Making Apache Hive 100 Times Faster


UPDATE: Since this article was posted, the Stinger initiative has continued to drive to the goal of 100x Faster Hive. You can read the latest information at

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop.  Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface.  The who’s who of business analytics have already adopted Hive.

Apache Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining and data preparation use cases.  These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. At Hortonworks, we believe in the power of the open source community to innovate faster than any proprietary offering and the Stinger initiative is proof of this once again as we collaborate with others to improve Hive performance.

So, What is Stinger?

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. All the while, HiveQL remains the same before and after these advancements so it just gets better. And in keeping with the ecosystem of existing tools, it is complementary to best-of-breed data warehouses and analytic platforms.

  • stingerRoadFirst, we are making Hive a more suitable tool for the decision support queries people want to perform on Hadoop.  This includes adding analytics features like the OVER clause, support for subqueries in WHERE, and aligning Hive’s type system more with the standard SQL model.
  • Second, we are optimizing Hive’s query execution plans and based on our initial changes, we have already seen query time drop by 90% on some of our test queries. We are also looking at additional changes inside Hive’s execution engine that we believe will significantly increase the number of records per second that a Hive task can process.
  • Third, we have introduced a new columnar file format (i.e. ORCFile) within the Hive community to provide a more modern, efficient, and high performing way to store Hive data.
  • And lastly, we’ve introduced a new runtime framework, called Tez, which aims to eliminate Hive’s latency and throughput constraints that result from its reliance on MapReduce. Tez optimizes Hive job execution by eliminating unnecessary tasks, synchronization barriers, and reads from and write to HDFS.  This optimizes the execution chain within Hadoop and drastically speeds Hive’s workload processing.

All of these modifications to Hive are underway in the open and an initial preview will be available in advance of Hadoop Summit Amsterdam in March.

Embrace the community, Embrace Hive…

A diverse group of individuals within the Hive community are collaborating on these efforts. As part f the community, a wide group of people contributed to this effort, including resources from SAP, Microsoft, Facebook and Hortonworks.

Harish Butani from SAP has led an effort to add analytics and windowing functions to Hive.  This will add the OVER clause for use with existing aggregate functions as well as adding analytics functions like RANK and NTILE and windowing functions like LEAD and LAG; you can see this work at HIVE-896.  Namit Jain from Facebook has been spending a lot of time lately optimizing Hive’s query execution planning so that it performs joins and other operations more efficiently and with less need for hints from the user.  Hortonworks engineers have been collaborating on these and other community efforts to improve Hive.

Owen O’Malley, a Hortonworks co-founder and early Hadoop developer, has been working with Facebook on the new ORCFile in order to greatly improve performance when Hive is reading, writing, and processing data; you can see this work at HIVE-3874. We are also working on farther reaching changes and optimizations such as reworking Hive’s operators to process records in blocks of a thousand or more and thus be much more efficient than it is today.

We believe the performance changes we are making today, along with the work being done in Tez will transform Hive into a single tool that Hadoop users can use to do report generation, ad hoc queries, and large batch jobs spanning 10s or 100s of terabytes.

Why reinvent the wheel?



Eric Hanson says:

Microsoft strongly supports efforts to improve the enterprise-readiness and performance of Hive as a true open source community project. We have been collaborating behind the scenes with Hortonworks to help drive performance improvements in ORC and vectorized query execution. We’re looking forward to becoming peer contributors of new technology in the open source Hadoop community. Our first substantive contribution to Hive performance work is a “VectorizedRowBatch” object class in ORC to hold 1024-row batches of data coming out of the storage layer, optimized for fast, vector-oriented query execution. Over time, we envision more and more work will be done in a vectorized fashion in the Hive query executer. This and other planned enhancements make the 100X performance improvement goal of the Stinger project a realistic target.

Eric Hanson

Kaushal Jha says:

Hi Eric,
why not allow excel to talk to hiveserver via OLAP cubes?.
today excel tries to inject all the data before a pivot can be used (which means it won’t really work directly with big datasets > 1million rows)

if excel allowed the OLAP interface to hiveservers (probably needs an OLAP service for HIVE ?) would that be beneficial to the community at large?


Christian says:

These are fantastic developments. The lag Hive has today does hamper its full adoption on medium sized & interactive query workloads.

Slim Baltagi says:

Is Stinger the answer of Hortonworks to Cloudera’s Impala and MapR’s Drill? Besides revitalizing/tuning Hive ( already established project of the Hadoop ecosystem) versus offering a
completely new tool, how Stinger is differentiating itself from Impala and Drill?

Clarissa says:
Your comment is awaiting moderation.

Encօre un plst sincèгement intéressant

David says:

Stinger is one of the more interesting developments lately when it comes to Lag Hive

Laughing Skeptic says:

Impala gets much of its performance improvements by making optimal use of the metadata embedded in the Parquet files (its preferred format). This allows Impala to skip a lot of IO. This article does not mention metadata and the OCR file format does not support metadata in the same way that Parquet does. The metadata does not have to stored in the data files and some might even say it is a mistake to do so. However it has to go somewhere and without improved metadata Hive will never match up with Impala in performance.

saurabh Sharma says:

Hive query on my side is running totallly slow instead of TEX..pls suggest

saurabh Sharma says:

instead of TEZ enabled…pls suggest how can we improve..

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums