Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
June 30, 2014
prev slideNext slide

Did Cloudera Just Shoot Their Impala?

We certainly live in interesting times. About 20 months ago, in an effort to find proprietary differentiation that could be used to monetize and lock in customers to their model, Cloudera unveiled Impala and at that time Mike Olson stated “Our view is that, long-term, this will supplant Hive”. Only 6 months ago in his Impala v Hive post, Olson defended his “decision to develop Impala from the ground up as a new project, rather than improving the existing Apache Hive project” stating “Put bluntly: We chose to build Impala because Hive is the wrong architecture for real-time distributed SQL processing.”

impalaSo, 20 months after abandoning Hive and repeated marketing attempts to throw Hive and many other SQL alternatives under the bus in lieu of their “better” approach, I’m certainly puzzled as Cloudera unveils their plan to enable Apache Hive to run on Apache Spark; please see HIVE-7292 for details. I can only interpret this move as recognition of:

  • The success of the Stinger Initiative in modernizing Hive’s architecture for interactive SQL applications while preserving the investments of Hive’s users and the broader ecosystem.
  • The new era of “Enterprise Hadoop” with YARN as its architectural center. YARN has transformed Hadoop into a platform that goes far beyond the batch-oriented mappers and reducers of “Traditional Hadoop”.

And the move seriously calls into question the future of Cloudera’s Impala, begging the question: Did Cloudera just accidentally shoot their Impala or is the kill shot on purpose?

Stinger Initiative: A broad community investment on behalf of the customer

Shortly after Impala was unveiled, the Stinger Initiative was launched, driven by the community and spearheaded by Hortonworks. Rather than abandon the Apache Hive community, we chose to trust the community model, double down on Hive, and preserve the investments of Hive’s end users and broad ecosystem of vendors already integrated with Hive. A multi-phase roadmap of investment was published aimed at ensuring Hive remained the de facto standard for SQL-in-Hadoop. In the 18 months that followed, Hive 11, Hive 12, and Hive 13 were delivered in support of enabling both batch and interactive SQL query workloads in a single engine. While speed was just one facet of the Stinger efforts, the benchmark results comparing Hive 10 versus Hive 13 are nonetheless impressive.

Stinger also gave rise to Apache Tez on YARN, a data processing engine inspired by Microsoft’s Dryad paper, that’s being used by Hive for enabling its interactive SQL queries. Microsoft also brought innovations from SQL Server to Hive including Hive’s new Optimized Row Columnar (ORC) file format and vectorized query execution which enable impressive CPU and throughput gains.

The Stinger Initiative rallied the involvement of 145 contributors from across 44 different organizations, and we’re pleased to have been able to grow our active Hive committers at Hortonworks from 2 to 14 along the way. While we are proud of our engineers who earned their stripes as committers, we’re more proud of the collective effort of the 145 contributors and the 392,000 lines of new code they added towards the shared goal of improving Hive’s speed, scale, and SQL semantics.

Speaking of Hive-focused initiatives, The Register’s Jack Clark posted an article today that provides added color around Cloudera’s pivot away from Impala and towards Hive on Spark. I hold out hope that their interests in enabling Hive on Spark are genuine and not part of some broader aspirational marketing campaign laced with bombastic FUD. I find Cloudera’s newfound appreciation for Hive the sincerest form of flattery, and a great validation that the right choice 20 months ago was to stay focused on the success of Hive.

Hive on Spark on YARN… a worthy idea

As with other successful open source innovations, the build out of Hive on Spark on YARN will progress in the open, over the coming months. The Stinger Initiative proved that the smart engineers in the Hive community can collaborate in a way that moves Hive forward for the broad ecosystem of users who appreciate the improvements made over the past 18 months and look forward to more.  This means the Hive community will continue its momentum on making Hive faster, more scalable, and richer in its SQL semantics in support of broader and more valuable use cases.  

Assuming Cloudera is genuinely interested in (re)embracing the Hive community, then I assume they are OK with Impala’s progressively slimming set of use cases getting squeezed out entirely over the next couple of Hive releases. Moreover, it’s a good thing that Hive’s use of Tez has blazed the trail for how to think about alternative Hive execution engines. In Hive 13, Tez finally provides a modern approach for translating Hive’s SQL queries into full, expressive, and efficient graphs of data processing execution. The Hive-Tez query plans have set the bar for how alternative execution engines should plug into Hive.

I hope the aforementioned HIVE-7292 avoids rearview mirror thinking of using the Hive-MapReduce query plans executing on Spark via mappers and reducers; after all, Spark’s APIs can support a more expressive means of integration and Hadoop’s architecture has significantly moved on from the Traditional Hadoop era where mappers and reducers were the only choice. For example, I believe Shark, the original Hive on Spark implementation, used more sophisticated query plans than Hive-MapReduce, but Shark’s approach has recently been superseded by SparkSQL in the Apache Spark community. While there are details to be worked out and more SQL engine confusion to help people sort through along the way, Hive on Spark could turn out to be a very good idea.

What else is on deck for Hive?

Work continues on the integration of Apache Optiq which brings Hive a first-class cost-based optimization framework that can take best advantage of the superior performance, scalability, throughput, and expressiveness that modern engines, such as Tez, provide. Also interesting are the efforts around “Discardable Memory and Materialized Queries” and “Discardable Distributed Memory: Supporting Memory Storage in HDFS”. These efforts are solidly aimed at making sure the broader Hadoop platform and data processing engines, including both Apache Tez and Apache Spark, are able to make best and most efficient use of available memory across a wide range of use cases.

A public commitment to Spark on YARN for Enterprise Hadoop

Consistent with our investments in YARN as the architectural center of Enterprise Hadoop, just last week, we announced that Apache Spark is YARN Ready. This is a vital step forward. It ensures memory and CPU intensive Spark-based applications can co-exist within a single Hadoop cluster with all the other workloads you have deployed. Concurrent with the announcement, Hortonworks became an inaugural member of the Databricks Certified Spark Distribution program. The combination of both programs provides enterprises and the broader Hadoop ecosystem with the assurance that their tools and applications are fully compatible with Apache Spark, Apache Hadoop YARN, and the Hortonworks Data Platform. 

Our focus remains on delivering a fast, safe, scalable, and manageable data platform on a consistent footprint that includes HDFS, YARN, Hive, Tez, Spark, Storm, Ambari, Knox, and Falcon to name just a few of the critical components of Enterprise Hadoop. Integrating Spark within this comprehensive set of components helps make it “enterprise ready” so that our customers can confidently adopt it.


The above only reinforces that an open source, community driven model is the right one, and recognition that all of Enterprise Hadoop shall be delivered in open source.

If the effort proposed in HIVE-7292 moves Hive forward in a useful and valuable way, then you’ll find Hortonworks at the party. Unlike others, we’ve always been there, never left, and we actually brought more friends to the party over the past 18 months.



  • This is not a validation of Hive, just a strategic move to undermine the need for Tez (and therefore Hortonworks). Its pretty obvious

  • This is good for customers and partners who are all in on Hive. My read on this is that this is a simple way to wrestle control of enhanced Hive (leveraging Tez) from Hortonworks

  • I appreciate aggressive competition as much as anyone, but this is a cheap shot. The truth is that the emergence of Spark has changed the game for all of the current Hadoop distributors. Shark has been great and will allow people to get existing systems off Map/Reduce, but for those building new ones I would advocate investing in Spark SQL going forward.

    Regarding your article specifically, I give someone credit for trying to improve on existing solutions even if it does not work out. Do you have the same criticism for Spark SQL?

  • “I hold out hope that their interests in enabling Hive on Spark are genuine and not part of some broader aspirational marketing campaign laced with bombastic FUD.” I really think YOU are being FUDLY. Cloudera has had 1-2 people involved with the hive project for a while now. Maybe like 6 years. Carl is the hive lead, previously he worked for Cloudera. Cloudera has 2 people now adding features. Hortonworks is relatively new to the hive project. 2-3 years tops?

    So even though cloudera did build impala they have kept steady support on the hive project for a very long time.

    Spark is just very buzzy now. Everyone wants to have it or be involved with it like devops or cloud, but its actually 3-4 years old right? Everyone is in there enterprising pissing match again.

  • Where does Hortonworks stand on Apache Pig? Will it be supported on Spark?

  • That’s one of many proofs, that OpenSource driven projects can be developed at amazingly high speed and deliver great results. And, on top of that, it’s possible to combine OpenSource and running a business. There is so many companies around the “distributed” (if not yet big) data, like HortonWorks, MongoDB, Elasticsearch, who seem to be really successful and empower the whole Internet by opensourcing the code at the same time.

  • Hive and Impala serve 2 entirely different purposes.

    Hive: ETL using SQL with rich language support
    Impala: low-latency, interactive SQL queries.

    Have you ever tried to hook up Hive to a BI tool like Tableau? It provides a bad user experience, with latencies that are cringe-worthy. Even Stinger or Hive-on-Spark won’t solve that. Impala solves that problem. It allows users of BI tools like Tableau to have all that data, at “big data” scala at their finger tips, with a bearable latency.

    What Hive-on-Spark does solve, is making the ETL processes that Hive is really good at, a lot quicker and more manageable.

    What is really golden, is the following quote:
    community driven model is the right one

    That is great coming from a company who feels that Hadoop innovation should rather be driven by companies that are willing to pay, than to hard-working community members, as illustrated by this Hadoop Open Data Platform initiative, which is pay-to-play.

  • This article is trying to deflect the reader from facts by randomly choosing arguments. We have been using both Impala and Hive for over a year now. I can emphatically say that each has its purpose. Big Data workflows differ depending on the use case, while Impala is great for some, Hive is great for others.

  • @sarnath – Went through the link. Running on desktop is not representative of production workloads of scale.
    believe Hive on Tez shines if SQL needs to run in parallel.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *