We certainly live in interesting times. About 20 months ago, in an effort to find proprietary differentiation that could be used to monetize and lock in customers to their model, Cloudera unveiled Impala and at that time Mike Olson stated “Our view is that, long-term, this will supplant Hive”. Only 6 months ago in his Impala v Hive post, Olson defended his “decision to develop Impala from the ground up as a new project, rather than improving the existing Apache Hive project” stating “Put bluntly: We chose to build Impala because Hive is the wrong architecture for real-time distributed SQL processing.”
So, 20 months after abandoning Hive and repeated marketing attempts to throw Hive and many other SQL alternatives under the bus in lieu of their “better” approach, I’m certainly puzzled as Cloudera unveils their plan to enable Apache Hive to run on Apache Spark; please see HIVE-7292 for details. I can only interpret this move as recognition of:
And the move seriously calls into question the future of Cloudera’s Impala, begging the question: Did Cloudera just accidentally shoot their Impala or is the kill shot on purpose?
Shortly after Impala was unveiled, the Stinger Initiative was launched, driven by the community and spearheaded by Hortonworks. Rather than abandon the Apache Hive community, we chose to trust the community model, double down on Hive, and preserve the investments of Hive’s end users and broad ecosystem of vendors already integrated with Hive. A multi-phase roadmap of investment was published aimed at ensuring Hive remained the de facto standard for SQL-in-Hadoop. In the 18 months that followed, Hive 11, Hive 12, and Hive 13 were delivered in support of enabling both batch and interactive SQL query workloads in a single engine. While speed was just one facet of the Stinger efforts, the benchmark results comparing Hive 10 versus Hive 13 are nonetheless impressive.
Stinger also gave rise to Apache Tez on YARN, a data processing engine inspired by Microsoft’s Dryad paper, that’s being used by Hive for enabling its interactive SQL queries. Microsoft also brought innovations from SQL Server to Hive including Hive’s new Optimized Row Columnar (ORC) file format and vectorized query execution which enable impressive CPU and throughput gains.
The Stinger Initiative rallied the involvement of 145 contributors from across 44 different organizations, and we’re pleased to have been able to grow our active Hive committers at Hortonworks from 2 to 14 along the way. While we are proud of our engineers who earned their stripes as committers, we’re more proud of the collective effort of the 145 contributors and the 392,000 lines of new code they added towards the shared goal of improving Hive’s speed, scale, and SQL semantics.
Speaking of Hive-focused initiatives, The Register’s Jack Clark posted an article today that provides added color around Cloudera’s pivot away from Impala and towards Hive on Spark. I hold out hope that their interests in enabling Hive on Spark are genuine and not part of some broader aspirational marketing campaign laced with bombastic FUD. I find Cloudera’s newfound appreciation for Hive the sincerest form of flattery, and a great validation that the right choice 20 months ago was to stay focused on the success of Hive.
As with other successful open source innovations, the build out of Hive on Spark on YARN will progress in the open, over the coming months. The Stinger Initiative proved that the smart engineers in the Hive community can collaborate in a way that moves Hive forward for the broad ecosystem of users who appreciate the improvements made over the past 18 months and look forward to more. This means the Hive community will continue its momentum on making Hive faster, more scalable, and richer in its SQL semantics in support of broader and more valuable use cases.
Assuming Cloudera is genuinely interested in (re)embracing the Hive community, then I assume they are OK with Impala’s progressively slimming set of use cases getting squeezed out entirely over the next couple of Hive releases. Moreover, it’s a good thing that Hive’s use of Tez has blazed the trail for how to think about alternative Hive execution engines. In Hive 13, Tez finally provides a modern approach for translating Hive’s SQL queries into full, expressive, and efficient graphs of data processing execution. The Hive-Tez query plans have set the bar for how alternative execution engines should plug into Hive.
I hope the aforementioned HIVE-7292 avoids rearview mirror thinking of using the Hive-MapReduce query plans executing on Spark via mappers and reducers; after all, Spark’s APIs can support a more expressive means of integration and Hadoop’s architecture has significantly moved on from the Traditional Hadoop era where mappers and reducers were the only choice. For example, I believe Shark, the original Hive on Spark implementation, used more sophisticated query plans than Hive-MapReduce, but Shark’s approach has recently been superseded by SparkSQL in the Apache Spark community. While there are details to be worked out and more SQL engine confusion to help people sort through along the way, Hive on Spark could turn out to be a very good idea.
Work continues on the integration of Apache Optiq which brings Hive a first-class cost-based optimization framework that can take best advantage of the superior performance, scalability, throughput, and expressiveness that modern engines, such as Tez, provide. Also interesting are the efforts around “Discardable Memory and Materialized Queries” and “Discardable Distributed Memory: Supporting Memory Storage in HDFS”. These efforts are solidly aimed at making sure the broader Hadoop platform and data processing engines, including both Apache Tez and Apache Spark, are able to make best and most efficient use of available memory across a wide range of use cases.
Consistent with our investments in YARN as the architectural center of Enterprise Hadoop, just last week, we announced that Apache Spark is YARN Ready. This is a vital step forward. It ensures memory and CPU intensive Spark-based applications can co-exist within a single Hadoop cluster with all the other workloads you have deployed. Concurrent with the announcement, Hortonworks became an inaugural member of the Databricks Certified Spark Distribution program. The combination of both programs provides enterprises and the broader Hadoop ecosystem with the assurance that their tools and applications are fully compatible with Apache Spark, Apache Hadoop YARN, and the Hortonworks Data Platform.
Our focus remains on delivering a fast, safe, scalable, and manageable data platform on a consistent footprint that includes HDFS, YARN, Hive, Tez, Spark, Storm, Ambari, Knox, and Falcon to name just a few of the critical components of Enterprise Hadoop. Integrating Spark within this comprehensive set of components helps make it “enterprise ready” so that our customers can confidently adopt it.
The above only reinforces that an open source, community driven model is the right one, and recognition that all of Enterprise Hadoop shall be delivered in open source.
If the effort proposed in HIVE-7292 moves Hive forward in a useful and valuable way, then you’ll find Hortonworks at the party. Unlike others, we’ve always been there, never left, and we actually brought more friends to the party over the past 18 months.