Posts by Bob Page:


Hortonworks Data Platform 2.0 Alpha 2 now available: focus on performance

We are very pleased to announce the Alpha 2 release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha2) is now available for download!

A key focus in HDP 2.0 Alpha 2 is on performance as announced in the Stinger initiative, and includes a series of enhancements to the performance of Apache Hive for interactive SQL queries.  In fact HDP 2.0 Alpha 2 was used to perform the tests announced yesterday, showing a 45X performance increase using Hive.  There is much more to come but we are pleased with the early results, and encourage Hive users to take a look and continue to give us feedback.

Consistent with HDP 2.0 Alpha 1, this version is built from the developmental Apache Hadoop 2.0 line and includes Apache YARN, a next-generation resource-management and application framework that enables Hadoop to support an ever-expanding range of use cases.  We are extremely excited about the opportunities that YARN enables – for background, check out Arun Murthy’s blog post series where he provides a YARN overview.

Notable new components over Alpha 1 include:

  • Apache Tez: A new Apache project that provides an optimized data processing framework on top of YARN. Tez is a general-purpose, highly customizable framework that simplifies data processing tasks across both small-scale, low-latency and large-scale, high-throughput workloads in Hadoop. Tez can provide an order of magnitude performance boost for the broader ecosystem of data processing tools such as Apache Hive and Apache Pig.
  • Apache Hive Interactive Query: Beyond the speedups made possible by Apache Tez, several new features were added to speed Hive queries. A new file format called the ORCFile (optimized RC file) optimizes how data is stored and accessed in Hive, and significant query optimizations reduce latency and improve performance.

Note that Tez is not enabled by default.  Instructions for doing so, and allowing Hive to use Tez, are in the installation guide.

Learn More
Please take a look at the Hortonworks Documentation to learn more about installing and using HDP 2.0 Alpha 2.

Download It
You can download HDP 2.0 Alpha 2 from the Hortonworks Download site.

Tell Us About It
Please visit the HDP 2.0 Alpha Forum to ask questions, get help, provide feedback and hear what others are doing with HDP. 

We are excited about the opportunities that Hadoop 2 provides for the future of Hadoop and large-scale data processing. HDP 2.0 Alpha 2 is a key milestone that provides organizations with a packaged release to evaluate and gain experience with the upcoming Apache Hadoop 2 technology stack. We look forward to your feedback on HDP 2.0 Alpha 2 while we work with the community to make Hadoop 2 a stable reality. Enjoy!

Note: This Alpha release is a technology preview to gather feedback from outside of Hortonworks. Some features are missing or incomplete. Some APIs may change. Do not use Alpha 2 for production use. Keep away from open flame. Support is only available via Forums.

Plastics, SQL and the Extensible Future of Hadoop

Plastics, SQL and the Extensible Future of Hadoop

Mr. McGuire: I just want to say one word to you. Just one word.

Benjamin: Yes, sir.


Mr. McGuire: Are you listening?

Benjamin: Yes, I am.

Mr. McGuire: Plastics.

 

The advice given by Mr. McGuire in 1967’s The Graduate was certainly prophetic — plastics has become one of the largest manufacturing industries in the U.S. (Today, Mr. McGuire would probably say “Data.” But this post isn’t about career choices.)

Plastics initially found itself taking on familiar roles, providing rough equivalents for materials that were more expensive, in low supply, or some other attribute that made plastics a viable alternative — materials like glass, wood and metal were commonly imitated. But plastics were often seen as a poor replacement. Eventually, two things happened: New uses were found that went far beyond existing use cases, and the technology got better at becoming more like the materials they mimicked.

I think history is repeating itself, this time with Hadoop.

First though: Analyzing all that Hadoop data via native MapReduce doesn’t leverage existing SQL skills and technologies, which represent a significant investment. Because Apache Hive, the most widely used technology that brings SQL to Hadoop, is not a complete implementation of SQL, nor designed for interactive queries, we’re seeing a bevy of announcements, both open source and proprietary, that allow SQL-on-Hadoop to meet those use cases. I will avoid enumerating them — more may have appeared since you started reading this.

These SQL-on-Hadoop efforts are like the early days of plastics. Making Hadoop mimic the characteristics of a relational database query language is important and worth investing in. Some will be discarded as poor imitations — especially for customers that are used to enterprise-class warehouse SQL engines like Teradata. Others will get better, and even implement Hadoop-specific innovations, moving SQL forward. Even for the really good technologies, users are still stuck with a thirty-plus-year-old framework and relational model, regardless of how many UDFs and “calls to Hadoop” functions exist – otherwise many BI tools will need to be modified for each of these one-off implementations. Not to mention the operational overhead of new storage layers, resource management, etc.

To be clear: SQL is incredibly important, and will be for a long time. Making SQL-on-Hadoop is a very high priority across the industry. Including Hortonworks — witness the Stinger Initiative. It just doesn’t demonstrate the firepower of this fully armed and operational battle station. Like plastics, Hadoop is a breakthrough technology platform, and creating innovation is where customers will ultimately get the real value.

Unfortunately, in the rush to meet market demand, many of the SQL-on-Hadoop efforts ignore Hadoop’s emerging architecture. While YARN generalizes the Hadoop resource management framework, Apache Tez generalizes the data processing framework, in order to support an amazing array of future applications. YARN+Tez represents the future of Hadoop, and the future of enterprise data.

I recently met with a customer who had an interesting observation: “I don’t understand why [vendor] would implement SQL outside of YARN and Tez. All the extra resource management, operational cost, all the additional work involved — It is like they don’t really understand where Hadoop is going.” The obvious answer is that nobody builds SQL on YARN and Tez, because YARN and Tez aren’t available today. But that’s a short-term answer. YARN and Tez have wide community support and represent a large investment across the community. The community also continues to invest heavily in advancing Apache Hive. By letting Hive use the Tez speed innovations and freeing it from MapReduce, it’ll get the faster execution analysts need, within the Apache Hadoop framework. If this customer uses a non-Hive solution today, how will that solution compare with Hive on Tez?

Objectively comparing future products to future products is a fool’s errand. Technology moves forward, and the solutions will get better. The issue is really one of technology philosophy and approach. Enterprise customers don’t simply make decisions based on the bits that exist today, and will change tomorrow. They want to make sure they are investing in a future.

Which brings me back to the larger issue.

If yesterday’s “early adopter” Hadoop use case was ETL, and today’s “early majority” is SQL, tomorrow may be streaming, or iterative programming, or machine learning, or something we haven’t thought of yet — but it should all work within the data framework we call Hadoop. That is why at Hortonworks, we’re putting our energy into improving Hadoop, rather than coding around it, or adding proprietary extensions to it. The community investments in HDFS and YARN, and generalizing the fundamental building blocks of Hadoop, will allow us to both create a new data ecosystem that makes Apache Hive a first-class SQL engine and enable a new wave of innovation in integrated data management and analytics. That’s a huge opportunity for the industry, and I’m excited to see what comes next.