Plastics, SQL and the Extensible Future of Hadoop
Plastics, SQL and the Extensible Future of Hadoop
Mr. McGuire: I just want to say one word to you. Just one word.
Benjamin: Yes, sir.
Mr. McGuire: Are you listening?
Benjamin: Yes, I am.
Mr. McGuire: Plastics.
The advice given by Mr. McGuire in 1967’s The Graduate was certainly prophetic — plastics has become one of the largest manufacturing industries in the U.S. (Today, Mr. McGuire would probably say “Data.” But this post isn’t about career choices.)
Plastics initially found itself taking on familiar roles, providing rough equivalents for materials that were more expensive, in low supply, or some other attribute that made plastics a viable alternative — materials like glass, wood and metal were commonly imitated. But plastics were often seen as a poor replacement. Eventually, two things happened: New uses were found that went far beyond existing use cases, and the technology got better at becoming more like the materials they mimicked.
I think history is repeating itself, this time with Hadoop.
First though: Analyzing all that Hadoop data via native MapReduce doesn’t leverage existing SQL skills and technologies, which represent a significant investment. Because Apache Hive, the most widely used technology that brings SQL to Hadoop, is not a complete implementation of SQL, nor designed for interactive queries, we’re seeing a bevy of announcements, both open source and proprietary, that allow SQL-on-Hadoop to meet those use cases. I will avoid enumerating them — more may have appeared since you started reading this.
These SQL-on-Hadoop efforts are like the early days of plastics. Making Hadoop mimic the characteristics of a relational database query language is important and worth investing in. Some will be discarded as poor imitations — especially for customers that are used to enterprise-class warehouse SQL engines like Teradata. Others will get better, and even implement Hadoop-specific innovations, moving SQL forward. Even for the really good technologies, users are still stuck with a thirty-plus-year-old framework and relational model, regardless of how many UDFs and “calls to Hadoop” functions exist – otherwise many BI tools will need to be modified for each of these one-off implementations. Not to mention the operational overhead of new storage layers, resource management, etc.
To be clear: SQL is incredibly important, and will be for a long time. Making SQL-on-Hadoop is a very high priority across the industry. Including Hortonworks — witness the Stinger Initiative. It just doesn’t demonstrate the firepower of this fully armed and operational battle station. Like plastics, Hadoop is a breakthrough technology platform, and creating innovation is where customers will ultimately get the real value.
Unfortunately, in the rush to meet market demand, many of the SQL-on-Hadoop efforts ignore Hadoop’s emerging architecture. While YARN generalizes the Hadoop resource management framework, Apache Tez generalizes the data processing framework, in order to support an amazing array of future applications. YARN+Tez represents the future of Hadoop, and the future of enterprise data.
I recently met with a customer who had an interesting observation: “I don’t understand why [vendor] would implement SQL outside of YARN and Tez. All the extra resource management, operational cost, all the additional work involved — It is like they don’t really understand where Hadoop is going.” The obvious answer is that nobody builds SQL on YARN and Tez, because YARN and Tez aren’t available today. But that’s a short-term answer. YARN and Tez have wide community support and represent a large investment across the community. The community also continues to invest heavily in advancing Apache Hive. By letting Hive use the Tez speed innovations and freeing it from MapReduce, it’ll get the faster execution analysts need, within the Apache Hadoop framework. If this customer uses a non-Hive solution today, how will that solution compare with Hive on Tez?
Objectively comparing future products to future products is a fool’s errand. Technology moves forward, and the solutions will get better. The issue is really one of technology philosophy and approach. Enterprise customers don’t simply make decisions based on the bits that exist today, and will change tomorrow. They want to make sure they are investing in a future.
Which brings me back to the larger issue.
If yesterday’s “early adopter” Hadoop use case was ETL, and today’s “early majority” is SQL, tomorrow may be streaming, or iterative programming, or machine learning, or something we haven’t thought of yet — but it should all work within the data framework we call Hadoop. That is why at Hortonworks, we’re putting our energy into improving Hadoop, rather than coding around it, or adding proprietary extensions to it. The community investments in HDFS and YARN, and generalizing the fundamental building blocks of Hadoop, will allow us to both create a new data ecosystem that makes Apache Hive a first-class SQL engine and enable a new wave of innovation in integrated data management and analytics. That’s a huge opportunity for the industry, and I’m excited to see what comes next.