Moving Hadoop Beyond Batch with Apache YARN
Apache Hadoop 2.0 continues to make its way through the open source community process at the Apache Software Foundation and is getting closer to being declared “ready” from a community development perspective. Once ready, our team at Hortonworks will apply our usual enterprise rigor in providing a tested and integrated distribution that includes Hadoop 2.0 along with the other enterprise-focused services our customers and partners require.
In my roles both at Hortonworks and in the open-source Apache Hadoop community, I’m asked a lot of questions regarding the key aspects and motivations behind Hadoop 2.0. Here is some information to sate the curious mind.
First-generation success inspires second-generation focus
In the early days of Hadoop at Yahoo!, we had a very particular objective: store and process very large amounts of data to support our internet search efforts. And so the first generation of Hadoop was a purpose-built system for web-scale data processing that was embraced by Yahoo! as well as other technology-savvy early adopters such as Facebook.
As usage at Yahoo! began to expand so did the number of ways that users wanted to interact with the data stored in Hadoop. As with any successful open-source project, the broader ecosystem of Hadoop users responded by contributing additional capabilities to the Hadoop community, with some of the most popular examples being Apache Hive for SQL-based querying, Apache Pig for scripted data processing and Apache HBase as a NoSQL database.
These additional open source projects opened the door for a much richer set of applications to be built on top of Hadoop – but they didn’t really address the design limitations inherent in Hadoop; specifically, that it was designed as a single application system with MapReduce at the core (i.e. batch-oriented data processing).
Do we need SQL ON Hadoop or SQL IN Hadoop?
Fast forward to today, and we see that Hadoop’s momentum has continued and many more enterprises (not just web scale companies) want to store ALL incoming data in Hadoop, and then enable their users to interact with it in a host of different ways: batch, interactive, analyzing data streams as they arrive, and more. And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so.
Nothing illustrates this dynamic more clearly than the current industry noise around SQL on Hadoop. All kinds of vendors are clamoring to provide better SQL access to data stored in Hadoop – and so they should, since SQL is understood by many users. Since Apache Hive has been the defacto SQL interface to Hadoop data for many years, we’ve found most users would like to continue to leverage the power of Hive in support of these additional interactive SQL use cases.
But by building SQL access on top of Hadoop, it just highlights the challenge of Hadoop being a single application system. For when I run a SQL query on that data, it could consume all the resources of the cluster and cause performance issues for the other applications and jobs running in the cluster – not a good outcome to say the least.
YARN enables SQL IN Hadoop and many more applications
When we set out to build Hadoop 2.0, we wanted to fundamentally re-architect Hadoop to be able to run multiple applications against relevant data sets. And do so in a way where multiple types of applications can operate efficiently and predictably within the same cluster – this is really the reason behind Apache YARN, which is foundational to Hadoop 2.0. By managing the resource requests across a cluster, YARN turns Hadoop from a single application system to a multi-application operating system.
Getting back to the SQL ON Hadoop point, with YARN we now have the ability to run SQL IN Hadoop. For by being IN Hadoop (built on YARN), it becomes part of the platform itself and can be managed by YARN to ensure that multiple use cases can be addressed. Why stop at SQL? What about machine learning or modeling? What about processing events (data) as they arrive? Would it be not nice to manage all of these through a common system?
By turning Apache Hadoop 2.0 into a multi application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.
And so that is the trailer for the story for Hadoop 2.0: Unleashing the Power of YARN. Coming soon to a cluster near you, summer of 2013! Stay tuned!