Merv Adrian couldn’t have said it better. In his blog post from the weekend, he continued in his quest to define Hadoop. And it is no easy quest as the components of, and evolution of, Hadoop is happening at a pace that is, frankly, astounding.
Hadoop – the data processing engine based on MapReduce – is being superceded by new processing engines: Apache Tez, Apache Storm, Apache Spark and others. YARN makes any data processing future possible.
But Hadoop the platform – thanks to YARN as its architectural center – is the future for data management, with a selection of best-fit processing engines for any given use case: from batch to interactive to real-time. More than that, Hadoop has become a movement. It is the center of gravity for everyone engaged in the challenge of big data.
If you’re only just switching on to Hadoop then its worth exploring Hadoop then, Hadoop now, and Hadoop next.
I joined Hortonworks in late 2011, during the height of what’s being referred to as the era of “Traditional Hadoop”. Traditional Hadoop uses HDFS for scalable storage and MapReduce as the sole system and framework that workloads ride atop. In this era, mappers and reducers ruled, and an ecosystem of tools climbed onto the elephant; Hive for SQL processing and Pig for scripting data flows to name just a few. Early Hadoop vendors also focused on bolting on basic levels of operations management, security, and other capabilities.
The sentiments of this era were:
and my favorite
In January 2008, Arun Murthy and the team at Yahoo! saw the need for Hadoop to move beyond its MapReduce-only roots, describing this requirement for the Hadoop community within MAPREDUCE-279, which gave rise to the genesis of YARN. Work on Hadoop’s next-generation architecture powered by YARN accelerated in 2011, when Hortonworks was founded, and culminated in 2013 with the GA release of the next-generation of Hadoop (Hadoop 2) with YARN as the thing “at the center that holds it all together like an operating system”.
I’m always amused by bombastic, saber rattling statements such as “Hadoop is Dead”, so when I read the recent BusinessWeek article by Ashlee Vance about Google’s DataFlow and Hadoop, and the InfoWorld response by Serdar Yedulalp that asserts “Why Google Cloud Dataflow is no Hadoop killer”, it not only highlighted the need to correct a common misconception: MapReduce, a computational framework, is NOT the same thing as Hadoop, but it also elicited my response of:
Yes. Traditional Hadoop and the era of batch-only mappers and reducers is dead!
The new era of “Enterprise Hadoop” with YARN as its architectural center has transformed Hadoop into a platform that goes FAR BEYOND traditional batch-oriented mappers and reducers. Enterprise Hadoop allows for multiple ways of interacting with data, including interactive SQL processing (a la Hive), iterative in-memory analytics (a la Spark), real-time stream processing (a la Storm) and online data processing (a la HBase and Accumulo), and much more. YARN serves as a common data operating system that enables the Hadoop ecosystem (both open source and commercial) to natively integrate innovative data processing engines into Hadoop while extending consistent security, governance and operations across the platform. Hortonworks’ YARN Ready program helps commercial vendors certify that their YARN-based solutions plug in properly, further expanding the choice of Data Access engines that can run natively in Hadoop. It’s important to note that YARN does all of this in a way that allows traditional batch-oriented mappers and reducers to co-exist, preserving the investments made in all of the Traditional Hadoop applications.
There are many companies that depend on Enterprise Hadoop, and the power of YARN, with transformative use cases such as moving from a world of nightly targeted emails to real-time recommendations using kiosk data, GPS data, and more. There was an excellent panel of Enterprise Hadoop users at the recent Hadoop Summit where representatives from BNY Mellon, British Gas/Centrica, Kohls, Rogers, Target, and TrueCar shared their Enterprise Hadoop experiences.
This means that Enterprise Hadoop powered by YARN is truly an extensible PLATFORM that facilitates both ongoing innovation and mainstream adoption by enterprises of all types and sizes across a wide range of production use cases at scale. As we know from this week, Apache Spark has garnered tremendous interest, much as Apache Storm continues to do. The point is, new data engines will continue to emerge, and YARN is there to provide a clean and easy way for that innovation to plug in to Hadoop so that enterprises can benefit.
So repeat after me:
Traditional Hadoop and the era of batch-only mappers and reducers is dead!
YARN makes any data processing future possible.
I encourage you to watch Arun Murthy’s keynote from the recent Hadoop Summit to hear how YARN is unlocking Enterprise Hadoop’s true potential.
In this video, Arun talks about the rationale for YARN as well as a framework called Apache Tez, a modern data processing engine inspired by Microsoft’s Dryad paper, that’s used by Apache Hive for enabling its interactive SQL queries. Also covered is Apache Slider that enables “always on” services, such as Apache HBase and Apache Storm, to more easily run on YARN.