Thank you for reading our Data Lake 3.0 series! In part 1 of the series, we briefly introduced the power of leveraging prepackaged applications in Data Lake 3.0 and how the focus will shift from the platform management to solving the business problems. In this post, we further deliberate on this idea to help answer questions on how a multi-colored YARN will play a critical role in building such a successful Data Lake 3.0.
Apache® HadoopTM YARN is the modern resource-management platform that enables applications to share a common infrastructure of servers and storage. YARN is now morphing into a multi-colored platform of choice! YARN’s vision has always been to enable Hadoop to run many different workloads. The next steps in the journey are about dialing up the workload diversity and in making the creation and deployment of modern data apps easy. Without further ado, let’s first recap on how YARN acted as the platform of choice thus far, before elaborating on the evolution of a multi-colored YARN as part of Data Lake 3.0.
Apache Hadoop YARN is built as a general purpose resource management platform. YARN’s core concepts are applications, containers, and resources. A container is a virtualized execution environment where a set of processes or tasks utilize the physical resources of the underlying machine. Administrators set up a bunch of machines to support multiple such containers. Users then write applications, each to be a set of tasks or processes executing in a collection of containers.
|“YARN’s core concepts – applications, containers and resources.”|
Making use of these concepts of applications and containers, YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.
YARN is being used in production at a wide variety of organizations to host a wide variety of data-intensive applications such as batch workloads (Hadoop MapReduce), interactive query processing (Apache Hive, Apache Tez, Apache Spark) and real-time processing (Apache Storm). For those of you who are familiar with its history, it originated out of a need to evolve Hadoop to support not just MapReduce but any arbitrary processing engine. As more engines mentioned above came to the fore over time, YARN’s core architectural design has served the needs of these engines well, needing only occasional incremental improvements. Over the years, YARN has easily supported a wide spectrum of frameworks.
The power of YARN is not limited to just enabling all these different programming paradigms on shared datasets (typically over a distributed storage system like HDFS) and physical hardware. YARN brings to the table a variety of platform features that users rely on for an end-to-end big data success story. YARN can use its key strengths – cost effective resource management, powerful scheduling primitives, resource isolation and multi-tenancy – on a myriad of resources, varying from small pools of special-purpose machines to datacenter-scale infrastructure built out of commodity hardware.
YARN is the Data Operating System that powers our Data Lake 3.0 vision. While YARN has initially focused on large scale but short-running apps (often also referred to simply as jobs), it is also the perfect platform to run long-running services as well as apps that have a mix of both. YARN’s scheduler and its key abstractions are general enough to support running a variety of application including batch, longer running streaming and classical services. However, what separates YARN from others is its special support for data intensive applications.
Extending YARN’s inherent capabilities to handle data intensive applications, we are seeing significant signals of a perfect storm enabled by two major drivers. On the business front, our advanced users are looking to solve end-to-end business problems as the next phase in the big data maturity curve. On the technology front, we are seeing the wide adoption of containerized workloads that provides the ease of distribution, packaging and isolation. We next discuss more about both these drivers.
Let’s revisit the historical way the Hadoop ecosystem has been built over time. Since the beginning, the Apache ecosystem has focused on singular storage and compute engines, each addressing a specific problem in the larger big-data space. This is akin to the unix mantra of “doing one thing and doing it well”. So far, this approach has served well the developer community and the user-base. Developer community could zoom into a single (set of) problem(s) with undivided attention and solve them all the way through. Users could then bring these different but ultimately working-well-together tools in addressing their business use-cases.
During the past few years though, end-to-end business use-cases have evolved to another level.
Manual plumbing of all these different colored services in tiresome! Further, there is a clear need for seamless aggregate deployment, lifecycle management and application wireup. This is the gap that needs to be bridged between what these end-to-end business use-cases need from the platform and what the platform offers today. If these features are provided, then the business use cases authors can singularly focus on the business logic.
|“Modern data applications – assemblies – span across multiple tools and must be 100X easier to build, wire up, deploy, manage, monitor, scale, secure and govern!”|
Further, by starting assumptions, applications need to be composable and reusable. Once a service (like Kafka on YARN) or an end-to-end application (like an IOT app) is made to work well, other members of the community should be able to simply build more complex structures using these existing components.
We thus want to enable businesses care a bit less about the infrastructure and more on driving the end-to-end user-cases. We call this end-to-end business application an Assembly.
You may be wondering: “Why not statically manage all these applications, services and assemblies?”
This type of ad hoc management works at small scale but not desirable at larger scales given the ubiquity of hardware failures, need for upfront capacity planning, and manual scaling / elasticity. This is fundamentally the same resource-management problem that YARN is built to address!
Simplified deployment and scaling, enhanced discovery, management, monitoring of assemblies as a unit are some of the needs from the platform. An assembly can further be a fundamental unit of version control (of business logic), component-reuse, and security.
Why not build assemblies manually? Beyond simple application & services, assemblies managed manually are a much tougher problem both for operators and the application developers
Having the platform enable automated management of assemblies frees up significant productivity towards building and managing higher order apps.
On the technology front, there is another revolution underway in the industry: containers. Simply put, containers are a lightweight virtualization mechanism for executing programs in isolated containers, popularized by the open-source technology docker. While restricted to processes, they offer the same isolation and resource management benefits as virtual machines, but with a very little overhead. Further, they have packaging mechanisms offering the same management simplicity as VM images.
YARN always had a notion of a logical container – it can be a single application process, a group of processes forming a process-tree, or a process-tree set under a memory / cpu cgroup.
With docker, we can now also enable users to leverage industry-standard packaging of bits.
The packaging story is one of the cornerstones of enabling varied types of applications. To this end, the YARN community has been working on native integration of docker containers in YARN. The primary effort in this area is the support for “Container Runtimes” in YARN so that in addition to process-tree containers, one can run docker containers.
And to top it all, irrespective of the container types, the users can make use of the same old platform feature like isolation, queuing models, scheduling strategies etc.
The aforementioned new use-cases still deserve the same set of powerful platform features that short running disparate jobs have long enjoyed – multi-tenancy, massive scale, security, elastic sharing etc. Not reinventing the wheel and simply reusing these platform features will also be a massive productivity boost.
We are close to delivering a kaleidoscopic YARN to encompass all these different use-cases, with much more agility.
To this end, YARN community is working towards enabling containers, long-running services and complex assemblies in a first-class manner. YARN as a technology has always had the right foundations to support a wide variety of applications and services. So, the next leg in our journey is going to focus on making simplified application authoring and packaging, simplified and first-class services, and enabling the notion of reusable, composable assemblies.
I have talked about some of this very early integration in last year’s Hadoop Summit at San Jose. The talk recording is embedded below:
And the corresponding slides here:
Please stay tuned for more upcoming blogs as part of our Data Lake 3.0 series where we attempt to shed more light on some of the concrete sub-efforts that are happening in the Apache Hadoop YARN community. We first follow-up this one with another exciting blog which puts them all together in showcasing a Deep Learning TensorFlow Assembly deployed on YARN’s cluster-wide resources (including GPUs).
Read the next blog post in the series: Data Lake 3.0 Part 3 – Distributed Tensorflow Assembly on Apache Hadoop YARN