Thank you for reading our Data Lake 3.0 series! In part 1 of the series, we introduced what a Data Lake 3.0 is. In part 2 of the series, we talked about how a multi-colored YARN will play a critical role in building a successful Data Lake 3.0. In part 3 of the series, we looked under the hood and demonstrated a Deep Learning framework (TensorFlow) assembly on Apache Hadoop YARN. In part 4 of the series, we explored creating a Big Data archive to power the Data Lake 3.0 storage and cut storage footprint in half. In this blog, we will discuss running pre-packaged dockerized applications in our Data Lake 3.0, powered by YARN.
As YARN has grown to support more diverse workloads, one of the requirements that emerged was adding first class support in YARN for containerization, of which Docker is the most popular and widely used solution. Docker containerization provides the benefits of easy packaging and distribution. Now, the customer does not to worry about additional modules required to run the application and instead, can focus on running the application, followed by configuration fine tuning. This translates to cutting the down the “time to deployment” as well as “time to insight” significantly. Additionally, Docker containerization provides isolation and enables running multiple versions of the same applications side by side. The customer can now have a production version of the application that is more stable and hardened while trying the latest version of the application for test-dev evaluation. While YARN can run any type of containerized workloads, this blog is focused on Docker containers.
Since its inception, YARN has supported the notion of the ContainerExecutor abstraction. The ContainerExecutor is responsible for 3 things –
In the past, Apache Hadoop shipped with three ContainerExecutors – DefaultContainerExecutor, LinuxContainerExecutor, and WindowsSecureContainerExecutor. Each one of these was created to address a specific need. DefaultContainerExecutor is meant for non-secure clusters where all YARN containers are launched as the same user as the NodeManager (providing no security). LinuxContainerExecutor is meant for secure clusters where tasks are launched and run as the user who submitted them. WindowsSecureContainerExecutor provides similar functionality but on Windows.
As Docker grew in popularity, DockerContainerExecutor was added to the list of ContainerExecutors. DockerContainerExecutor was the first attempt to add support for Docker in YARN. It would allow users to run tasks as Docker containers. It added support in YARN for Docker commands to allow the NodeManager to launch, monitor and clean up Docker containers as it would for any other YARN container.
There were a couple of limitations of the DockerContainerExecutor – some related to implementation and some architectural. The limits related to implementation were things such a not allowing users to specify the image they wished to run (it required all users to use the same image).
However, the bigger architectural issue is that in YARN, you can use one ContainerExecutor per NodeManager. All tasks will use the ContainerExecutor specified in the node’s configuration. As a result, once the cluster was configured to use DockerContainerExecutor, users would be unable to launch regular MapReduce or Tez or Spark jobs. Additionally, implementing a new ContainerExecutor means that all of the benefits of the existing LinuxContainerExecutor (such as cgroups and traffic shaping) now need to be reimplemented in the new ContainerExecutor. As a result of these challenges, DockerContainerExecutor has been deprecated in favor of a newer abstraction – container runtimes – and DockerContainerExecutor will be removed in a future Apache Hadoop releases.
To address these deficiencies, YARN added support for container runtimes in LinuxContainerExecutor. Container runtimes split up the ContainerExecutor into two distinct pieces – the underlying framework required to carry out the functionalities and a runtime piece which can change depending on the type of container you wish to launch. With these changes, we solve the architectural problem of being able to run regular YARN process containers alongside Docker containers. The lifecycle of the Docker container is managed by YARN just like any other container. The change also allows YARN to add support for other containerization technologies in the future.
Currently, two runtimes exist; the process tree based runtime (DefaultLinuxContainerRuntime) and the new Docker runtime (DockerLinuxContainerRuntime). The process-tree based runtime launches containers the same way YARN has always done, whereas, the Docker runtime launches Docker containers. Interfaces exist that can be extended to add new container runtimes. Support for container runtimes, and specifically the DockerLinuxContainerRuntime, is being add via YARN-3611.
In the sections that follow, we will outline the steps necessary to run the distributed shelli sample application in Docker containers via the container runtime.
For the sake of this example, it is expected that you have already installed Docker on all of the machines running NodeManagers in the cluster. It is recommended that a recent version of docker is used. Currently, only Docker on Linux is supported by the container runtime.
Please note that the Docker on YARN support is alpha/experimental and is not secure or recommended for production use.
To enable running Docker containers on YARN, several configuration properties need to be set in yarn-site.xml and container-executor.cfg. Make sure to restart following the configuration changes. The configuration is as follows:
Environment variables are currently used (this will likely change in the future) to configure which container-runtime should be used and to provide configuration to that specific container runtime. To use DockerLinuxContainerRuntime, the env variable YARN_CONTAINER_RUNTIME_TYPE must be set to docker and an image must be supplied. Below is an example that runs the supplied shell command inside of a Docker container using the centos image from Docker Hub with the “latest” tag.
export DJAR=./yarn/hadoop-yarn-applications-distributedshell-*.jar yarn jar $DJAR -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker \ -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos \ -shell_command "sleep 120" \ -jar $DJAR -num_containers 1
While the foundation for Docker in YARN is solid, the Docker ecosystem is large and comes with many considerations. Below is a list of several key efforts currently on-going in the YARN community.
Container Runtimes greatly enhance YARN’s ability to support a multitude of containerization technologies without reworking the core execution path. These improvements allow YARN to go beyond data centric applications and allow for nearly any workload to take advantage of YARN’s resource-management capabilities at scale. Support for Docker in YARN is well underway and available for early testing. Please note that the Docker on YARN support is alpha/experimental and is not secure or recommended for production use. Thanks to the Apache Hadoop community, and specifically the co-authors of this blog; Varun Vasudev, Sidharta Seethana, and Vinod Vavilapalli.
Read the next blog post in the series: Data Lake 3.0 Part 6 – A Self-Diagnosing Data Lake