Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
May 10, 2017
prev slideNext slide

Part 5 of Data Lake 3.0: YARN and Containerization: Supporting docker and beyond

 

Thank you for reading our Data Lake 3.0 series! In part 1 of the series, we introduced what a Data Lake 3.0 is. In part 2 of the series, we talked about how a multi-colored YARN will play a critical role in building a successful Data Lake 3.0. In part 3 of the series, we looked under the hood and demonstrated a Deep Learning framework (TensorFlow) assembly on Apache Hadoop YARN. In part 4 of the series, we explored creating a Big Data archive to power the Data Lake 3.0 storage and cut storage footprint in half. In this blog, we will discuss running pre-packaged dockerized applications in our Data Lake 3.0, powered by YARN.

As YARN has grown to support more diverse workloads, one of the requirements that emerged was adding first class support in YARN for containerization, of which Docker is the most popular and widely used solution. Docker containerization provides the benefits of easy packaging and distribution. Now, the customer does not to worry about additional modules required to run the application and instead, can focus on running the application, followed by configuration fine tuning. This translates to cutting the down the “time to deployment” as well as “time to insight” significantly. Additionally, Docker containerization provides isolation and enables running multiple versions of the same applications side by side. The customer can now have a production version of the application that is more stable and hardened while trying the latest version of the application for test-dev evaluation. While YARN can run any type of containerized workloads, this blog is focused on Docker containers.

Background

Previous state

Since its inception, YARN has supported the notion of the ContainerExecutor abstraction. The ContainerExecutor is responsible for 3 things –

  1. Localizing (downloading and setting up) the resources required for running the container on any given node
  2. Setting up the environment for the container to run (such as creating the directories for the container)
  3. Managing the lifecycle of the YARN container (launching, monitoring and cleaning up the container)

In the past, Apache Hadoop shipped with three ContainerExecutors – DefaultContainerExecutor, LinuxContainerExecutor, and WindowsSecureContainerExecutor. Each one of these was created to address a specific need. DefaultContainerExecutor is meant for non-secure clusters where all YARN containers are launched as the same user as the NodeManager (providing no security). LinuxContainerExecutor is meant for secure clusters where tasks are launched and run as the user who submitted them. WindowsSecureContainerExecutor provides similar functionality but on Windows.

Experimental DockerContainerExecutor

As Docker grew in popularity, DockerContainerExecutor was added to the list of ContainerExecutors. DockerContainerExecutor was the first attempt to add support for Docker in YARN. It would allow users to run tasks as Docker containers. It added support in YARN for Docker commands to allow the NodeManager to launch, monitor and clean up Docker containers as it would for any other YARN container.

There were a couple of limitations of the DockerContainerExecutor – some related to implementation and some architectural. The limits related to implementation were things such a not allowing users to specify the image they wished to run (it required all users to use the same image).

However, the bigger architectural issue is that in YARN, you can use one ContainerExecutor per NodeManager. All tasks will use the ContainerExecutor specified in the node’s configuration. As a result, once the cluster was configured to use DockerContainerExecutor, users would be unable to launch regular MapReduce or Tez or Spark jobs. Additionally, implementing a new ContainerExecutor means that all of the benefits of the existing LinuxContainerExecutor (such as cgroups and traffic shaping) now need to be reimplemented in the new ContainerExecutor. As a result of these challenges, DockerContainerExecutor has been deprecated in favor of a newer abstraction – container runtimes – and DockerContainerExecutor will be removed in a future Apache Hadoop releases.

Introducing Container Runtimes

To address these deficiencies, YARN added support for container runtimes in LinuxContainerExecutor. Container runtimes split up the ContainerExecutor into two distinct pieces – the underlying framework required to carry out the functionalities and a runtime piece which can change depending on the type of container you wish to launch. With these changes, we solve the architectural problem of being able to run regular YARN process containers alongside Docker containers. The lifecycle of the Docker container is managed by YARN just like any other container. The change also allows YARN to add support for other containerization technologies in the future.

Currently, two runtimes exist; the process tree based runtime (DefaultLinuxContainerRuntime) and the new Docker runtime (DockerLinuxContainerRuntime). The process-tree based runtime launches containers the same way YARN has always done, whereas, the Docker runtime launches Docker containers. Interfaces exist that can be extended to add new container runtimes. Support for container runtimes, and specifically the DockerLinuxContainerRuntime, is being add via YARN-3611.

Example

In the sections that follow, we will outline the steps necessary to run the distributed shelli sample application in Docker containers via the container runtime.

Install Docker

For the sake of this example, it is expected that you have already installed Docker on all of the machines running NodeManagers in the cluster. It is recommended that a recent version of docker is used. Currently, only Docker on Linux is supported by the container runtime.

Configuring YARN

Please note that the Docker on YARN support is alpha/experimental and is not secure or recommended for production use.

To enable running Docker containers on YARN, several configuration properties need to be set in yarn-site.xml and container-executor.cfg. Make sure to restart following the configuration changes. The configuration is as follows:

yarn-site.xml

  • yarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
  • yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user=nobody
  • yarn.nodemanager.runtime.linux.docker.default-container-network=bridge
  • yarn.nodemanager.runtime.linux.docker.capabilities=CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP,SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE,DAC_READ_SEARCH,SYS_PTRACE,SYS_ADMIN
  • yarn.nodemanager.runtime.linux.docker.privileged-containers.allowed=true
  • yarn.nodemanager.runtime.linux.docker.privileged-containers.acl=root ( or “*” to allow everyone to run privileged containers )
  • yarn.nodemanager.linux-container-executor.cgroups.mount-path=/sys/fs/cgroup
  • yarn.nodemanager.pmem-check-enabled=false
  • yarn.nodemanager.resource.cpu.enabled=true
  • yarn.nodemanager.resource.disk.enabled=true
  • yarn.nodemanager.resource.memory.enabled=true

container-executor.cfg

  • yarn.nodemanager.linux-container-executor.group=yarn
  • banned.users=#
  • min.user.id=50
  • allowed.system.users=#
  • docker.binary=/usr/bin/docker
  • feature.docker.enabled=1
  • feature.tc.enabled=1

Running Distributed Shell through Docker-on-YARN

Environment variables are currently used (this will likely change in the future) to configure which container-runtime should be used and to provide configuration to that specific container runtime. To use DockerLinuxContainerRuntime, the env variable YARN_CONTAINER_RUNTIME_TYPE must be set to docker and an image must be supplied. Below is an example that runs the supplied shell command inside of a Docker container using the centos image from Docker Hub with the “latest” tag.

export DJAR=./yarn/hadoop-yarn-applications-distributedshell-*.jar

yarn jar $DJAR -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker \

                     -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos \                             

                     -shell_command "sleep 120" \

                     -jar $DJAR -num_containers 1

Future Improvements

While the foundation for Docker in YARN is solid, the Docker ecosystem is large and comes with many considerations. Below is a list of several key efforts currently on-going in the YARN community.

  • Docker image management – Docker distributes the reusability layers via images. Images are typically stored in a registry and are downloaded at runtime. Images downloaded locally to a node are cached for future use. Localization (YARN-3854) and image cleanup/management (YARN-5670) are key topics that need to be addressed.
  • Volume support – It is common that files and directories need to be bind-mounted from the host into the container. Care needs to be taken as allowing arbitrary volumes has security implications. Alternatively, hardcoding all “safe” volumes is restrictive. YARN-5534 will enable volume support from a whitelist of “safe” volumes.
  • Persistence – Currently, ephemeral containers are the target use case meaning that volumes are not persisted and highly available across the cluster. Leveraging HDFS for highly available volumes seems to be the logical approach to pursue.
  • User remapping – The process in the container may requiring running as a particular user while still being able to write the application logs for use with YARN’s log aggregation service. How are UID/GID/username differences between the container and OS handled? YARN-4266 is addressing this.
  • Service Discovery – Easily finding services in a distributed system can be difficult, the Service Discovery effort will add the ability to map containers and services to easy to remember DNS names.
  • Documentation – For those interested in trying out Docker in YARN, the initial proposed documentation is being reviewed and can be found on YARN-5258.

Conclusion

Container Runtimes greatly enhance YARN’s ability to support a multitude of containerization technologies without reworking the core execution path. These improvements allow YARN to go beyond data centric applications and allow for nearly any workload to take advantage of YARN’s resource-management capabilities at scale. Support for Docker in YARN is well underway and available for early testing. Please note that the Docker on YARN support is alpha/experimental and is not secure or recommended for production use. Thanks to the Apache Hadoop community, and specifically the co-authors of this blog; Varun Vasudev, Sidharta Seethana, and Vinod Vavilapalli.

Comments

  • Suggestion: Name this posting “Data Lake 3.0 Part 5 – YARN and Containerization: Supporting docker and beyond” to follow the naming pattern for the prior 4 parts.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>