cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
October 22, 2014
prev slideNext slide

Docker & Kubernetes on Apache Hadoop YARN

Merv Adrian, the widely respected Gartner analyst, recently remarked on the continuing evolution of Apache Hadoop:

YARN is the one that really matters because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads.

We couldn’t agree more!

In fact, we’ve talked about how we further want to leverage Apache Hadoop YARN to further enhance the meaningfulness of Hadoop for our users by bringing together the worlds of Data and PaaS by leveraging Docker, Google Kubernetes & Red Hat OpenShift on YARN. As you can see, the goal is to enable common resource management across data and PaaS workloads in a seamless fashion. Furthermore, the exciting developments in the Apache Hadoop HDFS community to develop an Ozone, an Object Store on HDFS, is another giant step in this direction.

In this follow up blog post, we’ll describe some of the internals of how we are integrating Google Kubernetes , an open-source container management implementation for PaaS, with YARN.

Kubernetes-on-YARN architecture

The architecture for Kubernetes-on-YARN is depicted below.

docker_k_1

The crux of the integration between Kubernetes and YARN is that we now have the Kubernetes Master allocating resources (to run Docker containers) from YARN. We now have an alternate implementation of the Kubernetes DefaultScheduler called the YARNScheduler. On start up, the scheduler process registers itself with YARN as an ApplicationMaster. Hence, YARN, designed to be oblivious of the type of workloads, now runs PaaS (on Kubernetes) as just one of several possible types of workloads on the cluster. A different workload (depicted by AppMaster2 above, for example) could be running simultaneously.

Resource Negotiation: Docker Container Allocation/Deallocation

The resource negotiation protocol between Kubernetes and YARN is straight-forward. The Kubernetes scheduler is notified of all pod creations/deletions, which are then forwarded to YARN. YARN keeps track of all resource usage on the cluster – not just for pods running under Kubernetes, but also resources used by any other running YARN workloads. When YARN receives an allocation request, it finds a suitable location for the container and responds to the scheduler with the corresponding node information. The scheduler then informs Kubernetes of the node where the container is to be run. Subsequently, Kubernetes provisions the pod by spinning up a docker container in the specified node. When a pod is deleted, the scheduler informs YARN, which updates its resource tracking accordingly. Pretty simple and straight-forward.

Interesting sidebar: Since both Docker & Kubernetes are implemented in golang, we felt particularly clever 🙂 hooking them up with YARN by implementing the YARNScheduler using the golang bindings for YARN (the updated ones to work with secure hadoop-2.x available here on github).

Setting up a local Kubernetes-YARN cluster

We created a prototype implementation of Kubernetes-YARN, along with support for a simple Vagrant-based setup. Instructions for setting up a local Kubernetes-YARN cluster are available here. Note that, depending on your local machine and available bandwidth, it may take some time for the cluster to come up.

Running Docker Containers

Once the cluster is up, the YARN UI (accessed here) shows that a Kubernetes application master has been registered with YARN.

docker_k_2

This can also be seen in the scheduler logs on the kubernetes master :
~/demo/kubernetes-yarn$ vagrant ssh master
[vagrant@kubernetes-master ~]$ journalctl -u scheduler -n 100 | grep "registered application master" -B 2

Oct 15 18:03:59 kubernetes-master scheduler[17346]: I1015 18:03:59.319887 17346 logs.go:39] Successfully completed SASL negotiation!
Oct 15 18:03:59 kubernetes-master scheduler[17346]: I1015 18:03:59.781095 17346 logs.go:39] starting periodic allocate routine with interval(ms): 1000
Oct 15 18:03:59 kubernetes-master scheduler[17346]: I1015 18:03:59.781125 17346 logs.go:39] Successfully registered application master.
[vagrant@kubernetes-master ~]$

At this point no containers are running, as indicated by YARN and Kubernetes.

docker_k_3

Screen Shot 2014-10-22 at 9.07.51 AM

Now, we can run a test container using a known docker image.

Screen Shot 2014-10-22 at 9.13.12 AM

When we submit a container creation request to Kubernetes, YARN allocates a container (through the scheduler integration) as seen here in the YARN UI and the scheduler logs.

docker_k_4

AMRMClient.Allocate #asks: 1 #releases: 0
Oct 15 18:46:50 kubernetes-master scheduler[17346]: I1015 18:46:50.537471 17346 logs.go:39] received allocated containers notification. writing to channel: [id:allocated container on: kubernetes-minion-2
Oct 15 18:46:50 kubernetes-master scheduler[17346]: I1015 18:46:50.539989 17346 logs.go:39] YARN node kubernetes-minion-2 maps to minion: 10.245.2.3

Following a successful creation of containers, the command below shows that the pod is created/running on the specified node. (The pod may be in the ‘Waiting’ state until the docker image is pulled and the container started).

Screen Shot 2014-10-22 at 9.17.39 AM

The running docker container can also be found on the corresponding node.

Screen Shot 2014-10-22 at 9.19.23 AM

Since we forwarded the container port 80 to the host (VM) port 8090, we can find nginx’s welcome page on http://:8090

docker_k_5

For more examples of running docker containers (including multi-container applications), look for more on the Kubernetes project here.

Summary

Integrating Kubernetes with YARN allows us to seamlessly manage resources across heterogeneous PaaS (Kubernetes) and Data (YARN) workloads. Integration also brings these worlds together. This continues the evolution of Hadoop YARN into the de-facto standard for cluster resource management in the datacenter.

If you would like to play around with a local Kubernetes-YARN cluster, a prototype implementation along with simple Vagrant-based instructions is available here.

Stay tuned for more; we certainly hope you will join us on this exciting journey!

Comments

  • Which version of YARN do I need to run docker and Kubernetes? Will this feature be part of HDP 2.2 (or 2.3)?

    Can’t wait to see it happen.

    Jianshi

  • Are there any updates on this project? There are many updates to Kubernetes and the bindata-assetfs libraries that are needed for it to work.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>