Merv Adrian, the widely respected Gartner analyst, recently remarked on the continuing evolution of Apache Hadoop:
YARN is the one that really matters because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads.
We couldn’t agree more!
In fact, we’ve talked about how we further want to leverage Apache Hadoop YARN to further enhance the meaningfulness of Hadoop for our users by bringing together the worlds of Data and PaaS by leveraging Docker, Google Kubernetes & Red Hat OpenShift on YARN. As you can see, the goal is to enable common resource management across data and PaaS workloads in a seamless fashion. Furthermore, the exciting developments in the Apache Hadoop HDFS community to develop an Ozone, an Object Store on HDFS, is another giant step in this direction.
In this follow up blog post, we’ll describe some of the internals of how we are integrating Google Kubernetes , an open-source container management implementation for PaaS, with YARN.
The architecture for Kubernetes-on-YARN is depicted below.
The crux of the integration between Kubernetes and YARN is that we now have the Kubernetes Master allocating resources (to run Docker containers) from YARN. We now have an alternate implementation of the Kubernetes DefaultScheduler called the YARNScheduler. On start up, the scheduler process registers itself with YARN as an ApplicationMaster. Hence, YARN, designed to be oblivious of the type of workloads, now runs PaaS (on Kubernetes) as just one of several possible types of workloads on the cluster. A different workload (depicted by AppMaster2 above, for example) could be running simultaneously.
The resource negotiation protocol between Kubernetes and YARN is straight-forward. The Kubernetes scheduler is notified of all pod creations/deletions, which are then forwarded to YARN. YARN keeps track of all resource usage on the cluster – not just for pods running under Kubernetes, but also resources used by any other running YARN workloads. When YARN receives an allocation request, it finds a suitable location for the container and responds to the scheduler with the corresponding node information. The scheduler then informs Kubernetes of the node where the container is to be run. Subsequently, Kubernetes provisions the pod by spinning up a docker container in the specified node. When a pod is deleted, the scheduler informs YARN, which updates its resource tracking accordingly. Pretty simple and straight-forward.
Interesting sidebar: Since both Docker & Kubernetes are implemented in golang, we felt particularly clever 🙂 hooking them up with YARN by implementing the YARNScheduler using the golang bindings for YARN (the updated ones to work with secure hadoop-2.x available here on github).
We created a prototype implementation of Kubernetes-YARN, along with support for a simple Vagrant-based setup. Instructions for setting up a local Kubernetes-YARN cluster are available here. Note that, depending on your local machine and available bandwidth, it may take some time for the cluster to come up.
Once the cluster is up, the YARN UI (accessed here) shows that a Kubernetes application master has been registered with YARN.
This can also be seen in the scheduler logs on the kubernetes master :
~/demo/kubernetes-yarn$ vagrant ssh master
[vagrant@kubernetes-master ~]$ journalctl -u scheduler -n 100 | grep "registered application master" -B 2
Oct 15 18:03:59 kubernetes-master scheduler: I1015 18:03:59.319887 17346 logs.go:39] Successfully completed SASL negotiation!
Oct 15 18:03:59 kubernetes-master scheduler: I1015 18:03:59.781095 17346 logs.go:39] starting periodic allocate routine with interval(ms): 1000
Oct 15 18:03:59 kubernetes-master scheduler: I1015 18:03:59.781125 17346 logs.go:39] Successfully registered application master.
At this point no containers are running, as indicated by YARN and Kubernetes.
Now, we can run a test container using a known docker image.
When we submit a container creation request to Kubernetes, YARN allocates a container (through the scheduler integration) as seen here in the YARN UI and the scheduler logs.
AMRMClient.Allocate #asks: 1 #releases: 0
Oct 15 18:46:50 kubernetes-master scheduler: I1015 18:46:50.537471 17346 logs.go:39] received allocated containers notification. writing to channel: [id:
Oct 15 18:46:50 kubernetes-master scheduler: I1015 18:46:50.539989 17346 logs.go:39] YARN node kubernetes-minion-2 maps to minion: 10.245.2.3
Following a successful creation of containers, the command below shows that the pod is created/running on the specified node. (The pod may be in the ‘Waiting’ state until the docker image is pulled and the container started).
The running docker container can also be found on the corresponding node.
Since we forwarded the container port 80 to the host (VM) port 8090, we can find nginx’s welcome page on http://
For more examples of running docker containers (including multi-container applications), look for more on the Kubernetes project here.
Integrating Kubernetes with YARN allows us to seamlessly manage resources across heterogeneous PaaS (Kubernetes) and Data (YARN) workloads. Integration also brings these worlds together. This continues the evolution of Hadoop YARN into the de-facto standard for cluster resource management in the datacenter.
If you would like to play around with a local Kubernetes-YARN cluster, a prototype implementation along with simple Vagrant-based instructions is available here.
Stay tuned for more; we certainly hope you will join us on this exciting journey!