This is the second post in the Engineering @ Hortonworks blog series that explores how we in Hortonworks Engineering build, test and release new versions of our platforms.
In this post, we deep dive into something that we are extremely excited about – Running a container cloud on YARN! We have been using this next-generation infrastructure for more than a year in running all of Hortonworks internal CI / CD infrastructure.
With this, we can now run Hadoop on Hadoop to certify our releases! Let’s dive right in!
The introductory post on Engineering @ Hortonworks gave the readers an overview of the scale of challenges we see in delivering an Enterprise Ready Data platform.
Essentially, for every new release of a platform, we provision Hadoop clusters on demand, with specific configurations like authentication on/off, encryption on/off, DB combinations, and OS environments, run a bunch of tests to validate changes, and shut them down.
And we do this over and over, day in and day out, throughout the year.
Certifying our platforms is thus an arduous deal. The specific challenge that is relevant to this post is how resources get shared and utilized efficiently when we bring up and down these Hadoop clusters. We discuss these issues related to resource management below.
There is a dogfood opportunity here that helps both us internally as well as our customers and users: Use what we ship and ship what we use.
We recently talked about how Apache Hadoop is moving towards a multi-colored YARN. The need for such a multi-colored YARN is coming from the demand for new use-cases and technology drivers discussed in that post.
Looking at the internal pains discussed above, we asked ourselves – “Our users and customers have a very powerful platform for scheduling and managing their applications. Why not use that same platform for our internal needs?” To this end, nearly 15 months ago, we started migration of our workloads to a YARN based container cloud.
What is this new infrastructure? It uses the well-known HDP components that you are already familiar with – Ambari, Zookeeper, HDFS, YARN newly augmented with Docker support and Slider. It also is one of the few fully functioning production clusters that is running all the time inside Hortonworks.
If you look carefully, through this mechanism, one can spot a YARN cluster running inside (containers running on) another YARN cluster! That is, we have finally realized YInception – YARN on YARN!
This is not a radically new idea, though. This is very similar to how source code compilers go through their own evolution. Through the boostrapping process, developers use an older version of a compiler to build the newer version of the compiler! What we are doing here is to use an older version of Hadoop as a compute platform to test the newer versions of Hadoop. And Hive. And Spark. And everything!
Different type of apps run on this platform : (a) containerized apps and services – a key promise of YARN in Apache Hadoop 3.0 and (b) legacy applications that run inside containers made to look like VMs each with its own IP address, SSH based access etc.
Let’s look at one specific workload. The following is a depiction of a system-test cluster running on this container cloud. On the base YARN cluster, we dynamically bring up N containers, and then use Ambari to install HDP components inside those containers!
Similarly, running all unit-tests of the Apache Hadoop project can take more than 6-7 hours. We have developed a parallel Master-Worker framework that runs on the YARN Container Cloud which brings the unit-test run-time to 10-15 minutes! We will talk more about this in one of the next posts.
We have been running hard and fast with this new architecture for more than a year! The following graphs show some key metrics of this cluster over the last 15 months.
Growth in terms of containers’ memory resource over time.
Yellow: Used + Reserved Memory
Blue: Total cluster Memory
Growth in terms of containers’ CPU resource over time.
Yellow: Used + Reserved CPU
Blue: Total cluster CPU
As you can see, the cluster slowly got built over time. The workloads likewise slowly got migrated from the old platform to the new one.
Just looking at the number of apps and containers, we so far have more than 2.4M containers allocated!!
And close to 400K applications completed so far!
We first lifted the curtains on our next generation infrastructure at Dataworks Summit San Jose 2017:
— Mithun Radhakrishnan (@mithunrk) June 15, 2017
Shane and Jian gave an excellent talk on this topic at the DataWorks Summit, San Jose: https://dataworkssummit.com/san-jose-2017/sessions/running-a-container-cloud-on-yarn/, and we will be condensing some of that information in a little blog post next!
To conclude, this container-cloud infrastructure on YARN is running inside Hortonworks at large scale for more than a year and continues to ratchet up on machine count as well as the volume / variety of workloads. Gour Saha, Billie Rinaldi, Shane Kumpf, Jian He (and Varun Vasudev, Jon Maron in the past) have all been part of this crazy team building such an amazing infrastructure.
The underlying source code for this infrastructure is derived from two key initiatives in Apache Hadoop project: (1) support in YARN for containerized applications (YARN-3853) and (2) first class support for services on YARN. Even though we have been running this code on a quote-production cluster-unquote, these are efforts that are largely works-in-progress that we hope to bring to your clusters as part of Apache Hadoop 3.0!