This is the introductory post in a blog series that explores how we in Hortonworks Engineering build, test and release new versions of our platforms. In this post, we introduce the basic themes and set context for deeper discussions in subsequent blogs.
We at Hortonworks are very proud of the work we do. Along with the open-source communities, we are very aggressively pushing the frontiers on infrastructure for data with YARN, Hive/LLAP, Atlas, Ranger, Spark etc.
Our ability to continue to do so depends on not only having the brightest minds in the buildings, but increasingly also on providing ourselves the tools to be able to validate the incredible work at scale and to a level of readiness that hundreds of Enterprise customers (of Hortonworks or other distributions) can securely/reliably run their business on.
Getting to an Enterprise-ready release of our platforms is a long road. Roughly speaking, here’s what happens in any given release:
That’s neat, you’d say? The reality is far more complex.
Just to provide a perspective on the breadth of the task at hand to integrate and validate 25+ open-source projects into a coherent distribution (HDP or HDF), here are some of the vectors we in Hortonworks engineering deal with on a daily basis – captured below.
Mathematically, this leads to over 30K combinations which are finite, but overwhelming to validate!
Navigating the “Matrix”, as we all reverentially refer to it, is a really hard engineering problem – at least as hard as working on YARN or Atlas or LLAP – if not harder!
Moreover, we usually have several releases in flight at the same time which require different amounts of testing – Major, maintenance and hotfixes.
Last but not the least, we have a corpus of over 30,000 functional tests we’ve built up over the years, which cover different aspects of validating the platforms:
Each of these tests have to be run on each “configuration” or “Matrix slice” (OS/DB/FS/JDK/…) before we feel comfortable shipping a release to our Enterprise customers.
To put everything into perspective, here are some stats on the Hortonworks machinery every single day for each “slice”:
This, naturally, necessitates a degree of sophistication and innovation for the infrastructure which is fairly unprecedented!
Further, once the infrastructure is available, dealing with analysis of the output of the tests is a huge challenge, given the sheer breadth of tests we have built over time. This is itself is a big-data problem! Take a moment to imagine this – if 1500 tests fail due to a broken merge, it would require enormous amounts of human time to analyze and pinpoint the root-cause.
Wouldn’t it be nice if we sprinkle some pixie dust and conjure up infrastructure to help us deal with this?
Unfortunately, that’s a viable option in a Disney movie, but for us – not so much.
So, as we started to look at a Version 2 (aka the Project Pixie Dust) of our internal infrastructure a couple years ago, we had some lofty ambitions. How about we build our packages much faster, and deploy system test clusters in minutes instead of hours? Wouldn’t it be better to use text-analytics and machine learning to categorize test failures and report them with possible root cause? Why stop there, let’s go further and file the tracking ticket automatically! 🙂
We brainstormed on these ambitions and we set ourselves the following concrete goals
So, how far did we get, and how?
The rest of this series will walk you through the fantastic feats of gymnastics we made HDP perform to go a long way.
A teaser: Hadoop-3.0 YARN-based Docker container cloud running several million containers per release and several thousand HDP clusters per day, Ambari deploys in 10 mins or less, ML-based text-analytics for auto-categorization of failures etc.
Stay tuned! We are sure you will enjoy reading, we are certainly very proud of it… for it’s not only some very hard engineering problems, but also a massive competitive differentiator for Hortonworks!