Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
September 18, 2017
prev slideNext slide

Engineering @ Hortonworks – The Matrix

This is the introductory post in a blog series that explores how we in Hortonworks Engineering build, test and release new versions of our platforms. In this post, we introduce the basic themes and set context for deeper discussions in subsequent blogs.

We at Hortonworks are very proud of the work we do. Along with the open-source communities, we are very aggressively pushing the frontiers on infrastructure for data with YARN, Hive/LLAP, Atlas, Ranger, Spark etc.

Our ability to continue to do so depends on not only having the brightest minds in the buildings, but increasingly also on providing ourselves the tools to be able to validate the incredible work at scale and to a level of readiness that hundreds of Enterprise customers (of Hortonworks or other distributions) can securely/reliably run their business on.

Getting to an Enterprise-ready release of our platforms is a long road. Roughly speaking, here’s what happens in any given release:

Hortonworks Engineering Release Process
Hortonworks Engineering Release Process

That’s neat, you’d say? The reality is far more complex.

Just to provide a perspective on the breadth of the task at hand to integrate and validate 25+ open-source projects into a coherent distribution (HDP or HDF), here are some of the vectors we in Hortonworks engineering deal with on a daily basis – captured below.

The Matrix
The Matrix

Mathematically, this leads to over 30K combinations which are finite, but overwhelming to validate!

Navigating the “Matrix”, as we all reverentially refer to it, is a really hard engineering problem – at least as hard as working on YARN or Atlas or LLAP – if not harder!

Moreover, we usually have several releases in flight at the same time which require different amounts of testing –  Major, maintenance and hotfixes.

Concurrent release lines
Concurrent release lines

Last but not the least, we have a corpus of over 30,000 functional tests we’ve built up over the years, which cover different aspects of validating the platforms:

  • Functional
  • Unit testing
  • Reliability/HA
  • Stress
  • Concurrency
  • Scale
  • Security
  • Performance
  • Integration
  • Operational Readiness
  • Upgrades
  • Longevity

Each of these tests have to be run on each “configuration” or “Matrix slice” (OS/DB/FS/JDK/…) before we feel comfortable shipping a release to our Enterprise customers.

To put everything into perspective, here are some stats on the Hortonworks machinery every single day for each “slice”:

  • 3500 VMs
  • 21000 Compute Hours
  • 30,000 tests
  • 50+ projects (including Apache projects, connectors etc)
  • 100+ Commits

This, naturally, necessitates a degree of sophistication and innovation for the infrastructure which is fairly unprecedented!

Further, once the infrastructure is available, dealing with analysis of the output of the tests is a huge challenge, given the sheer breadth of tests we have built over time. This is itself is a big-data problem! Take a moment to imagine this – if 1500 tests fail due to a broken merge, it would require enormous amounts of human time to analyze and pinpoint the root-cause.

Pixie Dust to the Rescue

Wouldn’t it be nice if we sprinkle some pixie dust and conjure up infrastructure to help us deal with this?

Unfortunately, that’s a viable option in a Disney movie, but for us – not so much.

So, as we started to look at a Version 2 (aka the Project Pixie Dust) of our internal infrastructure a couple years ago, we had some lofty ambitions. How about we build our packages much faster, and deploy system test clusters in minutes instead of hours? Wouldn’t it be better to use text-analytics and machine learning to categorize test failures and report them with possible root cause? Why stop there, let’s go further and file the tracking ticket automatically! 🙂

We brainstormed on these ambitions and we set ourselves the following concrete goals

  • Builds of the entire stack in 1hr
  • Unit Tests of every component in 10mins
  • Always on CI process that finishes in 1hr
  • Deploy a single HDP cluster in 15 mins as part of the system testing
  • Validate 30,000 tests across 500 HDP clusters per “slice” within 6 hours
  • Automated analysis and reporting of test case failures from log analytics!
  • Most important of them all – do this all using HDP projects! That is, use what we ship, and ship what we use!

So, how far did we get, and how?

The rest of this series will walk you through the fantastic feats of gymnastics we made HDP perform to go a long way.

A teaser: Hadoop-3.0 YARN-based Docker container cloud running several million containers per release and several thousand HDP clusters per day, Ambari deploys in 10 mins or less, ML-based text-analytics for auto-categorization of failures etc.

Stay tuned! We are sure you will enjoy reading, we are certainly very proud of it… for it’s not only some very hard engineering problems, but also a massive competitive differentiator for Hortonworks!

Comments

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>