At Hortonworks we are constantly striving to achieve high quality releases. HDP/HDF releases are deployed by thousands of enterprises and are used in business critical environments to crunch several petabytes of data every single day. So maintaining the highest standards of quality and investing in an infrastructure to support the repeatable standards of quality is one of the key guiding principles of our organization.
Unlike traditional enterprise software, we deal with an inflow of hundreds of Apache commits, across 25+ projects in the Apache Hadoop ecosystem. Apache community has a rich set of unit tests, which are continuously run (often, for every commit) to catch regressions early. However, they are not always sufficient to assess the impact on integrated functionality. This is where having a robust, scalable and reliable testing infrastructure to validate the multi-layer stack becomes crucial.
At Hortonworks, innovation of test, build and release infrastructure is as important as delivering any new feature in HDP and the internal infrastructure has evolved into many avatars over the past few years to cater to the growing demands in the Apache Hadoop ecosystem.
The problem with this approach was:
To address these issues, we implemented a phased approach, with key design principles:
Advantages with this approach:
This model is proven to scale very well for point fixes. For big feature branch merges, we have a similar pipeline but the number of tests run before the feature branch merge are more exhaustive. So when the merge happens to mainline, it has already gone through thorough testing and chances of finding regressions are almost nil.
Putting everything into perspective, we have a combination of:
|Test duration||30mins||4hrs||24-72 hrs||7-10 days and continues to stay on|
|Frequency||For every commit||Once a day||3-5 times in a release||Once a release|
– Backward compatibility
– Express /Rolling Upgrades
– Partner scenarios
– Hortonworks production cluster running customer workflows
– Always ON cluster running e2e scenarios simulating production workload
|Number of tests||Hundreds||Tens of thousands||Hundreds||Hundreds|
|Compute hours||~ 250 per day||~ 21000 per day||~ 2000 per release||~ 2400 per release|
To summarize, building a scalable, reliable test infrastructure to cater to such large scale is a hard engineering problem to tackle, especially for a software stack which has unprecedented pace of innovation. In addition to having a robust process that works, we also need to allocate the right amount of compute, networking/HA capabilities, storage and develop automation tools to accommodate ongoing releases and prepare for new ones to come in future. And most importantly have the best minds in the industry who understand how the Apache community works and are continuously innovating more ways to improve the efficiency of the release process.
In our next blog, we will talk more about best practices and processes that help us test and debug efficiently (for both machines and humans). Stay tuned!