Delivering High-Quality Apache Hadoop Releases
As enterprises increasingly adopt Apache Hadoop for critical data, the need for high-quality releases of Apache Hadoop becomes even more crucial. Storage systems in particular require robustness and data integrity since enterprises cannot tolerate data corruption or loss. Further, Apache Hadoop offers an execution engine for customer applications that comes with its own challenges. Apache Hadoop handles failures of disks, storage nodes, compute nodes, network and applications. The distributed nature, scale and rich feature set makes testing Apache Hadoop non-trivial.
Testing Apache Hadoop does not just involve writing a test plan based upon the design spec. Instead, it requires an understanding of the numerous use cases of an API rather than the actual specifications of that API. The intriguing part has always been in analyzing the impact that every feature, improvement and bug fix has on the various Hadoop subsystems and user applications. Additionally, one has to go beyond the unit tests, functional, scale and reliability tests and run tests against live data and live user applications to test the integration of Hadoop with the other products in the ecosystem.
Delivering a high quality Apache Hadoop release has been a focus for our team since their early days at Yahoo!, where Apache Hadoop has been used in production across thousands of nodes. Over the years we have developed elaborate test suites and procedures. Every stable Apache release of Hadoop from the early days to hadoop-20.2xx has gone through this rigorous test procedure.
This work now continues as part of the Yahoo! and Hortonworks partnership. The next generation of Hadoop (hadoop-0.23) has significant new features co-developed by the two companies. It will be hardened as an enterprise-quality product and rolled out across Yahoo! and other organizations around the globe.
Our rigorous process has resulted in a remarkable record of data integrity and robustness. Even the large commercial storage vendors will have a hard time matching the level of testing managed by Yahoo! and Hortonworks.
Can you trust your data to anything less?
Hortonworks QA Process for Apache Hadoop
This section of the post describes the stringent process followed to test and qualify Apache Hadoop releases.
The process consists of the following procedures:
- Nightly QA tests
- Release certification
- Deployment to sandbox, research and production clusters
1. Nightly QA Tests
At Hortonworks, we have a nightly automated deploy setup that deploys the latest Apache Hadoop 0.20.x and 0.23.x code base to two QA clusters. Once the deployment succeeds we run 1200+ automated tests that include the following:
Benchmarks and end-to-end tests
Benchmark tests help in tracking any performance degradations due to recent code check-ins. End-end tests ensures that the new code does not break the overall functioning of the existing framework. These tests are considered as acceptance tests before running an exhaustive set of functional tests.
This area of testing is the most challenging one due to the advanced, leading-edge feature set available in Apache Hadoop that has to be tested in a distributed environment. Below is a sample of the depth of functional testing needed before distributing a Apache Hadoop release:
- The entire breadth of the product is tested covering all the sub systems such as HDFS, MapReduce, streaming, distcp, archives and so forth.
- QA does a deep dive into individual sub system such as block replication, quotas, balancer etc. in HDFS and job scheduling, distributed cache, task controller etc. in MapReduce.
- Detailed testing of each of these components is completed, such as user limits, high RAM jobs, reservation and queue limits in capacity scheduler. In queue limits alone there are several number of use cases such as verification of limits on tasks per job, jobs per user, pending/running tasks per queue and per user, etc.
Thus, the Hortonworks QA process will catch any regressions introduced by a new patch and have early insights into the quality of the upcoming releases.
2. Release certification
Prior to calling for an Apache release vote, Hortonworks QA will ensure that the following tests succeed:
- All the unit tests succeed on Apache Jenkins
- No degradation is observed in the benchmark numbers
- No regressions are introduced in the nightly test run
Once the above tests succeed, the following non-functional tests are executed:
Compatibility with existing clusters
To ensure a smooth upgrade on existing clusters, Hortonworks QA verifies the compatibility of the new code with the existing cluster. For this we run all the existing tests on the upgraded cluster and make sure that the old user jobs are able to run.
Hadoop Stack Integration Testing
We also run tests to verify that other products in the Apache Hadoop ecosystem such as Pig, HCatalog, Hive and Oozie seamlessly integrate with the new code.
To guarantee privacy, security and integrity, and to ensure that users are correctly authenticated to the edge service, Hortonworks QA verifies the following security scenarios:
- User level authorization and authentication to perform HDFS operations
- User level authorization and authentication to submit, execute and administer MapReduce jobs
- Service level authorization to access HDFS and to run MapReduce jobs
The framework is also tested for scenarios such as unauthorized users, services, expired/cancelled/invalid kerberos tickets, block tokens, delegation tokens and corrupt credentials.
The focus of scalability tests is to verify that Apache Hadoop can gracefully handle increase in requests, data sets, jobs etc. without degrading the performance. To run scale tests, Hortonworks feeds load directly from Yahoo!’s production grids onto our 800+ QA cluster using GridMixV3, rumen and folder.
We also test the framework with:
- High data volume
- Increased number of files and directories in the namespace
- Large number of HDFS and local FS read and writes
- Increased number of jobs and tasks in various states such as pending, running and completed
- Large number of users in the system and queues
Reliability of the framework is its ability to function normally in case of failures. The Hortonworks QA reliability tests broadly cover:
Service failures – failure of:
- Secondary Namenode
Network failures – connection time outs, lost Tasktrackers and fetch failures
Bad hardware – corrupt disks, missing blocks, corrupt data and lost map outputs
Testing reliability in Apache Hadoop is critical because it is not just sufficient to check for the recovery mechanism. One also has to confirm that the state of the system is reflected correctly. For example, a task running on a lost Tasktracker will eventually be rescheduled, but the testing is complete only after verifying that no further tasks are scheduled on the lost TT, the total cluster capacity is reduced and the TT no longer appears in the active TT list. And in the case where the lost TT rejoins, tasks should continue to be scheduled on them. We cover all of the above failure scenarios to unravel any unreliability in the code.
3. Release Testing at Yahoo!
Once the release is certified by QA, it is then deployed onto three of Yahoo!’s sandbox clusters, each having 400-1000 nodes. The release will be available for 2 months waiting for the signoff from all the production projects.
After the sandbox environment, the release is then moved to six of Yahoo!’s research clusters where it is deployed for another 2 months before being deployed to production clusters. The average number of jobs per week on the research clusters vary from 0.25-0.5 million – thus, by the time we exit this stage, the Apache Hadoop release has run more than 10 million jobs and stored tens of petabytes of data.
Only then, is the release deployed on to the production clusters.
Also, the production logs are made available to QA to run future scale tests on QA clusters.
The highly stringent Dev-QA-Operation process described in this post for rolling out new releases to the grids is followed for every release of Apache Hadoop. Testing the Apache Hadoop release in such a manner, at very large scale, has helped Yahoo! qualify high-quality releases. This will now help Hortonworks do the same.
— Ramya Sunil