Hortonworks, Scaled Risk and eBay Collaborate to Improve HBase Mean Time to Recovery (MTTR)
This post’s Principal Author: Ming Ma, Software Development Manager, eBay. With contribution from Mayank Bansal (eBay), Devaraj Das (Hortonworks), Nicolas Liochon (Scaled Risk), Michael Weng (eBay), Ted Yu (Hortonworks), John Zhao (eBay)
eBay runs Apache Hadoop at extreme scale, with tens of petabytes of data. Hadoop was created for computing challenges like ours, and eBay runs some of the largest Hadoop clusters in existence.
Our business uses Apache HBase to deliver value to our customers in real-time and we are sensitive to any failures because prolonged recovery times significantly degrade site performance and result in material loss of revenue. Over the past few months, we have been testing and using the Mean Time to Recovery (MTTR) feature improvements, delivered in the open HBase community.
We partnered with Hortonworks and Scaled Risk to work together to improve HBase failure detection and MTTR, a key issue that concerns everybody in the open source community.
Nicolas Liochon, from Scaled Risk introduced our collaboration in an earlier Hortonworks blog post called “Introduction to HBase Mean Time to Recovery (MTTR)”.
In this post, we summarize the findings of our simulation effort to demonstrate the MTTR capabilities of HBase 0.95+ in a large, physical-node environment with real-world workloads and data.
Our approach was to stabilize on a tuned simulation cluster (“Test Cluster”) and make multiple runs of each test scenario, taking an average of those runs, and comparing with a historical, anecdotal average MTTR of real-world production clusters running HBase 0.92.1 (“Baseline”).
We considered certain types of HBase failures, such as:
- Node/RegionServer failed while writing
- Node/RegionServer failed while reading
- Rack failure
- Whole cluster failure
- Machine reboot because CPU too hot
- NIC speed flip to 100Mb/s
We decided to focus on the first two of these failure types for our testing. In the future, we plan to test rack failure, whole cluster failure, and machine reboot because of a hot CPU, and NIC speed flip to 100Mb/s. We plan to use the results of those tests as guidance for future HBase improvements in the Apache community.
Testing Scenarios, Environment Details and Test Data
For the Node/RegionServer failure test scenarios, we killed the Node or the RegionServer while they were either being written to or read from. We ran these four tests at two different activity levels: “hot cluster” or “cold cluster”. “Cold cluster” means that no active jobs were running at the time of the test. “Hot clusters” were running aggressive row counter jobs during the test.
Therefore, we had eight different end-state scenarios: 2 actions (write or read), times 2 machine types (Node or RegionServer), times two activity levels (hot or cold). These are shown in the following diagram.
We compared these eight different outcomes on the Test Cluster to the HBase MTTR Baseline average observed over time on 500 nodes running HBase 0.92.1, HDFS 2.0 and MapReduce 1.0. The Test Cluster environment was comprised of 220 nodes running HBase 0.95.1+, HDFS 2.1 and MapReduce 2.0 on YARN.
The average MTTR across our Baseline was 180 seconds and includes a mix of Node failure and RegionServer failure measurements across the different read/write and hot/cold scenarios.
Hardware specifications, inter-connectivity, and rack layout (~30-40 nodes per rack) was identical/comparable in both environments. The Large Table data used for the simulation consisted of about 900 million rows, 3600 regions and 10TB of data on HDFS (with snappy compression).
Scenario 1 – Node/RegionServer Killed On Write
This scenario simulated Large Table data writes for 10 minutes (using the same schema and compression for the Test Cluster as with the Baseline, with real-world data). We then identified the RegionServer being written to and simulated Node failure using firewall rules to isolate the server. The Node and RegionServer were killed when data was being written to them.
As one can see in the following summarization plot, the Test Cluster MTTR after writing data exceeded expectations, comparing very favorably against the Baseline average of 180 seconds.
Scenario 2 – Node/RegionServer Killed on Read
This scenario simulated reading data from the Large Table. We identified the RegionServer being read from and simulated machine failure using firewall rules to isolate the server.
This summarization plot shows similar performance gains as in the first scenario. MTTR after reading data on the Test Cluster, exceeds expectations and compares very favorably against the Baseline average of 180 seconds.
The simulation findings summarized in this post demonstrate the application of significant ongoing work in the Apache HBase community around MTTR. For the scenarios tested, HBase 0.95+ on HDFS 2.1 and YARN (Test Cluster) compares favorably against historic MTTR performance of HBase 0.92 on HDFS 2.0 and MapReduce 1.0 (Baseline).
In addition, in the Test Cluster, Namenode failure was shown to have minimal impact on actively running jobs. Outright HBase performance was superior on the Test Cluster.
As an integral part of the simulation effort, the collaboration between Hortonworks, eBay, and Scaled Risk has resulted in numerous improvements (proposed and implemented) in Apache HBase and HDFS itself.
One immediate result of this collaboration was the creation of new test tools. Automated tools were developed and utilized for simulation scenarios, and they are generally applicable and will evolve to become a framework for executing integration and unit tests. These tools will automate the failure detection process, and now anyone can use them to track HBase MTTR on their own cluster. See this JIRA thread for more details.
For additional details around MTTR as it relates to HBase and the ongoing work in the community please refer to the following supplemental materials:
- Introduction to HBase Mean Time to Recover (MTTR)
- Slideshare: How To Get the MTTR Below One Minute and More