Data Integrity and Availability in Apache Hadoop HDFS
Data integrity and availability are important for Apache Hadoop, especially for enterprises that use Apache Hadoop to store critical data. This blog will focus on a few important questions about Apache Hadoop’s track record for data integrity and availability and provide a glimpse into what is coming in terms of automatic failover for HDFS NameNode.
What is Apache Hadoop’s Track Record for Data Integrity?
In 2009, we examined HDFS’s data integrity at Yahoo! and found that HDFS lost 650 blocks out of 329 million blocks on 10 clusters with 20,000 nodes running Apache Hadoop 0.20.3. Of the 650 lost blocks:
- 533 blocks were temporary blocks abandoned by a failed client. Due to a bug in release 0.20’s append, when a client abandons a file, its temporary blocks are not correctly deleted and unfortunately reported as corrupt. These aren’t blocks lost by HDFS but merely incorrectly reported. This bug has since been fixed with the new implementation of append in release 21 onwards.
- 98 blocks were blocks explicitly created with a single replica. Applications sometime create files with replication factor of 1 to gain addition performance advantage when the data is not critical or easily reproduced.
- 19 blocks were lost due to roughly 7 bugs that were promptly fixed in dot releases of 0.20.
The customer who lost those 19 blocks would not find HDFS’s 99.99999% data reliability very comforting. Hence we take loosing 19 blocks out of 320 million blocks very seriously.
Why is Apache Hadoop’s Track Record So Exemplary?
- HDFS has a simple yet robust architecture that was explicitly designed for data reliability in the face of faults and failures in disks, nodes and networks.
- HDFS, by default, replicates each data block 3 times on different nodes and on at least 2 racks.
- Data read or written is also checked end-to-end.
- The system constantly checks the integrity of data stored on disk against its checksum. Any faults are promptly reported to the NameNode, which in turn promptly creates additional new replicas and deletes corrupt ones.
- HDFS stores data using the underlying operating system’s file system (typically Ext3 on Linux). This means one can take advantage of the platform operating system’s hardened file system. Furthermore, one can easily take advantage of improvements of in the operating system’s file system; for example, we expect many customers to move to Ext4 to take advantage of its improved performance. Solaris customers can use its excellent ZFS.
- Extensive testing and hardening of Apache Hadoop is a critical part of Hadoop’s impressive data integrity and availability story. Stable Hadoop releases are extensively tested and hardened by Hortonworks and Yahoo! as described in Delivering High-Quality Apache Hadoop Releases.
What is Apache Hadoop’s Track Record for Data Availability?
As described above, HDFS tolerates failures of storage servers (called DataNodes) and its disks. NameNode stores its metadata on multiple disks that typically include a non-local file server; hence when the HDFS NameNode is restarted it recovers its metadata. The HDFS file system is temporarily unavailable whenever the HDFS NameNode is down.
We examined failures of the HDFS’ NameNode over the last 18 months and found that there were 22 failures across 25 clusters over the last 18 months, which equates to 0.58 failures per cluster per year. The Mean-Time-Between-Failures (MTBFs) are:
- MTBF = 610.5 days (3 hour restart time, 4000 nodes with 12PB
- MTBF = 612.6 days (1 hour restart time, 750 Nodes with 3PB)
Of these, only 8 would have benefitted from an automatic failure of the NameNode. These 8 included both hardware failures (some due memory errors) and software failures. Due to the low failure rate the MTFS improve only slightly with a 5 minute automatic failover:
- MTBF = 611.6 days (8 failovers in 5 minutes and the rest in 3 hours)
- MTBF = 612.9 days (8 failovers over in 5 minutes and the rest in 1 hour)
Failures that would have not benefitted from automatic failover included cluster power failover, failure of non-redundant switches, configuration errors, etc.
More details are available in Robert Chansler’s Hadoop Summit 2011 presentation (which will soon be available as a separate blog).
The Upcoming HA NameNode – Automatic NameNode Failover
Automatic failover for the HDFS NameNode is in active development. Suresh Srinivas (also from Hortonworks) and I have published a detailed design in JIRA HDFS-1623. The design has been well received by other active HDFS developers and the feature is actively being developed by engineers at Hortonworks, Cloudera, EBay and Facebook. Several sub-task JIRAs have been filed and are being addressed by developers from these organization.
The design was recently presented at a Hadoop User Group meeting on July 27th, 2011 (HA NameNode). Be sure to stay tuned to the Hortonworks blog for more details.
Much of the data reported in this blog was collected at Yahoo! and reported by Robert Chansler at Hadoop Summit 2011.
— Sanjay Radia