Best Practices: Failure of Active NameNode in Hadoop Prior to HA
Failure of Active Namenode in a non-HA deployment
The best approach to mitigating the risk of data loss due to a NameNode failure is to harden the NameNode system and components to meet the desired level of redundancy.
Since the journal is not flushed with every operation, it could be up to several seconds out of sync with the persisted disk state. This latency determines the scope of potential data loss, in the event of NameNode failure.
Having a highly fault tolerant NameNode system, mitigates the potential for data loss. In the future, when the NameNode is distributed, this latency will no longer be a concern and data loss scenarios become much less probable.
This level of fault tolerance and availability can be reached through various mechanism either hardware, software, or some combination.
Until NameNode HA (High Availability) becomes available, the current solution is to set up a secondary name node that will store a duplicate set of data.
When setting up the secondary namenode, consider whether it will assume the role of the NameNode in the case of NameNode failure, or simply as a means to replicate NameNode data.
If the secondary host will assume the role the of NameNode, then be sure no other services running on it would be impacted by an IP/FQDN change, as the failover NameNode must resolve to the same IP as the failed node. For more information on this, please see JIRA issue HDFS-34
Additionally its advisable to have the Hadoop NameNode binaries and supporting libraries, also mirrored onto the secondary NameNode. If the system has been architected to be fault tolerant, this should already be addressed. If not, these binaries and configuration would have to be duplicated prior to promoting the new node to NameNode.
The overall steps to manually switching to a new NameNode: (please see http://wiki.apache.org/hadoop/NameNodeFailover)
- Make a copy of the data before promoting the host to NameNode
- Change the IP address of the target, to the IP of the failed NameNode
- Ensure Hadoop is installed and configured identically to the original
- DO NOT FORMAT THIS NODE
At this point the new NameNode should begin processing the journal/logs and eventually after all nodes have reported their image / blocks, come up.