The Hortonworks Blog

More from Sanjay Radia

Traditionally, HDFS, Hadoop’s storage subsystem, has focused on one kind of storage medium, namely spindle-based disks.  However, a Hadoop cluster can contain significant amounts of memory and with the continued drop in memory prices, customers are willing to add memory targeted at caching storage to speed up processing.

Recently HDFS generalized its architecture to include other kinds of storage media including SDDs and memory [1]. We also added support for caching hot files in memory [2].…

Introduction

A Highly Available NameNode for HDFS has been in development since last year. That effort focused singularly on the automatic failover of the NameNode for Hadoop 2.0. During that time we have realized two things.

First, we realized we should use an outside-in approach to the HA problem: start by designing the availability of the Hadoop system as a whole and then focus on the high-availability of individual components; that work lead to the Full Stack HA Architecture.…

Data integrity and availability are important for Apache Hadoop, especially for enterprises that use Apache Hadoop to store critical data.  This blog will focus on a few important questions about Apache Hadoop’s track record for data integrity and availability and provide a glimpse into what is coming in terms of automatic failover for HDFS NameNode.

What is Apache Hadoop’s Track Record for Data Integrity?

In 2009, we examined HDFS’s data integrity at Yahoo!…