cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

The Hortonworks Blog

More from Sanjay Radia

Ram Venkatesh also contributed to this blog series  Why Apache Hadoop in the Cloud? Ten years ago, Hadoop, the elephant started the Big Data journey inside the firewall of a data center- the Apache Hadoop components were deployed on commodity servers inside a private data center. Now, the public cloud is another viable option for […]

Introduction Today, organizations use the Apache Hadoop™ stack in the form of a central data lake to store their critical datasets and power their analytical processing workloads. A key requirement for the Hadoop cluster and the services running on it is to be highly available and flawlessly continue to function while software is being upgraded. […]

Traditionally, HDFS, Hadoop’s storage subsystem, has focused on one kind of storage medium, namely spindle-based disks.  However, a Hadoop cluster can contain significant amounts of memory and with the continued drop in memory prices, customers are willing to add memory targeted at caching storage to speed up processing. Recently HDFS generalized its architecture to include […]

Introduction A Highly Available NameNode for HDFS has been in development since last year. That effort focused singularly on the automatic failover of the NameNode for Hadoop 2.0. During that time we have realized two things. First, we realized we should use an outside-in approach to the HA problem: start by designing the availability of […]

Data integrity and availability are important for Apache Hadoop, especially for enterprises that use Apache Hadoop to store critical data.  This blog will focus on a few important questions about Apache Hadoop’s track record for data integrity and availability and provide a glimpse into what is coming in terms of automatic failover for HDFS NameNode. […]