Category Archives: HDFS


Snapshots for HDFS

This blog covers our on-going work on Snapshots in Apache Hadoop HDFS. In this blog, I will cover the motivations for the work, a high level design and some of the design choices we made. Having seen snapshots in use with various filesystems, I believe that adding snapshots to Apache Hadoop will be hugely valuable to the Hadoop community. With luck this work will be available to Hadoop users in late 2012 or 2013.

snapshot is a point-in-time image of the entire filesystem or a subtree of a filesystem. Some of the scenarios where snapshots are very useful:

  1. Protection against user errors:  Admin sets up a process to take read-only (RO) snapshots periodically in a rolling manner so that there are always x number of RO snapshots on HDFS. If a user accidentally deletes a file, the file can be restored from the latest RO snapshot that contains the file.
  2. Backup: Admin wants backup the entire file system, a subtree in the file system or just a file. Depending on the requirements, admin takes a read-only (henceforth referred to as RO) snapshot and uses this snapshot as the starting point of a full backup. Incremental backups are then taken by doing a diff between two snapshots.
  3. Experimental/Test setups:  A user wants to test an application against the main dataset. Normally, without doing a full copy of the dataset, this is a very risky proposition because the test setup can corrupt/overwrite production data. Admin creates a read-write (henceforth referred to as RW) snapshot of the production dataset and assigns the RW snapshot to the user to be used for experiment. Changes done to the RW snapshot will not be reflected on the production dataset.
  4. Disaster Recovery:  RO Snapshots can be used to create a consistent point in time image for replication and this can be copied over to remote site for Disaster Recovery.

Read More

Namenode HA Reaches a Major Milestone

We reached a significant milestone in HDFS: the Namenode HA branch was merged into the trunk. With this merge, HDFS trunk now supports HOT failover.

Significant enhancements were completed to make HOT Failover work:

  • Configuration changes for HA
  • Notion of active and standby states were added to the Namenode
  • Client-side redirection
  • Standby processing journal from Active
  • Dual block reports to Active and Standby

We have extensively tested HOT manual failover in our labs over the last few months. The HDFS team is now working on completing automatic failover. Please see HDFS-1623 for more details.

~Jitendra Pandey

Apache Hadoop 0.23.1 is Released!

A very short while ago, Vinod blogged about some of the significant improvements in Hadoop.Next (a.k.a hadoop-0.23.1).

To recap, the Hortonworks and Yahoo! teams have done a huge amount of work to test, validate and benchmark Hadoop.Next, the next generation of Apache Hadoop that includes HDFS Federation, NextGen MapReduce (a.k.a. YARN) and many other significant features and performance improvements.

Today, I’m very excited to announce that the Apache Hadoop community voted to release hadoop-0.23.1 and it’s now available for all to use!

Please head over to the Apache Hadoop Releases page to download and play with it. Happy Hadoop-ing!

Of course, many thanks to everyone in the community who contributed!

~Arun

Delivering on Hadoop .Next: Benchmarking Performance

In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:

  • Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
  • NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.

Read More

Delivering the Next Generation of Apache Hadoop

Today we announced our plans to release a public preview of the Hortonworks Data Platform (HDP) version 2. HDP v2 will leverage Apache Hadoop 0.23, which is the first major update to Hadoop in more than three years. Among other advancements, HDP v2 will include the NextGen MapReduce architecture, HDFS NameNode HA and HDFS Federation. It will also include the most up-to-date stable components including HCatalog, HBase, Hive and Pig; all fully integrated and tested at scale.

In order to avoid confusion, let me explain the two versions of HDP:

  • HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
  • HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available in the second half of 2012.

Read More

Apache Hadoop Reaches Milestone: Release 1.0.0

Congratulations! The Hadoop Community has given itself a big holiday present: Release 1.0.0! This release has been six years in the making, and has involved:

  • Hard work and cooperation from dozens of software developers and contributors from across the industry, including of course Doug Cutting and Mike Cafarella’s early work in Nutch and the founding Hadoop team at Yahoo, Doug, Owen O’Malley and many others, with leadership from Eric14.  Special thanks to all the Hadoop committers.
  • Commitment to stability, joined with testing and indispensable production experience at scale, at industry-leading companies like Yahoo!, Facebook, LinkedIn, and others, including hundreds of millions of compute-hours and exabytes of data processed.
  • Feedback from hundreds of knowledgeable users, data scientists, systems engineers and architects.
  • Commitment to the philosophy and practice of opensource from Google, who published their seminal papers and have long supported Apache.
  • The Apache Software Foundation, which provided a structured home for the growth of the ecosystem and blossoming of multiple associated projects.

Read More

WebHDFS – HTTP REST Access to HDFS

Motivation

Apache Hadoop provides a high performance native protocol for accessing HDFS. While this is great for Hadoop applications running inside a Hadoop cluster, users often want to connect to HDFS from the outside. For examples, some applications have to load data in and out of the cluster, or to interact with the data stored in HDFS from the outside. Of course they can do this using the native HDFS protocol but that means installing Hadoop and a Java binding with those applications. To address this we have developed an additional protocol to access HDFS using an industry standard RESTful mechanism, called WebHDFS. As part of this, WebHDFS takes advantages of the parallelism that a Hadoop cluster offers. Further, WebHDFS retains the security that the native Hadoop protocol offers. It also fits well into the overall strategy of providing web services access to all Hadoop components.

Read More

Update on Apache Hadoop-0.23

There has been a lot of progress on hadoop-0.23. We’re continuing to crank through issues as we get ready to ship.

We are mostly past the initial challenges of moving our entire build infrastructure to Maven. Many thanks to Alejandro, Tom, Giri & Eric Yang for making it happen.

HDFS is nearly there:

  • HDFS Federation and Client-side mount tables have been tested with ~300 node clusters with security on.
  • HDFS upgrades have been tested from 0.20.2xx.
  • Functional tests for HDFS  are complete.

Read More

Do You Have an Interesting HDFS Use Case?

Hi Folks,

I’m talking at a storage conference this month and I’d like to see if crowdsourcing will generate interesting examples and studies that I can include in my presentation.

What I’d like is interesting cases where HDFS has been compared to other storage technologies. Especially interested in cases where the decision was made to deploy HDFS rather than to buy an alternative technology.  Also interested in any large deployments where HDFS is being used for interesting things beyond being the serving layer for MapReduce and HBase.  If you have an interesting story, slides or other material that you think might be helpful for an HDFS presentations, please send me a note at HdfsCases2011-group@hortonworks.com.

Read More

Preparing for Next Release of Apache Hadoop

We are glad to have branched for a hadoop-0.23 release. We have already talked about some of the significant enhancements coming in the upcoming release such as HDFS Federation and NextGen MapReduce and we are excited to be starting the journey to begin stabilizing the next release. Please check out this presentation for more details.

As always, this is a community effort and we are very thankful for all the contributions from the Apache Hadoop community. As the Release Manager for Apache Hadoop 0.23, I look forward to getting a great release out of the door!

Read More

An Introduction to HDFS Federation

HDFS Federation

HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer. It enables support for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases.


Overview of Current HDFS

HDFS has two main layers:

Read More

Data Integrity and Availability in Apache Hadoop HDFS

Data integrity and availability are important for Apache Hadoop, especially for enterprises that use Apache Hadoop to store critical data.  This blog will focus on a few important questions about Apache Hadoop’s track record for data integrity and availability and provide a glimpse into what is coming in terms of automatic failover for HDFS NameNode.

What is Apache Hadoop’s Track Record for Data Integrity?

In 2009, we examined HDFS’s data integrity at Yahoo! and found that HDFS lost 650 blocks out of 329 million blocks on 10 clusters with 20,000 nodes running Apache Hadoop 0.20.3. Of the 650 lost blocks:

Read More

Go to page:12