Community-driven Snapshot for HDFS – Part TWO

This blog is a follow up on our previous blog “Snapshots for HDFS

In June we had posted an early prototype of snapshots that allowed us to experiment with a few ideas in HDFS-2802. Since then we have added more details to the design document and made significant progress on a brand new implementation (over 40 subtasks in HDFS-2802).

Some of the highlights of this new design include:

  • Read-Only Copy-on-Write (COW) snapshots (but can be extended RW later)
  • Snapshots for entire namespace or sub directories
  • Snapshots are managed by Admin, but users are allowed to take snapshots
  • Snapshots are efficient
    • Creation is instantaneous with O(1) cost.
    • Additional memory is used only when modifications are made relative to  a snapshots (memory usage is O(M), where M is the number of modified files/directories)
    • Snapshots do not adversely affect regular HDFS operations

An initial implementation of snapshots with some tests is already completed.  We are now working on the improvements, some new tools for snapshots and adding more tests.

The major work-in-progress items are:

  • Persistent data structure based solution for efficient creation and memory usage
  • Snapshot diff tool
  • Restore/rollback snapshots

Meetup at Hortonworks: rallying the community
Recently we also held a Meetup at Hortonworks office, where over 30 folks attended to discuss the design and some of the features in great detail. A wide range of topics were discussed from Snapshot usage by HBase, administration aspects of snapshots, overhead of creating and maintaining snapshots, and lower level details such as length of files that are open-for-writing. We had  representation from HDFS developers, HBase developers and engineers with deep experience in managing hadoop and other storage systems.   We thank the community for the valuable discussion and feedback on the feature requirements and the open questions.

 

Categorized by :
Apache Hadoop HDFS Other

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready: Developing Applications on Hadoop with Scalding
Thursday, September 18, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.