This blog covers our on-going work on Snapshots in Apache Hadoop HDFS. In this blog, I will cover the motivations for the work, a high level design and some of the design choices we made. Having seen snapshots in use with various filesystems, I believe that adding snapshots to Apache Hadoop will be hugely valuable to the Hadoop community. With luck this work will be available to Hadoop users in late 2012 or 2013.
A snapshot is a point-in-time image of the entire filesystem or a subtree of a filesystem. Some of the scenarios where snapshots are very useful:
- Protection against user errors: Admin sets up a process to take read-only (RO) snapshots periodically in a rolling manner so that there are always x number of RO snapshots on HDFS. If a user accidentally deletes a file, the file can be restored from the latest RO snapshot that contains the file.
- Backup: Admin wants backup the entire file system, a subtree in the file system or just a file. Depending on the requirements, admin takes a read-only (henceforth referred to as RO) snapshot and uses this snapshot as the starting point of a full backup. Incremental backups are then taken by doing a diff between two snapshots.
- Experimental/Test setups: A user wants to test an application against the main dataset. Normally, without doing a full copy of the dataset, this is a very risky proposition because the test setup can corrupt/overwrite production data. Admin creates a read-write (henceforth referred to as RW) snapshot of the production dataset and assigns the RW snapshot to the user to be used for experiment. Changes done to the RW snapshot will not be reflected on the production dataset.
- Disaster Recovery: RO Snapshots can be used to create a consistent point in time image for replication and this can be copied over to remote site for Disaster Recovery.