HDFS metadata represents the structure of HDFS directories and files in a tree. It also includes the various attributes of directories and files, such as ownership, permissions, quotas, and replication factor. In this blog post, I’ll describe how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files. All examples shown are from testing a build of the soon-to-be-released Apache Hadoop 2.6.0.
WARNING: Do not attempt to modify metadata directories or files. Unexpected modifications can cause HDFS downtime, or even permanent data loss. This information is provided for educational purposes only.
Persistence of HDFS metadata broadly breaks down into 2 categories of files:
- fsimage – An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.
- edits – An edits file is a log that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.
Checkpointing is the process of merging the content of the most recent fsimage with all edits applied after that fsimage is merged in order to create a new fsimage. Checkpointing is triggered automatically by configuration policies or manually by HDFS administration commands.
Here is an example of an HDFS metadata directory taken from a NameNode. This shows the output of running the tree command on the metadata directory, which is configured by setting dfs.namenode.name.dir in hdfs-site.xml.
│ ├── VERSION
│ ├── edits_0000000000000000001-0000000000000000007
│ ├── edits_0000000000000000008-0000000000000000015
│ ├── edits_0000000000000000016-0000000000000000022
│ ├── edits_0000000000000000023-0000000000000000029
│ ├── edits_0000000000000000030-0000000000000000030
│ ├── edits_0000000000000000031-0000000000000000031
│ ├── edits_inprogress_0000000000000000032
│ ├── fsimage_0000000000000000030
│ ├── fsimage_0000000000000000030.md5
│ ├── fsimage_0000000000000000031
│ ├── fsimage_0000000000000000031.md5
│ └── seen_txid
In this example, the same directory has been used for both fsimage and edits. Alternatively, configuration options are available that allow separating fsimage and edits into different directories. Each file within this directory serves a specific purpose in the overall scheme of metadata persistence:
- VERSION – Text file that contains:
- layoutVersion – The version of the HDFS metadata format. When we add new features that require changing the metadata format, we change this number. An HDFS upgrade is required when the current HDFS software uses a layout version newer than what is currently tracked here.
- namespaceID/clusterID/blockpoolID – These are unique identifiers of an HDFS cluster. The identifiers are used to prevent DataNodes from registering accidentally with an incorrect NameNode that is part of a different cluster. These identifiers also are particularly important in a federated deployment. Within a federated deployment, there are multiple NameNodes working independently. Each NameNode serves a unique portion of the namespace (namespaceID) and manages a unique set of blocks (blockpoolID). The clusterID ties the whole cluster together as a single logical unit. It’s the same across all nodes in the cluster.
- storageType – This is either NAME_NODE or JOURNAL_NODE. Metadata on a JournalNode in an HA deployment is discussed later.
- cTime – Creation time of file system state. This field is updated during HDFS upgrades.
- edits_start transaction ID-end transaction ID – These are finalized (unmodifiable) edit log segments. Each of these files contains all of the edit log transactions in the range defined by the file name’s through . In an HA deployment, the standby can only read up through the finalized log segments. It will not be up-to-date with the current edit log in progress (described next). However, when an HA failover happens, the failover finalizes the current log segment so that it’s completely caught up before switching to active.
- edits_inprogress__start transaction ID – This is the current edit log in progress. All transactions starting from are in this file, and all new incoming transactions will get appended to this file. HDFS pre-allocates space in this file in 1 MB chunks for efficiency, and then fills it with incoming transactions. You’ll probably see this file’s size as a multiple of 1 MB. When HDFS finalizes the log segment, it truncates the unused portion of the space that doesn’t contain any transactions, so the finalized file’s space will shrink down.
- fsimage_end transaction ID – This contains the complete metadata image up through . Each fsimage file also has a corresponding .md5 file containing a MD5 checksum, which HDFS uses to guard against disk corruption.
- seen_txid – This contains the last transaction ID of the last checkpoint (merge of edits into a fsimage) or edit log roll (finalization of current edits_inprogress and creation of a new one). Note that this is not the last transaction ID accepted by the NameNode. The file is not updated on every transaction, only on a checkpoint or an edit log roll. The purpose of this file is to try to identify if edits are missing during startup. It’s possible to configure the NameNode to use separate directories for fsimage and edits files. If the edits directory accidentally gets deleted, then all transactions since the last checkpoint would go away, and the NameNode would start up using just fsimage at an old state. To guard against this, NameNode startup also checks seen_txid to verify that it can load transactions at least up through that number. It aborts startup if it can’t.
- in_use.lock – This is a lock file held by the NameNode process, used to prevent multiple NameNode processes from starting up and concurrently modifying the directory.
In an HA deployment, edits are logged to a separate set of daemons called JournalNodes. A JournalNode’s metadata directory is configured by setting dfs.journalnode.edits.dir. The JournalNode will contain a VERSION file, multiple edits__ files and an edits_inprogress_, just like the NameNode. The JournalNode will not have fsimage files or seen_txid. In addition, it contains several other files relevant to the HA implementation. These files help prevent a split-brain scenario, in which multiple NameNodes could think they are active and all try to write edits.
- committed-txid – Tracks last transaction ID committed by a NameNode.
- last-promised-epoch – This file contains the “epoch,” which is a monotonically increasing number. When a new writer (a new NameNode) starts as active, it increments the epoch and presents it in calls to the JournalNode. This scheme is the NameNode’s way of claiming that it is active and requests from another NameNode, presenting a lower epoch, must be ignored.
- last-writer-epoch – Similar to the above, but this contains the epoch number associated with the writer who last actually wrote a transaction. (This was a bug fix for an edge case not handled by last-promised-epoch alone.)
- paxos – Directory containing temporary files used in implementation of the Paxos distributed consensus protocol. This directory often will appear as empty.
Although DataNodes do not contain metadata about the directories and files stored in an HDFS cluster, they do contain a small amount of metadata about the DataNode itself and its relationship to a cluster. This shows the output of running the tree command on the DataNode’s directory, configured by setting dfs.datanode.data.dir in hdfs-site.xml.
│ ├── BP-1079595417-192.168.2.45-1412613236271
│ │ ├── current
│ │ │ ├── VERSION
│ │ │ ├── finalized
│ │ │ │ └── subdir0
│ │ │ │ └── subdir1
│ │ │ │ ├── blk_1073741825
│ │ │ │ └── blk_1073741825_1001.meta
│ │ │ │── lazyPersist
│ │ │ └── rbw
│ │ ├── dncp_block_verification.log.curr
│ │ ├── dncp_block_verification.log.prev
│ │ └── tmp
│ └── VERSION
The purpose of these files is:
- BP-random integer-NameNode-IP address-creation time – The naming convention on this directory is significant and constitutes a form of cluster metadata. The name is a block pool ID. “BP” stands for “block pool,” the abstraction that collects a set of blocks belonging to a single namespace. In the case of a federated deployment, there will be multiple “BP” sub-directories, one for each block pool. The remaining components form a unique ID: a random integer, followed by the IP address of the NameNode that created the block pool, followed by creation time.
- VERSION – Much like the NameNode and JournalNode, this is a text file containing multiple properties, such as layoutVersion, clusterId and cTime, all discussed earlier. There is a VERSION file tracked for the entire DataNode as well as a separate VERSION file in each block pool sub-directory. In addition to the properties already discussed earlier, the DataNode’s VERSION files also contain:
- storageType – In this case, the storageType field is set to DATA_NODE.
- blockpoolID – This repeats the block pool ID information encoded into the sub-directory name.
- finalized/rbw – Both finalized and rbw contain a directory structure for block storage. This holds numerous block files, which contain HDFS file data and the corresponding .meta files, which contain checksum information. “Rbw” stands for “replica being written”. This area contains blocks that are still being written to by an HDFS client. The finalized sub-directory contains blocks that are not being written to by a client and have been completed.
- lazyPersist – HDFS is incorporating a new feature to support writing transient data to memory, followed by lazy persistence to disk in the background. If this feature is in use, then a lazyPersist sub-directory is present and used for lazy persistence of in-memory blocks to disk. We’ll cover this exciting new feature in greater detail in a future blog post.
- dncp_block_verification.log – This file tracks the last time each block was verified by checking its contents against its checksum. The last verification time is significant for deciding how to prioritize subsequent verification work. The DataNode orders its background block verification work in ascending order of last verification time. This file is rolled periodically, so it’s typical to see a .curr file (current) and a .prev file (previous).
- in_use.lock – This is a lock file held by the DataNode process, used to prevent multiple DataNode processes from starting up and concurrently modifying the directory.
Various HDFS commands impact the metadata directories
||NameNode startup automatically saves a new checkpoint. As stated earlier, checkpointing is the process of merging any outstanding edit logs with the latest fsimage, saving the full state to a new fsimage file, and rolling edits. Rolling edits means finalizing the current edits_inprogress and starting a new one.
|hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
|This saves a new checkpoint (much like restarting NameNode) while the NameNode process remains running. Note that the NameNode must be in safe mode, so all attempted write activity would fail while this is running.
|hdfs dfsadmin -rollEdits
||This manually rolls edits. Safe mode is not required. This can be useful if a standby NameNode is lagging behind the active and you want it to get caught up more quickly. (The standby NameNode can only read finalized edit log segments, not the current in progress edits file.)
|hdfs dfsadmin -fetchImage
||Downloads the latest fsimage from the NameNode. This can be helpful for a remote backup type of scenario.
Several configuration properties in hdfs-site.xml control the behavior of HDFS metadata directories.
- dfs.namenode.name.dir – Determines where on the local filesystem the DFS name node should store the name table (fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
- dfs.namenode.edits.dir – Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir.
- dfs.namenode.checkpoint.period – The number of seconds between two periodic checkpoints.
- dfs.namenode.checkpoint.txns – The standby will create a checkpoint of the namespace every ‘dfs.namenode.checkpoint.txns’ transactions, regardless of whether ‘dfs.namenode.checkpoint.period’ has expired.
- dfs.namenode.checkpoint.check.period – How frequently to query for the number of uncheckpointed transactions.
- dfs.namenode.num.checkpoints.retained – The number of image checkpoint files that will be retained in storage directories. All edit logs necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.
- dfs.namenode.num.extra.edits.retained – The number of extra transactions which should be retained beyond what is minimally necessary for a NN restart. This can be useful for audit purposes or for an HA setup where a remote Standby Node may have been offline for some time and need to have a longer backlog of retained edits in order to start again.
- dfs.namenode.edit.log.autoroll.multiplier.threshold – Determines when an active namenode will roll its own edit log. The actual threshold (in number of edits) is determined by multiplying this value by dfs.namenode.checkpoint.txns. This prevents extremely large edit files from accumulating on the active namenode, which can cause timeouts during namenode startup and pose an administrative hassle. This behavior is intended as a failsafe for when the standby fails to roll the edit log by the normal checkpoint threshold.
- dfs.namenode.edit.log.autoroll.check.interval.ms – How often an active namenode will check if it needs to roll its edit log, in milliseconds.
- dfs.datanode.data.dir – Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. Heterogeneous storage allows specifying that each directory resides on a different type of storage: DISK, SSD, ARCHIVE or RAM_DISK.
We briefly discussed how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files, the relevant configurations that drive specific behaviors, and appropriate HDFS metadata directory commands that print out the directory tree, initiate checkpoint, and create a fsimage.
In a future blog, we’ll explore lazy persistence, a scheme to persist in-memory data to disk, in more in details.