The Hadoop Distributed File System (HDFS) is the reliable and scalable data storage core of the Hortonworks Data Platform (HDP). In HDP, HDFS and YARN combine to form the distributed operating system for your data platform, providing resource management for diverse workloads and scalable data storage for the next generation of analytical applications.
In this blog, we’ll describe the key concepts introduced by Heterogeneous Storage in HDFS and how they are utilized to enable key tiered storage scenarios.
In HDP 2.1, we introduced Phase I of Heterogeneous Storage into HDFS, laying the groundwork for specifying different storage types in an HDP cluster. With HDP 2.2, HDFS expands on this groundwork and now provides the ability to utilize heterogeneous storage media within the HDFS cluster to enable the following tiered storage scenarios:
Each DataNode in HDFS is configured with a set of specific disk volumes as storage mount points on which HDFS files are persisted. With HDP 2.2, administrators can tag each volume with a Storage Type to identify the characteristic of storage media that represents the volume. For example, a mounted volume may be designated as an archival storage and another one as flash storage.
Storage Policies define the policy HDFS uses to persist block replicas of a file to Storage Types as well as the desired Storage Type(s) for each replica of the file blocks being persisted. They allow for fallback strategies, whereby if the desired Storage Type is out of space then a fallback Storage Type is utilized to store the file blocks. The scope of these policies extends and applies to directories, and all files within it.
Storage Policies can be enforced during file creation, and at any point during the lifetime of the file. For Storage Policies that have changed during the lifetime of the file, HDFS introduces a new tool called Mover that can be run periodically to migrate all files across the cluster to correct Storage Types based on their Storage policies.
Over the lifetime of a dataset, the frequency of reads of a dataset in processing workloads decreases. That is, the dataset is deemed as “cold.” As the amount of data under storage grows, there is a need to optimize storage of such ‘cold’ datasets. An Archival storage tier, consisting of nodes with slow spinning high density storage drives and low compute power, provides cost efficiency for storing these cold datasets.
HDP 2.2 introduces an ‘ARCHIVE’ Storage Type and related Storage Policies – ‘Hot’, ‘Warm’, ‘Cold’.
All disk volumes in the Archival storage tier nodes are tagged with the ‘ARCHIVE’ storage type. Administrators can then apply the ‘Cold’ Storage Policy to datasets that need to be stored on the Archival storage tier nodes. Since the ‘Cold’ Storage Policy is applied after the dataset has been created, the policy will be enforced when the HDFS Mover tool is run.
Solid State Drives (SDD) provide higher read/write throughput and higher IO operations per second than spinning hard disk drives offer, with the tradeoff of a higher cost per GB.
HDP 2.2 introduces a ‘SSD’ Storage Type and related Storage Policies: ‘All_SSD’ and ‘One_SSD’.
Each SSD disk used as a DataNode storage volume is tagged with the ‘SSD’ storage type. When data applications elect “All_SSD,” all block replicas will be stored only on the SSD volumes. With “One_SSD”, one block replica is written on one SSD volume while the other block replicas are written to spinning disk volumes in the cluster. The SSD storage policies are enforced during the creation of the file in HDFS.
For applications that need to write data that are temporary or re-generatable, memory (RAM) is an appropriate storage medium that provides low latency for reads and writes. Since memory is a volatile storage medium, data written to the memory tier will be asynchronously persisted to disk.
HDP 2.2 introduces, as a Technical Preview, the ‘RAM_DISK’ Storage Type and ‘LAZY_PERSIST’ Storage Policy.
With this scenario, we are making headway towards our vision of enabling Hadoop to take advantage of large amounts of memory to speed up processing for a wide variety of workloads.
With HDP 2.2, HDFS offers enhanced performance and cost efficiency for diverse workloads by intelligently utilizing heterogeneous storage media in the cluster.