HDFS 2.0 Next Generation Architecture
The Hadoop Distributed File System is the reliable and scalable data core of the Hortonworks Data Platform. In HDP 2.0, YARN + HDFS combine to form the distributed operating system for your Data Platform, providing resource management and scalable data storage to the next generation of analytical applications.
Over the past six months, HDFS has introduced a slew of major features to HDFS covering Enterprise Multi-tenancy, Business Continuity Processing and Enterprise Integration:
- Enabled automated failover with a hot standby and full stack resiliency for the NameNode master service
- Added enterprise standard NFS read/write access to HDFS
- Enabled point in time recovery with Snapshots in HDFS
- Wire Encryption for HDFS Data Transfer Protocol
Looking forward, there are evolving patterns in Data Center infrastructure and Analytical applications that are driving the evolution of HDFS.
Evolving Needs – Applications and Infrastructure
With YARN in HDP 2.0, new applications are emerging that will execute on the same Hadoop cluster against data in HDFS. This range of applications have different data access patterns and requirements, going beyond just batch:
Data Center infrastructure is evolving to include a variety of storage devices across the nodes in a Hadoop cluster. This is a shift from prior Hadoop cluster design, where all disks were treated equally in each node – JBODs attached to each data node.There is a need to take advantage of all storage and memory hardware – spinning disks, solid state drives, RAM memory and external storage.
Enabling the next Generation of Hadoop Applications
To enable these new applications, HDFS is evolving to to take advantage of the emerging variety of hardware options available in the Data Center infrastructure.
The cluster system administrator will be able to configure the storage media available on each node. HDFS will then allow datasets to be given a storage tier preference. Applications will be able to specify a Storage Medium preference when creating files that supports the applications’ read work loads.
For example, HBase can request that its data files (Hfiles) be stored on SSD. Then when HBase does writes and reads from HDFS, these requests will hit SSD and provide the latency requirements that HBase needs for supporting near real time applications.
For details on the work that is in progress for Tiered Storage support in HDFS, feel free to check out HDFS-2832.
To support interactive workloads, required by the Stinger Initiative, HDFS is evolving to support coordinated caching of datasets. Users and applications (such as Hive, Pig or HBase) will be able to identify a set of files that need to be cached. For example, dimension tables in Hive can be configured for caching in the DataNode RAM, enabling quick reads for Hive queries to these frequently looked up tables.
After a user or application requests caching for a file, the NameNode coordinates with every DataNode hosting one of that file’s blocks to pin the data into RAM as part of the DataNode process’s working set. This prevents page faults that can harm performance while clients attempt to read the block. HDFS will expose whether or not a block is cached on a particular DataNode. This enables enhancements in the MapReduce scheduler to consider placing map tasks on nodes that already have the data in memory.
HDFS-4949 captures the design details and progress for HDFS Caching.
You can find more about HDP 2.0 here.