Many organizations use traditional, direct attached storage (DAS) Hadoop clusters for storing big data. As data requirements grow, organizations are finding traditional Hadoop storage architecture inefficient, costly, and difficult to manage.
With most Hadoop deployments, as more and more data is stored for longer time, the demand for storage is outstripping the demand for compute. Organizations using Hadoop need a cost effective and easy to manage solution to address this storage dilemma. Current solutions are inadequate:
Remove cold data – identify and manually delete old data
Add more nodes – adds unnecessary compute capacity to the cluster
The HDFS Tiered Storage solution from Dell EMC® has been validated with Hortonworks to decouple growing storage capacity from compute capacity. The validation covers extensive test cases using MapReduce, Hive, and Spark workloads with DAS Hadoop Clusters configured in either default security, Kerberos security, or Kerberos with Ranger HDFS and HIVE policies enabled, i.e. the solution covers a majority of Hadoop deployment scenarios.
Dell EMC® Isilon® is a scale-out NAS platform with an integrated Hadoop Distributed File System (HDFS). Using HDFS as an over-the-wire protocol with Isilon, organizations can now quickly expand their Hadoop storage capacity without the need to add more compute nodes. Dell EMC Isilon easily scales to support petabytes of Hadoop data with unmatched simplicity, reliability, flexibility, and efficiency. Key benefits over DAS include:
Over 60% more storage efficiency
Up to 75% reduction in storage footprint
Automated tiering and storage performance that scales independently of compute nodes
Seeing the challenges with traditional Hadoop storage architecture, and the pace at which file-based data is increasing, Dell EMC® Isilon® has optimized its storage operating system, the OneFS® Operating System, with various HDFS performance enhancements.
Isilon OneFS HDFS Protocol optimizations include:
HDFS protocol written in C++ (increases parallel processing and performance)
Integrated Name Node Redundancy (increases NN fault tolerance and performance)
Data Node Load Balancing (increases DN fault tolerance and performance)
Web GUI Enhancements (Ranger Integration, AD/LDAP integration, and more)
To leverage Hadoop tiering with Isilon, users simply reference the remote Isilon filesystem using an HDFS path, for example,
Every node in the Isilon cluster transparently acts as a Name Node and a Data Node for its local namespace. Unlike the single active Name Node design seen with traditional DAS Hadoop Clusters, all Name Nodes on Isilon are always active, this provides enhanced Name Node redundancy and performance for the entire Isilon HDFS cluster without a need for Name Node compute nodes, Secondary Name Nodes, Name Node HA management, etc.
Administration is easy with Dell EMC Isilon. There is no need to modify the DAS Hadoop configuration or worry about configuring HDFS storage policies to leverage the additional HDFS storage capacity available on Isilon. Isilon is simply accessible as a remote HDFS file system, users simply point to the Isilon HDFS path and have immediate access to all the available HDFS storage space independent of the number of compute nodes in the DAS Hadoop cluster.
Each Isilon node boosts performance and expands the cluster’s storage capacity, as storage requirements increase, simply add more Isilon nodes to increase capacity and performance.
Hive is a key component of Hadoop. Hive provides the metadata that can organize countless directories and files into tables and columns that can be queried using standard SQL. Hive also provides a SQL engine that can execute a SQL query by converting it into a series of MapReduce or Tez jobs and then execute the jobs. Additionally, other applications such as Spark and HBase use the metadata services provided by Hive to organize files into tables but do their own query processing.
The Dell EMC® Isilon® HDFS tiering solutions allows for a common Hive Metastore across both the DAS and Isilon clusters. There is no need to maintain separate Metastores with Dell EMC Isilon HDFS tiering, by simply creating external databases, tables, or partitions that specify Isilon as the remote filesystem location in Hive, users can transparently access remote data on Isilon. This is a powerful use case. External Hadoop users do not have to change any client side configurations or path statements, Hive directs the traffic based on location information specified in the Metastore. With our new Gen 6 Isilon Nodes, performance can even be faster that DAS as shown in the TPCDS Benchmark results below:
Versions & Models Tested
HDP v 2.6.3
OneFS v 18.104.22.168 (Gen 5), OneFS 22.214.171.124 (Gen 6)
o Existing customers can download OneFS from: https://support.emc.com
o OneFS Simulator also available at:
Isilon Models Tested
o Gen 6
- Isilon H600-4U-Single-256GB-1x1GE-2x40GE SFP+-36TB-6554GB SSD
o Gen 5
- Isilon X410-4U-Dual-256GB-2x1GE-2x10GE SFP+-96TB-3277GB SSD
A high-level reference architecture of Hadoop tiered storage with Isilon is shown below. This reference architecture provides hot tier data in high-throughput, low-latency local storage and cold tier data in capacity-dense remote storage. You can deploy the Hadoop cluster on physical hardware servers or on a virtualization platform. Data is accessible via any HDFS application, e.g. Hive, DistCP, Spark, MapReduce, etc.
Each Isilon node includes (at a minimum) dual 10G interfaces for the access network and dual Infiniband interfaces for a private data interconnect. Newer Gen 6 models (H & A Series) may be upgraded to dual 40G interfaces for the access network and 40G interfaces for the private data interconnect. With any configuration, high-speed redundant network connectivity is a key design aspect for the Isilon Scale-Out Hadoop tiering solution.
Isilon Scale-Out NAS Model Options
Isilon Hybrid Nodes (Recommended for Hadoop Tiering)
If you are interested in reading more, check out the HDFS Tiering Solution Guide covering both Isilon and ECS at Hortonworks.com.