HDFS

A distributed Java-based file system for storing large volumes of data

HDFS and YARN form the data management layer of Apache Hadoop. YARN is the architectural center of Hadoop, the resource management framework that enables the enterprise to process data in multiple ways simultaneously—for batch, interactive and real-time data workloads on one shared dataset. YARN provides the resource management and HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data.

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.

In HDP 2.2, the rolling upgrade feature and the underlying HDFS High Availability configuration enable Hadoop operators to upgrade the cluster software and restart upgraded services, without taking the entire cluster down.

Hortonworks Focus for HDFS

Since its first deployment at Yahoo in 2006, HDFS has established itself as the defacto scalable, reliable and robust file system for big data. Since then, HDFS has addressed several fundamental challenges of distributed storage at unparalleled scale and with enterprise rigor.

The Apache community continues innovating. For example, a new initiative called Ozone introduces an object store, which extends HDFS beyond a file system, toward a more versatile enterprise object-enables storage layer for use cases such as storing all the photos uploaded on Facebook or all the email attachments in Gmail.

Fortune 1000 CIOs see the Hadoop cluster’s storage and compute resources as a valuable infrastructure for running both Hadoop and non-Hadoop applications and services. This emerging trend of PaaS-on-Hadoop, propelled by YARN, opens up the Hadoop infrastructure to new use cases. Object store is a natural fit for the storage component of a PaaS model, and the Apache community is thrilled to work on adding these new capabilities to HDFS and Apache Hadoop.

The Apache Hadoop HDFS team is working on the following improvements:

Focus Planned Enhancements
Reliable and Secure Operations
  • Transparent data-at-rest encryption to comply with privacy and security regulations, for example:
    • HIPAA regulations in healthcare,
    • PCI DSS regulations in the financial services industry, or
    • FISMA regulations in the US government
  • Credential provider — the client-server protocol will provide three types of security:
    • Authentication via Kerberos HTTP SPNEGO
    • Confidentiality and integrity via HTTPS for transport
    • Authorization via Hadoop ACLs
Scalability and Efficiency
  • Erasure coding
  • Scale NameNode metadata
Support for Heterogenous Hardware
  • Support for APIs to expose storage types
  • Support for storage in memory

Recent Progress in HDFS

Hadoop Version Progress
2.5.0
  • Support incremental data copy for files with the same file names but with different file length
  • Support extended attributes for associated metadata
2.4.0
  • Support for rolling upgrades
  • Support for Access Control Lists (ACLs)
  • Metadata compatibility through FSImage protobuf usage
2.3.0
  • Support for heterogeneous storages
  • Centralized cache management
  • Filesystem implementation for OpenStack Swift
  • HTTPs support for HDFS protocols (webhdfs, httpfs, webUIs)

What HDFS Does

HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.

These specific features ensure that data is stored efficiently in a Hadoop cluster and that it is highly available:

Feature Description
Rack awareness Considers a node’s physical location when allocating storage and scheduling tasks
Minimal data motion Hadoop moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides, which significantly reduces network I/O and provides very high aggregate bandwidth.
Utilities Dynamically diagnose the health of the file system and rebalance the data on different nodes
Rollback Allows operators to bring back the previous version of HDFS after an upgrade, in case of human or systemic errors
Standby NameNode Provides redundancy and supports high availability (HA)
Operability HDFS requires minimal operator intervention, allowing a single operator to maintain a cluster of 1000s of nodes

How HDFS Works

An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas.

The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes.

The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.

The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:

  • replicate blocks to other nodes,
  • remove local block replicas,
  • re-register and send an immediate block report, or
  • shut down the node.

Try these Tutorials

Try HDFS with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

View Past Webinars

Discover HDP 2.2: Data Storage Innovations in Hadoop Distributed File System (HDFS)
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN and HDFS
More posts on:
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.