Best Practices: Linux File Systems for HDFS

ISSUE:

Choosing the appropriate Linux file system for HDFS deployment

SOLUTION:

The Hadoop Distributed File System is platform independent and can function on top of any underlying file system and Operating System. Linux offers a variety of file system choices, each with caveats that have an impact on HDFS.

As a general best practice, if you are mounting disks solely for Hadoop data, disable ‘noatime’. This speeds up reads for files.

There are three Linux file system options that are popular to choose from:

  • Ext3
  • Ext4
  • XFS

Yahoo uses the ext3 file system for its Hadoop deployments. ext3 is also the default filesystem choice for many popular Linux OS flavours. Since HDFS on ext3 has been publicly tested on Yahoo’s cluster it makes for a safe choice for the underlying file system.

ext4 is the successor to ext3. ext4 has better performance with large files. ext4 also introduced delayed allocation of data, which adds a bit more risk with unplanned server outages while decreasing fragmentation and improving performance.

XFS offers better disk space utilization than ext3 and has much quicker disk formatting times than ext3. This means that it is quicker to get started with a data node using XFS.

Most often performance of a Hadoop cluster will not be constrained by disk speed – I/O and RAM limitations will be more important. ext3 has been extensively tested with Hadoop and is currently the stable option to go with. ext4 and xfs can be considered as well and they give some performance benefits.

References:

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.