Hortonworks Knowledgebase
HOWTO: Test HDFS Setup
ISSUE
How do I run simple Hadoop Distributed File System tasks? Or
How do I test that HDFS services are working?
SOLUTION
Make sure the name node and the data nodes are started.
To start the name node:
su hdfs - -c "hadoop-daemon.sh --config /etc/hadoop/ start namenode"
To start a data node:
su hdfs - -c "hadoop-daemon.sh --config /etc/hadoop start datanode"
Put data files into HDFS. This command will take a file from disk and put into HDFS:
su hdfs hadoop fs -put trial_file.csv /user/hdfs/trial_file.csv
Read data from HDFS. This command will read the contents of a file from HDFS and display on the console:
su hdfs hadoop fs -cat /user/hdfs/trial_file.csv
References:
http://hadoop.apache.org/common/docs/current/file_system_shell.html…
Tags: HDFS, setup Read More »Best Practices: Linux File Systems for HDFS
ISSUE:
Choosing the appropriate Linux file system for HDFS deployment
SOLUTION:
The Hadoop Distributed File System is platform independent and can function on top of any underlying file system and Operating System. Linux offers a variety of file system choices, each with caveats that have an impact on HDFS.
As a general best practice, if you are mounting disks solely for Hadoop data, disable ‘noatime’. This speeds up reads for files.
There are three Linux file system options that are popular to choose from:
- Ext3
- Ext4
- XFS
Yahoo uses the ext3 file system for its Hadoop deployments. ext3 is also the default filesystem choice for many popular Linux OS flavours. Since HDFS on ext3 has been publicly tested on Yahoo’s cluster it makes for a safe choice for the underlying file system.
ext4 is the successor to ext3. ext4 has better performance with large files.…
Tags: HDFS, Linux Read More »HOWTO: Check the Health of an HDFS Cluster
ISSUE
How do I check the health of my HDFS cluster (name node and all data nodes)?
SOLUTION
Hadoop includes the dfsadmin command line tool for HDFS administration functionality. This tool allows the user to view the status of the HDFS cluster.
To view a comprehensive status report, execute the following command:
hadoop dfsadmin -report
This command will output basic statistics of the cluster health. This includes the status of the namenode, status of each datanode, disk capacity amounts, block health statuses.
The same information can be found on the NameNode web status page – at http://<namenode IP>:50070/dfshealth.jsp
References:
http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html#DFSAdmin+Command…