cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Manage Files on HDFS via CLI/Ambari Files View

Tutorial 1: Manage Files on HDFS with the Command Line

Introduction

In this tutorial, we will walk through many of the common of the basic Hadoop Distributed File System (HDFS) commands you will need to manage files on HDFS. The particular datasets we will utilize to learn HDFS file management are San Francisco salaries from 2011-2014.

Pre-Requisites

We will download sf-salaries-2011-2013.csv and sf-salaries-2014.csv data onto our local filesystems of the sandbox. The commands are tailored for mac and linux users.

1. Open a terminal on your local machine, SSH into the sandbox:

ssh root@127.0.0.1 -p 2222

Note: If your on VMware or Azure, insert your appropriate ip address in place of 127.0.0.1. Azure users will need to replace port 2222 with 22.

2. Copy and paste the commands to download the sf-salaries-2011-2013.csv and sf-salaries-2014.csv files. We will use them while we learn file management operations.

# download sf-salaries-2011-2013
wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/using-the-command-line-to-manage-hdfs/sf-salary-datasets/sf-salaries-2011-2013.csv
# download sf-salaries-2014
wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/using-the-command-line-to-manage-hdfs/sf-salary-datasets/sf-salaries-2014.csv

sf_salary_datasets

Outline

Step 1: Create a Directory in HDFS, Upload a file and List Contents

Let’s learn by writing the syntax. You will be able to copy and paste the following example commands into your terminal. Let’s login under hdfs user, so we can give root user permission to perform file operations:

su hdfs
cd

We will use the following command to run filesystem commands on the file system of hadoop:

hdfs dfs [command_operation]

Refer to the File System Shell Guide to view various command_operations.

hdfs dfs -chmod:

  • Affects the permissions of the folder or file. Controls who has read/write/execute privileges
  • We will give root access to read and write to the user directory. Later we will perform an operation in which we send a file from our local filesystem to hdfs.
hdfs dfs -chmod 777 /user
  • Warning in production environments, setting the folder with the permissions above is not a good idea because anyone can read/write/execute files or folders.

Type the following command, so we can switch back to the root user. We can perform the remaining file operations under the user folder since the permissions were changed.

exit

hdfs dfs -mkdir:

  • Takes the path uri’s as an argument and creates a directory or multiple directories.
# Usage:
        # hdfs dfs -mkdir <paths>
# Example:
        hdfs dfs -mkdir /user/hadoop
        hdfs dfs -mkdir /user/hadoop/sf-salaries-2011-2013 /user/hadoop/sf-salaries /user/hadoop/sf-salaries-2014

hdfs dfs -put:

  • Copies single src file or multiple src files from local file system to the Hadoop Distributed File System.
# Usage:
        # hdfs dfs -put <local-src> ... <HDFS_dest_path>
# Example:
        hdfs dfs -put sf-salaries-2011-2013.csv /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv
        hdfs dfs -put sf-salaries-2014.csv /user/hadoop/sf-salaries-2014/sf-salaries-2014.csv

hdfs dfs -ls:

  • Lists the contents of a directory
  • For a file, returns stats of a file
# Usage:  
        # hdfs dfs  -ls  <args>  
# Example:
        hdfs dfs -ls /user/hadoop
        hdfs dfs -ls /user/hadoop/sf-salaries-2011-2013
        hdfs dfs -ls /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv

list_folder_contents

Step 2: Find Out Space Utilization in a HDFS Directory

hdfs dfs -du:

  • Displays size of files and directories contained in the given directory or the size of a file if its just a file.
# Usage:  
        # hdfs dfs -du URI
# Example:
        hdfs dfs -du  /user/hadoop/ /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv

displays_entity_size

Step 3: Download File From HDFS to Local File System

hdfs dfs -get:

  • Copies/Downloads files from HDFS to the local file system
# Usage:
        # hdfs dfs -get <hdfs_src> <localdst>
# Example:
        hdfs dfs -get /user/hadoop/sf-salaries-2011-2013/sf-salaries-2011-2013.csv /home/

Step 4: Explore Two Advanced Features

hdfs dfs -getmerge

  • Takes a source directory file or files as input and concatenates files in src into the local destination file.
  • Concatenates files in the same directory or from multiple directories as long as we specify their location and outputs them to the local file system, as can be seen in the Usage below.
  • Let’s concatenate the san francisco salaries from two separate directory and output them to our local filesystem. Our result will be the salaries from 2014 are appended below the last row of 2011-2013.
# Usage:
        # hdfs dfs -getmerge <src> <localdst> [addnl]
        # hdfs dfs -getmerge <src1> <src2> <localdst> [addnl]
# Option:
        # addnl: can be set to enable adding a newline on end of each file
# Example:
        hdfs dfs -getmerge /user/hadoop/sf-salaries-2011-2013/ /user/hadoop/sf-salaries-2014/ /root/output.csv

Merges the files in sf-salaries-2011-2013 and sf-salaries-2014 to output.csv in the root directory of the local filesystem. The first file contained about 120,000 rows and the second file contained almost 30,000 rows. This file operation is important because it will save you time from having to manually concatenate them.

hdfs dfs -cp:

  • Copy file or directories recursively, all the directory’s files and subdirectories to the bottom of the directory tree are copied.
  • It is a tool used for large inter/intra-cluster copying
# Usage:
        # hdfs dfs -cp <src-url> <dest-url>
# Example:
        hdfs dfs -cp /user/hadoop/sf-salaries-2011-2013/ /user/hadoop/sf-salaries-2014/ /user/hadoop/sf-salaries

-cp: copies sf-salaries-2011-2013, sf-salaries-2014 and all their contents to sf-salaries

Verify the files or directories successfully copied to the destination folder:

hdfs dfs -ls /user/hadoop/sf-salaries/
hdfs dfs -ls /user/hadoop/sf-salaries/sf-salaries-2011-2013
hdfs dfs -ls /user/hadoop/sf-salaries/sf-salaries-2014

visual_result_of_distcp

Visual result of distcp file operation. Notice that both src1 and src2 directories and their contents were copied to the dest directory.

Step 5: Use Help Command to access Hadoop Command Manual

Help command opens the list of commands supported by Hadoop Data File System (HDFS)

# Example:  
        hdfs dfs  -help

hadoop_help_command_manual

Hope this short tutorial was useful to get the basics of file management.

Summary

Congratulations! We just learned to use commands to manage our sf-salaries-2011-2013.csv and sf-salaries-2014.csv dataset files in HDFS. We learned to create, upload and list the the contents in our directories. We also acquired the skills to download files from HDFS to our local file system and explored a few advanced features of HDFS file management using the command line.

Further Reading

Tutorial 2: Manage Files on HDFS with Ambari Files View

Introduction

In the previous tutorial, we learned to manage files on the Hadoop Distributed File System (HDFS) with the command line. Now we will use Ambari Files View to perform many of the file management operations on HDFS that we learned with CLI, but through the web-based interface.

Pre-Requisites

We will download sf-salaries-2011-2013.csv and sf-salaries-2014.csv data onto our local filesystems of our computer. The commands are tailored for mac and linux users.

1. Open a terminal on your local machine, copy and paste the commands to download the sf-salaries-2011-2013.csv and sf-salaries-2014.csv files. We will use them while we learn file management operations.

cd ~/Downloads
# download sf-salaries-2011-2013
wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/using-the-command-line-to-manage-hdfs/sf-salary-datasets/sf-salaries-2011-2013.csv
# download sf-salaries-2014
wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/using-the-command-line-to-manage-hdfs/sf-salary-datasets/sf-salaries-2014.csv
mkdir sf-salary-datasets
mv sf-salaries-2011-2013.csv sf-salaries-2014.csv sf-salary-datasets/

Goals for this Module:

  • Learn how to use HDFS from Ambari Files View

Outline

Step 1: Create Directories in HDFS, Upload files and List Contents

Create Directory Tree in User

1. Login to Ambari Interface at 127.0.0.1:8080. Use the following login credentials in Table 1.

Table 1: Ambari Login credentials

Username Password
admin **setup process

Ambari password setup process, refer to step 2.2 Setup Ambari Admin Password Manually of Learning the Ropes of the Hortonworks Sandbox.

2. Now that we have admin privileges, we can manage files on HDFS using Files View. Hover over the Ambari Selector Icon ambari_selector_icon, enter the Files
View web-interface.

files_view

The Files View Interface will appear with the following default folders.

files_view_web_interface

3. We will create 4 folders using the Files View web-interface. All three folders: sf-salaries-2011-2013, sf-salaries and sf-salaries-2014 will reside in the hadoop folder, which resides in user. Navigate into the user folder. Click the new folder button new_folder_button, an add new folder window appears and name the folder hadoop. Press enter or Add

folder_name

4. Navigate into the hadoop folder. Create the three folders: sf-salaries-2011-2013, sf-salaries and sf-salaries-2014 following the process stated in the previous instruction.

hadoop_internal_folders

Upload Local Machine Files to HDFS

We will upload two files from our local machine: sf-salaries-2011-2013.csv and sf-salaries-2014.csv to appropriate HDFS directories.

1. Navigate through the path /user/hadoop/sf-salaries-2011-2013 or if you’re already in hadoop, enter the sf-salaries-2011-2013 folder. Click the upload button upload-button to transfer sf-salaries-2011-2013.csv into HDFS.

An Upload file window appears:

upload_file_window

2. Click on the cloud with an arrow. A window with files from your local machine appears, find sf-salaries-2011-2013.csv in the Downloads/sf-salary-datasets folder, select it and then press open button.

sf_salaries_2011_2013_csv

3. In Files View, navigate to the hadoop folder and enter the sf-salaries-2014 folder. Repeat the upload file process to upload sf-salaries-2014.csv.

sf_salaries_2014_csv

View and Examine Directory Contents

Each time we open a directory, the Files View automatically lists the contents. Earlier we started in the user directory.

1. Let’s navigate back to the user directory to examine the details given by the contents. Reference the image below while you read the Directory Contents Overview.

/ Directory Contents Overview of Columns

  • Name are the files/folders
  • Size contains bytes for the Contents
  • Last Modified includes the date/time the content was created or Modified
  • Owner is who owns that contents
  • Group is who can make changes to the files/folders
  • Permissions establishes who can read, write and execute data

files_view_web_interface

Step 2: Find Out Space Utilization in a HDFS Directory

In the command line when the directories and files are listed with the hadoop fs -du /user/hadoop/, the size of the directory and file is shown. In Files View, we must navigate to the file to see the size, we are not able to see the size of the directory even if it contains files.

1. Let’s view the size of sf-salaries-2011-2013.csv file. Navigate through /user/hadoop/sf-salaries-2011-2013. How much space has the file utilized? Files View shows 11.2 MB for sf-salaries-2011-2013.csv.

sf_salaries_2011_2013_csv

Step 3: Download File From HDFS to Local Machine(Mac, Windows, Linux)

Files View enables users to download files and folders to their local machine with ease.

1. Let’s download the sf-salaries-2011-2013.csv file to our computer. Click on the file’s row, the row’s color becomes blue, a group of file operations will appear, select the Download button. The default directory the file downloads to is our Download folder on our local machine.

download_file_hdfs_local_machine

Step 4: Explore Two Advanced Features

Concatenate Files

File Concatenation merges two files together. If we concatenate sf-salaries-2011-2013.csv with sf-salaries-2014.csv, the data from sf-salaries-2014.csv will be appended to the end of sf-salaries-2011-2013.csv. A typical use case for a user to use this feature is when they have similar large datasets that they want to merge together. The manual process to combine large datasets is inconvenient, so file concatenation was created to do the operation instantly.

1. Before we merge the csv files, we must place them in the same folder. Click on sf-salaries-2011-2013.csv row, it will highlight in blue, then press copy and in the copy window appears, select the sf-salaries-2014 folder and press Copy to copy the csv file to it.

copy_to_sf_salaries_2014

2. We will merge two large files together by selecting them both and performing concatenate operation. Navigate to the sf-salaries-2014 folder. Select sf-salaries-2011-2013.csv, hold shift and click on sf-salaries-2014.csv. Click the concatenate button. The files will be downloaded into the Download folder on your local machine.

concatenate_csv_files

3. By default, Files View saves the merged files as a txt file, we can open the file and save it as a csv file. Then open the csv file and you will notice that all the salaries from 2014 are appended to the salaries from 2011-2013.

Copy Files or Directories recursively

Copy file or directories recursively means all the directory’s files and subdirectories to the bottom of the directory tree are copied. For instance, we will copy the hadoop directory and all of its contents to a new location within our hadoop cluster. In production, the copy operation is used to copy large datasets within the hadoop cluster or between 2 or more clusters.

1. Navigate to the user directory. Click on the row of the hadoop directory. Select the Copy button copy_button.

2. The Copy to window will appear. Select the tmp folder, the row will turn blue. If you select the folder icon, the contents of tmp become visible. Make sure the row is highlighted blue to do the copy. Click the blue Copy button to copy the hadoop folder recursively to this new location.

copy_hadoop_to_tmp

3. A new copy of the hadoop folder and all of its contents can be found in the tmp folder. Navigate to tmp for verification. Check that all of the hadoop folder’s contents copied successfully.

hadoop_copied_to_tmp

Summary

Congratulations! We just learned to use the Files View to manage our sf-salaries-2011-2013.csv and sf-salaries-2014.csv dataset files in HDFS. We learned to create, upload and list the the contents in our directories. We also acquired the skills to download files from HDFS to our local file system and explored a few advanced features of HDFS file management.

Further Reading