Get Started with Cascading on Hortonworks Data Platform 2.1
- Install Hortonworks Sandbox, a single-node cluster
- Code a simple Java log parsing application using Cascading SDK
- Build the single unit of execution, the jar file, using the gradle build tool
- Deploy the jar file onto to the Sandbox
- Examine the resulting MapReduce Jobs
- View the output stored as an HDSF file.
Step 1: Downloading and installing HDP 2.1 Sandbox
- Download and install HDP 2.1 Sandbox
- Familiarize yourself with the navigation on the Linux virtual host through a shell window
- Login into your Linux Sandbox as root (password is hadoop)
ssh -p 2222 firstname.lastname@example.org
Step 2: Downloading and installing Gradle
Step 3: Downloading sources and log data file
git clone git://github.com/dmatrix/examples.git
Step 4: Building the single unit of execution
gradle clean jar
Step 5: Running the jar on Sandbox
- create a logs directory in HDFS
hdfs dfs -mkdir /user/guest/logs
- create an output directory in HDFS
hdfs dfs –mkdir /user/guest/output
- copy the log file from the local filesystem to the HDFS logs directory
hdfs dfs -copyFromLocal ./NASA_access_log_Aug95.txt /user/guest/logs
- Finally, run the Cascading application on the Sandbox, the single-node HDP cluster
hadoop jar ./build/libs/dataprocessing.jar /user/guest/logs /user/guest/output/logs
Tracking the MapReduce Jobs on the SandboxOnce the job is submitted (or running), you can visually track its progress from the Sandbox Hue's Job Browser. By default, it will display all jobs submitted by the user hue; filter by the user guest. You can drill down on any links to explore further details about the Map Reduce jobs running in their respective YARN containers. For example, clicking on one of the job ids will show all the maps and reduces tasks created.
Viewing the Log Parsing OutputWhen the job is finished, the 10 IP addresses are written as an HDFS file part-00000. Use the Sandbox Hue’s File Browser to navigate to the HDFS directory, /user/guest/output/logs, and view its contents.
In the next tutorial, we will examine how you to use Cascading Driven to discover in-depth information on the Flow (including logical, physical, and performance views).