Get Started with Cascading on Hortonworks Data Platform 2.1

Implementing Log Parsing with Java Cascading SDK on HDP 2.1 Sandbox

If you have any errors in completing this tutorial. Please ask questions or notify us on Hortonworks Community Connection!

This is the second tutorial to enable you as a Java developer to learn about Cascading and Hortonworks Data Platform (HDP). Other tutorials are:

In this tutorial, you will do the following:

  • Install Hortonworks Sandbox, a single-node cluster
  • Code a simple Java log parsing application using Cascading SDK
  • Build the single unit of execution, the jar file, using the gradle build tool
  • Deploy the jar file onto to the Sandbox
  • Examine the resulting MapReduce Jobs
  • View the output stored as an HDSF file.

This example code is derived from Concurrent Inc.’s training class by Alexis Roos (@alexisroos). It demonstrates the simplicity of using Cascading Java Framework to write MapReduce Jobs, without using the actual MapReduce API, to parse a large file for analysis. Even though the example merely sorts the top ten IP’s visited, its efficacy and usage is far more powerful. Nonetheless, it introduces its potential and its simplicity.

Step 1: Downloading and installing HDP 2.1 Sandbox

  • Download and install HDP 2.1 Sandbox
  • Familiarize yourself with the navigation on the Linux virtual host through a shell window
  • Login into your Linux Sandbox as root (password is hadoop)
    • ssh -p 2222 root@
    • su guest

Step 2: Downloading and installing Gradle

cd ~
chmod +x gradle-1.9/bin/gradle

Step 3: Downloading sources and log data file

  • git clone git://
  • cd /home/guest/examples/dataprocessing
  • wget

Step 4: Building the single unit of execution

  • cd /home/guest/examples/dataprocessing
  • ~/gradle-1.9/bin/gradle clean jar

Step 5: Running the jar on Sandbox

  • create a logs directory in HDFS
    • hdfs dfs -mkdir /user/guest/logs
  • create an output directory in HDFS
    • hdfs dfs –mkdir /user/guest/output
  • copy the log file from the local filesystem to the HDFS logs directory
    • hdfs dfs -copyFromLocal ./NASA_access_log_Aug95.txt /user/guest/logs
  • Finally, run the Cascading application on the Sandbox, the single-node HDP cluster
    • hadoop jar ./build/libs/dataprocessing.jar /user/guest/logs /user/guest/output/logs

This run should create the following output:

Screen Shot 2014-05-12 at 11.51.40 AM

Tracking the MapReduce Jobs on the Sandbox

Once the job is submitted (or running), you can visually track its progress from the MapReduce Job Browser. Login to Ambari and click MapReduce 2. Then Use Quick Links to get to the JobHistory UI.

Screen Shot 2014-05-12 at 11.53.52 AM

You can drill down on any links to explore further details about the Map Reduce jobs running in their respective YARN containers. For example, clicking on one of the job ids will show all the maps and reduces tasks created.

Viewing the Log Parsing Output

When the job is finished, the 10 IP addresses are written as an HDFS file part-00000. Use the Ambari HDFS Files view to navigate to the HDFS directory, /user/guest/output/logs, and view its contents.

Screen Shot 2014-05-12 at 6.33.13 PM

Voila! You have written a Cascading log processing application, executed it on the Hortonworks HDP Sandbox, and perused the respective MapReduce jobs and the output generated.

In the next tutorial, we will examine how you to use Cascading Driven to discover in-depth information on the Flow (including logical, physical, and performance views).

We hope you enjoyed the tutorial! If you’ve had any trouble completing this tutorial or require assistance, please head on over to Hortonworks Community Connection where hundreds of Hadoop experts are ready to help!


August 12, 2014 at 4:00 am

at the first step i have given ssh -p 2222 root@, but it is saying connection refused.
how to solve this problem.kindly suggest me as early as possible.


August 31, 2014 at 11:19 pm

after typing ssh -p 2222 root@ i am getting connection refused .I want to do mapreduce programs on Hortonworks.
Can u please suggest me.

    Jules S. Damji
    October 6, 2014 at 12:44 pm

    If you getting an immediate connection refuse, it suggests that your Sandbox VM instance on the virtual box is not running.

    sh root@ -p 2222
    ssh: connect to host port 2222: Connection refused

    I start my VM from the VirtualBox.
    W10866:~ jdamji$ ssh root@ -p 2222
    root@’s password:

September 12, 2014 at 2:29 am

Guy’s, is the local ip of the sandbox. I guess there is something wrong with the network adapter settings of the VM. @Hortonworks; the command “hdfs dfs –mkdir /user/guest/output” has the wrong dash..

September 18, 2014 at 12:18 pm

Make sure the HDP Sandbox is running.

Zack Riesland
October 6, 2014 at 7:47 am

Doesn’t look like git is installed on the 2.1 sandbox.

Zack Riesland
October 6, 2014 at 7:56 am

Nevermind about git… I had a typo when I updated my $PATH.

October 14, 2014 at 1:19 pm

Be careful when you copy and paste.
In this step:
“hdfs dfs –mkdir /user/guest/output”

The dash here is a (html-formatted) dash. (Char is xD0).
When copy-and-paste this command, it does not work.
Type the command manually to avoid errors.


    Jules S. Damji
    October 14, 2014 at 2:49 pm

    Good point. The perils of cut-and-past.

October 27, 2014 at 3:02 pm

Getting error in Step 4 when running “gradle clean jar”? … “Could not find method mavenRepo() for arguments … ”

I ran into this error and determined that is was due to using a gradle 2.0+ version. The build.gradle file in the dataprocessing directory needs to be updated as described in:

Basically replace
mavenRepo name: 'conjars', url: ''
maven {
url ''

    Mungeol Heo
    November 4, 2014 at 5:11 pm


    Danilo Sanchez
    November 3, 2015 at 3:35 am

    Very helpful. Thanks!

February 5, 2015 at 11:29 pm

Similar error.
Connection refused for ssh root@ -p 2222.
Any solution for this.


    Jules S. Damji
    February 6, 2015 at 12:00 pm

    Is port forwarding enabled on the virtual box? By default, the Virtual Box (VB) has ssh port forwarding entry in its table.
    In the VB, Click on Settings–>Netowrk–>Port Forwarding, and check for this entry in the table:
    ssh TCP 2222 22

surya nemani
February 6, 2015 at 7:21 pm

same error:
ssh -p 2222 root@

Can somebody please share the solution.

Thanks in Advance.

March 1, 2015 at 1:25 pm

This worked for me in step 2 instead of last command line:
export PATH=/home/guest/gradle-1.12/bin:$PATH

March 2, 2015 at 9:16 pm

for connection just type ssh root@

May 12, 2015 at 6:49 am

Is ssh is done from normal command prompt?? I got the error connection refused when done from sandbox command prompt (ALT+F5). But I got process information unavailable at JPS. I am a new bie

    Jules S. Damji
    May 12, 2015 at 9:21 am

    Do it from your shell window on your laptop.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>