cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Get Started with Cascading on Hortonworks Data Platform 2.1

If you have any errors in completing this tutorial. Please ask questions or notify us on Hortonworks Community Connection!

This tutorial will enable you, as a Java developer, to learn the following:

  • Introduce you to Hortonworks Data Platform 2.1 on Hortonworks Sandbox, a single-node cluster
  • Introduce you to Java Cascading SDK
  • Examine the WordCount program in Java
  • Build the single unit of execution, the jar file, using the gradle build tool
  • Deploy the jar file onto to the Sandbox
  • Examine the resulting MapReduce Jobs
  • View at the output stored as an HDSF file.

To start this tutorial, you must do two things: First, download the Sandbox and follow the installation instructions. Second, download the Cascading SDK.

The example WordCount is derived from part 2 of the Cascading Impatient Series.

Downloading and installing the HDP 2.1 Sandbox

  1. Download and install HDP 2.1 Sandbox.
  2. Familiarize yourself with the navigation on the Linux virtual host through a shell window.
  3. Login into your Linux Sandbox and create a user cascade. You can do this with the following command:

    useradd cascade

Git Clone Cascading example and Build it

First do su cascade to login as cascade user

  1. Download and install gradle-1.1 onto the Linux sandbox.

    cd ~
    wget https://services.gradle.org/distributions/gradle-1.9-bin.zip
    unzip gradle-1.9-bin.zip
    chmod +x gradle-1.9/bin/gradle

  2. Next, cd ~
  3. git clone git://github.com/Cascading/Impatient.git
  4. cd /home/cascade/Impatient/part2
  5. ~/gradle-1.9/bin/gradle clean jar (this builds the impatient.jar file, which is your wordcount unit of execution)

Deploying and running the Cascading Java application

Now you’re ready to run and deploy your impatient.jar file onto the cluster.

cd /home/cascade/Impatient/part2
hadoop fs -mkdir -p /user/cascade/data/
hadoop fs -copyFromLocal data/rain.txt /user/cascade/data/
hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc

This command will produce the following output:

Screen Shot 2014-04-20 at 4.27.45 PM

Tracking the MapReduce Jobs on Sandbox

Once the job is submitted (or running) you can actually track its progress from the Sandbox MapReduce Job Browser. Click on Job History UI.

Screen Shot 2014-04-18 at 4.17.04 PM

By default, it will display all jobs run by the user. Look for the latest one which should have a user cascade.

Screen Shot 2014-04-19 at 11.12.09 AM

Viewing the WordCount Output

When the job is finished, the word counts are written as an HDFS file part-00000. Use the Sandbox’s HDFS Files view to navigate to the HDFS directory and view its contents.

Screen Shot 2014-04-19 at 11.02.53 AM

Above and Beyond

For the adventurous, you can try the entire Impatient Series, after you have downloaded the sources from the github. Beyond the Impatient series, there’re other tutorials and case examples to play with.

Have Fun!

We hope you enjoyed the tutorial! If you’ve had any trouble completing this tutorial or require assistance, please head on over to Hortonworks Community Connection where hundreds of Hadoop experts are ready to help!