Get Started with Cascading on Hortonworks Data Platform 2.1

Implementing WordCount with Cascading on HDP 2.1 Sandbox

If you have any errors in completing this tutorial. Please ask questions or notify us on Hortonworks Community Connection!

This tutorial will enable you, as a Java developer, to learn the following:

  • Introduce you to Hortonworks Data Platform 2.1 on Hortonworks Sandbox, a single-node cluster
  • Introduce you to Java Cascading SDK
  • Examine the WordCount program in Java
  • Build the single unit of execution, the jar file, using the gradle build tool
  • Deploy the jar file onto to the Sandbox
  • Examine the resulting MapReduce Jobs
  • View at the output stored as an HDSF file.

To start this tutorial, you must do two things: First, download the Sandbox and follow the installation instructions. Second, download the Cascading SDK.

The example WordCount is derived from part 2 of the Cascading Impatient Series.

Downloading and installing the HDP 2.1 Sandbox

  1. Download and install HDP 2.1 Sandbox.
  2. Familiarize yourself with the navigation on the Linux virtual host through a shell window.
  3. Login into your Linux Sandbox and create a user cascade. You can do this with the following command:

    useradd cascade

Git Clone Cascading example and Build it

First do su cascade to login as cascade user

  1. Download and install gradle-1.1 onto the Linux sandbox.

    cd ~
    chmod +x gradle-1.9/bin/gradle

  2. Next, cd ~
  3. git clone git://
  4. cd /home/cascade/Impatient/part2
  5. ~/gradle-1.9/bin/gradle clean jar (this builds the impatient.jar file, which is your wordcount unit of execution)

Deploying and running the Cascading Java application

Now you’re ready to run and deploy your impatient.jar file onto the cluster.

cd /home/cascade/Impatient/part2
hadoop fs -mkdir -p /user/cascade/data/
hadoop fs -copyFromLocal data/rain.txt /user/cascade/data/
hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc

This command will produce the following output:

Screen Shot 2014-04-20 at 4.27.45 PM

Tracking the MapReduce Jobs on Sandbox

Once the job is submitted (or running) you can actually track its progress from the Sandbox MapReduce Job Browser. Click on Job History UI.

Screen Shot 2014-04-18 at 4.17.04 PM

By default, it will display all jobs run by the user. Look for the latest one which should have a user cascade.

Screen Shot 2014-04-19 at 11.12.09 AM

Viewing the WordCount Output

When the job is finished, the word counts are written as an HDFS file part-00000. Use the Sandbox’s HDFS Files view to navigate to the HDFS directory and view its contents.

Screen Shot 2014-04-19 at 11.02.53 AM

Above and Beyond

For the adventurous, you can try the entire Impatient Series, after you have downloaded the sources from the github. Beyond the Impatient series, there’re other tutorials and case examples to play with.

Have Fun!

We hope you enjoyed the tutorial! If you’ve had any trouble completing this tutorial or require assistance, please head on over to Hortonworks Community Connection where hundreds of Hadoop experts are ready to help!


October 27, 2014 at 12:11 pm

Running into error with building Cascading example with gradle? … “You can’t change configuration ‘providedCompile’ because it is already resolved!”

../common/providedCompile.gradle needs to be updated to build with gradle 2.1 … see solution in

    Jules S. Damji
    October 27, 2014 at 12:40 pm

    Thanks for the pointer.

    Mungeol Heo
    November 4, 2014 at 1:16 am


    December 26, 2014 at 12:26 pm

    I had a similar error, it was caused by running a version of gradle >= 2.0 If you change your latest gradle to 1.x you may resolve this issue.

    January 12, 2015 at 3:49 pm

    I was able to fix my error by going back to ../common/providedCompile.gradle and surrounding all of the += second arguments with [], like

    foo += [bar]

November 1, 2014 at 5:44 pm

Hello, I think theres is a step that is missing, when i type the gradle clean jar command it says that the command is not found, I have now idea what to do and the internet is has been no help, can someone please help

    Jules S. Damji
    November 2, 2014 at 1:05 pm


    You need to download gradle, install it on your Sandbox, and put its home path in your $PATH. Downloading and installing are part of step 1.

    I hope that helps

Krishna Reddy Munnangi
December 24, 2014 at 11:31 pm

Following are the commands for (Download and install gradle-1.1 onto the Linux sandbox.)

Commands to instal Gradle in Sandbox


unzip -d /root/gradle

Export PATH

Gradle -v

Krishna Reddy Munnangi
December 24, 2014 at 11:37 pm

Following are the commands for installing on sandbox for Download and install gradle-1.1 onto the Linux sandbox.


unzip -d /root/gradle

Export PATH

Gradle -v

December 26, 2014 at 12:01 pm

For anyone having trouble installing gradle, the easy way to do it (at least for me) is is through the shell with the following commands

# installs to /opt/gradle
# existing versions are not overwritten/deleted
# seamless upgrades/downgrades
# $GRADLE_HOME points to latest *installed* (not released)
mkdir /opt/gradle
wget -N${gradle_version}
unzip -oq ./gradle-${gradle_version} -d /opt/gradle
ln -sfnv gradle-${gradle_version} /opt/gradle/latest
printf “export GRADLE_HOME=/opt/gradle/latest\nexport PATH=\$PATH:\$GRADLE_HOME/bin” > /etc/profile.d/
. /etc/profile.d/
# check installation
gradle -v

Taken from here:

For the current latest install change gradle_version to gradle_version=2.2.1

January 9, 2015 at 8:20 am

I am new to CASCADE and just starting learning. I read that Cascading is the proven application development platform for building data applications on Hadoop. Is it another way of processing model on top of HDFS just like Map reduce?
So i guess entire API set to read, write, process is created in different in CASCADING.
Please correct or elaborate.

    Jules S. Damji
    January 12, 2015 at 10:23 am

    The API provides expressive high-level operators as abstractions to MapReduce. So I wouldn’t call it different; rather, it hides or abstracts them.
    Are you using Cascading 3.0 on HDP2.2?

January 12, 2015 at 3:51 pm

If you have overridden the default queue by following another tutorial, you can specify a queue adding a property in, like this:

Properties properties = new Properties();
properties.put(“”, “Development”);

January 26, 2015 at 5:03 pm

For anyone get following error:

* What went wrong:
A problem occurred configuring project ‘:impatient-docs’.
> Could not resolve all dependencies for configuration ‘:impatient-docs:classpath’.
> Could not download artifact ‘asciidoctor-java-integration.jar (org.asciidoctor:asciidoctor-java-integration:0.1.3)’
> Host may not be null

cd impatient-docs
vi build.gradle

change following:

dependencies {
classpath ‘org.asciidoctor:asciidoctor-gradle-plugin:1.5.0’

apply plugin: ‘org.asciidoctor.gradle.asciidoctor’

It should fix the error.

January 8, 2016 at 11:14 pm

I used Gradle 1.1 but it was giving error. After using gradle 2.1 the problem got resolved. Following is the way to get Gradle 2
Following are the commands for installing on sandbox for Download and install gradle-1.1 onto the Linux sandbox.


unzip -d /root/gradle


gradle -v

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>