HDP on Linux – Installation Forum

MapReduce in Sandbox

  • #28816
    Duncan Gunn

    I might be missing something here, but when I try to run the standard word count MapReduce job in the sandbox, it runs successfully but the generated output is just the input file!

    I know this code works as I have verified it separately using Amazon EMR.

    I create the job in the job designer and specify the following properties:


    I point the input dir at a words.txt file.

    I would expect the output to be a count, e.g. apple 3 orange 1 and so on.

    Instead I just get the original input file back as output!

    What am I doing wrong? It’s as if the map and reduce aren’t running at all!


to create new topics or reply. | New User Registration

  • Author
  • #28823

    Hi Duncan,

    How is your day so far? Can you please provide the exact steps that you have followed to run the word count?


    Duncan Gunn

    Hi Abdelrahman

    Good thanks! Hope you are well also.

    My exact steps (from memory) are:

    – copy my wordcount.jar to /user/hue/examples directory
    – create new MapReduce job in the job designer
    – complete the path to the wordcount.jar file
    – add two properties: mapred.output.dir and mapred.input.dir and set them as ${vars}
    – save and submit the job
    – enter the input and output parameters
    – the job seems to run fine, and is OK, but when I look at the part-00000 file that is produced, it contains exactly the same as the input file! It’s as if the job has just copied the input file.

    I’ve looked at the logs and it seems to be running through all the right steps from what I can see.

    I’m obviously doing something very wrong, but I’m lost!



    Hi Duncan,

    Thank you for providing the details. Let us run a simple word count first from command line please run.
    bin/hadoop jar hadoop-*-examples.jar wordcount -m 4 -r 1
    Before this step. Create the out-dir in /tmp/output_wordcount in Hadoop by running the following command.
    > hadoop fs -mkdir /tmp/output_wordcount.
    Let me know if this works for you.


    Duncan Gunn

    This is maybe a silly question, but how do I run the command line? I’ve tried to set up a shell job in the past but that doesn’t seem to work!



    Hi Duncan,

    To run a hadoop job from the command line you need to either ssh into the sandbox or open a shell prompt directly in the vm. The easiest is the latter. To do this you click in the Sandox VM window and then press the key combination that it is telling you in the window (usually alt+f5). It will ask you for a username use ‘root’ then enter ‘hadoop’ as the password. You are now in a shell prompt where you can run command line stuff.


    Duncan Gunn

    Excellent; thanks very much it worked!

    Bit strange why it doesn’t work via the Sandbox GUI though….


    HI Duncan,

    Yup, it is a bit strange that it doesn’t work there for you. I am checking to see if I get the same problem.



    hi Abdelrahman,
    Please find the exact steps as below:
    1) login to shell command of sandbox with credential “root” and “hadoop”
    2) go to home directory
    cd /home
    3) make one directory dft here (you can make any directory of your choice)
    mkdir dft
    cd dft
    4) wget http://www.gutenberg.org/files/4300/4300.zip (input files to which u will count the words)
    5) unzip 4300.zip
    6) rm 4300.zip
    7) hadoop dfs -copyFromLocal /home/dft dft (copy files to hdfs)
    8)hadoop dfs -ls
    9) hadoop dfs -ls dft
    10) hadoop jar /usr/lib/hadoop/hadoop-example- wordcount dft dft-output
    check the output
    11)hadoop dfs -ls
    12)hadoop dfs -ls dft-output
    13)hadoop dfs -cat dft-output/part-00000 | less
    14) hadoop dfs -copyToLocal dft-output/part-00000 . (copy output to local directory dft)


    Hi Chandra,

    When I follow the steps you’ve given the output is as it should be, a count of the words in the input file.



    Could you please anyone suggest me the exact location of wordcount program.
    I logged into sandbox using root/hadoop credentials,I navigated to home directory but could nt find any usr or lib directory there.Please help



    Hi Suthan,

    the example programs are in /usr/lib/hadoop


    Anand Saraf

    There are multiple samples available one can see running to see the list of samples run:
    yarn jar /usr/hdp/<ver>/hadoop-mapreduce/hadoop-mapreduce-examples-<ver>.jar

    Valid program names are:
    wordcount: A map/reduce program that counts the words in the input files.
    wordmean: A map/reduce program that counts the average length of the words in the input files.
    wordmedian: A map/reduce program that counts the median length of the words in the input files.
    wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

    To know input for any specific program run for ex wordcount:
    yarn jar /usr/hdp/<ver>/hadoop-mapreduce/hadoop-mapreduce-examples-<ver>.jar wordcount
    Usage: wordcount <in> [<in>…] <out>

    Now, run it passing required arg:
    yarn jar /usr/hdp/<ver>/hadoop-mapreduce/hadoop-mapreduce-examples-<ver>.jar wordcount in.txt outFolder

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.