HDP on Linux – Installation Forum

MapReduce in Sandbox

  • #28816
    Duncan Gunn
    Member

    I might be missing something here, but when I try to run the standard word count MapReduce job in the sandbox, it runs successfully but the generated output is just the input file!

    I know this code works as I have verified it separately using Amazon EMR.

    I create the job in the job designer and specify the following properties:

    mapred.output.dir
    mapred.input.dir

    I point the input dir at a words.txt file.

    I would expect the output to be a count, e.g. apple 3 orange 1 and so on.

    Instead I just get the original input file back as output!

    What am I doing wrong? It’s as if the map and reduce aren’t running at all!

    Thanks

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #28823
    abdelrahman
    Moderator

    Hi Duncan,

    How is your day so far? Can you please provide the exact steps that you have followed to run the word count?

    Thanks
    -Abdelrahman

    #28826
    Duncan Gunn
    Member

    Hi Abdelrahman

    Good thanks! Hope you are well also.

    My exact steps (from memory) are:

    – copy my wordcount.jar to /user/hue/examples directory
    – create new MapReduce job in the job designer
    – complete the path to the wordcount.jar file
    – add two properties: mapred.output.dir and mapred.input.dir and set them as ${vars}
    – save and submit the job
    – enter the input and output parameters
    – the job seems to run fine, and is OK, but when I look at the part-00000 file that is produced, it contains exactly the same as the input file! It’s as if the job has just copied the input file.

    I’ve looked at the logs and it seems to be running through all the right steps from what I can see.

    I’m obviously doing something very wrong, but I’m lost!

    Thanks

    #28829
    abdelrahman
    Moderator

    Hi Duncan,

    Thank you for providing the details. Let us run a simple word count first from command line please run.
    bin/hadoop jar hadoop-*-examples.jar wordcount -m 4 -r 1
    Before this step. Create the out-dir in /tmp/output_wordcount in Hadoop by running the following command.
    > hadoop fs -mkdir /tmp/output_wordcount.
    Let me know if this works for you.

    Thanks
    -Abdelrahman

    #28830
    Duncan Gunn
    Member

    This is maybe a silly question, but how do I run the command line? I’ve tried to set up a shell job in the past but that doesn’t seem to work!

    Thanks

    #28862
    tedr
    Moderator

    Hi Duncan,

    To run a hadoop job from the command line you need to either ssh into the sandbox or open a shell prompt directly in the vm. The easiest is the latter. To do this you click in the Sandox VM window and then press the key combination that it is telling you in the window (usually alt+f5). It will ask you for a username use ‘root’ then enter ‘hadoop’ as the password. You are now in a shell prompt where you can run command line stuff.

    Thanks,
    Ted.

    #28890
    Duncan Gunn
    Member

    Excellent; thanks very much it worked!

    Bit strange why it doesn’t work via the Sandbox GUI though….

    #28900
    tedr
    Moderator

    HI Duncan,

    Yup, it is a bit strange that it doesn’t work there for you. I am checking to see if I get the same problem.

    Thanks,
    Ted.

    #28996

    hi Abdelrahman,
    Please find the exact steps as below:
    1) login to shell command of sandbox with credential “root” and “hadoop”
    2) go to home directory
    cd /home
    3) make one directory dft here (you can make any directory of your choice)
    mkdir dft
    cd dft
    4) wget http://www.gutenberg.org/files/4300/4300.zip (input files to which u will count the words)
    5) unzip 4300.zip
    6) rm 4300.zip
    7) hadoop dfs -copyFromLocal /home/dft dft (copy files to hdfs)
    8)hadoop dfs -ls
    9) hadoop dfs -ls dft
    10) hadoop jar /usr/lib/hadoop/hadoop-example-1.2.0.1.3.0.0-107.jar wordcount dft dft-output
    check the output
    11)hadoop dfs -ls
    12)hadoop dfs -ls dft-output
    13)hadoop dfs -cat dft-output/part-00000 | less
    14) hadoop dfs -copyToLocal dft-output/part-00000 . (copy output to local directory dft)
    thanks

    #29010
    tedr
    Moderator

    Hi Chandra,

    When I follow the steps you’ve given the output is as it should be, a count of the words in the input file.

    Thanks,
    Ted.

    #30173

    Hi,
    Could you please anyone suggest me the exact location of wordcount program.
    I logged into sandbox using root/hadoop credentials,I navigated to home directory but could nt find any usr or lib directory there.Please help

    Thanks,
    Suthan

    #30176
    Member

    Hi Suthan,

    the example programs are in /usr/lib/hadoop

    Thanks,
    Teja

    #76540
    Anand Saraf
    Participant

    There are multiple samples available one can see running to see the list of samples run:
    Command:
    yarn jar /usr/hdp/<ver>/hadoop-mapreduce/hadoop-mapreduce-examples-<ver>.jar

    O/p:
    Valid program names are:
    wordcount: A map/reduce program that counts the words in the input files.
    wordmean: A map/reduce program that counts the average length of the words in the input files.
    wordmedian: A map/reduce program that counts the median length of the words in the input files.
    wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

    To know input for any specific program run for ex wordcount:
    Command:
    yarn jar /usr/hdp/<ver>/hadoop-mapreduce/hadoop-mapreduce-examples-<ver>.jar wordcount
    O/p:
    Usage: wordcount <in> [<in>…] <out>

    Now, run it passing required arg:
    yarn jar /usr/hdp/<ver>/hadoop-mapreduce/hadoop-mapreduce-examples-<ver>.jar wordcount in.txt outFolder

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.