Home Forums Pig Elephant-bird to analyse Tweets

This topic contains 3 replies, has 3 voices, and was last updated by  Hadoop Developer 1 month ago.

  • Creator
    Topic
  • #45901

    Rodulfo
    Participant

    Hello I wanted to use Twitters Elephant-bird, to analyze Tweets without having to save them in another format like csv and leave them in their original JSON format.

    I have built Elephant-bird and I wrote the following simple code to load tweets from a file, following some examples I saw:


    REGISTER /user/rmrodriguez/jar/json-simple-1.1.jar;
    REGISTER /user/rmrodriguez/jar/elephant-bird-pig-4.4.jar;
    REGISTER /user/rmrodriguez/jar/elephant-bird-core-4.4.jar;
    REGISTER /user/rmrodriguez/jar/google-collections-1.0.jar;

    A = LOAD 'tweets.20131201-215958.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

    tweets = FOREACH A GENERATE (CHARARRAY)$0#'id' AS id;

    DUMP tweets;

    and I get the following error:

    2013-12-19 07:55:55,364 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
    2013-12-19 07:55:55,367 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
    2013-12-19 07:55:55,370 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. com/twitter/elephantbird/util/HadoopCompat
    Details at logfile: /hadoop/yarn/local/usercache/rmrodriguez/appcache/application_1387366430472_0012/container_1387366430472_0012_01_000002/pig_1387457753323.log

    Anyone has experience with elephant-bird that might know the cause for the error or can suggest another way for loading tweets in JSON format?

    Greetings,
    Rod

Viewing 3 replies - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #57215

    Hadoop Developer
    Participant

    Hi,

    I have 3 records in json format in source file but after processing in PIG it only throws back 1 record. Please suggest how to proceed on this.
    Here is the detailed description:

    Source Data :

    hadoop fs -cat /tmp/json.txt

    [
    {"key": "313032","columns": [["name","name102",1405333129634000]]},
    {“key”: “313030″,”columns”: [["name","name1",1405333115374000]]},
    {“key”: “313031″,”columns”: [["name","name101",1405333123538000]]}
    ]

    Pig Code:

    register /tmp/jsontest/elephant-bird-pig-4.4.jar;
    register /tmp/jsontest/elephant-bird-core-4.4.jar;
    register ‘/tmp/jsontest/elephant-bird-hadoop-compat-4.4.jar’;
    register ‘/tmp/jsontest/google-collections-1.0-rc1.jar’;
    register ‘/tmp/jsontest/json_simple-1.1.jar’;

    a = load ‘/tmp/json.txt’ using com.twitter.elephantbird.pig.load.JsonLoader(‘-nestedLoad’) ;
    dump a;

    Output:

    ([columns#{({(name),(name101),(1405333123538000)})},key#313031])

    Collapse
    #46254

    Rodulfo
    Participant

    Thank you Ramanan,

    That was it,. I just needed to register the HadoopCompat.jar. Now it works!

    Collapse
    #46226

    check the logfile ‘pig_1387457753323.log’. you may need to register the HadoopCompat jar in your pig script

    Collapse
Viewing 3 replies - 1 through 3 (of 3 total)