Home Forums Pig Elephant-bird to analyse Tweets

This topic contains 3 replies, has 3 voices, and was last updated by  Hadoop Developer 3 months, 1 week ago.

  • Creator
  • #45901


    Hello I wanted to use Twitters Elephant-bird, to analyze Tweets without having to save them in another format like csv and leave them in their original JSON format.

    I have built Elephant-bird and I wrote the following simple code to load tweets from a file, following some examples I saw:

    REGISTER /user/rmrodriguez/jar/json-simple-1.1.jar;
    REGISTER /user/rmrodriguez/jar/elephant-bird-pig-4.4.jar;
    REGISTER /user/rmrodriguez/jar/elephant-bird-core-4.4.jar;
    REGISTER /user/rmrodriguez/jar/google-collections-1.0.jar;

    A = LOAD 'tweets.20131201-215958.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

    tweets = FOREACH A GENERATE (CHARARRAY)$0#'id' AS id;

    DUMP tweets;

    and I get the following error:

    2013-12-19 07:55:55,364 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
    2013-12-19 07:55:55,367 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
    2013-12-19 07:55:55,370 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. com/twitter/elephantbird/util/HadoopCompat
    Details at logfile: /hadoop/yarn/local/usercache/rmrodriguez/appcache/application_1387366430472_0012/container_1387366430472_0012_01_000002/pig_1387457753323.log

    Anyone has experience with elephant-bird that might know the cause for the error or can suggest another way for loading tweets in JSON format?


Viewing 3 replies - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.

  • Author
  • #57215

    Hadoop Developer


    I have 3 records in json format in source file but after processing in PIG it only throws back 1 record. Please suggest how to proceed on this.
    Here is the detailed description:

    Source Data :

    hadoop fs -cat /tmp/json.txt

    {“key”: “313032”,”columns”: [["name","name102",1405333129634000]]},
    {“key”: “313030”,”columns”: [["name","name1",1405333115374000]]},
    {“key”: “313031”,”columns”: [["name","name101",1405333123538000]]}

    Pig Code:

    register /tmp/jsontest/elephant-bird-pig-4.4.jar;
    register /tmp/jsontest/elephant-bird-core-4.4.jar;
    register ‘/tmp/jsontest/elephant-bird-hadoop-compat-4.4.jar';
    register ‘/tmp/jsontest/google-collections-1.0-rc1.jar';
    register ‘/tmp/jsontest/json_simple-1.1.jar';

    a = load ‘/tmp/json.txt’ using com.twitter.elephantbird.pig.load.JsonLoader(‘-nestedLoad’) ;
    dump a;





    Thank you Ramanan,

    That was it,. I just needed to register the HadoopCompat.jar. Now it works!


    check the logfile ‘pig_1387457753323.log’. you may need to register the HadoopCompat jar in your pig script

Viewing 3 replies - 1 through 3 (of 3 total)