Home Forums Hortonworks Sandbox Error running tutorial 2

Tagged: , ,

This topic contains 4 replies, has 4 voices, and was last updated by  Jamal Diab 11 months, 2 weeks ago.

  • Creator
    Topic
  • #28380

    I am trying to run the following pig script:

    batting = load ‘Batting.csv’ using PigStorage(‘,’);
    runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    grp_data = GROUP runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
    join_max_run = JOIN max_runs by ($0, max_runs), runs by (year, runs);
    join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
    dump join_data;

    Its giving me the following error:
    Failed to read data from “hdfs://sandbox:8020/user/hue/Batting.csv”

    I ran each of those lines followed by a dump in the grunt interactive shell and its erroring out in this line:
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;

    Any suggestions on how to get past this?

    I appreciate your help!

Viewing 4 replies - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #39018

    Jamal Diab
    Member

    I am getting an error i check each line and it seems its form the last line which is
    dump join_data;

    this is the error message:
    # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201310012238_0035_m_000000

    this is the log :

    Input(s):
    Failed to read data from “hdfs://sandbox:8020/user/hue/Batting.csv”

    2013-10-02 00:52:54,768 [main] ERROR org.apache.pig.tools.grunt.Grunt -
    ERROR 1066: Unable to open iterator for alias join_data

    I have ran the code several times and it gives same problem.
    deleted the batting.csv file then uploaded it again. no help

    What should I do next?

    This is the code i am plugging in :
    batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
    runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    grp_data = GROUP runs BY (year);
    max_runs = FOREACH grp_data Generate group as grp,MAX(runs.runs) as max_runs;
    join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
    join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
    dump join_data;

    when i run syntax check …

    i get this message :

    2013-10-02 01:04:46,222 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
    2013-10-02 01:04:46,266 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
    2013-10-02 01:04:46,282 [main] WARN org.apache.pig.tools.grunt.GruntParser – ‘dump’ statement is ignored while processing ‘explain -script’ or ‘-check’
    script.pig syntax OK

    Any suggestions?

    Collapse
    #28876

    tedr
    Moderator

    Hi Michael,

    Thanks for the fixed code. In future updates to the Tutorials the firs line of the data file will be dropped, so that there will not be this error when loading the data.

    Thanks,
    Ted.

    Collapse
    #28814

    Michael Roux
    Member

    Same problem here. The fixed code loosk like:

    batting = LOAD ‘Batting.csv’ USING PigStorage(‘,’);
    raw_runs = FILTER batting BY $1>0;
    runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
    grp_data = GROUP runs BY (year);
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
    join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
    join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
    DUMP join_data;

    Collapse
    #28613

    tedr
    Moderator

    Hi Gopal,

    I believe that this is due to a change in how Pig works between when the tutorial was created and the curretn version of Pig in the Sandbox. The specific change is that Pig used to be permissive and automatically ignore the first line in the datafile or it would interpret that line as column names, in the new version of Pig this permissiveness was removed. The work around is to insert a line before the the ‘runs = FOREACH line…’ to filter out the first row in the file with something like ‘raw_runs = FILTER batting by >0′ (I’m not exactly sure that that line is correct, but it should get you in the correct direction).

    Thanks,
    Ted.

    Collapse
Viewing 4 replies - 1 through 4 (of 4 total)