Hortonworks Sandbox Forum

Error running tutorial 2

  • #28380

    I am trying to run the following pig script:

    batting = load ‘Batting.csv’ using PigStorage(‘,’);
    runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    grp_data = GROUP runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
    join_max_run = JOIN max_runs by ($0, max_runs), runs by (year, runs);
    join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
    dump join_data;

    Its giving me the following error:
    Failed to read data from “hdfs://sandbox:8020/user/hue/Batting.csv”

    I ran each of those lines followed by a dump in the grunt interactive shell and its erroring out in this line:
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;

    Any suggestions on how to get past this?

    I appreciate your help!

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #28613
    tedr
    Moderator

    Hi Gopal,

    I believe that this is due to a change in how Pig works between when the tutorial was created and the curretn version of Pig in the Sandbox. The specific change is that Pig used to be permissive and automatically ignore the first line in the datafile or it would interpret that line as column names, in the new version of Pig this permissiveness was removed. The work around is to insert a line before the the ‘runs = FOREACH line…’ to filter out the first row in the file with something like ‘raw_runs = FILTER batting by >0′ (I’m not exactly sure that that line is correct, but it should get you in the correct direction).

    Thanks,
    Ted.

    #28814
    Michael Roux
    Member

    Same problem here. The fixed code loosk like:

    batting = LOAD ‘Batting.csv’ USING PigStorage(‘,’);
    raw_runs = FILTER batting BY $1>0;
    runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
    grp_data = GROUP runs BY (year);
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
    join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
    join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
    DUMP join_data;

    #28876
    tedr
    Moderator

    Hi Michael,

    Thanks for the fixed code. In future updates to the Tutorials the firs line of the data file will be dropped, so that there will not be this error when loading the data.

    Thanks,
    Ted.

    #39018
    Jamal Diab
    Member

    I am getting an error i check each line and it seems its form the last line which is
    dump join_data;

    this is the error message:
    # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201310012238_0035_m_000000

    this is the log :

    Input(s):
    Failed to read data from “hdfs://sandbox:8020/user/hue/Batting.csv”

    2013-10-02 00:52:54,768 [main] ERROR org.apache.pig.tools.grunt.Grunt –
    ERROR 1066: Unable to open iterator for alias join_data

    I have ran the code several times and it gives same problem.
    deleted the batting.csv file then uploaded it again. no help

    What should I do next?

    This is the code i am plugging in :
    batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
    runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    grp_data = GROUP runs BY (year);
    max_runs = FOREACH grp_data Generate group as grp,MAX(runs.runs) as max_runs;
    join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
    join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
    dump join_data;

    when i run syntax check …

    i get this message :

    2013-10-02 01:04:46,222 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
    2013-10-02 01:04:46,266 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
    2013-10-02 01:04:46,282 [main] WARN org.apache.pig.tools.grunt.GruntParser – ‘dump’ statement is ignored while processing ‘explain -script’ or ‘-check’
    script.pig syntax OK

    Any suggestions?

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.