Hortonworks Sandbox Forum

Error while running sand box tutorial for pig script

  • #33124

    Hi Folks,

    I am getting below error while executing pig script from sand box tutorial.

    # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

    can someone help to proceed.

to create new topics or reply. | New User Registration

  • Author
    Replies
  • #33259
    Akki Sharma
    Moderator

    Hello Krishna,

    In your mapped-site.xml file, please check what is the value of the property “mapred.job.reuse.jvm.num.tasks”.

    It should be 1. The property should look like:

    mapred.job.reuse.jvm.num.tasks
    1

    and run the script again.

    Best Regards,
    Akki

    #33691
    Dave
    Moderator

    Hi Krishna,

    Which tutorial are you hitting an issue on?

    Thanks

    Dave

    #33860

    thanks Dave and Sharma for response.

    Sharma, I could see value as 1 for the property you mentioned.

    Dave, I am trying baseball statistics in tutorial2.

    #34151
    Robert
    Participant

    Hi Krishna,
    Is the problem consistent? Can you try running the pig script multiple times to verify it reproduces? If so, please provide the virtual application you are using and the operating system?

    Regards,
    Robert

    #34719

    Hi Robert,

    Below is the code copied from tutorial.

    batting = load ‘Batting.csv’ using PigStorage(‘,’);
    runs = FOREACH batting GENERATE $0 as playerID,$1 as year,$8 as runs;
    grp_data = GROUP runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp,MAX (runs.runs) as max_runs;
    dump max_runs;

    but it is giving below errors:

    2013-09-07 07:04:04,968 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats – ERROR 2106: Error executing an algebraic function
    2013-09-07 07:04:04,970 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil – 1 map reduce job(s) failed!
    2013-09-07 07:04:04,987 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats – Script Statistics:

    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    1.2.0.1.3.0.0-107 0.11.1.1.3.0.0-107 mapred 2013-09-07 07:02:20 2013-09-07 07:04:04 GROUP_BY

    Failed!

    Failed Jobs:
    JobId Alias Feature Message Outputs
    job_201309070520_0002 batting,grp_data,max_runs,runs GROUP_BY,COMBINER Message: Job failed! Error – # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

    After googling, I realized the fix as by adding additional statement highlighted below:

    batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
    runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    runs = FILTER runs_raw BY runs > 0;
    grp_data = group runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
    dump max_runs;

    so what is the difference?

    #39166
    Dave
    Moderator

    Hi Krishna,

    This is because Batting.csv has column names at the top which can’t be used as an algebraic function.
    If you were to remove the first line and remove the filter, then it would work.

    Thanks

    Dave

    #40177

    Hello Dave,
    your suggestion works. Would appreciate if you could provide more explanation on why the column names is not valid and needs to be removed.

    In other words, can we rename the column names to make it work instead of removing the header row.

    #40315
    Dave
    Moderator

    Hi Ravi,

    No, renaming the columns will not work.
    This is because you are parsing them into a algebraic function and they are not numeric.
    This is why applying the filter works and would be best practice

    Thanks

    Dave

The topic ‘Error while running sand box tutorial for pig script’ is closed to new replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.