Home Forums Hortonworks Sandbox Tutorial 2: ERROR 2106: Error executing an algebraic function

This topic contains 19 replies, has 10 voices, and was last updated by  Tim Wise 5 months, 4 weeks ago.

  • Creator
    Topic
  • #27288

    I have done tutorial 2 exactly as it is stated.
    I get the following error:
    2013-06-12 10:59:32,536 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Failed!
    2013-06-12 10:59:32,555 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 2106: Error executing an algebraic function
    Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306120551_0040/attempt_201306120551_0040_m_000000_0/work/pig_1371059910260.log

    Strange as well is that I can’t find the logfile when logging in on the sandbox directly.

Viewing 19 replies - 1 through 19 (of 19 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #49720

    Tim Wise
    Participant

    I ran in to this, too, with sandbox 2.0 using the tutorials on the web site. In the logs I see:


    2014-03-06 13:45:42,134 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats – ERROR 0: Exception while executing (Name: year_group: Local Rearrange[tuple]{bytearray}(false) – scope-33 Operator Key: scope-33): org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error executing an algebraic function

    Input(s):
    Failed to read data from “hdfs://sandbox.hortonworks.com:8020/user/hue/Batting.csv”

    After reading this thread, I modified the batting csv to remove the column header row, then it worked.

    Is there not a way in PigStorage() to tell it the file starts with a row of column headers?

    Collapse
    #48616

    Philippe Back
    Participant

    This one is working for me:

    batting = load ‘/user/hue/tutorial/day02/Batting.csv’ using PigStorage(‘,’);
    – needed to cast as first row contains non numerical values (headings)
    runs = FOREACH batting GENERATE (chararray)$0 as playerID, (int)$1 as year, (int)$8 as runs;
    describe runs;
    grp_data = GROUP runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
    describe max_runs;
    join_max_runs = JOIN max_runs by ($0, max_runs), runs by (year, runs);
    describe join_max_runs;
    join_data = FOREACH join_max_runs GENERATE $0 as year, $2 as playerID, $1 as runs;
    describe join_data
    dump join_data;

    Collapse
    #32859


    Member

    A more direct solution would be to use explicit type casting.
    Something like:

    batting = LOAD 'Batting.csv' using PigStorage(',');
    runs = FOREACH batting GENERATE
    (chararray)$0 AS playerID,
    (int)$1 AS year,
    (int)$8 AS runs;
    -- DESCRIBE runs;

    grp_data = GROUP runs BY (year);
    max_runs = FOREACH grp_data GENERATE
    group AS grp, MAX(runs.runs) AS max_runs;
    -- DESCRIBE max_runs;

    join_max_run = JOIN
    max_runs BY (grp, max_runs),
    runs by (year, runs);
    -- DESCRIBE join_max_run;

    join_data = FOREACH join_max_run GENERATE
    $0 as year,
    $2 as playerID,
    $1 as runs;
    DESCRIBE join_data;
    STORE join_data INTO 'baseball1';

    The real problem here is that one of the data rows includes a non numeric value. Unless I’m missing something, the clean way to address it should be to enforce numeric values…

    Collapse
    #32817


    Member

    I am sure by now most folks would have figured this out. I worked out a solution after looking up on suggestions provided through ‘googling’. I had recently downloaded the sandbox and the original code has not been changed and still causes the error. If the code is changed as below it will generate required output:

    batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
    runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    runs = FILTER runs_raw BY runs > 0;
    grp_data = group runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
    join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
    join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
    dump join_data;

    I have added a line “runs = FILTER runs_raw BY runs > 0;” based on solutions I saw and experimented. That is the only difference. Enjoy!

    Collapse
    #27941

    tedr
    Moderator

    Hi everyone watching this thread,

    The script ran in the old version of Pig as included with the previous Sandbox version due to a bug where it wasn’t enforcing the standard behavior of reading the first row of data. Since this issue is fixed in the latest version of Pig the script that relied on the bug in order to work, broke. This script can be fixed as Daniel points out or the data can be changed to not have the column names as the first row of data. the latter is the approach that the Sandbox folks are going to use.

    Thanks,
    Ted.

    Collapse
    #27928

    Nice work Daniel. That worked like a charm.

    Craig

    Collapse
    #27710

    @Daniel,

    This works indeed.
    Thanks

    Wim

    Collapse
    #27651

    tedr
    Moderator

    Hi Daniel,

    Thanks for the info, this then points to a difference in how the version of Pig int this version of the Sandbox and that in previous versions operate. We’ll need to get the issue looked at in the latest version of Pig.

    Thanks,
    Ted.

    Collapse
    #27644

    I think it pulls the title string into the numeric data. Once you filter out everything below 0, things seem to work:

    batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
    runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    runs = FILTER runs_raw BY runs > 0;

    grp_data = group runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;

    dump max_runs;

    Collapse
    #27598

    @Sriram and @ales: So when there is an error, so when you need the log file the most, they are wiped away?
    Strange way of working :-)

    @all thanks for looking into it. I finished in the meanwhile most other tutorials without problems.
    Any suggestions what I could do know to lift my knowledge of Hadoop to the next level :-)

    thanks,

    Wim

    Collapse
    #27576

    Sriram Mohan
    Participant

    @TedR – I don’t think it is a problem with Max as applied to this Sandbox version. I tried using MAX in Tutorial 1 on the nyse-stocks dataset. I uploaded the data into HCatalog and loaded it into Pig as specified in Tutorial 1 and MAX works great there.

    Max and Min fail to work on the the batting.csv dataset in Tutorial 2. It seems to be a problem either pertinent to this data set or a problem with loading directly from HDFS without using HCatalog.

    Collapse
    #27575

    tedr
    Moderator

    Hi Guys,

    Yup, AVG works MAX doesn’t, digging into to see why MAX fails to work in this build of the sandbox, going to see if it will run on a different cluster with the same version of Pig.

    @Alex,

    If you change the avg to AVG it should run.

    Thanks,
    Ted.

    Collapse
    #27569

    alex Gordon
    Member

    When I use AVG instead of MAX, I am getting:

    2013-06-14 14:38:25,694 [main] INFO org.apache.pig.Main – Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
    2013-06-14 14:38:25,695 [main] INFO org.apache.pig.Main – Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
    2013-06-14 14:38:26,198 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /usr/lib/hadoop/.pigbootup not found
    2013-06-14 14:38:26,438 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://sandbox:8020
    2013-06-14 14:38:26,824 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: sandbox:50300
    2013-06-14 14:38:28,238 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1070: Could not resolve avg using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log

    here’s my code:

    batting = load ‘batting.csv’ using PigStorage(‘,’);
    runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
    grp_data = GROUP runs by (year);
    max_runs = FOREACH grp_data GENERATE group as grp,avg(runs.runs) as max_runs;
    join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
    join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
    dump join_data;

    Collapse
    #27567

    alex Gordon
    Member

    I am also getting the same error in the same tutorial. Has there been a resolution?

    2013-06-14 13:26:18,222 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 0% complete
    2013-06-14 13:26:19,866 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – HadoopJobId: job_201306140401_0017
    2013-06-14 13:26:19,866 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Processing aliases batting,grp_data,max_runs,runs
    2013-06-14 13:26:19,867 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – detailed locations: M: batting[1,10],runs[2,7],max_runs[4,11],grp_data[3,11] C: max_runs[4,11],grp_data[3,11] R: max_runs[4,11]
    2013-06-14 13:26:19,867 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – More information at: http://sandbox:50030/jobdetails.jsp?jobid=job_201306140401_0017
    2013-06-14 13:27:26,511 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 50% complete
    2013-06-14 13:27:30,379 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
    2013-06-14 13:27:30,379 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – job job_201306140401_0017 has failed! Stop running all dependent jobs
    2013-06-14 13:27:30,381 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 100% complete

    2013-06-14 13:27:30,435 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats – ERROR 2106: Error executing an algebraic function
    2013-06-14 13:27:30,435 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil – 1 map reduce job(s) failed!
    2013-06-14 13:27:30,453 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats – Script Statistics:

    HadoopVersion PigVersion UserId StartedAt FinishedAt Features
    1.2.0.1.3.0.0-107 0.11.1.1.3.0.0-107 mapred 2013-06-14 13:26:10 2013-06-14 13:27:30 HASH_JOIN,GROUP_BY

    Failed!

    Failed Jobs:
    JobId Alias Feature Message Outputs
    job_201306140401_0017 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message: Job failed! Error – # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201306140401_0017_m_000000

    Input(s):
    Failed to read data from “hdfs://sandbox:8020/user/hue/Batting.csv”

    Output(s):

    Counters:
    Total records written : 0
    Total bytes written : 0
    Spillable Memory Manager spill count : 0
    Total bags proactively spilled: 0
    Total records proactively spilled: 0

    Job DAG:
    job_201306140401_0017 -> null,
    null

    2013-06-14 13:27:30,453 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Failed!
    2013-06-14 13:27:30,455 [main] ERROR org.

    Collapse
    #27564

    alex Gordon
    Member

    @wim derw… in fact it does create the log file, but right after it is done writing to it, it deletes the entire contents of the directory. I don’t know why,

    Collapse
    #27555

    Sriram Mohan
    Participant

    I am able to access the logfile, if you get rid of the max function and use an avg function, it works. For instance,

    batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
    bat = FOREACH batting GENERATE $0 as playerID, $1 as yearID, $8 as runs;
    year = GROUP bat by (yearID);
    avgData = FOREACH year GENERATE
    avg(bat.runs) as avg_runs;
    dump avgData;

    When you use the max, it claims it is unable to read the file on my end as well

    Collapse
    #27553

    tedr
    Moderator

    Hi Sriram,

    In my testing it seemed to be that for some reason Pig was unable to even read the datafile, I’m digging into that.

    thanks,
    Ted

    Collapse
    #27552

    Sriram Mohan
    Participant

    I have the exact same issue. It seems to be associated with the max function call for this particular data set. Max works fine in other datasets(tutorial 1) for instance and using avg works fine for the dataset in tutorial 2. Any help will be greatly appreciated.

    Collapse
    #27310

    tedr
    Moderator

    Hi Wim,

    I too am having problems getting the job to run successfully as it is written in the tutorial, looking into what could be causing this.

    Thanks,
    Ted.

    Collapse
Viewing 19 replies - 1 through 19 (of 19 total)