How To Process Data with Apache Pig

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the data to more complex algorithms for the transform. Finally Pig can store the results into the Hadoop Data File System.

Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster. As part of the translation the Pig interpreter does perform optimizations to speed execution on Apache Hadoop. We are going to write a Pig script that will do our data analysis task.

Our data processing task

We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Downloading the data

The data file we are using comes from the site www.seanlahman.com. You can download the data file in csv zip form from:

http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

Once you have the file you will need to unzip the file into a directory. We will be uploading just the master.csv and batting.csv files.

Uploading the data files

We start by selecting the File Browser from the top tool bar. The File Browser allows us to view the Hortonworks Data Platform(HDP) file store. This is separate from the local file system. In a Hadoop cluster this would be your view of the Hadoop Data File System(HDFS). For the Hortonworks Sandbox it will be part of the file system in the Hortonworks Sandbox VM.


Click on the Upload button to select the files we want to upload into the Hortonworks Sandbox environment.

When you click on the Upload a file button you will get a dialog box. Navigate to where you stored the Batting.csv file on your local disk and select Batting.csv. Do the same thing for Master.csv. When you are done you will see there are two files in your directory.

Now that we have our data files we can start writing our Pig script. Click on the Pig icon at the top of the screen.

/p>

We see the Pig user interface in our browser window. On the left is a list of the saved scripts. On the right is the composition area where we will be writing our script. Below the composition area are buttons to Save, Execute, Explain and perform a Syntax check of the current script. At the very bottom are status boxes where we will see logs, error message and the output of our script.


To get started fill in a name for your script. You can not save it until we add our first line of code. The first thing we need to do is load the data. We use the load statement for this. The PigStorage function is what does the loading and we pass it a comma as the data delimiter. Our code is:

batting = load 'Batting.csv' using PigStorage(',');

The next thing we want to do is name the fields. We will use a FOREACH statement to iterate through the batting data object. We can use Pig Helper that is at the bottom of the composition area to provide us with a template. We will click on Pig Helper, select Data processing functions and then click on the FOREACH template. We can then replace each element by hitting the tab key.


So the FOREACH statement will iterate through the batting data object and GENERATE pulls out selected fields and assigns them names. The new data object we are creating is then named runs. Our code will now be:

runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;

The next line of code is a group statement that groups the elements in runs by the year field. So the grp_data object will then be indexed by year. In the next statement as we iterate through grp_data we will go through year by year. Type in the code:

grp_data = GROUP runs by (year);

In the next FOREACH statement we are going to find the maximum runs for each year. The code for this is:

max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

Now that we have the maximum runs we need to join this with the runs data object so we can pick up the player id. The result will be a dataset with Year, PlayerID and Max Run. At the end we dump the data to the output.

join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);  
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;  
dump join_data;

Let’s take a look at our script. The first thing to notice is we never really address single rows of data to the left of the equals sign and on the right we just describe what we want to do for each row. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.

At this point we can save our script. Fill in a name in the box below “Pig script:” if you haven’t already. Click on the save button and the your script will show up in the bar on the left. 

We can execute our code by clicking on the execute button at the bottom of the composition area. As the jobs are run you will get a progress bar at the bottom.


When the job completes the results are displaying in the green box at the bottom.


If you scroll down to the “Logs…” and click on the link you can see the log file of your jobs.


So we have created a simple Pig script that reads in some comma separated data. Once we have that set of records in Pig we pull out the playerID, year and runs fields from each row. We them sort them by year with one statement, GROUP. Then for each year we find the maximum runs. This is finally mapped to the playerID and we produce our final dataset.

As mentioned before Pig operates on data flows. We consider each group of rows together and we specify how we operate on them as a group. As the datasets get larger and/or add fields our Pig script will remain pretty much the same because it is concentrating on how we want to manipulate the data.

Comments

Ethels
|
February 14, 2014 at 10:35 am
|

This was really insightful and inspiring. Thank you so much for this, I really appreciate you guys.

Cheers!

S,Moalla
|
February 24, 2014 at 5:14 pm
|

Type needs to be specified in order for max to work
batting = load ‘Batting.csv’ using PigStorage(‘,’) AS (playerID:chararray,yearID:int,stint:int,teamID:chararray,lgID:chararray,G:int,G_batting:int,AB:int,R:int,H:int,att1:int,att3:int,HR:int,RBI:int,SB:int,CS:int,BB:int,SO:int,IBB:int,HBP:int,SH:int,SF:int,GIDP:int,G_old:int);
runs = FOREACH batting GENERATE $0 as (playerID:chararray), $1 as (year:int), $8 as (runs:int);
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

Torrey
|
March 5, 2014 at 3:24 pm
|

The script no longer works. I had to make the below changes.
batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs = FOREACH batting GENERATE (chararray)$0 as playerID, (int)$1 as year, (int)$8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_runs = JOIN max_runs by ($0, max_runs), runs by (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

Satish
|
March 18, 2014 at 4:30 am
|

It really nice tutorial. Keep up good work.

Bilal Abu Salih
|
April 1, 2014 at 9:04 am
|

That was wonderful tutorial ,, appreciated

Aran
|
May 5, 2014 at 9:41 pm
|

The code doesn’t work. I found a working version on your forum:

batting = LOAD ‘Batting.csv’ USING PigStorage(‘,’);
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
DUMP join_data;

    Jim M
    |
    June 8, 2014 at 2:32 am
    |

    Thank you for posting this here, Aran. Helped me get through this tutorial.

    Keith
    |
    June 13, 2014 at 8:06 am
    |

    Thank you Aran, your version of the code works for me.

    Mauricio
    |
    June 18, 2014 at 6:59 pm
    |

    Thanks Aran for sharing the code in this forum. Well, actually the code example is not wrong in terms of logical flow, but it is corrupted in line 5 when the statement “join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);” does have 1 misspelling mistake. If you see at the expression “(year,runs)” is not separated by a space between the variables, so this simple error will crash the job. I’ve spent a lot of time debugging jobs for misspelling mistakes that I can tell they are a pain in the heck. I hope you will find this post useful.

    Piotr Sobolewski
    |
    June 25, 2014 at 12:27 pm
    |

    Thanks for this solution. It works.

    Ben
    |
    June 30, 2014 at 2:03 pm
    |

    Thank you!

    Chris H.
    |
    July 3, 2014 at 1:15 pm
    |

    Thanks for reposting this Aran, it was very helpful.

    Satya
    |
    July 17, 2014 at 3:03 pm
    |

    Thanks Aran for the working code. It looks like the Filter statement made the code to work.

    Ben
    |
    August 12, 2014 at 12:58 pm
    |

    Good work.

    If you are copy and pasting Aran’s code, be sure to paste into notepad or notepad++ to make sure the apostrophes don’t become formatted incorrectly in line 1.

    batting = LOAD ‘Batting.csv’ USING PigStorage(‘,’);

Peter Klavins
|
May 15, 2014 at 6:01 am
|

Great tutorial, thanks! The data and/or Pig and/or HDP have moved on and the example doesn’t work as at 15 May 2014. I’ve made three changes: 1) Add FILTER to remove header line from input file; 2) Replace $0 with column name ‘grp’ when calculating ‘join_max_run’ (didn’t work otherwise); 3) Replace $1 with $4 when calculating ‘join_data’ (output original string instead of calculated float value for better visualisation). The resultant working version is as follows:

batting = LOAD 'Batting.csv' USING PigStorage(',');
batting_minus_header = FILTER batting BY $1!='yearID';
runs = FOREACH batting_minus_header GENERATE $0 AS playerID, $1 AS year, $8 as runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group AS grp, MAX(runs.runs) AS max_runs;
join_max_run = JOIN max_runs BY (grp, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_run GENERATE $0 AS year, $2 AS playerID, $4 AS runs;
DUMP join_data;

Matt
|
July 8, 2014 at 11:46 pm
|

Hi All, Wonder if someone could help me, As a complete neophyte with Hadoop and Pig I could do with some basic explaination of what exactly is happening in the code below, a step by step guide as to what is actually happening under the covers, can any one help me or guide me to a website that would assist:

max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

I understand this much:
Create a relation called “max_runs” this is created by stepping through each row in the “grp_data” relation and creating…

Then I’m kind of lost, I know what the the output is, but I don’t really UNDERSTAND well enough to take this away and apply it elsewhere

Thanks for any help

    Brad Stone
    |
    July 24, 2014 at 7:34 am
    |

    Matt,

    The original script uses duplicate names for variables, so some of the lines, like the one you described, are confusing (e.g. runs and runs).

    The following script attempts to describe what is happening at each step and outputs the first five lines of the results at each step. Hopefully this will help.


    batting = LOAD 'Batting.csv' USING PigStorage(',');

    -- Strip off the first row (column headings) so the Max function can be used later without errors
    raw_runs = FILTER batting BY $1>0;

    -- Create a table with all rows, but only 3 columns
    -- Columns are numbered starting with zero, so the first column is $0, the second is $1, etc.
    all_runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
    -- Show sample output of all_runs
    limit_all_runs = limit all_runs 5;
    describe all_runs;
    dump limit_all_runs;

    -- Group by year
    grp_data = GROUP all_runs BY (year);
    -- Show sample output of grp_data
    limit_grp_data = limit grp_data 5;
    describe grp_data;
    dump limit_grp_data;

    -- Create a table that contains each year and the max runs for that year
    max_runs_year = FOREACH grp_data GENERATE group as max_year, MAX(all_runs.runs) AS max_runs;
    -- Show sample output of grp_data
    limit_max_runs_year = limit max_runs_year 5;
    describe max_runs_year;
    dump limit_max_runs_year;

    -- Join max_runs_year and all_runs by matching on both year and runs to find the playerID with the max runs each year
    join_max_runs = JOIN max_runs_year BY (max_year, max_runs), all_runs BY (year, runs);
    -- Show sample output of join_max_runs
    limit_join_max_runs = limit join_max_runs 5;
    describe join_max_runs;
    dump limit_join_max_runs;

    -- Clean up the output so that only the year, playerID, and the maximum runs are included (columns zero, two and four)
    join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $4 AS runs;
    -- Show sample output of join_data
    limit_join_data = limit join_data 5;
    describe join_data;
    dump limit_join_data;

      Saurabh Agrawal
      |
      September 8, 2014 at 3:46 am
      |

      Thanks Brad, this was extremely helpful.

      Max Mir
      |
      September 24, 2014 at 10:27 pm
      |

      Very helpful! Thank you! Perhaps you can contribute/modify this article? We need people like you who can explain things succinctly and clearly as you’ve done in your reply.

      Srini Kesavan
      |
      November 2, 2014 at 2:30 pm
      |

      Excellent explanation. Thanks

      Thanuja
      |
      November 24, 2014 at 2:56 am
      |

      Extremely helpful and clear code segments by you. Now I did understand the original article coding.

      Anne Troop
      |
      November 30, 2014 at 3:28 pm
      |

      Thank you! Debugging the tutorial with distinct names increased the usefulness of the tutorial. Good suggestion!

      Matt
      |
      April 15, 2015 at 3:36 am
      |

      Hi Brad, Sorry I never responded to this, Can’t believe its a year since I started this! But now have some time to pickup tutorials again.

      Just wanted to thank you for your explanation here, it now means much more than just “what the point is” its explaining what the code is actually doing “under the hood/bonnet/Covers” Also in confusion I forget the use of describe, and dump regularly to monitor what’s being generated/understood by the script

      Once again, thanks for such a complete dissection of the code and explanation

Murali
|
July 24, 2014 at 8:20 am
|

Thanks for sharing this article, it is helpful

ramesh
|
September 15, 2014 at 10:20 am
|

Thanks for the usefull information.

qq How we can accomplish the transformation on each field (if the file has 100 fields) based on field name and its UDF rule in a lookup file.

so we read the file based on field name and get the related UDF and perform,

Thanks
Ramesh

bijender
|
September 26, 2014 at 11:48 am
|

it is good and very useful to learn pig, we need more tutorial like this but have more complexity

thanks
br/bijender

Anand
|
October 6, 2014 at 7:07 am
|

Awesome ..really helpful to understand the concepts

Theodore Wong
|
October 9, 2014 at 3:48 pm
|

To get the tutorial to run, I had to change the JOIN statement to select the runs column from max_runs correctly:

join_max_run = JOIN max_runs BY ($0, runs), runs BY (year, runs);

Dhanesh Kothari
|
October 16, 2014 at 6:39 am
|

I am getting error:
Failed!

Failed Jobs:
JobId Alias Feature Message Outputs
job_1413436689840_0036 batting,grp_data,max_runs,raw_runs,runs MULTI_QUERY,COMBINER Message: Job failed!

Input(s):
Failed to read data from “hdfs://sandbox.hortonworks.com:8020/user/hue/Batting.csv”

Output(s):

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1413436689840_0036 -> null,
null

2014-10-16 06:31:53,360 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Failed!
2014-10-16 06:31:53,360 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1066: Unable to open iterator for alias join_data

Nalin
|
November 4, 2014 at 1:45 pm
|

I have the code as below
batting = LOAD ‘Batting.csv’ USING PigStorage(‘,’);
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 AS playerID, $1 AS year, $8 AS runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) AS max_runs;
join_max_runs = JOIN max_runs BY ($0, max_runs), runs BY (year, runs);
join_data = FOREACH join_max_runs GENERATE $0 AS year, $2 AS playerID, $1 AS runs;
DUMP join_data;

When I am executing it is failed and below are the logs . It seems the job is not executing

[04/Nov/2014 12:43:25 +0000] access WARNING 192.168.49.1 hue – “GET /logs HTTP/1.0″
[04/Nov/2014 12:43:23 +0000] middleware INFO Processing exception: Could not find job application_1415146267118_0007. The job might not be running yet.: Traceback (most recent call last):
File “/usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/core/handlers/base.py”, line 100, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File “/usr/lib/hue/apps/jobbrowser/src/jobbrowser/views.py”, line 58, in decorate
raise PopupException(_(‘Could not find job %s. The job might not be running yet.’) % jobid, detail=e)
PopupException: Could not find job application_1415146267118_0007. The job might not be running yet.
[04/Nov/2014 12:43:23 +0000] resource DEBUG GET Got response: <!DOCTYPE html PUBLIC "-//W3C//D…
[04/Nov/2014 12:43:23 +0000] http_client DEBUG GET http://sandbox.hortonworks.com:8088/proxy/application_1415146267118_0007/ws/v1/mapreduce/jobs/job_1415146267118_0007
[04/Nov/2014 12:43:23 +0000] resource DEBUG GET Got response: {"app":{"id":"application_141514…
[04/Nov/2014 12:43:23 +0000] http_client DEBUG GET http://sandbox.hortonworks.com:8088/ws/v1/cluster/apps/application_1415146267118_0007

Please help ,I have just started my learning and this is my first pig script .

Ahil PonArul
|
November 21, 2014 at 12:13 pm
|

I keep getting this error at the top of my logs
ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory

I am not sure how this can be happening. Anyone else getting this?

Ahil PonArul
|
November 21, 2014 at 12:13 pm
|

I keep getting this error in the first row of my log
ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory

Anyone else getting it?

Thanuja
|
November 24, 2014 at 2:58 am
|

This article and HortonWorks youtube video was helpful. Also for the guys who put working coding segment (especially the FILTER code) thanks for sharing. Hopefully we talk soon in the site again

Jumsheed
|
December 12, 2014 at 1:17 pm
|

for few commands i got this warn.

WARN [main] org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
what does that mean?

Chidi
|
December 15, 2014 at 12:55 am
|

I have issues executing this scrip successfully as the status bar shows the script works fine but towards the end it fails and when I look at the log, I get the below failed message. Please can anyone assist me on this?

2014-12-14 23:42:00,820 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Failed!
2014-12-14 23:42:00,824 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1066: Unable to open iterator for alias join_data
Details at logfile: /hadoop/yarn/local/usercache/hue/appcache/application_1418056014812_0015/container_1418056014812_0015_01_000002/pig_1418629302506.log

Aarthi
|
December 16, 2014 at 3:22 pm
|

When i executed the below command i recieved the below error and log file message
can anyone help me with the same
grunt> REGISTER /opt/ibm/biginsights/pig/contrib/piggybank/java/piggybank.jar;
grunt> records = LOAD ‘googlebooks-1988.csv’ AS (word: chararray, year: int, wordcount: int, pagecount: int, bookcount: int);
grunt> grouped = GROUP records BY org.apache.pig.piggybank.evaluation.string.LENGTH(word);
grunt> final = FOREACH grouped GENERATE group, SUM(records.wordcount);
grunt> DUMP final;
ERROR [main] org.apache.pig.tools.grunt.Grunt – ERROR 2017: Internal error creating job configuration.
Details at logfile: /home/biadmin/pig_1418766622888.log
Log file message displayed below
ERROR 2017: Internal error creating job configuration.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias final
at org.apache.pig.PigServer.openIterator(PigServer.java:857)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:682)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias final
at org.apache.pig.PigServer.storeEx(PigServer.java:956)
at org.apache.pig.PigServer.store(PigServer.java:919)
at org.apache.pig.PigServer.openIterator(PigServer.java:832)
… 12 more
Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:731)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:259)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:180)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
at org.apache.pig.PigServer.storeEx(PigServer.java:952)
… 14 more
Caused by: java.lang.IllegalArgumentException
at java.util.zip.ZipInputStream.getUTF8String(ZipInputStream.java:329)
at java.util.zip.ZipInputStream.getFileName(ZipInputStream.java:448)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:267)
at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:93)
at java.util.jar.JarInputStream.getNextEntry(JarInputStream.java:141)
at java.util.jar.JarInputStream.getNextJarEntry(JarInputStream.java:178)
at org.apache.pig.impl.util.JarManager.mergeJar(JarManager.java:212)
at org.apache.pig.impl.util.JarManager.mergeJar(JarManager.java:206)
at org.apache.pig.impl.util.JarManager.createJar(JarManager.java:126)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:415)
… 19 more

Aneesh
|
December 29, 2014 at 10:30 pm
|

Images and screenshots for tutorial are not displaying.

alexander
|
December 31, 2014 at 7:29 am
|

I followed your exact steps on a clean installation of the sandbox, got the following error:
Input(s):
Failed to read data from “hdfs://sandbox.hortonworks.com:8020/user/hue/Batting.csv”

Murali
|
January 29, 2015 at 2:04 pm
|

Above code doesn’t work. I have added additional filter (runs1)

Change this to:

batting = load ‘Batting.csv’ using PigStorage(‘,’);
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs1= FILTER runs BY runs > 0;
grp_data = GROUP runs1 by (year);
max_runs_year = FOREACH grp_data GENERATE group as max_year, MAX(runs1.runs) AS max_runs;
dump max_runs_year;

It works.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>