How To Process Data with Apache Pig

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the data to more complex algorithms for the transform. Finally Pig can store the results into the Hadoop Data File System.

Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster. As part of the translation the Pig interpreter does perform optimizations to speed execution on Apache Hadoop. We are going to write a Pig script that will do our data analysis task.

Our data processing task

We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Downloading the data

The data file we are using comes from the site You can download the data file in csv zip form from:

Once you have the file you will need to unzip the file into a directory. We will be uploading just the master.csv and batting.csv files.

Uploading the data files

We start by selecting the HDFS Files view from the Off-canvas menu at the top. The HDFS Files view allows us to view the Hortonworks Data Platform(HDP) file store. This is separate from the local file system. For the Hortonworks Sandbox it will be part of the file system in the Hortonworks Sandbox VM.

Navigate to /user/admin and click on the Upload button to select the files we want to upload into the Hortonworks Sandbox environment.

When you click on the browse button you will get a dialog box. Navigate to where you stored the Batting.csv file on your local disk and select Batting.csv and click again upload. Do the same thing for Master.csv. When you are done you will see there are two files in your directory.

Now that we have our data files we can start writing our Pig script. Click on the Pig button from the Off-canvas menu.

We see the Pig user interface in our browser window. On the left we can choose between our saved Pig Scripts, UDFs and the Pig Jobs executed in the past. To the right of this menu bar we see our saved Pig Scripts.

To get started push the button "New Script" at the top right and fill in a name for your script. If you leave the gap “Script HDFS Location” empty, it will be filled automatically.

After clicking on “create”, a new page opens.
At the center is the composition area where we will be writing our script. At top right of the composition area are buttons to Execute, Explain and perform a Syntax check of the current script.

At the left are buttons to save, copy or delete the script and at the very bottom we can add a argument.

The first thing we need to do is load the data. We use the load statement for this. The PigStorage function is what does the loading and we pass it a comma as the data delimiter. Our code is:

batting = load 'Batting.csv' using PigStorage(',');

To filter out the first row of the data we have to add this line:

    raw_runs = FILTER batting BY $1>0;

The next thing we want to do is name the fields. We will use a FOREACH statement to iterate through the batting data object. We can use Pig Helper that is at the bottom of the composition area to provide us with a template. We will click on Pig Helper, select Data processing functions and then click on the FOREACH template. We can then replace each element by hitting the tab key.

So the FOREACH statement will iterate through the batting data object and GENERATE pulls out selected fields and assigns them names. The new data object we are creating is then named runs. Our code will now be:

runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;

The next line of code is a GROUP statement that groups the elements in runs by the year field. So the grp_data object will then be indexed by year. In the next statement as we iterate through grp_data we will go through year by year. Type in the code:

grp_data = GROUP runs by (year);

In the next FOREACH statement we are going to find the maximum runs for each year. The code for this is:

max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

Now that we have the maximum runs we need to join this with the runs data object so we can pick up the player id. The result will be a dataset with Year, PlayerID and Max Run. At the end we DUMP the data to the output.

join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);  
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;  
DUMP join_data;

Let’s take a look at our script. The first thing to notice is we never really address single rows of data to the left of the equals sign and on the right we just describe what we want to do for each row. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.

At this point we can save our script.

We can execute our code by clicking on the execute button at the top right of the composition area, which opens a new page.

As the jobs are run we will get status boxes where we will see logs, error message, the output of our script and our code at the bottom.

If you scroll down to the “Logs…” and click on the link you can see the log file of your jobs. We should always check the Logs to check if your script was executed correctly.

So we have created a simple Pig script that reads in some comma separated data.
Once we have that set of records in Pig we pull out the playerID, year and runs fields from each row.
We then sort them by year with one statement, GROUP.
Then we find the maximum runs for each year.
This is finally mapped to the playerID and we produce our final dataset.

As mentioned before Pig operates on data flows. We consider each group of rows together and we specify how we operate on them as a group. As the datasets get larger and/or add fields our Pig script will remain pretty much the same because it is concentrating on how we want to manipulate the data.


February 14, 2014 at 10:35 am

This was really insightful and inspiring. Thank you so much for this, I really appreciate you guys.


March 18, 2014 at 4:30 am

It really nice tutorial. Keep up good work.

Bilal Abu Salih
April 1, 2014 at 9:04 am

That was wonderful tutorial ,, appreciated

    August 12, 2015 at 5:15 am

    yes wonderful . it was not possible using good old SQL ???

Jim M
June 8, 2014 at 2:32 am

Thank you for posting this here, Aran. Helped me get through this tutorial.

June 13, 2014 at 8:06 am

Thank you Aran, your version of the code works for me.

Piotr Sobolewski
June 25, 2014 at 12:27 pm

Thanks for this solution. It works.

June 30, 2014 at 2:03 pm

Thank you!

July 24, 2014 at 8:20 am

Thanks for sharing this article, it is helpful

Saurabh Agrawal
September 8, 2014 at 3:46 am

Thanks Brad, this was extremely helpful.

Max Mir
September 24, 2014 at 10:27 pm

Very helpful! Thank you! Perhaps you can contribute/modify this article? We need people like you who can explain things succinctly and clearly as you’ve done in your reply.

September 26, 2014 at 11:48 am

it is good and very useful to learn pig, we need more tutorial like this but have more complexity


October 6, 2014 at 7:07 am

Awesome ..really helpful to understand the concepts

Theodore Wong
October 9, 2014 at 3:48 pm

To get the tutorial to run, I had to change the JOIN statement to select the runs column from max_runs correctly:

join_max_run = JOIN max_runs BY ($0, runs), runs BY (year, runs);

Srini Kesavan
November 2, 2014 at 2:30 pm

Excellent explanation. Thanks

November 24, 2014 at 2:56 am

Extremely helpful and clear code segments by you. Now I did understand the original article coding.

November 24, 2014 at 2:58 am

This article and HortonWorks youtube video was helpful. Also for the guys who put working coding segment (especially the FILTER code) thanks for sharing. Hopefully we talk soon in the site again

Anne Troop
November 30, 2014 at 3:28 pm

Thank you! Debugging the tutorial with distinct names increased the usefulness of the tutorial. Good suggestion!

April 15, 2015 at 3:36 am

Hi Brad, Sorry I never responded to this, Can’t believe its a year since I started this! But now have some time to pickup tutorials again.

Just wanted to thank you for your explanation here, it now means much more than just “what the point is” its explaining what the code is actually doing “under the hood/bonnet/Covers” Also in confusion I forget the use of describe, and dump regularly to monitor what’s being generated/understood by the script

Once again, thanks for such a complete dissection of the code and explanation

July 4, 2015 at 4:30 am

Excellent.. Simply superb..
Very easy to understand without naming confilts and really impressive

July 18, 2015 at 6:04 pm

Hi, The tutorial is quite helpful and well explained. However, it lacks one thing – picture of the data. I think if a snapshot of data object is provided after each code block is provided, the tutorial would be much more easier to understand.


August 12, 2015 at 5:13 am

yes , this is wonderful , This was not possible using the good old SQL ??

Manish Joshi
November 3, 2015 at 6:48 pm

It’s very useful for new learners, and description is in very simple and understanding language.

November 11, 2015 at 1:42 pm

Great tutorial.

Some of the images have seem to gone missing over time.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">