In this tutorial you will gain a working knowledge of Pig through the hands-on experience of creating Pig scripts to carry out essential data operations and tasks.
We will first read in two data files that contain New York Stock Exchange dividend prices and stock prices, and then use these files to perform a number of Pig operations including:
- Define a relation with and without
- Define a new relation from an
Selectspecific columns from within a relation
- Sort the data using
- FILTER and Group the data using
What is Pig?
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.
Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig. Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.
Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System.
Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster.
Download the Data
You’ll need sample data for this tutorial. The data set you will be using is stock ticker data from the
New York Stock Exchange from the years 2000-2001. Download this sample data from the following location:
The file is about 11 megabytes, and might take a few minutes to download.
Open the folder infochimps_dataset_4778_download_16677 > NYSE and locate the two data files that you will be using for this tutorial:
Step 1: Upload the data files
HDFS Files view from the Off-canvas menu at the top. That is the
views menu. The HDFS Files view allows you to view the Hortonworks Data Platform(HDP) file store. The HDP file system is separate from the local file system.
/user/admin, click Upload and Browse, which brings up a dialog box where you can select the
NYSE_daily_prices_A.csv file from you computer.
NYSE_dividends_A.csv file in the same way. When finished, notice that both files are now in HDFS.
Step 2: Create Your Script
Open the Pig interface by clicking the
Pig Button in the
On the left we can choose between our saved
UDFs and the
Pig Jobs executed in the past. To the right of this menu bar we see our saved Pig Scripts.
Click on the button
"New Script", enter “Pig-Dividend” for the title of your script and leave the location path empty:
Below you can find an overview about which functionalities the pig interface makes available.
A special feature of the interface is the PIG helper at the top left of the composition area, which provides templates for Pig statements, functions, I/O statements, HCatLoader() and Python user defined functions.
Step 3: Define a relation
In this step, you will create a script to load the data and define a relation.
- On line 1
definea relation named STOCK_A that represents the
NYSE stocksthat start with the letter “A”
- On line 2 use the
DESCRIBEcommand to view the STOCK_A relation
The completed code will look like:
STOCK_A = LOAD 'nyse/NYSE_daily_prices_A.csv' using PigStorage(','); DESCRIBE STOCK_A;
Step 4: Save and Execute the Script
Click the Save button to save your changes to the script. Click Execute to run the script. This action creates one or more MapReduce jobs. After a moment, the script starts and the page changes. Now, you have the opportunity to Kill the job in case you want to stop the job.
Next to the
Kill job button is a
progress bar with a text field above that shows the
When the job completes, check the results in the green box. You can also download results to your system by clicking the download icon. Notice STOCK_A does not have a schema because we did not define one when loading the data into relation STOCK_A.
Step 5: Define a Relation with a Schema
Let’s use the above code but this time with a schema. Modify line 1 of your script and add the following AS clause to
define a schema for the daily stock price data. The complete code will be:
STOCK_A = LOAD 'NYSE_daily_prices_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); DESCRIBE STOCK_A;
Save and execute the script again. This time you should see the schema for the STOCK_A relation:
Step 6: Define a new relation from an existing relation
You can define a new relation based on an existing one. For example, define the following B relation, which is a collection of 100 entries (arbitrarily selected) from the STOCK_A relation.
Add the following line to the end of your code:
B = LIMIT STOCK_A 100; DESCRIBE B;
Save and execute the code. Notice B has the same schema as STOCK_A, because B is a
subset of A relation.
Step 7: View the Data
To view the data of a relation, use the
Add the following
DUMP command to your Pig script, then save and execute it again:
The command requires a MapReduce job to execute, so you will need to wait a minute or two for the job to complete. The output should be 100 entries from the contents of
NYSE_daily_prices_A.csv (and not necessarily the ones shown below, because again, entries are arbitrarily chosen):
Step 8: Select specific columns from a relation
DESCRIBE B and
DUMP B commands from your Pig script; you will no longer need those.
One of the key uses of Pig is data transformation. You can define a new relation based on the fields of an existing relation using the
FOREACH command. Define a new relation
C, which will contain only the
symbol, date and close fields from relation B.
Now the complete code is:
STOCK_A = LOAD 'NYSE_daily_prices_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); B = LIMIT STOCK_A 100; C = FOREACH B GENERATE symbol, date, close; DESCRIBE C;
Save and execute the script and your output will look like the following:
Step 9: Store relationship data into a HDFS File
In this step, you will use the
STORE command to output a relation into a new file in
HDFS. Enter the following command to output the C relation to a folder named
output/C (then save and execute):
STORE C INTO 'output/C';
Again, this requires a MapReduce job (just like the
DUMP command), so you will need to wait a minute for the job to complete.
Once the job is finished, go to
HDFS Files view and look for a newly created folder called “output” under
Click on “output” folder. You will find a subfolder named “C”.
Click on “C” folder. You will see an output file called “part-r-00000”:
Click on the file “part-r-00000”. It will download the file:
Step 10: Perform a join between 2 relations
In this step, you will perform a
join on two NYSE data sets: the daily prices and the dividend prices. Dividends prices are shown for the quarter, while stock prices are represented on a daily basis.
You have already defined a relation for the stocks named STOCK_A. Create a new Pig script named “Pig-Join”. Then define a new relation named DIV_A that represents the dividends for stocks that start with an “A”, then
join A and B by both the
symbol and date and describe the schema of the new relation C.
The complete code will be:
STOCK_A = LOAD 'NYSE_daily_prices_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); DIV_A = LOAD 'NYSE_dividends_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, dividend:float); C = JOIN STOCK_A BY (symbol, date), DIV_A BY (symbol, date); DESCRIBE C;
Save the script and execute it. Notice C contains all the fields of both STOCK_A and DIV_A. You can use the
DUMP command to see the data stored in the relation C:
Step 11: Sort the data using “ORDER BY”
ORDER BY command to sort a relation by one or more of its fields. Create a new Pig script named “Pig-sort” and enter the following commands to sort the dividends by symbol then date in ascending order:
DIV_A = LOAD 'NYSE_dividends_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, dividend:float); B = ORDER DIV_A BY symbol, date asc; DUMP B;
Save and execute the script. Your output should be sorted as shown here:
Step 12: Filter and Group the data using “GROUP BY”
GROUP command allows you to group a relation by one of its fields. Create a new Pig script named “Pig-group”. Then, enter the following commands, which group the DIV_A relation by the dividend price for the “AZZ” stock.
DIV_A = LOAD 'NYSE_dividends_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, dividend:float); B = FILTER DIV_A BY symbol=='AZZ'; C = GROUP B BY dividend; DESCRIBE C; DUMP C;
Save and execute. Notice that the data for stock symbol “AZZ” is grouped together for each dividend.
Congratulations! You have successfully completed the tutorial and well on your way to pigging on Big Data.