In this tutorial, we will learn to use Hive and Pig along with other tools to process, analyze and filter large datasets.
We will use a baseball statistics file. This file has
all the statistics for each American player by year from 1871-2011. The
data set is fairly large (over 95,000 records), but to learn Big Data
you don’t need to use a massive dataset. You need only use tools that
scale to massive datasets.
The data files we are using in this tutorial come from Sean Lahman’s
extensive historical baseball database
(http://seanlahman.com/), and are being used
under a Creative Commons Attribution-ShareAlike license:
Download and unzip the data file below:
If the above link doesn’t work you can get the file from this link as well.
The zip archive includes many statistics files. We will only use the
following two files:
Start by using the HDFS Files view from the views dropdown menu in Ambari
Navigate to the folder
/tmp and create a new folder called data.
Then use the menus to upload to upload the
master.csv file and
After uploading both files head back to the
data folder we created. Right click on it’s row and select Permissions. Make sure all boxes are checked (blue).
Look at the top bar above the files. Look for Upload. Then click the Browse to search for the files that we unzipped earlier.
When you are done, you will see the two files in your directory.
HCatalog has been merged with Hive project. This means that your Hive queries will utilize HCatalog when using commands like
create table and
We are now going to utilize the Hive view to create tables with our data. Use the same dropdown menu that you used to select the HDFS Files view, and instead click Hive.
Notice some of the basic items in the Hive view which are outlined in the image below.
We’re now going to create a table from our CSV using a Hive query. Copy and paste the following query and click Execute to run the command and create the table.
CREATE TABLE IF NOT EXISTS batting_data (playerid string, yearid string, stint string, teamid string, lgid string, g string, g_batting string, ab string, runs int, h string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1")
You’ll now need to load the data file into the table. Use the following command to do so.
LOAD DATA INPATH '/tmp/data/Batting.csv' OVERWRITE INTO TABLE batting_data
You will see a new table “batting_data” has been created and has all of the data contained within it.
Repeat above steps for the second data set (master.csv) using the following queries to create the
CREATE TABLE IF NOT EXISTS master_data (lahmanID string,playerID string,managerID string ,hofID string, birthYear string,birthMonth string,birthDay string, birthCountry string,birthState string,birthCity string, deathYear string,deathMonth string ,deathDay string, deathCountry string,deathState string ,deathCity string, nameFirst string, nameLast string, nameNote string, nameGiven string, nameNick string, weight string, height string, bats string, throws string, debut string, finalGame string, college string, lahman40ID string, lahman45ID string, retroID string, holtzID string, bbrefID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1")
LOAD DATA INPATH '/tmp/data/Master.csv' OVERWRITE INTO TABLE master_data
You should now have two different tables inside the database explorer:
In the previous sections you:
Apache Hive™ provides a data warehouse function to the Hadoop cluster.
Through the use of HiveQL you can view your data as a table and create
queries just as you would in a database.
To make it easy to interact with Hive, you can we can use Ambari’s built in views to run queries on this data.
In the latest versions of the Hortonworks sandbox we can execute our Hive queries using Tez, a fast execution engine. It improves on MapReduce in many areas and allows us near-realtime querying on our datasets in Hive.
Notice the query window and Execute. Type your queries in the
Query window. When you are done with a query, click the Execute.
To see tables that Hive knows about, in Query Editor type the query:
and click on Execute.
Notice the tables that you previously created are in the list (“batting_data” and “master_data”).
To see the records type:
select * from batting_data
in Query Editor and click Execute.
You can see the columns in the table by executing:
You can make a join with other tables in Hive the same way you do with
other database queries.
Let’s make a join between batting_data and master_data tables:
Enter the following into the query editor:
select a.playerid, a.namefirst, a.namelast, b.yearid, b.runs from master_data a join batting_data b ON (b.playerid = a.playerid);
This job is more complex so it might take longer than previous queries. You can
watch the job running in the log.
When the job completes, you can see the results.
In this tutorial, you will create and execute a Pig script.
To access the Pig interface, use the dropdown menu for views in Ambari. Select Pig.
A special feature of the interface is the Pig helper. The Pig helper
provides templates for the statements, functions, I/O statements,
HCatLoader() and Python user defined functions. Another feature is the
Pig arguments which provides pre-formatted command line arguments
used during execution.
Click New Script and create a name for it.
In this section, you will load the data from the table that is stored in
HCatalog/Hive. Then you will make a join between two data sets on the Player
ID field in the same way that you did in the Hive section.
The data is already in HDFS through HCatalog. HCatalog stores schema and
location information, so we can use the HCatLoader() function within the
Pig script to read data from HCatalog-managed tables. In Pig, you now
only need to give the table a name or alias so that Pig can process the
Follow this procedure to load the data using HCatLoader:
Choose PIG helper -> HCatalog -> LOAD…template. This action
pastes the Load template into the script area.
IMPORTANT! Note that the statement should be
org.apache.hive.hcatalog.pig.HCatLoader();. Note the addition of the hive component.
Repeat this sequence for “master_data” and add ” b = ”
The completed lines of code will be:
a = LOAD 'batting_data' using org.apache.hive.hcatalog.pig.HCatLoader(); b = LOAD 'master_data' using org.apache.hive.hcatalog.pig.HCatLoader();
It is important to note that at this point, we have merely defined the
aliases for our tables to hold the data (alias “a” for batting data and
alias “b” for master data). Data is not loaded or transformed until we
execute an operational command such as DUMP or STORE
Next, you will use the JOIN operator to join both tables on the Player
ID. Master.data has the player’s first name and last name and player ID
(among other fields). Batting.data has the player’s run record and
player ID (among other fields). You will create a new data set using the
Pig Join function that will match the player ID field and include all of
the data from both tables.
Complete these steps:
So, the final code will be:
a = LOAD 'batting_data' using org.apache.hive.hcatalog.pig.HCatLoader(); b = LOAD 'master_data' using org.apache.hive.hcatalog.pig.HCatLoader(); c = join a by playerid, b by playerid;
Now you have joined all the records in both of the tables on Player ID.
To complete the Join operation, use the DUMP command to execute the
results. This will show all of the records that have a common PlayerID.
The data from both tables will be merged into one row. Complete this
The full script should be:
a = LOAD 'batting_data' using org.apache.hive.hcatalog.pig.HCatLoader(); b = LOAD 'master_data' using org.apache.hive.hcatalog.pig.HCatLoader(); c = join a by playerid, b by playerid; dump c;
First you need to add the
-useHCatalog (Case Sensitive) argument using the box box in the bottom right hand corner
At the top of the screen, make sure the box “Execute on Tez” is checked. Then click Execute to run the script. This action creates one or more Tez jobs.
Below the Execute is a progress bar that shows the job’s status.
The progress bar can be blue (indicating job is in process), red (job
has a problem), or green (job is complete).
When the job completes, you will see the results show up in one of the dropdown menus. The result is that each line that starts with an open parenthesis “(” has
data from both tables for each unique player ID.
Click the Logs dropdown menu if you want to see what happened when your
script ran, including any error messages. (You might need to scroll down
to view the entire log.)
Congratulations! You have successfully completed HCatalog, Basic Pig &
If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don’t find your answer you can post a new HCC question for this tutorial using the Ask Questions button.
|Find Answers||Ask Questions|
Tutorial Name: Learning the Ropes of the Hortonworks Sandbox
HCC Tutorial Tag: tutorial-260 and HDP-2.4
If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab.
All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Collaboration Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it!