In this tutorial, we will learn to use Hive and Pig along with other tools to process, analyze and filter large datasets.
driver data file from here.
Once you have the file you will need to unzip the file into a directory. We will be uploading two csv files –
Start by using the
HDFS Files view from the views drop down menu in Ambari:
Navigate to the folder
/tmp and create a new folder called
Then use the menus to upload the
drivers.csv file and
After uploading both files head back to the data folder we created. Click on data row and select
Permissions. Make sure all boxes are checked blue.
HCatalog has been merged with Hive project. This means that your Hive queries will utilize HCatalog when using commands like create table and drop table.
We are now going to utilize the
Hive view to create tables with our data. Use the same drop down menu that you used to select the HDFS Files view, and instead click
You will view a page like this:
We’re now going to create a table from our CSV using a Hive query. Copy and paste the following query and click
Execute to run the command and create the table.
create table drivers (driverId int, name string, ssn bigint, location string, certified string, wageplan string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1");
You’ll now need to
load the data file into the table. Use the following command to do so.
LOAD DATA INPATH '/tmp/data/drivers.csv' OVERWRITE INTO TABLE drivers;
You will see a new table
drivers has been created and has all of the data contained within it. Click on the box next to the drivers table in
Database Explorer to view the data.
Repeat above steps for the second data set
truck_event_text_partition.csv using the following queries to create the
create table truck_events (driverId int, truckId int, eventTime string, eventType string, longitude double, latitude double, eventKey string, correlationId bigint, driverName string, routeId int, routeName string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1");
Load the data by running the following command:
LOAD DATA INPATH '/tmp/data/truck_event_text_partition.csv' OVERWRITE INTO TABLE truck_events;
You should now have two different tables inside the database explorer:
You can view the data by clicking on the box next to the table truck_events.
In the previous sections you:
Apache Hive™ provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries just as you would in a database.
To make it easy to interact with Hive, you can we can use Ambari’s built in views to run queries on this data.
In the latest versions of the Hortonworks sandbox we can execute our Hive queries using Tez, a fast execution engine. It improves on MapReduce in many areas and allows us near-realtime querying on our datasets in Hive.
Notice the query window and Execute. Type your queries in the Query window. When you are done with a query, click the
To see tables that Hive knows about, in Query Editor type the query:
and click on
Notice the tables that you previously created are in the list (“drivers” and “truck_events”). You can see the columns in the table by executing:
You can make a join with other tables in Hive the same way you do with other database queries. Let’s make a join between
Enter the following into the query editor:
select a.driverId,a.driverName,a.eventType,b.certified from truck_events a join drivers b ON (a.driverId = b.driverId);
This job is more complex so it might take longer than previous queries. You can watch the job running in the log. When the job completes, you can see the results.
In this tutorial, you will create and execute a
Pig script. To access the Pig interface, use the dropdown menu for views in Ambari. Select
A special feature of the interface is the Pig helper. The Pig helper provides templates for the statements, functions, I/O statements, HCatLoader() and Python user defined functions. Another feature is the Pig arguments which provides pre-formatted command line arguments used during execution. You will see a page like this:
New Script and create a name for it.
In this section, you will load the data from the table that is stored in HCatalog/Hive. Then you will make a join between two data sets on the
driverId field in the same way that you did in the Hive section.
The data is already in HDFS through HCatalog. HCatalog stores schema and location information, so we can use the HCatLoader() function within the Pig script to read data from HCatalog-managed tables. In Pig, you now
only need to give the table a name or alias so that Pig can process the
Follow this procedure to load the data using HCatLoader:
Choose PIG helper -> HCatalog -> LOAD…template. This action
pastes the Load template into the script area.
IMPORTANT! Note that the statement should be
org.apache.hive.hcatalog.pig.HCatLoader();. Note the addition of the hive component.
Repeat this sequence for “truck_events” and add “ b = “
The completed lines of code will be:
a = LOAD 'drivers' using org.apache.hive.hcatalog.pig.HCatLoader(); b = LOAD 'truck_events' using org.apache.hive.hcatalog.pig.HCatLoader();
It is important to note that at this point, we have merely defined the aliases for our tables to hold the data (alias “a” for drivers and alias “b” for truck_events). Data is not loaded or transformed until we execute an operational command such as
Next, you will use the
JOIN operator to join both tables on the
driverid. You will create a new data set using the Pig Join function that will match the driverid field and include all of the data from both tables.
Complete these steps:
So, the final code will be:
a = LOAD 'drivers' using org.apache.hive.hcatalog.pig.HCatLoader(); b = LOAD 'truck_events' using org.apache.hive.hcatalog.pig.HCatLoader(); c = join b by driverid, a by driverid;
Now you have joined all the records in both of the tables on driverid.
To complete the Join operation, use the
DUMP command to execute the results. This will show all of the records that have a common driverid. The data from both tables will be merged into one row. Complete this steps:
The full script should be:
a = LOAD 'drivers' using org.apache.hive.hcatalog.pig.HCatLoader(); b = LOAD 'truck_events' using org.apache.hive.hcatalog.pig.HCatLoader(); c = join b by driverid, a by driverid; dump c;
First you need to add the
-useHCatalog (Case Sensitive) argument using the box box in the bottom right hand corner
At the top of the screen, make sure the box “Execute on Tez” is checked. Then click
Execute to run the script. This action creates one or more Tez jobs.
Execute is a progress bar that shows the job’s status. The progress bar can be blue (indicating job is in process), red (job has a problem), or green (job is complete).
When the job completes, you will see the results show up in one of the dropdown menus. The result is that each line that starts with an open parenthesis “(“ has data from both tables for driverid.
Logs drop down menu if you want to see what happened when your script ran, including any error messages. (You might need to scroll down to view the entire log.)
Congratulations! You have successfully completed HCatalog, Basic Pig &