cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

How to Process Data with Apache Hive

Introduction

In this tutorial, we will use the Ambari HDFS file view to store massive data files of baseball statistics. We will implement Hive queries to analyze, process and filter that data.

Pre-Requisites

Outline

Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial,
we used Pig, which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.

Hive or Pig?

People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is
often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are
used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache
Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo
Developer blog titled Pig and Hive at Yahoo! From a technical point
of view, both Pig and Hive are feature complete, so you can do tasks in either tool. However, you will find one tool or the other will be
preferred by the different groups that have to use Apache Hadoop. The good part is they have a choice and both tools work together.

Our Data Processing Task

We are going to do the same data processing task as we just did with Pig in the previous tutorial. We have several files of baseball
statistics and we are going to bring them into Hive and do some simple computing with them. We are going to find the player with the
highest runs for each year. This file has all the statistics from 1871–2011 and contains more that 90,000 rows. Once we have the highest
runs we will extend the script to translate a player id field into the first and last names of the players.

Step 1: Download The Data

The data files we are using come from the site SeanLahman.com. You can download the following data file,
lahman591-csv.zip. Once you have the file you will need to unzip it into a
directory. We will be uploading just the Master.csv and Batting.csv files from the dataset.

Step 2: Upload The Data Files

We start by selecting the HDFS Files view from the Off-canvas menu at the top. The HDFS Files view shows you the files in the HDP file
store. In this case, the file store resides in the Hortonworks Sandbox VM.

HDFS File View Icon Image

Navigate to /user/maria_dev and click on the Upload button.

HDFS_maria_dev_Folder_Image

Clicking on browse will open a dialog box. Navigate to where you stored theBatting.csv file on your local disk and select Batting.csv.
Do the same thing for Master.csv. When you are done you will see there are two files in your directory.

batting and master csv uploaded Image

Step 3: Start the Hive View

Lets open the Hive Viewby clicking on the Hive button in the top bar as previously when we selected the HDFS Files view. The Hive view provides a user interface to the Hive data warehouse system for Hadoop.

Hive View Icon from HDFS warehouse

3.1 Explore The Hive User Interface

On right is a query editor. A query may span multiple lines. At the bottom, there are buttons to Execute the query, Explain the query, Save the query with a name and to open a new Worksheet window for another query.

hive ui

Hive and Pig Data Model Differences

Before we get started let’s take a look at how Pig and Hive data models differ. In the case of Pig all data objects exist and are operated on in the script. Once the script is complete all data objects are deleted unless you stored them. In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query. You can think of Hive as providing a data workbench where you can examine, modify and manipulate the data in Apache Hadoop. So when we perform our data processing task we will execute it one query or line at a time. Once a line successfully executes you can look at the data objects to verify if the last operation did what you expected. All your data is live, compared to Pig, where data objects only exist inside the script unless they are copied out to storage. This kind of flexibility is Hive’s strength. You can solve problems bit by bit and change your mind on what to do next depending on what you find.

3.2 Create Table temp_batting

The first task we will do is create a table to hold the data. We will type the query into the composition area on the right handside. Once you have typed in the query hit the Execute button at the bottom.

create table temp_batting (col_value STRING);

temp_batting_query

Hint: press CTRL + Space for autocompletion

The query does not return any results because at this point we just created an empty table and we have not copied any data in it.

Once the query has executed we can refresh the Database Explorer at the left of the composition area and when folding down the default database we will see we have a new table called temp_batting.

temp_batting_table_default_database

Clicking on the icon next to the table name a new Worksheets opens, which loads sample data from this table. We see the table is empty right now. This is a good example of the interactive feel you get with using Hive.

load_sample_data_temp_batting_empty

The next line of code will load the data file Batting.csv into the table temp_batting.

load_battingcsv_into_temp_batting

3.3 Create Query to Populate Hive Table temp_batting with Batting.csv Data

The complete query looks like this.

LOAD DATA INPATH '/user/maria_dev/Batting.csv' OVERWRITE INTO TABLE temp_batting;

After executing the query we can look at the Tables again and when we browse the data for temp_batting we see that the data has been read in. Note Hive consumed the data file Batting.csv during this step. If you look in the File Browser you will see Batting.csv is no longer there.

temp_batting_sample_data_has_data

3.4 Create Table batting

Now that we have read the data in we can start working with it. The next thing we want to do extract the data. So first we will type in a query to create a new table called batting to hold the data. That table will have three columns for player_id, year and the number of runs.

create table batting (player_id STRING, year INT, runs INT);

batting_query

3.5 Create Query to Extract Data from temp_batting and Store It to batting

Then we extract the data we want from temp_batting and copy it into batting. We will do this with a regexp pattern. To do this we are going to build up a multi-line query. The first line of the query create the table batting. The three regexp_extract calls are going to extract the player_id, year and run fields from the table temp_batting. When you are done typing the query it will look like this. Be careful as there are no spaces in the regular expression pattern.

insert overwrite table batting  
SELECT  
  regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) player_id,  
  regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) year,  
  regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) run  
from temp_batting;

filter_batting_player_year_run_query

Execute the query and look at the batting table. You should see data that looks like this.

load_sample_batting_table_data

3.6 Create Query to Filter The Data (year, runs)

Now we have the data fields we want. The next step is to group the data by year so we can find the highest score for each year. This query first groups all the records by year and then selects the player with the highest runs from each year.

SELECT year, max(runs) FROM batting GROUP BY year;

The results of the query look like this.

group_year_highest_runs

3.7 Create Query to Filter The Data (year, player, runs)

Now we need to go back and get the player_id(s) so we know who the player(s) was. We know that for a given year we can use the runs to find the player(s) for that year. So we can take the previous query and join it with the batting records to get the final table.

SELECT a.year, a.player_id, a.runs from batting a  
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b  
ON (a.year = b.year AND a.runs = b.runs);

The resulting data looks like:

year_playerid_runs_data_table

So now we have our results. As described earlier we solved this problem using Hive step by step. At any time we were free to look around at the data, decide we needed to do another task and come back. At all times the data is live and accessible to us.

Further Reading


Tutorial Q&A and Reporting Issues

If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don’t find your answer you can post a new HCC question for this tutorial using the Ask Questions button.

Find Answers Ask Questions

Tutorial Name: How to Process Data with Apache Hive
HCC Tutorial Tag: tutorial-110 and HDP-2.4

If the tutorial has multiple labs please indicate which lab your question corresponds to. Please provide any feedback related to that lab.

All Hortonworks, partner and community tutorials are posted in the Hortonworks github and can be contributed via the Hortonworks Tutorial Collaboration Guide. If you are certain there is an issue or bug with the tutorial, please create an issue on the repository and we will do our best to resolve it!