How to Process Data with Apache Hive

Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.

People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complete so you can do tasks in either tool. However you will find one tool or the other will be preferred by the different groups that have to use Apache Hadoop. The good part is they have a choice and both tools work together.

Our data processing task

We are going to do the same data processing task as we just did with Pig in the previous tutorial. We have several files of baseball statistics and we are going to bring them into Hive and do some simple computing with them. We are going to find the player with the highest runs for each year. This file has all the statistics from 1871–2011 and contains more that 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Downloading the data

The data files we are using comes from the site You can download the data file from:

Once you have the file you will need to unzip it into a directory. We will be uploading just the Master.csv and Batting.csv files from the dataset.

Uploading the data files

We start by selecting the File Browser from the top tool bar. The File Browser shows you the files in the HDP file store. In this case the file store resides in the Hortonworks Sandbox VM.

Click on the Upload button

You want to select Files. Then you will get a dialog box.

When you click on the Upload a file button you will get a dialog box. Navigate to where you stored the Batting.csv file on your local disk and select Batting.csv. Do the same thing for Master.csv. When you are done you will see there are two files in your directory.

Starting Beeswax, the Hive UI

Lets start Beeswax by clicking on the bee icon in the top bar. Beeswax is a user interface to the Hive data warehouse system for Hadoop.

Beeswax provides a GUI to Hive. On right is a query editor. There is a limit of one query per execute cycle. A query may span multiple lines. At the bottom there are buttons to Execute the query, Save the query with a name, Explain the query and to start a new query.

Before we get started let’s take a look at how Pig and Hive data models differ. In the case of Pig all data objects exist and are operated on in the script. Once the script is complete all data objects are deleted unless you stored them. In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query. You can think of Hive as providing a data workbench where you can examine, modify and manipulate the data in Apache Hadoop. So when we perform our data processing task we will execute it one query or line at a time. Once a line successfully executes you can look at the data objects to verify if the last operation did what you expected. All your data is live, compared to Pig, where data objects only exist inside the script unless they are copied out to storage. This kind of flexibility is Hive’s strength. You can solve problems bit by bit and change your mind on what to do next depending on what you find.

The first task we will do is create a table to hold the data. We will type the query into the composition area on the right like this. Once you have typed in the query hit the Execute button at the bottom.

create table temp_batting (col_value STRING);

The query returns “No data available in the table” because at this point we just created an empty table and we have not copied any data in it.

Once the query has executed we can click on Tables at the top of the composition area and we will see we have a new table called temp_batting.

Clicking on the Browse Data button will let us see the data and right now the table is empty. This is a good example of the interactive feel you get with using Hive.

The next line of code will load the data file Batting.csv into the table temp_batting. We can start typing the code and we will notice there is a helper feature that helps us fill in the correct path to our file.

The complete query looks like this.

LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;

After executing the query we can look at the Tables again and when we browse the data for temp_batting we see that the data has been read in. Note Hive consumed the data file Batting.csv during this step. If you look in the File Browser you will see Batting.csv is no longer there.

Now that we have read the data in we can start working with it. The next thing we want to do extract the data. So first we will type in a query to create a new table called batting to hold the data. That table will have three columns for player_id, year and the number of runs.

create table batting (player_id STRING, year INT, runs INT);

Then we extract the data we want from temp_batting and copy it into batting. We will do this with a regexp pattern. To do this we are going to build up a multi-line query. The first line of the query create the table batting. The three regexp_extract calls are going to extract the player_id, year and run fields from the table temp_batting. When you are done typing the query it will look like this. Be careful as there are no spaces in the regular expression pattern.

insert overwrite table batting  
  regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,  
  regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,  
  regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run  
from temp_batting;

Execute the query and look at the batting table. You should see data that looks like this.

Now we have the data fields we want. The next step is to group the data by year so we can find the highest score for each year. This query first groups all the records by year and then selects the player with the highest runs from each year.

SELECT year, max(runs) FROM batting GROUP BY year;

The results of the query look like this.

Now we need to go back and get the player_id(s) so we know who the player(s) was. We know that for a given year we can use the runs to find the player(s) for that year. So we can take the previous query and join it with the batting records to get the final table.

SELECT a.year, a.player_id, a.runs from batting a  
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b  
ON (a.year = b.year AND a.runs = b.runs) ;

The resulting data looks like:

This query may take a couple of minutes. While you’re waiting, if you have internet access, take a look at this video from Alan Gates to hear about the future of HCatalog:
Future of HCatalog

So now we have our results. As described earlier we solved this problem using Hive step by step. At any time we were free to look around at the data, decide we needed to do another task and come back. At all times the data is live and accessible to us.


Robin Dong
March 5, 2014 at 11:44 pm

I have found your tutorials are very excellent, one of the best on Internet.

June 3, 2014 at 11:12 am

There is a syntactic error in the last query. The temporary table generated does not have a name for the max runs column, and hence cannot be used outside unless it is named.


select a.player_id, a.year, a.runs
from batting a
(select year, max(runs) as max_runs
from batting
group by year) b
(a.year = b.year AND a.runs = b.max_runs)

    September 14, 2015 at 7:14 am

    There is a column alias mentioned – max(runs) runs if you observe carefully. I’ve run the query as-is and works alright.

June 25, 2014 at 2:18 am

Nice tutorials!
Can you please throw some statistics on the performance with respect to amount of data Hive queries HCatalog?

July 17, 2014 at 11:29 am

this is really useful, thank you

August 5, 2014 at 8:32 pm

Excellent article. However I have a confusion:
We first wrote the data into HDFS, then created a table and loaded data from HDFS files to HIVE table. Does this mean HIVE table is actually creating a copy of the HDFS file which was loaded into the table or is it only pointing to the data on the HDFS file system?

    Anand Suresh
    August 19, 2014 at 12:02 am

    As you have said Hive is only pointing to the data on the HDFS file system. Hive only stores the structure of table not the data. Data is always accessed from HDFS.

    August 19, 2014 at 10:22 am

    @ Shekhar – No Shekhar, when you load data into HIVE table, the file is moved from HDFS into HIVE-controlled file system namespace. Although I did not understand why this is done, found this note at []. I’m still researching into why it is done that way. Please share if you find more details.

      November 25, 2014 at 11:52 pm

      This is not the case, only the schema/structure will copied as per my understanding. Only a pointer is maintained from the Hive table to HDFS

    August 19, 2014 at 10:22 am

    @ Shekhar – No, when you load data into HIVE table, the file is moved from HDFS into HIVE-controlled file system namespace. Although I did not understand why this is done, found this note at []. I’m still researching into why it is done that way. Please share if you find more details.

      November 5, 2014 at 12:29 am

      Create Hive Table and give the table storage location as HDFS file Location, then data is there in HDFS only its not moved to HIVE warehouse. Table read the data from HDFS only.

August 25, 2014 at 1:30 pm

Hi, I have a question on partitioning. Is it necessary that the partitioning variable be the last column of the source table?

August 25, 2014 at 1:32 pm

Hello, this tutorial was really helpful. I have a question on partitioning and is as follows:

“Is it necessary to have the partitioning variable as the last column in the source table?”

September 4, 2014 at 12:05 am

Hi can we use hive for interactive reporting purpose. like accessing the Hive data using Jasper Reports and displaying jasper reports through browser(using web technologies java, php,..etc)

September 16, 2014 at 6:53 pm

Can you explain more on below topic.
I thought that after executing below command, HDFS File stay same and additional data files are created in
/users/hive/warehouse.. I do believe this command is for internal tables.
Can you elobarate

LOAD DATA INPATH ‘/user/sandbox/Batting.csv’ OVERWRITE INTO TABLE temp_batting;
After executing the query we can look at the Tables again and when we browse the data for temp_batting we see that the data has been read in. Note Hive consumed the data file Batting.csv during this step. If you look in the File Browser you will see Batting.csv is no longer there.

Pavel V
September 17, 2014 at 2:07 am

The Yahoo article link 404’s. The correct link seems to be

Arya Basu
October 14, 2014 at 6:10 am

How can we update a record in Hive table?

Mungeol Heo
November 3, 2014 at 10:47 pm

Note: use ‘LOAD DATA INPATH ‘/user/hue/Batting.csv’ OVERWRITE INTO TABLE temp_batting;’ instead of ‘LOAD DATA INPATH ‘/user/sandbox/Batting.csv’ OVERWRITE INTO TABLE temp_batting;’ from HDP 2.1.

justin joseph
January 8, 2015 at 9:01 pm

Please tell me how can i do these thing through the REST api?

Anwar Mian
July 3, 2015 at 2:05 am

Excellent article. Step by step with screen shots was very helpful.

September 2, 2015 at 1:24 pm

Question — what if there are multiple players with the “max(runs)” in a given year. For example, the most runs by a player in 1882 was 99…but what if there were two players who had 99 runs that year?

Would the result set show two rows, one for each of those two players with 99 runs in 1882…or would the result set show just one row, in which case which of the ‘tied’ players would it show?

September 6, 2015 at 12:52 pm


Very awesome tutorial.

Any chance you’re able to get the screenshots and the tutorial steps updated to reflect the latest version of Ambari / Hive please?

I got stuck on the bit where you need to upload the CSV to Hive in order to process it into the TEMP_BATTING table.

I kept getting an access error, but I don’t have it on me at the moment as I’m currently at work.

September 26, 2015 at 11:42 am

when trying to upload file from hue file browser and trying to load from putty getting error root user not authorized.. how to get access to root user for files uploaded using hue file browser

James Song
October 3, 2015 at 10:56 pm

So, what happens behind the scene?
Do a MapReduce job is created when the insert query is executed?
How does Hive interact with HDFS?

October 8, 2015 at 7:09 pm

I am using Centos So How I can do hive on it.

October 8, 2015 at 7:11 pm

I am using centos linux I download that data file from given website so what to do a head ?
please explain me.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">