Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
June 26, 2012
prev slideNext slide

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Sharing Data with HCatalog

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.  It streamlines the sharing of Hadoop data between Pig, MapReduce and Hive.  HCatalog is available for download here. You can follow along with an excellent tutorial on installing HCatalog is available HERE. The HCatalog wiki is an excellent source of up to date information.   Finally, the HCatalog mailing lists are a great way to get involved with the HCatalog community.

Once again, we’re going to use Hive, Pig and Amazon Elastic MapReduce (EMR) to process the Enron emails.  The Enron emails are available for download on S3 in Avro format here.  Hive is available here. To run Hive locally we have to run Hadoop locally, and all the configuration can get confusing. So we take this opportunity to hit the cloud. Instructions for using Amazon’s Elastic MapReduce service – a Hadoop in the cloud – are available here.

Please complete post two to create a derived dataset in Pig first, so that we may share it via HCatalog. We’re going to store the results of our script at in HCatalog.

Install Hive 0.9

First, ssh to the master node of your EMR cluster, and install the latest version of Hive, version 0.9. Then fire up Hive to make sure everything runs well:

[bash]$ wget http://apache.cs.utah.edu/hive/hive-0.9.0/hive-0.9.0.tar.gz
[bash]$ tar -xvzf hive-0.9.0.tar.gz
[bash]$ cd hive-0.9.0
[bash]$ bin/hive
Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201206120020_2040132181.txt
hive> show tables;
Time taken: 26.524 seconds
hive> quit;

And don’t forget to link your Hive configuration to /conf, where HCatalog will expect it. And set HIVE_HOME.

[bash]$ cp hive/conf/hive-site.xml hive-0.9.0/conf
[bash]$ export HIVE_HOME=/home/hadoop/hive-0.9.0

Install Forrest

Install Apache forrest, which you can download here: .

[bash]$ wget http://mirrors.axint.net/apache//forrest/apache-forrest-0.9-sources.tar.gz
[bash]$ wget http://mirrors.axint.net/apache//forrest/apache-forrest-0.9-dependencies.tar.gz
[bash]$ tar -xvzf apache-forrest-0.9-sources.tar.gz
[bash]$ tar -xvzf apache-forrest-0.9-dependencies.tar.gz
[bash]$ cd apache-forrest-0.9
[bash]$ bin/forrest


Total time: 5 seconds


[bash]$ echo 'export FORREST_HOME=/home/hadoop/apache-forrest-0.9' >> ~/.bash_profile
[bash]$ source ~/.bash_profile
[bash]$ echo $FORREST_HOME

Install MySQL Connector (Java Driver)

Now install the MySQL JDBC Connector.

[bash]$ cd
[bash]$ wget wget http://mirror.services.wisc.edu/mysql/Downloads/Connector-J/mysql-connector-java-3.1.14.tar.gz
[bash]$ tar -xvzf mysql-connector-java-3.1.14.tar.gz

Now start the Hive metastore.

./bin/hive --config ./conf --service metastore &

Install Pig 0.10

Next, install and setup Pig v0.10 and ensure it works well:

wget http://mirrors.gigenet.com/apache/pig/pig-0.10.0/pig-0.10.0.tar.gz
tar -xvzf pig-0.10.0.tar.gz
cd pig-0.10.0
export PIG_HOME=/home/hadoop/pig-0.10.0
grunt> pwd
grunt> mkdir /enron
grunt> ls /enron

Setup our Environment

Edit and re-run our .bash_profile to setup our environment:

export HADOOP_HOME=/home/hadoop
export HCAT_HOME=/usr/local/hcat
export PIG_HOME=/home/hadoop/pig-0.10.0
export HIVE_HOME=/home/hadoop/hive-0.9.0
export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HIVE_HOME/lib/hive-metastore-0.9.0.jar:
export PIG_OPTS=-Dhive.metastore.uris=thrift://localhost:10001

Setup HCatalog

Now setup HCatalog on the master node:

[bash]$ mkdir /tmp/hcat_source_release
[bash]$ cp hcatalog-src-0.4.0-incubating.tar.gz /tmp/hcat_source_release
[bash]$ cd /tmp/hcat_source_release
[bash]$ tar -xzf hcatalog-src-0.4.0-incubating.tar.gz
[bash]$ cd hcatalog-src-0.4.0-incubating
[bash]$ ant -Dhcatalog.version=0.4.0 -Dforrest.home=$FORREST_HOME tar
[bash]$ cp build/hcatalog-0.4.0.tar.gz /tmp
[bash]$ cd /tmp
[bash]$ tar -xvzf hcatalog-0.4.0.tar.gz
[bash]$ cd hcatalog-0.4.0
[bash]$ sudo share/hcatalog/scripts/hcat_server_install.sh -r /usr/local/hcat -d /home/hadoop/mysql-connector-java-3.1.14 -h $HADOOP_HOME -p 9933
Installing into [/usr/local/hcat]
Installation successful

Configure Hive and HCatalog

[bash]$ sudo cp $HIVE_HOME/conf/hive-site.xml /usr/local/hcat/etc/hcatalog/
[bash]$ sudo ln -s $HIVE_HOME/bin/hive /bin/hive
[bash]$ sudo ln -s $HIVE_HOME/bin/hive-config.sh

Now, we need to configure Hive and HCatalog. Thanks to some help from @khorgath, a hive configuration file is available here: .

[bash]$ cd /usr/local/hcat/etc/hcatalog
[bash]$ mv hive-site.xml hive-site.orig.xml
[bash]$ wget http://s3.amazonaws.com/rjurney_public_web/hive-site.xml
[bash]$ mv /home/hadoop/hive-0.9.0/conf/hive-site.xml  /home/hadoop/hive-0.9.0/conf/hive-site.orig.xml
[bash]$ ln -s /usr/local/hcat/etc/hcatalog/hive-site.xml /home/hadoop/hive-0.9.0/conf/hive-site.xml
[bash]$ hadoop fs -mkdir /tmp
[bash]$ hadoop fs -chmod 777 /tmp
[bash]$ hadoop fs -mkdir /user/hive/warehouse
[bash]$ hadoop fs -chmod 777 /user/hive/warehouse

And finally…

Start HCatalog

[bash]$ cd /usr/local/hcat
[bash]$ sudo sbin/hcat_server.sh start

Started metastore server init, testing if initialized correctly...
Metastore initialized successfully on port[10001].

Setup HCatalog with Pig

Congratulations, we’ve booted HCatalog on Amazon Elastic MapReduce! Now lets get to work. Startup Pig and lets store some records in HCatalog, then access them in Hive.

First, we must build Pig and Piggybank to access the jars we need.

[bash]$ cd
[bash]$ cd pig-0.10
[bash]$ ant
... ivy doing its thing for a while ...
     [echo] svnString : unknown
      [jar] Building jar: /home/hadoop/pig-0.10.0/build/pig-0.10.0-SNAPSHOT-withouthadoop.jar
     [copy] Copying 1 file to /home/hadoop/pig-0.10.0


Total time: 3 minutes 47 seconds

[bash]$ cd contrib/piggybank/java
[bash]$ ant
... ant doing its thing for a while ...
     [echo]  *** Creating pigudf.jar ***
      [jar] Building jar: /home/hadoop/pig-0.10.0/contrib/piggybank/java/piggybank.jar

Total time: 8 seconds

Store Pig Relations to HCatalog

[bash]$ cd
[bash]$ pig-0.10.0/bin/pig -l /tmp -v -w
/* HCatalog */
register /usr/local/hcat/share/hcatalog/hcatalog-0.4.0.jar
register /home/hadoop/hive-0.9.0/lib/*.jar

/* Avro */
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/avro-1.5.3.jar
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/json-simple-1.1.jar
register /home/hadoop/pig-0.10.0/contrib/piggybank/java/piggybank.jar
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar

define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

/* Date rounding into weekly buckets */
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/joda-time-1.6.jar
define ISOToWeek org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToWeek();

/* Cleanup the last run */
rmf /tmp/test

/* Load the enron emails from s3 */
emails = load 's3://rjurney.public/enron.avro' using AvroStorage();

/* Only include emails with both a from and at least one to address (some emails are only bcc) */
emails = filter emails by (from is not null) and (tos is not null) and (date is not null);

/* Project all pairs and round to the week */
pairs = foreach emails generate from.(address) as from,
                                FLATTEN(tos.(address)) as to,
                                ISOToWeek(date) as week;

/* Count the emails between pairs per week */
from_to_weekly_counts = foreach (group pairs by (from, to, week) parallel 10) generate
                                FLATTEN(group) as (from, to, week),
                                COUNT_STAR($1) as total;

store from_to_weekly_counts into '/tmp/test';

Running our code gives us…

cd; pig-0.10.0/bin/pig -l /tmp -v -w hcat.pig


2012-06-22 21:56:34,124 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete
2012-06-22 21:57:09,198 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 22% complete
2012-06-22 22:02:59,547 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete
2012-06-22 22:10:09,594 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-06-22 22:10:09,596 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
0.20.205	0.10.0-SNAPSHOT	hadoop	2012-06-22 21:47:39	2012-06-22 22:10:09	FILTER


Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	Alias	Feature	Outputs
job_201206120007_0007	14	0	475	378	425	0	0	0	emails,pairs	MULTI_QUERY,MAP_ONLY	/tmp/test,from_to_week,

Successfully read 246391 records (4886 bytes) from: "s3://rjurney.public/enron.avro"

Successfully stored 1159680 records (82710669 bytes) in: "/tmp/test"
Successfully stored 1159680 records in: "from_to_week"

Total records written : 2319360
Total bytes written : 82710669
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

HCatalog in Hive

Success! Now lets check our our results in Hive.

> show tables;
Time taken: 2.14 seconds

hive> select * from from_to_week limit 1;
crandallm@ndu.edu	jdasovic@enron.com	2001-01-15T00:00:00.000Z

hive> describe from_to_week;
from_address	string
to_address	string
week	string
Time taken: 0.245 seconds

hive> select from_address, to_address, week, count(*) as total
             from from_to_week group by from_address, to_address, week
             having total > 100
             order by total desc
             limit 100;

pete.davis@enron.com	pete.davis@enron.com	2002-01-07T00:00:00.000Z	494
pete.davis@enron.com	pete.davis@enron.com	2002-01-28T00:00:00.000Z	491
pete.davis@enron.com	pete.davis@enron.com	2002-01-21T00:00:00.000Z	443
pete.davis@enron.com	pete.davis@enron.com	2002-01-14T00:00:00.000Z	399
pete.davis@enron.com	pete.davis@enron.com	2001-12-31T00:00:00.000Z	386
michelle.nelson@enron.com	mike.maggi@enron.com	2001-11-19T00:00:00.000Z	249
pete.davis@enron.com	pete.davis@enron.com	2002-02-04T00:00:00.000Z	211
mike.maggi@enron.com	michelle.nelson@enron.com	2001-11-19T00:00:00.000Z	194
pete.davis@enron.com	pete.davis@enron.com	2001-12-24T00:00:00.000Z	168
pete.davis@enron.com	pete.davis@enron.com	2001-10-22T00:00:00.000Z	168
pete.davis@enron.com	pete.davis@enron.com	2001-12-17T00:00:00.000Z	168
pete.davis@enron.com	pete.davis@enron.com	2001-10-08T00:00:00.000Z	167
pete.davis@enron.com	pete.davis@enron.com	2001-04-23T00:00:00.000Z	165
pete.davis@enron.com	pete.davis@enron.com	2001-04-09T00:00:00.000Z	165
pete.davis@enron.com	pete.davis@enron.com	2001-10-15T00:00:00.000Z	163
pete.davis@enron.com	pete.davis@enron.com	2001-04-16T00:00:00.000Z	162
pete.davis@enron.com	pete.davis@enron.com	2001-04-02T00:00:00.000Z	153
pete.davis@enron.com	pete.davis@enron.com	2001-12-10T00:00:00.000Z	139
michelle.nelson@enron.com	mike.maggi@enron.com	2001-11-26T00:00:00.000Z	134
pete.davis@enron.com	pete.davis@enron.com	2001-02-26T00:00:00.000Z	113
pete.davis@enron.com	pete.davis@enron.com	2001-03-05T00:00:00.000Z	104

And we’re cooking with HCatalog!


  • Hi Russell,

    Fantasitic article!

    Can you explain how the data gets into HCatalog (“from_to_week”)? The only store statement i see in your Pig is:

    store from_to_weekly_counts into ‘/tmp/test’;

  • Problem Solved :

    Syntax for HcatLoader() and HcatStorer :

    A = LOAD ‘default.nysefinal_part1’ USING org.apache.hcatalog.pig.HCatLoader();
    B = filter A by stock_symbol == ‘Q’;

    store B into ‘default.Hcatfinal_part11’ USING org.apache.hcatalog.pig.HCatStorer();

    Here i am storing in a table without partition. If we want to create a new partition we can mention it in the HCatStorer parameter.

    Plz refer http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>