The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce
Series Introduction
This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.
- Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
- Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
- Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.
Sharing Data with HCatalog
Apache HCatalog is a table and storage management service for data created using Apache Hadoop. It streamlines the sharing of Hadoop data between Pig, MapReduce and Hive. HCatalog is available for download here. You can follow along with an excellent tutorial on installing HCatalog is available HERE. The HCatalog wiki is an excellent source of up to date information. Finally, the HCatalog mailing lists are a great way to get involved with the HCatalog community.
Once again, we’re going to use Hive, Pig and Amazon Elastic MapReduce (EMR) to process the Enron emails. The Enron emails are available for download on S3 in Avro format here. Hive is available here. To run Hive locally we have to run Hadoop locally, and all the configuration can get confusing. So we take this opportunity to hit the cloud. Instructions for using Amazon’s Elastic MapReduce service – a Hadoop in the cloud – are available here.
Please complete post two to create a derived dataset in Pig first, so that we may share it via HCatalog. We’re going to store the results of our script at in HCatalog.
Install Hive 0.9
First, ssh to the master node of your EMR cluster, and install the latest version of Hive, version 0.9. Then fire up Hive to make sure everything runs well:
[bash]$ wget http://apache.cs.utah.edu/hive/hive-0.9.0/hive-0.9.0.tar.gz [bash]$ tar -xvzf hive-0.9.0.tar.gz [bash]$ cd hive-0.9.0 [bash]$ bin/hive |
Logging initialized USING configuration IN jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties Hive history file=/tmp/hadoop/hive_job_log_hadoop_201206120020_2040132181.txt hive> SHOW TABLES; OK TIME taken: 26.524 seconds hive> quit; |
And don’t forget to link your Hive configuration to /conf, where HCatalog will expect it. And set HIVE_HOME.
[bash]$ cp hive/conf/hive-site.xml hive-0.9.0/conf [bash]$ export HIVE_HOME=/home/hadoop/hive-0.9.0 |
Install Forrest
Install Apache forrest, which you can download here: .
[bash]$ wget http://mirrors.axint.net/apache//forrest/apache-forrest-0.9-sources.tar.gz [bash]$ wget http://mirrors.axint.net/apache//forrest/apache-forrest-0.9-dependencies.tar.gz [bash]$ tar -xvzf apache-forrest-0.9-sources.tar.gz [bash]$ tar -xvzf apache-forrest-0.9-dependencies.tar.gz [bash]$ cd apache-forrest-0.9 [bash]$ bin/forrest ... BUILD SUCCESSFUL Total time: 5 seconds |
Now set FORREST_HOME:
[bash]$ echo 'export FORREST_HOME=/home/hadoop/apache-forrest-0.9' >> ~/.bash_profile [bash]$ source ~/.bash_profile [bash]$ echo $FORREST_HOME |
Install MySQL Connector (Java Driver)
Now install the MySQL JDBC Connector.
[bash]$ cd [bash]$ wget wget http://mirror.services.wisc.edu/mysql/Downloads/Connector-J/mysql-connector-java-3.1.14.tar.gz [bash]$ tar -xvzf mysql-connector-java-3.1.14.tar.gz |
Now start the Hive metastore.
./bin/hive --config ./conf --service metastore & |
Install Pig 0.10
Next, install and setup Pig v0.10 and ensure it works well:
wget http://mirrors.gigenet.com/apache/pig/pig-0.10.0/pig-0.10.0.tar.gz tar -xvzf pig-0.10.0.tar.gz cd pig-0.10.0 export PIG_HOME=/home/hadoop/pig-0.10.0 bin/pig |
grunt> pwd hdfs://10.4.115.51:9000/user/hadoop grunt> mkdir /enron grunt> ls /enron grunt> |
Setup our Environment
Edit and re-run our .bash_profile to setup our environment:
export HADOOP_HOME=/home/hadoop export HCAT_HOME=/usr/local/hcat export PIG_HOME=/home/hadoop/pig-0.10.0 export HIVE_HOME=/home/hadoop/hive-0.9.0 export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HIVE_HOME/lib/hive-metastore-0.9.0.jar: $HIVE_HOME/lib/libthrift-0.7.0.jar:$HIVE_HOME/lib/hive-exec-0.9.0.jar:$HIVE_HOME/lib/libfb303-0.7.0.jar: $HIVE_HOME/lib/jdo2-api-2.3-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:$HIVE_HOME/lib/slf4j-api-1.6.1.jar export PIG_OPTS=-Dhive.metastore.uris=thrift://localhost:10001 |
Setup HCatalog
Now setup HCatalog on the master node:
[bash]$ mkdir /tmp/hcat_source_release [bash]$ cp hcatalog-src-0.4.0-incubating.tar.gz /tmp/hcat_source_release [bash]$ cd /tmp/hcat_source_release [bash]$ tar -xzf hcatalog-src-0.4.0-incubating.tar.gz [bash]$ cd hcatalog-src-0.4.0-incubating [bash]$ ant -Dhcatalog.version=0.4.0 -Dforrest.home=$FORREST_HOME tar [bash]$ cp build/hcatalog-0.4.0.tar.gz /tmp [bash]$ cd /tmp [bash]$ tar -xvzf hcatalog-0.4.0.tar.gz [bash]$ cd hcatalog-0.4.0 [bash]$ sudo share/hcatalog/scripts/hcat_server_install.sh -r /usr/local/hcat -d /home/hadoop/mysql-connector-java-3.1.14 -h $HADOOP_HOME -p 9933 Installing into [/usr/local/hcat] Installation successful |
Configure Hive and HCatalog
[bash]$ sudo cp $HIVE_HOME/conf/hive-site.xml /usr/local/hcat/etc/hcatalog/ [bash]$ sudo ln -s $HIVE_HOME/bin/hive /bin/hive [bash]$ sudo ln -s $HIVE_HOME/bin/hive-config.sh |
Now, we need to configure Hive and HCatalog. Thanks to some help from @khorgath, a hive configuration file is available here: .
[bash]$ cd /usr/local/hcat/etc/hcatalog [bash]$ mv hive-site.xml hive-site.orig.xml [bash]$ wget http://s3.amazonaws.com/rjurney_public_web/hive-site.xml [bash]$ mv /home/hadoop/hive-0.9.0/conf/hive-site.xml /home/hadoop/hive-0.9.0/conf/hive-site.orig.xml [bash]$ ln -s /usr/local/hcat/etc/hcatalog/hive-site.xml /home/hadoop/hive-0.9.0/conf/hive-site.xml [bash]$ hadoop fs -mkdir /tmp [bash]$ hadoop fs -chmod 777 /tmp [bash]$ hadoop fs -mkdir /user/hive/warehouse [bash]$ hadoop fs -chmod 777 /user/hive/warehouse |
And finally…
Start HCatalog
[bash]$ cd /usr/local/hcat [bash]$ sudo sbin/hcat_server.sh start Started metastore server init, testing if initialized correctly... Metastore initialized successfully on port[10001]. |
Setup HCatalog with Pig
Congratulations, we’ve booted HCatalog on Amazon Elastic MapReduce! Now lets get to work. Startup Pig and lets store some records in HCatalog, then access them in Hive.
First, we must build Pig and Piggybank to access the jars we need.
[bash]$ cd [bash]$ cd pig-0.10 [bash]$ ant ... ivy doing its thing for a while ... buildJar-withouthadoop: [echo] svnString : unknown [jar] Building jar: /home/hadoop/pig-0.10.0/build/pig-0.10.0-SNAPSHOT-withouthadoop.jar [copy] Copying 1 file to /home/hadoop/pig-0.10.0 jar-all: BUILD SUCCESSFUL Total time: 3 minutes 47 seconds [bash]$ cd contrib/piggybank/java [bash]$ ant ... ant doing its thing for a while ... jar: [echo] *** Creating pigudf.jar *** [jar] Building jar: /home/hadoop/pig-0.10.0/contrib/piggybank/java/piggybank.jar BUILD SUCCESSFUL Total time: 8 seconds |
Store Pig Relations to HCatalog
[bash]$ cd [bash]$ pig-0.10.0/bin/pig -l /tmp -v -w |
/* HCatalog */
register /usr/local/hcat/share/hcatalog/hcatalog-0.4.0.jar
register /home/hadoop/hive-0.9.0/lib/*.jar
/* Avro */
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/avro-1.5.3.jar
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/json-simple-1.1.jar
register /home/hadoop/pig-0.10.0/contrib/piggybank/java/piggybank.jar
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
/* Date rounding into weekly buckets */
register /home/hadoop/pig-0.10.0/build/ivy/lib/Pig/joda-time-1.6.jar
define ISOToWeek org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToWeek();
/* Cleanup the last run */
rmf /tmp/test
/* Load the enron emails from s3 */
emails = load 's3://rjurney.public/enron.avro' using AvroStorage();
/* Only include emails with both a from and at least one to address (some emails are only bcc) */
emails = filter emails by (from is not null) and (tos is not null) and (date is not null);
/* Project all pairs and round to the week */
pairs = foreach emails generate from.(address) as from,
FLATTEN(tos.(address)) as to,
ISOToWeek(date) as week;
/* Count the emails between pairs per week */
from_to_weekly_counts = foreach (group pairs by (from, to, week) parallel 10) generate
FLATTEN(group) as (from, to, week),
COUNT_STAR($1) as total;
store from_to_weekly_counts into '/tmp/test'; |
Running our code gives us…
cd; pig-0.10.0/bin/pig -l /tmp -v -w hcat.pig ... 2012-06-22 21:56:34,124 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete 2012-06-22 21:57:09,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 22% complete 2012-06-22 22:02:59,547 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete 2012-06-22 22:10:09,594 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2012-06-22 22:10:09,596 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.205 0.10.0-SNAPSHOT hadoop 2012-06-22 21:47:39 2012-06-22 22:10:09 FILTER Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201206120007_0007 14 0 475 378 425 0 0 0 emails,pairs MULTI_QUERY,MAP_ONLY /tmp/test,from_to_week, Input(s): Successfully read 246391 records (4886 bytes) from: "s3://rjurney.public/enron.avro" Output(s): Successfully stored 1159680 records (82710669 bytes) in: "/tmp/test" Successfully stored 1159680 records in: "from_to_week" Counters: Total records written : 2319360 Total bytes written : 82710669 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 |
HCatalog in Hive
Success! Now lets check our our results in Hive.
hive> > SHOW TABLES; OK from_to_week TIME taken: 2.14 seconds hive> SELECT * FROM from_to_week LIMIT 1; OK crandallm@ndu.edu jdasovic@enron.com 2001-01-15T00:00:00.000Z hive> DESCRIBE from_to_week; OK from_address string to_address string week string TIME taken: 0.245 seconds hive> SELECT from_address, to_address, week, COUNT(*) AS total FROM from_to_week GROUP BY from_address, to_address, week HAVING total > 100 ORDER BY total DESC LIMIT 100; OK pete.davis@enron.com pete.davis@enron.com 2002-01-07T00:00:00.000Z 494 pete.davis@enron.com pete.davis@enron.com 2002-01-28T00:00:00.000Z 491 pete.davis@enron.com pete.davis@enron.com 2002-01-21T00:00:00.000Z 443 pete.davis@enron.com pete.davis@enron.com 2002-01-14T00:00:00.000Z 399 pete.davis@enron.com pete.davis@enron.com 2001-12-31T00:00:00.000Z 386 michelle.nelson@enron.com mike.maggi@enron.com 2001-11-19T00:00:00.000Z 249 pete.davis@enron.com pete.davis@enron.com 2002-02-04T00:00:00.000Z 211 mike.maggi@enron.com michelle.nelson@enron.com 2001-11-19T00:00:00.000Z 194 pete.davis@enron.com pete.davis@enron.com 2001-12-24T00:00:00.000Z 168 pete.davis@enron.com pete.davis@enron.com 2001-10-22T00:00:00.000Z 168 pete.davis@enron.com pete.davis@enron.com 2001-12-17T00:00:00.000Z 168 pete.davis@enron.com pete.davis@enron.com 2001-10-08T00:00:00.000Z 167 pete.davis@enron.com pete.davis@enron.com 2001-04-23T00:00:00.000Z 165 pete.davis@enron.com pete.davis@enron.com 2001-04-09T00:00:00.000Z 165 pete.davis@enron.com pete.davis@enron.com 2001-10-15T00:00:00.000Z 163 pete.davis@enron.com pete.davis@enron.com 2001-04-16T00:00:00.000Z 162 pete.davis@enron.com pete.davis@enron.com 2001-04-02T00:00:00.000Z 153 pete.davis@enron.com pete.davis@enron.com 2001-12-10T00:00:00.000Z 139 michelle.nelson@enron.com mike.maggi@enron.com 2001-11-26T00:00:00.000Z 134 pete.davis@enron.com pete.davis@enron.com 2001-02-26T00:00:00.000Z 113 pete.davis@enron.com pete.davis@enron.com 2001-03-05T00:00:00.000Z 104 |
And we’re cooking with HCatalog!
Hi Russell,
Fantasitic article!
Can you explain how the data gets into HCatalog (“from_to_week”)? The only store statement i see in your Pig is:
store from_to_weekly_counts into ‘/tmp/test’;
Thanks for noticing! It looks like I screwed up and forgot to also use HCatStorage. I’ll update the code.
Waiting for ur updation of code… As i 2 have the same doubt as thelabdude..
Problem Solved :
Syntax for HcatLoader() and HcatStorer :
A = LOAD ‘default.nysefinal_part1′ USING org.apache.hcatalog.pig.HCatLoader();
B = filter A by stock_symbol == ‘Q’;
store B into ‘default.Hcatfinal_part11′ USING org.apache.hcatalog.pig.HCatStorer();
Here i am storing in a table without partition. If we want to create a new partition we can mention it in the HCatStorer parameter.
Plz refer http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html
Has any1 worked with Hcatalog + mapreduce..
I am facing a problem in that.