The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Loading Avros in Pig with AvroStorage

In this post, we’re going to extract records about the people sending and receiving these emails from our email documents and store them in a format HIVE can read. We’ll use HIVE to check the results of our analysis in Pig, and then run additional descriptive queries on this data. The enron emails are available for download on S3 in Avro format here.

For the analyst interested in time series, we might prepare a table especially for time series analysis between pairs of email addresses. The following Pig code loads the Enron emails in Avro format. It shows how to use DESCRIBE and ILLUSTRATE to watch our data change as we transform it.

register /me/pig/contrib/piggybank/java/piggybank.jar
 
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/build/ivy/lib/Pig/joda-time-1.6.jar
 
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
 
set default_parallel 10
set aggregate.warning true
 
rmf /enron/from_to_date
 
emails = load '/enron/emails.avro' using AvroStorage();
 
/* We can view our schema with describe. */
describe emails
 
/* Get samples of the data instantly with illustrate! */
illustrate emails
 
froms = foreach emails generate flatten(from.address) as from_address, flatten(tos.address) as to_address, date;
froms = filter froms by to_address is not null and to_address != '';
 
illustrate froms
 
store froms into '/enron/from_to_date';
 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| emails     | message_id:chararray                        | date:chararray           | from:tuple(address:chararray,name:chararray)             | subject:chararray     | body:chararray                                                                                                                                                                                   | tos:bag{ARRAY_ELEM:tuple(address:chararray,name:chararray)}                 | ccs:bag{ARRAY_ELEM:tuple(address:chararray,name:chararray)}                 | bccs:bag{ARRAY_ELEM:tuple(address:chararray,name:chararray)}                 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|            |  | 2002-02-14T07:37:36.000Z | (kevin.hyatt@enron.com, Kevin Hyatt)                     | Hey                   | Hope you're enjoying the parent visit!  Thanks for the VD card, that was cute.  If Jesse Jackson shows up here one more time, we're getting a group together and going downstairs to moon him.\n | {(lmfoust@aol.com, )}                                                       | {}                                                                          | {}                                                                           |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
| froms     | from_address:chararray     | to_address:chararray     | date:chararray           |
------------------------------------------------------------------------------------------------
|           | kevin.hyatt@enron.com      | lmfoust@aol.com          | 2002-02-14T07:37:36.000Z |
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
| froms     | from_address:chararray     | to_address:chararray     | date:chararray           |
------------------------------------------------------------------------------------------------
|           | kevin.hyatt@enron.com      | lmfoust@aol.com          | 2002-02-14T07:37:36.000Z |
------------------------------------------------------------------------------------------------

DESCRIBE shows us the schema of our data, which is handy when we load avros via AvroStorage or perform more complex operations.

My favorite command though, is ILLUSTRATE. ILLUSTRATE is described in detail in the paper, Generating Example Data for Dataflow Programs. As you can see above, ILLUSTRATE gives us an instant run-through of our code. This is particularly useful when we employ UDFs or User Defined Functions, that allow us to perform arbitrary operations on any data flowing through our pig script. This rapidly accelerates development on Hadoop, which is a batch system – meaning that on a production Hadoop cluster you can easily wait minutes or even hours for a long data pipeline, a big/intense job or even a simple job on an overloaded cluster to complete. ILLUSTRATE lets you bypass batch and iterate your script in realtime in grunt.

No other Hadoop interface provides this unique, powerful capability. Combined with in-record schemas via AvroStorage, Pig’s power is amplified greatly. Pig 0.10 is a strongly recommended upgrade.

To the Cloud!

Now that we’ve stored our time series data, lets query it naturally in SQL on Hadoop with HIVE. HIVE is available here. To run HIVE locally we have to run Hadoop locally, and all the configuration can get confusing. So lets take this opportunity to hit the cloud! Instructions for using Amazon’s Elastic MapReduce service – a Hadoop in the cloud – are available here.

S3cmd is an easy to use command line tool for manipulating files on Amazon’s Simple Storage Service. We’ll use it to upload our local Pig results to Amazon S3 for consumption by HIVE on EMR.

[bash]$ s3cmd put --recursive from_to_date s3://enron.data/from.to.date/

Querying Data with HIVE

Now we can load data into HIVE as an external table and query it using SQL (note, do NOT use a ‘_’ in the name of a bucket or folder):

create external table from_to(from_address string, to_address string, dt string)
    row format delimited fields terminated by '\t' stored as textfile location 's3://enron.data/from.to.date';
 
// Get this from https://github.com/rjurney/timeseriesserde
add jar /home/hadoop/timeseriesserde.jar;
create temporary function TimeSeries as 'com.example.hive.udf.TimeSeries';

Our table is simple:

hive> describe from_to; 
 
OK 
 
from_address string
to_address string
dt string 
 
Time taken: 0.091 seconds

Lets inspect our records:

hive> select * from from_to limit 10; 
 
OK 
 
crandallm@ndu.edu jdasovic@enron.com 2001-01-16T09:51:00.000Z
sara.shackleton@enron.com wayne.gresham@enron.com 1999-08-11T06:02:00.000Z
sara.shackleton@enron.com dale.rasmussen@enron.com 1999-08-11T06:02:00.000Z
louise.kitchen@enron.com tana.jones@enron.com 2000-07-27T03:25:00.000Z
sally.beck@enron.com wanda.curry@enron.com 2000-02-23T04:15:00.000Z
jeff.dasovich@enron.com susan.landwehr@enron.com 2001-05-14T04:03:00.000Z
taffy.milligan@enron.com kay.mann@enron.com 2001-05-24T05:20:00.000Z
chris.germany@enron.com wdgofalto@aol.com 2000-03-30T13:04:00.000Z
matthew.lenhart@enron.com erichardson@sarofim.com 2001-10-10T12:25:20.000Z
d..steffes@enron.com howard.fromer@enron.com 2001-10-16T11:21:38.000Z 
 
Time taken: 3.038 seconds

Take a count of emails sent between addresses:

hive> select from_address, to_address, count(*) as total from from_to group by from_address, to_address order by total desc limit 100; 
 
OK
 
pete.davis@enron.com pete.davis@enron.com 4489
vince.kaminski@enron.com vkaminski@aol.com 1143
jeff.dasovich@enron.com susan.mara@enron.com 935
jeff.dasovich@enron.com paul.kaufman@enron.com 879
jeff.dasovich@enron.com richard.shapiro@enron.com 864
jeff.dasovich@enron.com james.steffes@enron.com 781
jeff.dasovich@enron.com karen.denne@enron.com 724
jeff.dasovich@enron.com harry.kingerski@enron.com 546
jeff.dasovich@enron.com skean@enron.com 529
jeff.dasovich@enron.com sandra.mccubbin@enron.com 500
michelle.nelson@enron.com mike.maggi@enron.com 494

Manipulating Data with Custom Serdes in HIVE

Finally, one of the benefits of using HIVE over a traditional data warehouse is the ability to write simple Serde, which are like Pig UDFs. Serde extend HIVE to enable custom functionality – to transform records, or to load a different kind of format. I’ve written a simple Serde called TimeSeries here. To use it, download it into your working directory and run hive:

select to_date(dt) as dt, TimeSeries(CAST(count(*) AS INT)) as stars, count(*) as total from from_to group by to_date(dt) order by to_date(dt);

The… admittedly humorous hack of an output looks like this :)

2002-05-23 ********************************************************************** 70
2002-05-24 ******************************************************* 55
2002-05-27 **** 4
2002-05-28 ******************************************************************* 67
2002-05-29 ***************************************************************************************** 89
2002-05-30 ******************************************************************************************************************* 115
2002-05-31 **************************************************************************************************** 100
2002-06-03 *************************************************************************** 75
2002-06-04 ******************************************************* 55
2002-06-05 ************************************************************************* 73
2002-06-06 ******************************************************************************** 80
2002-06-07 ****************************************************************** 66

Or, if you zoom out…

~ Russell Jurney

Categorized by :
Apache Hadoop Hadoop Ecosystem HCatalog Pig

Comments

|
June 18, 2012 at 7:45 am
|

Hey Russell,

This is a good article, but I have some concerns:
1) In practice you should be using Hive tables with partitions, how would you store data from pig to Hive partitions?
2) I think it would be easier to use Cloudera VM instead of EMR to run Hive for testing.

Thanks

    |
    June 18, 2012 at 9:25 am
    |

    Thanks for yur comment. We’ll cover #1 in the next post in the series.

Chris
|
June 15, 2012 at 10:15 am
|

Hi Russell,

This is a really nice series, thank you.

Is there a reason you aren’t also storing the output from Pig as Avro format and then using Hive to query that?

Regards,
Chris

    |
    June 15, 2012 at 11:25 am
    |

    Chris, no reason in particular. I wrote TSV for simplicity’s sake. Being new to Hive, I had some trouble getting the data to load in Hive on EMR.

|
June 13, 2012 at 4:27 pm
|

I’m following your process but keep running into a parsing error when trying to write out the avro data or even read it in. What version of pig and the avro UDF were you using? I’m using the Cloudera Hadoop VM if it helps.

    |
    June 14, 2012 at 1:45 pm
    |

    You need to be using Pig v0.10.0+, which Cloudera doesn’t yet offer. You can download this newer version using the link above.

    |
    June 14, 2012 at 1:45 pm
    |

    You need to be using Pig v0.10.0+, which Cloudera doesn’t yet offer.

|
June 5, 2012 at 9:43 am
|

Very interesting, thank you for writing.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Recently in the Blog

Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.

Thank you for subscribing!