Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch

Working code examples for this post (for both Pig 0.10 and ElasticSearch 0.18.6) are available here.

ElasticSearch makes search simple. ElasticSearch is built over Lucene and provides a simple but rich JSON over HTTP query interface to search clusters of one or one hundred machies. You can get started with ElasticSearch in five minutes, and it can scale to support heavy loads in the enterprise. ElasticSearch has a Whirr Recipe, and there is even a Platform-as-a-Service provider, Bonsai.io.

Apache Pig makes Hadoop simple. In a previous post, we prepared the Berkeley Enron Emails in Avro format. The entire dataset is available in Avro format here: https://s3.amazonaws.com/rjurney.public/enron.avro. Lets check them out:

register /me/pig/contrib/piggybank/java/piggybank.jar
 
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
 
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
 
enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
describe enron_emails
 
emails: {message_id: chararray,orig_date: chararray,datetime: chararray,from_address: chararray,from_name: chararray,subject: chararray,body: chararray,tos: {ARRAY_ELEM: (address: chararray,name: chararray)},ccs: {ARRAY_ELEM: (address: chararray,name: chararray)},bccs: {ARRAY_ELEM: (address: chararray,name: chararray)}}
 
illustrate enron_emails
 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| emails     | message_id:chararray                         | orig_date:chararray    | datetime:chararray       | from_address:chararray    | from_name:chararray                       | subject:chararray     | body:chararray                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | tos:bag{ARRAY_ELEM:tuple(address:chararray,name:chararray)}             | ccs:bag{ARRAY_ELEM:tuple(address:chararray,name:chararray)}             | bccs:bag{ARRAY_ELEM:tuple(address:chararray,name:chararray)}             |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|            |  | 2000-09-05 13:05:00    | 2000-09-05T13:05:00.000Z | david@ddh-pd.com          | DDH Product Design, Inc." "David Hayslett | Family Reunion Photos | Rod,\n\n It was nice to talk to you this evening. It did sound like you\n had a cold. There is no way to protect from going from air\n conditioning to the outside heat/humidity then back into\n the air conditioning. Just try to get some rest and we'll think positive\n for some cooler weather for you.\n\n Attached pls. find the photos I spoke of. There were 30 of them and I\nnarrowed them to the family I could name. I'll write more later.\n It would be great if you all came out around the holidays!\n Love,\n\n Dave........... \n - Family_Reunion_2000.zip\n | {(hayslettr@yahoo.com, )}                                               | {(rod.hayslett@enron.com, )}                                            | {(rod.hayslett@enron.com, )}                                             |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Wonderdog (contributed to open source via the Apache 2.0 license by Infochimps) makes searching Pig relations easy. Lets make the Enron emails searchable:

/* Now load Wonderdog and set configuration */
register /me/wonderdog/target/wonderdog*.jar;
register /me/elasticsearch-0.18.6/lib/*.jar; /* */
 
/* Nuke any previous email index, as we are about to replace it. */
sh curl -XDELETE 'http://localhost:9200/enron'
 
/* Create an elasticsearch index for our emails */
sh curl -XPUT 'http://localhost:9200/enron/'
 
/* Store our emails as JSON, remove the pig_schema information, and load
   it as a single chararray field full of JSON data. */
rmf /tmp/enron_emails_elastic
store enron_emails into '/tmp/enron_emails_elastic' using JsonStorage();
json_emails = load '/tmp/enron_emails_elastic' AS (json_record:chararray);
 
/* Now we can store our email json data to elasticsearch for indexing with message_id. */
store json_emails into 'es://enron/email?json=true&size=1000' USING
  com.infochimps.elasticsearch.pig.ElasticSearchStorage('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
 
/* Voila! We've made the Enron emails searchable! */
sh curl -XGET 'http://localhost:9200/enron/email/_search?q=oil&pretty=true&size=10'

Lets look at some emails by Tim Belden, former head of trading in Enron Energy Services:

[bash] curl -XGET 'http://localhost:9200/enron/email/_search?pretty=true' -d '{
    "query": {
        "term" : { "from.address": "tim.belden", "from.name": "Tim Belden" }
    }
}'
{
      "_index" : "enron",
      "_type" : "email",
      "_id" : "NG7A1ZsASTe6kYo1UpKUqQ",
      "_score" : 5.0232315, "_source" : {"subject":"Correction -  Floor Meeting @ 12:15","tos":[{"address":"center.dl-portland@enron.com","name":"DL-Portland World Trade Center"}],"date":"2002-01-11T11:41:00.000Z","message_id":"","body":"The government affairs seminar will run until about 12:10.  Therefore, we will have the floor meeting at 12:15.\\n\\nTim","from":{"address":"tim.belden@enron.com","name":"Tim Belden"}}
    }, {
      "_index" : "enron",
      "_type" : "email",
      "_id" : "dLZkbG0USzyXa2I9liygFA",
      "_score" : 5.0232315, "_source" : {"subject":"Layoffs","tos":[{"address":"center.dl-portland@enron.com","name":"DL-Portland World Trade Center"}],"date":"2001-12-10T16:47:57.000Z","message_id":"","body":"Last week was the hardest week that our office has ever experienced.  When I look down the list of people who were laid off I see the names of some very talented people.  Things have unfolded so quickly -- the reality of the situation is settling in for me, and probably many of us, today.  \\n\\nAgain, I wanted to express my thanks and gratitude to everyone on our floor -- both current and former employees.  In the interest of keeping people informed about who is here and who is gone, below is a complete list of layoffs done last week within the west power team.\\n\\nLast Name\\tFirst Name\\t\\nCoffer\\tWalter\\t\\nAlport\\tKysa\\t\\nAusenhus\\tKara\\t\\nAxford\\tKathy\\t\\nBryson\\tJesse\\t\\nBurry\\tJessica\\t\\nButler\\tEmily\\t\\nCadena\\tAngela\\t\\nChen\\tAndy\\t\\nChen\\tLei\\t\\nCocke\\tStan\\t\\nCox\\tChip\\t\\nDalia\\tMinal\\t\\nDeas\\tPatty\\t\\nFrost\\tDavid\\t\\nFuller\\tDave\\t\\nGang\\tLisa\\t\\nGuillaume\\tDavid\\t\\nHall\\tErin\\t\\nHarasin\\tLeaf\\t\\nLinder\\tEric\\t\\nMainzer\\tElliot\\t\\nMaxwell\\tDan\\t\\nMcCarrel\\tSteven\\t\\nMehrer\\tAnna\\t\\nMerriss\\tSteven\\t\\nMerten\\tEric\\t\\nMiles\\tDarryl\\t\\nMullen\\tMark\\t\\nMumm\\tChris\\t\\nNalluri\\tSusmitha\\t\\nPorter\\tDavid\\t\\nPresto\\tDarin\\t\\nSoderquist\\tLarry\\t\\nSymes\\tKate\\t\\nThome\\tJennifer\\t\\nTully\\tMike\\t\\nVan Gelder\\tJohn\\t\\nWarner\\tNicholas\\t\\nWilson\\tSusan\\t\\nBrowner\\tVictor\\t\\nCalvert\\tGray\\t\\nCarranza\\tOctavio\\t\\nHa\\tVicki\\t\\nMara\\tSue\\t\\nPerrino\\tDave\\t\\nQureishi\\tIbrahim\\t\\nTurnipseed\\tEdith\\t\\nWong\\tMichael\\t","from":{"address":"tim.belden@enron.com","name":"Tim Belden"}}
    } ]
  }
}

Now that our data is indexed by ElasticSearch, we can load this same search query directly into a Pig relation.

register /me/wonderdog/target/wonderdog*.jar;
register /me/elasticsearch-0.18.6/lib/*.jar; /* */
 
/* All emails from a sender with the name "Tim Belden".
   Note the use of url encoded values for ' ' (%20) and '"' (%22) */
tim_emails = LOAD 'es://enron/email?q=from.name:%22Tim%20Belden%22' USING com.infochimps.elasticsearch.pig.ElasticSearchStorage('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins') AS (doc_id:chararray, contents:chararray);
store tim_emails INTO '/tmp/tim.json' USING JsonStorage();

There are only about 258 emails in the set from Tim Belden, which is interesting, considering that he is considered the central figure in Enron’s manipulation of the California energy market. Lets verify that result against our original data.

tim_emails = FILTER enron_emails BY from.name == 'Tim Belden';
total = foreach (group tim_emails all) generate 'total' as foo, COUNT_STAR($1) as total;
dump total
(total,258)

The ElasticSearch query is correct. The sparsity of emails from Tim was noted by Jeff Heer in preparing this dataset.

Note however, the ElasticSearch UDF returns nearly instantly, while the FILTER is slooooooowwww… which brings us to the real boon of ElasticSearch: indexing data to load small portions of a larger dataset by search query. The other is of course indexing data for serving search to applications via http.

~ Russell Jurney

Categorized by :
Apache Hadoop Big Data Hadoop Ecosystem Pig

Comments

|
September 24, 2012 at 4:21 pm
|

Sorry to hear you had trouble. I had the same issues, and I was able to address these issues via https://github.com/infochimps-labs/wonderdog/pull/8 as well as by using these parameters:

com.infochimps.elasticsearch.pig.ElasticSearchStorage(‘/me/elasticsearch-0.18.6/config/elasticsearch.yml’, ‘/me/elasticsearch-0.18.6/plugins’);

Viv
|
August 21, 2012 at 12:17 pm
|

It would be wonderful if someone can suggest any workaround for the above mentioned error.. or PIG-2872.

    |
    September 24, 2012 at 4:21 pm
    |

    Sorry to hear you had trouble. I had the same issues, and I was able to address these issues via https://github.com/infochimps-labs/wonderdog/pull/8 as well as by using these parameters:

    com.infochimps.elasticsearch.pig.ElasticSearchStorage(‘/me/elasticsearch-0.18.6/config/elasticsearch.yml’, ‘/me/elasticsearch-0.18.6/plugins’);

Evert
|
August 13, 2012 at 2:21 am
|

I’m giving up on trying to make this work. It’s not hard to see what’s going wrong, but it’s harder to figure out how to fix it:

ElasticSearchStorage.elasticSearchSetup() adds elasticsearch.yml to mapred.cache.files in the Configuration object of the given Job object via wonderdog’s implementation of Pig’s StoreFuncInterface.setStoreLocation. However, the configuration passed to setStoreLocation through Pig’s JobControlCompiler.getJob is a copy of the Configuration object created in getJob (new org.apache.hadoop.mapreduce.Job(nwJob.getConfiguration())). Thus nothing added to the Configuration object in setStoreLocation() is included in nwJob’s Configuration object. This seems to mean that the DistributedCache cannot be used from within an implementation of the StoreFuncInterface – nothing can be added to the Configuration, for that matter.

I don’t know whether this is intentionally done by Pig. A quick fix could be to not pass a new Job object to setStoreLocation from within getJob, but rather use the existing instance nwJob. However, in the current getJob code a new Job object is passed explicitly, and I’m not deep enough in Pig to understand whether these are the correct semantics of StoreFunc.setStoreLocation.

I’ve filed a Pig jira: https://issues.apache.org/jira/browse/PIG-2872.

Evert
|
August 7, 2012 at 7:43 am
|

Fun post! I can’t get wonderdog to work well though – ElasticSearchStorage can’t find my elasticsearch.yml, eventhough it’s there:

org.elasticsearch.env.FailedToResolveConfigException: Failed to resolve config path [home/evert/Downloads/elasticsearch-0.18.6/config/elasticsearch.yml], tried file path [/home/evert/Downloads/elasticsearch-0.18.6/config/elasticsearch.yml], path file [/disk3/mapred.local.dir/taskTracker/evert/jobcache/job_201208051131_0152/attempt_201208051131_0152_m_000000_3/work/config/home/evert/Downloads/elasticsearch-0.18.6/config/elasticsearch.yml], and classpath

but:

$ ls -lahtr /home/evert/Downloads/elasticsearch-0.18.6/config/elasticsearch.yml
-rw-r–r– 1 evert evert 12K Dec 13 2011 /home/evert/Downloads/elasticsearch-0.18.6/config/elasticsearch.yml

I’m on a Hadoop 0.20 cluster and run Elasticsearch 0.18.6 (but got the same error on 0.19.8), wonderdog trunk (to include a config fix @ https://github.com/infochimps-labs/wonderdog/pull/8), and Pig 0.10 on my local machine.

Buffy
|
July 10, 2012 at 11:36 am
|

It’s “voila”, not “wallah” ! Argh.

|
July 10, 2012 at 11:28 am
|

Could you include an example in the README on how to run the .pig files, or what the pre-requisite is?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

Big Data Virtual Meetup Chennai
Wednesday, October 29, 2014
9:00 pm India Time / 8:30 am Pacific Time / 4:30 pm Europe Time (Paris)

More Webinars »

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.