Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
August 16, 2012
prev slideNext slide

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at and you can download the Enron emails we use in the example in Avro format at You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Part two of this series on HBase and JRuby is available here:


In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.

Pig and Avro

Pig’s Avro support is solid in Pig 0.10.0. To use AvroStorage, we need only load piggbank.jar, and the jars for avro and json-simple. A shortcut to AvroStorage is convenient as well. Note that all paths are relative to your Pig install path. We load Avro support into Pig like so:

/* Load Avro jars and define shortcut */
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar
define AvroStorage; /* Shortcut */

MongoDB’s Java Driver

To connect to MongoDB, we’ll need the MongoDB Java Driver. You can download it here: We’ll load this jar in our Pig script.


The mongo-hadoop project provides integration between MongoDB and Hadoop. You can download the latest version at Once you download and unzip the project, you’ll need to build it with sbt.

./sbt package

This will produce the following jars:

$ find .|grep jar

We load these MongoDB libraries in Pig like so:

/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar /* MongoDB Java Driver */
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

/* Set speculative execution off so we don't have the chance of duplicate records in Mongo */
set false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
set default_parallel 5 /* By default, lets have 5 reducers */

Writing to MongoDB

Loading Avro data and storing records to MongoDB are one-liners in Pig.

avros = load 'enron.avro' using AvroStorage();
store avros into 'mongodb://localhost/enron.emails' using MongoStorage();

From Avro to Mongo in One Line

I’ve automated loading Avros and storing them to MongoDB in the script at, using Pig’s parameter substitution:

avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();

We can then call our script like this, and it will load our Avros to Mongo:

pig -l /tmp -x local -v -w -param avros=enron.avro 
   -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig

We can verify our data is in MongoDB like so:

$ mongo enron

MongoDB shell version: 2.0.2
connecting to: enron

> show collections

> db.emails.findOne({message_id: "%3C3607504.1075843446517.JavaMail.evans@thyme%3E"})
	"_id" : ObjectId("502b4ae703643a6a49c8d180"),
	"message_id" : "",
	"date" : "2001-04-25T12:35:00.000Z",
	"from" : {
		"address" : "",
		"name" : "Jeff Dasovich"
	"subject" : null,
	"body" : "Breathitt's hanging tough, siding w/Hebert, standing for markets.  Jeff",
	"tos" : [
			"address" : "",
			"name" : null
	"ccs" : [ ],
	"bccs" : [ ]

To the Web with Node.js

We’ve come this far, so we may as well publish our data on the web via a simple web service. Lets use Node.js to fetch a record from MongoDB by message ID, and then return it as JSON. To do this, we’ll use Node’s mongodb package. Installation instructions are available in our github project.

Our node application is simple enough. We listen for an http request on port 1337, and use the messageId parameter to query an email by message id.

// Dependencies
var mongodb = require("mongodb"),
    http = require('http'),
    url = require('url');

// Set up Mongo
var Db = mongodb.Db,
    Server = mongodb.Server;

// Connect to the MongoDB 'enron' database and its 'emails' collection
var db = new Db("enron", new Server("", 27017, {}));, n_db) { db = n_db });
var collection = db.collection("emails");

// Setup a simple API server returning JSON
http.createServer(function (req, res) {
  var inUrl = url.parse(req.url, true);
  var messageId = inUrl.query.messageId;

  // Given a message ID, find one record that matches in MongoDB
  collection.findOne({message_id: messageId}, function(err, item) {
    // Return 404 on error
    if(err) {
      console.log("Error:" + err);
    // Return 200/json on success
    if(item) {
      res.writeHead(200, {'Content-Type': 'application/json'});

}).listen(1337, '');

console.log('Server running at');

Navigating to http://localhost:1337/?messageId=%3C3607504.1075843446517.JavaMail.evans@thyme%3E returns an enron email as JSON:

We’ll leave the CSS as an exercise for your web developer, or you might try Bootstrap if you don’t have one.


The Hadoop Filesystem serves as a dumping ground for aggregating events. Apache Pig is a scripting interface to Hadoop MapReduce. We can manipulate and mine data on Hadoop, and when we’re ready to publish it to an application we use mongo-hadoop to store our records in MongoDB. From there, creating a web service is a few lines of javascript with Node.js – or your favorite web framework.

MongoDB is a popular NoSQL database for web applications. Using Hadoop and Pig we can aggregate and process logs at scale and publish new data-driven features back to MongoDB – or whatever our favorite database is.

Note: we should ensure that there is sufficient I/O between our Hadoop cluster and our MongoDB cluster, lest we overload Mongo with writes from Hadoop. Be careful out there! I have however verified that writing from an Elastic MapReduce Hadoop cluster to a replicated MongoHQ cluster (on Amazon EC2) works well.

About the Author

Russell Jurney is a data scientist and the author of the book Agile Data (O’Reilly, Dec 2012), which teaches a flexible toolset and methodology for building effective analytics applications using Apache Hadoop and cloud computing.



roman says:

Please add note that mongo-hadoop/mongo-2.7.3.jar is needed

krati jain says:
Your comment is awaiting moderation.


I ran the code but while running the pig script, I get following exception (which possibly is due to incompatible versions of mongodb and hadoop/pig): Can you please specify the minimum version required for the pig-mongo interaction

Pig Stack Trace
ERROR 2998: Unhandled internal error. Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected

java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected
at com.mongodb.hadoop.MongoOutputFormat.checkOutputSpecs(
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(
at org.apache.pig.newplan.logical.relational.LOStore.accept(
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(
at org.apache.pig.newplan.DepthFirstWalker.walk(
at org.apache.pig.newplan.PlanVisitor.visit(

spurtcommerce says:

Really its a informative article. I am sure that i will take some points from this article. Nowadays most of the people try to learn more about Node.js projects. Its really amazing!

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums