Category Archives: Uncategorized


Hadoop Summit Session for Your Consideration: Taking Hadoop to the Clouds

If you been following #hadoopsummit on twitter you might have noticed some excitement around the community choice, a public voting system that enables the entire Apache Hadoop community to have a say in the sessions chosen for #hadoopsummit EU. Anyone can vote and the top vote getters in each track will automatically be included in the #hadoopsummit EU agenda, March 20-21, 2013.

If you’re still deciding which sessions, in which tracks, should be so lucky to get your vote, I have one for your consideration. Our very own Steve Loughran went beyond the twitter-sphere and created a blog to promote why you should vote for his session: Taking Hadoop to the Clouds.

Before we proceed to Steve’s case, remember to vote in the Community Choice process. Help us shape the conference agenda by getting in your vote! Deadline is December 14, so vote today!

This is a guest blog post from Steve; making a strong case to why you should pick his session. 

The Hadoop summit vote list is up, and I have two proposals -currently undervoted. Even though I’m on the review committee for the futures strand, not even I could push through a talk, which had zero votes on it, -ideally I’d like my talks to get in through popular acclaim. I could just create 400 fake email addresses and vote-stuff that way, but I’m lazy.

For that reason, I’m going to talk in detail about why my talks will be so excellent that to even think about having them left out could be detrimental to the entire conference.

 

 

 

 

 

 

 

 

 

 

One of my talks is “Taking Hadoop to the Clouds”.

There are two competitors

  1. Deploying Hadoop in the Cloud, which looks at options, details and best practices. I don’t see anything particularly compelling in the abstract -I assume it’s got more votes as it’s the one that comes up first. Or they are trying the many-email-address-vote-stuffing technique(*).
  2. How to Deploy Hadoop Applications on Any Cloud & Optimize Price Performance.  This could be interesting, as it covers how CliQr deploys Hadoop on different infrastructures. It sounds like a rackable-style orchestration layer above infrastructures, for Hadoop it may have similarities with MastodonC’s Kixi work,

Why then, should people vote for mine?

I’m giving the talk.

This is not me being egocentrically smug about the quality of my presentations, but because I’m reasonably confident I know a lot about the area.

  1. My last time at HP Labs was spent on the implementation of the “Cells” virtual infrastructure: declarative configuration of the entire cluster design. The details were presented at the 5th IEEE/ACM conference on Utility and Cloud Computing, and will no doubt be in the ACM library. This means I know about IaaS implementation details; the problems of placement, why networking behaves the way it does, image management, what UIs could look like, what the APIs could be, etc.
  2. I’ve spent a lot of time publicly making Hadoop cloud-friendly. I presume that MS Azure and AWS ElasticMR have put in more hours, but unless they’re going to talk about their work, Tom White and myself are the next choices. Jun Ping and VMWare colleagues have done a lot too -and big patches into the codebase, but I don’t see any submissions from them.
  3. I have opinions on the matter. They aren’t clear cut “cloud good/physical bad” or “physical bad/cloud good”. There are arguments either way; it depends on what you want to do, what your data volume is, and where it lives.
  4. I’m still working in the area, in Hadoop itself and the code nearby.

Recent cloud-related activities include

  • HADOOP-8545: a  Swift Filesystem driver for OpenStack. This is something everyone running Hadoop on Rackspace or other OpenStack clusters will want. This week two different implementations have surfaced, getting them merged together is going to be the next activity,
  • WHIRR-667: Add whirr support for HDP-1 installation
  • Ambari with Whirr. Proof of concept more than anything else.
  • Jclouds and Rackspace UK throttling. Adrian Cole managed to reduce the impact of issue-549, which is good as I don’t really want to get sucked into a different OSS codebase,
  • Other things that I’m not going to talk about -yet.

That’s why people should vote for me. The other talks will be about “how we got Hadoop to work in a virtual world” -mine will be about how we improved Hadoop to work in a virtual world.

(*) ps, for anyone planning the many-email-accounts approach, remember that the email addresses are something we reviewers can look at, and many sequential accounts all doing three votes to a single talk will show up as “statistically significant”. Russ has the data, he likes his analyses. He may even have the IP addresses.

[Photo: an interview with Page 6 Guy at ApacheCon]

====

You can also access Steve’s blog here.

Thankful…

Happy Thanksgiving!

Today, like the rest of the U.S., we take a pause from our regular blog schedule to give thanks…

We are thankful for mappers and reducers. We are thankful for namenodes and jobtrackers. We give thanks to speculative execution battling the march of the last reducer. Give thanks to every petabyte, terabyte, gigabyte, file and block of data. We are thankful for the capacity scheduler.

We are very thankful for many things here at Hortonworks and I know many of us are thankful for an extra long weekend. This has been an amazing year at Hortonworks. We have seen our team double and then triple in size and we are thankful for our smart and hard-working Hortonworkers. We are thankful for sushi lunches, an office of candy, snacks, drinks AND paid gym memberships.

We are thankful for everyone in the Apache Hadoop community and to all those who have downloaded HDP. We are thankful for a ecosystem of partners who are second to none. MOST of all, we are thankful to our investors and to all those companies who have chosen to partner with us as customers.

Happy Thanksgiving!

Rackspace and Hortonworks, a Match Made in the Clouds

As we speed towards wide spread enterprise adoption of Apache Hadoop, it has become readily apparent that this new data platform must not only capture, process and distribute data, but it also must be able to be deployed in a variety of ways, be it on premise, in a VM, as an appliance or better yet in the cloud…

Today we announced a new relationship with Rackspace in which we will develop an OpenStack based Hadoop solution for the public and private cloud. This is not just a paper relationship.  It is a joint effort to produce and make available Hortonworks Data Platform for OpenStack in early 2013.

There are customers today that deploy Hadoop clusters using HDP on dedicated hardware at Rackspace and this is now available as a turn-key, on-demand service running on the Rackspace open cloud and in clusters on private cloud infrastructure in data centers or a customer’s data center.

Why does this make sense?
Well, when you speak of the OpenStack we think of compute, networking and storage as the three main components. OpenStack was created by Rackspace as a collaborative software project designed to create freely available code, badly needed standards, and common ground for the benefit of both cloud providers and cloud customers. In this environment, Hortonworks just makes sense.  Our 100% open source approach is freely available; standards based and better yet open to integrate with the ecosystem and other stack components. More importantly, core Hadoop is compute and storage and Hortonworks provides the most stable and reliable distribution for this.  For wide scale adoption, Hadoop must be enterprise ready and HDP represents this.

Avoid Vendor Lock
The point of an OpenStack is to provide an open and scalable operating system for building public and private clouds. It provides both large and small organizations an alternative to closed cloud environments, reducing the risks of lock-in associated with proprietary platforms. With Rackspace you simply provision the service and you are “good to go”.  With Hortonworks, we add a new service to the stack that is also provisioned via Rackspace so you can be up and running in minutes and without license and without the vendor lock.

The main reason we can do this is we package a fully open Apache Ambari for monitoring and managing a cluster.  With other distributions you need to purchase these same capabilities, which not only locks you in to the vendor for license but also closes the ecosystem, as the open source community can no longer be a source for patches or upgrades.  You need to wait for your vendor to release their proprietary fix, even for the open source bits they built on top of. Not with Hortonworks.

This approach allows customers to invest further into the open cloud future to confidently invest in a technology for the long term.

Where exactly IS your data?
Many have turned to the cloud to store or process data.  Doesn’t it make sense to extend this processing for big data in the cloud where much data already resides?  Well with this new offer you can do just that and in only a matter of minutes.  You can easily extend your current Rackspace environment by firing up a Hadoop cluster and there is no need to move data from internal resources to the cloud the data is already there.  While this may not be the case for every Hadoop project, it makes sense for many and it may make sense for many Rackspace customers.

Rackspace & Hortonworks… seems like a match made in heaven, well, maybe in the clouds

 

If you would like more information, please contact us or Rackspace.

 

Hortonworks at Strata Conference 2012 in New York City!

Visit Hortonworks at Strata New York!

We are so excited to attend O’Reilly Strata Conference in New York next week! If you are going to be there,  please come by booth 16 meet the members of the Hortonworks team who will be happy to discuss any questions you have about Hortonworks Data Platform, business benefits, see a nice demo and walk away with cool swags!

Hortonworks will also be participating in an array of sessions and meet-ups at this conference. And we hope you can join us.

Attend our sessions!

Hadoop’s Role in the Big Data Architecture  (part of Bridge to Big Data)
Jim Walker @jaymce, Director Product Marketing
Tuesday, October 23, 3:30pm, Nassau

Future of Data Processing with Apache Hadoop 
Arun Murthy @acmurthy, Co-founder and Architect and VP, Apache Hadoop at the ASF
Wednesday,October 24, 1:40pm, Grand East (NY Hilton)

Drive Smarter Decisions with Microsoft Big data
Wednesday,October 24, 1:40pm, Regent Parlor

HDFS: What is new and future
Sanjay Radia @ssr, Co-founder of Hortoworks and Apache Hadoop Committer  and Todd Lipcon @tlipcon
Wednesday, October 24, 4:10pm

Making Pig Fly: Optimizing Data Processing on Hadoop
Thejas Madhavan Nair @thejasn and Jianyong Dai, both PMC members and committers of Apache Pig project
Thursday, October 25, 5pm, Murray West (NY Hilton)

Let’s “meet-up”!

Big Data Camp
Monday, October 22,  5:30pm -10pm, Murry Hill Suite

Apache Accumulo meetup
Wednesday, October 24 , 7:00pm, Regent Parlor

Apache Pig Strata
Wednesday, October 24, 6:30pm, Beekman Parlor + Sutton North Room

Remember to follow us on twitter for #Strataconf updates!

My Summer Internship at Hortonworks

Hortonworks Summer Internship 2012

As a first time intern, I can undoubtedly say that Hortonworks was the perfect place for me to gain real world work experience and have the chance to team up with many incredibly talented, driven people. Of course, I didn’t get to fully interact with everyone in the company in the three months that I was here but even after such a short time it is clear to me that it is the welcoming atmosphere and the determined team here that have allowed Hortonworks to achieve so many goals in just over a year.

During this summer, I was awarded the opportunity to be part of something big, something that is gaining impressive momentum in the world of technology and will not be slowing down any time soon. I have received insightful information from people who are overflowing with innovative ideas for how to utilize the big data of today’s world and this has provided me with knowledge that I did not expect to gain from a big data company.

Throughout the course of my internship, I was able to learn about and work in various areas of marketing, with my main focus being on creating blog content. John Kreisa, the VP of marketing, was very helpful in explaining how important this was for bringing attention to Hortonworks and the best ways to do this in the social media sphere. Writing blogs, alone, was a very educational experience because I was able to explore and research real world use cases in which big data was present.

What I found most intriguing was that big data is everywhere now (and if it’s not somewhere, then it will be in a few years). Cancer research becomes efficient and provides more and more substantial results. Retail shoppers receive a personal shopping experience catered to them, and only to them. A hospital patient is provided the most up-to-date medical care and information possible. Police departments learn to harness extensive amounts of information and crime is decreased significantly. The list goes on. It isn’t at all just about big tech companies coming out with the next cool gadget (although this is important too).

So much positive change can happen with the emergence of big data. Three months ago I would have considered this a bit of a far-fetched or even cheesy statement… But now that I know about so many companies and organizations that really are learning how to harness big data in order to benefit the health and safety of people and the environment, I am truly excited for the developments that will arise in the near future, especially with the help of Apache Hadoop. Global change really is attainable.

Of course the most impressive aspect of my internship that really showed me the magnitude of how big big data really is, was Hadoop Summit 2012. I cannot stress enough how lucky I am to have been part of this event. Over 2,200 people gathered on June 13th and 14th to discuss the most recent and innovative developments in big data that were being made possible by Hadoop. I was impressed by how many people were there to educate themselves on the workings of Hadoop and its power in today’s business world. Lecture rooms were overflowing, people were exchanging ideas, connections were being made… The world was climbing the Summit of innovative progress.

Along with this, I was able to see the amount of work that went into planning such an impressive event. Sponsors, press releases, the venue itself, registration, catering, displays, the awesome party at the Tech Museum and so many other equally important things had to be taken care of. Denise Maudru, the marketing director in charge of the event, gave me some small tasks to do before the Summit like preparing dinner name cards and totaling up registrants and payments but these were miniscule parts of a much bigger project and I was really impressed by how Denise and the rest of the Summit planning group organized everything so perfectly and thoughtfully.

After the Summit, I got to work with Masha Finkelstein, the interactive marketing manager, on analytical and inbound marketing. She gave me a sense of how marketing flows work and I helped her create some e-mail templates and landing pages in Marketo. I learned that inbound marketing is actually a pretty sensitive aspect of marketing and has to be both perfect and efficient.

Thousands of leads have to receive the correct e-mail with the correct links and if they click on one of those links they have to get the correct landing page and this landing page has to take them to the next correct landing page and, in my eyes, it is all really a plethora of errors waiting to happen… However, Masha showed me how to keep the errors at bay and how to simultaneously filter out unnecessary leads to move along the marketing process more quickly. After this, Steven further explained how strict sales has to be with incoming leads in order to not waste time with people that may not become Hortonworks customers in the future.

Additionally, I helped the HR team to come up with some revamping ideas for the careers page with the goal of bringing a more pronounced sense of community and culture to our website. We stemmed off the Hortonworks Dr. Seuss theme and jointly came up with some great, creative concepts which could engage potential applicants and employees and show them the fun side of Hortonworks. This project is still in the works but hopefully these ideas will come to life on the careers page very soon.

After three months, I am still far from being familiar with all of the workings of the company. Although I was a marketing intern, I still believe it is necessary to be aware of all of the bolts and screws that make up the bigger picture. Even still, I have learned more and have gotten more real world experience in these past three months than I have in any of my other summers combined. Sometimes it is difficult to notice change within one’s self over a long period of time. Yet, I realize that I have become more outgoing, less afraid, more driven by my future, and generally more aware of the world around me ever since I nervously walked through the Hortonworks doors to a smiling and welcoming Rachele on my first day as a marketing intern.

It has been a liberating, educational, and inspiring experience and I owe it to the people that have been part of it and have made it so wonderful for me.

Thanks Hortonworks!

Welcome Hortonworks Data Platform 1.1

Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop

It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…

  • Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
  • Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
  • And, our Hortonworks team grew by leaps and bounds!

In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1.  We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.

Ask our sales and support teams, adoption of Apache Hadoop is clearly growing.  In order to accelerate this wide spread interest and adoption our customers demand that their Hadoop distribution is both stable and reliable. It is overwhelming… the enterprise needs to have confidence in the platform.  To this end, we are dedicated to meeting these expectations and these key new features in HDP 1.1 represent a step in that right direction.

Highly Available Hadoop
Not only is HDP 1.1 built on the most stable and reliable release of Hadoop, we are the only distribution to provide full stack high availability on this release. With HDP 1.1, we extend our HA options with the ability to include the most current versions of Red Hat Enterprise Linux (RHEL) and the High Availability Add On. So, now our customers have an option to use industry leading solutions from both VMware and Red Hat as well.

Capturing Data Streams
The addition of Apache Flume into the distribution enables expanded streaming data capture for analysis within the Hortonworks Data Platform. Organizations can now easily and reliably collect and analyze real-time data streams, such as high-volume web logs, in Apache Hadoop, driving additional insights from data that was previously too bulky to capture and process.

Empowering Ops
Operations is a key player in a Hadoop implementation as they are tasked with monitoring and managing the Hadoop infrastructure.  HDP 1.1 delivers easier and deeper integration into third-party management tools and systems so that operations can more easily manage a cluster along side other resources… through a single pane of glass.

Faster, Faster
Hadoop is fast, but why not make it faster?  With this release, we have tested out a 10% + performance improvement on MapReduce jobs over our previous release.  Faster read and writes speed data capture and delivery within the platform. Improved Map Reduce execution performance means that jobs process data more quickly.

To get started with HDP 1.1, please visit our downloads page.

There are also a wealth of useful technical resources available as well, including online documentation, community forums and a Hortonworks knowledge base. Please visit the Community section of our website for these resources and more.

Finally, please join us for our next “What’s New” webinar this week where we will talk more about the new 1.1 features.

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-jruby-sinatra-hbase-pig and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Part one of this series on MongoDB is available here: http://hortonworks.com/blog/pig-as-connector-part-one-pig-mongodb-and-node-js/.

Introduction

Hadoop is about freedom as much as scale: providing you disk spindles and processor cores together to process your data with whatever tool you choose. Unleash your creativity. Pig as duct tape facilitates this freedom, enabling you to connect distributed systems at scale in minutes, not hours. In this post we’ll demonstrate how you can turn raw data into a web service using Hadoop, Pig, HBase, JRuby and Sinatra. In doing so we will demonstrate yet another way to use Pig as connector to publish data you’ve processed on Hadoop.

Apache HBase is the Hadoop database. It has emerged as the dominant database used with Hadoop, supporting billions of rows and millions of columns. HBase is a highly available, fast column store, supporting realtime workloads while providing easy access to your data to and from Hadoop via scans. A really great introduction to HBase’s data model is here: http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable.

JRuby is Ruby implemented in Java. JRuby has emerged as an invaluable tool in helping enterprises with aging Java infrastructures bring new life into their codebase by wrapping complex Java with simple Ruby to provide more productive interfaces for application developers. Using JRuby in this example also helps us to learn the HBase Java APIs, as we’ll be calling them directly from JRuby. We’ll be using JRuby to serve data from HBase as a web service, with a lightweight, ‘no frills’ web framework called Sinatra.

Booting HBase

An excellent HBase quickstart tutorial is available here: http://hbase.apache.org/book/quickstart.html. For more detailed information, check out the HBase Reference Guide – a full blown book on Apache HBase. We’ll be booting HBase in local mode for testing. HBase runs on top of Hadoop and Zookeeper in production, but local mode takes care of that for us for experimentation.

We’ll be using the latest stable version of HBase, version 0.94.1.

$ wget http://archive.apache.org/dist/hbase/hbase-0.94.1/hbase-0.94.1.tar.gz
$ tar -xvzf hbase-0.94.1.tar.gz
$ sudo mkdir /var/hbase

Now edit hbase-0.94.1/conf/hbase-site.xml to include our hbase working directory:

<property>
  <name>hbase.rootdir</name>
  <value>file:///var/hbase</value>
</property>

Launch HBase in local mode:

$ cd hbase-0.94.1
$ bin/start-hbase.sh

Now checkout the “Shell Exercises” section of the HBase Book: http://hbase.apache.org/book/quickstart.html#shell_exercises. Lets create a new table for testing. Start the HBase shell (really JIRB under the covers). The help command is there to guide us.

$ bin/hbase shell
...
1.8.7-p352 :002 > help

Lets create our first HBase table called ‘enron’ with a single column family called ‘email’. We might make another column family later for an organizational chart or extracted entities named ‘people.’ Column families are groups of columns.

create 'enron', 'email'
0 row(s) in 1.7900 seconds

Verify that its there with list.

list 'enron'
TABLE                                                                                                                                
enron                                                                                                                                
1 row(s) in 0.0690 seconds

We can put, get and scan records easily. The beauty of HBase is that we can update records in realtime from our application, and then scan them in batch using Hadoop without worrying about stale data.

> put 'enron', 'row1', 'email:address', 'bob@enron.com'
0 row(s) in 0.0190 seconds
> put 'enron', 'row2', 'email:address', 'stevo@enron.com'
0 row(s) in 0.0190 seconds
 
> get 'enron', 'row2'
COLUMN                             CELL                                                                                              
 email:address                     timestamp=1345691920800, value=stevo@enron.com                                                    
1 row(s) in 0.0110 seconds
 
> scan 'enron'
ROW                                COLUMN+CELL                                                                                       
 row1                              column=email:address, timestamp=1345691847565, value=bob@enron.com                                
 row2                              column=email:address, timestamp=1345691920800, value=stevo@enron.com                              
2 row(s) in 0.1190 seconds

Note that HBase doesn’t care what kind of data we store into it, and it returns a timestamp. HBase can hold a history of values for each cell, and we can even use this feature in our applications to store historical data!

Storing Data in HBase with Pig

Pig supports HBase via HBaseStorage. An excellent guide is here, and a here is a good presentation on Pig and HBase at Twitter from 2010.

We need to tell Pig where to find HBase via the HBASE_HOME environment variable.

$ echo 'export HBASE_HOME=/me/hbase-0.94.1' >> ~/.bash_profile
source ~/.bash_profile

We also need to replace the HBase jar distributed with Pig with 0.94.1.

$ rm /me/pig/build/ivy/lib/Pig/hbase-0.90.0.jar
$ cp target/hbase-0.94.1.jar /me/pig/build/ivy/lib/Pig/

Now we can load records in Avro, process them and store them in HBase.

/* Load Avro jars and define shortcut */
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
 
/* HBaseStorage libraries */
register /me/pig/build/ivy/lib/Pig/hbase-0.94.1.jar
register /me/pig/build/ivy/lib/Pig/zookeeper-3.3.3.jar
register /me/pig/build/ivy/lib/Pig/guava-11.0.jar
 
/* Load JRuby to validat emails and creat UUIDs for HBase rowIds */
register 'udf.rb' using jruby as udfs;
 
emails = load '/me/tmp/enron.avro' using AvroStorage();
/* Project all unique combinations of from/to for each email to more than one person */
from_to = foreach emails generate from.address as from_address, FLATTEN(tos.(address)) as to_address;
 
/* Group by email from/to pairs and count emails between those addresses.
   Also, generate a UUID for storing rows in HBase. */
by_pair = group from_to by (from_address, to_address);
sent_counts = foreach by_pair generate udfs.uuid() as id, 
                                       FLATTEN(group) as (from_address, to_address), 
                                       COUNT_STAR(from_to) as total_sent;
 
/* Store to the HBase table 'enron' using a UUID as row key with the loadKey option. */
store sent_counts into 'enron' using 
     org.apache.pig.backend.hadoop.hbase.HBaseStorage('address.pairs:from_address address.pairs:to_address address.pairs:total_sent', 'loadKey true');

ILLUSTRATE shows us our dataflow:

-----------------------------------------------------------------------
| from_to     | from_address:chararray     | to_address:chararray     | 
-----------------------------------------------------------------------
|             | jane.mcbride@enron.com     | tana.jones@enron.com     | 
|             | jane.mcbride@enron.com     | tana.jones@enron.com     | 
-----------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| by_pair     | group:tuple(from_address:chararray,to_address:chararray)             | from_to:bag{:tuple(from_address:chararray,to_address:chararray)}                                 | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|             | (jane.mcbride@enron.com, tana.jones@enron.com)                       | {(jane.mcbride@enron.com, tana.jones@enron.com), (jane.mcbride@enron.com, tana.jones@enron.com)} | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------
| sent_counts     | id:chararray                         | from_address:chararray     | to_address:chararray     | total_sent:long     | 
----------------------------------------------------------------------------------------------------------------------------------------
|                 | 8d0434e4-b50d-47ac-8278-fc1bb16ad8e4 | jane.mcbride@enron.com     | tana.jones@enron.com     | 2                   | 
----------------------------------------------------------------------------------------------------------------------------------------

Loading Data from HBase with Pig

Loading data from HBase in Pig is easy, and you can pick individual columns or filter the data. Lets look at the top 100 most prolific email relationships, which could be a measure of how strong a relationship is.

/* HBaseStorage shortcut */
register /me/pig/build/ivy/lib/Pig/hbase-0.94.1.jar
register /me/pig/build/ivy/lib/Pig/zookeeper-3.3.3.jar
register /me/pig/build/ivy/lib/Pig/guava-11.0.jar
 
/* Grab the top 100 most prolific email relationships from HBase and dump them. */
address_pairs = LOAD 'hbase://enron4' using 
  org.apache.pig.backend.hadoop.hbase.HBaseStorage('address.pairs:from_address address.pairs:to_address address.pairs:total_sent')
  as (from_address:chararray, to_address:chararray, total_sent:long);
sorted = order address_pairs by total_sent DESC;
top_100 = limit address_pairs 100;
dump top_100
(pete.davis@enron.com,pete.davis@enron.com,4489)
(vince.kaminski@enron.com,vkaminski@aol.com,1143)
(jeff.dasovich@enron.com,susan.mara@enron.com,935)
(jeff.dasovich@enron.com,paul.kaufman@enron.com,879)
...

Pig UDFs in JRuby

Checkout the id:chararray field in the above example. We created that with a JRuby UDF.

Pig added JRuby UDFs in version 0.10.0. Writing UDFs in JRuby is much simpler than in Java. Our UDF class in udf.rb looks like this:

require 'pigudf'
require 'lib/data_utils'
 
# Refer to our Utils class to share JRuby code between Pig and Sinatra
class Udfs < PigUdf  
  outputSchema "uuid:chararray"
  def uuid()
    DataUtils.uuid()
  end
end

Notice how we employ an external utility class, which in turn calls Java’s java.util.UUID.randomUUID().toString() method. We put our code in a utility class called DataUtils so that it might be shared with other applications, like our Sinatra web app. Code sharing between Hadoop and other systems using JRuby is efficient.

Our utility class looks like this:

# Magic line
require 'java'
 
import java.util.UUID
 
class DataUtils
  # Create Unique IDs - code adapted from https://github.com/jdamick/uuid/blob/master/lib/uuid.rb
  def self.uuid()
    self.generate()
  end
 
  def self.generate()
    java.util.UUID.randomUUID().toString()
  end
end

Using ILLUSTRATE lets us see our UDF code run on real data, without waiting on Hadoop jobs to finish. This is great for development!

HBase and JRuby

You can download JRuby at http://jruby.org/download or better yet, install it via rvm, which you can install via the instructions here: https://rvm.io/rvm/install/.

$ rvm install jruby

JRuby can use the Java native HBase client, which is fast and efficient (the HBase shell is actually a modified JRuby Interactive Ruby Shell). Thrift and JSON APIs are provided for other languages. Details about JRuby and HBase are available at http://wiki.apache.org/hadoop/Hbase/JRuby, although the example uses old APIs that we’ve updated for this example.

A great resource to see JRuby in action against HBase is in the HBase shell itself: https://github.com/apache/hbase/tree/trunk/hbase-server/src/main/ruby.

To connect to HBase in JRuby, we’ll need to setup our CLASSPATH to import the HBase jars.

cd $HBASE_HOME
wget http://central.maven.org/maven2/org/jruby/jruby-complete/1.6.7.2/jruby-complete-1.6.7.2.jar
export CLASSPATH=$CLASSPATH:`java -jar $HBASE_HOME/jruby-complete-1.6.7.2.jar -e "puts Dir.glob('$HBASE_HOME/{.,build,lib}/*.jar').join(':')"`

Now lets import the Java HBase client classes in JRuby and connect to HBase. htable.rb from the Hbase Shell is helpful: https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/ruby/hbase/table.rb.

$ jirb

We begin by importing the relevant Java classes into JRuby.

# Adapted from obsolete example at http://wiki.apache.org/hadoop/Hbase/JRuby
 
include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.client.ResultScanner
import org.apache.hadoop.hbase.util.Writables
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.io.Text

Next, we take a queue from the HBase shell and create a to_string utility method.

# Make a String of the passed kv
def to_string(column, kv, maxlength = -1)
  if kv.isDelete
    val = "timestamp=#{kv.getTimestamp}, type=#{org.apache.hadoop.hbase.KeyValue::Type::codeToType(kv.getType)}"
  else
    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getValue)}"
  end
  val
end

Connecting to HBase is easy. Lets wrap the code in a connect method.

# Connect to HBase and our table
def connect(table_name)
  @conf = HBaseConfiguration.create
  admin = HBaseAdmin.new(@conf)
  @table = HTable.new(@conf, table_name)
end

Fetching a record is fairly easy. Lets make a get method.

def get_key(key)
  my_get = Get.new(key.to_java_bytes)
  result = @table.get(my_get)
  result_ary = []
  for kv in result.list
    family = String.from_java_bytes(kv.get_family)
    qualifier = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.get_qualifier)
    column = "#{family}:#{qualifier}"
    value = to_string(column, kv, -1)
    timestamp = kv.get_timestamp
    str_value = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.get_value)
    result_ary << str_value.to_s
  end
  result_ary
end

I’ve wrapped connect and get in a simple JRuby HBase Client: https://github.com/rjurney/enron-jruby-sinatra-hbase-pig/blob/master/lib/hbase_client.rb.

A simple unit test verifies things work:

$ jruby test/hbase_client.rb
require 'lib/hbase_client'
 
hclient = HBaseClient.new
hclient.connect('enron')
hclient.get('row1')
1345691847565
bob@enron.com

JRuby and Sinatra

Sinatra is a simple Ruby framework for web applications. It is summarized nicely in its README. Installing Sinatra is easy.

jgem install sinatra

Our sinatra app, sinatra.rb, is simple enough:

require 'rubygems'
require 'sinatra'
require 'json'
require 'lib/hbase_client'
require 'lib/data_utils'
 
hclient = HBaseClient.new
hclient.connect('enron')
 
get '/:message_id' do |message_id|
  JSON hclient.get(message_id)
end
 
get '/create_uuid' do
  "This web service shares code with a Pig JRuby UDF to produce this UUID: " + DataUtils.uuid
end

Run our app: jruby ./web.rb and navigate to our UUID web service.

Now pick out a message ID from a scan in the HBase shell and fetch it as JSON:

> scan 'enron'
 ffffe8b6-2abb-43ba-aad8-6ee8c4526 column=address.pairs:from_address, timestamp=1345904219080, value=stanley.horton@enron.com
 2b1                                                                                                                                 
 ffffe8b6-2abb-43ba-aad8-6ee8c4526 column=address.pairs:to_address, timestamp=1345904219080, value=maria.pavlou@enron.com            
 2b1                                                                                                                                 
 ffffe8b6-2abb-43ba-aad8-6ee8c4526 column=address.pairs:total_sent, timestamp=1345904219080, value=1                                 
 2b1                                                                                                                                 
310192 row(s) in 246.2150 seconds
> get 'enron', 'ffffe8b6-2abb-43ba-aad8-6ee8c45262b1'
COLUMN                             CELL                                                                                              
 address.pairs:from_address        timestamp=1345904219080, value=stanley.horton@enron.com                                           
 address.pairs:to_address          timestamp=1345904219080, value=maria.pavlou@enron.com                                             
 address.pairs:total_sent          timestamp=1345904219080, value=1                                                                  
3 row(s) in 0.0300 seconds

Our web service publishes stats on messages between email addresses from HBase.

Conclusion

Starting with emails in Avro format, we have processed our data using Pig and published it to HBase, where a simple JRuby Sinatra app serves it as JSON. We’ve also managed to share UUID code between our JRuby Pig UDF and our Sinatra web application.

About the Author

Russell Jurney is Hortonworks Hadoop Evangelist and the author of the book Agile Data (O’Reilly, Dec 2012), which teaches a flexible toolset and methodology for building effective analytics applications using Apache Hadoop and cloud computing.

Hadoop: Your Partner in Crime

Pre-crime? Pretty close…

If you have seen the futuristic movie Minority Report, you most likely have an idea of how many factors and decisions go into crime prevention. Yes, Pre-crime is an aspect of the future but even today it is clear that many social, economic, psychological, racial, and geographical circumstances must be thoroughly considered in order to make crime prediction even partially possible and accurate. The predictive analytics made possible with Apache Hadoop can significantly benefit this area of government security.

The essence of crime prevention is to understand and narrow down thousands of “what if” cases to a manageable and plausible handful of scenarios. Crime can happen anywhere and can be categorized as anything from cyber fraud to kidnapping, which provides a lot of combinations for possible misdemeanors or felonies. With the help of big data analytics, government agencies can zone in on certain areas, demographics, and age groups to pick out specific types of crimes and move towards decreasing the one trillion dollar annual cost of crime in the United States.

Zach Friend, a crime analyst for the Santa Cruz Police Department, explained that there aren’t enough cops on the streets due to insufficient funds. Not only that, but many police departments are still technologically behind in the crime-monitoring field, so big data analytics tools could be a huge step forward for police all over the country. Evidence and information about cases could be stored much more efficiently, police action could be more proactive, and crime awareness could be much more prevalent.

Who’s on the case?

The Crime and Corruption Observatory (created by the European company, FuturICT) is pushing for this kind of development and aims to predict the dynamics of criminal phenomena by running massive data mining and large-scale computer simulations. The Observatory is structured as a network that involves scientists from varying fields – “from cognitive and social science to criminology, from artificial intelligence to complexity science, from statistics to economics and psychology”.

This Observatory will be used through the framework of the developing Living Earth Simulator project – “a big data and supercomputing project that will attempt to uncover the underlying sociological and psychological laws that underpin human civilization.” The project, funded by the European Union, is an impressive advancement in technology, which will not only aid in pin pointing crime but will also effectively utilize the big data of today’s world.

PredPol has made predictive crime analytics available to police departments so that “pre-crime”, in a sense, could be put into action. Zach Friend explains, “We’re facing a situation where we have 30 percent more calls for service but 20 percent less staff than in the year 2000, and that is going to continue to be our reality. So we have to deploy our resources in a more effective way. This model does that.” PredPol allows law enforcement agencies to collect and organize data about crimes that have already happened and to use this data to predict future incidents in certain areas at a radius of 500 square foot blocks. It may not be the same as knowing the exact perpetrator, victim, and cause of the crime ahead of time as was possible in Minority Report but it is an impressive step towards perfecting crime prediction.

The Santa Cruz Police Department, which is using PredPol’s software, has already seen significant improvements in police work. SCPD began by locating areas of possible burglaries, battery, and assault and handing out maps of these areas to officers so they could patrol them. Since then, the department has seen a 19% decrease in these types of crimes.

PredPol software is able to make calculations about crimes based on previous times and locations of other incidents while cross-referencing these with criminal behavior and patterns. Here is an example of how large-scale this could get: George Mohler, a UCLA mathematician who was testing the effectiveness of PredPol, looked at 5,000 crimes which required 5,000! comparisons (i.e. 5,000 x 4,999 x 4,998…). With impressive results already materializing from calculations like these, it is exciting to think how much more accurate predictive crime analytics could become.

Hadoop lays down the law

With Apache Hadoop, perfecting crime prevention becomes an attainable goal. CTOlabs presented some very important points in a recent white paper about big data and law enforcement, showing how Hadoop could be beneficial to smaller police departments that don’t have very much financial leeway. The LAPD for example, is very well-funded and can afford to work with companies such as IBM to develop crime predicting techniques.

Smaller or less advanced departments, however, do not have the financial advantage to use supercomputers or extensive command centers and will use less efficient techniques (such as simple spreadsheets and homegrown databases) to keep track of all of the information involved in law enforcement. “Nationwide, agencies and departments have to reduce their resources and even their manpower but are expected to continue the trend of a decreasing crime rate. To do so requires better service with fewer resources.” Open source presents an extremely effective and less expensive option – Apache Hadoop is the super hero that can save the day, one cluster at a time.

With Hadoop’s capability to store and organize data, police departments can filter through unnecessary information in order to focus on the aspects of crime that are more important. By applying advanced analytics to historical crime patterns, weather trends, traffic sensor data, and a wealth of other sources, police can place patrol cops in areas with higher crime probability instead of evenly distributing man power throughout quiet and dangerous neighborhoods. This conserves money, effort, and time. Hadoop can also help organize a number of other factors such as police back up, calls for service, or screening for biases and confounding variables. Phone calls, videos, historical records, suspect profiles, or any other important information that is necessary for law agencies to keep for a long time can be systematized and referenced whenever need be.

Increasing public safety through effective use of technology is not a panacea but it is here and is an effective tool in combating crime. Apache Hadoop serves as a foundation for this new approach and, most importantly, it is accessible to a wider range of police departments all over the country and the world. Yes, predictive policing and crime prevention still have a lot of room for development and have yet to tackle issues like specific crimes that depend on interpersonal relationships or random events. However, it is all very possible, especially with the use of Hadoop as a predictive analytics platform. Crime can be stopped. No PreCogs necessary.

UC Irvine Medical Center: Improving Quality of Care with Apache Hadoop

This is the first part of a series written by Charles Boicey from the UC Irvine Medical Center.  The series will demonstrate a real case study for Apache Hadoop in healthcare and also journal the architecture and technical considerations presented during implementation.

With a single observation in early 2011, the Hadoop strategy at UC Irvine Medical Center started. While using Twitter, Facebook, LinkedIn and Yahoo we came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with healthcare data.

In healthcare, data shares many of the same qualities as that found in the large web properties.  Each has a seemingly infinite volume of data to ingest and it is all types and formats across structured, unstructured, video and audio. We also noticed the near zero latency in which data was not only ingested but also rendered back to users was important. Intelligence was also apparent in that algorithms were employed to make suggestion such as people you may know.

We started to draw parallels to the challenges we were having with the typical characteristic of Big Data, volume, velocity and variety.

In the beginning, our first project was to build an environment capable of ingesting Continuity of Care Documents (CCD) via a JSON pipeline, store them in MongoDB and then render them via a web user interface that had search capabilities. From that initial success project Saritor was launched.

Saritor is the Roman god for cultivation, in this case the cultivation of healthcare data for the purposes of rapidly progressing through the data to information, to knowledge, to wisdom continuum. We saw this project as vehicle for demonstrating the value of Applied Clinical Informatics and promoting the translational effects of rapidly moving from “code side to bedside”.

Why Saritor? The Electronic Medical Record (EMR) cannot handle complex operations such as anomaly detection, machine learning, building complex algorithms or pattern set recognition and the Enterprise Data Warehouse (EDW) supports quality, operations, clinicians & researchers. We, like many organizations with data warehouses run ETL processes at night to minimize the load on the production systems. We have some have real time interfaces with the data warehouse,but not all data is ingested in real time. In turn, our data suffers from a latency factor of up to 24 hours in many cases making this environment suboptimal. An adjunctive environment is needed to fill in the gaps.

Why Apache Hadoop?

Hadoop has a very attractive scale to cost ratio because it is A) open source and B) the server requirements are minimal and VM is an option. We currently deploy eight nodes, which is a far cry from the multiple 4000+ node clusters that Yahoo employs but our small environment is providing us big value.

Hadoop is uniquely capable of storing a wide range of healthcare environment data not matter the type or amount of structure.  For us, this includes:

  • all ancillary HL7 feeds (without the need for modification),
  • EMR generated data,
  • genomic data,
  • financial data,
  • RTLS data from assets,
  • patient and caregiver data,
  • smart pump data,
  • incremental physiological monitoring measurements (across one minute increments),
  • ventilator data in one minute or less increments
  • and temperature and humidity data.

Any electronically generated data in a healthcare environment can be ingested and stored in Hadoop and most importantly on commodity hardware.

But wait, that’s not all. The Hadoop ecosystem is modular and within those modules lays the functionality to build algorithms for surveillance, detection and notification of conditions such as sepsis or the prediction of potential 30 day readmits. Other uses cases we are working on include monitoring “Sink Time”, that is how much time caregivers spend washing their hands; patient throughput with the ability to capture actual hand off times; patient scorecards pushed to the patient portal and the ability to discover the unknown unknowns in our data.

Hadoop has also answered the problem of legacy data. UC Irvine Healthcare like many healthcare organizations has a legacy system, clinicians and researchers needed access to the data. Data conversion from the legacy system to the new EMR or data warehouse was not feasible. Our legacy system like others has the ability to print to text the patient record. For UCI that meant 1.2 million patients and over 3 million records. Those records are now in Saritor and are searchable. Solving this use case was our first deliverable with a demonstrable ROI.

We believe that Hadoop is the right environment for developing an analytic ecosystem to aide in the delivery of quality care at the lowest possible cost and an environment to enable clinical researchers to examine healthcare data in its entirety.

Next time we’ll dive deepr into the Saritor Hadoop ecosystem, ongoing and future development as well as collaborations with our partners.

Apache Hadoop, the Energy Softgrid and my Imaginary Tesla

This week, I spent some time and enjoyed speaking at the Softgrid 2012 conference in San Francisco. It was a great collection of speakers and attendees and opened my eyes to some Hadoop driven possibilities that not only differentiate utilities companies but will also transform our day-to-day lives.

The conference focused on software (in this case intelligent analytics) as a competitive advantage to enable value and growth for utilities.  These often large and historically conservative organizations have moved beyond the notion that their sole business is to distribute electric power efficiently, reliably, and cost-effectively to consumers. They now rely on analysis of massive amounts of data they already collect from smart meters and existing networks about distribution and consumption, and are taking progressive action on that data.

As we have seen in other markets, such as Financial Services and Retail, data is becoming the currency for an energy market transformation.

While I am not a Prius, Volt or Tesla (unfortunately) driver, I am sensitive to eco-friendly causes that have a large and immediate impact on the way we consume our natural resources.  I feel I am like many consumers in that saving five to ten or twenty dollars on my monthly bill is important but honestly I am more interested in knowledge and insight into usage and just how green I am. Call me an armchair activist I guess.

This conference opened my eyes to a broad range of possibilities for the utilities to really change the way we live and increase their bottom line through green tech. Here are two possible uses of big data in Energy.

Generation vs. consumption

My friend and Hortonworker, Rikin Shah, walked me through one potential use case of Hadoop in energy before I even left for San Francisco.  There is no such thing as a big battery that will hold any excess energy that is generated by the utility companies.  That means if we burn the coal or split the atoms we have to use all the energy produced or it gets wasted.  The challenge is that the consumption curve is erratic and this leads to waste, as we have to produce more than necessary to avoid a brownout when consumption extends beyond generation.  It is difficult at best to predict consumption.  However we can get a lot better through data.

In some companies they use smart meter technology that can automatically read meters at any desired interval. For many organizations this is once or twice a month, however they are moving to collect readings every four hours. That’s 6/day x 30 – 180x growth in data points collected per month per house! Why shouldn’t this be eve more frequent?  Well the amount of data is massive.  What if we could extend this to near continuous meter reads and analyze in near real time.  It could get us to better predictability of spikes and reduce the padding between production and consumption.  Further, new technologies (such as Nest thermostats) bring this direct touch to the point of consumption.  As we evolve, certainly smart light switches and wall outlets could all be tied into the grid to provide real touch with real consumption.  Perhaps we combine usage data with detailed weather data that drills down to a square meter.  The profound analysis could revolutionize help us conserve through near real time production.

Individually provisioned consumers

My phone goes everywhere I go.  I use it… a lot and I am often found borrowing a charger or asking someone if I could plug in and give it a charge.  They pay the bill.  This is ok when it is just a few kilowatts but what if I was out of electricity and I was at your house with my (pretend) Tesla?  I will presume I would consume much more than a few kilowatts to give my car a jump.  Currently, there is no way to track and provision usage per device and through to the owner.    Why not?  In part it is a data problem.

A more intelligent provisioning system would require a massive registry of devices and locations and the ratings engine…  well, that would be heck of an algorithm and would require some pretty heavy computation.  If you think the telecom call data record analysis is complex, this would be insane.  Devices come and go and there are a magnitude more options for plugging in.

Enter Hadoop

Apache Hadoop with massively parallel process and widespread storage.  Many utilities companies are already enjoying the benefits of this open source data platform.  There are probably a few innovations and some fairly substantial capital expense necessary to make a fully connected grid a reality but it is not that far off and definitively a possibility.  When I was a kid, my dad used to yell at me when I left lights on in the house.  Maybe if the light switch registered my presence, I could ask my pop to take the cost out of my allowance.

Back to our green Earth

IT was an interesting day full of great speakers and roomful of people very interested in how technology and software in general can aid in creating amore efficient grid.  Sure, a more intelligent grid will reduce costs through more intelligent production, but it can also change the way we all think about consumption.

I’m still waiting to get that Tesla!

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Part two of this series on HBase and JRuby is available here: http://hortonworks.com/blog/pig-as-hadoop-connector-part-two-hbase-jruby-and-sinatra/.

Introduction

In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.

Pig and Avro

Pig’s Avro support is solid in Pig 0.10.0. To use AvroStorage, we need only load piggbank.jar, and the jars for avro and json-simple. A shortcut to AvroStorage is convenient as well. Note that all paths are relative to your Pig install path. We load Avro support into Pig like so:

/* Load Avro jars and define shortcut */
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); /* Shortcut */

MongoDB’s Java Driver

To connect to MongoDB, we’ll need the MongoDB Java Driver. You can download it here: https://github.com/mongodb/mongo-java-driver/downloads. We’ll load this jar in our Pig script.

Mongo-Hadoop

The mongo-hadoop project provides integration between MongoDB and Hadoop. You can download the latest version at https://github.com/mongodb/mongo-hadoop/downloads. Once you download and unzip the project, you’ll need to build it with sbt.

./sbt package

This will produce the following jars:

$ find .|grep jar
./core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
./target/mongo-hadoop-1.1.0-SNAPSHOT.jar

We load these MongoDB libraries in Pig like so:

/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar /* MongoDB Java Driver */
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
 
/* Set speculative execution off so we don't have the chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
set default_parallel 5 /* By default, lets have 5 reducers */

Writing to MongoDB

Loading Avro data and storing records to MongoDB are one-liners in Pig.

avros = load 'enron.avro' using AvroStorage();
store avros into 'mongodb://localhost/enron.emails' using MongoStorage();

From Avro to Mongo in One Line

I’ve automated loading Avros and storing them to MongoDB in the script at https://github.com/rjurney/enron-node-mongo/blob/master/avro_to_mongo.pig, using Pig’s parameter substitution:

avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();

We can then call our script like this, and it will load our Avros to Mongo:

pig -l /tmp -x local -v -w -param avros=enron.avro \
   -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig

We can verify our data is in MongoDB like so:

$ mongo enron

MongoDB shell version: 2.0.2
connecting to: enron

> show collections
emails
system.indexes

> db.emails.findOne({message_id: "%3C3607504.1075843446517.JavaMail.evans@thyme%3E"})
{
	"_id" : ObjectId("502b4ae703643a6a49c8d180"),
	"message_id" : "",
	"date" : "2001-04-25T12:35:00.000Z",
	"from" : {
		"address" : "jeff.dasovich@enron.com",
		"name" : "Jeff Dasovich"
	},
	"subject" : null,
	"body" : "Breathitt's hanging tough, siding w/Hebert, standing for markets.  Jeff",
	"tos" : [
		{
			"address" : "7409949@skytel.com",
			"name" : null
		}
	],
	"ccs" : [ ],
	"bccs" : [ ]
}

To the Web with Node.js

We’ve come this far, so we may as well publish our data on the web via a simple web service. Lets use Node.js to fetch a record from MongoDB by message ID, and then return it as JSON. To do this, we’ll use Node’s mongodb package. Installation instructions are available in our github project.

Our node application is simple enough. We listen for an http request on port 1337, and use the messageId parameter to query an email by message id.

// Dependencies
var mongodb = require("mongodb"),
    http = require('http'),
    url = require('url');

// Set up Mongo
var Db = mongodb.Db,
    Server = mongodb.Server;

// Connect to the MongoDB 'enron' database and its 'emails' collection
var db = new Db("enron", new Server("127.0.0.1", 27017, {}));
db.open(function(err, n_db) { db = n_db });
var collection = db.collection("emails");

// Setup a simple API server returning JSON
http.createServer(function (req, res) {
  var inUrl = url.parse(req.url, true);
  var messageId = inUrl.query.messageId;

  // Given a message ID, find one record that matches in MongoDB
  collection.findOne({message_id: messageId}, function(err, item) {
    // Return 404 on error
    if(err) {
      console.log("Error:" + err);
      res.writeHead(404);
      res.end();
    }
    // Return 200/json on success
    if(item) {
      res.writeHead(200, {'Content-Type': 'application/json'});
      res.send(JSON.stringify(item));
      res.end();
    }
  });

}).listen(1337, '127.0.0.1');

console.log('Server running at http://127.0.0.1:1337/');

Navigating to http://localhost:1337/?messageId=%3C3607504.1075843446517.JavaMail.evans@thyme%3E returns an enron email as JSON:

We’ll leave the CSS as an exercise for your web developer, or you might try Bootstrap if you don’t have one.

Conclusion

The Hadoop Filesystem serves as a dumping ground for aggregating events. Apache Pig is a scripting interface to Hadoop MapReduce. We can manipulate and mine data on Hadoop, and when we’re ready to publish it to an application we use mongo-hadoop to store our records in MongoDB. From there, creating a web service is a few lines of javascript with Node.js – or your favorite web framework.

MongoDB is a popular NoSQL database for web applications. Using Hadoop and Pig we can aggregate and process logs at scale and publish new data-driven features back to MongoDB – or whatever our favorite database is.

Note: we should ensure that there is sufficient I/O between our Hadoop cluster and our MongoDB cluster, lest we overload Mongo with writes from Hadoop. Be careful out there! I have however verified that writing from an Elastic MapReduce Hadoop cluster to a replicated MongoHQ cluster (on Amazon EC2) works well.

About the Author

Russell Jurney is a data scientist and the author of the book Agile Data (O’Reilly, Dec 2012), which teaches a flexible toolset and methodology for building effective analytics applications using Apache Hadoop and cloud computing.

Apache Hadoop YARN – Concepts and Applications

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – Concepts & Applications

As previously described, YARN is essentially a system for managing distributed applications. It consists of a central ResourceManager, which arbitrates all available cluster resources, and a per-node NodeManager, which takes direction from the ResourceManager and is responsible for managing resources available on a single node.

Resource Manager

In YARN, the ResourceManager is, primarily, a pure scheduler. In essence, it’s strictly limited to arbitrating available resources in the system among the competing applications – a market maker if you will.  It optimizes for cluster utilization (keep all resources in use all the time) against various constraints such as capacity guarantees, fairness, and SLAs. To allow for different policy constraints the ResourceManager has a pluggable scheduler that allows for different algorithms such as capacity and fair scheduling to be used as necessary.

ApplicationMaster

Many will draw parallels between YARN and the existing Hadoop MapReduce system (MR1 in Apache Hadoop 1.x). However, the key difference is the new concept of an ApplicationMaster.

The ApplicationMaster is, in effect, an instance of a framework-specific library and is responsible for negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the containers and their resource consumption. It has the responsibility of negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.

The ApplicationMaster allows YARN to exhibit the following key characteristics:

  • Scale: The Application Master provides much of the functionality of the traditional ResourceManager so that the entire system can scale more dramatically. In tests, we’ve already successfully simulated 10,000 node clusters composed of modern hardware without significant issue. This is one of the key reasons that we have chosen to design the ResourceManager as a pure scheduler i.e. it doesn’t attempt to provide fault-tolerance for resources. We shifted that to become a primary responsibility of the ApplicationMaster instance. Furthermore, since there is an instance of an ApplicationMaster per application, the ApplicationMaster itself isn’t a common bottleneck in the cluster.
  • Open: Moving all application framework specific code into the ApplicationMaster generalizes the system so that we can now support multiple frameworks such as MapReduce, MPI and Graph Processing.

It’s a good point to interject some of the key YARN design decisions:

  • Move all complexity (to the extent possible) to the ApplicationMaster while providing sufficient functionality to allow application-framework authors sufficient flexibility and power.
  • Since it is essentially user-code, do not trust the ApplicationMaster(s) i.e. any ApplicationMaster is not a privileged service.
  • The YARN system (ResourceManager and NodeManager) has to protect itself from faulty or malicious ApplicationMaster(s) and resources granted to them at all costs.

It’s useful to remember that, in reality, every application has its own instance of an ApplicationMaster. However, it’s completely feasible to implement an ApplicationMaster to manage a set of applications (e.g. ApplicationMaster for Pig or Hive to manage a set of MapReduce jobs). Furthermore, this concept has been stretched to manage long-running services which manage their own applications (e.g. launch HBase in YARN via an hypothetical HBaseAppMaster).

Resource Model

YARN supports a very general resource model for applications. An application (via the ApplicationMaster) can request resources with highly specific requirements such as:

  • Resource-name (hostname, rackname – we are in the process of generalizing this further to support more complex network topologies with YARN-18).
  • Memory (in MB)
  • CPU (cores, for now)
  • In future, expect us to add more resource-types such as disk/network I/O, GPUs etc.

ResourceRequest and Container

YARN is designed to allow individual applications (via the ApplicationMaster) to utilize cluster resources in a shared, secure and multi-tenant manner. Also, it remains aware of cluster topology in order to efficiently schedule and optimize data access i.e. reduce data motion for applications to the extent possible.

In order to meet those goals, the central Scheduler (in the ResourceManager) has extensive information about an application’s resource needs, which allows it to make better scheduling decisions across all applications in the cluster. This leads us to the ResourceRequest and the resulting Container.

Essentially an application can ask for specific resource requests via the ApplicationMaster to satisfy its resource needs. The Scheduler responds to a resource request by granting a container, which satisfies the requirements laid out by the ApplicationMaster in the initial ResourceRequest.

Let’s look at the ResourceRequest – it has the following form:

<resource-name, priority, resource-requirement, number-of-containers>

Let’s walk through each component of the ResourceRequest to understand this better.

  • resource-name is either hostname, rackname or * to indicate no preference. In future, we expect to support even more complex topologies for virtual machines on a host, more complex networks etc.
  • priority is intra-application priority for this request (to stress, this isn’t across multiple applications).
  • resource-requirement is required capabilities such as memory, cpu etc. (at the time of writing YARN only supports memory and cpu).
  • number-of-containers is just a multiple of such containers.

Now, on to the Container.

Essentially, the Container is the resource allocation, which is the successful result of the ResourceManager granting a specific ResourceRequest. A Container grants rights to an application to use a specific amount of resources (memory, cpu etc.) on a specific host.

The ApplicationMaster has to take the Container and present it to the NodeManager managing the host, on which the container was allocated, to use the resources for launching its tasks. Of course, the Container allocation is verified, in the secure mode, to ensure that ApplicationMaster(s) cannot fake allocations in the cluster.

Container Specification during Container Launch

While a Container, as described above, is merely a right to use a specified amount of resources on a specific machine (NodeManager) in the cluster, the ApplicationMaster has to provide considerably more information to the NodeManager to actually launch the container.

YARN allows applications to launch any process and, unlike existing Hadoop MapReduce in hadoop-1.x (aka MR1), it isn’t limited to Java applications alone.

The YARN Container launch specification API is platform agnostic and contains:

  • Command line to launch the process within the container.
  • Environment variables.
  • Local resources necessary on the machine prior to launch, such as jars, shared-objects, auxiliary data files etc.
  • Security-related tokens.

This allows the ApplicationMaster to work with the NodeManager to launch containers ranging from simple shell scripts to C/Java/Python processes on Unix/Windows to full-fledged virtual machines (e.g. KVMs).

YARN – Walkthrough

Armed with the knowledge of the above concepts, it will be useful to sketch how applications conceptually work in YARN.

Application execution consists of the following steps:

  • Application submission.
  • Bootstrapping the ApplicationMaster instance for the application.
  • Application execution managed by the ApplicationMaster instance.

Let’s walk through an application execution sequence (steps are illustrated in the diagram):

  1. A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself.
  2. The ResourceManager assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster.
  3. The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client program to query the ResourceManager for details, which allow it to  directly communicate with its own ApplicationMaster.
  4. During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-request protocol.
  5. On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. The launch specification, typically, includes the necessary information to allow the container to communicate with the ApplicationMaster itself.
  6. The application code executing within the container then provides necessary information (progress, status etc.) to its ApplicationMaster via an application-specific protocol.
  7. During the application execution, the client that submitted the program communicates directly with the ApplicationMaster to get status, progress updates etc. via an application-specific protocol.
  8. Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed.

In our next post in this series we dive more into guts of the YARN system, particularly the ResourceManager – stay tuned!

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – Background and an Overview

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – Background & Overview

Celebrating the significant milestone that was Apache Hadoop YARN being promoted to a full-fledged sub-project of Apache Hadoop in the ASF we present the first blog in a multi-part series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.

MapReduce – The Paradigm

Essentially, the MapReduce model consists of a first, embarrassingly parallel, map phase where input data is split into discreet chunks to be processed. It is followed by the second and final reduce phase where the output of the map phase is aggregated to produce the desired result. The simple, and fairly restricted, nature of the programming model lends itself to very efficient and extremely large-scale implementations across thousands of cheap, commodity nodes.

Apache Hadoop MapReduce is the most popular open-source implementation of the MapReduce model.

In particular, when MapReduce is paired with a distributed file-system such as Apache Hadoop HDFS, which can provide very high aggregate I/O bandwidth across a large cluster, the economics of the system are extremely compelling – a key factor in the popularity of Hadoop.

One of the keys to this is the lack of data motion i.e. move compute to data and do not move data to the compute node via the network. Specifically, the MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack – a core advantage.

Apache Hadoop MapReduce, circa 2011 – A Recap

Apache Hadoop MapReduce is an open-source, Apache Software Foundation project, which is an implementation of the MapReduce programming paradigm described above. Now, as someone who has spent over six years working full-time on Apache Hadoop, I normally like to point out that the Apache Hadoop MapReduce project itself can be broken down into the following major facets:

  • The end-user MapReduce API for programming the desired MapReduce application.
  • The MapReduce framework, which is the runtime implementation of various phases such as the map phase, the sort/shuffle/merge aggregation and the reduce phase.
  • The MapReduce system, which is the backend infrastructure required to run the user’s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc.

This separation of concerns has significant benefits, particularly for the end-users – they can completely focus on the application via the API and allow the combination of the MapReduce Framework and the MapReduce System to deal with the ugly details such as resource management, fault-tolerance, scheduling etc.

The current Apache Hadoop MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves called TaskTrackers.

The JobTracker is responsible for resource management (managing the worker nodes i.e. TaskTrackers), tracking resource consumption/availability and also job life-cycle management (scheduling individual tasks of the job, tracking progress, providing fault-tolerance for tasks etc).

The TaskTracker has simple responsibilities – launch/teardown tasks on orders from the JobTracker and provide task-status information to the JobTracker periodically.

For a while, we have understood that the Apache Hadoop MapReduce framework needed an overhaul. In particular, with regards to the JobTracker, we needed to address several aspects regarding scalability, cluster utilization, ability for customers to control upgrades to the stack i.e. customer agility and equally importantly, supporting workloads other than MapReduce itself.

We’ve done running repairs over time, including recent support for JobTracker availability and resiliency to HDFS issues (both of which are available in Hortonworks Data Platform v1 i.e. HDP1) but lately they’ve come at an ever-increasing maintenance cost and yet, did not address core issues such as support for non-MapReduce and customer agility.

Why support non-MapReduce workloads?

MapReduce is great for many applications, but not everything; other programming models better serve requirements such as graph processing (Google Pregel / Apache Giraph) and iterative modeling (MPI). When all the data in the enterprise is already available in Hadoop HDFS having multiple paths for processing is critical.

Furthermore, since MapReduce is essentially batch-oriented, support for real-time and near real-time processing such as stream processing and CEPFresil are emerging requirements from our customer base.

Providing these within Hadoop enables organizations to see an increased return on the Hadoop investments by lowering operational costs for administrators, reducing the need to move data between Hadoop HDFS and other storage systems etc.

Why improve scalability?

Moore’s Law… Essentially, at the same price-point, the processing power available in data-centers continues to increase rapidly. As an example, consider the following definitions of commodity servers:

  • 2009 – 8 cores, 16GB of RAM, 4x1TB disk
  • 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.

Generally, at the same price-point, servers are twice as capable today as they were 2-3 years ago – on every single dimension.  Apache Hadoop MapReduce is known to scale to production deployments of ~5000 nodes of hardware of 2009 vintage. Thus, ongoing scalability needs are ever present given the above hardware trends.

What are the common scenarios for low cluster utilization?

In the current system, JobTracker views the cluster as composed of nodes (managed by individual TaskTrackers) with distinct map slots and reduce slots, which are not fungible.  Utilization issues occur because maps slots might be ‘full’ while reduce slots are empty (and vice-versa).  Fixing this was necessary to ensure the entire system could be used to its maximum capacity for high utilization.

What is the notion of customer agility?

In real-world deployments, Hadoop is very commonly deployed as a shared, multi-tenant system. As a result, changes to the Hadoop software stack affect a large cross-section if not the entire enterprise. Against that backdrop, customers are very keen on controlling upgrades to the software stack as it has a direct impact on their applications. Thus, allowing multiple, if limited, versions of the MapReduce framework is critical for Hadoop.

Enter Apache Hadoop YARN

The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM).

The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner.

The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.

The ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc.

The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container.

Here is an architectural view of YARN:

One of the crucial implementation details for MapReduce within the new YARN system that I’d like to point out is that we have reused the existing MapReduce framework without any major surgery. This was very important to ensure compatibility for existing MapReduce applications and users. More on this later.

The next post will dive further into the intricacies of the architecture and its benefits such as significantly better scaling, support for multiple data processing frameworks (MapReduce, MPI etc.) and cluster utilization.

 

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Introducing Apache Hadoop YARN

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Introducing Apache Hadoop YARN

I’m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!

Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.

In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.

Implications for the Apache Hadoop Developer community

I’d like to take a brief moment to walk folks through the implications of making Hadoop YARN as a sub-project, particularly for members of the Hadoop developer community.

  • We will now see a top-level hadoop-yarn-project source folder in Hadoop trunk.
  • We will now use a separate jira project for issue tracking for YARN i.e. https://issues.apache.org/jira/browse/YARN
  • We will also use a new yarn-dev@hadoop.apache.org mailing list for collaboration.
  • We will continue to co-release a single Apache Hadoop release that will include the Common, HDFS, YARN and MapReduce sub-projects.

If you would like to play with YARN please download the latest hadoop-2 release from the ASF and start contributing – either to core YARN sub-project or start building your cool application on top!

Please do remember that hadoop-2 is still deemed alpha quality by the Apache Hadoop community, but YARN itself shows a lot of promise and we are excited by the future possibilities!

Conclusion

Overall, having Hadoop YARN as a sub-project of Apache Hadoop is a significant milestone for Hadoop several years in the making. Personally, it is very exciting given that this journey started more than 4 years ago with https://issues.apache.org/jira/browse/MAPREDUCE-279. It’s a great pleasure, and honor, to get to this point by collaborating with a fantastic community that is driving Apache Hadoop.

Kudos to everyone!

Healthcare Goes Big

Earlier, in the “Big Data in Genomics and Cancer Treatment” blog post, I explored how the extensive amount of information in DNA analysis mostly comes from the vast array of characteristics associated with people’s DNA make up and with different cancer variations. The case with today’s healthcare is very similar. Each patient is unique and has thorough medical history records that allow doctors to make evaluations and recommendations for future treatments. These records also contain various drugs, therapies, diets, and regimens that must coincide with the patient’s condition and which, if not followed correctly, could endanger the patient’s life.

“Doctor, can I have some of that Big Data?”

Currently, the medical field is overflowing with big data and there is huge potential for improvement in treatment quality and overall patient experience. With the use of big data analytics, health care and pharmaceutical companies could significantly advance the services that they offer their patients.

Through big data analytics, there could be much more control over hospital operations. According to the Top Ten Innovations for 2012 article for the 2012 Medical Innovation Summit, data could “track outcomes for clinical and surgical procedures, including length of stay, readmission rates, infection rates, mortality, and comorbidity prevention”. The article also stated that, “Healthcare big data requires advanced technologies to efficiently process it with tolerable elapsed time, so organizations can create, collect, search, and share data, while still ensuring privacy.” This brings up a very important point: how can healthcare organizations take advantage of the benefits that many big data technologies provide while also ensuring privacy?

Healthcare companies must be able to balance the privacy of their individual patients with the overall health of the population. To meet that need, companies can still analyze patient records through the HIPAA-compliant privacy framework – a security framework that eliminates patient identification and still allows data analysis. This framework complies with federal law and, most importantly, brings about exciting improvements at various hospitals in their goals to improve today’s healthcare. Aside from that, there is also the Nationwide Health Information Network (NHIN) Exchange – a way for healthcare professionals to securely exchange information while following specific standards, services, and policies. The NHIN Exchange is helping to achieve the Health Information Technology for Economic and Clinical Health act (HITECH) of 2009.

Big Data Role Models in Healthcare

With the help of big data companies, hospitals and pharmacies have the ability to understand a wide range of patient data, which decreases the chances of missing any warning signs or medical miscalculations. So far, for example, New York Presbyterian Hospital has decreased potentially fatal blood clot cases by approximately 25% and the Seton Healthcare Family hospital has been able to predict (and prepare for) probable congestive heart failure cases. A clinical analytics company called Humedica is focusing specifically on congestive heart failure and has developed a predictive analytic model that also allows doctors to be aware of high-risk CHF patients before they are admitted into a hospital. The Sax Institute has also launched a project called The Secure Unified Research Environment (SURE), which allows health researchers to access patients’ medical information (identities are protected) through a data center. While still in its infancy, this project will compile a lot of research about the consistency of care in respect to the age, wealth, and overall living condition of various patients, so that doctors will be able to analyze all the factors contributing to a patient’s medical case.

Another very impressive project is IBM’s Watson (yes, the one that became popular after a game of “Jeopardy!”)– a computing system that can be used as a tool for doctors and researchers in the medical field. Watson is capable of analyzing the meaning and context of human language, allowing doctors to have an evidence-providing adviser on patient conditions in near real time. To give you a sense of its power: Watson is able to examine about one million books and analyze the information in them all in about three seconds. This kind of speed and precision can prove very helpful for doctors when they are faced with difficult medical cases, especially ones that require quick treatment. This big data system can positively change doctor-patient communication and help to facilitate efficient health care.

Explorys, a healthcare data company, has already developed a secure software platform with the help of Apache Hadoop and offers doctors the ability to aggregate, analyze, manage, and research all of the information they need to make the right decisions every day. Through its platform, Explorys has compiled an extensive healthcare database, which is already being used by 11 other major healthcare companies.

Apixio, another medical search company, uses Hadoop to analyze structured and unstructured data to provide meaningful results when healthcare professionals search specific issues in Apixio’s Medical Information Navigation Engine (MINE). Any kind of data can be put through MINE (forms, CT scans, emails) and doctors can then extract the information they need based on specific symptoms. Vishnu Vyas, a natural language scientist at Apixio, explained MINE as “Google for doctors, only better, because it’s patient-centric and determines how data relate to one another.”

Big Potential

Bill Schmarzo, chief technology officer at EMC, shared a helpful list of Big Data Business Opportunities in health care. Here are a few:

  • Ability to access any data source, no matter where it is located, using new federated query, data discovery and semantic management technologies.  This allows health care providers to gain a more timely, more complete understanding of the patient’s current situation so that they can prescribe the appropriate and most effective treatments.
  • New instrumentation opportunities to increase the amount and real-time nature of data being captured about patients’ health care (blood monitoring, smart toothbrushes, etc.)
  • In-memory capabilities to facilitate real-time, life-saving decisions at the point of care, especially in high stress, immediate need areas like the emergency room.
  • Real-time monitoring of key patient health care metrics that leverages in-memory computing to more rapidly evaluate incoming patient data streams (from the multitude of new health metrics capturing sensors), flag areas of concern, and score potential health-related issues.

By harnessing big data, healthcare industries can see significant benefits and accelerate development, particularly by using the power of Apache Hadoop. Healthcare in the United States is costly and, as an open source platform, Hadoop makes big data analytics affordable. Professionals could revolutionize their medical businesses and provide the best care possible to their patients. Most importantly, if healthcare companies learned to manage big data efficiently, there could be a wider availability of data and, consequently, a much more global knowledge of patient treatments, therapies, and drugs. For the healthcare world, in this case, Apache Hadoop may be just what the doctor ordered.

Go to page:1234