Stream Processing in Hadoop: YARN, Storm and the Hortonworks Data Platform

storm-timelineHortonworks will be making a preview of Apache Storm integration available in Q4 of this year and will be including Apache Storm in the Hortonworks Data Platform in first half of 2014.

Any time now, the Apache Hadoop community will declare the General Availability of Hadoop 2.0 which includes the much anticipated Apache Hadoop YARN.  The YARN-based architecture of Hadoop 2 is the most significant change to Hadoop introduced in the past six years and enables Hadoop to expand from a single-purpose, batch-oriented data platform based on MapReduce into a truly multi-purpose platform supporting a wide range of data processing approaches. The general availability of YARN – which in 2.0 essentially becomes the Hadoop Operating System – promises to open up the range of ways Hadoop is used to process data.

In fact, one of the most common use cases that we see emerging from our customers is the antithesis of batch: stream processing in Hadoop.  Early adopters are using stream processing to analyze some of the most common new types of data such as sensor and machine data in real time.

For some users, this means monitoring a continuous stream of server log data and taking action immediately in the case of component failure.  For others, it means monitoring a stream of market data for signals and then taking action in real-time or for powering real-time analytic dashboards.  This is being done on dedicated Hadoop clusters today.

A recent customer example of the need for streaming comes from UC Irvine Medical Center. The organization recently launched a new technology called SensiumVitals® to monitor and transmit patient vital signs every minute. The minute-by-minute snapshots of vital signs (4,320 per patient, per day) are the building blocks for algorithms that ultimately lead to dramatically reduced average time-to-insight.

This sensor data enables real-time predictive analytics that can allow caregivers like us to respond before a patient’s vital signs ever cross a dangerous threshold.Charles Boicey, UCI’s Informatics Solutions Architect

Enter Storm

A few weeks ago the Storm project, originally conceived and built by the team at BackType/Twitter to analyze the tweet stream in real time, became an official Apache incubation project.  Over the past year or so it has enjoyed increased interest as many early adopters have embraced it as the preferred option for streaming analytics in Hadoop. Yahoo!, an unabashed Hadoop trailblazer picked up Storm late last year and started to build out Storm on YARN.  At Hadoop Summit this summer they presented a use case for Storm on YARN where they realized five-second analytics windows on streaming data.  And broader usage of Storm is well documented on Github.

Many applications use Storm for low-latency processing and Map/Reduce for batch processing while sharing data between Storm and Map/Reduce. By placing Storm physically closer to the data source and/or other components in the same pipeline we can reduce network transfers and in turn the total cost of acquiring the data.Andy Feng, Distinguished Architect, Yahoo!
YDN : Storm-YARN Released as Open Source

An engineering commitment to deeply integrate Apache Storm with Hadoop

At Hortonworks, we know the fastest path to innovation is the open community and have dedicated our entire development model around this point: every Hortonworks developer is a contributor and every contribution is done in the open.   To that end we are pleased to announce that we have initiated an engineering commitment to deeply integrate Storm with Hadoop, specifically as a supported component of the 100% Open Source Hortonworks Data Platform.

Availability of Storm with the Hortonworks Data Platform

Hortonworks will be making a preview of Storm integration available in Q4 of this year and will be including Apache Storm in the Hortonworks Data Platform in H1 of 2014.

We’re bullish about the possibilities that stream processing brings, and excited to be bringing Storm to HDP.  Please drop us a line and let us know how you intend to use Storm!

Follow the progress on Storm integration with HDP on our Labs page.

Categorized by :
Administrator Architect & CIO Data Analyst & Scientist Developer HDP 2 Storm

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :