cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
August 27, 2014
prev slideNext slide

Hadoop Summit Curated Content: Apache Storm

Chaos Before The Storm … and a Brief History

For its name and the metaphoric image it evokes, Apache Storm lives up to its purpose and promise: to ingest, absorb, and digest an avalanche of real-time data as a stream of unbounded discrete events at scale, speed, and success.

Before Storm, developers used a set of queues and workers to process a stream of real-time events. That is, events were placed on a worker queues, and worker threads plucked events and processed them—transforming, persisting or forwarding them to another queue for further processing. In fact, Nathan Marz used this initial paradigm at Twitter to process a storm of tweets from the Twitter firehose.

But this model, Marz said, had problems for real-time computation for three reasons:

  1. It did not scale (adding worker or queue was difficult).
  2. It lacked fault-tolerance (handling failure became unmanageable).
  3. It was tedious to build and manage (coordinating tasks added undue complexity).

Movtivated by these and additional requirements, Nathan Marz (then at BackType, which Twitter acquired) designed and built a real-time distributed data processing system that addressed not only these requirements but also allowed three use cases: stream processing; distributed RPC; and continuous computation. Marz open-sourced Apache Storm in September 2011.

Calm After the Storm

Today, as an incubated project since September 2013 at the Apache Software Foundation (ASF), a growing number of companies use Apache Storm to process massive amounts of data in real-time for analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Since Storm’s incubation, a growing community has contributed to the project.

As part of a component that handles real-time data, Apache Storm fits and sits well in the enterprise blueprint’s data access and management layer, running on Apache Hadoop YARN, the architectural center of Hadoop 2.

5-boxes

Apache Storm Curated Content

Some of the contributors, committers, and companies shared their Apache Storm usage, best practices, and cluster architecture at the Hadoop Summit San Jose 2014. Although the Summit is behind us, the invaluable content is available here.

We have selected a few sessions below for practitioners, curating them under Apache Storm theme. Here are few sessions that demonstrate Apache Storm’s success, speed, and scale for real-time data processing:

Session Title Watch View
Scaling Storm – Cluster Sizing and Performance Optimization Video Slides
Analyzing 1.2 Million Network Packets per Second in Real Time Video Slides
Multi-tenant Storm as a Service (on and off Hadoop) Video Slides
Real-time Analytics and Anomalies Detection using Elasticsearch, Hadoop and Storm Video Slides
Pig on Storm Video Slides
Architecting R into the Storm Application Development Process Video Slides
Real-time Energy Data Analytics with Storm Video Slides

storm_sessionsWe cherry picked these few tracks that best demonstrate Storm’s usage, but you can always peruse through all the tracks on the schedule’s session description along any time slot, any day, that piques your curiosity. For example, when you hover and click on a session description, a popup will display in which you can either select to watch the video or view slides.

Discover and Learn More

  • Visit our Apache Storm Roadmap.
  • Read the future of Apache Storm.
  • Read about Apache Storm.
  • Try Apache Storm Tutorials.
Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>