For its name and the metaphoric image it evokes, Apache Storm lives up to its purpose and promise: to ingest, absorb, and digest an avalanche of real-time data as a stream of unbounded discrete events at scale, speed, and success.
Before Storm, developers used a set of queues and workers to process a stream of real-time events. That is, events were placed on a worker queues, and worker threads plucked events and processed them—transforming, persisting or forwarding them to another queue for further processing. In fact, Nathan Marz used this initial paradigm at Twitter to process a storm of tweets from the Twitter firehose.
But this model, Marz said, had problems for real-time computation for three reasons:
Movtivated by these and additional requirements, Nathan Marz (then at BackType, which Twitter acquired) designed and built a real-time distributed data processing system that addressed not only these requirements but also allowed three use cases: stream processing; distributed RPC; and continuous computation. Marz open-sourced Apache Storm in September 2011.
Today, as an incubated project since September 2013 at the Apache Software Foundation (ASF), a growing number of companies use Apache Storm to process massive amounts of data in real-time for analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Since Storm’s incubation, a growing community has contributed to the project.
As part of a component that handles real-time data, Apache Storm fits and sits well in the enterprise blueprint’s data access and management layer, running on Apache Hadoop YARN, the architectural center of Hadoop 2.
Some of the contributors, committers, and companies shared their Apache Storm usage, best practices, and cluster architecture at the Hadoop Summit San Jose 2014. Although the Summit is behind us, the invaluable content is available here.
We have selected a few sessions below for practitioners, curating them under Apache Storm theme. Here are few sessions that demonstrate Apache Storm’s success, speed, and scale for real-time data processing:
|Scaling Storm – Cluster Sizing and Performance Optimization||Video||Slides|
|Analyzing 1.2 Million Network Packets per Second in Real Time||Video||Slides|
|Multi-tenant Storm as a Service (on and off Hadoop)||Video||Slides|
|Real-time Analytics and Anomalies Detection using Elasticsearch, Hadoop and Storm||Video||Slides|
|Pig on Storm||Video||Slides|
|Architecting R into the Storm Application Development Process||Video||Slides||Real-time Energy Data Analytics with Storm||Video||Slides|
We cherry picked these few tracks that best demonstrate Storm’s usage, but you can always peruse through all the tracks on the schedule’s session description along any time slot, any day, that piques your curiosity. For example, when you hover and click on a session description, a popup will display in which you can either select to watch the video or view slides.