Apache Storm

A system for processing streaming data in real time

Apache™ Storm adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

Storm integrates with YARN via Apache Slider, YARN manages Storm while also considering cluster resources for data governance, security and operations components of a modern data architecture.

What Storm Does

Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. Enterprises harness this speed and combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes.

Some of specific new business opportunities include: real-time customer service management, data monetization, operational dashboards, or cyber security analytics and threat detection.

Here are some typical “prevent” and “optimize” use cases for Storm.

“Prevent” Use Cases “Optimize” Use Cases
 Financial Services
  • Securities fraud
  • Operational risks & compliance violations
  • Order routing
  • Pricing
  • Security breaches
  • Network outages
  • Bandwidth allocation
  • Customer service
  • Shrinkage
  • Stock outs
  • Offers
  • Pricing
  • Preventative maintenance
  • Quality assurance
  • Supply chain optimization
  • Reduced plant downtime
  • Driver monitoring
  • Predictive maintenance
  • Routes
  • Pricing
  • Application failures
  • Operational issues
  • Personalized content

Now with Storm in Hadoop on YARN, a Hadoop cluster can efficiently process a full range of workloads from real-time to interactive to batch. Storm is simple and developers can write Storm topologies using any programming language.

Five characteristics make Storm ideal for real-time data processing workloads. Storm is:

  • Fast – benchmarked as processing one million 100 byte messages per second per node
  • Scalable – with parallel calculations that run across a cluster of machines
  • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
  • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
  • Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.

How Storm Works

A storm cluster has three sets of nodes:

  • Nimbus node (master node, similar to the Hadoop JobTracker):
    • Uploads computations for execution
    • Distributes code across the cluster
    • Launches workers across the cluster
    • Monitors computation and reallocates workers as needed
  • ZooKeeper nodes – coordinates the Storm cluster
  • Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus


Five key abstractions help to understand how Storm processes data:

  • Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
  • Streams – an unbounded sequence of tuples.
  • Spouts –sources of streams in a computation (e.g. a Twitter API)
  • Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases.
  • Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram)


Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop.

Learn more about how the community is working to integrate Storm with Hadoop and improve its readiness for the enterprise.

Hortonworks Focus for Storm

Apache Storm added open source, stream data processing to Enterprise Hadoop and Hortonworks Data Platform. The Storm community is working to improve capabilities related to three important themes: business continuity, operations and developer productivity. The team is working to deliver high availability (HA), user authentication, advanced scheduling and declarative wiring to Storm.

Focus Planned Enhancements
Business continuity
    Enhance Storm’s enterprise readiness with high availability (HA) and failover to standby clusters
    Apache Ambari support for Nimbus HA node setup and elastic topologies via YARN and Apache Slider. Incremental improvements to Storm UI to easily deploy, manage and monitor topologies.
Enterprise readiness
    Assist developer productivity with declarative wiring of data sources, spouts and bolts into topologies

Recent Progress in Storm

The Apache Storm open source community has already begun working on those themes.

Storm Version Progress
Version 0.9.2
  • Netty transport overhaul – significantly improve performance through better utilization of thread, CPU, and network resources
  • UI improvements with a new REST API – REST API exposes metrics and operations in JSON format, used by the UI
  • Plugable serialization for multilang – enables the use of more performant serialization frameworks than JSON, like protocol buffers
  • Apache Kafka spout – for consuming data from Kafka
Version 0.9.0
  • Netty-based messaging transport
  • Log viewer UI
  • Improved Windows platform support
  • Security improvements
  • API compatibility and upgrading

Storm Tutorials

Storm in our Blog

Webinars & Presentations


to create new topics or reply. | New User Registration

This forum contains 19 topics and 15 replies, and was last updated by  Shay Gupta 19 hours, 30 minutes ago.

Viewing 19 topics - 1 through 19 (of 19 total)
Viewing 19 topics - 1 through 19 (of 19 total)

You must be to create new topics. | Create Account

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Stay up to date!
Developer updates!