Apache Storm

A system for processing streaming data in real time

Apache™ Storm is a distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop® 2.x. Storm in Hadoop helps capture new business opportunities with low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in their Hadoop cluster.

What Storm Does

Now with Storm and MapReduce running together in Hadoop on YARN, a Hadoop cluster can efficiently process a full range of workloads from real-time to interactive to batch. Storm is simple and developers can write Storm topologies using any programming language.

Five characteristics make Storm ideal for real-time data processing workloads. Storm is:

  • Fast – benchmarked as processing one million 100 byte messages per second per node
  • Scalable – with parallel calculations that run across a cluster of machines
  • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
  • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
  • Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.

Enterprises use Storm to prevent certain outcomes or to optimize their objectives. Here are some “prevent” and “optimize” use cases.

“Prevent” Use Cases “Optimize” Use Cases
 Financial Services  
  • Securities Fraud
  • Compliance Violations
  • Order Routing
  • Pricing
  Telecom
  • Security Breaches
  • Network Outages
  • Bandwidth Allocation
  • Customer Service
  Retail
  • Shrinkage
  • Stock outs
  • Offers
  • Pricing
  Manufacturing
  • Machine Failures
  • Quality Assurance
  • Supply Chain
  • Continuous Improvement
  Transportation
  • Driver Monitoring
  • Predictive Maintenance
  • Routes
  • Pricing
  Web
  • Application Failures
  • Operational Issues
  • Personalized Content

How Storm Works

A storm cluster has three sets of nodes:

  • Nimbus node (master node, similar to the Hadoop JobTracker):
    • Uploads computations for execution
    • Distributes code across the cluster
    • Launches workers across the cluster
    • Monitors computation and reallocates workers as needed
  • ZooKeeper nodes – coordinates the Storm cluster
  • Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus

storm_architecture

Five key abstractions help to understand how Storm processes data:

  • Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
  • Streams – an unbounded sequence of tuples.
  • Spouts –sources of streams in a computation (e.g. a Twitter API)
  • Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases.
  • Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram)

Storm_concepts

Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop.

Learn more about how the community is working to integrate Storm with Hadoop and improve its readiness for the enterprise.

Try these Tutorials

Try Storm with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Resources

More posts on:
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.