cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Apache Storm

A system for processing streaming data in real time

Apache™ Storm adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

Storm integrates with YARN via Apache Slider, YARN manages Storm while also considering cluster resources for data governance, security and operations components of a modern data architecture.

What Storm Does

Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. Enterprises harness this speed and combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes.

Some of specific new business opportunities include: real-time customer service management, data monetization, operational dashboards, or cyber security analytics and threat detection.

Here are some typical “prevent” and “optimize” use cases for Storm.

“Prevent” Use Cases “Optimize” Use Cases
 Financial Services
  • Securities fraud
  • Operational risks & compliance violations
  • Order routing
  • Pricing
  Telecom
  • Security breaches
  • Network outages
  • Bandwidth allocation
  • Customer service
  Retail
  • Shrinkage
  • Stock outs
  • Offers
  • Pricing
  Manufacturing
  • Preventative maintenance
  • Quality assurance
  • Supply chain optimization
  • Reduced plant downtime
  Transportation
  • Driver monitoring
  • Predictive maintenance
  • Routes
  • Pricing
  Web
  • Application failures
  • Operational issues
  • Personalized content

Storm is simple and developers can write Storm topologies using any programming language.

Five characteristics make Storm ideal for real-time data processing workloads. Storm is:

  • Fast – benchmarked as processing one million 100 byte messages per second per node
  • Scalable – with parallel calculations that run across a cluster of machines
  • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
  • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
  • Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.

How Storm Works

A storm cluster has three sets of nodes:

  • Nimbus node (master node, similar to the Hadoop JobTracker):
    • Uploads computations for execution
    • Distributes code across the cluster
    • Launches workers across the cluster
    • Monitors computation and reallocates workers as needed
  • ZooKeeper nodes – coordinates the Storm cluster
  • Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus

Storm Architecture

Five key abstractions help to understand how Storm processes data:

  • Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
  • Streams – an unbounded sequence of tuples.
  • Spouts –sources of streams in a computation (e.g. a Twitter API)
  • Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases.
  • Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram)

Storm_concepts

Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop.

Learn more about how the community is working to integrate Storm with Hadoop and improve its readiness for the enterprise.

Hortonworks Focus for Storm

Hortonworks is focused on developer productivity, enterprise readiness and operational simplicity of Storm.

Focus
Developer Productivity
  • New Storm Connectors
  • Storm-Kafka Spout using new client APIs
  • Storm Distributed Log Search
  • Storm Dynamic Worker Profiling
Enterprise Readiness
  • Improved Nimbus HA
  • Storm Automatic Back Pressure
  • Storm Distributed cache
  • Storm Windowing and State Management
  • Storm Performance improvements
Operational Simplicity
  • Storm Topology Event inspector
  • Storm Resource Aware Scheduling
  • Storm Dynamic Log Levels
  • Pacemaker Storm Daemon

For more info: Announcement Apache Storm 1.0
stream-processing-apache-storm

Recent Progress in Storm

The Apache Storm open source community has already begun working on those themes.

Apache Storm Version Enhancements HDP Versions HDF Versions
Version 1.0.2 2.5 2.0
Version 0.10.0
  • Netty transport overhaul – significantly improve performance through better utilization of thread, CPU, and network resources
  • UI improvements with a new REST API – REST API exposes metrics and operations in JSON format, used by the UI
  • Plugable serialization for multilang – enables the use of more performant serialization frameworks than JSON, like protocol buffers
  • Apache Kafka spout – for consuming data from Kafka
2.4  1.2

Forums

Storm Tutorials

Storm in our Blog

Webinars & Presentations