A system for processing streaming data in real time
Apache™ Storm adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.
Storm integrates with YARN via Apache Slider, YARN manages Storm while also considering cluster resources for data governance, security and operations components of a modern data architecture.
What Storm Does
Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. Enterprises harness this speed and combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes.
Some of specific new business opportunities include: real-time customer service management, data monetization, operational dashboards, or cyber security analytics and threat detection.
Here are some typical “prevent” and “optimize” use cases for Storm.
|“Prevent” Use Cases||“Optimize” Use Cases|
Now with Storm in Hadoop on YARN, a Hadoop cluster can efficiently process a full range of workloads from real-time to interactive to batch. Storm is simple and developers can write Storm topologies using any programming language.
Five characteristics make Storm ideal for real-time data processing workloads. Storm is:
- Fast – benchmarked as processing one million 100 byte messages per second per node
- Scalable – with parallel calculations that run across a cluster of machines
- Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
- Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
- Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.
How Storm Works
A storm cluster has three sets of nodes:
- Nimbus node (master node, similar to the Hadoop JobTracker):
- Uploads computations for execution
- Distributes code across the cluster
- Launches workers across the cluster
- Monitors computation and reallocates workers as needed
- ZooKeeper nodes – coordinates the Storm cluster
- Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus
Five key abstractions help to understand how Storm processes data:
- Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
- Streams – an unbounded sequence of tuples.
- Spouts –sources of streams in a computation (e.g. a Twitter API)
- Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases.
- Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram)
Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop.
Learn more about how the community is working to integrate Storm with Hadoop and improve its readiness for the enterprise.
Hortonworks Focus for Storm
Apache Storm added open source, stream data processing to Enterprise Hadoop and Hortonworks Data Platform. The Storm community is working to improve capabilities related to three important themes: business continuity, operations and developer productivity. The team is working to deliver high availability (HA), user authentication, advanced scheduling and declarative wiring to Storm.
Recent Progress in Storm
The Apache Storm open source community has already begun working on those themes.