Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
February 12, 2015
prev slideNext slide

Storm and Kafka Together: A Real-time Data Refinery

Hortonworks Data Platform’s YARN-based architecture enables multiple applications to share a common cluster and data set while ensuring consistent levels of response made possible by a centralized architecture. Hortonworks led the efforts to on-board open source data processing engines, such as Apache Hive, HBase, Accumulo, Spark, Storm and others, on Apache Hadoop YARN.

In this blog, we will focus on one of those data processing engines—Apache Storm—and its relationship with Apache Kafka. I will describe how Storm and Kafka form a multi-stage event processing pipeline, discuss some use cases, and explain Storm topologies.

An oil refinery takes crude oil, distills it, processes it and refines it into useful finished products such as the gas that we buy at the pump. We can think of Storm with Kafka as a similar refinery, but data is the input. A real-time data refinery converts raw streaming data into finished data products, enabling new use cases and innovative business models for the modern enterprise.

Apache Storm is a distributed real-time computation engine that reliably processes unbounded streams of data. While Storm processes stream data at scale, Apache Kafka processes messages at scale. Kafka is a distributed pub-sub real-time messaging system that provides strong durability and fault tolerance guarantees.

Storm and Kafka naturally complement each other, and their powerful cooperation enables real-time streaming analytics for fast-moving big data. HDP 2.2 contains the results of Hortonworks’ continuing focus on making the Storm-Kafka union even more powerful for stream processing.

Conceptual Reference Architecture for Real-Time Processing in HDP 2.2
Conceptual Reference Architecture for Real-Time Processing in HDP 2.2

Conceptual Introduction to the Event Processing Pipeline


In an event processing pipeline, we can view each stage as a purpose-built step that performs some real-time processing against upstream event streams for downstream analysis. This produces increasingly richer event streams, as data flows through the pipeline:

  • raw events stream from many sources,
  • those are processed to create events of interest, and
  • events of interest are analyzed to detect significant events.

Here are some typical uses for this event processing pipeline:

  • a. High Speed Filtering and Pattern Matching
  • b. Contextual Enrichment on the Fly
  • c. Real-time KPIs, Statistical Analytics, Baselining and Notification
  • d. Predictive Analytics
  • e. Actions and Decisions

Building the Data Refinery with Topologies

To perform real-time computation on Storm, we create “topologies.” A topology is a graph of a computation, containing a network of nodes called “Spouts” and “Bolts.” In a Storm topology, a Spout is the source of data streams and a Bolt holds the business logic for analyzing and processing those streams.


Hortonworks’ focus for Apache Storm and Kafka has been to make it easier for developers to ingest and publish data streams from Storm topologies. The first topology ingests raw data streams from Kafka and fans out to HDFS, which serves as persistent store for raw events. Next, a filter Bolt emits the enriched event to a downstream Kafka Bolt that publishes it to a Kafka Topic. As events flow through these stages, the system can keep track of data lineage that allows drill-down from aggregated events to its constituents and can be used for forensic analysis. In a multi-stage pipeline architecture, providing right cluster resources to most intense part of the data processing stages is very critical, an “Isolation Scheduler” in Storm provides the ability to easily and safely share a cluster among many topologies.

In summary, refinery style data processing architecture enables you to:

  • Incrementally add more topologies/use cases
  • Tap into raw or refined data streams at any stage of the processing
  • Modularize your key cluster resources to most intense processing phase of the pipeline

Learn More

Try out Storm and Kafka Tutorials

Read These Blog Posts



Shima says:

is there any deatials about how Kafka gurantee integrity of stored data ..

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums