We recently hosted the sixth of our seven Discover HDP 2.1 webinars, entitled Apache Storm for Stream Data Processing in Hadoop. Over 200 people attended the webinar and joined in the conversation.
Thanks to our presenters Justin Sears (Hortonworks’ Product Marketing Manager), Himanshu Bari (Hortonworks’ Senior Product Manager for Storm), and Taylor Goetz (Hortonworks’ Software Engineer and Apache Storm Committer) who presented the webinar. The speakers covered:
If you missed the webinar, here is the complete recording:
And here is the presentation deck.
|What is the difference between Apache Storm and Apache HBase?||
HBase and Storm were created to address separate problems.Storm provides data processing in real-time, while HBase (over HDFS) offers you low-latency reads of processed data for querying later.
Storm processes but does not store. HBase stores but does not process.
Normally, you would front your Storm cluster where data is processed in real time; as it’s ingested, alerts and actions can be raised if needed, and then the data can be persisted in HDFS and accessed via HBase.
|Can Storm and Hadoop co-exist within the same cluster?||Yes, they can co-exist in the same cluster. You may install the Storm nodes on the same nodes as your Hadoop cluster.However, Storm is not yet a native YARN application. We are working on it and it will be available soon.|
|Is Apache Storm related to Apache Flume? How are they different?||This is similar to the earlier question about HBase. Apache Flume ingests data, but does not process data. Apache Storm processes data but does not ingest the data.Normally, you would extract data from Flume, put it on a messaging pipeline like Kafka, and then feed or connect to Storm for real time processing.|
|What is the difference between Apache Storm and Apache Spark?||One way to describe the difference is that Spark is a batch processing framework that also does micro-batching (Spark Streaming), while Storm is a stream processing framework that also does micro-batching (Trident). So architecturally they are very different, but have some similarity on the functional side.With micro-batching, you can achieve higher throughput at the cost of increased latency. With Spark, this is unavoidable. With Storm, you can use the core API (spouts and bolts) to do one-at-a-time processing to avoid the inherent latency overhead imposed by micro-batching. And with Trident, you get state management out of the box, and sliding windows are supported as well.
And finally, many enterprises use Storm in production for real time stream processing, whereas Spark Streaming is still new.
|Can a Storm bolt replace the functionality done with Apache Pig or Apache Hive?||The Bolt component in Storm is open-ended. You can write your processing logic in whatever language you want, which gets plugged into the Storm processing topology.There are efforts underway within the Apache community to do just that. For example, Yahoo! is working on “Pig on Storm.” These Pig scripts can be part of the Bolt semantics.|
|Can Apache Storm and Apache Hadoop coexist on the same cluster?||Absolutely. However, under heavy load or particularly resource-intensive use cases, it is best to deploy the Supervisor services to dedicated nodes (i.e. not stacked on top of other services such as ZooKeeper). This helps avoid resource contention.|
|How does a data stream flow into Storm?||Data flows into Storm via Spouts. Spouts wrap some sort of data source and emit Storm Tuples. A number of Spouts are available for a number of streaming data sources like Kafka and JMS. Writing a custom spout is also a straightforward process.|
|Can you explain Trident in more detail, with a use case?||
Trident uses cases are no different than Storm’s. Trident is a higher-level API than Storm’s core API.One thing that you can easily do with Trident’s API (that would take much more work with Storm’s core primitives) is exactly-once semantics.
With core Storm, it is easy to do at-least-once processing, whereas it’s much more difficult to do exactly-once processing.
Trident brings you the ability to do that easily. It supports transactions, so you can transactionally process data in small batches.
We mentioned in the presentation that Trident is a micro-batch processing framework. As such, Trident supports higher throughput at the cost of some additional latency overhead.