On December 4th, Hortonworks presented the fifth of 8 Discover HDP 2.2 webinars: Apache Kafka and Apache Storm for Stream Data Processing. Taylor Goetz, Rajiv Onat, and Justin Sears hosted this 5th webinar in the series.
After Justin Sears set the stage for the webinar by explaining the drivers behind Modern Data Architecture (MDA), Rajiv Onat and Taylor Goetz introduced and discussed how to use Apache Kafka and Apache Storm for stream data processing. They also covered Apache Kafka’s and Apache Storm’s new streaming innovations included in HDP 2.2:
Here is the complete recording of the Webinar.
Here are the presentation slides.
Register for all remaining webinars in the series
We’re grateful to all 381 participants who joined the HDP 2.2 webinar and asked excellent questions. This is the list of questions with their corresponding answers:
|Is it possible to use MQ-Series instead of Kafka as Messaging Queue?||Apache Storm supports JMS Spouts. It is currently certified with Active MQ and Oracle JMS. We ran into issues with IBM MQ-Series with respect to how messages are acknowledged by IBM MQ. IBM-MQ requires the thread that receives the message be the same thread that acks it. Storm’s framework cannot support this requirement, as the receiving and acking thread by design are different threads to achieve higher throughput.|
|Are there any tutorials for Storm and Kafka?||Yes, on our website there’re number of tutorials on both Kafka and Storm.|
|Is Nimbus the single point of failure?||
Nimbus is a fail-fast tolerant. That is, it does not impact the executing of the topology. What would be impacted is the assignment of new tasks to supervisors or deployment of new topologies.
However, we are currently looking into providing HA for Nimbus.
|Can streams be enriched with other systems, such as Reddit or in-memory databases?||Absolutely yes. Storm is a processing framework; as such, it’s agnostic to input source of data stream for the bolts. Data streams can come from any source: your RDMS, filesystem etc.|
|Can you please provide details on Flume use case vs. Storm use case?||Flume and Storm complement each other and do not overlap with each other. Flume is data ingest framework, whereas Storm is stream processing framework. While Flume provides interceptors to process the data in its pipeline, it does not provide processing guarantees that Storm provides. We are seeing some use cases implemented using Flume->Kafka->Storm pipelines.|
|What kind of enterprise production support is available for Storm/Kafka layers?||Hortonworks provides enterprise support subscriptions for Storm and Kafka.|
|Will Storm integrate with Falcon in the future such that a topology can be executed from Falcon?||Currently, we do not have plans to support it.|
|What is the difference between Spark streaming and Storm streaming¬?||Storm is stream processing platform that provides comprehensive support for both tuple based processing and micro-batch (framework allows you to implement sliding and tumbling count/time based windows) based processing.
Spark Streaming, on the other hand, supports only micro-batch sliding time window based processing. If you are already a Spark user, Spark Streaming may be a natural choice because of its consistent APIs; however, for real-time event driven use cases that requires sub-second latency, Storm provides better support.
|One of the slides showed Storm writing into Hive, but mentioned HBase. Was this a typo, or does this only work for Hive running on HBase?||Storm-hive connector uses Hive streaming ingest and it does not require HBase or intermediate writing to HDFS.|
|How is Apache Kafka different from Sqoop or Flume data ingestion?||Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
By contrast, Apache Kafka is real-time and fault-tolerant distributed messaging infrastructure that provides publish-subscribe semantics (conceptually similar to a JMS system but at linear scale).
|Are there any specific requirements regarding the resources of the cluster for an effective real-time processing? Do you need extra RAM on worker nodes?||System resources for Storm topologies are really use case dependent. As a rule of thumb, if you have high degree of parallelism, the number of cores on worker nodes comes into play; the more cores you have, the more threads you can support, which directly correlates to parallelism.
If you are caching data for high-speed lookup or joins or performing aggregations, you better get more RAM. Using Storm for machine learning type use cases are, in general, CPU heavy.
|Is the Storm topology configuration externalized such that a runtime we can change them as needed?||There are different answers to this question. If you’re looking at a re-balancing act, then the current re-balancing capabilities support it. For example, you can increase its parallelism using command line. On the other hand, it does not support defining a declarative support for new topologies.
But we are working with the Hadoop community to enable defining new topologies in a declarative fashion.
|Can Storm be used with AWS SQL/SNS?||Technically, yes, you may need to write custom bolt to write to or lookup data from AWS SQL. Our roadmap includes a generic JDBC bolt for connecting to RDBMS. AWS has published a spout implementation for kinesis, which is very similar to Kafka. https://github.com/awslabs/kinesis-storm-spout.|
|Do you have plans to support Kafka brokers to be managed through YARN?||Yes, there is community work currently undergoing in deploying Kafka on YARN via Apache Slider.|
|When the local state of a Storm topology goes bad on a restart, is there a way to only get rid of state files for a specific topology?||Typically, killing the topology using storm kill topology Name and then resubmitting it takes care of deleting necessary directories for nimbus/supervisors/workers from their local disk and cleaning up corresponding zookeeper entries.|
|Can Storm provide ETL functionality?||One of the popular use case for Apache Storm is continuous ETL, where real-time events can be filtered, enriched, joined and aggregated for downstream persistence into data warehouse and for analytics.|
|Is Kafka still in beta stage?||Kafka 0.8.1.1 is the current stable version.|
|How do you solve issue with aggregating only once (processing event and increment counters in HBase) in the situations when you have out of order streams and possibility of duplicate event due to fail to ack event after aggregations done?||Aggregation and stateful stream processing use cases are best suited for Storm Trident Topologies; topologies built using Trident can provide exactly once-processed semantics.|
|Which Storm version is shipped with HDP 2.2?||The official version shipped with HDP 2.2 is Apache Storm 0.9.3.|