Processing streaming data in Hadoop with Apache Storm
In this tutorial we will walk through the process of
- Reviewing the pre-installed Apache Storm infrastructure
- Run a sample use case end to end
What is Apache Storm?
Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.
Let’s look at the various components of a Storm Cluster:
- Nimbus node. The master node (Similar to JobTracker)
- Supervisor nodes. Starts/stops workers & communicates with Nimbus through Zookeeper
- ZooKeeper nodes. Coordinates the Storm cluster
Here are a few terminologies and concepts you should get familiar with before we go hands-on:
- Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
- Streams. An unbounded sequence of tuples.
- Spouts. Sources of streams in a computation (e.g. a Twitter API)
- Bolts. Process input streams and produce output streams. They can:
- Run functions;
- Filter, aggregate, or join data;
- Talk to databases.
- Topologies. The overall calculation, represented visually as a network of spouts and bolts
A working HDP cluster – the easiest way to get a HDP cluster is to download the HDP Sandbox
Installation and Setup Verification:
Let’s check if the sandbox has storm processes up and running by login into Ambari and look for Storm in the services listed:
Step 2 :
Now let’s look at a Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.
Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.
Step 3 :
In the Storm example Topology, we will be using three main parts or processes:
- Sentence Generator Spout
- Sentence Split Bolt
- WordCount Bolt
You can check the classes available in the jar as follows:
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Sentence jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Split jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep WordCount
Step 4 :
Let’s run the storm job. It has a Spout job to generate random sentences. There is a split Bolt Process along with the Wordcount Bolt Class.
Let’s run the Storm Jar file.
/usr/lib/storm/bin/storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com
Step 5 :
Let’s use Storm UI and look at it graphically:
You should notice the Storm Topology, WordCount in the Topology summary.
Step 6 :
Please click on the WordCount Topology. You will see the following:
Step 7 :
In this page, please click on count in the Bolt Section.
Step 8 :
Now if click on any port in the executor section and you will be able to view the results.
Step 9 :
Lastly but most importantly, you can always look at the log files in the following folder. These logs are extremely useful for debugging or status finding.
You just processed streaming data using Apache Storm
Try this tutorial with :
These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.