In this tutorial, we will review Apache Storm Infrastructure, download a storm jar file and deploy a WordCount Topology. After we run the topology, we will view storm log files because it is helpful in debugging purposes.
Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.
Let’s look at the various components of a Storm Cluster:
Architechture: Nimbus, Zookeeper, Supervisor
Here are a few terminologies and concepts you should get familiar with before we go hands-on:
Basic Concepts Map: Topologies process data when it comes streaming in from the spout, the bolt processes it and the results are passed into Hadoop.
Let’s check if the sandbox has storm processes up and running by login into Ambari and look for Storm in the services listed:
Now let’s look at a Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.
Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.
In the Storm example Topology, we will be using three main parts or processes:
You can check the classes available in the jar as follows:
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Sentence jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Split jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep WordCount
Let’s run the storm job. It has a Spout job to generate random sentences while the bolt counts the different words. There is a split Bolt Process along with the Wordcount Bolt Class.
Let’s run the Storm Jar file.
[root@sandbox ~]# storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com
Note: For Sandbox versions without Storm preinstalled, navigate to
/usr/lib/storm/bin/directory to run the command above.
Let’s use Storm UI and look at it graphically:
You should notice the Storm Topology, WordCount in the Topology summary.
The topology is located Under Topology Summary. You will see the following:
Click on count.
Click on any port and you will be able to view the results.
You just processed streaming data using Apache Storm. Congratulations on completing the Tutorial!
Lastly but most importantly, you can always look at the log files. These logs are extremely useful for debugging or status finding. Their directory location:
[root@sandbox ~]# cd /var/log/storm [root@sandbox storm]# ls -ltr
Download and install Apache Maven as shown in the commands below
curl -o /etc/yum.repos.d/epel-apache-maven.repo https://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo yum -y install apache-maven mvn -version
Download the Storm Starter Kit and try other topology examples, such as ExclamationTopology and ReachTopology.
git clone git://github.com/apache/storm.git && cd storm/examples/storm-starter
If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don’t find your answer you can post a new HCC question for this tutorial using the Ask Questions button.
|Find Answers||Ask Questions|
Tutorial Name: Processing streaming data in Hadoop with Apache Storm
HCC Tutorial Tag: tutorial-240 and HDP-2.4