Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Processing streaming data in Hadoop with Apache Storm


In this tutorial, we will review Apache Storm Infrastructure, download a storm jar file and deploy a WordCount Topology. After we run the topology, we will view storm log files because it is helpful in debugging purposes.



What is Apache Storm?

Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.

Let’s look at the various components of a Storm Cluster:

  1. Nimbus node. The master node (Similar to JobTracker)
  2. Supervisor nodes. Starts/stops workers & communicates with Nimbus through Zookeeper
  3. ZooKeeper nodes. Coordinates the Storm cluster

Storm Architecture

Architechture: Nimbus, Zookeeper, Supervisor

Here are a few terminologies and concepts you should get familiar with before we go hands-on:

  • Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
  • Streams. An unbounded sequence of tuples.
  • Spouts. Sources of streams in a computation (e.g. a Twitter API)
  • Bolts. Process input streams and produce output streams. They can:
    • Run functions;
    • Filter, aggregate, or join data;
    • Talk to databases.
  • Topologies. The overall calculation, represented visually as a network of spouts and bolts

Storm Basic Concepts

Basic Concepts Map: Topologies process data when it comes streaming in from the spout, the bolt processes it and the results are passed into Hadoop.

Installation and Setup Verification:

Step 1: Check Storm Service is Running

Let’s check if the sandbox has storm processes up and running by login into Ambari and look for Storm in the services listed:

Step 2: Download the Storm Topology JAR file

Now let’s look at a Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.

Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.

wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/Storm/

enter image description here

Step 3: Check Classes Available in jar

In the Storm example Topology, we will be using three main parts or processes:

  1. Sentence Generator Spout
  2. Sentence Split Bolt
  3. WordCount Bolt

You can check the classes available in the jar as follows:

jar -xvf storm-starter-0.0.1-storm- | grep Sentence  
jar -xvf storm-starter-0.0.1-storm- | grep Split  
jar -xvf storm-starter-0.0.1-storm- | grep WordCount

enter image description here

Step 4: Run Word Count Topology

Let’s run the storm job. It has a Spout job to generate random sentences while the bolt counts the different words. There is a split Bolt Process along with the Wordcount Bolt Class.

Let’s run the Storm Jar file.

[root@sandbox ~]# storm jar storm-starter-0.0.1-storm- storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com

Note: For Sandbox versions without Storm preinstalled, navigate to /usr/lib/storm/bin/ directory to run the command above.

enter image description here

Step 5: Open Storm UI

Let’s use Storm UI and look at it graphically:
enter image description here

You should notice the Storm Topology, WordCount in the Topology summary.

Step 6: Click on WordCount Topology

The topology is located Under Topology Summary. You will see the following:

enter image description here

Click on count.

enter image description here

Click on any port and you will be able to view the results.

enter image description here

You just processed streaming data using Apache Storm. Congratulations on completing the Tutorial!

Appendix A: View Storm Log Files

Lastly but most importantly, you can always look at the log files. These logs are extremely useful for debugging or status finding. Their directory location:

[root@sandbox ~]# cd /var/log/storm

[root@sandbox storm]# ls -ltr

enter image description here

Appendix B: Install Maven and Get Started with Storm Starter Kit

Install Maven

Download and install Apache Maven as shown in the commands below

curl -o /etc/yum.repos.d/epel-apache-maven.repo https://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo
yum -y install apache-maven
mvn -version

enter image description here

Get Started with Storm Starter Kit

Download the Storm Starter Kit and try other topology examples, such as ExclamationTopology and ReachTopology.

git clone git://github.com/apache/storm.git && cd storm/examples/storm-starter

Further Reading

Tutorial Q&A and Reporting Issues

If you need help or have questions with this tutorial, please first check HCC for existing Answers to questions on this tutorial using the Find Answers button. If you don’t find your answer you can post a new HCC question for this tutorial using the Ask Questions button.

Find Answers Ask Questions

Tutorial Name: Processing streaming data in Hadoop with Apache Storm
HCC Tutorial Tag: tutorial-240 and HDP-2.4