Processing streaming data in Hadoop with Apache Storm

Real-time data processing

In this tutorial we will walk through the process of

  • Reviewing the pre-installed Apache Storm infrastructure
  • Run a sample use case end to end

What is Apache Storm?

Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.

Let’s look at the various components of a Storm Cluster:

  1. Nimbus node. The master node (Similar to JobTracker)
  2. Supervisor nodes. Starts/stops workers & communicates with Nimbus through Zookeeper
  3. ZooKeeper nodes. Coordinates the Storm cluster

Here are a few terminologies and concepts you should get familiar with before we go hands-on:

  • Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
  • Streams. An unbounded sequence of tuples.
  • Spouts. Sources of streams in a computation (e.g. a Twitter API)
  • Bolts. Process input streams and produce output streams. They can:
    • Run functions;
    • Filter, aggregate, or join data;
    • Talk to databases.
  • Topologies. The overall calculation, represented visually as a network of spouts and bolts


A working HDP cluster – the easiest way to get a HDP cluster is to download the HDP Sandbox

Installation and Setup Verification:

Step 1:

Let’s check if the sandbox has storm processes up and running. Please ssh in to the Sandbox from a terminal window on mac or using Putty:

ssh root@ -p 2222;

Then list the processes:

ps -ef|grep storm

If Storm processes are running, you will see something similar to the following:

If Storm processes are not running, we can manually start those as steps below:

Step 2:

Let’s look at Storm Home folders to see the location of it’s binaries, libraries, etc.:

1000-image Folder

Step 3:

Let’s review a few configuration items:

  • storm.yaml is one of the config files you can open and review. This configures storm daemons like Nimbus, Zookeeper,etc.

Go to the following directory and open storm.yaml file.

cd /usr/lib/storm/conf

enter image description here

Review the entire body of this file for the configuration options. You can see only the top part in this screen shot

If you don’t modify this file, then it will pick up all the default values which are listed in the following link : Storm Default Values

In this case, we will use all the default values as a trial setup in the standalone Sandbox.

Step 4:

Let’s review and discuss about a few parameters.

  • storm.zookeeper.servers. This must be configured if your Zookeeper is running on another server/s. So, we must specify/configure.
    • i.e. storm.zookeeper.servers: “111.222.333.444” “555.666.777.888”
    • storm.zookeeper.port must be assigned if you are not using default ports.
    • In our case running Storm on the Sandbox, we will use default, which is ‘localhost’.
  • The worker nodes need to know which machine is the master in order to download topology jars and confs. In our case, we are using default, which is ‘localhost’.
  • ui.port. This is the port on which the Storm UI will use. In our case, it is 8080.
  • supervisor.slots.ports. This is another important configuration. # of ports defined, allows that many workers per node. Here are the default ports:
    • supervisor.slots.ports:
      • 6700
      • 6701
      • 6702
      • 6703
    • In your storm.yaml file, here are a few samples listed.

enter image description here

Please notice that the Zookeeper server is set as Also, Nimbus server is set as The ui.port is set at 8744.

Step 5:

Let’s start the Storm Nimbus and Supervisor Processes:

./storm nimbus

enter image description here

Let’s start the Supervisor Server:

./storm supervisor

enter image description here
Nimbus is the main master process and supervisor process is the worker process.

Step 6:

Let’s start the Storm UI so that we can access storm using Web UI. For this, please open another terminal window as we didn’t run the above in the background just to see how it runs. In this new terminal window, please start Storm UI as follows:

./storm ui

enter image description here

In case, you get into already running Storm UI situation, you could always look for what is running on the port 8744 using netstat -ntlp|grep 8744 and kill the process using kill -9 PID.

Step 7 :

Before you can access you Storm Cluster via UI, please set up your /etc/hosts file in your local host machine (not sandbox) and map to localhost if it is not setup.

The file hosts is usually located at c:\Windows\System32\drivers folder in windows and in /etc folder in a Mac.

Setup as follows :   localhost

Let’s save now as follows. Hosts file doesn’t have any extension.

Step 8 :

Let’s open the Storm UI using your browser.

enter image description here

Step 9 :

Now let’s look at a Hadoop Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.

Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.


enter image description here

Step 10 :

In the Storm example Topology, we will be using three main parts or processes:

  1. Sentence Generator Spout
  2. Sentence Split Bolt
  3. WordCount Bolt

You can check the classes available in the jar as follows:

jar -xvf storm-starter-0.0.1-storm- | grep Sentence
jar -xvf storm-starter-0.0.1-storm- | grep Split
jar -xvf storm-starter-0.0.1-storm- | grep WordCount

enter image description here

Step 11 :

Let’s run the storm job. It has a Spout job to generate random sentences. There is a split Bolt Process along with the Wordcount Bolt Class.

Let’s run the Storm Jar file.

/usr/lib/storm/bin/storm jar storm-starter-0.0.1-storm- storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c

enter image description here

Step 12 :

Let’s use Storm UI and look at it graphically:
enter image description here

You should notice the Storm Topology, WordCount in the Topology summary.

Step 13 :

Please click on the WordCount Topology. You will see the following:
enter image description here

Step 14 :

In this page, please click on count in the Bolt Section.

enter image description here

Step 15 :

Now if click on any port in the executor section and you will be able to view the results.

Step 16 :

Lastly but most importantly, you can always look at the log files in the following folder. These logs are extremely useful for debugging or status finding.

enter image description here

You just processed streaming data using Apache Storm.


October 14, 2014 at 9:29 am

Who i can show topology’s results? For example, if i use the topology “WordCount” that return the count of word, where storm show me the results of de count of word?

September 8, 2014 at 9:35 pm

where should i find input and output directories


September 8, 2014 at 9:34 pm

where should i find input and output directries

Kuldeep Dhole
August 3, 2014 at 6:28 pm

Thanks a lot Hortonworkers! for making it so simple.

Can you please explain me why did you use the class “storm.starter.WordCountTopology WordCount” twice in following command:

/usr/lib/storm/bin/storm jar storm-starter-0.0.1-storm- storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c


Ajit Singh
July 24, 2014 at 7:56 am

worker-6700.log & worker-6701.log in /var/log/storm are continuously increasing, how to stop it?

    August 3, 2014 at 6:22 pm

    Hi Ajit,

    You are right, the logs will go on increasing as this is “stream processing”.

    From what I understood, there are two things in WordCount topology:
    1) Spout & Bolts both are running on 6700 and 6701.
    2) Being a stream processing, it is supposed to keep running till we close it.

    So, we are supposed to deactivate or kill the job from the UI:
    1) go to http://localhost:8744/ Storm UI
    2) Under Topology summary section, click on WordCount
    3) On newly directed page, under Topology Actions section, click on Deactivate or Kill.

    You are all set.

    But I agree that this should have been mentioned considering beginner’s point of view.

      Jules S. Damji
      October 3, 2014 at 12:36 pm

      Thanks for the note! We’ll rectify and mention it early to stop the jobs.

Chakra Sankaraiah
May 26, 2014 at 6:44 am

I was able to start the storm process via Ambari and it worked great.
After that you should be able to proceed from step 8 onward by using the storm UI link as
where is your HDP sandbox IP address.

Chakra Sankaraiah
May 26, 2014 at 6:26 am

I was not able to start the supervisor and UI, not sure why

Here are few pieces that i think need to be changed
— Before you run ./storm supervisor you need to be /usr/lib/storm/bin directory
— In the host file i think you need to have the IP address of your own hortonworks sandbox

Chakra Sankaraiah
May 26, 2014 at 6:21 am

Just a minor comment, we need to be /usr/lib/storm/bin directory before you start the supervisor using this command
./storm supervisor

Sandeep Kumar
May 9, 2014 at 8:14 pm

Nice info.
Is there any streaming jar like in hadoop so that we can make use of shell/perl/php scripts as spouts and bolts? By this way spouts will emit to stdout and bolts will receive tuples from stdin.

Gandhi Manalu
April 23, 2014 at 9:35 pm

Thank you for the article, however I think you forgot to mention one more step to start the logviewer: ./storm logviewer. If this command was not executed, the result can’t be viewed. Also, don’t forget to change the default port for the logviewer (8000) since it’s already occupied by Hue.

April 3, 2014 at 12:15 pm

In Step #5, when I attempt to start the nimbus server, I get similar output, but also some additional lines relating to the JMXetricAgent. Also, the item never returns to a command prompt. I’ve downloaded/am using the HDP 2.1 tech preview.

Vladislav Pernin
April 2, 2014 at 12:23 pm

The different Storm processes do not seem to run under Supervisord. See this link for an example
Storm seems to work with a fail fast policy, this is a good thing to have the processes automatically restarted.
Does HDP2.1 include the matching monitoring alert, nagios script ans Ambari stuff specifically for Storm ?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try this tutorial with :

These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.