Processing streaming data in Hadoop with Apache Storm

Real-time data processing

In this tutorial we will walk through the process of

  • Reviewing the pre-installed Apache Storm infrastructure
  • Run a sample use case end to end

What is Apache Storm?

Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.

Let’s look at the various components of a Storm Cluster:

  1. Nimbus node. The master node (Similar to JobTracker)
  2. Supervisor nodes. Starts/stops workers & communicates with Nimbus through Zookeeper
  3. ZooKeeper nodes. Coordinates the Storm cluster

Here are a few terminologies and concepts you should get familiar with before we go hands-on:

  • Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
  • Streams. An unbounded sequence of tuples.
  • Spouts. Sources of streams in a computation (e.g. a Twitter API)
  • Bolts. Process input streams and produce output streams. They can:
    • Run functions;
    • Filter, aggregate, or join data;
    • Talk to databases.
  • Topologies. The overall calculation, represented visually as a network of spouts and bolts

Prerequisites:

A working HDP cluster – the easiest way to get a HDP cluster is to download the HDP Sandbox

Installation and Setup Verification:

Step 1:

Let’s check if the sandbox has storm processes up and running by login into Ambari and look for Storm in the services listed:

Step 2 :

Now let’s look at a Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.

Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.

wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/Storm/0.9.0.1/storm-starter-0.0.1-storm-0.9.0.1.jar
enter image description here

enter image description here

Step 3 :

In the Storm example Topology, we will be using three main parts or processes:

  1. Sentence Generator Spout
  2. Sentence Split Bolt
  3. WordCount Bolt

You can check the classes available in the jar as follows:

jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Sentence  
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Split  
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep WordCount
enter image description here

enter image description here

Step 4 :

Let’s run the storm job. It has a Spout job to generate random sentences. There is a split Bolt Process along with the Wordcount Bolt Class.

Let’s run the Storm Jar file.

/usr/lib/storm/bin/storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com
enter image description here

enter image description here

Step 5 :

Let’s use Storm UI and look at it graphically:

enter image description here

You should notice the Storm Topology, WordCount in the Topology summary.

Step 6 :

Please click on the WordCount Topology. You will see the following:

enter image description here

Step 7 :

In this page, please click on count in the Bolt Section.

enter image description here

enter image description here

Step 8 :

Now if click on any port in the executor section and you will be able to view the results.

Step 9 :

Lastly but most importantly, you can always look at the log files in the following folder. These logs are extremely useful for debugging or status finding.

enter image description here

enter image description here

You just processed streaming data using Apache Storm

Comments

Vladislav Pernin
|
April 2, 2014 at 12:23 pm
|

The different Storm processes do not seem to run under Supervisord. See this link for an example http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/#running-storm-daemons-under-supervision
Storm seems to work with a fail fast policy, this is a good thing to have the processes automatically restarted.
Does HDP2.1 include the matching monitoring alert, nagios script ans Ambari stuff specifically for Storm ?

Jeff
|
April 3, 2014 at 12:15 pm
|

In Step #5, when I attempt to start the nimbus server, I get similar output, but also some additional lines relating to the JMXetricAgent. Also, the item never returns to a command prompt. I’ve downloaded/am using the HDP 2.1 tech preview.

Gandhi Manalu
|
April 23, 2014 at 9:35 pm
|

Thank you for the article, however I think you forgot to mention one more step to start the logviewer: ./storm logviewer. If this command was not executed, the result can’t be viewed. Also, don’t forget to change the default port for the logviewer (8000) since it’s already occupied by Hue.

Sandeep Kumar
|
May 9, 2014 at 8:14 pm
|

Nice info.
Is there any streaming jar like in hadoop so that we can make use of shell/perl/php scripts as spouts and bolts? By this way spouts will emit to stdout and bolts will receive tuples from stdin.

Chakra Sankaraiah
|
May 26, 2014 at 6:21 am
|

Just a minor comment, we need to be /usr/lib/storm/bin directory before you start the supervisor using this command
./storm supervisor

Chakra Sankaraiah
|
May 26, 2014 at 6:26 am
|

I was not able to start the supervisor and UI, not sure why

Here are few pieces that i think need to be changed
— Before you run ./storm supervisor you need to be /usr/lib/storm/bin directory
— In the host file i think you need to have the IP address of your own hortonworks sandbox

Chakra Sankaraiah
|
May 26, 2014 at 6:44 am
|

I was able to start the storm process via Ambari and it worked great.
After that you should be able to proceed from step 8 onward by using the storm UI link as
http://192.168.202.133:8744/
where 192.168.202.133 is your HDP sandbox IP address.

Ajit Singh
|
July 24, 2014 at 7:56 am
|

worker-6700.log & worker-6701.log in /var/log/storm are continuously increasing, how to stop it?

    |
    August 3, 2014 at 6:22 pm
    |

    Hi Ajit,

    You are right, the logs will go on increasing as this is “stream processing”.

    From what I understood, there are two things in WordCount topology:
    1) Spout & Bolts both are running on 6700 and 6701.
    2) Being a stream processing, it is supposed to keep running till we close it.

    So, we are supposed to deactivate or kill the job from the UI:
    1) go to http://localhost:8744/ Storm UI
    2) Under Topology summary section, click on WordCount
    3) On newly directed page, under Topology Actions section, click on Deactivate or Kill.

    You are all set.

    But I agree that this should have been mentioned considering beginner’s point of view.

      Jules S. Damji
      |
      October 3, 2014 at 12:36 pm
      |

      Thanks for the note! We’ll rectify and mention it early to stop the jobs.

Kuldeep Dhole
|
August 3, 2014 at 6:28 pm
|

Thanks a lot Hortonworkers! for making it so simple.

Can you please explain me why did you use the class “storm.starter.WordCountTopology WordCount” twice in following command:

/usr/lib/storm/bin/storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com

Regards,
Kuldeep

jhon
|
September 8, 2014 at 9:34 pm
|

where should i find input and output directries

jhon
|
September 8, 2014 at 9:35 pm
|

where should i find input and output directories

?

Emiliano
|
October 14, 2014 at 9:29 am
|

Who i can show topology’s results? For example, if i use the topology “WordCount” that return the count of word, where storm show me the results of de count of word?

Vinod
|
November 18, 2014 at 10:15 am
|

I was able to deploy Wordcount using the script provided. But it does not seem to bring up spout or bolt after that. I do not see worker logs either. Could someone tell me what is wrong?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Try this tutorial with :

These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.