Using Spring XD to stream Tweets to Hadoop for Sentiment Analysis

Community Tutorial

This tutorial is from the Community part of tutorial for Hortonworks Sandbox (1.3) – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.

This community tutorial submitted by mehzer with source available at Github. Feel free to contribute edits or your own tutorial and help the community learn Hadoop.


This tutorial will build on the previous tutorial – 13 – Refining and Visualizing Sentiment Data – by using Spring XD to stream in tweets to HDFS. Once in HDFS, we’ll use Apache Hive to process and analyze them, before visualizing in a tool.

1 – Download and Install Spring XD

Spring XD can be found at This tutorial uses the 1.0.0.M3 version, so conventions may change in next release.

Follow the install instructions and kick up Spring XD with a test stream to make sure it’s looking good.

create stream --name ticktock --definition "Time | Log"

That simple instruction should begin showing output in the server terminal window similar to:

2013-10-12 17:18:09
2013-10-12 17:18:10
2013-10-12 17:18:11
2013-10-12 17:18:12
2013-10-12 17:18:13
2013-10-12 17:18:14

Congrats, Spring XD is running.

2 – Download and Install Hortonworks Sandbox

The Hortonworks Sandbox environment can be downloaded from This tutorial uses the 1.3 version, so conventions may change in the next release. This tutorial also uses the VirtualBox version of the image.

Running Sandbox with Bridged Networking

The current Sandbox uses a NAT adapter with port forwarding by default. This makes it convenient to access Sandbox at but unfortunately, Spring XD appears to locate and attempt to use the internal IP address (in my case that was a fairly standard VirtualBox IP As this IP won’t resolve, the simplest workaround is to use Bridged Networking so the Sandbox has an IP address on local physical network.

Steps to do this are as follows:

  • Power off Sandbox
  • Access the Network settings for the Sandbox.
  • Disable the NAT Adapter

  • Set-up a Bridged Adapter (and don’t forget to change it back later if necessary)

  • Power on Sandbox

NB: Sandbox will still say access at, but owing to the changes is now incorrect. You can find the IP Address of the Sandbox as it loads:

In my case, local IP was and I could access the Sandbox at Check you can do the same, and then we can configure SpringXD to use Sandbox.

Configuring Spring XD to use Hadoop (Hortonworks Sandbox)

NB: If you have Ambari activated on Sandbox, then both it and Spring XD attempt to use port 8080. This means you’ll need to run Spring XD with a different port, for example: --httpPort 8090

Step 1 – Edit the file

Edit the file at XD_HOME\xd\config\ to enter the namenode config:

Step 2 – Spin up the Spring XD Service with Hadoop

In a terminal window get the server running from the XD_HOME\XD\ folder:

./xd-singlenode --hadoopDistro hadoop11

Step 3 – Spin up the Spring XD Client with Hadoop

In a separate terminal window get the shell running from the XD_HOME\Shell\ folder:

./xd-shell --hadoopDistro hadoop11

Then set the namenode for the client using the IP Address of the Sandbox:

hadoop config fs --namenode hdfs://

Next, test out whether you can see HDFS with a command like:

hadoop fs ls /

You should see something like:

drwxr-xr-x   - hdfs   hdfs          0 2013-05-30 10:34 /apps
drwx------   - mapred hdfs          0 2013-10-12 17:06 /mapred
drwxrwxrwx   - hdfs   hdfs          0 2013-10-12 17:19 /tmp
drwxr-xr-x   - hdfs   hdfs          0 2013-06-10 14:39 /user

Once that’s confirmed we can set up a simple test stream. In this case, we can re-create TickTock but store it in HDFS.

stream create --name ticktockhdfs --definition "Time | HDFS"

Leave it a few seconds, then destroy or undeploy the stream.

stream destroy --name ticktockhdfs

You can then view the small file that will have been generated in HDFS.

hadoop fs ls /xd/ticktockhdfs

Found 1 items
-rwxr-xr-x   3 root hdfs        420 2013-10-12 17:18 /xd/ticktockhdfs/ticktockhdfs-0.log

Which you can quickly examine with:

hadoop fs cat /xd/ticktockhdfs/ticktockhdfs-0.log

2013-10-12 17:18:09
2013-10-12 17:18:10
2013-10-12 17:18:11
2013-10-12 17:18:12
2013-10-12 17:18:13
2013-10-12 17:18:14

Cool, but not so interesting, so let’s get to Twitter.

3 – Create the Tweet Stream in Spring XD

In order to stream in information from Twitter, then you’ll need to set-up a Twitter Developer app so you can get the necessary keys.

Once you have the keys, you can add them to XD_HOME\xd\config\

In our case, we’ll take a look at the stream of current opinion on that current icon of popular culture: Miley Cyrus. The stream can be set-up as follows with some simple tracking terms:

stream create --name cyrustweets --definition "twitterstream --track='mileycyrus, miley cyrus' | hdfs"

You might want to build up these files for a little while. You can check in on the data at:

hadoop fs ls  /xd/cyrustweets/

Found 12 items
-rwxr-xr-x   3 root hdfs    1002252 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-0.log
-rwxr-xr-x   3 root hdfs    1000126 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-1.log
-rwxr-xr-x   3 root hdfs    1004800 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-10.log
-rwxr-xr-x   3 root hdfs          0 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-11.log
-rwxr-xr-x   3 root hdfs    1003357 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-2.log
-rwxr-xr-x   3 root hdfs    1000903 2013-10-12 19:33 /xd/cyrustweets/cyrustweets-3.log
-rwxr-xr-x   3 root hdfs    1000096 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-4.log
-rwxr-xr-x   3 root hdfs    1001072 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-5.log
-rwxr-xr-x   3 root hdfs    1001226 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-6.log
-rwxr-xr-x   3 root hdfs    1000398 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-7.log
-rwxr-xr-x   3 root hdfs    1001404 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-8.log
-rwxr-xr-x   3 root hdfs    1006052 2013-10-12 19:34 /xd/cyrustweets/cyrustweets-9.log

The default rollover for the logs is 1MB so there are a lot of files. You might want to increase that or change other options.

After a cup of coffee or two, we should have some reasonable data to begin processing and refining. It took around 30 mins to generate 100MB of log files – clearly a fairly popular topic.

At this point, you can undeploy the stream so we can do some sample analysis:

stream undeploy --name cyrustweets

We’re now done with Spring XD. It’s a fun way to pull in a bunch of data from various sources. We can now switch over to Sandbox.

4 – Refine the Data using Hive

To process and analyze the data, we’ll borrow the technique from the previous tutorial. First of all, we can take a look in the File Browser to see the logs we’ve ingested.

Next, we need to position the reference files for the analysis. You can follow the steps (Step 1 and Step 2) in the previous tutorial to load in the dictionary file, and the time_zone_map file. If you’ve already completed that tutorial, then you have everything you need already in the Sandbox.

Next, let’s run some Hive queries.

Create the Tables for the logos, dictionary and time zone map

In the previous tutorial, we had the luxury of pre-formatted files, but in this case, the incoming tweets are stored as JSON, so we need to use a JSON SerDe to process the files into tables. This project on Github provides a great JSON SerDe. We need to clone it, build it and then move it to the Sandbox. Use the following to build the JAR.

git clone
cd Hive-JSON-Serde
mvn package

This will place a JAR called json-serde-1.1.7-jar-with-dependencies.jar in the target folder.

This JAR is needed for the Hive queries we’ll perform. To do that, we will create a new query, first loading the JAR by selecting ‘Add File’ > ‘Upload File’ and then finally selecting that JAR for use.

Once done, the following script creates a fresh table for the twitter logs:

CREATE EXTERNAL TABLE cyrustweets_raw (
   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
   entities STRUCT<
   text STRING,
   user STRUCT<
   in_reply_to_screen_name STRING,
   year int,
   month int,
   day int,
   hour int
LOCATION '/xd/cyrustweets'

NB. If you’ve already completed the previous tutorial, there’s no need to recreate the next two tables. Note that the LOCATION paths may be different for you.

-- Add the dictionary table
    type string,
    length int,
    word string,
    pos string,
    stemmed string,
    polarity string
LOCATION '/user/hue/data/dictionary';

-- Add the time zone map table
    time_zone string,
    country string,
    notes string
LOCATION '/user/hue/data/time_zone_map';

Refine the Data

With the essential data now in place, we can refine the data a little.

CREATE VIEW cyrustweets_simple AS
    cast ( from_unixtime( unix_timestamp(concat( '2013 ', substring(created_at,5,15)), 
    'yyyy MMM dd hh:mm:ss')) as timestamp) ts,
FROM cyrustweets_raw

CREATE VIEW cyrustweets_clean AS
FROM cyrustweets_simple t 
LEFT OUTER JOIN time_zone_map m ON t.time_zone = m.time_zone;

Run the Sentiment Analysis

Then we’ll create some views that can be used in the sentiment calculations.

-- Compute sentiment
FROM cyrustweets_raw LATERAL VIEW explode(sentences(lower(text))) dummy AS words;

FROM l1 LATERAL VIEW explode( words ) dummy AS word ;

    CASE d.polarity 
        WHEN  'negative' THEN -1
        WHEN 'positive' THEN 1 
        ELSE 0 
    END AS polarity 
FROM l2 LEFT OUTER JOIN dictionary d ON l2.word = d.word;

CREATE VIEW tweets_sentiment AS
        WHEN sum( polarity ) > 0 THEN 'positive' 
        WHEN sum( polarity ) < 0 THEN 'negative'  
        ELSE 'neutral'
    END AS sentiment 

Finally, we execute the analysis.

-- Put everything back together and re-number sentiment
CREATE TABLE cyrustweetsanalysis
        case s.sentiment 
            when 'positive' then 2 
            when 'neutral' then 1 
            when 'negative' then 0 
        end as sentiment  
    FROM cyrustweets_clean t 
    LEFT OUTER JOIN tweets_sentiment s on =;

Once this job has completed, then a quick browse of the data in cyrustweetsanalysis will show the results of the analysis.

5 – Visualize the Data using Tool X

We’ve created the same table definition as in Tutorial 13, so you could now follow the rest of those instructions to visualize this data in PowerBI and Excel, or you could follow a tutorial for another visualization tool such as Tableau.


Chinni Ramesh Pentakota
May 18, 2014 at 9:25 pm

I am new to Hadoop. I was performing the steps in the blog and when starting sandbox with “Running Sandbox with Bridged Networking” using Oracle VM Virtual Box getting error :
safemode: call from to failed on connection excetion: …:[datanode] Error 255 (ignored)
When logged into vm i am able to verify that data node was running on port 8020 and Spring XD was unable to connect to datanode from desktop to VM.
Please help.

May 22, 2014 at 7:19 pm

I think you miss “–deploy” argument.

Once that’s confirmed we can set up a simple test stream. In this case, we can re-create TickTock but store it in HDFS.

stream create –name ticktockhdfs –definition “Time | HDFS”

September 8, 2014 at 5:59 pm

I am using HDP2.1 sandbox and to configure spring XD I use -hadoopDistro hadoop22 or -hadoopDistro hadoop24 both errors with “is not a valid value for the option –hadoopDistro

    September 6, 2015 at 9:53 am

    Yeah, Im facing this error too. Anyone knows why?

Sravan Mattevada
October 30, 2014 at 2:04 pm

Having trouble making this work. I have HW 2.2 Preview.
I am able to get the step where “hadoo fs ls /” shows me the directory structure on the X-shell client
but when I run the “stream create –name ticktockhdfs –definition “time | hdfs” –deploy command with/without hadoop distro, an error is thrown on the the XD server “connection refused error”.

I was able to run the non-hadoop version for spring XD and was able to get the tweets but didn’t have success creating the Hcat table using the SerDe when I copied the tweets file to HDFS.

The table creation failed (the one that needed SerDe) even for the previous tutorial with data already placed from the example and the SerDr jar

Can someone help me with please ?

Mungeol Heo
November 9, 2014 at 6:53 pm

In order to solve ‘connection refused error’, set namenode info at servers.yml instead of
Hope helps.

    February 15, 2015 at 5:54 pm

    Thank you for the hint.

December 25, 2014 at 6:48 am

Could you please tell me, how we can create a spring xd job that creates a hive table?

February 20, 2015 at 11:50 am

I went through the install process several times and I can get it to work up to where you list out the hdfs folders. However, when I attempt to create/deploy the stream, it can’t seem to find the HDFS module? Looks like it has no issues with connectivity but just cannot resolve the appropriate module to use with hdfs. I get the following error:

Command failed Could not find module with name ‘HDFS’ and type ‘sink’

giovanni gadaleta
February 25, 2015 at 1:10 pm

The statement “create external table cyrustweets_raw …..” fails because of the error below.

I think it’s because the jar file used in the create statement is not the good one. I prepared the jar with:
git clone
cd Hive-JSON-Serde
mvn package

the jar json-serde-1.1.7-jar-with-dependencies.jar is not created.
There is one jar created that sounds like that and is : json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar

I uploaded it in hdfs and then “ADD jar … ” in hive. But then I get the error below when I create the table.

any clue ?

2015-02-25 20:58:00,816 ERROR [HiveServer2-Background-Pool: Thread-251]: operation.Operation ( – Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V
at org.apache.hive.service.cli.operation.Operation.toSQLException(
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(
at org.apache.hive.service.cli.operation.SQLOperation.access$100(
at org.apache.hive.service.cli.operation.SQLOperation$1$
at Method)
at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(
at org.apache.hive.service.cli.operation.SQLOperation$
at java.util.concurrent.Executors$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

Paul Eddie
March 10, 2015 at 12:31 pm

At the start, it is actually (for version 1.1.0 release):
stream create –name ticktock –definition “time | log” –deploy

(the phrase ‘create stream’ and ‘stream create’ were reversed)

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">