How To Refine and Visualize Sentiment Data

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0


This tutorial describes how to refine raw Twitter data using the Hortonworks Data Platform, and how to analyze and visualize this refined sentiment data using the Power View feature in Microsoft Excel 2013.

Demo: Here is the video demo of what you’ll be doing in this tutorial:

Sentiment Data

Sentiment data is unstructured data that represents opinions, emotions, and attitudes contained in sources such as social media posts, blogs, online product reviews, and customer support interactions.

Potential Uses of Sentiment Data

Organizations use sentiment analysis to understand how the public feels about something at a particular moment in time, and also to track how those opinions change over time.

An enterprise may analyze sentiment about:

    • A product – For example, does the target segment understand and appreciate messaging around a product launch? What products do visitors tend to buy together, and what are they most likely to buy in the future?
    • A service – For example, a hotel or restaurant can look into its locations with particularly strong or poor service.
    • Competitors – In what areas do people see our company as better than (or weaker than) our competition?
    • Reputation – What does the public really think about our company? Is our reputation positive or negative?

In this tutorial, we will focus on a product launch. Specifically, we will look at public sentiment during the days leading up to and immediately following the recent release of the movie Iron Man 3.

How did the public feel about the debut, and how might that sentiment data have been used to better promote the movie’s launch?


  • Hortonworks Sandbox (installed and running)
  • Hortonworks ODBC driver installed and configured

Refer to:


  • In this tutorial, the screenshots show the Hortonworks Sandbox is installed on an Oracle VirtualBox virtual machine (VM) – your screens may be different.
  • Install the ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).
  • In this tutorial, we will use the Power View feature in Excel 2013 to visualize the sentiment data. Power View is currently only available in Microsoft Office Professional Plus and Microsoft Office 365 Professional Plus.
  • Note, other versions of Excel will work, but the visualizations will be limited to charts. You can connect to any other visualization tool you like


To refine and visualize website sentiment data, we will:

  • Download and extract the sentiment tutorial files.
  • Load Twitter data into the Hortonworks Sandbox.
  • Copy a Hive script to the Sandbox.
  • Run the Hive script to refine the raw data.
  • Access the refined sentiment data with Excel.
  • Visualize the sentiment data using Excel Power View.

Step 1: Download and Extract the Sentiment Tutorial Files

  • You can download a set of sample Twitter data contained in a compressed (.zip) folder

    The Twitter data was obtained using Hortonworks Flume. Flume can be used as a log aggregator, collecting log data from many diverse sources and moving it to a centralized data store. In this case, Flume was used to capture the Twitter stream data, which we can now load into the Hadoop Distributed File System (HFDS).

  • Save the file to your computer, and then extract the files. You should see a SentimentFiles folder.

Step 2: Load Twitter Data into the Hortonworks Sandbox

We will now load Twitter data into the Sandbox.

To make things simpler we have prepared a zip file with all the data files we need uploaded into the sandbox. Click on the File Browser tab at the top and then select Upload -> Zip File.

You will see a file selection box and navigate into the SentimentFiles folder. You will see a file called Select that and start the upload.

The file will upload into the Sandbox VM and be automatically unpacked into the directory.

When the upload it complete you will see a new folder called upload.

Step 3: Copy a Hive Script to the Sandbox

We will now use SCP to copy the hiveddl.sql file to the Sandbox. The procedure is slightly different for Windows and Mac, so both methods are described here.

Mac OS X — Copy the hiveddl.sql File to the Sandbox

    • Open a Terminal window and navigate to “SentimentFiles/upload/hive” subfolder in the SentimentFiles folder you extracted previously. Type in the following commands, then press the Enter key:scp -P 2222 hiveddl.sql root@

scp -P 2222 json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar root@


    • You must use an uppercase “P” for the “-P” in this command.
    • You may be prompted to validate the authenticity of the host. If so, type “yes” when prompted.

Here is what our screen looked like after we completed the transfers.

    • When prompted, type in the Sandbox password (“hadoop”), then Press Enter. This command will copy the hiveddl.sql file to the root folder on the Sandbox.When the file transfer is complete, a confirmation message will appear in the terminal window:

After the files have uploaded you can connect to the sandbox with an SSH session and verify the files were uploaded.

Windows 7: Copy the files to the Sandbox

On Windows you will need to download and install the free WinSCP application.

    • Open WinSCP and type in the following settings, then click Login.
    • Host name:
    • Port: 2222
    • User name: root

  • Type the Sandbox password (“hadoop”) in the Password box, then click OK.

  • Use the WinSCP file browser to navigate to the SentimentFiles\hive folder in the left-hand pane, and to the Sandbox /root folder in the right-hand pane.Drag-and-drop the hiveddl.sql file from the SentimentFiles\hive folder to the /root folder on the Sandbox.

Click Copy on the Copy pop-up to confirm the file transfer. Do the same with the json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar file.

Step 4: Run the Hive Script to Refine the Raw Data

    • In the Hortonworks Sandbox virtual machine (VM) console window, press the Alt and F5 keys, then log in to the Sandbox using the following user name and password:Login: root\ Password: hadoop

      After you log in, the command prompt will appear with the prefix [root@sandbox ~]#:

    • At the command prompt, type in the following command, then press the Enter key:hive -f hiveddl.sql

Lines of text appear as the script runs a series of MapReduce jobs. It will take a few minutes for the script to finish running. When the script has finished running, the time taken is displayed, and the normal command prompt appears.

The hiveddl.sql script has performed the following steps to refine the data:

  • Converted the raw Twitter data into a tabular format.
  • Used the dictionary file to score the sentiment of each Tweet by the number of positive words compared to the number of negative words, and then assigned a positive, negative, or neutral sentiment value to each Tweet.
  • Created a new table that includes the sentiment value for each Tweet.

We can look at the data using the Hive command line. We can start Hive by typing hive at the prompt.

Remember to add the json serde jar file so we can look at the tables.

The command “show tables” will show you the tables. You can browse the data using the “select * from

limit 10;” command. The limit 10 gives you the first 10 records instead of the whole table.

Let’s use HCatalog to take a quick look at the data.

  • Open the Sandbox HUE user interface, then click HCatalog in the menu at the top of the page. Select the check box next to the “tweetsbi” table, then click Browse Data. The “tweetsbi” table is the table created by the Hive script that added a column with the sentiment value for each tweet. (Note, you may need to scroll right to see all of the columns.)

Now that we have refined Twitter data in a tabular format with sentiment ratings, we can access the data with Excel.

Step 5: Access the Refined Sentiment Data with Excel

In this section, we will use Excel Professional Plus 2013 to access the refined sentiment data.

    • In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.

  • On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK.The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.

  • After the connection to the Sandbox is established, the Query Wizard appears. Select the “tweetsbi” table in the Available tables and columns box, then click the right arrow button to add the entire “tweetsbi” table to the query. Click Next to continue.

  • Select the “text” column in the “Columns in your query” box, then click the left arrow button to remove the text column.

  • After the “text” column has been removed, click Next to continue.

  • On the Filter Data screen, click Next to continue without filtering the data.

  • On the Sort Order screen, click Next to continue without setting a sort order.

  • Click Finish on the Query Wizard Finish screen to retrieve the query data from the Sandbox and import it into Excel.

  • On the Import Data dialog box, click OK to accept the default settings and import the data as a table.

  • The imported query data appears in the Excel workbook.

Now that we have successfully imported the Twitter sentiment data into Microsoft Excel, we can use the Excel Power View feature to analyze and visualize the data.

Step 6: Visualize the Sentiment Data Using Excel Power View

Data visualization can help you optimize your website and convert more visits into sales and revenue. In this section we will see how sentiment varies by country, and review the sentiment data for the United States.

    • In the Excel worksheet with the imported “tweetsbi” table, select Insert > Power View to open a new Power View report.

  • The Power View Fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the Pop Out icon to maximize the size of the data table.

  • In the Power View Fields area, clear the checkboxes next to the id and ts fields, then click Map on the Design tab in the top menu.

  • The map view displays a global view of the data.

  • Now let’s display the sentiment data by color. In the Power View Fields area, click sentiment, then select Add as Color.

  • Under SIZE, click sentiment, then select Count (Not Blank).

  • Now the map displays the sentiment data by color:
  • Orange: positive
  • Blue: negative
  • Red: neutral

  • Use the map controls to zoom in on Ireland. About half of the tweets have a positive sentiment score, as indicated by the color orange.

  • Now use the map controls to zoom in on Mexico. In Mexico, about one-fifth of the tweets expressed negative sentiment (shown in blue), and only a small portion of the tweets were positive. Most tweets from Mexico were neutral, as shown in red.

  • Next, use the map controls to zoom in on the sentiment data in China. Marvel studios and the Chinese studio DMG co-financed Iron Man 3, and the cast included a famous Chinese actress.We can see that the majority of tweets from China are neutral, with positive sentiment slightly outweighing negative sentiment.

  • The United States is the biggest market, so let’s look at sentiment data there. The size of the United States pie chart indicates that a relatively large number of the total tweets come from the US.About half of the tweets in the US show neutral sentiment, with a relatively small amount of negative sentiment.

We’ve shown a visualization of Twitter sentiment data for the release of Iron Man 3. This information will be useful for planning marketing activities for any future Iron Man movie releases.


We are eager to hear your feedback on this tutorial. Please let us know what you think. Click here to take survey


Sagar Prasad
August 6, 2014 at 12:52 am

For issue : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

I tried suggestion from Brandon ( and it worked for me.
This issue is caused by an incompatibility with the SerDe jar that is packed with the demo. To fix this, use the version of SerDe that is now packaged with HCatalog. Make the following changes to hiveddl.sql

1) comment out or remove the first line that adds the SerDe jar
–ADD JAR json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar;
2) Change line 34 to refer to the HCatalog version of SerDe

    September 6, 2014 at 10:14 pm

    Hi prasad .. Thanks … I followed same steps. I see Map 100% Reduce 100% but still got exception. And now trying to rerun gettong AlreadyExistsException Table tweets_raw already exists

July 15, 2014 at 5:56 am

Thank you for this very interesting tutorial. But I have a problem:
I try to connect to WinSCP with the arguments you offer but the connection is refused.
Do you know this problem?
Thank you in advance

    August 5, 2014 at 10:39 am

    Port should be 22 not 2222.

    August 5, 2014 at 10:29 am

    I have the same problem.

July 12, 2014 at 12:45 pm

Somebody essentially help to make critically articles I
would state. This is the first time I frequented your website page and so
far? I surprised with the research you made to create this actual
submit extraordinary. Magnificent job!

July 11, 2014 at 2:00 pm

I’ve been surfing online more than 3 hours today, yet
I never found any interesting article like yours. It
is pretty worth enough for me. Personally, if all site owners and bloggers made good content as
you did, the net will be a lot more useful than ever before.

July 9, 2014 at 5:53 am

Hi I have tried to execute this from sandbox , logged in with root…but during the execution of hiveddl.sql, i faced this error:

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJav aObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive /PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

Any idea what it might be due to?

Tri Nguyen
June 19, 2014 at 2:23 pm

Came here from the flume page
The description about the data collection using Flume is almost non-existent. The article should at least add a description about the flume configuration file.

    Jules S. Damji
    July 16, 2014 at 12:59 pm

    Thanks. Will take note.

June 10, 2014 at 1:32 am

I have followed the same steps, but while executing the hiveddl.sql, i get the following error.

Executed in putty :

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTa sk. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJav aObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive /PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

Execeuted from HUE shell

Driver returned: 1. Errors: OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Could not initialize class

Please help.

June 10, 2014 at 1:13 am

I followed the same steps above, but while executing the hiveddl.sql, i get an error:

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTa sk. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJav aObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive /PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

Please help.

Yogesh Sobale
May 20, 2014 at 3:19 am

In given example, the version for json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar is not compatible with hive installed in HDP 2.1.
After debugging I found that hive installed in sandbox has jar hive-serde- which is not compatible with json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar. It is having issue with the constructor AbstractPrimitiveJavaObjectInspector(). Can you please provide the correct jar ?

May 19, 2014 at 3:16 pm

Does not work on HDP2.1 with Hive .13; Created the json-serde again, but still fails.

Driver returned: 1. Errors: OK
converting to local hdfs://
Added /tmp/08cf0f24-0df6-4b44-8890-6150f2873398_resources/json-serde- to class path
Added resource: /tmp/08cf0f24-0df6-4b44-8890-6150f2873398_resources/json-serde-
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Could not initialize class

May 16, 2014 at 5:49 am

Hi, when i run the hiveddl.sql script all it does it just creates bunch of tables and views and no data.. when I opened the script I found no LOAD statement in there.Is it possible the script is not complete or am i missing something here?

Will H
May 14, 2014 at 11:00 am

I am having an issue running the initial hive script. Not sure if I’m doing something wrong but am seeing the following error:

[root@sandbox ~]# hive -f hiveddl.sql

Logging initialized using configuration in file:/etc/hive/conf.dist/
Added json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar to class path
Added resource: json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V
[root@sandbox ~]#

April 18, 2014 at 8:25 am

how do you it for some other movie or key word . What files are there to be modified.

February 18, 2014 at 8:05 pm

Thanks for some other fantastic article. The place else
may just anybody get that kind of information in such an ideal method of
writing? I have a presentation subsequent week, and I am at the look for such information.

February 11, 2014 at 9:13 am

A person essentially lend a hand to make severely articles I would state.
This is the first time I frequented your web page and to this point?

I amazed with the research you made to make this particular submit incredible.
Excellent activity!

February 10, 2014 at 9:16 pm

The bottom line here is simple: use supplements only after doing
research and exercising caution. So, what happens when we incorporate a probiotic fermentation
process to whole food nutritional ingredients. Some brands contain extra substances that you really don’t need any may
actually cause you more harm than good.

February 2, 2014 at 9:08 am

It’s actually a cool and helpful piece of info.
I’m satisfied that you just shared this helpful info with us.

Please stay us up to date like this. Thanks for sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try this tutorial with :

These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.