How To Refine and Visualize Sentiment Data
This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0
This tutorial describes how to refine raw Twitter data using the Hortonworks Data Platform, and how to analyze and visualize this refined sentiment data using the Power View feature in Microsoft Excel 2013.
Demo: Here is the video demo of what you’ll be doing in this tutorial:
Sentiment data is unstructured data that represents opinions, emotions, and attitudes contained in sources such as social media posts, blogs, online product reviews, and customer support interactions.
Potential Uses of Sentiment Data
Organizations use sentiment analysis to understand how the public feels about something at a particular moment in time, and also to track how those opinions change over time.
An enterprise may analyze sentiment about:
- A product – For example, does the target segment understand and appreciate messaging around a product launch? What products do visitors tend to buy together, and what are they most likely to buy in the future?
- A service – For example, a hotel or restaurant can look into its locations with particularly strong or poor service.
- Competitors – In what areas do people see our company as better than (or weaker than) our competition?
- Reputation – What does the public really think about our company? Is our reputation positive or negative?
In this tutorial, we will focus on a product launch. Specifically, we will look at public sentiment during the days leading up to and immediately following the recent release of the movie Iron Man 3.
How did the public feel about the debut, and how might that sentiment data have been used to better promote the movie’s launch?
- Hortonworks Sandbox (installed and running)
- Hortonworks ODBC driver installed and configured
- Tutorial 7: Installing and Configuring the Hortonworks ODBC driver on Windows 7
- Tutorial 11: Installing and Configuring the Hortonworks ODBC driver on Mac OS X
- Microsoft Excel 2013 Professional Plus
- Sentiment tutorial files (included)
- In this tutorial, the screenshots show the Hortonworks Sandbox is installed on an Oracle VirtualBox virtual machine (VM) – your screens may be different.
- Install the ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).
- In this tutorial, we will use the Power View feature in Excel 2013 to visualize the sentiment data. Power View is currently only available in Microsoft Office Professional Plus and Microsoft Office 365 Professional Plus.
- Note, other versions of Excel will work, but the visualizations will be limited to charts. You can connect to any other visualization tool you like
To refine and visualize website sentiment data, we will:
- Download and extract the sentiment tutorial files.
- Load Twitter data into the Hortonworks Sandbox.
- Copy a Hive script to the Sandbox.
- Run the Hive script to refine the raw data.
- Access the refined sentiment data with Excel.
- Visualize the sentiment data using Excel Power View.
Step 1: Download and Extract the Sentiment Tutorial Files
- You can download a set of sample Twitter data contained in a compressed (.zip) folder here:SentimentFiles.zip
The Twitter data was obtained using Hortonworks Flume. Flume can be used as a log aggregator, collecting log data from many diverse sources and moving it to a centralized data store. In this case, Flume was used to capture the Twitter stream data, which we can now load into the Hadoop Distributed File System (HFDS).
- Save the SentimentFiles.zip file to your computer, and then extract the files. You should see a SentimentFiles folder.
Step 2: Load Twitter Data into the Hortonworks Sandbox
We will now load Twitter data into the Sandbox.
To make things simpler we have prepared a zip file with all the data files we need uploaded into the sandbox. Click on the File Browser tab at the top and then select Upload -> Zip File.
You will see a file selection box and navigate into the SentimentFiles folder. You will see a file called upload.zip. Select that and start the upload.
The file will upload into the Sandbox VM and be automatically unpacked into the directory.
When the upload it complete you will see a new folder called upload.
Step 3: Copy a Hive Script to the Sandbox
We will now use SCP to copy the hiveddl.sql file to the Sandbox. The procedure is slightly different for Windows and Mac, so both methods are described here.
Mac OS X — Copy the hiveddl.sql File to the Sandbox
- Open a Terminal window and navigate to “SentimentFiles/upload/hive” subfolder in the SentimentFiles folder you extracted previously. Type in the following commands, then press the Enter key:
scp -P 2222 hiveddl.sql email@example.com:
scp -P 2222 json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar firstname.lastname@example.org:
- You must use an uppercase “P” for the “-P” in this command.
- You may be prompted to validate the authenticity of the host. If so, type “yes” when prompted.
Here is what our screen looked like after we completed the transfers.
- When prompted, type in the Sandbox password (“hadoop”), then Press Enter. This command will copy the hiveddl.sql file to the root folder on the Sandbox.When the file transfer is complete, a confirmation message will appear in the terminal window:
After the files have uploaded you can connect to the sandbox with an SSH session and verify the files were uploaded.
Windows 7: Copy the files to the Sandbox
On Windows you will need to download and install the free WinSCP application.
- Open WinSCP and type in the following settings, then click Login.
- Host name: 127.0.0.1
- Port: 2222
- User name: root
- Type the Sandbox password (“hadoop”) in the Password box, then click OK.
- Use the WinSCP file browser to navigate to the SentimentFiles\hive folder in the left-hand pane, and to the Sandbox /root folder in the right-hand pane.Drag-and-drop the hiveddl.sql file from the SentimentFiles\hive folder to the /root folder on the Sandbox.
Click Copy on the Copy pop-up to confirm the file transfer. Do the same with the json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar file.
Step 4: Run the Hive Script to Refine the Raw Data
- In the Hortonworks Sandbox virtual machine (VM) console window, press the Alt and F5 keys, then log in to the Sandbox using the following user name and password:Login: root\ Password: hadoop
After you log in, the command prompt will appear with the prefix [root@sandbox ~]#:
- At the command prompt, type in the following command, then press the Enter key:
hive -f hiveddl.sql
Lines of text appear as the script runs a series of MapReduce jobs. It will take a few minutes for the script to finish running. When the script has finished running, the time taken is displayed, and the normal command prompt appears.
The hiveddl.sql script has performed the following steps to refine the data:
- Converted the raw Twitter data into a tabular format.
- Used the dictionary file to score the sentiment of each Tweet by the number of positive words compared to the number of negative words, and then assigned a positive, negative, or neutral sentiment value to each Tweet.
- Created a new table that includes the sentiment value for each Tweet.
We can look at the data using the Hive command line. We can start Hive by typing hive at the prompt.
Remember to add the json serde jar file so we can look at the tables.
The command “show tables” will show you the tables. You can browse the data using the “select * from
limit 10;” command. The limit 10 gives you the first 10 records instead of the whole table.
Let’s use HCatalog to take a quick look at the data.
- Open the Sandbox HUE user interface, then click HCatalog in the menu at the top of the page. Select the check box next to the “tweetsbi” table, then click Browse Data. The “tweetsbi” table is the table created by the Hive script that added a column with the sentiment value for each tweet. (Note, you may need to scroll right to see all of the columns.)
Now that we have refined Twitter data in a tabular format with sentiment ratings, we can access the data with Excel.
Step 5: Access the Refined Sentiment Data with Excel
In this section, we will use Excel Professional Plus 2013 to access the refined sentiment data.
- In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.
- On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK.The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.
- After the connection to the Sandbox is established, the Query Wizard appears. Select the “tweetsbi” table in the Available tables and columns box, then click the right arrow button to add the entire “tweetsbi” table to the query. Click Next to continue.
- Select the “text” column in the “Columns in your query” box, then click the left arrow button to remove the text column.
- After the “text” column has been removed, click Next to continue.
- On the Filter Data screen, click Next to continue without filtering the data.
- On the Sort Order screen, click Next to continue without setting a sort order.
- Click Finish on the Query Wizard Finish screen to retrieve the query data from the Sandbox and import it into Excel.
- On the Import Data dialog box, click OK to accept the default settings and import the data as a table.
- The imported query data appears in the Excel workbook.
Now that we have successfully imported the Twitter sentiment data into Microsoft Excel, we can use the Excel Power View feature to analyze and visualize the data.
Step 6: Visualize the Sentiment Data Using Excel Power View
Data visualization can help you optimize your website and convert more visits into sales and revenue. In this section we will see how sentiment varies by country, and review the sentiment data for the United States.
- In the Excel worksheet with the imported “tweetsbi” table, select Insert > Power View to open a new Power View report.
- The Power View Fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the Pop Out icon to maximize the size of the data table.
- In the Power View Fields area, clear the checkboxes next to the id and ts fields, then click Map on the Design tab in the top menu.
- The map view displays a global view of the data.
- Now let’s display the sentiment data by color. In the Power View Fields area, click sentiment, then select Add as Color.
- Under SIZE, click sentiment, then select Count (Not Blank).
- Now the map displays the sentiment data by color:
- Orange: positive
- Blue: negative
- Red: neutral
- Use the map controls to zoom in on Ireland. About half of the tweets have a positive sentiment score, as indicated by the color orange.
- Now use the map controls to zoom in on Mexico. In Mexico, about one-fifth of the tweets expressed negative sentiment (shown in blue), and only a small portion of the tweets were positive. Most tweets from Mexico were neutral, as shown in red.
- Next, use the map controls to zoom in on the sentiment data in China. Marvel studios and the Chinese studio DMG co-financed Iron Man 3, and the cast included a famous Chinese actress.We can see that the majority of tweets from China are neutral, with positive sentiment slightly outweighing negative sentiment.
- The United States is the biggest market, so let’s look at sentiment data there. The size of the United States pie chart indicates that a relatively large number of the total tweets come from the US.About half of the tweets in the US show neutral sentiment, with a relatively small amount of negative sentiment.
We’ve shown a visualization of Twitter sentiment data for the release of Iron Man 3. This information will be useful for planning marketing activities for any future Iron Man movie releases.
We are eager to hear your feedback on this tutorial. Please let us know what you think. Click here to take survey
Try this tutorial with :
These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.