Tutorial 13: How To Refine and Visualize Sentiment Data
This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.
This tutorial describes how to refine raw Twitter data using the Hortonworks Data Platform, and how to analyze and visualize this refined sentiment data using the Power View feature in Microsoft Excel 2013.
Demo: This video shows a demo of what you’ll be doing in this tutorial with this sentiment data.
Sentiment data is unstructured data that represents opinions, emotions, and attitudes contained in sources such as social media posts, blogs, online product reviews, and customer support interactions.
Potential Uses of Sentiment Data
Organizations use sentiment analysis to understand how the public feels about something at a particular moment in time, and also to track how those opinions change over time.
An enterprise may analyze sentiment about:
- A product – For example, does the target segment understand and appreciate messaging around a product launch? What products do visitors tend to buy together, and what are they most likely to buy in the future?
- A service – For example, a hotel or restaurant can look into its locations with particularly strong or poor service.
- Competitors – In what areas do people see our company as better than (or weaker than) our competition?
- Reputation – What does the public really think about our company? Is our reputation positive or negative?
In this tutorial, we will focus on a product launch. Specifically, we will look at public sentiment during the days leading up to and immediately following the recent release of the movie Iron Man 3.
How did the public feel about the debut, and how might that sentiment data have been used to better promote the movie’s launch?
- Hortonworks Sandbox (installed and running)
- Hortonworks ODBC driver installed and configured
- Tutorial 7: Installing and Configuring the Hortonworks ODBC driver on Windows 7
- Tutorial 11: Installing and Configuring the Hortonworks ODBC driver on Mac OS X
- Microsoft Excel 2013 Professional Plus
- Sentiment tutorial files (included)
- In this tutorial, the screenshots show the Hortonworks Sandbox is installed on an Oracle VirtualBox virtual machine (VM) – your screens may be different.
- Install the ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).
- In this tutorial, we will use the Power View feature in Excel 2013 to visualize the sentiment data. Power View is currently only available in Microsoft Office Professional Plus and Microsoft Office 365 Professional Plus.
- Note, other versions of Excel will work, but the visualizations will be limited to charts. You can connect to any other visualization tool you like
To refine and visualize website sentiment data, we will:
- Download and extract the sentiment tutorial files.
- Load Twitter data into the Hortonworks Sandbox.
- Copy a Hive script to the Sandbox.
- Run the Hive script to refine the raw data.
- Access the refined sentiment data with Excel.
- Visualize the sentiment data using Excel Power View.
Step 1 – Download and Extract the Sentiment Tutorial Files
- You can download a set of sample Twitter data contained in a compressed (.zip) folder here:
- The Twitter data was obtained using Apache Flume. Flume can be used as a log aggregator, collecting log data from many diverse sources and moving it to a centralized data store. In this case, Flume was used to capture the Twitter stream data, which we can now load into the Hadoop Distributed File System (HFDS).
- Save the SentimentFiles.zip file to your computer, and then extract the files. You should see a SentimentFiles folder that contains a loaddemo.sh file.
Step 2 – Load Twitter Data into the Hortonworks Sandbox
We will now load Twitter data into the Sandbox.
- Open the Sandbox HUE and click the File Browser icon in the toolbar at the top of the page, then click the first slash to the right of Home (and to the left of “user”) to navigate to the top level of the Sandbox file system.
- Click New, then select Directory.
- On the Create Directory pop-up, type “data” in the Directory Name box, then click Submit.
- The “data” folder appears in the list of files.
- Click the “data” folder. In the “data” folder, use New > Directory to create three folders with the names “dictionary”, “time_zone_map”, and “tweets”. You must use these exact names as the rest of the tutorial is based on these names.
- Click the “dictionary” folder. In the “dictionary” folder, click Upload, then select Files.
- On the Uploading pop-up, click Select files.
- Use the File Upload dialog to browse to the “data” folder in the SentimentFiles folder you extracted previously. Select the dictionary.tsv file, then click Open.
- The dictionary.tsv file will appear in the /data/dictionary folder.
- Navigate back up to the “data” folder. Use the same procedure to upload the time_zone_map.tsv file to the data/time_zone_map folder.
- Navigate back up to the “data” folder, then click the “tweets” folder. This folder contains all the tweets we collected. To keep things simple we will use a subset. Once you complete the tutorial you can upload the entire dataset and explore it. Select Upload > Files, then click Select files.Use the File Upload dialog to browse to the SentimentFiles\data\tweets.rc folder. Select the “00” file, then Shift-click the “06” file to select files 00-06. Click Open to upload the files.
- A progress indicator appears while the files are being uploaded.
- When the upload is complete, files 00-06 will appear in the /data/tweets folder.
Step 3 – Copy a Hive Script to the Sandbox
We will now use SCP to copy the hiveddl.sql file to the Sandbox. The procedure is slightly different for Windows and Mac, so both methods are described here.
Mac OS X — Copy the hiveddl.sql File to the Sandbox
- Open a Terminal window and navigate to “hive” subfolder in the SentimentFiles folder you extracted previously. Type in the following command, then press the Enter key:
scp -P 2222 hiveddl.sql firstname.lastname@example.org:
- You must use an uppercase “P” for the “-P” in this command.
- You may be prompted to validate the authenticity of the host. If so, type “yes” when prompted.
- When prompted, type in the Sandbox password (“hadoop”), then Press Enter. This command will copy the hiveddl.sql file to the root folder on the Sandbox.When the file transfer is complete, a confirmation message will appear in the terminal window:
hiveddl.sql 100% 3368 3.3KB/s 00:00
Windows 7: Copy the hiveddl.sql File to the Sandbox
On Windows you will need to download and install the free WinSCP application.
- Open WinSCP and type in the following settings, then click Login.
- Host name: 127.0.0.1
- Port: 2222
- User name: root
- Type the Sandbox password (“hadoop”) in the Password box, then click OK.
- Use the WinSCP file browser to navigate to the SentimentFiles\hive folder in the left-hand pane, and to the Sandbox /root folder in the right-hand pane.Drag-and-drop the hiveddl.sql file from the SentimentFiles\hive folder to the /root folder on the Sandbox.
Click Copy on the Copy pop-up to confirm the file transfer.
Step 4 – Run the Hive Script to Refine the Raw Data
- In the Hortonworks Sandbox virtual machine (VM) console window, press the Alt and F5 keys, then log in to the Sandbox using the following user name and password:Login: root
Password: hadoopAfter you log in, the command prompt will appear with the prefix [root@sandbox ~]#:
- At the command prompt, type in the following command, then press the Enter key:
hive -f hiveddl.sql
Lines of text appear as the script runs a series of MapReduce jobs. It will take a few minutes for the script to finish running. When the script has finished running, the time taken is displayed, and the normal command prompt appears.
The hiveddl.sql script has performed the following steps to refine the data:
- Converted the raw Twitter data into a tabular format.
- Used the dictionary file to score the sentiment of each Tweet by the number of positive words compared to the number of negative words, and then assigned a positive, negative, or neutral sentiment value to each Tweet.
- Created a new table that includes the sentiment value for each Tweet.
Let’s use HCatalog to take a quick look at the data.
- Open the Sandbox HUE user interface, then click HCatalog in the menu at the top of the page. Select the check box next to the “tweets_raw” table, then click Browse Data.
- The “tweets_raw” table was created by the Hive script from the raw Twitter data. You should see columns with data for creation time, the number of re-tweets, the tweet content, the user name, and other Twitter data.
- Click HCatalog in the menu at the top of the page, select the check box next to the “tweetsbi” table, and then click Browse Data. The “tweetsbi” table is the table created by the Hive script that includes a column with the sentiment value for each tweet. (Note, you may need to scroll right to see all of the columns.)
Now that we have refined Twitter data in a tabular format with sentiment ratings, we can access the data with Excel.
Step 5 – Access the Refined Sentiment Data with Excel
In this section, we will use Excel Professional Plus 2013 to access the refined sentiment data.
- In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.
- On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK.The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.
- After the connection to the Sandbox is established, the Query Wizard appears. Select the “tweetsbi” table in the Available tables and columns box, then click the right arrow button to add the entire “tweetsbi” table to the query. Click Next to continue.
- Select the “text” column in the “Columns in your query” box, then click the left arrow button to remove the text column.
- After the “text” column has been removed, click Next to continue.
- On the Filter Data screen, click Next to continue without filtering the data.
- On the Sort Order screen, click Next to continue without setting a sort order.
- Click Finish on the Query Wizard Finish screen to retrieve the query data from the Sandbox and import it into Excel.
- On the Import Data dialog box, click OK to accept the default settings and import the data as a table.
- The imported query data appears in the Excel workbook.
Now that we have successfully imported the Twitter sentiment data into Microsoft Excel, we can use the Excel Power View feature to analyze and visualize the data.
Step 6 – Visualize the Sentiment Data Using Excel Power View
Data visualization can help you optimize your website and convert more visits into sales and revenue. In this section we will see how sentiment varies by country, and review the sentiment data for the United States.
- In the Excel worksheet with the imported “tweetsbi” table, select Insert > Power View to open a new Power View report.
- The Power View Fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the Pop Out icon to maximize the size of the data table.
- In the Power View Fields area, clear the checkboxes next to the id and ts fields, then click Map on the Design tab in the top menu.
- The map view displays a global view of the data.
- Now let’s display the sentiment data by color. In the Power View Fields area, click sentiment, then select Add as Color.
- Under SIZE, click sentiment, then select Count (Not Blank).
- Now the map displays the sentiment data by color:
- Orange: positive
- Blue: negative
- Red: neutral
- Use the map controls to zoom in on Ireland. About half of the tweets have a positive sentiment score, as indicated by the color orange.
- Now use the map controls to zoom in on Mexico. In Mexico, about one-fifth of the tweets expressed negative sentiment (shown in blue), and only a small portion of the tweets were positive. Most tweets from Mexico were neutral, as shown in red.
- Next, use the map controls to zoom in on the sentiment data in China. Marvel studios and the Chinese studio DMG co-financed Iron Man 3, and the cast included a famous Chinese actress.We can see that the majority of tweets from China are neutral, with positive sentiment slightly outweighing negative sentiment.
- The United States is the biggest market, so let’s look at sentiment data there. The size of the United States pie chart indicates that a relatively large number of the total tweets come from the US.About half of the tweets in the US show neutral sentiment, with a relatively small amount of negative sentiment.
We’ve shown a visualization of Twitter sentiment data for the release of Iron Man 3. This information will be useful for planning marketing activities for any future Iron Man movie releases.