How To Refine and Visualize Sentiment Data
With Hadoop, you can mine Twitter, Facebook and other social media conversations for sentiment data about you and your competition, and use it to make targeted, real-time, decisions that increase market share.This demo takes a Hadoop user through the steps for extracting sentiment data from Twitter and analyzing the performance of a recent movie release.
[embed width="700" height="550"]http://www.youtube.com/watch?v=y3nFfsTnY3M[/embed]
In this tutorial, you'll learn how to:
- Stream twitter feeds into HDFS with Flume
- Use HCatalog to build a relational view of the data
- Use Hive to query and refine the data
- Import the data into Microsoft Excel with the ODBC connector
- Visualize data with Powerview
Sentiment DataSentiment data is unstructured data that represents opinions, emotions, and attitudes contained in sources such as social media posts, blogs, online product reviews, and customer support interactions.
Potential Uses of Sentiment DataOrganizations use sentiment analysis to understand how the public feels about something at a particular moment in time, and also to track how those opinions change over time. An enterprise may analyze sentiment about:
- A product – For example, does the target segment understand and appreciate messaging around a product launch? What products do visitors tend to buy together, and what are they most likely to buy in the future?
- A service – For example, a hotel or restaurant can look into its locations with particularly strong or poor service.
- Competitors – In what areas do people see our company as better than (or weaker than) our competition?
- Reputation – What does the public really think about our company? Is our reputation positive or negative?
- Hortonworks Sandbox (installed and running)
- Hortonworks ODBC driver installed and configured
- Tutorial 7: Installing and Configuring the Hortonworks ODBC driver on Windows 7
- Tutorial 11: Installing and Configuring the Hortonworks ODBC driver on Mac OS X
- Microsoft Excel 2013 Professional Plus
- Sentiment tutorial files (included)
- In this tutorial, the screenshots show the Hortonworks Sandbox is installed on an Oracle VirtualBox virtual machine (VM) – your screens may be different.
- Install the ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).
- In this tutorial, we will use the Power View feature in Excel 2013 to visualize the sentiment data. Power View is currently only available in Microsoft Office Professional Plus and Microsoft Office 365 Professional Plus.
- Note, other versions of Excel will work, but the visualizations will be limited to charts. You can connect to any other visualization tool you like
OverviewTo refine and visualize website sentiment data, we will:
- Download and extract the sentiment tutorial files.
- Load Twitter data into the Hortonworks Sandbox.
- Copy a Hive script to the Sandbox.
- Run the Hive script to refine the raw data.
- Access the refined sentiment data with Excel.
- Visualize the sentiment data using Excel Power View.
Step 1: Download and Extract the Sentiment Tutorial Files
- You can download a set of sample Twitter data contained in a compressed (.zip) folder here:SentimentFiles.zip The Twitter data was obtained using Hortonworks Flume. Flume can be used as a log aggregator, collecting log data from many diverse sources and moving it to a centralized data store. In this case, Flume was used to capture the Twitter stream data, which we can now load into the Hadoop Distributed File System (HFDS).
- Save the SentimentFiles.zip file to your computer, and then extract the files. You should see a SentimentFiles folder.
Step 2: Load Twitter Data into the Hortonworks SandboxWe will now load Twitter data into the Sandbox. To make things simpler we have prepared a zip file with all the data files we need uploaded into the sandbox. Click on the File Browser tab at the top and then select Upload -> Zip File. You will see a file selection box and navigate into the SentimentFiles folder. You will see a file called upload.zip. Select that and start the upload. The file will upload into the Sandbox VM and be automatically unpacked into the directory. When the upload it complete you will see a new folder called upload.
Step 3: Copy a Hive Script to the SandboxWe will now use SCP to copy the hiveddl.sql file to the Sandbox. The procedure is slightly different for Windows and Mac, so both methods are described here.
Mac OS X -- Copy the hiveddl.sql File to the Sandbox
- Open a Terminal window and navigate to "SentimentFiles/upload/hive" subfolder in the SentimentFiles folder you extracted previously. Type in the following commands, then press the Enter key:
scp -P 2222 hiveddl.sql email@example.com:
scp -P 2222 json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar firstname.lastname@example.org:Notes:
- You must use an uppercase "P" for the "-P" in this command.
- You may be prompted to validate the authenticity of the host. If so, type "yes" when prompted.
- When prompted, type in the Sandbox password ("hadoop"), then Press Enter. This command will copy the hiveddl.sql file to the root folder on the Sandbox.When the file transfer is complete, a confirmation message will appear in the terminal window:
Windows 7: Copy the files to the SandboxOn Windows you will need to download and install the free WinSCP application.
- Open WinSCP and type in the following settings, then click Login.
- Host name: 127.0.0.1
- Port: 2222
- User name: root
- Type the Sandbox password ("hadoop") in the Password box, then click OK.
- Use the WinSCP file browser to navigate to the SentimentFiles\hive folder in the left-hand pane, and to the Sandbox /root folder in the right-hand pane.Drag-and-drop the hiveddl.sql file from the SentimentFiles\hive folder to the /root folder on the Sandbox.
Step 4: Run the Hive Script to Refine the Raw Data
- In the Hortonworks Sandbox virtual machine (VM) console window, press the Alt and F5 keys, then log in to the Sandbox using the following user name and password:Login: root\ Password: hadoop After you log in, the command prompt will appear with the prefix [root@sandbox ~]#:
- At the command prompt, type in the following command, then press the Enter key:
hive -f hiveddl.sql
- Converted the raw Twitter data into a tabular format.
- Used the dictionary file to score the sentiment of each Tweet by the number of positive words compared to the number of negative words, and then assigned a positive, negative, or neutral sentiment value to each Tweet.
- Created a new table that includes the sentiment value for each Tweet.
- Open the Sandbox HUE user interface, then click HCatalog in the menu at the top of the page. Select the check box next to the "tweetsbi" table, then click Browse Data. The "tweetsbi" table is the table created by the Hive script that added a column with the sentiment value for each tweet. (Note, you may need to scroll right to see all of the columns.)
Step 5: Access the Refined Sentiment Data with ExcelIn this section, we will use Excel Professional Plus 2013 to access the refined sentiment data.
- In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.
- On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK.The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.
- After the connection to the Sandbox is established, the Query Wizard appears. Select the "tweetsbi" table in the Available tables and columns box, then click the right arrow button to add the entire "tweetsbi" table to the query. Click Next to continue.
- Select the "text" column in the "Columns in your query" box, then click the left arrow button to remove the text column.
- After the "text" column has been removed, click Next to continue.
- On the Filter Data screen, click Next to continue without filtering the data.
- On the Sort Order screen, click Next to continue without setting a sort order.
- Click Finish on the Query Wizard Finish screen to retrieve the query data from the Sandbox and import it into Excel.
- On the Import Data dialog box, click OK to accept the default settings and import the data as a table.
- The imported query data appears in the Excel workbook.
Step 6: Visualize the Sentiment Data Using Excel Power ViewData visualization can help you optimize your website and convert more visits into sales and revenue. In this section we will see how sentiment varies by country, and review the sentiment data for the United States.
- In the Excel worksheet with the imported "tweetsbi" table, select Insert > Power View to open a new Power View report.
- The Power View Fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the Pop Out icon to maximize the size of the data table.
- In the Power View Fields area, clear the checkboxes next to the id and ts fields, then click Map on the Design tab in the top menu.
- The map view displays a global view of the data.
- Now let’s display the sentiment data by color. In the Power View Fields area, click sentiment, then select Add as Color.
- Under SIZE, click sentiment, then select Count (Not Blank).
- Now the map displays the sentiment data by color:
- Orange: positive
- Blue: negative
- Red: neutral
- Use the map controls to zoom in on Ireland. About half of the tweets have a positive sentiment score, as indicated by the color orange.
- Now use the map controls to zoom in on Mexico. In Mexico, about one-fifth of the tweets expressed negative sentiment (shown in blue), and only a small portion of the tweets were positive. Most tweets from Mexico were neutral, as shown in red.
- Next, use the map controls to zoom in on the sentiment data in China. Marvel studios and the Chinese studio DMG co-financed Iron Man 3, and the cast included a famous Chinese actress.We can see that the majority of tweets from China are neutral, with positive sentiment slightly outweighing negative sentiment.
- The United States is the biggest market, so let’s look at sentiment data there. The size of the United States pie chart indicates that a relatively large number of the total tweets come from the US.About half of the tweets in the US show neutral sentiment, with a relatively small amount of negative sentiment.