cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Analyzing Social Media and Customer Sentiment With Apache NiFi and HDP Search

Introduction

In this tutorial, we will learn to install Apache NiFi on your Hortonworks Sandbox if you do not have it pre-installed already. Using NiFi, we create a data flow to pull tweets directly from the Twitter API.

We will use Solr and the LucidWorks HDP Search to view our streamed data in realtime to gather insights as the data arrives in our Hadoop cluster.

Next, we will use Hive to analyze the social sentiment after we have finished collecting our data from NiFi.

Finally, we will use Apache Zeppelin to create charts, so we can visualize our data directly inside of our Hadoop cluster.

List of technologies in this tutorial:

Pre-Requisites

Outline

  1. Install Apache NiFi
  2. Configure and Start Solr
  3. Create a Twitter Application
  4. Create a Data Flow with Nifi
  5. (Optional) Generating Random Twitter Data
  6. Analyze and Search Data with Solr
  7. Analyze Tweet Data in Hive
  8. Visualize Sentiment with Zeppelin

Install Apache Nifi


The first thing you’re going to need if you haven’t done it already is install the Apache Nifi service on your Sandbox. Follow the Set up Nifi Environment section of Analyze Traffic Pattern with Apache Nifi.

Configure and Start Solr

Make sure that Ambari Infra is stopped, we now need to install HDP Search.

Login to Ambari user credentials: Username – raj_ops and Password – raj_ops. Click on Actions button at the bottom and then Add Service:

actions_button

Next, you will view a list of services that you can add. Scroll to the bottom and select Solr, then press Next.

check_solr

Accept all default values in next few pages, and then you can see the progress of your installation:

solr_progress

After a minute, you can see Solr successfully installed:

solr_install_success

Press Next, you will be asked to restart some services. Restart HDFS, YARN, Mapreduce2 and HBase.

We just need to make a few quick changes.

Open your terminal shell and SSH back into the sandbox. We’re going to need to run the following commands as the solr user. run

su solr

Then we need to edit the following file path to make sure that Solr can recognize a tweet’s timestamp format. First we’re going to copy the config set over to twitter’s tweet_configs folder:

cp -r /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs /opt/lucidworks-hdpsearch/solr/server/solr/configsets/tweet_configs
vi /opt/lucidworks-hdpsearch/solr/server/solr/configsets/tweet_configs/conf/solrconfig.xml

su_solr

Once the file is opened in vi type

Note In vi the command below should not be run in INSERT mode. / will do a find for the text that you type after it.

/solr.ParseDateFieldUpdateProcessorFactory

This will bring you to the part of the config where we need to add the following:

<str>EEE MMM d HH:mm:ss Z yyyy</str>

Make sure this is inserted just above all of the other <str> tags.

Note In vi, to type or insert anything you must be in insert mode. Press i on your keyboard to enter insert mode in vi.

After inserting the above, the portion of the file should look something like this:

<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
  <processor class="solr.ParseDateFieldUpdateProcessorFactory">
    <arr name="format">
      <str>EEE MMM d HH:mm:ss Z yyyy</str>
      <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
      <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
      <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>
      <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>
      <str>yyyy-MM-dd'T'HH:mm:ssZ</str>
      </arr>
    </processor>
</processor>

Finally press the Escape key on your keyboard and type :wq to save and close the solrconfig.xml file.

Next we need to replace a JSON file. Use the following commands to move the original and download the replacement file:

cd /opt/lucidworks-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/

mv default.json default.json.orig

wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/default.json

wget_json

Then we are going to add a collection called “tweets”

/opt/lucidworks-hdpsearch/solr/bin/solr create -c tweets -d tweet_configs -s 1 -rf 1 -p 8983

Note: Here -c indicates the name
-d is the config directory
-s is the number of shards
-rf is the replication factor
-p is the port at which Solr is running

add collection tweets

We can now go back to running commands as the root user. Run

exit

This will log you out of the solr user

Great! Now Solr should be installed and running on your sandbox!

Ensure that you can access the Solr UI by navigating to http://sandbox.hortonworks.com:8983/solr/

Solr UI

Create a Twitter Application

If you would rather not register your own Twitter application and use previous data, please head to the next section where you can download the sample dataset.

If you want to pull live data from Twitter in this tutorial you’ll need to register your own Twitter application. It’s quite simple and only takes a few short steps

First head over to the Twitter Apps Website and Sign In using your Twitter account (or make one if you don’t have one yet!)

Then click Create a New App.

Creating Twitter App

After you’ve clicked that you’ll need to fill in some details about your application. Feel free to put whatever you want.

Twitter App Details

Then click Create Your Twitter Application at the bottom of the screen after reading the developer agreement.

Note that you might need to add your mobile phone to your Twitter account before creating your application

Once you’ve done that you should be greeted by a dashboard for your Twitter application. Head over to the permissions tab and select the Read Only Option and Update your application.

Changing App Permission

Finally you need to generate your OAuth key. You can do this by clicking Test OAuth on the top of the permissions page, or by heading to Keys and Access Tokens and then finding the option that allows you to generate your OAuth tokens.

Finally, your keys and access tokens should look similar to the following:

Twitter Tokens

Please make note of your Consumer Key, Consumer Secret, Access Token, and Access Token Secret. You will need these to create the data flow in NiFi.

Create a Data Flow with NiFi

The first thing you’ll need to do here is download the NiFi data flow template for the Twitter Dashboard here

Make note of where you download this file. You’ll need it in the next step.

Open up the NiFi user interface found at http://sandbox.hortonworks.com:9090/nifi. Then you’ll need to import the template you just downloaded into NiFi.

Import the template by clicking Templates icon in the right of the Operate box.

click_import_template

Then click on the search sign to select the template and navigate to the Twitter_JSON_Flow.xml file that you just previously downloaded.

search_template

Once you’ve selected the file you can click UPLOAD.

upload_template_json

You should now see the Success message that your file is uploaded, press OK

template_uploaded

Now that we’ve got the template imported into NiFi we can instantiate it. Drag the template icon (the 7th from the left) onto the workspace.

drag_template

Then a dialog box should appear. Make sure that Twitter_JSON_Flow is selected and click Add.

add_template_json

After clicking ADD you should have a screen similar to the following:

work_flow

Great! The NiFi flow has been set up. The boxes are what NiFi calls processors. Each of the processors can be connected to one another and help make data flow. Each processor can perform specific tasks. They are at the very heart of NiFi’s functionality.

Note! You can make you flows looks very clean by having the connections between all of your processors at 90 degree angles with respect to one another. You can do this by double clicking a connection arrow to create a vertex. This will allow you to customize the look of your flow

Try right-clicking on a few of the the processors and look at their configuration. This can help you better understand how the Twitter flow works.

Now we’ll need to configure the Twitter Hose processor with the access tokens that we made earlier for our Twitter application.

Right click on the Grab Garden Hose element and click Configure

grab_garden

Then you’re going to need to place all of those Twitter API tokens from earlier in their respective places. Then hit Apply.

configure_processor

Once you’ve got all of your properties set up you can take a look at the configurations of some of the other processors in our data.

The processors are valid since the warning symbols disappeared. Notice that the processors have a stop symbol stop_signal in the upper left corner and are ready to run. To select all processors, hold down the shift-key and drag your mouse across the entire data flow.

Now that all processors are selected, go to the actions toolbar and click the start button play_signal. You can see your workflow running.

Generating Random Tweet Data for Hive and Solr

This section is for anyone who didn’t want to set up a Twitter app so they could stream custom data. We’re just going to use a script to generate some data and then put that into Hive and Solr. Skip to the next section if you have already set up NiFi to collect tweets.

First you’ll need to SSH into the sandbox execute the following command

wget https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/nifi-sentiment-analytics/assets/twitter-gen.sh

Then run the command with your specified number of tweets that you would like to generate.

bash twitter-gen.sh {NUMBER_OF_TWEETS}

Example:

bash twitter-gen.sh 2000

The script will generate the data and put it in the directory /tmp/data/

You can now continue on with the rest of the tutorial.

Analyze and Search Data with Solr


Now that we have our data in HDP-Search/Solr we can go ahead and start searching through our data.
If you are using NiFi to stream the data you can head over to the Banana Dashboard at http://sandbox.hortonworks.com:8983/solr/banana/index.html

The dashboard was designed by the default.json file that we had downloaded previously. You can find more about Banana here

You should be able to see the constant flow of data here and you can analyze some of it as it is dropped into the Solr index from NiFi. Try exploring the charts and see what each one does. It should be important to note that all of the graphs on the page include data that was queried straight from Solr to create those images using d3.js. You can see the queries for each graph by clicking the small gear icon located in each box.

Banana Dashboard

Note If you didn’t use NiFi to import the data from Twitter then you won’t see anything on the dashboard.

Let’s go do some custom search on the data! Head back to the normal Solr dashboard at http://sandbox.hortonworks.com:8983/solr

Select the tweets shard that we created before from the Core Selector menu on the bottom left of the screen.

Solr Core Selector

Once you’ve selected the tweets shard we can take a look to see what Solr has done with our data.

Solr Tweets Index

  1. We can see how many documents or records have been stored into this index in Solr. As long as NiFi continues to run this number will become larger as more data is ingested. If you used the twitter-gen.sh script then this number should be close to the amount of tweets that you generated.
  2. Here we can see the size on the disk that the data is taking up in Solr. We don’t have many tweets collected yet, so this number is quite small.
  3. On the left side bar there are a number of different tabs to view the data that’s stored within Solr. We’re going to focus on the Query one, but you should explore the others as well.

Click on the query tab, and you should be brought to screen similar to the following:

Solr Query Dash

We’re only going to be using 3 of these fields before we execute any queries, but let’s quickly outline the different query parameters

  • fq: This is a filter query parameter it lets us retrieve data that only contains certain values that we’re looking for. Example: we can specify that we only want tweets after a certain time to be returned.
  • sort: self-explanatory. You can sort by a specified field in ascending or descending order. we could return all tweets by alphabetical order of Twitter handles, or possible by the time they were tweeted as well.
  • start, rows: This tells us where exactly in the index we should start searching, and how many rows should be returned when we execute the query. The defaults for each of these is 0 and 10 respectively.
  • fl: Short for field list specify which fields you want to be returned. If the data many, many fields, you can choose to specify only a few that are returned in the query.
  • df: Short for default fields you can tell which fields solr should be searching in. You will not need this if the query fields are already defined.
  • Raw Query Params: These will be added directly the the url that is requested when Solr send the request with all of the query information.
  • wt: This is the type of data that solr will return. We can specify many things such as JSON, XML, or CSV formatting.

We aren’t going to worry about the rest of the flags. Without entering any parameters click Execute Query.

Solr Query Results 1

From this you should be able to view all of the tweet data that is collected. Try playing with some of the parameters and add more to the rows value in the query to see how many results you can obtain.

Now let’s do a real query and see if we can find some valuable data.

  • For q type language_s:en
  • For sort type screenName_s asc
  • For rows type 150
  • For fl type screenName_s, text_t
  • For wt choose csv

Solr Query Results 2

Let’s try one last query. This time you can omit the sort field and chooses whichever wt format you like. Keep the fl parameter as is though.

  • Specify an fq parameter as language_s:en
  • In the query box, pick any keyword. I am going to use stock

Solr Query Results 3

Further Reading

Analyze Tweet Data in Hive


Now that we’ve taken a look at some of our data and searched it with Solr, let’s see if we can refine it a bit more.

But before moving ahead, let us setup Hive-JSON-Serde to read the data in JSON format.
We have to use the maven to compile the serde. Go back to the terminal and follow the below steps to setup the maven:

wget http://mirror.olnevhost.net/pub/apache/maven/binaries/apache-maven-3.2.1-bin.tar.gz

Now, extract this file:

tar xvf apache-maven-3.2.1-bin.tar.gz

install_maven

Now since our maven is installed, let us download the Hive-JSON-Serde. Type the following command:

git clone https://github.com/rcongiu/Hive-JSON-Serde

This command must have created the new directory, go inside to that directory using cd:

cd Hive-JSON-Serde

Next, run the command to compile the serde:

./../apache-maven-3.2.1/bin/mvn -Phdp23 clean package

install_json_serde

Wait for its completion, and then you have to copy the serde jar to the Hive lib:

cp json-serde/target/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar /usr/hdp/2.5.0.0-1245/hive/lib
cp json-serde/target/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar /usr/hdp/2.5.0.0-1245/hive2/lib

copy_jars

DO NOT forget to Restart Hive from Ambari,

restart_hive

We’re going to attempt to get the sentiment of each tweet by matching the words in the tweets with a sentiment dictionary. From this we can determine the sentiment of each tweet and analyze it from there.

First off, if your Twitter flow on the NiFi instance is still running, you’ll need to shut it off. Open up the NiFi dashboard at sandbox.hortonworks.com:9090/nifi and click stop square stop_signal at the top of the screen.

Next, you’ll need to SSH into the sandbox again and run the following two commands

# Virtualbox  
	sudo -u hdfs hadoop fs -chown -R maria_dev /tmp/tweets_staging
	sudo -u hdfs hadoop fs -chmod -R 777 /tmp/tweets_staging
# Azure	    
	sudo -u hdfs hadoop fs -chown -R azure /tmp/tweets_staging
	sudo -u hdfs hadoop fs -chmod -R 777 /tmp/tweets_staging

change permission of tweets staging

After the commands complete let’s go to the Hive view. Head over to http://sandbox.hortonworks.com:8080. Login into Ambari. Refer to Learning the Ropes of the Hortonworks Sandbox if you need assistance with logging into Ambari.
> Note: login credentials are maria_dev/maria_dev (Virtualbox), else azure/azure (Azure). Use the dropdown menu at the top to get to the Hive view.

Execute the following command to create a table for the tweets

CREATE EXTERNAL TABLE IF NOT EXISTS tweets_text(
  tweet_id bigint,
  created_unixtime bigint,
  created_time string,
  lang string,
  displayname string,
  time_zone string,
  msg string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/tmp/tweets_staging';

create_table_json

Now we’re going to need to do some data analysis.

First you’re going to need to head to the HDFS Files View and create a new directory in /tmp/data/tables

Then create two new directories inside of /tmp/data/tables. One named time_zone_map and another named dictionary

Data Table Folders

In each of the folders respectively you’ll need to upload the dictionary.tsv file, and the time_zone_map.tsv file to each of their respective directories.

After doing so, you’ll need to run the following command on the Sandbox:

sudo -u hdfs hadoop fs -chmod -R 777 /tmp/data/tables

modify permissions tables folder

Finally, run the following two commands:

CREATE EXTERNAL TABLE if not exists dictionary (
	type string,
	length int,
	word string,
	pos string,
	stemmed string,
	polarity string )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/tmp/data/tables/dictionary';

create dictionary table

CREATE EXTERNAL TABLE if not exists time_zone_map (
    time_zone string,
    country string,
    notes string )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/tmp/data/tables/time_zone_map';

create time zone map table

This will create two tables from that data which we will use to analyze the tweet sentiment. They should appear in the database explorer as shown below.

Note Refresh page if explorer doesn’t appear automatically.

Data Table Folders

Next we’ll need to create two table views from our tweets which will simplify the columns the data we have access to.

CREATE VIEW IF NOT EXISTS tweets_simple AS
SELECT
  tweet_id,
  cast ( from_unixtime( unix_timestamp(concat( '2016 ', substring(created_time,5,15)), 'yyyy MMM dd hh:mm:ss')) as timestamp) ts,
  msg,
  time_zone
FROM tweets_text;

create tweets simple view

ADD JAR /usr/hdp/2.5.0.0-1245/hive2/lib/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar;

CREATE VIEW IF NOT EXISTS tweets_clean AS
SELECT
  t.tweet_id,
  t.ts,
  t.msg,
  m.country
 FROM tweets_simple t LEFT OUTER JOIN time_zone_map m ON t.time_zone = m.time_zone;

add_jar

After running the above commands you should be able run SELECT * FROM tweets_clean LIMIT 100; which should yield results:

Data Table Folders

Now that we’ve cleaned our data we can get around to computing the sentiment. Use the following Hive commands to create some views that will allow us to do that.

-- Compute sentiment
create view IF NOT EXISTS l1 as select tweet_id, words from tweets_text lateral view explode(sentences(lower(msg))) dummy as words;

create view IF NOT EXISTS l2 as select tweet_id, word from l1 lateral view explode( words ) dummy as word;

create view IF NOT EXISTS l3 as select
    tweet_id,
    l2.word,
    case d.polarity
      when  'negative' then -1
      when 'positive' then 1
      else 0 end as polarity
 from l2 left outer join dictionary d on l2.word = d.word;

compute sentiment

Now that we were able to compute some sentiment values we can assign whether a tweet was positive, neutral, or negative. Use this next Hive command to do that.

create table IF NOT EXISTS tweets_sentiment stored as orc as select
  tweet_id,
  case
    when sum( polarity ) > 0 then 'positive'
    when sum( polarity ) < 0 then 'negative'  
    else 'neutral' end as sentiment
from l3 group by tweet_id;

compute sentiment values

Lastly, to make our analysis somewhat easier we are going to turn those ‘positive’, ‘negative’, and ‘neutral’ values into numerical values using the next Hive command

CREATE TABLE IF NOT EXISTS tweetsbi
STORED AS ORC
AS SELECT
  t.*,
  case s.sentiment
    when 'positive' then 2
    when 'neutral' then 1
    when 'negative' then 0
  end as sentiment  
FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s on t.tweet_id = s.tweet_id;

Hive Sentiment Analysis Results

This command should yield our final results table as shown below.

Hive Sentiment Analysis Results

Try the new Hive Visualization tab!

On the right hand side of the screen try clicking the graph icon in the column located in row 3. It will bring up a new tab where you can directly create charts using your query results in Hive!

Now that we can access the sentiment data in our Hive table let’s do some visualization on the analysis using Apache Zeppelin.

Visualize Sentiment With Zeppelin


Make sure your Zeppelin service is started in Ambari, then head over to the Zeppelin at http://sandbox.hortonworks.com:9995.

Hive Sentiment Analysis Results

Use the Notebook dropdown menu at the top of the screen and click + Create New Note. After which, you can name the note Sentiment Analysis.

Hive Sentiment Analysis Results

After creating the note, open it up to the blank Notebook screen and type the following command.

%hive
select * from tweetsbi LIMIT 300

We’re limiting our query to just 300 results because right now we won’t need to see everything. And if you’ve collected a lot of data from NiFi, then it could slow down your computer.

  • Arrange your results so that your chart is a bar graph.
  • The tweetsbi.country column is a key and the tweetsbi.sentiment as the value.
  • Make sure that sentiment is labeled as COUNT.
  • Run the query by clicking the arrow on the right hand side, or by pressing Shift+Enter.

Your results should look like the following:

First Zeppelin Query Results

After looking at the results we see that if we group by country that many tweets are actually labeled as null.

For the sake of visualization let’s remove any tweets that might appear in our select statement that have a country value of “null”, as well as increase our result limit to 500.

Scroll down to the next note and create run the following query, and set up the results the same way as above.

Note Before running Hive queries, restart the Spark Interpreter since Spark jobs take up cluster resources. Click the Interpreter tab located near Zeppelin logo at the top of the page, under Spark click on the button that says restart.

%hive
select * from tweetsbi where country != "null" LIMIT 500

Non Null Countries in Results

Great! Now given the data we have, we can at least have an idea of the distribution of users who’s tweets come from certain countries!

You can also experiment with this and try a pie chart as well.

Pie chart of above results

In our original raw tweet data from NiFi we also collected the language from our users as well. So we can also get an idea of the distribution of languages!

Run the following query and make

  • lang as the Key
  • COUNT for lang in values
%hive
select lang, time_zone from tweets_text LIMIT 1000

Pie chart of language results

If you have not seen from our earlier analysis in Hive

  • A bad or negative sentiment is 0
  • A neutral sentiment value is 1.
  • A positive sentiment value is 2.

Using this we can now look at individual countries and see the sentiment distributions of each.

%hive
select sentiment, count(country), country from tweetsbi group by sentiment, country having country != "null"

Sentiment Comparison

Using this data you can determine how you might want to market your products to different countries!

Further Reading