Analyzing Social Media and Customer Sentiment

If you have any errors in completing this tutorial. Please ask questions or notify us on Hortonworks Community Connection!

Analyzing Twitter Data With Apache NiFi and HDP Search

Introduction

I this tutorial we will learn how to install Apache NiFi on your Hortonworks Sandbox if you do not have it pre-installed already. Using NiFi we will create a data flow that can pull tweets directly from the Twitter API.

We will then use Solr and the LucidWorks HDP Search to view our streamed data in realtime to gather insights as the data arrives in our Hadoop cluster.

Next, we will be using Hive to analyze the social sentiment after we have finished collecting our data from NiFi.

Finally, we will use Apache Zeppelin to create charts so that we can visualize our data directly inside of our Hadoop cluster.

List of technologies in this tutorial:

Pre-Requisites

If you haven’t added sandbox.hortonworks.com to your lists of hosts you can do so with the following command on a unix system:

echo "127.0.0.1     sandbox.hortonworks.com >> /etc/hosts

Outline

  1. Install Apache NiFi
  2. Configure and Start Solr
  3. Creating a Twitter Application
  4. Create a Data Flow with Nifi
  5. (Optional) Generating Random Twitter Data
  6. Analyze and Search Data with Solr
  7. Analyzing Tweet Data in Hive
  8. Visualizing Sentiment with Zeppelin

Install Apache Nifi


The first thing you’re going to need if you haven’t done it already is install the Apache Nifi service on your Sandbox.

Download Apache NiFi

If you haven’t already you will need to download the GZipped versions of Hortonworks DataFlow from the website.

Send NiFi to the Sandbox

First we’re going to need to send the HDF file that was just downloaded to the Sandbox via SCP.

Assuming that HDF has been downloaded to your ~/Downloads/ directory and that the file has a name nifi-1.1.1.0-12-bin.tar.gz Open up the your terminal and type the following command:

Note: the -P argument is case sensitive.

scp -P 2222 ~/Downloads/nifi-1.1.1.0-12-bin.tar.gz root@localhost:/root

Once you’ve done that you’ll need to SSH into the sandbox

SSH into the Sandbox

There are two options to connecting to your sandbox to execute terminal commands. The first is by using the terminal emulator at http://sandbox.hortonworks.com:4200. Or you can open a terminal on your computer and use the following command

ssh root@sandbox.hortonworks.com -p 2222

If you’ve already logged into your sandbox through SSH your password will be different than below.

username password
root hadoop

Note that you will be prompted to change the root user’s password once you login to the sandbox. Do NOT forget this!

First, use the export command to create a temporary variable to store the name of the nifi file which you just downloaded.

For example if the filename is nifi-1.1.1.0-12-bin.tar.gz:

export NIFI=nifi-1.1.1.0-12-bin.tar.gz

Once you’ve successfully connected to the sandbox make sure that you’re in the directory /root/. Then run the following commands.

Make a new directory for NiFi

mkdir nifi

Move our file to the folder which we just created, then cd into the folder

mv $NIFI ./nifi
cd nifi

Unzip the file

tar -xvf $NIFI

Then let’s head into the directory we just unzipped. It will be the same as the $NIFI variable, expcept without the -bin.tar.gz at the end. In this case the command is:

cd nifi-1.1.1.0-12

Next we’re going to need to change the port which nifi runs on from 8080 to 9090.

Inside the conf/nifi.properties file, find the line which has nifi.web.http.port. Make sure it looks like the following:

nifi.web.http.port=9090

You can now start NiFi! use the nifi.sh file to start the application.

bash bin/nifi.sh start

After a few short moments NiFi will start up on the Sandbox.

Make sure you can reach the NiFi user interface at http://sandbox.hortonworks.com:9090/nifi.

If you can’t access it, first wait approximately 10-15 seconds after executing the command. If you still can’t connect after that then you might need to forward port 9090 on your virtual machine.

For VirtualBox you can forward the port 2 ways. Either through the GUI, or using the command line on the Host machine

Forwarding a Port on the Host Machine’s Terminal

First you’ll need to run the following command:

VBoxManage list vms

Look for the Hortonworks Sandbox VM. Take note of it’s ID. Once you’ve taken note of the ID, run the following command to forward the port:

VBoxManage controlvm {INSERT_VM_ID_HERE} natpf1 nifi,tcp,,9090,,9090

Example:

HW11108:~ zblanco$ VBoxManage list vms
"Hortonworks Sandbox with HDP 2.3.2" {2d299b17-3b10-412a-a895-0bf958f98788}

HW11108:~ zblanco$ VBoxManage controlvm 2d299b17-3b10-412a-a895-0bf958f98788 natpf1 nifi,tcp,,9090,,9090

Port 9090 should now be forwarded! You may skip the GUI section of port forwarding.

Forwarding a Port with the GUI

  1. Opening VirtualBox Manager
  2. Right click your running Hortonworks Sandbox, click Settings
  3. Go to the Network Tab
  4. Click the button that says Port Forwarding. Add an entry with the following values
Name Protocol Host IP Host Port Guest IP Guest Port
NiFi TCP 127.0.0.1 9090   9090

Port Forward NiFi

You should now be able to access the NiFi user interface at http://sandbox.hortonworks.com:9090/nifi.

NiFi Interface

Configure and Start Solr

Hortonworks Sandbox with HDP 2.3.2 has the Lucidworks HDP Search Pre-installed.

We just need to make a few quick changes.

First, we need to modify some file permissions. Open your terminal shell and SSH back into the sandbox. Execute the following

chown -R solr:solr /opt/lucidworks-hdpsearch/solr

We’re going to need to run the following commands as the Solr user. run

su solr

Then we need to edit the following file to make sure that Solr can recognize a tweet’s timestamp format. First we’re going to copy the config set over to a different place:

cp -r /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs /opt/lucidworks-hdpsearch/solr/server/solr/configsets/tweet_configs
vi /opt/lucidworks-hdpsearch/solr/server/solr/configsets/tweet_configs/conf/solrconfig.xml

Once the file is opened in vi type

Note: In vi the command below should not be run in INSERT mode. / will do a find for the text that you type after it.

/solr.ParseDateFieldUpdateProcessorFactory

This will bring you to the part of the config where we need to add the following:

<str>EEE MMM d HH:mm:ss Z yyyy</str>

Make sure this is inserted just above all of the other <str> tags.

Note. In vi, to type or insert anything you must be in insert mode. Press i on your keyboard to enter insert mode in vi.

After inserting the above, the portion of the file should look something like this:

<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
  <processor class="solr.ParseDateFieldUpdateProcessorFactory">
    <arr name="format">
      <str>EEE MMM d HH:mm:ss Z yyyy</str>
      <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
      <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
      <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>
      <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>
      <str>yyyy-MM-dd'T'HH:mm:ssZ</str>
      </arr>
    </processor>
</processor>

Finally press the Escape key on your keyboard and type :wq to save and close the solrconfig.xml file.

Next we need to replace a JSON file. Use the following commands to move the original and download the replacement file:

cd /opt/lucidworks-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/

mv default.json default.json.orig

wget https://raw.githubusercontent.com/ZacBlanco/hwx-tutorials/hdp-2.3/assets/nifi-sentiment-analytics/assets/default.json

Now we’re going to start Solr. Execute

/opt/lucidworks-hdpsearch/solr/bin/solr start -c -z localhost:2181

Then we are going to add a collection called “tweets”

/opt/lucidworks-hdpsearch/solr/bin/solr create -c tweets -d tweet_configs -s 1 -rf 1

We can now go back to running commands as the root user. Run

exit

This will log you out of the solr user

Lastly, we need to update the system time on the sandbox so that we will be able to process the tweets correctly in NiFi

yum install -y ntp
service ntpd stop
ntpdate pool.ntp.org
service ntpd start

Great! Now Solr should be installed and running on your sandbox!

Ensure that you can access the Solr UI by navigating to http://sandbox.hortonworks.com:8983/solr/

Solr UI

Creating a Twitter Application

If you would rather not register your own Twitter application and use previous data, please head to the next section where you can download the sample dataset.

If you want to pull live data from Twitter in this tutorial you’ll need to register your own Twitter application. It’s quite simple and only takes a few short steps

First head over to the Twitter Apps Website and Sign In using your Twitter account (or make one if you don’t have one yet!)

Then click Create a New App.

Creating Twitter App

After you’ve clicked that you’ll need to fill in some details about your application. Feel free to put whatever you want.

Twitter App Details

Then click Create Your Twitter Application at the bottom of the screen after reading the developer agreement.

Note that you might need to add your mobile phone to your Twitter account before creating your application

Once you’ve done that you should be greeted by a dashboard for your Twitter application. Head over to the permissions tab and select the Read Only Option and Update your application.

Changing App Permission

Finally you need to generate your OAuth key. You can do this by clicking Test OAuth on the top of the permissions page, or by heading to Keys and Access Tokens and then finding the option that allows you to generate your OAuth tokens.

Finally, your keys and access tokens should look similar to the following:

Twitter Tokens

Please make note of your Consumer Key, Consumer Secret, Access Token, and Access Token Secret. You will need these to create the data flow in NiFi.

Create a Data Flow with NiFi

The first thing you’ll need to do here is download the NiFi data flow template for the Twitter Dashboard here

Download

Make note of where you download this file. You’ll need it in the next step.

Open up the NiFi user interface found at http://sandbox.hortonworks.com:9090/nifi. Then you’ll need to import the template you just downloaded into NiFi.

Import the template by clicking Templates icon on the top right corner of the screen (Third from the right).

NiFi Templates Icon

Then click Browse and navigate to the Twitter_Dashboard.xml file that you just previously downloaded.

NiFi Template Browse

Once you’ve selected the file you can click Import.

NiFi Import Template

You should now see the template appear below.

NiFi Template Imported

Now that we’ve got the template imported into NiFi we can instanstiate it. Drag the template icon (the 7th from the left) onto the workspace.

Drag Template Icon

Then a dialog box should appear. Make sure that Twitter Dashboard is selected and click Add.

Instantiate Template

After clicking import you should have a screen similar to the following:

Imported Dashboard

Great! The NiFi flow has been set up. The boxes are what NiFi calls processors. Each of the processors can be connected to one another and help make data flow. Each processor can perform specific tasks. They are at the very heart of NiFi’s functionality.

Note! You can make you flows looks very clean by having the connections between all of your processors at 90 degree angles with respect to one another. You can do this by double clicking a connection arrow to create a vertex. This will allow you to customize the look of your flow

Try right-clicking on a few of the the processors and look at their configuration. This can help you better understand how the Twitter flow works.

Now we’ll need to configure the Twitter Hose processor with the access tokens that we made earlier for our Twitter application.

Right click on the Grab Garden Hose element and click Configure

Configure Garden Hose

Then you’re going to need to place all of those Twitter API tokens from earlier in their respective places. Then hit Apply.

NiFi Tokens

Once you’ve got all of your peroperties set up you can take a look at the configurations of some of the other processsors in our data.

Once you’ve done that head to the top of the page and lick the play button to watch the tweets roll in! Note that all of the red squares have now turned to green arrows.

If only one of the boxes changes when you click Start, make sure that you don’t have any specific processor selected. Deselect things by simply clicking on the blank area of the screen.

Starting NiFi Flow

Generating Random Tweet Data for Hive and Solr

This section is for anyone who didn’t want to set up a Twitter app so they could stream custom data. We’re just going to use a script to generate some data and then put that into Hive and Solr. Skip to the next section if you have already set up NiFi to collect tweets.

First you’ll need to SSH into the sandbox execute the following command

wget https://raw.githubusercontent.com/ZacBlanco/hwx-tutorials/hdp-2.3/assets/nifi-sentiment-analytics/assets/twitter-gen.sh

Then run the command with your specified number of tweets that you would like to generate.

bash twitter-gen.sh {NUMBER_OF_TWEETS}

Example:

bash twitter-gen.sh 2000

The script will generate the data and put it in the directory /tmp/data/

You can now continue on with the rest of the tutorial.

Analyze and Search Data with Solr


Now that we have our data in HDP-Search/Solr we can go ahead and start searching through our data.
If you are using NiFi to stream the data you can head over to the Banana Dashboard at http://sandbox.hortonworks.com:8983/solr/banana/

The dashboard was designed by the default.json file that we had downloaded previously. You can find more about Banana here

You should be able to see the constant flow of data here and you can analyze some of it as it is dropped into the Solr index from NiFi. Try exploring the charts and see what each one does. It should be important to note that all of the graphs on the page include data that was queried straight from Solr to create those images using d3.js. You can see the queries for each graph by clicking the small gear icon located in each box.

Banana Dashboard

Note If you didn’t use NiFi to import the data from Twitter then you won’t see anything on the dashboard.

Let’s go do some custom search on the data! Head back to the normal Solr dashboard at http://sandbox.hortonworks.com:8983/solr

Select the tweets shard that we created before from the Core Selector menu on the bottom left of the screen.

Solr Core Selector

Once you’ve selected the tweets shard we can take a look to see what Solr has done with our data.

Solr Tweets Index

  1. We can see how many documents or records have been stored into this index in Solr. As long as NiFi continues to run this number will become larger as more data is ingested. If you used the twitter-gen.sh script then this number should be close to the amount of tweets that you generated.
  2. Here we can see the size on the disk that the data is taking up in Solr. We don’t have many tweets collected yet, so this number is quite small.
  3. On the left side bar there are a number of different tabs to view the data that’s stored within Solr. We’re going to focus on the Query one, but you should explore the others as well.

Click on the query tab, and you should be brought to screen similar to the following:

Solr Query Dash

We’re only going to be using 3 of these fields before we execute any queries, but let’s quickly outline the different query parameters

  • fq: This is a filter query parameter it lets us retrieve data that only contains certain values that we’re looking for. Example: we can specify that we only wants tweets after a certain time to be returned.
  • sort: self-explanatory. You can sort by a specified field in ascending or descending order. we could return all tweets by alphabetical order of Twitter handles, or possible by the time they were tweeted as well.
  • start, rows: This tells us where exactly in the index we should start searching, and how many rows should be returned when we execute the query. The defaults for each of these is 0 and 10 respectively.
  • fl: Short for field list specify which fields you want to be returned. If the data many, many fields, you can choose to specify only a few that are returned in the query.
  • df: Short for default fields you can tell which fields solr should be searching in. You will not need this if the query fields are already defined.
  • Raw Query Params: These will be added directly the the url that is requested when Solr send the request with all of the query information.
  • wt: This is the type of data that solr will return. We can specify many things such as JSON, XML, or CSV formatting.

We aren’t going to worry about the rest of the flags. Without entering any parameters click Execute Query.

Solr Query Results 1

From this you should be able to view all of the tweet data that is collected. Try playing with some of the parameters and add more to the rows value in the query to see how many results you can obtain.

Now let’s do a real query and see if we can find some valuable data.

  • For q type language_s:en
  • For sort type screeName_s asc
  • For rows type 150
  • For fl type screenName_s, text_t
  • For wt choose csv

Solr Query Results 2

Let’s try one last query. This time you can omit the sort field and chooses whichever wt format you like. Keep the fl parameter as is though.

  • Specify an fq parameter as language_s:en
  • In the query box, pick any keyword. I am going to use stock

Solr Query Results 3

Further Reading

Analyzing Tweet Data in Hive


Now that we’ve taken a look at some of our data and searched it with Solr, lets see if we can refine it a bit more.

We’re going to attempt to get the sentiment of each tweet by matching the words in the tweets with a sentiment dictionary. From this we can determine the sentiment of each tweet and analyze it from there.

First off, if your Twitter flow on the NiFi instance is still running, you’ll need to shut it off. Open up the NiFi dashboard at sandbox.hortonworks.com:9090/nifi and click red square at the top of the screen.

Turning off NiFi

Next, you’ll need to SSH into the sandbox again and run the following two commands

sudo -u hdfs hadoop fs -chown -R admin /tmp/tweets_staging
sudo -u hdfs hadoop fs -chmod -R 777 /tmp/tweets_staging

After the commands completes let’s go to the Hive view. Head over to http://sandbox.hortonworks.com:8080. Login with credentials admin/admin. Use the dropdown menu at the top to get to the Hive view.

Execute the following command to create a table for the tweets

CREATE EXTERNAL TABLE IF NOT EXISTS tweets_text(
  tweet_id bigint, 
  created_unixtime bigint, 
  created_time string,
  lang string,
  displayname string, 
  time_zone string,
  msg string,
  fulltext string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY "|"
LOCATION "/tmp/tweets_staging";

Hive Tweets Table

Now we’re going to need to do some data analysis.

First you’re going to need to head to the HDFS Files View and create a new directory in /tmp/data/tables

Then create two new directories inside of /tmp/data/tables. One named time_zone_map and another named dictionary

Data Table Folders

In each of the folders respectively you’ll need to upload the dictionary.tsv file, and the time_zone_map.tsv file to each of their respective directories.

After doing so, you’ll need to run the following command on the Sandbox:

sudo -u hdfs hadoop fs -chmod -R 777 /tmp/data/tables

Finally, run the following two commands:

CREATE EXTERNAL TABLE if not exists dictionary (
    type string,
    length int,
    word string,
    pos string, 
    stemmed string, 
    polarity string )
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE
LOCATION '/tmp/data/tables/dictionary';

CREATE EXTERNAL TABLE if not exists time_zone_map (
    time_zone string,
    country string,
    notes string )
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE
LOCATION '/tmp/data/tables/time_zone_map';

This will create two tables from that data which we will use to analyze the tweet sentiment. They should appear in the database explorer as shown below.

Data Table Folders

Next we’ll need to create two table views from our tweets which will simplify the columns the data we have access to.

CREATE VIEW IF NOT EXISTS tweets_simple AS
SELECT
  tweet_id,
  cast ( from_unixtime( unix_timestamp(concat( '2015 ', substring(created_time,5,15)), 'yyyy MMM dd hh:mm:ss')) as timestamp) ts,
  msg,
  time_zone 
FROM tweets_text;

CREATE VIEW IF NOT EXISTS tweets_clean AS
SELECT
  t.tweet_id,
  t.ts,
  t.msg,
  m.country 
 FROM tweets_simple t LEFT OUTER JOIN time_zone_map m ON t.time_zone = m.time_zone;

After running the above commands you should be able run SELECT * FROM tweets_clean LIMIT 100; which should yield results:

Data Table Folders

Now that we’ve cleaned our data we can get around to computing the sentiment. Use the following Hive commands to create some views that will allow us to do that.

-- Compute sentiment
create view IF NOT EXISTS l1 as select tweet_id, words from tweets_text lateral view explode(sentences(lower(msg))) dummy as words;

create view IF NOT EXISTS l2 as select tweet_id, word from l1 lateral view explode( words ) dummy as word;

create view IF NOT EXISTS l3 as select 
    tweet_id, 
    l2.word, 
    case d.polarity 
      when  'negative' then -1
      when 'positive' then 1 
      else 0 end as polarity 
 from l2 left outer join dictionary d on l2.word = d.word;

Now that we were able to compute some sentiment values we can assign whether a tweet was positive, neutral, or negative. Use this next Hive command to do that.

 create table IF NOT EXISTS tweets_sentiment stored as orc as select 
  tweet_id, 
  case 
    when sum( polarity ) > 0 then 'positive' 
    when sum( polarity ) < 0 then 'negative'  
    else 'neutral' end as sentiment 
 from l3 group by tweet_id;

Lastly, to make our analysis somewhat easier we are going to turn those ‘positive’, ‘negative’, and ‘neutral’ values into numerical values using the next Hive command

CREATE TABLE IF NOT EXISTS tweetsbi 
STORED AS ORC
AS SELECT 
  t.*,
  case s.sentiment 
    when 'positive' then 2 
    when 'neutral' then 1 
    when 'negative' then 0 
  end as sentiment  
FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s on t.tweet_id = s.tweet_id;

This command should yield our final results table as shown below.

Hive Sentiment Analysis Results

Try the new Hive Visualization tab!

On the right hand side of the screen try clicking the graph icon. It will bring up a new tab where you can directly create charts using your query results in Hive!

Now that we can access the sentiment data in our Hive table let’s do some visualization on the analysis using Apache Zeppelin.

Visualizing Sentiment With Zeppelin


Make sure you Zeppelin service is started in Ambari and head over to the Zeppelin View.

Hive Sentiment Analysis Results

Use the Notebook dropdown menu at the top of the screen and click + Create New Note. After which, you can name the note Sentiment Analysis.

Hive Sentiment Analysis Results

After creating the note, open it up to the blank Notebook screen and type the following command.

%hive
select * from tweetsbi LIMIT 300

We’re limiting our query to just 300 results because right now we won’t need to see everything. And if you’ve collected a lot of data from NiFi, then it could slow down your computer.

  • Arrange your results so that your chart is a bar graph.
  • The tweetsbi.country column is a key and the tweetsbi.sentiment as the value.
  • Make sure that sentiment is labeled as COUNT.
  • Run the query by clicking the arrow on the right hand side, or by pressing Shift+Enter.

Your results should look like the following:

First Zeppelin Query Results

After looking at the results we see that if we group by country that many tweets are actually labeled as null.

For the sake of visualization let’s remove any tweets that might appear in our select statement that have a country value of “null”, as well as increase our result limit to 500.

Scroll down to the next note and create run the following query, and set up the results the same way as above.

%hive
select * from tweetsbi where country != "null" LIMIT 500

Non Null Countries in Results

Great! Now given the data we have, we can at least have an idea of the distribution of users who’s tweets come from certain countries!

You can also experiment with this and try a pie chart as well.

Pie chart of above results

In our original raw tweet data from NiFi we also collected the language from our users as well. So we can also get an idea of the distribution of languages!

Run the following query and make

  • lang as the Key
  • COUNT for lang in values
%hive
select lang, time_zone from tweets_text LIMIT 1000

Pie chart of language results

If you have not seen from our earlier analysis in Hive

  • A bad or negative sentiment is 0
  • A neutral sentiment value is 1.
  • A positive sentiment value is 2.

Using this we can now look at individual countries and see the sentiment distributions of each.

%hive
select sentiment, count(country), country from tweetsbi group by sentiment, country having country != "null"

Sentiment Comparison

</br>
Using this data you can determine how you might want to market your products to different countries!

Further Reading

We hope you enjoyed the tutorial! If you’ve had any trouble completing this tutorial or require assistance, please head on over to Hortonworks Community Connection where hundreds of Hadoop experts are ready to help!

Comments

|
February 2, 2014 at 9:08 am
|

It’s actually a cool and helpful piece of info.
I’m satisfied that you just shared this helpful info with us.

Please stay us up to date like this. Thanks for sharing.

|
February 10, 2014 at 9:16 pm
|

The bottom line here is simple: use supplements only after doing
research and exercising caution. So, what happens when we incorporate a probiotic fermentation
process to whole food nutritional ingredients. Some brands contain extra substances that you really don’t need any may
actually cause you more harm than good.

|
February 11, 2014 at 9:13 am
|

A person essentially lend a hand to make severely articles I would state.
This is the first time I frequented your web page and to this point?

I amazed with the research you made to make this particular submit incredible.
Excellent activity!

|
February 18, 2014 at 8:05 pm
|

Thanks for some other fantastic article. The place else
may just anybody get that kind of information in such an ideal method of
writing? I have a presentation subsequent week, and I am at the look for such information.

srikrishna
|
April 18, 2014 at 8:25 am
|

how do you it for some other movie or key word . What files are there to be modified.

Will H
|
May 14, 2014 at 11:00 am
|

I am having an issue running the initial hive script. Not sure if I’m doing something wrong but am seeing the following error:

[root@sandbox ~]# hive -f hiveddl.sql

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Added json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar to class path
Added resource: json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V
[root@sandbox ~]#

    Rohit Gore
    |
    January 11, 2015 at 10:12 pm
    |

    i am facing the same problem
    [root@sandbox ~]# hive -f hiveddl.sql

    Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
    Added json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar to class path
    Added resource: json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V
    [root@sandbox ~]#

nitesh
|
May 16, 2014 at 5:49 am
|

Hi, when i run the hiveddl.sql script all it does it just creates bunch of tables and views and no data.. when I opened the script I found no LOAD statement in there.Is it possible the script is not complete or am i missing something here?

Vik
|
May 19, 2014 at 3:16 pm
|

Does not work on HDP2.1 with Hive .13; Created the json-serde again, but still fails.

Driver returned: 1. Errors: OK
converting to local hdfs://sandbox.hortonworks.com:8020/user/hue/upload/upload/json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar
Added /tmp/08cf0f24-0df6-4b44-8890-6150f2873398_resources/json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar to class path
Added resource: /tmp/08cf0f24-0df6-4b44-8890-6150f2873398_resources/json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Could not initialize class org.openx.data.jsonserde.objectinspector.JsonObjectInspectorFactory

Yogesh Sobale
|
May 20, 2014 at 3:19 am
|

In given example, the version for json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar is not compatible with hive installed in HDP 2.1.
After debugging I found that hive installed in sandbox has jar hive-serde-0.13.0.2.1.1.0-385.jar which is not compatible with json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar. It is having issue with the constructor AbstractPrimitiveJavaObjectInspector(). Can you please provide the correct jar ?

kavitha
|
June 10, 2014 at 1:13 am
|

http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentiment-data/

I followed the same steps above, but while executing the hiveddl.sql, i get an error:

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTa sk. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJav aObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive /PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

Please help.

kavitha
|
June 10, 2014 at 1:32 am
|

http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentiment-data/

I have followed the same steps, but while executing the hiveddl.sql, i get the following error.

Executed in putty :

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTa sk. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJav aObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive /PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

Execeuted from HUE shell

Driver returned: 1. Errors: OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Could not initialize class org.openx.data.jsonserde.objectinspector.JsonObjectInspectorFactory

Please help.

Tri Nguyen
|
June 19, 2014 at 2:23 pm
|

Came here from the flume page http://hortonworks.com/hadoop/flume/
The description about the data collection using Flume is almost non-existent. The article should at least add a description about the flume configuration file.

    Jules S. Damji
    |
    July 16, 2014 at 12:59 pm
    |

    Thanks. Will take note.

Gaurav
|
July 9, 2014 at 5:53 am
|

Hi I have tried to execute this from sandbox , logged in with root…but during the execution of hiveddl.sql, i faced this error:

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJav aObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive /PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

Any idea what it might be due to?

    Rohit Gore
    |
    January 12, 2015 at 9:58 pm
    |

    Hiii Gaurav i am also facing same issue. Please help me to resolve this if you have a solution

|
July 11, 2014 at 2:00 pm
|

I’ve been surfing online more than 3 hours today, yet
I never found any interesting article like yours. It
is pretty worth enough for me. Personally, if all site owners and bloggers made good content as
you did, the net will be a lot more useful than ever before.

|
July 12, 2014 at 12:45 pm
|

Somebody essentially help to make critically articles I
would state. This is the first time I frequented your website page and so
far? I surprised with the research you made to create this actual
submit extraordinary. Magnificent job!

François
|
July 15, 2014 at 5:56 am
|

Hello,
Thank you for this very interesting tutorial. But I have a problem:
I try to connect to WinSCP with the arguments you offer but the connection is refused.
Do you know this problem?
Thank you in advance
François

    Masoud
    |
    August 5, 2014 at 10:29 am
    |

    I have the same problem.

    Masoud
    |
    August 5, 2014 at 10:39 am
    |

    Port should be 22 not 2222.

Sagar Prasad
|
August 6, 2014 at 12:52 am
|

For issue : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V

I tried suggestion from Brandon (https://github.com/hortonworks/hadoop-tutorials/issues/30) and it worked for me.
This issue is caused by an incompatibility with the SerDe jar that is packed with the demo. To fix this, use the version of SerDe that is now packaged with HCatalog. Make the following changes to hiveddl.sql

1) comment out or remove the first line that adds the SerDe jar
–ADD JAR json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar;
2) Change line 34 to refer to the HCatalog version of SerDe
ROW FORMAT SERDE ‘org.apache.hive.hcatalog.data.JsonSerDe’

    |
    September 6, 2014 at 10:14 pm
    |

    Hi prasad .. Thanks … I followed same steps. I see Map 100% Reduce 100% but still got exception. And now trying to rerun gettong AlreadyExistsException Table tweets_raw already exists

      Mungeol Heo
      |
      November 6, 2014 at 10:21 pm
      |

      Try to delete hive tables that the script created, then rerun it.

Viswanath
|
December 18, 2014 at 6:25 pm
|

can you put more information on how you used FLUME to get twitter data

    Karl
    |
    July 28, 2015 at 11:17 pm
    |

    I found your remark. I would be interested in information getting twitter data with flume into Haddoop?

    I would appreciate your help.

    Kind regards,
    Karl

Pramod
|
December 29, 2014 at 3:37 am
|

Hi,
I followed the exact step mention above. But no data display in excel for tweetsbi table only column header is there. Please help what went wrong.

Thanks.

Angelo
|
August 18, 2015 at 6:08 am
|

Someone could help me? i have this error.
Added [json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [json-serde-1.1.6-SNAPSHOT-jar-with-dependencies.jar]
FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:10924)
at org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:45856)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HiveParser.java:38211)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(HiveParser.java:36342)
at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java:39707)
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:38655)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:38367)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.java:38051)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HiveParser.java:36203)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5214)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:396)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 12:6 Failed to recognize predicate ‘user’. Failed rule: ‘identifier’ in column specification
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.

|
August 20, 2015 at 8:03 am
|

It looks like some changes have been made in recent HDP releases that prevent this tutorial from working as it is written. I’m working with the tutorial team at Hortonworks to try to get this updated, but in the meantime I’ve put together a script that fixes the issues and installs the tutorial data on the latest HDP sandbox. It’s available here:

https://github.com/hwx-se-ne/hdp-twitterdemo

Until we get the tutorial updated, please try the above to run this.

(Disclaimer: while I am an employee of Hortonworks on the NE US Solutions Engineering team, this is not an official update to the tutorial. It’s just something I put together on my own to get things working and wanted to share. However, if you have any problems I’m happy to try to help. Just reach out to me at rmccollam [at] hortonworks [dot] com.)

|
September 5, 2015 at 8:07 pm
|

There is one more way to avoid the reserved keyword like ‘user’ is to set following property before creating tweets_raw table.

set hive.support.sql11.reserved.keywords=false;

Also its good to download the latest hive json serde if you are using current version of hive – 1.2

ADD JAR json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar;

Here is the place where you can download it https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master/dist

|
October 18, 2015 at 12:17 am
|

what are the automated tools required for analyzing data.

olli_whatever
|
October 18, 2015 at 1:01 pm
|

Perhaps this would be helpful: https://github.com/hortonworks/hadoop-tutorials/issues/52

By the way, I have also edited the paths. As I understand, you upload the whole folder “SentimentFiles”, not only “upload”: e.g.
LOCATION ‘/user/hue/SentimentFiles/SentimentFiles/upload/data/time_zone_map’

Alex_B
|
November 24, 2015 at 3:40 pm
|

Hello all,

well, I tried everything. I applied the fix, changed values but still nothing is working. Is there a “full” solution incoming.. ?

|
January 1, 2016 at 7:52 am
|

Hi All!
This look fantastic. Important word here “LOOK” but doesn’t work…
Like Alex_B, I tried everything with the 2 fix listed on git, without succes.
Don’t waste anytime on this. At the end I get : Error running Hive DDL script! Aborting…

Would be nice to update the tuto when the VM is updated!
When there will be a “full working” solution.. ?

VUSI
|
January 20, 2016 at 2:04 am
|

HOW DO YOU LOAD TWITTER FEEDS INTO HDFS

vusi
|
January 20, 2016 at 11:03 pm
|

how would you load twitter feeds into hdfs using Flume

Saad
|
January 21, 2016 at 8:24 pm
|

Error: Class path contains multiple SLF4J bindings. Please help

vusi
|
January 25, 2016 at 4:45 am
|

i get an error on the following command :
/opt/lucidworks-hdpsearch/solr/bin/solr create -c tweets -d tweet_configs -s 1 -rf 1

it says : Error CREATEing SolrCore ‘tweets_shard1_replica1’: Unable to create core [tweets_shard1_replica1] Caused by: The element type \”procesor\” must be terminated by the matching end-tag \”\”.”}}

omarabdillah
|
January 30, 2016 at 12:29 am
|

I have a problem when sending the downloaded Nifi from my local computer to the Sandbx via SCP.
There was password authentication, but when I type my correct password, the Access denied.

Can anyone help me how to solve it?

    omarabdillah
    |
    January 30, 2016 at 2:00 am
    |

    I’ve solved the problem by changing the permission in root directory.

    Thanks

omarabdillah
|
January 30, 2016 at 4:05 am
|

Hi everyone,

I need your help guys for solving my problem in implementing this service.
My problem is I can not access http://sandbox.hortonworks.com:9090/nifi

I’ve changed nifi.web.http.port=9090 in conf/nifi.properties

And have run the script bash bin/nifi.sh start

I also have forward the port in my VirtualBox Manager in my Azure account and the nifi port was there.

Is there any one have the solution about it?

    Zachary Blanco
    |
    January 31, 2016 at 8:32 am
    |

    Hi there. If you’re using Azure, then instead of using sandbox.hortonworks.com – you should use the Azure hostname/IP instead of using “sandbox.hortonworks.com”.

      omarabdillah
      |
      February 1, 2016 at 3:46 am
      |

      Thanks, Blanco for the response.

      I have tried to use my Azure hostname/IP, but still can not to open the Nifi, and the Solr too.

      And I think, for port 9090, 8080, and the port for Solr 8983.

      Do you have any idea what should I do for the next?

      Thanks

Farid
|
February 1, 2016 at 8:53 am
|

Though the article is interesting, the page http://hortonworks.com/use-cases/sentiment-analysis-hadoop-example/ links to this page as a tutorial for using Flume/HCatalog/Hive combination for analyzing the sentimental data, but this tutorial is for NiFi/Solr/Zepplin.

Malek.BS
|
February 5, 2016 at 6:18 am
|

Hi,
Very interessant tutorials and i’m facing one problem with NIFI when i try to collect data from tweeter.
I created tweeter apps and in NIFI the processor wich is responsable to get tweeter “GetTweeter” show this error:
ERROR
GetTweeter [id=………………………………………….]
received error HTTP_ERROR: HTTP/1.1 401 authorization Requiered. will attempt to reconnect.

the same error is always repeating, i’ve checked the configuration of the processor and the coonsumer secret, token ..etc are well indicated, i can connect to internet and the firewalls are shut down.
Please can anywone help me to solve the problem i didn’t found any good answer on the internet.
Many Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>