cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Natural Language Processing and Sentiment Analysis for Retailers using HDP and ITC Infotech Radar

Introduction

RADAR is a software solution for retailers built using ITC Handy tools (NLP and Sentiment Analysis engine) and utilizing Hadoop technologies including HDFS, YARN, Apache Storm, Apache Solr, Oozie and Zookeeper to help them maximize sales through data-based continuous re-pricing.

High Level Architecture diagram

Use Case: Specifically, using RADAR, a brick and mortar or online retailers can track the following for any number of products in their portfolio:

  • Social Sentiment for each product
  • Competitive Pricing/promotions being offered in social media and on the web

Using this, retailers can create continuous re-pricing campaigns that can then be implemented in real-time in their pricing systems. RADAR can then further track the impact of re-pricing on sales and continuously dashboard it vs social sentiment

Part of the RADAR tool can also be exposed to retailers’ customers as a way to show end customers social sentiment alongside comparative reviews for products being sold by retailer thereby helping to close sales faster as user gets more information to make a decision

The ITC Infotech HDP 2.1 Real-time Analytics Dashboard Application for Retail (RADAR) demonstration showcases Apache Storm for real time processing of data and Solr for indexing and data analysis. This demonstration is built on top of a proprietary text analysis engine developed by ITC Infotech .

Data set of about 1500 TV models was extracted and natural language text in the data associated with each model analyzed using this engine. Data for each television model was gathered from a very large number of user reviews and web pages. Tweets, on the other hand, are processed real-time to extract relevant deals about televisions.

Instructions

Installation of Solr indexes with rpm

  1. First install Solr, following instructions at https://hortonworks.com/hadoop-tutorial/searching-data-Solr/. If Solr is already running, stop it.
  2. Download the appropriate files –
    a) To download the UI(html) and indexes for tweets and user reviews data, download this (52MB)- http://hdp2.1.s3.amazonaws.com/itcinfotech-RADAR-with-reviews-tweets-1.0-1.noarch.rpm OR
    b) To download the UI(html) and indexes for tweets, web pages and user reviews data, download this (550MB) http://hdp2.1.s3.amazonaws.com/itcinfotech-RADAR-with-web-reviews-tweets-1.0-1.noarch.rpm OR
    c) To download the UI(html) only, download this – http://hdp2.1.s3.amazonaws.com/itcinfotech-RADAR-UIonly.rpm (25KB. Note that you need to download and install the indexes separately.)
  3. Install the rpm
    rpm -ivh /tmp/itcinfotech-RADAR-with-reviews-tweets-1.0-1.noarch.rpm
  4. Test the install
    After you install the rpm and start Solr, you should browse to http://sandbox/RADAR. You should see a screen like this one below.

    Once you start typing in this search box, an autocomplete function will come into play showing a list of matching available TV models. See the screenshot below for an example snapshot. Once a specific model is selected, the corresponding results appear.
    You will see in the results screen the attributes automatically generated by the text analysis engine. The relative sizes of the rectangular boxes represent the importance of each attribute. Each box is color coded by sentiment (red – negative, green – positive and yellow – evenly balanced).
    Mouse over text on any box displays a count of the number of matching relevant snippets of text. Clicking on any box shows some of the snippets for that attribute for that specific television model.
    By making the appropriate choice in the radio button, you can get results from just web pages pertaining to any TV model or from user reviews for the selected model. Clicking on the twitter button shows deals extracted from tweets (in real-time if you setup with Apache Storm) for that model or brand.
    Autocomplete for TV models

    Treemap of attributes extracted from web pages for Samsung UN46F7500 46-Inch 1080p 240Hz 3D Ultra Slim Smart LED HDTV

    Deal information extracted from tweets real-time

    Treemap of attributes extracted from text of user reviews for Samsung UN46F7500 46-Inch 1080p 240Hz 3D Ultra Slim Smart LED HDTV

    To delete the rpm

    rpm -e itcinfotech-RADAR
    

Installation of Solr indexes without rpm

  1. Download the prebuilt indexes from
    a) https://s3-us-west-1.amazonaws.com/hdp2.1/indexes.tgz (55MB. Indexes for user reviews, web pages and tweets) OR
    b) http://hdp2.1.s3.amazonaws.com/indexes-small.tgz (55MB. Indexes user reviews and tweets)
    c) Untar the indexes

    cd /opt/solr-4.7.2/example/solr
    
    tar –xvf indexes.tgz
    
    start Solr
    

Troubleshooting

If you do not see any treemap generated.
Browse to the URL http://sandbox:8983/solr. Clicking on the core selector should show you the collections used for this demo.

If you do not see any of the collections, add them one by one. Click on core admin -> add core

Add rdemo collection: If you downloaded the larger index (or install), add the rdemoweb collection as well.

Running the Apache storm component of the RADAR demo:

  1. Download the RADAR jar file from http://hdp2.1.s3.amazonaws.com/radar-0.9.2-incubating-SNAPSHOT-jar-with-dependencies.jar
  2. Download a sample properties file from http://hdp2.1.s3.amazonaws.com/storm.properties
  3. Download a sample set of tweets from http://hdp2.1.s3.amazonaws.com/radar-sample-tweets

Starting the Storm related Services

  1. Before starting Storm services remove the httpclient.jar , httpcore.jar from /usr/lib/storm/lib directory in sandbox as those jars are below 4.2 version. This solution requires any version above 4.2 for those jars. No needs to put updated jars in the lib as the jars are already bundled with the code.
  2. Start Amabari by executing start_ambari.sh script available in /root directory.
  3. Login to the Ambari UI (by default both the user Name and password to login into Ambari is admin) and check if all the 5 storm services are running by going to the URL http://sandbox:8080/#/main/services/STORM/summary. If the services are not started then start them in the sequence given below. Starting with Nimbus service first the sequence is
    Nimbus -> Supervisors -> Storm UI Server -> DRPC Server -> Storm REST API Serv
  4. Once Storm services are started then one can run the demo using the tweets samples or connect to twitter real-time stream, process and index the data into Solr
  5. To run the sample to process the sample tweets, run the following command (change the path to the files appropriately)
    Find below the command to submit the topology to storm for execution
    storm jar /tmp/radar-0.9.2-incubating-SNAPSHOT-jar-with-dependencies.jar com.itcinfotech.radar.TweetFromFileIndexingTopology /tmp/storm.properties /tmp/radar-sample-tweets tweetIndexerFile
  6. To make sure the topology has been submitted correctly and is in Active status we can go to Storm UI at URL http://sandbox:8744/ and verify that Topology is in Active state(in Topology Summary section).
  7. To run the sample to process data from twitter real-time:Edit the storm.properties, put in the credentials for your twitter application (create an application at http://twitter.com/apps)
    a. Put in values for the following –
    b. consumer.key= (API key)
    c. consumer.secret= (API secret)
    d. access.token=
    e. access.token.secret=run the following command (adjust your path appropriately)
    Find below the command to submit the topology to storm for execution

    storm jar /tmp/radar-0.9.2-incubating-SNAPSHOT-jar-with-dependencies.jar com.itcinfotech.radar.TweetIndexingTopology /tmp/storm.properties tweetIndexer
    

    If these run fine, you will see tweets getting updated in the Solr index named twdeals. Then, you can see them in the UI as well at http://sandbox/RADAR -> (twitter radio button)
    *Replace sandbox with your sanbox Ip in case it don’t work.