For this project, you will play the part of a Big Data Application Developer who leverages their skills as a Data Engineer and Data Scientist by using multiple Big Data Technologies provided by Hortonworks Data Flow (HDF) and Hortonworks Data Platform (HDP) to build a Real-Time Sentiment Analysis Application. For the application, you will learn to acquire tweet data from Twitter’s Decahose API and send the tweets to the Kafka Topic “tweets” using NiFi. Next you will learn to build Spark Machine Learning Model that classifies the data as happy or sad and export the model to HDFS. However, before building the model, Spark requires the data that builds and trains the model to be in feature array, so you will have to do some data cleansing with SparkSQL. Once the model is built, you will use Spark Structured Streaming to load the model from HDFS, pull in tweets from Kafka topic “tweets”, add a sentiment score to the tweet, then stream the data to Kafka topic “tweetsSentiment”. Earlier after finishing the NiFi flow, you will build another NiFi flow that ingests data from Kafka topic “tweetsSentiment” and stores the data into HBase. With Hive and HBase integration, you will perform queries to visualize that the data was stored successfully and also show the sentiment score for tweets.
Big Data Technologies used to develop the Application:
- Twitter API
- HDF Sandbox
- HDP Sandbox
Goals and Objectives
- Learn to create a Twitter Application using Twitter’s Developer Portal to get KEYS and TOKENS for connecting to Twitter’s APIs
- Learn to create a NiFi Dataflow Application that integrates Twitter’s Decahose API to ingest tweets, perform some preprocessing, store the data into the Kafka Topic “tweets”.
- Learn to create a NiFi Dataflow Application that ingests the Kafka Topic “tweetsSentiment” to stream sentiment tweet data to HBase
- Learn to build a SparkSQL Application to clean the data and get it into a suitable format for building the sentiment classification model
- Learn to build a SparkML Application to train and validate a sentiment classification model using Gradient Boosting
- Learn to build a Spark Structured Streaming Application to stream the sentiment tweet data from Kafka topic “tweets” on HDP to Kafka topic “tweetsSentiment” on HDF while attaching a sentiment score per tweet based on output of the classification model
- Learn to visualize the tweet sentiment score by using Zeppelin’s Hive interpreter mapping to the HBase table
- Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox
- Read through Learning the Ropes of the HDP Sandbox to setup hostname mapping to IP address
- If you don’t have 32GB of dedicated RAM for HDP Sandbox, then refer to Deploying Hortonworks Sandbox on Microsoft Azure
- Enabled Connected Data Architecture:
The tutorial series consists of the following tutorial modules:
1. Application Development Concepts You will be introduced to sentiment fundamentals: sentiment analysis, ways to perform the data analysis and the various use cases.
2. Setting up the Development Environment You will create a Twitter Application in Twitter’s Developer Portal for access to KEYS and TOKENS. You will then write a shell code and perform Ambari REST API Calls to setup a development environment.
3. Acquiring Twitter Data You will build a NiFi Dataflow to ingest Twitter data, preprocess it and store it into the Kafka Topic “tweets”. The second NiFi Dataflow you will build, ingests the enriched sentiment tweet data from Kafka topic “tweetsSentiment” and streams the content of the flowfile to HBase.
4. Cleaning the Raw Twitter Data You will create a Zeppelin notebook and use Zeppelin’s Spark Interpreter to clean the raw twitter data in preparation to create the sentiment classification model.
5. Building a Sentiment Classification Model You will create a Zeppelin notebook and use Zeppelin’s Spark Interpreter to build a sentiment classification model that classifies tweets as Happy or Sad and exports the model to HDFS.
6. Deploying a Sentiment Classification Model You will create a Scala IntelliJ project in which you develop a Spark Structured Streaming application that streams the data from Kafka topic “tweets” on HDP, processes the tweet JSON data by adding sentiment and streaming the data into Kafka topic “tweetsSentiment” on HDF.
7. Visualizing Sentiment Scores You will use Zeppelin’s JDBC Hive Interpreter to perform SQL queries against the noSQL HBase table “tweets_sentiment” for visual insight into tweet sentiment score.