newsletter

Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
HDP > Develop with Hadoop > Hello World

Hadoop Tutorial – Getting Started with HDP

Data Reporting With Zeppelin

cloud Ready to Get Started?

DOWNLOAD SANDBOX

Data Reporting With Zeppelin

Introduction

In this tutorial you will be introduced to Apache Zeppelin and teach you to visualize data using Zeppelin.

Prerequisites

The tutorial is a part of series of hands on tutorial to get you started on HDP using the Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial.

Outline

Apache Zeppelin

Apache Zeppelin provides a powerful web-based notebook platform for data analysis and discovery.
Behind the scenes it supports Spark distributed contexts as well as other language bindings on top of Spark.

In this tutorial we will be using Apache Zeppelin to run SQL queries on our geolocation, trucks, and
riskfactor data that we’ve collected earlier and visualize the result through graphs and charts.

Create a Zeppelin Notebook

Navigate to Zeppelin Notebook

Open Zeppelin interface using browser URL:

http://sandbox-hdp.hortonworks.com:9995/

welcome-to-zeppelin

Click on a Notebook tab at the top left and select Create new note. Name your notebook Driver Risk Factor.

create-new-notebook

Download the Data

If you had trouble completing the previous tutorial or lost the risk factor data click here to download it and upload it to HDFS under /tmp/data/

save-risk-factor

Execute a Hive Query

Visualize finalresults Data in Tabular Format

In the previous Spark tutorial you already created a table finalresults or riskfactor which gives the risk factor associated with every driver. We will use the data we generated in this table to visualize which drivers have the highest risk factor. We will use the jdbc Hive interpreter to write queries in Zeppelin.

1. Copy and paste the code below into your Zeppelin note.

%spark2
val hiveContext = new org.apache.spark.sql.SparkSession.Builder().getOrCreate()
val riskFactorDataFrame = spark.read.format("csv").option("header", "true").load("hdfs:///tmp/data/riskfactor.csv")
riskFactorDataFrame.createOrReplaceTempView("riskfactor")
hiveContext.sql("SELECT * FROM riskfactor LIMIT 15").show()

2. Click the play button next to “ready” or “finished” to run the query in the Zeppelin notebook.

Alternative way to run query is “shift+enter.”

Initially, the query will produce the data in tabular format as shown in the screenshot.

output_riskfactor_zeppelin_lab6

Build Charts using Zeppelin

Visualize finalresults Data in Chart Format

1. Iterate through each of the tabs that appear underneath the query.
Each one will display a different type of chart depending on the data that is returned in the query.

charts_tab_under_query_lab6

2. After clicking on a chart, we can view extra advanced settings to tailor the view of the data we want.

bar_graph_zeppelin_lab6

3. Click settings to open the advanced chart features.

4. To make a chart with riskfactor.driverid and riskfactor.riskfactor SUM, drag the table relations into the boxes as shown in the image below.

fields_set_keys_values_chart_lab6

5. You should now see an image like the one below.

driverid_riskfactor_chart_lab6

6. If you hover on the peaks, each will give the driverid and riskfactor.

hover_over_peaks_lab6

7. Try experimenting with the different types of charts as well as dragging and
dropping the different table fields to see what kind of results you can obtain.

8. Let’ try a different query to find which cities and states contain the drivers with the highest risk factors.

%sql
SELECT a.driverid, a.riskfactor, b.city, b.state
FROM riskfactor a, geolocation b where a.driverid=b.driverid

queryFor_cities_states_highest_driver_riskfactor

9. After changing a few of the settings we can figure out which of the cities have the high risk factors.
Try changing the chart settings by clicking the scatterplot icon. Then make sure that the keys a.driverid
is within the xAxis field, a.riskfactor is in the yAxis field, and b.city is in the group field.
The chart should look similar to the following.

visualize_cities_highest_driver_riskfactor_lab6

You can hover over the highest point to determine which driver has the highest risk factor and in which cities.

Summary

Great, now we know how to query and visualize data using Apache Zeppelin. We can leverage Zeppelin—along with our newly gained knowledge of Hive and Spark—to solve real world problems in new creative ways.

Further Reading

User Reviews

User Rating
6 5 out of 5 stars
5 Star 100%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
Tutorial Name
Hadoop Tutorial – Getting Started with HDP

To ask a question, or find an answer, please visit the Hortonworks Community Connection.

6 Reviews
Write Review

Register

Please register to write a review

Share Your Experience

Example: Best Tutorial Ever

You must write at least 50 characters for this field.

Success

Thank you for sharing your review!

Excellent
by reena Bhatt on December 13, 2018 at 3:22 am

Very nicely explained and easy to understand! Very good introduction with nice screenshot and videos.

Very nicely explained and easy to understand! Very good introduction with nice screenshot and videos.

Show Less
Cancel

Review updated successfully.

limit number of paragraphs in sandboxed Zeppelin
by Tom Celuszak on November 29, 2018 at 11:20 am

Good tutorial, introduced me to Zeppelin and let me exercise some of its functions. Had a problem with the final query - the join of riskfactor and geolocation - hanging. I could get the same query to complete using Ambari and Hive View 2. Finally found that removing all but one paragraph, of which I had... 20 or so? ...let the query run to completion in 23 seconds. I had been creating a new paragraph each step; best to reuse the first paragraph. My config is the sandbox on VirtualBox on Windows 7.

Good tutorial, introduced me to Zeppelin and let me exercise some of its functions.

Had a problem with the final query – the join of riskfactor and geolocation – hanging. I could get the same query to complete using Ambari and Hive View 2. Finally found that removing all but one paragraph, of which I had… 20 or so? …let the query run to completion in 23 seconds. I had been creating a new paragraph each step; best to reuse the first paragraph.

My config is the sandbox on VirtualBox on Windows 7.

Show Less
Cancel

Review updated successfully.

Easy to understand
by Dennis Suhari on October 19, 2018 at 12:27 am

Informative and good practical description of the steps

Informative and good practical description of the steps

Show Less
Cancel

Review updated successfully.

Great Tutorial
by scott payne on July 24, 2018 at 8:55 pm

Tutorial was an excellent introduction to HDP data processing using a realistic data set. Each concept is presented succinctly with suggestions to explore the concept further. My only suggestion is that not enough emphasis is placed on how much faster it is to run your queries using a shell than it is to use the sandbox.

Tutorial was an excellent introduction to HDP data processing using a realistic data set. Each concept is presented succinctly with suggestions to explore the concept further.

My only suggestion is that not enough emphasis is placed on how much faster it is to run your queries using a shell than it is to use the sandbox.

Show Less
Cancel

Review updated successfully.

Outstanding
by Christian Lopez on May 8, 2018 at 8:29 pm

This review is written from the perspective of a new HDP user interested in understanding this environment and the tools included in the Sandbox. First you will be introduced to the technologies involved in the tutorial namely Hadoop, Ambari, Hive, Pig Latin, SPARK, HDFS, and most importantly HDP. Next, you will use IoT data to calculate the risk factor for truck drivers by using the truck's information and their geo-location, you will accomplish this goal by uploading the needed data to your VM and storing the data as Hive tables. Additionally, you will learn to use… Show More

This review is written from the perspective of a new HDP user interested in understanding this environment and the tools included in the Sandbox.

First you will be introduced to the technologies involved in the tutorial namely Hadoop, Ambari, Hive, Pig Latin, SPARK, HDFS, and most importantly HDP. Next, you will use IoT data to calculate the risk factor for truck drivers by using the truck’s information and their geo-location, you will accomplish this goal by uploading the needed data to your VM and storing the data as Hive tables. Additionally, you will learn to use PIG Latin and SPARK to extrapolate the data needed to find the risk factor for all drivers in the set and storing the information you found back into the database. Accomplishing the same task using two different tools (SPARK, and PIG) highlights the robustness and flexibility of HDP as all the operations happen flawlessly.

I highly recommend this tutorial as it is highly informative, shows a realistic use-case, and as a new user of HDP I learned about all the cool technologies enabled to work through the Hortonworks platform, most importantly I was left with a great sense of accomplishment and that’s reason alone to try the tutorial.

Show Less
Cancel

Review updated successfully.

Excellent Tutorial!
by Ana Castro on May 8, 2018 at 4:05 pm

The tutorial was very informative and had an excellent flow. It had just the right amount of detail per concept. Great introduction to Hadoop and other Apache projects.

The tutorial was very informative and had an excellent flow. It had just the right amount of detail per concept. Great introduction to Hadoop and other Apache projects.

Show Less
Cancel

Review updated successfully.