Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
HDP > Develop with Hadoop > Apache Spark

Learning Spark SQL with Zeppelin

cloud Ready to Get Started?



In this two-part lab-based tutorial, we will first introduce you to Apache Spark SQL. Spark SQL is a higher-level Spark module that allows you to operate on DataFrames and Datasets, which we will cover in more detail later.

In the second part of the lab, we will explore an airline dataset using high-level SQL API. We will visualize the dataset and write SQL queries to find insights on when and where we can expect highest delays in flight arrivals and departures.

The lab is part of our Apache Zeppelin based lab series, providing an intuitive and developer friendly web-based environment for data ingestion, wrangling, munging, visualization and more.



Tutorial Details

As mentioned earlier, this is a two-part lab. In the first part of the lab, we will cover Spark SQL’s Datasets and DataFrames, which are distributed collections of data conceptually equivalent to a table in a relational database or a dataframe in Python or R. Both provide rich optimizations and translate to an optimized lower-level Spark code. The main difference between the Datasets and DataFrames is that Datasets are strongly typed, requiring consistent value/variable type assignments. The Dataset is available in Scala and Java (strongly typed languages), while DataFrame additionally supports Python and R languages.

If this is a bit confusing, don’t worry. Once you run through this lab you will find that both the Dataset and DataFrame APIs provide an intuitive way of interacting with the data. We will take you through several steps of exploring and selecting relevant data, and creating User Defined Functions (UDFs) to apply basic filters to columns of interest, e.g. to determine which flights were delayed.

In part two of the lab, we will create a temporary view (in Spark 1.x called a temporary table) to store our DataFrame in memory and make its contents accessible via a SQL API. This will allow us to run SQL queries against this temporary view allowing for an even richer exploration of the data with built in Zeppelin visualizations.

We will wrap up by persisting our results to a permanent table that can then be shared with other people.

One thing to remember is that in both part one and part two of the lab the queries on Datasets/DataFrames or the temporary view will translate to an underlying optimized form of Spark Resilient Distributed Datasets (RDDs) assuring that all code is executed in a parallel/distributed fashion. To learn more about RDDs, which are beyond the scope of this tutorial, see the Spark docs.

Environment Setup

Option 1: Setup Hortonworks Data Cloud (HDCloud) on AWS

This option is ideal if you want to experience a production-ready multi-node cluster in a cloud.

See the Getting Started with HDCloud tutorial for details.

Option 2: Download and Setup Hortonworks Data Platform (HDP) Sandbox

This option is optimal if you prefer to run everything in local environment (laptop/PC).

Keep in mind, that you will need 8GB of memory dedicated for the virtual machine, meaning that you should have at least 12GB of memory on your system.

2a. Download and Install HDP Sandbox 2.6

2b. Review Learning the Ropes of HDP Sandbox

Review Zeppelin Tutorial

If you are new to Zeppelin, review the following tutorial Getting Started with Apache Zeppelin

Notebook Preview

Before you start, here’s a preview of the notebook.

Notebook Preview

Start Tutorial

To begin the tutorial, import the Exploring Spark SQL Module notebook into your Zeppelin environment. (If at any point you have any issues, make sure to checkout the Getting Started with Zeppelin tutorial.)

On the Zeppelin home screen click Import note -> Add from URL and copy and paste the following URL:

Once your notebook is imported, you can open it from the Zeppelin home screen by clicking
Labs -> Spark 2.x -> Data Worker -> Scala -> 101 - Intro to SparkSQL

Once the Exploring Spark SQL Module notebook is up, follow all the directions within the notebook to complete the tutorial.

Final Words

Once you have completed part one and part two of the lab you should have a basic toolset to start exploring new datasets using a high-level programatic Dataset or DataFrame APIs, or a SQL API. Both APIs provide the same performance while giving you the choice to choose one or both to accomplish a task demanding high performance data exploration, wrangling, munging, and visualization.

Next, you may want to checkout a short introductory tutorial on Machine Learning with Spark.

User Reviews

User Rating
0 No Reviews
5 Star 0%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
Tutorial Name
Learning Spark SQL with Zeppelin

To ask a question, or find an answer, please visit the Hortonworks Community Connection.

No Reviews
Write Review


Please register to write a review

Share Your Experience

Example: Best Tutorial Ever

You must write at least 50 characters for this field.


Thank you for sharing your review!