Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
June 03, 2015
prev slideNext slide

Apache Spark on HDP: Learn, Try and Do

Not a day passes without someone tweeting or re-tweeting a blog on the virtues of Apache Spark.

At a Memorial Day BBQ, an old friend proclaimed: “Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.”

Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data; to the data scientists it offers machine libraries, MLlib component; and to data analysts it offers SQL capabilities for inquiry.

In this blog, I summarize how you can get started, enjoy Spark’s delight, and commence on a quick journey to Learn, Try, and Do Spark on HDP, with a set of tutorials.

Spark on HDP

Spark on Apache Hadoop YARN enables deep integration with Hadoop and allows developers and data scientists alike two modes of development and deployment on Hortonworks Data Platform (HDP).

In the local mode—running on a single mode, on an HDP Sandbox— you can get started using a set of tutorials put together by my colleague Saptak Sen.

  1. Hands on Tour with Apache Spark in Five Minutes. Besides introducing basic Apache Spark concepts, this tutorial demonstrates how to use Spark shell with Python. Often, simplicity does not preclude profundity. In this simple example, a lot is happening behind the scenes and under the hood but it’s hidden from the developer using an interactive Spark shell. If you are a Python developer and have used Python shell, you’ll appreciate the interactive PySpark shell.
  2. Interacting with Data on HDP using Scala and Apache Spark. Building on the concepts introduced in the first tutorial, this tutorial explores how to use Spark with a Scala shell to read data from an HDFS file, perform in-memory transformations and iterations on a RDD, iterate over results, and then display them inside the shell.
  3. Using Apache Hive with ORC from Apache Spark. While the last two tutorials explored reading data from HDFS and computing in-memory, this tutorial shows how to persist data as Apache Hive tables in ORC format and how to use SchemaRDD and Dataframes. Additionally, it shows how to query Hive tables using Spark SQL.

What’s Next?

Our commitment to Apache Spark is to ensure it’s YARN-enabled and enterprise-ready with security, governance, and operations, allowing deep integration with Hadoop and other YARN enabled workloads in the enterprise—all running under the same Hadoop cluster, all accessing the same dataset.

We continue with that steadfast strategy. Last month, we released a technical preview of Apache Spark 1.3.1 on HDP 2.2. Shortly, we’ll follow with a 1.3.1 GA.

Learn More



PK says:

Great post!! 3.Using Apache Hive with ORC from Apache Spark link doesn’t work.

Tim Benninghoff says:

It looks like you’ve got an invalid link for “Using Apache Hive with ORC from Apache Spark”.

Jules S. Damji says:

Thanks for catching, Tim. fixed!

Maciek says:

Why not Spark 1.4.0 ?

Jules S. Damji says:

Hello Maciek,

Whenever we release a version of Spark on HDP, we ensure that it’s enterprise ready. That means, we want our customers to easily deploy and manage it via Ambari, we want them to integrate security aspects via Ranger, we want it integrate support access formats in Hive, and we want to ensure it run as a first-class citizen on YARN. All these imperative requirements for enterprise take few months to test, validate, and for general availability. Hence, our rolling out of technical previews, followed by general availability.

In due time, we will have 1.4.1 as a technical review on HDP.

Ha Son Hai says:

Do you have any hint on the release date for the technical review of Spark 1.4.1? The sparkR API drives people crazy to have it with HDP.

Jules S. Damji says:

I can’t provide any specific dates. However, it’s on your release maps to offer HDP developers access to tech previews as early is possible for us.

Thomas Arehart says:

Trying to copy the ‘littlelog.csv’ file to tmp using the ”
hadoop fs -put ./littlelog.csv /tmp/” command at the command line. I have placed the csv file in several locations (home and the hdfs directory). Keep getting an error message saying that “No such file or directory exists’.

Any suggestions on what I am doing wrong?

Jules S. Damji says:

You need to copy and past the lines shown in the tutorial into a file called ‘littlelog.csv,’ and then use the command to copy into the HDFS location /tmp.

Bala Sethuram says:

Thanks for the tutorials. These deal with interactive shells. Is there any example of creating a simple batch spark application, which can run in HDP in a batch (non interactive mode). Typically something like reading a file/hive table (ORC?) – some data processing / transformation – creating output in a file. Thanks.

Jules S. Damji says:

This github shows how to create a Spark App using Hive. You can replace the interactive code inside this example for ORC and Hive.

Even better, once you know how to create an application outside the interactive shell, you cut-and-paste code from these interactive examples. The key is a) the spark context is created inside your app and b) you submit the app to the driver.

An example how to submit a spark app on HDP is here: (toward the bottom)

I hope that helps. But the next tutorials, we explore this further by providing an end-to-end app rather than interactive.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums