cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
April 15, 2015
prev slideNext slide

Hands-on Tour of Apache Spark in 5 Minutes

Introduction

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, and Python that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets. Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.

In this blog, we will introduce the basic concepts of Apache Spark and the first few necessary steps to get started with Spark on Hortonworks Sandbox. But first, you must download Hortonworks Sandbox with Apache Spark 1.2.1 GA before proceeding.

Note: This tutorial was written for HDP 2.2.4 & Spark 1.2.1, we have also released HDP 2.3 & Spark 1.3.1 recently and you can also run this tutorial with HDP 2.3 Sandbox.

Prerequisite

Download Hortonworks Sandbox with HDP  

Concepts

At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.

Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.

Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing DAG that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.

Let’s try it out.

A Hands-On Example

Let’s open a shell to our Sandbox through SSH:

The default password is hadoop

Then let’s get some data with the command below in your shell prompt:

wget http://en.wikipedia.org/wiki/Hortonworks

Copy the data over to HDFS on Sandbox:

hadoop fs -put ~/Hortonworks /user/guest/Hortonworks

Let’s start the PySpark shell and work through a simple example of counting the lines in a file. The shell allows us to interact with our data using Spark and Python:

pyspark

As discussed above, the first step is to instantiate the RDD using the Spark Context sc with the file Hortonworks on HDFS.

myLines = sc.textFile('hdfs://sandbox.hortonworks.com/user/guest/Hortonworks')

Now that we have instantiated the RDD, it’s time to apply some transformation operations on the RDD. In this case, I am going to apply a simple transformation operation using a Python lambda expression to filter out all the empty lines.

myLines_filtered = myLines.filter( lambda x: len(x) > 0 )

Note that the previous Python statement returned without any output. This lack of output signifies that the transformation operation did not touch the data in any way but has only modified the processing graph.

Let’s make this transformation real, with an Action operation like ‘count()’, which will execute all the transformation actions before and apply this aggregate function.

myLines_filtered.count()

The final result of this little Spark Job is the number you see at the end. In this case it is 341.

We hope that this little example whets your appetite for more ambitious data science projects on the Hortonworks Data Platform.

For more on Apache Spark, check out the links below:

Tags:

Comments

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>