Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets (see Spark API Documentation for more info). Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.
In this tutorial, we will introduce the basic concepts of Apache Spark and the first few necessary steps to get started with Spark using an Apache Zeppelin Notebook on a Hortonworks Data Platform (HDP) Sandbox.
This tutorial is a part of series of hands-on tutorials to get you started with HDP using Hortonworks sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial.
At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.
Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.
Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.
Let’s try it out.
First, launch Apache Zeppelin.
If you’re new to Zeppelin, make sure to checkout the Getting Started Guide.
Create a Notebook
Create a new note, and give it a meaningful name, e.g. Apache Spark in 5 Minutes
Zeppelin comes with several interpreters pre-configured on Sandbox. In this tutorial we will use a shell interpreter
%sh and a pyspark interpreter
Let’s start with a shell interpreter
%sh and bring in some Hortonworks related Wikipedia data.
Type the following in your Zeppelin Notebook and hit shift + enter to execute the code:
%sh wget http://en.wikipedia.org/wiki/Hortonworks
You should see an output similar to this
Next, let’s copy the data over to HDFS. Type and execute the following:
%sh hadoop fs -put ~/Hortonworks /tmp
Now we are ready to run a simple Python program with Spark. This time we will use the python interpreter
%pyspark. Copy and execute this code:
%pyspark myLines = sc.textFile('hdfs://sandbox.hortonworks.com/tmp/Hortonworks') myLinesFiltered = myLines.filter( lambda x: len(x) > 0 ) count = myLinesFiltered.count() print count
When you execute the above you should get only a number as an output. I got
311 but it may vary depending on the Wikipedia entry.
Let’s go over what’s actually going on. After the python interpreter
%pyspark is initialized we instantiate an RDD using a Spark Context
sc with a
Hortonworks file on HDFS:
myLines = sc.textFile('hdfs://sandbox.hortonworks.com/tmp/Hortonworks')
After we instantiate the RDD, we apply a transformation operation on the RDD. In this case, a simple transformation operation using a Python lambda expression to filter out all the empty lines:
myLinesFiltered = myLines.filter( lambda x: len(x) > 0 )
At this point, the transformation operation above did not touch the data in any way. It has only modified the processing graph.
We finally execute an action operation using the aggregate function
count(), which then executes all the transformation operations:
count = myLinesFiltered.count()
print count we display the final count value, which returns
That’s it! Your complete notebook should look like this after you run your code in all paragraphs:
We hope that this little example wets your appetite for more ambitious data science projects on the Hortonworks Data Platform (HDP) Sandbox.
If you’re feeling adventurous checkout our other great Apache Spark & Hadoop tutorials.