Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
June 01, 2017
prev slideNext slide

Run Apache Spark 2.1 & Apache Zeppelin in Hortonworks Data Cloud

Apache Spark 2.1 Improves in Structured Streaming and Machine Learning.

  • Structured Streaming: Kafka .10 support, Metrics & Stability improvements
  • Machine Learning: SparkR Improvements including new ML algorithms for LDA, Random forests, GMM, etc.

The recent release of Hortonworks Data Platform 2.6 (“HDP 2.6”) includes Apache Spark 2.1. And Hortonworks Data Cloud (“HDCloud”) for AWS gives you a quick way to launch a Spark cluster. Let’s use the HDCloud release to launch a Data Science cluster powered Spark 2.1 and Zeppelin:

  1. Launch a Spark 2.1 cluster with HDCloud
  2. Run an example Spark job using Spark 2.1
  3. Install and Configure Zeppelin to run with Spark 2.1

STEP 1: Create and Launch a Data Science Cluster in HDCloud

Grab the latest HDCloud Release, launch your Cloud Controller, login and create your cluster that includes Spark 2.1 by selecting HDP 2.6 (Cloud) and choosing the “Data Science: Apache Spark 2.1, Apache Zeppelin .0.7.0” Cluster Type.

Create Cluster for Data Science
Create Cluster for Data Science

During the cluster creation you should also select the check both to enable remote access to cluster components. These component include: HDFS NameNode (NN), YARN Resource Manager (RM), Spark History Server (SHS) & MapReduce Job History Server (JHS).

Network Requirements
Network Requirements

Pick the default configuration of master & worker EC2 instance type.

Step 2: Run Sample Notebook with Zeppelin

The user/password is the same that you used to create the cluster.

Login
Login

Once you are in Zeppelin home page, you can run the Zeppelin Tutorial.

STEP 3: Access HDCloud from command line & Run SparkPI

SSH into one of the cluster Worker nodes:

ssh -i "vinay-ec2-us-west.pem" cloudbreak@ec2-34-208-106-165.us-west-2.compute.amazonaws.com
sudo spark
cd  /usr/hdp/current/spark2-client/

And run the Spark PI example:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client
--num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1
examples/jars/spark-examples*.jar 10re>

 

You can see the completed job in the Spark History Server UI.Spark

Spark History Server

Step 4: Run Spark 2
By default Zeppelin comes configured with both Spark & Spark2 interpreters. You can bind a notebook to Spark or Spark2 interpreter.  The Spark version for Spark interpreter is 1.6.3 & Spark version of Spark2 interpreter is 2.1. To use Spark2 in your Zeppelin notebook type %spark2.

Spark Version

WHAT’S NEXT?

It is great to see such rapid progress in the Spark Community and we are excited to be able to provide Spark 2.1 with Hortonworks Data Platform 2.6 and Hortonworks Data Cloud for AWS.

To run more Spark examples in HDCloud, visit A Lap around Spark tutorial and try the examples from there in HDCloud.

If you have issues or need help with launching Spark 2.1 or trying out HDCloud, please visit https://community.hortonworks.com/spaces/61/operations-track_2.html?type=question. We’d love to hear from you.

Are you interested in learning how other practitioners and customers are getting the business value from Spark in the cloud?  Join us for DataWorks Summit on June 13–15 in San Jose and save 25% off your all-access pass. Enter BLOG when you register.

Leave a Reply

Your email address will not be published. Required fields are marked *