Spark 1.0.1 Technical Preview – with HDP 2.1.3

Introduction

The Spark Technical preview lets you evaluate Apache Spark 1.0.1 on YARN with HDP 2.1.3. With YARN, Hadoop can now support multiple different types of workloads; Spark on YARN becomes another workload running against the same set of hardware resources.

This guide describes how to run Spark on YARN. It also provides the canonical examples of running SparkPI and Wordcount with Spark shell.  When you are ready to go beyond that level of testing, try the machine learning examples at Apache Spark.

Requirements

To evaluate Spark on the HDP 2.1 Sandbox, add an entry to on your Host machine in /etc/hosts to enable Sandbox or localhost to resolve to 127.0.0.1. For example:

127.0.0.1 localhost sandbox.hortonworks.com

Installation and Configuration

The Spark 1.0.1 Technical Preview is provided as a single tarball.

Download the Spark Tarball

Use wget to download the Spark tarball:

wget http://public-repo-1.hortonworks.com/spark/centos6/tar/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563.tgz

Copy the Spark Tarball to a HDP 2.1 Cluster:

Copy  the downloaded Spark tarball to your HDP 2.1 Sandbox or to your Hadoop cluster.

For example, the following command copies Spark to HDP 2.1 Sandbox:

scp -P 2222 spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563.tgz
root@127.0.0.1:/root

Note: The password for HDP 2.1 Sandbox is hadoop.

Untar the Tarball

To untar the Spark tarball, run:

tar xvfz spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563.tgz

Set the YARN environment variable

Specify the appropriate directory for your Hadoop cluster. For example, if your Hadoop and YARN config files are in /etc/hadoop/conf:

export YARN_CONF_DIR=/etc/hadoop/conf

Set yarn.application.classpath in yarn-site.xml. In the HDP 2.1 Sandbox yarn.application.classpath is already set, so there is no need to set yarn.application.classpath to set up Spark in HDP 2.1 Sandbox.

If you are running Spark in your own HDP 2.1 cluster ensure that yarn-site.xml has the following value for yarn.application.classpath property:

<property>
    <name>yarn.application.classpath</name>
    <value>/etc/hadoop/conf,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*</value>
</property>

Running the Spark Pi Example

To test compute intensive tasks in Spark, the Pi example calculates Π by “throwing darts” at a circle. The example points in the unit square ((0,0) to (1,1)) and sees how many fall in the unit circle. The fraction should be Π/4, which is used to estimate Pi.

To calculate Pi with Spark:

      1. Change to your Spark directory.

cd spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563

2. Run the Spark Pi example.

./bin/spark-submit --class org.apache.spark.examples.SparkPi    --master yarn-cluster  --num-executors 3 --driver-memory 512m  --executor-memory 512m   --executor-cores 1  lib/spark-examples*.jar 10

Note: The Pi job should complete without any failure messages and produce output similar to:

14/07/16 23:20:34 INFO yarn.Client: Application report from ASM: 
application identifier: application_1405567714475_0008
appId: 8
clientToAMToken: null
appDiagnostics: 
appMasterHost: sandbox.hortonworks.com
appQueue: default
appMasterRpcPort: 0
appStartTime: 1405578016384
yarnAppState: FINISHED
distributedFinalState: SUCCEEDED
appTrackingUrl: http://sandbox.hortonworks.com:8088/proxy/application_1405567714475_0008/A
appUser: root

3. To view the results in a browser, copy the appTrackingUrl and go to:

http://sandbox.hortonworks.com:8088/proxy/application_1405567714475_0008

Note: The two values above in bold are specific to your environment. These instructions assume that HDP 2.1 Sandbox is installed and that /etc/hosts is mapping sandbox.hortonworks.com to localhost.

4. Click the logs link in the bottom right.

The browser shows the YARN container output after a redirect.

Note the following output on the page. (Other output omitted for brevity.)

…..
14/07/14 16:00:25 INFO ApplicationMaster: AppMaster received a signal.
14/07/14 16:00:25 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1405371122903_0002
Log Type: stdout
Log Length: 22
Pi is roughly 3.14102

Running WordCount on Spark

WordCount counts the number of words from a block of text, designated as the input file.

Copy input file for Spark WordCount Example

Upload the input file to use in WordCount to HDFS. You can use any text file as input. The following example uses log4j.properties:

hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data

Run Spark WordCount

To run WordCount:

  1. Run the Spark shell:

./bin/spark-shell

2. If Spark-shell appears to hang, hit enter to get to a scala prompt.

scala>
val file = sc.textFile("hdfs://sandbox.hortonworks.com:8020/tmp/data")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/tmp/wordcount")

Viewing the WordCount output using Scala Shell

To view the output in the scala shell:

scala > counts.count()

To print the full output of the WordCount job:

scala > counts.toArray().foreach(println)

Exit the scala shell.

  scala > exit

Viewing the WordCount output using HDFS

To read the output of WordCount using HDFS command:

  1. View WordCount results:

hadoop fs -ls /tmp/wordcount

It should display output similar to:

/tmp/wordcount/_SUCCESS
/tmp/wordcount/part-00000
/tmp/wordcount/part-00001

3. Use the HDFS cat command to see the WordCount output. For example:

hadoop fs -cat /tmp/wordcount/part-00000

Running the Machine Learning Spark Application

Make sure all of your nodemanager nodes have the gfortran library. If not, you need to install it in all of your nodemanager nodes.

sudo yum install gcc-gfortran

Note: The gfortran library is usually available in the updates repos for CentOS. For example:

sudo yum install gcc-gfortran --enablerepo=update

MLlib will throw a linking error if it cannot detect these libraries automatically. For example, if you try to do Collaborative Filtering without gfortran runtime library installed, you will see the following linking error:

java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dposv(CII[DII[DII)I
   at org.jblas.NativeBlas.dposv(Native Method)
   at org.jblas.SimpleBlas.posv(SimpleBlas.java:369)
   at org.jblas.Solve.solvePositive(Solve.java:68)

Visit http://spark.apache.org/docs/latest/mllib-guide.html for Spark ML examples.

Troubleshooting

Issue:

Spark submitted job fails to run, appears to hang, and in the YARN container log contains the following error:

14/07/15 11:36:09 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/07/15 11:36:24 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/07/15 11:36:39 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Solution:

The Hadoop cluster must have sufficient memory for the request. For example, submitting the following job with 1GB memory allocated for executor and Spark driver will fail with the above error in the HDP 2.1 Sandbox.  Reduce the memory asked for the executor and the Spark driver to 512m and re-start the cluster.

./bin/spark-submit --class org.apache.spark.examples.SparkPi    --master yarn-cluster  --num-executors 3 --driver-memory 512m  --executor-memory 512m   --executor-cores 1  lib/spark-examples*.jar 10

Issue:

Error message about HDFS non-existent InputPath when running Machine Learning examples.

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com:8020/user/root/mllib/data/sample_svm_data.txt
   at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
   at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
   at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
……
……
……
(Omitted for brevity.)

Solution:

Ensure that the input data is uploaded to HDFS.

Known Issues

At the time of this release, there are no known issues for Apache Spark.Visit the forum for the latest discussions on issues:

http://hortonworks.com/community/forums/forum/spark/

Further Reading

Apache Spark documentation is available here:

https://spark.apache.org/docs/latest/

Try these Tutorials

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.