Using Apache Spark: Technical Preview with HDP 2.2

Apache Spark 1.2.0 on YARN with HDP 2.2

The Spark Technical preview lets you evaluate Apache Spark 1.2.0 on YARN with HDP 2.2. With YARN, Hadoop can now support various types of workloads; Spark on YARN becomes yet another workload running against the same set of hardware resources.

This technical preview describes how to:

  • Run Spark on YARN and run the canonical Spark examples: SparkPI and Wordcount.
  • Run Spark 1.2 on HDP 2.2.
  • Work with a built-in UDF, collect_list, a key feature of Hive 13. This technical preview provides support for Hive 0.13.1 and instructions on how to call this UDF from Spark shell.
  • Use SparkSQL thrift JDBC/ODBC Server.
  • View history of finished jobs with Spark Job History.
  • Use ORC files with Spark, with examples.
  • Run SparkPI with Tez as the execution engine.

When you are ready to go beyond these tasks, try the machine learning examples at Apache Spark.

HDP Sandbox Requirements

To evaluate Spark on the HDP 2.2 Sandbox, add an entry to /etc/hosts on your Host machine to enable Sandbox or localhost to resolve to 127.0.0.1. For example:

127.0.0.1 localhost sandbox.hortonworks.com

Ensure port forwarding (from host to guest) in the HDP Sandbox for ports 4040, 8042, 18080, and 19188.

Install the Technical Preview

The Spark 1.2.0 Technical Preview is provided as a single tarball.

Download the Spark Tarball

Use wget to download the Spark tarball:

wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/spark/1.2.0/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041.tgz

Copy the Spark Tarball to a HDP 2.2 Cluster

Copy the downloaded Spark tarball to your HDP 2.2 Sandbox or to your Hadoop cluster.

For example, the following command copies Spark to HDP 2.2 Sandbox:

scp -P 2222 spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041.tgz  
root@127.0.0.1:/root

Note: The password for the HDP 2.2 Sandbox is hadoop.

Untar the Tarball

To untar the Spark tarball, run:

tar xvfz spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041.tgz

The directory where Spark tarball is expanded is SPARK_HOME

Set up the environment

Specify the appropriate directory for your Hadoop cluster. For example, if your Hadoop and YARN config files are in /etc/hadoop/conf:

  1. Set environment variable
    export YARN_CONF_DIR=/etc/hadoop/conf
  2. Create a file SPARK_HOME/conf/spark-defaults.conf and add the following settings:
    spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
    spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041

Run the Spark Pi Example

To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle. The example points in the unit square ((0,0) to (1,1)) and sees how many fall in the unit circle. The fraction should be pi/4, which is used to estimate Pi.

To calculate Pi with Spark:

  1. Navigate to your Spark directory:cd <SPARK_HOME>
  2. Run the Spark Pi example:
    ./bin/spark-submit --verbose --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

Note: The Pi job should complete without any failure messages and produce output similar to the following:

14/12/19 19:46:38 INFO impl.YarnClientImpl: Submitted application application_1419016680263_0002  
14/12/19 19:46:39 INFO yarn.Client: Application report for application_1419016680263_0002 (state: ACCEPTED)  
14/12/19 19:46:39 INFO yarn.Client:  
      client token: N/A  
      diagnostics: N/A  
      ApplicationMaster host: N/A  
      ApplicationMaster RPC port: -1  
      queue: default  
      start time: 1419018398442  
      final status: UNDEFINED  
      tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1419016680263_0002/  
      user: root
  1. To view the results in a browser, copy the appTrackingUrl and go to:http://sandbox.hortonworks.com:8088/proxy/application_1419016680263_0002/A

Notes:

  • The two values above in bold are specific to your environment.
  • These instructions assume that HDP 2.2 Sandbox is installed and that /etc/hosts maps sandbox.hortonworks.com to localhost.

Click the “logs” link in the bottom right

The browser shows the YARN container output after a redirect.

Note the following output on the page. (Other output omitted for brevity.)

…  
14/12/22 17:13:30 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED  
14/12/22 17:13:30 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.  
14/12/22 17:13:30 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.  
14/12/22 17:13:30 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1419016680263_0005

Log Type: stdout  
Log Upload Time: 22-Dec-2014 17:13:33  
Log Length: 23  
Pi is roughly 3.143824

Using WordCount with Spark

Copy input file for Spark WordCount Example

Upload the input file you want to use in WordCount to HDFS. You can use any text file as input.

In the following example, log4j.properties is used as an example:

hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data

Run Spark WordCount

To run WordCount:

  1. Run the Spark shell:
    ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

You should see output similar to the following, before the Scala REPL prompt, “scala>”:

14/12/22 17:27:38 INFO util.Utils: Successfully started service 'HTTP class server' on port 41936.  
Welcome to
   ____              __  
  / __/__  ___ _____/ /__  
 _\ \/ _ \/ _ `/ __/  '_/  
/___/ .__/\_,_/_/ /_/\_\   version 1.2.0
   /_/  
Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_71)  
Type in expressions to have them evaluated.  
…
4/12/22 17:28:27 INFO yarn.Client: Application report for application_1419016680263_0006 (state: ACCEPTED)  
14/12/22 17:28:28 INFO yarn.Client:  
      client token: N/A  
      diagnostics: N/A  
      ApplicationMaster host: N/A  
      ApplicationMaster RPC port: -1  
      queue: default  
      start time: 1419269306798  
      final status: UNDEFINED  
      tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1419016680263_0006/  
      user: root  
…
14/12/22 17:29:23 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)  
14/12/22 17:29:23 INFO repl.SparkILoop: Created spark context..  
Spark context available as sc.

scala>
  1. At the Scala REPL prompt, enter:
    val file = sc.textFile("/tmp/data")
    val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    counts.saveAsTextFile("/tmp/wordcount")
Viewing the WordCount output in the Scala Shell

To view the output in the scala shell:

counts.count()

To print the full output of the WordCount job:

counts.toArray().foreach(println)
Viewing the WordCount output using HDFS

To read the output of WordCount using the HDFS command:

  1. Exit the scala shell:scala > exit
  2. View WordCount Results:hadoop fs -ls /tmp/wordcount

You should see output similar to the following:

/tmp/wordcount/_SUCCESS  
/tmp/wordcount/part-00000  
/tmp/wordcount/part-00001
  1. Use the HDFS cat command to see the WordCount output. For example,hadoop fs -cat /tmp/wordcount/part–00000

Running Hive 0.14.0 UDF

Before running Hive examples run the following steps:

Create hive-site in Spark conf

Create the file SPARK_HOME/conf/hive-site.xml.

Edit the file to contain only the following statements:

<configuration>  
<property>  
  <name>hive.metastore.uris</name>
  **<!-- Ensure that the following statement points to the Hive Metastore URI in your cluster -->**
  <value>thrift://sandbox.hortonworks.com:9083</value>
  <description>URI for client to contact metastore server</description>  
</property>  
</configuration>

Hive 0.13.1 provides a new built-in UDF collect_list(col), which returns a list of objects with duplicates.

Launch the Spark Shell on YARN cluster

./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client

Create Hive Context

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

You should see output similar to the following:

…  
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@7d9b2e8d

Create Hive Table

hiveContext.hql("CREATE TABLE IF NOT EXISTS TestTable (key INT, value STRING)")

You should see output similar to the following:

…  
res0: org.apache.spark.sql.SchemaRDD =  
SchemaRDD[0] at RDD at SchemaRDD.scala:108  
== Query Plan ==  
<Native command: executed by Hive>

Load example KV value data into Table

hiveContext.hql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE TestTable")

You should see output similar to the following:

14/12/22 18:37:45 INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>  
14/12/22 18:37:45 INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1419273465053 end=1419273465053 duration=0 from=org.apache.hadoop.hive.ql.Driver>  
14/12/22 18:37:45 INFO log.PerfLogger: </PERFLOG method=Driver.run start=1419273463944 end=1419273465053 duration=1109 from=org.apache.hadoop.hive.ql.Driver>  
res1: org.apache.spark.sql.SchemaRDD =  
SchemaRDD[2] at RDD at SchemaRDD.scala:108  
== Query Plan ==  
<Native command: executed by Hive>

Invoke Hive collect_list UDF

hiveContext.hql("from TestTable SELECT key, collect_list(value) group by key order by key").collect.foreach(println)

You should see output similar to the following:

…  
[489,ArrayBuffer(val_489, val_489, val_489, val_489)]  
[490,ArrayBuffer(val_490)]  
[491,ArrayBuffer(val_491)]  
[492,ArrayBuffer(val_492, val_492)]  
[493,ArrayBuffer(val_493)]  
[494,ArrayBuffer(val_494)]  
[495,ArrayBuffer(val_495)]  
[496,ArrayBuffer(val_496)]  
[497,ArrayBuffer(val_497)]  
[498,ArrayBuffer(val_498, val_498, val_498)]

Example: Reading and Writing an ORC File

 

This Tech Preview provides full support for ORC files with Spark. We will walk through an example that reads and writes an ORC file and uses ORC schema to infer a table.

ORC File Support

Create a new Hive Table with ORC format

hiveContext.sql("create table orc_table(key INT, value STRING) stored as orc")

Load Data into the ORC table

hiveContext.hql("INSERT INTO table orc_table select * from testtable")

Verify that Data is loaded into the ORC table

hiveContext.hql("FROM orc_table SELECT *").collect().foreach(println)

Read ORC Table from HDFS as HadoopRDD

val inputRead = sc.hadoopFile("/apps/hive/warehouse/orc_table", classOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],classOf[org.apache.hadoop.io.NullWritable],classOf[org.apache.hadoop.hive.ql.io.orc.OrcStruct])

Verify that you can manipulate the ORC record through RDD

val k = inputRead.map(pair => pair._2.toString)  
val c = k.collect

You should see output similar to the following:

...  
14/12/22 18:41:37 INFO scheduler.DAGScheduler: Stage 7 (collect at <console>:16) finished in 0.418 s  
14/12/22 18:41:37 INFO scheduler.DAGScheduler: Job 4 finished: collect at <console>:16, took 0.437672 s  
c: Array[String] = Array({238, val_238}, {86, val_86}, {311, val_311}, {27, val_27}, {165, val_165}, {409, val_409}, {255, val_255}, {278, val_278}, {98, val_98}, {484, val_484}, {265, val_265}, {193, val_193}, {401, val_401}, {150, val_150}, {273, val_273}, {224, val_224}, {369, val_369}, {66, val_66}, {128, val_128}, {213, val_213}, {146, val_146}, {406, val_406}, {429, val_429}, {374, val_374}, {152, val_152}, {469, val_469}, {145, val_145}, {495, val_495}, {37, val_37}, {327, val_327}, {281, val_281}, {277, val_277}, {209, val_209}, {15, val_15}, {82, val_82}, {403, val_403}, {166, val_166}, {417, val_417}, {430, val_430}, {252, val_252}, {292, val_292}, {219, val_219}, {287, val_287}, {153, val_153}, {193, val_193}, {338, val_338}, {446, val_446}, {459, val_459}, {394, val_394}, {2…

Copy example table into HDFS

cd SPARK_HOME

hadoop dfs -put examples/src/main/resources/people.txt people.txt

Run Spark-Shell

./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client

At the Scala prompt type the following (except for the comments):

import org.apache.spark.sql.hive.orc._  
import org.apache.spark.sql._  
**# Load and register the spark table**  
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)  
val people = sc.textFile("people.txt")  
val schemaString = "name age"  
val schema = StructType(schemaString.split(" ").map(fieldName => {if(fieldName == "name") StructField(fieldName, StringType, true) else StructField(fieldName, IntegerType, true)}))  
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), new Integer(p(1).trim)))  
**# Infer table schema from RDD**  
val peopleSchemaRDD = hiveContext.applySchema(rowRDD, schema)  
**# Create a table from schema**  
peopleSchemaRDD.registerTempTable("people")  
val results = hiveContext.sql("SELECT * FROM people")  
results.map(t => "Name: " + t.toString).collect().foreach(println)  
**# Save Table to ORCFile**  
peopleSchemaRDD.saveAsOrcFile("people.orc")  
**# Create Table from ORCFile**  
val morePeople = hiveContext.orcFile("people.orc")  
morePeople.registerTempTable("morePeople")  
hiveContext.sql("SELECT * from morePeople").collect.foreach(println)

Using the SparkSQL Thrift Server for JDBC/ODBC access

With this Tech Preview, SparkSQL’s thrift server provides JDBC access to SparkSQL.

1. Start the Thrift Server

From SPARK_HOME, start SparkSQL thrift server. Note the port value of the thrift JDBC server.  
 ./sbin/start-thriftserver.sh --master yarn --executor-memory 512m --hiveconf hive.server2.thrift.port=10001

2. Connect to the Thrift Server over beeline

Launch beeline from SPARK_HOME:

./bin/beeline

3. Issue SQL commands

At the beeline prompt:

beeline>!connect jdbc:hive2://localhost:10001

You should see output similar to the following:

0: jdbc:hive2://localhost:10001> show tables;  
Connected to: Spark SQL (version 1.2.0)  
Driver: null (version null)  
Transaction isolation: TRANSACTION_REPEATABLE_READ  
+------------+  
|   result   |  
+------------+  
| orc_table  |  
| sample_07  |  
| sample_08  |  
| testtable  |  
+------------+  
4 rows selected (6.725 seconds)

Notes:

  • This example does not have security enabled, so any username and password combination should work.
  • The beeline connection might take 10 to 15 seconds to be available in the Sandbox environment, if show tables returns without any output, wait 10–15 seconds.

Step 3: Stop the Thrift Server

./sbin/stop-thriftserver.sh

Using the Spark Job History Server

The Spark Job History server is integrated with YARN’s Application Timeline Server (ATS). The Job History server publishes job metrics to ATS. This allows job details to be available after the job finishes. You can let the history server run while you run examples in the tech preview, and then go to the YARN resource manager page athttp://sandbox.hortonworks.com:8088/cluster/apps to see the logs from the finished application.

1. Add History Services to SPARK_HOME/conf/spark-defaults.conf

spark.yarn.services                org.apache.spark.deploy.yarn.history.YarnHistoryService  
spark.history.provider             org.apache.spark.deploy.yarn.history.YarnHistoryProvider  
**## Make sure the host and port match the node where your YARN history server is running**  
spark.yarn.historyServer.address   localhost:18080

2. Start the Spark History Server

./sbin/start-history-server.sh

3. Stop the Spark History Server

./sbin/stop-history-server.sh

Run SparkPI with Tez as execution engine

HDP 2.2 provides the option of running Spark DAGs with Tez as the execution engine. Please see this post for details about the benefits of this approach.

1. Copy config files to Spark Home Directory

cp /etc/hadoop/conf/core-site.xml SPARK_HOME/external/spark-native-yarn/conf  
cp /etc/hadoop/conf/yarn-site.xml​SPARK_HOME/external/spark-native-yarn/conf  
cp /etc/tez/conf/tez-env.sh SPARK_HOME/external/spark-native-yarn/conf  
cp /etc/tez/conf/tez-site.xml SPARK_HOME/external/spark-native-yarn/conf

2. Start SparkPI

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master execution-context:org.apache.spark.tez.TezJobExecutionContext --conf update-classpath=true ./lib/spark-examples*.jar 3

The console will print output similar to the following. Note the value of Pi in bold at the end of the output.

…  
14/12/23 19:47:48 INFO client.DAGClientImpl: DAG initialized: CurrentState=Running  
14/12/23 19:47:50 INFO client.DAGClientImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 3 Succeeded: 0 Running: 0 Failed: 0 Killed: 0  
14/12/23 19:47:50 INFO client.DAGClientImpl:     VertexStatus: VertexName: 0 Progress: 0% TotalTasks: 3 Succeeded: 0 Running: 0 Failed: 0 Killed: 0  
14/12/23 19:47:55 INFO client.DAGClientImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 3 Succeeded: 0 Running: 0 Failed: 0 Killed: 0  
14/12/23 19:47:55 INFO client.DAGClientImpl:     VertexStatus: VertexName: 0 Progress: 0% TotalTasks: 3 Succeeded: 0 Running: 0 Failed: 0 Killed: 0  
14/12/23 19:48:00 INFO client.DAGClientImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 3 Succeeded: 0 Running: 1 Failed: 0 Killed: 0  
14/12/23 19:48:00 INFO client.DAGClientImpl:     VertexStatus: VertexName: 0 Progress: 0% TotalTasks: 3 Succeeded: 0 Running: 1 Failed: 0 Killed: 0  
14/12/23 19:48:03 INFO client.DAGClientImpl: DAG: State: RUNNING Progress: 66.67% TotalTasks: 3 Succeeded: 2 Running: 1 Failed: 0 Killed: 0  
14/12/23 19:48:03 INFO client.DAGClientImpl:     VertexStatus: VertexName: 0 Progress: 66.67% TotalTasks: 3 Succeeded: 2 Running: 1 Failed: 0 Killed: 0  
14/12/23 19:48:03 INFO client.DAGClientImpl: DAG: State: SUCCEEDED Progress: 100% TotalTasks: 3 Succeeded: 3 Running: 0 Failed: 0 Killed: 0  
14/12/23 19:48:03 INFO client.DAGClientImpl:     VertexStatus: VertexName: 0 Progress: 100% TotalTasks: 3 Succeeded: 3 Running: 0 Failed: 0 Killed: 0  
14/12/23 19:48:03 INFO client.DAGClientImpl: DAG completed. FinalState=SUCCEEDED  
14/12/23 19:48:03 INFO tez.DAGBuilder: DAG execution complete  
**Pi is roughly 3.1394933333333332**

Running the Machine Learning Spark Application

Make sure all of your nodemanager nodes have the gfortran library installed. If not, you need to install it in all of your nodemanager nodes:

sudo yum install gcc-gfortran

Note: The library is usually available in the update repo for CentOS. For example:

sudo yum install gcc-gfortran --enablerepo=update

MLlib throws a linking error if it cannot detect these libraries automatically. For example, if you try to do Collaborative Filtering without the gfortran runtime library installed, you will see the following linking error:

java.lang.UnsatisfiedLinkError:  
org.jblas.NativeBlas.dposv(CII[DII[DII)I  
    at org.jblas.NativeBlas.dposv(Native Method)  
    at org.jblas.SimpleBlas.posv(SimpleBlas.java:369)  
    at org.jblas.Solve.solvePositive(Solve.java:68)

Visit http://spark.apache.org/docs/latest/mllib-guide.html for Spark ML examples.

Troubleshooting

Issue:

Spark submit fails.

Note the error about failure to set the env:

Exception in thread "main" java.lang.Exception: When running with master 'yarn-cluster' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.  
at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:182)  
…

Solution:

Set the environment variable YARN_CONF_DIR as follows:

export YARN_CONF_DIR=/etc/hadoop/conf

Issue: 

A Spark-submitted job fails to run and appears to hang.

In the YARN container log you will notice the following error:

14/07/15 11:36:09 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory  
14/07/15 11:36:24 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory  
14/07/15 11:36:39 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Solution:

The Hadoop cluster needs sufficient memory for the request. For example, submitting the following job with 1GB memory allocated for executor and Spark driver fails with the above error in the HDP 2.2 Sandbox. Reduce the memory allocation for the executor and the Spark driver to 512 MB, and restart the cluster.

./bin/spark-submit --class org.apache.spark.examples.SparkPi    --master yarn-cluster  --num-executors 3 --driver-memory **512m**  --executor-memory **512m**   --executor-cores 1  lib/spark-examples*.jar 10

Issue:

Error message about HDFS non-existent InputPath when running Machine Learning examples:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:  
hdfs://sandbox.hortonworks.com:8020/user/root/mllib/data/sample_svm_data.txt  
      at  
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)  
      at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)  
      at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)  
      at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)  
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)  
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)  
      at scala.Option.getOrElse(Option.scala:120)  
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)  
      at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)  
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)  
… 
(Omitted for brevity.)

Solution:

Ensure that the input data is uploaded to HDFS.

Known Issues:

This tech preview does not work against a Kerberos enabled cluster

Additional Information:

Visit the forum for the latest discussions about issues:

http://hortonworks.com/community/forums/forum/spark/

Comments

Sangeeth Jairaj
|
November 26, 2014 at 10:39 am
|

We are using 48 node cluster with HDP 2.1 configured . Spark shell and examples works fine. But the Hive context setting using the command fails with below error. I have followed all the steps in the tech preview.

Sangeeth Jairaj
|
November 26, 2014 at 10:41 am
|

We have 48 node cluster with HDP 2.1. Followed all the steps in Tech Preview. spark shell works fine with examples. but hive context fails with following error.

14/11/26 11:37:51 INFO hive.metastore: Connected to metastore.
java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:280)
at $iwC$$iwC$$iwC$$iwC.(:12)
at $iwC$$iwC$$iwC.(:17)
at $iwC$$iwC.(:19)
at $iwC.(:21)
at (:23)
at .(:27)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:76)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.api.SessionNotRunning
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
… 42 more

|
December 12, 2014 at 3:58 pm
|

Thanks for these instructions. I was confused about a couple things (running from outside the cluster) but I figured it out and blogged about it:

http://clarkupdike.blogspot.com/2014/12/running-spark-on-yarn-from-outside.html

François Pelletier
|
January 10, 2015 at 11:52 am
|

Hi,
When executing step 2 in Run SparkPI with Tez as execution engine, I get this error
https://gist.github.com/franc00018/a9c614910dab7b259255

|
January 27, 2015 at 9:14 am
|

Spark can’t arrive GA in HDP soon enough, nor can there be enough investment on this front.

There’s still outstanding issues/bugs for real usage eg. Kerberos vs Spark SQL Hive context sasl props, JDBC ThriftServer ClassCastExceptions on select count(*), although select * works etc:

select count(*)…
Exception in thread “Thread-7″ java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at org.apache.hive.jdbc.HiveStatement.getQueryLog(HiveStatement.java:841)
at org.apache.hive.jdbc.HiveStatement.getQueryLog(HiveStatement.java:786)
at org.apache.hive.beeline.Commands$1.run(Commands.java:841)
at java.lang.Thread.run(Thread.java:745)

Btw there are also a couple minor typos on this page (thrift => “thirft”)

Regards,

Hari Sekhon

|
February 8, 2015 at 11:12 am
|

I got a similar error to François Pelletier’s, but upon further investigation, I had a few things I hadn’t done correctly. The most egregious error was not entering the first line of the spark-defaults.conf file properly (apparently “park” isn’t the same as “spark” ;)).

I also didn’t have port forwarding setup at first, so I couldn’t get to all the errors.

Finally, I upped the memory on my VirtualBox VM to ~16 gigs (as high as it would go without producing a warning) and my CPUs from 2 to 8. I doubt this did anything other than speeding things up a bit though.

Hope this helps!

Prashant Nerkar
|
March 19, 2015 at 7:07 pm
|

How to connect to sample file uploaded on HDFS from Spark ?
Spark code is something like this :
String logFile = “hdfs://……”;

Chakra Sankaraiah
|
March 20, 2015 at 9:27 am
|

When you try to run the spark PI example please ensure that you have double hyphen

./bin/spark-submit –class org.apache.spark.examples.SparkPi lib/spark-examples*.jar 10

Alex McLintock
|
April 23, 2015 at 7:43 am
|

Now that HDP 2.2.4 is released is there a binary of just spark which I can use with HDP 2.2.0 ?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Try this tutorial with :

These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.