Apache Zeppelin is an exciting project for many of our customers who want to use the notebook and visualization capabilities to make big data more approachable and easier to understand.
Zeppelin addresses use cases like data exploration, data discovery, and interactive code snippets. It provides built-in visualization. Many users see Zeppelin as a potential modern data science studio.
This tech preview of Apache Zeppelin provides:
This technical preview can be installed on any HDP 2.3.x cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. The following instructions assume that Spark (version 1.4.1 ) is already installed on the HDP cluster.
The Zeppelin Technical Preview is provided as a tarball, compiled against Spark 1.4.1.
The below steps run as user root.
tar xvfz zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz
cd zeppelin-0.6.0-incubating-SNAPSHOT cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
export HADOOP_CONF_DIR=/etc/hadoop/conf export ZEPPELIN_PORT=9995 export ZEPPELIN_JAVA_OPTS="-Dhdp.version=<HDP-version>"
To obtain the HDP version for your HDP cluster, run the following command: hdp-select status hadoop-client | sed 's/hadoop-client - (.*)/1/'
cp /etc/hive/conf/hive-site.xml conf/
Make sure to remove “s” from the value of hive.metastore.client.connect.retry.delay & hive.metastore.client.socket.timeout in the hive-site.xml in zeppelin/conf dir to avoid a number format exception.
su hdfs hdfs dfs -mkdir /user/root hdfs dfs -chown root /user/root
To launch Zeppellin, run the following commands:
cd zeppelin-0.6.0-incubating-SNAPSHOT bin/zeppelin-daemon.sh start
The Zeppelin server will start, and it will launch the Notebook UI.
To access the Zeppelin UI, enter the following address into your browser:
Note: If you specified a port other than 9995 in zeppelin-env.sh, use the port that you specified.
Before you run a notebook to access Spark and Hive, you need to create and configure interpreters for the two components.
To create the Spark interpreter, go to the Zeppelin Web UI. Switch to the “Interpreter” tab and create a new interpreter:
master yarn-client spark.home /usr/hdp/current/spark-client spark.yarn.jar /usr/hdp/current/spark-client/lib/spark-assembly-18.104.22.168.3.2.0-2950-hadoop22.214.171.124.3.2.0-2950.jar
Note: the path for spark.yarn.jar assumes that Spark 1.4.1 is installed. If you are running Spark 1.5.1, change this value to the path for Spark 1.5.1.
spark.driver.extraJavaOptions -Dhdp.version=126.96.36.199-2950 spark.yarn.am.extraJavaOptions -Dhdp.version=188.8.131.52-2950
Note Make sure that both spark.driver.extraJavaOptions & spark.yarn.am.extraJavaOptions are saved. Without these properties set, the Spark job will fail with message related to bad substitution
Configure the Hive interpreter:
Note: the preceding setting uses the default Hive Server port of 10000. If you use a different Hive Server port, change this to match the setting in your environment.
To create a notebook:
res0: String = 1.4.1 Took 32 seconds.
Zeppelin comes with a notebook called Zeppelin Tutorial. The tutorial is a good way to explore Zeppelin.
You can find many more sample Zeppelin notebooks at https://github.com/hortonworks-gallery/zeppelin-notebooks
Often in the notebook you will want to use one or more libraries. For example, we recently published a blog on Magellan – a library for Geospatial analytics in Spark. To create a notebook to explore Magellan, you will need to include the Magellan library in your environment.
There are three ways in Zeppelin to include an external dependency.
%dep z.load("group:artifact:version") %spark import ...
Here is an example to import dependency for Magellan
%dep z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven") z.load("com.esri.geometry:esri-geometry-api:1.2.1") z.load("harsha2010:magellan:1.
For more information, see https://zeppelin.incubator.apache.org/docs/interpreter/spark.html#dependencyloading.
export SPARK_SUBMIT_OPTIONS="--packages group:artifact:version"
To stop the Zeppelin server, issue the following commands:
cd zeppelin-0.6.0-incubating-SNAPSHOT bin/zeppelin-daemon.sh stop
There is an experimental Ambari stack definition for Zeppelin available for installing Zeppelin and managing its configuration and life cycle (GitHub repository). This stack definition can build Zeppelin or use a pre-built version.
If you need help or have any feedback or questions with the tech preview, please first check out Hortonworks Community Connection (HCC) for existing questions and answers. Please use the tag tech-preview and zeppelin.
|Visit Data Science Track||Find Answers||Ask Questions|