Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Apache Zeppelin Technical Preview

Introduction

Apache Zeppelin is an exciting project for many of our customers who want to use the notebook and visualization capabilities to make big data more approachable and easier to understand.

Zeppelin addresses use cases like data exploration, data discovery, and interactive code snippets. It provides built-in visualization. Many users see Zeppelin as a potential modern data science studio.

image00
This tech preview of Apache Zeppelin provides:

  • Instructions for setting up Zeppelin on HDP 2.3.x with Spark 1.4.1
  • Configuration for running Zeppelin against Spark on YARN and Hive
  • Sample Notebooks to explore

HDP Cluster Requirement

This technical preview can be installed on any HDP 2.3.x cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. The following instructions assume that Spark (version 1.4.1 ) is already installed on the HDP cluster.

If you have an HDP 2.3.0 cluster, it came with Spark 1.3.1, you can either upgrade the entire cluster with Ambari to 2.3.2 to get Spark 1.4.1 or only manually upgrade Spark to 1.4.1

Install the Zeppelin Tech Preview

The Zeppelin Technical Preview is provided as a tarball, compiled against Spark 1.4.1.

The below steps run as user root.

  1. Download the version of the tarball that corresponds to the version of Spark deployed in your HDP cluster.
    Spark 1.4.1:

    wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/zeppelin/0.6.0-incubating-1.4.1.2.3.2.0-2950/zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz
  2. Unpack the tarball  on a HDP node that has HDFS, Spark, and Hive Clients installed, and into a directory of your choice; for example, /home/cloud-user/ZeppelinTP
    tar xvfz zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz
  3. Make a copy of zeppelin-env.sh:
    cd zeppelin-0.6.0-incubating-SNAPSHOT
    cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
  4. In the zeppelin-env.sh file, add the following.
    Note: you will use PORT to access the Zeppelin Web UI. <HDP-version> corresponds to the version of HDP where you are installing Zeppelin; for example, 2.3.2.0-2950.

    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export ZEPPELIN_PORT=9995
    export ZEPPELIN_JAVA_OPTS="-Dhdp.version=<HDP-version>"
    To obtain the HDP version for your HDP cluster, run the following command:
     hdp-select status hadoop-client | sed 's/hadoop-client - (.*)/1/'
  5. Copy hive-site.xml to Zeppelin’s conf directory
    cd zeppelin-0.6.0-incubating-SNAPSHOT
    cp /etc/hive/conf/hive-site.xml conf/

    Make sure to remove “s” from the value of hive.metastore.client.connect.retry.delay & hive.metastore.client.socket.timeout in the hive-site.xml in zeppelin/conf dir to avoid a number format exception.

  6. Next, create a root user in HDFS
    su hdfs
    hdfs dfs -mkdir /user/root
    hdfs dfs -chown root /user/root

Launch Zeppelin

To launch Zeppellin, run the following commands:

cd zeppelin-0.6.0-incubating-SNAPSHOT
bin/zeppelin-daemon.sh start

The Zeppelin server will start, and it will launch the Notebook UI.

To access the Zeppelin UI, enter the following address into your browser:

http://<node_where_zeppelin_is_installed:9995>

Note: If you specified a port other than 9995 in zeppelin-env.sh, use the port that you specified.

Configure Zeppelin Spark and Hive Interpreters

Before you run a notebook to access Spark and Hive, you need to create and configure interpreters for the two components.

To create the Spark interpreter, go to the Zeppelin Web UI. Switch to the “Interpreter” tab and create a new interpreter:

      1. Click on the <+ Create > button to the right.
      2. Name the interpreter spark-yarn-client.
      3. Select spark as the interpreter type.
      4. You will see a list of properties. Edit the following property values:
        master           yarn-client
        spark.home       /usr/hdp/current/spark-client
        spark.yarn.jar   /usr/hdp/current/spark-client/lib/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar

        Note: the path for spark.yarn.jar assumes that Spark 1.4.1 is installed. If you are running Spark 1.5.1, change this value to the path for Spark 1.5.1.

      5. Add the following properties and settings:
        spark.driver.extraJavaOptions -Dhdp.version=2.3.2.0-2950 
        spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.2.0-2950

        Note Make sure that both spark.driver.extraJavaOptions & spark.yarn.am.extraJavaOptions are saved. Without these properties set, the Spark job will fail with message related to bad substitution

        1. Save the settings and restart the interpreter.

Configure the Hive interpreter:

        1. From the “Interpreter” tab find the hive interpreter
        2. Edit the following property:
          hive.hiveserver2.url    jdbc:hive2://<hive_server_host>:10000

          Note: the preceding setting uses the default Hive Server port of 10000. If you use a different Hive Server port, change this to match the setting in your environment.

        3. Save the settings and restart the interpreter

Create a Note

To create a notebook:

        1. Switch to the “Notebook” tab and click on Create new note.
        2. Navigate to the note that you just created, and click on the settings icon: interpreter-binding.image02
        3. Drag the spark-yarn-client interpreter to the top of the list, and save it:
          image01
        4. Type sc.version into a paragraph in the note, and click the “Play” button. SparkContext, SQLContext, ZeppelinContext will be created automatically, and exposed as variable names ‘sc’, ‘sqlContext’ and ‘z’, respectively, both in scala and python environments.  Note: The first run will take some time, because it is spinning up a new Spark job to run against YARN. Subsequent paragraphs will run much faster.
        5. You should see output similar to the following, listing the version of Spark in your cluster.
          res0: String = 1.4.1
          Took 32 seconds.

Sample Notebooks

Zeppelin comes with a notebook called Zeppelin Tutorial. The tutorial is a good way to explore Zeppelin.

You can find many more sample Zeppelin notebooks at https://github.com/hortonworks-gallery/zeppelin-notebooks

Import External Libraries

Often in the notebook you will want to use one or more libraries. For example, we recently published a blog on Magellan – a library for Geospatial analytics in Spark. To create a notebook to explore Magellan, you will need to include the Magellan library in your environment.

There are three ways in Zeppelin to include an external dependency.

        1. Using the %dep interpreter. Note: this will only work for libraries that are published to Maven.
          %dep
          z.load("group:artifact:version")
          %spark
          import ...

          Here is an example to import dependency for Magellan

          %dep
          z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
          z.load("com.esri.geometry:esri-geometry-api:1.2.1")
          z.load("harsha2010:magellan:1.0.3-s_2.10")
          
          

          For more information, see https://zeppelin.incubator.apache.org/docs/interpreter/spark.html#dependencyloading.

        2. When you have a jar on the node where Zeppelin is running, the following approach can be useful:Add spark.files property at SPARK_HOME/conf/spark-defaults.conf; for example:
          spark.files  /path/to/my.jar
        3. When you have a jar on the node where Zeppelin is running, this approach can also be useful:
          Add SPARK_SUBMIT_OPTIONS env variable to the  ZEPPELIN_HOME/conf/zeppelin-env.sh file; for example:

          export SPARK_SUBMIT_OPTIONS="--packages group:artifact:version"

Stop the Zeppelin Server

To stop the Zeppelin server, issue the following commands:

cd zeppelin-0.6.0-incubating-SNAPSHOT
bin/zeppelin-daemon.sh stop

Zeppelin with Ambari

There is an experimental Ambari stack definition for Zeppelin available for installing Zeppelin and managing its configuration and life cycle (GitHub repository). This stack definition can build Zeppelin or use a pre-built version.

Known Issues

        • The Zeppelin tech preview is not certified to run against a Kerberos-enabled cluster.
        • When you create a new Note, the Zeppelin Web UI will not automatically navigate to it.

If you need help or have any feedback or questions with the tech preview, please first check out Hortonworks Community Connection (HCC) for existing questions and answers. Please use the tag tech-preview and zeppelin.

Visit Data Science Track Find Answers Ask Questions