Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Apache Zeppelin on HDP 2.4

In November, 2015 we introduced Apache Zeppelin as a technical preview on HDP 2.3. Since then, we have made significant progress on integrating Zeppelin into HDP while working in the Apache community to add new features to Zeppelin.

These features are now available in this Apache Zeppelin technical preview – the second Zeppelin technical preview. This technical preview works with HDP 2.4 and comes with the following major features:

In addition, this tech preview includes improvements made in the community such as auto-save, the ability to quickly add new paragraphs, and stability related fixes.

Overview

This tech preview of Apache Zeppelin provides:

  • Instructions for setting up Zeppelin on HDP 2.4 with Spark 1.6
    • Ambari-managed Install
    • Manual Install of Zeppelin
  • Configuration for running Zeppelin against Spark on YARN and Hive
  • Configuration for Zeppelin to authenticate users against LDAP
  • Sample Notebooks to explore

Note: While both Ambari-managed and manual installation instructions are provided, you only need to follow one of the two sets of instructions to set up Zeppelin in your cluster.

Prerequisites

This technical preview requires the following software:

  • HDP 2.4
  • Spark 1.6 or 1.5

HDP Cluster Requirement

This technical preview can be installed on any HDP 2.4 cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. The following instructions assume that Spark (version 1.6) is already installed on the HDP cluster.

Note the following cluster requirements:

  1. The Zeppelin server should be installed on a cluster node that has the Spark client installed on it.
  2. Ensure the node running Ambari server has the git package installed.
  3. Ensure that Zeppelin server has the wget package installed

Installing Zeppelin on an Ambari-Managed Cluster

To install Zeppelin using Ambari, complete the following steps.

  1. Download the Zeppelin Ambari Stack Definition. On the node running Ambari server, run the following:
    VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'`
    sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git  /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN
  2. Restart the Ambari Server:
    sudo service ambari-server restart
  3. After Ambari restarts and service indicators turn green, add the Zeppelin Service:
    At the bottom left of the Ambari dashboard, choose Actions -> Add Service:Screen Shot 2016-03-03 at 11.12.13 PM
    On the Add Service screen, select the Zeppelin service.
    Step through the rest of the installation process, accepting all default values.
    On the Review screen, make a note of the node selected to run Zeppelin service; call this ZEPPELIN_HOST.
    Screen Shot 2016-03-03 at 11.12.21 PM
    Click Deploy to complete the installation process.
  4. Launch Zeppelin in your browser:
    http://ZEPPELIN_HOST:9995

Zeppelin includes a few sample notebooks, including a Zeppelin tutorial. There are also quite a few notebooks available at the Hortonworks Zeppelin Gallery, including sentiment analysis, geospatial mapping, and IoT demos.

(Optional) Installing Zeppelin Manually

The Zeppelin Technical Preview is available as an HDP package compiled against Spark 1.6.

To install the Zeppelin Technical Preview manually (instead of using Ambari), complete the following steps as user root.

  1. Install the Zeppelin service:
    yum install zeppelin
  2. Make a copy of zeppelin-env.sh:
    cd /usr/hdp/current/zeppelin-server/lib
    cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
  3. In the zeppelin-env.sh file, export the following three values.
    Note: you will use PORT to access the Zeppelin Web UI. <HDP-version> corresponds to the version of HDP where you are installing Zeppelin; for example, 2.4.0.0-169.

    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export ZEPPELIN_PORT=9995
    export ZEPPELIN_JAVA_OPTS="-Dhdp.version=<HDP-version>"
  4. To obtain the HDP version for your HDP cluster, run the following command:
    hdp-select status hadoop-client | sed 's/hadoop-client - (.*)/1/'
  5. Copy hive-site.xml to Zeppelin’s conf directory:
    cd /usr/hdp/current/zeppelin-server/lib
    cp /etc/hive/conf/hive-site.xml conf/
  6. Remove “s” from the values of hive.metastore.client.connect.retry.delay and hive.metastore.client.socket.timeout, in the hive-site.xml file in zeppelin/conf dir. (This will avoid a number format exception.)
  7. Create a root user in HDFS:
    su hdfs
    hdfs dfs -mkdir /user/root
    hdfs dfs -chown root /user/root

To launch Zeppelin, run the following commands:

cd /usr/hdp/current/zeppelin-server/lib
bin/zeppelin-daemon.sh start

The Zeppelin server will start, and it will launch the Notebook Web UI.

To access the Zeppelin UI, enter the following address into your browser, where ZEPPELIN_HOST is the node where Zeppelin is installed:
http://ZEPPELIN_HOST:9995

Note: If you specified a port other than 9995 in zeppelin-env.sh, use the port that you specified.

Configuring Zeppelin Spark and Hive Interpreters

Before you run a notebook to access Spark and Hive, you need to create and configure interpreters for the two components.

To create the Spark interpreter, go to the Zeppelin Web UI. Switch to the “Interpreter” tab and create a new interpreter:

  1. Click on the +Create button to the right.
  2. Name the interpreter spark-yarn-client.
  3. Select spark as the interpreter type.
  4. The next section of this page contains a form-based list of spark interpreter settings for editing. The remainder of the page contains lists of properties for all supported interpreters.
    1. In the first list of properties, specify the following values (if they are not already set). To add a property, enter the name and value into the form at the end of the list, and click +.
      master           yarn-client
      spark.home       /usr/hdp/current/spark-client
      spark.yarn.jar   /usr/hdp/current/spark-client/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar
    2. Add the following properties and settings (HDP version may vary; specify the appropriate version for your cluster):
      spark.driver.extraJavaOptions -Dhdp.version=2.4.0.0-169
      spark.yarn.am.extraJavaOptions -Dhdp.version=2.4.0.0-169
    3. When finished, click Save.
      Note: Make sure that you save all property settings. Without spark.driver.extraJavaOptions and spark.yarn.am.extraJavaOptions, the Spark job will fail with a message related to bad substitution.

To configure the Hive interpreter:

  1. From the “Interpreter” tab, find the hive interpreter.
  2. Check that the following property references your Hive server node. If not, edit the property value.
    hive.hiveserver2.url  jdbc:hive2://<hive_server_host>:10000

    Note: the default interpreter setting uses the default Hive Server port of 10000. If you use a different Hive Server port, change this to match the setting in your environment.

  3. If you changed the property setting, click Save to save the new setting and restart the interpreter.

Creating a Notebook

To create a notebook:

  1. Under the “Notebook” tab, choose +Create new note.
  2. You will see the following window. Type a name for the new note (or accept the default):
    Screen Shot 2016-03-07 at 4.43.20 PM
  3. You will see the note that you just created, with one blank cell in the note. Click on the settings icon at the upper right. (Hovering over the icon will display the words “interpreter-binding.”)
    zepp-settings-button
  4. Drag the spark-yarn-client interpreter to the top of the list, and save it:
    Screen Shot 2016-03-03 at 11.14.58 PM
  5. Type sc.version into a paragraph in the note, and click the “Play” button (blue triangle):
    Screen Shot 2016-03-07 at 4.51.46 PM
    SparkContext, SQLContext, ZeppelinContext will be created automatically. They will be exposed as variable names ‘sc’, ‘sqlContext’ and ‘z’, respectively, in scala and python environments.
    Note: The first run will take some time, because it is launching a new Spark job to run against YARN. Subsequent paragraphs will run much faster.
  6. When finished, the status indicator on the right will say “FINISHED”. The output should list the version of Spark in your cluster:
    Screen Shot 2016-03-07 at 4.55.48 PM

Importing External Libraries

As you explore Zeppelin you will probably want to use one or more external libraries. For example, to run Magellan you need to import its dependencies; you will need to include the Magellan library in your environment.

There are three ways to include an external dependency in a Zeppelin notebook:

Using the %dep Interpreter

(Note: this will only work for libraries that are published to Maven.)

%dep
z.load("group:artifact:version")
%spark
import ...

Here is an example that imports the dependency for Magellan:

%dep
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.esri.geometry:esri-geometry-api:1.2.1")
z.load("harsha2010:magellan:1.0.3-s_2.10")

For more information, see https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html.

Adding and Referencing a spark.files Property

When you have a jar on the node where Zeppelin is running, the following approach can be useful:

Add spark.files property at SPARK_HOME/conf/spark-defaults.conf; for example:

spark.files  /path/to/my.jar

Adding and Referencing SPARK_SUBMIT_OPTIONS

When you have a jar on the node where Zeppelin is running, this approach can also be useful:

Add SPARK_SUBMIT_OPTIONS env variable to the  ZEPPELIN_HOME/conf/zeppelin-env.sh file; for example:

export SPARK_SUBMIT_OPTIONS="--packages group:artifact:version"

Stopping the Zeppelin Server

To stop the Zeppelin server, issue the following commands:

cd /usr/hdp/current/zeppelin-server/lib
bin/zeppelin-daemon.sh stop

LDAP Authentication Configuration

This version of the TP, allows users to authenticate users and provide separation of notebooks.

Note By default Zeppelin is enabled to receive requests over HTTP & not HTTPS. When you enable LDAP Authentication for Zeppelin, it will send username/password over HTTP. For better security, you should enable Zeppelin to listen in HTTPS by enabling SSL. You can use SSL properties specified in this doc. Also note at this time Zeppelin does not send the user identity downstream and we are working to address this before Zeppelin goes GA.

To enable authentication, in /usr/hdp/current/zeppelin-server/conf/shiro.ini file edit the section and enable authentication [urls]

#/** = anon 
/** = authcBasic
For local user configuration, enable the section [users]
admin = password1
user1 = password2
user2 = password3
Alternatively for LDAP integration, enable the section [main]
#ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm
#ldapRealm.userDnTemplate = cn={0},cn=engg,ou=testdomain,dc=testdomain,dc=com
#ldapRealm.contextFactory.url = ldap://ldaphost:389
#ldapRealm.contextFactory.authenticationMechanism = SIMPLE

For more information on Shiro please refer http://shiro.apache.org/authentication-features.html

Sample Notebooks

Zeppelin includes a few sample notebooks, including a Zeppelin tutorial. There are also quite a few notebooks available at the Hortonworks Zeppelin Gallery, including sentiment analysis, geospatial mapping, and IoT demos.

Known Issues

  • Zeppelin does not yet send user identity downstream after LDAP authentication.

If you need help or have any feedback or questions with the tech preview, please first check out Hortonworks Community Connection (HCC) for existing questions and answers. Please use the tag tech-preview and zeppelin.

Visit Data Science TrackFind AnswersAsk Questions