Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Predictive Analytics on H2O and Hortonworks Data Platform


H2O is the open source in memory solution from 0xdata for predictive analytics on big data. It is a math and machine learning engine that brings distribution and parallelism to powerful algorithms that enable you to make better predictions and more accurate models faster. With familiar APIs like R and JSON, as well as common storage method of using HDFS, H2O can bring the ability to do advance analyses to a broader audience of users. With an almost nonexistent learning curve for current Hadoop users, the following tutorial serves as a way to streamline the initial setup of H2O on Hortonworks Sandbox.

This video shows each step in a quick video. Follow the details below for the step by step instructions.



List the main steps needed to perform the procedure:

  1. Download the current release of H2O
  2. Launch H2O
  3. Run Analyses

Boot Hortonworks Sandbox from VM. After which log in by hitting or use a terminal and ssh to root@ -p 2222
Hortonworks VM

Copy H2O zip file to Hadoop node, wherever you intend to run Hadoop commands :

$ scp -P 2222 h2o-[version].zip root@
root@'s password: hadoop

Copy downloaded zip file


Securely tunnel into the VM and run the command wget

ssh root@ -p 2222
root@'s password: hadoop
Last login: Tue Jun 24 15:48:05 2014 from
[root@sandbox ~]# wget 

Download zip file

The next step is to launch H2O in the Hadoop node.

Unzip the H2O file, cd to h2o-[version]/hadoop/ and run the following command (that’ll launch one Hadoop node of size 1gb each as mapper tasks in Hadoop) :

$ hadoop jar h2odriver_hdp2.1.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 1g -nodes 1 -output hdfsOutputDirName

Access the H2O embedded browser by going to any of the H2O nodes launched. Find the callback IP address H2O instance launched on and depending on the network settings pick the appropriate one to navigate to to find H2O’s web interface. For example, if the VM is launched with Virtualbox Host-Only Ethernet Adapter H2O’s web GUI is available at and port 54321.

H2O Home Page

H2O is launched as a JVM on the Hadoop cluster, and the job is tracked in Hortonworks Sandbox Job Browser:
Job Tracker

Locate the data file you want to run regressions on in Hortonworks Sandbox after uploading the file from disk or after having played around with the dataset in hive or pig.
HDFS Catalog

Import dataset in H2O browser from HDFS and start creating models with H2O’s available features including GLM, K-Means, and Random Forest to start with.
Navigate to data

Import Page

More H2O related tutorials and information are available at H2O current releases’ accompanying documentation. Specifically there is more H2O on Hadoop documentation as well as walk through tutorials for most of the features available (GLM, K-Means, Random Forest, PCA, and GBM).