Predictive Analytics on H2O and Hortonworks Data Platform
H2O is the open source in memory solution from 0xdata for predictive analytics on big data. It is a math and machine learning engine that brings distribution and parallelism to powerful algorithms that enable you to make better predictions and more accurate models faster. With familiar APIs like R and JSON, as well as common storage method of using HDFS, H2O can bring the ability to do advance analyses to a broader audience of users. With an almost nonexistent learning curve for current Hadoop users, the following tutorial serves as a way to streamline the initial setup of H2O on Hortonworks Sandbox.
List the main steps needed to perform the procedure:
- Download the current release of H2O
- Launch H2O
- Run Analyses
Copy H2O zip file to Hadoop node, wherever you intend to run Hadoop commands :
$ scp -P 2222 h2o-[version].zip email@example.com: firstname.lastname@example.org's password: hadoop
Securely tunnel into the VM and run the command wget
ssh email@example.com -p 2222 firstname.lastname@example.org's password: hadoop Last login: Tue Jun 24 15:48:05 2014 from 10.0.2.2 [root@sandbox ~]# wget http://s3.amazonaws.com/h2o-release/h2o/rel-kolmogorov/3/h2o-126.96.36.199.zip
The next step is to launch H2O in the Hadoop node.
Unzip the H2O file, cd to h2o-[version]/hadoop/ and run the following command (that’ll launch one Hadoop node of size 1gb each as mapper tasks in Hadoop) :
$ hadoop jar h2odriver_hdp2.1.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 1g -nodes 1 -output hdfsOutputDirName
Access the H2O embedded browser by going to any of the H2O nodes launched. Find the callback IP address H2O instance launched on and depending on the network settings pick the appropriate one to navigate to to find H2O’s web interface. For example, if the VM is launched with Virtualbox Host-Only Ethernet Adapter H2O’s web GUI is available at 192.168.56.102 and port 54321.
More H2O related tutorials and information are available at H2O current releases’ accompanying documentation. Specifically there is more H2O on Hadoop documentation as well as walk through tutorials for most of the features available (GLM, K-Means, Random Forest, PCA, and GBM).
Try this tutorial with :
These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.