Elastic Hadoop on OpenStack

Project Sahara: Hadoop operational agility & deployment flexibility across public and private clouds

Apache Hadoop and OpenStack represent two of the largest open source communities and both are relatively new to the data center. Hadoop can benefit from the operational agility provided by OpenStack and it serves as an excellent use case for OpenStack.

To accelerate the adoption of Hadoop over OpenStack, we partnered with Mirantis and Red Hat to collaborate on Project Savanna (since renamed to Project Sahara).

Initiative Goals

    One-click, self-service, template-based provisioning of Hadoop clusters

    Dynamic scaling of clusters for ad-hoc analytics and transient workloads

    Maximum server utilization across Hadoop and non-Hadoop workloads with virtual machine isolation.

Our initiative targets the following use cases:

  • One-Click Provisioning 
    • Enable self-service provisioning for frequent requests
    • Simplify migrations from development to production
    • Reduce operator error in provisioning
    • Facilitate migration from Amazon EMR for ad-hoc analytics
  • Elasticity
    • Vary cluster compute capacity based on factors like time of day, resource utilization, user job requirements etc
    • Provide transient Hadoop clusters for analyzing data stored in Swift object store
  • Multi-Tenancy
    • Simplify upgrade and maintenance by running multiple Hadoop versions over common server pools
    • Improve server utilization by sharing resources with non-Hadoop workloads
    • Simplify chargeback/showback

Architecture Overview

Screen Shot 2013-12-05 at 1.47.01 PM 1

The core of Sahara called the ‘controller’ serves as the glue between Hadoop and OpenStack. It manages the provisioning and orchestration of virtual machines by working with the underlying OpenStack projects like Nova, Quantum, Cinder and Glance. The Hortonworks OpenStack plugin for Sahara will configure and manage the Hadoop cluster using Ambari. It will also set up the HDFS and swift object store connectors.


Project Sahara is currently under incubation in the OpenStack community. Hortonworks is working with the community to help mature Sahara to become a top-level OpenStack project. The most recent version of Sahara is 0.3, released on Oct 17th 2013.

Since the HDP OpenStack plugin is being developed in the open community, it has been included in Project Sahara since the 0.2 release. With the current version, users can provision a simple HDP cluster over OpenStack to run basic MapReduce jobs and use Ambari to manage the clusters.

We are targeting the HDP OpenStack plugin to be generally available in Q1-2014.

See the Project Sahara OpenStack incubation project page here

Essential Timeline

Phase 1:
  • Provisioning & Management
    • Template-based self-provisioning
    • Job flow based provisioning (Savanna EDP)
    • Ambari-based cluster management
    • OpenStack HEAT support
  • Elasticity
    • Manual compute & data node elasticity
    • OpenStack Swift to HDFS data movement support
  • Multi-tenancy
    • VM-based CPU, memory & I/O isolation
    • OpenStack Neutron support for network isolation
    • Dedicated Ambari & Hue per cluster
Coming Soon
Phase 2:
  • Self-provisioning
    • Savanna template to Ambari template conversion
    • Additional data sources and job types with Savanna EDP
  • Elasticity.
    • Automatic rule-based cluster elasticity
    • Improved OpenStack Ceilometer integration
  • Multi-Tenancy.
    • Single Ambari instance per tenant with multi-cluster support
    • OpenStack Horizon to Ambari single sign-on
    • Hadoop node VM to physical server pinning

Technical Resources

Recently in the Blog

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Thank you for subscribing!