Hadoop on OpenStack

Project Sahara: Hadoop operational agility & deployment flexibility across public and private clouds

Apache Hadoop and OpenStack represent two of the largest open source communities and both are relatively new to the data center. Hadoop can benefit from the operational agility provided by OpenStack and it serves as an excellent use case for OpenStack.

To accelerate the adoption of Hadoop over OpenStack, we partnered with Mirantis and Red Hat to collaborate on Project Savanna (since renamed to Project Sahara).

Initiative Goals

    One-click, self-service, template-based provisioning of Hadoop clusters

    Dynamic scaling of clusters for ad-hoc analytics and transient workloads

    Maximum server utilization across Hadoop and non-Hadoop workloads with virtual machine isolation

Our initiative targets the following use cases:

  • One-Click Provisioning 
    • Enable self-service provisioning for frequent requests
    • Simplify migrations from development to production
    • Reduce operator error in provisioning
    • Facilitate migration from Amazon EMR for ad-hoc analytics
  • Elasticity
    • Vary cluster compute capacity based on factors like time of day, resource utilization, user job requirements etc
    • Provide transient Hadoop clusters for analyzing data stored in Swift object store
  • Multi-Tenancy
    • Simplify upgrade and maintenance by running multiple Hadoop versions over common server pools
    • Improve server utilization by sharing resources with non-Hadoop workloads
    • Simplify chargeback/showback

Architecture Overview

The core of Sahara called the ‘controller’ serves as the glue between Hadoop and OpenStack. It manages the provisioning and orchestration of virtual machines by working with the underlying OpenStack projects like Nova, Quantum, Cinder and Glance.

Hortonworks is developing (in the OpenStack community) a plugin for Project Sahara that leverages Apache Ambari to configure and manage Hadoop clusters in the cloud. The HDP Plugin also configures HDFS and Swift object store connectors.


Project Sahara is currently under incubation in the OpenStack community. Hortonworks is working with the community to help mature Sahara to become an integrated OpenStack project in the Juno release cycle.

The HDP Plugin is under development in the OpenStack community and has been included with Project Sahara since the 0.3 release. With the current HDP Plugin, users can provision a HDP cluster on OpenStack and manage the cluster with Apache Ambari.

The most recent version of Project Sahara is 0.3, released on Oct 17th 2013.

Visit these sites to learn more about Project Sahara:

Essential Timeline

Phase 1: OpenStack Icehouse
  • Provisioning & Management
    • Template-based self-provisioning
    • Elastic Data Processing
    • Ambari-based cluster management
    • HEAT and Nova-based provisioning
  • Elasticity
    • Manual compute & data node elasticity
    • OpenStack Swift to HDFS data movement support
  • Multi-tenancy
    • VM-based CPU, memory & I/O isolation
    • OpenStack Neutron support for network isolation
    • Dedicated Ambari per cluster
Coming Soon
Phase 2: OpenStack Juno
  • Provisioning & Management
    • Native support for Ambari Blueprints
    • Command line interface
    • Kerberos cluster support
  • Platform Support
    • Support for HDP 2.1 Stack
  • Data Worker
    • Support for Ambari Views
    • Additional Elastic Data Processing capabilities
    • Internationalization

Technical Resources

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.