Our customers have many choices of infrastructure to deploy HDP: on premise, cloud, virtualized and even as an appliance. Further, our customers have a choice of deploying on Linux and Windows operating systems. You can easily see this creates a complex matrix. At Hortonworks, we believe you should not be limited to just one option but have the option to choose the best combination of infrastructure and operating system based on the usage scenario. That means: in a hybrid deployment model, you should have all of these options.
Why would an organization use a hybrid deployment model to deploy HDP and Hadoop? Our customers come to us asking to meet the requirements for their organizations for the following three basic scenarios:
Data architects require Hadoop to act like other systems in the data center, and business continuity through replication across on-premises and cloud-based storages targets is a critical requirement. In HDP 2.2 we extend the capabilities of Apache Falcon to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3. For example, this tutorial shows how to incrementally backup data to Microsoft Azure using Apache Falcon.
A development environment is always separate from the production environment. And today, many organizations are relying on a cloud-based option for their development teams. It allows them manage multiple environments more easily and also to spin up temporary environments to a full or a short-term development requirement. As a hybrid option, you need to be able to port not just data but the Hadoop apps as well.
Data Science continues to be a large interest within many of the organization we help with Hadoop. With Apache Hadoop YARN acting as a data operating system for an in production cluster, some want to spin up a temporary cluster (on premise or cloud) to perform some sort of exploration of data via machine learning and in order to do so will need data and some of the application logic from their existing production Hadoop environment.
In all three of these deployment models, the key to making it work is portability. You need to be able to not only move data back and forth, but to also synchronize data sets. Further and even more complex is the consistency of the “bits” across environments. The same version of the entire Hadoop stack must be deployed in the environments or else you risk a job execution failing as it is migrated from one to the next. This portability is a CRITICAL requirement for hybrid deployment of Hadoop.
Setting up a cluster is not a simple task. There are hundreds of options that not only allow you to deploy different options within the stack, but also configuration settings that will optimize your cluster for your particular use.
Two new features in Apache Ambari provide you with a very broad set of options to simplify deployment not just within the cloud but on premise as well.
Ambari Blueprints makes it easy to take a template of one cluster and apply it to another for seamless portability. With a Blueprint, you specify the version of HDP, the Component layout and the Configurations to materialize a Hadoop cluster instance (via a REST API) without any user interaction.
The “stack” for a cluster is defined by a set of components that are running in the environment. This might comprise of Hadoop, Pig and Hive (and more). It is typically a fairly complex list and can even be extended to non-Apache projects. With Apache Ambari you can define a stack and have the same definition deployed across environments.
Only HDP provides the wide array of options necessary to deploy the same bits across operating systems and environments. Further, we have gone to great lengths to automate the movement of data and to manage each of these environments. You can download HDP today or try some of these features out in our HDP sandbox.