Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
August 15, 2016
prev slideNext slide

Making the Elephant Fly in the Cloud

Ram Venkatesh also contributed to this blog series 

Why Apache Hadoop in the Cloud?

Screen Shot 2016-08-13 at 12.25.37 PMTen years ago, Hadoop, the elephant started the Big Data journey inside the firewall of a data center- the Apache Hadoop components were deployed on commodity servers inside a private data center. Now, the public cloud is another viable option for Big Data deployment. In addition to long-running workloads, the cloud is enabling many ephemeral on-demand analytics workloads, which is game changing. The cloud can save time and money – you can spin up a Hadoop cluster with hundreds of nodes in minutes with no up-front cost and then, add nodes on-demand. The cloud is flexible and has unlimited elastic scale – you pay only for the compute and storage that you use and tear down when work is completed.

Doesn’t Vanilla Hadoop on the Cloud Just Work?

Vanilla Hadoop runs in the cloud, with caveats ranging from correctness on corner cases to performance. Further, cloud brings its unique features and properties that offer an opportunity for significant improvements and functionality for a truly first-class cloud experience. Additionally. there are new cloud use cases that need adjustments in Hadoop components. Let us briefly cover some reasons why vanilla Hadoop does not just run first-class on the cloud:

Storage –  There are several considerations here. While one could make Apache Hadoop file-system (HDFS) the primary storage system, the cloud’s native object store is preferable for the shared Data Lake. The storage and its data can be shared across multiple Hadoop clusters and non-hadoop cloud applications (native or 3rd party) – as depicted in the deployment model below. As a secondary benefit, the deployment model enables usage of the lower priced cloud object storage (e.g. Windows Azure Storage Blob (WASB) or Amazon Simple Storage Service (S3))  instead of higher priced ones (e.g. Amazon Elastic Block Storage (EBS)) for the Data Lake.

Screen Shot 2016-08-15 at 11.43.50 AM

Unfortunately, cloud storage offers challenges in making it the primary Data Lake: the object store API semantics do not match the normal filesystem API and semantics that Hadoop and many normal applications expect. This has been primarily due to the fact that cloud-storage has been designed for scale, low price-point and geographic distribution and its APIs aren’t designed for general file-like access. In particular, Cloud object storage offers eventual consistency, which cause hiccups, especially in highly parallel executions.  Furthermore, lack of an atomic rename operation can lead to  several issues, ranging from performance to unwanted left-over files in output directory.

Another cloud storage issue is performance. First, Hadoop’s libraries need to be adjusted to work better against object storage’s performance characteristics. For example, access to columnar data on cloud object storage may not perform well due to random nature of its IO. We are addressing this by changes in the Optimized Row Columnar (ORC) layer or in the storage connector implementation. Another significant  and fundamental difference is lack of locality – separation of storage from compute is an inherent cloud property and hence, its performance does not match that of local storage. We are addressing this with – LLAP (Live Long and Process) caching of columnar data and HDFS storing intermediate data, and also acting as a cache.

Resilience in face of VM tear-down – VM failure or tear-down in the cloud is more frequent than machine failure in an on-premise datacenter. Hadoop deployment in the cloud need to take advantage of some of the auto-restart features of the cloud, deal with Internet Protocol address changes and reattach the storage in the VM as appropriate (e.g. Elastic Block Storage (EBS)).

Shared Metadata and Governance In addition to long-running clusters, the cloud offers opportunity to run multiple ephemeral clusters that share the Data Lake on the cloud storage. This implies that metadata , security and other governance need to be shared across such clusters. Metastore of Apache Hive and security modules like Apache Ranger  need to adjusted and tuned to work in this shared mode.

Security – When a Hadoop cluster is shared and allows injection of arbitrary application codes (typical for Hadoop), such deployments are open to security exploits. On-premise Hadoop has addressed this issue and now, this needs to be applied to cloud authentication. Additionally, there needs to be a seamless authentication and authorization across the cloud and on-premise data center.

Elasticity Elasticity and spot pricing offers opportunities for more first class experience on the cloud. For example, Cloudbreak can take advantage of elastic compute in the cloud and will evolve to take advantage of spot pricing. Furthermore, YARN (Yet Another Resource Negotiator) can also be evolved to take advantage of the elasticity of the cloud.

Making Hadoop work First Class on all Major Clouds

At Hortonworks, our cloud strategy revolves around providing a first-class cloud experience combined with enterprise-quality HDP on the major public clouds and we are making improvements in each of the above mentioned areas. Consistent with Hortonworks’ truly open-source philosophy, we will contribute all improvements back into the Apache projects. Additionally, Hortonworks will allow customers to have a hybrid solution across cloud vendors, minus the cloud vendor lock-in and span data applications across both the cloud and the on-premise clusters.

We started our cloud journey with Microsoft’s Azure. Now, we are extending our Big Data solution to other clouds and have been enhancing our the Hadoop libraries including the cloud storage connectors, in partnership.

MicrosoftMicrosoft Azure HDInsight is our Premier Connected Data Platforms Cloud Solution for a fully-managed cloud service powered by the Hortonworks Data Platform (HDP). Windows Azure Storage Blob (WASB) is fully certified, well-tested and has been running in production for HDInsight for a long time. Azure Data Lake Store (ADLS) is a new option that is implemented specifically for Big Data workloads and is currently in public preview.

Amazon – We recently released Technical Preview of Hortonworks Connected Data Cloud for Amazon Web Services (AWS) for those interested in running Apache Hive (a query interface) and Apache Spark (a real-time processing engine) on workloads in their Amazon environment.

Google – We are collaborating to enable Hadoop workloads on Google Cloud Platform (GCP), tightly integrated with Google Cloud Storage (GCS) and Big Query.

Hortonworks Vision – Connected Data Platform

Screen Shot 2016-08-13 at 12.28.24 PM

To tie everything together – we envision a “Connected Data Platform” that spans both the public cloud and the on-premise data center. In the era of the Internet of Things, the “Connected Data Platform” architecture powers the modern data applications to deliver the value of actionable intelligence, only possible with both data in motion and data at rest. To further understand the next generation data use-cases, we want to explore the architecture in the context of a car manufacturer, as shown in the picture above. The blue hexagon represents data in motion, while the green hexagon represents data at rest, both in cloud and data center context.  In the upper left, we have connected car data – here you want to do some streaming analytics on that data close to the edge. You also need to route that data to other cloud-based use cases such as machine learning. The bottom half of the picture illustrates edge data coming from manufacturing plant with its own streaming analytics. Routing data from the cloud into the private data center is important  to assemble a deep historical view of the car: If you combine manufacturing line data  with operational data coming from the cars, it can unlock root-cause manufacturing defects that are manifesting themselves in cars on the highways. This example illustrates the hybrid infrastructure –  the cloud and the data center that covers both data-at-rest and data-in-motion.

Overview of Rest of the Blog Series: Machinery Under the Hood

We are following a delicate balance-while we are following a nimble, agile development cadence to serve our cloud customers, we want to stay true to enterprise-grade performance and quality that enterprise customers expect. This is a journey that starts with enabling both ephemeral and long-running Big Data workloads in the cloud and then connecting the on-premise clusters with the cloud clusters. We are laying the foundational groundwork evolving towards a single framework for our deployment & management, security & governance. Below, we capture the high-level themes of the follow-on blogs and set the stage for more in-depth commentaries from our engineering team – the work broadly applies to our Big Data solutions on Microsoft Azure, Google Cloud Platform (GCP) and Amazon Web Services (AWS).

“Baseline Performance” – we are making the Hadoop components on the cloud perform many times better than a vanilla Hadoop on the cloud. The work cuts across the Hadoop stack and will be ongoing- in the base cloud storage connector (e.g. Amazon Simple Storage Service (S3) connector); in the upper layers (e.g. ORC file format); in the components (e.g. Hive). While the first one is cloud-vendor-specific, the last two apply to all cloud providers. We are currently observing 2 to 14x improvements on individual Apache Hive queries, with our recent Amazon S3 connector – the chart below captures the total response time of an Apache Hive test benchmark. Microsoft and Hortonworks collaborated on similar work and performance enhancement for Microsoft Azure HDInsight, which will be captured in a separate blog post.

Screen Shot 2016-08-12 at 9.28.00 AM

“Consistency” – we are providing a strongly consistent behavior on top of an eventual consistent Amazon S3.  Similar consistency work has been done for Azure HDInsight, in collaboration with Microsoft. Last but not the least, Google has a connector that deals with consistency – Hortonworks and Google are collaborating to certify the offering with Hortonworks Data Platform (HDP).

“Caching Performance” – we are making the cloud performance at par with the on-premise HDP, removing any friction in the cloud journey. This is achieved with – LLAP caching columnar data (targeted for Hortonworks Data Platform version 2.5) and HDFS storing intermediate data and caching (future).
Please stay tuned for the rest of the blog series!



Syamsul Anuar says:

Any such optimization planned for openstack?

Steve Loughran says:

1. we’re working on the S3a connector right now, it’s the most heavily used; Microsoft have been putting effort in to the Azure binding.
2. Once we are happy with the S3a work, pulling some of the optimizations into openstack, —ideally by sharing common code— is an obvious action.
3. Everything we can improve in the layers above: ORC, Hive, Spark, …, will benefit all object stores, even things not in the ASF code, such as the Google Cloud Storage client.

Andy says:

Gents-too little, too late. Making the elephant fly is necessary but not sufficient. Big Data is much more than Hadoop.

Hariharan says:

This is some really interesting work. Have you filed JIRAs for this work or anything else we could look at to check the details?


Ashish says:

Good to know these all

Leave a Reply

Your email address will not be published. Required fields are marked *