Ram Venkatesh also contributed to this blog series
Ten years ago, Hadoop, the elephant started the Big Data journey inside the firewall of a data center- the Apache Hadoop components were deployed on commodity servers inside a private data center. Now, the public cloud is another viable option for Big Data deployment. In addition to long-running workloads, the cloud is enabling many ephemeral on-demand analytics workloads, which is game changing. The cloud can save time and money – you can spin up a Hadoop cluster with hundreds of nodes in minutes with no up-front cost and then, add nodes on-demand. The cloud is flexible and has unlimited elastic scale – you pay only for the compute and storage that you use and tear down when work is completed.
Vanilla Hadoop runs in the cloud, with caveats ranging from correctness on corner cases to performance. Further, cloud brings its unique features and properties that offer an opportunity for significant improvements and functionality for a truly first-class cloud experience. Additionally. there are new cloud use cases that need adjustments in Hadoop components. Let us briefly cover some reasons why vanilla Hadoop does not just run first-class on the cloud:
Storage – There are several considerations here. While one could make Apache Hadoop file-system (HDFS) the primary storage system, the cloud’s native object store is preferable for the shared Data Lake. The storage and its data can be shared across multiple Hadoop clusters and non-hadoop cloud applications (native or 3rd party) – as depicted in the deployment model below. As a secondary benefit, the deployment model enables usage of the lower priced cloud object storage (e.g. Windows Azure Storage Blob (WASB) or Amazon Simple Storage Service (S3)) instead of higher priced ones (e.g. Amazon Elastic Block Storage (EBS)) for the Data Lake.
Unfortunately, cloud storage offers challenges in making it the primary Data Lake: the object store API semantics do not match the normal filesystem API and semantics that Hadoop and many normal applications expect. This has been primarily due to the fact that cloud-storage has been designed for scale, low price-point and geographic distribution and its APIs aren’t designed for general file-like access. In particular, Cloud object storage offers eventual consistency, which cause hiccups, especially in highly parallel executions. Furthermore, lack of an atomic rename operation can lead to several issues, ranging from performance to unwanted left-over files in output directory.
Another cloud storage issue is performance. First, Hadoop’s libraries need to be adjusted to work better against object storage’s performance characteristics. For example, access to columnar data on cloud object storage may not perform well due to random nature of its IO. We are addressing this by changes in the Optimized Row Columnar (ORC) layer or in the storage connector implementation. Another significant and fundamental difference is lack of locality – separation of storage from compute is an inherent cloud property and hence, its performance does not match that of local storage. We are addressing this with – LLAP (Live Long and Process) caching of columnar data and HDFS storing intermediate data, and also acting as a cache.
Resilience in face of VM tear-down – VM failure or tear-down in the cloud is more frequent than machine failure in an on-premise datacenter. Hadoop deployment in the cloud need to take advantage of some of the auto-restart features of the cloud, deal with Internet Protocol address changes and reattach the storage in the VM as appropriate (e.g. Elastic Block Storage (EBS)).
Shared Metadata and Governance – In addition to long-running clusters, the cloud offers opportunity to run multiple ephemeral clusters that share the Data Lake on the cloud storage. This implies that metadata , security and other governance need to be shared across such clusters. Metastore of Apache Hive and security modules like Apache Ranger need to adjusted and tuned to work in this shared mode.
Security – When a Hadoop cluster is shared and allows injection of arbitrary application codes (typical for Hadoop), such deployments are open to security exploits. On-premise Hadoop has addressed this issue and now, this needs to be applied to cloud authentication. Additionally, there needs to be a seamless authentication and authorization across the cloud and on-premise data center.
Elasticity – Elasticity and spot pricing offers opportunities for more first class experience on the cloud. For example, Cloudbreak can take advantage of elastic compute in the cloud and will evolve to take advantage of spot pricing. Furthermore, YARN (Yet Another Resource Negotiator) can also be evolved to take advantage of the elasticity of the cloud.
At Hortonworks, our cloud strategy revolves around providing a first-class cloud experience combined with enterprise-quality HDP on the major public clouds and we are making improvements in each of the above mentioned areas. Consistent with Hortonworks’ truly open-source philosophy, we will contribute all improvements back into the Apache projects. Additionally, Hortonworks will allow customers to have a hybrid solution across cloud vendors, minus the cloud vendor lock-in and span data applications across both the cloud and the on-premise clusters.
We started our cloud journey with Microsoft’s Azure. Now, we are extending our Big Data solution to other clouds and have been enhancing our the Hadoop libraries including the cloud storage connectors, in partnership.
Microsoft – Microsoft Azure HDInsight is our Premier Connected Data Platforms Cloud Solution for a fully-managed cloud service powered by the Hortonworks Data Platform (HDP). Windows Azure Storage Blob (WASB) is fully certified, well-tested and has been running in production for HDInsight for a long time. Azure Data Lake Store (ADLS) is a new option that is implemented specifically for Big Data workloads and is currently in public preview.
Amazon – We recently released Technical Preview of Hortonworks Connected Data Cloud for Amazon Web Services (AWS) for those interested in running Apache Hive (a query interface) and Apache Spark (a real-time processing engine) on workloads in their Amazon environment.
To tie everything together – we envision a “Connected Data Platform” that spans both the public cloud and the on-premise data center. In the era of the Internet of Things, the “Connected Data Platform” architecture powers the modern data applications to deliver the value of actionable intelligence, only possible with both data in motion and data at rest. To further understand the next generation data use-cases, we want to explore the architecture in the context of a car manufacturer, as shown in the picture above. The blue hexagon represents data in motion, while the green hexagon represents data at rest, both in cloud and data center context. In the upper left, we have connected car data – here you want to do some streaming analytics on that data close to the edge. You also need to route that data to other cloud-based use cases such as machine learning. The bottom half of the picture illustrates edge data coming from manufacturing plant with its own streaming analytics. Routing data from the cloud into the private data center is important to assemble a deep historical view of the car: If you combine manufacturing line data with operational data coming from the cars, it can unlock root-cause manufacturing defects that are manifesting themselves in cars on the highways. This example illustrates the hybrid infrastructure – the cloud and the data center that covers both data-at-rest and data-in-motion.
We are following a delicate balance-while we are following a nimble, agile development cadence to serve our cloud customers, we want to stay true to enterprise-grade performance and quality that enterprise customers expect. This is a journey that starts with enabling both ephemeral and long-running Big Data workloads in the cloud and then connecting the on-premise clusters with the cloud clusters. We are laying the foundational groundwork evolving towards a single framework for our deployment & management, security & governance. Below, we capture the high-level themes of the follow-on blogs and set the stage for more in-depth commentaries from our engineering team – the work broadly applies to our Big Data solutions on Microsoft Azure, Google Cloud Platform (GCP) and Amazon Web Services (AWS).
“Baseline Performance” – we are making the Hadoop components on the cloud perform many times better than a vanilla Hadoop on the cloud. The work cuts across the Hadoop stack and will be ongoing- in the base cloud storage connector (e.g. Amazon Simple Storage Service (S3) connector); in the upper layers (e.g. ORC file format); in the components (e.g. Hive). While the first one is cloud-vendor-specific, the last two apply to all cloud providers. We are currently observing 2 to 14x improvements on individual Apache Hive queries, with our recent Amazon S3 connector – the chart below captures the total response time of an Apache Hive test benchmark. Microsoft and Hortonworks collaborated on similar work and performance enhancement for Microsoft Azure HDInsight, which will be captured in a separate blog post.
“Consistency” – we are providing a strongly consistent behavior on top of an eventual consistent Amazon S3. Similar consistency work has been done for Azure HDInsight, in collaboration with Microsoft. Last but not the least, Google has a connector that deals with consistency – Hortonworks and Google are collaborating to certify the offering with Hortonworks Data Platform (HDP).
“Caching Performance” – we are making the cloud performance at par with the on-premise HDP, removing any friction in the cloud journey. This is achieved with – LLAP caching columnar data (targeted for Hortonworks Data Platform version 2.5) and HDFS storing intermediate data and caching (future).
Please stay tuned for the rest of the blog series!