The new year brings new innovation and collaborative efforts. Various teams from the Apache community have been working hard for the last eighteen months to bring the EZ button to Apache Hadoop technology and Data Lake. In the coming months, we will publish a series of blogs introducing our Data Lake 3.0 architecture and highlighting our innovations within Apache Hadoop core and its related technologies.
You probably heard of the Deep Learning powered cucumber sorter from a Japanese farmer Makoto Koike! In their cucumber farm, Makoto’s mother spends up to eight hours per day classifying cucumbers into different classes. Makoto is a trained embedded systems designer but not a trained “Machine Learning” engineer. He leveraged TensorFlow, a deep learning framework, with minor configurations to automate his mom’s complex art of cucumber sorting so that they can focus more on cucumber farming instead.
This simple, yet powerful example mirrors the trip we have embarked on with our valued enterprise customers to reduce the time to deployment and insight (from days to minutes), while reducing the Total Cost of Ownership (TCO) by 2x. Instead of a component-centric approach, we envision an application-centric Data Lake 3.0. If you look back, Data Lake 1.0 was a single use system for batch applications and Data Lake 2.0 was a multi-use platform for batch, interactive, online and streaming components. In Data Lake 3.0, we want to deploy pre-packaged applications with minor customizations and the focus will shift from the platform management to solving the business problems.
We begin with a few real-world problems – ranging from simple to complex. The common threads behind the Data Lake 3.0 architecture are: reduce the time to deployment; reduce the time to insight; reduce the TCO of a Petabyte (PB) scale Hadoop infrastructure, while increasing utilization of the cluster with additional workloads.
Our Data Lake 3.0 vision requires us to execute on a complex set of machinery under the hood. While not an exhaustive list, the following sections provides a high-level overview of capabilities and we are setting the stage with this introductory blog.
Application Assemblies: A baseline set of services running on bare-metal facilitates running dockerized services for a longer duration. We can leverage the benefits of docker packaging and distribution, along with the isolation. We can cut down the “time to deployment” from days to minutes and enable use cases, such as running multiple versions of applications side by side; running multiple Hortonworks Data Platform (HDP) clusters logically on a single data lake; running use-case focused data intensive micro-services that we refer to as “Assemblies”.
Storage Enhancements: Naturally, we store the datasets in a single Hadoop data lake to increase the analytics efficacy and reduce the siloes, while providing multiple logical application-centric services on top. Data needs to be kept for many years in an active archive fashion. Depending on the access pattern and temperature, the data needs to sit on both fast (Solid State Drive) and slow (Hard Drive) media. This is where Reed Solomon based Erasure Coding plays a pivotal role in reducing the storage overhead by 2x (vs. existing 3 replica approach) especially for cold storage. In future, we intend to provide an “auto-tiering” mechanism to move the data between hot and cold tiers of media automagically. Liberated from the storage overhead and TCO burden, customers can now retain data for many years. Features such as three NameNode configuration make sure that the administrator has a large servicing window just in case, the Active NameNode goes down on a Saturday night.
Resource Isolation & Sharing: Compute intensive analytics such as deep learning require not only a large compute pool, but also a fast and expensive processing pool made of Graphic Processing Unit (GPU)s in tandem to cut the time of insight from months to days. We intend to provide a resource vector attribute that can be mapped to the cluster-wide GPU resources -so, a customer does not have to dedicate a GPU node to a single tenant or workload. In addition to providing CPU and Memory level isolation, we will provide Network and IO level isolation between tenants and facilitate dynamic allocation of the resources.
At Hortonworks, we are incredibly lucky to be guided by many of the world’s advanced analytics users, representing a wide set of verticals in our customer advisory and briefing meetings. Based on their invaluable input, we are on an exciting journey to supercharge Apache Hadoop. Our trip will have many legs, however, 2017 is going to be the exciting year to deliver on many of our promises. If you have made this far, I encourage you to follow this blog series, as we continue to provide more detailed updates from our rockstar technology leaders.
Read the next blog in this series: Data Lake 3.0 Part 2 – A Multi-Colored YARN