Due to the flourish of Apache Software Foundation projects that have emerged in recent years in and around the Apache Hadoop project, a common question I get from mainstream enterprises is: What is the definition of Hadoop?
This question goes beyond the Apache Hadoop project itself, since most folks know that it’s an open source technology borne out of the experience of web scale consumer companies such as Yahoo!, Facebook and others who were confronted with the need to store and process massive quantities of data. The question is more about making sense of the wide range of innovative projects that help make Hadoop more relevant and useful for mainstream enterprises.
Before answering, I usually reframe the question as: What is Enterprise Hadoop?
To fully frame out the “Enterprise Hadoop” context, I draw a diagram with 8 key areas worth talking about. The 3 gray boxes set the broader context within the enterprise, while the 5 green boxes outline the core capabilities required of Enterprise Hadoop.
[hadoop_arch type=”external lo”]
It all starts with the Presentation & Application box that’s about the new and existing applications that will leverage and derive value from data stored in Hadoop. In order to maximize Enterprise Hadoop’s impact, it also needs to support a wide range of Deployment Options spanning physical, virtual, and cloud. Since we’re talking about real applications and data important to the business, the Enterprise Hadoop platform needs to integrate with existing Enterprise Management & Security tools and processes.
That leaves us with the following 5 areas of core Enterprise Hadoop capabilities:
As Apache Hadoop has taken its role in enterprise data architectures, a host of open source projects have been contributed to the Apache Software Foundation (ASF) by both vendors and users alike that greatly expand Hadoop’s capabilities as an enterprise data platform. Many of the “committers” for these open source projects are Hortonworks employees. For those unfamiliar with the term “committer”, they are the talented individuals who devote their time and energy on specific Apache projects adding features, fixing bugs, and reviewing and approving changes submitted by others interested in contributing to the project.
At Hortonworks, we have over 100 committers authoring code and providing stewardship within the Apache Community across a wide range of Hadoop-related projects. Since we are focused on serving the needs of mainstream enterprise users, we have a rigorous engineering process and related test suites that integrate, test, and certify at scale this wide range of projects into an easy to use and consume Enterprise Hadoop platform called the Hortonworks Data Platform. Those talented people in our engineering team provide the foundation for the industry-leading support and services that we deliver directly or through our partners to the market.
At Hortonworks, we’ve maintained a consistent focus on enabling Hadoop to be an enterprise-viable data platform that uniquely powers a new generation of data-driven applications and analytics. Let’s take a look at the 5 areas of core Enterprise Hadoop and the Hortonworks Data Platform in more detail.
Data Management: The Hadoop Distributed File System (HDFS) provides the foundation for storing data in any format at scale across low-cost commodity hardware. YARN, introduced in the Apache Hadoop 2 release, is a must-have for Enterprise Hadoop deployments since it acts as the platform’s data operating system – providing the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data stored in HDFS with predictable performance and service levels.
Data Access: While classic Batch-oriented MapReduce applications are important, thanks to the introduction of YARN, they are not the only workloads that can run natively “IN-Hadoop”. Technologies for Scripting, SQL, NoSQL, Search, and Streaming are integrated into the Hortonworks Data Platform. Apache Pig provides Scripting capabilities, and Apache Hive is the de-facto standard SQL engine for handling BOTH batch and interactive SQL data access and is proven at petabyte scale. Apache HBase is a popular columnar NoSQL database and Apache Accumulo, with its cell-level security, is used in high-security NoSQL use cases. Apache Storm supports real-time stream processing commonly needed for sensor and machine data use cases. And there are other data access engines including Apache Spark for in-memory iterative analytics and a wide range of 3rd-party ISV solutions expected to plug into the platform over 2014 and beyond. Thanks to YARN, all of these data access engines can work across one set of data in a coordinated and predictable manner.
Data Governance & Integration: Apache Falcon provides policy-based workflows for governing the lifecycle of flow of data in Hadoop, including disaster recovery and data retention use cases. For data ingest, Apache Sqoop makes it easy to bring data from other databases into Hadoop, and Apache Flume enables logs to easily flow into Hadoop. NFS and WebHDFS interfaces provide familiar and flexible ways to store and interact with data in HDFS.
Security: Providing a holistic approach to authentication, authorization, accounting, and data protection, security is handled at every layer of the Hadoop stack: from the HDFS storage and YARN resource management layers, to the data access components such as Hive as well as the overall data pipelines coordinated by Falcon, on up through the perimeter of the entire cluster via Apache Knox.
Operations: Apache Ambari offers a comprehensive solution including the necessary user interface and REST APIs for enabling operators to provision, manage and monitor Hadoop clusters as well as integrate with other enterprise management solutions.
As you can see, the Hortonworks Data Platform addresses everything that’s needed from an Enterprise Hadoop solution – all delivered as a 100% open source platform that you can rely on.
To learn more about Enterprise Hadoop and how it powers the Modern Data Architecture including the common journey from new analytic applications to a Data Lake, I encourage you to download our whitepaper here.