What is Apache Hadoop?
Apache™ Hadoop® is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment.
Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.
Storage – HDFS
Storage is accomplished with the Hadoop Distributed File System (HDFS) – a reliable and distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers.
Processing – MapReduce
Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated “nodes.” It was designed to run on commodity hardware and to scale up or down without system interruption.
Resource Management – YARN
YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
A Hadoop Distribution
A number of supporting ASF projects enable the integration of core Apache Hadoop into a data center environment. Typically, these projects are packaged into a Hadoop ”distribution”, which is a tested and hardened set of projects that simplifies a Hadoop implementation. Hortonworks Data Platform is an example of a distribution and is the only 100% Apache Hadoop distribution.
The distribution package is crucial because it ensures version compatibility among projects and more importantly, is typically subjected to significant testing to ensure it is reliable and stable.
The Ecosystem of Hadoop Related Projects
There are numerous ASF projects included in a distribution. Each of them has been developed to deliver an explicit function and each has it’s own community of developers and individual release cycles.
MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
- Apache Hadoop YARN
Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
- Apache Tez
Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
- Apache Pig
A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
- Apache HCatalog
A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
- Apache Hive
Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
- Apache HBase
A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
- Apache Storm
Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop® 2.x
- Apache Mahout
Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
- Apache Accumulo
Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.
- Apache Flume
Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
- Apache Sqoop
Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
- Apache ZooKeeper
A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
- Apache Ambari
An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
- Apache Oozie
Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
- Apache Falcon
Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows.
- Apache Knox
The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.