What is Apache Hadoop?

The open source framework for storing and extracting insight from massive volumes of data.

Apache Hadoop is an open source framework for distributed storing and processing of large sets of data on commodity hardware. Hadoop enables businesses to gain insight from massive amounts of structured and unstructured data quickly.

Enterprise Hadoop: The Ecosystem of Projects

There are numerous Apache Software Foundation projects that comprise the services required by an enterprise to deploy, integrate and work with Hadoop.  Each of them has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.

Presentation & Applications
Enable both existing and new applications to provide value to the organization.
Enterprise Management & Security
Empower existing operations and security tools to manage Hadoop.
Governance Integration
Data Workflow, Lifecycle & Governance
Data Access
Access your data simultaneously in multiple ways (batch, interactive, real-time)
Script
SQL
NoSQL
Stream
Search
Others...
  • In-Mem
  • Search
Store and process your Corporate Data Assets
HDFS
Distributed File System
Data Management
Security
Store and process your Corporate Data Assets
Authentication, Authorization, Accounting & Data Protection
Operations
Deploy, Manage and Monitor
Provision, Manage & Monitor
Scheduling
Deployment Choice
  • Linux & Windows
  • On Premise or Cloud/Hosted

Data Management. Store and process vast quantities of data in a scale out storage layer.

Hadoop Distributed File System (HDFS) is the core technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. Apache Hadoop YARN is the pre-requisite for Enterprise Hadoop as it provides the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels.

  • HDFS
    Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
  • Apache Hadoop YARN
    Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
  • Apache Tez
    Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.

Data Access. Interact with your data in a wide variety of ways – from batch to real-time.

Apache Hive is the most widely adopted data access technology, though there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing, Apache HBase offers columnar NoSQL storage and Apache Accumulo offers cell-level access control. All of these engines can work across one set of data and resources thanks to YARN. YARN also provides flexibility for new and emerging data access methods, for instance Search and programming frameworks such as Cascading.

  • MapReduce
    MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
  • Apache Pig
    A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
  • Apache HCatalog
    A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
  • Apache Hive
    Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
  • Apache HBase
    A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
  • Apache Storm
    Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop® 2.x
  • Apache Mahout
    Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
  • Apache Accumulo
    Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.

Data Governance & Integration. Quickly and easily load data, and manage according to policy.

Apache Falcon provides policy-based workflows for governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.

  • Apache Falcon
    Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows.
  • Apache Flume
    Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
  • Apache Sqoop
    Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.

Security. Address requirements of Authentication, Authorization, Accounting and Data Protection.

Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox.

Apache Knox
The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.

Operations. Provision, manage, monitor and operate Hadoop clusters at scale.

Apache Ambari offers the necessary interface and APIs to provision, manage and monitor Hadoop clusters and integrate with other management console software.

  • Apache Zookeeper
    A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
  • Apache Ambari
    An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
  • Apache Oozie
    Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Thank you for subscribing!