The Hortonworks Blog

Posts categorized by : High Availability
YARN and Apache Storm: A Powerful Combination

YARN changed the game for all data access engines in Apache Hadoop. As part of Hadoop 2, YARN took the resource management capabilities that were in MapReduce and packaged them for use by new engines. Now Apache Storm is one of those data-processing engines that can run alongside many others, coordinated by YARN.

YARN’s architecture makes it much easier for users to build and run multiple applications in Hadoop, all sharing a common resource manager.…

The Journey

Almost to the date, two years ago the Apache Hadoop community voted to make YARN a sub-project of Apache Hadoop followed by the GA release nearly a year ago last fall.

Since then, it’s becoming plainly obvious that Apache Hadoop 2.x, powered by YARN as its architectural center, is the best platform for workloads such as Apache Hadoop MapReduce, Apache Pig, Apache Hive etc., which were designed to process data on Apache Hadoop HDFS.…

This post’s Principal Author: Ming Ma, Software Development Manager, eBay.  With contribution from Mayank Bansal (eBay), Devaraj Das (Hortonworks), Nicolas Liochon (Scaled Risk), Michael Weng (eBay), Ted Yu (Hortonworks), John Zhao (eBay)

eBay runs Apache Hadoop at extreme scale, with tens of petabytes of data. Hadoop was created for computing challenges like ours, and eBay runs some of the largest Hadoop clusters in existence.

Our business uses Apache HBase to deliver value to our customers in real-time and we are sensitive to any failures because prolonged recovery times significantly degrade site performance and result in material loss of revenue. …

With HDP 1.3 and HDP 2.0 Beta, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors.

HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are:

  • Performant and Reliable: Snapshot creation is atomic and instantaneous, no matter the size or depth of the directory subtree
  • Scalable: Snapshots do not create extra copies of blocks on the file system.

The shift to a data-oriented business is happening. The inherent value in established and emerging big datasets is becoming clear. Enterprises are building big data strategies to take advantage of these new opportunities and Hadoop is the platform to realize those strategies.

Hadoop is enabling a modern data architecture where it plays a central role: built to tackle big data sets with efficiency while integrating with existing data systems. As champions of Hadoop, our aim is to ensure the success of every Hadoop implementation and improve our own understanding of how and why enterprises tackle big data initiatives. …

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down.…

Introduction

A Highly Available NameNode for HDFS has been in development since last year. That effort focused singularly on the automatic failover of the NameNode for Hadoop 2.0. During that time we have realized two things.

First, we realized we should use an outside-in approach to the HA problem: start by designing the availability of the Hadoop system as a whole and then focus on the high-availability of individual components; that work lead to the Full Stack HA Architecture.…

In Shaun Connolly’s post about balancing community innovation and enterprise stability, he discussed the importance of an open source project forging ahead with big improvements that are expected to be initially buggy and incomplete functionally but then stabilize over time. In the case of Apache Hadoop 2.0, currently in community Alpha release, the big improvements have been underway for the past 3 years and include such things as:

  • Next-gen MapReduce (aka YARN) that opens up Hadoop’s job processing architecture to other application workloads beyond MapReduce,
  • New HDFS pipe-line to support append and flush,
  • HDFS federation and performance improvements that enable Hadoop to scale to 1000’s more nodes in a cluster, and
  • High availability improvements that address some of the single point of failure issues that are often used as examples of how Hadoop may not be as enterprise-ready as it could be.
  • We reached a significant milestone in HDFS: the Namenode HA branch was merged into the trunk. With this merge, HDFS trunk now supports HOT failover.

    Significant enhancements were completed to make HOT Failover work:

    • Configuration changes for HA
    • Notion of active and standby states were added to the Namenode
    • Client-side redirection
    • Standby processing journal from Active
    • Dual block reports to Active and Standby

    We have extensively tested HOT manual failover in our labs over the last few months.…