The Hortonworks Blog

More from Vinod Kumar Vavilapalli

Mayank Bansal, of EBay, is a guest contributing author of this collaborative blog.

This is the 4th post in a series that explores the theme of enabling diverse workloads in YARN. See the introductory post to understand the context around all the new features for diverse workloads as part of Apache Hadoop YARN in HDP.

Background

 In Hadoop YARN’s Capacity Scheduler, resources are shared by setting capacities on a hierarchy of queues.…

This is the fourth post in a series that explores the theme of enabling diverse workloads in YARN. See the introductory post to understand the context around all the new features for diverse workloads as part of YARN in HDP 2.2.

Introduction

When it comes to managing resources in YARN, there are two aspects that we, the YARN platform developers, are primarily concerned with:

  • Resource allocation: Application containers should be allocated on the best possible nodes that have the required resources and
  • Enforcement and isolation of Resource usage: On any node, don’t let containers exceed their promised/reserved resource-allocation
  • From its beginning in Hadoop 1, all the way to Hadoop 2 today, the compute platform has always supported memory based allocation and isolation.…

    This is the 3rd post in a series that explores the theme of supporting rolling-upgrades & downgrades of a Hadoop YARN cluster. See the introductory post here.

    Background and Motivation

    Before HDP 2.2, Hadoop MapReduce applications depended on MapReduce jars being deployed on all the nodes in a cluster. The java classpath of all the tasks and the ApplicationMaster of a MapReduce job were set to point to the deployed jars.…

    The Apache Hadoop community is happy to announce the release of Apache Hadoop 2.7.0! We want to express our gratitude to every contributor, reviewer and committer.

    The Hadoop community fixed 923 JIRAs in total as part of the 2.7.0 release. Of the 923 fixes:

    • 259 were in Hadoop Common
    • 350 were in HDFS
    • 253 were in YARN
    • 61 were in MapReduce

    Hadoop 2.7.0 is the first Hadoop release in 2015, following late last year’s 2.6.0.…

    Introduction

    Today, organizations use the Apache Hadoop™ stack in the form of a central data lake to store their critical datasets and power their analytical processing workloads. A key requirement for the Hadoop cluster and the services running on it is to be highly available and flawlessly continue to function while software is being upgraded. In the past, the Hadoop community has added enterprise features such as High Availability (HA) to various components of the stack, snapshots, improved disaster recovery etc.…

    This is the third post in a series exploring recent innovations in the Hadoop ecosystem that are included in Hortonworks Data Platform (HDP) 2.2. In this post, we introduce the theme of supporting rolling upgrades and downgrades of a Hadoop YARN cluster.

    HDP 2.2 offers substantial innovations in Apache™ Hadoop YARN, enabling Hadoop users to efficiently store and interact with their data in a single repository, simultaneously using a wide variety of engines.…

    This is the second post in a series that explores recent innovations in the Hadoop ecosystem that are included in HDP 2.2. In this post, we introduce the theme of running service-workloads in YARN to set context for deeper discussion in subsequent blogs.

    HDP 2.2 brings substantial innovations in Apache Hadoop YARN, enabling users of Apache Hadoop to efficiently store their data in a single repository and interact with it simultaneously using a wide variety of engines.…

    This is the first post in a series that explores recent innovations in the Hadoop ecosystem that are included in HDP 2.2. In this post, we introduce themes to set context for deeper discussion in subsequent blogs.

    HDP 2.2 represents another major step forward for Enterprise Hadoop. With thousands of enhancements across all elements of the platform spanning data access to security to governance, rolling upgrades and more, HDP 2.2 makes it even easier for our customers to incorporate HDP as a core component of Modern Data Architecture (MDA).…

    Jian He (Apache YARN Hadoop committer) and I discuss Apache Hadoop YARN’s Resource Manager resiliency upon restart in this blog. This is the third blog post in the series on motivations and architecture for improvements to the Apache Hadoop YARN’s Resource Manager (RM) resiliency. Others in the series are:

    Introduction Phase II – Preserving work-in-progress of running applications

    ResourceManager-restart is a critical feature that allows YARN applications to be able to continue functioning even when the ResourceManager (RM) crash-reboots due to various reasons.…

    This is the second in our series on the motivations and architecture for improvements to the Apache Hadoop YARN’s Resource Manager Restart resiliency. Other in the series are:

    Introduction: Phase I – Preserve Application-queues

    In the introductory blog, we previewed what RM Restart Phase I entails. In essence, we preserve the application-queue state into a persistent store and reread it upon RM restart, eliminating the need for users to resubmit their applications.…

    This is the first post in our series on the motivations and architecture for improvements to the Apache Hadoop YARN’s Resource Manager Restart resiliency. Other in the series are:

    Resource Manager (RM) is the central authority of Apache Hadoop YARN for resource management and scheduling. It is responsible for allocation of resources to applications like Hadoop MapReduce jobs, Apache TEZ DAGs, and other applications running atop YARN.…

    User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, race conditions when running on a cluster, and debugging task/job failures due to hardware or platform bugs. Secondly, one can do historical analyses of the logs to see how individual tasks in job/workflow perform over time. One can even analyze the Hadoop MapReduce user-logs using Hadoop MapReduce(!) to determine any performance issues.…

    This post is authored by Omkar Vinit Joshi with Vinod Kumar Vavilapalli and is the ninth post in the multi-part blog series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. Other posts in this series:

    Introduction

    In the previous post, we explained the basic concepts of LocalResources and resource localization in YARN.…

    This post is authored by Omkar Vinit Joshi with Vinod Kumar Vavilapalli and is the 8th post in the multi-part blog series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. Other posts in this series: 

    Introduction

    In YARN, applications perform their work by running containers, which today map to processes on the underlying operating system.…

    I’ve been sitting on this post for a while as Apache Hadoop 2 GA work was keeping me extremely busy. As they say, better late than never, so here we go – the slides are at the end of the post.

    Three weeks ago, we had a Apache Hadoop YARN meetup at LinkedIn. Kind folks at LinkedIn had offered to host us in addition to talking about exciting projects like usage of YARN at LinkedIn, and applications on YARN like Apache Samza, Apache Giraph and Apache Helix.…