Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
February 03, 2015
prev slideNext slide

Support For long-running services on your Hadoop YARN clusters

This is the second post in a series that explores recent innovations in the Hadoop ecosystem that are included in HDP 2.2. In this post, we introduce the theme of running service-workloads in YARN to set context for deeper discussion in subsequent blogs.

HDP 2.2 brings substantial innovations in Apache Hadoop YARN, enabling users of Apache Hadoop to efficiently store their data in a single repository and interact with it simultaneously using a wide variety of engines. Thematically, YARN in HDP 2.2 encompasses several tracks that we introduced in our first blog. This blog introduces running service workloads in YARN, which we’ll expand further in subsequent posts.

The World of YARN—So far

Before the advent of YARN, Apache Hadoop MapReduce had served as a powerful framework for distributed, scalable and fault-tolerant data processing. With the introduction of YARN in HDP 2.x, MapReduce went through a second incarnation – MRv2 – making it more scalable, much more performant, and all without major changes to their apps. So, for example, if you are moving from HDP 1.x to HDP 2.x with YARN, even without immediate intentions to leverage other programming abstractions, we ensure you have incentives to upgrade to the latest stack. For one, we spent time documenting the entire process. Second, we supported many users and customers through this smooth upgrade process.

Over the course of last year, we also saw the rise of Apache Tez as the next chapter in data processing through Hadoop. Tez is a distributed framework that runs natively on YARN, taking advantage of all the common infrastructure functionality YARN exposes. Tez is by nature both a near successor of the MapReduce framework and a radical departure in terms of how user applications translate to system requirements and the way resources get utilized through YARN. In the new world, Tez acts as a foundation layer above YARN for unlocking new potential across various Hadoop ecosystem projects. The success of the Stinger initiative is a great example of this potential.

Enabling long running service workloads to run on YARN: The why and the how

MapReduce and Tez pave a long way for unified data processing on Hadoop clusters through custom written applications, such as batch and interactive jobs. Beyond on-demand batch and interactive data processing, organizations today also have use-cases to store, process and analyze transactional data and to perform real-time analytics.

Before HDP 2.2, users would run NoSQL systems like Apache HBase, Apache Accumulo or stream processing systems like Apache Storm on clusters specifically carved out outside of their YARN clusters. These silos have multiple disadvantages:

  • Administrative overhead: Cluster administrators are forced to do capacity planning of multiple and separate clusters: once for batch/interactive workloads, once for operational data, once for stream processing and so on. This has to be performed, even though in most cases, the underlying data-processing needs are identical. Further, every time there are changes to usage patterns of one or more of the workload types, administrators are forced to do yet another round of static and inflexible capacity planning.
  • Data locality: In most of the service use-cases, some or all of the data that the services access usually ends up in the Hadoop Distributed File System (HDFS), where other applications then do more data analyses. If the the nodes running the services are partitioned statically into separate clusters or silos, you break the data locality advantages that batch and interactive workloads enjoy in a centralized and replicated HDFS storage environment and you miss YARN’s resource-management capabilities.
  • Utilization and elasticity: Given the static nature of physical resource partitioning and separate management of different workloads, the overall utilization of the underlying physical compute and storage resources is difficult to predict or optimize. In many cases, utilization suffers when one of the use-cases is no longer hot for duration of a day, month or a year.

Increasingly, Enterprises have existing applications and services, distributed or non-distributed, that they desire to run on existing Hadoop clusters for various reasons. By migrating these existing workloads to Hadoop, organizations can leverage the ever-growing large-scale accumulation of data-sets in HDFS. Also, writing new services on YARN from scratch should be an easier experience than it is today.

To yield better data placement and utilization, increase elasticity, and reduce administrative overheads, YARN encourages having all workloads share the cluster’s storage layer and utilize its compute resources as multiple tenants. By migrating or “sliding” these service workloads under YARN for cluster-wide resource management, you achieve far better data placement and cluster utilization.

Two parallel efforts make this possible:

  1. Native support of long running services in YARN
  2. Introduction of Apache Slider, a new YARN framework to help ease the evolution of services on YARN

1. Native support of long running services in YARN

As part of HDP 2.2, we are bringing native support for running Long Running Services on existing Hadoop YARN deployments.

yarn_1

Except for a few differences, running long running services on YARN is fundamentally no different than running short-lived applications. Below is a short list of these distinguishing aspects of services on YARN.

  • Fault tolerance of long running services

    A YARN ApplicationMaster (AM) may crash or disconnect from the ResourceManager (RM) because of hardware failures, application bugs etc. For a non-service YARN application, when an AM exits, the RM explicitly kills all the corresponding containers launched by this AM. This is usually fine for batch applications like MapReduce, but for long-lived services such as HBase or Storm running on YARN, it is unacceptable. Further, YARN only restarts crashed AMs only a couple of times before failing for the application. This is an unacceptable scenario, too. Both failures needed to be addressed with long running services.

  • YARN Security to support long running services

    All Hadoop applications before HDP 2.2 were limited to running for a pre-configured number of days, defaulting to a week, because all Hadoop delegation tokens expire in a day. This is a fundamental problem for long-running services on YARN, which are expected to run far beyond a week, or many times forever. Security is one of the main areas that needed to be addressed.

  • Log handling of long running services in YARN

    For applications that finish in a finite and short amount of time, logs of containers running on any given nodes are aggregated into HDFS only when the application finishes. However, for YARN applications (like services) that never exit, this is not an option. Instead, not only an online aggregation of logs while a YARN application is still running is needed, but also these aggregated blogs must be viewable while the service is still alive.

  • Service registry for long running services in YARN

    By design, in a YARN cluster, users/clients cannot predict where the individual containers constituting a service will come up, and on which hosts or ports. It was always left as an application-layer responsibility with little support from the platform. This is a common concern with services, as interaction of clients with services is more fundamental and common than interaction with batch applications.

  • Advanced scheduling and isolation

    Resource scheduling and isolation of service containers on YARN also needs some fundamental thinking. For example, for a Storm application that is long running in a YARN cluster, resource-isolation of CPU resources coupled with first class support of CPU scheduling is required. Also, some services may require resources on a dedicated set of nodes within a cluster that are typically associated to admin-specified labels.

In HDP 2.2, YARN addresses all these aforementioned differences so that users can interact with long running services just the way they do with regular short applications. YARN-896 is the Apache JIRA that tracks the core efforts related to supporting long running services in YARN (the ticket is still open for further improvements). We will cover these services’ related efforts in YARN in much more detail in upcoming posts.

2. Apache Slider

We joined the open source community in kick-starting Apache Slider, an incubator project at the Apache Software Foundation (ASF). Simply put, Apache Slider is a Hadoop YARN framework to deploy existing distributed applications on YARN without any code changes by submitting a declarative specification of how the app should run.

Through Slider, we are making it economical and easy to develop and deploy services on YARN. This helps the organization in two ways: (1) ability to port their existing services to YARN easily and (2) ability to develop new services from ground up on YARN rapidly.

First class support for well-known services on YARN

Finally, in addition to enabling the support for services in YARN, we demonstrated its deployment with three key services in HDP 2.2. Apache HBase, Apache Accumulo and Apache Storm can now run as long running services on YARN through Slider.

Our journey to support more services to slide onto YARN at scale has begun; more efforts are underway to enable integrations in all dimensions.

Conclusion

In this blog post, we introduced two key efforts in HDP 2.2: YARN support for deploying long running services and Apache Slider, a new YARN framework. We are very excited about extending the capabilities of your existing YARN clusters in powerful new ways!

In coming weeks, we plan to share more posts with richer details on some of the above efforts pertaining to long-running services on YARN.

Tags:

Comments

  • When we try to run our application on yarn, it stops application master and containers for a while later. That’s the case for both native and slider.
    What is the thing we should do to keep application master and containers running?

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>