The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

YARN Forum

YARN availability

  • #24678
    Tobias Herb

    Hey all!

    I´m a student and just experimenting with YARN and I got a couple questions concerning the “availability mechanisms” in YARN. Maybe someone of you can help me and give me some hints…

    I wrote a simple demo application (comparable to the “Distributed Shell” example) and tested it on a single node setup. That all worked great!
    Now I want to investigate in the “Failure Tolerance / Availability” topic, for that I examined several scenarios (but YARN did not behave as I expected):

    (1) Start the YARN app (1 Client, 1 ApplicationMaster (AM) and 1 Worker/Task (associated with the AM)). Kill the ResourceManager process (for simulating a crashed RM node) during running the AM/Worker. I would expect that the RM would be relaunched on a new allocated container (If so what component is responsible for that relaunch -> ZooKeeper stuff?). Is that assumption wrong? How must the AM react to reconnect to a new launched RM? Or is the complete system down after RM crash?

    (2) If I kill the AM process, I expect that the ApplicationsManager (ASM) restart the AM also on a new allocated container and execute all tasks that haven´t been executed so far.

    (3) (Liveness-Protocols) If I set the RM_AM_EXPIRY_INTERVAL_MS to 2 min and let the AM freeze (per sleep command) for 3 min nothing happens. I would also expect a restart of the AM by the ASM. But the job finishes without problems and without any log notification or something.

    (4) The liveness management for the worker nodes is completely handled by the AM. If a worker node crashes, the AM must restart all crashed tasks?

    I have these behavioral assumptions due to the design document “Architecture of Next Generation Apache Hadoop MapReduce Framework”…. But maybe my way of exploring is completely wrong…

    It would be great if someone could give me some few hints!!!

    Thanks in advance,

  • Author
  • #25368

    Hi Tobi,

    Thanks for trying HDP2.0. We’re looking onto these issues and will get back as soon as we have a definitive answer.


The forum ‘YARN’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.