This is the second post in a series exploring the theme of long-running service workloads in YARN. See for the introductory post.
Long-running services deployed on YARN are by definition expected to run for a long period of time—in many cases forever. Services such as Apache™ HBase, Apache Accumulo and Apache Storm can be run on YARN to provide a layer of services to end users, and they usually have a central master running in conjunction with an ApplicationMaster (AM). The AM is a potential single point of failure for all of these services.
In this blog post, we will discuss a few strategies for making long-running services on YARN more fault-tolerant to AM failures.
Before HDP 2.2, YARN already supported basic mechanisms to restart the AM in case of AM crash or failure. Together, ResourceManager and the NodeManagers running the AMs are responsible for automatically detecting an AM crash and restarting it. However, there are a couple of significant limitations to this approach:
Let’s take a look at solutions available now with HDP 2.2 that can overcome these limitations.
The way to prevent the loss of work-in-progress at an AM crash is to implement container-preserving AM restart. This allows the AM to be restarted without killing its associated containers, and then to rebind/reconnect the restarted AM with its previous running containers. All the related work for container-preserving AM restart is tracked by the Apache Hadoop YARN JIRA ticket YARN-1489(note that the ticket is still open for addressing a few minor enhancements).
Here’s how it works:
In this solution, note that all the non-running containers (such as any reserved containers and acquired-but-not-yet-launched containers) are still killed by the RM. Any outstanding container requests are cleared as well. While this may sound like a limitation, it is a necessary result of an underlying focus on scalability. Even today, RM does not persistently record all containers’ information—it simply learns the status of running containers from NodeManagers’ (NMs) registration. In the case of an RM crash-reboot event, information about those not-yet-running containers cannot be retrieved unless RM explicitly persists in each and every state of every container in the system—a design anti-goal for RM scalability.
Let’s dig a little deeper into the mechanism for notifying the newly-started AM about the previous running containers. This happens through the ApplicationMasterService registration call.
When a new AM is re-launched, it first re-registers with the RM. Then the RM registration-response generates a list of containers that were running before the AM went down. Once the new AM receives this information, it gains the location of previous running containers and maps them to any framework-level tasks like HBase RegionServers, Storm Supervisors, etc.
In a secure environment, the AM also requires the relevant NMTokens in order to communicate with the corresponding NMs. In this case, the RM also re-creates the relevant NMTokens and sends them across to the AM via the registration call along with the list of previous running containers.
Note that the containers themselves are still not notified of the new AM location. This is left as an application-level detail.
To use this feature, application writers need to be aware of a few APIs:
This API sets a flag indicating whether YARN should keep containers across application attempts or not. If the flag is true, running containers will not be killed when an AM crashes and restarts. These containers will be retrieved by the new application attempt upon AM registration via the
YARN container IDs have always been derived from the ID of the application attempt that originally allocated the container. Even though these transferred containers are now managed by the newly-created AM (with a new/different Application Attempt ID), the container ID is still tied to the ID of the application attempt that originated the container.
The limited number of AM restarts makes sense for short-lived applications like MapReduce that can typically run into completion within hours or days. If the AMs of short-lived applications keep failing for whatever reason, it’s better to fail the application after some number of attempts instead of retrying it forever. However, that logic just doesn’t work for long-running services, especially those designed to run forever.
Sometimes, an AM failure is caused by a restart (during upgrade) or by the decommission of an NM (on which the AM container is running). In cases like these, ideally the AM should not be penalized—but it is because each of those conditions still counts as an AM failure by the ResourceManager.
Because the retry count for the AMs (maxAppAttempts) is pre-configured and does not reset after application submission, the number of attempts will eventually cross the restart threshold, given enough time. Once that happens, RM will mark the application as failed and shut it down—a bad end-state for long-running services.
YARN in HDP 2.2 addresses these limitations by two related efforts:
The idea here is to stop penalizing the AM restart count for events that aren’t really AM problems, by separating true AM failures from ones induced by hardware or YARN-level issues. Events like NM restart/failures, AM container preemption, etc., are no longer counted towards the limited number of AM retries. This work is tracked by the Apache JIRA YARN-614.
When an AM fails, the RM will check the exit status for the AM container. If the exit status matches any of the following statuses, the AM failure count will not be increased:
Because the RM ignores the above exit statuses when determining whether a new application attempt needs to be launched, application resiliency is significantly improved.
Another method to improve fault tolerance is to create a policy to reset the AM retry count. We have introduced a new user-specifiable parameter, the attempt_failures_validity_interval, which creates an AM-retry window. When implemented, the RM checks how many AM failures happened during the last time window of size attempt_failures_validity_interval. If the retry count reaches the configured limit of maxAppAttempts within the given time window, only then will the application be failed. Otherwise, a new attempt will be launched. This parameter accepts values in milliseconds and can be set in ApplicationSubmissionContext during application submission. The related work is tracked at YARN-611.
Every application can have its own attempt_failures_validity_interval, and the RM will use it to determine if a new AM needs to be launched whenever an AM crashes.
To use this feature, application writers need to specify attempt_failures_validity_interval in their application-submission code using the
The default value for this parameter is -1, falling back to today’s behavior. When attemptFailuresValidityInterval is set to > 0 milliseconds, the RM calculates the desired time window.
In conclusion, this blog post discussed some of the key YARN efforts delivered in HDP 2.2 that enable reliable and fault-tolerant management of long-running services. We hope these features help you bring your service workloads to YARN clusters more easily.