This is the third post in a series that explores the theme of supporting rolling-upgrades & downgrades of a Hadoop YARN cluster. See here for an introductory post.
Carrying out a rolling upgrade/downgrade of all nodes in a Hadoop cluster can be a very disruptive process. Before HDP 2.2, if a NodeManager (NM) were brought down, all active containers on that node would be killed. This would significantly interrupt all applications in the cluster being upgraded/downgraded. It would become worse for applications with long-running containers, especially if some of those containers ended up failing again and again as multiple nodes got upgraded over a short period of time. Thus, from the perspective of a rolling upgrade, it is much better if the NMs minimize impact on running applications in a cluster by supporting a mode where they can be restarted with an updated software version without losing any containers and/or run-state.
Apache Hadoop YARN in HDP 2.2 addresses the above limitations with a new effort work-preserving NodeManager restart.
Work-preserving NM Restart enables NodeManagers to be restarted without losing active containers running on the node. At a high level, the NM stores any necessary state synchronously to a state-store as it processes incoming requests from applications. After the NM restarts, it regains its previous memory by loading the saved state for its various subsystems which then perform recovery by using the loaded state.
YARN in HDP 2.2 ships with LevelDB, a fast key-value store, as the state-store backend for the NodeManagers.
Below is an architectural overview of the state-management within the NodeManager to facilitate recovery:
This feature in its entirety is tracked at the Apache Hadoop™ YARN JIRA ticket YARN-1336 (note that the umbrella ticket is still open for few further enhancements).
When work-preserving NM restart is enabled, any changes to the state of a NodeManager will get persisted synchronously in a LevelDB store and is recovered later during NM restart. Different types of internal-state are treated differently depending on the level of resiliency needed. The following is a detailed explanation of how NodeManager treats each category of its internal state with respect to recovery. (Feel free to skip this section if you aren’t interested in the gory inner workings of the feature!)
In an older blog post, we described the concept of Local Resources in YARN. In short, NodeManagers keep track of LocalResources (jars, libraries etc.) needed by each application’s containers and download them in an optimal manner. In HDP 2.2, the NM in the state-store preserves meta information related to LocalResources each time the NM starts downloading a resource. During the recovery process, the NodeManager
When an application is initialized on the NM, its state information is made persistent on the state-store. It gets updated to FINISHED state when the RM indicates the same to the NodeManager. When the NM no longer needs to track an application, it will remove the application state from the state-store.
During the recovery, the NM loads the applications’ state from the state store. The state for each application indicates whether the application has finished or not. Note that for a finished application no more containers will be launched but it may still be undergoing log- aggregation. As each application is recovered, a new Application object is created and initialization events are triggered to reinitialize the bookkeeping for the application within the NM. When an application finishes, an ApplicationFinishedEvent is dispatched to the Application after all containers are recovered to trigger any log-aggregation and/or cleanup processing for the application.
With work-preserving NM restart, the containers’ states get persisted when container start-requests are received. It can be updated later when the container successfully gets launched, completes or gets killed.
During NM restart, the recovery process loads the state of all containers being tracked by the NM. Unlike the regular container-startup control flow (i.e. requesting local resources, followed by launching the container, etc.), the container-recovery process does not rerun the container but rather attempts to reattach to the previously launched container. It does this by locating the PID file created by the container-executor and asking the executor to reattach to the running process-tree. Below is a graphical representation of this mechanism:
If the container (process-tree) is still running, a special-cased container-launch mechanism called RecoveredContainerLaunch periodically polls to see if the process has exited, and obtains the exit code from the exit-code file created (by the container-executor) once the container finishes. If no exit-code gets located, then the reacquisition fails and the container is reported as LOST (exit code 154).
NMTokens are special tokens used by ApplicationMasters to securely communicate with NodeManagers during start and stop of containers. To help recovery, when the NM registers with the RM, the NM-tokens and the corresponding master-keys (current and previous) are persisted. They may be updated later if RM rolls the master key and notifies the NM as part of the heartbeat response.
During NM restart, the recovery process loads all of the NM-token master-key states, and then uses them to update the NMTokenSecretManagerInNM instance with the appropriate master-key state and repopulates the mapping of application-attempts to master-keys.ds.
ContainerTokens are special tokens issued by the ResourceManager to ApplicationMasters so that they can in turn pass them to NodeManagers to verify their ability to launch containers. To help recovery, when the NM registers with the RM, container-tokens of applications and the corresponding master-keys are persisted in a hierarchical manner in LevelDB. They may be updated later if the RM rolls the keys. The expiration time for a particular container-token is also persisted.
During NM restart, the recovery process loads all of the container-token state. The recovered state is then used to update the NMContainerTokenSecretManager instance to re-populate the master-keys and rebuild the map of expiration-times and container-IDs to track container-tokens that have been in use by applications.
Deletion service is an NM-internal service responsible for deleting files and directories on local-disks as other components within the NM request over time. To support recovery, the deletion-service tracks deletion-tasks that are scheduled to execute at various times. These are persisted to the store by passing a ProtocolBuffer object that describes the deletion-task and any successor task IDs that should be triggered after it executes.
During NM restart, the recovery process loads the state of all of the persisted deletion-tasks, then this state is used to recreate the deletion tasks in the DeletionService instance and to
reconnect the deletion-task dependencies based upon the stored successor task-IDs.
Log-aggregation in NodeManager is responsible for collating and posting logs to a remote file-system for posterity. It is primarily triggered by application-finish and container-completion events. For recovery of the log-aggregation service, no explicit state is stored since (1) the application init/finished events are automatically propagated to the log-aggregation service during application-recovery and (2) container completion-events are propagated to the log-aggregation service during container-recovery. When the log-aggregation service receives the application-finished event, it proceeds to upload the logs as per the regular control-flow. Any log-aggregation that was in progress when the NM restarted will resume from the beginning, overwriting any existing temporary files.
The NM provides the auxiliary-services framework for extending its fringe functionality. There is a rudimentary support in NMs of HDP 2.2 for auxiliary services to support state persistence and recovery. If recovery is supported like the ShuffleHandler in MapReduce, the NM will create a subdirectory in the NM state-storage directory specific to that auxiliary-service, and then invokes the setRecoveryPath() API on the service before initializing it. During re-initialization, the auxiliary-service can invoke the getRecoveryPath() API to determine if recovery is supported and where it should store/recover its state to/from. If the recovery path is not set, then recovery is not enabled.
This feature can be enabled by setting the configuration flag yarn.nodemanager.recovery.enabled to be true and by pointing yarn.nodemanager.recovery.dir to a local FS directory where the state should be stored. Following is an example (in yarn-site.xml):
Besides these two settings, ephemeral ports (port 0, which is default) cannot be used for the NodeManager’s RPC server via the configuration property yarn.nodemanager.address. Configuring the communication port to be ephemeral will very likely lead to the NM using a different port after a restart. This will break any previously running clients that were communicating with the NM before the restart. Explicitly setting yarn.nodemanager.address to an address with a specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling NM restart.
In addition, any auxiliary services configured on the NodeManager should support recovery by reloading previous state when the NM restarts and re-initializes the auxiliary service. An example is the MapReduce ShuffleHandler which supports recovery out-of-the-box with the default settings.
With Work Preserving NM Restart enabled, all NodeManagers in a Hadoop YARN cluster can be restarted while keeping active containers running on each of the nodes with the corresponding process-state and internal-services automatically getting recovered. This feature largely helps rolling upgrades on YARN clusters together with other related efforts in HDP 2.2. to enhance the ResourceManager.
From the client’s perspective, there are relevant effort/enhancements to make upgrading a YARN cluster smoother. In the near future blog posts, we will discuss more on what YARN applications and clients can do to handle rolling-upgrades of the cluster better.