This post is authored by Omkar Vinit Joshi with Vinod Kumar Vavilapalli and is the 8th post in the multi-part blog series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. Other posts in this series:
In YARN, applications perform their work by running containers, which today map to processes on the underlying operating system. More often than that, containers have dependencies on files for execution. These files are either required at startup or may be during runtime – just once or more number of times. For example, to launch a simple java program as a container, we need a jar file and potentially more jars as dependencies. Instead of forcing every application to either access (mostly just reading) these files remotely every time or manage the files themselves, YARN gives the applications the ability to localize these files.
At the time of starting a container, an ApplicationMaster (AM) can specify all the files that a container will require and thus should be localized. Once specified, YARN takes care of the localization by itself and hides all the complications involved in securely copying, managing and later deleting these files.
In the remainder of this post, we’ll explain the basic concepts about this functionality.
Here are some definitions to begin with:
Localization. Localization is the process of copying/download remote resources onto the local file-system. Instead of always accessing a resource remotely, it is copied to the local machine which can then be accessed locally.
LocalResource. LocalResource represents a file/library required to run a container. The NodeManager is responsible for localizing the resource prior to launching the container. For each LocalResource, Applications can specify:
What files can a container request for localization? One can use any kind of files that are meant to be read-only by the containers. Typical examples of LocalResources include:
The following are some examples of bad candidates for LocalResources:
As mentioned above, NodeManager tracks the last-modification timestamp of each LocalResource before container-start. Before downloading, NodeManager checks that the files haven’t changed in the interim. This is a way of giving a consistent view at the LocalResources – an application can use the very same file-contents all the time it runs without worrying about data corruption issues due to concurrent writers to the same file.
Once the file is copied from its remote location to one of the NodeManager’s local disks, it loses any connection to the original file other than the URL (used while copying). Any future modifications to the remote file are NOT tracked and hence if an external system has to update the remote resource – it should be done via versioning. YARN will fail containers that depend on modified remote resources to prevent inconsistencies.
Note that ApplicationMaster specifies the resource time-stamps to a NodeManager while starting any container on that node. Similarly, for the container running the ApplicationMaster itself, the client has to populate the time-stamps for all the resources that ApplicationMaster needs.
In case of a MapReduce application, the MapReduce JobClient determines the modification-timestamps of the resources needed by MapReduce ApplicationMaster. The ApplicationMaster itself then sets the timestamps for the resources needed by the MapReduce tasks.
Each LocalResource can be of one of the following types:
LocalResources can be of three types depending upon their specified LocalResourceVisibility, i.e depending on how visible/accessible they are on the original storage/file system.
All the LocalResources (remote URLs) that are marked PUBLIC are accessible for containers of any user. Typically PUBLIC resources are those that can be accessed by anyone on the remote file-system and, following the same ACLs, are copied into public LocalCache. If in future, a container belonging to this or any other application (of this or any user) requests the same LocalResource, it is served from the LocalCache and thus not copied/downloaded again if it isn’t evicted from the LocalCache by then. All files in public cache will be owned by “yarn-user” (user which NodeManager runs as) with world-readable permissions, so that they can be shared by containers from all users whose containers are running on that node.
LocalResources that are marked private are shared among all applications of the same user on the node. These LocalResourcse ar ecopied into the specific user’s (user who started the container i.e. the application submitter’s) private cache. These files are accessible to all the containers belonging to different applications but all started by the same user. These files on local file system are owned by the user and not accessible by any other user. Similar to public LocalCache, even for the application submitters, there aren’t any write permissions – the user cannot modify these files once localized. This is to avoid accidental writes to these files by one container harming other containers – all containers expect them to be in the same state as originally specified (mirroring original timestamp and/or version number).
All the resources that are marked under “APPLICATION” scope are shared only amongst containers of the same application on the node. They are copied into the application specific LocalCache which is owned by the user who started the container (application-submitter). All theses files are owned by the user with read-only permissions.
Note that ApplicationMaster specifies this resource visibility to a NodeManager while starting the container – Node manager itself doesn’t make any decision and classify resources. Similarly, for the container running the ApplicationMaster itself, the client has to specify visibilities for all the resources that ApplicationMaster needs.
In case of a MapReduce application, the MapReduce JobClient decides the resource-type which the corresponding ApplicationMaster then forwards to a NodeManager.
Like already mentioned, different type of LocalResources have different life-cycles:
One thing of note is that for any given application, we may have multiple ApplicationAttempts and each attempt may start zero or more containers on a given node manager. When the first container belonging to an ApplicationAttempt starts, ResourceLocalizationService localizes files for that application as requested in the container’s launch context. If future containers request more such resources then they all will be localized. If one ApplicationAttempt finishes/fails and another is started, ResourceLocalizationService doesn’t do anything w.r.t the previously localized resources. However when eventually the application finishes, ResourceManager communicates that information to NodeManagers which in turn clear the application LocalCache. In summary, APPLICATION LocalResources are truly application scoped and not ApplicationAttempt scoped.
That ends our coverage of the basic concepts that application writers will need to know about LocalResources. LocalResources are a very useful feature that application writers can exploit to declare their startup and runtime dependencies. In the next post, we’ll delve deep into how the localization process itself actually takes place in the NodeManager.