We recently hosted the fourth of our seven Discover HDP 2.1 webinars, entitled Apache Hadoop 2.4.0, HDFS and YARN. It was very well attended and a very informative discourse. The speakers outlined the new features in YARN and HDFS in HDP 2.1 including:
Many thanks to our presenters, Rohit Bakhshi (Hortonworks’ senior product manager), Vinod Kumar Vavilapalli (co-author of the YARN Book, PMC, Hadoop YARN Project Lead at Apache and Hortonworks), and Justin Sears (Hortonworks’ Product Marketing Manager).
If you missed the webinar, here is the complete recording of the webinar.
And here is the presentation deck.
|What is the effective blocksize in production in general?||The default HDFS blocksize is 64MB but in production it’s usually 128MB|
|If YARN and HDFS are (roughly) for storage and memory management, which daemon does the processing – That is, what’s the Job tracker’s work of previous versions of Hadoop (HDP1.0)? In short, where or who does the data Processing?||In the new architecture of Hadoop 2.0 with YARN, the work of data processing and resource-management is decoupled between the Resource Manager and the per-job ApplicationMaster. The ResourceManager handles all the scheduling and hands off containers to the ApplicationMasters while the ApplicationMaster tracks the job level orchestration making use of the containers given by the ResourceManager. For example, the MapReduce ApplicationMaster now is responsible to launch and track all the tasks in a MR job, that were previously handled by the JobTracker itself. In the YARN world, this decoupling allows the ApplicationMaster to do job life-cycle managment and data-processing while the ResourceManager handles efficient management of resources|
|Does preemption force scheduler to allocate resource on certain nodes where preemption module releases containers? Thus Data Locality may be overridden because of preemption?||The scheduler makes its best attempt to honor data locality while allocating resources to a high priority application container-requests. Preemption will free up resources that are being consumed by an over-subscribed queue, and then those resources are available to the scheduler to allocate to the high priority application container-requests.|
|How is application and task execution handled in YARN & MapReduce? Has it changed from 1.0 to 2.0 world?||Application and task execution is a MapReduce responsibility. It works better in the 2.0 than 1.0 w.r.t various features. We have implemented an the MR ApplicationMaster that works with YARN to execute MR jobs. We built it as a library so that other applications can use it in their framework. We have improved how we can manage and monitor application execution.|
|What are the UI’s available to to monitor the applications in HDP 2.1?||Today, we continue to have the MR Job-History server, if you’re already familiar with what we had in 1.0 and 2.0. In 2.1 we also have a new functionality available called YARN Timeline Server. It is exposed through Ambari, where we have the Hive-on-Tez jobs’ views that are built by pulling the events and metrics and visualizing it. As such, you can get a DAG’s view of the Tez’s runtime engine. You can see metrics for each DAG job and visualize the performance of each job. Over time, we plan to add more visualization by harnessing the power of common monitoring framework for HDP.|
|What do you mean by full stack Resource Manager Resiliency?||What we did when we certified YARN Resource Manager HA is that we tested the entire data access layer stack—Hive, Pig, Tez jobs, MR jobs, HBase jobs, etc—and made sure that when you fail-over the Resource Manager, the components pause, re-try, re-establish connections and resume progress. The idea is that we don’t want any service disruption, and it’s something that every enterprise customer wants, every enterprise operators desires, when there’s a failure event or when one of the Resource Manager goes down. It’s an automatic failover, all the downstream applications can automatically recover and restart and don’t need manual intervention to restart.That’s what we mean by restart resiliency. That’s how the system is architected to resubmit and restart in case of a failure. That what it means to an operator or an enterprise customer.|
|How does YARN take advantage of Data locality?||
Data locality is a feature that’s implemented via synergy between the YARN platform and the individual applications. The YARN scheduler understands when the ApplicationMaster requests for containers specifying where and on what nodes’ data blocks are, so assigns containers on their respective nodes where the tasks will execute within the assigned containers. For example, in the MapReduce land, the Map Reduce ApplicationMaster may request that its HDFS blocks are on a specific set of machines, and that it needs these file-resources on these set of nodes, and accordingly, it will convert that information into a request for the containers to take advantage of the data close to it.
And YARN will allocate containers best suited for tasks as close to the data that the tasks need.
Attend our next Discover HDP 2.1 webinar on Thursday, June 12 at 10 am Pacific Time: Apache Solr for HDP 2.1
And if you have any further questions pertaining to YARN and HDFS—documentation, code examples, tutorials—please post them on the Community forums under YARN and HDFS.