Deployments at Yahoo!
The meetup kicked off with YARN committers from Yahoo presenting on current Hadoop 2.0 deployments at Yahoo. As part of the presentation, the following were covered.
- described scenarios where YARN positively advanced the state of the art like scalability, its current stability, the power of the YARN web-services, and its superlative performance compared to the previous versions.
- efforts undergone relation to battle testing YARN including application validation and performance benchmarking.
- summed it up with suggestions for improvements to issues like UI loading, lack of generic history server etc.
Chris Riccomini’s on “Building Applications on YARN”
Chris Riccomini from LinkedIn then presented about his experience in “Building Applications on YARN”. He briefly covered the anatomy of a YARN application and then jumped into various dimensions a YARN application developer should think about – deployment, metrics, logging, application specific configuration to name a few.
The most interesting bits about his presentation include how, pre-production, small instances of YARN clusters can be utilized to develop applications in an agile manner. For example, one could start with using local file system and avoiding HDFS to minimize the operational effort, and then switch over to a full-blown distributed file system when the desire for scalability crosses a threshold. Also worth attention is how YARN’s web-service APIs can be exploited to build custom dashboards.
YARN API Discussion
After that, Arun recapped the YARN’s powerful scheduling API available to the application developers for using the cluster resources. He walked us through the scheduling concepts, and rounded it up with how scheduling happens in the context of an example MapReduce job.
Bikas and I then proceeded to give a brief overview of what all APIs are available to application developers. We described some of the pain points with the APIs that various users indicated in the recent past and efforts underway to address some of them. To enumerate a few:
- How to make the scheduling logic explicit – for e.g, that scheduler looks for free resources on a node, then proceeds to a rack and then off-rack
- Multiple ways to release and reject containers
- Use-cases which require resources on specific nodes and/or racks
- Applications that want to avoid/blacklist some nodes and/or racks
- Limitations on the number of threads making resource requests
We opened the API discussion for further feedback. This exercise was very fulfilling. We discovered how various users were experimenting with the APIs and what pitfalls and limitations they ran into. Some concrete suggestions include:
- Libraries for recovering AMs, launching containers
- A generic framework for applications to expose specific data via http or web-services.
- A generic application history server
- Tagging nodes with labels like GPU etc and use these labels for scheduling. This is an extension of data locality
Our slides are available here.
After a short break, Alejandro Abdelnur from Cloudera briefly talked about the efforts underway to augment YARN with cpu-isolation using cgroups.
Finally, Siddarth Seth from Hortonworks talked about his work on modifying MR application to reuse containers for jobs both large and small. This exciting development opens new innovations in the MapReduce land like intermediate output aggregation. You can read through Sid’s presentation below. The core points covered are:
- Decoupling the TaskAttempt and Container concepts inside MR AM
- Add new first class concepts of Container, Node and Scheduler
- The current state of the effort
- New avenues this transition opens up – custom task types, output aggregation, performance optimizations.
His slides are available here.
The success of this meetup reaffirmed the excitement of the community about YARN. This also strengthened our desire to make it a recurring event. We look forward to the next one, with hopefully more turnout, extended brainstorming, and of course, more pizza and beer