Hortonworks hosted the second Apache Hadoop YARN meetup at Hortonworks office in Palo Alto on last Friday (22 February 2013). Following the success with the first one, this meetup continues to enjoy a good attendance from the YARN community. About 40 joined the meetup in person and nearly another 30 attended via phone/webex.
The Yahoo! grid team responsible for YARN rollout on their clusters gave an update of the current deployments and their state. Robert Evans and others from their team threw some very impressive numbers about the YARN clusters – 10s of million jobs till now on YARN, averaging ~100,000 jobs on some clusters per day. Please go ahead and read their recent blog on Yahoo! developer network: Hadoop at Yahoo!: More Than Ever Before. They then fielded several questions from the community like any pain-points for the users during the upgrade, big issues that only surfaced at scale. The software is deemed sufficiently stable, churning jobs out impressively and with maximum uptime with downtime mostly happening during upgrades.
After the update from Yahoo!, Bikas Saha from Hortonworks talked about the ResoureManager restart functionality. Most of his work is captured on the Apache JIRA issue YARN-128. The effort is divided into phases and the first phase involves:
This first phase to restart all the running applications on RM recovery is done and shipped with the latest hadoop release 2.0.3-alpha. He discussed the overall design on a whiteboard, explaining the implementation.
Chris Riccomini then talked a bit about what he continues to do with YARN (see his notes from last meetup).
Arun talked about the enhancements to YARN resource scheduling to also account for CPU cores in addition to the memory based scheduling we already have. This effort is capture on Apache JIRA at YARN-2. Arun walked us through the DRF algorithm on which this work is based on, described various scheduling scenarios and summed it up with possible future directions.
Alejandro also gave a brief summary about adding support for CPU isolation/monitoring of containers. YARN-3 enhances YARN to use cgroups to control the cpu usage of containers. There is still a little work left to make this feature exposed to the end-users.
CPU scheduling and support for isolation via cgroups in YARN are both available in the most recent hadoop release 2.0.3-alpha. Both these features are big steps for YARN in realizing its goal of becoming the foremost generic resource management layer and making Hadoop the ‘distributed operating system’ on which rest of the data systems build on.
I did a quick recap of what we discussed in the last YARN meetup and what we’ve achieved so far. Few things, the community has delivered on its promises from last time:
Libraries for helping application writers: YARN-418 is the umbrella ticket for tracking this and we made quite some progress. YARN-29 helps application submission, YARN-103 is helpful to simply the usage of the AM RM protocol.
CPU isolation and scheduling: YARN-2 and YARN-3 are checked in as noted above
RM restart: The first phase to just restart running AMs and NMs is already in as part of YARN-128.
I then summed it up with our roadmap. The goal of YARN community for the next version of hadoop is to address some rough corners in YARN that are thwarting its march beyond its alpha use. Some areas of focus include:
Thanks to everyone for making YARN meetups a continued success story. All help is welcome from the community to focus on solidifying our next release. Looking forward to meeting you all again at the next meetup!