Field Notes: Apache Hadoop YARN Meetup at LinkedIn

I’ve been sitting on this post for a while as Apache Hadoop 2 GA work was keeping me extremely busy. As they say, better late than never, so here we go :) – the slides are at the end of the post.

Three weeks ago, we had a Apache Hadoop YARN meetup at LinkedIn. Kind folks at LinkedIn had offered to host us in addition to talking about exciting projects like usage of YARN at LinkedIn, and applications on YARN like Apache Samza, Apache Giraph and Apache Helix.

It was well attended – about 100 signed up! Many attended in person, lots more joined via the webex, a few made their attendance very well known by not turning off their mics :)

Winds of change

With Apache Hadoop 2 going GA, it was a good time to catch up and review where it is all going and how different applications are being built on top. For me, this meetup represented a big thematic change – so far in YARN meetups, we’ve always had discussions and talks centered only around YARN internals and efforts underway to make the YARN platform etc. This was the first meetup where more than about 75% of the content was completely focused around applications on YARN. Very soon, we may be forced to have separate meetups for the platform and applications/frameworks. smile

Hadoop 2.0 beta and GA

I first recapped on the work done for stabilizing YARN APIs to make them future proof. Most of that work was tracked at Apache JIRA (YARN-386). As you may have known, this work enables us to support stable and apt APIs for a long time and avoiding the potential pain of supporting bad APIs within and after the beta and stable releases. We also have a migration guide for our alpha users.

I then talked about the binary compatibility work that was done for the easy migration of existing MapReduce applications. This work was tracked at (MAPREDUCE-5108) and we wrote more about it in the past in detail. To summarize that post, barring things like the new mapreduce APIs, all your existing applications are either already binary compatible, or work directly if you start using latest versions of Pig, Hive, Oozie etc.

I finally rounded it up with the amount of testing that went into Hadoop 2.0 beta, starting from the core to all the way with all the stack components. I should get folks who spent a lot of time validating beta and GA releases to write about their experiences.

Application History Server – Mayank Bansal

Mayank Bansal talked about the Application History Server effort that he’s been working on with others. He described the motiviations of the effort, then explained the architecture and the design, following it up with pending work and the future. The icing on the cake was a live demo by him showing how finished applications are served by the Application History Server. You can get more information about this work at Apache JIRA YARN-321. We are looking to merge this branch soon into trunk and then into one of the 2.x releases!

ResourceManager reliability – Bikas Saha, Jian He, Karthik Kambatla

Bikas Saha summarized the ResourceManager reliability work that the community has been focusing on. He rounded up the design and the work plan.

  

Jian He then took over and explained the RM-restart effort mixing it up with great energy and good amount of humour. He started with a description of the current state of the effort, then transitioned into the inner details of the architecture, briefly touching on the impact of ResourceManager-restart on applications, frameworks and downstream components and concluding on instructions as to how one can use this ‘nice feature’ that he’s been working on. All of this work is tracked at YARN-128.

Karthik Kambatle then concluded the RM reliability topic with his efforts on RM fail-over. He described the overall architecture, what the planned changes are and how it looks to the admins w.r.t configuration. This is a work in progress that is captured at YARN-149 and is an important effort that closes one of the last gaps in YARN’s reliability story.

Apache Tez – Hitesh Shah

Hitesh Shah from Hortonworks kick-started the applications track with Apache TEZ. He started with a coverage of what Tez is, how it works on top of YARN, its current state and future pointers welcoming new contributors. You can learn more about Apache Tez here.

Hitesh and other contributors of the Apache Tez project have been running a multi-part blog series on all things Tez that you can read starting here.

YARN at LinkedIn – Mohammad Islam, Chris Riccomini and Kishore Gopalakrishna

Mohammad Kamrul Islam then took over the baton, kick-starting the sequence of talks about all things YARN at LinkedIn. Among many things, particularly impressive is the usage of Apache Giraph on top of YARN and the massive scale at which it is running inside of LinkedIn. He specifically requested me take his picture during his talk – to which I gladly complied :)

  

After that, Chris Riccomini, one of YARN’s very early adaptors, talked about Apache Samza and how it can run on top of YARN. Apache Samza fills the gap for streaming applications in the Hadoop eco-system. You can read more about it at Apache Incubator here.

Finally, Kishore Gopalakrishna concluded YARN-at-LinkedIn theme with a talk about management of YARN containers via Apache Helix. You can read more about it at Apache Helix’s incubator website.

Llama – Alejandro Abdelnur

Alejandro Abdelnur talked about getting Impala running on YARN. His talk was the quickest talk of them all, about Llama, a system that mediates between Impala and YARN – to run work in processes outside of the typical container lifecycle in YARN. You can learn more about it here.

Go Hadoop! – Arun C Murthy

Arun C Murthy then talked about what he’s been playing with recently – a native YARN application in Go! It demonstrated how YARN is moving past Hadoop’s JAVA roots and enabling all kinds of applications irrespective of the choice of language. He described how various efforts like protocol buffer support in Hadoop and YARN’s usage of cross platfrom API descriptions are helping such innovations. He didn’t stop just with theory and concluded his talk with a cool demo of a real Go application live on top of YARN! You can get the sources to play with it and/or learn more about it from Arun’s blog post.

Conclusion

As we’ve seen, this meetup represents a change in the land of Apache Hadoop YARN and its ecosystem. YARN is increasingly being adopted and is being taken for obvious as the platform of choice and innovations are now at full throttle in the application and framework layers. We hope this continues to go at full-pace, strengthening YARN’s place in the ecosystem and at the same time enabling it to go beyond its original ambitions. That is it for now – will be back with more coverage on future YARN meetups, stay tuned.

Categorized by :
Developer YARN

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :