We are right on the verge of some great celebrations of 10 years of Apache Hadoop! Hadoop Summit San Jose 2016 is almost here too marking these celebrations! Held on June 28-30, 2016, it is the event for technical and business audiences to learn how big data continues to a major force in transforming the industries and to dive deep into the technologies that are driving this massive transformation.
Apache Hadoop YARN is also entering its fifth year since its entrance in the first alpha releases, evolving from Hadoop MapReduce into a more general, powerful, massively scalable, multi-tenant compute platform. Helping kick-start the project, being part of this epic journey and watching over time its momentous growth has been nothing short of a privilege. It’s been an amazing experience seeing the exponential growth in the deployments of YARN, coupled with the rapid adoption of applications and workloads running on top of this massive multi-tenant compute platform.
Celebrating these major milestones, I’d like to take this opportunity to introduce to you some of the talks that can help explore and discover the latest happenings, trends of the Apache Hadoop YARN community at Hadoop Summit San Jose 2016.
The following is a curated list of sessions that relate to YARN as the generic resource-management platform. These talks from the contributors and Committers of Apache Hadoop YARN cover an entire spectrum of efforts and initiatives targeting the past, present and future of YARN.
I have classified these 13 YARN session into 4 groups: (1) Extracting more value from your existing Hadoop YARN clusters, (2) Operationalizing YARN, (3) Workloads on YARN and concluding with (4) What’s next in YARN.
If you are running Hadoop YARN clusters, the chances are that you are already looking for ways in which you can maximize the business value you are getting out of them.
In this category of sessions from committers / contributors of YARN, you can learn about various ongoing initiatives that can help you extract more value from your existing clusters – in terms of resource utilization and more fine-grained operational insights into what is happening in your clusters.
By Sangjin Lee, Twitter Inc and Li Lu, Hortonworks Inc.
It is more important now than ever to have a comprehensive 360-degree monitoring to understand workload patterns, resource utilization, application performance, etc., to glean insights that can fuel crucial optimizations. This talk introduces YARN Timeline Service (YTS) v.2 designed from ground up for high scalability and to support a broad range of such use cases.
By Jason Lowe, Yahoo! Inc.
YARN requires applications to specify the size of the resources they wish to utilize and this specification is enforced strictly which can lead to unutilized resources. This talk will describe the dynamic over-commit implementation that Yahoo! is running at scale, along with the corresponding results and pitfalls.
By Arun Suresh, Microsoft and Srikanth Kandula, Microsoft
This talk present GoodFit, a new and efficient multi-resource Packing allocator that packs YARN containers to machines based on their requirements of all resource types – CPU, memory, disk and network. This talk will demonstrate how it is simultaneously able to achieve better performance and fairness than the policies employed by the default YARN schedulers.
In this category of sessions, you can learn about how organizations are operationalizing YARN on-premises as well as in cloud environments. You can also hear about various ongoing initiatives that can help you in YARN clusters operations – what scheduling policies should be adopted, what new behaviors are getting built by the community, and lastly about how to debug YARN issues found in production environments.
By Abhishek Modi, Qubole Inc.
YARN is a big shift from Hadoop-1.x MapReduce and operating it in cloud environment as ephemeral, auto-scaling clusters has challenges. This talk covers Qubole’s experience of navigating this migration, along with efforts to leverage public cloud features like spot instances, EBS volumes and uses cloud object stores as primary storage.
By Kendall Thrapp, Yahoo!, Inc and Shawna Martell, Yahoo!, Inc.
At Yahoo!’s scale, hundreds of teams and thousands of individuals share large multi-tenant Hadoop clusters, and ensuring fair sharing of resources and fair funding of platform cost can be a real challenge. This talk covers Yahoo!’s journey in quantifying resource usage for both individual users and whole projects and the corresponding real dollar cost, based on a variety of platform cost factors, like power and bandwidth, instead of just server cost.
By Varun Vasudev, Hortonworks and Wangda Tan, Hortonworks
As existing workloads evolve and new workloads are executed on YARN, today’s policies and resource types in YARN need to evolve too. This talk focuses on efforts in the YARN community to allow applications to express fine grained scheduling concepts such as affinity and anti-affinity, fallback policies, enhanced node-labels support and (b) support for arbitrary resource types which would allow administrators to add new resource types to the scheduler and to define known resource profiles (or container bucket sizes)
By Jian He, Hortonworks and Ram Venkatesh, Hortonworks
Organizations work towards making YARN run smoothly for their users but invariably deal with various kinds of issues. This session looks at typical problems users face running YARN clusters, corresponding solutions and lessons learned. Areas of focus include hung / failing applications, unexpected scheduling behavior, and multi-tenancy issues that may cause unpredicated cluster slowness or even downtime.
YARN as the foundation of the multi-tenant data operating system has enabled a rich set of workloads and engines running together. The talks below discuss some of these workloads on YARN.
By Nitin Aggarwal,Rocket fuel inc. and Ishan Chhabra, Rocketfuel Inc.
In this talk, the speakers from Rocketfuel describe Helios, anin-house system built using Storm and HBase all running dynamically on top of YARN to combat fraudulent advertising auctions in real time.
By Ashwin Shankar, Netflix and Nezih Yigitbasi, Netflix
This talk by speakers from Netflix covers the technical aspects of their journey of productionizing Apache Spark running ETL jobs on a multitenant YARN environment with hundreds of users, thousands of nodes, and a across a petabyte scale data warehouse on Amazon S3.
By Kostas Sakellis, Cloudera
Apache Spark can take advantage of YARN to allow fair sharing of resources, both between Spark applications and between Spark and other frameworks. In this talk, you will learn about the common problems with running concurrent Spark applications on the same YARN cluster and Spark’s ability (and ongoing improvements) to manage cluster resources on demand to better optimize utilization.
If the last few years have been about the rise of YARN, the next steps will be about how this powerful compute platform evolves to support a multitude of new use-cases and foolproofing its architecture to unprecedented scale. The following collection of talks focus on precisely these areas.
By Vinod Kumar Vavilapalli, Hortonworks (that’d be me 🙂 )
Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive (Hive, Tez, Spark) and real-time processing (Storm). In this talk, I’ll talk about a new suite of use-cases that YARN community is working towards – services and more complex wiring of different application types. Business increasingly care less about the infrastructure and more about how to drive the end-to-end user-cases. In this context, we will also discuss APIs, tool-set and how the new multi-colored YARN’s story empowers the developer community.
By Subru Krishnan, Microsoft and Kishore Chaliparambil, Microsoft
YARN’s growing popularity and a trend towards workload consolidation is pushing large organizations towards deploying bigger and bigger clusters. In this talk, you can hear from speakers of Microsoft about making YARN work at large datacenter scale of 20k-100k nodes via a scale-out, federation-based solution.
That’s an impressively varied collection of talks, all driven by individuals from several sides of the hemisphere and organizations both large and small.
Don’t also forget about the Bird of a Feather (BoF) on YARN on Thursday June 30th from 5:00 to 7:00 pm that I’ll be coordinating. All BoFs are open to everyone in the community and don’t require a Hadoop Summit pass.