Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
June 12, 2017
prev slideNext slide

Explore the latest of Apache Hadoop YARN at Dataworks Summit San Jose 2017

This post introduces some of the talks and sessions from Dataworks Summit San Jose 2017 that cover the efforts of the Apache Hadoop YARN community. Come explore the latest of Apache Hadoop YARN at Dataworks Summit San Jose 2017!

Dataworks Summit San Jose 2017
Dataworks Summit San Jose 2017

Dataworks Summit / Hadoop Summit San Jose 2017 is almost upon us!

Held between June 13-15, it is the industry’s largest big data event. Packed into three full days of industry experts speaking about how open source technologies continue to fire the big data revolution on all cylinders, the agenda is simply staggering! Topics cover the latest and greatest of the big data ecosystem projects at Apache, big data practices on premise and in the cloud, and the latest industry advancements in predictive analytics, deep-learning and artificial intelligence.

Apache Hadoop 3.0 is also around the corner – with the community kicking the tires on many of the alpha releases. On the other hand, Apache Hadoop YARN is now entering its sixth year since its entrance in the first alpha releases of Apache Hadoop 2.0. In Hadoop 3.0, YARN also is going through revolutionary changes just like it did during 2.0 when it helped Hadoop evolve from being a Hadoop MapReduce only platform into a more general purpose large-scale multi-tenant compute platform.

Having been part of this journey since YARN’s inception, the project’s evolution continues to inspire me and I’d like to take this opportunity to introduce to you some of the talks and sessions from Dataworks Summit San Jose 2017 that explore and discover the latest happenings, trends of the Apache Hadoop YARN community.

Sessions at the main conference

Apache Hadoop 3.0 Community Update

By Junping Du (Hortonworks) and Andrew Wang (Cloudera)

Apache Hadoop 3 is coming! Go to this talk to learn about the status of Hadoop 3.0 release work in the community and its path through alpha, beta towards GA. Coverage includes features like Erasure Coding in HDFS, Docker container support, YARN native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation. Last but not the least, focus will be on a few incompatible API or CLI changes which could pose challenges for downstream projects and existing Hadoop users who are looking for an upgrade.

Apache Hadoop YARN: Present and Future

By Vinod Kumar Vavilapalli, Hortonworks (hey, that’s me!)

In this talk, I’ll first discuss current status of Apache Hadoop YARN, the future promise of features and initiatives like 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.

Lessons learned from scaling YARN to 40k machines in a multi tenancy environment

By Hitesh Sharma and Roni Burd, Microsoft.

Last year’s Summit saw our friends at Microsoft announcing their intent and designs towards running YARN at 100K nodes scale. This year, they are back again with their progress and real life experiences of running this already at a scale more than 40K nodes and 500,000 jobs per day! They are going to talk about how they leverage federation (YARN-2915) and Mercury (YARN-2877) to scale out to more than 40,000 nodes (spread across clusters) at 3000 allocate/second while achieving <5s response time at 95 percentile. Join this session and learn about the challenges and lessons from running YARN at humongous scale. Just don’t miss this one out!

Running a container cloud on YARN

By Shane Kumpf and Jian He, Hortonworks

Two things. (A) YARN now supports running Docker containers alongside process containers. (b) YARN also now supports running services side by side with apps. This talk will present how you can combine these two and run a container cloud on YARN. Go to this talk for lessons learned as part of real-life experiences running a YARN based container-cloud and handling issues with resource management, debugging application failures, running Docker etc.

Building a modern end-to-end open source Big Data reference application

By Edgar Orendain (UC Berkeley / Hortonworks)

In this talk, Edgar Orendain walks through a modern real-time streaming application serving as a reference framework for developing a big data pipeline, complete with a broad range of use cases and powerful reusable core components. Watch how he brings up such a complex app easily on YARN in Hadoop 3.0!

Never late again! Job-Level deadline SLOs in YARN

By Subru Krishnan and Carlo Curino, Microsoft

Microsoft is back again with their Morpheus system built on YARN Reservations feature, that addresses the problematic tension between high cluster utilization expectations from cluster administrators and job’s performance predictability by users. Them validating the ideas in this system against production traces from a 50k node cluster should be fun!

Hadoop ecosystem boosts Tensorflow and machine learning technologies

By Wangda Tan and Yanbo Liang, Hortonworks

Deep-learning is all the rage in the industry and what better poster-child for this than TensorFlow? Attend this talk to learn about leveraging YARN to manage large-scale TensorFlow services running on a GPU-equipped cluster, and share the same cluster with other tenants and applications and using existing big-data tools like Spark/Hive for large scale data preprocessing and Zeppelin as an interactive interface to orchestrate and visualize the learning workflow. How the speakers plan to tie it all together to solve a classic machine learning challenge – online ads Click Through Rate (CTR) prediction is a must-watch.

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

By Jon Eagles, Kuhu Shukla, Yahoo!

In this talk the speakers introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler. This powers Apache Pig and Hive at scale at Yahoo!, so look for the performance evaluation results from real world jobs and future roadmap.

Medea: Expressive Scheduling of Long-Running Applications

By Arun Suresh and Konstantinos Karanasos, Microsoft

Microsoft’s big data clusters have more than 10%-30% of machines that are dedicated to long-running containers like streaming, machine learning, and latency-sensitive applications. These applications have stringent and complex scheduling requirements. Go to this talk to learn about Medea, an extension of Apache Hadoop YARN.

Yahoo – Moving beyond running 100% of Apache Pig jobs on Apache Tez

By Rohini Palaniswamy, Yahoo!

Last year Yahoo! spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez over YARN.

Sessions at the sister Meetups

Flexible and Scalable Compute Resource Management with Apache Hadoop YARN for Large Organizations

By Jonathan Hung (Linkedin), Xuan Gong (Hortonworks)

Happening at the 56th Bay Area Hadoop User Group (HUG) Meetup on Monday, June 12, 2017 6:00 PM, this joint talk between LinkedIn and Hortonworks covers new improvements to Hadoop YARN which allow for dynamically configuring cluster and queue configurations via APIs and better control of queue hierarchy by supporting queue add/remove/rename/move without restarting ResourceManager. Stay for learning how LinkedIn uses these enhancements for a multi-thousand node clusters not only to facilitate queue management, but also to build tools which improve compute utilization and resource usage monitoring.

YARN Scheduling – A Step Beyond

By Sunil Govind(Hortonworks), Jian He (Hortonworks)

Also happening at the 56th Bay Area Hadoop User Group (HUG) Meetup on Monday, this talk covers the latest and greatest of YARN Capacity Scheduler – Global Scheduling Support, general placement support, better preemption model to handle resource anomalies across and within queue, absolute resources’ configuration support, priority support between Queues and Applications.

First-Class GPU Support for Big-Data Apps on Your Apache Hadoop YARN Clusters

By Wangda Tan, Hortonworks

This talk at the pre-summit Spark and Tensorflow meetup on Monday, June 12, 2017 has Wangda speaking about how GPUs enable modern deep-learning apps and how by adding first class support (including configuration/discovery/scheduling/isolation) for GPUs in Apache Hadoop YARN, applications running on YARN are finally able to leverage the capability of GPUs in the shared cluster.

Closing thoughts

I am sure that is just the tip of the iceberg, there are tons of sessions that cover other exciting topics beyond YARN. Here’s hoping to see you all at Dataworks Summit and wishing a great conference!

Don’t also forget that Hortonworks is sponsoring several Birds of Feather (BoFs) sessions, hosted by Apache Committers, Hortonworks’ architects, tech-leads, and engineers. I am doing a Bird of a Feather (BoF) on YARN on Thursday June 30th from 5:00 to 7:00 pm. See you all there!


Leave a Reply

Your email address will not be published. Required fields are marked *