In preparation for Hadoop Summit San Jose, I asked the Chair for the Apache Committer Insights track, Andy Feng – VP Architecture, Yahoo! which were the top 3 sessions he would recommend. Although it was a tough choose only 3, he recommended:
Speakers: Chris Nauroth from Hortonworks and Arpit Agarwal from Hortonworks
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Speaker: Jun Rao from Confluent
To manage the ever-increasing volume and velocity of data within your company you have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center powered by Apache Kafka. But what needs to be done if one data center is not enough? In this session we describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence. We provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication and mirroring as well as disaster scenarios and failure handling.
Speakers: Arun Suresh from Microsoft and Srikanth Kandula from Microsoft
Tasks in modern data-parallel clusters have highly diverse resource requirements along CPU, memory, disk and network. We present an efficient Multi-Resource Packing allocator that packs tasks to machines based on their requirements of all resource types. Doing so avoids resource fragmentation as well as over-allocation of the resources that are not explicitly allocated, both of which are drawbacks of the DRF (Dominant Resource Fairness) policies employed by the default YARN schedulers. ‘GoodFit’ adapts heuristics for the multidimensional bin packing problem to the context of cluster schedulers wherein task arrivals and machine availability change in an online manner and wherein task’s resource needs change with time and with the machine that the task is placed at. In addition, GoodFit improves average job completion time by preferentially serving jobs that have less remaining work. This talk will demonstrate how, given that the above heuristics are compatible with a large class of fairness policies, GoodFit is simultaneously able to achieve better performance and fairness than DRF allocations. Trace driven simulations and deployment of our Apache YARN prototype on a 250 node cluster show improvements of over 30% in makespan and job completion time while achieving nearly perfect fairness.
Andy recommends you attend all Apache Committer Insights talks, but more importantly, register to attend Hadoop Summit!
Hear about the latest innovation within the Hadoop ecosystem from the community architecting and building Hadoop – the committers. These are the engineers and developers who lead the innovation in open source projects and can provide an insider’s perspective. This track presents technical deep dives across a wide range of Apache topics and projects.