Hadoop Summit San Jose, is just around the corner. I am amazed at the depth and breadth of the technical sessions and was looking at the Application Development track:
YARN has transformed Hadoop into a multi-tenant data platform. It is the foundation for a wide range of processing engines that empowers businesses to interact with the same data in multiple ways simultaneously. This means applications can interact with the data in the most appropriate way: from batch to interactive SQL or low latency access with NoSQL. You will have the opportunity to hear from the rock stars of the Hadoop community and learn how these innovators are building applications. You can then take that knowledge back to your own app projects.
The Track committee lead by James Taylor – Architect, Salesforce, had their work cutout getting the submissions down to just 14, To see the top 3, or all 14, you need to register to attend Hadoop Summit San Jose. Here is what the committee chose as their top 3:
Speakers: Mithun Radhakrishnan and Josh Walters from Yahoo Inc
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo’s Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
Speakers: Ankit Singhal and RajeshBabu Chintaguntla from Hortonworks
In this talk, we will present how HBase and Phoenix can become a company’s data-warehouse appliance for fast interactive analytics. We will share some experience on how the companies are currently using this appliance for their decision-support system. We will present use cases that depend on being able to run hundreds of queries in parallel on fact tables of size more than 100 billion rows, and yet expect really fast responses back on the individual queries. We will take a look at the internals of Phoenix and HBase at a high level and talk about a couple of the relevant pieces that makes realizing those use-cases possible. We will also go one level deeper, and justify that the architecture presented is viable in terms of the features that enterprises expect – low setup and maintenance cost, highly available & scalable, disaster recovery support, security support, and others. We will shed some light on how the Phoenix + HBase architecture fits in the data-warehouse space in terms of integration with other processing engines for ETL like mapreduce, hive, spark, pig etc.
Speaker: Kelvin Chu from Uber
Many data intensive computations are done on Spark at Uber. Some examples are applications in mappings, frauds, machine learning and data science. While Spark are fast, scalable and has a strong technology stack, the adoption curve may be steep for engineers and data scientists sometimes. To make creation and management of Spark jobs easy, we create Spark Uber Development Kit (UDK). It is a set of APIs (for job monitoring, message logging, result dispersal, etc) and tools (including logs debugger, performance reporter, resource auditing, etc). UDK helps jobs developers reduce the time of creating and running Spark jobs from weeks to days. The tools make debug, monitor and optimize Spark jobs easily.