Guest blog post from Eric Hanson, Principal Program Manager, Microsoft
Hadoop had a crazy and collaborative beginning as an OSS project, and that legacy continues. There have been over 1,200 contributors across 80 companies since its beginning. Microsoft has been contributing to Hadoop since October 2011, and we’re committed to giving back and keeping it open.
Our first wave of contributions, in collaboration with Hortonworks, has been to port Hadoop to Windows, to enable it both for our HDInsight service on Windows Azure and for on-premises Big Data installations on Windows. Now, we’re starting to contribute to the Stinger initiative to dramatically speed up Hive and make it more enterprise-ready.
Contribution to the core of Apache Hadoop through Stinger
Our main activity in Stinger right now is around Tez, and vectorized query execution. One of our developers, Mike Liddell, has experience with DAG-based computations in Microsoft’s internal Dryad-LINQ effort, and has just joined Tez as a founding committer. I kick-started and helped guide our project to introduce columnstore data formats and vectorized (a.k.a. “batch mode”) query execution into SQL Server 2012. After moving to the SQL Server Big Data team, I’ve been collaborating with Hortonworks developers since late last fall regarding how to make Hive faster. We heard about the ORC project, led by Owen O’Malley of Hortonworks, to improve the RCFile columnstore format. I’ve had several productive design discussions with Owen about ORC, and we really like the way it’s shaping up.
Based on our experience, we knew that a great columnstore format is only part of the story about making data warehouse-style queries run really fast. Good process and communication architecture is one – Tez is a great step there. Another is fast query execution (QE), and vectorized query execution research and field experience has shown it can speed up queries on the order of 10X-100X.
Some people were saying that fast QE required a total-rewrite in C++. I didn’t buy that, and I prototyped vectorized scan and filter operators in Java and shared this with Hortonworks. For simple conditions like column = constant, we’ve seen the ability to filter about 150 million rows per second on one thread in Java. We now have a two-company team introducing vectorized QE to Hive, consisting of two Hortonworks folks (Jitendra Pandey and Owen) and several Microsoft engineers. We’re going to take it in small steps, adding vectorized scans over ORC, and basic filter operations first. Then we’ll move on to vectorized aggregates and joins.
We think that the functional surface area of Hive, including its SQL query language, the open, extensible storage model over HDFS, and its easy programmer extensibility with Java UDFs, is quite compelling. It gives non-procedural access to Big Data, with ability for programmers to create custom Java add-ins that let them do complex calculations more easily that they can with Map-Reduce programs. Hive also has a strong community of OSS developers and users. It works on ultra-scale clusters on data sets vastly bigger than total cluster memory. Stinger aims to boost the speed of Hive to complement its rich functionality in a way that users will love.
An active participant in the open community
We’ve been part of OSS Big Data world for about a year and half now. Through the combined efforts of the overall Hadoop community, Microsoft, and Hortonworks, Hadoop is now accessible on Windows Server and Windows Azure. We’ve gained so much from the community. Now we’re helping return the favor by contributing to Stinger, with our eye on 100X performance gains.