The Yahoo! Effect
While much credit has been given to Yahoo! since Hadoop was donated to the Apache Software Foundation in 2006, the real measure of their contributions and the impact that they have had in making Apache Hadoop what it is today is quite substantial. This blog will take a look at Yahoo!’s contributions to Apache Hadoop and the impact that those contributions have had on making Apache Hadoop what it is today. I spent some time analyzing the data and will share the details in this blog.
Lines of Code Contributed to Apache Hadoop Trunk (Pre-Hortonworks)
The chart below highlights the vast contribution in terms of lines of code contributed to Apache Hadoop trunk, through June 2011. From its inception until this past June, Yahoo! contributed more than 84% of the lines of code still in Apache Hadoop trunk.
Cumulative Lines of Code Contributed to Apache Hadoop Trunk Timeline through June 2011
One of the great things about open source is that contributions can come from a wide range of organizations. It’s great to see that the number of organizations contributing code to Apache Hadoop trunk has been steadily rising, especially in the past 18 months. Facebook, who made their initial contribution in April 2008 and Cloudera, who made their initial contributions in November 2008, combined more than 37,000 lines of code to Apache Hadoop trunk (nearly 8% of the total). Other organizations that have made significant contributions to trunk include Inria, Lastfm, Powerset and the University of California, Berkeley. It’s also good to see that Linked In has become a considerable contributor since March 2011.
Patches Contributed to Apache Hadoop Trunk (Pre-Hortonworks)
The chart below is similar to the previous chart except that it contains stats for patches contributed to trunk.
Cumulative Patches Contributed to Apache Hadoop Trunk Timeline through June 2011
While Yahoo! has steadily been the primary patch contributor (more than 72% of all patches to trunk), the percentage of patch contributions from other organizations is actually even higher than lines of code. Facebook in particular has been a regular contributor of patches since mid-2007. Cloudera has also ramped the number of patches during 2010 and 2011.
Yahoo! and Hortonworks in 2011
As our readers know, Hortonworks was formed by a number of key Yahoo! Hadoop Engineering team members, including architects and major contributors/committers to various Apache Hadoop projects and sub-projects. There continues to be a very incorrect assumption by some in the market, however, that Yahoo! will no longer be a major contributor to Apache Hadoop moving forward. Nothing could be further from the truth. In fact, if you look at the diagram below, even if you exclude all of the code contributions made by the former Yahoo! team now at Hortonworks, Yahoo! is still the largest contributor to Apache Hadoop.
Together, Yahoo! and Hortonworks have contributed a significant amount of code to Apache Hadoop trunk, much of it fueled by the NextGen MapReduce (YARN) recently added to trunk.
The Other Yahoo! Effect
In addition to contributing code and patches to Apache Hadoop, Yahoo! also contributed something that will be even more valuable over the coming years: people. As the original knowledge experts on Apache Hadoop, Yahoo! employed and trained the vast majority of the project’s early architects and committers. Over the past couple of years, some of these Hadoop experts have left to join other organizations as they adopted Apache Hadoop as a critical component of their data architecture. These Yahoo! alumni are leveraging their Hadoop expertise to solve new challenges at their new employers while still contributing to Apache Hadoop. As such, Apache Hadoop will continue to get better thanks to the wider set of companies contributing code and patches, each of whom has their own perspectives and ideas.
Let me start by pointing out that this analysis is focused on Apache Hadoop trunk, which includes HDFS, MapReduce and Hadoop Common. It includes the latest contributions from YARN (Next Gen MapReduce). It’s also important to point out that the analysis focused on contributions, not commits. We believe that contributions are a more accurate representation of the amount of time and effort contributed by individual developers.
Apache projects recognize code contributions from individuals, not companies. Each patch’s list of authors is tracked in the CHANGES.txt file. For this analysis, I took the data from the CHANGES.txt as of end of June 2011 and combined it with the data in both JIRA, the issue tracking system, and Subversion, the source code control system. This produced a list of changes, the list of authors, the list of files that were changed and the data in which the patch was committed. By splitting the credit for patches evenly among the authors, I arrived at a list of individual contributions.
When needed, I used Linked In and web searches to determine the employer of the individual contributor, broken down by date ranges, and then mapped the patch authors to their employers. Because not all patches are created equally, I therefore counted the lines of code added or modified by each patch (discounting tests and contrib projects). I also ignored the big source code moving patches (HADOOP-1148, HADOOP-2885, HADOOP-2916 and HADOOP-4687).
— Owen O’Malley