Reality Check: Contributions to Apache Hadoop
Several weeks ago, Hortonworks published a blog post that highlighted the tremendous contributions that Yahoo has made to Hadoop over the years. The point was two-fold: 1) to pay homage to our former employer, and 2) to clarify that Yahoo will continue to be a major contributor to Hadoop.
Earlier this week, Cloudera responded to our post, calling it a misleading story. While we generally don’t comment on another vendor’s blogs, even if they assert things that we find questionable, we felt we had to respond to this one.
Underneath a lot of words, their claim was that Cloudera had made the most contributions to Apache Hadoop this year of any single organization.
While it is true that Cloudera has ramped up the number of patches they have contributed over the past few months, portraying patches as the leading indicator of an organization’s contributions to Apache Hadoop is misguided.
Why? This is because patches differ in their investment of time and effort. The average patch created by a contributor increases in size as they gain experience and begin to work on more complicated tasks. A patch can be as complicated as a major new feature or subsystem or as simple as a spelling correction in a line of documentation. On average, beginners contribute small fixes and experts contribute larger patches that require much more work.
We strongly believe that the lines of code contributed is a significantly more relevant metric. While addressing spelling errors are useful for the community, they are not nearly as important as adding major new features or fixing significant bugs.
Compare a one or two-line patch with say:
- HDFS Federation: Nearly 15,000 lines of code
- HDFS EditLogs Re-write: Nearly 10,000 lines of code (incidentally, led by a Cloudera engineer; credit where it’s due)
- NextGen MapReduce: over 150,000 lines of code
When you consider that nearly 40% of all patches contributed to Apache Hadoop this year have contained less than 10 lines of code each, it’s easy to see how simply tracking the number of patches dramatically distorts the true picture.
The simple fact remains that Hortonworks and Yahoo! together have contributed more than 80% of the lines of code in Apache Hadoop trunk. This number, as Owen described in his methodology, attributed code contributions to the organization that employed the developer at the time of the contribution. This only seems fair since organizations that are funding the development of Apache Hadoop by supporting their employees contributing code to Apache should get the credit they deserve.
Here is an updated chart that shows total lines of code contributed to Apache Hadoop trunk since 2006, based upon the organization that employed each contributor at the time of the contribution:
Cloudera stated that the credit should be given to the current employer of that developer, regardless of the investment made by the previous employer. We agree that individuals contribute to open-source projects, not corporations; although we disagree that one should simply ignore the investments made by organizations to build Apache Hadoop. It seems only fair to count work done at Yahoo as, well, work done at Yahoo.
Regardless, using Cloudera’s methodology, we calculated the lines of code contributed to Apache Hadoop since 2006 and here are the results:
This graph is particularly interesting in that it shows where the most active Apache Hadoop contributors are now employed. We are proud that former colleagues have gone to organizations such as Facebook, LinkedIn, eBay, etc. to spread their knowledge and experience. Note that this methodology actually benefits Hortonworks and other organizations, but doesn’t meaningfully change Cloudera’s contribution.
Again, while it is a net positive for the ecosystem that the talent (much of it Yahoo! talent) is being shared across a wider section of organizations, Hortonworks and Yahoo! still employ the developers that have contributed the majority of code to Apache Hadoop.
If you look only at 2011 data using Cloudera’s methodology, the general picture remains the same:
As you can see, Hortonworks and Yahoo! are the two leading contributors to Apache Hadoop, contributing nearly 68% of the lines of code so far this year. As Owen highlighted in his blog and as I have said many times, it is great to see other organizations contribute to Apache Hadoop at a growing rate. I agree this is indicative of a healthy and growing ecosystem.
Lastly, here are a couple of diagrams that show both lines of code contributed and patches contributed, assigning credit to the contributors employer at the time of their contribution. The first one shows the totals since the beginning of 2006 to give some historical perspective:
The second one shows lines of code and patches contributed during 2011 alone:
As Cloudera pointed out in their blog, they have been increasing the volume of patches contributed to Apache Hadoop and we applaud them for doing so. However, the big part of the story, which they omit, is that Hortonworks and Yahoo! continue to contribute the majority of code to the Apache Hadoop projects. This is an absolute fact.
Lastly, let me also point out that our analysis focuses on Apache Hadoop core, namely Common, HDFS and MapReduce, simply because they are the core. Every distribution includes two or more of these projects and we wouldn’t have Hadoop without them.
Yahoo! contributed a number of other projects including ZooKeeper, Pig, HCatalog, Avro and Ambari. Both Yahoo! and Hortonworks have deep expertise and continue to contribute code for these and other projects. Other organizations have leading expertise in other projects such as Facebook with Hive; and Facebook, Stumbleupon and TrendMicro with HBase. Cloudera also has expertise in the projects it recently submitted to Apache, including Bigtop, Sqoop, Whirr and Flume. There are also a number of other projects in the ecosystem that could have been included in an analysis, including Azkaban, Cascading, Cassandra, Giraph, Hama, Hypertable, Kafka, JAQL, Mahout, Mesos, OpenMPI, R, Spark, and Thrift just to a name a few. Adding an arbitrary selection of additional projects to the analysis of the Hadoop project is simply changing the subject.
Let me again state that Hortonworks is absolutely dedicated to making Apache Hadoop better and accelerating the development and adoption of Hadoop across the world. We are excited to see increasing contributions from a growing number of individuals and organizations. We love working with the Apache Hadoop community and we’ve been doing it for nearly 6 years.
We believe strongly that the Apache Software Foundation’s ownership of Hadoop is one of Hadoop’s greatest strengths. See my recent blog post on the subject. We don’t own the code and we are proud of that. We contribute 100% of our code to Apache. We don’t hold back any proprietary software. All of our code goes back to Apache to make Apache better for everyone. We believe this is the fastest and best way to create a vibrant Hadoop community and ecosystem.
Lastly, we believe strongly that we have the deepest Apache Hadoop domain knowledge of any organization and are well placed to assist enterprises and technology vendors with Apache Hadoop.