Reality Check: Contributions to Apache Hadoop

Several weeks ago, Hortonworks published a blog post that highlighted the tremendous contributions that Yahoo has made to Hadoop over the years. The point was two-fold: 1) to pay homage to our former employer, and 2) to clarify that Yahoo will continue to be a major contributor to Hadoop.

Earlier this week, Cloudera responded to our post, calling it a misleading story. While we generally don’t comment on another vendor’s blogs, even if they assert things that we find questionable, we felt we had to respond to this one.

Underneath a lot of words, their claim was that Cloudera had made the most contributions to Apache Hadoop this year of any single organization.

While it is true that Cloudera has ramped up the number of patches they have contributed over the past few months, portraying patches as the leading indicator of an organization’s contributions to Apache Hadoop is misguided.

Why? This is because patches differ in their investment of time and effort. The average patch created by a contributor increases in size as they gain experience and begin to work on more complicated tasks. A patch can be as complicated as a major new feature or subsystem or as simple as a spelling correction in a line of documentation. On average, beginners contribute small fixes and experts contribute larger patches that require much more work.

We strongly believe that the lines of code contributed is a significantly more relevant metric. While addressing spelling errors are useful for the community, they are not nearly as important as adding major new features or fixing significant bugs.

Compare a one or two-line patch with say:

  • HDFS Federation: Nearly 15,000 lines of code
  • HDFS EditLogs Re-write: Nearly 10,000 lines of code (incidentally, led by a Cloudera engineer; credit where it’s due)
  • NextGen MapReduce: over 150,000 lines of code

When you consider that nearly 40% of all patches contributed to Apache Hadoop this year have contained less than 10 lines of code each, it’s easy to see how simply tracking the number of patches dramatically distorts the true picture.

The simple fact remains that Hortonworks and Yahoo! together have contributed more than 80% of the lines of code in Apache Hadoop trunk. This number, as Owen described in his methodology, attributed code contributions to the organization that employed the developer at the time of the contribution. This only seems fair since organizations that are funding the development of Apache Hadoop by supporting their employees contributing code to Apache should get the credit they deserve.

Here is an updated chart that shows total lines of code contributed to Apache Hadoop trunk since 2006, based upon the organization that employed each contributor at the time of the contribution:

Apache Hadoop Code Contributions Lifetime Hortonworks

Cloudera stated that the credit should be given to the current employer of that developer, regardless of the investment made by the previous employer. We agree that individuals contribute to open-source projects, not corporations; although we disagree that one should simply ignore the investments made by organizations to build Apache Hadoop. It seems only fair to count work done at Yahoo as, well, work done at Yahoo.

Regardless, using Cloudera’s methodology, we calculated the lines of code contributed to Apache Hadoop since 2006 and here are the results:

Apache Hadoop Contributions Lifetime Hortonworks New Method

This graph is particularly interesting in that it shows where the most active Apache Hadoop contributors are now employed. We are proud that former colleagues have gone to organizations such as Facebook, LinkedIn, eBay, etc. to spread their knowledge and experience.  Note that this methodology actually benefits Hortonworks and other organizations, but doesn’t meaningfully change Cloudera’s contribution.

Again, while it is a net positive for the ecosystem that the talent (much of it Yahoo! talent) is being shared across a wider section of organizations, Hortonworks and Yahoo! still employ the developers that have contributed the majority of code to Apache Hadoop.

If you look only at 2011 data using Cloudera’s methodology, the general picture remains the same:

Apache Hadoop Contributions 2011 New Method

As you can see, Hortonworks and Yahoo! are the two leading contributors to Apache Hadoop, contributing nearly 68% of the lines of code so far this year. As Owen highlighted in his blog and as I have said many times, it is great to see other organizations contribute to Apache Hadoop at a growing rate. I agree this is indicative of a healthy and growing ecosystem.

Lastly, here are a couple of diagrams that show both lines of code contributed and patches contributed, assigning credit to the contributors employer at the time of their contribution. The first one shows the totals since the beginning of 2006 to give some historical perspective:

Apache Hadoop Contributions and Patches Lifetime

The second one shows lines of code and patches contributed during 2011 alone:

Apache Hadoop Contributions 2011 Hortonworks

As Cloudera pointed out in their blog, they have been increasing the volume of patches contributed to Apache Hadoop and we applaud them for doing so. However, the big part of the story, which they omit, is that Hortonworks and Yahoo! continue to contribute the majority of code to the Apache Hadoop projects. This is an absolute fact.

Lastly, let me also point out that our analysis focuses on Apache Hadoop core, namely Common, HDFS and MapReduce, simply because they are the core. Every distribution includes two or more of these projects and we wouldn’t have Hadoop without them.

Yahoo! contributed a number of other projects including ZooKeeper, Pig, HCatalog, Avro and Ambari. Both Yahoo! and Hortonworks have deep expertise and continue to contribute code for these and other projects. Other organizations have leading expertise in other projects such as Facebook with Hive; and Facebook, Stumbleupon and TrendMicro with HBase. Cloudera also has expertise in the projects it recently submitted to Apache, including Bigtop, Sqoop, Whirr and Flume. There are also a number of other projects in the ecosystem that could have been included in an analysis, including Azkaban, Cascading, Cassandra, Giraph, Hama, Hypertable, Kafka, JAQL, Mahout, Mesos, OpenMPI, R, Spark, and Thrift just to a name a few. Adding an arbitrary selection of additional projects to the analysis of the Hadoop project is simply changing the subject.

Summary

Let me again state that Hortonworks is absolutely dedicated to making Apache Hadoop better and accelerating the development and adoption of Hadoop across the world. We are excited to see increasing contributions from a growing number of individuals and organizations. We love working with the Apache Hadoop community and we’ve been doing it for nearly 6 years.

We believe strongly that the Apache Software Foundation’s ownership of Hadoop is one of Hadoop’s greatest strengths. See my recent blog post on the subject. We don’t own the code and we are proud of that. We contribute 100% of our code to Apache. We don’t hold back any proprietary software. All of our code goes back to Apache to make Apache better for everyone. We believe this is the fastest and best way to create a vibrant Hadoop community and ecosystem.

Lastly, we believe strongly that we have the deepest Apache Hadoop domain knowledge of any organization and are well placed to assist enterprises and technology vendors with Apache Hadoop.

–E14

@jeric14, @hortonworks

Categorized by :
Apache Hadoop Other

Comments

Nilam
|
March 27, 2012 at 10:42 pm
|

I want to contribute hadoop for identifying changeset for backup application. so tell me how can i contribute??

    Nilam
    |
    March 28, 2012 at 11:25 pm
    |

    I have done with library, now plz tell me how can it be added in hadoop package???

Cos
|
October 7, 2011 at 5:52 am
|

BTW, BigTop should be counted 50/50 between Yahoo! and Cloudera. because the whole concept was developed and proved by two former Yahoo! employees who went to Cloudera and repeated the work there. Just for the sake of the correctness.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.

Thank you for subscribing!