Innovations and Contributions: Apache Hadoop

Raising an elephant with people, ideas and code

The Apache Software Foundation (ASF) provides valuable stewardship and guide-rails for projects interested in attracting the broadest community of involvement as possible, especially across a wide range of vendors and end users. While the ASF’s role is not about guaranteeing wild success for every project, they do a great job of providing a place where the broadest community of people, ideas, and code can come together and raise an elephant, so to speak.

These sentiments were expressed quite nicely in an article by Andy Oliver:

Hadoop is everything an Apache project should be: a community of rival companies, an increasing activity level, and an increasing number of committers…This is Apache at its very finest. It will be messy and there will be kerfuffles, but how else and where else could this happen? Where else could Hadoop be both open source and inaugurate the next stage of the InterWebs? In some ways Hadoop is in fact the successor to the Apache Web Server — or maybe the realization of what it started.

The Absolute Transparency Of It All

One of the remarkable aspects of open source development under the stewardship of the ASF is the absolute transparency of it all.  Want to know how many lines of code have been contributed to a project? Or which committers are contributing to a particular project?  It’s all there…in the open.

With that in mind, a recent blog post caught our eye from a Hadoop contributor from Japan (thank you @ajis_ka) who wrote a simple query and produced a handful of graphs identifying some of the interesting trends regarding development within the Apache Hadoop project.

While not surprising, the post drew a couple of key conclusions:

  1. The pace of contributions to Apache Hadoop from 2011 through 2013 remains healthy and vibrant.
  2. While a diverse group is contributing, Hortonworks engineers continue to deliver a significant share (which is not a surprise since we are maniacally focused on driving innovation in the open).

Let’s Take a Look at the Numbers

Figure 2 below illustrates how the YARN-based architecture of Hadoop 2 worked its way into the project over the course of 2011 through 2013. YARN underpins Hadoop 2’s fundamental new architecture for supporting workloads that span batch, interactive, and real-time use cases. While the amount of changes to the Hadoop codebase stabilized in 2013, figure 3 shows that significant net-new features were added and completed in 2013 (such as HDFS Snapshots and High Availability), which explains the increase of 260,000+ lines of code versus prior years and demonstrates the ongoing innovation happening within the project.

cn2

cn1

What’s even more interesting than the absolute lines of code contributed is the increase in diversity of organizations contributing to Apache Hadoop between 2012 and 2013 as illustrated in the next two charts from the blog post.

cn4

cn3

The diversification of contributions between 2012 and 2013 are illustrative of the fact that the framework for community collaboration provided by the Apache Software Foundation is working quite well for the Apache Hadoop project, as Andy Oliver’s article mentioned above implied.

We are obviously proud of our contributions to the Apache Hadoop project and we plan to continue to catalyze and energize this amazing community with your help.

And thanks again to @ajis_ka for sharing the analysis.

Categorized by :
Apache Hadoop Architect & CIO

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Using Ambari for Management
Thursday, September 4, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :