Did EMC Just Say Fork You To The Hadoop Community?

 

In Derrick Harris’ article on GigaOM entitled “EMC to Hadoop competition: See ya, wouldn’t wanna be ya.”, EMC unveiled their new Pivotal HD offering which effectively re-architects the Greenplum analytic database so it sits on top of the Hadoop Distributed File System (HDFS). Scott Yara, Greenplum cofounder, is excited about the new product. Since a key focus for us at Hortonworks is to deeply integrate Hadoop with other data systems (a la our efforts with Teradata, Microsoft, MarkLogic, and others), I’m always excited to see data system providers like Greenplum decide to store their data natively in HDFS. And I can’t argue with Scott Yara’s sentiment that “I do think the center of gravity will move toward HDFS”.

But putting HDFS under a proprietary database does not make it Hadoop, however.

All in on Hadoop?

Glancing at the Pivotal HD diagram in the GigaOM article, they’ve made it easy to distinguish the EMC proprietary components in Blue from the Apache Hadoop-related components in Green. And based on what Scott Yara says “We literally have over 300 engineers working on our Hadoop platform”.

Wow, that’s a lot of engineers focusing on Hadoop! Since Scott Yara admitted that “We’re all in on Hadoop, period.”, a large number of those engineers must be working on the open source Apache Hadoop-related projects labeled in Green in the diagram, right?

So a simple question is worth asking: How many of those 300 engineers are actually committers* to the open source projects Apache Hadoop, Apache Hive, Apache Pig, and Apache HBase?

furrierTweetJohn Furrier actually asked this question on Twitter and got a reply from Donald Miner from the Greenplum team. The thread is as follows:

Since I agree with John Furrier that understanding the number of committers is kinda related to the context of Scott Yara’s claim, I did a quick scan through the committers pages for Hadoop, Hive, Pig and HBase to seek out the large number of EMC engineers spending their time improving these open source projects. Hmmm….my quick scan yielded a curious absence of EMC engineers directly contributing to these Apache projects. Oh well, I guess the vast majority of those 300 engineers are working on the EMC proprietary technology in the blue boxes.

Why Do Committers Matter?

Simply put: Just because you can read Moby-Dick doesn’t make you talented enough to have authored it.

Committers matter because they are the talented authors who devote their time and energy on working within the Apache Software Foundation community adding features, fixing bugs, and reviewing and approving changes submitted by the other committers. At Hortonworks, we have over 50 committers, across the various Hadoop-related projects, authoring code and working with the community to make their projects better.

This is simply how the community-driven open source model works. And believe it or not, you actually have to be in the community before you can claim you are leading the community and authoring the code!

So when EMC says they are “all-in on Hadoop” but have nary a committer in sight, then that must mean they are “all-in for harvesting the work done by others in the Hadoop community”.  Kind of a neat marketing trick, don’t you think?

Scott Yara effectively says that it would take about $50 to $100 million dollars and 300 engineers to do what they’ve done. Sounds expensive, hard, and untouchable doesn’t it? Well, let’s take a close look at the Apache Hadoop community in comparison.  Over the lifetime of just the Apache Hadoop project, there have been over 1200 people across more than 80 different companies or entities who have contributed code to Hadoop.  Mr. Yara, I’ll see your 300 and raise you a community!

Are You Forking With Me?

So, assuming EMC has little or no committers on the relevant Apache open source projects, then one can only assume their strategy is to fork the Hadoop-related code and maintain their own proprietary version. If they are not actively authoring code within the community, then how else are they able to add important new features or fix critical bugs for enterprise customers?

Looking at the Pivotal HD diagram closely, I also wonder why the box at the foundation lists “HDFS or Isilon OneFS”. Doesn’t that just make you wonder how committed to HDFS EMC actually is? And how long it will take them to start throwing HDFS under the bus from a marketing perspective so they can sell more Isilon? They have to pay for those expensive 300 engineers somehow, right?

And Are They Forking EMC Customers and MapR Technologies While They’re At It?

For EMC customers, another important tidbit to note in the GigaOm article:

“Yara said Greenplum had known for a while that Hadoop was the key to any big data strategy going forward, but that it would take some time to build up its own technology. So, in 2011, it entered into a reseller agreement with Hadoop startup MapR to offer a premium product to appease enterprise customers while Greenplum’s engineers got to work on what would become Pivotal HD. That deal with MapR is still in place, but it’s no longer the focal point of Greenplum’s Hadoop strategy.”

Yep, as confusing as it sounds, EMC had two Hadoop-like offerings, Greenplum HD and Greenplum MR. Pivotal HD appears to be a reswizzled rendition of Greenplum HD (with magical Hawq dust sprinkled on top).  And if you were one of those enterprise customers who EMC “appeased” by buying into Greenplum MR (with the OEM’d MapR distribution inside), then you’re either being abandoned and kicked to the curb or being presented with a fork in the road.

Either way, you are faced with a choice: do you ride out EMC’s changing course yet again or do you look for safer harbor elsewhere…

Choose Community Driven Open Source and Avoid Proprietary Lock-in

At Hortonworks, we believe in the relentless march of community driven open source as the fastest path to innovation and adoption of Apache Hadoop. We believe the most effective path is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs. I encourage you to read more.

We also believe that community driven open source offers the safest path forward since you’re not locked into the whims of a single vendor.

At Hortonworks, when we say “we are ALL IN on Hadoop”, we actually mean it!

And while my post may sound a little harsh, it’s important to note that we’d love to see EMC engineers, and anyone else for that matter, participate in the Apache community and make real contributions.  After all, at the end of the day, community rules!

 

NOTE: A committer is someone who has “earned their stripes” within the Apache community and has the ability to commit code directly to their corresponding Apache project source code tree. The Apache Hive project has a wiki page that provides a nice explanation of how this process works.

Categorized by :
Hadoop Ecosystem Industry Happenings Other

Comments

Alex
|
March 19, 2013 at 11:19 am
|

What’s regarding OpenChorus project ? I’m not sure “ALL IN on Hadoop” must be paraphrased as “ALL IN Apache Hadoop project”.

http://www.greenplum.com/communities/developer/openchorus

|
February 26, 2013 at 11:33 am
|

Over the lifetime of just the Apache Hadoop project, there have been over 1200 people across more than 80 different companies or entities who have contributed code to Hadoop. Mr. Yara, I’ll see your 300 and raise you a community!!!

This is what i though.

Ohloh says something else 33 contributors https://www.ohloh.net/p/Hadoop Are they wrong or missed something?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Thank you for subscribing!