Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
February 02, 2016
prev slideNext slide

Ten Years of Herding Elephants

It was 10 years ago today (Feb 2) that my first patch ( went into the code that two days later became Hadoop (

I had been working on Yahoo Search’s WebMap, which was the back end that analyzed the web for the search engine.  We had been working on a C++ implementation of GFS and MapReduce, but after hiring Doug Cutting decided that it would be easier to get Yahoo’s permission to contribute to code that was already open source rather than open source our C++ project.

Last week, I did some software archaeology and checked out the code that Doug Cutting, Mike Carafella and I (via my small patch!) wrote. I’d like to encourage you to check it out to see how far Hadoop has come over the years. To make it easy, I created a Docker image at that let’s you play with that early version of Nutch DFS and MapReduce. I also back ported the WordCount example that I wrote for Hadoop so that it runs against Nutch and included it in the Docker container.

Some fun points to notice:

  • All of the primary servers are there: NameNode, DataNode, JobTracker, and TaskTracker.

  • There is no tracking of users or creation/submission times.

  • The JobTracker has a Web UI, but it is really primitive. The NameNode doesn’t have a UI at all.

  • The MapReduce job and task names are all randomly generated.

  • There isn’t a Secondary NameNode, so you need to restart your NameNode every couple of days to compact the edit log.

  • The reduces randomly ask each TaskTracker whether they have each specific map output.

  • Rather than programmatically submitting jobs, the developer was expected to create an XML file that described their job.

  • There was no support for retrying failed MapReduce tasks. Any failed task killed the entire job.

It has been an amazing 10 year journey taking Hadoop from a small unknown project to a project that has become the world’s big data platform. Another measure of how far we’ve come is that NDFS had 5kloc and MapReduce had 6kloc back in February 2006. Compare that to the 300kloc added to the Hadoop project in 2015 alone. (

In honor of Hadoops 10 Birthday, be sure to attend one of our two 10 Year Anniversary Parties – click here


Leave a Reply

Your email address will not be published. Required fields are marked *