NextGen MapReduce Hits Apache Hadoop Mainline

We are very excited to announce NextGen Apache Hadoop MapReduce is getting close. We just merged the code base to Apache Hadoop mainline and Arun is about to branch a hadoop-0.23 to prepare for a release!

We’ve talked about NextGen Apache Hadoop MapReduce and it’s advantages. The drawbacks of current Apache Hadoop MapReduce are both old and well understood. The proposed architecture has been in the public domain for over 3 years now. The team started the work in August 2010 starting with a prototype upon which we did rapid iterations. This culminated with an initial check-in to Apache Hadoop SVN in March 2011. Since then we’ve done all development on the MR-279 branch in Apache and have run really hard to get NextGen Hadoop MapReduce ready. We hope to see it soon on *your* cluster.

Some fun stats:

  • NextGen MapReduce has nearly 100,000 lines of code (roughly – just the *.java files). That’s nearly 1/3 of current Apache Hadoop codebase we’ve added in the last 12 months!
  • We spent nearly 100 man months of effort developing and testing this so far. At one point we had over 10 full time Yahoo/Hortonworks employees working on it.
  • We are excited to be receiving patches from lots of members in the community. We have received/committed patches from members in at least 3 time zones.
  • We know at least 2 efforts to port non-MapReduce frameworks to work on NextGen Hadoop!

How to contribute

Now, this is just the beginning. There is still much to do. Making the MapReduce framework production quality is the top priority but implementing/porting alternative computing frameworks will excite some contributors as well. To help that cause, I am pasting the new source code directory structure here:

  • trunk/
    • hadoop-mapreduce ( was mapreduce before)
  • trunk/hadoop-mapreduce – Classic code. JobTracker/TaskTracker
    • build.xml
    • src
  • trunk/hadoop-mapreduce/ – New code related to YARN
    • assembly
    • pom.xml
    • hadoop-mr-client
    • hadoop-yarn – Yarn APIs, libraries, and server code
      • hadoop-yarn-api
      • hadoop-yarn-common
      • hadoop-yarn-server – Server code, ResourceManager, NodeManager,
  • trunk/hadoop-mapreduce/hadoop-yarn/hadoop-yarn-server/ – server libraries and tests.
    • hadoop-yarn-server-common
    • hadoop-yarn-server-nodemanager
    • hadoop-yarn-server-resourcemanager
    • hadoop-yarn-server-tests
  • trunk/hadoop-mapreduce/hadoop-mr-client – MapReduce server and client code
    • hadoop-mapreduce-client-app- The MapReduce ApplicationMaster
    • hadoop-mapreduce-client-core
    • hadoop-mapreduce-client-jobclient- MR JobClient
    • hadoop-mapreduce-client-common
    • hadoop-mapreduce-client-hs- History-server for MR jobs
    • hadoop-mapreduce-client-shuffle – Shuffle code in MR

You should look at this wiki page for instructions related to building latest MapReduce trunk until the usual wiki page is updated to reflect the latest changes.


I know no single list is comprehensive given the monstrosity of the effort, but I wanted to recognize all of the contributors – Arun C. Murthy, Christopher Douglas, Devaraj Das, Greg Roelofs, Jeffrey Naisbitt, Josh Wills, Ahmed Radwan, Jonathan Eagles, Krishna Ramachandran, Luke Lu, Mahadev Konar, Robert Evans, Sharad Agarwal, Siddharth Seth, Thomas Graves, Ramya Sunil (testing), Giridharan Kesavan(release engineering), Karam Singh and Santosh Kumar (performance engineering).

Alright, it’s time you checked out the bleeding edge Hadoop MapReduce trunk. Start hacking and have fun!

– Vinod Kumar Vavilapalli a.k.a @tshooter


Categorized by :
Hadoop MapReduce


Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.