NextGen MapReduce Hits Apache Hadoop Mainline
We are very excited to announce NextGen Apache Hadoop MapReduce is getting close. We just merged the code base to Apache Hadoop mainline and Arun is about to branch a hadoop-0.23 to prepare for a release!
We’ve talked about NextGen Apache Hadoop MapReduce and it’s advantages. The drawbacks of current Apache Hadoop MapReduce are both old and well understood. The proposed architecture has been in the public domain for over 3 years now. The team started the work in August 2010 starting with a prototype upon which we did rapid iterations. This culminated with an initial check-in to Apache Hadoop SVN in March 2011. Since then we’ve done all development on the MR-279 branch in Apache and have run really hard to get NextGen Hadoop MapReduce ready. We hope to see it soon on *your* cluster.
Some fun stats:
- NextGen MapReduce has nearly 100,000 lines of code (roughly – just the *.java files). That’s nearly 1/3 of current Apache Hadoop codebase we’ve added in the last 12 months!
- We spent nearly 100 man months of effort developing and testing this so far. At one point we had over 10 full time Yahoo/Hortonworks employees working on it.
- We are excited to be receiving patches from lots of members in the community. We have received/committed patches from members in at least 3 time zones.
- We know at least 2 efforts to port non-MapReduce frameworks to work on NextGen Hadoop!
How to contribute
Now, this is just the beginning. There is still much to do. Making the MapReduce framework production quality is the top priority but implementing/porting alternative computing frameworks will excite some contributors as well. To help that cause, I am pasting the new source code directory structure here:
- hadoop-mapreduce ( was mapreduce before)
- trunk/hadoop-mapreduce – Classic code. JobTracker/TaskTracker
- trunk/hadoop-mapreduce/ – New code related to YARN
- hadoop-yarn – Yarn APIs, libraries, and server code
- hadoop-yarn-server – Server code, ResourceManager, NodeManager,
- trunk/hadoop-mapreduce/hadoop-yarn/hadoop-yarn-server/ – server libraries and tests.
- trunk/hadoop-mapreduce/hadoop-mr-client – MapReduce server and client code
- hadoop-mapreduce-client-app- The MapReduce ApplicationMaster
- hadoop-mapreduce-client-jobclient- MR JobClient
- hadoop-mapreduce-client-hs- History-server for MR jobs
- hadoop-mapreduce-client-shuffle – Shuffle code in MR
I know no single list is comprehensive given the monstrosity of the effort, but I wanted to recognize all of the contributors – Arun C. Murthy, Christopher Douglas, Devaraj Das, Greg Roelofs, Jeffrey Naisbitt, Josh Wills, Ahmed Radwan, Jonathan Eagles, Krishna Ramachandran, Luke Lu, Mahadev Konar, Robert Evans, Sharad Agarwal, Siddharth Seth, Thomas Graves, Ramya Sunil (testing), Giridharan Kesavan(release engineering), Karam Singh and Santosh Kumar (performance engineering).
Alright, it’s time you checked out the bleeding edge Hadoop MapReduce trunk. Start hacking and have fun!
– Vinod Kumar Vavilapalli a.k.a @tshooter