NextGen MapReduce Hits Apache Hadoop Mainline

We are very excited to announce NextGen Apache Hadoop MapReduce is getting close. We just merged the code base to Apache Hadoop mainline and Arun is about to branch a hadoop-0.23 to prepare for a release!

We’ve talked about NextGen Apache Hadoop MapReduce and it’s advantages. The drawbacks of current Apache Hadoop MapReduce are both old and well understood. The proposed architecture has been in the public domain for over 3 years now. The team started the work in August 2010 starting with a prototype upon which we did rapid iterations. This culminated with an initial check-in to Apache Hadoop SVN in March 2011. Since then we’ve done all development on the MR-279 branch in Apache and have run really hard to get NextGen Hadoop MapReduce ready. We hope to see it soon on *your* cluster.

Some fun stats:

  • NextGen MapReduce has nearly 100,000 lines of code (roughly – just the *.java files). That’s nearly 1/3 of current Apache Hadoop codebase we’ve added in the last 12 months!
  • We spent nearly 100 man months of effort developing and testing this so far. At one point we had over 10 full time Yahoo/Hortonworks employees working on it.
  • We are excited to be receiving patches from lots of members in the community. We have received/committed patches from members in at least 3 time zones.
  • We know at least 2 efforts to port non-MapReduce frameworks to work on NextGen Hadoop!


How to contribute

Now, this is just the beginning. There is still much to do. Making the MapReduce framework production quality is the top priority but implementing/porting alternative computing frameworks will excite some contributors as well. To help that cause, I am pasting the new source code directory structure here:

  • trunk/
    • hadoop-mapreduce ( was mapreduce before)
  • trunk/hadoop-mapreduce – Classic code. JobTracker/TaskTracker
    • build.xml
    • src
  • trunk/hadoop-mapreduce/ – New code related to YARN
    • assembly
    • pom.xml
    • hadoop-mr-client
    • hadoop-yarn – Yarn APIs, libraries, and server code
      • hadoop-yarn-api
      • hadoop-yarn-common
      • hadoop-yarn-server – Server code, ResourceManager, NodeManager,
  • trunk/hadoop-mapreduce/hadoop-yarn/hadoop-yarn-server/ – server libraries and tests.
    • hadoop-yarn-server-common
    • hadoop-yarn-server-nodemanager
    • hadoop-yarn-server-resourcemanager
    • hadoop-yarn-server-tests
  • trunk/hadoop-mapreduce/hadoop-mr-client – MapReduce server and client code
    • hadoop-mapreduce-client-app- The MapReduce ApplicationMaster
    • hadoop-mapreduce-client-core
    • hadoop-mapreduce-client-jobclient- MR JobClient
    • hadoop-mapreduce-client-common
    • hadoop-mapreduce-client-hs- History-server for MR jobs
    • hadoop-mapreduce-client-shuffle – Shuffle code in MR

You should look at this wiki page for instructions related to building latest MapReduce trunk until the usual wiki page is updated to reflect the latest changes.


Acknowledgements

I know no single list is comprehensive given the monstrosity of the effort, but I wanted to recognize all of the contributors - Arun C. Murthy, Christopher Douglas, Devaraj Das, Greg Roelofs, Jeffrey Naisbitt, Josh Wills, Ahmed Radwan, Jonathan Eagles, Krishna Ramachandran, Luke Lu, Mahadev Konar, Robert Evans, Sharad Agarwal, Siddharth Seth, Thomas Graves, Ramya Sunil (testing), Giridharan Kesavan(release engineering), Karam Singh and Santosh Kumar (performance engineering).

Alright, it’s time you checked out the bleeding edge Hadoop MapReduce trunk. Start hacking and have fun!

- Vinod Kumar Vavilapalli a.k.a @tshooter

 

Categorized by :
Apache Hadoop MapReduce

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

Big Data Virtual Meetup Chennai
Wednesday, October 29, 2014
9:00 pm India Time / 8:30 am Pacific Time / 4:30 pm Europe Time (Paris)

More Webinars »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.