Apache Hadoop 0.23 is Here!

As the Release Manager, it’s my privilege to present Apache Hadoop 0.23:

Release: http://hadoop.apache.org/common/releases.html
Documentation: http://hadoop.apache.org/common/docs/r0.23.0/

I’ll present a short overview of the release in this post, more details are available in my recent talk on Apache Hadoop 0.23 at Hadoop World, 2011.


(2009)

As shown by the above timeline of Apache Hadoop releases, hadoop-0.23 is the first major release off the Apache Hadoop mainline on track to be stable since hadoop-0.20 in April, 2009 – very exciting times indeed for the Hadoop community!

The Release

As you might be aware, hadoop-0.23 contains significant advances at all levels. Undoubtedly, the highlights are:

  • HDFS Federation
  • NextGen MapReduce

HDFS Federation
HDFS has undergone a transformation to separate out Namespace management from the Block (storage) management to allow for significant scaling of the filesystem – in the current architecture they are intertwined in the NameNode.

However, we have ensured that existing HDFS apis continue to work as before and user applications do not need to be modified.

More details are available in the HDFS Federation release documentation or in the recent HDFS Federation talk by Suresh Srinivas, a Hortonworks co-founder at Hadoop World, 2011.

NextGen MapReduce aka YARN
MapReduce has undergone a complete overhaul in hadoop-0.23 with the fundamental change to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform where we can support MapReduce and other application execution frameworks such as MPI etc.

However, note that existing MapReduce applications should continue to work as-is and users shouldn’t notice the underlying frameworks changes i.e. replacement of JobTracker/TaskTracker with ResourceManager/NodeManager.

More details are available in the YARN release documentation or in the recent YARN presentation by Mahadev Konar, a Hortonworks co-founder at Hadoop World, 2011.


(Lots More)

Note that hadoop-0.23 has significant other enhancements:

  • Performance is 2x+ across the board (HDFS read/write path improvements, MapReduce shuffle re-write from Owen/me for the 2009 Terasort record, Optimizations for small jobs etc. etc.)
  • Full mavenization of the build (thanks to Alejandro Abdelnur & Tom White)
  • Re-write of HDFS edits log (thanks to Todd Lipcon)
  • Many, many more …

Next Steps

hadoop-0.23 is a big advance and as with big leaps it will take a little while for us to stabilize the release. Thus, please note that hadoop-0.23.0 is very much alpha quality and we do not recommend using it in production – yet!

If you are interested in what it takes and how we stabilize a major Hadoop release, please refer to my Apache Hadoop 0.23 presentation at Hadoop World, 2011.

Oh, the Hadoop HDFS developer community is also working on incorporating High Availability for the HDFS NameNode in an upcoming release from the hadoop-0.23 branch, more details here: https://issues.apache.org/jira/browse/HDFS-1623 and in the recent HDFS HA talk by Suresh Srinivas & Aaron Myers at Hadoop World, 2011.

We are currently in the process of rolling out hadoop-0.23.0 to test/alpha clusters (small clusters of ~500 nodes) at Yahoo and are excited to report that Pig, Hive, HBase, Oozie etc. should be integrated in very short order.

Conclusion

Apache Hadoop 0.23 is a quantum leap for the Hadoop community and we are very excited to have it released. Please do try the release (download it here) and provide us with feedback and help to stabilize it.

Again, I’d like to emphasize we have taken great care to ensure existing applications using the HDFS and MapReduce apis do not need to be modified to use the hadoop-0.23 release.

My personal, biased, highlight: NextGen MapReduce… and I really am proud of the efforts we’ve put in over the last 18 months or so to get this out. Well, I did warn that I was biased! :)

~Arun C. Murthy
@acmurthy

Categorized by :
Apache Hadoop HDFS MapReduce

Comments

Tejas
|
January 29, 2012 at 10:02 pm
|

In version 0.23, is there going to be an enhancement in monitoring and controlling apis exposed?

Are there any administration apis exposed?

|
December 12, 2011 at 11:04 am
|

How does the NextGen MapReduce compare to Platform Computing’s commercial MapReduce product? Is theirs really better as they claim?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Using Ambari for Management
Thursday, September 4, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.