Apache Hadoop 0.23 is Here!
As the Release Manager, it’s my privilege to present Apache Hadoop 0.23:
I’ll present a short overview of the release in this post, more details are available in my recent talk on Apache Hadoop 0.23 at Hadoop World, 2011.
As shown by the above timeline of Apache Hadoop releases, hadoop-0.23 is the first major release off the Apache Hadoop mainline on track to be stable since hadoop-0.20 in April, 2009 – very exciting times indeed for the Hadoop community!
As you might be aware, hadoop-0.23 contains significant advances at all levels. Undoubtedly, the highlights are:
- HDFS Federation
- NextGen MapReduce
HDFS has undergone a transformation to separate out Namespace management from the Block (storage) management to allow for significant scaling of the filesystem – in the current architecture they are intertwined in the NameNode.
However, we have ensured that existing HDFS apis continue to work as before and user applications do not need to be modified.
NextGen MapReduce aka YARN
MapReduce has undergone a complete overhaul in hadoop-0.23 with the fundamental change to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform where we can support MapReduce and other application execution frameworks such as MPI etc.
However, note that existing MapReduce applications should continue to work as-is and users shouldn’t notice the underlying frameworks changes i.e. replacement of JobTracker/TaskTracker with ResourceManager/NodeManager.
Note that hadoop-0.23 has significant other enhancements:
- Performance is 2x+ across the board (HDFS read/write path improvements, MapReduce shuffle re-write from Owen/me for the 2009 Terasort record, Optimizations for small jobs etc. etc.)
- Full mavenization of the build (thanks to Alejandro Abdelnur & Tom White)
- Re-write of HDFS edits log (thanks to Todd Lipcon)
- Many, many more …
hadoop-0.23 is a big advance and as with big leaps it will take a little while for us to stabilize the release. Thus, please note that hadoop-0.23.0 is very much alpha quality and we do not recommend using it in production – yet!
If you are interested in what it takes and how we stabilize a major Hadoop release, please refer to my Apache Hadoop 0.23 presentation at Hadoop World, 2011.
Oh, the Hadoop HDFS developer community is also working on incorporating High Availability for the HDFS NameNode in an upcoming release from the hadoop-0.23 branch, more details here: https://issues.apache.org/jira/browse/HDFS-1623 and in the recent HDFS HA talk by Suresh Srinivas & Aaron Myers at Hadoop World, 2011.
We are currently in the process of rolling out hadoop-0.23.0 to test/alpha clusters (small clusters of ~500 nodes) at Yahoo and are excited to report that Pig, Hive, HBase, Oozie etc. should be integrated in very short order.
Apache Hadoop 0.23 is a quantum leap for the Hadoop community and we are very excited to have it released. Please do try the release (download it here) and provide us with feedback and help to stabilize it.
Again, I’d like to emphasize we have taken great care to ensure existing applications using the HDFS and MapReduce apis do not need to be modified to use the hadoop-0.23 release.
My personal, biased, highlight: NextGen MapReduce… and I really am proud of the efforts we’ve put in over the last 18 months or so to get this out. Well, I did warn that I was biased!
~Arun C. Murthy