Apache Hadoop 2.3.0 Released!
hadoop-2.3.0 is the first release for the year 2014, and brings a number of enhancements to the core platform, in particular to HDFS.
With this release, there are two significant enhancements to HDFS:
- Support for Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
- In-memory Cache for data resident in HDFS via Datanodes (HDFS-4949)
With support for heterogeneous storage classes in HDFS, we now can take advantage of different storage types on the same Hadoop clusters. Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, SSDs, Memory etc. More details on this major enhancement are available here.
Along similar lines, it is now possible to use memory available in the Hadoop cluster to centrally cache and administer data-sets in-memory in the Datanode’s address space. Applications such as MapReduce, Hive, Pig etc. can now request for memory to be cached (for the curios, we use a combination of mmap, mlock to achieve this) and then read it directly off the Datanode’s address space for extremely efficient scans by avoiding disk altogether. As an example, Hive is taking advantage of this feature by implementing an extremely efficient zero-copy read path for ORC files – see HIVE-6347 for details.
In YARN, we are very excited to see that ResourceManager Automatic Failover (YARN-149) is nearly complete; even it isn’t ready for primetime yet. We expect it to land by the next release i.e. hadoop-2.4. Furthermore, a number of key operational enhancements have been driven into YARN such as better logging, error-handling, diagnostics etc.
On the MapReduce side of the house, a key enhancement is MAPREDUCE-4421; with this we now no longer need to install MapReduce binaries on every machine and can just use a MapReduce tarball via the YARN DistributedCache by copying it into HDFS.
Of course, a number of bug-fixes, enhancements etc. have also made it into hadoop-2.3; thereby continuing to improve the core platform. Please see hadoop-2.3.0 release notes for more details.
Looking Ahead to Apache Hadoop 2.4.0
With hadoop-2.3 the community has again delivered major upgrade to the platform. Looking ahead a number of exciting features are shaping up for Apache Hadoop 2.4 such as:
- Support for ACLs in HDFS (HDFS-4685)
- Key operability features such as support for Rolling Upgrades in HDFS (HDFS-5535) and FSImage being enhanced to use ProtoBufs (HDFS-5698).
- YARN ResourceManager Automatic Failover (YARN-149)
- YARN Generic Application Timeline (YARN-1530) & History (YARN-321) services to make it significantly easier to develop and manage new frameworks and services in YARN.
Many thanks to everyone who contributed to the release, and everyone in the Apache Hadoop community. Just for the reader’s edification it is instructive to note that hadoop-2.3.0 has 560 JIRAs fixed. Of these, 138 are in Hadoop Common, 203 made it to HDFS, 148 are in YARN and 71 went into MapReduce. So, thank you to every single one of the contributors, reviewers and testers!
In particular I’d like to call out the following folks: Arpit Agarwal, Tsz Wo Sze for their work on Heterogeneous Storage; Andrew Wang, Colin McCabe and Chris Nauroth for their efforts on In-Memory Datanode Cache; Jason Lowe for his work on forklifting MapReduce to deploy via the DistributedCache and several folks from Twitter such as Gera Shegalov, Lohit V., Joep R. and others their for a number of unsung, but very key operational enhancements and fixes to YARN.