Apache Hadoop 2.4.0 Released!

Second Hadoop release in 2014

hadoop24It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.4.0! Thank you to every single one of the contributors, reviewers and testers!

The community fixed 411 JIRAs for 2.4.0 (on top of the 511 JIRAs resolved for 2.3.0). Of the 411 fixes:

  • 50 are in Hadoop Common,
  • 171 are in HDFS,
  • 160 are in YARN and
  • 30 went into MapReduce

Hadoop 2.4.0 is the second Hadoop release in 2014, following Hadoop 2.3.0’s February release and its key enhancements to HDFS such as Support for Heterogeneous Storage and In-Memory Cache.

Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:

  • Support for Access Control Lists in HDFS (HDFS-4685)
  • Native support for Rolling Upgrades in HDFS (HDFS-5535)
  • Smooth operational upgrades with protocol buffers for HDFS FSImage (HDFS-5698)
  • Full HTTPS support for HDFS (HDFS-5305)
  • Support for Automatic Failover of the YARN ResourceManager (YARN-149) (a.k.a Phase 1 of YARN ResourceManager High Availability)
  • Enhanced support for new applications on YARN with Application History Server (YARN-321) and Application Timeline Server (YARN-1530)
  • Support for strong SLAs in YARN CapacityScheduler via Preemption (YARN-185)

HADOOP-1298 introduced file-permissions in HDFS in Hadoop 0.16 (a blast from the past – this was in January, 2008). Now, HDFS takes a major step forward with support of Access Control Lists (getfacl/setfacl). ACLs enhance the existing HDFS permission model to support controlling file access for arbitrary combinations of users and groups instead of presenting 3 predetermined options: a single owner, single group, and all other users.  Take a look at the HDFS ACLs design document for more details.

Hadoop clusters are growing, and some operations teams are challenged with upgrading as many as 5000 HDFS nodes, storing more than 100 petabytes of data. Rolling Upgrades make this significantly easier to manage. Switching the HDFS FSImage to use protocol-buffers also eases operations, since it allows safe HDFS upgrades to newer versions with better rollback capabilities (in face of software bugs or human errors).

Security is a key concern for Apache Hadoop and we are pleased that version 2.4.0 includes full HTTPS support for HDFS across all components: WebHDFS, HsFTP and even web-interfaces.

With automatic failover of the YARN ResourceManager, applications can smoothly failover to a (cold) standby ResourceManager in case of operational issues such as hardware failures. The new ResourceManager will automatically restart applications. In the next phase we plan to add a hot standby that can continue to run applications from the point of failover, to preserve any work already completed.

We are also seeing the community take advantage of YARN’s promise, with many diverse applications now implemented (or ported over) to run on YARN. From this, we have received important feedback that it would be useful for YARN to provide standard services to track and store application-specific metrics such as containers used and resources consumed.

So we are thrilled to note that YARN now provides better metrics capabilities with a generic Application Timeline Server (ATS). ATS uses a NoSQL store at the backend (which defaults to a single-node LevelDB, with HBase for scale-out) and provides extremely fast writes for millions of metrics and some key aggregation capabilities during retrieval. ATS also provides a very-simple REST interface to PUT & GET application timeline data.

ATS is already being used by key applications such as Apache Tez & Apache Hive to store query metrics and render GUIs on the client-side, using JavaScript by presenting the JSON in a human-friendly manner on the browser!

Preemption in YARN CapacityScheduler had been available since Hadoop 2.2, but, to my knowledge, this was the first time that anyone had extensively validated the feature and it came out with flying colors! Many thanks to Carlo Curino & Chris Douglas, the original contributors.

Looking Ahead to Apache Hadoop 2.5.0

As always, the Apache Hadoop community is looking ahead, with our eyes on a number of enhancements to the core platform for Apache Hadoop 2.5. Here is a preview:

  • First-class support for rolling upgrades in YARN, with:
    • Work-preserving ResourceManager restart (YARN-556)
    • Container-preserving NodeManager restart (YARN-1336)
  • Support for admin-specified labels for servers in YARN for enhanced control and scheduling (YARN-796)
  • Support for applications to delegate resources to others in YARN. This will allow external services to share not just YARN’s resource-management capabilities but also it’s workload-management capabilities. (YARN-1488)
  • Support for automatically sharing application artifacts in the YARN distributed cache. (YARN-1492)

Acknowledgements

Many thanks to everyone who contributed to the release, and everyone in the Apache Hadoop community.

In particular I’d like to call out the following folks: Chris Nauroth, Haohui Mai & Vinaykumar B. for their work on HDFS ACLs; Haohui Mai for his work on using protobufs for FSImage; Tsz Wo Sze, Kihwal Lee, Arpit Agarwal, Brandon Li & Jing Zhao for their work on Rolling Upgrades for HDFS; Karthik Kambatla, Xuan Gong and Tsuyoshi Ozawa for their work on YARN ResourceManager automatic failover; Zhijie Shen, Mayank Bansal, Billie Rinaldi and Vinod K. V. for their work on YARN ATS/AHS and, again, several folks from Twitter such as Gera Shegalov, Lohit V., Joep R., Sangjin Lee et al for a number of unsung, but very key operational enhancements and bug-fixes to YARN. Last, but not least, a big shout-out to folks such as Ramya Sunil, Yesha Vora, Tassapol A., Arpit Gupta and others who helped validate the release and ensured that we, as a community, can continue to deliver very high quality releases of Apache Hadoop.

Links

Categorized by :
Apache Hadoop Hadoop 2.0

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Using Ambari for Management
Thursday, September 4, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.