Apache Hadoop 2 is now GA!

I’m thrilled to note that the Apache Hadoop community has declared Apache Hadoop 2.x as Generally Available with the release of hadoop-2.2.0!

This represents the realization of a massive effort by the entire Apache Hadoop community which started nearly 4 years to date, and we’re sure you’ll agree it’s cause for a big celebration. Equally, it’s a great credit to the Apache Software Foundation which provides an environment where contributors from various places and organizations can collaborate to achieve a goal which is as significant as Apache Hadoop v2.

Congratulations to everyone!

The Journey

Apache Hadoop v2 is not just a major release number, but represents generational shift in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a significantly more powerful platform – one that takes Hadoop beyond merely batch applications to taking its position as a ‘data operating system’.

To recap, Apache Hadoop v1 comprised of HDFS & MapReduce.

With HDFS one could store data of all manner, however MapReduce was the only algorithm you could use to process that data in parallel. That was very limiting since MapReduce, although very general, proved inadequate to satisfy all the demands being placed on Apache Hadoop.

As Apache Hadoop crystallizes into a key component of a Modern Data Architecture, users and customers want to store all data in HDFS and interact with that data in multiple ways:

  • Real-time processing of events (sensor, telecommunications, fraud etc.) even before it lands on HDFS
  • Interactive query capabilities for interrogating new data for data analysts (SQL) and data scientists (SQL plus scripting etc.)
  • The need to productionize the insight i.e. batch-processing, reporting etc. in a well-defined and timely manner

The community has worked together to make HDFS itself a much more scalable, efficient and enterprise-friendly storage platform by addressing key functionality – High Availability for the HDFS NameNode, Federation for scaling & HDFS Snapshots to list a few.

With YARN, Apache Hadoop now clearly delineates the system (resource management, security, SLAs etc.) from the application framework (e.g. MapReduce) and allows for multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with Apache Storm, interactive SQL with Apache Hive and Apache Tez).

We are already seeing the benefits of this vision in the form of many and varied applications and services being re-vectored on top of YARN such as Apache Storm for event processing, Apache Giraph for graph processing, Apache Tez for interactive SQL queries, HOYA for running services such as Apache HBase and Apache Accumulo on YARN and so on. Exciting times indeed!

As a result the Hadoop stack looks very different with Hadoop v2:

hadoopstack

Personally, it’s a huge thrill to see this baby grow up and reach adulthood since the original Jira ticket (MAPREDUCE-279) opened more than 5 ½ years ago!

Apache Hadoop v2

As a lot of people are aware, Apache Hadoop 2 landed the Beta tag a few months ago. Since then the community has spent a lot of time validating the APIs, protocols and the system itself. As a result we are now very confident in our ability to not only handle the workloads that will be thrown at Apache Hadoop, but also in our ability to do so in a forward compatible manner such that Apache Hadoop v2 represents a stable base atop which the ecosystem can flourish in the future.

For those who, like me, are more comfortable with simplified lists (*smile*), here are the enhancements and major features:

  • YARN
  • High Availability for HDFS
  • HDFS Federation
  • HDFS Snapshots
  • NFSv3 access to data in HDFS
  • Binary Compatibility for MapReduce applications between Hadoop v1 and Hadoop v2 to ease migration
  • Performance
  • Support for running Hadoop on Microsoft Windows
  • Integration testing for the entire Apache Hadoop ecosystem at the ASF.

Onwards

Although it’s a major milestone and a big reason to celebrate, the Apache Hadoop community will continue to drive it forward under the aegis of the the ASF. There are ever more things to do, user-cases to fulfill and users to thrill. The HDFS community is striving hard to finish up the addition of symlinks to HDFS which just didn’t make the cut at the last minute. On the YARN side we plan to add more enhancements such as advanced scheduling features, high availability for YARN Resource Manager, enhanced support for long-running services and generally make it easier to run other applications such as Apache Storm within YARN. Stay tuned!

Acknowledgements

As always, it’s an honor and pleasure to with the entire Apache Hadoop community – thanks to everyone who contributed!

Categorized by :
Administrator Apache Hadoop Architect & CIO Data Analyst & Scientist Developer Hadoop 2.0 Hadoop Ecosystem HDFS Hive YARN

Comments

|
October 20, 2013 at 7:45 am
|

Congratulations on reaching this important milestone

Padma
|
October 17, 2013 at 11:35 pm
|

Is Hadoop 2.2 is available on Windows Server and Windows local machines?

Jamie Sutphin
|
October 17, 2013 at 8:12 am
|

It’s just hats off to you guys.

To see this high-level of achievement in producing this type of platform, one that is bringing a paradigm shift so quickly, seems to me like in feat not only development, but in the output from Apache community.

Most companies are still trying just to get their head around the basic concepts!

Really, hats off.

|
October 16, 2013 at 11:09 am
|

Will there be a Hadoop 2 VM sandbox released any time soon?

|
October 16, 2013 at 6:01 am
|

Congratz on reaching this milestone, looking forward to all the new stuff

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Recently in the Blog

Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.

Thank you for subscribing!