Apache Hadoop 2 is now GA!
This represents the realization of a massive effort by the entire Apache Hadoop community which started nearly 4 years to date, and we’re sure you’ll agree it’s cause for a big celebration. Equally, it’s a great credit to the Apache Software Foundation which provides an environment where contributors from various places and organizations can collaborate to achieve a goal which is as significant as Apache Hadoop v2.
Congratulations to everyone!
Apache Hadoop v2 is not just a major release number, but represents generational shift in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a significantly more powerful platform – one that takes Hadoop beyond merely batch applications to taking its position as a ‘data operating system’.
To recap, Apache Hadoop v1 comprised of HDFS & MapReduce.
With HDFS one could store data of all manner, however MapReduce was the only algorithm you could use to process that data in parallel. That was very limiting since MapReduce, although very general, proved inadequate to satisfy all the demands being placed on Apache Hadoop.
As Apache Hadoop crystallizes into a key component of a Modern Data Architecture, users and customers want to store all data in HDFS and interact with that data in multiple ways:
- Real-time processing of events (sensor, telecommunications, fraud etc.) even before it lands on HDFS
- Interactive query capabilities for interrogating new data for data analysts (SQL) and data scientists (SQL plus scripting etc.)
- The need to productionize the insight i.e. batch-processing, reporting etc. in a well-defined and timely manner
The community has worked together to make HDFS itself a much more scalable, efficient and enterprise-friendly storage platform by addressing key functionality – High Availability for the HDFS NameNode, Federation for scaling & HDFS Snapshots to list a few.
With YARN, Apache Hadoop now clearly delineates the system (resource management, security, SLAs etc.) from the application framework (e.g. MapReduce) and allows for multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with Apache Storm, interactive SQL with Apache Hive and Apache Tez).
We are already seeing the benefits of this vision in the form of many and varied applications and services being re-vectored on top of YARN such as Apache Storm for event processing, Apache Giraph for graph processing, Apache Tez for interactive SQL queries, HOYA for running services such as Apache HBase and Apache Accumulo on YARN and so on. Exciting times indeed!
As a result the Hadoop stack looks very different with Hadoop v2:
Personally, it’s a huge thrill to see this baby grow up and reach adulthood since the original Jira ticket (MAPREDUCE-279) opened more than 5 ½ years ago!
Apache Hadoop v2
As a lot of people are aware, Apache Hadoop 2 landed the Beta tag a few months ago. Since then the community has spent a lot of time validating the APIs, protocols and the system itself. As a result we are now very confident in our ability to not only handle the workloads that will be thrown at Apache Hadoop, but also in our ability to do so in a forward compatible manner such that Apache Hadoop v2 represents a stable base atop which the ecosystem can flourish in the future.
For those who, like me, are more comfortable with simplified lists (*smile*), here are the enhancements and major features:
- High Availability for HDFS
- HDFS Federation
- HDFS Snapshots
- NFSv3 access to data in HDFS
- Binary Compatibility for MapReduce applications between Hadoop v1 and Hadoop v2 to ease migration
- Support for running Hadoop on Microsoft Windows
- Integration testing for the entire Apache Hadoop ecosystem at the ASF.
Although it’s a major milestone and a big reason to celebrate, the Apache Hadoop community will continue to drive it forward under the aegis of the the ASF. There are ever more things to do, user-cases to fulfill and users to thrill. The HDFS community is striving hard to finish up the addition of symlinks to HDFS which just didn’t make the cut at the last minute. On the YARN side we plan to add more enhancements such as advanced scheduling features, high availability for YARN Resource Manager, enhanced support for long-running services and generally make it easier to run other applications such as Apache Storm within YARN. Stay tuned!
As always, it’s an honor and pleasure to with the entire Apache Hadoop community – thanks to everyone who contributed!
Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.