In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:
As we have discussed previously, delivering a major Apache Hadoop release takes a significant amount of effort to meet very strict reliability, scalability and performance requirements. Since Apache Hadoop (HDFS & MapReduce) are the core parts of the ecosystem, compatibility and integration of components in the upper layers of the stack (HBase, Pig, Hive, Oozie etc.) are critical for success of the new release.
In the tradition that we’ve followed for every single major (stable) release of Apache Hadoop, Hortonworks partnered with Yahoo! to benchmark and certify hadoop-0.23.1 on a performance cluster of 350 machines. Although performance improvements have been a continuous process since the beginning, it became the principle focus after the alpha release of Hadoop .Next (0.23.0).
We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release. Even those that don’t perform significantly better are on par with hadoop-1.0.
The performance benchmarks are the same ones that we’ve been using to harden & stabilize major Hadoop releases throughout the lifetime of the project.
The aim of this process is to verify every single aspect of core Hadoop – to validate that there are no regressions at scale. These include the core HDFS and MapReduce (i.e. NextGen MapReduce, or YARN) and the applications that run on top of this framework.
Here are some details on the benchmark tests:
We also started using a couple of new benchmarks to cater to the architectural changes due to YARN:
Many of the performance improvements can be attributed to the new architecture itself. Stay tuned for additional blogs on this topic.
Leaving YARN aside, i.e. the resource-management layer, the MapReduce runtime (map task, sort, shuffle, merge etc.) itself has many improvements when compared to hadoop-1.0. Some examples are: MAPREDUCE-64, MAPREDUCE-318, MAPREDUCE-240.
More information is available on MAPREDUCE-3561, which is the umbrella Apache Hadoop JIRA where we were tracking all our benchmarking efforts.
Benchmarking distributed systems is a very challenging task. It involves debugging, constant focus on one problem at a time, knowing which threads of investigation to follow and which to ignore and last, but not the least, patience and persistence. We had so much fun doing it and learnt some valuable lessons along the way. The process itself merits its own post.
We thank the Yahoo! Performance team for the cluster resources, development & performance teams for all the help along the way!
We are very excited to be delivering on the promise of Hadoop .Next and hope you can derive even better value from your Hadoop clusters.
– Vinod Kumar Vavilapalli a.k.a @tshooter