Delivering on Hadoop .Next: Benchmarking Performance

In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:

  • Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
  • NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.

As we have discussed previously, delivering a major Apache Hadoop release takes a significant amount of effort to meet very strict reliability, scalability and performance requirements. Since Apache Hadoop (HDFS & MapReduce) are the core parts of the ecosystem, compatibility and integration of components in the upper layers of the stack (HBase, Pig, Hive, Oozie etc.) are critical for success of the new release.

In the tradition that we’ve followed for every single major (stable) release of Apache Hadoop, Hortonworks partnered with Yahoo! to benchmark and certify hadoop-0.23.1 on a performance cluster of 350 machines. Although performance improvements have been a continuous process since the beginning, it became the principle focus after the alpha release of Hadoop .Next (0.23.0).

We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release. Even those that don’t perform significantly better are on par with hadoop-1.0.

The performance benchmarks are the same ones that we’ve been using to harden & stabilize major Hadoop releases throughout the lifetime of the project.

The aim of this process is to verify every single aspect of core Hadoop – to validate that there are no regressions at scale. These include the core HDFS and MapReduce (i.e. NextGen MapReduce, or YARN) and the applications that run on top of this framework.

Here are some details on the benchmark tests:

  • The dfsio benchmark for measuring HDFS I/O (read/write) performance.
  • The slive benchmark for measuring NameNode operations.
  • The scan benchmark to measure HDFS I/O performance for MapReduce jobs.
  • The shuffle benchmark to calibrate how fast the map-outputs are shuffled
  • The famous sort benchmark which measures time for sorting data with MapReduce.
  • The compression benchmark to validate how fast we compress intermediate and the final outputs of MapReduce jobs.
  • The gridmix-V3 to measure the throughput of the cluster using a production trace of thousands of actual user jobs.

We also started using a couple of new benchmarks to cater to the architectural changes due to YARN:

  • The ApplicationMaster Scalability benchmark to figure out how fast task/container scheduling happens at the MapReduce ApplicationMaster. Compared to hadoop-1.0, this benchmark ran twice as fast with hadoop-0.23.1.
  • The ApplicationMaster Recoverability benchmark for measuring how fast jobs recover on restart.
  • The ResourceManager Scalability to evaluate the central master’s scalability by simulating lots of nodes in a cluster.
  • The Small Jobs benchmark to measure performance for very small jobs also runs more than twice as fast due to improvements made where the tasks execute within the ApplicationMaster itself (as opposed to launching small number of tasks for the job).

Many of the performance improvements can be attributed to the new architecture itself. Stay tuned for additional blogs on this topic.

Leaving YARN aside, i.e. the resource-management layer, the MapReduce runtime (map task, sort, shuffle, merge etc.) itself has many improvements when compared to hadoop-1.0. Some examples are: MAPREDUCE-64, MAPREDUCE-318, MAPREDUCE-240.

More information is available on MAPREDUCE-3561, which is the umbrella Apache Hadoop JIRA where we were tracking all our benchmarking efforts.

Benchmarking distributed systems is a very challenging task. It involves debugging, constant focus on one problem at a time, knowing which threads of investigation to follow and which to ignore and last, but not the least, patience and persistence. We had so much fun doing it and learnt some valuable lessons along the way. The process itself merits its own post.

Summary & Acknowledgements

We thank the Yahoo! Performance team for the cluster resources, development & performance teams for all the help along the way!

We are very excited to be delivering on the promise of Hadoop .Next and hope you can derive even better value from your Hadoop clusters.

– Vinod Kumar Vavilapalli a.k.a @tshooter

Categorized by :
Business Value of Hadoop Hadoop HDFS MapReduce


Charles Solomon
April 2, 2014 at 3:27 pm

Where can I get YARN benchmarks? Any links appreciated.


Tom Wirtz
October 20, 2012 at 6:34 am

I am interested in trying some of the benchmarks. Where can I find them?


Ashish Attarde
May 5, 2012 at 3:37 am

I am modifying Hadoop version 1.0.2 to improve its scan performance as well as energy efficiency. Can you please guide me about scan benchmarks?

    Joe B
    August 29, 2012 at 12:15 pm

    Hello Ashish

    Do you have a patch for your modification in Scan performance?
    if it can be shared, we can take a look at it

February 29, 2012 at 7:09 am

@Jeff – Drop me an email & I can help. Thanks!

Jeff Buell
February 28, 2012 at 4:10 pm

This is good stuff. I’d like to reproduce these results, and I’m also interested to see what the effect of virtualization would be. Could you supply or point me to a description of the hardware used, the *site.xml files (and other non-default options), and the commands used to run the tests?


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

In Memory Processing with Apache Spark
Thursday, March 12, 2015
1:00 PM Eastern / 10:00 AM Pacific

More Webinars »

Ambari Stacks, Views and Blueprints Workshop
Thursday, March 26, 2015
1:00 PM Eastern / 10:00 AM Pacific

More Webinars »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.