Delivering on Hadoop .Next: Benchmarking Performance

In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:

  • Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
  • NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.

As we have discussed previously, delivering a major Apache Hadoop release takes a significant amount of effort to meet very strict reliability, scalability and performance requirements. Since Apache Hadoop (HDFS & MapReduce) are the core parts of the ecosystem, compatibility and integration of components in the upper layers of the stack (HBase, Pig, Hive, Oozie etc.) are critical for success of the new release.

In the tradition that we’ve followed for every single major (stable) release of Apache Hadoop, Hortonworks partnered with Yahoo! to benchmark and certify hadoop-0.23.1 on a performance cluster of 350 machines. Although performance improvements have been a continuous process since the beginning, it became the principle focus after the alpha release of Hadoop .Next (0.23.0).

We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release. Even those that don’t perform significantly better are on par with hadoop-1.0.

The performance benchmarks are the same ones that we’ve been using to harden & stabilize major Hadoop releases throughout the lifetime of the project.

The aim of this process is to verify every single aspect of core Hadoop – to validate that there are no regressions at scale. These include the core HDFS and MapReduce (i.e. NextGen MapReduce, or YARN) and the applications that run on top of this framework.

Here are some details on the benchmark tests:

  • The dfsio benchmark for measuring HDFS I/O (read/write) performance.
  • The slive benchmark for measuring NameNode operations.
  • The scan benchmark to measure HDFS I/O performance for MapReduce jobs.
  • The shuffle benchmark to calibrate how fast the map-outputs are shuffled
  • The famous sort benchmark which measures time for sorting data with MapReduce.
  • The compression benchmark to validate how fast we compress intermediate and the final outputs of MapReduce jobs.
  • The gridmix-V3 to measure the throughput of the cluster using a production trace of thousands of actual user jobs.

We also started using a couple of new benchmarks to cater to the architectural changes due to YARN:

  • The ApplicationMaster Scalability benchmark to figure out how fast task/container scheduling happens at the MapReduce ApplicationMaster. Compared to hadoop-1.0, this benchmark ran twice as fast with hadoop-0.23.1.
  • The ApplicationMaster Recoverability benchmark for measuring how fast jobs recover on restart.
  • The ResourceManager Scalability to evaluate the central master’s scalability by simulating lots of nodes in a cluster.
  • The Small Jobs benchmark to measure performance for very small jobs also runs more than twice as fast due to improvements made where the tasks execute within the ApplicationMaster itself (as opposed to launching small number of tasks for the job).

Many of the performance improvements can be attributed to the new architecture itself. Stay tuned for additional blogs on this topic.

Leaving YARN aside, i.e. the resource-management layer, the MapReduce runtime (map task, sort, shuffle, merge etc.) itself has many improvements when compared to hadoop-1.0. Some examples are: MAPREDUCE-64, MAPREDUCE-318, MAPREDUCE-240.

More information is available on MAPREDUCE-3561, which is the umbrella Apache Hadoop JIRA where we were tracking all our benchmarking efforts.

Benchmarking distributed systems is a very challenging task. It involves debugging, constant focus on one problem at a time, knowing which threads of investigation to follow and which to ignore and last, but not the least, patience and persistence. We had so much fun doing it and learnt some valuable lessons along the way. The process itself merits its own post.

Summary & Acknowledgements

We thank the Yahoo! Performance team for the cluster resources, development & performance teams for all the help along the way!

We are very excited to be delivering on the promise of Hadoop .Next and hope you can derive even better value from your Hadoop clusters.

– Vinod Kumar Vavilapalli a.k.a @tshooter

Categorized by :
Apache Hadoop Hadoop in the Enterprise HDFS MapReduce

Comments

Charles Solomon
|
April 2, 2014 at 3:27 pm
|

Where can I get YARN benchmarks? Any links appreciated.

Thanks,

Tom Wirtz
|
October 20, 2012 at 6:34 am
|

I am interested in trying some of the benchmarks. Where can I find them?

Thanks.

Ashish Attarde
|
May 5, 2012 at 3:37 am
|

I am modifying Hadoop version 1.0.2 to improve its scan performance as well as energy efficiency. Can you please guide me about scan benchmarks?

    Joe B
    |
    August 29, 2012 at 12:15 pm
    |

    Hello Ashish

    Do you have a patch for your modification in Scan performance?
    if it can be shared, we can take a look at it

|
February 29, 2012 at 7:09 am
|

@Jeff – Drop me an email & I can help. Thanks!

Jeff Buell
|
February 28, 2012 at 4:10 pm
|

This is good stuff. I’d like to reproduce these results, and I’m also interested to see what the effect of virtualization would be. Could you supply or point me to a description of the hardware used, the *site.xml files (and other non-default options), and the commands used to run the tests?

Jeff

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.