Elephants can remember: MapReduce Job History in HDP 2.0

An important tool in the Hadoop developer toolkit is the ability to look at key metrics for a MapReduce job – to understand the performance of each job and to optimize future job runs.

In this blog article, we’ll explore how HDP 2.0 stores and provides insight into the performance of a MapReduce job on YARN.

Change from MapReduce v1 and HDP 1.x

In MapReduce-v2 on YARN in HDP 2.0, the JobTracker no longer exists. The job life cycle management functionality is now the responsibility of the short-lived Application Masters. Each MapReduce-v2 job will spin up an Application Master, and after the MapReduce2 job is complete, the Application Master will be terminated.

For this reason, a new MapReduce JobHistory server was added for MapReduce-v2, which maintains information about MapReduce jobs after their Application Master terminates. The Resource Manager Web UI manages the forwarding of requests to the JobHistory server when the Application Master completes.

Viewing Job History in Ambari

With HDP 2.0, Ambari provides a screen to manage and monitor the JobHistory Server.


The JobHistory UI is accessible as a link from this screen. The JobHistory UI lists all executed MapReduce2 jobs.


You can drill down into each job to get the detailed metrics about the job runtime.


Job history data persisted to HDFS

All the underlying data per job is persisted to HDFS. This means that historical operational metrics for each job is maintained and is accessible for the lifetime of the HDP cluster.

In HDP 2.0, the MapReduce job history files are stored in the “/mr-history/done” directory on HDFS. The directories are organized by date the job executed on:


Go Get It

Download HDP 2.0 Beta and deploy today!

Categorized by :
Ambari MapReduce YARN

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.