Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
January 25, 2016
prev slideNext slide

Community Choice Winner Blog: Advanced Execution Visualization of Spark jobs

Advanced Execution Visualization of Spark jobs
Author: Zoltán Zvara, Márton Balassi, András Garzó, Hungarian Academy of Sciences in collaboration with Ericsson

Understanding the physical plan of a big data application is often crucial for tracking down bottlenecks and faulty behavior. Apache Spark although offering useful Web UI component for monitoring and understanding the logical plan of the jobs, lacks a tool that helps to understand the physical plan of the task scheduler and the possibility to monitor execution at a very low level, along with the communication triggered by RDDs and remote block-requests. We propose a tool that allows users to real-time monitor and later to replay, examine job executions on any cluster currently supported by Spark.










Our execution-visualizer implementation gives the following benefits to end-users:

  • Understand the execution mechanism of Spark and demonstrate how executors, tasks work internally, which would attract new users;
  • Provided by the advanced visual monitoring of programs, the ability to discover issues of executors and tasks in a more detailed and convenient way;
  • The possibility to highlight bottlenecks of certain workflows, that would add an insight to advanced, online optimization strategies.
  • Capture and visualize key-distributions at task level to identify bottlenecks introduced by data-skew.

After this talk you will know more about:

  • How Apache Spark, a general-purpose distributed data-processing batch-engine works in the background.
  • How to map user-code to a physical execution plan, as generated by the Spark scheduler.
  • How to spot bottlenecks and to get a general view on the issues you might face during application-development.
  • How the visualization was implemented and about future plans.
  • How the visualization used at Ericsson to enhance telecommunication and IOT analytics.

In our talk proposal we have stated the intention to extend the tool to support other frameworks in the Hadoop ecosystem in the future. Since then we have started implementing the data generator on top of Flink’s REST API.


Check out the features of a previous version of the tool in this video:

Register for the Hadoop Summit in Dublin here.


Leave a Reply

Your email address will not be published. Required fields are marked *