Debugging distributed systems can be difficult largely because they are designed to run on many (possibly thousands) of hosts in a cluster. This process typically involves monitoring and analyzing log files spread across the cluster, and if the necessary information is not being logged, service restarts and job redeployment may be required. Not only is this process tedious, it can also be disruptive in the case of systems running in production.
The latest 1.0 release of Apache Storm includes a number of important new features that address this difficulty. In this post we’ll take a high level look at what these features mean for Storm users and administrators.
The first computer bug. A moth that was stuck in a relay discovered by Grace Hopper.
In previous versions of Storm changing logging levels required manually editing configuration files across all nodes in the cluster. This was especially tedious in large clusters, and to make matters worse, once you were finished you had repeat the process to revert those changes.
Storm 1.0 allows you to change any log level directly from the Storm UI or the command line, without having to remotely login to machines in the cluster. What’s more, it also allows you to specify an expiration time after which the changes will be automatically reverted.
The log file viewer added in the Apache Storm 0.9.1 release made accessing Storm’s log files significantly easier, but in some cases still required examination individual log files one-by-one. In Storm 1.0 the UI now includes a powerful search feature that allows you to search a specific topology log file, or across all topology log files in the cluster, even archived files.
When performing a topology-wide search, the UI will search across all supervisor nodes for a match. The search results include a link to the matching log file, as well as host and port information that allow you quickly identify on which machine a specific log event occurred. This feature is particularly helpful when trying to track down when and where a particular error occurred.
In the past, it was common practice for developers to insert “debug” bolts or Trident functions into their Storm topologies in order to trace the flow of data through a topology. The problem with this approach was that these “debug” components were usually not meant for production, and removing them necessitated repackaging and redeploying the topology.
The Event Sampling feature introduced in Storm 1.0 eliminates the need for this practice by allowing users to sample a percentage of live data as it flows through a topology, and view it and download it directly from the Storm UI. Users can sample data at the topology level, or even drill down and sample data from individual spouts and bolts. When you are finished sampling, simply turn it off. There’s no need to stop or redeploy the topology.
When debugging or tuning JVM applications for performance and memory usage there are a few utilities that are invaluable:
Typically these tools are used from the command line on the machine where the JVM application is running. With a distributed system such as Storm, using these tools required logging into the specific machine, identifying the target process, and manually running the appropriate tool.
In Storm 1.0, access to these tools is integrated directly into the Storm UI. Getting a heap dump, jstack stack trace, or Java Flight Recorder recording are as easy as clicking a button and downloading the resulting file. Once downloaded, you can use the analysis and visualization tools of your choice to get an in-depth view into the JVM process.
While debugging a distributed system such as Storm may not fit everyone’s definition of “fun,” it is frequently necessary and unavoidable at times. These new enhancements in Storm 1.0 make that job significantly easier than it has been in the past.