Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
October 24, 2017
prev slideNext slide

LogAI – Automated Log Analytics for Validation


Previous blog posts discussed The Matrix — a set of over 27 software components that need to work together as part of any big data infrastructure. The automation suite used to perform functional validation of these components consists of over 30,000 tests which are divided into 250+ logical groups (called splits). The splits are executed concurrently on an internal container cloud (see: YInception). Each split consists of a set of tests that verifies entire features using actual instances of the services involved — no mocks are used in these functional tests. Each split spins up a test cluster, deploys and configures a set of services and executes the tests using a test automation framework.

Automated Log Analysis

In the present day and age, it is almost obvious that the analysis of such a large number of tests is also automated. The tool that automates test result analysis at Hortonworks is called LogAI.


The Processor is the brain of the system. It is a component that is constantly learning to correlate between facts and events. Examples of facts include configuration information associated with a specific version of a component, cluster attributes such as number of nodes and the quantity and types of resources attached to them [such as cpu, memory, storage, network interfaces], cluster location [on-premise data center, cloud infrastructure], etc. Examples of events include errors reported by a component or a test, observed information such as latencies, etc.


As described previously, test suites contain tests grouped into splits. Each split (which contains a set of tests executed in sequence) is run in a cluster of its own — the cluster is created for that specific run of the split and is not reused. The cluster only contains those components that are needed for the functionality being tested. Needless to say, a given component may be part of multiple functional tests.

Tests fall into a few different categories — release tests are executed to perform end-to-end functional validation of the stack; CI tests are executed on a subset of the stack as part of validating individual code commits; there are also nightly tests that are executed on smaller subsets of components.

A given run of on a test suite has a specific objective (for example, functional validation of HDP stack version 2.7.0, etc). Such runs are therefore assigned a unique run identifier. As individual splits are executed, the processor records facts associated with the split and also events observed and persists them against the run identifier.

Correlation Models

  • Frequency:

The number of times an event is observed within a given time window can be used to diagnose problems in a component. For example, a small number of packet drops when reading from HDFS during a shuffle operation in map-reduce is a common occurrence and does not necessarily indicate trouble. Packet drop events are taken into account only if their occurrence count exceeds a certain threshold.

  • Co-occurrence:

Co-occurrence of events, facts, and fact-event combinations are used by the Processor for learning. There are a number of different dimensions to this, some of which are illustrated by the following examples:

  • A given component may be a part of multiple functional tests (for example, multiple services depending on it). A problem with this component may manifest itself as problems with the other components that depend on it. The Processor is capable determining the root cause of the failures on the dependent components to be the failure in the common component.
  • Events that occur at the same (or close enough) time window may also be correlated — for example, Garbage Collection pauses in the JVM of a given service may cause a spike in latencies or even communication errors on another, dependent service.

Correlation models are used by the Processor to de-dupe problems seen across splits in a given run. In a typical run, the number of errors generated is of the order of 300k – 400k. After de-duping by the Processor, the number of distinct errors reported is of the order of 50k – 70k across the entire stack of 27+ components. The tool has helped QE reduce test analysis time from several days to a few hours.

User Interface

The LogAI system provides a simple user interface for QE to browse the results of test runs. For each identified failure, the system provides a probable root cause. The UI also allows direct access to the associated test output artifacts and to specific error messages in log files that helped determine the cause.

Dashboard view for a given split involving a number of components.

Components with error (red dots) have popups showing related information.

The popup leads to the RCA view where a histogram components and associated errors is shown.

The RCA detail view shows individual errors and the component where the error originated.

Future Work

Ongoing work in LogAI involves the following:

  • Improve correlation models by tapping more input sources for facts and events.
  • Mining data from defect tracking systems.
  • Understanding component dependencies for improved root cause determination.


Stay tuned!




Liteblue says:

Great info mate. Thanks for sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *