Previous blog posts discussed The Matrix — a set of over 27 software components that need to work together as part of any big data infrastructure. The automation suite used to perform functional validation of these components consists of over 30,000 tests which are divided into 250+ logical groups (called splits). The splits are executed concurrently on an internal container cloud (see: YInception). Each split consists of a set of tests that verifies entire features using actual instances of the services involved — no mocks are used in these functional tests. Each split spins up a test cluster, deploys and configures a set of services and executes the tests using a test automation framework.
In the present day and age, it is almost obvious that the analysis of such a large number of tests is also automated. The tool that automates test result analysis at Hortonworks is called LogAI.
The Processor is the brain of the system. It is a component that is constantly learning to correlate between facts and events. Examples of facts include configuration information associated with a specific version of a component, cluster attributes such as number of nodes and the quantity and types of resources attached to them [such as cpu, memory, storage, network interfaces], cluster location [on-premise data center, cloud infrastructure], etc. Examples of events include errors reported by a component or a test, observed information such as latencies, etc.
As described previously, test suites contain tests grouped into splits. Each split (which contains a set of tests executed in sequence) is run in a cluster of its own — the cluster is created for that specific run of the split and is not reused. The cluster only contains those components that are needed for the functionality being tested. Needless to say, a given component may be part of multiple functional tests.
Tests fall into a few different categories — release tests are executed to perform end-to-end functional validation of the stack; CI tests are executed on a subset of the stack as part of validating individual code commits; there are also nightly tests that are executed on smaller subsets of components.
A given run of on a test suite has a specific objective (for example, functional validation of HDP stack version 2.7.0, etc). Such runs are therefore assigned a unique run identifier. As individual splits are executed, the processor records facts associated with the split and also events observed and persists them against the run identifier.
The number of times an event is observed within a given time window can be used to diagnose problems in a component. For example, a small number of packet drops when reading from HDFS during a shuffle operation in map-reduce is a common occurrence and does not necessarily indicate trouble. Packet drop events are taken into account only if their occurrence count exceeds a certain threshold.
Co-occurrence of events, facts, and fact-event combinations are used by the Processor for learning. There are a number of different dimensions to this, some of which are illustrated by the following examples:
Correlation models are used by the Processor to de-dupe problems seen across splits in a given run. In a typical run, the number of errors generated is of the order of 300k – 400k. After de-duping by the Processor, the number of distinct errors reported is of the order of 50k – 70k across the entire stack of 27+ components. The tool has helped QE reduce test analysis time from several days to a few hours.
The LogAI system provides a simple user interface for QE to browse the results of test runs. For each identified failure, the system provides a probable root cause. The UI also allows direct access to the associated test output artifacts and to specific error messages in log files that helped determine the cause.
Dashboard view for a given split involving a number of components.
Components with error (red dots) have popups showing related information.
The popup leads to the RCA view where a histogram components and associated errors is shown.
The RCA detail view shows individual errors and the component where the error originated.
Ongoing work in LogAI involves the following: