In this Hortonworks’ partner guest blog, Jorik Blaas, chief technical officer at SynerScope, explores a use case in a new class of exploratory analytics, using Apache Spark on YARN, HDP and SynerScope.
SynerScope is a pioneering developer of fast, sense-making Big Data Analytics technology. Focusing on human-in-the-loop analytics, we excel at combining heterogeneous data sources to enable a new class of exploratory analytics. By leveraging the Hortonworks Data Platform (HDP) platform through Apache Spark on YARN, we are able to bring agile lock-in-free analytics at scale to our market.
For this introduction we’d like to show a number of ways to look at generally structured data and provide a brief glimpse of how combining multiple independent sources can lead to novel insights. The examples have been chosen to use publicly available data, so that reproducing and using them can be done within any modern PySpark environment. Our customers often combine large numbers of data sources to find patterns in their data. They are keen on not only finding patterns, but also understanding them. For this, Hadoop gives us the platform to contextualize the findings with rich sources in the form of documents, external sources, images, videos, sensor readings, etc. By bringing all these sources together in one single visual environment, we can provide your brain with as many flavors of the `truth` as possible, because every single one could fail you in your sense-making objective.
We are using SPARK on top of YARN, within the HDP ecosystem. Through a PySpark notebook the data is loaded from CSV files into Apache HIVE, where SQL queries are run and the result flows back into PySpark, where we use seaborn (http://stanford.edu/~mwaskom/software/seaborn/) and matplotlib (http://matplotlib.org) to build a number of visualizations.
Accompanying this blog-post is a python notebook that can be used to run the exact same analytics. We will show how correlations and distributions can be used in direct visualizations for sense making, but also show how deeper analytical methods can be used for similarity clustering of the same data.
The number of open-data initiatives is staggering and the UK has recently released a dataset relating to the crimes reported to the police within the entire UK. The data is available as CSV files through their website and historical files can be easily downloaded (https://data.police.uk/data/).
Each data record looks like this:
|Reported by||Metropolitan Police Service|
|Falls within||Metropolitan Police Service|
|Last outcome category||Offender sent to prison|
|LSOA name||Ashford 006C|
|Location||On or near Foster Road|
A quick inspection reveals that the exact timestamp is not available; we are limited in resolution to a year/month pair. The geographical locations are stored, and are grouped into regions. Each crime is classified as a specific type. When the outcome of the investigation is known the field ‘Last outcome category’ describes what happened with the investigation.
To get a feeling for the contents of the data, we start both from the bottom as well as from the top. First we introspect a few records to get a feel for the contents of the data, after which we are ready to pick a few aggregates to look at the general distributions within the entire set.
For a chosen crime-type, we might be interested in the outcomes of the police investigations. We would expect these outcomes (ranging from ‘Offender sent to prison’ to ‘Unable to prosecute subject’) to be strongly related to the type of criminal activity. A first glimpse into the data is to pick a crime type and look at the distribution of outcomes.
By using the Spark SQL count and group by method, combined with the pylab.pie plotter, we can quickly get a feel for if these differences actually exist in the way we suspected.
This confirms our suspicion that just looking at crime outcomes without looking at the type of crime could put us on the wrong track. The outcomes between ‘Shoplifting’ and ‘Drugs’ differ vastly.
To dive more deeply into this effect, we need something else than a pie-chart to visualize the distributions. The tool of choice here is a heatmap with normalized outcome rates. We take the already computed summary statistics, but instead of picking two crime types, we’re going to format it into a tabular heatmap form. Each row corresponds to an outcome and each column to a crime type. The contents of the cells show how often that specific crime type ends up with that specific outcome.
So the dark blue cell at the top center with a value of 1.0 means that 100% of the corresponding crimes within the category ‘Anti-Social Behavior’ end up with an empty outcome type. These types of data-quality issues are usually present in real-life data, and may be linked to the underlying registration process or to how the data is entered into the registry.
Looking at the full heatmap we spot a few other trends, for example that ‘Bicycle theft’ is very much like general ‘Theft from the person’, and hardly ever leads to anything else but ‘Under investigation’ or ‘No suspect identified.”
Crime types vary in outcome and also in frequency. We can use these frequency counts to organize the regions by similarity. Each region will have a different distribution of crime types, and some of the regions may be similar, even though they are not necessarily geospatially near to each other.
By aggregating for each region the number of crimes of each type, we obtain a list of numbers. These lists can be seen as a so-called feature-vector, which describes the distribution of crime types within the region. By comparing these feature-vectors between regions, we can map out regions that are similar in behavior, even though they do not have to be close to each other geographically.
By using a t-stochastic neighbor embedding, we can visualize the similarities in the natural form of a map:
In the generated map, we can see that City of London Police is very close to the British Transport Police, the Metropolitan Police Service and also the Thames Valley Police. To study this phenomenon deeper is outside the scope of this post, because it requires more context to understand. This task is better suited for interactive analysis within the SynerScope Marcato environment, which gives coordinated geographical, numeric and semi-structured views simultaneously for interactive use.
While some crime types are constant over the year, we do expect there to be strong variations over time. One way to look at this data is to investigate season correlation of each crime type.
First let’s define the seasonal temperature change by approximating the mean temperature shift with a sine wave.
This sine wave serves as a seasonal vector, as it defines a basic weighting curve, weighting the summer with positive values and the winter with negative values. By combining this seasonal vector with the actual counts for each mont
h, we can get an indication as to whether a given crime type is mainly happening in the summer or in the winter. For each crime-type, a single value is produced that indicates the strength of correlation with the seasonality curve. Values of 1.0 mean that there is a strong correlation with the summer, -1.0 strongly correlates with winter and 0.0 indicates no strong seasonal influence.
The plot confirms that Bicycle Theft is happening mostly in the summer as it positively correlates with the seasonal curve. Theft from the person however is strongly negatively correlated, thus this type of crime is more prevalent in the winter.
By plotting the raw aggregates in a heat map we can confirm these correlations:
We’ve shown how data can interpret through several visual slices, and how correlations with external phenomena such as the weather can help understand what is happening. PySpark on YARN is the engine that makes these algorithms tick as size, and using it within Python Notebook gives the analyst a very powerful toolbox, as it allows direct visualization of individual results.