cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
August 20, 2015
prev slideNext slide

Spark Analytics with SynerScope

In this Hortonworks’ partner guest blog, Jorik Blaas, chief technical officer at SynerScope, explores a use case in a new class of exploratory analytics, using Apache Spark on YARN, HDP and SynerScope.

Preliminaries

SynerScope is a pioneering developer of fast, sense-making Big Data Analytics technology. Focusing on human-in-the-loop analytics, we excel at combining heterogeneous data sources to enable a new class of exploratory analytics. By leveraging the Hortonworks Data Platform (HDP) platform through Apache Spark on YARN, we are able to bring agile lock-in-free analytics at scale to our market.

Introduction

For this introduction we’d like to show a number of ways to look at generally structured data and provide a brief glimpse of how combining multiple independent sources can lead to novel insights. The examples have been chosen to use publicly available data, so that reproducing and using them can be done within any modern PySpark environment. Our customers often combine large numbers of data sources to find patterns in their data. They are keen on not only finding patterns, but also understanding them. For this, Hadoop gives us the platform to contextualize the findings with rich sources in the form of documents, external sources, images, videos, sensor readings, etc. By bringing all these sources together in one single visual environment, we can provide your brain with as many flavors of the `truth` as possible, because every single one could fail you in your sense-making objective.

We are using SPARK on top of YARN, within the HDP ecosystem. Through a PySpark notebook the data is loaded from CSV files into Apache HIVE, where SQL queries are run and the result flows back into PySpark, where we use seaborn (http://stanford.edu/~mwaskom/software/seaborn/) and matplotlib (http://matplotlib.org) to build a number of visualizations.

scope_1

Accompanying this blog-post is a python notebook that can be used to run the exact same analytics. We will show how correlations and distributions can be used in direct visualizations for sense making, but also show how deeper analytical methods can be used for similarity clustering of the same data.

The Data

The number of open-data initiatives is staggering and the UK has recently released a dataset relating to the crimes reported to the police within the entire UK. The data is available as CSV files through their website and historical files can be easily downloaded (https://data.police.uk/data/).

Each data record looks like this:

Field Name Value
Crime ID 3d08247b3c0a2db23323bd6d7279280b575123b65e1c07da837558161828ddd6
Reported by Metropolitan Police Service
Falls within Metropolitan Police Service
Last outcome category Offender sent to prison
Longitude 0.902711
Month 2014-11
Crime type Drugs
LSOA name Ashford 006C
Context
Latitude 51.131689
LSOA code E01023996
Location On or near Foster Road

A quick inspection reveals that the exact timestamp is not available; we are limited in resolution to a year/month pair. The geographical locations are stored, and are grouped into regions. Each crime is classified as a specific type. When the outcome of the investigation is known the field ‘Last outcome category’ describes what happened with the investigation.

A first dive, using aggregate statistics in Spark

To get a feeling for the contents of the data, we start both from the bottom as well as from the top. First we introspect a few records to get a feel for the contents of the data, after which we are ready to pick a few aggregates to look at the general distributions within the entire set.

For a chosen crime-type, we might be interested in the outcomes of the police investigations. We would expect these outcomes (ranging from ‘Offender sent to prison’ to ‘Unable to prosecute subject’) to be strongly related to the type of criminal activity. A first glimpse into the data is to pick a crime type and look at the distribution of outcomes.

By using the Spark SQL count and group by method, combined with the pylab.pie plotter, we can quickly get a feel for if these differences actually exist in the way we suspected.

 

scope_3

 

 

 

 

 

 

 

 

 

scope_2

 

 

 

 

 

 

 

 

 

This confirms our suspicion that just looking at crime outcomes without looking at the type of crime could put us on the wrong track. The outcomes between ‘Shoplifting’ and ‘Drugs’ differ vastly.

Visual comparisons: the Heatmap

To dive more deeply into this effect, we need something else than a pie-chart to visualize the distributions. The tool of choice here is a heatmap with normalized outcome rates. We take the already computed summary statistics, but instead of picking two crime types, we’re going to format it into a tabular heatmap form. Each row corresponds to an outcome and each column to a crime type. The contents of the cells show how often that specific crime type ends up with that specific outcome.

 

scope_4

So the dark blue cell at the top center with a value of 1.0 means that 100% of the corresponding crimes within the category ‘Anti-Social Behavior’ end up with an empty outcome type. These types of data-quality issues are usually present in real-life data, and may be linked to the underlying registration process or to how the data is entered into the registry.

Looking at the full heatmap we spot a few other trends, for example that ‘Bicycle theft’ is very much like general ‘Theft from the person’, and hardly ever leads to anything else but ‘Under investigation’ or ‘No suspect identified.”

Using the summary features to cluster similar regions

Crime types vary in outcome and also in frequency. We can use these frequency counts to organize the regions by similarity. Each region will have a different distribution of crime types, and some of the regions may be similar, even though they are not necessarily geospatially near to each other.

By aggregating for each region the number of crimes of each type, we obtain a list of numbers. These lists can be seen as a so-called feature-vector, which describes the distribution of crime types within the region. By comparing these feature-vectors between regions, we can map out regions that are similar in behavior, even though they do not have to be close to each other geographically.

By using a t-stochastic neighbor embedding, we can visualize the similarities in the natural form of a map:

scope_5

In the generated map, we can see that City of London Police is very close to the British Transport Police, the Metropolitan Police Service and also the Thames Valley Police. To study this phenomenon deeper is outside the scope of this post, because it requires more context to understand. This task is better suited for interactive analysis within the SynerScope Marcato environment, which gives coordinated geographical, numeric and semi-structured views simultaneously for interactive use.
scope_6

Looking at the Seasons

scope_7While some crime types are constant over the year, we do expect there to be strong variations over time. One way to look at this data is to investigate season correlation of each crime type.

First let’s define the seasonal temperature change by approximating the mean temperature shift with a sine wave.

 

 

scope_8This sine wave serves as a seasonal vector, as it defines a basic weighting curve, weighting the summer with positive values and the winter with negative values. By combining this seasonal vector with the actual counts for each mont
h, we can get an indication as to whether a given crime type is mainly happening in the summer or in the winter. For each crime-type, a single value is produced that indicates the strength of correlation with the seasonality curve. Values of 1.0 mean that there is a strong correlation with the summer, -1.0 strongly correlates with winter and 0.0 indicates no strong seasonal influence.

 

The plot confirms that Bicycle Theft is happening mostly in the summer as it positively correlates with the seasonal curve. Theft from the person however is strongly negatively correlated, thus this type of crime is more prevalent in the winter.

scope_9

By plotting the raw aggregates in a heat map we can confirm these correlations:

 

 

 

Recap

We’ve shown how data can interpret through several visual slices, and how correlations with external phenomena such as the weather can help understand what is happening. PySpark on YARN is the engine that makes these algorithms tick as size, and using it within Python Notebook gives the analyst a very powerful toolbox, as it allows direct visualization of individual results.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *