Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
August 17, 2017
prev slideNext slide

Model as Service: Modern Streaming Data Science with Apache Metron

The Motivation

Many cybersecurity problems are also big data problems.   What is more and more apparent, though, is that these problems are also problems solved by data science.  The modern cybersecurity practitioner solves cybersecurity problems by making sense of data.  This is firmly within the sweet spot where data science meets big data and reaps rewards.  The Metron team at Hortonworks is keenly aware of this relationship as we are building a cybersecurity platform atop big data infrastructure.  While there are many great platforms to use for dealing with the volume and velocity of cybersecurity data, we saw some gaps in the tooling that the modern security practitioner and data scientist can use.

The data scientist’s toolbox is filled with constructing models.  Indeed, as a data scientist, I feel positively overwhelmed by the variety of options available for the construction of useful machine learning models.  There are options in the favored languages of data scientists and more are coming every day.  What is woefully missing, however, is the piece that comes next: using models productively and in an integrated way with the existing data processing infrastructure.  This becomes especially challenging when your data scientists are polyglot and opinionated about the libraries that they want to use.

This practical situation affects most data science teams. We designed Metron to address this situation and enable the working, productive data scientist to deploy their custom models into production and use them within Metron.  We do this via a piece of Metron called Model as a Service.

The Solution

It became clear that such a solution, for us, given our position comfortably nestled in Hadoop’s loving embrace, would have the following characteristics:

  • Data Science First – Let the data scientists choose the right tool for their models.
  • Cluster First – We exist on a cluster which has a resource manager.  We, therefore, should work with Yarn.
  • Streaming First – Metron, at its core, is a streaming application.  New models should be immediately discoverable by interested parties without restarting the service.
  • Metron First – Interactions with models should be seamless within our core scripting framework: Stellar.

This solution breaks down into a clean separation of 3 components:

  • Deployment Service – Allows model instances to be started or stopped, and registers the model locations for discovery within Metron.
  • Deployment Client – Takes model collateral (scripts, model binaries, etc.) and submits it to the deployment service along with the number of instances and their size to start on the cluster.
  • Stellar Function: Finds locations of named models on the cluster and interacts with a model instance, in a load balanced way.

Models are expected to expose themselves as a REST endpoint.  It became apparent that in the dominant use case, the security data scientist will create a model, train it on historical data, and then interact with this model from Metron. It is not unreasonable to expect that a data scientist could provide such an interface given that REST is well-supported in the popular languages that data scientists prefer to use.  Hurdling this low bar, we can take it from there, deploy it onto the cluster, then find and interact with the instances from Metron.

Domain Generating Algorithms

Metron, general purpose data science infrastructure aside, is in the business of cybersecurity data analysis.  It is fitting, therefore, that we motivate this bit of data science infrastructure by showing a compelling cybersecurity use case.

Domain Generating Algorithms are used by botnets to communicate with compromised computers in an evasive way.  Traffic to a fixed command and control host would be noticed and firewall rules could be modified to cut communication channels cleanly.  Instead, the botnet command and control hosts must move around to evade detection.  Typically this involves periodically generating a synthetic domain in a repeatable way and having the compromised computers attempt to connect to a few candidate synthetic domains daily with some moderate hope of guessing the right one.  This traffic is small enough to be lost in the shuffle of a large organization, making the evasion effective.

Traditionally this effort was combatted by reverse-engineering the compromised hosts’ binaries and extracting rules and blacklists (or getting a threat intelligence feed with this info) well after 0-day.  Unfortunately, this help that comes too late and is too rigid to adapt to new threats.  Recently, there is hope that we may be able to reframe this issue as a classification problem suitable for tackling via machine learning.  Not unlike distinguishing between “spam” and “ham” emails, we may be able to learn the patterns that differentiate between synthetic and normal traffic in a way that can quickly generalize to new threats.

The Model

At the BSides DFW Conference in 2013,  ClickSecurity presented a model written in Python for detecting synthetic domains.  In order to demonstrate its integration with Metron, I’ll walk through exposing the model as generated by their script.  ClickSecurity output a few model binaries suitable for use with the popular python data science library scikit-learn.  For convenience, I’ve wrapped the model application into a class called DGA with a function evaluate_domain which will evaluate a hostname without tld return legit or dga.

What remains is taking this class and exposing the evaluate_domain function as a REST endpoint.  Thankfully, within Python, this is extremely easy using flask:

import json
import model
from flask import Flask
from flask import request,jsonify
import socket

app = Flask(__name__)

@app.route("/apply", methods=['GET'])
def predict():
  # We expect one argument, the hostname without TLD.
  h = request.args.get('host')
  r = {}
  r['is_malicious'] = model.evaluate_domain(h)
  # We will return a JSON map with one field, 'is_malicious' which will be
  # 'legit' or 'dga', the two possible outputs of our model.
  return jsonify(r)

if __name__ == "__main__":
  # Create my model object that I want to expose.
  model = model.DGA()
  # In order to register with model as a service, we need to bind to a port
  # and inform the discovery service of the endpoint. Therefore,
  # we will bind to a port and close the socket to reserve it.
  sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  sock.bind(('localhost', 0))
  port = sock.getsockname()[1]
  sock.close()
  with open("endpoint.dat", "w") as text_file:
    # To inform the discovery service, we need to write a file with a simple
    # JSON Map indicating the full URL that we've bound to.
    text_file.write("{\"url\" : \"http://0.0.0.0:%d\"}" % port)
  # Make sure flask uses the port we reserved
  app.run(threaded=True, host="0.0.0.0", port=port)

Model Deployment

Now that we have our model build and collateral created, we can deploy the model using Model as a Service:

Scoring with Stellar

In Metron, our primary mechanism for enrichment or transformation is via a scripting language named Stellar.  This is a very simple language intended to allow for the most common transformation tasks.  It supports

  • simple arithmetic operations
  • conditional operations
  • user defined functions
  • function composition  

As stated earlier, there are Stellar functions associated for interacting with Zookeeper and calling out to deployed models.  First we will see it operate from the Stellar shell to get a flavor of the functions.

Streaming Scoring

Now that we know how to interact with our model via Stellar, we can use our model anywhere that Stellar is used in Metron.  Let’s look at how we can use our model to enrich some squid proxy data to give an indication if a domain is a DGA or not.  However, as you saw above, there is a hitch that the model returns a map.  What we really want is a field ‘dga_status’ which is either ‘dga’ or ‘legit’, so the Stellar function becomes:

MAP_GET('is_malicious', MAAS_MODEL_APPLY(MAAS_GET_ENDPOINT('dga   '), {'host' : domain_without_tld }))

That’s a lot to take in, so let’s break it down:

  • MAAS_GET_ENDPOINT – Returns the Endpoint location for the model
  • MAAS_MODEL_APPLY – Communicates with the model via REST and returns the results, in our case a Map with a field called ‘is_malicious’
  • MAP_GET – Returns a value associated with a key from a map, the key in question is ‘is_malicious’

Conclusion

Hopefully we have demonstrated both the value of using machine learning in cybersecurity as well as the tooling in Metron to make its use easy for the cybersecurity practitioner. Rather than putting in place hundreds of static firewall rules or relying on the correctness and timeliness of a threat intelligence feed, we can learn from history and hopefully adapt to new threats faster.  This is an important part of the journey to being proactive rather than reactive.

The future is bright for this piece of infrastructure, as there is much to add here, including but not limited to:

  • Robust monitoring for model performance
  • A more intelligent, friendlier user interface
  • Easier, more straightforward ways to interact with models
    • Other transport protocols than REST supported (e.g., Tensorflow serving)
    • Automatic construction of REST endpoints for models that conform to certain specifications (e.g., Spark-ML models, PMML, sci-kit learn exported pickle files)

To watch Casey’s session on Model as A Service from Dataworks Summit click here

 

Leave a Reply

Your email address will not be published. Required fields are marked *