Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
June 20, 2017
prev slideNext slide

HDF Series Part 4: SAM’s Stream Builder – Building Complex Stream Analytics Apps without Code

Last week, in Part 3 of this blog series, we announced the GA of HDF 3.0 and let the cat out of the bag by introducing the new open source component called  Streaming Analytics Manager (SAM), an exciting new technology that helps developers, operators, and business analysts to build, deploy, manage, monitor streaming analytics apps. SAM consists of 3 modules that caters to three different personas within an organization as the below diagram illustrates.


In this blog, we will focus on the Stream Builder module that focuses on the app developers within the enterprise.

The Challenges SAM’s Stream Builder Aims to Solve

Working with users over the last few years who built streaming apps with different engines like Storm, Spark Streaming and others, the challenges faced by most large enterprise customers include the following:

  • Building Stream Applications requires specialized skill-sets that most enterprise organizations don’t have today.
  • Stream Applications requires considerable amount of low level programming, testing and tuning to productionalize.
  • The time it takes to design, develop, test, deploy into production is extremely high.
  • Key Streaming primitives such as joining/splitting streams, aggregations over windows of time, and pattern matching are difficult to implement
  • Service discovery and configuration of big data services are complex.

SAM’s Stream Builder aims to solve each of these challenges. Read on!

SAM’s Service Pool and Environment Concept

A typical stream app will work with a number of different big data services to create streams, store events and do analytics. Common big data services used in streaming apps include Kafka, HDFS, HBase, Spark, Cassandra, SOLR, Elastic, etc.. The app developer building the stream app has to know the internals of how to work with each of these services. For example, to connect to a Kafka cluster, the app developer has to find out the hosts of each kafka broker and their port. In other words, service discovery and configuration has to be easier.

SAM solves these problems with the powerful Service Pool feature. The following are the fundamental constructs of Service Pools.

  • Service – Is an entity that an application developer works with to build stream apps. Examples of services include a Storm cluster that the stream app will be deployed to, a Kafka cluster that is used by the stream app to create a streams, or an HBase cluster that the stream app writes to.
  • Service Pool – SAM allows you to point to a big data cluster manager (e.g: Ambari) and import all the services in the cluster along with its configuration. A set of services managed by an Ambari cluster represents a service pool.
  • Environment – Is a named entity that represents a set of services chosen from different service pools. A stream app is assigned to an environment; the app can use the services associated with that environment.

Adding Service Pools

With SAM, adding Service Pools is as easy as entering the Ambari Rest endpoint and clicking ‘Auto Add’. 

When a service pool is created, all of the configuration to manage and connect to the big data services in the pool are imported from Ambari into SAM. If a configuration associated with a service is changed in Ambari, the service pool can adopt the new configuration by refreshing the pool as indicated in the following diagram.

Creating Environments

An environment is a named entity that represents a set of services chosen from different service pools. When a stream app is assigned to an environment, the app can use the services associated with that environment. To create an environment, give an environment a name and select services for that environment across different service pools.

You can create different environments based on your needs.

Creating a Stream App using Stream Builder Canvas

When creating a new stream app, give it a name and select the Environment. The app can work each of the big data services within that Environment.

Stream Builder Canvas

The app developer uses the Stream Builder canvas to build streaming apps by dragging components from the palette, configuring them, connecting component together and then deploying the app.

There are 3 types of components on the canvas palette: Sources, Processors and Sinks. SAM provides a number of out of the box components as well as a simple SDK where you can register your own component and get them added the palette. The list of processors that are supported OOO include:

  • Join – Joins multiple streams together based on a field from each stream. Two join types are supported: inner and left. Joins are based on a window that you can configure based on time or count.
  • Rule – Allows you to configure rule conditions that route/filter events to different streams. Standard conditional operators are supported for rules.
  • Aggregate – Performs functions over windows of events. You can create window criteria based on time interval or count. Two types of timed windows are supported: tumbling and sliding. Window functions supported out of the box include: stddev, stddevp, variance, variancecep, avg, min, max, sum, count. The system is extensible to add custom unctions as well via SDK.
  • Projection – Applies transformations to the events in the stream. Extensive set of OOO functions and the ability to add your own functions
  • PMML – Executes a PMML model that is stored in the Model Registry

Integration With Hortonworks Schema Registry

In Part 2 of this blog series, we introduced the need for a Schema Registry, a central schema repository that allows applications and HDF components (NiFi, Storm, Kafka, and others) to flexibly interact with each other. Streaming apps require a schema and hence SAM has first class integration with the Hortonworks Schema Registry.

The first step in creating an app is to start with a source component.  A common source component used by many customers is a Kafka topic. In the below diagram, we drag the Kafka component onto the canvas.

To configure the Kafka component, you double click it. Below is the configuration for the Kafka config for the Kafka component.

The above config for Kafka Topic shows two powerful integrations with SAM:

  • Service Pool / Environment Integration – Note that the connection settings to the Kafka broker are populated by SAM and the developer doesn’t have to be concerned with it. This is possible because we associated this App to an environment and that environment had a service from a service pool that contained all the configuration required to connect to the kafka service
  • Schema Registry Integration – In the diagram above, note that when the user selects a schema from the drop down, SAM talks to the Schema Registry and selects the schema registered for the Kafka topic and then displays the schema in the output section. When components get wired together, the schema gets passed as inputs to the downstream component.

Full end to end Streaming App

The below diagrams shows a full end to end stream app built using the Stream Builder without writing any code.

The above app showcases the following:

  • Connecting to two Kafka Topics to generate streams
  • Joining the streams in real-time
  • Applying rules on the stream to filter on events of interest
  • Doing aggregations over windows of time/count (calculating the average speed of driver over 3 minutes)
  • Detecting when a driver is speeding (over 80 mph over 3 minute window)
  • Streaming speeding alerts to a dashboard and generating alerts.
  • Enriching the events with driver HR info, hours/miles driven in the past week, weather info.
  • Normalizing the enriched events to feed into an analytical model.
  • Executing a predictive logistical regression built with Spark ML to predict if a driver is going to commit a violation.

To see a step by step short video of how the above app was created using SAM’s Stream Builder, watch the following video: Create a streaming analytics app in 10min. Also check out the detailed steps to recreate the app in the following doc: Getting Started with Streaming Analytics

The next blog in this series will talk through the Stream Operations module of SAM. Stay Tuned!


Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums