Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
HDF > Develop Data Flow & Streaming Applications > Hello World

SAM in Trucking IoT on HDF

Stream Processing & SAM

cloud Ready to Get Started?

DOWNLOAD SANDBOX

Objective

To become acquainted with the idea of stream processing and where Stream Analytics Manager (SAM) fits in. To learn about some of the key concepts of SAM.

Outline

Introduction to Stream Processing

In the stream processing model, data is sent directly into a data analytics engine to be computed one by one with results occurring in real-time. In the realm of batch processing, data is collected over time, then it is fed into a data analytics engine.

When to use Stream Processing?

When you need data analytic results in real-time, you can create multiple data stream sources using Apache Kafka, then use the visual stream processing tool SAM to pull in these data sources. With SAM you can leverage its visualization programming paradigm to feed the data into its suite of analytic tools and obtain near-instant analytic results.

Fraud Detection in which transaction data is being generated instantaneously calls for using stream processing. SAM will be able to detect anomalies, which signal fraud in real-time and stop these fraudulent transactions before they can finish.

History

Previously, developers had to use headless stream processing tools like Apache Storm and Spark Streaming, eventually many developers in the open source community came together to provide the “head” or “visual canvas” that integrates Apache Storm behind the scenes as the stream processing engine.

Streaming Analytics Manager

A part of developing the SAM topology requires setting up your service pool and environment, so you can add a stream application. Once you have these three components established, you will be able to start building your SAM topology.

Service Pool

A service pool is the set of services related to your Ambari Managed Cluster on HDF. A service is an entity in which an Application Developer uses to build stream topologies. Examples include:

  • a Storm Cluster for Stream Application deployment
  • a Kafka Cluster in which the stream application uses to create streams
  • a Druid Data Store to which the stream application writes

The list of managed service pools is located in Configuration configuration tab. For our case, we have one service pool preloaded into our HDF Sandbox shown in Figure 1, but in larger projects, you will probably have more.

Service_Pool

Figure 1: Sandbox Service Pool

Environment

An environment is a named entity which represents a set of services chosen from different service pools. When a new stream application is created, they are assigned an environment by the developer. Thus, the stream application can only use services associated with your chosen environment

The list of managed environments is located in the Configuration tab. For our case, we have one environment preloaded into our HDF Sandbox shown in Figure 2

Environments

Figure 2: Sandbox Environment

Application

An application is the stream analytics manager topology. For example the application you will create is shown in Figure 3.

My_Applications

Figure 3: Trucking-IoT-Demo

Once the above three components are created, a SAM Developer, can start building their topology.

SAM Topology Components

SAM’s topology visual editor allows you to drag and drop components into the canvas, so you can add the first processor to your topology or connect new processors to your existing topology.

Source

Source is a component that ingests data from various data stream sources, such as messaging systems or distributed file systems, such as Kafka or HDFS, into their topology.

A nice analogy shown in Figure 4 is a visualization of water in which there are multiple streams being pulled into a body of water that keeps on flowing:

multi_streams

Figure 4: Data Stream Sources

Processor

Processor is a component that performs the transformations, computation, joining and filtering on data as it moves through the stream.

In nature, as portrayed in Figure 5 we have rocks that process our water and other forms of natural solid materials that filter our water the way processors filter our data down to the insights we want:

rock_processing

Figure 5: Processors

Sink

Sink is a component that stores data into a datastore or distributed file system, such as Druid, HBase, HDFS, JDBC, etc.

In a body of water, there are many water sinkholes correlating to the idea that data can be stored in different locations, here is a visual representation:

water_sink

Figure 6: Data Sinks

Summary

Congratulations! You are now familiar with the concept of stream processing and when you would need to use a stream processing tool such as SAM. Now that you are aware of the fundamental concepts in SAM, let’s jump into building a SAM topology.

Further Reading

User Reviews

User Rating
0 No Reviews
5 Star 0%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
Tutorial Name
SAM in Trucking IoT on HDF

To ask a question, or find an answer, please visit the Hortonworks Community Connection.

No Reviews
Write Review

Register

Please register to write a review

Share Your Experience

Example: Best Tutorial Ever

You must write at least 50 characters for this field.

Success

Thank you for sharing your review!