Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
HDF > Develop with Hadoop > Real World Examples

Real-Time Event Processing In NiFi, SAM, Schema Registry and SuperSet (Mac/Linux)

cloud Ready to Get Started?

DOWNLOAD SANDBOX

Introduction

In this tutorial, you will learn how to deploy the modern real-time streaming application. This application serves as a reference framework for developing a big data pipeline, complete with a broad range of use cases and powerful reusable core components. You will explore the NiFi Dataflow application, Kafka topics, schemas, SAM topology and the visualization slices from Superset.

Prerequisites

Outline

Concepts

SuperSet

SuperSet is a visual, intuitive and interactive data exploration platform. This platform offers a fast way to create and share dashboards with friends and business clients of your visualized datasets. Various visualization options are available to analyze the data and interpret it. The Semantic Layer allows users to control how the data stores are displayed in the UI. The model is secure and allows users to intricate rules in which only certain features are accessible by select individuals. SuperSet can be integrated with Druid or other data stores (SQLAlchemy, Python ORM, etc) of the user’s choice offering flexibility to be compatible with multiple systems.

Druid

Druid is an open source analytics database developed for business intelligence queries on data. Druid provides data ingestion is in real-time with low latency, flexible data exploration and quick aggregation. Deployments often reach out to trillions of event in relation to numerous petabytes of data.

Stream Analytics Manager (SAM)

Stream Analytics Manager is a drag and drop program that enables stream processing developers to build data topologies within minutes compared to traditional practice of writing several lines of code. Now users can configure and optimize how they want each component or processor to perform computations on the data. They can perform windowing, joining multiple streams together and other data manipulation. SAM currently supports the stream processing engine known as Apache Storm, but it will later support other engines such as Spark and Flink. At that time, it will be the users choice on which stream processing engine they want to choose.

Schema Registry

Schema Registry (SR) stores and retrieves Avro Schemas via RESTful interface. SR stores a version history containing all schemas. Serializers are provided to plug into Kafka clients that are responsible for schema storage and retrieve Kafka messages sent in Avro format.

Overview of Trucking IoT Ref App

The Trucking IoT Reference Application is built using Hortonworks DataFlow platform.

The Trucking IoT data comes from a truck events simulator that is ingested by Apache NiFi, NiFi sends the data to Kafka topics which are then ingested by Stream Analytics Manager (SAM) to be stored into Druid. Superset is used to create a visual representation of the Druid data sources. A more in depth explanation of the pipeline will be explained as you explore the NiFi Dataflow application, Schema Registry, SAM, Druid and Superset.

Step 1: Explore Dataflow Application

1. Open the NiFi UI http://sandbox-hdf.hortonworks.com:9090/nifi/

2. Drag the NiFi template icon onto the canvas, nifi_template_icon.

3. Choose the Template: Trucking IoT Demo. The NiFi Dataflow application will appear on the canvas. Deselect the Dataflow.

4. Select NiFi configuration icon nifi_configuration. Click on the Controller Services tab:

nifi_controller_services

5. Enable the HortonworksSchemaRegistry by selecting the lightning bolt symbol.

6. In the “Enable Controller Service” window, under “Scope”, select “Service and referencing components”. Then click ENABLE.

enable_hwx_sr

All controller services referencing HortonworksSchemaRegistry will also be enabled. Head back to the NiFi Dataflow.

7. Select all the processors in the NiFi Dataflow and click on the start button nifi_start.

nifi-to-2kafka-2schemas

8. To reduce resource consumption and footprint, when the PublishKafka_0_10 processors reach about 500 input records, click on the stop button nifi_stop. This will take approximately 1 – 2 minutes.

9. Stop NiFi service and make sure to turn ON maintenance mode.

Overview of the 7 processors in the NiFi Flow:

  • GetTruckingData – Simulator generates TruckData and TrafficData in bar-delimited CSV

  • RouteOnAttribute – filters the TrafficData and TruckData into separate
    data feeds

Data Name Data Fields
TruckData eventTime, truckId, driverId, driverName, routeId, routeName, latitude, longitude, speed, eventType
TrafficData eventTime, routeId, congestionLevel

TruckData side of Flow

  • EnrichTruckData – tags on three fields to the end of TruckData: “foggy”,
    “rainy”, “windy”

  • ConvertRecord – reads incoming data with “CSVReader” and writes out avro data with “AvroRecordSetWriter” embedding a “trucking_data_truck” schema onto each flowfile.

  • PublishKafka_0_10 – stores avro data into Kafka Topic
    “trucking_data_truck”

TrafficData side of Flow

  • ConvertRecord – converts CSV data into avro data embedding a “trucking_data_traffic” schema onto each flowfile

  • PublishKafka_0_10 – stores avro data into Kafka Topic “trucking_data_traffic”

Overview of 5 controller services used in the NiFi Flow:

  • AvroRecordSetWriter – writes contents of RecordSet in Binary
    Avro Format (trucking_data_truck schema)

  • AvroRecordSetWriter – Traffic – writes contents of RecordSet in Binary
    Avro Format (trucking_data_traffic schema)

  • CSVReader – returns each row in csv file as a separate record (trucking_data_truck schema)

  • CSVReader – Traffic – returns each row in csv file as a separate record
    (trucking_data_traffic schema)

  • HortonworksSchemaRegistry – provides schema registry service for
    interaction with Hortonworks Schema Registry

Step 2: View Schema Registry

1. Open the Schema Registry UI at http://sandbox-hdf.hortonworks.com:7788/

schema_registry_trucking_data

Overview of the essential schemas in the Schema Registry:

  • trucking_data_joined – model for truck event originating from a truck’s onboard computer (EnrichedTruckAndTrafficData)

  • trucking_data_traffic model for eventTime, routeId, congestionLevel (TrafficData)

  • trucking_data_truck – model for truck event originating from a truck’s onboard computer (EnrichedTruckData)

Step 3: Analyze Stream Analytics Application

1. Open Stream Analytics Manager (SAM) at http://sandbox-hdf.hortonworks.com:7777/

2. Click on the Trucking-IoT-Demo SAM topology square:

sam_home

3. Click on the EDIT button:

trucking_iot_demo_view

4. Click on the start button to deploy the topology:

sam_start.

A window will appear asking if you want to continue deployment, choose Ok.

5. You will receive a notification that the SAM topology application deployed successfully and your topology will show Active Status in the bottom right corner.

trucking_iot_sam_topology

Overview of the 7 processors in the SAM topology:

  • TruckingDataTraffic source data of “trucking_data_traffic” kafka topic

  • TruckingDataTruck source data of “trucking_data_truck” kafka topic

  • JOIN stream TruckingDataTruck and TruckingDataTraffic by “routeId”

  • IsViolation checks if not “Normal” eventType, then will emit them

  • HDFS storage for joined TruckingDataTruck and TruckingDataTraffic data

  • Violation-Events-Cube stores violation events into Druid

  • Data-Lake-HDFS store violation events into HDFS

Step 4: Visualize Trucking Data Via Superset

1. Open Ambari at http://sandbox-hdf.hortonworks.com:8080/

2. Turn on the HDFS, YARN, Druid and Superset services and make sure to turn off maintenance mode.

For example, to turn on HDFS, click on the service name in Ambari, click on the Service Actions dropdown and click Start. In the window, you will be asked if you want to start, confirm and also click on the checkbox to turn off maintenance mode.

3. Open Superset at http://sandbox-hdf.hortonworks.com:9089/

4. Wait about 5 – 10 minutes for Kafka data to be consumed, then periodically, select the Sources dropdown and click on Refresh Druid Metadata. Eventually, the two Druid data sources will appear.

refresh_metadata

5. Select average-speed-cube-01 druid data source.

6. You will be taken to the Superset visualization slice where you can visualize that druid data source.

superset_visual_slice

7. Under Datasource & Chart Type, select Visualization Type: Sunburst.

visual_type

8. Under Hierarchy, add driverId, speed_AVG.

9. Press on Query to visualize the data into a Sunburst representation.

10. Select Save as and name the slice: AvgSpeedSunburst. Create a new dashboard and call it: Trucking-IoT-Demo. Click Save.

save_slice

The following visualization slice is a “Sunburst” of average-speed-cube-01 data source.

sunburst_visual_slice

The following visualization slice is a “Sunburst” of volation-events-cube-01 data source:

sunburst_driverviolation

All created visualization slices will be sent to the dashboard you assign them to, in the two above examples, both slices are apart of the Trucking-IoT-Demo dashboard.

superset_dashboard

Summary

Congratulations! You deployed the Trucking IoT demo that processes truck event data by using the NiFi data flow application to separate the data into two flows: TruckData and TrafficData, that are transmitted into two Kafka robust queues tagged with Schema Registry schemas: trucking_data_traffic and trucking_data_truck. Stream Analytics Manager’s (SAM) topology pulls in this data to join the two streams (or flows) by routId and filter non-normal events to Druid’s datasource: violation-events-cube2. Superset visualizes the datasources into Sunburst and various other visualization slices to add more insights to our Trucking IoT demo.

Further Reading

Appendix A: Trucking IoT Github Repo

For more information on the the Trucking IoT Reference Application, visit the documentation and source code at:

https://github.com/orendain/trucking-iot/tree/hadoop-summit-2017