It may seem obvious (or inevitable), but many companies are embracing the Internet of Things (IoT)—and for good reasons, notes Forbes’ Mike Kavis. For one, McKinsey Global Institute reports that IoT business will reach $6.2 trillion in revenue by 2025. And second, more and more objects are becoming embedded with sensors that communicate real-time data to data centers’ networks for processing, explain McKinsey’s Chui, Loffler, and Roberts.
While both reasons may be true, what makes IoT possible, besides ubiquitous embedded sensors, is the sensors’ ability to transmit digestible data in real-time and Hadoop 2 clusters’ capacity to absorb and process voluminous data at petabyte scale. At the heart of the processing voluminous sensor data at scale are three major steps:
These three steps are possible because today’s Modern Data Architecture (MDA), powered by Apache Hadoop YARN as its architectural center, allows multi-purpose data processing engines accessing and transforming the same data workloads residing within the same cluster.
In this blog, we briefly introduce three tutorials for the Sandbox, written by Saptak Sen of Hortonworks. They employ two complementary component technologies—Apache Kafka and Apache Storm, both running on Hortonworks Data Platform (HDP) and both essential components for handling sensor data at scale.
Of the three major steps outlined above for IoT, these two components exemplify the first step: data ingestion. We will explore data storage and data analytics in subsequent tutorials and respective blogs.
These tutorials illustrate how Kafka and Storm capture, ingest, and process sensor data, combined with geo-location from sensors in trucks, with real-time events like speeding, lane-departure, and unsafe tailgating. They enable, facilitate, and demonstrate how real-time data processing can be achieved in a Hadoop cluster.
Data must originate from somewhere. For example, an embedded sensor can produce data at frequent intervals. A consumer can fetch it from a live data stream or read from a committed log file. In both cases, for each datum, there is a producer and a consumer. This produce-and-consume paradigm is at the core of any messaging system. Apache Kafka is a publish-subscribe messaging system designed for distributed commit log. Kafka allows producers to ingest data into it—and consumers to read from it.
In the first tutorial, we show how you can use Apache Kafka as a producer of trucking events.
Whereas the first tutorial shows how to produce Kafka truck events, the second tutorial demonstrates how to capture and consume these truck data events in realtime with an Apache Storm cluster.
Finally, no data processing tutorial in Hadoop cluster can escape the putative WordCount example. In that tradition, this third tutorial shows how to process and count words in real-time using Apache Storm.