Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
November 22, 2017
prev slideNext slide

IoT and Data Science – A Trucking Demo on DSX Local with Apache NiFi

IBM’s Data Science Experience (DSX) comes in multiple flavors: cloud, desktop, and local. In this post we cover an IoT trucking demo on DSX local, i.e. running on top of Hortonworks Data Platform (HDP). We train and deploy a model, and then we use that model to score simulated incoming trucking data in Apache NiFi. We closely follow a data science lifecycle process as we discuss all the steps.

Fig 1. Data science lifecycle

Step 1 – Problem Definition

Imagine a trucking company that dispatches trucks across the country. The trucks are outfitted with sensors that collect data. For instance, location of the driver, weather conditions, and recent events such as speeding, the truck weaving out of its lane, or following too closely. This data is generated once per second and streamed back to the company’s servers.

Fig 1. Sample input data

The company needs a way to process this stream of data and run some analysis on the data so that it can make sure trucks are traveling safe and that the driver is not likely to make any violations anytime soon. And all this has to be done in real-time.

Step 2 – ETL & Feature Extraction

Fig 2. Input features and output label correlation matrix 

For predicting violations, we simulate trucking events in terms of location, miles driven, and weather conditions. We perform multiple feature engineering steps and examine correlations between different features.

The first video covers the following:

  • Fetching data from HDFS
  • Feature engineering
  • Data visualization

Step 3 – Learning & Model Deployment

Fig 3. Model testing in DSX’s UI

Once the data is ready, we build a predictive model. In our example we are use the SparkML Random Forest classifier. Classification is a statistical technique which assigns a class to each driver: violation or normal. We train the model on a small dataset containing historical data and  evaluate the model on several different metrics: accuracy, precision, and area under ROC curve. Finally, we deploy and test the model in a DSX UI and make RESTful API calls.

The second video covers the following:

  • Building a Random Forest classifier in Spark ML
  • Saving the model in a Machine Learning repository
  • Deploying the model online via UI
  • Testing the model via UI and RESTful API

Step 4 – Simulating End-to-end Data Flow

Fig 4. Simulated trucking data flow in Apache NiFi

With the model’s accessibility via RESTful API calls, we simulate an end-to-end flow in Apache NiFi. Here we have multiple processors that deal with data simulation (by randomly selecting a combination of acceptable values) and making a call to the model to decide whether a violation is likely to occur. Depending on the model prediction we write the results to a plain-text file.

The third video covers the following:

  • Simulating trucking data
  • Calling the model via RESTful API
  • Routing data based on the API response: violation or no violation
  • Storing results

Closing Thoughts

The next step would be to attach dashboards that would allow more advanced monitoring and trigger alerts for the trucking fleet. These alerts would be both useful to the trucking management as well as to the individual drivers who could take corrective action to diminish probability of a violation.

If we deployed this to production, we would replace the simulated data with actual sensor and weather data. We would also make sure we simplified the model by removing redundant features, i.e. features that are highly positively correlated, e.g. hours and miles driven, and do not provide any additional useful information.

For Random Forest classifiers, normalizing data is not necessary, although for other classifiers this may be a necessary step.

Fig 5. Example of an unbalanced data in the training data set

Finally, we would keep monitoring model health as new data came in, making sure that our models are still performing with acceptable metrics, e.g. area under ROC curve.

Resources

Comments

  • Wow, incredible demo! This is going to change the game. We’ve been hearing for years how IoT and data will change logistics, and this application is sure to impact trucking in a big way. Thanks for breaking it down in a post like this– I learned a lot, actually.

  • Excellent example how a feed of real time data can fuel predictive analytics. I’d like to build a NiFi workflow that feeds open source tools performing predictive analysis. Can you recommend any good and recent examples I can use as my model?

  • Hey Robert! Great level of explanantion. Would like to see elaborate discussion on feature engineering because I feel that’s forms the crux of any data science project. Nevertheless a nice blog to read..

  • Leave a Reply

    Your email address will not be published. Required fields are marked *