Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
December 08, 2014
prev slideNext slide

Data Science with Hadoop: Predicting Airline Delays – Part 2


In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib.

Apache Spark is a relatively new entrant to the Hadoop ecosystem. Now running natively on Apache Hadoop YARN, the architectural center of Hadoop, Apache Spark is an in-memory data processing API and execution engine that is effective for machine learning and data science use cases. And with Spark on YARN, data workers can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset on the same cluster.


Machine Learning with ML-Lib

In the context of our blog, we will show how to use Apache Spark via its Scala API to generate our feature matrix and also use ML-Lib (Spark’s machine learning library) to build and evaluate supervised learning models.

Recall from part 1 that we are exploring a predictive model for flight delays. Our source dataset resides here, and it includes details about flights in the US from the years 1987-2008. We have also enriched the data with weather information, where we find daily temperatures (min/max), wind speed, snow conditions and precipitation.

Just as in the part 1, we will build a supervised learning model to predict flight delays for flights leaving O’Hare International airport (ORD). We will use the year 2007 data to build the model, and test its validity using data from 2008.

And as in part 1, the Hortonworks Data Platform’s (HDP) cluster configuration for this part remains unchanged.

Pre-Processing with Hadoop and Spark

Apache Spark’s basic data abstraction is that of a RDD (Resilient Distributed Dataset), which is a fault-tolerant collection of elements that can be operated on and transformed in parallel across your Hadoop cluster.
Spark’s API (available in Scala, Python or Java) supports a variety of operations on RDDs such as map() and flatMap(), filter(), join(), and more. For a full description of the API please check the Spark API programming guide.

We will show how to perform the same pre-processing with Spark (using Scala) as we did with PIG previously, in part 1 and its accompanying IPython Notebook.

Methodology, Iterative and Interactive IPython Runs

For brevity and readability, the detailed step-by-step construction of the feature matrix, machine learning model, implementation details as well as the model evaluation steps are demonstrated in the IPython Notebook here, which we encourage you to explore and try the Spark and Scala code.

The methodology entails the following steps:

  1. Defining few Scala functions for pre-processing and feature generation.
  2. Using Spark and Scala to compute the feature matrix.
  3. Modeling with Spark and ML-Lib, using Logistic Regression, Support Vector Machines, and Decision Trees.
  4. Constructing a richer interactive and predictive model for flight delays by using weather data.

You can follow the example or re-run it on your own cluster and IPython instance by using the code in the accompanying IPython Notebook.


In this blog post and its associated IPython Notebook, we demonstrated how to build a predictive model with Hadoop, Spark, and ML-Lib.

We used Apache Spark on Hadoop to perform various types of data pre-processing and feature engineering tasks. We then applied the ML-Lib machine learning algorithms, such as using Logistic Regression, Support Vector Machines, and Decision Trees, to the resulting datasets and showed how through iterations we continuously add new and improved features resulting in better performance model.

In the next part of this multi-part blog posts, we will show how to perform the same learning task with R.

Learn More



ksivaus says:

I was looking something with graphs I tried but not getting the desired results. Please check my question in the forum and help out.


Leave a Reply to ksivaus Cancel reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums