Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
November 12, 2014
prev slideNext slide

Data Science with Apache Hadoop: Predicting Airline Delays


With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.


It is a common misconception that the way data scientists apply predictive learning algorithms like Linear Regression, Random Forest or Neural Networks to large datasets requires a dramatic change in approach, in tooling, or in usage of siloed clusters. Not so: no dramatic change; no dedicated clusters; using existing modeling tools will suffice.

In fact, the big change is in what is known as “feature engineering”—the process by which very large raw data is transformed into a “feature matrix.” Enabled by Apache Hadoop with YARN as an ideal platform, this transformation of large raw datasets (terabytes or petabytes) into a feature matrix is now scalable and not limited by RAM or compute power of a single node.

Since the output of the feature engineering step (the “feature matrix”) tends to be relatively small in size (typically in the MB or GB scale), a common choice is to run the learning algorithm on a single machine (often with multiple cores and high amount of RAM), allowing us to utilize a plethora of existing robust tools and algorithms from R packages, Python’s Scikit-learn, or SAS.

In this multi-part blog post and its accompanying IPython Notebook, we will demonstrate an example step-by-step solution to a supervised learning problem. We will show how to solve this problem with various tools and libraries and how they integrate with Hadoop. In part I we focus on Apache PIG, Python, and Scikit-learn, while in subsequent parts, we will explore and examine other alternatives such as R or Spark/ML-Lib

Cluster Configuration

For all the examples using machine techniques in this series, we employed a small Hortonworks Data Platform (HDP) cluster with the following configuration:

  • 4 Nodes
  • Each node with 4 cores and 16GB RAM and 500GB disk space
  • Each node runs CentOS 6 and HDP 2.1

Pig and Python Can’t Fly But Can Predict Flight Delays

Every year approximately 20% of airline flights are delayed or cancelled, resulting in significant costs to both travelers and airlines. As our example use-case, we will build a supervised learning model that predicts airline delay from historical flight data and weather information.


We start by exploring the airline delay dataset available here. This dataset includes details about flights in the US from the years 1987-2008. Every row in the dataset includes 29 variables (which you can peruse from the link below).

The detailed step-by-step construction of the feature matrix, machine learning model, implementation details as well as the model evaluation steps are shown in the IPython notebook here.

You can follow the notebook example or re-run it on your own cluster and IPython instance.

This IPython notebook demonstrates:

  1. Exploring the raw data to determine various properties of features and how predictive these features might be for the task at hand.
  2. Using PIG and Python to prepare the feature matrix from the raw data. We perform 3 iterations. With each iteration, we improve our feature set, resulting in better overall predictive performance. For example, in the 3rd iteration, we enrich the input data with weather information, resulting in predictive features such as temperature, snow conditions, or wind speed. This iterative nature of data science is a very common practice.
  3. Using Python’s Scikit-learn, we build various models, such as Logistic Regression or Random Forest. The feature matrix fits in memory (as is usually the case), so we run this locally on a local machine with 16 GB of memory.
  4. Using Scikit-learn, we evaluate performance of the models and compare between iterations.


In this blog post we demonstrated how to build a predictive model with Hadoop and Python using open source tools. We employed Hadoop to perform various types of data pre-processing and feature engineering tasks, followed by applying Scikit-learn machine learning algorithm on the resulting datasets. In addition, we showed how we could continuously add new and improved features to obtain a better predictive performance model by doing iterations and by introducing various variables with predictable results.

In the next part of this multi-part blog post, we will show how to perform the same learning task with Spark and ML-Lib.

Learn More



  • Can you please guide me how to install “iPython” on HDP2.2 (CentOS 6.6) (without using sandbox) and configure to Spark 1.2.0 and HDP 2.2 .

  • which weather data set was used in this data can please give us the specific link in the NOAA website there are lots of datasets in that website

  • Well written detailed article making it very easy for anyone to try out. Liked that the output log was pasted as well.
    Noticed that you wrote rmf statements in the middle of the pig script before STORE statements which interrupt execution. Your two jobs for 2007 and 2008 were run serially because of that. Moving it before LOAD, would make them run in parallel.

  • This is somehow old but still a great data science demo content.
    One thing: I’m trying the 3rd Iteration on the PIG script (notebook #1), where it adds weather data to increase the accuracy. However, I’m having an ERROR because the util.py described in the notebook does not have the definition of the to_date() UDF. Could you please correct the util.py definition and/or provide the code you used for the to_date() function?

  • This python script used to be in the original blog post, but seems to no longer be there. So reposting the original Python file here:

    from datetime import date

    @outputSchema(“value: int”)
    def get_hour(val):
    return int(val.zfill(4)[:2])

    @outputSchema(“date: chararray”)
    def to_date(year, month, day):
    s = “%04d%02d%02d” % (year, month, day)
    return s

    holidays = [
    date(2007, 1, 1), date(2007, 1, 15), date(2007, 2, 19), date(2007, 5, 28), date(2007, 6, 7), date(2007, 7, 4), \
    date(2007, 9, 3), date(2007, 10, 8), date(2007, 11, 11), date(2007, 11, 22), date(2007, 12, 25), \
    date(2008, 1, 1), date(2008, 1, 21), date(2008, 2, 18), date(2008, 5, 22), date(2008, 5, 26), date(2008, 7, 4), \
    date(2008, 9, 1), date(2008, 10, 13), date(2008, 11, 11), date(2008, 11, 27), date(2008, 12, 25) \

    @outputSchema(“days: int”)
    def days_from_nearest_holiday(year, month, day):
    d = date(year, month, day)
    x = [(abs(d-h)).days for h in holidays]
    return min(x)

    • BTW in the meantime I had it working with the following code:

      # get date in YYYYMMDD format
      @outputSchema(“date: chararray”)
      def to_date(year, month, day):
      d = date(year, month, day)
      return d.strftime(“%Y%m%d”)

      I figured the date format from the weather data.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>