In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib.
Apache Spark is a relatively new entrant to the Hadoop ecosystem. Now running natively on Apache Hadoop YARN, the architectural center of Hadoop, Apache Spark is an in-memory data processing API and execution engine that is effective for machine learning and data science use cases. And with Spark on YARN, data workers can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset on the same cluster.
In the context of our blog, we will show how to use Apache Spark via its Scala API to generate our feature matrix and also use ML-Lib (Spark’s machine learning library) to build and evaluate supervised learning models.
Recall from part 1 that we are exploring a predictive model for flight delays. Our source dataset resides here, and it includes details about flights in the US from the years 1987-2008. We have also enriched the data with weather information, where we find daily temperatures (min/max), wind speed, snow conditions and precipitation.
Just as in the part 1, we will build a supervised learning model to predict flight delays for flights leaving O’Hare International airport (ORD). We will use the year 2007 data to build the model, and test its validity using data from 2008.
Apache Spark’s basic data abstraction is that of a RDD (Resilient Distributed Dataset), which is a fault-tolerant collection of elements that can be operated on and transformed in parallel across your Hadoop cluster.
Spark’s API (available in Scala, Python or Java) supports a variety of operations on RDDs such as map() and flatMap(), filter(), join(), and more. For a full description of the API please check the Spark API programming guide.
For brevity and readability, the detailed step-by-step construction of the feature matrix, machine learning model, implementation details as well as the model evaluation steps are demonstrated in the IPython Notebook here, which we encourage you to explore and try the Spark and Scala code.
The methodology entails the following steps:
You can follow the example or re-run it on your own cluster and IPython instance by using the code in the accompanying IPython Notebook.
In this blog post and its associated IPython Notebook, we demonstrated how to build a predictive model with Hadoop, Spark, and ML-Lib.
We used Apache Spark on Hadoop to perform various types of data pre-processing and feature engineering tasks. We then applied the ML-Lib machine learning algorithms, such as using Logistic Regression, Support Vector Machines, and Decision Trees, to the resulting datasets and showed how through iterations we continuously add new and improved features resulting in better performance model.
In the next part of this multi-part blog posts, we will show how to perform the same learning task with R.