In our series on Data Science and Hadoop, predicting airline delays, we demonstrated how to build predictive models with Apache Hadoop, using existing tools. In part 1, we employed Pig and Python; part 2 explored Spark, ML-Lib and Scala.
Throughout the series, the thesis, theme, topic, and algorithms were similar. That is, we wanted to dismiss the misconception that data scientists – when applying predictive learning algorithms, like Linear Regression, Random Forest or Neural Networks to large datasets – require dramatic changes to the tooling; that they need dedicated clusters; and that existing tools will not suffice.
Instead, we used the same HDP cluster configuration, the same machine learning techniques, the same data sets, and the same familiar tools like PIG, Python and Scikit-learn and Spark.
For the final part, we resort to Scalding and R. R is a very popular, robust and mature environment for data exploration, statistical analysis, plotting and machine learning. We will use R for data exploration, graphics as well as for building our predictive models with Random Forest and Gradient Boosted Trees. Scalding, on the other hand, provides Scala libraries that abstract Hadoop MapReduce and implement data pipelines. We demonstrate how to pre-process the data into a feature matrix using the Scalding framework.
For brevity I shall spare summarizing the methodology here, since both previous posts (and their accompanying IPython Notebooks) expound the steps, iteration and implementation code. Instead, I would urge that you read all parts as well as try the accompanying IPython Notebooks.
Finally, for this last installment in the series in Scaling and R, read its IPython Notebook for implementation details.