Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
January 26, 2015
prev slideNext slide

Data Science and Hadoop: Predicting Airline Delays – Part 3

In our series on Data Science and Hadoop, predicting airline delays, we demonstrated how to build predictive models with Apache Hadoop, using existing tools. In part 1, we employed Pig and Python; part 2 explored Spark, ML-Lib and Scala.

Throughout the series, the thesis, theme, topic, and algorithms were similar. That is, we wanted to dismiss the misconception that data scientists – when applying predictive learning algorithms, like Linear Regression, Random Forest or Neural Networks to large datasets – require dramatic changes to the tooling; that they need dedicated clusters; and that existing tools will not suffice.

Instead, we used the same HDP cluster configuration, the same machine learning techniques, the same data sets, and the same familiar tools like PIG, Python and Scikit-learn and Spark.

For the final part, we resort to Scalding and R. R is a very popular, robust and mature environment for data exploration, statistical analysis, plotting and machine learning. We will use R for data exploration, graphics as well as for building our predictive models with Random Forest and Gradient Boosted Trees. Scalding, on the other hand, provides Scala libraries that abstract Hadoop MapReduce and implement data pipelines. We demonstrate how to pre-process the data into a feature matrix using the Scalding framework.

For brevity I shall spare summarizing the methodology here, since both previous posts (and their accompanying IPython Notebooks) expound the steps, iteration and implementation code. Instead, I would urge that you read all parts as well as try the accompanying IPython Notebooks.

Finally, for this last installment in the series in Scaling and R, read its IPython Notebook for implementation details.

Learn More



Ajay Sadhu says:
Your comment is awaiting moderation.

Excellent Post.
This is the most detailed and elaborated Data Science usecase I have found.
Got a clear picture on how to approach a problem.


John says:

Do you have any examples using SparkR?

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums