Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
April 11, 2013
prev slideNext slide

4 Reasons to use Hadoop for Data Science

Over the last 10 years or so, large web companies such as Google, Yahoo!, Amazon and Facebook have successfully applied large scale machine learning algorithms over big data sets, creating innovative data products such as online advertising systems and recommendation engines.

Apache Hadoop is quickly becoming a central store for big data in the enterprise, and thus is a natural platform with which enterprise IT can now apply data science to a variety of business problems such as product recommendation, fraud detection, and sentiment analysis.

Building on the patterns of Refine, Explore, Enrich that we described in our Hadoop Patterns of Use whitepaper, let’s review some of the major reasons to use Hadoop for data science which are also capture in the following presentation:

[slideshare id=18622467&doc=whyhadoopfordatascience-130411110136-phpapp02]


Reason 1: Data exploration with full datasets

Data scientists love their working environment. Whether using R, SAS, Matlab or Python, they always need a laptop with lots of memory to analyze data and  build models. In the world of big data, laptop memory is never enough, and sometimes not even close.

A common approach is to use a sample of the large dataset, a large a sample as can fit in memory. With Hadoop, you can now run many exploratory data analysis tasks on full datasets, without sampling. Just write a map-reduce job, PIG or HIVE script, launch it directly on Hadoop over the full dataset, and get the results right back to your laptop.

Reason 2: Mining larger datasets

In many cases, machine-learning algorithms achieve better results when they have more data to learn from, particularly for techniques such as clustering, outlier detection and product recommenders.

Historically, large datasets were not available or too expensive to acquire and store, and so machine-learning practitioners had to find innovative ways to improve models with rather limited datasets. With Hadoop as a platform that provides linearly scalable storage and processing power, you can now store ALL of the data in RAW format, and use the full dataset to build better, more accurate models.

Reason 3: Large scale pre-processing of raw data

As many data scientists will tell you, 80% of data science work is typically with data acquisition, transformation, cleanup and feature extraction. This “pre-processing” step transforms the raw data into a format consumable by the machine-learning algorithm, typically in a form of a feature matrix.

Hadoop is an ideal platform for implementing this sort of pre-processing efficiently and in a distributed manner over large datasets, using map-reduce or tools like PIG, HIVE, and scripting languages like Python. For example, if your application involves text processing, it is often needed to represent data in word-vector format using TFIDF, which involves counting word frequencies over large corpus of documents, ideal for a batch map-reduce job.

Similarly, if your application requires joining large tables with billions of rows to create feature vectors for each data object, HIVE or PIG are very useful and efficient for this task.

Reason 4: Data agility

It is often mentioned that Hadoop is “schema on read”, as opposed to most traditional RDBMS systems which require a strict schema definition before any data can be ingeted into them.

“Schema on read” creates “data agility”: when a new data field is needed, one is not required to go through a lengthy project of schema redesign and database migration in production, which can last months. The positive impact ripples through an organization and very quickly everyone wants to use Hadoop for their project, to achieve the same level of agility, and gain competitive advantage for their business and product line.

If you want to learn more about data science with Apache Hadoop, you can Get Started over here and also we invite you to attend Hortonwork’s “Applying data science with Apache Hadoop” classes:



Insofe says:

I think these are the good Reasons to use hadoop for Data Science 1.Data Exploration with full Dataset 2.Mining Larger Datasets 3.Large Scale Pre-Processing of Raw Data 4.Data Agility which are great points. A general way is to use a piece of the large dataset, a large a sample as can fit in memory. With Hadoop, you can now run plentiful exploratory data analysis tasks on full datasets, without sampling.

Data Science Training In Hyderabad says:

Thanks for sharing such a great article with us on Data Science
We are also expecting more articles from this blog
Thanking you

Harshali Patel says:

Very helpful article about hadoop usage and implementation, especially for hadoop learners.
Thank you for sharing.
Expecting more on the same.
Thank you

Best data science training online says:

Thank you so much for share your post. A tip which you shared about data science training. those are very cleared and you provided knowledge by the easy way their people can understand very easily.
Best data science training online

Reema Paul says:
Your comment is awaiting moderation.

Data science is a technology which requires a module to learn. This is really a great article describing the skills of data science with python.

Leave a Reply

Your email address will not be published. Required fields are marked *