4 Reasons to use Hadoop for Data Science

Over the last 10 years or so, large web companies such as Google, Yahoo!, Amazon and Facebook have successfully applied large scale machine learning algorithms over big data sets, creating innovative data products such as online advertising systems and recommendation engines.

Apache Hadoop is quickly becoming a central store for big data in the enterprise, and thus is a natural platform with which enterprise IT can now apply data science to a variety of business problems such as product recommendation, fraud detection, and sentiment analysis.

Building on the patterns of Refine, Explore, Enrich that we described in our Hadoop Patterns of Use whitepaper, let’s review some of the major reasons to use Hadoop for data science which are also capture in the following presentation:


Reason 1: Data exploration with full datasets

Data scientists love their working environment. Whether using R, SAS, Matlab or Python, they always need a laptop with lots of memory to analyze data and  build models. In the world of big data, laptop memory is never enough, and sometimes not even close.

A common approach is to use a sample of the large dataset, a large a sample as can fit in memory. With Hadoop, you can now run many exploratory data analysis tasks on full datasets, without sampling. Just write a map-reduce job, PIG or HIVE script, launch it directly on Hadoop over the full dataset, and get the results right back to your laptop.

Reason 2: Mining larger datasets

In many cases, machine-learning algorithms achieve better results when they have more data to learn from, particularly for techniques such as clustering, outlier detection and product recommenders.

Historically, large datasets were not available or too expensive to acquire and store, and so machine-learning practitioners had to find innovative ways to improve models with rather limited datasets. With Hadoop as a platform that provides linearly scalable storage and processing power, you can now store ALL of the data in RAW format, and use the full dataset to build better, more accurate models.

Reason 3: Large scale pre-processing of raw data

As many data scientists will tell you, 80% of data science work is typically with data acquisition, transformation, cleanup and feature extraction. This “pre-processing” step transforms the raw data into a format consumable by the machine-learning algorithm, typically in a form of a feature matrix.

Hadoop is an ideal platform for implementing this sort of pre-processing efficiently and in a distributed manner over large datasets, using map-reduce or tools like PIG, HIVE, and scripting languages like Python. For example, if your application involves text processing, it is often needed to represent data in word-vector format using TFIDF, which involves counting word frequencies over large corpus of documents, ideal for a batch map-reduce job.

Similarly, if your application requires joining large tables with billions of rows to create feature vectors for each data object, HIVE or PIG are very useful and efficient for this task.

Reason 4: Data agility

It is often mentioned that Hadoop is “schema on read”, as opposed to most traditional RDBMS systems which require a strict schema definition before any data can be ingeted into them.

“Schema on read” creates “data agility”: when a new data field is needed, one is not required to go through a lengthy project of schema redesign and database migration in production, which can last months. The positive impact ripples through an organization and very quickly everyone wants to use Hadoop for their project, to achieve the same level of agility, and gain competitive advantage for their business and product line.

If you want to learn more about data science with Apache Hadoop, you can Get Started over here and also we invite you to attend Hortonwork’s “Applying data science with Apache Hadoop” classes:

Categorized by :
Apache Hadoop Big Data Business Analytics

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Using Ambari for Management
Thursday, September 4, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :