Over the last 10 years or so, large web companies such as Google, Yahoo!, Amazon and Facebook have successfully applied large scale machine learning algorithms over big data sets, creating innovative data products such as online advertising systems and recommendation engines.
Apache Hadoop is quickly becoming a central store for big data in the enterprise, and thus is a natural platform with which enterprise IT can now apply data science to a variety of business problems such as product recommendation, fraud detection, and sentiment analysis.
Building on the patterns of Refine, Explore, Enrich that we described in our Hadoop Patterns of Use whitepaper, let’s review some of the major reasons to use Hadoop for data science which are also capture in the following presentation:
Data scientists love their working environment. Whether using R, SAS, Matlab or Python, they always need a laptop with lots of memory to analyze data and build models. In the world of big data, laptop memory is never enough, and sometimes not even close.
A common approach is to use a sample of the large dataset, a large a sample as can fit in memory. With Hadoop, you can now run many exploratory data analysis tasks on full datasets, without sampling. Just write a map-reduce job, PIG or HIVE script, launch it directly on Hadoop over the full dataset, and get the results right back to your laptop.
In many cases, machine-learning algorithms achieve better results when they have more data to learn from, particularly for techniques such as clustering, outlier detection and product recommenders.
Historically, large datasets were not available or too expensive to acquire and store, and so machine-learning practitioners had to find innovative ways to improve models with rather limited datasets. With Hadoop as a platform that provides linearly scalable storage and processing power, you can now store ALL of the data in RAW format, and use the full dataset to build better, more accurate models.
As many data scientists will tell you, 80% of data science work is typically with data acquisition, transformation, cleanup and feature extraction. This “pre-processing” step transforms the raw data into a format consumable by the machine-learning algorithm, typically in a form of a feature matrix.
Hadoop is an ideal platform for implementing this sort of pre-processing efficiently and in a distributed manner over large datasets, using map-reduce or tools like PIG, HIVE, and scripting languages like Python. For example, if your application involves text processing, it is often needed to represent data in word-vector format using TFIDF, which involves counting word frequencies over large corpus of documents, ideal for a batch map-reduce job.
Similarly, if your application requires joining large tables with billions of rows to create feature vectors for each data object, HIVE or PIG are very useful and efficient for this task.
It is often mentioned that Hadoop is “schema on read”, as opposed to most traditional RDBMS systems which require a strict schema definition before any data can be ingeted into them.
“Schema on read” creates “data agility”: when a new data field is needed, one is not required to go through a lengthy project of schema redesign and database migration in production, which can last months. The positive impact ripples through an organization and very quickly everyone wants to use Hadoop for their project, to achieve the same level of agility, and gain competitive advantage for their business and product line.