How to Get Started in Data Science
A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, then of course you can do it.
In a previous post, I described my view on “What is a data scientist?”: it’s a hybrid role that combines the “applied scientist” with the “data engineer”. Many developers, statisticians, analysts and IT professionals have some partial background and are looking to make the transition into data science.
And so, how does one go about that? Your approach will likely depend on your previous experience. Here are some perspectives below from developers to business analysts.
If you’re a Java developer, you are familiar with software engineering principles and thrive on crafting software systems that perform complex tasks. Data science is all about building “data products”, essentially software systems that are based on data and algorithms.
A good first step is to understand the various algorithms in machine learning: which algorithms exist, which problems they solve and how they are implemented. It is also useful to learn how to use a modeling tool like R or Matlab. Libraries like WEKA, Vowpal Wabbit, and OpenNLP provide well-tested implementations of many common algorithms. If you’re not already familiar with Hadoop — learning map-reduce, Pig and Hive and Mahout will be valuable.
To deal with large datasets, learn more about Hadoop and its integration with Python via streaming.
Statisticians and applied scientists
If you’re coming from a statistics or machine-learning background, its likely you’ve already been using tools like R, Matlab or SAS for years to perform regression analysis, clustering analysis, classification or similar machine learning tasks.
R, Matlab and SAS are amazing tools for statistical analysis and visualization, with mature implementations for many machine learning algorithms.
However, these tools are typically used for data exploration and model development, and rarely used in isolation to build production-grade data products. In most cases, you need to mix-in various other software components in like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.
Naturally, becoming familiar with one or more modern programming languages such as Python or Java is your first step. I found it very helpful to work closely with experienced data engineers to better understand the mindset and tools they use to build production-quality data products.
If your background is SQL, you have been using data for many years already and understand full well how to use data to gain business insights. Using Hive, which gives you access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step for you into the world of big data.
Data science often entails developing data products that utilize machine learning and statistics at a level that SQL cannot describe well or implement efficiently. Therefore, the next important step towards data science is to understand these types of algorithms (such as recommendation engines, decision trees, NLP) at a deeper theoretical level, and become familiar with current implementations by tools such as Mahout, WEKA, or Python’s Scikit-learn.
If you’re a Hadoop developer, you already know the complexities of large datasets and cluster computing. You are probably also familiar with Pig, Hive, and HBase and experienced in Java.
A good first step is to gain deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets. A good first place to look is Mahout which implements many of these algorithms over Hadoop.
Another area to look into is “data cleanup”. Many algorithms assume a certain basic structure to the data before modeling begins. Unfortunately, in real life data is quite “dirty” and making it ready for modeling tends to take a large bulk of the work in data science. Hadoop is often a tool of choice for large-scale data cleanup and pre-processing, prior to modeling.
The road to data science is not a walk in the park. You have to learn a lot of new disciplines, programming languages, and most important – gain real-world experience. This takes time, effort and a personal investment. But what you find at the end of the road is quite rewarding.
And one more thing: a great way to get started on real world problems is to participate in a data science competition hosted on Kaggle.com. If you do it with a friend, it’s twice the fun.
For more resources on data analysis in Hadoop, take a look here.
Get Started using Hadoop to Analyze Data. This guide includes tutorials, videos and advice on integrating Hadoop with popular analytics packages.