How to Get Started in Data Science

A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, then of course you can do it.

In a previous post, I described my view on “What is a data scientist?”: it’s a hybrid role that combines the “applied scientist” with the “data engineer”. Many developers, statisticians, analysts and IT professionals have some partial background and are looking to make the transition into data science.

And so, how does one go about that? Your approach will likely depend on your previous experience. Here are some perspectives below from developers to business analysts.

Java Developers

If you’re a Java developer, you are familiar with software engineering principles and thrive on crafting software systems that perform complex tasks. Data science is all about building “data products”, essentially software systems that are based on data and algorithms.

A good first step is to understand the various algorithms in machine learning: which algorithms exist, which problems they solve and how they are implemented. It is also useful to learn how to use a modeling tool like R or Matlab. Libraries like WEKA, Vowpal Wabbit, and OpenNLP provide well-tested implementations of many common algorithms. If you’re not already familiar with Hadoop — learning map-reduce, Pig and Hive and Mahout will be valuable.

Python Developers

If you’re a Python developer, you are familiar with software development and scripting, and may have already used some Python libraries that are often used in data science such as NumPy and SciPy.

Python has great support for data science applications, especially with libraries such as NumPy/Scipy, Pandas, Scikit-learnIPython for exploratory analysis, and Matplotlib for visualizations.

To deal with large datasets, learn more about Hadoop and its integration with Python via streaming.

Statisticians and applied scientists

If you’re coming from a statistics or machine-learning background, its likely you’ve already been using tools like R, Matlab or SAS for years to perform regression analysis, clustering analysis, classification or similar machine learning tasks.

R, Matlab and SAS are amazing tools for statistical analysis and visualization, with mature implementations for many machine learning algorithms.

However, these tools are typically used for data exploration and model development, and rarely used in isolation to build production-grade data products. In most cases, you need to mix-in various other software components in like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.

Naturally, becoming familiar with one or more modern programming languages such as Python or Java is your first step. I found it very helpful to work closely with experienced data engineers to better understand the mindset and tools they use to build production-quality data products. 

Business analysts

If your background is SQL, you have been using data for many years already and understand full well how to use data to gain business insights. Using Hive, which gives you access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step for you into the world of big data.

Data science often entails developing data products that utilize machine learning and statistics at a level that SQL cannot describe well or implement efficiently. Therefore, the next important step towards data science is to understand these types of algorithms (such as recommendation engines, decision trees, NLP) at a deeper theoretical level, and become familiar with current implementations by tools such as Mahout, WEKA, or Python’s Scikit-learn.

Hadoop developers

If you’re a Hadoop developer, you already know the complexities of large datasets and cluster computing. You are probably also familiar with Pig, Hive, and HBase and experienced in Java.

A good first step is to gain deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets. A good first place to look is Mahout which implements many of these algorithms over Hadoop.

Another area to look into is “data cleanup”. Many algorithms assume a certain basic structure to the data before modeling begins. Unfortunately, in real life data is quite “dirty” and making it ready for modeling tends to take a large bulk of the work in data science. Hadoop is often a tool of choice for large-scale data cleanup and pre-processing, prior to modeling.

Final thoughts

The road to data science is not a walk in the park. You have to learn a lot of new disciplines, programming languages, and most important – gain real-world experience. This takes time, effort and a personal investment. But what you find at the end of the road is quite rewarding.

There are many resources you might find useful: books, training, and presentations.

And one more thing: a great way to get started on real world problems is to participate in a data science competition hosted on Kaggle.com. If you do it with a friend, it’s twice the fun.

For more resources on data analysis in Hadoop, take a look here.

Categorized by :
Architect & CIO Big Data Business Analytics Data Analyst & Scientist Developer Mahout

Comments

Chetan Karkhanis
|
September 27, 2013 at 8:43 pm
|

What about simple folks coming from traditional DW/BI backgrounds? Is there any hope for them?

    |
    October 10, 2013 at 6:35 pm
    |

    Chetan, of course there is hope for them. If you think about it, those folks have strong data engineering skills which are paramount to the role of data scientist. Further, if a person is skilled in BI then I would contend that they ought to have a solid foundation in basic statistics and math.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.