Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
September 25, 2013
prev slideNext slide

How to Get Started in Data Science

A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, then of course you can do it.

In a previous post, I described my view on “What is a data scientist?”: it’s a hybrid role that combines the “applied scientist” with the “data engineer”. Many developers, statisticians, analysts and IT professionals have some partial background and are looking to make the transition into data science.

And so, how does one go about that? Your approach will likely depend on your previous experience. Here are some perspectives below from developers to business analysts.

Java Developers

If you’re a Java developer, you are familiar with software engineering principles and thrive on crafting software systems that perform complex tasks. Data science is all about building “data products”, essentially software systems that are based on data and algorithms.

A good first step is to understand the various algorithms in machine learning: which algorithms exist, which problems they solve and how they are implemented. It is also useful to learn how to use a modeling tool like R or Matlab. Libraries like WEKA, Vowpal Wabbit, and OpenNLP provide well-tested implementations of many common algorithms. If you’re not already familiar with Hadoop — learning map-reduce, Pig and Hive and Mahout will be valuable.

Python Developers

If you’re a Python developer, you are familiar with software development and scripting, and may have already used some Python libraries that are often used in data science such as NumPy and SciPy.

Python has great support for data science applications, especially with libraries such as NumPy/Scipy, Pandas, Scikit-learnIPython for exploratory analysis, and Matplotlib for visualizations.

To deal with large datasets, learn more about Hadoop and its integration with Python via streaming.

Statisticians and applied scientists

If you’re coming from a statistics or machine-learning background, its likely you’ve already been using tools like R, Matlab or SAS for years to perform regression analysis, clustering analysis, classification or similar machine learning tasks.

R, Matlab and SAS are amazing tools for statistical analysis and visualization, with mature implementations for many machine learning algorithms.

However, these tools are typically used for data exploration and model development, and rarely used in isolation to build production-grade data products. In most cases, you need to mix-in various other software components in like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.

Naturally, becoming familiar with one or more modern programming languages such as Python or Java is your first step. I found it very helpful to work closely with experienced data engineers to better understand the mindset and tools they use to build production-quality data products. 

Business analysts

If your background is SQL, you have been using data for many years already and understand full well how to use data to gain business insights. Using Hive, which gives you access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step for you into the world of big data.

Data science often entails developing data products that utilize machine learning and statistics at a level that SQL cannot describe well or implement efficiently. Therefore, the next important step towards data science is to understand these types of algorithms (such as recommendation engines, decision trees, NLP) at a deeper theoretical level, and become familiar with current implementations by tools such as Mahout, WEKA, or Python’s Scikit-learn.

Hadoop developers

If you’re a Hadoop developer, you already know the complexities of large datasets and cluster computing. You are probably also familiar with Pig, Hive, and HBase and experienced in Java.

A good first step is to gain deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets. A good first place to look is Mahout which implements many of these algorithms over Hadoop.

Another area to look into is “data cleanup”. Many algorithms assume a certain basic structure to the data before modeling begins. Unfortunately, in real life data is quite “dirty” and making it ready for modeling tends to take a large bulk of the work in data science. Hadoop is often a tool of choice for large-scale data cleanup and pre-processing, prior to modeling.

Final thoughts

The road to data science is not a walk in the park. You have to learn a lot of new disciplines, programming languages, and most important – gain real-world experience. This takes time, effort and a personal investment. But what you find at the end of the road is quite rewarding.

There are many resources you might find useful: books, training, and presentations.

And one more thing: a great way to get started on real world problems is to participate in a data science competition hosted on Kaggle.com. If you do it with a friend, it’s twice the fun.

For more resources on data analysis in Hadoop, take a look here.



    • Chetan, of course there is hope for them. If you think about it, those folks have strong data engineering skills which are paramount to the role of data scientist. Further, if a person is skilled in BI then I would contend that they ought to have a solid foundation in basic statistics and math.

  • Hey thanks for the info.. but i have often seen people coming from diverse backgrounds like social science, management, economics opting for data science. What are there chances of doing well in this field?

  • Can you please let me know what role a system admin / Infrastructure person can play in Hadoop ecosystem,,, i am interested to know beyond just setting up hadoop / hdfs / hbase cluster and using scoop for data transfers.

  • I’m doing my master’s in Mechanical Engineering and quite familiar with Matlab, SQL, Statistics and SAS. What should be the other thing for me to learn to become a data scientist. In other sense how should I utilize my knowledge to get a tag of data scientist.

    • I would start playing with Spark on YARN and on HDP, Scala, SparkSQL, Hive, R, MLLib, all available by downloading our Sandbox and working through the tutorials.

  • First of all my big thanks to you sir for such a great article.

    I am an Oracle Certified DBA (OCP) in 11g R2 Version.I also have knowledge of Linux,C language.Now I really want to become a Data Scientist.What should i do?What technologies i have to learn from now on to be the same.Please tell me first step of this learning process.

    Thanks in Advance

    • I would start downloading our Sandbox and play around with some basic tutorials that deal with ETL processing: extracting, transforming and loading data. With DB background and SQL knowledge you will be able to pick up Apache Hive easily. Second, start playing with Spark on YARN and Spark SQL. For DBA, it’s a good transition. But biggest challenge is dealing with NoSQL concepts of Hadoop and the repositories.

  • Hi. I am a MBA and I am consumer market researcher by profession. I am really interested in data analysis and data science. I dont have a computer science background. How do you suggest I should start?

    • I would research into taking some workshops in Scala, Python, or Java. Attend Spark Summit. Start with basic programming concepts and work your way up.

  • Good advice. I am I right if wherever you mentioned java/Python I substitute with C++? And MATLAB/R I substitute with Octave?
    Nice time

  • Hi there.. couple of questions.. I am pursuing MBA in E commerce, but my bachelors is in management only.. thus, no technical background.. .I want to become a data scientist.. what will be ur suggestion to me… plz suggest fields that can be a good career option for me. and also high paying. 😛 ..

    • I would consider learning some programming languages: python, Scala, and Java. There may be some graduate or post-graduate courses offered by universities where you reside. In fact, UC Berkeley offers a masters program in Data Science.

  • Big Data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications……. [Read More]

  • Hi,
    I’am from finance background with liking for statistics. I have my own practise of Audit and taxation in India. However, I;m really keen in knowing whether it is possible for me to venture in data analytics from practice point of view. I mean whether i’ll be able to get assignments for data analysis? coz i dont want to take up a job in IT. Please advise how do i begin and how can i convert the opportunity in reality.

  • Hello Sir,
    I have worked almost 7 years as software developer in c++ application development, Now I want to become data scientist , would it be good career path for me ??

  • Hey,

    Probably a bit late on the scene. But I have question.

    I have around 8 years experience as an DBA, Primarily Oracle Databases.
    Over the last few years I got an opportunity to work on various other databases focusing on Analytics such as Greenplum/Redshift as part of Big data project.In that project I got a few days training on Spark/Hive etc. I was leading the DBA team for setting up the Greenplum CLuster. I have good knowledge of scripting in Python/Shell/Perl.

    I somehow really liked the idea of deriving meaning of the data which I was handling and interacting with the Business Side as whole. And that made me curious and look more into Data Science as career change.

    My question is would it be necessary to get a Master Degree in Data Science and how helpful would it be? Are there companies who would hire a guy with DBA background as a Data Scientist ?

  • Hi,
    For more resources on data analysis in Hadoop, take a look here link doesn’t work.
    Can you please take a look.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *