Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
September 25, 2013
prev slideNext slide

How to Get Started in Data Science

A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, then of course you can do it.

In a previous post, I described my view on “What is a data scientist?”: it’s a hybrid role that combines the “applied scientist” with the “data engineer”. Many developers, statisticians, analysts and IT professionals have some partial background and are looking to make the transition into data science.

And so, how does one go about that? Your approach will likely depend on your previous experience. Here are some perspectives below from developers to business analysts.

Java Developers

If you’re a Java developer, you are familiar with software engineering principles and thrive on crafting software systems that perform complex tasks. Data science is all about building “data products”, essentially software systems that are based on data and algorithms.

A good first step is to understand the various algorithms in machine learning: which algorithms exist, which problems they solve and how they are implemented. It is also useful to learn how to use a modeling tool like R or Matlab. Libraries like WEKA, Vowpal Wabbit, and OpenNLP provide well-tested implementations of many common algorithms. If you’re not already familiar with Hadoop — learning map-reduce, Pig and Hive and Mahout will be valuable.

Python Developers

If you’re a Python developer, you are familiar with software development and scripting, and may have already used some Python libraries that are often used in data science such as NumPy and SciPy.

Python has great support for data science applications, especially with libraries such as NumPy/Scipy, Pandas, Scikit-learnIPython for exploratory analysis, and Matplotlib for visualizations.

To deal with large datasets, learn more about Hadoop and its integration with Python via streaming.

Statisticians and applied scientists

If you’re coming from a statistics or machine-learning background, its likely you’ve already been using tools like R, Matlab or SAS for years to perform regression analysis, clustering analysis, classification or similar machine learning tasks.

R, Matlab and SAS are amazing tools for statistical analysis and visualization, with mature implementations for many machine learning algorithms.

However, these tools are typically used for data exploration and model development, and rarely used in isolation to build production-grade data products. In most cases, you need to mix-in various other software components in like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.

Naturally, becoming familiar with one or more modern programming languages such as Python or Java is your first step. I found it very helpful to work closely with experienced data engineers to better understand the mindset and tools they use to build production-quality data products. 

Business analysts

If your background is SQL, you have been using data for many years already and understand full well how to use data to gain business insights. Using Hive, which gives you access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step for you into the world of big data.

Data science often entails developing data products that utilize machine learning and statistics at a level that SQL cannot describe well or implement efficiently. Therefore, the next important step towards data science is to understand these types of algorithms (such as recommendation engines, decision trees, NLP) at a deeper theoretical level, and become familiar with current implementations by tools such as Mahout, WEKA, or Python’s Scikit-learn.

Hadoop developers

If you’re a Hadoop developer, you already know the complexities of large datasets and cluster computing. You are probably also familiar with Pig, Hive, and HBase and experienced in Java.

A good first step is to gain deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets. A good first place to look is Mahout which implements many of these algorithms over Hadoop.

Another area to look into is “data cleanup”. Many algorithms assume a certain basic structure to the data before modeling begins. Unfortunately, in real life data is quite “dirty” and making it ready for modeling tends to take a large bulk of the work in data science. Hadoop is often a tool of choice for large-scale data cleanup and pre-processing, prior to modeling.

Final thoughts

The road to data science is not a walk in the park. You have to learn a lot of new disciplines, programming languages, and most important – gain real-world experience. This takes time, effort and a personal investment. But what you find at the end of the road is quite rewarding.

There are many resources you might find useful: books, training, and presentations.

And one more thing: a great way to get started on real world problems is to participate in a data science competition hosted on If you do it with a friend, it’s twice the fun.

For more resources on data analysis in Hadoop, take a look here.



Chetan Karkhanis says:

What about simple folks coming from traditional DW/BI backgrounds? Is there any hope for them?

Louis Frolio says:

Chetan, of course there is hope for them. If you think about it, those folks have strong data engineering skills which are paramount to the role of data scientist. Further, if a person is skilled in BI then I would contend that they ought to have a solid foundation in basic statistics and math.

Shaona Mukherjee says:

Hey thanks for the info.. but i have often seen people coming from diverse backgrounds like social science, management, economics opting for data science. What are there chances of doing well in this field?

Vijay says:

Can you please let me know what role a system admin / Infrastructure person can play in Hadoop ecosystem,,, i am interested to know beyond just setting up hadoop / hdfs / hbase cluster and using scoop for data transfers.

Raj kishore jaiswal says:

Very useful.
Thank you so much

Santosh says:

I’m doing my master’s in Mechanical Engineering and quite familiar with Matlab, SQL, Statistics and SAS. What should be the other thing for me to learn to become a data scientist. In other sense how should I utilize my knowledge to get a tag of data scientist.

Jules S. Damji says:

I would start playing with Spark on YARN and on HDP, Scala, SparkSQL, Hive, R, MLLib, all available by downloading our Sandbox and working through the tutorials.

Aditya Atreya says:

First of all my big thanks to you sir for such a great article.

I am an Oracle Certified DBA (OCP) in 11g R2 Version.I also have knowledge of Linux,C language.Now I really want to become a Data Scientist.What should i do?What technologies i have to learn from now on to be the same.Please tell me first step of this learning process.

Thanks in Advance

Jules S. Damji says:

I would start downloading our Sandbox and play around with some basic tutorials that deal with ETL processing: extracting, transforming and loading data. With DB background and SQL knowledge you will be able to pick up Apache Hive easily. Second, start playing with Spark on YARN and Spark SQL. For DBA, it’s a good transition. But biggest challenge is dealing with NoSQL concepts of Hadoop and the repositories.

Debolin Dhar says:

Is there any scope for transition to data science for software engineers currently working on legacy systems ?

Jules S. Damji says:

This Ofer’s blog speaks to your question about the composition and skills’ spectrum for a data scientist.

Anurag Zutshi says:

Can someone please tell me is it feasible to switch from Business Process Analyst (Process re engineering domain) into Data Scientist..Waiting for your reply

Market Researcher says:

Hi. I am a MBA and I am consumer market researcher by profession. I am really interested in data analysis and data science. I dont have a computer science background. How do you suggest I should start?

Jules S. Damji says:

I would research into taking some workshops in Scala, Python, or Java. Attend Spark Summit. Start with basic programming concepts and work your way up.

Wandeto John says:

Good advice. I am I right if wherever you mentioned java/Python I substitute with C++? And MATLAB/R I substitute with Octave?
Nice time

Akshay Jadhav says:

Hi there.. couple of questions.. I am pursuing MBA in E commerce, but my bachelors is in management only.. thus, no technical background.. .I want to become a data scientist.. what will be ur suggestion to me… plz suggest fields that can be a good career option for me. and also high paying. 😛 ..

Jules S. Damji says:

I would consider learning some programming languages: python, Scala, and Java. There may be some graduate or post-graduate courses offered by universities where you reside. In fact, UC Berkeley offers a masters program in Data Science.

Melissa James says:

Big Data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications……. [Read More]

ThomasV says:
Your comment is awaiting moderation.

There are many free resources on the internet to become a data scientist only by self study. Some interesting learning path are listed there :

Niraj Mahajan says:
Your comment is awaiting moderation.

I’am from finance background with liking for statistics. I have my own practise of Audit and taxation in India. However, I;m really keen in knowing whether it is possible for me to venture in data analytics from practice point of view. I mean whether i’ll be able to get assignments for data analysis? coz i dont want to take up a job in IT. Please advise how do i begin and how can i convert the opportunity in reality.

Achilles says:

Hello Sir,
I have worked almost 7 years as software developer in c++ application development, Now I want to become data scientist , would it be good career path for me ??

Abhishek says:


Probably a bit late on the scene. But I have question.

I have around 8 years experience as an DBA, Primarily Oracle Databases.
Over the last few years I got an opportunity to work on various other databases focusing on Analytics such as Greenplum/Redshift as part of Big data project.In that project I got a few days training on Spark/Hive etc. I was leading the DBA team for setting up the Greenplum CLuster. I have good knowledge of scripting in Python/Shell/Perl.

I somehow really liked the idea of deriving meaning of the data which I was handling and interacting with the Business Side as whole. And that made me curious and look more into Data Science as career change.

My question is would it be necessary to get a Master Degree in Data Science and how helpful would it be? Are there companies who would hire a guy with DBA background as a Data Scientist ?

Sujitha Sanku says:

For more resources on data analysis in Hadoop, take a look here link doesn’t work.
Can you please take a look.

Sirjon says:

Thanks for the nice article. Just wanted to understand difference between machine learning using MATLAB/OCTAVE vs Spark MLib. Is Spark MLib an alternative to MATLAB/OCTAVE or these ware completely different things? I want to do a online course in Machine Learning and got little confused between these different product for ML

Ofer Mendelevitch says:

With Matlab/Octave (just like R or Python scikit-learn) you can run machine learning algorithms with the assumption that all data would fit into memory, whereas Spark ML-Lib is designed to handle larger training sets that may not fit into a single machine. Thus those implementations are distributed in nature.

You might be interested in my new book (just published) that provides more detail on this topic:

xinablak says:

Great article, thanks a lot! It’s always good to have someone knowledgeable giving us some pointers… Could you please give me your recommendations for a Civil (Structural) Engineer with a great passion for Excel, VBA/VB (no fundamental background on IT or business whatsoever) and a very strong interest in data analysis, data visualization and, hence, data science? Will it be a very steep learning curve? Thanks in advance!

Ram Rao says:

Hi there I have a MBA and a masters in civil engineering
I have worked for several years as a civil engineer. I have some programming background.
How do I break into data science? I’m happy to take an entry level job

elysiumacademy says:

Whether you are a beginner who is looking for a way to get started or a professional developer trying to further improve his skills, there is a blog on this list that can suit your needs.thanks for sharing informative content
python training in bhopal

Leave a Reply

Your email address will not be published. Required fields are marked *