How to Build a Hadoop Data Science Team

From a software engineer to a research scientist

Data scientists are in high demand these days. Everyone seems to be hiring a team of data scientists, yet many are still not quite sure what data science is all about, and what skill set they need to look for in a data scientist to build a stellar Hadoop data science team. We at Hortonworks believe data science is an evolving discipline that will continue to grow in demand in the coming years, especially with the growth of Hadoop adoption. This role requires experience and knowledge in math, statistics and machine learning, programming and scripting, as well as visualization techniques.

Hadoop data scientists

We tend to think of the data scientist role as a continuum of skills:

Software engineers really enjoy crafting new production-grade software systems, that are testable and maintainable, secure and scale well. Some of those software engineers specialize in working with data. They tend to be highly skilled in technologies like SQL, Hadoop, HIVE/PIG and Map-reduce, and excel at building production quality data pipelines. We call those “data engineers”.

Research scientists focus on academic research in machine learning and statistical techniques, creating brand new algorithms like support vector machines and deep learning, and prove theoretical properties of such algorithms. Applied scientists are those research scientists who thrive on solving real world problems with real data. They are very good at applying state-of-the-art algorithms and techniques to real world data.

The data scientist role combines the skill set and experience of a data engineer with that of the applied scientist. It is quite difficult to find good data scientists, because the combination of all these skills and interests are rarely found in a single person.“Okay, okay, I understand it’s hard to find good data scientists”, you may say, “but I still need to complete my data projects, what should I do?” One option might be to train data engineers to be experts in math, statistics and applied science. Or maybe hire applied scientists and train them to be good software engineers. In my experience that approach has limited success, because good software engineers may not be as good in applied science, or may not be interested to shift their career in that direction. And vice versa.

Instead, simply build a Hadoop data science team that combines data engineers and applied scientists, working in tandem to build your data products. Back when I was at Yahoo!, that’s exactly the structure we had:  applied scientists working together with data engineers to build large-scale computational advertising systems.



Categorized by :
Data Science Hadoop Ecosystem Hive Pig


April 16, 2013 at 9:51 am

Agreed. A team with varied skills is the answer. Interestingly I had been circulating a quick presentation on the role of a Data Science Engineer – blog at

April 22, 2013 at 12:46 pm

Excellent post. Of course, not all companies have the luxury of hiring a data science team – they have to rely on the skills of a single individual. But even in those cases, often the best approach is to supplement the skillset of an existing tech team with someone who brings the necessary addiitional skills. Whether that is on the data engineering, or the applied science side will depend on the existing skillsets in the team. I use these same approaches when working with clients to advise on hiring in the data science space.


puneet k agarwal
October 10, 2013 at 8:10 pm

Excellent Post …

April 25, 2014 at 11:41 am

Superb article for Young engineers to think on the path to data science!!

Jeevan Patnaik
September 15, 2014 at 5:42 am

Thank you…it’s a simple and nice post. Cleared my basic doubts about this field.
I have an idea of a project of my own which needs data analysis. So, I want to learn this data science my self and want to bring it in a shape.
The project is about surveying a particular village or town or a society or a state in overall, and then make it a big that we can ask almost everything…my intention is to find solutions to the common issues like ineffective energy use, ineffective tax collection etc. etc. etc…..there are many….so I first want to bring everything into digital data….form a model…so that it can be implemented everywhere by the government.
If you can get my point, can you suggest me from where exactly do I need to start? Is this data science exactly what I am looking for? I just know DBMS and a little basics of data mining that’s all.

Leave a Reply

Your email address will not be published. Required fields are marked *

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.