Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
April 15, 2013
prev slideNext slide

How to Build a Hadoop Data Science Team

Data scientists are in high demand these days. Everyone seems to be hiring a team of data scientists, yet many are still not quite sure what data science is all about, and what skill set they need to look for in a data scientist to build a stellar Hadoop data science team. We at Hortonworks believe data science is an evolving discipline that will continue to grow in demand in the coming years, especially with the growth of Hadoop adoption. This role requires experience and knowledge in math, statistics and machine learning, programming and scripting, as well as visualization techniques.

Hadoop data scientists

We tend to think of the data scientist role as a continuum of skills:

Software engineers really enjoy crafting new production-grade software systems, that are testable and maintainable, secure and scale well. Some of those software engineers specialize in working with data. They tend to be highly skilled in technologies like SQL, Hadoop, HIVE/PIG and Map-reduce, and excel at building production quality data pipelines. We call those “data engineers”.

Research scientists focus on academic research in machine learning and statistical techniques, creating brand new algorithms like support vector machines and deep learning, and prove theoretical properties of such algorithms. Applied scientists are those research scientists who thrive on solving real world problems with real data. They are very good at applying state-of-the-art algorithms and techniques to real world data.

The data scientist role combines the skill set and experience of a data engineer with that of the applied scientist. It is quite difficult to find good data scientists, because the combination of all these skills and interests are rarely found in a single person.“Okay, okay, I understand it’s hard to find good data scientists”, you may say, “but I still need to complete my data projects, what should I do?” One option might be to train data engineers to be experts in math, statistics and applied science. Or maybe hire applied scientists and train them to be good software engineers. In my experience that approach has limited success, because good software engineers may not be as good in applied science, or may not be interested to shift their career in that direction. And vice versa.

Instead, simply build a Hadoop data science team that combines data engineers and applied scientists, working in tandem to build your data products. Back when I was at Yahoo!, that’s exactly the structure we had:  applied scientists working together with data engineers to build large-scale computational advertising systems.




  • Excellent post. Of course, not all companies have the luxury of hiring a data science team – they have to rely on the skills of a single individual. But even in those cases, often the best approach is to supplement the skillset of an existing tech team with someone who brings the necessary addiitional skills. Whether that is on the data engineering, or the applied science side will depend on the existing skillsets in the team. I use these same approaches when working with clients to advise on hiring in the data science space.


  • Hi,
    Thank you…it’s a simple and nice post. Cleared my basic doubts about this field.
    I have an idea of a project of my own which needs data analysis. So, I want to learn this data science my self and want to bring it in a shape.
    The project is about surveying a particular village or town or a society or a state in overall, and then make it a big database..so that we can ask almost everything…my intention is to find solutions to the common issues like ineffective energy use, ineffective tax collection etc. etc. etc…..there are many….so I first want to bring everything into digital data….form a model…so that it can be implemented everywhere by the government.
    If you can get my point, can you suggest me from where exactly do I need to start? Is this data science exactly what I am looking for? I just know DBMS and a little basics of data mining that’s all.

  • You certainly can hire a consultant to mine the data and see industry specific knowledge on a macro level or on a micro level. A firm might give you better decision knowledge and make prudent choices. In any case Cloud computing should come with support which goes beyond doing simplistic payroll/man power/office management. In a super competitive field, economics is the difference between those who perish and those who flourish. Link-In was made for time specific assignments and a means to help the small concern. I’m always looking at business models and seeking input from others, who have specialized knowledge to consult/brain storm; looking at problems as opportunities and developing processes.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *