How to Build a Hadoop Data Science Team

Data scientists are in high demand these days. Everyone seems to be hiring a team of data scientists, yet many are still not quite sure what data science is all about, and what skill set they need to look for in a data scientist to build a stellar Hadoop data science team. We at Hortonworks believe data science is an evolving discipline that will continue to grow in demand in the coming years, especially with the growth of Hadoop adoption. This role requires experience and knowledge in math, statistics and machine learning, programming and scripting, as well as visualization techniques.

Hadoop data scientists

We tend to think of the data scientist role as a continuum of skills:

Software engineers really enjoy crafting new production-grade software systems, that are testable and maintainable, secure and scale well. Some of those software engineers specialize in working with data. They tend to be highly skilled in technologies like SQL, Hadoop, HIVE/PIG and Map-reduce, and excel at building production quality data pipelines. We call those “data engineers”.

Research scientists focus on academic research in machine learning and statistical techniques, creating brand new algorithms like support vector machines and deep learning, and prove theoretical properties of such algorithms. Applied scientists are those research scientists who thrive on solving real world problems with real data. They are very good at applying state-of-the-art algorithms and techniques to real world data.

The data scientist role combines the skill set and experience of a data engineer with that of the applied scientist. It is quite difficult to find good data scientists, because the combination of all these skills and interests are rarely found in a single person.“Okay, okay, I understand it’s hard to find good data scientists”, you may say, “but I still need to complete my data projects, what should I do?” One option might be to train data engineers to be experts in math, statistics and applied science. Or maybe hire applied scientists and train them to be good software engineers. In my experience that approach has limited success, because good software engineers may not be as good in applied science, or may not be interested to shift their career in that direction. And vice versa.

Instead, simply build a Hadoop data science team that combines data engineers and applied scientists, working in tandem to build your data products. Back when I was at Yahoo!, that’s exactly the structure we had:  applied scientists working together with data engineers to build large-scale computational advertising systems.

 

 

Categorized by :
Big Data Hadoop Ecosystem

Comments

Jeevan Patnaik
|
September 15, 2014 at 5:42 am
|

Hi,
Thank you…it’s a simple and nice post. Cleared my basic doubts about this field.
I have an idea of a project of my own which needs data analysis. So, I want to learn this data science my self and want to bring it in a shape.
The project is about surveying a particular village or town or a society or a state in overall, and then make it a big database..so that we can ask almost everything…my intention is to find solutions to the common issues like ineffective energy use, ineffective tax collection etc. etc. etc…..there are many….so I first want to bring everything into digital data….form a model…so that it can be implemented everywhere by the government.
If you can get my point, can you suggest me from where exactly do I need to start? Is this data science exactly what I am looking for? I just know DBMS and a little basics of data mining that’s all.

Janardhan
|
April 25, 2014 at 11:41 am
|

Superb article for Young engineers to think on the path to data science!!

puneet k agarwal
|
October 10, 2013 at 8:10 pm
|

Excellent Post …

|
April 22, 2013 at 12:46 pm
|

Excellent post. Of course, not all companies have the luxury of hiring a data science team – they have to rely on the skills of a single individual. But even in those cases, often the best approach is to supplement the skillset of an existing tech team with someone who brings the necessary addiitional skills. Whether that is on the data engineering, or the applied science side will depend on the existing skillsets in the team. I use these same approaches when working with clients to advise on hiring in the data science space.

Joanna

|
April 16, 2013 at 9:51 am
|

Agreed. A team with varied skills is the answer. Interestingly I had been circulating a quick presentation on the role of a Data Science Engineer – blog at http://doubleclix.wordpress.com/2013/04/16/data-science-engineers-the-new-breed-of-data-scientists/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.