Data scientists are in high demand these days. Everyone seems to be hiring a team of data scientists, yet many are still not quite sure what data science is all about, and what skill set they need to look for in a data scientist to build a stellar Hadoop data science team. We at Hortonworks believe data science is an evolving discipline that will continue to grow in demand in the coming years, especially with the growth of Hadoop adoption. This role requires experience and knowledge in math, statistics and machine learning, programming and scripting, as well as visualization techniques.
We tend to think of the data scientist role as a continuum of skills:
Software engineers really enjoy crafting new production-grade software systems, that are testable and maintainable, secure and scale well. Some of those software engineers specialize in working with data. They tend to be highly skilled in technologies like SQL, Hadoop, HIVE/PIG and Map-reduce, and excel at building production quality data pipelines. We call those “data engineers”.
Research scientists focus on academic research in machine learning and statistical techniques, creating brand new algorithms like support vector machines and deep learning, and prove theoretical properties of such algorithms. Applied scientists are those research scientists who thrive on solving real world problems with real data. They are very good at applying state-of-the-art algorithms and techniques to real world data.
The data scientist role combines the skill set and experience of a data engineer with that of the applied scientist. It is quite difficult to find good data scientists, because the combination of all these skills and interests are rarely found in a single person.“Okay, okay, I understand it’s hard to find good data scientists”, you may say, “but I still need to complete my data projects, what should I do?” One option might be to train data engineers to be experts in math, statistics and applied science. Or maybe hire applied scientists and train them to be good software engineers. In my experience that approach has limited success, because good software engineers may not be as good in applied science, or may not be interested to shift their career in that direction. And vice versa.
Instead, simply build a Hadoop data science team that combines data engineers and applied scientists, working in tandem to build your data products. Back when I was at Yahoo!, that’s exactly the structure we had: applied scientists working together with data engineers to build large-scale computational advertising systems.