Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
August 17, 2017
prev slideNext slide

What is a Data Science Workbench and Why Do Data Scientists Need One?

Data science is inherently an exploratory and creative process because there is usually neither a definitive answer to the problem at hand nor a well-defined approach to reaching one. Data scientists research problems, explore data, visualize patterns across data and use their experience and judgment to choose parameters and processes that may be relevant to the specific problem at hand. This makes sharing and collaboration a critical activity that enables teams of data scientists to build on each other’s knowledge and to produce the overall best results.

As data science has evolved over time with big data, new techniques and technologies have emerged. This change is reflected in the background and training of the data scientists across organizations. There is a wide spectrum of languages and toolkits used by data scientists. These include open source software such as R, Python, and Spark, as well as commercial software like SAS and SPSS that data scientists may be have been trained on or feel comfortable with. For data science initiatives to be successful, companies must enable data scientists to work effectively, without being restricted by their backgrounds, and use the best technique or technology to address the problem at hand.

In this regard, data science workbenches offer a great value in enhancing data scientists’ productivity and effectiveness. Data Science Workbench is an application that empowers data scientists to use their preferred technologies, languages and libraries in an environment that can be local to their machines or part of the broader enterprise-wide infrastructure. Using a workbench, data scientists can access tools that are stored on their machines and in their organizations. For example, data science workbenches provide data scientists with computational notebooks such as Jupyter or Zeppelin, as well as development environments for widely used statistical languages such as R and Python.

Data scientists currently spend a lot of time and effort setting up their analytical environments. This process consists of identifying the data, moving the data from a number of sources into their data science environment and then running the experiments there. Through the workbench, data scientists can connect directly to the data sources in the data lake with minimal setup. Once connected to the data sources, data scientists can simply use the notebook that that is part of the workbench to tap into the processing power of the cluster using the best in class support for Spark or their choice of machine learning technologies.

An important aspect of data scientists’ work activities is to exchange ideas with their peers and colleagues. A data science workbench provides a collaborative environment, complemented by visualizations, where data scientists with expertise in different techniques and technologies can share their results with each other. Teams can not only share their code but package the entire notebook including live datasets into a reproducible environment so that others can get started quickly without having to perform additional setups. The advanced collaborative paradigms supported by workbench not only foster learning and cross-pollination of ideas but also allow teams with different expertise to work jointly on a predictive model. The opportunity to test the model with different assumptions and use cases by different teams improve its robustness and predictive power. Also, If as part of the process to research the problem, a data scientist finds a notebook, code or a tutorial that can prove to be effective in addressing the problem at hand, some of the leading workbenches in the market enables data scientists to incorporate these assets into the current project.

All these factors combine to make data scientists self-sufficient, improve the effectiveness of their models and, most importantly, accelerate the time to insight.

Tags:

Comments

  • Leave a Reply

    Your email address will not be published. Required fields are marked *