Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Data Science

cloud Let Us Know About Your Data Science Use Case

Submit Use Case

What Is Data Science

What's your Data Science Use Case

Organizations now have the capability to acquire, store and process large volumes of data using commodity hardware.  At the same time, technologies such as Hadoop and Spark have enabled the collection, organization and analysis of Big Data at scale. The convergence of cost effective storage and scalable processing allows us to extract richer insights from data. These insights can then be operationalized to provide commercial and social value.   Data scientist process and extract meaningful insights from large volumes of structured and unstructured data.

Data science is about scientific exploration of data to extract meaning or insight, and the construction of software systems to utilize such insights in a business or social context.

This involves the art of discovering data insights combined with the science of operationalizing them.  A data scientist uses a combination of  of econometrics, machine learning, statistics, visualization, and computer science to extract valuable business insights hiding in data and builds operational systems to deliver that value.

Data science is a cross-functional discipline.   A data scientist often works and collaborates with an extended team which includes visualization specialists, developers, business analysts, data engineers, applied scientists, architects, LOB owners and devOps. The success of data science projects often relies on the communication, collaboration, and interaction that takes place with the extended team, both internally and possibly externally to their organizations.

Data Science Use Cases

The application of statistics, machine learning and computer science to data analysis is not new.  Data science has evolved from this heritage. However, with the growth in the volume, velocity and variety of data along with the adoption of Hadoop and Spark, the demand for deeper insights has fueled the demand for data science. Data science is applicable in many practical situations and use-cases across a variety of industries.

Currently every industry is growing the use of data science to improve their businesses by detecting patterns and providing actionable insight which is driving organizational change and also changing every facet of life.  The following table provides a few examples, from Churn prediction to preventative maintence, of how data science is being used:

Use-case Description
Churn prediction Predict whether a customer is likely to “leave”
Customer segmentation Uncover a natural segmentation of customers into groups of similar behavior
Product recommendation Predict the preference of a product to a customer, and recommend to customers those products they are most likely to have a strong preference for.
Information Security Detect network traffic anomalies and identify potential hackers
Fraud detection Identify fraudulent patterns in insurance claims or credit card transactions
Predictive Maintenance Based on sensor data feeds, predict equipment failure before it happens and pro-actively maintain it

How Data Science Works

Data science is a multi-step process and each step in this process requires a diverse set of skills and technologies. In other words, there is no single technology, tool or algorithm - a silver bullet - that would enable a data scientist to extract insights from all the potential data sets and diverse use cases.   Data science is an iterative, multi-step process that leverages multiple tools.  Let’s take a look at a typical data science workflow from a process and tools perspective.


Data science, like most software development projects, starts with with strategic planning, and addressing two important areas:

  • What is the question that  I am trying to answer?

    • A data scientist must clearly define  the business outcome he or she plans to achieve, and ensure that the modeling output is practical and actionable from a business perspective.

  • What data will I need to answer that question?

    • This topic involves an assessment of the data that is currently available and the data that might be required but is not currently collected or available within the enterprise.   A data scientist also needs to determine the volume of data  required to develop the model, and the mechanism to  get this data into the Hadoop environment.   Depending on the volume, velocity and variety of data, a data scientist will select appropriate development tools or technologies.

Following the planning stage, data science follows an  iterative macro-process of:

  • Data Acquisition: Identification of sources

  • Data Cleansing: Identification and remediation of data quality issues

  • Data Analysis: Generation of features or attributes that will be part of the model. This is the step of the process where actual data mining takes place.

Within each macro-process there is further iteration within the Data Cleansing and Data Analysis steps.

After several iterations, when the Data Scientist is satisfied with the results, he/ she then might decide to:

  1. Publish or share the results with colleagues for peer review

  2. Embed the model into a report or dashboard that is used in the organization to make business decisions

  3. Deploy the model to production

Data Science With Hadoop

Hadoop enhances the effectiveness of data science in three key areas. Hadoop:

  1. Breaks down silos and makes more data accessible for modeling

  2. Enables modeling with larger datasets, resulting in higher quality models

  3. Provides faster and more scalable model training

Break down data silos to make more data accessible

With the continued growth in scope and scale of applications using Hadoop and other data sources, the vision of an enterprise Data Lake starts to materialize. Combining data from multiple silos, including internal and external data sources, helps organization find answers to complex questions that no one previously was able to answer.

In a practical sense, a Data Lake is characterized by three key attributes:

  1. Collect everything. A Data Lake contains data, both from raw sources over extended periods of time, as well as any processed data.

  2. Dive in anywhere.A Data Lake enables users across multiple business units to refine, explore and enrich data on their own terms.

  3. Flexible access.A Data Lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.

As data continues to grow exponentially, Open Enterprise Hadoop investments can provide a strategy for both efficiency in a next generation data architecture, and opportunity in an enterprise Data Lake.

The end result? With Hadoop, data scientists have instant access to more data co-located in the data lake from new and existing data sources which breaks down data silos and leads to more effective models.

Incorporate comprehensive feature set for higher predictive power

The distributed storage and compute nature of Hadoop enables data scientist to develop better models by leveraging new and larger data sets.

Knowing which features to include in your model can be an art and traditional data science approaches can limit the number of attributes and predictive power of the model.  These approaches also don’t scale with the volume of Big Data now available to enterprises.  Hadoop provides data scientists with a larger palette to include a more comprehensive set of feature into the model, as well as enables datasets with a larger number of observations to be used for model training.

Hadoop enables data scientists to use attributes at a much finer grain and spread over a longer time duration. Instead of depending on a statistically significant sample, Data Scientists can now use the entire data set in Hadoop to build more accurate predictive models.

Important data science use cases, such as fraud detection, require modeling of rare events. For such use cases, traditional data sampling techniques raises the likelihood of filtering out rare events. Therefore, sampling can be detrimental in statistical modeling of rare events.  Since Hadoop empowers data scientists to use complete or significantly larger data sets, the probability of capturing rare events is higher which in turn improves the predictive power of the model.

Faster and more scalable model training

Utilizing the parallel computing power of Hadoop, predictive models can be trained in parallel, which results in faster training time and improved productivity.  

This is especially useful when tuning models, where thousands of instances can be deployed at the same time, to generate the best possible model.

Data Science With Spark

Apache Spark plays an important role within the Hadoop ecosystem when it comes to data science.  Spark is designed for data science and its abstraction makes data science easier.  

Data scientists commonly use machine learning - a set of techniques and algorithms that can learn from data. These algorithms are often iterative, and Spark’s ability to cache the dataset in memory greatly speeds up such iterative data processing, making Spark an ideal processing engine for implementing such algorithms.

Spark as a framework has support and libraries for ETL, machine learning, SQL, stream processing, and graph processing:


Spark also includes MLlib, a library that provides a growing set of machine algorithms for common data science techniques: Classification, Regression, Collaborative Filtering, Clustering and Dimensionality Reduction.

Spark’s ML Pipeline API is a high level abstraction to model an entire data science workflow.   The ML pipeline package in Spark models a typical machine learning workflow and provides abstractions like Transformer, Estimator, Pipeline & Parameters.  This is an abstraction layer that makes data scientists more productive.

Although Spark, MLlib and ML pipelines are powerful tools that are  evolving rapidly, the current state of Spark still has room for improved algorithms that reduce the effort currently required to implement data science solutions. .  As an example the productivity of data scientists can be greatly enhanced through a  common infrastructure and libraries built on top of the existing Spark libraries.

A number of problems in the area of data science are common across industries and are frequently encountered by data scientists. Two critical data sets in analytics are time and space. Spark currently does not have an out of the box geolocation library, yet incorporating geospatial attributes to predict certain events is becoming increasingly important for use cases such as optimizing location for search ads, predicting crime hotspots and to understand traffic patterns for services like Uber. Processing geospatial data can be challenging.  The geospatial libraries that are currently available are not scalable,  often miss metadata, and are not integrated with  programming languages.  To address these common problems, Hortonworks recently contributed Magellan, a package that provides GeoSpatial libraries on top of Spark to leverage and query geographic features.

Another common problem is Entity Resolution or de-duplication. Often entities such as a person in a health care system may have slightly different representation in terms of name or address. A library that provides sophisticated matching of entities would allow customers to leverage state of the art algorithms and extend them with private knowledge-bases to solve this very important data cleansing problem.

Additional algorithms and a common solution infrastructure can make data scientists more productive.  Hortonworks is focused on simplifying data science and contributing packages such as Magellan and additional libraries for Entity Resolution.

At Hortonworks we believe that Spark & HDP are Perfect Together.

To learn more about Hortonworks Focus on Spark read Hortonwork’s Apache Spark page.

Data Science with Zeppelin

Interactive browser-based notebooks enable data scientists to be more productive by developing, organizing, executing, and sharing data science code and visualizing results without referring to the command line or needing the cluster details. Notebooks allow data scientist not only allow to execute but to interactively work with long workflows.  There are a number of notebooks available with Spark. iPython remains a mature choice and great example of a data science notebook.  Hortonworks provides an Ambari stack definition to help our customers quickly
set up iPython on their Hadoop clusters.  

Apache Zeppelin is a new and upcoming web-based notebook which brings data exploration, visualization, sharing and collaboration features to Spark.   It support Python, but also a growing list of programming languages such as Scala, Hive, SparkSQL, shell and markdown.  The various languages are supported via Zeppelin language interpreters.  We are excited about this project and are working with the community to bring Zeppelin to a mature state. We plan to make the Zeppelin ready for production use by adding security, stability, R support and making the visualization more intuitive.

Data discovery, exploration, reporting and visualization are key components of the data science workflow.  Zeppelin provides a “Modern Data Science Studio” that supports Spark and Hive out of the box.   Actually, Zeppelin supports multiple language backends which has support for a growing ecosystem of data sources.   Zeppelin’s notebooks provides interactive snippet-at-time experience to data scientist.  You can see a collection of Zeppelin notebooks in the Hortonworks Gallery.

Even with notebooks the data science process remains challenging. Often data scientists struggle with feature engineering, algorithm selection, tuning, sharing their work with others and deploying their work into production. We are working to improve the Zeppelin notebook in the community. We have added Hive Interpreter to Zeppelin, and are working to improve the editor to make it more stable.  We are deepening our involvement in the Zeppelin community to help deliver features such as security, summary statistics, context sensitive help to improve data science experience.

At Hortonworks we believe that Spark & HDP are Perfect Together.   And that Zeppelin is a key component to accelerate data science solutions.

To learn more about Hortonworks’ Focus on Zeppelin read Hortonwork’s Apache Zeppelin page.

Why our Data Scientists

The mission of Hortonwork’s data science team is to help our customer deliver business value through applying data science with Hadoop.

Our Data Scientists have expertise in the following areas:

  • Data engineering: data quality, pre-processing, feature engineering

  • Text Analytics and natural language processing

  • Graph algorithms

  • Data exploration and visualization

And most importantly, how to apply these skills and techniques with Spark and Hadoop in HDP and data flow in HDF is address customer Data Science challenges.

Our team can assist customer teams in the following Data Science areas:

  • Strategy/Vision:

    • How to use data to increase business value

    • Use-case analysis

    • How to build a data science team

  • Design/Architecture:

    • Selection of tools, techniques, algorithms

    • Data architecture and flow

  • Implementation:

    • We augment your team during various phases of the project

Why Data Science with Hortonworks

Hortonworks provides deep data science skills to gain industry insight from data science solutions.   Hortonworks provides the following key components to deliver successful solutions:

  • Industry leading Hadoop and Spark in HDP and HDF

  • Deep data science professional services expertise

  • Rich ecosystem of data science partners

  • Insightful and hands-on data science and Spark training

Industry leading Hadoop and Spark in HDP and HDF

Hortonworks continues to invest in Spark for Enterprise Hadoop so users can deploy Spark-based applications alongside other Hadoop workloads in a consistent, predictable and robust way. Current investment includes:

  • Leverage the scale and multi-tenancy provided by YARN so memory and CPU-intensive applications can deliver optimum performance

  • Deliver HDFS memory tier integration with Spark to allow RDD caching

  • Enhance the data science experience with Spark

  • Continue Integrating with HDP’s operations, security, governance and data management capabilities

There are additional opportunities for Hortonworks to contribute to and maximize the value of technologies that interact with Spark. Specifically, we believe that we can further optimize data access via the new DataSources API. This should allow SparkSQL users to take full advantage of the following capabilities:

  • ORCFile instantiation as a table

  • Column pruning

  • Language integrated queries

  • Predicate pushdown

At Hortonworks we believe that Spark & HDP are Perfect Together and our focus is on:

  • Data Science Acceleration

    • Improve data science productivity by improving Apache Zeppelin and by contributing additional Spark algorithms and packages to ease the development of key solutions.

  • Seamless Data Analysis

    • Improve the Spark integration with YARN, HDFS, Hive and HBase.

  • Innovate at the Core

    • Contribute additional machine learning algorithms and enhance Spark enterprise operations and security readiness.

Deep data science professional services expertise

Hortonworks’s data science team comprises of technical and thought leaders across the field. Our data scientists work closely with our customers to explore their data science requirements, define and execute projects, provide expert advice and help them overcome data science challenges.  The Hortonworks data science services team works closely with our development teams, committers and the extended community to continuously drive customer requirements, improve the ecosystem and share best practices.

Rich ecosystem of data science partners

All the major business intelligence vendors offer Hadoop and Spark integration, and specialized analytics vendors offer niche solutions for specific data types and use cases.  Since our inception, Hortonworks has been working with leading enterprise technology vendors to enable Open Enterprise Hadoop in next generation data architectures.  Hortonworks has deep relationships and does co-development with a large set of partners to provide differentiated solutions.  There is a rich ecosystem of partners that provide tools for the various phases data science workflow that are enabled for Hadoop and Spark on HDP.   You can learn more about these partners on our Hortonworks partner page.

Insightful and hands-on data science and Spark training

Hortonworks provides immersive and valuable real-world training designed by Hadoop and Spark experts. Scenario-based training courses are available in-classroom or online from anywhere in the world, and offer unmatched depth and expertise.  Learn more about our HDP and data science training.