Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
August 19, 2015
prev slideNext slide

Featurizing Data: Spark and Beyond

In this Hortonworks’ partner guest blog, Abhimanyu Aditya, Senior Product Manager and co-founder at Skytree, explains how Skytree APIs solve challenges facing data engineers, simplifies data preparation and data transformation, using Apache Spark on YARN with Hortonworks Data Platform (HDP).

Challenges Facing Data Engineers and Data Scientists

sky_tree_1Machine learning as a technology can be challenging. It is difficult to create, understand and deploy machine learning models. Even before the modeling process can begin, the data needs to be prepared for machine learning and modern data scientists, developers, hackers, Ph.D.’s, analysts and domain experts spend a significant amount of time and effort doing this. Some of the challenges they face include:

  • Too many data sources (logs, files, RDBMS, Hadoop, databases, sensors)
  • Disparate tools for data blending (Pig, MapReduce, SQL, Commercial ETL offerings etc.)
  • Too many data types (categorical, unstructured, dense/sparse, time series, multiple distributions)
  • Missing values, missing data
  • Noisy data, Dirty data, Non-standardized data
  • Skewed data
  • Too little data on what you are trying to predict (fraud, intrusion)

These issues become significantly more complicated when dealing with big data on a distributed compute environment. Therefore, establishing a scalable process to prepare the data for machine learning workflows is critical.

Data Preparation for Machine Learning with Spark

Enter Apache Spark. Spark’s processing engine is the perfect tool to prepare data for machine learning. Spark’s core functionality lies in its in-memory Resilient Distributed Datasets (RDD’s). The datasets are cached in-memory and distributed across a large cluster where map- and reduce-like functions can be applied to these RDD’s in-memory. This gives a significant speedup when compared to traditional MapReduce for highly iterative workflows, which are typical in machine learning data-preparation. The key advantage that Spark brings to machine learning workflows for data scientists is the capability to efficiently apply multiple transforms while not spending time unnecessarily moving from disk to workload.

Skytree APIs Abstraction Simplifies Data Preparation and Transformation

Still, using Spark involves a lot of careful coding due to the need to work with low-level API’s exposed through native Spark. One way to simplify this is to use Skytree’s machine learning software. Skytree employs Spark for machine learning data preparation and has created a high level data transformation library directly on top of Spark to abstract away much of the complexity of native Spark for machine learning data preparation. This reduces the coding burden on data scientists and ensures that transforms executed in Spark are done in a consistent and efficient manner. Once the data is ready to be consumed by machine learning algorithms, the data is efficiently and automatically transferred to Skytree’s machine learning engine, which installs directly on top of Hadoop, providing data scientists with a unified big data analytics platform.

The following Spark-based transformations are available in Skytree software:

  1. Normalization/standardization
  2. Joins
  3. Categorization
  4. Dummification/horizontalization of categorical variables
  5. Filter rows/columns
  6. New feature creation based on formulas
  7. Unstructured/NLP/text data featurization
    • Document ingestion: PDF, DOC, HTML etc.
    • Token annotation
    • Feature extraction
    • Bag-of-words vectorization (for classification/clustering)
    • Language identification
    • Sentence annotation (identify sentence boundaries)
    • Part-of-speech tagging (identify nouns, verbs, etc.)
    • Lemmatization/stemming (normalize words)
  8. Many more…

A Simple Use Case

Let’s walk through a simple Web page categorization use case that involves Skytree’s Spark functionality:

The goal: to classify about 120,000 Web pages stored on Hadoop into about 100 classification categories. The broad steps to solve this problem are as follows:

  • 1. Ingest the data into a Skytree dataset (which encapsulates a Spark RDD) using the Python SDK. This is done in one step by pointing the SDK to a corpus index file.
  • 2. Perform the following NLP/text transformations that utilize Spark under the hood on it:
    • Document ingestion
    • Token annotation
    • Feature extraction
    • Bag-of-words vectorization

The data is passed as Spark RDDs to the transforms, which are then distributed across multiple nodes for processing. Once these four steps have been performed, the unstructured dataset has been transformed into a structured dataset, as shown in the light blue circle below.

Figure 1: Web page categorization using Skytree’s Spark functionality
Figure 1: Web page categorization using Skytree’s Spark functionality
  • 3. The dataset is now ready to be consumed by Skytree’s machine learning engine. The featurized data is efficiently transferred from RDD’s to the engine, where the machine learning is performed independent of Spark. The best model for the dataset is chosen using Skytree’s model automation capability, AutoModel, which automatically finds the classification model with the highest accuracy.
  • 4. Once the best model has been found, it is available to move into production. Skytree has various means to do this, including export to PMML, JAR, as well as batch execution and other streaming techniques.


In principle, Spark allows data scientists to execute advanced data preparation tasks for machine learning with speed and scalability. Data scientists working on Hortonworks Data Platform (HDP) distribution of Hadoop, along with Skytree’s machine learning platform, can now apply the best-in-class machine learning algorithms on troves of data using a single platform, while taking advantage of Skytree’s Spark integration to simplify arduous and iterative machine learning data preparation tasks – resulting in faster and better insight from their data.


Leave a Reply

Your email address will not be published. Required fields are marked *