In this Hortonworks’ partner guest blog, Abhimanyu Aditya, Senior Product Manager and co-founder at Skytree, explains how Skytree APIs solve challenges facing data engineers, simplifies data preparation and data transformation, using Apache Spark on YARN with Hortonworks Data Platform (HDP).
Machine learning as a technology can be challenging. It is difficult to create, understand and deploy machine learning models. Even before the modeling process can begin, the data needs to be prepared for machine learning and modern data scientists, developers, hackers, Ph.D.’s, analysts and domain experts spend a significant amount of time and effort doing this. Some of the challenges they face include:
These issues become significantly more complicated when dealing with big data on a distributed compute environment. Therefore, establishing a scalable process to prepare the data for machine learning workflows is critical.
Enter Apache Spark. Spark’s processing engine is the perfect tool to prepare data for machine learning. Spark’s core functionality lies in its in-memory Resilient Distributed Datasets (RDD’s). The datasets are cached in-memory and distributed across a large cluster where map- and reduce-like functions can be applied to these RDD’s in-memory. This gives a significant speedup when compared to traditional MapReduce for highly iterative workflows, which are typical in machine learning data-preparation. The key advantage that Spark brings to machine learning workflows for data scientists is the capability to efficiently apply multiple transforms while not spending time unnecessarily moving from disk to workload.
Still, using Spark involves a lot of careful coding due to the need to work with low-level API’s exposed through native Spark. One way to simplify this is to use Skytree’s machine learning software. Skytree employs Spark for machine learning data preparation and has created a high level data transformation library directly on top of Spark to abstract away much of the complexity of native Spark for machine learning data preparation. This reduces the coding burden on data scientists and ensures that transforms executed in Spark are done in a consistent and efficient manner. Once the data is ready to be consumed by machine learning algorithms, the data is efficiently and automatically transferred to Skytree’s machine learning engine, which installs directly on top of Hadoop, providing data scientists with a unified big data analytics platform.
The following Spark-based transformations are available in Skytree software:
Let’s walk through a simple Web page categorization use case that involves Skytree’s Spark functionality:
The goal: to classify about 120,000 Web pages stored on Hadoop into about 100 classification categories. The broad steps to solve this problem are as follows:
The data is passed as Spark RDDs to the transforms, which are then distributed across multiple nodes for processing. Once these four steps have been performed, the unstructured dataset has been transformed into a structured dataset, as shown in the light blue circle below.
In principle, Spark allows data scientists to execute advanced data preparation tasks for machine learning with speed and scalability. Data scientists working on Hortonworks Data Platform (HDP) distribution of Hadoop, along with Skytree’s machine learning platform, can now apply the best-in-class machine learning algorithms on troves of data using a single platform, while taking advantage of Skytree’s Spark integration to simplify arduous and iterative machine learning data preparation tasks – resulting in faster and better insight from their data.