Apache Spark

In-Memory Compute for Machine Learning & Data Science Workloads

Apache Spark provides an elegant, attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast, in-memory data processing. And with Spark on YARN, they can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset.

Hortonworks Focus for Spark

Hortonworks has outlined a set of initiatives to work on some of the current challenges with Spark that will make it easier for users to consume as an enterprise-ready part of the completely open source Hortonworks Data Platform (HDP). While delivery is planned into discreet phases laid out below, the work can be categorized into two distinct categories:

YARN-enabled Spark

Deeper integration of Spark with YARN will allow it to become a more efficient tenant along side other engines, such as Hive, Storm and HBase and others, simultaneously, all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited and more effectively share resources within a single cluster.

Governance, Security and Operations

As with any data platform, Governance, Security and Operations are table stakes. And so our efforts here will focus on enabling the application of consistent governance, security and management policies for Spark, just as they do for the other data processing engines within HDP.

Hortonworks Focus for Spark

Hortonworks’ Approach to Apache Spark

We have already certified Spark as YARN Ready. This means that your memory and CPU intensive Spark-based applications can co-exist within a single Hadoop cluster with all the other workloads deployed in a YARN-enabled cluster.

As we have proven with other engines such as Storm, Hive, HBase and others, we are committed to contributing to Spark in a similar manner… outline a strategy, rally the community, and contribute key features within the community as promised. Below is a summary of the key initiatives we plan to work on in order to ready Spark for the enterprise and we are already well underway.

Phase 1: Laying the Groundwork

Phase 1 of our work focuses on enabling Apache Spark to take advantage of the latest innovations across the Hadoop ecosystem. The work around Hive integration and the ORC File support is available now will, while the remainder of these items will be available by the end of year.

Focus Description
Improved integration with Apache Hive Today SparkSQL can be built and configured to read and write data stored in Apache Hive, but is limited to using Hive 0.12. Hortonworks is contributing to Spark to enable support for Hive 0.13, and as the Hive community marches towards Hive 0.14, will contribute additional Hive innovations that can be leveraged by Spark. This allows SparkSQL to use modern versions of Hive to access data for machine learning, modeling etc.
Support for ORC file format As part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression and is rapidly becoming the defacto storage format for Hive. We believe the Spark project will also benefit from this file format and have introduced SPARK-2883 to provide basic support of ORCFile in Spark. Our recently refreshed Spark technical preview allows our HDP users to simply add Spark to their existing Hadoop deployment: they can continue using Hadoop’s advanced capabilities and storage formats while also exploring the additional benefits of Spark.
Security Many of our customers’ initial use cases for Spark run on Hadoop clusters which either do not contain sensitive data or are dedicated for a single application and so they are not subject to broad security requirements. However most of our customers plan to deploy Spark based applications alongside other applications in a single cluster, and therefore we plan work to ensure Spark is integrated with the security constructs of the broader Hadoop platform. Initially, most common request we hear from our enterprise customers is to provide authorization via integration with LDAP or Active Directory, before granting access to the native Spark Web User Interface and ensuring that Spark runs effectively on a secure Hadoop cluster.
Operations Hortonworks continues to focus on streamlining operations for Spark through the 100% open source Apache Ambari. Our customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this foundational Hadoop project. Currently, our partners leverage Ambari Stacksto rapidly define new components/services and add those within your Hadoop cluster. With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster. The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User Interface.


Phase 2: Optimizing Spark on YARN, Advanced Security, and easing the debugging process

The second phase of work will focus on optimizing Spark on YARN, extending security capabilities with wire encryption and improving the developer experience with some key tools for debugging Spark-based workloads/applications.

Focus Description
Improved Reliability and Scale of Spark-on-YARN The Spark API allows developers to create both iterative in-memory applications on Apache Hadoop YARN. However, the current model of Spark-on-YARN leads to a less than ideal utilization of cluster resources, particularly when large datasets are involved. Spark does not behave like MapReduce or Tez—executing periodic jobs and releasing the compute resources once those jobs finish. In some ways it behaves much more like a long running service; holding onto resources (such as memory) until the end of the entire workload. Using the experience we have already gained in building MapReduce, Apache Tez and other data processing engines, we believe similar concepts can be applied to Spark in order to optimize it’s resource utilization and be a good multi-tenant citizen within a YARN-based Hadoop cluster. Significant improvements can be made to Spark with native integration of YARN and HDFS features such as:

  • Classic Hadoop Execution model (e.g. MapReduce) for batch applications by leveraging Tez and other native features such as the YARN shuffle service to transfer intermediate data within the Spark application
  • YARN Node Labels for isolating Spark applications to memory-heavy nodes in the cluster
  • Heterogeneous storage tiers in HDFS including the in-memory tier to share RDDs across independent Spark applications

This approach has multiple benefits: more efficient use of resources, improved reliability and extending Spark’s ability to access and process petabytes of data in large-scale, multi-tenant clusters with hundreds of users. Our plans are to focus on where we can contribute within the Spark community to help promote these concepts and strongly advocate the requirements of our enterprise customers.

There are additional opportunities for Hortonworks to contribute and maximize the value of technologies we have contributed to within the open community. Specifically, we believe that we can further optimize data access via ORCFile. This should allow SparkSQL users to take full advantage of the following capabilities:

  • Allowing ORCFile to be instantiated as a table,
  • Column pruning,
  • Language integrated queries and
  • Predicate pushdown.
Improved Debug Capabilities Improved debugging facilities for YARN applications by integrating with YARN ATS and Ambari. One of the more significant challenges with distributed processing across a cluster of compute resources is debugging. Through our experiences with Apache Tez, we believe Hortonworks has significant experience to contribute to improving the “debug-ability” of Apache Spark. Some of the more recent improvements we’ve made to Apache Tez highlight the opportunity that exists – see Apache Tez 0.5 blog for more details.
Wire Encryption and Authorization Integration with Apache Argus in phase three will add authorization capabilities to Spark. This integration should allow for other authorization subsystems to be plugged-in, allowing the customer to choose how to deploy and configure the authorization capabilities based on their requirements and existing security infrastructure. In addition, we plan to focus on the communication between the Spark nodes themselves and preventing unauthorized access to the data being exchanged over the network.
YARN ATS Integration From an operations perspective, Hortonworks plans to integrate Spark with the YARN Application Timeline Server (ATS). ATS provides generic storage and retrieval of applications’ current and historic information. This permits a common integration point for certain classes of operational information and metrics. Once Spark is integrated with ATS, the cluster operator can take advantage of information already surfaced from Ambari to gain additional visibility into the health and execution status of the Spark engine(s) and associated workloads.

Fundamentally, our strategy continues to focus on innovating at the core of Hadoop and we look forward to continuing to support our customers and partners by contributing to a vibrant Hadoop ecosystem that includes Apache Spark as yet another data access application running in YARN.

Try these Tutorials

Apache Top-Level Project Since
February 2014
Download, installation and setup instructions for evaluating Apache Spark with HDP 2.2

Try Spark with HDP

Try Spark with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

View Past Webinars

In Memory Processing with Apache Spark

More Webinars »

More posts on:
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.