Apache Spark

In-Memory Compute for ETL, Machine Learning and Data Science Workloads

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in ScalaJava, and Python that allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.


Spark screen shot 3

Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.

Additional libraries, built atop the core, allow diverse workloads for Streaming, SQL, and Machine Learning.

Hortonworks Support Spark

Hortonworks enhanced Spark to be enterprise ready by enabling Spark on YARN and applying enterprise governance, security and operations services for Spark applications. We have integrated Spark as part of HDP 2.2 release.

spark screen shotYARN-enabled Spark

Deeper integration of Spark with YARN provides a workload that efficiently shares cluster resources alongside other engines, such as Hive, Storm and HBase—all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited and allows for more efficient resource use within a single cluster.

Governance, Security and Operations

As with any enterprise data platform, governance, security and operations are vital pillars. HDP provides consistent governance, security and management policies for Spark applications, just as it does for the other data processing engines within HDP.

Hortonworks Focus for Spark

Hortonworks is focused on enabling Spark for Enterprise Hadoop so users can deploy Spark-based applications along with their other Hadoop workloads in a consistent, predictable and robust way. We are working to:

  • Leverage the scale and multi-tenancy provided by YARN so its memory and CPU-intensive apps can work with predictable performance
  • Deliver HDFS memory tier integration with Spark to allow RDD caching
  • Enhance the data science experience with Spark
  • Continue Integrating with HDP’s operations, security, governance and data management capabilities

There are additional opportunities for Hortonworks to contribute to and maximize the value of technologies that interact with Spark. Specifically, we believe that we can further optimize data access via the new DataSources API. This should allow SparkSQL users to take full advantage of the following capabilities:

  • ORCFile instantiation as a table
  • Column pruning
  • Language integrated queries
  • Predicate pushdown

Hortonworks’ Approach to Apache Spark

We have already certified Spark as YARN Ready. This means that your memory and CPU-intensive Spark-based applications can coexist with all the other workloads deployed in a YARN-enabled cluster.

Hortonworks approached Spark in the same way we approached other data access engines like Storm, Hive, and HBase. We outline a strategy, rally the community, and contribute key features within the Apache Software Foundation’s process. Below is a summary of the various integration points that make Spark enterprise-ready.

Focus Description
Support for the ORCFile format As part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression. It is rapidly becoming the defacto storage format for Hive. Hortonworks contributed to SPARK-2883, which provides basic support of ORCFile in Spark.
Security Many of our customers’ initial use cases for Spark run on Hadoop clusters which either do not contain sensitive data or are dedicated for a single application and so they are not subject to broad security requirements. But users plan to deploy Spark-based applications alongside other applications in a single cluster, so we worked to integrate Spark with the security constructs of the broader Hadoop platform. We hear a common request that Spark runs effectively on a secure Hadoop cluster and can leverage authorization offered by HDFS.Also to improve security we have worked within the community to ensure that Spark runs on a Kerberos-enabled cluster. This means that only authenticated users can submit Spark jobs.

Hortonworks continues to focus on streamlining operations for Spark through the 100% open source Apache Ambari. Our customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this foundational Hadoop project. Currently, our partners leverage Ambari Stacksto rapidly define new components/services and add those within a Hadoop cluster. With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster. The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User Interface.

To simplify the operational experience, HDP 2.2.4 also allows Spark to be installed and be managed by Apache Ambari 2.0. Ambari allows the cluster administrator to manage the configuration of Spark and Spark daemons life cycles.

Improved Reliability and Scale of Spark-on-YARN The Spark API allows developers to create both iterative and in-memory applications on Apache Hadoop YARN. With the community interest behind it Spark is making great strides in efficient cluster resource usage. With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound. We continue to believe Spark can use the cluster resources more efficiently and are working with the community to promote a better resource usage.
YARN ATS Integration From an operations perspective, Hortonworks has integrated Spark with the YARN Application Timeline Server (ATS). ATS provides generic storage and retrieval of applications’ current and historic information. This permits a common integration point for certain classes of operational information and metrics. With this integration, the cluster operator can take advantage of information already available from YARN to gain additional visibility into the health and execution status of the Spark jobs.

Fundamentally, our strategy continues to focus on innovating at the core of Hadoop and we look forward to continuing to support our customers and partners by contributing to a vibrant Hadoop ecosystem that includes Apache Spark as yet another data access application running in YARN.

Try these Tutorials

Apache Top-Level Project Since
February 2014
Hortonworks Committers
Tutorials, installation and setup instructions for using Spark in HDP Get Started with Spark

Try Spark with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

View Past Webinars

In Memory Processing with Apache Spark
More posts on:
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.