Apache Spark provides elegant, attractive development APIs and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast, in-memory data processing. And with Spark on YARN, they can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset in a single cluster.
Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development
Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning. Spark also has additional libraries in alpha development.
Hortonworks Focus for Spark
Hortonworks enhanced Spark to be enterprise ready by enabling Spark on YARN and applying enterprise governance, security and operations services for Spark applications.
Deeper integration of Spark with YARN provides a workload that efficiently shares cluster resources alongside other engines, such as Hive, Storm and HBase—all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited and allows for more efficient resource use within a single cluster.
Governance, Security and Operations
As with any enterprise data platform, governance, security and operations are vital pillars. HDP provides consistent governance, security and management policies for Spark applications, just as it does for the other data processing engines within HDP.
Hortonworks is focused on enabling Spark for Enterprise Hadoop so users can deploy Spark-based applications along with their other Hadoop workloads in a consistent, predictable and robust way. We are working to:
- Leverage the scale and multi-tenancy provided by YARN so its memory and CPU-intensive apps can work with predictable performance
- Deliver HDFS memory tier integration with Spark to allow RDD caching
- Enhance the data science experience with Spark
- Continue Integrating with HDP’s operations, security, governance and data management capabilities
There are additional opportunities for Hortonworks to contribute to and maximize the value of technologies that interact with Spark. Specifically, we believe that we can further optimize data access via the new DataSources API. This should allow SparkSQL users to take full advantage of the following capabilities:
- ORCFile instantiation as a table
- Column pruning
- Language integrated queries
- Predicate pushdown
Hortonworks’ Approach to Apache Spark
We have already certified Spark as YARN Ready. This means that your memory and CPU-intensive Spark-based applications can coexist with all the other workloads deployed in a YARN-enabled cluster.
Hortonworks approached Spark in the same way we approached other data access engines like Storm, Hive, and HBase. We outline a strategy, rally the community, and contribute key features within the Apache Software Foundation’s process. Below is a summary of the various integration points that make Spark enterprise-ready.
|Support for the ORCFile format||As part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression. It is rapidly becoming the defacto storage format for Hive. Hortonworks contributed to SPARK-2883, which provides basic support of ORCFile in Spark.|
Many of our customers’ initial use cases for Spark run on Hadoop clusters which either do not contain sensitive data or are dedicated for a single application and so they are not subject to broad security requirements. But users plan to deploy Spark-based applications alongside other applications in a single cluster, so we worked to integrate Spark with the security constructs of the broader Hadoop platform. We hear a common request that Spark runs effectively on a secure Hadoop cluster and can leverage authorization offered by HDFS.
Also to improve security we have worked within the community to ensure that Spark runs on a Kerberos-enabled cluster. This means that only authenticated users can submit Spark jobs.
Hortonworks continues to focus on streamlining operations for Spark through the 100% open source Apache Ambari. Our customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this foundational Hadoop project. Currently, our partners leverage Ambari Stacksto rapidly define new components/services and add those within a Hadoop cluster. With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster. The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User Interface.
To simplify the operational experience, HDP 2.2.4 also allows Spark to be installed and be managed by Apache Ambari 2.0. Ambari allows the cluster administrator to manage the configuration of Spark and Spark daemons life cycles.
|Improved Reliability and Scale of Spark-on-YARN||The Spark API allows developers to create both iterative and in-memory applications on Apache Hadoop YARN. With the community interest behind it Spark is making great strides in efficient cluster resource usage. With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound. We continue to believe Spark can use the cluster resources more efficiently and are working with the community to promote a better resource usage.|
|YARN ATS Integration||From an operations perspective, Hortonworks has integrated Spark with the YARN Application Timeline Server (ATS). ATS provides generic storage and retrieval of applications’ current and historic information. This permits a common integration point for certain classes of operational information and metrics. With this integration, the cluster operator can take advantage of information already available from YARN to gain additional visibility into the health and execution status of the Spark jobs.|
Fundamentally, our strategy continues to focus on innovating at the core of Hadoop and we look forward to continuing to support our customers and partners by contributing to a vibrant Hadoop ecosystem that includes Apache Spark as yet another data access application running in YARN.
Try these Tutorials
Try Spark with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.Get Sandbox