Hortonworks’ strategy, since our inception, has been extremely consistent: enable a modern data architecture whereby users have the ability to store data in a single location and interact with it in multiple ways – using the right data processing engine at the right time. At the core of that strategy is YARN, which as a part of Apache Hadoop, allows multiple data processing engines to interact with data stored in a single platform, unlocking an entirely new approach to analytics.
And as the Apache Hadoop platform matures, so do new analytic engines, such as Apache Spark, which is ideally suited for a certain class of application workloads. There has been unbridled excitement for Spark over the past few months because it provides an elegant, attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques.
Hortonworks Support for Apache Spark
A few months ago we announced support for Spark as an HDP Tech Preview and by year-end we plan to offer support for it within our Enterprise Plus Support Subscription. Based on experiences we’ve gathered over the past few months, we have outlined a set of initiatives to address some of the current challenges with the technology that will make it easier for users to consume as part of the completely open source Hortonworks Data Platform (HDP).
While delivery is planned into discreet phases laid out below, the work can be categorized into two distinct categories:
- YARN-enabled Spark
Deeper integration of Spark with YARN will allow it to become a more efficient tenant along side other engines, such as Hive, Storm and HBase and others, simultaneously, all on a single data platform. This avoids the need to create and manage dedicated Spark clusters to support that subset of applications for which Spark is ideally suited and more effectively share resources within a single cluster.
- Governance, Security and Operations
As with any data platform, Governance, Security and Operations are table stakes. And so our efforts here will focus on enabling the application of consistent governance, security and management policies for Spark, just as they do for the other data processing engines within HDP.
First, it is important to note the work already completed earlier this year to make Spark “YARN Ready.” Over the next year we plan to continue to enrich and further optimize Spark on YARN. There is important work already underway that has significant benefits to Spark and its rapid growing community of users. Ultimately, the customers we have worked with want Spark to be reliable, easy to manage, debug, and secure.
An Investment Strategy for Enterprise Spark
As we have proven with other engines such as Storm, Hive, HBase and others, we are committed to contributing to Spark in a similar manner… outline a strategy, rally the community, and contribute key features within the community as promised. Below is a summary of the key initiatives we plan to invest in to ready Spark for the enterprise and we are already well underway.
Phase 1: Laying the Groundwork
Phase 1 of our investment focuses on enabling Apache Spark to take advantage of the latest innovations across the Hadoop ecosystem. We are happy to announce that the work around Hive integration and the ORC File support will be available this week, while the remainder of these items will be available by the end of year.
- Improved integration with Apache HiveToday SparkSQL can be built and configured to read and write data stored in Apache Hive, but is limited to using Hive 0.12. Hortonworks is contributing to Spark to enable support for Hive 0.13, and as the Hive community marches towards Hive 0.14, will contribute additional Hive innovations that can be leveraged by Spark. This allows SparkSQL to use modern versions of Hive to access data for machine learning, modeling etc.
- Support for ORC file formatAs part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression and is rapidly becoming the defacto storage format for Hive. We believe the Spark project will also benefit from this file format and have introduced SPARK-2883 to provide basic support of ORCFile in Spark. Our recently refreshed Spark technical preview allows our HDP users to simply add Spark to their existing Hadoop deployment: they can continue using Hadoop’s advanced capabilities and storage formats while also exploring the additional benefits of Spark.
- SecurityMany of our customers’ initial use cases for Spark run on Hadoop clusters which either do not contain sensitive data or are dedicated for a single application and so they are not subject to broad security requirements.However most of our customers plan to deploy Spark based applications alongside other applications in a single cluster, and therefore we plan to invest heavily to ensure Spark is integrated with the security constructs of the broader Hadoop platform.Initially, most common request we hear from our enterprise customers is to provide authorization via integration with LDAP or Active Directory, before granting access to the native Spark Web User Interface and ensuring that Spark runs effectively on a secure Hadoop cluster.
- OperationsHortonworks continues to focus on streamlining operations for Spark through the 100% open source Apache Ambari. Our customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this foundational Hadoop project.Currently, our partners leverage Ambari Stacks to rapidly define new components/services and add those within your Hadoop cluster. With stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster. The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User Interface.
Phase 2: Optimizing Spark on YARN , Advanced Security, and easing the debugging process
The second phase of investment focuses on optimizing Spark on YARN, extending security capabilities with wire encryption and improving the developer experience with some key tools for debugging Spark-based workloads/applications.
- Improved Reliability and Scale of Spark-on-YARNThe Spark API allows developers to create both iterative in-memory applications on Apache Hadoop YARN. However, the current model of Spark-on-YARN leads to a less than ideal utilization of cluster resources, particularly when large datasets are involved. Spark does not behave like MapReduce or Tez—executing periodic jobs and releasing the compute resources once those jobs finish. In some ways it behaves much more like a long running service; holding onto resources (such as memory) until the end of the entire workload. Using the experience we have already gained in building MapReduce, Apache Tez and other data processing engines, we believe similar concepts can be applied to Spark in order to optimize it’s resource utilization and be a good multi-tenant citizen within a YARN-based Hadoop cluster. Significant improvements can be made to Spark with native integration of YARN and HDFS features such as:
- Classic Hadoop Execution model (e.g. MapReduce) for batch applications by leveraging Tez and other native features such as the YARN shuffle service to transfer intermediate data within the Spark application
- YARN Node Labels for isolating Spark applications to memory-heavy nodes in the cluster
- Heterogeneous storage tiers in HDFS including the in-memory tier to share RDDs across independent Spark applications
This approach has multiple benefits: more efficient use of resources, improved reliability and extending Spark’s ability to access and process petabytes of data in large-scale, multi-tenant clusters with hundreds of users. Our plans are to focus on where we can contribute within the Spark community to help promote these concepts and strongly advocate the requirements of our enterprise customers.
There are additional opportunities for Hortonworks to contribute and maximize the value of technologies we have contributed to within the open community. Specifically, we believe that we can further optimize data access via ORCFile. This should allow SparkSQL users to take full advantage of the following capabilities:
- Allowing ORCFile to be instantiated as a table,
- Column pruning,
- Language integrated queries and
- Predicate pushdown.
- Improved Debug CapabilitiesImproved debugging facilities for YARN applications by integrating with YARN ATS and Ambari. One of the more significant challenges with distributed processing across a cluster of compute resources is debugging. Through our experiences with Apache Tez, we believe Hortonworks has significant experience to contribute to improving the “debug-ability” of Apache Spark. Some of the more recent improvements we’ve made to Apache Tez highlight the opportunity that exists – see Apache Tez 0.5 blog for more details.
- Wire Encryption and AuthorizationIntegration with Apache Argus in phase three will add authorization capabilities to Spark. This integration should allow for other authorization subsystems to be plugged-in, allowing the customer to choose how to deploy and configure the authorization capabilities based on their requirements and existing security infrastructure. In addition, we plan to focus on the communication between the Spark nodes themselves and preventing unauthorized access to the data being exchanged over the network.
- YARN ATS IntegrationFrom an operations perspective, Hortonworks plans to integrate Spark with the YARN Application Timeline Server (ATS). ATS provides generic storage and retrieval of applications’ current and historic information. This permits a common integration point for certain classes of operational information and metrics. Once Spark is integrated with ATS, the cluster operator can take advantage of information already surfaced from Ambari to gain additional visibility into the health and execution status of the Spark engine(s) and associated workloads.
Fundamentally, our investment strategy continues to focus on innovating at the core of Hadoop and we look forward to continuing to support our customers and partners by contributing to a vibrant Hadoop ecosystem that includes Apache Spark as yet another data access application running in YARN.
Much of the first phase is available today as a HDP Tech Preview on our website. The remainder of phase 1 will be available by year-end and we expect phase 2 to be delivered early in 2015.