Apache Spark & Hadoop
Spark adds in-Memory Compute for ETL, Machine Learning and Data Science Workloads to Hadoop
Apache Spark brings fast, in-memory data processing to Hadoop. Elegant and expressive development APIs in Scala, Java, and Python allow data workers to efficiently execute streaming, machine learning or SQL workloads for fast iterative access to datasets.
What Spark Does
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.
The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP.
Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.
Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning.
Spark & HDP
Spark is certified as YARN Ready and is a part of HDP. Memory and CPU-intensive Spark-based applications can coexist with other workloads deployed in a YARN-enabled cluster. This approach avoids the need to create and manage dedicated Spark clusters and allows for more efficient resource use within a single cluster.
HDP also provides consistent governance, security and management policies for Spark applications, just as it does for the other data processing engines within HDP.
Hortonworks approached Spark in the same way we approached other data access engines like Storm, Hive, and HBase. We outline a strategy, rally the community, and contribute key features within the Apache Software Foundation’s process.
Below is a summary of the various integration points that make Spark enterprise-ready.
- Support for the ORCFile formatAs part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression. It is rapidly becoming the defacto storage format for Hive. Hortonworks contributed to SPARK-2883, which provides basic support of ORCFile in Spark.
- SecurityMany of our customers’ initial use cases for Spark run on Hadoop clusters which either do not contain sensitive data or are dedicated for a single application and so they are not subject to broad security requirements. But users plan to deploy Spark-based applications alongside other applications in a single cluster, so we worked to integrate Spark with the security constructs of the broader Hadoop platform. We hear a common request that Spark runs effectively on a secure Hadoop cluster and can leverage authorization offered by HDFS.Also to improve security we have worked within the community to ensure that Spark runs on a Kerberos-enabled cluster. This means that only authenticated users can submit Spark jobs.
- OperationsHortonworks continues to focus on streamlining operations for Spark through the 100% open source Apache Ambari. Our customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this foundational Hadoop project. Currently, our partners leverage Ambari Stacksto rapidly define new components/services and add those within a Hadoop cluster. With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster. The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User Interface.To simplify the operational experience, HDP 2.2.4 also allows Spark to be installed and be managed by Apache Ambari 2.0. Ambari allows the cluster administrator to manage the configuration of Spark and Spark daemons life cycles.
- Improved Reliability and Scale of Spark-on-YARNThe Spark API allows developers to create both iterative and in-memory applications on Apache Hadoop YARN. With the community interest behind it Spark is making great strides in efficient cluster resource usage. With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound. We continue to believe Spark can use the cluster resources more efficiently and are working with the community to promote a better resource usage.
- YARN ATS IntegrationFrom an operations perspective, Hortonworks has integrated Spark with the YARN Application Timeline Server (ATS). ATS provides generic storage and retrieval of applications’ current and historic information. This permits a common integration point for certain classes of operational information and metrics. With this integration, the cluster operator can take advantage of information already available from YARN to gain additional visibility into the health and execution status of the Spark jobs.
Fundamentally, our strategy continues to focus on innovating at the core of Hadoop and we look forward to continuing to support our customers and partners by contributing to a vibrant Hadoop ecosystem that includes Apache Spark as yet another data access application running in YARN.
Hortonworks' Focus for Spark
Hortonworks continues to invest in Spark for Enterprise Hadoop so users can deploy Spark-based applications alongside other Hadoop workloads in a consistent, predictable and robust way. Current investment includes:
- Leverage the scale and multi-tenancy provided by YARN so its memory and CPU-intensive apps can work with predictable performance
- Deliver HDFS memory tier integration with Spark to allow RDD caching
- Enhance the data science experience with Spark
- Continue Integrating with HDP’s operations, security, governance and data management capabilities
There are additional opportunities for Hortonworks to contribute to and maximize the value of technologies that interact with Spark. Specifically, we believe that we can further optimize data access via the new DataSources API. This should allow SparkSQL users to take full advantage of the following capabilities:
- ORCFile instantiation as a table
- Column pruning
- Language integrated queries
- Predicate pushdown
Getting Started with Spark
For developers new to Spark, our conversations typically revolve around two stages in their journey building Spark-based applications:
Stage 1 – Explore and Develop in Spark Local Mode
The first stage starts with a local mode of Spark where Spark runs on a single node. The developer uses this system to learn Spark and starts to build a prototype of the their application leveraging the Spark API. Using Spark Shells (Scala REPL & PySpark), a developer rapidly prototypes and packages a Spark application with tools such as Maven or Scala Build Tool (SBT). Even though the dataset is typically small (so that it fits on a developer machine), a developer can easily debug the application on a single node.
Stage 2 – Deploy Production Spark Applications
The second stage involves running the prototype application against a much larger dataset to fine tune it and get it ready for a production deployment. Typically, this means running Spark on YARN as another workload in the enterprise data lake and allowing it to read data from HDFS. The developer takes the custom application created against a local mode of Spark and submits the application as a Spark job to a staging or production cluster.
Data Science with Spark
For data scientists, Spark is a highly effective data processing tool. It offers first class support for machine learning algorithms and provides an expressive and higher-level API abstraction for transforming or iterating over datasets. Put simply, Apache Spark makes it easier to build machine learning pipelines compared to other approaches.
Data scientists often use tools such as Notebooks (e.g. iPython) to quickly create prototypes and share their work. Many of data scientists love R, and the Spark community is hard at work to deliver R integration with SparkR. We are excited about this emerging capability.
For ease of use, Apache Zeppelin is an emerging tool that provides Notebook features for Spark. We have been exploring Zeppelin and discovered that it makes Spark more accessible and useful.
Here is a screenshot that provides a view into the compelling user interface that Zeppelin can provide for Spark.
With the new release of Spark 1.3, it brought forth new features such as a Data Frames API and Direct Kafka support in Spark Streaming. With Spark 1.4, SparkR gives R support. Given the pace with which these capabilities continue to appear, we plan to continue to provide updates via tech previews between our major releases to allow customers to keep up with the speed of innovation in Spark.
Learn, Try, and Do
The next step is to traverse the TRY and Do tabs and try some tutorials that demonstrate how to load data into HDFS, create Hive tables, process data in memory, and query data, all using Spark APIs and Spark shell.
To get started, take a look at the following tutorials by traversing the Do tab.