Apache Spark

In-Memory Compute for Machine Learning & Data Science Workloads

As the ratio of memory to processing power rapidly evolves, many within the Hadoop community are gravitating towards Apache Spark for fast, in-memory data processing. And with YARN, they use Spark for machine learning and data science use cases along side other workloads simultaneously.

Apache Spark allows data scientists to effectively and simply implement iterative algorithms for advanced analytics such as clustering and classification of datasets.  It is currently a top level Apache project and is emerging as an attractive alternative to run some discreet data science workloads.

It provides three key value points to developers :

  • in-memory compute for iterative workloads,
  • a simplified programming model in Scala,
  • and machine learning libraries to simply programming.

Our Approach

As members of the Hadoop community, Hortonworks develops, tests and supports Apache Spark and currently offers a Tech Preview of the component. We are also working on a series of initiatives that will bring the best of heterogeneous, tiered storage, and resource-based models of computing together with the broader Hadoop community starting at the core of HDFS and working up.

YARN Ready

YARN Ready LogoWe have certified Spark as YARN Ready.  This means that your memory and CPU intensive Spark-based applications can co-exist within a single Hadoop cluster with all the other workloads you have deployed. It allows you to use a single cluster with a single set of data for multiple purposes rather than silo your Spark workloads into a separate cluster.

Now you can deploy interactive SQL query applications with Hive and low latency application using HBase alongside your iterative, machine learning workloads deployed using Spark.  There is no need to have a separate system or separate set of resources for your data science work.

Our focus remains on delivering a fast, secure, scalable, and manageable platform on a consistent footprint that includes HDFS, YARN, Tez, Ambari, Knox, and Falcon.  We are working within this comprehensive set of components and continue to follow our commitment to make Apache Spark enterprise ready so that our customers can confidently adopt it.

Coming Next

As is the case with many emerging technologies, Spark has a significant road ahead of it in order to make it ready for the enterprise.  Hortonworks will work within the open community to represent the needs of the modern enterprise in the ASF and help push this project forward.

In the coming months, we will include Spark features within Ambari so it can be easily provisioned, managed, and monitored.  We will also integrate Spark with the recently acquired XA Secure technology so that we can include Spark within a centrally administered security policy. This work will prepare Spark for use within a broader Enterprise Hadoop platform.

Usage Considerations

Apache Spark currently scales to meet the needs of just a handful of concurrent users and is typically stretched to its limits with larger clusters.  We also advise customers to consider having multiple deployments (i.e. a handful of users on a single Spark instance which is convenient via YARN). Clearly, functional programming and Scala language skills are a critical requirement since they are required to use the Spark framework.

There has been a number of proposed workloads and use-cases for Spark, but thus far, we have only seen clear benefits around machine learning and iterative workloads.


The YARN Ready, HDP 2.1 Tech Preview for Spark is available now.

Try these Tutorials

Apache Top-Level Project Since
February 2014
Hortonworks Committers
Download, installation and setup instructions for evaluating Apache Spark with HDP 2.1

Try Spark with HDP

Try Spark with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox
More posts on:
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.