In-Memory Compute with Apache Spark

Machine Learning & Data Science Workloads in Hadoop
As the ratio of memory to processing power rapidly evolves, many within the Hadoop community are gravitating towards Apache Spark for fast, in-memory data processing. And with YARN, they use Spark for machine learning and data science use cases along side other workloads simultaneously.

Spark-logo-192x100pxApache Spark allows data scientists to effectively and simply implement iterative algorithms for advanced analytics such as clustering and classification of datasets.  It is currently a top level Apache project and is emerging as an attractive alternative to run some discreet data science workloads. It provides three key value points to developers :

  • in-memory compute for iterative workloads,
  • a simplified programming model in Scala,
  • and machine learning libraries to simply programming.

Our Approach

As members of the Hadoop community, Hortonworks develops, tests and supports Apache Spark and currently offers a Tech Preview of the component. We are also working on a series of initiatives that will bring the best of heterogeneous, tiered storage, and resource-based models of computing together with the broader Hadoop community starting at the core of HDFS and working up.

YARN Ready

Screen Shot 2014-06-03 at 7.03.26 AMWe have certified Spark as YARN Ready.  This means that your memory and CPU intensive Spark-based applications can co-exist within a single Hadoop cluster with all the other workloads you have deployed. It allows you to use a single cluster with a single set of data for multiple purposes rather than silo your Spark workloads into a separate cluster.

Now you can deploy interactive SQL query applications with Hive and low latency application using HBase alongside your iterative, machine learning workloads deployed using Spark.  There is no need to have a separate system or separate set of resources for your data science work.

Our focus remains on delivering a fast, secure, scalable, and manageable platform on a consistent footprint that includes HDFS, YARN, Tez, Ambari, Knox, and Falcon.  We are working within this comprehensive set of components and continue to follow our commitment to make Apache Spark enterprise ready so that our customers can confidently adopt it.

Coming Next

As is the case with many emerging technologies, Spark has a significant road ahead of it in order to make it ready for the enterprise.  Hortonworks will work within the open community to represent the needs of the modern enterprise in the ASF and help push this project forward.

In the coming months, we will include Spark features within Ambari so it can be easily provisioned, managed, and monitored.  We will also integrate Spark with the recently acquired XA Secure technology so that we can include Spark within a centrally administered security policy. This work will prepare Spark for use within a broader Enterprise Hadoop platform.

Usage Considerations

Apache Spark currently scales to meet the needs of just a handful of concurrent users and is typically stretched to its limits with larger clusters.  We also advise customers to consider having multiple deployments (i.e. a handful of users on a single Spark instance which is convenient via YARN). Clearly, functional programming and Scala language skills are a critical requirement since they are required to use the Spark framework.

There has been a number of proposed workloads and use-cases for Spark, but thus far, we have only seen clear benefits around machine learning and iterative workloads.

Availability

The YARN Ready, HDP 2.1 Tech Preview for Spark is available now.

Download, installation and setup instructions for evaluating Apache Spark with HDP 2.1

Try Spark with HDP

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.