In-Memory Compute with Apache Spark
Apache Spark allows data scientists to effectively and simply implement iterative algorithms for advanced analytics such as clustering and classification of datasets. It is currently a top level Apache project and is emerging as an attractive alternative to run some discreet data science workloads. It provides three key value points to developers :
- in-memory compute for iterative workloads,
- a simplified programming model in Scala,
- and machine learning libraries to simply programming.
As members of the Hadoop community, Hortonworks develops, tests and supports Apache Spark and currently offers a Tech Preview of the component. We are also working on a series of initiatives that will bring the best of heterogeneous, tiered storage, and resource-based models of computing together with the broader Hadoop community starting at the core of HDFS and working up.
We have certified Spark as YARN Ready. This means that your memory and CPU intensive Spark-based applications can co-exist within a single Hadoop cluster with all the other workloads you have deployed. It allows you to use a single cluster with a single set of data for multiple purposes rather than silo your Spark workloads into a separate cluster.
Now you can deploy interactive SQL query applications with Hive and low latency application using HBase alongside your iterative, machine learning workloads deployed using Spark. There is no need to have a separate system or separate set of resources for your data science work.
Our focus remains on delivering a fast, secure, scalable, and manageable platform on a consistent footprint that includes HDFS, YARN, Tez, Ambari, Knox, and Falcon. We are working within this comprehensive set of components and continue to follow our commitment to make Apache Spark enterprise ready so that our customers can confidently adopt it.
As is the case with many emerging technologies, Spark has a significant road ahead of it in order to make it ready for the enterprise. Hortonworks will work within the open community to represent the needs of the modern enterprise in the ASF and help push this project forward.
In the coming months, we will include Spark features within Ambari so it can be easily provisioned, managed, and monitored. We will also integrate Spark with the recently acquired XA Secure technology so that we can include Spark within a centrally administered security policy. This work will prepare Spark for use within a broader Enterprise Hadoop platform.
Apache Spark currently scales to meet the needs of just a handful of concurrent users and is typically stretched to its limits with larger clusters. We also advise customers to consider having multiple deployments (i.e. a handful of users on a single Spark instance which is convenient via YARN). Clearly, functional programming and Scala language skills are a critical requirement since they are required to use the Spark framework.
There has been a number of proposed workloads and use-cases for Spark, but thus far, we have only seen clear benefits around machine learning and iterative workloads.
The YARN Ready, HDP 2.1 Tech Preview for Spark is available now.