With the release of Apache Hadoop YARN in October of last year, more and more solution providers are moving from single-application Hadoop clusters to a versatile, integrated Hadoop 2 data platform. This allows them to host multiple applications — eliminating silos, maximizing resources and bringing true multi-workload capabilities to Hadoop.
That is why we’re extremely excited to have Paul Kent, Vice President of Big Data at SAS, share his insights on the value of Apache Hadoop YARN and the benefits it brings to SAS and its users.
As much as it could be a refrain from the Montessori school playground, this theme of “Share your Cluster” echoes across many modern Apache Hadoop deployments.
Data Architects are plotting to assemble all their data in one system – something that is now achievable thanks to the economics of modern Apache Hadoop systems. Once assembled, this collection of data now has sufficient gravity to attract the application processing towards it – folks are becoming intolerant of the idea that we should make another copy (and have to reconcile, secure and govern that copy) to facilitate processing.
One of the original themes in Hadoop is to move the work to the data. MapReduce is a beautiful expression of this paradigm – the work is expressed in Java classes, and these classes are literally copied and executed across all the data nodes that participate in processing the data.
Apache Hadoop YARN facilitates this idea at the next level. YARN allows for users to move different kinds of computation to the data while sharing both the data and the resources of the cluster – which enables other newer applications to run alongside traditional MapReduce workloads.
The SAS High-Performance Analytics (HPA) products and SAS LASR Analytic Server based products have their roots in classical High Performance Computing. At their heart, both began their life as a traditional MPI application. The “mpirun” command launches an instance of the application on each host in the host list. These instances are bound together by infrastructure so that they can communicate easily with one another. SSH is used to create processes on the cluster, and as a practical matter, we use password-free SSH or Kerberos keys to save from having to repeat passwords over and over.
For SAS workloads, the ability for worker tasks to communicate with each other to solve the problem team-style is preferred to the MapReduce model where each worker is assigned a slice of the dataset and must process that slice in relative isolation, handing off the results to a downstream task without learning anything from other workers doing the same processing on their slice.
Adopting YARN allows us to use the YARN infrastructure to set the boundaries for the processes needed to run SAS HPA products and SAS LASR Analytic Server based products. CPU and memory can be capped, facilitating a better sharing model for the cluster. Our early adoption and expanded integration with YARN going forward positions SAS as a good citizen at the center of shared Hadoop clusters. Our engineers have been working with Hortonworks engineers on the optimal way to integrate SAS HPA and SAS LASR Analytic Server with Hadoop YARN and the result of these efforts means that customers will get the maximum benefits from their Hadoop clusters.
This is an exciting development for us. SAS applications bring highly advanced, in-memory analytic processing to the data in Hadoop and enable a rich set of additional use cases with high performance analytic needs. Together, our combined technologies offer customers more flexibility to choose best of breed SAS HPA and LASR analytic applications in conjunction with their trusted Hadoop workloads as they build and deploy bigger Hadoop clusters.