cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
September 24, 2015
prev slideNext slide

YARN – What’s the big deal

Since the partnership between Hortonworks and SAS we have created some awesome assets (i.e., SAS Data Loader sandbox tutorial, educational webinars and array of blogs) that have enabled Hadoop and Big Data enthusiasts’ hands-on training with Apache Hadoop and SAS’ powerful analytics solutions. You can find more details around our partnership and resources here: https://hortonworks.com/partner/sas

To continue the momentum, we have Paul Kent, Vice President of Big Data at SAS, share his insights on the value of  YARN and the benefits it brings to SAS and its users- this time around SAS Grid and YARN. 

On my travels and in the SAS Executive Briefing Center, it has become more obvious that many folks have grabbed on to the idea that Hadoop will allow them two things:

  1. to assemble a copy of all their data in one place
  2. to provide enough processing horse power to actually make some sense (business value) of the patterns contained in a holistic view of said data

As they get closer to this goal they realize what a valuable resource the data lake has become. They need an effective means to “share nicely” – its not likely that every department is going to have the resources to establish their own data lake, and even if they do, you’ll be back to arguing about which version of the truth is the correct one.

YARN is the component in the Hadoop eco-system that helps folks share the value gained from building a shared pool of the organizations data.

Move the work to the Data

As the data volumes and velocities grow it has become important to find a strategy that minimized the number of hard (permanent) copies of data (and inherent reconciliation and governance). YARN allows Hadoop to become “the Operating System for your data” – a tool that manages and mediates access to the shared pool of data, as well as the resources to manipulate the pool.

Yarn allows the various patterns of work destined for your cluster to form orderly and rational queues, so that you can set the policy for what is urgent, what is important, what is routine, and what should be allowed to soak up resources so long as no one else requires them at the moment.

SAS_YARN

 

 

 

 

 

Expand then Consolidate

Disruptive technologies like Hadoop are often deployed “at the fringes” of an organization (perhaps in an Innovation Lab). Initial ROI is often found by attacking new ground – problems the organization had not attempted to handle (or handle at scale) before. When these early projects succeed I’ve seen many customers ask themselves “well, that worked OK; is there some way to consolidate the older ways of doing things into this new world?” – Simplifying and modernizing their Analytics Landscape as a delightful side effect!

In reality the blue box for “SAS” above is really a few distinct patterns of work for the Hadoop Cluster

  1. Long Running Server (sometimes called Daemon) processes. The SAS LASR server is purpose built to load your important data into distributed memory and provide low latency actions to service requests against that data rapidly.
  2. Resource Intensive single user tasks that require distributed computing. Each Invocation of a SAS HPA procedure (to build a regression model, to train a neural network, or to determine a decision tree) needs memory and CPU cycles from several cluster nodes to perform its task, and those resources are returned to the pool immediately after the task completes.
  3. Traditional Grid Computing where jobs from many users are distributed over several servers to improve availability and ultimately response times. This is not distributed (Massively Parallel) computing in the sense that several computers attack one problem, but it is a form of sharing the load where several computers attack the tasks of several users in a divide and conquer style.

SASGRID1

The first two patterns above are examples of new world Distributed Computing.  The third is an example of using the newer infrastructure to replace (at a lower cost) the hardware used for a previous generation Analytics Landscape.   Also SAS Grid Manager is the only product to provide horizontal scaling of an application where some parts of the application need to operate on all of the data, such as a Monte Carlo simulation.   The “cherry on top” is that you can combine these technologies such that a single SAS Grid job running on a Hadoop data node could kick off an HPA job that would distribute vertically to send processing to each node to operate on the local data.

I asked Cheryl Doninger, who leads the development for SAS Grid Manager, why customers should be excited about this new flavor of SAS Grid Manager and she said – “SAS Grid Manager for Hadoop is a perfect fit for our customers who have, or plan to implement in the near future, a multi-application data operating system, as described by Arun.  Now they can co-locate all of their SAS Grid jobs on the Hadoop cluster and manage them with YARN along with any other analysis being done on the cluster.   The SAS Grid jobs can leverage any of the SAS integration points to Hadoop to maximize the value of this shared pool of data and all through direct integration with YARN or by leveraging other components of the Hadoop ecosystem that are natively managed by YARN.”

All this effort was a result of tightly integrated joint Engineering collaboration with Hortonworks and the Apache YARN team, including committer, Arun Murthy.

To learn more about SAS Grid Manager for Hadoop visit the SAS support site here, or click https://support.sas.com/rnd/scalability/grid/hadoop/index.html

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *