cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
February 13, 2017
prev slideNext slide

Executing Traditional SAS Workloads on Hadoop

In this blog, we will be discussing, SAS® Grid Manager for Hadoop. There are some very compelling reasons to modernize data architectures with Hadoop. Anyone responsible for administering SAS workloads on Hadoop or considering this path should know about SAS Grid Manager for Hadoop.

What is SAS Grid Computing?

SAS Grid Computing has been offering SAS shops a lower cost, shared, multi-tenant, high performing computing environment to meet their advanced analytic and modeling needs. By implementing a SAS Grid, SAS administrators are able to centralize individual and or departmental SAS computing environments onto a SAS Compute Grid and better utilize IT resources, provide high availability and accelerated processing. A SAS Grid runs on two or more SAS Grid Compute Nodes. Each SAS Grid Compute Node is a candidate to execute SAS jobs submitted into a Grid queue by SAS user groups at a site.

Enter Apache Hadoop and YARN

The benefits of SAS Grid Computing are not new to the SAS User Community. It has been successfully implemented and running in production at thousands of customers’ sites around the world for well over a decade and provides significant benefits. What’s new with SAS Grid Manager for Hadoop is that it offers YARN as an orchestration option in addition to the existing and well-proven use of the Platform Suite for SAS which includes LSF.

When developing SAS Grid Manager for Hadoop, SAS Grid and Hortonworks YARN engineering teams worked collaboratively during the initial phases of the development of this new YARN based Hadoop integration.

“We started this initiative with SAS because both companies could see the value for SAS customers who would want to leverage the power of Hadoop with analytic technologies from SAS, using YARN as the resource manager. We are pleased this engineering effort is available to pool SAS workloads using Hadoop and YARN. This speeds the processing and compute jobs and produces the best predictive and historical analytics insights that will drive new business outcomes.”

–Arun Murthy, Founder and Apache Hadoop PMC member, contributor and committer, Hortonworks

SAS Grid Manager for Hadoop was designed to enable customers to co-locate their SAS Grid and all of the associated SAS workloads on their new or existing SAS Hadoop clusters.

“A tight integration with Hadoop and YARN is important to customers wanting to leverage the power of SAS advanced analytics within a Data Lake. SAS Grid Manager for Hadoop is another important level of integration with YARN.  By co-locating all SAS Grid jobs on the Hadoop cluster, managed by YARN, customers are able to leverage their existing cluster hardware investment and have both compute and data in a single environment.”

– Cheryl Doninger, Senior R&D Director, SAS

Why Move to Hadoop

A decision to develop this type of solution typically starts by identifying business needs and associated usage cases to support these needs. Any customer interested in consolidating and centralizing departmental SAS Servers and at the same time, also planning on leveraging Hadoop datasets for their existing and new usage cases is a potential candidate for SAS Grid Manager for Hadoop. If a newly defined use case involves new data sources like IoT data (Sensors, Click Stream, Web Logs, Machine Data and IoT device data), then Hadoop is an ideal location to land this data. Other reasons to consider moving to Hadoop would be lowering the total cost of ownership of your IT infrastructure or moving existing SAS workloads to Hadoop. Sites are even considering complimenting their existing SAS Grid Manager running on LSF with a SAS Grid orchestrated by YARN.

Once the decision has been made to move new and or existing SAS workloads to SAS Grid Manager for Hadoop it is highly recommended that you invest in SAS and Hadoop administrative training and professional services. It is also critical that you have a detailed understanding of how YARN works with Hadoop. In addition, involving SAS and Hortonworks Professional services during the project startup phase in important.

How It Works

With SAS Grid Manager for Hadoop, a community of SAS users leveraging SAS Clients submit interactive and batch SAS jobs to the SAS Grid Computing infrastructure on Hadoop. These jobs are scheduled based on queues and site policies by YARN to an optimal SAS Grid Compute Node (Hadoop Worker Node) for execution. YARN handles job scheduling based on Hadoop resource availability and queue policies. Below is a Conceptual View of the architecture:

SAS Grid Manager for Hadoop Conceptual Architecture

SAS Grid Manager for Hadoop Conceptual Architecture. (Reference: SAS Grid Manager for Hadoop)

YARN 101

If you are new to YARN, let me provide a bit of background. YARN is the orchestration engine for Hadoop 2.x, scheduling Batch, Interactive, and Real Time workloads within a single, multi-tenant Hadoop cluster. SAS Grid Manager for Hadoop leverages YARN for workload orchestration within a highly secure, Kerborized Hadoop cluster. SAS jobs are orchestrated or scheduled by YARN onto Hadoop. This means traditional SAS jobs running in this architecture are now running inside of the Hadoop Cluster firewall. This can significantly reduce the complexity of a SAS Hadoop configuration, eliminating port conflicts between traditional SAS jobs, which in the past ran on Hadoop Edge nodes, requesting additional Hadoop cluster resources. YARN’s job is to determine the most optimal worker node to run a job or task in Hadoop. With this solution, SAS jobs will no longer be running outside the Hadoop firewall.

A Day in the Life of a SAS – Hadoop User

When using SAS client interfaces, it should be transparent to SAS users that they are interacting with SAS Grid. Here is a detailed walk through of a SAS user leveraging SAS Enterprise Guide running SAS Process Flows and Tasks on Hadoop.

Below, a SAS user logs into SAS Enterprise Guide’s UI, and opens an existing project. To execute work from this project, the user must launch a SAS Workspace Server. In this case, if the user expands the +SAS Grid on the left hand side of the screen, SAS Grid Manager for Hadoop will request YARN’s Resource Manager to launch a SAS Workspace Server (WSS) inside the Hadoop Cluster.

Click on +SASGrid to launch SAS Workspace Server on Hadoop Worker Node

The green check mark (below and left) next to SASGrid indicates that the SAS EG user can run any Process Flow or Task in Hadoop, because YARN has successfully launched the SAS WSS on a Hadoop Worker Node:

If we take a look at the YARN UI, typically reserved for Hadoop Administrators, we can see a SAS Enterprise Guide – Workspace Server in a RUNNING state. This indicates to the Hadoop Administrator that a SAS user has reserved a YARN container and is currently active on the cluster (see below):

SAS User reserving a YARN container

If we switch back to the SAS User EG Session, we can see that a Library has been assigned within Enterprise Guide pointing to HiveServer2. We can also see in the Explorer window on the left hand side, all the Hive Tables under SAS libref HIVE_TPC. At this point, the SAS EG user can run SAS analytic tasks directly against Hive tables within the hive_tpc database. (See below):

SAS EG runs SAS analytic tasks directly against Hive tables

At this point, the SAS EG user can run any traditional SAS job or code and also, call any HDP service (i.e. HDFS, Pig, Hive, MapReduce) in-line within this SAS workspace server. We can also call SAS High Performance Analytics, which can also run on YARN.

When the SAS EG user issues a “Disconnect” (see below), this will initiate a request to the SAS Workspace Server to shutdown.   Yarn will release the container that was used for running the SAS Workspace Server and reclaim the resources associated with that container.

Disconnect to SAS Workspace Server to Shutdown

Once the “Disconnect” is complete in EG, the Hadoop Administrator will see the SAS Enterprise Guide Workspace Server is now “FINISHED”. (See below). A disconnect will also occur automatically if a SAS EG user shuts down the UI.

SAS Enterprise Guide Workspace Server Finished

For sites interested in moving traditional SAS workloads co-located onto Hadoop, SAS Grid Manager for Hadoop is an ideal solution to meet this need. We have discussed in this blog, the benefits of moving to SAS Grid Manager on Hadoop, and also shared the seamless user experience with you. If you would like to learn more about SAS Grid Manager for Hadoop, please see the following links below.

Learn More

SAS Global Forum 2017
Come by the Hortonworks Booth at SAS Global Forum (#SASGF) April 2-5 in Orlando, FL.

Attend the Hortonworks Technical Session on Monday April 3rd at 12:30 PM
I will be authoring a paper for this event, which will dive into more of the technical details behind the SAS Grid Manager for Hadoop architecture. I’ll describe the components that make up YARN and lessons learned in the field around configuring this environment. And also, I’ll discuss additional workloads a Hadoop administrator can expect from traditional SAS jobs running on Hadoop.

Additional Information

Hortonworks SAS partner site on Hortonworks
http://hortonworks.com/partner/sas/

SAS Hortonworks Partner site on SAS.com
http://www.sas.com/en_us/partners/find-a-partner/alliance-partners/hortonworks.html

SAS Global Forum 2016 Paper:
Authored by Cheryl Doninger and Doug Haigh:
http://support.sas.com/resources/papers/proceedings16/SAS6281-2016.pdf

SAS Grid Manager for Hadoop Documentation:
http://support.sas.com/rnd/scalability/grid/hadoop/index.html

YARN
http://hortonworks.com/apache/yarn/

Categories:

Comments

  • Leave a Reply

    Your email address will not be published. Required fields are marked *