How to deploy MemcacheD on YARN in Hadoop 2

When I first started to understand what YARN is, I wanted to build an application to understand its core. There was already a great example YARN application called Distributed Shell that I could use as the shell (pun intended) for my experiment. Now I just needed an existing application that could provide massive reuse value by other applications. I looked around and I decided on MemcacheD.

This brief guide shows how to get MemcacheD up and running on YARN – MOYA if you will…

Prerequisites

You’re going to need a few things to get the sample application operational.

The source code for my work can be found on Github at: https://github.com/josephxsxn/moya

What is MemcacheD?

moya1MemcacheD is a distributed key value pair caching system; simply put it is a giant distributed in-memory hash table. Use of MemcacheD in application development can lead to reduced load on database servers or other systems where you may be fetching a value repeatedly. Because MemcacheD is application language neutral, we can integrate with many different applications.

MemcacheD works by attempting to fetch a key from the in-memory hash table. If it exists, we don’t have to hit our database for the key’s value. If it doesn’t exist, we should get the value from the database and then cache it into MemcacheD for later access.

Working with the Distributed Shell Example Code

The Distributed Shell (DS) code included with YARN is by far the easiest way to get started writing your own YARN application; think of it as the Word Count for YARN. If you haven’t already looked at the DS source code check it out.

There are 3 main classes that are the core of getting this running on YARN:

  1. Client – This negotiates the first container for our Application Master, it also will be placing all other resources for our application directly onto HDFS
  2. ApplicationMaster – This is going to be the master of our MemcacheD Cluster, it will be responsible for starting up the MemcacheD Server Daemons. We will also create a ‘Persistent’ MOYA ZNode for our server daemons.
  3. DSConstants – This file has the key names for our the environments we will setup. We will end up adding a number of new parameters here. In the MOYA source this class was renamed to MConstants.

JMemcache Server Daemon

Anyone that knows me knows I program in Java and that’s about the end of it. I’ll dabble in Ruby or Python but at the end of the day if a bash script won’t cut it I go directly to Java. Because of this I decided to use the JMemcache Server Daemon found here. Because it’s a Java implementation it makes it easy to embed in code but will come with some minor performance hits, which is ok by me in this situation.

One class is needed for our JMemcached Server Daemon

  1. StartMemcached – This is the class our application master will request a container for and then launch our daemon as a runnable jar. Here our daemon gets configured and will join our ‘Persistent’ ZNode created by the Application Master and add itself as a ‘Ephemeral’ ZNode consisting of the daemons connection information. As an Ephemeral node should this Server Daemon ever die, it will remove its ZNode in ZooKeeper.

Gluing it together with Zookeeper

One of the primary challenges I experienced when thinking up my design was how to deal with the possibility that the MemcacheD Server Daemon always starts on a different node in my Hadoop cluster. With no centralized repository for the configuration information, my MemcacheD Clients wouldn’t know where a Server Daemon is running thus making the caching layer inaccessible. Using ZooKeeper, we can tackle these distributed configuration issues with ease.

ZooKeeper provides a centralized server for distributed synchronization, maintaining configuration information and more. By having the Server Daemons update their host and port information to our ZooKeeper node created by the Application Master we end up with a centralized list of all currently alive Server Daemons. Now we can have our client connect to ZooKeeper and get the server daemons it will need to configure its client classes for load balancing between the distributed JMemcache Server Daemons.

To manage and use the ZNode’s in ZooKeeper we will need to use 3 classes

  1. CreateGroup – This class is used by our Application master and will connect to our ZooKeeper servers and create the initial ‘/moya’ ZNode
  2. JoinGroup – The StartMemcachedclass uses the JoinGroup class. This allows the daemon to join the initial /moya ZNode and add its information as a ZNode /moya/HOSTNAME:PORT
  3. ListGroup – The MemcacheD Client makes use of this code to get a list of all the hostnames and ports of the Server Daemons which attached themselves to the /moya/ ZNode.

Refactor, Refactor, Refactor

There is a minimal amount of code changes that have to be done to get an application to use the DS code as a template. Most of it involves changes the Client and Application Master classes to perform their new task – i.e. launching a jar. Below is the basic list of changes that were made in the Distributed Shell source code that facilitated being able to run Memcached On Yarn.

This assumes you’re following along in the MOYA source code

Client.java

  • Lines 282 – 310 Have the extra command line parameters we need for ZooKeeper Hosts and the Runnable Jar. You will also notice if you compare this with the DS that we have gutted some un-used parameters.
  • Lines 409 – 471 The resources for the Application Master Jar and the MemcacheD Server Daemon Jar are setup, this will place the jars onto HDFS.
  • Lines 504 – 515 Setting up the Environment for the location of the AM and Server Daemon Jar. We also add the ZooKeeper Hosts string to the environment so the AM can retrieve it later.
  • Lines 548 – 570 Is where we define the initial commands that will be run when the AM container is created. You can see specifically on lines 550, 552 and 554 we are setting up the JAVA_HOME, our max JVM heap size for the AM and finally the class in the jar we wish to launch.

ApplicationMaster.java

  • Lines 407 – 435 The AM code fetches the environment variables we set up for the location of the jars and the ZooKeeper hosts.
  • Lines 836 – 879 Like in Client.java, we will setup the resources (server Daemon Jar) we wish the container to have access to.
  • Lines 884 – 912 Place the ZooKeeperHosts variable into the container environment that will be used by our Server Daemons. Configure all other ClassPaths like before…
  • Lines 917 – 951 Finish setting up the runnable jar command we will execute when creating the server daemon containers. On line 932 you can see this time we will use the java -jar command.

MemcacheD-On-YARN

Compile MOYA Source

If you don’t want to do this part you can just download the jars. They can be found in their respective target folders in the github repo.

Get the source code for the client and server with git:

git clone https://github.com/josephxsxn/moya

Build the MOYA-CLIENT. You will need the jar in target/MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar

cd moya/MOYA-CLIENT
mvn package

Build the MOYA-SERVER. We need the server JAR with dependencies target/MOYA-SERVER-0.0.1-SNAPSHOT-jar-with-dependencies.jar

cd ../MOYA-SERVER
mvn package

Lets Start Our Cluster

Start MOYA Up
Here we are starting up on my 13 node cloud instance, I only wanted 10 MemcacheD Server Daemons, you can also see I specified my only ZooKeeper server. If you had multiple ZooKeeper servers you can list them with a comma – for example 192.168.17.52:2181,192.168.17.55:2181

sudo -u hdfs yarn jar MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.moya.core.yarn.Client -jar MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar -lib MOYA-SERVER-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -num_containers 10 -container_memory 512 -ZK 192.168.17.52:2181

Check in ZooKeeper to see if they have joined the cluster by starting the zkCli

sudo -u hdfs /usr/lib/zookeeper/bin/zkCli.sh

List the moya nodes:

[zk: localhost:2181(CONNECTED) 14] ls /moya

And you’ll see a similar response to:

[hdp2-joen-291f306b-12d3-40b7-9d3c-93f523ff7881:8555, hdp2-joen-ab25682c-91e3-4744-bf83-9031cc865c19:8555, hdp2-joen-433bc455-0b6e-47a8-a594-bc27eeef85a2:8555, hdp2-joen-4bd483b0-9d8b-4abd-a544-59be0b9fe49c:8555, hdp2-joen-cb47dd7e-711d-4f99-a022-13666d947a13:8555, hdp2-joen-a4a2eaf4-bc7b-4cc1-a942-a0c9ae001603:8555, hdp2-joen-bcd19b49-8713-4e62-95ab-ca5a2f3a594b:8555, hdp2-joen-797acd7a-79e0-4c4e-b193-6f5f84953c81:8555, hdp2-joen-34ae576c-9368-47a9-b115-b3dd3f1ee938:8555, hdp2-joen-d2662f02-2c25-4e94-b278-0b2f29e4e42e:8555]

Now load it up with some Key-Value(KV) pairs and query the pairs while causing some misses. You can use the MOYABeatDown class has a MemcacheD client which connects to Zookeeper to get all the active server daemons and loads them up with N KV Pairs. In the command below, we are going to make 100,000 KV Pairs (0-99,999) and put them into the MemcacheD Servers, then when it gets the keys we are going to start at 1,000 and continue to 100,999 causing misses (In this case 1% of the attempts will be misses 1000/100000.)

org.moya.core.memcached.MOYABeatDown [ZKServerList] [#ofKeysToMake] [OffsetWhenGettingKeys]

java -cp MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.moya.core.memcached.MOYABeatDown 192.168.17.52:2181 100000 1000

Finally the MOYABeatDown will dump out the stats of each Server Daemon. Below is the stats of 1 of the 10 Daemons.

moya2

Summary and Next steps

I hope this was a useful intro into YARN. In the future I hope to add the following:

  1. Getting containers that die to automatically restart
  2. Understand how I can get the Application Master to restart if it dies
  3. Management of the clients. Currently I have to kill clients through the YARN Cli.
    yarn application -kill [app#]
  4. Adding in unit tests and sample/test applications
  5. Client Memcached system notification if a server daemon dies.
  6. Have MOYA clean things up if the AM dies or is exited.

As you can see the amount of changes it took to port MemcacheD onto YARN is minimal.  Understanding how you are going to handle distributed synchronization and configuration is the hardest part, luckily Zookeeper is here to save us. You can learn a lot about the YARN framework studying the Distributed Shell application from the YARN examples. But as many have said the best way to learn is to do, so take Distributed Shell and port your own applications to YARN today.

Download Hortonworks Sandbox 2.0 here and get started with YARN

Categorized by :
Developer HDP 2 Sandbox YARN

Comments

Mohit Kumar
|
November 27, 2013 at 7:45 am
|

Hi Joseph,

Must thank you on a nice article. I had one problem though, When i kill the application, the application master closes down but not the jmemcached..then i tried killing the containers explicitly by calling “nmClientAsync.stopContainerAsync”. Now it says that container is shut but jar still running.

Hope you could help me with that.

Thanks and Regards
Mohit

    Joseph Niemiec
    |
    December 2, 2013 at 3:34 pm
    |

    Hi Mohit,

    This sounds like a very odd issue, can you give me details about how your running this (on what OS is YARN running) and how you kill it (yarn cli?)

    Thanks
    Joe

      Mohit Kumar
      |
      December 3, 2013 at 5:18 am
      |

      Hi Joseph,
      Our environment is Ubuntu and Mint linux..this particular issue has been happening on my laptop which has Linux Mint. And I kill it with Yarn application -kill [#app id].

      Thanks and Regards
      Mohit

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join the Webinar!

YARN Ready – Integrating to YARN natively (part 1 of 3)
Thursday, July 24, 2014
12:00 PM Eastern / 9:00 AM Pacific

More Webinars »

HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.