How to deploy MemcacheD on YARN in Hadoop 2
When I first started to understand what YARN is, I wanted to build an application to understand its core. There was already a great example YARN application called Distributed Shell that I could use as the shell (pun intended) for my experiment. Now I just needed an existing application that could provide massive reuse value by other applications. I looked around and I decided on MemcacheD.
This brief guide shows how to get MemcacheD up and running on YARN – MOYA if you will…
You’re going to need a few things to get the sample application operational.
- Hortonworks Sandbox 2.0 or a HDP 2.0 Cluster
- Zookeeper running in the HDP environment
- Maven 3
- Java 6 or Later
The source code for my work can be found on Github at: https://github.com/josephxsxn/moya
What is MemcacheD?
MemcacheD is a distributed key value pair caching system; simply put it is a giant distributed in-memory hash table. Use of MemcacheD in application development can lead to reduced load on database servers or other systems where you may be fetching a value repeatedly. Because MemcacheD is application language neutral, we can integrate with many different applications.
MemcacheD works by attempting to fetch a key from the in-memory hash table. If it exists, we don’t have to hit our database for the key’s value. If it doesn’t exist, we should get the value from the database and then cache it into MemcacheD for later access.
Working with the Distributed Shell Example Code
The Distributed Shell (DS) code included with YARN is by far the easiest way to get started writing your own YARN application; think of it as the Word Count for YARN. If you haven’t already looked at the DS source code check it out.
There are 3 main classes that are the core of getting this running on YARN:
Client– This negotiates the first container for our Application Master, it also will be placing all other resources for our application directly onto HDFS
ApplicationMaster– This is going to be the master of our MemcacheD Cluster, it will be responsible for starting up the MemcacheD Server Daemons. We will also create a ‘Persistent’ MOYA ZNode for our server daemons.
DSConstants– This file has the key names for our the environments we will setup. We will end up adding a number of new parameters here. In the MOYA source this class was renamed to
JMemcache Server Daemon
Anyone that knows me knows I program in Java and that’s about the end of it. I’ll dabble in Ruby or Python but at the end of the day if a bash script won’t cut it I go directly to Java. Because of this I decided to use the JMemcache Server Daemon found here. Because it’s a Java implementation it makes it easy to embed in code but will come with some minor performance hits, which is ok by me in this situation.
One class is needed for our JMemcached Server Daemon
- StartMemcached – This is the class our application master will request a container for and then launch our daemon as a runnable jar. Here our daemon gets configured and will join our ‘Persistent’ ZNode created by the Application Master and add itself as a ‘Ephemeral’ ZNode consisting of the daemons connection information. As an Ephemeral node should this Server Daemon ever die, it will remove its ZNode in ZooKeeper.
Gluing it together with Zookeeper
One of the primary challenges I experienced when thinking up my design was how to deal with the possibility that the MemcacheD Server Daemon always starts on a different node in my Hadoop cluster. With no centralized repository for the configuration information, my MemcacheD Clients wouldn’t know where a Server Daemon is running thus making the caching layer inaccessible. Using ZooKeeper, we can tackle these distributed configuration issues with ease.
ZooKeeper provides a centralized server for distributed synchronization, maintaining configuration information and more. By having the Server Daemons update their host and port information to our ZooKeeper node created by the Application Master we end up with a centralized list of all currently alive Server Daemons. Now we can have our client connect to ZooKeeper and get the server daemons it will need to configure its client classes for load balancing between the distributed JMemcache Server Daemons.
To manage and use the ZNode’s in ZooKeeper we will need to use 3 classes
- CreateGroup – This class is used by our Application master and will connect to our ZooKeeper servers and create the initial ‘/moya’ ZNode
- JoinGroup – The
StartMemcachedclass uses the
JoinGroupclass. This allows the daemon to join the initial /moya ZNode and add its information as a ZNode
- ListGroup – The MemcacheD Client makes use of this code to get a list of all the hostnames and ports of the Server Daemons which attached themselves to the /moya/ ZNode.
Refactor, Refactor, Refactor
There is a minimal amount of code changes that have to be done to get an application to use the DS code as a template. Most of it involves changes the Client and Application Master classes to perform their new task – i.e. launching a jar. Below is the basic list of changes that were made in the Distributed Shell source code that facilitated being able to run Memcached On Yarn.
This assumes you’re following along in the MOYA source code
- Lines 282 – 310 Have the extra command line parameters we need for ZooKeeper Hosts and the Runnable Jar. You will also notice if you compare this with the DS that we have gutted some un-used parameters.
- Lines 409 – 471 The resources for the Application Master Jar and the MemcacheD Server Daemon Jar are setup, this will place the jars onto HDFS.
- Lines 504 – 515 Setting up the Environment for the location of the AM and Server Daemon Jar. We also add the ZooKeeper Hosts string to the environment so the AM can retrieve it later.
- Lines 548 – 570 Is where we define the initial commands that will be run when the AM container is created. You can see specifically on lines 550, 552 and 554 we are setting up the JAVA_HOME, our max JVM heap size for the AM and finally the class in the jar we wish to launch.
- Lines 407 – 435 The AM code fetches the environment variables we set up for the location of the jars and the ZooKeeper hosts.
- Lines 836 – 879 Like in Client.java, we will setup the resources (server Daemon Jar) we wish the container to have access to.
- Lines 884 – 912 Place the ZooKeeperHosts variable into the container environment that will be used by our Server Daemons. Configure all other ClassPaths like before…
- Lines 917 – 951 Finish setting up the runnable jar command we will execute when creating the server daemon containers. On line 932 you can see this time we will use the java -jar command.
Compile MOYA Source
If you don’t want to do this part you can just download the jars. They can be found in their respective target folders in the github repo.
Get the source code for the client and server with git:
git clone https://github.com/josephxsxn/moya
Build the MOYA-CLIENT. You will need the jar in
Build the MOYA-SERVER. We need the server JAR with dependencies
Lets Start Our Cluster
Start MOYA Up
Here we are starting up on my 13 node cloud instance, I only wanted 10 MemcacheD Server Daemons, you can also see I specified my only ZooKeeper server. If you had multiple ZooKeeper servers you can list them with a comma – for example 192.168.17.52:2181,192.168.17.55:2181
sudo -u hdfs yarn jar MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.moya.core.yarn.Client -jar MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar -lib MOYA-SERVER-0.0.1-SNAPSHOT-jar-with-dependencies.jar -num_containers 10 -container_memory 512 -ZK 192.168.17.52:2181
Check in ZooKeeper to see if they have joined the cluster by starting the zkCli
sudo -u hdfs /usr/lib/zookeeper/bin/zkCli.sh
List the moya nodes:
[zk: localhost:2181(CONNECTED) 14] ls /moya
And you’ll see a similar response to:
[hdp2-joen-291f306b-12d3-40b7-9d3c-93f523ff7881:8555, hdp2-joen-ab25682c-91e3-4744-bf83-9031cc865c19:8555, hdp2-joen-433bc455-0b6e-47a8-a594-bc27eeef85a2:8555, hdp2-joen-4bd483b0-9d8b-4abd-a544-59be0b9fe49c:8555, hdp2-joen-cb47dd7e-711d-4f99-a022-13666d947a13:8555, hdp2-joen-a4a2eaf4-bc7b-4cc1-a942-a0c9ae001603:8555, hdp2-joen-bcd19b49-8713-4e62-95ab-ca5a2f3a594b:8555, hdp2-joen-797acd7a-79e0-4c4e-b193-6f5f84953c81:8555, hdp2-joen-34ae576c-9368-47a9-b115-b3dd3f1ee938:8555, hdp2-joen-d2662f02-2c25-4e94-b278-0b2f29e4e42e:8555]
Now load it up with some Key-Value(KV) pairs and query the pairs while causing some misses. You can use the
MOYABeatDown class has a MemcacheD client which connects to Zookeeper to get all the active server daemons and loads them up with N KV Pairs. In the command below, we are going to make 100,000 KV Pairs (0-99,999) and put them into the MemcacheD Servers, then when it gets the keys we are going to start at 1,000 and continue to 100,999 causing misses (In this case 1% of the attempts will be misses 1000/100000.)
org.moya.core.memcached.MOYABeatDown [ZKServerList] [#ofKeysToMake] [OffsetWhenGettingKeys]
java -cp MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.moya.core.memcached.MOYABeatDown 192.168.17.52:2181 100000 1000
MOYABeatDown will dump out the stats of each Server Daemon. Below is the stats of 1 of the 10 Daemons.
Summary and Next steps
I hope this was a useful intro into YARN. In the future I hope to add the following:
- Getting containers that die to automatically restart
- Understand how I can get the Application Master to restart if it dies
- Management of the clients. Currently I have to kill clients through the YARN Cli.
yarn application -kill [app#]
- Adding in unit tests and sample/test applications
- Client Memcached system notification if a server daemon dies.
- Have MOYA clean things up if the AM dies or is exited.
As you can see the amount of changes it took to port MemcacheD onto YARN is minimal. Understanding how you are going to handle distributed synchronization and configuration is the hardest part, luckily Zookeeper is here to save us. You can learn a lot about the YARN framework studying the Distributed Shell application from the YARN examples. But as many have said the best way to learn is to do, so take Distributed Shell and port your own applications to YARN today.