When I first started to understand what YARN is, I wanted to build an application to understand its core. There was already a great example YARN application called Distributed Shell that I could use as the shell (pun intended) for my experiment. Now I just needed an existing application that could provide massive reuse value by other applications. I looked around and I decided on MemcacheD.
This brief guide shows how to get MemcacheD up and running on YARN – MOYA if you will…
You’re going to need a few things to get the sample application operational.
The source code for my work can be found on Github at: https://github.com/josephxsxn/moya
MemcacheD is a distributed key value pair caching system; simply put it is a giant distributed in-memory hash table. Use of MemcacheD in application development can lead to reduced load on database servers or other systems where you may be fetching a value repeatedly. Because MemcacheD is application language neutral, we can integrate with many different applications.
MemcacheD works by attempting to fetch a key from the in-memory hash table. If it exists, we don’t have to hit our database for the key’s value. If it doesn’t exist, we should get the value from the database and then cache it into MemcacheD for later access.
The Distributed Shell (DS) code included with YARN is by far the easiest way to get started writing your own YARN application; think of it as the Word Count for YARN. If you haven’t already looked at the DS source code check it out.
There are 3 main classes that are the core of getting this running on YARN:
Client– This negotiates the first container for our Application Master, it also will be placing all other resources for our application directly onto HDFS
ApplicationMaster– This is going to be the master of our MemcacheD Cluster, it will be responsible for starting up the MemcacheD Server Daemons. We will also create a ‘Persistent’ MOYA ZNode for our server daemons.
DSConstants– This file has the key names for our the environments we will setup. We will end up adding a number of new parameters here. In the MOYA source this class was renamed to
Anyone that knows me knows I program in Java and that’s about the end of it. I’ll dabble in Ruby or Python but at the end of the day if a bash script won’t cut it I go directly to Java. Because of this I decided to use the JMemcache Server Daemon found here. Because it’s a Java implementation it makes it easy to embed in code but will come with some minor performance hits, which is ok by me in this situation.
One class is needed for our JMemcached Server Daemon
One of the primary challenges I experienced when thinking up my design was how to deal with the possibility that the MemcacheD Server Daemon always starts on a different node in my Hadoop cluster. With no centralized repository for the configuration information, my MemcacheD Clients wouldn’t know where a Server Daemon is running thus making the caching layer inaccessible. Using ZooKeeper, we can tackle these distributed configuration issues with ease.
ZooKeeper provides a centralized server for distributed synchronization, maintaining configuration information and more. By having the Server Daemons update their host and port information to our ZooKeeper node created by the Application Master we end up with a centralized list of all currently alive Server Daemons. Now we can have our client connect to ZooKeeper and get the server daemons it will need to configure its client classes for load balancing between the distributed JMemcache Server Daemons.
To manage and use the ZNode’s in ZooKeeper we will need to use 3 classes
StartMemcachedclass uses the
JoinGroupclass. This allows the daemon to join the initial /moya ZNode and add its information as a ZNode
There is a minimal amount of code changes that have to be done to get an application to use the DS code as a template. Most of it involves changes the Client and Application Master classes to perform their new task – i.e. launching a jar. Below is the basic list of changes that were made in the Distributed Shell source code that facilitated being able to run Memcached On Yarn.
This assumes you’re following along in the MOYA source code
If you don’t want to do this part you can just download the jars. They can be found in their respective target folders in the github repo.
Get the source code for the client and server with git:
git clone https://github.com/josephxsxn/moya
Build the MOYA-CLIENT. You will need the jar in
Build the MOYA-SERVER. We need the server JAR with dependencies
Start MOYA Up
Here we are starting up on my 13 node cloud instance, I only wanted 10 MemcacheD Server Daemons, you can also see I specified my only ZooKeeper server. If you had multiple ZooKeeper servers you can list them with a comma – for example 192.168.17.52:2181,192.168.17.55:2181
sudo -u hdfs yarn jar MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.moya.core.yarn.Client -jar MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar -lib MOYA-SERVER-0.0.1-SNAPSHOT-jar-with-dependencies.jar -num_containers 10 -container_memory 512 -ZK 192.168.17.52:2181
Check in ZooKeeper to see if they have joined the cluster by starting the zkCli
sudo -u hdfs /usr/lib/zookeeper/bin/zkCli.sh
List the moya nodes:
[zk: localhost:2181(CONNECTED) 14] ls /moya
And you’ll see a similar response to:
[hdp2-joen-291f306b-12d3-40b7-9d3c-93f523ff7881:8555, hdp2-joen-ab25682c-91e3-4744-bf83-9031cc865c19:8555, hdp2-joen-433bc455-0b6e-47a8-a594-bc27eeef85a2:8555, hdp2-joen-4bd483b0-9d8b-4abd-a544-59be0b9fe49c:8555, hdp2-joen-cb47dd7e-711d-4f99-a022-13666d947a13:8555, hdp2-joen-a4a2eaf4-bc7b-4cc1-a942-a0c9ae001603:8555, hdp2-joen-bcd19b49-8713-4e62-95ab-ca5a2f3a594b:8555, hdp2-joen-797acd7a-79e0-4c4e-b193-6f5f84953c81:8555, hdp2-joen-34ae576c-9368-47a9-b115-b3dd3f1ee938:8555, hdp2-joen-d2662f02-2c25-4e94-b278-0b2f29e4e42e:8555]
Now load it up with some Key-Value(KV) pairs and query the pairs while causing some misses. You can use the
MOYABeatDown class has a MemcacheD client which connects to Zookeeper to get all the active server daemons and loads them up with N KV Pairs. In the command below, we are going to make 100,000 KV Pairs (0-99,999) and put them into the MemcacheD Servers, then when it gets the keys we are going to start at 1,000 and continue to 100,999 causing misses (In this case 1% of the attempts will be misses 1000/100000.)
org.moya.core.memcached.MOYABeatDown [ZKServerList] [#ofKeysToMake] [OffsetWhenGettingKeys]
java -cp MOYA-CLIENT-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.moya.core.memcached.MOYABeatDown 192.168.17.52:2181 100000 1000
MOYABeatDown will dump out the stats of each Server Daemon. Below is the stats of 1 of the 10 Daemons.
I hope this was a useful intro into YARN. In the future I hope to add the following:
yarn application -kill [app#]
As you can see the amount of changes it took to port MemcacheD onto YARN is minimal. Understanding how you are going to handle distributed synchronization and configuration is the hardest part, luckily Zookeeper is here to save us. You can learn a lot about the YARN framework studying the Distributed Shell application from the YARN examples. But as many have said the best way to learn is to do, so take Distributed Shell and port your own applications to YARN today.