Go Hadoop! Err, Hadoop and Go.
Personally, I’ve followed the Go Programming Language (golang) with increasing interest for a while and have been itching to really sink my teeth into it. I’ve always felt you never learn any programming language for real unless it’s used to build a fairly large, real-world solution. It’s the only way to gain tackle real issues and gain some confidence for future battles with destiny… FTR, my first real project in Java was Hadoop, circa 2006. *smile*
So, I figured, what the hell, let’s go for it with Apache Hadoop and YARN! For those of you not familiar with YARN, it is the basis for application architecture in Hadoop 2, separating resource management from data processing to provide a more generalized processing platform and therefore enabling multiple applications and workloads in Hadoop. More details on that here.
This was not only a way for me to learn something new, but also a useful exercise to prove to ourselves that both Hadoop and YARN are ready to support non-Java applications in a native manner. As you may know, both HDFS & YARN switched to Protocol Buffers based RPC system a short while ago with the intent of better supporting compatibility across versions and cross-language clients. A shout-out to our friends at Spotify for coming up with snakebite, a native Python client for HDFS! Obviously, I’ve been very keen on supporting native, non-Java, applications for YARN too; you can see where this is going…
With that context, the last bit of the puzzle was a free weekend a couple of weeks ago; with the added bonus of a couple of cross-country flights – I had a great time at the Chicago HUG talking YARN this month, particularly on the 66th floor of the Willis Tower… easily the best location ever for a Hadoop User Group! (Thanks to everyone, particularly to Trustwave for sponsoring and Marc Slusar & Mike Segel, the organizers). People who know me won’t be surprised to hear I look forward to long flights without distractions, it’s great for cutting code! So… game, set, commit.
Fast forward, and here we are. gohadoop (obviously) is now on github and includes a very early version of Hadoop IPC client to talk the Hadoop RPC protocol and YARN client libraries so that one can write a full-fledged, native, go YARN application. To my knowledge, it’s the first-ever native non-Java application in YARN – here is hoping for many, many more!
A quick tour:
- hadoop_common/ipc/client is the go IPC client to talk the Hadoop RPC protocol.
- hadoop_yarn contains necessary YARN protocols to talk the three main YARN protocols, see here for more details:
- applicationclient_service.pb.go – Protocol for clients to submit applications to the ResourceManager
- applicationmaster_service.pb.go – Protocol for ApplicationMaster to negotiate resources from the ResourceManager
- containermanagement_service.pb.go – Protocol for ApplicationMaster to start/stop containers with NodeManagers.
- One shouldn’t need to use the raw protocols above, rather, use the simpler yarn_client go module to interact with YARN. These are modelled after their Java counterparts in the org.apache.hadoop.yarn.client.api module.
- hadoop_yarn/examples/dist_shell is a simple distributed-shell which can run ‘n’ copies of any unix command – see the java equivalent in Apache Hadoop source-tree or an even simpler version here on github.
- Currently, gohadoop does not have a HDFS module yet, so you’ll have to use libhdfs or webhdfs to get data in/out of HDFS.
That’s about it, once you have a YARN cluster up and running try running the dist_shell go application:
$ HADOOP_CONF_DIR=conf go run hadoop_yarn/examples/dist_shell/client.go
See http://golang.org/ for more about go itself, installation etc.
If all goes well, you should see something like on our YARN console:
I’ll talk more about this in the Hadoop YARN meetup on 9/27 at LinkedIn, feel free to hit me up with questions. Obviously it’s very early, but I hope it will be fun and useful. Love to get patches back too, keep those pull requests coming.