Anyone Have a Good Recipe for Remote EC2 Clusters?
I am trying to configuring an HDP cluster in EC2 and use local client tools. The use case behind this is working interactively on projects as time allows. Since the cluster is dormant much of the time, it is set-up and torn down as needed. I want to keep the project files and tools on my laptop, where I have a nice configuration for the client side tools (data on S3).
There seems to be several ways to approach this:
1. SSH tunneling of the ports from ‘localhost’ to the correct node in EC2
2. Configure the cluster using external EC2 hostnames and use security groups to limit access to my machine alone
3. Create an AWS VPC
I tried (1), however it’s difficult to get all of the ports and machines correct. The HADOOP_CONF_DIR from a machine with the client tools installed has numerous references to the internal EC2 hostnames and ports that must be configured. There would be a rats nest of ssh tunnels, but it might work, barring any user or authentication issues.
Method 2 looks like it would be the simplest way, however Ambari refuses to register nodes where the host names don’t match. Since the instances are using internal EC2 names, and I need to access them from external IP addresses, this configuration won’t install. It’s possible to script a renaming of the host automatically that might solve this problem, but I don’t know if this will introduce any other problems. Anyone done this and got it to work?
So, that leaves method 3. Looking at the documentation, this seems as if would be the best long-term solution and, I believe, may even allow stoping and starting instances, since I can permanently assign IP addresses to the instances. However I’ve never set-up a VPC and at the moment don’t have a spare day or so to learn how to do this.
Has anyone got any clever ways to remotely access an EC2 HDP cluster with local client tools? I’ll be happy if I can just get pig and hive working, but eventually they’ll all need to work remotely. This seems like it would be a common enough use case for EC2, but I can find surprisingly little on the topic.