The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

HDP on Linux – Installation Forum

Anyone Have a Good Recipe for Remote EC2 Clusters?

  • #48798
    Steve Nunez

    I am trying to configuring an HDP cluster in EC2 and use local client tools. The use case behind this is working interactively on projects as time allows. Since the cluster is dormant much of the time, it is set-up and torn down as needed. I want to keep the project files and tools on my laptop, where I have a nice configuration for the client side tools (data on S3).

    There seems to be several ways to approach this:

    1. SSH tunneling of the ports from ‘localhost’ to the correct node in EC2
    2. Configure the cluster using external EC2 hostnames and use security groups to limit access to my machine alone
    3. Create an AWS VPC

    I tried (1), however it’s difficult to get all of the ports and machines correct. The HADOOP_CONF_DIR from a machine with the client tools installed has numerous references to the internal EC2 hostnames and ports that must be configured. There would be a rats nest of ssh tunnels, but it might work, barring any user or authentication issues.

    Method 2 looks like it would be the simplest way, however Ambari refuses to register nodes where the host names don’t match. Since the instances are using internal EC2 names, and I need to access them from external IP addresses, this configuration won’t install. It’s possible to script a renaming of the host automatically that might solve this problem, but I don’t know if this will introduce any other problems. Anyone done this and got it to work?

    So, that leaves method 3. Looking at the documentation, this seems as if would be the best long-term solution and, I believe, may even allow stoping and starting instances, since I can permanently assign IP addresses to the instances. However I’ve never set-up a VPC and at the moment don’t have a spare day or so to learn how to do this.

    Has anyone got any clever ways to remotely access an EC2 HDP cluster with local client tools? I’ll be happy if I can just get pig and hive working, but eventually they’ll all need to work remotely. This seems like it would be a common enough use case for EC2, but I can find surprisingly little on the topic.

    – SteveN

The forum ‘HDP on Linux – Installation’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.