Deploying a Hadoop Cluster on Amazon EC2 with HDP2

Wow. Much cloud. Very screenshots.

In this post, we’ll walk through the process of deploying an Apache Hadoop 2 cluster on the EC2 cloud service offered by Amazon Web Services (AWS), using Hortonworks Data Platform.

Both EC2 and HDP offer many knobs and buttons to cater to your specific, performance, security, cost, data size, data protection and other requirements. I will not discuss most of these options in this blog as the goal is to walk through one particular path of deployment to get started.

Let’s go!

Prerequisites

  • Amazon Web Services account with the ability to launch 7 large instances of EC2 nodes.
  • A Mac or a Linux machine. You could also use Windows but you will have to install additional software such as SSH clients and SCP clients, etc.
  • Lastly, we assume that you have basic familiarity with EC2 to the extent that you have created EC2 instances and SSH’d in.

Step 1: Creating a Base AMI with all the OS level configuration common to all nodes

Navigate to your EC2 console from the AWS Dashboard and then click on ‘Launch Instance’:
EC2 Dashboard

Let’s select the RHEL 64bit and go to the next step:

SelectBaseImage

Let’s select a large instance with adequate processing power and memory:

ImageSize

Here we adjust storage as required:

Instance_Storage

We are ready for Review and Launch:

ReviewAndLaunch

But, before you Launch the instance, make sure you have downloaded the private key. Keep the private key safe and Launch:

PrivateKey

Everything looks good. Let’s view the instances.

<Display Name>

Now that we have instance up and running, we will need the public DNS name to connect to it:

<Display Name>

Let’s SSH in:

SSH

Now let’s prep the instance:

Prep

That was all the prep we need, so we are going to create a private AMI. Go to the EC2 console, select the instance and from the action menu select “Create Image”:

AMI

Make sure you check ‘No reboot’ before you click Create Image, as we will like to continue to work on this instance:

NoReboot

Wait for the creation of the AMI to be complete:

<Display Name>

Let’s configure this instance for password-less SSH to all the other nodes in the cluster. The first step is to have the private key on this instance.

<Display Name>

We will need to move the private key to .ssh folder and rename it to id_rsa:

<Display Name>

Let’s provision the other nodes now:

<Display Name>

Select the size of the node instances:

<Display Name>

I will select 6 more nodes here with 3 nodes dedicated for all the management daemons and 4 nodes dedicated to data nodes. Then click on ‘Review and Launch’:

<Display Name>

Click on the “Launch” button:

<Display Name>

Ensure, you are using the same key as before for the passwordless SSH to work between the Ambari node and the rst of the new nodes. Click on the ‘Launch Instance’:

<Display Name>

As the instances are getting launched, we will copy down to a text file the Private DNS names of all the instances we have launched so far:

<Display Name>

We will end up with a list like below:

<Display Name>

Step 2: Customize the security groups to minimize attack surface area while not blocking essential communication channels

We have have to add rules to the security groups which was created by default when we launched the instances.

The first security group should have been created when we launched the first instance. We are running the Ambari server on this instance, so we have to ensure we can get to it and it can communicate with the rest of the instances that we launched later:

<Display Name>

Then we also need to open up the ports for IPs internal to the datacenter:

<Display Name>

Step 3: Setting up Ambari

Get the bits of HDP and add it to the repo:

<Display Name>

next we will refresh the repo:

<Display Name>

Then we will install the Ambari server:

<Display Name>

Agree to download bits:

<Display Name>

Agree to download the key:

<Display Name>

Ambari Server bits are installed:

<Display Name>

Now, we will configure the bits:

<Display Name>

Just accept all the all the default options for all the prompts by pressing Enter:

<Display Name>

Let’s start the Ambari Server:

<Display Name>

That’s it we are all set to use Ambari to bring up the cluster.

Step 4: Using Ambari to deploy the cluster

Copy the public DNS name of the Ambari:

<Display Name>

Navigate to port 8080 of the public DNS from your browser. You should see the login page of Ambari. The default username and password is ‘admin’ and ‘admin’ respectively:

<Display Name>

This is where we start creating the cluster. Enter any cluster name of your choosing:

<Display Name>

We are going to create a HDP 2.0 cluster:

<Display Name>

Remember the list of private DNS names that you had copied down to a text file. We will pull out the list and paste it in the Target host input box. We will also upload the private key that we have been using on this page:

<Display Name>

We are all set to go. These should all come back as green with no warnings:

<Display Name>

At this stage, we need to decide what services we need:

<Display Name>

For this demonstration, I will select everything, although in real life you want to be more judicious and select the bare minimum needed for your requirement:

<Display Name>

After we are done selecting the services, it’s time to determine where they will run. Ambari is smart enough to suggest you reasonable suggestions, but if you have specific topology in mind you might want move these around:

<Display Name>

Next step is to configure which nodes do you want to Data nodes and Clients to be. I like to have clients on multiple instances just for the convenience.

<Display Name>

In the next step we will have to configure the credentials for some of the services. the ones where you will need to populate the credentials are marked by a number in the red background mark:

<Display Name>

Once we are done with all the inputs, we are ready to review and then start the deployment:

<Display Name>

At this point it will take a while ( ~ 30 mins) to complete the deployment and test the services:

<Display Name>

Voila!! We now have a fully functional and tested cluster on EC2. Happy Hadooping!!!

@saptak

Categorized by :
Administrator Ambari Apache Hadoop Architect & CIO Cloud Hadoop 2.0 HDP 2 Operations & Management

Comments

Carolus Holman
|
July 16, 2014 at 9:14 am
|

You should add that for AWS, the default user when adding the Hosts should be set to ec2-user, root will fail.

Jason Rubin
|
June 26, 2014 at 10:37 am
|

I got to the beginning of Step 4. But when I try to enter “http://(my public dns):8080″, the browser is unable to connect to it. Any suggestions for how I may fix this?

Thanks!

Iván
|
June 18, 2014 at 2:38 am
|

Hi,

I’m trying to follow this setup in Amazon EC2 + VPC but I get a warning message in the confirm host step:

All bootstrapped hosts registered but unable to retrieve cpu and memory related information hortonworks

The problem is that next button is disabled and cannot continue with the installation

Gerry Miller
|
May 17, 2014 at 3:46 am
|

Hi Saptak,
I tried to find the url for HUE but couldn’t, is HUE included in this install, or would it have to be installed separately on top?
Thanks,
Gerry

srikrishna
|
April 1, 2014 at 3:53 pm
|

when i do yum install ambari-server , I get the following error.

Transaction Check Error:
file /usr/lib64/python2.6/distutils/README from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/site-packages/README from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/bsddb/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/compiler/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/ctypes/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/ctypes/macholib/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/curses/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/distutils/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64

srikrishna
|
March 30, 2014 at 3:35 pm
|

Is medium instances sufficient ? How many minimum EC2 instances are required for master /slave for a quick set up ?

srikrishna
|
March 30, 2014 at 3:34 pm
|

Is medium instances sufficient .? How many minimum EC2 instances we need for a cluster ?

Divya
|
March 1, 2014 at 6:36 pm
|

An excellent introduction to using Ambari on Amazon.
– Is there an automated script that you can use instead of the manual steps?
– Is the base machine image created as a publicly accessible AMI?

Thank you for the well written tutorial.

sushma
|
February 6, 2014 at 4:22 am
|

Are large instances really necessary for trial?. Its costing me a lot ! Let me know soon !

Thanks

    |
    February 6, 2014 at 3:11 pm
    |

    Sushma, it is not necessary to go with large instances. It depends on your workload that you are planning to test. The least expensive way to test is to run the Hortonworks Sandbox on your local machine – http://hortonworks.com/sandbox

srikrishna
|
February 5, 2014 at 7:56 pm
|

Do you really need large instances for trial ? What is the minimum amount of resources required ?

Satish
|
February 4, 2014 at 8:52 pm
|

Elephant is powerful when flying in the cloud

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.