Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
February 04, 2014
prev slideNext slide

Deploying a Hadoop Cluster on Amazon EC2 with HDP2

In this post, we’ll walk through the process of deploying an Apache Hadoop 2 cluster on the EC2 cloud service offered by Amazon Web Services (AWS), using Hortonworks Data Platform.

Both EC2 and HDP offer many knobs and buttons to cater to your specific, performance, security, cost, data size, data protection and other requirements. I will not discuss most of these options in this blog as the goal is to walk through one particular path of deployment to get started.

Let’s go!

Prerequisites

  • Amazon Web Services account with the ability to launch 7 large instances of EC2 nodes.
  • A Mac or a Linux machine. You could also use Windows but you will have to install additional software such as SSH clients and SCP clients, etc.
  • Lastly, we assume that you have basic familiarity with EC2 to the extent that you have created EC2 instances and SSH’d in.

Step 1: Creating a Base AMI with all the OS level configuration common to all nodes

Navigate to your EC2 console from the AWS Dashboard and then click on ‘Launch Instance’:
EC2 Dashboard

Let’s select the RHEL 64bit and go to the next step:

SelectBaseImage

Let’s select a large instance with adequate processing power and memory:

ImageSize

Here we adjust storage as required:

Instance_Storage

We are ready for Review and Launch:

ReviewAndLaunch

But, before you Launch the instance, make sure you have downloaded the private key. Keep the private key safe and Launch:

PrivateKey

Everything looks good. Let’s view the instances.

<Display Name>

Now that we have instance up and running, we will need the public DNS name to connect to it:

<Display Name>

Let’s SSH in:

SSH

Now let’s prep the instance:

Prep

That was all the prep we need, so we are going to create a private AMI. Go to the EC2 console, select the instance and from the action menu select “Create Image”:

AMI

Make sure you check ‘No reboot’ before you click Create Image, as we will like to continue to work on this instance:

NoReboot

Wait for the creation of the AMI to be complete:

<Display Name>

Let’s configure this instance for password-less SSH to all the other nodes in the cluster. The first step is to have the private key on this instance.

<Display Name>

We will need to move the private key to .ssh folder and rename it to id_rsa:

<Display Name>

Let’s provision the other nodes now:

<Display Name>

Select the size of the node instances:

<Display Name>

I will select 6 more nodes here with 3 nodes dedicated for all the management daemons and 4 nodes dedicated to data nodes. Then click on ‘Review and Launch’:

<Display Name>

Click on the “Launch” button:

<Display Name>

Ensure, you are using the same key as before for the passwordless SSH to work between the Ambari node and the rst of the new nodes. Click on the ‘Launch Instance’:

<Display Name>

As the instances are getting launched, we will copy down to a text file the Private DNS names of all the instances we have launched so far:

<Display Name>

We will end up with a list like below:

<Display Name>

Step 2: Customize the security groups to minimize attack surface area while not blocking essential communication channels

We have have to add rules to the security groups which was created by default when we launched the instances.

The first security group should have been created when we launched the first instance. We are running the Ambari server on this instance, so we have to ensure we can get to it and it can communicate with the rest of the instances that we launched later:

<Display Name>

Then we also need to open up the ports for IPs internal to the datacenter:

<Display Name>

Step 3: Setting up Ambari

Get the bits of HDP and add it to the repo:

<Display Name>

next we will refresh the repo:

<Display Name>

Then we will install the Ambari server:

<Display Name>

Agree to download bits:

<Display Name>

Agree to download the key:

<Display Name>

Ambari Server bits are installed:

<Display Name>

Now, we will configure the bits:

<Display Name>

Just accept all the all the default options for all the prompts by pressing Enter:

<Display Name>

Let’s start the Ambari Server:

<Display Name>

That’s it we are all set to use Ambari to bring up the cluster.

Step 4: Using Ambari to deploy the cluster

Copy the public DNS name of the Ambari:

<Display Name>

Navigate to port 8080 of the public DNS from your browser. You should see the login page of Ambari. The default username and password is ‘admin’ and ‘admin’ respectively:

<Display Name>

This is where we start creating the cluster. Enter any cluster name of your choosing:

<Display Name>

We are going to create a HDP 2.0 cluster:

<Display Name>

Remember the list of private DNS names that you had copied down to a text file. We will pull out the list and paste it in the Target host input box. We will also upload the private key that we have been using on this page:

<Display Name>

We are all set to go. These should all come back as green with no warnings:

<Display Name>

At this stage, we need to decide what services we need:

<Display Name>

For this demonstration, I will select everything, although in real life you want to be more judicious and select the bare minimum needed for your requirement:

<Display Name>

After we are done selecting the services, it’s time to determine where they will run. Ambari is smart enough to suggest you reasonable suggestions, but if you have specific topology in mind you might want move these around:

<Display Name>

Next step is to configure which nodes do you want to Data nodes and Clients to be. I like to have clients on multiple instances just for the convenience.

<Display Name>

In the next step we will have to configure the credentials for some of the services. the ones where you will need to populate the credentials are marked by a number in the red background mark:

<Display Name>

Once we are done with all the inputs, we are ready to review and then start the deployment:

<Display Name>

At this point it will take a while ( ~ 30 mins) to complete the deployment and test the services:

<Display Name>

Voila!! We now have a fully functional and tested cluster on EC2. Happy Hadooping!!!

@saptak

Tags:

Comments

Satish says:

Elephant is powerful when flying in the cloud

srikrishna says:

Do you really need large instances for trial ? What is the minimum amount of resources required ?

sushma says:
Your comment is awaiting moderation.

do we really need large insatnces ?. Its costing me like anything. Thanks.

sushma says:

Are large instances really necessary for trial?. Its costing me a lot ! Let me know soon !

Thanks

Saptak Sen says:

Sushma, it is not necessary to go with large instances. It depends on your workload that you are planning to test. The least expensive way to test is to run the Hortonworks Sandbox on your local machine – https://hortonworks.com/sandbox

Charley Lingerfelt says:
Your comment is awaiting moderation.

Is there a Sandbox ready file for AWS?

Manoj says:
Your comment is awaiting moderation.

Hi,

Can I get steps to deploy manually (I mean hardway) instead of using Apache Ambari

Thanks

Felice says:
Your comment is awaiting moderation.

Why hdp2 doen’t support Amazon linux ami? It is derived redhat and not have licence cost

Janos Matyas says:
Your comment is awaiting moderation.

For a faster way to deploy HDP2 on EC2 check the following blog post

Janos,
SequenceIQ

Puneet says:
Your comment is awaiting moderation.

Step 4: Using Ambari to deploy the cluster

The login page for ambari is not opening. I am running the URL with my ambari server public dns:8080/login.

I foolowed the security step to add 8080 for this server.

Please help.

Divya says:

An excellent introduction to using Ambari on Amazon.
– Is there an automated script that you can use instead of the manual steps?
– Is the base machine image created as a publicly accessible AMI?

Thank you for the well written tutorial.

Hasib Rahman says:
Your comment is awaiting moderation.

I got the below error message when I invoked yum install ambari-server

file /usr/lib64/python2.6/zipfile.pyo from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64

Hasib Rahman says:
Your comment is awaiting moderation.

I got the below error at: yum install ambari-server

file /usr/lib64/python2.6/zipfile.pyo from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64

Hasib Rahman says:
Your comment is awaiting moderation.

This document seems few elements, will update and provide a link soon.

A question if someone could answer. I am trying to follow the doc but got stuck at Confirm Hosts – It’s failing and here is the message

Host checks were skipped on 7 hosts that failed to register.

Can someone please help me troubleshoot?

Hasib Rahman says:
Your comment is awaiting moderation.

I just installed HDP via Ambari on AWS. Everything went okay. I am on login prompt, trying to install the cluster now but it’s failing on confirm hosts screen. Here is the message —

Host checks were skipped on 7 hosts that failed to register.

Can someone help me troubleshoot? Please please.

Anshul Vyas says:

You need to install ambari-agents in all the node , and start them.
On the register hosts screen in ambari, Select register by ambari-agents rather than ssh.
Moreover add all the hosts names in all the server under /etc/hosts file.
This would solve your issues.

srikrishna says:

Is medium instances sufficient .? How many minimum EC2 instances we need for a cluster ?

srikrishna says:

Is medium instances sufficient ? How many minimum EC2 instances are required for master /slave for a quick set up ?

srikrishna says:

when i do yum install ambari-server , I get the following error.

Transaction Check Error:
file /usr/lib64/python2.6/distutils/README from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/site-packages/README from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/bsddb/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/compiler/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/ctypes/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/ctypes/macholib/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/curses/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64
file /usr/lib64/python2.6/distutils/__init__.py from install of python26-2.6.8-2.el5.x86_64 conflicts with file from package python-2.6.6-37.el6_4.x86_64

Mario Lischka says:
Your comment is awaiting moderation.

This also a complete show stopper for me.
RHEL has package python installed which apparently is 2.6.
IMHO easiest solution is to modify the dependencies of Ambari, because replacing the package “python” with “python26” is close to impossible.

Patrick says:

I had to install Ambari 1.5 instead of 1.4.3.38, then open up TCP on then 2nd security group to all addresses, to get this to work.

Patrick says:

I also had to edit the Ambari config file and change the web port from 8080 to 80 in Amazon EC2. I would have liked to leave it at the original value but I could not change the port to 8080 in the security group in EC2.

Gerry Miller says:

Hi Saptak,
I tried to find the url for HUE but couldn’t, is HUE included in this install, or would it have to be installed separately on top?
Thanks,
Gerry

Ramazan FIRIN says:
Your comment is awaiting moderation.

hi,

i follow step by step but there is error when creteing HDP stack.

STDOUT

STDERR
Please login as the user “ec2-user” rather than the user “root”.

scp /usr/lib/python2.6/site-packages/ambari_server/os_type_check.sh done for host ip-172-31-35-90.us-west-2.compute.internal, exitcode=1
Copying os type check script finished
ERROR: Bootstrap of host ip-172-31-35-90.us-west-2.compute.internal fails because previous action finished with non-zero exit code (1)

can you help me ?

Brian says:
Your comment is awaiting moderation.

Hi

I’ve run through this tutorial before and it worked fine.
I’m redoing it now, and I can’t get past the step where I register all the hosts. The wizard gets through most of it until what I think is the end and then I keep getting:

Setting up agent finished
Registering with the server…
Registration with the server failed.

Is there anywhere to check the logs to see what is happening and why it’s failing?
I can ssh from the original node to all the other nodes as both ec2-user and root. Is there something else I need to do?

Regards
Brian

Iván says:

Hi,

I’m trying to follow this setup in Amazon EC2 + VPC but I get a warning message in the confirm host step:

All bootstrapped hosts registered but unable to retrieve cpu and memory related information hortonworks

The problem is that next button is disabled and cannot continue with the installation

Jason Rubin says:

I got to the beginning of Step 4. But when I try to enter “http://(my public dns):8080”, the browser is unable to connect to it. Any suggestions for how I may fix this?

Thanks!

Birender Saini says:

Open the port 80 and 8080 in your Security Group used for master.

tiru says:
Your comment is awaiting moderation.

I got same issue. Instead of DNS name i used public IP and i was able to access Ambari…

Sajith PP says:

@Jason Rubin, Any luck for connecting to the server with public DNS?

Carolus Holman says:

You should add that for AWS, the default user when adding the Hosts should be set to ec2-user, root will fail.

Iván says:

So, the problem I had, is well documented and fixed with ambari 1.6.1

Aks says:

Things went fine till I reached Ambari Installation. Only the Ambari host is started successfully. All the other instances have failed Registration. I see the below error in logs

Agent log at: /var/log/ambari-agent/ambari-agent.log
(‘INFO 2014-10-02 13:21:03,954 NetUtil.py:74 – Server at https://myAmbariHostPrivateDNS:8440 is not reachable, sleeping for 10 seconds…
INFO 2014-10-02 13:21:13,965 NetUtil.py:41 – Connecting to the following url https://myAmbariHostPrivateDNS:8440/cert/ca
INFO 2014-10-02 13:22:16,966 NetUtil.py:55 – Failed to connect to https://myAmbariHostPrivateDNS:8440/cert/ca due to [Errno 110] Connection timed out

Anyone faced a similar issue ?

Marco says:
Your comment is awaiting moderation.

Hi,

I have the same problem.. did you solve it?

Thanks!

theoryno3 says:
Your comment is awaiting moderation.

Hi,

I’ve encountered the same problem. I’ve only come across updating openssl as the only viable solution. However, this still doesn’t resolve the problem for me either.

This is for HDP 2.2 and Ambari 1.7.0. Have you had much luck?

Thanks!

Antonio Piccolboni says:
Your comment is awaiting moderation.

All this blog was a single line of code with the now abandoned whirr and it offered also a distributed script execution capability to install custom software. I don’t think we share the same definition of progress.

Ken Williams says:
Your comment is awaiting moderation.

One big problem with this approach is that as soon as the instances are shut off and turned back on, the IP addresses and hostnames change, which means the cluster doesn’t know where to find its various services. What’s a good way to deal with this by using Elastic IPs or a VPC or something?

Sri says:

I keep getting this error.

Error: Package: ambari-server-1.2.3.7-1.noarch (Updates-ambari-1.2.3.7)
Requires: python26

I am on EC2 RHEL image (7.1). I used Centos 6 version of Ambar.

Tried many of the solutions for this (like installing development tools, clearing repo, install python). Still doesn’t work,

Al-Tamimi says:

Hi Sri,

you are using an outdated Ambari repo.

check the main hortonworks documentation as there are many out of date documentation on google.

check out this link for latest repo for Ambari by hortonworks:
http://docs.hortonworks.com/HDPDocuments/Ambari-2.0.1.0/bk_Installing_HDP_AMB/content/_download_the_ambari_repo.html

i hope this helps.

kiran says:

Can’t I use apache hadoop as wamp
to use it to run PHP scripts please help me to run PHP in hadoop
what is the main use of hadoop?

kajal bhanushali says:

Thanyou for the information.
After performing the above steps on amazon, is there any way where i can use the amazon api to use and restful services for HDP, so that i can access the data in any other web application?
Can anyone help me on this.?

Naveen says:

Failed in Registering host:

I am installing HDP 2.0 using Ambari on AWS EC2. I installed Ambari and able to open the console. But when trying to register host list, its getting failed. I am not able to get logs aswell.

Amazon web services training in hyderabad says:

The Aws Online Training Features and Concepts track expands the participants’ knowledge on infrastructural and business concepts and functionality of selected modules of the Multichannel Platform. Aim of this course is to make participants understand the features and concepts for the successful planning of projects.

aws training says:

Oh my goodness! Incredible article dude! Many thanks, 

aws training says:

Wow. That is so elegant and logical and clearly explained. Brilliantly goes through what could be a complex process and makes it obvious.

Big Data Analytics Training in Hyderabad says:

Really Nice Explanation with screenshot .It is very useful to users who are looking for deploying hadoop cluster on amazon EC2 with HDP2

Dominika says:

You can also launch HDP clusters using Hortonworks Data Clouyd for AWS (HDCloud for AWS). To get started with the HDCloud for AWS general availibility version, visit http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/index.html

Aws Training says:

Every information from this website and all other websites were helpful. And more over that it was good experience.

Bhumendra says:

Hi I am installing Ambari – 2.4.1 & and HDP – 2.5 on AWS
cluster , how the local repository (on windows VM) will work for faster installation of all services.

Peter Phelan says:

When using “Red Hat Enterprise Linux Server release 7.4 (Maipo)” for your base instance it was missing the package libtirpc-devel
To fix that I did the following:
sudo yum-config-manager –enable rhui-REGION-rhel-server-optional
sudo yum install libtirpc-devel

Big Data Analytics Training In Hyderabad says:

Really Nice Explanation with the screenshot .It is very useful to users who are looking for deploying Hadoop cluster on Amazon EC2 with HDP2.
We are expecting more articles on Big Data Analytics
so more information please visit our website

Zhen Zeng says:

Redhad 7.4, / HDP 2.6.4
Error:
resource_management.core.exceptions.ExecutionFailed: Execution of ‘/usr/bin/yum -d 0 -e 0 -y install hadoop_2_6_4_0_91-mapreduce’ returned 1. Error: Package: hadoop_2_6_4_0_91-hdfs-2.7.3.2.6.4.0-91.x86_64 (HDP-2.6-repo-1)
Requires: libtirpc-devel

Solution:
———-
sudo yum-config-manager –enable rhui-REGION-rhel-server-optional
sudo yum install libtirpc-devel
———-
—>this blog changes 2 dashs into 1 long dash.
so you have to use 2 dashs before “enable”

carlos says:

Hi,

We were following all steps in this post, but now we have a problem, we can not do ssh between every host in ec2……… I dont know why? We have tried with internal ip and hostname, with root user and default user, and others and we receive always the same message:

Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

We suppose will be a stupid mistake, but We dont know what more to do.
help us, please.

carlos says:

Sorry, We forgotten this step:

We will need to move the private key to .ssh folder and rename it to id_rsa:

Everything is ok now.
Thanks.

Sajith says:

@Saptak Sen,
First of all thanks for the well written document.
I am getting errors in the preparations step itself. Is it possible for you to check and update the document if necessary. If possible please add the explanations like why we need to set up the iptables, ntpd, etc. it will be helpful for people who are new to this like me.

Leave a Reply

Your email address will not be published. Required fields are marked *