Set Up Apache Hadoop in Minutes with RPMs

We have some great news for developers and researchers that want to start using Apache Hadoop quickly. With the release of Apache Hadoop 0.20.204 today comes, for the first time, availability of RPMs that make it much simpler to setup a basic Hadoop cluster. This will allow you to focus on how to use the features instead of having to learn how they were implemented.

Before we begin, I’d like to apologize for the fact that these instructions do not optimize Hadoop settings to make Hadoop fast. We will leave Hadoop optimization for another day.

Download software

Download Java JDK RPM.

Download Apache Hadoop 0.20.204.0 RPM from Apache mirrors.


Single node system setup

1) Install JDK on a Red Hat or CentOS 5+ system.

sudo ./jdk-6u26-linux-x64-rpm.bin.sh

Java is installed and set JAVA_HOME to /usr/java/default

2) Install Apache Hadoop 0.20.204.

sudo rpm -i hadoop-0.20.204.0-1.i386.rpm

3) Setup Apache Hadoop configuration and start Hadoop processes.

sudo /usr/sbin/hadoop-setup-single-node.sh

The setup wizard will guide you through a list of questions to setup hadoop. Hadoop should be running after answering ‘Y’ to all questions.

4) Create a user account on HDFS for yourself.

sudo /usr/sbin/hadoop-create-user.sh -u $USER


Multi-nodes setup

1) install both the JDK and Hadoop 0.20.204.0 RPMs on all nodes

2) Generate hadoop configuration on all nodes:

sudo /usr/sbin/hadoop-setup-conf.sh \ 
  --namenode-url=hdfs://${namenode}:9000/ \
  --jobtracker-url=${jobtracker}:9001 \
  --conf-dir=/etc/hadoop \
  --hdfs-dir=/var/lib/hadoop/hdfs \
  --namenode-dir=/var/lib/hadoop/hdfs/namenode \
  --mapred-dir=/var/lib/hadoop/mapred \
  --datanode-dir=/var/lib/hadoop/hdfs/data \
  --log-dir=/var/log/hadoop \
  --auto

Where ${namenode} and ${jobtracker} should be replaced with hostname of namenode and jobtracker.

3) Format namenode and setup default HDFS layout.

sudo /usr/sbin/hadoop-setup-hdfs.sh

4) Start all data nodes.

sudo /etc/init.d/hadoop-datanode start

5) Start job tracker node.

sudo /etc/init.d/hadoop-jobtracker start

6) Start task tracker nodes.

sudo /etc/init.d/hadoop-tasktracker start

7) Create a user account on HDFS for yourself.

sudo /usr/sbin/hadoop-create-user.sh -u $USER


Verify Hadoop

Run the word count example.

I hope this information is helpful. For questions about Hadoop RPMs, please contact me directly at eyang at hortonworks dot com.

– Eric Yang

Categorized by :
Apache Hadoop

Comments

Prem
|
May 27, 2014 at 9:40 am
|

Hi Eric,

Everything went well. Except step 7 to create an user account.

sudo /usr/sbin/hadoop-create-user.sh -u $USER

The message I get is

mkdir: failed to create /user/hadoopuser
chown: could not get status for ‘/user/hadoopuser’ : File /user/hadoopuser does not exist.

Any suggestions?

Also, once I create an account how I access hadoop from Centos terminal?

Thank you so much
Prem

radhika
|
June 8, 2013 at 12:30 am
|

hi..

sudo -u hdfs hadoop fs -mkdir /var

when am tring to execute the above command it says ,JAVA_HOME is not set …can u plz help me ..

and i exported the follwing two lines in all the scripts ..

export JAVA_HOME=/usr/java/jdk1.7.0_21
export PATH=$PATH:$JAVA_HOME/bin

/usr/sbin/hadoop-set-hdfs.sh
/etc/hadoop/hadoop-env.sh
/usr/bin/hadoop
/etc/profile
/home/username/.bashrc

and echo $JAVA_HOME is retriving the correct path ..

can anyone plz help me ..
thanks ,
radhika

vennela
|
May 30, 2013 at 9:21 am
|

at step 3 of multi node set up i’m gettting “java_home not set” error… but i have set jave_home.. when i type “echo $JAVA_HOME” its giving me the path to jdk.. plz hel me with this

Kent Brodie
|
February 28, 2013 at 12:22 pm
|

After a whole ton of googling, I *finally* came across this blog entry and wow- wish this was included with the rpm kit(s). I downloaded and installed 1.0.4 stable, and had spent a day or so configuring things manually before I discovered the setup scripts that were included (oops!)– but still need something like this posting to figure out how to do things. VERY helpful. (yes, since 1.0.4 a few changes were required per the above replies, nothing major).

Thanks!

Eric Yang
|
December 14, 2012 at 2:44 pm
|

David, the script works on stock Apache Hadoop 1.x only. Cloudera has their own instructions on install CDH4 rpm.

Hadummy, the running system does not have mr user in hadoop group. The script is designed to run as root only to setup Linux task controller properly.

|
December 4, 2012 at 11:57 am
|

Hadummy posting here (first time installer). I after running the hadoop-setup-conf in the initial blog, I received a pretty large chunk of errors. I ran Jagane Sundar script instead, and got less errors, but still got hit with “chown: invalid user: `mr:hadoop’”. I’ve Googled, Binged and Yahoo’d to no avail. Apparently I’m not the first person to get this error, but no one has a method for how to get around it. (For the record, this error also appears on the intial script in the blog, it’s just followed with a more verbose string of issues.)

I’ve executed this via SUDO and as root, the errors, and the result are the same. (Three of the aboive mentioned errors, followed with “configuration setup is completed run hadoop-setup-hdfs.sh”. hadoop-setup-hdfs.sh fails with permission errors in the log and run directories)

I’m on CentOS 6.3, using hadoop-1.1.1-1.x86_64.rpm as my install package.

David Tucker
|
September 26, 2012 at 12:47 pm
|

There are a few other problems with the hadoop-setup-conf.sh script. Most notably, the template *-site.xml files appear to be insufficiently aligned with the environment settings from the script. For example, the mapred-site.xml template has to separate .dir entries that are hard-coded to “/mapred/” rather than “$HADOOP_MAPRED_DIR/” . The result is that attempts to start the Hadoop services fail as directories cannot be created.

It’s possible that this has been fixed in later revisions. I was using the CDH4 tarball of hadoop-2.0 from Cloudera.

Eric Yang
|
July 1, 2012 at 9:52 pm
|

Hi hadoop-user59,

Use /etc/init.d/hadoop-* scripts. For example, to start datanode:

sudo /etc/init.d/hadoop-datanode start

regards,
Eric

hadoop-user59
|
June 1, 2012 at 3:03 pm
|

Okay. I found it on /etc/hadoop/core-site.xml. I changed the port from 8020 to 9000. How do I re-start hadoop?

thanks

hadoop-user59
|
June 1, 2012 at 2:52 pm
|

Where does rpm install the conf directory? I need to change the core-site.xml but can’t find its location for the Amazon AMI linux instance.

Thanks.

Eric Yang
|
February 7, 2012 at 2:04 pm
|

Hi Gopal,

Hdfs user creation could fail if there is already an existing user using the same uid. Please check that your system does not already have a uid for HDFS user. In addition, JAVA_HOME may not be exported to the child process when the scripts are spawning additional shell. Hence, it is best to explicitly set –java-home settings in the setup script.

Hope this helps.

regards,
Eric

gopal
|
January 21, 2012 at 10:23 am
|

hi,

While creating hdfs user i am getting error

JAVA_HOME is not set.

But when i tried echo $JAVA_HOME i can retrieve the path. And also i included the JAVA_HOME value in the file /usr/sbin/hadoop-create-user.sh

Any idea ?

thanks,
gopal

|
December 4, 2011 at 10:20 pm
|

OK. Ignore the last comment I made about not being able to get it to run. I did get 205 to run using the rpm install, and the instructions in your blog post above.

I had to do two more things:

Issue #3: After running the script /usr/sbin/hadoop-setup-conf.sh, I needed to logout and login again because the env variables set in /etc/profile.d/hadoop-env.sh were not sourced in my shell, so /usr/sbin/hadoop-setup-hdfs.sh failing.

Issue #4: I need to add the new parameter –format to /usr/sbin/hadoop-setup-hdfs.sh in order to format HDFS.

Subsequently, the JT and TTs started up without any problems.

For reference, here is the command line I used for hadoop-setup-conf.sh:

# /usr/sbin/hadoop-setup-conf.sh \
–namenode-host=master \
–jobtracker-host=master \
–conf-dir=/etc/hadoop \
–hdfs-dir=/var/lib/hadoop/hdfs \
–namenode-dir=/var/lib/hadoop/hdfs/namenode \
–mapred-dir=/var/lib/hadoop/mapred \
–datanode-dir=/var/lib/hadoop/hdfs/data \
–log-dir=/var/log/hadoop \
–auto \
–mapreduce-user=mapred \
–dfs-support-append=true

Cheers, Eric. It would be great if you can keep this blog post and the rpm/scripts current and working. This is the easiest way to get a hadoop up and running.

|
December 4, 2011 at 7:19 pm
|

Hello Eric,

The 205 release has broken a few things in your blog post above. I installed the 205 rpm, and then ran the script as described in your blog post above.
Issue #1: The hadoop-setup-conf.sh param –namenode-url has changed to –namenode-host, and the –jobtracker-url has changed to –jobtracker-host.
Issue#2: The rpm creates a linux user mapred, whereas the default user is “mr” in your hadoop-setup-conf.sh script. The following additional param –mapreduce-user=mapred needs to be added in order to make the script play well with the rpm.

Hmm. OK. I actually cannot get the non-secure version to work when I create a configuration using this script. Oh well…

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.