The Hortonworks Blog


hcat requires some sort of persistent db to store schema information

SOLUTION 1: Specific host access only

grab the latest package

> yum -y install mysql-server

configure autostart at boot

> chkconfig mysqld on > service mysqld start

run the mysql client

> mysql -u root -p

enter your password

mysql> CREATE USER 'my_user_id'@'host' IDENTIFIED BY 'pw'; mysql> GRANT ALL PRIVILEGES ON *.* TO 'hcatre'@'host' WITH GRANT OPTION; mysql> FLUSH PRIVILEGES;

exit the client

mysql> exit;

test the new account

mysql -h host_FQDN -u my_user_id -p

You should now be logged into the mysql client as the new user

If you get: Error … Can’t connect ot MySQL server on …

log into the mysql host and assume root

iptables -A INPUT -i eth0 -p tcp -m tcp --dport 3306 -j ACCEPT service iptables save service iptables restart

Test from hcat server machine

shell into the hcat server

mysql -h host -u my_user -p

*verify you can log in from the hcat host

Test from Hive

Run the Hive shell:

#hive --config /etc/hcatalog hive> show tables;…

Failure of Active Namenode in a non-HA deployment


The best approach to mitigating the risk of data loss due to a NameNode failure is to harden the NameNode system and components to meet the desired level of redundancy.

Since the journal is not flushed with every operation, it could be up to several seconds out of sync with the persisted disk state. This latency determines the scope of potential data loss, in the event of NameNode failure.…


How do I check the health of my HDFS cluster (name node and all data nodes)?


Hadoop includes the dfsadmin command line tool for HDFS administration functionality. This tool allows the user to view the status of the HDFS cluster.

To view a comprehensive status report, execute the following command:

hadoop dfsadmin -report

This command will output basic statistics of the cluster health. This includes the status of the namenode, status of each datanode, disk capacity amounts, block health statuses.…


What should one keep in mind when configuring the network for a Hadoop cluster?


These are the best practices for configuring the network for a Hadoop cluster. These are recommended for a stable and performant Hadoop cluster.

  • Machines should be on an isolated network from the rest of the data center. This means that no other applications or nodes should share network I/O with the Hadoop infrastructure. This is recommended as Hadoop is I/O intensive, and all other interference should be removed for a performant cluster.

SSH with a passphrase will prompt the user for a password when connecting to the remote host.


Hadoop needs to be able to establish secure shell connections without passing a passphrase.  Alternatively, one could setup the ssh-agent which is inherently more secure, but which requires password entry at least once when the agent daemon is first started up.

This article reviews how to setup a key with no password

SOLUTION 1: Connection to different host(s)

on the host you will connect FROM:

generate the public private keys

> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

copy the public key to every host you will connect TO:

> scp ~/.ssh/ my_user_id@

* this should prompt you for a password

shell into the remote machine

> ssh my_user_id@

authorize the key by adding it to the list of authorized keys

> cat ~/.ssh/ >> ~/.ssh/authorized_keys

log out of the current shell

> exit

test that you can log in with no password

ssh -i ~/.ssh/id_dsa

if this prompts for a password

> ensure the remote user is the owner of the pub key

SOLUTION 2: connection to localhost

generate the public private keys

> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

authorize the key by adding it to the list of authorized keys

> cat ~/.ssh/ >> ~/.ssh/authorized_keys

test that you can log in with no password

> ssh localhost

check to make sure this works (doesn’t prompt for password)…