The Hortonworks Blog

Many HDP users are increasing their focus on security within Hadoop and are looking for ways to encrypt their data.  Fortunately, Hadoop  provides  several options for encrypting data at rest. At the lowest level of encryption,  there is volume encryption that can encrypt all the data on a node and doesn’t require any changes to Hadoop. The volume-level encryption provides protection against physical security but lacks a fine-grained approach.

Often, you  want to encrypt only selected files/directories in HDFS to save on overhead and protect performance and now this is possible with HDFS Transparent Data Encryption (TDE).…

Introduction

The Spark Technical preview lets you evaluate Apache Spark 1.2.0 on YARN with HDP 2.2. With YARN, Hadoop can now support various types of workloads; Spark on YARN becomes yet another workload running against the same set of hardware resources.

This technical preview describes how to:

  • Run Spark on YARN and run the canonical Spark examples: SparkPI and Wordcount.
  • Run Spark 1.2 on HDP 2.2.
  • Work with a built-in UDF, collect_list, a key feature of Hive 13.

This is the installation instructions for Storm on YARN. Our work is based on the code and documentation provided by Yahoo in the Storm-YARN repository at https://github.com/yahoo/storm-yarn

We initially installed Centos 6.4 minimal installation on a single VM. This installation can be scaled up to a multinode configuration.

You will need to make the following changes to prepare for the HDP 2.0 beta installation:

Disable selinux using the command:

setenforce 0

Edit the SELinux configuration file:

vi /etc/selinux/config

Change SELINUX=enforcing to SELINUX=disabled

Stop the iptables firewall and disable it.…

If you are having performance issues with the Sandbox, try the following:

  • Run only 1 virtual machine at a time
  • Reboot the virtual machine
  • Allocate more RAM to the Sandbox VM. This assumes you have more than 4GB of physical RAM on your system. To learn how to allocate more RAM to the VM, see the instructions for your virtualization platform:
  • These resources may also be of use :

    Writing a file to Hortonworks Sandbox from Talend Studio

    I recently needed to quickly build some test data for my Hadoop environment and was looking for a tool to help me out. What I discovered was this is a very simple process within Talend Studio. (you can get the latest Talend Studio from their site)

    Here is how…

    Step 1 – Generating Test Data within Talend Studio
    • Create a New Job within the Job Designer
    • Drag a tRowGenerator onto the Designer
    • Double Click on your tRowGenerator component and add in fields you want to generate
    Step 2 – Connecting to HDFS from Talend
    • Drag a tHDFSConnection onto the Designer
    • Change the “Name Node URI” property to point to your Hortonworks Sandbox on port 8020.

    Tableau, Apache Hive and the Hortonworks Sandbox

    As with most BI tools Tableau can use Apache Hive (via ODBC connection) as the defacto standard for SQL access in Hadoop. Establishing a connection from Tableau to Hadoop and the Hortonworks Sandbox is fairly straightforward and we will describe the process here.

    1. Install Tableau

    To get started, please download and install Tableau from their web site . Tableau is a Windows only application.…

    Ambari is 100% open source and included in HDP, greatly simplifying installation and initial configuration of Hadoop clusters. In this article we’ll be running through some installation steps to get started with Ambari. Most of the steps here are covered in the main HDP documentation here.

    The first order of business is getting Ambari Server itself installed. There are different approaches to this, but for the purposes of this short tour, we’ll assume Ambari is already installed on its own dedicated node somewhere or on one of the nodes on the (future) cluster itself.…

     

    HOWTO: Ambari on EC2

    This document is an informal guide to setting up a test cluster on Amazon AWS, specifically the EC2 service. This is not a best practice guide nor is it suitable for a full PoC or production install of HDP.

    Please refer to Hortonworks documentation online to get a complete set of documentation.

    Create Instances

    Created the following RHEL 6.3 64bit instances:

    • m1.medium ambarimaster
    • m1.large hdpmaster1
    • m1.large hdpmaster2
    • m1.medium hdpslave1
    • m1.medium hdpslave2
    • m1.medium hdpslave3

    Note: when instantiating instances, I increased the root partition to 100Gb on each of them.…

    ISSUE:

    How can I use HCatalog to discover which files are associated with a partition in a table so that the files can be read directly from HDFS?

    How do I place files in HDFS and then add them as a new partition to an existing table?

    SOLUTION:

    This document describes how to use HCatalog to discover which files are associated with a particular partition in a table so that those files can be read directly from HDFS, and how to place files in HDFS and then add them as a new partition to an existing table.…

    ISSUE

    How do I use Apache Sqoop for importing data from a relational DB?

    SOLUTION

    Apache Sqoop can be used to import data from any relational DB into HDFS, Hive or HBase.

    To import data into HDFS, use the sqoop import command and specify the relational DB table and connection parameters:

    sqoop import --connect <JDBC connection string> --table <tablename> --username <username> --password <password>

    This will import the data and store it as a CSV file in a directory in HDFS.…

    ISSUE

    How do I run an example map reduce job? Or

    How do I test the map reduce services are working?

    SOLUTION

    Make sure the job tracker and the task trackers are started.

    To start the job tracker:

    su mapred - -c "hadoop-daemon.sh --config /etc/hadoop start jobtracker; sleep 25"

    To start a task tracker:

    su mapred - -c "hadoop-daemon.sh --config /etc/hadoop start tasktracker"

    Run a map reduce job from the hadoop examples jar.…

    ISSUE

    How do I run simple Hadoop Distributed File System tasks? Or

    How do I test that HDFS services are working?

    SOLUTION

    Make sure the name node and the data nodes are started.

    To start the name node:

    su hdfs - -c "hadoop-daemon.sh --config /etc/hadoop/ start namenode"

    To start a data node:

    su hdfs - -c "hadoop-daemon.sh --config /etc/hadoop start datanode"

    Put data files into HDFS. This command will take a file from disk and put into HDFS:

    su hdfs hadoop fs -put trial_file.csv /user/hdfs/trial_file.csv

    Read data from HDFS.…

    ISSUE

    How do I test that Hbase is working properly? OR

    What is a simple set of Hbase Commands?

    SOLUTION

    If HBase processes are not running, start them with the following commands:

    To start the HBase master (‘sleep 25’ is included as the master takes some time to get up and running):

    su hbase - -c "/usr/bin/hbase-daemon.sh --config /etc/hbase start master; sleep 25"

    To start the HBase regioanserver:

    su hbase - -c "/usr/bin/hbase-daemon.sh --config /etc/hbase start regionserver"

    This command will describe a simple status of the HBase cluster nodes:

    status 'simple'

    This command will create a table with one column family:

    create 'table2', 'cf1'

    This command will add a row to the table:

    put 'table2', 'row1', 'column1', 'value'

    This command will display all rows in the table:

    scan 'table2'…
    ISSUE

    What is the optimal way to shut down a HDP slave node

    SOLUTION

    HDP slave nodes are usually configured to run the datanode and tasktracker processes. If HBase is installed, then the slave nodes run the HBase RegionServer process as well.

    To shut down the slave node, it is important to shut down the slave processes first. Each process should be shut down by the respective user account. These are the commands to run:

    Stop Hbase RegionServer:

    su hbase - -c "hbase-daemon.sh --config /etc/hbase/ stop regionserver"

    Stop tasktracker:

    su mapred - -c "hadoop-daemon.sh --config /etc/hadoop/ stop tasktracker"

    Stop datanode:

    su hdfs - -c "hadoop-daemon.sh --config /etc/hadoop/ stop datanode"…
    ISSUE:

    Choosing the appropriate Linux file system for HDFS deployment

    SOLUTION:

    The Hadoop Distributed File System is platform independent and can function on top of any underlying file system and Operating System. Linux offers a variety of file system choices, each with caveats that have an impact on HDFS.

    As a general best practice, if you are mounting disks solely for Hadoop data, disable ‘noatime’. This speeds up reads for files.…