HOWTO: Working with Files in HCatalog Tables

ISSUE:

How can I use HCatalog to discover which files are associated with a partition in a table so that the files can be read directly from HDFS?

How do I place files in HDFS and then add them as a new partition to an existing table?

SOLUTION:

This document describes how to use HCatalog to discover which files are associated with a particular partition in a table so that those files can be read directly from HDFS, and how to place files in HDFS and then add them as a new partition to an existing table.

If you installed the code as tarballs, you will need to know the following before starting:

  • hadoop_home: location where Hadoop is installed on your client machine. For example, if you did an install under /home/hadoop/hdp10/hadoop then this will be your hadoop_home value.
  • hcat_home location where the HCatalog client is installed on your client machine. For example, if you did an install under /home/hadoop/hdp10/hcatalog then this will be your hcat_home value.
  • table_name the name of the table you wish to read from or write to.
  • templeton_host the hostname of the machine running Templeton, the web services API for HCatalog. This is only necessary if you are doing these calls via Templeton.
  • user_name the name of the user to run these commands as. This is only necessary if you are doing these calls via Templeton.

Throughout the document commands are detailed for both operations done on the command line and those done via web services. For those done on the command line, if the installation of the client was done as a tarball (rather than an rpm) it is assumed that your environment contains the variable HADOOP_HOME set to hadoop_home and that hcat_home/bin is in your shell’s PATH environment variable.

Reading

Step 1: Determine the schema of the table (Optional)

Command line:

hcat -e "describe <table_name>;"

This will return text that looks like:

OK
id bigint
user string
my_p string
my_q string

Values on the left are column names, values on the right are data types.

Web services:

URL: GET to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name>
Accept: application/json
ContentType: application/json

Example JSON Response:

{ "columns": [
{
"name": "id",
"type": "bigint"
},
{
"name": "user",
"type": "string"
},
{
"name": "my_p",
"type": "string"
},
{
"name": "my_q",
"type": "string"
}
],
"database": "default",
"table": "my_table"
}

Step 2: Get a list of all partitions of the table (Optional)

Command line:

hcat -e "show partitions <table_name>;"

This will return text that looks like:

OK
ds=20110924
ds=20110925

Each line represents one partition, with the partition key to the left of the equal sign and the value for that partition to the right. If there are multiple partition keys they will be comma separated.

Web services:

URL: GET to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name>/partition
Accept: application/json
ContentType: application/json

Example JSON response:

{
"partitions": [
{
"values": [
{
"columnName": "ds",
"columnValue": "20110924"
},
],
"name": "ds='20110924'"
},
{
"values": [
{
"columnName": "ds",
"columnValue": "20110925"
},
],
"name": "ds='20110925'"
},
],
"database": "default",
"table": "my_table"
}

Step 3: Find location information for the partition you wish to read.

Once you know the partition values for the partition you wish to read, you can find the location information. In the following statements part_col is the name of the partition column, part_value is the value of that column for the partition you are reading.

Command line:

hcat -e "show table extended like <table_name> partition(<part_col>=<part_value>);"

This will return text that looks like:

OK
tableName:studentparttab30k
owner:hortonal
location:hdfs://hrt9n03.cc1.ygridcore.net:9000/user/hcat/tests/data/studentparttab30k/studentparttab30k.20110924
inputformat:org.apache.hadoop.mapred.TextInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
columns:struct columns { string name, i32 age, double gpa}
partitioned:true
partitionColumns:struct partition_columns { string ds}
totalNumberFiles:1
totalFileSize:219190
maxFileSize:219190
minFileSize:219190
lastAccessTime:1328599687969
lastUpdateTime:1325806298324

You need the line starting with location:. This indicates the file or directory name where the partition data is stored. Depending on how the partition was created, this may be either a file or a directory.

Web services:

URL: GET to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name>/partition/<partition-name>
Accept: application/json
ContentType: application/json

<partition-name> will be the name as given in the list of partitions obtained in Step 2. For example, for a table partitioned by the field ds where you wish to get the partition where ds=20110924, the name will be ds=’20110925′Note that the quotation marks will need to be escaped in your URL.

Example curl command:

curl -HContent-type:application/json 'http://www.myserver.com/templeton/v1/ddl/database/default/table/test_table/partition/ds=%2720110924%27'

Example JSON Response:

{
"minFileSize": 184,
"totalNumberFiles": 1,
"location": "hdfs://localhost:9000/user/hive/warehouse/my_table/my_p=XYZ/my_q=ABC",
"lastUpdateTime": 1329980827336,
"lastAccessTime": 1329980816220,
"columns": [
{
"name": "id",
"type": "bigint"
},
{
"name": "user",
"type": "string"
},
],
"partitionColumns": [
{
"name": "ds",
"type": "string"
},
],
"maxFileSize": 184,
"inputformat": "org.apache.hadoop.hive.ql.io.RCFileInputFormat"
"partitioned": true,
"owner": "you",
"totalFileSize": 184,
"outputformat": "org.apache.hadoop.hive.ql.io.RCFileOutputFormat",
"database": "default",
"table": "my_table",
"partition": "ds=’20110924’"
}

Step 4: Read the file

This can be done via hadoop fs, the Hadoop Java API, or webhdfs.

Writing

Step 1: Load the file you wish to have as a partition into HDFS

This can be done via hadoop fs, the Hadoop Java API, or webhdfs.

Step 2: Add the partition to the table

In the following statements part_col is the name of the partition column, part_value is the value of that column for the partition you are reading, and file_location is where you loaded the file in the previous step.

Command line:

hcat -e "alter table <table_name> add partition (<part_col>='<part_value>') location '<file_location>';"

Web services:

URL: PUT to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name>/partition/<partition-name>
Accept: application/json
ContentType: application/json

<partition-name> will be the name of the partition. This should match Hive’s partition naming scheme of key=’value’. For example, for a table partitioned by the field ds where you wish to get the partition where ds=20110924, the name will be ds=’20110925′ Note that the quotation marks will need to be escaped in your URL.

The location is passed in a JSON document:

{
“location”: “<location>”
}

Example curl command:

curl -X PUT -HContent-type:application/json -d '{"location": "hdfs://nn.acme.com/user/data/mytable/20110924"}' 'http://www.myserver.com/templeton/v1/ddl/database/default/table/test_table/partition/ds=%2720110924%27'

Example JSON Response:

{
"return-code" : “OK”
}

Full JSON Schema for document

{
"partition": "ds='20110924'",
"table": "mytable",
"database": "default"
}

Thank you for subscribing!