Using Hive to interact with HBase, Part 2

This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.

“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”

This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase 0.94.6.1.

Grab some data and register it in Hive

We’ll need some data to work with. For this purpose, grab some traffic stats from wikipedia. Once we have some data, copy it up to HDFS.

$ mkdir pagecounts ; cd pagecounts
$ for x in {0..9} ; do wget "http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-0${x}0000.gz" ; done
$ hadoop fs -copyFromLocal $(pwd) ./

For reference, this is what the data looks like.

$ zcat pagecounts-20081001-000000.gz | head -n5
aa.b Special:Statistics 1 837
aa Main_Page 4 41431
aa Special:ListUsers 1 5555
aa Special:Listusers 1 1052
aa Special:PrefixIndex/Comparison_of_Guaze%27s_Law_and_Coulomb%27s_Law 1 4332

As I understand it, each record is a count of page views of a specific page on
Wikipedia. The first column is the language code, second is the
page name, third is the number of page views, and fourth is the size of the
page in bytes. Each file contains an hour’s worth of aggregated data. None of
the above pages were particularly popular that hour.

Now that we have data and understand its raw schema, create a Hive table over
it. To do that, we’ll use a DDL script that looks like this.

$ cat 00_pagecounts.ddl
-- define an external table over raw pagecounts data
CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
ROW FORMAT
  DELIMITED FIELDS TERMINATED BY ' '
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/ndimiduk/pagecounts';

Run the script to register our dataset with Hive.

$ hive -f 00_pagecounts.ddl
OK
Time taken: 2.268 seconds

Verify that the schema mapping works by calculating a simple statistic over the dataset.

$ hive -e "SELECT count(*) FROM pagecounts;"
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
36668549
Time taken: 25.31 seconds, Fetched: 1 row(s)

Hive says the 10 files we downloaded contain just over 36.5mm records. Let’s just confirm things are working as expected by getting a second opinion. This isn’t that much data, so confirm on the command line.

$ zcat * | wc -l
36668549

The record counts match up — excellent.

Transform the schema for HBase

The next step is to transform the raw data into a schema that makes sense for HBase. In our case, we’ll create a schema that allows us to calculate aggregate summaries of pages according to their titles. To do this, we want all the data for a single page grouped together. We’ll manage that by creating a Hive view that represents our target HBase schema. Here’s the DDL.

$ cat 01_pgc.ddl
-- create a view, building a custom hbase rowkey
CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS
SELECT concat_ws('/',
         projectcode,
         concat_ws('/',
           pagename,
           regexp_extract(INPUT<strong>FILE</strong>NAME, 'pagecounts-(\d{8}-\d{6})\..*$', 1))),
       pageviews, bytes
FROM pagecounts;

The SELECT statement uses hive to build a compound rowkey for HBase. It concatenates the project code, page name, and date, joined by the '/'character. A handy trick: it uses a simple regex to extract the date from the source file names. Run it now.

$ hive -f 01_pgc.ddl
OK
Time taken: 2.712 seconds

This is just a view, so the SELECT statement won’t be evaluated until we query it for data. Registering it with hive doesn’t actually process any data. Again, make sure it works by querying Hive for a subset of the data.

$ hive -e "SELECT * FROM pgc WHERE rowkey LIKE 'en/q%' LIMIT 10;"
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
en/q:Special:Search/Blues/20081001-090000       1       1168
en/q:Special:Search/rock/20081001-090000        1       985
en/qadam_rasul/20081001-090000  1       1108
en/qarqay/20081001-090000       1       933
en/qemu/20081001-090000 1       1144
en/qian_lin/20081001-090000     1       918
en/qiang_(spear)/20081001-090000        1       973
en/qin_dynasty/20081001-090000  1       1120
en/qinghe_special_steel_corporation_disaster/20081001-090000    1       963
en/qmail/20081001-090000        1       1146
Time taken: 40.382 seconds, Fetched: 10 row(s)

Register the HBase table

Now that we have a dataset in Hive, it’s time to introduce HBase. The first step is to register our HBase table in Hive so that we can interact with it using Hive queries. That means another DDL statement. Here’s what it looks like.

$ cat 02_pagecounts_hbase.ddl
-- create a table in hbase to host the view
CREATE TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING, pageviews STRING, bytes STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:c1,f:c2')
TBLPROPERTIES ('hbase.table.name' = 'pagecounts');

This statement will tell Hive to go create an HBase table named pagecounts with the single column family f. It registers that HBase table in the Hive metastore by the name pagecounts_hbase with 3 columns: rowkey, pageviews, and bytes. The SerDe property hbase.columns.mapping makes the association from Hive column to HBase column. It says the Hive column rowkey is mapped to the HBase table’s rowkey, the Hive column pageviews to the HBase column f:c1, and bytes to the HBase column f:c2. To keep the example simple, we have Hive treat all these columns as the STRING type.

In order to use the HBase library, we need to make the HBase jars and configuration available to the local Hive process (at least until HIVE-5518 is resolved). Do that by specifying a value for the HADOOP_CLASSPATH environment variable before executing the statement.

$ export HADOOP_CLASSPATH=/etc/hbase/conf:/usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar:/usr/lib/zookeeper/zookeeper.jar
$ hive -f 02_pagecounts_hbase.ddl
OK
Time taken: 4.399 seconds

Populate the HBase table

Now it’s time to write data to HBase. This is done using a regular Hive INSERT statement, sourcing data from the view with SELECT. There’s one more bit of administration we need to take care of though. This INSERT statement will run a mapreduce job that writes data to HBase. That means we need to tell Hive to ship the HBase jars and dependencies with the job.

Note that this is a separate step from the classpath modification we did previously. Normally you can do this with an export statement from the shell, the same way we specified the HADOOP_CLASSPATH. However there’s a bug in HDP-1.3 that requires me to use Hive’s SET statement in the script instead.

$ cat 03_populate_hbase.hql
-- ensure hbase dependency jars are shipped with the MR job
-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x
SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;</p>

<p>-- populate our hbase table
FROM pgc INSERT INTO TABLE pagecounts_hbase SELECT pgc.* WHERE rowkey LIKE 'en/q%' LIMIT 10;

Note there’s a big ugly bug in Hive 0.12.0 which means this doesn’t work with that version. Never fear though, we have a patch in progress. Follow along at HIVE-5515.

If you choose to use a different method for setting Hive’s auxpath, be advised that it’s a tricky process — depending on how you specify it (HIVE_AUX_JARS_PATH, --auxpath), Hive will interpret the argument differently. HIVE-2349 seeks to remedy this unfortunate state of affairs.

$ hive -f 03_populate_hbase.hql
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
Time taken: 40.296 seconds

Be advised also that this step is currently broken on secured HBase deployments. Follow along HIVE-5523 if that’s of interest to you.

Query data from HBase-land

40 seconds later, you now have data in HBase. Let’s have a look using the HBase shell.

$ echo "scan 'pagecounts'" | hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.6.1.3.2.0-111, r410a7a1c151ca953553eae68aa84e2a9f0d6e4ca, Mon Aug 19 19:00:12 PDT 2013

scan 'pagecounts'
ROW                                                 COLUMN+CELL
 en/q:Pan%27s_Labyrinth/20081001-080000             column=f:c1, timestamp=1381534232485, value=1
 en/q:Pan%27s_Labyrinth/20081001-080000             column=f:c2, timestamp=1381534232485, value=1153
 en/q:Special:Search/Jazz/20081001-080000           column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/Jazz/20081001-080000           column=f:c2, timestamp=1381534232485, value=980
 en/q:Special:Search/peinture/20081001-080000       column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/peinture/20081001-080000       column=f:c2, timestamp=1381534232485, value=989
 en/q:Special:Search/rock/20081001-080000           column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/rock/20081001-080000           column=f:c2, timestamp=1381534232485, value=980
 en/qadi/20081001-080000                            column=f:c1, timestamp=1381534232485, value=1
 en/qadi/20081001-080000                            column=f:c2, timestamp=1381534232485, value=1112
 en/qalawun%20complex/20081001-080000               column=f:c1, timestamp=1381534232485, value=1
 en/qalawun%20complex/20081001-080000               column=f:c2, timestamp=1381534232485, value=942
 en/qalawun/20081001-080000                         column=f:c1, timestamp=1381534232485, value=1
 en/qalawun/20081001-080000                         column=f:c2, timestamp=1381534232485, value=929
 en/qari'/20081001-080000                           column=f:c1, timestamp=1381534232485, value=1
 en/qari'/20081001-080000                           column=f:c2, timestamp=1381534232485, value=929
 en/qasvin/20081001-080000                          column=f:c1, timestamp=1381534232485, value=1
 en/qasvin/20081001-080000                          column=f:c2, timestamp=1381534232485, value=921
 en/qemu/20081001-080000                            column=f:c1, timestamp=1381534232485, value=1
 en/qemu/20081001-080000                            column=f:c2, timestamp=1381534232485, value=1157
10 row(s) in 0.4960 seconds

Here we have 10 rows with two columns each containing the data loaded using Hive. It’s now accessible in your online world using HBase. For example, perhaps you receive an updated data file and have a corrected value for one of the stats. You can update the record in HBase with a regular PUT command.

Verify data from from Hive

The HBase table remains available to you Hive world; Hive’s HBaseStorageHandler works both ways, after all.

Note that this command expects that the HADOOP_CLASSPATH is still set and HIVE_AUX_JARS_PATH as well if your query is complex.

$ hive -e "SELECT * from pagecounts_hbase;"
OK
en/q:Pan%27s_Labyrinth/20081001-080000  1       1153
en/q:Special:Search/Jazz/20081001-080000        1       980
en/q:Special:Search/peinture/20081001-080000    1       989
en/q:Special:Search/rock/20081001-080000        1       980
en/qadi/20081001-080000 1       1112
en/qalawun%20complex/20081001-080000    1       942
en/qalawun/20081001-080000      1       929
en/qari'/20081001-080000        1       929
en/qasvin/20081001-080000       1       921
en/qemu/20081001-080000 1       1157
Time taken: 2.554 seconds, Fetched: 10 row(s)

Continue using Hive for analysis

Since the HBase table is accessible from Hive, you can continue to use Hive for your ETL processing with mapreduce. Keep in mind that the auxpath considerations apply here too, so I’ve scripted out the query instead of just running it directly at the command line.

$ cat 04_query_hbase.hql
-- ensure hbase dependency jars are shipped with the MR job
-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x
SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;
-- query hive data
SELECT count(*) from pagecounts_hbase;

Run it the same way we did the others.

$ hive -f 04_query_hbase.hql
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
10
Time taken: 19.473 seconds, Fetched: 1 row(s)

There you have it: a hands-on, end to end demonstration of interacting with HBase from Hive. You can learn more about the nitty-gritty details in Enis’s deck on the topic, or see the presentation he and Ashutosh gave at HBaseCon. If you’re inclined to make the intersection of these technologies work better (faster, stronger), I encourage you to pick up any of the JIRA issues mentioned in this post or the previous.

Happy hacking!

Categorized by :
Developer HBase HDP 1.x HDP 2 Hive

Comments

Afaque KHAN
|
March 27, 2014 at 5:52 am
|

Thank Nick, I created a post on the forum too.

Regards,

Afaque

Afaque KHAN
|
March 26, 2014 at 9:52 am
|

Hi Nick,

I am using Sandbox 2.0
&
I am trying to register my hbase table into hive using the following query

CREATE TABLE IF NOT EXISTS Document_Table_Hive (key STRING, author STRING, category STRING) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,metadata:author,categories:category’) TBLPROPERTIES (‘hbase.table.name’ = ‘Document’);

this does not work: I get the following Exception

2014-03-26 09:14:57,341 ERROR exec.DDLTask (DDLTask.java:execute(435)) – java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.setConf(HBaseStorageHandler.java:249)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:290)
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:285)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3549)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1437)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1215)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1043)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
… 24 more

2014-03-26 09:14:57,368 ERROR ql.Driver (SessionState.java:printError(419)) – FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hbase/HBaseConfiguration

the Hbase Table “Document” already exists and the describe command gives the following description

‘Document’,
{NAME => ‘categories’,..},
{NAME => ‘comments’,..},
{NAME => ‘metadata’,..}

I have tried the following things

1) add hive.aux.jars.path in hive-site.xml

hive.aux.jars.path
file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar,file:///usr/lib/hive/lib/hive-hbase-handler-0.12.0.2.0.6.0-76.jar,file:///usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar

2) add jars using hive add jar command
add jar /usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar;
add jar /usr/lib/hive/lib/hive-hbase-handler-0.12.0.2.0.6.0-76.jar;
add jar /usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar;
add jar /usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar;
add file /etc/hbase/conf/hbase-site.xml

3) As you have mentioned
export HADOOP_CLASSPATH=/etc/hbase/conf:/usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2:/usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar.

And it is still not working!!

Could you please help me in figuring out how I can add the jars in the hive classpath so that it finds the hbaseConfiguration class.

Or if it is entirely another issue.

Thanks in advance

Regards

Afaque

    |
    March 26, 2014 at 10:34 am
    |

    Hi Afaque,

    That stack trace looks like the hive cli doesn’t have hbase-common in it’s classpath. Perhaps your value for HADOOP_CLASSPATH isn’t being respected. Do you mind taking this question to our Hive support forums where we can get you on the right track?

    Thanks,
    Nick

François
|
January 30, 2014 at 9:06 am
|

Hi Nick,

I am using hbase 0.96 and hive 0.12.

Insertion in the HBase table (03_populate_hbase.hql) ends with the error:
“Job Submission failed with exception ‘java.lang.IllegalArgumentException(Property value must not be null)’”

Any idea where does it come from ?

Regards
Francois

    |
    January 30, 2014 at 11:03 am
    |

    Hi François,

    Yes, I suspect you’re running into HIVE-5515. Let’s continue the conversation on the support forum.

    Thanks,
    Nick

Harry
|
December 9, 2013 at 5:27 pm
|

I got lost with the export HADOOP_CLASSPATH example. Finally this worked

export HADOOP_CLASSPATH=`hbase classpath`

Hara
|
November 28, 2013 at 12:13 am
|

Hi Nick,

How do I query HBase tables through hive using a custom SerDe? for example Lets say I want to store tweets in HBase to avoid small files problem , how do I integrate hive Json SerDe with hbase for querying tweet data ?

Regards
Hara

    |
    December 2, 2013 at 5:36 pm
    |

    Hi Hara,

    I assume you have tweets stored as JSON blobs in a single column qualifier in HBase. You can’t provide a custom SerDe because the regular HBaseStorgeHandler already has it’s own implementation. One option might be to extend that one with your own custom logic. Doing so would make that table definition specific to your query. Instead, I’d read the JSON data as a String and then “hydrate” the JSON using something else — maybe the get_json_object function or a custom UDF perhaps?

    Best of luck,
    Nick

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox

Recently in the Blog

Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.

Thank you for subscribing!