The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

Pig Forum

Loading into HBase table using Pig fails

  • #44509
    Anand M


    Need to know the steps to integrate Pig with Hbase.
    I have a 8 node cluster running with HDP2.0 and the installation was done through Ambari.

    I am unable to load data into Hbase table using Pig. Any help would be appreciated.


  • Author
  • #45322

    Hi Anand,

    Is there any errors in the log. I have done a functional test for Pig and HBase integration and it works for me.

    Here is my pig script:

    pig script:

    raw = LOAD ‘hbase://ambarismoketest’
    USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
    ‘family:col01’, ‘-loadKey true -limit 5’)
    AS (first_name:chararray);

    dump raw ;
    2013-11-06 15:09:14,906 [main] INFO – Script Statistics:
    HadoopVersion PigVersion UserId StartedAt FinishedAt Features hdfs 2013-11-06 15:06:22 2013-11-06 15:09:14 UNKNOWN


    Job Stats (time in seconds):
    JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
    job_1383768104534_0004 1 0 16 16 16 16 n/a n/a n/a n/a raw MAP_ONLY hdfs://HDP.koelli.localdomain:8020/tmp/temp-807864500/tmp-1599452644,

    Successfully read 1 records (342 bytes) from: “hbase://ambarismoketest”

    Successfully stored 1 records (12 bytes) in: “hdfs://HDP.koelli.localdomain:8020/tmp/temp-807864500/tmp-1599452644”

    Total records written : 1
    Total bytes written : 12
    Spillable Memory Manager spill count : 0
    Total bags proactively spilled: 0
    Total records proactively spilled: 0

    Job DAG:

    2013-11-06 15:09:15,137 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Success!
    2013-11-06 15:09:15,141 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – is deprecated. Instead, use fs.defaultFS
    2013-11-06 15:09:15,142 [main] INFO – Key [pig.schematuple] was not set… will not generate code.
    2013-11-06 15:09:15,178 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1
    2013-11-06 15:09:15,178 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1



    Hi Anand ,
    I dont know in 8node cluster but I have done in my single node cluster using these setting,which can help you to sort out your problem

    sudo cp /usr/lib/hive/lib/hive-common-0.7.0-cdh3u0.jar /usr/lib/hadoop/lib/
    sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/
    sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/



    In order to create a new HBase table which is to be managed by Hive, use the STORED BY clause on CREATE TABLE:
    On Hive shell grunt>
    CREATE TABLE hbase_table_1(key int, value string)
    STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
    WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,cf1:val”)
    TBLPROPERTIES (“” = “xyz”);

    After executing the command above, you should be able to see the new (empty) table in the HBase shell:
    $ hbase shell
    HBase Shell; enter ‘help<RETURN>’ for list of supported commands.
    Version: 0.20.3, r902334, Mon Jan 25 13:13:08 PST 2010
    hbase(main):001:0> list
    1 row(s) in 0.0530 seconds
    hbase(main):002:0> describe “xyz”
    {NAME => ‘xyz’, FAMILIES => [{NAME => ‘cf1’, COMPRESSION => ‘NONE’, VE true
    RSIONS => ‘3’, TTL => ‘2147483647’, BLOCKSIZE => ‘65536’, IN_MEMORY =>
    ‘false’, BLOCKCACHE => ‘true’}]}
    1 row(s) in 0.0220 seconds
    hbase(main):003:0> scan “xyz”
    0 row(s) in 0.0060 seconds
    Notice that even though a column name “val” is specified in the mapping, only the column family name “cf1” appears in the DESCRIBE output in the HBase shell. This is because in HBase, only column families (not columns) are known in the table-level metadata; column names within a column family are only present at the per-row level.
    Here’s how to move data from Hive into the HBase table (see GettingStarted for how to create the example table pokes in Hive first):

    INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;
    Use HBase shell to verify that the data actually got loaded:
    hbase(main):009:0> scan “xyz”
    98 column=cf1:val, timestamp=1267737987733, value=val_98
    1 row(s) in 0.0110 seconds
    And then query it back via Hive:

    Tom Debus

    Hi Guys,

    seem to have the same issue with the standard OotB VM (2.0) – all the Hive tutorials work fine, but the PIG loading of the sample NYSE stock file fail. Also other loading of existing or newly added tables seem to fail. Happy to attach the log. Or do I need to follow the same procedures to update hbase or pig?


    Stanley Nguyen

    Hi Tom,

    Any luck that you have resolved the issue? I ran into similar issue but it fails @ the dump statement. Running from console works fine, and so does from the sandbox so I’m not sure there’s some other additional configuration required for multiple hosts setup.

    Bob Russell


    Would you be able to post your environment? I have an ambari installed cluster and my insert into hbase is failing with TableInputFormat class not found exception.

    Scott Saufferer

    I’m working with the current HDP 2.1 sandbox and cannot get reads from or writes into Hbase either. I used the pig script below to read from the ambarismoketest table and get what looks to be a class not found exception when reading.

    raw = LOAD ‘hbase://ambarismoketest’ USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘family:col01’, ‘-loadKey true -limit 5’) AS (first_name:chararray);
    dump raw;

    2014-08-12 16:18:08,159 [main] ERROR – ERROR 1066: Unable to open iterator for alias raw. Backend error : java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.mapreduce.TableSplit not found

    This is out of the box. :( I’ll try a write example next.

    Scott Saufferer

    Also, using the 2.1 sandbox environment to test loading into hbase via pig, per the title of this thread fails. I followed the instructions to “Using Pig to Bulk Load Data Into HBase” elsewhere in this site and it fails with log message shown below.

    A = LOAD ‘hdfs:///tmp/data.tsv’ USING PigStorage(‘\t’) AS (id:chararray, c1:chararray, c2:chararray);
    –DUMP A;
    STORE A INTO ‘simple_hcat_load_table’ USING org.apache.hcatalog.pig.HCatStorer();

    2014-08-12 16:38:41,719 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Failed!
    2014-08-12 16:38:41,737 [main] ERROR – ERROR 2998: Unhandled internal error. org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MutationProto$MutationType
    Details at logfile: /hadoop/yarn/local/usercache/hue/appcache/application_1407882621600_0006/container_1407882621600_0006_01_000002/pig_1407886546063.log

    Chinguun Dorj

    Anand M. how install your cluster ?

The forum ‘Pig’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.