HBase Forum

HBase or Hive? schema design for HBase?

  • #49784
    Sourabh Potnis

    I have to process data in CSV files present in HDFS which will contain Meta data at the start of file and actual raw data after that. It has START_OF_TAGS, END_OF_TAGS, START_OF_DATA and END_OF_DATA tags.

    Using a Java program I separated Meta data and actual data into two files: MetaFile and DataFile. Created Hive query & table using MetaFile and Loaded the DataFile into it, to query through Hive.

    But, issue is, there may be hundreds of such files, so creating 100s of tables for each is not feasible as each file will have different metadata. Also I may need to search across all or some of these files based on certain criterion.

    So I am thinking of using HBase for this scenario.

    What should be the design of my HBASE schema? How many tables and columnfamiles? Should we have a single table and columnfamily for all the files and dump data with ‘n’ columnQualifiers so each row will have different number of columnQualifiers?

    Also what should be given as ROWKEY, as we are not aware of the Primary key?



to create new topics or reply. | New User Registration

  • Author
  • #50064
    Devaraj Das

    For the Hive side, have you considered partitions instead of multiple tables?
    For HBase, have you looked at bulk load and ImportTSV http://hbase.apache.org/book/arch.bulk.load.html

    Sourabh Potnis

    Thanks for the reply.

    Yes, for HBase I loaded sample data from 2 files with different metadata/DDL into a single HBase table with single column family using ImportTSV.
    But I am not sure if a single table,single column family schema will give the optimal performance. What other design can be implemented in HBase?

    For Hive, if the table DDL/columns are not fixed i.e different for different files, how we can use a single Hive table, even if we partition it.
    It will need to have same schema, right?


You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.