Home Forums HBase HBase or Hive? schema design for HBase?

This topic contains 2 replies, has 2 voices, and was last updated by  Sourabh Potnis 6 months, 2 weeks ago.

  • Creator
    Topic
  • #49784

    Sourabh Potnis
    Participant

    I have to process data in CSV files present in HDFS which will contain Meta data at the start of file and actual raw data after that. It has START_OF_TAGS, END_OF_TAGS, START_OF_DATA and END_OF_DATA tags.

    Using a Java program I separated Meta data and actual data into two files: MetaFile and DataFile. Created Hive query & table using MetaFile and Loaded the DataFile into it, to query through Hive.

    But, issue is, there may be hundreds of such files, so creating 100s of tables for each is not feasible as each file will have different metadata. Also I may need to search across all or some of these files based on certain criterion.

    So I am thinking of using HBase for this scenario.

    What should be the design of my HBASE schema? How many tables and columnfamiles? Should we have a single table and columnfamily for all the files and dump data with ‘n’ columnQualifiers so each row will have different number of columnQualifiers?

    Also what should be given as ROWKEY, as we are not aware of the Primary key?

    Thanks.

    SSP

Viewing 2 replies - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #50167

    Sourabh Potnis
    Participant

    Hi,
    Thanks for the reply.

    Yes, for HBase I loaded sample data from 2 files with different metadata/DDL into a single HBase table with single column family using ImportTSV.
    But I am not sure if a single table,single column family schema will give the optimal performance. What other design can be implemented in HBase?

    For Hive, if the table DDL/columns are not fixed i.e different for different files, how we can use a single Hive table, even if we partition it.
    It will need to have same schema, right?

    Thanks

    Collapse
    #50064

    Devaraj Das
    Participant

    For the Hive side, have you considered partitions instead of multiple tables?
    For HBase, have you looked at bulk load and ImportTSV http://hbase.apache.org/book/arch.bulk.load.html

    Collapse
Viewing 2 replies - 1 through 2 (of 2 total)