For the Hive side, have you considered partitions instead of multiple tables? For HBase, have you looked at bulk load and ImportTSV http://hbase.apache.org/book/arch.bulk.load.html ...
HBase or Hive? schema design for HBase?
I have to process data in CSV files present in HDFS which will contain Meta data at the start of file and actual raw data after that. It has START_OF_TAGS, END_OF_TAGS, START_OF_DATA and END_OF_DATA tags.
Using a Java program I separated Meta data and actual data into two files: MetaFile and DataFile. Created Hive query & table using MetaFile and Loaded the DataFile into it, to query through Hive.
But, issue is, there may be hundreds of such files, so creating 100s of tables for each is not feasible as each file will have different metadata. Also I may need to search across all or some of these files based on certain criterion.
So I am thinking of using HBase for this scenario.
What should be the design of my HBASE schema? How many tables and columnfamiles? Should we have a single table and columnfamily for all the files and dump data with ‘n’ columnQualifiers so each row will have different number of columnQualifiers?
Also what should be given as ROWKEY, as we are not aware of the Primary key?
Support from the Experts
A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.
Become HDP Certified
Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world