The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

HBase Forum

Design HBase Schema for Twitter data

  • #17074

    I have following Twitter data and I want to design a schema for the same .The queries which I would need to perform would be following: get tweets volume for time interval,tweets with corresponding user info,tweets with corresponding topic info etc… Based on the below data ,anyone tell where designing of schema is correct.. (make rowkey as id+timestamp, column family as user ,others grouped into primary column . Any Suggestions ?

    “created_at”:”Tue Feb 19 11:16:34 +0000 2013″,
    “text”:”Unleashing Innovation Conference Kicks Off – Wall Street Journal (India) http:\/\/\/3bkXJBz1″,
    “source”:”\u003ca href=\”http:\/\/\” rel=\”nofollow\”\\u003c\/a\u003e”,
    “name”:”Innovation Plaza”,
    “description”:”All the latest breaking news about Innovation”,
    “created_at”:”Wed Nov 14 19:49:18 +0000 2012″,


  • Author
  • #17080
    Larry Liu

    Hi, Anups

    I am working on it and will get back to you shortly.




    Thanks Larry

    For checking it ..


    Larry Liu

    Hi, Anups,

    After investigating your data, I think your design is fine. My question for you is that since the tweet data is changing more frequently, but the user and url table are not changing that often, I have a suggestion to have 3 tables and then aggregate the data into the new tables for your purpose.

    Table 1:
    table: tweet:
    Rowkey: id+timestamp
    Column Family: url,userid, and other information you need

    Table 2:
    Table: user
    Rowkey: userid
    Column family: username, and other info

    Table 3:
    table: url
    Rowkey: md5(url)
    Family: url information

    Then after aggregate, you can have a 4th table for the data to meet your requirement.

    Let me know your thoughts.




    Hi Larry,,

    Thanks for you answer ..
    This is also a good approach…Even I had this in mind to separate into two tables one user and tweet table ..
    Actually am newbie to HBase ,so could you please explain a lil bit more on what you meant by aggregating data ?



    Larry Liu

    Hi, Anups,

    My initial idea is to let HBase store all the tweet, user and url information. The aggregation can be done in Java program to get the information you want from HBase tables. Use the data after aggregation to create a new table as your mentioned in the orginal post.

    I think if you don’t want to save the raw data into HBase, it is totally OK to design a schema like you mentioned in orginal post.

    Hope this explains my idea.



    Hi Larry,

    That explains it well..Thanks


The forum ‘HBase’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.