Home Forums HBase Design HBase Schema for Twitter data

Tagged: 

This topic contains 6 replies, has 2 voices, and was last updated by  anups 1 year, 5 months ago.

  • Creator
    Topic
  • #17074

    anups
    Member

    I have following Twitter data and I want to design a schema for the same .The queries which I would need to perform would be following: get tweets volume for time interval,tweets with corresponding user info,tweets with corresponding topic info etc… Based on the below data ,anyone tell where designing of schema is correct.. (make rowkey as id+timestamp, column family as user ,others grouped into primary column . Any Suggestions ?

    {
    “created_at”:”Tue Feb 19 11:16:34 +0000 2013″,
    “id”:303825398179979265,
    “id_str”:”303825398179979265″,
    “text”:”Unleashing Innovation Conference Kicks Off – Wall Street Journal (India) http:\/\/t.co\/3bkXJBz1″,
    “source”:”\u003ca href=\”http:\/\/dlvr.it\” rel=\”nofollow\”\u003edlvr.it\u003c\/a\u003e”,
    “truncated”:false,
    “in_reply_to_status_id”:null,
    “in_reply_to_status_id_str”:null,
    “in_reply_to_user_id”:null,
    “in_reply_to_user_id_str”:null,
    “in_reply_to_screen_name”:null,
    “user”:{
    “id”:948385189,
    “id_str”:”948385189″,
    “name”:”Innovation Plaza”,
    “screen_name”:”InnovationPlaza”,
    “location”:”",
    “url”:”http:\/\/tinyurl.com\/ee4jiralp”,
    “description”:”All the latest breaking news about Innovation”,
    “protected”:false,
    “followers_count”:136,
    “friends_count”:1489,
    “listed_count”:1,
    “created_at”:”Wed Nov 14 19:49:18 +0000 2012″,
    “favourites_count”:0,
    “utc_offset”:28800,
    “time_zone”:”Beijing”,
    “geo_enabled”:false,
    “verified”:false,
    “statuses_count”:149,
    “lang”:”en”,
    “contributors_enabled”:false,
    “is_translator”:false,
    “profile_background_color”:”131516″,
    “profile_background_image_url”:”http:\/\/a0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg”,
    “profile_background_image_url_https”:”https:\/\/si0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg”,
    “profile_background_tile”:true,
    “profile_image_url”:”http:\/\/a0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg”,
    “profile_image_url_https”:”https:\/\/si0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg”,
    “profile_link_color”:”009999″,
    “profile_sidebar_border_color”:”FFFFFF”,
    “profile_sidebar_fill_color”:”EFEFEF”,
    “profile_text_color”:”333333″,
    “profile_use_background_image”:true,
    “default_profile”:false,
    “default_profile_image”:false,
    “following”:null,
    “follow_request_sent”:null,
    “notifications”:null
    },
    “geo”:null,
    “coordinates”:null,
    “place”:null,
    “contributors”:null,
    “retweet_count”:0,
    “entities”:{
    “hashtags”:[

    ],
    “urls”:[
    {
    “url”:”http:\/\/t.co\/3bkXJBz1″,
    “expanded_url”:”http:\/\/dlvr.it\/2yyG5C”,
    “display_url”:”dlvr.it\/2yyG5C”,
    “indices”:[
    73,

Viewing 6 replies - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.

  • Author
    Replies
  • #17533

    anups
    Member

    Hi Larry,

    That explains it well..Thanks

    Regards
    Anups

    Collapse
    #17450

    Larry Liu
    Moderator

    Hi, Anups,

    My initial idea is to let HBase store all the tweet, user and url information. The aggregation can be done in Java program to get the information you want from HBase tables. Use the data after aggregation to create a new table as your mentioned in the orginal post.

    I think if you don’t want to save the raw data into HBase, it is totally OK to design a schema like you mentioned in orginal post.

    Hope this explains my idea.

    Thanks
    Larry

    Collapse
    #17429

    anups
    Member

    Hi Larry,,

    Thanks for you answer ..
    This is also a good approach…Even I had this in mind to separate into two tables one user and tweet table ..
    Actually am newbie to HBase ,so could you please explain a lil bit more on what you meant by aggregating data ?

    Regards

    Anups

    Collapse
    #17252

    Larry Liu
    Moderator

    Hi, Anups,

    After investigating your data, I think your design is fine. My question for you is that since the tweet data is changing more frequently, but the user and url table are not changing that often, I have a suggestion to have 3 tables and then aggregate the data into the new tables for your purpose.

    Table 1:
    table: tweet:
    Rowkey: id+timestamp
    Column Family: url,userid, and other information you need

    Table 2:
    Table: user
    Rowkey: userid
    Column family: username, and other info

    Table 3:
    table: url
    Rowkey: md5(url)
    Family: url information

    Then after aggregate, you can have a 4th table for the data to meet your requirement.

    Let me know your thoughts.

    Thanks

    larry

    Collapse
    #17220

    anups
    Member

    Thanks Larry

    For checking it ..

    Regards
    Anups

    Collapse
    #17080

    Larry Liu
    Moderator

    Hi, Anups

    I am working on it and will get back to you shortly.

    Thanks

    larry

    Collapse
Viewing 6 replies - 1 through 6 (of 6 total)