HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery

Note: Continued in part two

Your Twitter Archive

Twitter has a new feature, Your Twitter Archive, that enables any user to download their tweets as an archive. To view this feature, look at the bottom of the page at your account settings page. There should be an option for ‘Your Twitter archive,’ which will generate your tweets as a json/javascript web application and send them to you in email as a zip file.

download_tweets_email
Be patient: this process can take several days, in particular if you’ve lots of tweets (I personally have 24K tweets, and it took 4-5 days to get my tweets).

save_tweets
After a few hours or days, you’ll receive an email with a download link. Download your tweets, and unzip them to reveal their contents.

tweet_files

Digging In: ETL

There is a file called tweets.csv, but that is not the file we are interested in. It has very little detail. The files we are interested is in, which contain all the tweet data, are in the data/js/tweets directory. There is one file per month, laid out like this:

2008_08.js	2009_03.js	2009_10.js	2010_05.js	2010_12.js	2011_07.js	2012_02.js	2012_09.js
2008_09.js	2009_04.js	2009_11.js	2010_06.js	2011_01.js	2011_08.js	2012_03.js	2012_10.js
2008_10.js	2009_05.js	2009_12.js	2010_07.js	2011_02.js	2011_09.js	2012_04.js	2012_11.js
2008_11.js	2009_06.js	2010_01.js	2010_08.js	2011_03.js	2011_10.js	2012_05.js	2012_12.js
2008_12.js	2009_07.js	2010_02.js	2010_09.js	2011_04.js	2011_11.js	2012_06.js	2013_01.js
2009_01.js	2009_08.js	2010_03.js	2010_10.js	2011_05.js	2011_12.js	2012_07.js	2013_02.js
2009_02.js	2009_09.js	2010_04.js	2010_11.js	2011_06.js	2012_01.js	2012_08.js	2013_03.js

Inconveniently, the first line of each file is javascript:

Grailbird.data.tweets_2008_08 =

The first thing we’ll need to do us remove that line. Once we do so, the file is a large json array. Once we have an array, we can easily convert to the JSON format that Hive expects: one json object per newline.

I’ve created a python script that removes the first line of text, and converts and prints a one-json-object-per-newline format here: convert.py.

#!/usr/bin/env python

import sys, os, glob
import json

tweet_base = sys.argv[1]

# Open tweets
os.chdir(tweet_base + "/data/js/tweets")
for filename in glob.glob("*.js"):
  f = open(filename)
  lines = f.readlines()

  # Chop the first line (its javascript), parse the resulting array of tweets
  entire = ''
  for line in lines[1:]:
    entire += line
  tweets = json.loads(entire)

  # Now print, one json object per line
  for tweet in tweets:
    print json.dumps(tweet)

# Done - we've printed the tweets, one json record per line

To run the code:

convert.py ~/Downloads/tweets > all_tweets.json

Amusingly, my first tweet is about being terrified surfing Hurricane Gustav:

{
    "entities": {
        "user_mentions": [],
        "media": [],
        "hashtags": [],
        "urls": []
    },
    "text": "Paddled out, tried to psyche myself into wave for 30 minutes, then was afraid to come in for 30 more. \"I jusy want to go home.\"",
    "created_at": "Mon Sep 01 01:16:47 +0000 2008",
    "source": "Twinkle",
    "id_str": "905266904",
    "geo": {},
    "id": 905266904,
    "user": {
        "verified": false,
        "name": "Russell Jurney",
        "profile_image_url_https": "http://si0.twimg.com/profile_images/2964060639/9a98c1eb08f57472743caa4a5ae3260b_normal.jpeg",
        "protected": false,
        "id_str": "15831927",
        "id": 15831927,
        "screen_name": "rjurney"
    }
}

Schema Discovery

To load our tweets into Hive, we need a schema. There is no explicit schema for json data, we must infer it. Along these lines, Hortonworks co-founder Owen O’Malley created a tool, available on github as hive-json, that extracts a Hive schema from a collection of JSON documents. Given a collection of schemas, the output schema will be the superset of them all, which creates a reasonable SQL schema: optional fields are often null.

bin/find-json-schema ~/Downloads/all_tweets.json

Reading /Users/rjurney/Software/hive_tweets/tweets.json
24655 records read

create table tbl (
   created_at string,
   entities struct <
      hashtags: array ,
            text: string>>,
      media: array ,
            media_url: string,
            media_url_https: string,
            sizes: array >,
            url: string>>,
      urls: array ,
            url: string>>,
      user_mentions: array ,
            name: string,
            screen_name: string>>>,
   geo struct <
      coordinates: array ,
      type: string>,
   id bigint,
   id_str string,
   in_reply_to_screen_name string,
   in_reply_to_status_id bigint,
   in_reply_to_status_id_str string,
   in_reply_to_user_id int,
   in_reply_to_user_id_str string,
   retweeted_status struct <
      created_at: string,
      entities: struct <
         hashtags: array ,
               text: string>>,
         media: array ,
               media_url: string,
               media_url_https: string,
               sizes: array >,
               url: string>>,
         urls: array ,
               url: string>>,
         user_mentions: array ,
               name: string,
               screen_name: string>>>,
      geo: struct <
         coordinates: array ,
         type: string>,
      id: bigint,
      id_str: string,
      in_reply_to_screen_name: string,
      in_reply_to_status_id: bigint,
      in_reply_to_status_id_str: string,
      in_reply_to_user_id: int,
      in_reply_to_user_id_str: string,
      source: string,
      text: string,
      user: struct <
         id: int,
         id_str: string,
         name: string,
         profile_image_url_https: string,
         protected: boolean,
         screen_name: string,
         verified: boolean>>,
   source string,
   text string,
   user struct <
      id: int,
      id_str: binary,
      name: string,
      profile_image_url_https: string,
      protected: boolean,
      screen_name: string,
      verified: boolean>
)

So we’ve got JSON/newline version of our tweets, and a schema for them in Hive. In our next post, we’ll use the Hive-JSON-Serde to load the tweets and begin our analysis!

Continued in part two

Categorized by :
Hive

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try it with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox
Integrate with existing systems
Hortonworks maintains and works with an extensive partner ecosystem from broad enterprise platform vendors to specialized solutions and systems integrators.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.