HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery
Note: Continued in part two…
Your Twitter Archive




Digging In: ETL
There is a file called tweets.csv, but that is not the file we are interested in. It has very little detail. The files we are interested is in, which contain all the tweet data, are in the data/js/tweets directory. There is one file per month, laid out like this:
2008_08.js 2009_03.js 2009_10.js 2010_05.js 2010_12.js 2011_07.js 2012_02.js 2012_09.js
2008_09.js 2009_04.js 2009_11.js 2010_06.js 2011_01.js 2011_08.js 2012_03.js 2012_10.js
2008_10.js 2009_05.js 2009_12.js 2010_07.js 2011_02.js 2011_09.js 2012_04.js 2012_11.js
2008_11.js 2009_06.js 2010_01.js 2010_08.js 2011_03.js 2011_10.js 2012_05.js 2012_12.js
2008_12.js 2009_07.js 2010_02.js 2010_09.js 2011_04.js 2011_11.js 2012_06.js 2013_01.js
2009_01.js 2009_08.js 2010_03.js 2010_10.js 2011_05.js 2011_12.js 2012_07.js 2013_02.js
2009_02.js 2009_09.js 2010_04.js 2010_11.js 2011_06.js 2012_01.js 2012_08.js 2013_03.js
Inconveniently, the first line of each file is javascript:
Grailbird.data.tweets_2008_08 =
The first thing we’ll need to do us remove that line. Once we do so, the file is a large json array. Once we have an array, we can easily convert to the JSON format that Hive expects: one json object per newline.
I’ve created a python script that removes the first line of text, and converts and prints a one-json-object-per-newline format here: convert.py.
#!/usr/bin/env python
import sys, os, glob
import json
tweet_base = sys.argv[1]
# Open tweets
os.chdir(tweet_base + "/data/js/tweets")
for filename in glob.glob("*.js"):
f = open(filename)
lines = f.readlines()
# Chop the first line (its javascript), parse the resulting array of tweets
entire = ''
for line in lines[1:]:
entire += line
tweets = json.loads(entire)
# Now print, one json object per line
for tweet in tweets:
print json.dumps(tweet)
# Done - we've printed the tweets, one json record per line
To run the code:
convert.py ~/Downloads/tweets > all_tweets.json
Amusingly, my first tweet is about being terrified surfing Hurricane Gustav:
{
"entities": {
"user_mentions": [],
"media": [],
"hashtags": [],
"urls": []
},
"text": "Paddled out, tried to psyche myself into wave for 30 minutes, then was afraid to come in for 30 more. \"I jusy want to go home.\"",
"created_at": "Mon Sep 01 01:16:47 +0000 2008",
"source": "Twinkle",
"id_str": "905266904",
"geo": {},
"id": 905266904,
"user": {
"verified": false,
"name": "Russell Jurney",
"profile_image_url_https": "http://si0.twimg.com/profile_images/2964060639/9a98c1eb08f57472743caa4a5ae3260b_normal.jpeg",
"protected": false,
"id_str": "15831927",
"id": 15831927,
"screen_name": "rjurney"
}
}
Schema Discovery
To load our tweets into Hive, we need a schema. There is no explicit schema for json data, we must infer it. Along these lines, Hortonworks co-founder Owen O’Malley created a tool, available on github as hive-json, that extracts a Hive schema from a collection of JSON documents. Given a collection of schemas, the output schema will be the superset of them all, which creates a reasonable SQL schema: optional fields are often null.
bin/find-json-schema ~/Downloads/all_tweets.json
Reading /Users/rjurney/Software/hive_tweets/tweets.json
24655 records read
create table tbl (
created_at string,
entities struct <
hashtags: array ,
text: string>>,
media: array ,
media_url: string,
media_url_https: string,
sizes: array >,
url: string>>,
urls: array ,
url: string>>,
user_mentions: array ,
name: string,
screen_name: string>>>,
geo struct <
coordinates: array ,
type: string>,
id bigint,
id_str string,
in_reply_to_screen_name string,
in_reply_to_status_id bigint,
in_reply_to_status_id_str string,
in_reply_to_user_id int,
in_reply_to_user_id_str string,
retweeted_status struct <
created_at: string,
entities: struct <
hashtags: array ,
text: string>>,
media: array ,
media_url: string,
media_url_https: string,
sizes: array >,
url: string>>,
urls: array ,
url: string>>,
user_mentions: array ,
name: string,
screen_name: string>>>,
geo: struct <
coordinates: array ,
type: string>,
id: bigint,
id_str: string,
in_reply_to_screen_name: string,
in_reply_to_status_id: bigint,
in_reply_to_status_id_str: string,
in_reply_to_user_id: int,
in_reply_to_user_id_str: string,
source: string,
text: string,
user: struct <
id: int,
id_str: string,
name: string,
profile_image_url_https: string,
protected: boolean,
screen_name: string,
verified: boolean>>,
source string,
text string,
user struct <
id: int,
id_str: binary,
name: string,
profile_image_url_https: string,
protected: boolean,
screen_name: string,
verified: boolean>
)
So we’ve got JSON/newline version of our tweets, and a schema for them in Hive. In our next post, we’ll use the Hive-JSON-Serde to load the tweets and begin our analysis!
Continued in part two…