tweetharvest
Example Analysis¶This is an example notebook demonstrating how to establish a connection to a database of tweets collected using tweetharvest
. It presupposes that all the setup instructions have been completed (see README file for that repository) and that MongoDB server is running as described there. We start by importing core packages the PyMongo package, the official package to access MongoDB databases.
import pymongo
Next we establish a link with the database. We know that the database created by tweetharvester
is called tweets_db
and within it is a collection of tweets that goes by the name of the project, in this example: emotweets
.
db = pymongo.MongoClient().tweets_db
coll = db.emotweets
coll
Collection(Database(MongoClient('localhost', 27017), u'tweets_db'), u'emotweets')
We now have an object, coll
, that offers full access to the MongoDB API where we can analyse the data in the collected tweets. For instance, in our small example collection, we can count the number of tweets:
coll.count()
10598
Or we can count the number of tweets that are geolocated with a field containing the latitude and longitude of the user when they sent the tweet. We construct a MongoDB query that looks for a non-empty field called coordinates
.
query = {'coordinates': {'$ne': None}}
coll.find(query).count()
607
Or how many tweets had the hashtag #happy
in them?
query = {'hashtags': {'$in': ['happy']}}
coll.find(query).count()
8258
In order to perform these analyses there are a few things one needs to know:
tweetharvest
uses Python 2.7, and not Python 3.Apart from these skills, one needs to know how each status is stored in the database. Here is an easy way to look at the data structure of one tweet.
coll.find_one()
{u'_id': 610008194618757121L, u'contributors': None, u'coordinates': None, u'created_at': datetime.datetime(2015, 6, 14, 8, 57, 41), u'entities': {u'hashtags': [{u'indices': [0, 4], u'text': u'sad'}], u'symbols': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 2, u'favorited': False, u'geo': None, u'hashtags': [u'sad'], u'id_str': u'610008194618757121', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'is_quote_status': False, u'lang': u'und', u'metadata': {u'iso_language_code': u'und', u'result_type': u'recent'}, u'place': None, u'retweet_count': 1, u'retweeted': False, u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', u'text': u'#sad', u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': datetime.datetime(2012, 10, 26, 5, 15, 26), u'default_profile': False, u'default_profile_image': False, u'description': u'xvii // not subtle', u'entities': {u'description': {u'urls': []}}, u'favourites_count': 10565, u'follow_request_sent': None, u'followers_count': 683, u'following': None, u'friends_count': 374, u'geo_enabled': True, u'id': 905331738, u'id_str': u'905331738', u'is_translation_enabled': False, u'is_translator': False, u'lang': u'en', u'listed_count': 0, u'location': u'ceb, phl ', u'name': u'Kim \u2743', u'notifications': None, u'profile_background_color': u'FFFFFF', u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/438273422448009216/1OKtL--y.png', u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/438273422448009216/1OKtL--y.png', u'profile_background_tile': True, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/905331738/1431745819', u'profile_image_url': u'http://pbs.twimg.com/profile_images/607543257296310272/U5Yflc4l_normal.jpg', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/607543257296310272/U5Yflc4l_normal.jpg', u'profile_link_color': u'94D487', u'profile_sidebar_border_color': u'FFFFFF', u'profile_sidebar_fill_color': u'BADB7C', u'profile_text_color': u'E69DF2', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'kimjereza', u'statuses_count': 30050, u'time_zone': u'Beijing', u'url': None, u'utc_offset': 28800, u'verified': False}}
This JSON data structure is documented on the Twitter API website where each field is described in detail. It is recommended that this description is studied in order to understand how to construct valid queries.
tweetharvest
is faithful to the core structure of the tweets as described in that documentation, but with minor differences created for convenience:
Date
objects and returned as Python datetime
objects. This makes it easier to work on date ranges, sort by date, and do other date and time related manipulation.hashtags
field is created for convenience. This contains a simple array of all the hashtags contained in a particular tweet and can be queried directly instead of looking for tags inside a dictionary, inside a list of other entities. It is included for ease of querying but may be ignored if one prefers.This notebook establishes how you can connect to the database of tweets that you have harvested and how you can use the power of Python and MongoDB to access and analyse your collections. Good luck!
tweetharvest
Further Analysis¶Assuming we need some more advanced work to be done on the dataset we have collected, below are some sample analyses to dip our toes in the water.
The examples below are further illustration of using our dataset with standard Python modules used in datascience. The typical idion is that of queryiong MongoDB to get a cursor on our dataset, importing that into an analytic tool such as Pandas, and then producing the analysis. The analyses below require that a few packages are installed on our system:
The dataset used in this notebook is not published on the Github repository. If you want to experiment with your own data, you need to install the tweetharvest
package, harvest some tweets to replicate the emotweets
project embedded there, and then run the notebook. The intended use of this example notebook is simply as an illustration of the type of analysis one might want to do using your own tools.
%matplotlib inline
import pymongo # in case we have run Part 1 above
import pandas as pd # for data manipulation and analysis
import matplotlib.pyplot as plt
/Users/gauden/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /Users/gauden/anaconda/lib/python2.7/argparse.pyc, but /Users/gauden/anaconda/lib/python2.7/site-packages is being added to sys.path from pkg_resources import resource_stream
db = pymongo.MongoClient().tweets_db
COLL = db.emotweets
COLL
Collection(Database(MongoClient('localhost', 27017), u'tweets_db'), u'emotweets')
COLL.count()
10598
def count_by_tag(coll, hashtag):
query = {'hashtags': {'$in': [hashtag]}}
count = coll.find(query).count()
return count
print 'Number of #happy tweets: {}'.format(count_by_tag(COLL, 'happy'))
print 'Number of #sad tweets: {}'.format(count_by_tag(COLL, 'sad'))
Number of #happy tweets: 8258 Number of #sad tweets: 2403
query = {'coordinates': {'$ne': None}}
COLL.find(query).count()
607
# return a cursor that iterates over all documents and returns the creation date
cursor = COLL.find({}, {'created_at': 1, '_id': 0})
# list all the creation times and convert to Pandas DataFrame
times = pd.DataFrame(list(cursor))
times = pd.to_datetime(times.created_at)
earliest_timestamp = min(times)
latest_timestamp = max(times)
print 'Creation time for EARLIEST tweet in dataset: {}'.format(earliest_timestamp)
print 'Creation time for LATEST tweet in dataset: {}'.format(latest_timestamp)
Creation time for EARLIEST tweet in dataset: 2015-06-13 07:24:40 Creation time for LATEST tweet in dataset: 2015-06-14 09:29:21
query = {} # empty query means find all documents
# return just two columns, the date of creation and the id of each document
projection = {'created_at': 1}
df = pd.DataFrame(list(COLL.find(query, projection)))
times = pd.to_datetime(df.created_at)
df.set_index(times, inplace=True)
df.drop('created_at', axis=1, inplace=True)
tweets_all = df.resample('60Min', how='count')
tweets_all.plot(figsize=[12, 7], title='Number of Tweets per Hour', legend=None);
As an example of a more complex query, the following demonstrates how to extract all tweets that are not retweets, contain the hashtag #happy
as well at least one other hashtag, and that are written in English. These attributes are passed to the .find
method as a dictionary, and the hashtags are then extracted.
The hashtags of the first ten tweets meeting this specification are then printed out.
query = { # find all documents that:
'hashtags': {'$in': ['happy']}, # contain #happy hashtag
'retweeted_status': None, # are not retweets
'hashtags.1': {'$exists': True}, # and have more than 1 hashtag
'lang': 'en' # written in English
}
projection = {'hashtags': 1, '_id': 0}
cursor = COLL.find(query, projection)
for tags in cursor[:10]:
print tags['hashtags']
[u'rains', u'drenched', u'happy', u'kids', u'birds', u'animals', u'tatasky', u'home', u'sad', u'life'] [u'quote', u'wisdom', u'sad', u'happy'] [u'truro', u'nightout', u'drunk', u'nationalginday', u'happy', u'fun', u'cornwall', u'girlsnight', u'zafiros'] [u'happy', u'positivity'] [u'vaghar', u'cook', u'ghee', u'colzaoil', u'spices', u'love', u'happy', u'digestion', u'ayurveda', u'intuitive'] [u'happy', u'yay'] [u'kinderscout', u'peakdistrict', u'darkpeaks', u'happy'] [u'ichoisehappy', u'life', u'happy', u'quote', u'instaphoto'] [u'streetartthrowdown', u'me', u'myself', u'wacky', u'pretty', u'cute', u'nice', u'awesome', u'cool', u'smile', u'happy', u'selfie', u'selca'] [u'brothers', u'love', u'forever', u'heart', u'bless', u'live', u'family', u'happy', u'proud']
We could use this method to produce a network of hashtags. The following illustrates this by:
happy
(since it is connected to all the others by definition)In order to run this, we need to install the NetworkX package (pip install networkx
, documentation) and import it as well as the combinations
function from Python's standard library itertools
module.
from itertools import combinations
import networkx as nx
def gen_edges(coll, hashtag):
query = { # find all documents that:
'hashtags': {'$in': [hashtag]}, # contain hashtag of interest
'retweeted_status': None, # are not retweets
'hashtags.1': {'$exists': True}, # and have more than 1 hashtag
'lang': 'en' # written in English
}
projection = {'hashtags': 1, '_id': 0}
cursor = coll.find(query, projection)
for tags in cursor:
hashtags = tags['hashtags']
for edge in combinations(hashtags, 2):
yield edge
def build_graph(coll, hashtag, remove_node=True):
g = nx.Graph()
for u,v in gen_edges(coll, hashtag):
if g.has_edge(u,v):
# add 1 to weight attribute of this edge
g[u][v]['weight'] = g[u][v]['weight'] + 1
else:
# create new edge of weight 1
g.add_edge(u, v, weight=1)
if remove_node:
# since hashtag is connected to every other node,
# it adds no information to this graph; remove it.
g.remove_node(hashtag)
return g
G = build_graph(COLL, 'happy')
Finally we remove rare edges (defined here arbitrarily as edges that have a weigthing of less than 25), then print a table of these edges sorted in descending order by weight.
def trim_edges(g, weight=1):
# function from http://shop.oreilly.com/product/0636920020424.do
g2 = nx.Graph()
for u, v, edata in g.edges(data=True):
if edata['weight'] > weight:
g2.add_edge(u, v, edata)
return g2
G2 = trim_edges(G, weight=25)
df = pd.DataFrame([(u, v, edata['weight'])
for u, v, edata in G2.edges(data=True)],
columns = ['from', 'to', 'weight'])
df.sort(['weight'], ascending=False, inplace=True)
df
from | to | weight | |
---|---|---|---|
7 | love | me | 78 |
1 | cute | love | 74 |
14 | love | follow | 74 |
11 | love | instagood | 72 |
17 | love | photooftheday | 64 |
48 | me | instagood | 63 |
43 | photooftheday | instagood | 63 |
31 | follow | instagood | 63 |
4 | cute | follow | 63 |
0 | cute | me | 62 |
29 | follow | me | 60 |
3 | cute | instagood | 60 |
41 | photooftheday | me | 59 |
33 | follow | photooftheday | 58 |
32 | follow | followme | 58 |
30 | follow | tbt | 57 |
47 | tbt | instagood | 57 |
46 | tbt | me | 57 |
37 | followme | me | 57 |
10 | love | tbt | 57 |
16 | love | followme | 57 |
6 | cute | photooftheday | 56 |
2 | cute | tbt | 56 |
39 | followme | instagood | 56 |
42 | photooftheday | tbt | 55 |
38 | followme | tbt | 52 |
5 | cute | followme | 51 |
40 | followme | photooftheday | 50 |
12 | love | smile | 37 |
9 | love | family | 33 |
34 | happiness | truth | 31 |
24 | allah | happiness | 31 |
23 | allah | truth | 31 |
13 | love | fun | 29 |
8 | love | life | 29 |
20 | allah | lifegoals | 28 |
44 | good | prophet | 28 |
21 | allah | prophet | 28 |
35 | happiness | lifegoals | 28 |
22 | allah | promise | 28 |
28 | promise | good | 28 |
27 | promise | prophet | 28 |
19 | allah | good | 28 |
45 | lifegoals | truth | 28 |
18 | selfie | smile | 27 |
25 | i_am | positive | 27 |
15 | love | friends | 27 |
36 | positive | affirmation | 27 |
26 | i_am | affirmation | 27 |
G3 = trim_edges(G, weight=35)
pos=nx.circular_layout(G3) # positions for all nodes
# nodes
nx.draw_networkx_nodes(G3, pos, node_size=700,
linewidths=0, node_color='#cccccc')
edge_list = [(u, v) for u, v in G3.edges()]
weight_list = [edata['weight']/5.0 for u, v, edata in G3.edges(data=True)]
# edges
nx.draw_networkx_edges(G3, pos,
edgelist=edge_list,
width=weight_list,
alpha=0.4,edge_color='b')
# labels
nx.draw_networkx_labels(G3, pos, font_size=20,
font_family='sans-serif', font_weight='bold')
fig = plt.gcf()
fig.set_size_inches(10, 10)
plt.axis('off');
#sad
¶G_SAD = build_graph(COLL, 'sad')
G2S = trim_edges(G_SAD, weight=5)
df = pd.DataFrame([(u, v, edata['weight'])
for u, v, edata in G2S.edges(data=True)],
columns = ['from', 'to', 'weight'])
df.sort(['weight'], ascending=False, inplace=True)
df
from | to | weight | |
---|---|---|---|
7 | pathetic | rude | 36 |
13 | depressed | quote | 19 |
18 | quote | quotes | 15 |
23 | funy | all_sms_pkg | 13 |
8 | stylish | all_sms_pkg | 13 |
9 | stylish | funy | 13 |
20 | quote | happy | 11 |
17 | quote | quote | 10 |
12 | depressed | quotes | 10 |
2 | info | stylish | 8 |
5 | info | just | 8 |
10 | stylish | just | 8 |
14 | depressed | depression | 8 |
22 | just | funy | 7 |
21 | just | all_sms_pkg | 7 |
0 | teen | sadgirl | 7 |
15 | depressed | happy | 7 |
6 | suicide | suicidal | 7 |
4 | info | funy | 7 |
3 | info | all_sms_pkg | 7 |
16 | quote | love | 6 |
1 | teen | v | 6 |
19 | quote | depression | 6 |
11 | depressed | alone | 6 |
24 | quotes | love | 6 |
25 | quotes | happy | 6 |
26 | v | sadgirl | 6 |
Graph is drawn with a spring layout to bring out more clearly the disconnected sub-graphs.
G3S = trim_edges(G_SAD, weight=5)
pos=nx.spring_layout(G3S) # positions for all nodes
# nodes
nx.draw_networkx_nodes(G3S, pos, node_size=700,
linewidths=0, node_color='#cccccc')
edge_list = [(u, v) for u, v in G3S.edges()]
weight_list = [edata['weight'] for u, v, edata in G3S.edges(data=True)]
# edges
nx.draw_networkx_edges(G3S, pos,
edgelist=edge_list,
width=weight_list,
alpha=0.4,edge_color='b')
# labels
nx.draw_networkx_labels(G3S, pos, font_size=12,
font_family='sans-serif', font_weight='bold')
fig = plt.gcf()
fig.set_size_inches(13, 13)
plt.axis('off');