Research regarding social media analysis is a rapidly evolving field. The objective of this project is to test if we can observe the spread of influenza-like-symptoms (ILI) through an analysis of geo-tagged Twitter posts in Toronto, ON, for October 2012. Using tweets containing geographic information, we should be able to observe that people who come into contact with more sick people are more likely to be sick themselves. In the future I will perform some machine learning techniques to identify sick people in November and December, then rerun this analysis.
This project was inspired by work done by Sadilek A., Kautz H., Silenzi V. (2012) Modeling Spread of Disease from Social Interactions.
There is certainly a lot of room for improvement / next steps, and I would love to hear some suggestions. I also have the code written in R if anyone is interested.
print "\nSick Users: How many sick people they came in contact with in October\n"
print groups_sick.sum().b
print "\nHealthy Users: How many sick people they came in contact with in October\n"
print groups_never_sick.sum().b
print "\nProportion of people in October that were sick\n"
print groups_sick.sum().b / groups_never_sick.sum().b
plt.plot(groups_sick.sum().b / groups_never_sick.sum().b)
plt.xticks([0,1,2,3], ["1-15","15-30","30-45","45-60"])
plt.yticks([0,.1,.2,.3,.4,.5,.6,.7,.8], ["0%","10%","20%","30%","40%","50%","60%","70%", "80%"])
plt.xlabel("How many sick people were contacted", size = 12)
plt.ylabel("Proportion of people who became ill at t + 1", size = 12)
plt.minorticks_on()
plt.grid(b=True, which='major', color='black', alpha=0.5)
plt.title("\nIndividual encounters with sick people\n 8 hour time window, 100 meters", size = 14)
plt.savefig(path+'1.png', dpi=96)
Sick Users: How many sick people they came in contact with in October number_of_sick_people_contacted (1, 15] 2324 (15, 30] 136 (30, 45] 70 (45, 60] 27 Name: b, dtype: int64 Healthy Users: How many sick people they came in contact with in October number_of_sick_people_contacted (1, 15] 20668 (15, 30] 543 (30, 45] 577 (45, 60] 29 Name: b, dtype: int64 Proportion of people in October that were sick number_of_sick_people_contacted (1, 15] 0.112444 (15, 30] 0.250460 (30, 45] 0.121317 (45, 60] 0.931034 Name: b, dtype: float64
The data used is a collection of tweets streamed from the Twitter API: https://dev.twitter.com. The data were saved into a .csv with the following information:
1.Username
2.Tweet text
3.Date & time
4.Latitude / Longitude
I am not going to pretend that I invented this process, there are many many ways to do this. I used a python library called tweepy. See the link below.
The Twitter streaming protocol dictates that a stream can provide up to 1% of the global value of tweets at any given time. Since the stream was specified to collect only geo-tagged tweets in the Toronto area, there were no instances where the stream was throttled; therefore the dataset used in this analysis should contain every usable tweet. The initial dataset consisted of about 900 000 tweets, however, after cleaning the data for errors, missing usernames, and promotional tweets, the dataset contained some 568 192 posts with 35 053 unique users.
The biggest hurdle for this project was classifying which tweets were from sick people, and which were not. Generally some type of support vector machine or other machine learning technique is used.
While this problem might seem trivial at first, the issue is that words and phrases used to describe ILI have inconsistant connotations and uses. For example:
“That was a sick concert last night”
“I am terribly sick with a cough and cold # influenza”
However, I first tackled this data before I learned the advanced techniques to filter them; thus the data were filtered using keywords and then manually classified based on the judgement of me and a few undergraduate students.
I will be importing modules as they are required in this code, I think it makes it easier to keep track of what I am doing for those less familar with the libraries.
import pandas
#set a path for fle operations
path = '/users/ryankelly/projects/twitter_analysis/data/'
tmp = pandas.read_csv(path+'binary_sick.csv')
tmp.columns = columns = ['user', 'tweet', 'lat','long','date','is_sick', '']
tmp.tail()
user | tweet | lat | long | date | is_sick | ||
---|---|---|---|---|---|---|---|
568189 | tennisprincess2 | Finally finished my assignment due 2day! Whew!... | 43.779161 | -79.326051 | 11/5/2012 | 1 | NaN |
568190 | Iyshgrowndsexy | I have a headache.. But i gotta bang out this ... | 43.646013 | -79.530264 | 11/5/2012 | 1 | NaN |
568191 | katkillingbeck | Cuddled in bed. Not feeling the greatest #slee... | 43.638892 | -79.404184 | 11/5/2012 | 1 | NaN |
568192 | OfficialOka | Got The Wickidest Headache | 43.761494 | -79.499430 | 11/5/2012 | 1 | NaN |
568193 | sydmt | I have the worst headache ever. | 43.810971 | -79.174834 | 11/5/2012 | 1 | NaN |
5 rows × 7 columns
print "There are {} unique users in the data".format(len(pandas.unique(tmp['user'])))
There are 35053 unique users in the data
print "There are {} people who become ill in October".format(len(pandas.unique(tmp['user'][tmp['is_sick']==1])))
There are 542 people who become ill in October
print "The are {} geo-tagged tweets posted by sick people".format(len(tmp[tmp['is_sick']==1]))
The are 823 geo-tagged tweets posted by sick people
The goal in the next three functions is to create a temporal buffer around the users who post that they are sick; since all tweets by a sick user will not be concerning ILI symptoms. This is an attempt help capture all the locations and people that a sick person might have interacted with over the duration of thier contagious period.
I define this contagious buffer using information from the Centers for Disease Control and Prevention who describe the contagious period of ILI as:
For the purposes of this analysis, we will assume that 4 days after reporting symptoms is generous enough, since we have no way of knowing if they tweeted via Twitter at the first sign of symptoms.
# This function will be used to reformat the dates in the raw data
# Note that for some reason there were two types of dates present.
from datetime import datetime
def date_reformat(date_):
try:
d = datetime.strptime(date_, "%m/%d/%Y %H:%M")
except:
d = datetime.strptime(date_, "%m/%d/%Y")
finally:
pass
return d
import csv
# Load in the data and loop through
with open(path+"binary_sick.csv", "rU") as data:
# Skip the header
next(data)
user_dict ={}
reader = csv.reader(data, delimiter=",")
for rows in reader:
user = rows[0]
data = rows[1:6]
if not user in user_dict:
user_dict[user] = [data]
else:
user_dict[user].append(data)
Using the dictionary we created and the date reformatting function, we can now define the contagious buffer around the users.
There are certainly other ways to solve this problem, however here is my implimentation described in english first:
from datetime import timedelta
# Add some time deltas to be used in the function
before = timedelta(days=1)
after = timedelta(days=4)
def contagious_users():
path = '/users/ryankelly/projects/twitter_analysis/data/'
with open(path+"sick.csv", "wb") as out:
writer = csv.writer(out)
for user in user_dict:
for tweet in user_dict[user]:
if tweet[4] == "1":
# If user is sick, proceeed
date_ref = date_reformat(tweet[3])
for tweet2 in user_dict[user]:
# Scan all the current user's tweets with this date reference
date = date_reformat(tweet2[3])
# If the date is within the date range specified...
if (date_ref - before) <= date <= (date_ref + after):
tweet2[4] = "1"
else:
pass
#tweet[5] = 0 nope, we dont want to overlap this
else:
# if user is not sick, do nothing but proceed to next user in loop
pass
# After all tweets in user are examined
# Write user to file
for tweet in user_dict[user]:
out = [i for i in tweet]
out.insert(0, user)
writer.writerow(out)
start_time = datetime.now()
contagious_users()
print (datetime.now() - start_time)
0:00:03.481979
It was a little tricky, but very efficient looping through a plain dictionary of nearly 600 000 records.
# Lets just reload the data from the new csv
df = pandas.read_csv(path+"sick.csv", names=['user', 'tweet', 'lat', 'long', 'date', 'is_sick'], parse_dates=['date'])
# We dont really need the Twitter content anymore
del df['tweet']
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 568194 entries, 0 to 568193 Data columns (total 5 columns): user 568194 non-null object lat 568194 non-null float64 long 568194 non-null float64 date 568194 non-null datetime64[ns] is_sick 568194 non-null int64 dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
print "There are now {} tweets from sick people to work with".format(len(df[df['is_sick']==1]))
There are now 21089 tweets from sick people to work with
If we wanted we could change the dates to another format, but that is not needing currently
#change to unix date
#df['date'] = df['date'].apply(lambda x: (x-datetime(1970,1,1)).total_seconds())
We almost have the data looking how we want it too, but there are a few more things to consider. Since we are concerned with the interaction between people, we will remove "non-active" users from the dataset, but retain all users that were ever sick, so that they can still serve as points of infection.
# List of names that have < 100 obs
names = list(pandas.unique(df['user'][df.groupby('user').size().values < 100]))
# List of people that are sick
still_sick = pandas.unique(df['user'][df['is_sick'] == 1])
# If someone is sick in the names to exclude,
# Then remove them from the list so we retain them next step
for user in names:
if user in still_sick:
names.remove(user)
#subset the df with users that are not in the list
df = df[-df['user'].isin(names)]
print "We lost {} posts in this step".format(568192 - len(df))
We lost 30489 posts in this step
print "We lost {} users in this step".format(35053 - len(pandas.unique(df['user'])))
We lost 1915 users in this step
There is certainly room for expirimenting with this threshold in future analysis, however for now we will use the methodologies supported by the current literature and stick with ~100 tweets.
To make sure we are dealing with accurate distances, I transform the geographic coordinates of latitude / longitude to planar coordinates in meters. The Greater Toronto Area can be accurately projected onto UTM zone 17N.
import pyproj
# Define coordindate system projection function
proj = pyproj.Proj("+proj=utm +zone=17N, +ellps=WGS84 \
+datum=WGS84 +units=m +no_defs")
#Project the data
df['long'], df['lat'] = proj(df['long'].values, df['lat'].values)
Now we will loop through every tweet to calculate how many sick people each user was in contact.
The definition of contact is perhaps the most important consideration in this model.
Following conventions imposed by previous literature, we presume that possible contact between users can be captured using a 100-meter buffer (Sadilek et al. 2012). We also impose a temporal lag on the interaction since the influenza virus can last up to 8 hours on surfaces (CDC).
Thus we define contact between Twitter users as:
Future analysis will reach to investigate a more continuous analysis of these constraints, rather than binary cutoffs; as these are some fundamental limitations of the model. Simply because a user was within 100 meters of another sick user does not guarantee contagious transfer. Similarly, we assume that the virus is always deposited on surfaces or airspace where the sick tweet occurred, and that those surfaces have not been cleaned before the 8-hour buffer period.
In any case, these limitations mean that we are currently underestimating the data. ___
Because we cannot easily vectorize this application, it is almost necessary to compute in parallel.
Here we setup the parallel process in IPython
from IPython.parallel import Client
# First initiate how many clusters to start in the Ipython notebook dashboard
cli = Client()
cli.ids
dview=cli[:]
# Import required libraries to workers
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
importing numpy on engine(s) importing os on engine(s) importing timedelta from datetime on engine(s) importing pandas on engine(s)
df.sort('date')
@dview.parallel(block=True) # Parallel decorator
def work(d):
# Setup local variables
before1 = timedelta(hours = 8)
after1 = timedelta(minutes = 1)
output = []
# Iterate through the data via index reference i
for i in range(0, len(d)):
l = []
# Create a mask to query all datetimes between with 8 hours of datetime i
date_mask = (d['date'] >= d['date'].iloc[i]-before1) & (d['date'] <= d['date'].iloc[i]+after1)
# Make sure we are not computing distances between the same person
user_mask = d['user']!=d['user'].iloc[i]
# These are the data we need to check distances for
dists_to_check = d[date_mask & user_mask]
# a = current location i
# b = all locations from dists_to_check
a = numpy.array((d['long'].iloc[i], d['lat'].iloc[i]))
b = numpy.array((dists_to_check['long'].values, dists_to_check['lat'].values))
for j in range(1, len(dists_to_check)):
x = numpy.linalg.norm(a-numpy.array((b[0][j], b[1][j])))
# If the distance is < 100m append index to list l
if x <=1000:
l.append(j)
else:
pass
try:
# Subset all the data that satisfy both the datetime and distance restriction
data = dists_to_check.iloc[l]
# Add the variables of interest to the dataset
# Sum of sick people within constraints,
# and sum of how many people in total (sick/not sick)
output.append([data['is_sick'].sum(), len(data)])
except IndexError, e:
# If no data exists, add zero's
output.append([0,0])
return output
start = datetime.now()
out = work(df)
print datetime.now() - start
in sync results <function __call__ at 0x123a015f0> 0:52:15.430106
# Combine the data from the 4 parallel workers into one list
l = []
for i in out:
l.append(i)
Since the data from the function above is a sequential list of the original data, we append the output to the original dataframe.
This is more efficient than carrying the entire dataframe through the looping function.
# Merge all data together
# Data from function "work"
df2 = pandas.DataFrame(l, columns = ['sick', 'nearby'])
# Merge with original data
data = pandas.DataFrame(zip(df['user'],df['date'], df['is_sick'], df2['sick'], df2['nearby']), columns =['user', 'date','is_sick','sick','nearby'])
data.sort('date')
data.head()
user | date | is_sick | sick | nearby | |
---|---|---|---|---|---|
0 | NicoleWGKTA | 2012-10-02 23:49:00 | 0 | 0 | 0 |
1 | NicoleWGKTA | 2012-10-02 23:50:00 | 0 | 0 | 0 |
2 | NicoleWGKTA | 2012-10-02 23:58:00 | 0 | 0 | 0 |
3 | NicoleWGKTA | 2012-10-03 00:01:00 | 0 | 0 | 0 |
4 | NicoleWGKTA | 2012-10-03 00:05:00 | 0 | 0 | 0 |
5 rows × 5 columns
We have to be sure we remove users who became sick prior the collection of this data, since we dont know how many sick people they came in contact with.
user = data.groupby('user')
# Remove people that were sick at the start of the month
data_filter = user.filter(lambda x: x['is_sick'].iloc[0] != 1)
print "We lost {} unique users from this step, but {} users remain"\
.format(33138 - len(pd.unique(data_filter['user'])), len(pd.unique(data_filter['user'])))
We lost 172 unique users from this step, but 32966 users remain
# Subset of people that become ill
sick = data_filter.groupby('user').filter(lambda x: x['is_sick'].sum() >0)
print "There are {} people that became ill, and they posted {} times".\
format(len(pandas.unique(sick['user'])), len(sick))
There are 370 people that became ill, and they posted 41959 times
# Subset of people who never became sick
never_sick = data_filter.groupby('user').filter(lambda x: x['is_sick'].sum()==0)
print "There are {} people that never became ill, and they posted {} times".\
format(len(pandas.unique(never_sick['user'])), len(never_sick))
There are 32596 people that never became ill, and they posted 480316 times
# Make sure that these two subsets comprise all of the data
len(sick)+len(never_sick) == len(data_filter)
True
There are many ways to interpret these data, however for now I will summarize the results in order to answer two fundamental questions.
import matplotlib.pylab as plt
# Enable inline plotting
%matplotlib inline
# Set the default image size for this interactive session
plt.rcParams['figure.figsize'] = 6, 4
First we test if coming in contact with many sick people at once increases the likeyhood of becoming sick later.
To test this we summarize the 'sick' and 'never sick' users by how many sick people they came into contact with at single points in time and plot the ratio of the results between the 'sick' and 'not sick' groups. We expect to see that if someone comes in contact with many sick people at once (perhaps in a university lecture hall), that they have a higher probability of becoming sick themselves at a later date.
sick_values = pd.DataFrame(zip(pd.unique(sick['sick'].values),sick['sick'].value_counts()),\
columns = ['number_of_sick_people_contacted', 'b'])
never_sick_values = pd.DataFrame(zip(pd.unique(never_sick['sick'].values),never_sick['sick'].\
value_counts()), columns = ['number_of_sick_people_contacted', 'b'])
bins = [1,15,30,45,60]
groups_sick = sick_values.groupby(pd.cut(sick_values.number_of_sick_people_contacted, bins))
groups_never_sick = never_sick_values.groupby(pd.cut(never_sick_values.number_of_sick_people_contacted, bins))
print "\nSick Users: How many sick people they came in contact with in October\n"
print groups_sick.sum().b
print "\nHealthy Users: How many sick people they came in contact with in October\n"
print groups_never_sick.sum().b
print "\nProportion of people in October that were sick\n"
print groups_sick.sum().b / groups_never_sick.sum().b
plt.plot(groups_sick.sum().b / groups_never_sick.sum().b)
plt.xticks([0,1,2,3], ["1-15","15-30","30-45","45-60"])
plt.yticks([0,.1,.2,.3,.4,.5,.6,.7,.8], ["0%","10%","20%","30%","40%","50%","60%","70%", "80%"])
plt.xlabel("How many sick people were contacted", size = 12)
plt.ylabel("Proportion of people who became ill at t + 1", size = 12)
plt.minorticks_on()
plt.grid(b=True, which='major', color='black', alpha=0.5)
plt.title("\nIndividual encounters with sick people\n 8 hour time window, 100 meters", size = 14)
#plt.savefig(path+'1.png', dpi=96)
Sick Users: How many sick people they came in contact with in October number_of_sick_people_contacted (1, 15] 2324 (15, 30] 136 (30, 45] 70 (45, 60] 27 Name: b, dtype: int64 Healthy Users: How many sick people they came in contact with in October number_of_sick_people_contacted (1, 15] 20668 (15, 30] 543 (30, 45] 577 (45, 60] 29 Name: b, dtype: int64 Proportion of people in October that were sick number_of_sick_people_contacted (1, 15] 0.112444 (15, 30] 0.250460 (30, 45] 0.121317 (45, 60] 0.931034 Name: b, dtype: float64
<matplotlib.text.Text at 0x12e423650>
Great! This result aligns well with our above theory; the likeyhood of becoming sick greatly increases as we are surrounded by more sick people at once. From the table we can see that ~75% of the people that came into contact with 45 - 60 people later became sick. These data actually report strikling similar results to the New York data in Sadelik et al (2012).
Lastly we test if encountering more sick people in general in October increases the likelyhood of becoming sick.
In other words, we expect to see that if the total number of sick people we encounter in October is higher, our likelyhood of becoming sick is also higher. Here we calculate the total number of sick people encountered per user in Toronto, and examine the ratio between the people who became ill, and those who did not.
sick_values = pd.DataFrame(zip(pd.unique(sick['sick'].values), sick.groupby('user')['sick'].sum()), \
columns = ['number_of_sick_people_contacted','b'])
never_sick_values = pd.DataFrame(zip(pd.unique(never_sick['sick'].values), \
never_sick.groupby('user')['sick'].sum()),columns = \
['number_of_sick_people_contacted','b'])
bins = [1,15,30,45,60]
groups_sick = sick_values.groupby(pd.cut(sick_values.number_of_sick_people_contacted, bins))
groups_never_sick = never_sick_values.groupby(pd.cut(never_sick_values.number_of_sick_people_contacted, bins))
print "\nHealthy Users: How many sick people they came in contact with in October\n"
print groups_sick.sum().b
print "\nSick Users: How many sick people they came in contact with in October\n"
print groups_never_sick.sum().b
print "\nProportion of people in October that were sick\n"
print groups_never_sick.sum().b / groups_sick.sum().b
plt.plot(groups_never_sick.sum().b / groups_sick.sum().b)
plt.xticks([0,1,2,3], ["1-15","15-30","30-45","45-60"])
plt.yticks([0,.01,.02,.03,.04,.05,.06,.07,.08, 0.09], ["0%","1%","2%","3%","4%","5%","6%","7%", "8%"])
plt.xlabel("How many sick people were contacted", size = 12)
plt.ylabel("Proportion of people who became ill at t +1", size = 12)
plt.minorticks_on()
plt.grid(b=True, which='major', color='black', alpha=0.5)
plt.title("Total encounters with sick people\n 8 hour time window, 100 meters", size = 14)
Healthy Users: How many sick people they came in contact with in October number_of_sick_people_contacted (1, 15] 167 (15, 30] 109 (30, 45] 148 (45, 60] 528 Name: b, dtype: int64 Sick Users: How many sick people they came in contact with in October number_of_sick_people_contacted (1, 15] 6 (15, 30] 30 (30, 45] 14 (45, 60] 0 Name: b, dtype: int64 Proportion of people in October that were sick number_of_sick_people_contacted (1, 15] 0.035928 (15, 30] 0.275229 (30, 45] 0.094595 (45, 60] 0.000000 Name: b, dtype: float64
<matplotlib.text.Text at 0x12637cc90>
This is certainly an interesting result, though it may be partially due to lack of data on the high end. Either way, we can start to interpret the results.
Most notably, it appears as though there is a threashold effect, whereby after coming in contact with 15-30 sick people there is a severe drop off in the number of people that became sick. This suggests a “resistance effect” whereby users who have come in contact with many sick people are resistant to the infections strain, possibly due to previously being ill prior to October, or a healthier immune system in general.
There is certainly a lot more I could do and test with this data. I would love any feedback / suggestions on the methods, and ideas of where to go next. Let me know if you take this idea and apply it yourself!