Start to Finish: Quantifying Influenza-Like-Illness (ILI) Symptom Contraction using Twitter and Python¶

Introduction:¶

Research regarding social media analysis is a rapidly evolving field. The objective of this project is to test if we can observe the spread of influenza-like-symptoms (ILI) through an analysis of geo-tagged Twitter posts in Toronto, ON, for October 2012. Using tweets containing geographic information, we should be able to observe that people who come into contact with more sick people are more likely to be sick themselves. In the future I will perform some machine learning techniques to identify sick people in November and December, then rerun this analysis.

This project was inspired by work done by Sadilek A., Kautz H., Silenzi V. (2012) Modeling Spread of Disease from Social Interactions.

https://cs.rochester.edu/~kautz/papers/Sadilek-Kautz-Silenzio_Modeling-Spread-of-Disease-from-Social-Interactions_ICWSM-12.pdf

There is certainly a lot of room for improvement / next steps, and I would love to hear some suggestions. I also have the code written in R if anyone is interested.

In [142]:

print "\nSick Users: How many sick people they came in contact with in October\n"
print groups_sick.sum().b
print "\nHealthy Users: How many sick people they came in contact with in October\n"
print groups_never_sick.sum().b
print "\nProportion of people in October that were sick\n"
print groups_sick.sum().b / groups_never_sick.sum().b

plt.plot(groups_sick.sum().b / groups_never_sick.sum().b)
plt.xticks([0,1,2,3], ["1-15","15-30","30-45","45-60"])
plt.yticks([0,.1,.2,.3,.4,.5,.6,.7,.8], ["0%","10%","20%","30%","40%","50%","60%","70%", "80%"])
plt.xlabel("How many sick people were contacted", size = 12)
plt.ylabel("Proportion of people who became ill at t + 1", size = 12)
plt.minorticks_on()
plt.grid(b=True, which='major', color='black', alpha=0.5)
plt.title("\nIndividual encounters with sick people\n 8 hour time window, 100 meters", size = 14)
plt.savefig(path+'1.png', dpi=96)

Sick Users: How many sick people they came in contact with in October

number_of_sick_people_contacted
(1, 15]                            2324
(15, 30]                            136
(30, 45]                             70
(45, 60]                             27
Name: b, dtype: int64

Healthy Users: How many sick people they came in contact with in October

number_of_sick_people_contacted
(1, 15]                            20668
(15, 30]                             543
(30, 45]                             577
(45, 60]                              29
Name: b, dtype: int64

Proportion of people in October that were sick

number_of_sick_people_contacted
(1, 15]                            0.112444
(15, 30]                           0.250460
(30, 45]                           0.121317
(45, 60]                           0.931034
Name: b, dtype: float64

Aquiring the data:¶

The data used is a collection of tweets streamed from the Twitter API: https://dev.twitter.com. The data were saved into a .csv with the following information:

1.Username
2.Tweet text
3.Date & time
4.Latitude / Longitude

I am not going to pretend that I invented this process, there are many many ways to do this. I used a python library called tweepy. See the link below.

http://lmgtfy.com/?q=Streaming+twitter+API+with+tweepy

The Twitter streaming protocol dictates that a stream can provide up to 1% of the global value of tweets at any given time. Since the stream was specified to collect only geo-tagged tweets in the Toronto area, there were no instances where the stream was throttled; therefore the dataset used in this analysis should contain every usable tweet. The initial dataset consisted of about 900 000 tweets, however, after cleaning the data for errors, missing usernames, and promotional tweets, the dataset contained some 568 192 posts with 35 053 unique users.

The biggest hurdle for this project was classifying which tweets were from sick people, and which were not. Generally some type of support vector machine or other machine learning technique is used.

While this problem might seem trivial at first, the issue is that words and phrases used to describe ILI have inconsistant connotations and uses. For example:

“That was a sick concert last night”
“I am terribly sick with a cough and cold # influenza”

However, I first tackled this data before I learned the advanced techniques to filter them; thus the data were filtered using keywords and then manually classified based on the judgement of me and a few undergraduate students.

Lets start the analysis that derived this plot¶

I will be importing modules as they are required in this code, I think it makes it easier to keep track of what I am doing for those less familar with the libraries.

Lets take a look at the data before we begin.¶

In [26]:

import pandas
#set a path for fle operations
path = '/users/ryankelly/projects/twitter_analysis/data/'

tmp = pandas.read_csv(path+'binary_sick.csv')
tmp.columns = columns = ['user', 'tweet', 'lat','long','date','is_sick', '']
tmp.tail()

Out[26]:

	user	tweet	lat	long	date	is_sick
568189	tennisprincess2	Finally finished my assignment due 2day! Whew!...	43.779161	-79.326051	11/5/2012	1	NaN
568190	Iyshgrowndsexy	I have a headache.. But i gotta bang out this ...	43.646013	-79.530264	11/5/2012	1	NaN
568191	katkillingbeck	Cuddled in bed. Not feeling the greatest #slee...	43.638892	-79.404184	11/5/2012	1	NaN
568192	OfficialOka	Got The Wickidest Headache	43.761494	-79.499430	11/5/2012	1	NaN
568193	sydmt	I have the worst headache ever.	43.810971	-79.174834	11/5/2012	1	NaN

5 rows × 7 columns

Now that we understand the structure, lets see a few more basic statistics regarding the data

In [27]:

print "There are {} unique users in the data".format(len(pandas.unique(tmp['user'])))

There are 35053 unique users in the data

In [28]:

print "There are {} people who become ill in October".format(len(pandas.unique(tmp['user'][tmp['is_sick']==1])))

There are 542 people who become ill in October

In [29]:

print "The are {} geo-tagged tweets posted by sick people".format(len(tmp[tmp['is_sick']==1]))

The are 823 geo-tagged tweets posted by sick people

The Contagious Buffer¶

The goal in the next three functions is to create a temporal buffer around the users who post that they are sick; since all tweets by a sick user will not be concerning ILI symptoms. This is an attempt help capture all the locations and people that a sick person might have interacted with over the duration of thier contagious period.

I define this contagious buffer using information from the Centers for Disease Control and Prevention who describe the contagious period of ILI as:

1 day before reporting symptoms
up to 5 days after reporting symptoms

For the purposes of this analysis, we will assume that 4 days after reporting symptoms is generous enough, since we have no way of knowing if they tweeted via Twitter at the first sign of symptoms.

In [54]:

# This function will be used to reformat the dates in the raw data
# Note that for some reason there were two types of dates present.

from datetime import datetime

def date_reformat(date_):
    try:
        d = datetime.strptime(date_, "%m/%d/%Y %H:%M")
    except:
        d = datetime.strptime(date_, "%m/%d/%Y")
    finally:
        pass
    return d

Next we create a dictionary of users, and thier associated data using a standard python dictionary.

In [67]:

import csv

# Load in the data and loop through
with open(path+"binary_sick.csv", "rU") as data:
    # Skip the header
    next(data)
    user_dict ={}
    
    reader = csv.reader(data, delimiter=",")
    
    for rows in reader:
    
        user = rows[0]
        data = rows[1:6]
        
        if not user in user_dict:
            user_dict[user] = [data]
        else:
            user_dict[user].append(data)
    

Using the dictionary we created and the date reformatting function, we can now define the contagious buffer around the users.

There are certainly other ways to solve this problem, however here is my implimentation described in english first:

Open a new file to write to
Loop through all the users in the user dictionary, if a user is sick, then loop through all of thier tweets and check to see if they tweeted within the duration for which we expect them to be sick and record that information in a new field.

In [68]:

from datetime import timedelta

# Add some time deltas to be used in the function
before = timedelta(days=1)
after = timedelta(days=4)

def contagious_users():
    path = '/users/ryankelly/projects/twitter_analysis/data/'
    with open(path+"sick.csv", "wb") as out:
        writer = csv.writer(out)
       
        for user in user_dict:
            for tweet in user_dict[user]:

                if tweet[4] == "1":
                    # If user is sick, proceeed 
                    date_ref = date_reformat(tweet[3])

                    for tweet2 in user_dict[user]:
                        # Scan all the current user's tweets with this date reference
                        date = date_reformat(tweet2[3])
                        # If the date is within the date range specified...
                        if (date_ref - before) <= date <= (date_ref + after):
                            tweet2[4] = "1"

                        else:
                            pass
                            #tweet[5] = 0 nope, we dont want to overlap this
                else:
                    # if user is not sick, do nothing but proceed to next user in loop
                    pass
                
                
            # After all tweets in user are examined
            # Write user to file
            for tweet in user_dict[user]:
                out = [i for i in tweet]
                out.insert(0, user)
                writer.writerow(out)             

In [69]:

start_time = datetime.now()
contagious_users()
print (datetime.now() - start_time)

0:00:03.481979

It was a little tricky, but very efficient looping through a plain dictionary of nearly 600 000 records.

So how many sick tweets do we have after this step?

In [71]:

# Lets just reload the data from the new csv
df = pandas.read_csv(path+"sick.csv", names=['user', 'tweet', 'lat', 'long', 'date', 'is_sick'], parse_dates=['date'])
# We dont really need the Twitter content anymore
del df['tweet']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 568194 entries, 0 to 568193
Data columns (total 5 columns):
user       568194 non-null object
lat        568194 non-null float64
long       568194 non-null float64
date       568194 non-null datetime64[ns]
is_sick    568194 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)

In [77]:

print "There are now {} tweets from sick people to work with".format(len(df[df['is_sick']==1]))

There are now 21089 tweets from sick people to work with

If we wanted we could change the dates to another format, but that is not needing currently

In [10]:

#change to unix date
#df['date'] = df['date'].apply(lambda x: (x-datetime(1970,1,1)).total_seconds())

Some more data cleaning¶

We almost have the data looking how we want it too, but there are a few more things to consider. Since we are concerned with the interaction between people, we will remove "non-active" users from the dataset, but retain all users that were ever sick, so that they can still serve as points of infection.

We define "active" as posting at least 100 times in a month

In [78]:

# List of names that have < 100 obs
names = list(pandas.unique(df['user'][df.groupby('user').size().values < 100]))
# List of people that are sick
still_sick = pandas.unique(df['user'][df['is_sick'] == 1])

# If someone is sick in the names to exclude, 
# Then remove them from the list so we retain them next step
for user in names:
    if user in still_sick:
        names.remove(user)

In [79]:

#subset the df with users that are not in the list
df = df[-df['user'].isin(names)]

In [80]:

print "We lost {} posts in this step".format(568192 - len(df))

We lost 30489 posts in this step

In [81]:

print "We lost {} users in this step".format(35053 - len(pandas.unique(df['user'])))

We lost 1915 users in this step

There is certainly room for expirimenting with this threshold in future analysis, however for now we will use the methodologies supported by the current literature and stick with ~100 tweets.

Project Latitude Longitude coordinates to UTM¶

To make sure we are dealing with accurate distances, I transform the geographic coordinates of latitude / longitude to planar coordinates in meters. The Greater Toronto Area can be accurately projected onto UTM zone 17N.

In [83]:

import pyproj

In [84]:

# Define coordindate system projection function
proj = pyproj.Proj("+proj=utm +zone=17N, +ellps=WGS84 \
                   +datum=WGS84 +units=m +no_defs")

In [85]:

#Project the data
df['long'], df['lat'] = proj(df['long'].values, df['lat'].values)

The Main Loop and The Definition of "Contact" Between Users¶

Now we will loop through every tweet to calculate how many sick people each user was in contact.

The definition of contact is perhaps the most important consideration in this model.

Following conventions imposed by previous literature, we presume that possible contact between users can be captured using a 100-meter buffer (Sadilek et al. 2012). We also impose a temporal lag on the interaction since the influenza virus can last up to 8 hours on surfaces (CDC).

Thus we define contact between Twitter users as:

Being co-located within 100-meters within t – 8 hours.

Future analysis will reach to investigate a more continuous analysis of these constraints, rather than binary cutoffs; as these are some fundamental limitations of the model. Simply because a user was within 100 meters of another sick user does not guarantee contagious transfer. Similarly, we assume that the virus is always deposited on surfaces or airspace where the sick tweet occurred, and that those surfaces have not been cleaned before the 8-hour buffer period.

In any case, these limitations mean that we are currently underestimating the data. ___

Because we cannot easily vectorize this application, it is almost necessary to compute in parallel.

Here we setup the parallel process in IPython

In [95]:

from IPython.parallel import Client
# First initiate how many clusters to start in the Ipython notebook dashboard 
cli = Client()
cli.ids
dview=cli[:]

In [96]:

# Import required libraries to workers
with dview.sync_imports():
    import numpy as np
    import os
    from datetime import timedelta
    import pandas as pd

importing numpy on engine(s)
importing os on engine(s)
importing timedelta from datetime on engine(s)
importing pandas on engine(s)

In [90]:

df.sort('date')

In [115]:

@dview.parallel(block=True) # Parallel decorator
def work(d):
    # Setup local variables
    before1 = timedelta(hours = 8)
    after1 = timedelta(minutes = 1)
    output = []
    
    # Iterate through the data via index reference i
    for i in range(0, len(d)):
        l = []
        # Create a mask to query all datetimes between with 8 hours of datetime i
        date_mask = (d['date'] >= d['date'].iloc[i]-before1) & (d['date'] <= d['date'].iloc[i]+after1)
        # Make sure we are not computing distances between the same person
        user_mask = d['user']!=d['user'].iloc[i]
        
        # These are the data we need to check distances for
        dists_to_check = d[date_mask & user_mask]
        
        # a = current location i
        # b = all locations from dists_to_check
        a = numpy.array((d['long'].iloc[i], d['lat'].iloc[i]))
        b = numpy.array((dists_to_check['long'].values, dists_to_check['lat'].values))
        
        for j in range(1, len(dists_to_check)):
            x = numpy.linalg.norm(a-numpy.array((b[0][j], b[1][j])))
            # If the distance is < 100m append index to list l
            if x <=1000:
                l.append(j)
            else:
                pass
        try:
            # Subset all the data that satisfy both the datetime and distance restriction 
            data = dists_to_check.iloc[l]
            # Add the variables of interest to the dataset
            # Sum of sick people within constraints, 
            # and sum of how many people in total (sick/not sick)
            output.append([data['is_sick'].sum(), len(data)])

        except IndexError, e:
            # If no data exists, add zero's
            output.append([0,0])
        
    return output

In [116]:

start = datetime.now()
out = work(df)
print datetime.now() - start

in sync results <function __call__ at 0x123a015f0>
0:52:15.430106

If you can see of any ways to speed of this process, please let me know! I would like to look into doing 3-d KD-Tree's if possible. The time you see here is on 6 threads of a 2.3 ghz macbook pro.

In [117]:

# Combine the data from the 4 parallel workers into one list
l = []
for i in out:
    l.append(i)

Since the data from the function above is a sequential list of the original data, we append the output to the original dataframe.
This is more efficient than carrying the entire dataframe through the looping function.

In [118]:

# Merge all data together

# Data from function "work"
df2 = pandas.DataFrame(l, columns = ['sick', 'nearby'])
# Merge with original data
data = pandas.DataFrame(zip(df['user'],df['date'], df['is_sick'], df2['sick'], df2['nearby']), columns =['user', 'date','is_sick','sick','nearby'])
data.sort('date')
data.head()

Out[118]:

	user	date
0	NicoleWGKTA	2012-10-02 23:49:00
1	NicoleWGKTA	2012-10-02 23:50:00
2	NicoleWGKTA	2012-10-02 23:58:00
3	NicoleWGKTA	2012-10-03 00:01:00
4	NicoleWGKTA	2012-10-03 00:05:00

5 rows × 5 columns

Some more data cleaning¶

We have to be sure we remove users who became sick prior the collection of this data, since we dont know how many sick people they came in contact with.

In [134]:

user = data.groupby('user')

In [135]:

# Remove people that were sick at the start of the month
data_filter = user.filter(lambda x: x['is_sick'].iloc[0] != 1)
print "We lost {} unique users from this step, but {} users remain"\
.format(33138 - len(pd.unique(data_filter['user'])), len(pd.unique(data_filter['user'])))

We lost 172 unique users from this step, but 32966 users remain

Next we split the data into two subsets, those who became sick in October, and those who did not.

In [136]:

# Subset of people that become ill
sick = data_filter.groupby('user').filter(lambda x: x['is_sick'].sum() >0)
print "There are {} people that became ill, and they posted {} times".\
format(len(pandas.unique(sick['user'])), len(sick))

There are 370 people that became ill, and they posted 41959 times

In [122]:

# Subset of people who never became sick
never_sick = data_filter.groupby('user').filter(lambda x: x['is_sick'].sum()==0)
print "There are {} people that never became ill, and they posted {} times".\
format(len(pandas.unique(never_sick['user'])), len(never_sick))

There are 32596 people that never became ill, and they posted 480316 times

In [123]:

# Make sure that these two subsets comprise all of the data
len(sick)+len(never_sick) == len(data_filter)

Out[123]:

True

There is certainly an overwhelming majority of people who did not become sick, but this makes sense in October for this latitude.

Summarize and Plot Results:¶

There are many ways to interpret these data, however for now I will summarize the results in order to answer two fundamental questions.

Does the likelyhood of becoming sick increase if we encounter more sick people during single incidents (think large lecture halls or malls)?
Do a higher proportion of people become sick after coming in contact with more sick people over time (think the total number of sick people over the course of a month)?

In [137]:

import matplotlib.pylab as plt
# Enable inline plotting
%matplotlib inline

# Set the default image size for this interactive session
plt.rcParams['figure.figsize'] = 6, 4  

Results 1.0¶

First we test if coming in contact with many sick people at once increases the likeyhood of becoming sick later.

To test this we summarize the 'sick' and 'never sick' users by how many sick people they came into contact with at single points in time and plot the ratio of the results between the 'sick' and 'not sick' groups. We expect to see that if someone comes in contact with many sick people at once (perhaps in a university lecture hall), that they have a higher probability of becoming sick themselves at a later date.

In [140]:

sick_values = pd.DataFrame(zip(pd.unique(sick['sick'].values),sick['sick'].value_counts()),\
                           columns = ['number_of_sick_people_contacted', 'b'])
never_sick_values = pd.DataFrame(zip(pd.unique(never_sick['sick'].values),never_sick['sick'].\
                           value_counts()), columns = ['number_of_sick_people_contacted', 'b'])
bins = [1,15,30,45,60]

groups_sick = sick_values.groupby(pd.cut(sick_values.number_of_sick_people_contacted, bins))
groups_never_sick = never_sick_values.groupby(pd.cut(never_sick_values.number_of_sick_people_contacted, bins))

In [141]:

print "\nSick Users: How many sick people they came in contact with in October\n"
print groups_sick.sum().b
print "\nHealthy Users: How many sick people they came in contact with in October\n"
print groups_never_sick.sum().b
print "\nProportion of people in October that were sick\n"
print groups_sick.sum().b / groups_never_sick.sum().b

plt.plot(groups_sick.sum().b / groups_never_sick.sum().b)
plt.xticks([0,1,2,3], ["1-15","15-30","30-45","45-60"])
plt.yticks([0,.1,.2,.3,.4,.5,.6,.7,.8], ["0%","10%","20%","30%","40%","50%","60%","70%", "80%"])
plt.xlabel("How many sick people were contacted", size = 12)
plt.ylabel("Proportion of people who became ill at t + 1", size = 12)
plt.minorticks_on()
plt.grid(b=True, which='major', color='black', alpha=0.5)
plt.title("\nIndividual encounters with sick people\n 8 hour time window, 100 meters", size = 14)
#plt.savefig(path+'1.png', dpi=96)

Sick Users: How many sick people they came in contact with in October

number_of_sick_people_contacted
(1, 15]                            2324
(15, 30]                            136
(30, 45]                             70
(45, 60]                             27
Name: b, dtype: int64

Healthy Users: How many sick people they came in contact with in October

number_of_sick_people_contacted
(1, 15]                            20668
(15, 30]                             543
(30, 45]                             577
(45, 60]                              29
Name: b, dtype: int64

Proportion of people in October that were sick

number_of_sick_people_contacted
(1, 15]                            0.112444
(15, 30]                           0.250460
(30, 45]                           0.121317
(45, 60]                           0.931034
Name: b, dtype: float64

Out[141]:

<matplotlib.text.Text at 0x12e423650>

Great! This result aligns well with our above theory; the likeyhood of becoming sick greatly increases as we are surrounded by more sick people at once. From the table we can see that ~75% of the people that came into contact with 45 - 60 people later became sick. These data actually report strikling similar results to the New York data in Sadelik et al (2012).

Results 2.0¶

Lastly we test if encountering more sick people in general in October increases the likelyhood of becoming sick.

In other words, we expect to see that if the total number of sick people we encounter in October is higher, our likelyhood of becoming sick is also higher. Here we calculate the total number of sick people encountered per user in Toronto, and examine the ratio between the people who became ill, and those who did not.

In [130]:

sick_values = pd.DataFrame(zip(pd.unique(sick['sick'].values), sick.groupby('user')['sick'].sum()), \
                  columns = ['number_of_sick_people_contacted','b'])

never_sick_values = pd.DataFrame(zip(pd.unique(never_sick['sick'].values), \
                   never_sick.groupby('user')['sick'].sum()),columns = \
                                 ['number_of_sick_people_contacted','b'])

bins = [1,15,30,45,60]

groups_sick = sick_values.groupby(pd.cut(sick_values.number_of_sick_people_contacted, bins))
groups_never_sick = never_sick_values.groupby(pd.cut(never_sick_values.number_of_sick_people_contacted, bins))

In [131]:

print "\nHealthy Users: How many sick people they came in contact with in October\n"
print groups_sick.sum().b
print "\nSick Users: How many sick people they came in contact with in October\n"
print groups_never_sick.sum().b
print "\nProportion of people in October that were sick\n"
print groups_never_sick.sum().b / groups_sick.sum().b

plt.plot(groups_never_sick.sum().b / groups_sick.sum().b)
plt.xticks([0,1,2,3], ["1-15","15-30","30-45","45-60"])
plt.yticks([0,.01,.02,.03,.04,.05,.06,.07,.08, 0.09], ["0%","1%","2%","3%","4%","5%","6%","7%", "8%"])
plt.xlabel("How many sick people were contacted", size = 12)
plt.ylabel("Proportion of people who became ill at t +1", size = 12)
plt.minorticks_on()
plt.grid(b=True, which='major', color='black', alpha=0.5)
plt.title("Total encounters with sick people\n 8 hour time window, 100 meters", size = 14)

Healthy Users: How many sick people they came in contact with in October

number_of_sick_people_contacted
(1, 15]                            167
(15, 30]                           109
(30, 45]                           148
(45, 60]                           528
Name: b, dtype: int64

Sick Users: How many sick people they came in contact with in October

number_of_sick_people_contacted
(1, 15]                             6
(15, 30]                           30
(30, 45]                           14
(45, 60]                            0
Name: b, dtype: int64

Proportion of people in October that were sick

number_of_sick_people_contacted
(1, 15]                            0.035928
(15, 30]                           0.275229
(30, 45]                           0.094595
(45, 60]                           0.000000
Name: b, dtype: float64

Out[131]:

<matplotlib.text.Text at 0x12637cc90>

This is certainly an interesting result, though it may be partially due to lack of data on the high end. Either way, we can start to interpret the results.

Most notably, it appears as though there is a threashold effect, whereby after coming in contact with 15-30 sick people there is a severe drop off in the number of people that became sick. This suggests a “resistance effect” whereby users who have come in contact with many sick people are resistant to the infections strain, possibly due to previously being ill prior to October, or a healthier immune system in general.

Thats it for now!¶

There is certainly a lot more I could do and test with this data. I would love any feedback / suggestions on the methods, and ideas of where to go next. Let me know if you take this idea and apply it yourself!

In [ ]: