This notebook explores some initial modeling of sentiment analysis of weather related tweets. The datasets can be downloaded from Kaggle's "Partly Sunny with a Chance of Hashtags" competition.

Since the competition is over we will restrict ourselves to look at the training set only in order to have a good basis for benchmarking. We will make heavy use of nltk as a natural language processing (NLP) library to build a bag-of-words model that we we will use to train a Multinomial Naive Bayes classifier.

Retrieving and basic cleansing of the data¶

In [1]:

import pandas as pd
pd.options.display.encoding = 'ascii'

In [2]:

### import the different corpora and packages from nltk as you need them
# import nltk
# nltk.download()

In [3]:

raw_df = pd.DataFrame.from_csv('./datasets/train.csv')
raw_df.head(5)

Out[3]:

	tweet	state	location	s1	s2	s3	s4	s5	w1	w2	...	k6	k7	k8	k9	k10	k11	k12	k13	k14	k15
id
1	Jazz for a Rainy Afternoon: {link}	oklahoma	Oklahoma	0	0	1	0.000	0.000	0.800	0	...	0	0.000	0	0.000	1	0	0	0.000	0	0
2	RT: @mention: I love rainy days.	florida	Miami-Ft. Lauderdale	0	0	0	1.000	0.000	0.196	0	...	0	0.000	0	0.000	1	0	0	0.000	0	0
3	Good Morning Chicago! Time to kick the Windy C...	idaho	NaN	0	0	0	0.000	1.000	0.000	0	...	0	1.000	0	0.000	0	0	0	0.000	0	0
6	Preach lol! :) RT @mention: #alliwantis this t...	minnesota	Minneapolis-St. Paul	0	0	0	1.000	0.000	1.000	0	...	0	0.604	0	0.196	0	0	0	0.201	0	0
9	@mention good morning sunshine	rhode island	Purgatory	0	0	0	0.403	0.597	1.000	0	...	0	0.000	0	0.000	0	0	0	1.000	0	0

5 rows × 27 columns

For completeness we will also show a snippet from the test set to see what kind of features we are allowed to use.

In [4]:

test_df = pd.DataFrame.from_csv('./datasets/test.csv')
test_df.head(2)

Out[4]:

	tweet	state	location
id
4	Edinburgh peeps is it sunny?? #weather	NaN	birmingham
5	SEEVERE T’STORM WARNING FOR TROUSDALE, NORTHW...	NaN	Nashville

Label creation for Sentiment category¶

We start out by trying to model the sentiment category of a tweet. To this end we start by replacing the maximum vlaue of the sentiment fields s1-s5 with the corresponding category

In [5]:

def create_label(row):
    # row is of time pandas.Series need to cast to a list.
    lst = row.tolist()
    return lst.index(max(lst))+1

In [6]:

# apply defaults to work on columns rather than rows. Use axis = 1 to cahnge to row application.
label_df = raw_df[['s1','s2','s3','s4','s5']].apply(create_label, axis=1)
label_df.head(5)

Out[6]:

id
1    3
2    4
3    5
6    4
9    5
dtype: int64

Tokenizing, stemming and term counting¶

In order to use the tweets as input for a machine learning algorithm we need to convert them into numerical features. One way to do this is to chop up the tweets and count the number of words within a tweet and turn them into a large sparse vector whose length is size of the vocabulary of all tweets.

We start with a very simple way of counting the terms in the total of all tweets and count the number of terms.

In [7]:

from collections import Counter

full_tweet_string = raw_df.tweet.apply(lambda t: t.lower() + " ").sum()
Counter(full_tweet_string.split()).most_common()[:25]

Out[7]:

[('the', 36622),
 ('weather', 23046),
 ('to', 20942),
 ('@mention', 19424),
 ('a', 18712),
 ('in', 18611),
 ('and', 17827),
 ('i', 16907),
 ('is', 14302),
 ('{link}', 13491),
 ('for', 12580),
 ('this', 12231),
 ('of', 10703),
 ('it', 9131),
 ('rt', 8999),
 ('on', 8060),
 ('@mention:', 8008),
 ('my', 7492),
 ("it's", 7130),
 ('at', 6974),
 ('out', 6397),
 ('be', 6122),
 ('its', 5900),
 ('storm', 5673),
 ('you', 5385)]

The simple counter already indicates a problem. There are a lot of very common words that have obiously no signal, such as 'the', 'at', 'of', etc. We need to remove those stopwords. Another problem can be seem by comparing '@mention' and '@mention:' which should clearly be identified as the same word, meaning we need to remove the punctuation. Finally we might want to identify 'storm' and 'stormy' as the same and thus require stemming techniques.

The following code snippet provides a tokenizer that does all of the above

In [8]:

from nltk.corpus import stopwords
from nltk.stem.snowball import EnglishStemmer
from nltk import word_tokenize
from nltk.util import ngrams
import string

def custom_tokenizer(document_string):
    
    # define the stop word vocabulary
    stops = [unicode(word) for word in stopwords.words('english')] \
        + ["''", "``", 're:', 'fwd:', '-', '@mention', '@mention:', 'mention', 'link', ':', 'f.', '&'] 
    
    # create a default stemmer
    stemmer = EnglishStemmer()
    
    # return the stemmed list of words
    tokens = [stemmer.stem(unicode(word)) for word in word_tokenize(document_string.lower()) \
            if not (unicode(word.lower()) in stops or unicode(word.lower()) in list(string.punctuation))]
    
    return tokens + list(ngrams(tokens, 2))

Let's have a look at the word counts again

In [9]:

Counter(custom_tokenizer(full_tweet_string)).most_common()[:25]

Out[9]:

[(u'weather', 34001),
 (u'...', 17465),
 (u"'s", 12792),
 (u'rt', 9189),
 (u'day', 8824),
 (u'storm', 7920),
 (u'sunni', 6601),
 (u'hot', 5952),
 (u'today', 5612),
 (u'outsid', 5357),
 (u'like', 4831),
 (u"n't", 4714),
 (u'rain', 4655),
 (u'sunshin', 4587),
 (u'get', 4550),
 (u'degre', 4441),
 (u'thunderstorm', 4388),
 (u'feel', 4339),
 (u'humid', 4229),
 (u'go', 4164),
 (u'cold', 4041),
 (u"'m", 4032),
 (u'wind', 3913),
 (u'raini', 3889),
 (u'good', 3766)]

This looks already much better and we can identify weather realated terms such as 'storm', 'sunni', etc. This will be a good starting point for the rest of the model.

Model generation¶

One step that is missing is how to use the custom_tokenizer to actually create feature vectors. Luckily enough sklearn provides use with the right functionality to do just that.

In [10]:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df = 100, max_df = 10000, tokenizer = custom_tokenizer)

To ensure we have always the same vectorization options we need to "fit" the vectorizer to the set so we can vectorize any new examples according to this. We do it here on the full set from which we will also extract the cross-validation data and we should be aware of this bias, as it necessarily eliminates the possibility of encountering previously unseen terms when we run our test set predicition.

In [11]:

# fit the vectorizer to the full set
X = vectorizer.fit_transform(raw_df.tweet.tolist())

In [12]:

X.shape

Out[12]:

(77946, 1260)

Splitting the set and training the model¶

As the training data and the labels are now spearate and of different type it is annoying to split the training and test-sets manually. Again sklearn has a tool for us that we can use: the cross_validation API.

In [13]:

from sklearn import cross_validation

df_train, df_test, y_train, y_test = cross_validation.train_test_split(raw_df, label_df, test_size = 0.3)

To avoid the above case of having the "perfect" vocabulary let us now train a new vectorizer using only the training set

In [14]:

X_train = vectorizer.fit_transform(pd.DataFrame(df_train)[0].tolist())
X_test = vectorizer.transform(pd.DataFrame(df_test)[0].tolist())

We can now start to train the model. As this is a highly sparse problem it lends itself to be tackled using a Naive Bayes classifier.

In [15]:

from sklearn.naive_bayes import MultinomialNB

multi_nb_clf = MultinomialNB()
multi_nb_clf.fit(X_train, y_train)

multi_nb_clf.score(X_test, y_test)

Out[15]:

0.60549948682860077

With only single terms (monograms), bigrams (interestingly enough they seem not to have any impact at this level) and no additionally fine-grained modeling we already classify 60% of the tweets into the right category. This is impressive in so far that there are a lot of easy and obvious ways to improve the classification. Some ways to improve are

include higher n-grams in the tokenizer
train models per state or per city
include state and city as features
include TF-IDF vectorizers
use a different classifier?

Miscellenea¶

References¶

are mostly given throughout the text. But importantly

[1] Partly Sunny with a Chance of Hashtags
[2] scikit-learn
[3] nltk

Stylesheet¶

In [17]:

from IPython.core.display import HTML

def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

Out[17]:

In [ ]:

Sentiment analysis of weather related tweets¶