This notebook explores some initial modeling of sentiment analysis of weather related tweets. The datasets can be downloaded from Kaggle's "Partly Sunny with a Chance of Hashtags" competition.
Since the competition is over we will restrict ourselves to look at the training set only in order to have a good basis for benchmarking. We will make heavy use of nltk as a natural language processing (NLP) library to build a bag-of-words model that we we will use to train a Multinomial Naive Bayes classifier.
import pandas as pd
pd.options.display.encoding = 'ascii'
### import the different corpora and packages from nltk as you need them
# import nltk
# nltk.download()
raw_df = pd.DataFrame.from_csv('./datasets/train.csv')
raw_df.head(5)
tweet | state | location | s1 | s2 | s3 | s4 | s5 | w1 | w2 | ... | k6 | k7 | k8 | k9 | k10 | k11 | k12 | k13 | k14 | k15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1 | Jazz for a Rainy Afternoon: {link} | oklahoma | Oklahoma | 0 | 0 | 1 | 0.000 | 0.000 | 0.800 | 0 | ... | 0 | 0.000 | 0 | 0.000 | 1 | 0 | 0 | 0.000 | 0 | 0 |
2 | RT: @mention: I love rainy days. | florida | Miami-Ft. Lauderdale | 0 | 0 | 0 | 1.000 | 0.000 | 0.196 | 0 | ... | 0 | 0.000 | 0 | 0.000 | 1 | 0 | 0 | 0.000 | 0 | 0 |
3 | Good Morning Chicago! Time to kick the Windy C... | idaho | NaN | 0 | 0 | 0 | 0.000 | 1.000 | 0.000 | 0 | ... | 0 | 1.000 | 0 | 0.000 | 0 | 0 | 0 | 0.000 | 0 | 0 |
6 | Preach lol! :) RT @mention: #alliwantis this t... | minnesota | Minneapolis-St. Paul | 0 | 0 | 0 | 1.000 | 0.000 | 1.000 | 0 | ... | 0 | 0.604 | 0 | 0.196 | 0 | 0 | 0 | 0.201 | 0 | 0 |
9 | @mention good morning sunshine | rhode island | Purgatory | 0 | 0 | 0 | 0.403 | 0.597 | 1.000 | 0 | ... | 0 | 0.000 | 0 | 0.000 | 0 | 0 | 0 | 1.000 | 0 | 0 |
5 rows × 27 columns
For completeness we will also show a snippet from the test set to see what kind of features we are allowed to use.
test_df = pd.DataFrame.from_csv('./datasets/test.csv')
test_df.head(2)
tweet | state | location | |
---|---|---|---|
id | |||
4 | Edinburgh peeps is it sunny?? #weather | NaN | birmingham |
5 | SEEVERE T’STORM WARNING FOR TROUSDALE, NORTHW... | NaN | Nashville |
We start out by trying to model the sentiment category of a tweet. To this end we start by replacing the maximum vlaue of the sentiment fields s1-s5 with the corresponding category
def create_label(row):
# row is of time pandas.Series need to cast to a list.
lst = row.tolist()
return lst.index(max(lst))+1
# apply defaults to work on columns rather than rows. Use axis = 1 to cahnge to row application.
label_df = raw_df[['s1','s2','s3','s4','s5']].apply(create_label, axis=1)
label_df.head(5)
id 1 3 2 4 3 5 6 4 9 5 dtype: int64
In order to use the tweets as input for a machine learning algorithm we need to convert them into numerical features. One way to do this is to chop up the tweets and count the number of words within a tweet and turn them into a large sparse vector whose length is size of the vocabulary of all tweets.
We start with a very simple way of counting the terms in the total of all tweets and count the number of terms.
from collections import Counter
full_tweet_string = raw_df.tweet.apply(lambda t: t.lower() + " ").sum()
Counter(full_tweet_string.split()).most_common()[:25]
[('the', 36622), ('weather', 23046), ('to', 20942), ('@mention', 19424), ('a', 18712), ('in', 18611), ('and', 17827), ('i', 16907), ('is', 14302), ('{link}', 13491), ('for', 12580), ('this', 12231), ('of', 10703), ('it', 9131), ('rt', 8999), ('on', 8060), ('@mention:', 8008), ('my', 7492), ("it's", 7130), ('at', 6974), ('out', 6397), ('be', 6122), ('its', 5900), ('storm', 5673), ('you', 5385)]
The simple counter already indicates a problem. There are a lot of very common words that have obiously no signal, such as 'the', 'at', 'of', etc. We need to remove those stopwords. Another problem can be seem by comparing '@mention' and '@mention:' which should clearly be identified as the same word, meaning we need to remove the punctuation. Finally we might want to identify 'storm' and 'stormy' as the same and thus require stemming techniques.
The following code snippet provides a tokenizer that does all of the above
from nltk.corpus import stopwords
from nltk.stem.snowball import EnglishStemmer
from nltk import word_tokenize
from nltk.util import ngrams
import string
def custom_tokenizer(document_string):
# define the stop word vocabulary
stops = [unicode(word) for word in stopwords.words('english')] \
+ ["''", "``", 're:', 'fwd:', '-', '@mention', '@mention:', 'mention', 'link', ':', 'f.', '&']
# create a default stemmer
stemmer = EnglishStemmer()
# return the stemmed list of words
tokens = [stemmer.stem(unicode(word)) for word in word_tokenize(document_string.lower()) \
if not (unicode(word.lower()) in stops or unicode(word.lower()) in list(string.punctuation))]
return tokens + list(ngrams(tokens, 2))
Let's have a look at the word counts again
Counter(custom_tokenizer(full_tweet_string)).most_common()[:25]
[(u'weather', 34001), (u'...', 17465), (u"'s", 12792), (u'rt', 9189), (u'day', 8824), (u'storm', 7920), (u'sunni', 6601), (u'hot', 5952), (u'today', 5612), (u'outsid', 5357), (u'like', 4831), (u"n't", 4714), (u'rain', 4655), (u'sunshin', 4587), (u'get', 4550), (u'degre', 4441), (u'thunderstorm', 4388), (u'feel', 4339), (u'humid', 4229), (u'go', 4164), (u'cold', 4041), (u"'m", 4032), (u'wind', 3913), (u'raini', 3889), (u'good', 3766)]
This looks already much better and we can identify weather realated terms such as 'storm', 'sunni', etc. This will be a good starting point for the rest of the model.
One step that is missing is how to use the custom_tokenizer to actually create feature vectors. Luckily enough sklearn provides use with the right functionality to do just that.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df = 100, max_df = 10000, tokenizer = custom_tokenizer)
To ensure we have always the same vectorization options we need to "fit" the vectorizer to the set so we can vectorize any new examples according to this. We do it here on the full set from which we will also extract the cross-validation data and we should be aware of this bias, as it necessarily eliminates the possibility of encountering previously unseen terms when we run our test set predicition.
# fit the vectorizer to the full set
X = vectorizer.fit_transform(raw_df.tweet.tolist())
X.shape
(77946, 1260)
As the training data and the labels are now spearate and of different type it is annoying to split the training and test-sets manually. Again sklearn has a tool for us that we can use: the cross_validation API.
from sklearn import cross_validation
df_train, df_test, y_train, y_test = cross_validation.train_test_split(raw_df, label_df, test_size = 0.3)
To avoid the above case of having the "perfect" vocabulary let us now train a new vectorizer using only the training set
X_train = vectorizer.fit_transform(pd.DataFrame(df_train)[0].tolist())
X_test = vectorizer.transform(pd.DataFrame(df_test)[0].tolist())
We can now start to train the model. As this is a highly sparse problem it lends itself to be tackled using a Naive Bayes classifier.
from sklearn.naive_bayes import MultinomialNB
multi_nb_clf = MultinomialNB()
multi_nb_clf.fit(X_train, y_train)
multi_nb_clf.score(X_test, y_test)
0.60549948682860077
With only single terms (monograms), bigrams (interestingly enough they seem not to have any impact at this level) and no additionally fine-grained modeling we already classify 60% of the tweets into the right category. This is impressive in so far that there are a lot of easy and obvious ways to improve the classification. Some ways to improve are
are mostly given throughout the text. But importantly
from IPython.core.display import HTML
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()