Naive Bayes Classification: United States Congressional Record¶

The Naive Bayes classifier is a statsitcal programing tool with diverse uses in text analysis. It is famously used for determining if an email is "spam" or "ham"(the opposite of spam). It also has uses in sentiment analysis, for example, determining whether a movie review was good or bad, determining the authorship of an article, or categorizing documents into topics. Today we will be using a naive bayes classifier however, not to classify, but to analyze competing sides of political debates on the floor of the US Congress.

Import Necessary Tools¶

In [ ]:

%pylab inline 

from __future__ import division

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split

from speech import Speech

Query Database for Speeches¶

In [ ]:

phrase = "abortion" 
num_speeches = Speech.get(0, 0, phrase=phrase, congress="", start_date="1995-05-04", speaker_party="*")['count']
print "Downloading %i speeches" % num_speeches
speeches = Speech.get(start=0, rows=num_speeches, phrase=phrase, speaker_party="*")['speeches']
print len(speeches), "speeches downloaded"

Perpare the tools we will use to do our text analysis¶

In [ ]:

naive_bayes = MultinomialNB(alpha=1.0,fit_prior=True)
vectorizer = TfidfVectorizer(min_df=.1, max_df=.6, stop_words='english' )

Perpare data for analysis¶

In [ ]:

data = [" ".join(speech['speaking']) for speech in speeches]
data = vectorizer.fit_transform(data)
target = [speech['speaker_party'] for speech in speeches]
target = [ 0 if x == "D" else 1 for x in target ]

What our data looks like¶

In [ ]:

data.shape

Building a Training and Testing set¶

In [ ]:

X_train, X_test , Y_train, Y_test = train_test_split(data, target, test_size=0.2)
print X_train.shape, X_test.shape, len(Y_train), len(Y_test)

Run the Classifier¶

In [ ]:

naive_bayes.fit(X_train, Y_train)

See if the classifier is any good¶

In [ ]:

naive_bayes.score(X_test, Y_test)

In [ ]:

cross_val_score(naive_bayes, data, target, scoring='accuracy', verbose=1, cv=5)

Observe words most indicative if a speech is DEM or REP speech¶

In [ ]:

terms = vectorizer.get_feature_names()
t1 = [(naive_bayes.feature_log_prob_[0][i] * (naive_bayes.class_count_[0] / naive_bayes.class_count_.sum())) for i in range(len(terms))]
t2 = [(naive_bayes.feature_log_prob_[1][i] * (naive_bayes.class_count_[1] / naive_bayes.class_count_.sum())) for i in range(len(terms))]

In [ ]:

[(terms[i],t1[i]) for i in np.array(t1).argsort()] # Top Terms for Republicans

In [ ]:

[(terms[i],t2[i]) for i in np.array(t2).argsort()] # Top Terms for Democrats

Brainstrom¶

What other contexts might you be able to apply this methodology to? How can a classifier be useful in your work? How about the particular way in which we were able to find the words most indicative of a certain class?